{"id":1122,"date":"2024-11-14T20:20:07","date_gmt":"2024-11-14T11:20:07","guid":{"rendered":"https:\/\/www.yanagichiaki.jp\/?p=1122"},"modified":"2025-02-25T23:08:02","modified_gmt":"2025-02-25T14:08:02","slug":"computer-architecture-labs","status":"publish","type":"post","link":"https:\/\/yanagichiaki.jp\/index.php\/2024\/11\/14\/computer-architecture-labs\/","title":{"rendered":"Computer Architecture Labs"},"content":{"rendered":"\n<p class=\"is-style-big_icon_point\">\u30cf\u30eb\u30d3\u30f3\u5de5\u696d\u5927\u5b66\uff08\u6df1\u5733\uff09\u2022 2024 \u2022 \u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fb\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3 Lab\u2022 \u306b\u304a\u3051\u308b\u89e3\u6c7a\u7b56 \u2022 HITSZ \u8ba1\u7b97\u673a\u4f53\u7cfb\u7ed3\u6784\u5b9e\u9a8c 2024<\/p>\n\n\n\n<p class=\"has-border -border03 is-style-icon_info\">\u5fa1\u8cea\u554f\u304c\u5fa1\u5ea7\u3044\u307e\u3057\u305f\u3089\u3001\u3053\u306e\u30da\u30fc\u30b8\u306e\u4e0b\u90e8\u306b\u3042\u308b\u30b3\u30e1\u30f3\u30c8\u6b04\u3092\u5fa1\u5229\u7528\u304f\u3060\u3055\u3044\u3002<br><span class=\"swl-marker mark_yellow\">\u4ef0\u305b\u4e8b\u6709\u4e4b\u5019\u30cf\u30cf<\/span>\u3001<span class=\"swl-marker mark_blue\">\u6b64\u4e01\u4e4b\u4e0b\u30cb\u30a2\u30eb\u610f\u898b\u4e4b\u6b04\u30f2\u7528\u30f0\u7d66\u30d8<\/span>\u3002<\/p>\n\n\n<div class=\"swell-block-balloon\"><div class=\"c-balloon -bln-left -sp-vrtcl\" data-col=\"green\"><div class=\"c-balloon__icon -circle\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/www.yanagichiaki.jp\/wp-content\/uploads\/2024\/06\/logo.png\" alt=\"\" class=\"c-balloon__iconImg\" width=\"80px\" height=\"80px\"><span class=\"c-balloon__iconName\">\u5343\u79cb<\/span><\/div><div class=\"c-balloon__body -speaking -border-on\"><div class=\"c-balloon__text\">\n<p>\u6b21\u306eLab\u306b\u306f\u3001x86\u30a2\u30bb\u30f3\u30d6\u30ea\u3084CUDA\u30d7\u30ed\u30b0\u30e9\u30df\u30f3\u30b0\u306e\u77e5\u8b58\u304c\u5fc5\u8981<\/p>\n<span class=\"c-balloon__shapes\"><span class=\"c-balloon__before\"><\/span><span class=\"c-balloon__after\"><\/span><\/span><\/div><\/div><\/div><\/div>\n\n\n<p class=\"has-text-align-center u-mb-0 u-mb-ctrl\" style=\"line-height:2\">\u65e5\u672c\u8a9e\u8a33\u7248<\/p>\n\n\n\n<div class=\"swell-block-button green_ is-style-btn_shiny\"><a href=\"https:\/\/www.yanagichiaki.jp\/index.php\/2024\/12\/09\/computer-architecture-labs-guidebook-hitsz\/\" class=\"swell-block-button__link\" data-has-icon=\"1\"><svg class=\"__icon\" height=\"1em\" width=\"1em\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" viewBox=\"0 0 48 48\"><path d=\"M21.2 30.2c-.5 0-1-.2-1.4-.6l-.7-.7c-2.3-2.3-3.5-5.3-3.5-8.5s1.2-6.2 3.5-8.5l7.1-7.1c2.3-2.3 5.3-3.5 8.5-3.5s6.2 1.2 8.5 3.5c4.7 4.7 4.7 12.3 0 17l-3.5 3.5c-.8.8-2 .8-2.8 0-.8-.8-.8-2 0-2.8l3.5-3.5c3.1-3.1 3.1-8.2 0-11.3-1.5-1.5-3.5-2.3-5.7-2.3-2.1 0-4.2.8-5.7 2.3l-7.1 7.1c-1.5 1.5-2.3 3.5-2.3 5.7s.8 4.2 2.3 5.7l.7.7c.8.8.8 2 0 2.8-.4.3-.9.5-1.4.5z\"><\/path><path d=\"M13.4 46.6c-3.1 0-6.1-1.2-8.5-3.5-2.3-2.3-3.5-5.3-3.5-8.5s1.2-6.2 3.5-8.5l3.5-3.5c.8-.8 2-.8 2.8 0 .8.8.8 2 0 2.8l-3.5 3.5c-1.5 1.5-2.3 3.5-2.3 5.7 0 2.1.8 4.2 2.3 5.7 3.1 3.1 8.2 3.1 11.3 0l7.1-7.1c1.5-1.5 2.3-3.5 2.3-5.7 0-2.1-.8-4.2-2.3-5.7l-.7-.7c-.8-.8-.8-2 0-2.8.8-.8 2-.8 2.8 0l.7.7c2.3 2.3 3.5 5.3 3.5 8.5s-1.2 6.2-3.5 8.5l-7.1 7.1c-2.3 2.3-5.3 3.5-8.4 3.5z\"><\/path><\/svg><span>GuideBook \u3078<\/span><\/a><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Lab1 \u884c\u5217\u4e57\u7b97<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">\u74b0\u5883\u8a2d\u5b9a\u3068\u4e8b\u524d\u30c6\u30b9\u30c8<\/h3>\n\n\n\n<p>\u3053\u306e\u30bf\u30b9\u30af\u306e\u76ee\u7684\u306f\u3001\u6587\u5b57\u5217\u3092\u30b3\u30f3\u30bd\u30fc\u30eb\u306b\u51fa\u529b\u3059\u308b\u3053\u3068\u3067\u3059\u3002<\/p>\n\n\n\n<p>\u307e\u305a\u3001\u4e0e\u3048\u3089\u308c\u305f\u30b3\u30fc\u30c9\u3092\u78ba\u8a8d\u3057\u307e\u3057\u3087\u3046\u3002\u5c0f\u30bf\u30b9\u30af\u306f2\u3064\u3042\u308a\u307e\u3059\u3002\u6700\u521d\u306e\u30bf\u30b9\u30af\u306f\u3001\u3059\u3079\u3066\u306e\u6841\u304c\u51fa\u529b\u3055\u308c\u308b\u304b\u3069\u3046\u304b\u3092\u78ba\u8a8d\u3059\u308b\u3053\u3068\u3067\u3059\u3002\u6570\u5024\u3092\u5206\u5272\u3059\u308b\u3053\u3068\u3067\u6841\u3092\u51fa\u529b\u3057\u3066\u3044\u308b\u3053\u3068\u304c\u308f\u304b\u308a\u307e\u3059\u300210\u3067\u5272\u308b\u3068\u3001\u305d\u306e\u4f59\u308a\u304c\u51fa\u529b\u3059\u308b\u6841\u3068\u306a\u308a\u3001\u305d\u308c\u304c <code>rdx<\/code> \u306b\u4fdd\u5b58\u3055\u308c\u307e\u3059\u3002\u4e00\u65b9\u3001\u5546\u306f\u6b21\u306e\u30eb\u30fc\u30d7\u306e\u305f\u3081\u306b <code>rax<\/code> \u306b\u4fdd\u5b58\u3055\u308c\u307e\u3059\u3002<\/p>\n\n\n\n<p>\u6b21\u306b\u3001\u6700\u7d42\u7684\u306a\u5272\u308a\u7b97\u306e\u5f8c\u306b <code>rax<\/code> \u304c0\u306b\u306a\u308b\u3068\u30eb\u30fc\u30d7\u304c\u7d42\u4e86\u3059\u308b\u3053\u3068\u304c\u308f\u304b\u308a\u307e\u3059\u3002x86\u3067\u306f\u3001\u547d\u4ee4 <code>jnz<\/code> \u306f Z \u30d5\u30e9\u30b0\u3092\u78ba\u8a8d\u3057\u3066\u30b8\u30e3\u30f3\u30d7\u3059\u308b\u304b\u3069\u3046\u304b\u3092\u5224\u65ad\u3057\u307e\u3059\u3002\u30eb\u30fc\u30d7\u306b\u5f71\u97ff\u3092\u4e0e\u3048\u306a\u3044\u3088\u3046\u306b\u3001<code>rax<\/code> \u3092\u5909\u66f4\u3057\u306a\u3044\u3088\u3046\u306b\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\u3002\u3053\u306e\u305f\u3081\u3001$$\\textbf{test}\\text{ \\%rax, \\%rax}$$ \u3068\u3044\u3046\u547d\u4ee4\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3053\u306e\u547d\u4ee4\u306f <code>rax<\/code> \u81ea\u4f53\u3092\u5909\u66f4\u305b\u305a\u306b Z \u30d5\u30e9\u30b0\u3092\u8a2d\u5b9a\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<p>\u6b21\u306e\u30bf\u30b9\u30af\u306f\u3001\u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u306e\u30a8\u30e9\u30fc\u3092\u4fee\u6b63\u3059\u308b\u3053\u3068\u3067\u3059\u3002\u30d7\u30ed\u30bb\u30b9\u3092\u614e\u91cd\u306b\u78ba\u8a8d\u3057\u305f\u7d50\u679c\u3001\u4e00\u90e8\u306e\u30ed\u30b8\u30c3\u30af\u304c\u4e0d\u8981\u3067\u3042\u308a\u3001\u4e00\u90e8\u306e\u30ec\u30b8\u30b9\u30bf\u304c\u9593\u9055\u3063\u3066\u8a2d\u5b9a\u3055\u308c\u3066\u3044\u308b\u3053\u3068\u304c\u308f\u304b\u308a\u307e\u3057\u305f\u3002<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-793b7728-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-793b7728-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-793b7728-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">\/**\n * \u6a19\u6e96\u51fa\u529b\u306b64\u30d3\u30c3\u30c8\u6574\u6570\u3092\u51fa\u529b\u3059\u308b\n *\/\n\n.section .bss\n\/\/    .lcomm num, 8       \/\/ 64\u30d3\u30c3\u30c8\u6574\u6570\u3092\u4fdd\u5b58\u3059\u308b\u9818\u57df\n    .lcomm buffer, 21   \/\/ 20\u6841\u306e\u6570\u5b57 + 1\u3064\u306e\u30cc\u30eb\u6587\u5b57\u306e\u305f\u3081\u306e\u51fa\u529b\u30d0\u30c3\u30d5\u30a1\n\n.section .data\n    newline: .byte 0xA      \/\/ \u6539\u884c\u6587\u5b57\n\n.section .text\n    .globl _start\n\n_start:\n    \/\/ \u51fa\u529b\u3059\u308b\u6570\u5b57\u3092\u521d\u671f\u5316\n    mov $1234567890123456789, %rax\n\/\/    mov %rax, num(%rip)\n\n    \/\/ \u6574\u6570\u3092\u6587\u5b57\u5217\u306b\u5909\u63db\n\/\/    mov num(%rip), %rax\n    lea buffer+20(%rip), %rdi   \/\/ \u51fa\u529b\u6587\u5b57\u5217\u306e\u6700\u5f8c\u306e\u6587\u5b57\u306e\u30a2\u30c9\u30ec\u30b9\u3092rdi\u30ec\u30b8\u30b9\u30bf\u306b\u683c\u7d0d\n    movb $0, (%rdi)             \/\/ \u6700\u5f8c\u306e\u6587\u5b57\u306b'\\0'\u3092\u30bb\u30c3\u30c8\u3057\u3001\u7d42\u4e86\u3092\u30de\u30fc\u30af\n\nconvert_loop:                   \/\/ \u6574\u6570\u3092\u51fa\u529b\u7528\u6587\u5b57\u5217\u306b\u5909\u63db\u3059\u308b\u30eb\u30fc\u30d7\n\/\/    mov %rax, %rdx\n    xor %rdx, %rdx\n    mov $10, %rcx\n    div %rcx                    \/\/ rdx = rax % 10, rax = rax \/ 10\n    add $'0', %dl               \/\/ \u5bfe\u5fdc\u3059\u308bASCII\u30b3\u30fc\u30c9\u3092\u8a08\u7b97\uff08rdx\u306e\u4e0b\u4f4d8\u30d3\u30c3\u30c8\u3092dl\u3068\u547c\u3076\uff09\n    dec %rdi\n    mov %dl, (%rdi)             \/\/ \u7d50\u679c\u3092\u30e1\u30e2\u30ea\u306b\u66f8\u304d\u8fbc\u3080\n                                \/\/ \u7d42\u4e86\u6761\u4ef6\u3092\u78ba\u8a8d\uff08TODO: \u7d42\u4e86\u5224\u5b9a\u306e\u6307\u4ee4\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\uff09\n    jnz convert_loop\n\nfind_start:                     \/\/ \u5909\u63db\u304c\u5b8c\u4e86\u3057\u305f\u3089\u3001\u6587\u5b57\u5217\u306e\u5148\u982d\u306b\u3042\u308b\u3059\u3079\u3066\u306e'0'\u3092\u30b9\u30ad\u30c3\u30d7\n    cmpb $'0', (%rdi)\n    jne print_string\n    inc %rdi\n    jmp find_start\n\nprint_string:                   \/\/ \u6587\u5b57\u5217\u306e\u51fa\u529b\u3092\u958b\u59cb\n                                \/\/ \u6587\u5b57\u5217\u306e\u9577\u3055\u3092\u8a08\u7b97\n    lea buffer+20(%rip), %rax\n    sub %rdi, %rax              \/\/ \u683c\u7d0d\u3055\u308c\u305f\u30d0\u30a4\u30c8\u6570\u3092\u8a08\u7b97\n    mov %rax, %rdx              \/\/ \u51fa\u529b\u3059\u308b\u30d0\u30a4\u30c8\u6570\u3092rdx\u306b\u683c\u7d0d\n\n    \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u756a\u53f7 (sys_write)     \/\/ TODO: \u6b63\u5e38\u306b\u6587\u5b57\u5217\u3092\u51fa\u529b\u3059\u308b\u305f\u3081\u306b\u3053\u3053\u3067\u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u3092\u4fee\u6b63\u3057\u3066\u304f\u3060\u3055\u3044\n    mov $1, %rax\n    \/\/ \u30d5\u30a1\u30a4\u30eb\u8a18\u8ff0\u5b50 (stdout)\n    mov $1, %rdi\n    \/\/ \u6587\u5b57\u5217\u306e\u30dd\u30a4\u30f3\u30bf\u3092\u8a2d\u5b9a\n    mov %rdi, %rsi\n    \/\/ \u66f8\u304d\u8fbc\u3080\u30d0\u30a4\u30c8\u6570\n    mov %rdx, %rdx\n    \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u3092\u5b9f\u884c\n    syscall\n\n    \/\/ \u6539\u884c\u6587\u5b57\u3092\u51fa\u529b\n    mov $1, %rax\n    mov $1, %rdi\n    lea newline(%rip), %rsi\n    mov $1, %rdx\n    syscall\n\n    \/\/ \u30d7\u30ed\u30b0\u30e9\u30e0\u3092\u7d42\u4e86\n    mov $60, %rax  \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u756a\u53f7 (sys_exit)\n    xor %rdi, %rdi \/\/ \u7d42\u4e86\u30b3\u30fc\u30c9\n    syscall\n<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-793b7728-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">\/**\n * \u6a19\u6e96\u51fa\u529b\u306b64\u30d3\u30c3\u30c8\u6574\u6570\u3092\u51fa\u529b\u3059\u308b\n *\/\n\n.section .bss\n\/\/    .lcomm num, 8       \/\/ 64\u30d3\u30c3\u30c8\u6574\u6570\u3092\u4fdd\u5b58\u3059\u308b\u9818\u57df\n    .lcomm buffer, 21   \/\/ 20\u6841\u306e\u6570\u5b57 + 1\u3064\u306e\u30cc\u30eb\u6587\u5b57\u306e\u305f\u3081\u306e\u51fa\u529b\u30d0\u30c3\u30d5\u30a1\n\n.section .data\n    newline: .byte 0xA      \/\/ \u6539\u884c\u6587\u5b57\n\n.section .text\n    .globl _start\n\n_start:\n    \/\/ \u51fa\u529b\u3059\u308b\u6570\u5b57\u3092\u521d\u671f\u5316\n    mov $1234567890123456789, %rax\n\/\/    mov %rax, num(%rip)\n\n    \/\/ \u6574\u6570\u3092\u6587\u5b57\u5217\u306b\u5909\u63db\n\/\/    mov num(%rip), %rax\n    lea buffer+20(%rip), %rdi   \/\/ \u51fa\u529b\u6587\u5b57\u5217\u306e\u6700\u5f8c\u306e\u6587\u5b57\u306e\u30a2\u30c9\u30ec\u30b9\u3092rdi\u30ec\u30b8\u30b9\u30bf\u306b\u683c\u7d0d\n    movb $0, (%rdi)             \/\/ \u6700\u5f8c\u306e\u6587\u5b57\u306b'\\0'\u3092\u30bb\u30c3\u30c8\u3057\u3001\u7d42\u4e86\u3092\u30de\u30fc\u30af\n\nconvert_loop:                   \/\/ \u6574\u6570\u3092\u51fa\u529b\u7528\u6587\u5b57\u5217\u306b\u5909\u63db\u3059\u308b\u30eb\u30fc\u30d7\n\/\/    mov %rax, %rdx\n    xor %rdx, %rdx\n    mov $10, %rcx\n    div %rcx                    \/\/ rdx = rax % 10, rax = rax \/ 10\n    add $'0', %dl               \/\/ \u5bfe\u5fdc\u3059\u308bASCII\u30b3\u30fc\u30c9\u3092\u8a08\u7b97\uff08rdx\u306e\u4e0b\u4f4d8\u30d3\u30c3\u30c8\u3092dl\u3068\u547c\u3076\uff09\n    dec %rdi\n    mov %dl, (%rdi)             \/\/ \u7d50\u679c\u3092\u30e1\u30e2\u30ea\u306b\u66f8\u304d\u8fbc\u3080\n    <span class=\"swl-marker mark_green\">test %rax, %rax<\/span>             \/\/ \u7d42\u4e86\u6761\u4ef6\u3092\u78ba\u8a8d\uff08TODO: \u7d42\u4e86\u5224\u5b9a\u306e\u6307\u4ee4\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\uff09\n    jnz convert_loop\n\nfind_start:                     \/\/ \u5909\u63db\u304c\u5b8c\u4e86\u3057\u305f\u3089\u3001\u6587\u5b57\u5217\u306e\u5148\u982d\u306b\u3042\u308b\u3059\u3079\u3066\u306e'0'\u3092\u30b9\u30ad\u30c3\u30d7\n    cmpb $'0', (%rdi)\n    jne print_string\n    inc %rdi\n    jmp find_start\n\nprint_string:                   \/\/ \u6587\u5b57\u5217\u306e\u51fa\u529b\u3092\u958b\u59cb\n                                \/\/ \u6587\u5b57\u5217\u306e\u9577\u3055\u3092\u8a08\u7b97\n    lea buffer+20(%rip), %rax\n    sub %rdi, %rax              \/\/ \u683c\u7d0d\u3055\u308c\u305f\u30d0\u30a4\u30c8\u6570\u3092\u8a08\u7b97\n    mov %rax, %rdx              \/\/ \u51fa\u529b\u3059\u308b\u30d0\u30a4\u30c8\u6570\u3092rdx\u306b\u683c\u7d0d\n\n    \/\/ \u6587\u5b57\u5217\u306e\u30dd\u30a4\u30f3\u30bf\u3092\u8a2d\u5b9a\n    <span class=\"swl-marker mark_green\">mov %rdi, %rsi<\/span>\n    \/\/ \u66f8\u304d\u8fbc\u3080\u30d0\u30a4\u30c8\u6570\n    <span class=\"swl-marker mark_green\">\/\/ mov %rsi, %rdx<\/span>\n    \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u756a\u53f7 (sys_write)     \/\/ TODO: \u6b63\u5e38\u306b\u6587\u5b57\u5217\u3092\u51fa\u529b\u3059\u308b\u305f\u3081\u306b\u3053\u3053\u3067\u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u3092\u4fee\u6b63\u3057\u3066\u304f\u3060\u3055\u3044\n    <span class=\"swl-marker mark_green\">mov $1, %rax<\/span>\n    \/\/ \u30d5\u30a1\u30a4\u30eb\u8a18\u8ff0\u5b50 (stdout)\n    <span class=\"swl-marker mark_green\">mov $1, %rdi<\/span>\n    \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u3092\u5b9f\u884c\n    syscall\n\n    \/\/ \u6539\u884c\u6587\u5b57\u3092\u51fa\u529b\n    mov $1, %rax\n    mov $1, %rdi\n    lea newline(%rip), %rsi\n    mov $1, %rdx\n    syscall\n\n    \/\/ \u30d7\u30ed\u30b0\u30e9\u30e0\u3092\u7d42\u4e86\n    mov $60, %rax  \/\/ \u30b7\u30b9\u30c6\u30e0\u30b3\u30fc\u30eb\u756a\u53f7 (sys_exit)\n    xor %rdi, %rdi \/\/ \u7d42\u4e86\u30b3\u30fc\u30c9\n    syscall\n<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ .\/dist\/bins\/lab1_print_integer   \n1234567890123456789<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">\u884c\u5217\u4e57\u7b97\u306e\u30b3\u30fc\u30c9\u88dc\u5b8c<\/h3>\n\n\n\n<p>\u30b3\u30fc\u30c9\u3092\u3056\u3063\u3068\u898b\u3066\u307f\u308b\u3068\u3001\u5358\u306b A[$m$][$k$]\u3001B[$k$][$n$]\u3001C[$m$][$n$] \u306e\u30a2\u30c9\u30ec\u30b9\u3092\u6b63\u3057\u304f\u8a08\u7b97\u3059\u308b\u5fc5\u8981\u304c\u3042\u308b\u3053\u3068\u304c\u7c21\u5358\u306b\u308f\u304b\u308a\u307e\u3059\u3002\u914d\u5217\u306f\u7269\u7406\u7684\u306b\u306f\u7dda\u5f62\u7684\u306b\u4fdd\u5b58\u3055\u308c\u3066\u3044\u307e\u3059\u3002$m, n, k$ \u306e\u6b21\u5143\u304c <code>DIM_M<\/code>\u3001<code>DIM_N<\/code>\u3001<code>DIM_K<\/code> \u3068\u6307\u5b9a\u3055\u308c\u3066\u3044\u308b\u5834\u5408\u3001\u7dda\u5f62\u30a2\u30c9\u30ec\u30b9\u306f\u6b21\u306e\u3088\u3046\u306b\u89e3\u6c7a\u3067\u304d\u307e\u3059\u3002\\begin{align*} \\text{A}[m][k]&amp;: \\text{base\\_A} + 4\\times(m \\times\\text{DIM\\_K} + k)\\\\ \\text{B}[k][n]&amp;: \\text{base\\_B} + 4\\times(k \\times\\text{DIM\\_N} + n)\\\\ \\text{C}[m][n]&amp;: \\text{base\\_C} + 4\\times(m \\times\\text{DIM\\_N} + n) \\end{align*}<\/p>\n\n\n\n<p>\u3053\u3053\u3067\u3001\u5404\u5358\u7cbe\u5ea6\u306e\u6570\u5024\u306f4\u30d0\u30a4\u30c8\u3092\u5360\u3081\u308b\u305f\u3081\u30014\u3092\u639b\u3051\u3066\u3044\u307e\u3059\u3002<\/p>\n\n\n\n<p>x86\u3067\u306f\u3001\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u4f7f\u7528\u3057\u3066 $m, n, k$ \u3092\u8aad\u307f\u8fbc\u3080\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u914d\u5217\u306e\u30d9\u30fc\u30b9\u30a2\u30c9\u30ec\u30b9\u3084 <code>DIM_M<\/code>\u3001<code>DIM_N<\/code>\u3001<code>DIM_K<\/code> \u306f\u30de\u30af\u30ed\u3068\u3057\u3066\u4e0e\u3048\u3089\u308c\u3066\u3044\u307e\u3059\u3002<\/p>\n\n\n\n<p>A[$m$][$k$] \u3092\u4f8b\u306b\u53d6\u308b\u3068\u3001\u6b21\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002\\begin{align*} &amp;\\textbf{MOV} &amp;\\text{loop\\_m, mat\\_elem\\_idx}\\\\ &amp;\\textbf{IMUL} &amp;\\text{DIM\\_K, mat\\_elem\\_idx}\\\\ &amp;\\textbf{ADD} &amp;\\text{loop\\_k, mat\\_elem\\_idx}\\\\ &amp;\\textbf{flds} &amp;\\text{(MAT\\_A, mat\\_elem\\_idx, 4)} \\end{align*}\u200b<\/p>\n\n\n\n<p>\u5b9f\u969b\u306e\u3068\u3053\u308d\u3001\u5fc5\u8981\u306a\u306e\u306f\u8981\u7d20\u306e\u7dda\u5f62\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u89e3\u6c7a\u3059\u308b\u3053\u3068\u3060\u3051\u3067\u3059\u3002\u305d\u3057\u3066 <code>flds<\/code> \u547d\u4ee4\u304c\u81ea\u52d5\u7684\u306b4\u3092\u639b\u3051\u3066 <code>base_A<\/code> \u3092\u8a2d\u5b9a\u3057\u307e\u3059\u3002\u305d\u306e\u305f\u3081\u3001\u5358\u306b $m \\times\\text{DIM\\_K} + k$ \u3092\u89e3\u6c7a\u3059\u308b\u3060\u3051\u3067\u5341\u5206\u3067\u3059\u3002\u4ed6\u306e\u8981\u7d20\u3082\u540c\u69d8\u306e\u30ed\u30b8\u30c3\u30af\u306b\u5f93\u3044\u307e\u3059\u3002<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-c46472cb-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-c46472cb-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-c46472cb-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel;\n.type gemm_kernel, %function;\n\n\/\/ \u4ee5\u4e0b\u306f\u30de\u30af\u30ed\u5b9a\u7fa9\n#define     MAT_C               %rdi    \/\/ \u884c\u5217C\u306e\u30a2\u30c9\u30ec\u30b9\n#define     MAT_A               %rsi    \/\/ \u884c\u5217A\u306e\u30a2\u30c9\u30ec\u30b9\n#define     MAT_B               %r14    \/\/ \u884c\u5217B\u306e\u30a2\u30c9\u30ec\u30b9\n#define     DIM_M               %rcx    \/\/ \u884c\u5217C\u306e\u884c\u6570 (M)\n#define     DIM_N               %r8     \/\/ \u884c\u5217C\u306e\u5217\u6570 (N)\n#define     DIM_K               %r9     \/\/ \u5171\u901a\u6b21\u5143 (K)\n#define     loop_m              %r10    \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     loop_k              %r11    \/\/ K\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     loop_n              %r12    \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     mat_elem_idx        %r13    \/\/ \u884c\u5217\u8981\u7d20\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u8a08\u7b97\u7528\n\n.macro PUSHD                                        \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u4fdd\u5b58\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD                                        \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u5fa9\u5143\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT                                   \/\/ \u521d\u671f\u5316\n    \/\/ TODO: \u884c\u5217B\u306e\u30a2\u30c9\u30ec\u30b9\u3092MAT_B\u30de\u30af\u30ed\u306b\u5bfe\u5fdc\u3059\u308b\u30ec\u30b8\u30b9\u30bf\u306b\u4fdd\u5b58\n\n    xor loop_m, loop_m                              \/\/ M\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n    xor loop_k, loop_k                              \/\/ K\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n    xor loop_n, loop_n                              \/\/ N\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n.endm\n\n.macro DO_GEMM                                      \/\/ kij\u65b9\u5f0f\u3067\u884c\u5217\u7a4d\u3092\u8a08\u7b97\nDO_LOOP_K:                                          \/\/ \u6700\u5916\u5c64\u306eK\u6b21\u5143\u306e\u30eb\u30fc\u30d7\n    xor loop_m, loop_m                              \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n\nDO_LOOP_M:                                          \/\/ M\u6b21\u5143\u306e\u30eb\u30fc\u30d7\n    xor loop_n, loop_n                              \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n\n    \/\/ TODO: A&#91;m]&#91;k]\u3092\u8aad\u307f\u8fbc\u3080\n\n\n\n    flds (MAT_A, mat_elem_idx, 4)                   \/\/ A&#91;m]&#91;k]\u3092st(0)\u306b\u30ed\u30fc\u30c9\u3002flds\u306f\u30c7\u30fc\u30bf\u3092\u30b9\u30bf\u30c3\u30af\u30c8\u30c3\u30d7st(0)\u306b\u306e\u307f\u30ed\u30fc\u30c9\u53ef\u80fd\u3002\n                                                    \/\/ \u5143\u306est(0)\u306fst(1)\u306b\u79fb\u52d5\u3002\u30b9\u30bf\u30c3\u30af\u304c\u6e80\u676f\u306e\u5834\u5408\u3001\u30d7\u30c3\u30b7\u30e5\u5931\u6557\u3002\n\nDO_LOOP_N:\n    \/\/ TODO: B&#91;k]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080\n\n\n\n    flds (MAT_B, mat_elem_idx, 4)                   \/\/ B&#91;k]&#91;n]\u3092\u30ed\u30fc\u30c9\n\n    fmul %st(1), %st(0)                             \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n\n    \/\/ TODO: C&#91;m]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080\n\n\n\n    flds (MAT_C, mat_elem_idx, 4)                   \/\/ C&#91;m]&#91;n]\u3092\u30ed\u30fc\u30c9\n\n    faddp %st(1), %st(0)                            \/\/ C&#91;m]&#91;n] + A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n    fstps (MAT_C, mat_elem_idx, 4)                  \/\/ \u7d50\u679c\u3092C&#91;m]&#91;n]\u306b\u66f8\u304d\u623b\u3059\n\n    add $1, loop_n                                  \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n    fstp %st(0)                                     \/\/ st(0)\u3092\u30af\u30ea\u30a2\u3002\u884c\u5217A\u306e\u8981\u7d20\u306f\u3053\u308c\u4ee5\u4e0a\u4f7f\u7528\u3057\u306a\u3044\n    add $1, loop_m                                  \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $1, loop_k                                  \/\/ K\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n.endm\n\ngemm_kernel:\n    PUSHD                                           \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u4fdd\u5b58\n    GEMM_INIT                                       \/\/ \u521d\u671f\u5316\n    DO_GEMM                                         \/\/ \u884c\u5217\u7a4d\u306e\u8a08\u7b97\n    POPD                                            \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u5fa9\u5143\n    ret                                             \/\/ \u95a2\u6570\u304b\u3089\u623b\u308b\n<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-c46472cb-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel;\n.type gemm_kernel, %function;\n\n\/\/ \u4ee5\u4e0b\u306f\u30de\u30af\u30ed\u5b9a\u7fa9\n#define     MAT_C               %rdi    \/\/ \u884c\u5217C\u306e\u30a2\u30c9\u30ec\u30b9\n#define     MAT_A               %rsi    \/\/ \u884c\u5217A\u306e\u30a2\u30c9\u30ec\u30b9\n#define     MAT_B               %r14    \/\/ \u884c\u5217B\u306e\u30a2\u30c9\u30ec\u30b9\n#define     DIM_M               %rcx    \/\/ \u884c\u5217C\u306e\u884c\u6570 (M)\n#define     DIM_N               %r8     \/\/ \u884c\u5217C\u306e\u5217\u6570 (N)\n#define     DIM_K               %r9     \/\/ \u5171\u901a\u6b21\u5143 (K)\n#define     loop_m              %r10    \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     loop_k              %r11    \/\/ K\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     loop_n              %r12    \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\n#define     mat_elem_idx        %r13    \/\/ \u884c\u5217\u8981\u7d20\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u8a08\u7b97\u7528\n\n.macro PUSHD                                        \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u4fdd\u5b58\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD                                        \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u5fa9\u5143\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT                                   \/\/ \u521d\u671f\u5316\n    \/\/ TODO: \u884c\u5217B\u306e\u30a2\u30c9\u30ec\u30b9\u3092MAT_B\u30de\u30af\u30ed\u306b\u5bfe\u5fdc\u3059\u308b\u30ec\u30b8\u30b9\u30bf\u306b\u4fdd\u5b58\n    <span class=\"swl-marker mark_green\">MOV %rdx, MAT_B<\/span>\n    xor loop_m, loop_m                              \/\/ M\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n    xor loop_k, loop_k                              \/\/ K\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n    xor loop_n, loop_n                              \/\/ N\u65b9\u5411\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n.endm\n\n.macro DO_GEMM                                      \/\/ kij\u65b9\u5f0f\u3067\u884c\u5217\u7a4d\u3092\u8a08\u7b97\nDO_LOOP_K:                                          \/\/ \u6700\u5916\u5c64\u306eK\u6b21\u5143\u306e\u30eb\u30fc\u30d7\n    xor loop_m, loop_m                              \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n\nDO_LOOP_M:                                          \/\/ M\u6b21\u5143\u306e\u30eb\u30fc\u30d7\n    xor loop_n, loop_n                              \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30af\u30ea\u30a2\n\n    \/\/ TODO: A&#91;m]&#91;k]\u3092\u8aad\u307f\u8fbc\u3080\n    <span class=\"swl-marker mark_green\">MOV loop_m, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">IMUL DIM_K, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">ADD loop_k, mat_elem_idx<\/span>\n    flds (MAT_A, mat_elem_idx, 4)                   \/\/ A&#91;m]&#91;k]\u3092st(0)\u306b\u30ed\u30fc\u30c9\u3002flds\u306f\u30c7\u30fc\u30bf\u3092\u30b9\u30bf\u30c3\u30af\u30c8\u30c3\u30d7st(0)\u306b\u306e\u307f\u30ed\u30fc\u30c9\u53ef\u80fd\u3002\n                                                    \/\/ \u5143\u306est(0)\u306fst(1)\u306b\u79fb\u52d5\u3002\u30b9\u30bf\u30c3\u30af\u304c\u6e80\u676f\u306e\u5834\u5408\u3001\u30d7\u30c3\u30b7\u30e5\u5931\u6557\u3002\n\nDO_LOOP_N:\n    \/\/ TODO: B&#91;k]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080\n    <span class=\"swl-marker mark_green\">MOV loop_k, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">IMUL DIM_N, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">ADD loop_n, mat_elem_idx<\/span>\n    flds (MAT_B, mat_elem_idx, 4)                   \/\/ B&#91;k]&#91;n]\u3092\u30ed\u30fc\u30c9\n\n    fmul %st(1), %st(0)                             \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n\n    \/\/ TODO: C&#91;m]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080\n    <span class=\"swl-marker mark_green\">MOV loop_m, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">IMUL DIM_N, mat_elem_idx<\/span>\n    <span class=\"swl-marker mark_green\">ADD loop_n, mat_elem_idx<\/span>\n    flds (MAT_C, mat_elem_idx, 4)                   \/\/ C&#91;m]&#91;n]\u3092\u30ed\u30fc\u30c9\n\n    faddp %st(1), %st(0)                            \/\/ C&#91;m]&#91;n] + A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n    fstps (MAT_C, mat_elem_idx, 4)                  \/\/ \u7d50\u679c\u3092C&#91;m]&#91;n]\u306b\u66f8\u304d\u623b\u3059\n\n    add $1, loop_n                                  \/\/ N\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n    fstp %st(0)                                     \/\/ st(0)\u3092\u30af\u30ea\u30a2\u3002\u884c\u5217A\u306e\u8981\u7d20\u306f\u3053\u308c\u4ee5\u4e0a\u4f7f\u7528\u3057\u306a\u3044\n    add $1, loop_m                                  \/\/ M\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $1, loop_k                                  \/\/ K\u65b9\u5411\u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u3092\u30a4\u30f3\u30af\u30ea\u30e1\u30f3\u30c8\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n.endm\n\ngemm_kernel:\n    PUSHD                                           \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u4fdd\u5b58\n    GEMM_INIT                                       \/\/ \u521d\u671f\u5316\n    DO_GEMM                                         \/\/ \u884c\u5217\u7a4d\u306e\u8a08\u7b97\n    POPD                                            \/\/ \u30ec\u30b8\u30b9\u30bf\u5024\u3092\u5fa9\u5143\n    ret                                             \/\/ \u95a2\u6570\u304b\u3089\u623b\u308b\n<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ .\/dist\/bins\/lab1_test_gemm_kernel.unittest --gtest_filter=gemm_kernel.test0\nRunning main() from \/home\/amamitsu\/Applications\/lab11\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\nNote: Google Test filter = gemm_kernel.test0\n&#91;==========] Running 1 test from 1 test suite.                                                                                                                                         \n&#91;----------] Global test environment set-up.\n&#91;----------] 1 test from gemm_kernel\n&#91; RUN      ] gemm_kernel.test0\n&#91;       OK ] gemm_kernel.test0 (46 ms)\n&#91;----------] 1 test from gemm_kernel (46 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 1 test from 1 test suite ran. (46 ms total)\n&#91;  PASSED  ] 1 test.\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ .\/dist\/bins\/lab1_gemm 256 256 256 \nGEMM performance info:\n                M, K, N: 256, 256, 256\n                Ops: 0.0335544\n                Total compute time(s): 1.81176\n                Cost(s): 0.00905878\n                Benchmark(Gflops): 3.70408\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">CPU\u30a4\u30f3\u30d5\u30a9<\/h3>\n\n\n\n<p>CPU\u306e\u57fa\u672c\u60c5\u5831\u3092\u51fa\u529b\u3059\u308b\u3001\u66f4\u306b\u3001L1\u30c7\u30fc\u30bf\u30ad\u30e3\u30c3\u30b7\u30e5\u3001L2\u3001L3\u30ad\u30e3\u30c3\u30b7\u30e5\u306e\u57fa\u672c\u60c5\u5831\u3092\u51fa\u529b\u3059\u308b\u3067\u3044\u3044\u3002<br>\u3053\u306e\u90e8\u5206\u3067\u306f\u3001CPU\u306e\u60c5\u5831\u3092\u78ba\u8a8d\u3059\u308b\u3060\u3051\u3067\u5341\u5206\u3067\u3059\u3002\u60c5\u5831\u306f <code>lscpu<\/code> \u30b3\u30de\u30f3\u30c9\u3092\u4f7f\u7528\u3057\u3066\u53d6\u5f97\u3067\u304d\u307e\u3059\u3002<\/p>\n\n\n\n<p>CPU 0 \u306e L1D\u3001L2\u3001L3 \u30ad\u30e3\u30c3\u30b7\u30e5\u306b\u95a2\u3059\u308b\u57fa\u672c\u7684\u306a\u60c5\u5831\u306b\u3064\u3044\u3066\u306f\u3001\u30c7\u30a3\u30ec\u30af\u30c8\u30ea\u3092\u4ee5\u4e0b\u306b\u5909\u66f4\u3057\u307e\u3059\uff1a\\[\\text{\/sys\/devices\/system\/cpu\/cpu0\/cache}\\]<\/p>\n\n\n\n<p>\u3053\u306e\u30c7\u30a3\u30ec\u30af\u30c8\u30ea\u306e\u4e0b\u306b\u3001\u5fc5\u8981\u306a\u60c5\u5831\u3092\u4fdd\u5b58\u3057\u3066\u3044\u308b\u3044\u304f\u3064\u304b\u306e\u30b5\u30d6\u30c7\u30a3\u30ec\u30af\u30c8\u30ea\u304c\u3042\u308a\u307e\u3059\u3002\u306a\u304a\u3001L1D \u30ad\u30e3\u30c3\u30b7\u30e5\u306f index0\u3001L1I \u30ad\u30e3\u30c3\u30b7\u30e5\u306f index1\u3001L2 \u30ad\u30e3\u30c3\u30b7\u30e5\u306f index2\u3001L3 \u30ad\u30e3\u30c3\u30b7\u30e5\u306f index3 \u3067\u3059\u3002\u3053\u3053\u3067\u306f\u3001<code>coherency_line_size<\/code>\u3001<code>number_of_sets<\/code>\u3001<code>ways_of_associativity<\/code> \u3068\u3044\u3063\u305f\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u53d6\u5f97\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ lscpu\nArchitecture:             x86_64\n  CPU op-mode(s):         32-bit, 64-bit\n  Address sizes:          45 bits physical, 48 bits virtual\n  Byte Order:             Little Endian\nCPU(s):                   16\n  On-line CPU(s) list:    0-15\nVendor ID:                GenuineIntel\n  Model name:             12th Gen Intel(R) Core(TM) i9-12900K\n    CPU family:           6\n    Model:                151\n    Thread(s) per core:   1\n    Core(s) per socket:   1\n    Socket(s):            16\n    Stepping:             2\n    BogoMIPS:             6374.40\n    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid\n                           tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erm\n                          s invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni arat umip gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities\nVirtualization features:  \n  Hypervisor vendor:      VMware\n  Virtualization type:    full\nCaches (sum of all):      \n  L1d:                    768 KiB (16 instances)\n  L1i:                    512 KiB (16 instances)\n  L2:                     20 MiB (16 instances)\n  L3:                     480 MiB (16 instances)\nNUMA:                     \n  NUMA node(s):           1\n  NUMA node0 CPU(s):      0-15\nVulnerabilities:          \n  Gather data sampling:   Not affected\n  Itlb multihit:          Not affected\n  L1tf:                   Mitigation; PTE Inversion\n  Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown\n  Meltdown:               Mitigation; PTI\n  Mmio stale data:        Not affected\n  Reg file data sampling: Vulnerable: No microcode\n  Retbleed:               Mitigation; IBRS\n  Spec rstack overflow:   Not affected\n  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl\n  Spectre v1:             Mitigation; usercopy\/swapgs barriers and __user pointer sanitization\n  Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop\n  Srbds:                  Not affected\n  Tsx async abort:        Not affected\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ cd \/sys\/devices\/system\/cpu\/cpu0\/cache      \n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/system\/cpu\/cpu0\/cache]\n\u2514\u2500$ cd index0                            \n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index0]\n\u2514\u2500$ cat coherency_line_size  \n64\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index0]\n\u2514\u2500$ cat number_of_sets     \n64\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index0]\n\u2514\u2500$ cat ways_of_associativity\n12\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index0]\n\u2514\u2500$ cd ..\/index2             \n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index2]\n\u2514\u2500$ cat coherency_line_size  \n64\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index2]\n\u2514\u2500$ cat number_of_sets       \n2048\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index2]\n\u2514\u2500$ cat ways_of_associativity\n10\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index2]\n\u2514\u2500$ cd ..\/index3             \n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index3]\n\u2514\u2500$ cat coherency_line_size  \n64\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index3]\n\u2514\u2500$ cat number_of_sets       \n40960\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index3]\n\u2514\u2500$ cat ways_of_associativity\n12\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Perf\u306e\u4f7f\u7528<\/h3>\n\n\n\n<p>\u307e\u305a\u3001<code>perf list<\/code> \u30b3\u30de\u30f3\u30c9\u3092\u4f7f\u3063\u3066\u3001\u3042\u3089\u304b\u3058\u3081\u5b9a\u7fa9\u3055\u308c\u305f perf \u30a4\u30d9\u30f3\u30c8\u3092\u78ba\u8a8d\u3057\u307e\u3059\u3002 \u6b21\u306b\u3001<code>perf<\/code> \u3092\u4f7f\u7528\u3057\u3066\u884c\u5217\u7a4d\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u8abf\u3079\u308b\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u4ee5\u4e0b\u306b\u793a\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<p>\u6700\u521d\u306e\u7d50\u679c\u306f\u3001\u4e00\u822c\u7684\u306a\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u793a\u3057\u3066\u3044\u307e\u3059\u3002\u7279\u5b9a\u306e\u30a4\u30d9\u30f3\u30c8\u3092\u6307\u5b9a\u3059\u308b\u3068\u3001\u305d\u308c\u304c\u4f55\u3092\u30c6\u30b9\u30c8\u3057\u3066\u3044\u308b\u304b\u306b\u3064\u3044\u3066\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u78ba\u8a8d\u3067\u304d\u307e\u3059\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index3]\n\u2514\u2500$ perf list\n\nList of pre-defined events (to be used in -e or -M):\n\n  duration_time                                      &#91;Tool event]\n  user_time                                          &#91;Tool event]\n  system_time                                        &#91;Tool event]\n  mem-loads OR cpu_atom\/mem-loads\/                   &#91;Kernel PMU event]\n  mem-stores OR cpu_atom\/mem-stores\/                 &#91;Kernel PMU event]\n  ref-cycles OR cpu_atom\/ref-cycles\/                 &#91;Kernel PMU event]\n  topdown-bad-spec OR cpu_atom\/topdown-bad-spec\/     &#91;Kernel PMU event]\n  topdown-be-bound OR cpu_atom\/topdown-be-bound\/     &#91;Kernel PMU event]\n  topdown-fe-bound OR cpu_atom\/topdown-fe-bound\/     &#91;Kernel PMU event]\n  topdown-retiring OR cpu_atom\/topdown-retiring\/     &#91;Kernel PMU event]\n  mem-loads OR cpu_core\/mem-loads\/                   &#91;Kernel PMU event]\n  mem-loads-aux OR cpu_core\/mem-loads-aux\/           &#91;Kernel PMU event]\n  mem-stores OR cpu_core\/mem-stores\/                 &#91;Kernel PMU event]\n  ref-cycles OR cpu_core\/ref-cycles\/                 &#91;Kernel PMU event]\n  slots OR cpu_core\/slots\/                           &#91;Kernel PMU event]\n  topdown-bad-spec OR cpu_core\/topdown-bad-spec\/     &#91;Kernel PMU event]\n  topdown-be-bound OR cpu_core\/topdown-be-bound\/     &#91;Kernel PMU event]\n  topdown-br-mispredict OR cpu_core\/topdown-br-mispredict\/&#91;Kernel PMU event]\n  topdown-fe-bound OR cpu_core\/topdown-fe-bound\/     &#91;Kernel PMU event]\n  topdown-fetch-lat OR cpu_core\/topdown-fetch-lat\/   &#91;Kernel PMU event]\n  topdown-heavy-ops OR cpu_core\/topdown-heavy-ops\/   &#91;Kernel PMU event]\n  topdown-mem-bound OR cpu_core\/topdown-mem-bound\/   &#91;Kernel PMU event]\n  topdown-retiring OR cpu_core\/topdown-retiring\/     &#91;Kernel PMU event]\n  msr\/pperf\/                                         &#91;Kernel PMU event]\n  msr\/smi\/                                           &#91;Kernel PMU event]\n  msr\/tsc\/                                           &#91;Kernel PMU event]\n\ncache:\n  longest_lat_cache.miss\n       &#91;Counts the number of cacheable memory requests that miss in the LLC. Counts on a per core basis. Unit: cpu_atom]\n  longest_lat_cache.reference\n       &#91;Counts the number of cacheable memory requests that access the LLC. Counts on a per core basis. Unit: cpu_atom]\n  mem_bound_stalls.ifetch\n       &#91;Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in the L2,LLC,DRAM or MMIO (Non-DRAM). Unit: cpu_atom]\n  mem_bound_stalls.ifetch_dram_hit\n       &#91;Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in DRAM or MMIO (Non-DRAM). Unit: cpu_atom]\n  mem_bound_stalls.ifetch_l2_hit\n       &#91;Counts the number of cycles the core is stalled due to an instruction cache or TLB miss which hit in the L2 cache. Unit: cpu_atom]\n  mem_bound_stalls.ifetch_llc_hit\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;\/sys\/\u2026\/cpu\/cpu0\/cache\/index3]\n\u2514\u2500$ cd \/home\/amamitsu\/Applications\/lab1\/build\/         \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ sudo perf stat .\/lab1_gemm 256 256 256             \nGEMM performance info:\n                M, K, N: 256, 256, 256\n                Ops: 0.0335544\n                Total compute time(s): 2.64601\n                Cost(s): 0.0132301\n                Benchmark(Gflops): 2.53623\n\n Performance counter stats for '.\/lab1_gemm 256 256 256':\n\n          2,783.61 msec task-clock                #    1.000 CPUs utilized          \n                 5      context-switches          #    0.002 K\/sec                  \n                 0      cpu-migrations            #    0.000 K\/sec                  \n               315      page-faults               #    0.113 K\/sec                  \n    12,192,461,909      cycles                    #    4.380 GHz                    \n    49,467,952,031      instructions              #    4.06  insn per cycle         \n     3,539,861,350      branches                  # 1271.682 M\/sec                  \n        13,868,932      branch-misses             #    0.39% of all branches        \n\n       2.783398170 seconds time elapsed\n\n       2.779785000 seconds user\n       0.003999000 seconds sys\n                                                                                                                                                                                                                                            \n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab1\/build]\n\u2514\u2500$ sudo perf stat -e L1-dcache-loads,L1-dcache-load-misses,dTLB-loads,dTLB-load-misses .\/lab1_gemm 256 256 256\nGEMM performance info:\n                M, K, N: 256, 256, 256\n                Ops: 0.0335544\n                Total compute time(s): 2.5879\n                Cost(s): 0.0129395\n                Benchmark(Gflops): 2.59318\n\n Performance counter stats for '.\/lab1_gemm 256 256 256':\n\n     7,065,909,560      L1-dcache-loads                                             \n       232,261,146      L1-dcache-load-misses     #    3.29% of all L1-dcache hits  \n     7,065,909,560      dTLB-loads                                                  \n             4,167      dTLB-load-misses          #    0.00% of all dTLB cache hits \n\n       2.721420500 seconds time elapsed\n\n       2.717708000 seconds user\n       0.004002000 seconds sys<\/code><\/pre>\n\n\n\n\n\n\n\n<h2 class=\"wp-block-heading\">Lab2 \u30ad\u30e3\u30c3\u30b7\u30e5\u3001\u30eb\u30fc\u30d7\u3001\u304a\u3088\u3073\u30d6\u30ed\u30c3\u30ad\u30f3\u30b0\u3092\u4f7f\u7528\u3057\u3066\u884c\u5217\u4e57\u7b97\u3092\u6700\u9069\u5316<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Perf\u3067\u884c\u5217\u4e57\u7b97\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306e\u30dc\u30c8\u30eb\u30cd\u30c3\u30af\u3092\u7279\u5b9a<\/h3>\n\n\n\n<p><code>perf<\/code> \u3092\u4f7f\u7528\u3057\u3066\u30d7\u30ed\u30b0\u30e9\u30e0\u306e\u30dc\u30c8\u30eb\u30cd\u30c3\u30af\u3092\u78ba\u8a8d\u3057\u305f\u7d50\u679c\u3001L1\/L2\u30ad\u30e3\u30c3\u30b7\u30e5\u306e\u30df\u30b9\u304c\u4e3b\u306a\u539f\u56e0\u3067\u3042\u308b\u3053\u3068\u304c\u308f\u304b\u308a\u307e\u3057\u305f\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ sudo perf stat -e l2_rqsts.code_rd_hit,l2_rqsts.references,l1d.replacement,l1d_pend_miss.pending,l2_rqsts.pf_hit,l2_rqsts.pf_miss,L1-dcache-loads,L1-dcache-load-misses .\/dist\/bins\/lab2_gemm_baseline 256 1024 256\nGEMM performance info:\n                M, K, N: 256, 1024, 256\n                Ops: 0.134218\n                Total compute time(s): 8.5102\n                Cost(s): 0.042551\n                Benchmark(Gflops): 3.15428\n\nPerformance counter stats for \u2018.\/dist\/bins\/lab2_gemm_baseline 256 1024 256\u2019:\n\n           209,586     l2_rqsts.code_rd_hit                                    (49.99%)\n     1,850,479,513     l2_rqsts.references                                     (50.02%)\n       928,894,971     l1d.replacement                                         (50.03%)\n     3,827,521,915     l1d_pend_miss.pending                                   (50.03%)\n     1,137,852,364     l2_rqsts.pf_hit                                         (50.01%)\n       631,431,485     l2_rqsts.pf_miss                                        (49.98%)\n    28,253,616,176     L1-dcache-loads                                         (49.97%)\n       929,403,945     L1-dcache-load-misses # 3.29% of all L1-dcache accesses (49.97%)\n\n      15.272692474 seconds time elapsed\n\n      15.255466000 seconds user\n       0.000000000 seconds sys<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Prefetch\u3092\u5229\u7528\u3057\u3066\u884c\u5217\u4e57\u7b97\u306e\u6027\u80fd\u3092\u6700\u9069\u5316\u3059\u308b<\/h3>\n\n\n\n<p>\u3053\u306e\u554f\u984c\u306b\u5bfe\u51e6\u3059\u308b\u305f\u3081\u3001\u5fc5\u8981\u306a\u30c7\u30fc\u30bf\u3092\u4e8b\u524d\u306b\u6e96\u5099\u3059\u308b\u30d7\u30ea\u30d5\u30a7\u30c3\u30c1\uff08<code>prefetch<\/code>\uff09\u3092\u4f7f\u7528\u3067\u304d\u307e\u3059\u3002\u305f\u3060\u3057\u3001\u30ad\u30e3\u30c3\u30b7\u30e5\u30e1\u30e2\u30ea\u306b\u306f\u5236\u9650\u304c\u3042\u308b\u305f\u3081\u3001\u30d7\u30ea\u30d5\u30a7\u30c3\u30c1\u30b3\u30de\u30f3\u30c9\u3092\u983b\u7e41\u306b\u4f7f\u7528\u3057\u3059\u304e\u306a\u3044\u3088\u3046\u6ce8\u610f\u304c\u5fc5\u8981\u3067\u3059\u3002\u6700\u7d42\u7684\u306b\u3001\u5404 <code>m<\/code> \u30eb\u30fc\u30d7\u3067 A[$m+1$][$k$] \u3068 C[$m$][$n+1$] \u3092\u30d7\u30ea\u30d5\u30a7\u30c3\u30c1\u3059\u308b\u3053\u3068\u306b\u6c7a\u3081\u307e\u3057\u305f\u3002\u3053\u306e\u624b\u6cd5\u306b\u3088\u308a\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u308f\u305a\u304b\u306b\u5411\u4e0a\u3057\u307e\u3059\u304c\u3001\u5b9f\u969b\u306b\u306f\u305d\u308c\u307b\u3069\u5927\u304d\u306a\u5f71\u97ff\u306f\u3042\u308a\u307e\u305b\u3093\u3002\u307e\u305f\u3001\u3053\u306e\u65b9\u6cd5\u306f\u7279\u5b9a\u306e\u72b6\u6cc1\uff08\u30b5\u30a4\u30ba\uff09\u3067\u306e\u307f\u6709\u52b9\u3067\u3059\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_prefetch;\n.type gemm_kernel_opt_prefetch, %function;\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r14\n#define     DIM_M               %rcx\n#define     DIM_N               %r8\n#define     DIM_K               %r9\n#define     loop_m              %r10\n#define     loop_k              %r11\n#define     loop_n              %r12\n#define     mat_elem_idx        %r13\n#define     prefetch_elem_idx   %r15\n\n\n.macro PUSHD\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT\n    mov %rdx, MAT_B\n\n    xor loop_m, loop_m\n    xor loop_k, loop_k\n    xor loop_n, loop_n\n.endm\n\n.macro DO_GEMM\nDO_LOOP_K:\n    xor loop_m, loop_m\n\nDO_LOOP_M:\n    xor loop_n, loop_n\n\n    mov loop_m, %rax\n    mul DIM_K\n    mov %rax, mat_elem_idx\n    add loop_k, mat_elem_idx                    \/\/ m*K+k\u3092\u8a08\u7b97\n    flds (MAT_A, mat_elem_idx, 4)               \/\/ A&#91;m]&#91;k]\u3092\u30ed\u30fc\u30c9\n\n    \/\/ A&#91;m+1]&#91;k]\u3092Prefetch\n    mov loop_m, %rax\n    add $1, %rax\n    mul DIM_K\n    mov %rax, prefetch_elem_idx\n    add loop_k, prefetch_elem_idx\n    prefetcht0 (MAT_A, prefetch_elem_idx, 4)    \/\/ A&#91;m+1]&#91;k]\u3092Prefetch\n    \n    mov DIM_N, %rax\n    mul loop_m\n    add $1, %rax\n    mov %rax, prefetch_elem_idx\n    add loop_n, prefetch_elem_idx\n    prefetcht0 (MAT_C, prefetch_elem_idx, 4)    \/\/ C&#91;m]&#91;n+1]\u3092Prefetch\n    \nDO_LOOP_N:\n    mov DIM_N, %rax\n    mul loop_k\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx                    \/\/ k*N+n\u3092\u8a08\u7b97\n    flds (MAT_B, mat_elem_idx, 4)               \/\/ B&#91;k]&#91;n]\u3092\u30ed\u30fc\u30c9\n    fmul %st(1), %st(0)                         \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n\n    mov DIM_N, %rax\n    mul loop_m\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx                    \/\/ m*N+n\u3092\u8a08\u7b97\n    flds (MAT_C, mat_elem_idx, 4)               \/\/ C&#91;m]&#91;n]\u3092\u30ed\u30fc\u30c9\n\n\n    faddp %st(1), %st(0)                        \/\/ C&#91;m]&#91;n] + A&#91;m]&#91;k] * B&#91;k]&#91;n]\u3092\u8a08\u7b97\n    fstps (MAT_C, mat_elem_idx, 4)\n\n    add $1, loop_n\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n    fstp %st(0)\n    add $1, loop_m\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $1, loop_k\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n.endm\n\n\ngemm_kernel_opt_prefetch:\n    PUSHD\n    GEMM_INIT\n    DO_GEMM\n    POPD\n    ret<\/code><\/pre>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ .\/dist\/bins\/lab2_gemm_kernel_opt_prefetch.unittest              \nRunning main() from \/home\/amamitsu\/Applications\/lab2\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\n&#91;==========] Running 4 tests from 1 test suite.\n&#91;----------] Global test environment set-up.\n&#91;----------] 4 tests from gemm_kernel_opt_prefetch\n&#91; RUN      ] gemm_kernel_opt_prefetch.test0\n&#91;       OK ] gemm_kernel_opt_prefetch.test0 (12 ms)\n&#91; RUN      ] gemm_kernel_opt_prefetch.test1\n&#91;       OK ] gemm_kernel_opt_prefetch.test1 (3 ms)\n&#91; RUN      ] gemm_kernel_opt_prefetch.test2\n&#91;       OK ] gemm_kernel_opt_prefetch.test2 (3 ms)\n&#91; RUN      ] gemm_kernel_opt_prefetch.test3\n&#91;       OK ] gemm_kernel_opt_prefetch.test3 (0 ms)\n&#91;----------] 4 tests from gemm_kernel_opt_prefetch (20 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 4 tests from 1 test suite ran. (20 ms total)\n&#91;  PASSED  ] 4 tests.\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ .\/dist\/bins\/lab2_gemm_opt_prefetch 2048 64 64 \n--- Performance before prefetch optimization ---\nGEMM performance info:\n                M, K, N: 2048, 64, 64\n                Ops: 0.0167772\n                Total compute time(s): 1.09243\n                Cost(s): 0.00546215\n                Benchmark(Gflops): 3.07154\n--- Performance for after prefetch optimization ---\nGEMM performance info:\n                M, K, N: 2048, 64, 64\n                Ops: 0.0167772\n                Total compute time(s): 1.03568\n                Cost(s): 0.00517838\n                Benchmark(Gflops): 3.23985\n----------------------------\nPerformance difference(Gflops): 0.168311<\/code><\/pre>\n\n\n\n<p>\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306f5.5%\u5411\u4e0a\u3057\u307e\u3057\u305f\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\u30eb\u30fc\u30d7\u3068\u30d6\u30ed\u30c3\u30ad\u30f3\u30b0\u3092\u5229\u7528\u3057\u3066\u884c\u5217\u4e57\u7b97\u6027\u80fd\u3092\u5411\u4e0a\u3055\u305b\u308b<\/h3>\n\n\n\n<p>\u884c\u5217\u7a4d\u3092 $16\\times16$ \u306e\u30d6\u30ed\u30c3\u30af\u306b\u5206\u5272\u3057\u307e\u3057\u305f\u3002\u3053\u306e\u30d6\u30ed\u30c3\u30ad\u30f3\u30b0\u306b\u3088\u308a\u3001\u30ad\u30e3\u30c3\u30b7\u30e5\u30d2\u30c3\u30c8\u304c\u767a\u751f\u3057\u3084\u3059\u304f\u306a\u308b\u306e\u306f\u660e\u3089\u304b\u3067\u3059\u3002\u3055\u3089\u306b\u6700\u9069\u5316\u3059\u308b\u305f\u3081\u3001\u4e00\u90e8\u306e\u547d\u4ee4\u3092\u624b\u52d5\u3067\u30a2\u30e9\u30a4\u30f3\u30e1\u30f3\u30c8\u3057\u307e\u3057\u305f\u3002<br>x86\u30a2\u30bb\u30f3\u30d6\u30ea\u30ed\u30b8\u30c3\u30af\u304c\u8907\u96d1\u306a\u305f\u3081\u3001\u540c\u7b49\u3068\u307f\u306a\u305b\u308bC\u8a00\u8a9e\u306e\u30b3\u30fc\u30c9\u3082\u4ee5\u4e0b\u306b\u793a\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>void gemm_kernel_opt_loop(float *C, const float *A, const float *B, int M, int N, int K) {\n    for (int m = 0; m &lt; M\/16; m += 16)\n        for (int n = 0; n &lt; N\/16; n += 16)\n            for (int k = 0; k &lt; K\/16; k += 16) {\n                int minMt = m+16&lt; M ? m+16:M;\n                int minNt = n+16&lt; N ? n+16:N;\n                int minKt = k+16&lt; K ? k+16:K;\n                for (int mt = m; mt &lt; minMt; mt++)\n                    for (int nt = n; nt &lt; minNt; nt++)\n                        for (int kt = k; kt &lt; minKt; kt++)\n                            C&#91;mt * M + nt] += A&#91;mt * M + kt] * B&#91;kt * K + nt];\n            }\n}<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_loop;\n.type gemm_kernel_opt_loop, %function;\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r14\n#define     DIM_M               %rcx\n#define     DIM_N               %r8d\n#define     DIM_K               %r9d\n#define     loop_m              %r10d\n#define     loop_k              %r11d\n#define     loop_n              %r12\n\n\n.macro PUSHD\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro DO_GEMM\n    \/\/TODO\uff1a\n    \/\/16*16 blocks in 6 loops\n\tmov\t$0, loop_m\n\tcmp\t%ecx, loop_m\n\tjge\tOVER\n\tmov\t%rdx, %r13\n\tmov\tDIM_N, %eax\n\tmov\tDIM_N, -56(%rsp)\n\tcdqe\n\tlea\t0(,%rax,4), %r15\n\tjmp\tLOOP_1_VARSET\nLOOP_6_VARSET:\n\tmov\t%r13, %rdx\n\tmov\t-80(%rsp), %rax\n\tfldz\n\t.p2align 5\nLOOP_6:\n\tflds\t(%rax)\n\tfmuls\t(%rdx)\n\tfaddp\t%st, %st(1)\n\tadd\t$4, %rax\n\tadd\t%r15, %rdx\n\tcmp\t%rbx, %rax\n\tjne\tLOOP_6\nLOOP_5_OVERCHECK:\n\tmov\t-96(%rsp), %rax\n\tfadds\t(%rax,%rbp,4)\n\tfstps\t(%rax,%rbp,4)\n\tadd\t$1, %rbp\n\tadd\t$4, %r13\n\tcmp\t%ebp, -88(%rsp)\n\tjle\tLOOP_4_OVERCHECK\nLOOP_5_VARSET:\n\tmov\t-104(%rsp), %edx\n\tcmp\t%edx, -84(%rsp)\n\tjg\tLOOP_6_VARSET\n\tfldz\n\tjmp\tLOOP_5_OVERCHECK\n\t.p2align 6\nLOOP_4_OVERCHECK:\n\taddq\t$1, -72(%rsp)\n\tmov\t-72(%rsp), %eax\n\tmov\t-56(%rsp), %edx\n\tadd\t%edx, -68(%rsp)\n\tmov\t-16(%rsp), %edx\n\tadd\t%edx, -64(%rsp)\n\tcmp\t%eax, -52(%rsp)\n\tje\tBLOCK_COMPLETE\nLOOP_4_VARSET:\n\tmov\t-88(%rsp), %ebx\n\tcmp\t%ebx, -60(%rsp)\n\tjge\tLOOP_4_OVERCHECK\n\tmovsxd\t-64(%rsp), %rdx\n\tmov\t-104(%rsp), %rbx\n\tlea\t(%rdx,%rbx), %rax\n\tmov\t-48(%rsp), %rbp\n\tlea\t0(%rbp,%rax,4), %rax\n\tmov\t%rax, -80(%rsp)\n\tmov\t-12(%rsp), %eax\n\tadd\t%rbx, %rax\n\tadd\t%rdx, %rax\n\tlea\t0(%rbp,%rax,4), %rbx\n\tmovsxd\t-68(%rsp), %rax\n\tmov\t-40(%rsp), %rdx\n\tlea\t(%rdx,%rax,4), %rax\n\tmov\t%rax, -96(%rsp)\n\tmov\t-32(%rsp), %r13\n\tmov\t-24(%rsp), %rbp\n\tjmp\tLOOP_5_VARSET\nBLOCK_COMPLETE:\n\tmov\t-8(%rsp), %r13\nLOOP_3_OVERCHECK:\n\tmov\tDIM_K, %ebx\n\taddq\t$16, -104(%rsp)\n\tmov\t-104(%rsp), %rax\n\tcmp\t%eax, DIM_K\n\tjle\tLOOP_2_OVERCHECK\nLOOP_3_VARSET:\n\tmov\t-104(%rsp), %rbp\n\tmov\t%ebp, -96(%rsp)\n\tmov\tloop_m, -72(%rsp)\n\tlea\t16(loop_m), %eax\n\tcmp\t%ecx, %eax\n\tcmovg\t%ecx, %eax\n\tmov\t%eax, %edx\n\tmov\t%eax, -52(%rsp)\n\tmov\tloop_k, -60(%rsp)\n\tmov\t-104(%rsp), %eax\n\tadd\t$16, %eax\n\tcmp\t%ebx, %eax\n\tcmovg\t%ebx, %eax\n\tmov\t%eax, -84(%rsp)\n\tcmp\t%edx, loop_m\n\tjge\tLOOP_3_OVERCHECK\n\tmov\tMAT_A, -48(%rsp)\n\tmov\tMAT_C, -40(%rsp)\n\tmov\t-56(%rsp), %eax\n\tmov\t%eax, %edx\n\timul\tloop_m, %edx\n\tmov\t%edx, -68(%rsp)\n\tmov\t%ebx, -16(%rsp)\n\timul\tloop_m, %ebx\n\tmov\t%ebx, -64(%rsp)\n\tmovsxd\tloop_k, %rbx\n\tmov\t%rbx, -24(%rsp)\n\tmov\t-104(%rsp), %edx\n\timul\t%edx, %eax\n\tcdqe\n\tadd\t%rbx, %rax\n\tlea\t0(%r13,%rax,4), %rax\n\tmov\t%rax, -32(%rsp)\n\tmov\t-84(%rsp), %ebp\n\tmov\t-96(%rsp), %eax\n\tsub\t%eax, %ebp\n\tmov\t%ebp, -12(%rsp)\n\tmov\t%r13, -8(%rsp)\n\tjmp\tLOOP_4_VARSET\nLOOP_2_OVERCHECK:\n\tlea\t16(loop_k), %eax\n\tmov\t%eax, loop_k\n\tcmp\tDIM_N, %eax\n\tjge\tLOOP_1_OVERCHECK\nLOOP_2_VARSET:\n\tmov\tDIM_K, %ebx\n\ttest\tDIM_K, DIM_K\n\tjle\tLOOP_2_OVERCHECK\n\tmovq\t$0, -104(%rsp)\n\tlea\t16(loop_k), %eax\n\tcmp\tDIM_N, %eax\n\tcmovg\tDIM_N, %eax\n\tmov\t%eax, -88(%rsp)\n\tjmp\tLOOP_3_VARSET\nLOOP_1_OVERCHECK:\n\tlea\t16(loop_m), %eax\n\tmov\t%eax, loop_m\n\tcmp\t%ecx, %eax\n\tjge\tOVER\nLOOP_1_VARSET:\n\tmov\t$0, loop_k\n\ttest\tDIM_N, DIM_N\n\tjg\tLOOP_2_VARSET\n\tjmp\tLOOP_1_OVERCHECK\nOVER:\n.endm\n\ngemm_kernel_opt_loop:\n    PUSHD\n    DO_GEMM\n    POPD\n    ret\n<\/code><\/pre>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ .\/dist\/bins\/lab2_gemm_kernel_opt_loop.unittest\nRunning main() from \/home\/amamitsu\/Applications\/lab2\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\n&#91;==========] Running 4 tests from 1 test suite.\n&#91;----------] Global test environment set-up.\n&#91;----------] 4 tests from gemm_kernel_opt_loop\n&#91; RUN      ] gemm_kernel_opt_loop.test0\n&#91;       OK ] gemm_kernel_opt_loop.test0 (1 ms)\n&#91; RUN      ] gemm_kernel_opt_loop.test1\n&#91;       OK ] gemm_kernel_opt_loop.test1 (2 ms)\n&#91; RUN      ] gemm_kernel_opt_loop.test2\n&#91;       OK ] gemm_kernel_opt_loop.test2 (2 ms)\n&#91; RUN      ] gemm_kernel_opt_loop.test3\n&#91;       OK ] gemm_kernel_opt_loop.test3 (0 ms)\n&#91;----------] 4 tests from gemm_kernel_opt_loop (6 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 4 tests from 1 test suite ran. (6 ms total)\n&#91;  PASSED  ] 4 tests.\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ .\/dist\/bins\/lab2_gemm_opt_loop 2048 512 64                                       \n--- Performance before loop optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 8.63971\n                Cost(s): 0.0431986\n                Benchmark(Gflops): 3.107\n--- Performance for after loop optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 4.44931\n                Cost(s): 0.0222465\n                Benchmark(Gflops): 6.0332\n----------------------------\nPerformance difference(Gflops): 2.9262\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/lab2\/build]\n\u2514\u2500$ .\/dist\/bins\/lab2_gemm_opt_loop 4444 444 444       \n--- Performance before loop optimization ---\nGEMM performance info:\n                M, K, N: 4444, 444, 444\n                Ops: 1.75214\n                Total compute time(s): 109.722\n                Cost(s): 0.548611\n                Benchmark(Gflops): 3.19378\n--- Performance for after loop optimization ---\nGEMM performance info:\n                M, K, N: 4444, 444, 444\n                Ops: 1.75214\n                Total compute time(s): 58.2671\n                Cost(s): 0.291336\n                Benchmark(Gflops): 6.01418\n----------------------------\nPerformance difference(Gflops): 2.8204\n<\/code><\/pre>\n\n\n\n<p>\u30a2\u30bb\u30f3\u30d6\u30ea\u30b3\u30fc\u30c9\u3092\u614e\u91cd\u306b\u6700\u9069\u5316\u3059\u308b\u3053\u3068\u3067\u3001\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306f94.2%\u5411\u4e0a\u3057\u307e\u3057\u305f\u3002\u3061\u306a\u307f\u306b\u3001\u547d\u4ee4\u3092\u30a2\u30e9\u30a4\u30f3\u3057\u306a\u3044\u5834\u5408\u3067\u308244.3%\u306e\u5411\u4e0a\u304c\u898b\u3089\u308c\u3001\u3053\u308c\u3067\u3082\u5143\u306e\u30d0\u30fc\u30b8\u30e7\u30f3\u3068\u6bd4\u8f03\u3057\u3066\u5341\u5206\u52b9\u679c\u7684\u3067\u3059\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lab3 \u547d\u4ee4\u30ec\u30d9\u30eb\u306e\u4e26\u5217\u6027\u3001\u30d9\u30af\u30c8\u30eb\u547d\u4ee4\u3001\u304a\u3088\u3073\u4e26\u5217\u51e6\u7406\u3092\u4f7f\u7528\u3057\u3066\u884c\u5217\u4e57\u7b97\u3092\u6700\u9069\u5316<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">x87 FPU\u3092\u57fa\u306b\u884c\u5217\u4e57\u7b97\u306e\u6027\u80fd\u3092\u6700\u9069\u5316\u3059\u308b<\/h3>\n\n\n\n<p>\u3053\u3053\u3067\u306f\u3001x87 FPU \u547d\u4ee4\u3092\u4f7f\u7528\u3057\u3066\u30d7\u30ed\u30bb\u30b9\u3092\u6700\u9069\u5316\u3057\u307e\u3059\u3002\u3053\u306e\u90e8\u5206\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u5206\u3051\u3089\u308c\u307e\u3059\uff1a<\/p>\n\n\n\n<p>(1) A[$m$][$k$] $\\times$ B[$k$][$n+1$] \u3092\u8a08\u7b97\u3059\u308b<\/p>\n\n\n\n<p>(2) C[$m$][$n$] \u3092 st(1) \u306b\u3001C[$m$][$n+1$] \u3092 st(0) \u306b\u30ed\u30fc\u30c9\u3059\u308b<\/p>\n\n\n\n<p>(3) C[$m$][$n+1$] + A[$m$][$k$] $\\times$ B[$k$][$n+1$] \u3092\u8a08\u7b97\u3059\u308b<\/p>\n\n\n\n<p>(4) C[$m$][$n$] + A[$m$][$k$] $\\times$ B[$k$][$n$] \u3092\u8a08\u7b97\u3059\u308b<\/p>\n\n\n\n<p>(5) C[$m$][$n$] \u3092\u4fdd\u5b58\u3059\u308b<\/p>\n\n\n\n<p>(6) n \u30eb\u30fc\u30d7\u306e\u30ab\u30a6\u30f3\u30bf\u3092\u66f4\u65b0\u3059\u308b<\/p>\n\n\n\n<p>\u3053\u3053\u3067\u306e\u91cd\u8981\u306a\u30dd\u30a4\u30f3\u30c8\u306f\u3001\u547d\u4ee4\u304c\u6d6e\u52d5\u5c0f\u6570\u70b9\u30ec\u30b8\u30b9\u30bf\u306b\u3069\u306e\u3088\u3046\u306a\u5909\u5316\u3092\u3082\u305f\u3089\u3059\u304b\u3092\u7406\u89e3\u3059\u308b\u3053\u3068\u3067\u3059\u3002<br><br>\u30b3\u30fc\u30c9\u3088\u308a\u3001\u5404 n \u30eb\u30fc\u30d7\u3067 2 \u3064\u306e\u4e57\u7b97\u304c\u51e6\u7406\u3055\u308c\u308b\u3053\u3068\u304c\u308f\u304b\u308a\u307e\u3059\u3002\u305d\u306e\u305f\u3081\u3001\u4e0a\u8a18\u306e\u30d7\u30ed\u30bb\u30b9\u304c\u7d42\u4e86\u3057\u305f\u3089\u3001n \u306e\u30eb\u30fc\u30d7\u30ab\u30a6\u30f3\u30bf\u306b 2 \u3092\u52a0\u3048\u307e\u3059\u3002<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-b0591488-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-b0591488-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-b0591488-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_loop_unrolling;\n.type gemm_kernel_opt_loop_unrolling, %function;\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r14\n#define     DIM_M               %rcx\n#define     DIM_N               %r8\n#define     DIM_K               %r9\n#define     loop_m              %r10\n#define     loop_k              %r11\n#define     loop_n              %r12\n#define     mat_elem_idx        %r13\n\n\n.macro PUSHD   \/\/ \u73fe\u5728\u306e\u6c4e\u7528\u30ec\u30b8\u30b9\u30bf\u306e\u5024\u3092\u4fdd\u5b58\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD    \/\/ \u4fdd\u5b58\u3057\u305f\u6c4e\u7528\u30ec\u30b8\u30b9\u30bf\u306e\u5024\u3092\u5fa9\u5143\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT\n    mov %rdx, MAT_B\n\n    xor loop_m, loop_m\n    xor loop_k, loop_k\n    xor loop_n, loop_n\n.endm\n\n.macro DO_GEMM\nDO_LOOP_K:\n    xor loop_m, loop_m\n\nDO_LOOP_M:\n    xor loop_n, loop_n\n\n    mov loop_m, %rax\n    mul DIM_K\n    mov %rax, mat_elem_idx\n    add loop_k, mat_elem_idx          \/\/ m * K + k\u3092\u8a08\u7b97\n    flds (MAT_A, mat_elem_idx, 4)     \/\/ A&#91;m]&#91;k]\u3092\u8aad\u307f\u8fbc\u3080\n\nDO_LOOP_N:\n    mov DIM_N, %rax\n    mul loop_k\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx\n    flds (MAT_B, mat_elem_idx, 4)     \/\/ B&#91;k]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080\n    fmul %st(1), %st(0)               \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n] \u3092\u8a08\u7b97 --> st(0)\n\n    \/\/ TODO: A&#91;m]&#91;k] * B&#91;k]&#91;n+1] \u3092\u8a08\u7b97\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n\n\n    mov DIM_N, %rax\n    mul loop_m\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx          \/\/ m * N + n\u3092\u8a08\u7b97\n    \/\/ TODO: C&#91;m]&#91;n] \u3092 st(1)\u3001C&#91;m]&#91;n+1] \u3092 st(0) \u306b\u8aad\u307f\u8fbc\u3080\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n\n    \/\/ TODO: \u90e8\u5206\u548c\u3092\u7d2f\u7a4d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    \/\/ C&#91;m]&#91;n+1] + A&#91;m]&#91;k] * B&#91;k]&#91;n+1]\u3001C&#91;m]&#91;n] + A&#91;m]&#91;k] * B&#91;k]&#91;n]\n\n    fstps (MAT_C, mat_elem_idx, 4)    \/\/ C&#91;m]&#91;n+1] \u3092\u4fdd\u5b58\n\n    \/\/ TODO: C&#91;m]&#91;n] \u3092\u4fdd\u5b58\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n\n    \/\/ TODO: N\u6b21\u5143\u30eb\u30fc\u30d7\u3092\u66f4\u65b0\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n    fstp %st(0)                   \/\/ st(0) \u306e\u307f\u3092\u30dd\u30c3\u30d7\n    add $1, loop_m\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $1, loop_k\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n.endm\n\ngemm_kernel_opt_loop_unrolling:\n    PUSHD\n    GEMM_INIT\n    DO_GEMM\n    POPD\n    ret<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-b0591488-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_loop_unrolling;\n.type gemm_kernel_opt_loop_unrolling, %function;\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r14\n#define     DIM_M               %rcx\n#define     DIM_N               %r8\n#define     DIM_K               %r9\n#define     loop_m              %r10\n#define     loop_k              %r11\n#define     loop_n              %r12\n#define     mat_elem_idx        %r13\n\n\n.macro PUSHD   \/\/ \u73fe\u5728\u306e\u6c4e\u7528\u30ec\u30b8\u30b9\u30bf\u306e\u5024\u3092\u4fdd\u5b58\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD    \/\/ \u4fdd\u5b58\u3057\u305f\u6c4e\u7528\u30ec\u30b8\u30b9\u30bf\u306e\u5024\u3092\u5fa9\u5143\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT\n    mov %rdx, MAT_B\n\n    xor loop_m, loop_m\n    xor loop_k, loop_k\n    xor loop_n, loop_n\n.endm\n\n.macro DO_GEMM\nDO_LOOP_K:\n    xor loop_m, loop_m\n\nDO_LOOP_M:\n    xor loop_n, loop_n\n\n    mov loop_m, %rax\n    mul DIM_K\n    mov %rax, mat_elem_idx\n    add loop_k, mat_elem_idx          \/\/ m * K + k\u3092\u8a08\u7b97\n    flds (MAT_A, mat_elem_idx, 4)     \/\/ A&#91;m]&#91;k]\u3092\u8aad\u307f\u8fbc\u3080\n\nDO_LOOP_N:\n    mov DIM_N, %rax\n    mul loop_k\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx\n    flds (MAT_B, mat_elem_idx, 4)     \/\/ B&#91;k]&#91;n]\u3092\u8aad\u307f\u8fbc\u3080 (1st Stack below)\n    fmul %st(1), %st(0)               \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n] \u3092\u8a08\u7b97 --> st(0) (2nd Stack below)\n\n    \/\/ TODO: A&#91;m]&#91;k] * B&#91;k]&#91;n+1] \u3092\u8a08\u7b97\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    add $1, mat_elem_idx              \/\/ mat_elem_idx -> B&#91;k]&#91;n+1]\n    flds (MAT_B, mat_elem_idx, 4)     \/\/ B&#91;k]&#91;n+1] --> st(0) (3rd Stack below)\n    fmul %st(2), %st(0)               \/\/ A&#91;m]&#91;k] * B&#91;k]&#91;n+1] --> st(0) (4th Stack below)\n\n    mov DIM_N, %rax\n    mul loop_m\n    mov %rax, mat_elem_idx\n    add loop_n, mat_elem_idx          \/\/ m * N + n\u3092\u8a08\u7b97\n    \/\/ TODO: C&#91;m]&#91;n] \u3092 st(1)\u3001C&#91;m]&#91;n+1] \u3092 st(0) \u306b\u8aad\u307f\u8fbc\u3080\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    flds (MAT_C, mat_elem_idx, 4)     \/\/ C&#91;m]&#91;n] --> st(0) (5th Stack below)\n    add $1, mat_elem_idx              \/\/ mat_elem_idx -> C&#91;m]&#91;n+1]\n    flds (MAT_C, mat_elem_idx, 4)     \/\/ C&#91;m]&#91;n+1] --> st(0) (6th Stack below)\n    \/\/ TODO: \u90e8\u5206\u548c\u3092\u7d2f\u7a4d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    \/\/ C&#91;m]&#91;n+1] + A&#91;m]&#91;k] * B&#91;k]&#91;n+1]\u3001C&#91;m]&#91;n] + A&#91;m]&#91;k] * B&#91;k]&#91;n] (7th Stack below)\n    faddp %st(2), %st(0)              \/\/ C&#91;m]&#91;n+1] += A&#91;m]&#91;k] * B&#91;k]&#91;n+1] (8th Stack below)\n    faddp %st(2), %st(0)              \/\/ C&#91;m]&#91;n] += A&#91;m]&#91;k] * B&#91;k]&#91;n]\n    fstps (MAT_C, mat_elem_idx, 4)    \/\/ C&#91;m]&#91;n+1] \u3092\u4fdd\u5b58\n\n    \/\/ TODO: C&#91;m]&#91;n] \u3092\u4fdd\u5b58\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    sub $1, mat_elem_idx              \/\/ mat_elem_idx -> C&#91;m]&#91;n]\n    fstps (MAT_C, mat_elem_idx, 4)    \/\/ C&#91;m]&#91;n] \u3092\u4fdd\u5b58\n    \/\/ TODO: N\u6b21\u5143\u30eb\u30fc\u30d7\u3092\u66f4\u65b0\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044\n    add $2, loop_n                    \/\/ n += 2\n\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n    fstp %st(0)                   \/\/ st(0) \u306e\u307f\u3092\u30dd\u30c3\u30d7\n    add $1, loop_m\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $1, loop_k\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n.endm\n\ngemm_kernel_opt_loop_unrolling:\n    PUSHD\n    GEMM_INIT\n    DO_GEMM\n    POPD\n    ret<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>B[k][n]<\/code><\/td><td data-has-cell-bg=\"1\"><span data-icon-size=\"l\" data-icon-type=\"bg\" aria-hidden=\"true\" class=\"swl-cell-bg\">&nbsp;<\/span>st(0)<\/td><\/tr><tr><td><code>A[m][k]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><\/td><td><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">flds (MAT_B, mat_elem_idx, 4)<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>A[m][k]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><\/td><td><\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">fmul %st(1), %st(0)<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>B[k][n+1]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><code>A[m][k]<\/code><\/td><td>st(2)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">add $1, mat_elem_idx<br>flds (MAT_B, mat_elem_idx, 4)<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns\">\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>B[k][n+1]*A[m][k]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><code>A[m][k]<\/code><\/td><td>st(2)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">fmul %st(2), %st(0)<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>C[m][n]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>B[k][n+1]*A[m][k]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(2)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">flds (MAT_C, mat_elem_idx, 4)<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><code>C[m][n+1]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>C[m][n]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><code>B[k][n+1]*A[m][k]<\/code><\/td><td>st(2)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(3)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">add $1, mat_elem_idx<br>flds (MAT_C, mat_elem_idx, 4)<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"swell-block-columns\"><div class=\"swell-block-columns__inner\">\n<div class=\"swell-block-column swl-has-mb--s\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><\/td><td><\/td><\/tr><tr><td><code>C[m][n]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>B[k][n+1]*A[m][k]+C[m][n+1]<\/code><\/td><td>st(1)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]<\/code><\/td><td>st(2)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">faddp %st(2), %st(0)<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"swell-block-column swl-has-mb--s\">\n<figure class=\"wp-block-table\"><table><tbody style=\"--tbody-th-color--bg:var(--color_deep01);--tbody-th-color--txt:var(--swl-text_color--white)\"><tr><th colspan=\"2\"><span class=\"swl-cell-text-centered\">Stack Top<\/span><\/th><\/tr><tr><td><\/td><td><\/td><\/tr><tr><td><\/td><td><\/td><\/tr><tr><td><code>B[k][n+1]*A[m][k]+C[m][n+1]<\/code><\/td><td>st(0)<\/td><\/tr><tr><td><code>B[k][n]*A[m][k]+C[m][n]<\/code><\/td><td>st(1)<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">faddp %st(2), %st(0)<\/figcaption><\/figure>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_loop_unrolling.unittest              \nRunning main() from \/home\/amamitsu\/Applications\/Lab3\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\n&#91;==========] Running 3 tests from 1 test suite.\n&#91;----------] Global test environment set-up.\n&#91;----------] 3 tests from gemm_kernel_opt_loop_unrolling\n&#91; RUN      ] gemm_kernel_opt_loop_unrolling.test0\n&#91;       OK ] gemm_kernel_opt_loop_unrolling.test0 (1 ms)\n&#91; RUN      ] gemm_kernel_opt_loop_unrolling.test1\n&#91;       OK ] gemm_kernel_opt_loop_unrolling.test1 (0 ms)\n&#91; RUN      ] gemm_kernel_opt_loop_unrolling.test2\n&#91;       OK ] gemm_kernel_opt_loop_unrolling.test2 (1 ms)\n&#91;----------] 3 tests from gemm_kernel_opt_loop_unrolling (3 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 3 tests from 1 test suite ran. (3 ms total)\n&#91;  PASSED  ] 3 tests.\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_loop_unrolling 2048 512 64\n--- Performance before loop unrolling optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 8.9804\n                Cost(s): 0.044902\n                Benchmark(Gflops): 2.98913\n--- Performance for after loop unrolling optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 7.44397\n                Cost(s): 0.0372198\n                Benchmark(Gflops): 3.60608\n----------------------------\nPerformance difference(Gflops): 0.616955<\/code><\/pre>\n\n\n\n<p>\u30eb\u30fc\u30d7\u3092 1 \u56de\u5c55\u958b\u3059\u308b\u3053\u3068\u3067\u3001\u6027\u80fd\u304c 20% \u5411\u4e0a\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">AVX\u547d\u4ee4\u306e(2m,32n,32k)\u9ad8\u6027\u80fd\u884c\u5217\u4e57\u7b97\u8a08\u7b97\u30ab\u30fc\u30cd\u30eb<\/h3>\n\n\n\n<p>\u3053\u306e\u30bf\u30b9\u30af\u3067\u306f\u3001\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3068\u30c7\u30fc\u30bf\u3092\u30d9\u30af\u30bf\u30ec\u30b8\u30b9\u30bf\u306b\u30ed\u30fc\u30c9\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u5b9f\u88c5\u3057\u307e\u3059\u3002\u30b3\u30fc\u30c9\u5185\u306e\u4f8b\u304b\u3089\u3001\u307e\u305a\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3059\u308b\u8981\u7d20\u306e\u30a2\u30c9\u30ec\u30b9\u3092\u53d6\u5f97\u3057\u3001\u305d\u306e\u5f8c $\\textbf{vbroadcastss} $\u3092\u4f7f\u7528\u3057\u3066\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<p>\u30ec\u30b8\u30b9\u30bf\u306b\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\u3059\u308b\u969b\u306b\u306f\u3001$\\textbf{vmovups} $\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<p>\u8a08\u7b97\u3067\u306f\u3001$\\textbf{vfmadd231ps} $\u3092\u4f7f\u7528\u3057\u3066\u3001\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u6f14\u7b97\u3092\u5b9f\u884c\u3057\u307e\u3059\uff1a C[$m$][$n$] += A[$m$][$k$] $\\times$ B[$k$][$n$]<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-1c8f653b-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-1c8f653b-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-1c8f653b-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_avx;\n.type gemm_kernel_opt_avx, %function;\n\n\n#define     AVX_REG_BYTE_WIDTH  32\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r13\n#define     DIM_M               %rcx\n#define     DIM_N               %r8\n#define     DIM_K               %r9\n#define     loop_m              %r10\n#define     loop_k              %r11\n#define     loop_n              %r12\n#define     mat_elem_idx        %r14\n#define     temp_reg            %r15\n\n\/\/ \u4ee5\u4e0b\u306f\u8a08\u7b97\u4e2d\u306b\u4f7f\u7528\u3055\u308c\u308bAVX\u30ec\u30b8\u30b9\u30bf\n#define     mat_c0_0_8           %ymm0\n#define     mat_c0_8_16          %ymm1\n#define     mat_c0_16_24         %ymm2\n#define     mat_c0_24_32         %ymm3\n#define     mat_c1_0_8           %ymm4\n#define     mat_c1_8_16          %ymm5\n#define     mat_c1_16_24         %ymm6\n#define     mat_c1_24_32         %ymm7\n#define     mat_a0_0_8           %ymm8\n#define     mat_a1_0_8           %ymm9\n#define     mat_b0_0_8           %ymm10\n#define     mat_b0_8_16          %ymm11\n#define     mat_b0_16_24         %ymm12\n#define     mat_b0_24_32         %ymm13\n\n.macro PUSHD\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT\n    mov %rdx, MAT_B\n.endm\n\n.macro LOAD_MAT_A     \/\/ \u884c\u5217A\u306e\u540c\u3058\u5217\u306e2\u3064 (A&#91;m]&#91;k], A&#91;m+1]&#91;k])\u3092\u30ed\u30fc\u30c9\n    \/\/ A&#91;m]&#91;k]\u306e\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    mov loop_m, %rax\n    mul DIM_K\n    mov %rax, temp_reg\n    add loop_k, temp_reg\n\n    \/\/ A&#91;m]&#91;k]\u306e\u30a2\u30c9\u30ec\u30b9\u3092\u8a08\u7b97\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    vbroadcastss (MAT_A, mat_elem_idx), mat_a0_0_8    \/\/ A&#91;m]&#91;k]\u3092AVX\u30ec\u30b8\u30b9\u30bf\u306e8\u3064\u306e\u30bb\u30eb\u306b\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\n\n    \/\/ TODO: A&#91;m+1]&#91;k]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_a1_0_8\u306b\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n\n.endm\n\n.macro LOAD_MAT_B    \/\/ \u884c\u5217B\u306e1\u884c32\u500b\u306e\u8981\u7d20\u3092\u30ed\u30fc\u30c9 (B&#91;k]&#91;n:n+32])\n\n    \/\/ TODO: B&#91;k]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_b0_0_8, mat_b0_8_16, mat_b0_16_24, mat_b0_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n\n.endm\n\n.macro LOAD_MAT_C\n    mov loop_m, %rax\n    mul DIM_N\n    mov %rax, temp_reg\n    add loop_n, temp_reg\n\n    \/\/ \u884c\u5217C\u306e\u6700\u521d\u306e\u884c (C&#91;m]&#91;n:n+32]) \u3092\u30ed\u30fc\u30c9\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: C&#91;m]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_c0_0_8, mat_c0_8_16, mat_c0_16_24, mat_c0_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n\n    \/\/ \u884c\u5217C\u306e2\u884c\u76ee (C&#91;m+1]&#91;n:n+32]) \u3092\u30ed\u30fc\u30c9\n    mov temp_reg, mat_elem_idx\n    add DIM_N, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: C&#91;m+1]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_c1_0_8, mat_c1_8_16, mat_c1_16_24, mat_c1_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n\n.endm\n\n.macro STORE_MAT_C\n    mov loop_m, %rax\n    mul DIM_N\n    mov %rax, temp_reg\n    add loop_n, temp_reg\n\n    \/\/ \u884c\u5217C\u306e\u6700\u521d\u306e\u884c\u30c7\u30fc\u30bf\u3092\u4fdd\u5b58, C&#91;m]&#91;n:n+32]\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: mat_c0_0_8, mat_c0_8_16, mat_c0_16_24, mat_c0_24_32\u3092\u4fdd\u5b58\u3057\u3066C&#91;m]&#91;n:n+32]\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n\n    \/\/ \u884c\u5217C\u306e2\u884c\u76ee\u306e\u30c7\u30fc\u30bf\u3092\u4fdd\u5b58, C&#91;m+1]&#91;n:n+32]\n    \/\/ TODO: mat_c1_0_8, mat_c1_8_16, mat_c1_16_24, mat_c1_24_32\u3092\u4fdd\u5b58\u3057\u3066C&#91;m+1]&#91;n:n+32]\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n\n.endm\n\n.macro DO_COMPUTE      \/\/ C&#91;m:m+2]&#91;n:n+32] += A&#91;m:m+2]&#91;k] * B&#91;k:k+8]&#91;n:n+32] \u306e\u8a08\u7b97\u3092\u5b9f\u884c\n\n    \/\/ TODO: C&#91;m:m+2]&#91;n:n+32] += A&#91;m:m+2]&#91;k] * B&#91;k:k+8]&#91;n:n+32] \u306e\u8a08\u7b97\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n\n.endm\n\n\n.macro DO_GEMM\n    xor loop_n, loop_n\nDO_LOOP_N:\n\n    xor loop_m, loop_m\nDO_LOOP_M:\n    \/\/ \u884c\u5217C\u306e\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    LOAD_MAT_C\n\n    xor loop_k, loop_k\nDO_LOOP_K:\n    \/\/ \u884c\u5217A\u304a\u3088\u3073\u884c\u5217B\u306e\u5206\u5272\u3055\u308c\u305f\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    LOAD_MAT_A\n    LOAD_MAT_B\n\n    DO_COMPUTE\n\n    add $1, loop_k              \/\/ kr=1\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n\n    \/\/ \u7d50\u679c\u3092\u4fdd\u5b58\n    STORE_MAT_C\n\n    add $2, loop_m              \/\/ mr=2\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $32, loop_n             \/\/ nr=32\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n.endm\n\ngemm_kernel_opt_avx:\n    PUSHD\n    GEMM_INIT\n    DO_GEMM\n    POPD\n    ret<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-1c8f653b-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"x86asm\">.text;\n.p2align 2;\n.global gemm_kernel_opt_avx;\n.type gemm_kernel_opt_avx, %function;\n\n\n#define     AVX_REG_BYTE_WIDTH  32\n\n#define     MAT_C               %rdi\n#define     MAT_A               %rsi\n#define     MAT_B               %r13\n#define     DIM_M               %rcx\n#define     DIM_N               %r8\n#define     DIM_K               %r9\n#define     loop_m              %r10\n#define     loop_k              %r11\n#define     loop_n              %r12\n#define     mat_elem_idx        %r14\n#define     temp_reg            %r15\n\n\/\/ \u4ee5\u4e0b\u306f\u8a08\u7b97\u4e2d\u306b\u4f7f\u7528\u3055\u308c\u308bAVX\u30ec\u30b8\u30b9\u30bf\n#define     mat_c0_0_8           %ymm0\n#define     mat_c0_8_16          %ymm1\n#define     mat_c0_16_24         %ymm2\n#define     mat_c0_24_32         %ymm3\n#define     mat_c1_0_8           %ymm4\n#define     mat_c1_8_16          %ymm5\n#define     mat_c1_16_24         %ymm6\n#define     mat_c1_24_32         %ymm7\n#define     mat_a0_0_8           %ymm8\n#define     mat_a1_0_8           %ymm9\n#define     mat_b0_0_8           %ymm10\n#define     mat_b0_8_16          %ymm11\n#define     mat_b0_16_24         %ymm12\n#define     mat_b0_24_32         %ymm13\n\n.macro PUSHD\n    push %rax\n    push %rbx\n    push %rcx\n    push %rdx\n    push %rsi\n    push %rdi\n    push %rbp\n    push %r8\n    push %r9\n    push %r10\n    push %r11\n    push %r12\n    push %r13\n    push %r14\n    push %r15\n.endm\n\n.macro POPD\n    pop %r15\n    pop %r14\n    pop %r13\n    pop %r12\n    pop %r11\n    pop %r10\n    pop %r9\n    pop %r8\n    pop %rbp\n    pop %rdi\n    pop %rsi\n    pop %rdx\n    pop %rcx\n    pop %rbx\n    pop %rax\n.endm\n\n.macro GEMM_INIT\n    mov %rdx, MAT_B\n.endm\n\n.macro LOAD_MAT_A     \/\/ \u884c\u5217A\u306e\u540c\u3058\u5217\u306e2\u3064 (A&#91;m]&#91;k], A&#91;m+1]&#91;k])\u3092\u30ed\u30fc\u30c9\n    \/\/ A&#91;m]&#91;k]\u306e\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    mov loop_m, %rax\n    mul DIM_K\n    mov %rax, temp_reg\n    add loop_k, temp_reg\n\n    \/\/ A&#91;m]&#91;k]\u306e\u30a2\u30c9\u30ec\u30b9\u3092\u8a08\u7b97\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    vbroadcastss (MAT_A, mat_elem_idx), mat_a0_0_8    \/\/ A&#91;m]&#91;k]\u3092AVX\u30ec\u30b8\u30b9\u30bf\u306e8\u3064\u306e\u30bb\u30eb\u306b\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\n\n    \/\/ TODO: A&#91;m+1]&#91;k]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_a1_0_8\u306b\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n    mov temp_reg, mat_elem_idx\n    add DIM_K, mat_elem_idx     \/\/ mat_elem_idx -> A&#91;m+1]&#91;k]\n    shl $2, mat_elem_idx        \/\/ *=4\n    vbroadcastss (MAT_A, mat_elem_idx), mat_a1_0_8    \/\/ A&#91;m+1]&#91;k]\u3092AVX\u30ec\u30b8\u30b9\u30bf\u306e8\u3064\u306e\u30bb\u30eb\u306b\u30d6\u30ed\u30fc\u30c9\u30ad\u30e3\u30b9\u30c8\n.endm\n\n.macro LOAD_MAT_B    \/\/ \u884c\u5217B\u306e1\u884c32\u500b\u306e\u8981\u7d20\u3092\u30ed\u30fc\u30c9 (B&#91;k]&#91;n:n+32])\n\n    \/\/ TODO: B&#91;k]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_b0_0_8, mat_b0_8_16, mat_b0_16_24, mat_b0_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n    mov loop_k, %rax\n    mul DIM_N\n    add loop_n, %rax                       \/\/ B&#91;k]&#91;n]\u306e\u7dda\u5f62\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\n    lea (MAT_B, %rax, 4), mat_elem_idx\u3000\u3000\u3000\/\/ B&#91;k]&#91;n]\u306e\u30a2\u30c9\u30ec\u30b9\n    \n    vmovups (mat_elem_idx), mat_b0_0_8     \/\/ B&#91;k]&#91;n:n+8]\u3092\u30ed\u30fc\u30c9\n    vmovups 32(mat_elem_idx), mat_b0_8_16  \/\/ B&#91;k]&#91;n+8:n+16]\u3092\u30ed\u30fc\u30c9\n    vmovups 64(mat_elem_idx), mat_b0_16_24 \/\/ B&#91;k]&#91;n+16:n+24]\u3092\u30ed\u30fc\u30c9\n    vmovups 96(mat_elem_idx), mat_b0_24_32 \/\/ B&#91;k]&#91;n+24:n+32]\u3092\u30ed\u30fc\u30c9\n.endm\n\n.macro LOAD_MAT_C\n    mov loop_m, %rax\n    mul DIM_N\n    mov %rax, temp_reg\n    add loop_n, temp_reg\n\n    \/\/ \u884c\u5217C\u306e\u6700\u521d\u306e\u884c (C&#91;m]&#91;n:n+32]) \u3092\u30ed\u30fc\u30c9\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: C&#91;m]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_c0_0_8, mat_c0_8_16, mat_c0_16_24, mat_c0_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n    vmovups (MAT_C, mat_elem_idx), mat_c0_0_8\n    vmovups 32(MAT_C, mat_elem_idx), mat_c0_8_16\n    vmovups 64(MAT_C, mat_elem_idx), mat_c0_16_24\n    vmovups 96(MAT_C, mat_elem_idx), mat_c0_24_32\n    \/\/ \u884c\u5217C\u306e2\u884c\u76ee (C&#91;m+1]&#91;n:n+32]) \u3092\u30ed\u30fc\u30c9\n    mov temp_reg, mat_elem_idx\n    add DIM_N, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: C&#91;m+1]&#91;n:n+32]\u3092\u30ed\u30fc\u30c9\u3057\u3066mat_c1_0_8, mat_c1_8_16, mat_c1_16_24, mat_c1_24_32\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\u3057\u3066\n    vmovups (MAT_C, mat_elem_idx), mat_c1_0_8\n    vmovups 32(MAT_C, mat_elem_idx), mat_c1_8_16\n    vmovups 64(MAT_C, mat_elem_idx), mat_c1_16_24\n    vmovups 96(MAT_C, mat_elem_idx), mat_c1_24_32\n.endm\n\n.macro STORE_MAT_C\n    mov loop_m, %rax\n    mul DIM_N\n    mov %rax, temp_reg\n    add loop_n, temp_reg\n\n    \/\/ \u884c\u5217C\u306e\u6700\u521d\u306e\u884c\u30c7\u30fc\u30bf\u3092\u4fdd\u5b58, C&#91;m]&#91;n:n+32]\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx        \/\/ *=4\n\n    \/\/ TODO: mat_c0_0_8, mat_c0_8_16, mat_c0_16_24, mat_c0_24_32\u3092\u4fdd\u5b58\u3057\u3066C&#91;m]&#91;n:n+32]\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n    vmovups mat_c0_0_8, (MAT_C, mat_elem_idx)\n    vmovups mat_c0_8_16, 32(MAT_C, mat_elem_idx)\n    vmovups mat_c0_16_24, 64(MAT_C, mat_elem_idx)\n    vmovups mat_c0_24_32, 96(MAT_C, mat_elem_idx)\n    \/\/ \u884c\u5217C\u306e2\u884c\u76ee\u306e\u30c7\u30fc\u30bf\u3092\u4fdd\u5b58, C&#91;m+1]&#91;n:n+32]\n    \/\/ TODO: mat_c1_0_8, mat_c1_8_16, mat_c1_16_24, mat_c1_24_32\u3092\u4fdd\u5b58\u3057\u3066C&#91;m+1]&#91;n:n+32]\u306b\u683c\u7d0d\u3059\u308b\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n    add DIM_N, temp_reg\n    mov temp_reg, mat_elem_idx\n    shl $2, mat_elem_idx\n    vmovups mat_c1_0_8, (MAT_C, mat_elem_idx)\n    vmovups mat_c1_8_16, 32(MAT_C, mat_elem_idx)\n    vmovups mat_c1_16_24, 64(MAT_C, mat_elem_idx)\n    vmovups mat_c1_24_32, 96(MAT_C, mat_elem_idx)\n.endm\n\n.macro DO_COMPUTE      \/\/ C&#91;m:m+2]&#91;n:n+32] += A&#91;m:m+2]&#91;k] * B&#91;k:k+8]&#91;n:n+32] \u306e\u8a08\u7b97\u3092\u5b9f\u884c\n\n    \/\/ TODO: C&#91;m:m+2]&#91;n:n+32] += A&#91;m:m+2]&#91;k] * B&#91;k:k+8]&#91;n:n+32] \u306e\u8a08\u7b97\u30ed\u30b8\u30c3\u30af\u3092\u8ffd\u52a0\n    vfmadd231ps mat_b0_0_8, mat_a0_0_8, mat_c0_0_8\n    vfmadd231ps mat_b0_8_16, mat_a0_0_8, mat_c0_8_16\n    vfmadd231ps mat_b0_16_24, mat_a0_0_8, mat_c0_16_24\n    vfmadd231ps mat_b0_24_32, mat_a0_0_8, mat_c0_24_32\n\n    vfmadd231ps mat_b0_0_8, mat_a1_0_8, mat_c1_0_8\n    vfmadd231ps mat_b0_8_16, mat_a1_0_8, mat_c1_8_16\n    vfmadd231ps mat_b0_16_24, mat_a1_0_8, mat_c1_16_24\n    vfmadd231ps mat_b0_24_32, mat_a1_0_8, mat_c1_24_32\n.endm\n\n\n.macro DO_GEMM\n    xor loop_n, loop_n\nDO_LOOP_N:\n\n    xor loop_m, loop_m\nDO_LOOP_M:\n    \/\/ \u884c\u5217C\u306e\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    LOAD_MAT_C\n\n    xor loop_k, loop_k\nDO_LOOP_K:\n    \/\/ \u884c\u5217A\u304a\u3088\u3073\u884c\u5217B\u306e\u5206\u5272\u3055\u308c\u305f\u30c7\u30fc\u30bf\u3092\u30ed\u30fc\u30c9\n    LOAD_MAT_A\n    LOAD_MAT_B\n\n    DO_COMPUTE\n\n    add $1, loop_k              \/\/ kr=1\n    cmp DIM_K, loop_k\n    jl DO_LOOP_K\n\n    \/\/ \u7d50\u679c\u3092\u4fdd\u5b58\n    STORE_MAT_C\n\n    add $2, loop_m              \/\/ mr=2\n    cmp DIM_M, loop_m\n    jl DO_LOOP_M\n\n    add $32, loop_n             \/\/ nr=32\n    cmp DIM_N, loop_n\n    jl DO_LOOP_N\n\n.endm\n\ngemm_kernel_opt_avx:\n    PUSHD\n    GEMM_INIT\n    DO_GEMM\n    POPD\n    ret<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_avx.unittest              \nRunning main() from \/home\/amamitsu\/Applications\/Lab3\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\n&#91;==========] Running 3 tests from 1 test suite.\n&#91;----------] Global test environment set-up.\n&#91;----------] 3 tests from gemm_kernel_opt_avx\n&#91; RUN      ] gemm_kernel_opt_avx.test0\n&#91;       OK ] gemm_kernel_opt_avx.test0 (0 ms)\n&#91; RUN      ] gemm_kernel_opt_avx.test1\n&#91;       OK ] gemm_kernel_opt_avx.test1 (0 ms)\n&#91; RUN      ] gemm_kernel_opt_avx.test2\n&#91;       OK ] gemm_kernel_opt_avx.test2 (1 ms)\n&#91;----------] 3 tests from gemm_kernel_opt_avx (1 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 3 tests from 1 test suite ran. (2 ms total)\n&#91;  PASSED  ] 3 tests.\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_avx 2048 512 64  \n--- Performance before avx optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 8.98306\n                Cost(s): 0.0449153\n                Benchmark(Gflops): 2.98824\n--- Performance for after avx optimization ---\nGEMM performance info:\n                M, K, N: 2048, 512, 64\n                Ops: 0.134218\n                Total compute time(s): 0.319307\n                Cost(s): 0.00159654\n                Benchmark(Gflops): 84.0681\n----------------------------\nPerformance difference(Gflops): 81.0799\n<\/code><\/pre>\n\n\n\n<p>\u6027\u80fd\u306f 2713% \u5411\u4e0a\u3057\u307e\u3057\u305f\u3002\u3053\u308c\u306f\u30011 \u3064\u306e\u547d\u4ee4\u3067\u5927\u91cf\u306e\u30c7\u30fc\u30bf\u3092\u51e6\u7406\u3067\u304d\u308b\u305f\u3081\u3067\u3059\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">OpenMP\u3068AVX\u547d\u4ee4\u3067\u4efb\u610f\u5f62\u72b6\u306e\u884c\u5217\u4e57\u7b97\u3092\u5b9f\u73fe\u3059\u308b<\/h3>\n\n\n\n<p>\u3053\u3053\u3067\u306f\u3001\u8a08\u7b97\u3092\u6b63\u65b9\u5f62\u306b\u8fd1\u3044\u30d6\u30ed\u30c3\u30af\u306b\u5206\u5272\u3057\u3088\u3046\u3068\u3057\u3066\u3044\u307e\u3059\u3002\u4f8b\u3048\u3070\u3001\u30b9\u30ec\u30c3\u30c9\u304c 12 \u500b\u3042\u308b\u5834\u5408\u3001\u884c\u5217 C \u3092 $3\\times 4$ \u306e\u90e8\u5206\u306b\u5206\u5272\u3057\u3001\u5fc5\u8981\u306b\u5fdc\u3058\u3066\u30d1\u30c7\u30a3\u30f3\u30b0\u3092\u52a0\u3048\u3066\u95a2\u9023\u30c7\u30fc\u30bf\u3092\u6e96\u5099\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-50\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-72ed6ddf-0\" data-onclick=\"tabControl\">Baseline<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-72ed6ddf-1\" data-onclick=\"tabControl\">Optimized<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-72ed6ddf-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">#include &lt;omp.h>\n#include \"openmp_gemm.h\"\n#include \"gemm_kernel_opt.h\"\n#include &lt;cstring>\n\ninline int get_parallel_thread_num(uint64_t M, uint64_t K, uint64_t N, int kernel_mr, int kernel_nr, int max_threads, int&amp; m_thread, int&amp; n_thread) {\n    m_thread = 2;\n    if (m_thread > max_threads) {\n        m_thread = max_threads;\n    }\n    n_thread = max_threads \/ m_thread;\n    return m_thread * n_thread;\n}\n\nvoid openmp_gemm_baseline(int thread_num, float *C, float *A, float *B, uint64_t M, uint64_t N, uint64_t K){\n    \/\/ \u5b9a\u6570\n    const int KERNEL_MR = 2, KERNEL_NR = 32;    \/\/ TODO: AVX\u30ab\u30fc\u30cd\u30eb\u306b\u3088\u308b\u5909\u66f4\n    int m_thread = 1, n_thread = 1;\n   \u3000\n    int real_thread_num = get_parallel_thread_num(M, K, N, KERNEL_MR, KERNEL_NR, thread_num, m_thread, n_thread);\n\n#pragma omp parallel num_threads(real_thread_num) \\\n            default(none) \\\n            shared(C) \\\n            firstprivate(A, B, M, N, K, KERNEL_MR, KERNEL_NR, \\\n                m_thread, n_thread)\n    {\n        const int kernel_size = KERNEL_MR * KERNEL_NR;\n        int thread_id = omp_get_thread_num();  \/\/ \u30b9\u30ec\u30c3\u30c9\u306e\u5272\u308a\u5f53\u3066\u306f\u884c\u512a\u5148\u306e\u65b9\u5f0f\u3092\u63a1\u7528\u3057\u3001\u5404\u884c\u306b\u3064\u3044\u3066\u30b9\u30ec\u30c3\u30c9\u756a\u53f7\u306f 0, 1, 2, 3, ... \u306e\u9806\u3068\u306a\u308b\u3002\n        \/* 3\u3064\u306e\u6b21\u5143\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u8a08\u7b97\u3059\u308b *\/\n        int thread_id_m = thread_id \/ n_thread;  \/\/ M\u6b21\u5143\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\n        int thread_id_n = thread_id % n_thread;  \/\/ N\u6b21\u5143\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\n        \/* 3\u3064\u306e\u6b21\u5143\u306e\u8a08\u7b97\u958b\u59cb\u4f4d\u7f6e\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u8a08\u7b97\u3059\u308b *\/\n\n        \/\/ M\u6b21\u5143\u306b\u5272\u308a\u5f53\u3066\u3089\u308c\u305f\u884c\u6570\u3092\u8a08\u7b97\u3059\u308b\n        int dim_m_per_thread = (M + m_thread - 1) \/ m_thread; \/\/ M\u6b21\u5143\u3067\u5206\u5272\u53ef\u80fd\u306a\u30d6\u30ed\u30c3\u30af\u6570\uff08\u4e0d\u5b8c\u5168\u306a\u30d6\u30ed\u30c3\u30af\u3092\u542b\u3080\uff09\n        int m_padding = dim_m_per_thread % KERNEL_MR;\n        if (m_padding != 0) {\n            m_padding = KERNEL_MR - m_padding;\n        }\n        int dim_n_per_thread = (N + n_thread - 1) \/ n_thread; \/\/ N\u6b21\u5143\u3067\u5206\u5272\u53ef\u80fd\u306a\u30d6\u30ed\u30c3\u30af\u6570\uff08\u4e0d\u5b8c\u5168\u306a\u30d6\u30ed\u30c3\u30af\u3092\u542b\u3080\uff09\n        int n_padding = dim_n_per_thread % KERNEL_NR;\n        if (n_padding != 0) {\n            n_padding = KERNEL_NR - n_padding;\n        }\n\n        \/\/ \u6700\u521d\u306e\u30b9\u30c6\u30c3\u30d7\u306e\u30eb\u30fc\u30d7\u306fM\u6b21\u5143\u304b\u3089\u59cb\u3081\u308b\u3002\u3053\u306e\u3088\u3046\u306b\u3059\u308b\u3053\u3068\u3067\u3001B\u3092\u5171\u6709\u3057\u3001C\u3092\u7d2f\u7a4d\u8a08\u7b97\u3059\u308b\u5fc5\u8981\u304c\u306a\u304f\u306a\u308b\u3002\n        \/\/ \u305d\u306e\u305f\u3081\u3001\u3053\u306e\u6642\u70b9\u3067A\u3068C\u306e\u958b\u59cb\u4f4d\u7f6e\u3092\u518d\u8a08\u7b97\u3059\u308b\u5fc5\u8981\u304c\u3042\u308b\u3002\n        int thread_m_start = thread_id_m * dim_m_per_thread;\n        int thread_m_end = thread_m_start + dim_m_per_thread;\n        if (thread_m_end > M) {\n            thread_m_end = M;\n        }\n\n        \/\/ N\u6b21\u5143\u306estart\u3001end\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u8a08\u7b97\n        int thread_n_start = thread_id_n * dim_n_per_thread;\n        int thread_n_end = thread_n_start + dim_n_per_thread;\n        if (thread_n_end > N) {\n            thread_n_end = N;\n        }\n\n        \/\/ \u4e09\u3064\u306e\u884c\u5217\u306e\u30e1\u30e2\u30ea\u3092\u53d6\u5f97\n        auto A_padding = new float&#91;(dim_m_per_thread + m_padding) * K];\n        memset((void *) A_padding, 0, (dim_m_per_thread + m_padding) * K * sizeof(float));\n        auto B_padding = new float&#91;(dim_n_per_thread + n_padding) * K];\n        memset((void *) B_padding, 0, (dim_n_per_thread + n_padding) * K * sizeof(float));\n        auto C_padding = new float&#91;(dim_m_per_thread + m_padding) * (dim_n_per_thread + n_padding)];\n        memset((void *) C_padding, 0, (dim_m_per_thread + m_padding) * (dim_n_per_thread + n_padding) * sizeof(float));\n\n        \/\/ \u30c7\u30fc\u30bf\u3092\u30b3\u30fc\u30d4\n        for (int m = thread_m_start; m &lt; thread_m_end; m++) {\n            memcpy(A_padding + (m - thread_m_start) * K, A + m * K, K * sizeof(float));\n        }\n\n        for (int k = 0; k &lt; K; k++) {\n            memcpy(B_padding + k * (dim_n_per_thread + n_padding),\n                   B + thread_n_start + k * N,\n                   (thread_n_end - thread_n_start) * sizeof(float));\n        }\n\n        for (int m = thread_m_start; m &lt; thread_m_end; m++) {\n            memcpy(C_padding + (m - thread_m_start) * (dim_n_per_thread + n_padding),\n                   C + m * N + thread_n_start,\n                   (thread_n_end - thread_n_start) * sizeof(float));\n        }\n\n        \/\/ \u30ab\u30fc\u30cd\u30eb\u3092\u5229\u7528\u3057\u3066\u8a08\u7b97\n        gemm_kernel_opt_avx(C_padding, A_padding, B_padding, (dim_m_per_thread + m_padding),\n                                 (dim_n_per_thread + n_padding), K);\n\n        \/\/ \u7d50\u679c\u3092\u66f8\u304d\u8fbc\u3080\n        for (int m = thread_m_start; m &lt; thread_m_end; m++) {\n            memcpy(C + m * N + thread_n_start,\n                   C_padding + (m - thread_m_start) * (dim_n_per_thread + n_padding),\n                   (thread_n_end - thread_n_start) * sizeof(float));\n        }\n\n        delete&#91;] A_padding;\n        delete&#91;] B_padding;\n        delete&#91;] C_padding;\n    }\n}\n<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-72ed6ddf-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">#include &lt;omp.h>\n#include \"openmp_gemm.h\"\n#include \"gemm_kernel_opt.h\"\n#include &lt;cstring>\n#include &lt;cmath>\n\nvoid openmp_gemm_opt(int thread_num, float *C, float *A, float *B, uint64_t M, uint64_t N, uint64_t K) {\n    const int KERNEL_MR = 2;\n    const int KERNEL_NR = 32;\n    const int KERNEL_KR = 32;\n\n    int m_thread = 1, n_thread = 1;\n\n    int sqrt_threads = (int)sqrt(thread_num);\n    for (int mt = sqrt_threads; mt >= 1; mt--) {\n        if (thread_num % mt == 0) {\n            m_thread = mt;\n            n_thread = thread_num \/ mt;\n            break;\n        }\n    }\n\n    #pragma omp parallel num_threads(thread_num)\n    {\n        int thread_id = omp_get_thread_num();\n        int thread_id_m = thread_id \/ n_thread;\n        int thread_id_n = thread_id % n_thread;\n\n        uint64_t m_start = (M * thread_id_m) \/ m_thread;\n        uint64_t m_end = (M * (thread_id_m + 1)) \/ m_thread;\n        uint64_t n_start = (N * thread_id_n) \/ n_thread;\n        uint64_t n_end = (N * (thread_id_n + 1)) \/ n_thread;\n\n        uint64_t m_size = m_end - m_start;\n        uint64_t n_size = n_end - n_start;\n\n        uint64_t m_padded = ((m_size + KERNEL_MR - 1) \/ KERNEL_MR) * KERNEL_MR;\n        uint64_t n_padded = ((n_size + KERNEL_NR - 1) \/ KERNEL_NR) * KERNEL_NR;\n        uint64_t k_padded = ((K + KERNEL_KR - 1) \/ KERNEL_KR) * KERNEL_KR;\n\n        float* A_padded = new float&#91;m_padded * k_padded];\n        float* B_padded = new float&#91;k_padded * n_padded];\n        float* C_padded = new float&#91;m_padded * n_padded];\n\n        memset(A_padded, 0, m_padded * k_padded * sizeof(float));\n        memset(B_padded, 0, k_padded * n_padded * sizeof(float));\n        memset(C_padded, 0, m_padded * n_padded * sizeof(float));\n\n        \/\/ Copy data from A to A_padded with padding\n        for (uint64_t i = 0; i &lt; m_size; i++) {\n            memcpy(&amp;A_padded&#91;i * k_padded], &amp;A&#91;(m_start + i) * K], K * sizeof(float));\n            memset(&amp;A_padded&#91;i * k_padded + K], 0, (k_padded - K) * sizeof(float));\n        }\n\n        \/\/ Copy data from B to B_padded with padding\n        for (uint64_t i = 0; i &lt; K; i++) {\n            memcpy(&amp;B_padded&#91;i * n_padded], &amp;B&#91;i * N + n_start], n_size * sizeof(float));\n            memset(&amp;B_padded&#91;i * n_padded + n_size], 0, (n_padded - n_size) * sizeof(float));\n        }\n\n        \/\/ Call the optimized kernel function\n        gemm_kernel_opt_avx(C_padded, A_padded, B_padded, m_padded, n_padded, k_padded);\n\n        \/\/ Copy the result back to C\n        for (uint64_t i = 0; i &lt; m_size; i++) {\n            memcpy(&amp;C&#91;(m_start + i) * N + n_start], &amp;C_padded&#91;i * n_padded], n_size * sizeof(float));\n        }\n\n        delete&#91;] A_padded;\n        delete&#91;] B_padded;\n        delete&#91;] C_padded;\n    }\n}\n<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_openmp.unittest \nRunning main() from \/home\/amamitsu\/Applications\/Lab3\/build\/_deps\/googletest-src\/googletest\/src\/gtest_main.cc\n&#91;==========] Running 4 tests from 1 test suite.\n&#91;----------] Global test environment set-up.\n&#91;----------] 4 tests from openmp_gemm_opt\n&#91; RUN      ] openmp_gemm_opt.test0\n&#91;       OK ] openmp_gemm_opt.test0 (0 ms)\n&#91; RUN      ] openmp_gemm_opt.test1\n&#91;       OK ] openmp_gemm_opt.test1 (40 ms)\n&#91; RUN      ] openmp_gemm_opt.test2\n&#91;       OK ] openmp_gemm_opt.test2 (282 ms)\n&#91; RUN      ] openmp_gemm_opt.test3\n&#91;       OK ] openmp_gemm_opt.test3 (2443 ms)\n&#91;----------] 4 tests from openmp_gemm_opt (2766 ms total)\n\n&#91;----------] Global test environment tear-down\n&#91;==========] 4 tests from 1 test suite ran. (2767 ms total)\n&#91;  PASSED  ] 4 tests.\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_openmp 12 256 256 256\n--- Performance before openmp strategy optimization ---\nGEMM performance info:\n                M, K, N: 256, 256, 256\n                Ops: 0.0335544\n                Total compute time(s): 0.041128\n                Cost(s): 0.00020564\n                Benchmark(Gflops): 163.171\n--- Performance for after openmp strategy optimization ---\nGEMM performance info:\n                M, K, N: 256, 256, 256\n                Ops: 0.0335544\n                Total compute time(s): 0.019356\n                Cost(s): 9.678e-05\n                Benchmark(Gflops): 346.708\n----------------------------\nPerformance difference(Gflops): 183.538\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab3\/build]\n\u2514\u2500$ .\/dist\/bins\/lab3_gemm_opt_openmp 7 448 448 448\n--- Performance before openmp strategy optimization ---\nGEMM performance info:\n                M, K, N: 448, 448, 448\n                Ops: 0.179831\n                Total compute time(s): 0.142757\n                Cost(s): 0.000713785\n                Benchmark(Gflops): 251.94\n--- Performance for after openmp strategy optimization ---\nGEMM performance info:\n                M, K, N: 448, 448, 448\n                Ops: 0.179831\n                Total compute time(s): 0.080035\n                Cost(s): 0.000400175\n                Benchmark(Gflops): 449.38\n----------------------------\nPerformance difference(Gflops): 197.441\n<\/code><\/pre>\n\n\n\n<p>\u6027\u80fd\u306f\u7d04 100% \u5411\u4e0a\u3057\u307e\u3057\u305f\u3002\u3053\u308c\u306f\u3001\u6700\u9069\u5316\u3055\u308c\u305f\u30d0\u30fc\u30b8\u30e7\u30f3\u306e\u30ed\u30fc\u30ab\u30ea\u30c6\u30a3\u304c\u5411\u4e0a\u3057\u305f\u305f\u3081\u3067\u3059\u3002<br>\u305d\u306e\u6545\u306f\u3001\u30b9\u30ec\u30c3\u30c9\u3092 2 \u6b21\u5143\u306e\u8a08\u7b97\u5358\u4f4d\u306b\u5206\u5272\u3059\u308b\u3002\u5143\u306e OpenMP \u3092\u4f7f\u7528\u3057\u305f\u30d0\u30fc\u30b8\u30e7\u30f3\u3067\u306f\u3001\u30b9\u30ec\u30c3\u30c9\u306e\u914d\u7f6e\u304c\u7dda\u5f62\u306b\u306a\u3063\u3066\u3044\u307e\u3057\u305f\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lab4 CUDA\uff1aGPU\u306e\u884c\u5217\u4e57\u7b97<\/h2>\n\n\n\n<p>\u672c\u30bf\u30b9\u30af\u3067\u306f\u3001\u30b3\u30fc\u30c9\u5185\u306e\u91cd\u8981\u306a\u8a08\u7b97\u30ed\u30b8\u30c3\u30af\u3092\u5b8c\u6210\u3055\u305b\u308b\u3053\u3068\u3092\u76ee\u6307\u3057\u307e\u3059\u3002\u5b9f\u306f\u7c21\u5358\u3067\u3059\u3002<\/p>\n\n\n\n<p>\u95a2\u6570MatrixMulKernel\u306f\u30012\u3064\u306e\u884c\u5217\\(M\\)\u3068\\(N\\)\u306e\u7a4d\u3092\u8a08\u7b97\u3057\u3001\u305d\u306e\u7d50\u679c\u3092\u884c\u5217\\(P\\)\u306b\u683c\u7d0d\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<p>float* d_M: \u5165\u529b\u884c\u5217(M)\u306e\u30dd\u30a4\u30f3\u30bf<br>float* d_N: \u5165\u529b\u884c\u5217(N)\u306e\u30dd\u30a4\u30f3\u30bf<br>float* d_P: \u51fa\u529b\u884c\u5217(P)\u306e\u30dd\u30a4\u30f3\u30bf<br>int width: \u884c\u5217\u306e\u5e45\uff08\u6b63\u65b9\u884c\u5217\u306e\u5834\u5408\uff09<br><\/p>\n\n\n\n<p>CUDA\u30ab\u30fc\u30cd\u30eb\u5185\u306e\u5404\u30b9\u30ec\u30c3\u30c9\u306f\u3001\u7d50\u679c\u884c\u5217\\(P\\)\u306e1\u3064\u306e\u90e8\u5206\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002\u5404\u30b9\u30ec\u30c3\u30c9\u306e\u884c\u3068\u5217\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u8a08\u7b97\u3055\u308c\u307e\u3059\uff1a<br>\\[\\text{row} = \\text{blockIdx.y} \\times \\text{blockDim.y} + \\text{threadIdx.y}\\]<br>\\[\\text{col} = \\text{blockIdx.x} \\times \\text{blockDim.x} + \\text{threadIdx.x}\\]<\/p>\n\n\n\n<p>\u6b21\u306e\u6761\u4ef6\u306b\u3088\u308a\u3001\u30b9\u30ec\u30c3\u30c9\u304c\u884c\u5217\u306e\u6709\u52b9\u306a\u8981\u7d20\u306e\u307f\u3092\u64cd\u4f5c\u3059\u308b\u3053\u3068\u3092\u4fdd\u8a3c\u3057\u307e\u3059\uff1a<br>\\[\\text{if } (\\text{row} &lt; \\text{width}) \\text{ and } (\\text{col} &lt; \\text{width})\\]<\/p>\n\n\n\n<p>\u95a2\u6570\u306f\u6b21\u306e\u5f0f\u3092\u4f7f\u7528\u3057\u3066\u3001\u7d50\u679c\u884c\u5217(P)\u306e\u5404\u8981\u7d20\u3092\u8a08\u7b97\u3057\u307e\u3059\uff1a<br>\\[P[\\text{row}, \\text{col}] = \\sum_{k=0}^{\\text{width}-1} M[\\text{row}, k] \\times N[k, \\text{col}]\\]<\/p>\n\n\n\n<p>\u3053\u308c\u3092\u30b3\u30fc\u30c9\u3067\u5b9f\u88c5\u3059\u308b\u3068\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\uff1a<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-87276bc6-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-87276bc6-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-87276bc6-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">\/\/ #define USE_CUBLAS\n\n#include &lt;iostream>\n#include &lt;cstdio>\n#include &lt;cuda_runtime.h>\n#ifdef USE_CUBLAS\n#include &lt;cublas_v2.h>\n#endif\n#include &lt;device_launch_parameters.h>\n#include &lt;cmath>\nusing namespace std;\n\nconst int TILE_WIDTH = 16;\t\/\/ \u5b9a\u4e49\u5757block\u5927\u5c0f\n\n\/\/\/\/\/\/\/\/\/\n\/\/ Matrix multiplication with shared memory (CUDA Kernel) on the device: C = A * B\n\/\/\/\/\/\/\/\/\/\nconst int BLOCK_SIZE = TILE_WIDTH;\n__global__ void MatrixMulSharedMemKernel(float *A,\n    float *B, float *C, int wA,\n    int wB) {\n\n\n\n\n}\n\n\n\/\/! For square matrices only\n__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width)\n{\n  \/\/ Calculate the row index of the P element and M\n  \/\/ *** TO DO: Compute the row index for the current thread ***\n  \/\/ int row = ...;\n\n  \/\/ Calculate the column index of the P element and N\n  \/\/ *** TO DO: Compute the column index for the current thread ***\n  \/\/ int col = ...;\n\n  \/\/ Ensure the thread is within bounds\n  if ( (row &lt; width) &amp;&amp; (col &lt; width) ) {\n    float pValue = 0.0;\n\n    \/\/ Each thread computes one element of the matrix\n    \/\/ *** TO DO: Implement the matrix multiplication for a single element ***\n\n\n    \/\/ Store the computed value into the output matrix\n    \/\/ *** TO DO: Write the computed value to the correct position in d_P ***\n    \/\/ d_P&#91;row * width + col] = ...;\n  }\n}\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/! Compute reference data set matrix multiply on CPU\n\/\/! C = A * B\n\/\/! @param C          reference data, computed but preallocated\n\/\/! @param A          matrix A as provided to device\n\/\/! @param B          matrix B as provided to device\n\/\/! @param hA         height of matrix A\n\/\/! @param wA         width of matrix A\n\/\/! @param wB         width of matrix B\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\nvoid\nmatrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)\n{\n    for (unsigned int i = 0; i &lt; hA; ++i)\n        for (unsigned int j = 0; j &lt; wB; ++j)\n        {\n            double sum = 0;\n\n            for (unsigned int k = 0; k &lt; wA; ++k)\n            {\n                double a = A&#91;i * wA + k];\n                double b = B&#91;k * wB + j];\n                sum += a * b;\n            }\n\n            C&#91;i * wB + j] = (float)sum;\n        }\n}\n\nvoid printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)\n{\n    printf(\"Listing first %d Differences > %.6f...\\n\", iListLength, fListTol);\n    int i,j,k;\n    int error_count=0;\n\n    for (j = 0; j &lt; height; j++)\n    {\n        for (i = 0; i &lt; width; i++)\n        {\n            k = j * width + i;\n            float fDiff = fabs(data1&#91;k] - data2&#91;k]);\n\n            if (fDiff > fListTol)\n            {\n                if (error_count &lt; iListLength)\n                {\n                    printf(\"    Loc(%d,%d)\\tCPU=%.5f\\tGPU=%.5f\\tDiff=%.6f\\n\", i, j, data1&#91;k], data2&#91;k], fDiff);\n                }\n\n                error_count++;\n            }\n        }\n    }\n\n    printf(\" \\n  Total Errors = %d\\n\", error_count);\n}\n\nvoid getArg(int argc, char* argv&#91;], int &amp;size, int &amp;check)\n{\n  if (argc != 3)\n  {\n    cerr &lt;&lt; \"Usage: \" &lt;&lt; argv&#91;0] &lt;&lt; \" &lt;check_enable> &lt;size>\\n\";\n    cerr &lt;&lt; \"\\tcheck_enable: 1 to enable result checking\\n\";\n    cerr &lt;&lt; \"\\tsize: size of the matrix\\n\";\n    exit(1);\n  }\n\n  int val1, val2;\n  try\n  {\n    val1 = stoi(argv&#91;1]);\n    val2 = stoi(argv&#91;2]);\n  }\n  catch (const invalid_argument&amp; e)\n  {\n    cerr &lt;&lt; \"ERROR: parameters should be integer\\n\";\n    exit(1);\n  }\n\n  check = val1;\n  size = val2;\n}\n\n\n\nint main(int argc, char* argv&#91;])\n{\n  int size, check;\n  getArg(argc, argv, size, check);\n\n  int m = size, n = size, k = size;\n  \n  \/\/ \u58f0\u660e\u5b58\u653e\u5728GPU\u4e0a\u7684\u6570\u7ec4\n  float *h_M, *h_N, *d_M, *d_N;\n  float *h_P, *d_P;\n  \n  size_t sizeM = m * k * sizeof(float);\n  size_t sizeN = k * n * sizeof(float);\n  size_t sizeP = m * n * sizeof(float);\n\n  \/\/ Allocate host memory\n  h_M = (float*) malloc(sizeM);\n  h_N = (float*) malloc(sizeN);\n  h_P = (float*) malloc(sizeP);\n  float *reference = (float *)malloc(sizeP);\n\n  \/\/ Allocate device memory\n  cudaMalloc(&amp;d_M, sizeM);\n  cudaMalloc(&amp;d_N, sizeN);\n  cudaMalloc(&amp;d_P, sizeP);\n\n  \/\/ Init data \n  for(int i = 0; i &lt; m * n; ++i)\n  {\n    if(i % 2 == 0)\n      h_M&#91;i] = 1.0;\n    else\n      h_M&#91;i] = 0.5;\n  }\n\n  for(int i = 0;i &lt; n * k; ++i)\n  {\n    if(i % 2 == 0)\n      h_N&#91;i] = 0.5;\n    else\n      h_N&#91;i] = 1.0;\n  }\n\n  \/\/ Copy data from CPU to GPU\n  cudaMemcpy(d_M, h_M, sizeM, cudaMemcpyHostToDevice);\n  cudaMemcpy(d_N, h_N, sizeN, cudaMemcpyHostToDevice);\n\n  \/\/ Timing records \n  cudaEvent_t start,stop;\n  cudaEventCreate(&amp;start);\n  cudaEventCreate(&amp;stop);\n  cudaEventRecord(start,0);\n\n  \/\/ Launch kernel \u5b9a\u4e49grid&amp;block\n  dim3 grid((int)ceil(k*1.0 \/ TILE_WIDTH), (int)ceil(m*1.0\/ TILE_WIDTH));\n  dim3 block(TILE_WIDTH, TILE_WIDTH);\n  \n  int nIter = 5;\n#ifdef USE_CUBLAS\n  cublasHandle_t handle;\n  cublasCreate(&amp;handle);\n#endif\n  const float alpha = 1.0f;\n  const float beta  = 0.0f;\n  for (int j = 0; j &lt; nIter; j++) {\n    \/\/matrixMulCPU(reference, h_M, h_N, m, k, n);\n    MatrixMulKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m);\n    \/\/MatrixMulSharedMemKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m, n);\n    \/\/cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &amp;alpha, d_N, n, d_M, k, &amp;beta, d_P, n);\n  }\n\n  cudaEventRecord(stop, 0);\n  cudaEventSynchronize(stop);\n  float msecPerMatrixMul;\n  cudaEventElapsedTime(&amp;msecPerMatrixMul, start, stop);\n  msecPerMatrixMul \/= nIter;\n  printf(\"Kernel Elpased Time: %.3f ms\\n\", msecPerMatrixMul);\n\n  \/\/ Compute and print the performance\n  double flopsPerMatrixMul = 2.0 * (double)m * (double)n * (double)k;\n  double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) \/ (msecPerMatrixMul \/ 1000.0f);\n  printf(\"Performance= %.2f GFlop\/s, Time= %.3f msec, Size= %.0f Ops\\n\",\n\t\t  gigaFlops,\n\t\t  msecPerMatrixMul,\n\t\t  flopsPerMatrixMul);\n\n  \/\/ Copy data from GPU to CPU \n  cudaMemcpy(h_P, d_P, sizeP, cudaMemcpyDeviceToHost);\n\n  \/\/ compute reference solution\n  if (check == 1)\n  {\n    printf(\"Computing result using host CPU...\");\n    matrixMulCPU(reference, h_M, h_N, m, k, n);\n    printf(\"done.\\n\");\n    printDiff(reference, h_P, n, m, 100, 1.0e-5f);\n  }\n\n  free(h_P);\n  free(h_M);\n  free(h_N);\n  cudaFree(d_P);\n  cudaFree(d_M);\n  cudaFree(d_N);\n#ifdef USE_CUBLAS\n  cublasDestroy(handle);\n#endif\n\n  return 0;\n}<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-87276bc6-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">\/\/ #define USE_CUBLAS\n\n#include &lt;iostream>\n#include &lt;cstdio>\n#include &lt;cuda_runtime.h>\n#ifdef USE_CUBLAS\n#include &lt;cublas_v2.h>\n#endif\n#include &lt;device_launch_parameters.h>\n#include &lt;cmath>\nusing namespace std;\n\nconst int TILE_WIDTH = 16;\t\/\/ \u5b9a\u4e49\u5757block\u5927\u5c0f\n\n\/\/\/\/\/\/\/\/\/\n\/\/ Matrix multiplication with shared memory (CUDA Kernel) on the device: C = A * B\n\/\/\/\/\/\/\/\/\/\nconst int BLOCK_SIZE = TILE_WIDTH;\n__global__ void MatrixMulSharedMemKernel(float *A,\n    float *B, float *C, int wA,\n    int wB) {\n\n\n\n\n}\n\n\n\/\/! For square matrices only\n__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width)\n{\n  \/\/ Calculate the row index of the P element and M\n  \/\/ *** TO DO: Compute the row index for the current thread ***\n  \/\/ int row = ...;\n  int row = blockIdx.y * blockDim.y + threadIdx.y;\n  \/\/ Calculate the column index of the P element and N\n  \/\/ *** TO DO: Compute the column index for the current thread ***\n  \/\/ int col = ...;\n  int col = blockIdx.x * blockDim.x + threadIdx.x;\n  \/\/ Ensure the thread is within bounds\n  if ( (row &lt; width) &amp;&amp; (col &lt; width) ) {\n    float pValue = 0.0;\n\n    \/\/ Each thread computes one element of the matrix\n    \/\/ *** TO DO: Implement the matrix multiplication for a single element ***\n    for (int k = 0; k &lt; width; ++k)\n        pValue += d_M&#91;row * width + k] * d_N&#91;k * width + col];\n    \/\/ Store the computed value into the output matrix\n    \/\/ *** TO DO: Write the computed value to the correct position in d_P ***\n    \/\/ d_P&#91;row * width + col] = ...;\n    d_P&#91;row * width + col] = pValue;\n  }\n}\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/! Compute reference data set matrix multiply on CPU\n\/\/! C = A * B\n\/\/! @param C          reference data, computed but preallocated\n\/\/! @param A          matrix A as provided to device\n\/\/! @param B          matrix B as provided to device\n\/\/! @param hA         height of matrix A\n\/\/! @param wA         width of matrix A\n\/\/! @param wB         width of matrix B\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\nvoid\nmatrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)\n{\n    for (unsigned int i = 0; i &lt; hA; ++i)\n        for (unsigned int j = 0; j &lt; wB; ++j)\n        {\n            double sum = 0;\n\n            for (unsigned int k = 0; k &lt; wA; ++k)\n            {\n                double a = A&#91;i * wA + k];\n                double b = B&#91;k * wB + j];\n                sum += a * b;\n            }\n\n            C&#91;i * wB + j] = (float)sum;\n        }\n}\n\nvoid printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)\n{\n    printf(\"Listing first %d Differences > %.6f...\\n\", iListLength, fListTol);\n    int i,j,k;\n    int error_count=0;\n\n    for (j = 0; j &lt; height; j++)\n    {\n        for (i = 0; i &lt; width; i++)\n        {\n            k = j * width + i;\n            float fDiff = fabs(data1&#91;k] - data2&#91;k]);\n\n            if (fDiff > fListTol)\n            {\n                if (error_count &lt; iListLength)\n                {\n                    printf(\"    Loc(%d,%d)\\tCPU=%.5f\\tGPU=%.5f\\tDiff=%.6f\\n\", i, j, data1&#91;k], data2&#91;k], fDiff);\n                }\n\n                error_count++;\n            }\n        }\n    }\n\n    printf(\" \\n  Total Errors = %d\\n\", error_count);\n}\n\nvoid getArg(int argc, char* argv&#91;], int &amp;size, int &amp;check)\n{\n  if (argc != 3)\n  {\n    cerr &lt;&lt; \"Usage: \" &lt;&lt; argv&#91;0] &lt;&lt; \" &lt;check_enable> &lt;size>\\n\";\n    cerr &lt;&lt; \"\\tcheck_enable: 1 to enable result checking\\n\";\n    cerr &lt;&lt; \"\\tsize: size of the matrix\\n\";\n    exit(1);\n  }\n\n  int val1, val2;\n  try\n  {\n    val1 = stoi(argv&#91;1]);\n    val2 = stoi(argv&#91;2]);\n  }\n  catch (const invalid_argument&amp; e)\n  {\n    cerr &lt;&lt; \"ERROR: parameters should be integer\\n\";\n    exit(1);\n  }\n\n  check = val1;\n  size = val2;\n}\n\n\n\nint main(int argc, char* argv&#91;])\n{\n  int size, check;\n  getArg(argc, argv, size, check);\n\n  int m = size, n = size, k = size;\n  \n  \/\/ \u58f0\u660e\u5b58\u653e\u5728GPU\u4e0a\u7684\u6570\u7ec4\n  float *h_M, *h_N, *d_M, *d_N;\n  float *h_P, *d_P;\n  \n  size_t sizeM = m * k * sizeof(float);\n  size_t sizeN = k * n * sizeof(float);\n  size_t sizeP = m * n * sizeof(float);\n\n  \/\/ Allocate host memory\n  h_M = (float*) malloc(sizeM);\n  h_N = (float*) malloc(sizeN);\n  h_P = (float*) malloc(sizeP);\n  float *reference = (float *)malloc(sizeP);\n\n  \/\/ Allocate device memory\n  cudaMalloc(&amp;d_M, sizeM);\n  cudaMalloc(&amp;d_N, sizeN);\n  cudaMalloc(&amp;d_P, sizeP);\n\n  \/\/ Init data \n  for(int i = 0; i &lt; m * n; ++i)\n  {\n    if(i % 2 == 0)\n      h_M&#91;i] = 1.0;\n    else\n      h_M&#91;i] = 0.5;\n  }\n\n  for(int i = 0;i &lt; n * k; ++i)\n  {\n    if(i % 2 == 0)\n      h_N&#91;i] = 0.5;\n    else\n      h_N&#91;i] = 1.0;\n  }\n\n  \/\/ Copy data from CPU to GPU\n  cudaMemcpy(d_M, h_M, sizeM, cudaMemcpyHostToDevice);\n  cudaMemcpy(d_N, h_N, sizeN, cudaMemcpyHostToDevice);\n\n  \/\/ Timing records \n  cudaEvent_t start,stop;\n  cudaEventCreate(&amp;start);\n  cudaEventCreate(&amp;stop);\n  cudaEventRecord(start,0);\n\n  \/\/ Launch kernel \u5b9a\u4e49grid&amp;block\n  dim3 grid((int)ceil(k*1.0 \/ TILE_WIDTH), (int)ceil(m*1.0\/ TILE_WIDTH));\n  dim3 block(TILE_WIDTH, TILE_WIDTH);\n  \n  int nIter = 5;\n#ifdef USE_CUBLAS\n  cublasHandle_t handle;\n  cublasCreate(&amp;handle);\n#endif\n  const float alpha = 1.0f;\n  const float beta  = 0.0f;\n  for (int j = 0; j &lt; nIter; j++) {\n    \/\/matrixMulCPU(reference, h_M, h_N, m, k, n);\n    MatrixMulKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m);\n    \/\/MatrixMulSharedMemKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m, n);\n    \/\/cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &amp;alpha, d_N, n, d_M, k, &amp;beta, d_P, n);\n  }\n\n  cudaEventRecord(stop, 0);\n  cudaEventSynchronize(stop);\n  float msecPerMatrixMul;\n  cudaEventElapsedTime(&amp;msecPerMatrixMul, start, stop);\n  msecPerMatrixMul \/= nIter;\n  printf(\"Kernel Elpased Time: %.3f ms\\n\", msecPerMatrixMul);\n\n  \/\/ Compute and print the performance\n  double flopsPerMatrixMul = 2.0 * (double)m * (double)n * (double)k;\n  double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) \/ (msecPerMatrixMul \/ 1000.0f);\n  printf(\"Performance= %.2f GFlop\/s, Time= %.3f msec, Size= %.0f Ops\\n\",\n\t\t  gigaFlops,\n\t\t  msecPerMatrixMul,\n\t\t  flopsPerMatrixMul);\n\n  \/\/ Copy data from GPU to CPU \n  cudaMemcpy(h_P, d_P, sizeP, cudaMemcpyDeviceToHost);\n\n  \/\/ compute reference solution\n  if (check == 1)\n  {\n    printf(\"Computing result using host CPU...\");\n    matrixMulCPU(reference, h_M, h_N, m, k, n);\n    printf(\"done.\\n\");\n    printDiff(reference, h_P, n, m, 100, 1.0e-5f);\n  }\n\n  free(h_P);\n  free(h_M);\n  free(h_N);\n  cudaFree(d_P);\n  cudaFree(d_M);\n  cudaFree(d_N);\n#ifdef USE_CUBLAS\n  cublasDestroy(handle);\n#endif\n\n  return 0;\n}<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b\uff08\u30d6\u30ed\u30c3\u30af\u30b5\u30a4\u30baTILE_WIDTH = 16\uff09\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ nvcc -arch=compute_35 -L\/usr\/local\/cuda\/lib64 -lcublas .\/matrix_mul.cu -Wno-deprecated-gpu-targets\n.\/matrix_mul.cu(202): warning #177-D: variable \"alpha\" was declared but never referenced\n\n.\/matrix_mul.cu(203): warning #177-D: variable \"beta\" was declared but never referenced\n\n.\/matrix_mul.cu(18): warning #177-D: variable \"BLOCK_SIZE\" was declared but never referenced\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 1 1000\nKernel Elpased Time: 0.802 ms\nPerformance= 2493.63 GFlop\/s, Time= 0.802 msec, Size= 2000000000 Ops\nComputing result using host CPU...done.\nListing first 100 Differences > 0.000010...\n\n  Total Errors = 0\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 256\nKernel Elpased Time: 0.691 ms\nPerformance= 48.57 GFlop\/s, Time= 0.691 msec, Size= 33554432 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 1024\nKernel Elpased Time: 0.907 ms\nPerformance= 2367.84 GFlop\/s, Time= 0.907 msec, Size= 2147483648 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 2048\nKernel Elpased Time: 5.411 ms\nPerformance= 3175.05 GFlop\/s, Time= 5.411 msec, Size= 17179869184 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 4096\nKernel Elpased Time: 40.593 ms\nPerformance= 3385.77 GFlop\/s, Time= 40.593 msec, Size= 137438953472 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 10000\nKernel Elpased Time: 577.970 ms\nPerformance= 3460.38 GFlop\/s, Time= 577.970 msec, Size= 2000000000000 Ops<\/code><\/pre>\n\n\n\n<p>\u30d6\u30ed\u30c3\u30af\u30b5\u30a4\u30ba\u3092\u534a\u5206\uff088\uff09\u306b\u3059\u308b\u3068\u3001\u6027\u80fd\u3082\u534a\u6e1b\u3057\u307e\u3059\u3002<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lab5 CUDA\uff1aGPU\u306e\u884c\u5217\u4e57\u7b97\u6700\u9069\u5316<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">\u5171\u6709\u30e1\u30e2\u30ea\u3067\u306eGPU\u884c\u5217\u4e57\u7b97\u6700\u9069\u5316<\/h3>\n\n\n\n<p>\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\\(aRow\\), \\(aCol\\), \\(bRow\\), \\(bCol\\)\u306f\u3001\u5404\u30b9\u30ec\u30c3\u30c9\u304c\u30a2\u30af\u30bb\u30b9\u3059\u308b\u884c\u5217\\(A\\)\u3068\\(B\\)\u306e\u8981\u7d20\u3092\u6c7a\u5b9a\u3059\u308b\u305f\u3081\u306b\u4f7f\u7528\u3055\u308c\u307e\u3059\u3002\u3053\u308c\u3089\u306e\u8a08\u7b97\u306f\u6b21\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\uff1a<br><br>\\begin{align*}aRow &amp;= BLOCK\\_SIZE \\cdot by + ty \\\\aCol &amp;= a &#8211; aBegin + tx\\\\bRow &amp;= \\frac{b &#8211; bBegin}{wB} + ty \\\\bCol &amp;= BLOCK\\_SIZE \\cdot bx + tx\\end{align*}<\/p>\n\n\n\n<p>    \\(BLOCK\\_SIZE\\): \u30d6\u30ed\u30c3\u30af\u304c\u51e6\u7406\u3059\u308b\u30bf\u30a4\u30eb\u306e\u30b5\u30a4\u30ba\u3002<br>    \\(wB\\): \u884c\u5217\\(B\\)\u306e\u5e45\u3002<br>    \\(a\\): \u73fe\u5728\u51e6\u7406\u4e2d\u306e\u884c\u5217\\(A\\)\u306e\u30bf\u30a4\u30eb\u306e\u958b\u59cb\u4f4d\u7f6e\u3002<br>    \\(aBegin\\): \u884c\u5217\\(A\\)\u306e\u6700\u521d\u306e\u30bf\u30a4\u30eb\u306e\u958b\u59cb\u4f4d\u7f6e\u3002<br>    \\(b\\): \u73fe\u5728\u51e6\u7406\u4e2d\u306e\u884c\u5217\\(B\\)\u306e\u30bf\u30a4\u30eb\u306e\u958b\u59cb\u4f4d\u7f6e\u3002<br>    \\(bBegin\\): \u884c\u5217\\(B\\)\u306e\u6700\u521d\u306e\u30bf\u30a4\u30eb\u306e\u958b\u59cb\u4f4d\u7f6e\u3002<br>    \\(bx\\): \u6c34\u5e73\u65b9\u5411\u306e\u30d6\u30ed\u30c3\u30af\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3002<br>    \\(by\\): \u5782\u76f4\u65b9\u5411\u306e\u30d6\u30ed\u30c3\u30af\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3002<br>    \\(ty\\): \u30d6\u30ed\u30c3\u30af\u5185\u306e\u30b9\u30ec\u30c3\u30c9\u306e\u884c\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3002<br>    \\(tx\\): \u30d6\u30ed\u30c3\u30af\u5185\u306e\u30b9\u30ec\u30c3\u30c9\u306e\u5217\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3002<\/p>\n\n\n\n<p>\u5404\u30b9\u30ec\u30c3\u30c9\u306f\u3001\u884c\u5217\\(As\\)\uff08\u884c\u5217\\(A\\)\u306e\u90e8\u5206\u884c\u5217\uff09\u306e1\u884c\u3068\u3001\u884c\u5217\\(Bs\\)\uff08\u884c\u5217\\(B\\)\u306e\u90e8\u5206\u884c\u5217\uff09\u306e1\u5217\u306e\u30c9\u30c3\u30c8\u7a4d\u3092\u8a08\u7b97\u3057\u3066\u3001\u51fa\u529b\u884c\u5217\\(C\\)\u306e1\u3064\u306e\u8981\u7d20\u3092\u8a08\u7b97\u3057\u307e\u3059\uff1a<br>\\[Csub = \\sum_{k=0}^{BLOCK\\_SIZE &#8211; 1} As[ty][k] \\cdot Bs[k][tx]\\]<br><br>\u8a08\u7b97\u5f8c\u3001\u5404\u30b9\u30ec\u30c3\u30c9\u306f\u7d50\u679c\u3092\u51fa\u529b\u884c\u5217\\(C\\)\u306b\u66f8\u304d\u8fbc\u307f\u307e\u3059\u3002\u66f8\u304d\u8fbc\u307f\u7528\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u306f\u4ee5\u4e0b\u3067\u8a08\u7b97\u3055\u308c\u307e\u3059\uff1a<br>\\begin{align*}cRow &amp;= BLOCK\\_SIZE \\cdot by + ty \\\\cCol &amp;= BLOCK\\_SIZE \\cdot bx + tx\\end{align*}<br><br>\u5883\u754c\u30c1\u30a7\u30c3\u30af\u306b\u3088\u308a\u3001\u7bc4\u56f2\u5916\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u304c\u7121\u8996\u3055\u308c\u308b\u3053\u3068\u3092\u78ba\u8a8d\u3057\u307e\u3059\uff1a<br>    \\[\\text{if } (cRow &lt; wA) \\text{ and } (cCol &lt; wB), \\text{ then store } Csub.\\]<br>\u7d50\u679c\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u683c\u7d0d\u3055\u308c\u307e\u3059\uff1a<br>\\[C[cRow \\cdot wB + cCol] = Csub\\]<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-2e5c74f3-0\" data-onclick=\"tabControl\">\u672a\u5b8c\u6210\u306a\u30b3\u30fc\u30c9<\/button><\/li><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"false\" aria-controls=\"tab-2e5c74f3-1\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-2e5c74f3-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">\/\/ #define USE_CUBLAS\n\n#include &lt;iostream>\n#include &lt;cstdio>\n#include &lt;cuda_runtime.h>\n#ifdef USE_CUBLAS\n#include &lt;cublas_v2.h>\n#endif\n#include &lt;device_launch_parameters.h>\n#include &lt;cmath>\nusing namespace std;\n\nconst int TILE_WIDTH = 16;\t\/\/ \u5b9a\u4e49\u5757block\u5927\u5c0f\n\n\/\/\/\/\/\/\/\/\/\n\/\/ Matrix multiplication with shared memory (CUDA Kernel) on the device: C = A * B\n\/\/\/\/\/\/\/\/\/\nconst int BLOCK_SIZE = TILE_WIDTH;\n__global__ void MatrixMulSharedMemKernel(float *A,\n    float *B, float *C, int wA,\n    int wB) {\n  \/\/ Block index\n  int bx = blockIdx.x;\n  int by = blockIdx.y;\n\n  \/\/ Thread index\n  int tx = threadIdx.x;\n  int ty = threadIdx.y;\n\n  \/\/ Index of the first sub-matrix of A processed by the block\n  int aBegin = wA * BLOCK_SIZE * by;\n\n  \/\/ Index of the last sub-matrix of A processed by the block\n  int aEnd   = aBegin + wA - 1;\n\n  \/\/ Step size used to iterate through the sub-matrices of A\n  int aStep  = BLOCK_SIZE;\n\n  \/\/ Index of the first sub-matrix of B processed by the block\n  int bBegin = BLOCK_SIZE * bx;\n\n  \/\/ Step size used to iterate through the sub-matrices of B\n  int bStep  = BLOCK_SIZE * wB;\n\n  \/\/ Csub is used to store the element of the block sub-matrix\n  \/\/ that is computed by the thread\n  float Csub = 0;\n\n  \/\/ Loop over all the sub-matrices of A and B\n  \/\/ required to compute the block sub-matrix\n  for (int a = aBegin, b = bBegin;\n       a &lt; aEnd;\n       a += aStep, b += bStep) {\n    \/\/ Declaration of the shared memory array As used to\n    \/\/ store the sub-matrix of A\n    __shared__ float As&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Declaration of the shared memory array Bs used to\n    \/\/ store the sub-matrix of B\n    __shared__ float Bs&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Load the matrices from device memory\n    \/\/ to shared memory; each **thread** loads\n    \/\/ one element of each matrix\n    \/\/ --- TO DO :Load the elements of the sub-matrix of A into As ---\n    \/\/ ---        Load the elements of the sub-matrix of B into Bs ---\n    \/\/ NOTE: Ensure that the thread indices do not exceed the matrix dimensions to avoid out-of-bounds access.\n    \/\/       Use boundary checks to load valid elements into shared memory, and set invalid elements to 0.0f\n\n\n\n\n    \/\/ Synchronize to make sure the matrices are loaded\n    __syncthreads();\n\n    \/\/ Multiply the two matrices together;\n    \/\/ each thread computes one element\n    \/\/ of the block sub-matrix\n#pragma unroll\n    \/\/ --- TO DO :Implement the matrix multiplication using the sub-matrices As and Bs ---\n\n\n\n\n    \/\/ Synchronize to make sure that the preceding\n    \/\/ computation is done before loading two new\n    \/\/ sub-matrices of A and B in the next iteration\n    __syncthreads();\n  }\n\n  \/\/ Write the block sub-matrix to device memory;\n  \/\/ each thread writes one element\n  int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;\n  \/\/ --- TO DO :Store the computed Csub result into matrix C ---\n  \/\/ NOTE: Ensure that the thread indices \"c\" do not exceed the matrix dimensions to avoid out-of-bounds access.\n  \/\/       Use boundary checks to write valid elements to the output matrix C.\n\n\n\n}\n\n\n\/\/! For square matrices only\n__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width)\n{\n  \/\/ Calculate the row index of the P element and M\n  \/\/ *** TO DO: Compute the row index for the current thread ***\n  \/\/ int row = ...;\n  int row = blockIdx.y * blockDim.y + threadIdx.y;\n  \/\/ Calculate the column index of the P element and N\n  \/\/ *** TO DO: Compute the column index for the current thread ***\n  \/\/ int col = ...;\n  int col = blockIdx.x * blockDim.x + threadIdx.x;\n  \/\/ Ensure the thread is within bounds\n  if ( (row &lt; width) &amp;&amp; (col &lt; width) ) {\n    float pValue = 0.0;\n\n    \/\/ Each thread computes one element of the matrix\n    \/\/ *** TO DO: Implement the matrix multiplication for a single element ***\n    for (int k = 0; k &lt; width; k++)\n        pValue += d_M&#91;row * width + k] * d_N&#91;k * width + col];\n    \/\/ Store the computed value into the output matrix\n    \/\/ *** TO DO: Write the computed value to the correct position in d_P ***\n    \/\/ d_P&#91;row * width + col] = ...;\n    d_P&#91;row * width + col] = pValue;\n  }\n}\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/! Compute reference data set matrix multiply on CPU\n\/\/! C = A * B\n\/\/! @param C          reference data, computed but preallocated\n\/\/! @param A          matrix A as provided to device\n\/\/! @param B          matrix B as provided to device\n\/\/! @param hA         height of matrix A\n\/\/! @param wA         width of matrix A\n\/\/! @param wB         width of matrix B\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\nvoid\nmatrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)\n{\n    for (unsigned int i = 0; i &lt; hA; ++i)\n        for (unsigned int j = 0; j &lt; wB; ++j)\n        {\n            double sum = 0;\n\n            for (unsigned int k = 0; k &lt; wA; ++k)\n            {\n                double a = A&#91;i * wA + k];\n                double b = B&#91;k * wB + j];\n                sum += a * b;\n            }\n\n            C&#91;i * wB + j] = (float)sum;\n        }\n}\n\nvoid printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)\n{\n    printf(\"Listing first %d Differences > %.6f...\\n\", iListLength, fListTol);\n    int i,j,k;\n    int error_count=0;\n\n    for (j = 0; j &lt; height; j++)\n    {\n        for (i = 0; i &lt; width; i++)\n        {\n            k = j * width + i;\n            float fDiff = fabs(data1&#91;k] - data2&#91;k]);\n\n            if (fDiff > fListTol)\n            {\n                if (error_count &lt; iListLength)\n                {\n                    printf(\"    Loc(%d,%d)\\tCPU=%.5f\\tGPU=%.5f\\tDiff=%.6f\\n\", i, j, data1&#91;k], data2&#91;k], fDiff);\n                }\n\n                error_count++;\n            }\n        }\n    }\n\n    printf(\" \\n  Total Errors = %d\\n\", error_count);\n}\n\nvoid getArg(int argc, char* argv&#91;], int &amp;size, int &amp;check)\n{\n  if (argc != 3)\n  {\n    cerr &lt;&lt; \"Usage: \" &lt;&lt; argv&#91;0] &lt;&lt; \" &lt;check_enable> &lt;size>\\n\";\n    cerr &lt;&lt; \"\\tcheck_enable: 1 to enable result checking\\n\";\n    cerr &lt;&lt; \"\\tsize: size of the matrix\\n\";\n    exit(1);\n  }\n\n  int val1, val2;\n  try\n  {\n    val1 = stoi(argv&#91;1]);\n    val2 = stoi(argv&#91;2]);\n  }\n  catch (const invalid_argument&amp; e)\n  {\n    cerr &lt;&lt; \"ERROR: parameters should be integer\\n\";\n    exit(1);\n  }\n\n  check = val1;\n  size = val2;\n}\n\n\n\nint main(int argc, char* argv&#91;])\n{\n  int size, check;\n  getArg(argc, argv, size, check);\n\n  int m = size, n = size, k = size;\n  \n  \/\/ \u58f0\u660e\u5b58\u653e\u5728GPU\u4e0a\u7684\u6570\u7ec4\n  float *h_M, *h_N, *d_M, *d_N;\n  float *h_P, *d_P;\n  \n  size_t sizeM = m * k * sizeof(float);\n  size_t sizeN = k * n * sizeof(float);\n  size_t sizeP = m * n * sizeof(float);\n\n  \/\/ Allocate host memory\n  h_M = (float*) malloc(sizeM);\n  h_N = (float*) malloc(sizeN);\n  h_P = (float*) malloc(sizeP);\n  float *reference = (float *)malloc(sizeP);\n\n  \/\/ Allocate device memory\n  cudaMalloc(&amp;d_M, sizeM);\n  cudaMalloc(&amp;d_N, sizeN);\n  cudaMalloc(&amp;d_P, sizeP);\n\n  \/\/ Init data \n  for(int i = 0; i &lt; m * n; ++i)\n  {\n    if(i % 2 == 0)\n      h_M&#91;i] = 1.0;\n    else\n      h_M&#91;i] = 0.5;\n  }\n\n  for(int i = 0;i &lt; n * k; ++i)\n  {\n    if(i % 2 == 0)\n      h_N&#91;i] = 0.5;\n    else\n      h_N&#91;i] = 1.0;\n  }\n\n  \/\/ Copy data from CPU to GPU\n  cudaMemcpy(d_M, h_M, sizeM, cudaMemcpyHostToDevice);\n  cudaMemcpy(d_N, h_N, sizeN, cudaMemcpyHostToDevice);\n\n  \/\/ Timing records \n  cudaEvent_t start,stop;\n  cudaEventCreate(&amp;start);\n  cudaEventCreate(&amp;stop);\n  cudaEventRecord(start,0);\n\n  \/\/ Launch kernel \u5b9a\u4e49grid&amp;block\n  dim3 grid((int)ceil(k*1.0 \/ TILE_WIDTH), (int)ceil(m*1.0\/ TILE_WIDTH));\n  dim3 block(TILE_WIDTH, TILE_WIDTH);\n  \n  int nIter = 5;\n#ifdef USE_CUBLAS\n  cublasHandle_t handle;\n  cublasCreate(&amp;handle);\n#endif\n  const float alpha = 1.0f;\n  const float beta  = 0.0f;\n  for (int j = 0; j &lt; nIter; j++) {\n    \/\/matrixMulCPU(reference, h_M, h_N, m, k, n);\n    \/\/MatrixMulKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m);\n    MatrixMulSharedMemKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m, n);\n    \/\/cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &amp;alpha, d_N, n, d_M, k, &amp;beta, d_P, n);\n  }\n\n  cudaEventRecord(stop, 0);\n  cudaEventSynchronize(stop);\n  float msecPerMatrixMul;\n  cudaEventElapsedTime(&amp;msecPerMatrixMul, start, stop);\n  msecPerMatrixMul \/= nIter;\n  printf(\"Kernel Elpased Time: %.3f ms\\n\", msecPerMatrixMul);\n\n  \/\/ Compute and print the performance\n  double flopsPerMatrixMul = 2.0 * (double)m * (double)n * (double)k;\n  double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) \/ (msecPerMatrixMul \/ 1000.0f);\n  printf(\"Performance= %.2f GFlop\/s, Time= %.3f msec, Size= %.0f Ops\\n\",\n\t\t  gigaFlops,\n\t\t  msecPerMatrixMul,\n\t\t  flopsPerMatrixMul);\n\n  \/\/ Copy data from GPU to CPU \n  cudaMemcpy(h_P, d_P, sizeP, cudaMemcpyDeviceToHost);\n\n  \/\/ compute reference solution\n  if (check == 1)\n  {\n    printf(\"Computing result using host CPU...\");\n    matrixMulCPU(reference, h_M, h_N, m, k, n);\n    printf(\"done.\\n\");\n    printDiff(reference, h_P, n, m, 100, 1.0e-5f);\n  }\n\n  free(h_P);\n  free(h_M);\n  free(h_N);\n  cudaFree(d_P);\n  cudaFree(d_M);\n  cudaFree(d_N);\n#ifdef USE_CUBLAS\n  cublasDestroy(handle);\n#endif\n\n  return 0;\n}\n\n<\/code><\/pre>\n<\/div>\n\n\n\n<div id=\"tab-2e5c74f3-1\" class=\"c-tabBody__item\" aria-hidden=\"true\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">\/\/ #define USE_CUBLAS\n\n#include &lt;iostream>\n#include &lt;cstdio>\n#include &lt;cuda_runtime.h>\n#ifdef USE_CUBLAS\n#include &lt;cublas_v2.h>\n#endif\n#include &lt;device_launch_parameters.h>\n#include &lt;cmath>\nusing namespace std;\n\nconst int TILE_WIDTH = 16;\t\/\/ \u5b9a\u4e49\u5757block\u5927\u5c0f\n\n\/\/\/\/\/\/\/\/\/\n\/\/ Matrix multiplication with shared memory (CUDA Kernel) on the device: C = A * B\n\/\/\/\/\/\/\/\/\/\nconst int BLOCK_SIZE = TILE_WIDTH;\n__global__ void MatrixMulSharedMemKernel(float *A,\n    float *B, float *C, int wA,\n    int wB) {\n  \/\/ Block index\n  int bx = blockIdx.x;\n  int by = blockIdx.y;\n\n  \/\/ Thread index\n  int tx = threadIdx.x;\n  int ty = threadIdx.y;\n\n  \/\/ Index of the first sub-matrix of A processed by the block\n  int aBegin = wA * BLOCK_SIZE * by;\n\n  \/\/ Index of the last sub-matrix of A processed by the block\n  int aEnd   = aBegin + wA - 1;\n\n  \/\/ Step size used to iterate through the sub-matrices of A\n  int aStep  = BLOCK_SIZE;\n\n  \/\/ Index of the first sub-matrix of B processed by the block\n  int bBegin = BLOCK_SIZE * bx;\n\n  \/\/ Step size used to iterate through the sub-matrices of B\n  int bStep  = BLOCK_SIZE * wB;\n\n  \/\/ Csub is used to store the element of the block sub-matrix\n  \/\/ that is computed by the thread\n  float Csub = 0;\n\n  \/\/ Loop over all the sub-matrices of A and B\n  \/\/ required to compute the block sub-matrix\n  for (int a = aBegin, b = bBegin;\n       a &lt; aEnd;\n       a += aStep, b += bStep) {\n    \/\/ Declaration of the shared memory array As used to\n    \/\/ store the sub-matrix of A\n    __shared__ float As&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Declaration of the shared memory array Bs used to\n    \/\/ store the sub-matrix of B\n    __shared__ float Bs&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Load the matrices from device memory\n    \/\/ to shared memory; each **thread** loads\n    \/\/ one element of each matrix\n    \/\/ --- TO DO :Load the elements of the sub-matrix of A into As ---\n    \/\/ ---        Load the elements of the sub-matrix of B into Bs ---\n    \/\/ NOTE: Ensure that the thread indices do not exceed the matrix dimensions to avoid out-of-bounds access.\n    \/\/       Use boundary checks to load valid elements into shared memory, and set invalid elements to 0.0f\n  int aRow = BLOCK_SIZE * by + ty;\n  int aCol = a - aBegin + tx;\n\n  int bRow = (b - bBegin) \/ wB + ty;\n  int bCol = BLOCK_SIZE * bx + tx;\n\n  if (aRow &lt; wA &amp;&amp; aCol &lt; wA)\n      As&#91;ty]&#91;tx] = A&#91;aRow * wA + aCol];\n  else\n      As&#91;ty]&#91;tx] = 0.0f;\n\n  if (bRow &lt; wA &amp;&amp; bCol &lt; wB)\n      Bs&#91;ty]&#91;tx] = B&#91;bRow * wB + bCol];\n  else\n      Bs&#91;ty]&#91;tx] = 0.0f;\n    \/\/ Synchronize to make sure the matrices are loaded\n    __syncthreads();\n\n    \/\/ Multiply the two matrices together;\n    \/\/ each thread computes one element\n    \/\/ of the block sub-matrix\n#pragma unroll\n    \/\/ --- TO DO :Implement the matrix multiplication using the sub-matrices As and Bs ---\n  for (int k = 0; k &lt; BLOCK_SIZE; ++k)\n      Csub += As&#91;ty]&#91;k] * Bs&#91;k]&#91;tx];\n    \/\/ Synchronize to make sure that the preceding\n    \/\/ computation is done before loading two new\n    \/\/ sub-matrices of A and B in the next iteration\n    __syncthreads();\n  }\n\n  \/\/ Write the block sub-matrix to device memory;\n  \/\/ each thread writes one element\n  int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;\n  \/\/ --- TO DO :Store the computed Csub result into matrix C ---\n  \/\/ NOTE: Ensure that the thread indices \"c\" do not exceed the matrix dimensions to avoid out-of-bounds access.\n  \/\/       Use boundary checks to write valid elements to the output matrix C.\n  int cRow = BLOCK_SIZE * by + ty;\n  int cCol = BLOCK_SIZE * bx + tx;\n  if (cRow &lt; wA &amp;&amp; cCol &lt; wB)\n      C&#91;cRow * wB + cCol] = Csub;\n}\n\n\n\/\/! For square matrices only\n__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width)\n{\n  \/\/ Calculate the row index of the P element and M\n  \/\/ *** TO DO: Compute the row index for the current thread ***\n  \/\/ int row = ...;\n  int row = blockIdx.y * blockDim.y + threadIdx.y;\n  \/\/ Calculate the column index of the P element and N\n  \/\/ *** TO DO: Compute the column index for the current thread ***\n  \/\/ int col = ...;\n  int col = blockIdx.x * blockDim.x + threadIdx.x;\n  \/\/ Ensure the thread is within bounds\n  if ( (row &lt; width) &amp;&amp; (col &lt; width) ) {\n    float pValue = 0.0;\n\n    \/\/ Each thread computes one element of the matrix\n    \/\/ *** TO DO: Implement the matrix multiplication for a single element ***\n    for (int k = 0; k &lt; width; k++)\n        pValue += d_M&#91;row * width + k] * d_N&#91;k * width + col];\n    \/\/ Store the computed value into the output matrix\n    \/\/ *** TO DO: Write the computed value to the correct position in d_P ***\n    \/\/ d_P&#91;row * width + col] = ...;\n    d_P&#91;row * width + col] = pValue;\n  }\n}\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/! Compute reference data set matrix multiply on CPU\n\/\/! C = A * B\n\/\/! @param C          reference data, computed but preallocated\n\/\/! @param A          matrix A as provided to device\n\/\/! @param B          matrix B as provided to device\n\/\/! @param hA         height of matrix A\n\/\/! @param wA         width of matrix A\n\/\/! @param wB         width of matrix B\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\nvoid\nmatrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)\n{\n    for (unsigned int i = 0; i &lt; hA; ++i)\n        for (unsigned int j = 0; j &lt; wB; ++j)\n        {\n            double sum = 0;\n\n            for (unsigned int k = 0; k &lt; wA; ++k)\n            {\n                double a = A&#91;i * wA + k];\n                double b = B&#91;k * wB + j];\n                sum += a * b;\n            }\n\n            C&#91;i * wB + j] = (float)sum;\n        }\n}\n\nvoid printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)\n{\n    printf(\"Listing first %d Differences > %.6f...\\n\", iListLength, fListTol);\n    int i,j,k;\n    int error_count=0;\n\n    for (j = 0; j &lt; height; j++)\n    {\n        for (i = 0; i &lt; width; i++)\n        {\n            k = j * width + i;\n            float fDiff = fabs(data1&#91;k] - data2&#91;k]);\n\n            if (fDiff > fListTol)\n            {\n                if (error_count &lt; iListLength)\n                {\n                    printf(\"    Loc(%d,%d)\\tCPU=%.5f\\tGPU=%.5f\\tDiff=%.6f\\n\", i, j, data1&#91;k], data2&#91;k], fDiff);\n                }\n\n                error_count++;\n            }\n        }\n    }\n\n    printf(\" \\n  Total Errors = %d\\n\", error_count);\n}\n\nvoid getArg(int argc, char* argv&#91;], int &amp;size, int &amp;check)\n{\n  if (argc != 3)\n  {\n    cerr &lt;&lt; \"Usage: \" &lt;&lt; argv&#91;0] &lt;&lt; \" &lt;check_enable> &lt;size>\\n\";\n    cerr &lt;&lt; \"\\tcheck_enable: 1 to enable result checking\\n\";\n    cerr &lt;&lt; \"\\tsize: size of the matrix\\n\";\n    exit(1);\n  }\n\n  int val1, val2;\n  try\n  {\n    val1 = stoi(argv&#91;1]);\n    val2 = stoi(argv&#91;2]);\n  }\n  catch (const invalid_argument&amp; e)\n  {\n    cerr &lt;&lt; \"ERROR: parameters should be integer\\n\";\n    exit(1);\n  }\n\n  check = val1;\n  size = val2;\n}\n\n\n\nint main(int argc, char* argv&#91;])\n{\n  int size, check;\n  getArg(argc, argv, size, check);\n\n  int m = size, n = size, k = size;\n  \n  \/\/ \u58f0\u660e\u5b58\u653e\u5728GPU\u4e0a\u7684\u6570\u7ec4\n  float *h_M, *h_N, *d_M, *d_N;\n  float *h_P, *d_P;\n  \n  size_t sizeM = m * k * sizeof(float);\n  size_t sizeN = k * n * sizeof(float);\n  size_t sizeP = m * n * sizeof(float);\n\n  \/\/ Allocate host memory\n  h_M = (float*) malloc(sizeM);\n  h_N = (float*) malloc(sizeN);\n  h_P = (float*) malloc(sizeP);\n  float *reference = (float *)malloc(sizeP);\n\n  \/\/ Allocate device memory\n  cudaMalloc(&amp;d_M, sizeM);\n  cudaMalloc(&amp;d_N, sizeN);\n  cudaMalloc(&amp;d_P, sizeP);\n\n  \/\/ Init data \n  for(int i = 0; i &lt; m * n; ++i)\n  {\n    if(i % 2 == 0)\n      h_M&#91;i] = 1.0;\n    else\n      h_M&#91;i] = 0.5;\n  }\n\n  for(int i = 0;i &lt; n * k; ++i)\n  {\n    if(i % 2 == 0)\n      h_N&#91;i] = 0.5;\n    else\n      h_N&#91;i] = 1.0;\n  }\n\n  \/\/ Copy data from CPU to GPU\n  cudaMemcpy(d_M, h_M, sizeM, cudaMemcpyHostToDevice);\n  cudaMemcpy(d_N, h_N, sizeN, cudaMemcpyHostToDevice);\n\n  \/\/ Timing records \n  cudaEvent_t start,stop;\n  cudaEventCreate(&amp;start);\n  cudaEventCreate(&amp;stop);\n  cudaEventRecord(start,0);\n\n  \/\/ Launch kernel \u5b9a\u4e49grid&amp;block\n  dim3 grid((int)ceil(k*1.0 \/ TILE_WIDTH), (int)ceil(m*1.0\/ TILE_WIDTH));\n  dim3 block(TILE_WIDTH, TILE_WIDTH);\n  \n  int nIter = 5;\n#ifdef USE_CUBLAS\n  cublasHandle_t handle;\n  cublasCreate(&amp;handle);\n#endif\n  const float alpha = 1.0f;\n  const float beta  = 0.0f;\n  for (int j = 0; j &lt; nIter; j++) {\n    \/\/matrixMulCPU(reference, h_M, h_N, m, k, n);\n    \/\/MatrixMulKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m);\n    MatrixMulSharedMemKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m, n);\n    \/\/cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &amp;alpha, d_N, n, d_M, k, &amp;beta, d_P, n);\n  }\n\n  cudaEventRecord(stop, 0);\n  cudaEventSynchronize(stop);\n  float msecPerMatrixMul;\n  cudaEventElapsedTime(&amp;msecPerMatrixMul, start, stop);\n  msecPerMatrixMul \/= nIter;\n  printf(\"Kernel Elpased Time: %.3f ms\\n\", msecPerMatrixMul);\n\n  \/\/ Compute and print the performance\n  double flopsPerMatrixMul = 2.0 * (double)m * (double)n * (double)k;\n  double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) \/ (msecPerMatrixMul \/ 1000.0f);\n  printf(\"Performance= %.2f GFlop\/s, Time= %.3f msec, Size= %.0f Ops\\n\",\n\t\t  gigaFlops,\n\t\t  msecPerMatrixMul,\n\t\t  flopsPerMatrixMul);\n\n  \/\/ Copy data from GPU to CPU \n  cudaMemcpy(h_P, d_P, sizeP, cudaMemcpyDeviceToHost);\n\n  \/\/ compute reference solution\n  if (check == 1)\n  {\n    printf(\"Computing result using host CPU...\");\n    matrixMulCPU(reference, h_M, h_N, m, k, n);\n    printf(\"done.\\n\");\n    printDiff(reference, h_P, n, m, 100, 1.0e-5f);\n  }\n\n  free(h_P);\n  free(h_M);\n  free(h_N);\n  cudaFree(d_P);\n  cudaFree(d_M);\n  cudaFree(d_N);\n#ifdef USE_CUBLAS\n  cublasDestroy(handle);\n#endif\n\n  return 0;\n}\n\n<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u51fa\u529b\uff08\u30d6\u30ed\u30c3\u30af\u30b5\u30a4\u30ba=16\uff09\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3001\u7d0430%\u5411\u4e0a\u3057\u307e\u3057\u305f\uff1a<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ nvcc -arch=compute_35 -L\/usr\/local\/cuda\/lib64 -lcublas .\/matrix_mul.cu -Wno-deprecated-gpu-targets\n.\/matrix_mul.cu(107): warning #177-D: variable \"c\" was declared but never referenced\n\n.\/matrix_mul.cu(293): warning #177-D: variable \"alpha\" was declared but never referenced\n\n.\/matrix_mul.cu(294): warning #177-D: variable \"beta\" was declared but never referenced\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 1 1000\nKernel Elpased Time: 0.712 ms\nPerformance= 2810.55 GFlop\/s, Time= 0.712 msec, Size= 2000000000 Ops\nComputing result using host CPU...done.\nListing first 100 Differences > 0.000010...\n\n  Total Errors = 0\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 256\nKernel Elpased Time: 0.595 ms\nPerformance= 56.39 GFlop\/s, Time= 0.595 msec, Size= 33554432 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 1024\nKernel Elpased Time: 0.726 ms\nPerformance= 2959.59 GFlop\/s, Time= 0.726 msec, Size= 2147483648 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 2048\nKernel Elpased Time: 4.214 ms\nPerformance= 4076.64 GFlop\/s, Time= 4.214 msec, Size= 17179869184 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 4096\nKernel Elpased Time: 31.043 ms\nPerformance= 4427.37 GFlop\/s, Time= 31.043 msec, Size= 137438953472 Ops\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 10000\nKernel Elpased Time: 447.665 ms\nPerformance= 4467.63 GFlop\/s, Time= 447.665 msec, Size= 2000000000000 Ops<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">CUBLAS\u3067\u306eGPU\u884c\u5217\u4e57\u7b97\u6700\u9069\u5316<\/h3>\n\n\n\n<p>\u30de\u30af\u30ed\u5b9a\u7fa9\u3092\u9069\u7528\u3059\u308b\u3060\u3051\u3067CUBLAS\u3092\u6709\u52b9\u5316\u3067\u304d\u307e\u3059\u3002<\/p>\n\n\n\n<div class=\"swell-block-tab is-style-balloon\" data-width-pc=\"flex-auto\" data-width-sp=\"auto\"><ul class=\"c-tabList\" role=\"tablist\"><li class=\"c-tabList__item\" role=\"presentation\"><button role=\"tab\" class=\"c-tabList__button\" aria-selected=\"true\" aria-controls=\"tab-c8e916c4-0\" data-onclick=\"tabControl\">\u5b9f\u73fe\u3057\u305f\u30b3\u30fc\u30c9<\/button><\/li><\/ul><div class=\"c-tabBody\">\n<div id=\"tab-c8e916c4-0\" class=\"c-tabBody__item\" aria-hidden=\"false\">\n<pre class=\"wp-block-code\"><code class=\"cpp\">#define USE_CUBLAS\n\n#include &lt;iostream>\n#include &lt;cstdio>\n#include &lt;cuda_runtime.h>\n#ifdef USE_CUBLAS\n#include &lt;cublas_v2.h>\n#endif\n#include &lt;device_launch_parameters.h>\n#include &lt;cmath>\nusing namespace std;\n\nconst int TILE_WIDTH = 16;\t\/\/ \u5b9a\u4e49\u5757block\u5927\u5c0f\n\n\/\/\/\/\/\/\/\/\/\n\/\/ Matrix multiplication with shared memory (CUDA Kernel) on the device: C = A * B\n\/\/\/\/\/\/\/\/\/\nconst int BLOCK_SIZE = TILE_WIDTH;\n__global__ void MatrixMulSharedMemKernel(float *A,\n    float *B, float *C, int wA,\n    int wB) {\n  \/\/ Block index\n  int bx = blockIdx.x;\n  int by = blockIdx.y;\n\n  \/\/ Thread index\n  int tx = threadIdx.x;\n  int ty = threadIdx.y;\n\n  \/\/ Index of the first sub-matrix of A processed by the block\n  int aBegin = wA * BLOCK_SIZE * by;\n\n  \/\/ Index of the last sub-matrix of A processed by the block\n  int aEnd   = aBegin + wA - 1;\n\n  \/\/ Step size used to iterate through the sub-matrices of A\n  int aStep  = BLOCK_SIZE;\n\n  \/\/ Index of the first sub-matrix of B processed by the block\n  int bBegin = BLOCK_SIZE * bx;\n\n  \/\/ Step size used to iterate through the sub-matrices of B\n  int bStep  = BLOCK_SIZE * wB;\n\n  \/\/ Csub is used to store the element of the block sub-matrix\n  \/\/ that is computed by the thread\n  float Csub = 0;\n\n  \/\/ Loop over all the sub-matrices of A and B\n  \/\/ required to compute the block sub-matrix\n  for (int a = aBegin, b = bBegin;\n       a &lt; aEnd;\n       a += aStep, b += bStep) {\n    \/\/ Declaration of the shared memory array As used to\n    \/\/ store the sub-matrix of A\n    __shared__ float As&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Declaration of the shared memory array Bs used to\n    \/\/ store the sub-matrix of B\n    __shared__ float Bs&#91;BLOCK_SIZE]&#91;BLOCK_SIZE];\n\n    \/\/ Load the matrices from device memory\n    \/\/ to shared memory; each **thread** loads\n    \/\/ one element of each matrix\n    \/\/ --- TO DO :Load the elements of the sub-matrix of A into As ---\n    \/\/ ---        Load the elements of the sub-matrix of B into Bs ---\n    \/\/ NOTE: Ensure that the thread indices do not exceed the matrix dimensions to avoid out-of-bounds access.\n    \/\/       Use boundary checks to load valid elements into shared memory, and set invalid elements to 0.0f\n  int aRow = BLOCK_SIZE * by + ty;\n  int aCol = a - aBegin + tx;\n\n  int bRow = (b - bBegin) \/ wB + ty;\n  int bCol = BLOCK_SIZE * bx + tx;\n\n  if (aRow &lt; wA &amp;&amp; aCol &lt; wA)\n      As&#91;ty]&#91;tx] = A&#91;aRow * wA + aCol];\n  else\n      As&#91;ty]&#91;tx] = 0.0f;\n\n  if (bRow &lt; wA &amp;&amp; bCol &lt; wB)\n      Bs&#91;ty]&#91;tx] = B&#91;bRow * wB + bCol];\n  else\n      Bs&#91;ty]&#91;tx] = 0.0f;\n    \/\/ Synchronize to make sure the matrices are loaded\n    __syncthreads();\n\n    \/\/ Multiply the two matrices together;\n    \/\/ each thread computes one element\n    \/\/ of the block sub-matrix\n#pragma unroll\n    \/\/ --- TO DO :Implement the matrix multiplication using the sub-matrices As and Bs ---\n  for (int k = 0; k &lt; BLOCK_SIZE; ++k)\n      Csub += As&#91;ty]&#91;k] * Bs&#91;k]&#91;tx];\n    \/\/ Synchronize to make sure that the preceding\n    \/\/ computation is done before loading two new\n    \/\/ sub-matrices of A and B in the next iteration\n    __syncthreads();\n  }\n\n  \/\/ Write the block sub-matrix to device memory;\n  \/\/ each thread writes one element\n  int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;\n  \/\/ --- TO DO :Store the computed Csub result into matrix C ---\n  \/\/ NOTE: Ensure that the thread indices \"c\" do not exceed the matrix dimensions to avoid out-of-bounds access.\n  \/\/       Use boundary checks to write valid elements to the output matrix C.\n  int cRow = BLOCK_SIZE * by + ty;\n  int cCol = BLOCK_SIZE * bx + tx;\n  if (cRow &lt; wA &amp;&amp; cCol &lt; wB)\n      C&#91;cRow * wB + cCol] = Csub;\n}\n\n\n\/\/! For square matrices only\n__global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int width)\n{\n  \/\/ Calculate the row index of the P element and M\n  \/\/ *** TO DO: Compute the row index for the current thread ***\n  \/\/ int row = ...;\n  int row = blockIdx.y * blockDim.y + threadIdx.y;\n  \/\/ Calculate the column index of the P element and N\n  \/\/ *** TO DO: Compute the column index for the current thread ***\n  \/\/ int col = ...;\n  int col = blockIdx.x * blockDim.x + threadIdx.x;\n  \/\/ Ensure the thread is within bounds\n  if ( (row &lt; width) &amp;&amp; (col &lt; width) ) {\n    float pValue = 0.0;\n\n    \/\/ Each thread computes one element of the matrix\n    \/\/ *** TO DO: Implement the matrix multiplication for a single element ***\n    for (int k = 0; k &lt; width; k++)\n        pValue += d_M&#91;row * width + k] * d_N&#91;k * width + col];\n    \/\/ Store the computed value into the output matrix\n    \/\/ *** TO DO: Write the computed value to the correct position in d_P ***\n    \/\/ d_P&#91;row * width + col] = ...;\n    d_P&#91;row * width + col] = pValue;\n  }\n}\n\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\n\/\/! Compute reference data set matrix multiply on CPU\n\/\/! C = A * B\n\/\/! @param C          reference data, computed but preallocated\n\/\/! @param A          matrix A as provided to device\n\/\/! @param B          matrix B as provided to device\n\/\/! @param hA         height of matrix A\n\/\/! @param wA         width of matrix A\n\/\/! @param wB         width of matrix B\n\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\nvoid\nmatrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)\n{\n    for (unsigned int i = 0; i &lt; hA; ++i)\n        for (unsigned int j = 0; j &lt; wB; ++j)\n        {\n            double sum = 0;\n\n            for (unsigned int k = 0; k &lt; wA; ++k)\n            {\n                double a = A&#91;i * wA + k];\n                double b = B&#91;k * wB + j];\n                sum += a * b;\n            }\n\n            C&#91;i * wB + j] = (float)sum;\n        }\n}\n\nvoid printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)\n{\n    printf(\"Listing first %d Differences > %.6f...\\n\", iListLength, fListTol);\n    int i,j,k;\n    int error_count=0;\n\n    for (j = 0; j &lt; height; j++)\n    {\n        for (i = 0; i &lt; width; i++)\n        {\n            k = j * width + i;\n            float fDiff = fabs(data1&#91;k] - data2&#91;k]);\n\n            if (fDiff > fListTol)\n            {\n                if (error_count &lt; iListLength)\n                {\n                    printf(\"    Loc(%d,%d)\\tCPU=%.5f\\tGPU=%.5f\\tDiff=%.6f\\n\", i, j, data1&#91;k], data2&#91;k], fDiff);\n                }\n\n                error_count++;\n            }\n        }\n    }\n\n    printf(\" \\n  Total Errors = %d\\n\", error_count);\n}\n\nvoid getArg(int argc, char* argv&#91;], int &amp;size, int &amp;check)\n{\n  if (argc != 3)\n  {\n    cerr &lt;&lt; \"Usage: \" &lt;&lt; argv&#91;0] &lt;&lt; \" &lt;check_enable> &lt;size>\\n\";\n    cerr &lt;&lt; \"\\tcheck_enable: 1 to enable result checking\\n\";\n    cerr &lt;&lt; \"\\tsize: size of the matrix\\n\";\n    exit(1);\n  }\n\n  int val1, val2;\n  try\n  {\n    val1 = stoi(argv&#91;1]);\n    val2 = stoi(argv&#91;2]);\n  }\n  catch (const invalid_argument&amp; e)\n  {\n    cerr &lt;&lt; \"ERROR: parameters should be integer\\n\";\n    exit(1);\n  }\n\n  check = val1;\n  size = val2;\n}\n\n\n\nint main(int argc, char* argv&#91;])\n{\n  int size, check;\n  getArg(argc, argv, size, check);\n\n  int m = size, n = size, k = size;\n  \n  \/\/ \u58f0\u660e\u5b58\u653e\u5728GPU\u4e0a\u7684\u6570\u7ec4\n  float *h_M, *h_N, *d_M, *d_N;\n  float *h_P, *d_P;\n  \n  size_t sizeM = m * k * sizeof(float);\n  size_t sizeN = k * n * sizeof(float);\n  size_t sizeP = m * n * sizeof(float);\n\n  \/\/ Allocate host memory\n  h_M = (float*) malloc(sizeM);\n  h_N = (float*) malloc(sizeN);\n  h_P = (float*) malloc(sizeP);\n  float *reference = (float *)malloc(sizeP);\n\n  \/\/ Allocate device memory\n  cudaMalloc(&amp;d_M, sizeM);\n  cudaMalloc(&amp;d_N, sizeN);\n  cudaMalloc(&amp;d_P, sizeP);\n\n  \/\/ Init data \n  for(int i = 0; i &lt; m * n; ++i)\n  {\n    if(i % 2 == 0)\n      h_M&#91;i] = 1.0;\n    else\n      h_M&#91;i] = 0.5;\n  }\n\n  for(int i = 0;i &lt; n * k; ++i)\n  {\n    if(i % 2 == 0)\n      h_N&#91;i] = 0.5;\n    else\n      h_N&#91;i] = 1.0;\n  }\n\n  \/\/ Copy data from CPU to GPU\n  cudaMemcpy(d_M, h_M, sizeM, cudaMemcpyHostToDevice);\n  cudaMemcpy(d_N, h_N, sizeN, cudaMemcpyHostToDevice);\n\n  \/\/ Timing records \n  cudaEvent_t start,stop;\n  cudaEventCreate(&amp;start);\n  cudaEventCreate(&amp;stop);\n  cudaEventRecord(start,0);\n\n  \/\/ Launch kernel \u5b9a\u4e49grid&amp;block\n  dim3 grid((int)ceil(k*1.0 \/ TILE_WIDTH), (int)ceil(m*1.0\/ TILE_WIDTH));\n  dim3 block(TILE_WIDTH, TILE_WIDTH);\n  \n  int nIter = 5;\n#ifdef USE_CUBLAS\n  cublasHandle_t handle;\n  cublasCreate(&amp;handle);\n#endif\n  const float alpha = 1.0f;\n  const float beta  = 0.0f;\n  for (int j = 0; j &lt; nIter; j++) {\n    \/\/matrixMulCPU(reference, h_M, h_N, m, k, n);\n    \/\/MatrixMulKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m);\n    \/\/MatrixMulSharedMemKernel&lt;&lt;&lt;grid, block>>>(d_M, d_N, d_P, m, n);\n    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &amp;alpha, d_N, n, d_M, k, &amp;beta, d_P, n);\n  }\n\n  cudaEventRecord(stop, 0);\n  cudaEventSynchronize(stop);\n  float msecPerMatrixMul;\n  cudaEventElapsedTime(&amp;msecPerMatrixMul, start, stop);\n  msecPerMatrixMul \/= nIter;\n  printf(\"Kernel Elpased Time: %.3f ms\\n\", msecPerMatrixMul);\n\n  \/\/ Compute and print the performance\n  double flopsPerMatrixMul = 2.0 * (double)m * (double)n * (double)k;\n  double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) \/ (msecPerMatrixMul \/ 1000.0f);\n  printf(\"Performance= %.2f GFlop\/s, Time= %.3f msec, Size= %.0f Ops\\n\",\n\t\t  gigaFlops,\n\t\t  msecPerMatrixMul,\n\t\t  flopsPerMatrixMul);\n\n  \/\/ Copy data from GPU to CPU \n  cudaMemcpy(h_P, d_P, sizeP, cudaMemcpyDeviceToHost);\n\n  \/\/ compute reference solution\n  if (check == 1)\n  {\n    printf(\"Computing result using host CPU...\");\n    matrixMulCPU(reference, h_M, h_N, m, k, n);\n    printf(\"done.\\n\");\n    printDiff(reference, h_P, n, m, 100, 1.0e-5f);\n  }\n\n  free(h_P);\n  free(h_M);\n  free(h_N);\n  cudaFree(d_P);\n  cudaFree(d_M);\n  cudaFree(d_N);\n#ifdef USE_CUBLAS\n  cublasDestroy(handle);\n#endif\n\n  return 0;\n}\n\n<\/code><\/pre>\n<\/div>\n<\/div><\/div>\n\n\n\n<p>\u7d50\u679c\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3001H100\u3067\u7d041000%\u306e\u6027\u80fd\u5411\u4e0a\u304c\u5f97\u3089\u308c\u307e\u3057\u305f\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab4-5]\n\u2514\u2500$ .\/a.out 0 10000\nKernel Elpased Time: 43.667 ms\nPerformance= 45800.76 GFlop\/s, Time= 43.667 msec, Size= 2000000000000 Ops<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Lab6 LLaMA\u306e\u6700\u9069\u5316<\/h2>\n\n\n\n<p>\u307e\u305a\u306f\u3001<code>make run<\/code>\u3002<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">CPU Baseline<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories15M.bin\nOne day, a little girl named Amy found a very ugly teaspoon. It was so ugly, but Amy liked it. She decided to use it to eat her snack.\nAs she ate, her friend Tom came to play. Tom saw the ugly teaspoon and laughed. \"What a silly teaspoon!\" he said. \"I have never seen it before.\" Amy was sad. She did not want Tom to make fun of her ugly teaspoon.\nAmy had an idea. She used the ugly teaspoon to bake a sweet cookie. Tom saw the cookie and said, \"I don't like it!\" Amy said, \"It's okay, we can share.\" Tom smiled, and they both enjoyed the yummy cookie. They laughed and played together, happy that the ugly teaspoon and cookie were enjoyed.\nachieved tok\/s: 168.819188\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories15M.bin\nOnce upon a time, there was a little girl named Lily. She loved to paint. One day, she found a dull pencil on the ground. Lily wanted to paint a big, beautiful picture.\nLily took her paint and brush and started to paint. She painted a red apple, a blue banana, and a green frog. She was very happy with her painting. Then, she showed her painting to her friend, Tim.\n\"Look at my picture!\" Lily said to Tim. Tim looked at the picture and smiled. \"That's a great painting, Lily!\" he said. They both laughed and played with the dull pencil and painting more pictures. And they lived happily ever after.\nachieved tok\/s: 170.305677\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories15M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she found a treasure map. The map showed a big X in the desert where treasure was buried.\nLily decided to follow the map and see where the treasure was. She walked and walked for a long time until she found the big X. She dug and dug until she found a spot where the treasure was buried.\nBut when she tried to go back home, she found that her shoelaces were untied. She asked for help, but no one could tie them. She felt sad and wished she had returned home earlier.\nThe moral of the story is that sometimes we need to make a choice to find a solution, even if it's hard. And sometimes, trying something new can lead to trouble.\nachieved tok\/s: 168.362627\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories15M.bin\nOne day, a thin cat named Tom found a book. He liked the book because it had a lot of words. Tom wanted to read the book, but it was closed. He tried to open it, but it was hard. So, Tom had a plan.\nTom saw a boy named Tim. Tom asked Tim, \"Can you help me open the book?\" Tim said, \"Yes, I can help!\" Tim opened the book for Tom. They both looked at the words inside.\nTim and Tom were having fun. They wanted to make a spell to open the book. They found a spell and gave it to Tom. They said the words together. Suddenly, the book opened by itself! Inside the book, they found a magic stone. Tom and Tim were very happy. They used the magic stone to make their wishes come true.\nachieved tok\/s: 167.468720\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories15M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play with her toys and draw pictures. One day, her mommy asked her to take a bath and turn on the faucet. Lily didn't want to, but her mommy said she would feel better after.\nAfter her bath, Lily went to bed. But she couldn't sleep because she was scared of the dark. She called out for her mommy, \"Mommy, I'm scared!\" Her mommy came in and turned on the faucet. Lily felt safe again and fell asleep.\nThe next day, Lily's mommy asked her to clean up her toys. But Lily didn't want to. Her mommy said, \"If you don't clean up your toys, you won't be able to find them later.\" Lily didn't want that, so she started cleaning up. But soon, she realized that it was much easier to find her toys when she was ready to go to bed.\nachieved tok\/s: 168.382353\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories42M.bin\nOnce upon a time, there was a little girl named Lily. One day, she saw a candle on the table and she wanted to touch it. But her mom said, \"No, Lily! That candle is hot and it can hurt you.\" Lily listened to her mom and didn't touch the candle.\nLater that night, Lily was scared of the dark. Her mom said, \"Don't be afraid, Lily. I am here with you.\" Lily felt better and went to sleep. The next day, Lily went to the park with her friend Timmy. They played on the swings and went down the slide.\nSuddenly, Timmy pinched Lily's arm. \"Ouch!\" she cried. Timmy said, \"I'm sorry, Lily. I didn't mean to hurt you.\" Lily forgave Timmy and they continued to play. When it was time to go home, Lily's mom gave her a soft teddy bear as a gift. Lily was happy and went to bed with a smile on her face.\nachieved tok\/s: 57.266603\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories42M.bin\nOnce upon a time, there was a little bird named Tweetie. Tweetie had beautiful feathers that shone in the sun. One day, Tweetie was flying in the sky when she saw a little girl named Lily who was lost.\nLily was crying and didn't know where to go. Tweetie felt compassionate and flew down to her. \"Don't worry, little girl. I'll help you find your way home,\" said Tweetie.\nLily was so happy to have a friend like Tweetie. She told Tweetie a joke and they laughed together. \"Why did the tomato turn red?\" asked Lily. \"I don't know, why?\" replied Tweetie. \"Because it saw the salad dressing!\" said Lily, laughing again.\nTweetie was so happy to help Lily find her way home. She flew away feeling proud of herself. But then, something unexpected happened. A big gust of wind blew Tweetie away from the girl and she landed on a branch far away from her. Tweetie was scared and didn't know what to do. She\nachieved tok\/s: 56.679262\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories42M.bin\nOnce upon a time, there was a big bird named Ostrich. Ostrich was very tall and had long legs. Ostrich liked to run and jump in the big blue sky.\nOne day, Ostrich met a little girl named Lily. Lily said, \"Hi Ostrich, can you count how many times I can run in the sky?\" Ostrich said, \"Sure, I can count to ten!\"\nLily ran and ran and ran, but she tripped and fell. Ostrich ran over to her and said, \"Are you okay?\" Lily said, \"Yes, I'm okay. I'm okay, but I need to be careful.\"\nSuddenly, a big storm came and the sky turned dark. Ostrich tried to run away, but his big legs were not strong enough to withstand the storm. The storm was too strong and Ostrich couldn't stay safe. In the end, Ostrich got hurt and couldn't run anymore. Lily was very sad and wished she could have done something to help Ostrich.\nachieved tok\/s: 55.766493\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories42M.bin\nOnce upon a time, there was a little girl named Lily. She had a furry teddy bear that she loved very much. One day, Lily was playing with her teddy bear and she accidentally split a glass of juice on the floor. She tried to clean it up with a napkin, but it made a bigger mess.\nLily's mom saw what happened and said, \"Lily, you need to be careful with the table. You could hurt yourself or break something.\"\nLily felt sad because she didn't want to make a mess. She asked her mom, \"What can I do to clean it up?\"\nHer mom said, \"You can use a plastic cloth to wipe it up. That way, you won't get hurt and the mess will go away.\"\nLily learned that it's important to be careful and clean up after herself. She also learned that accidents happen and it's okay to ask for help. From that day on, she made sure to be extra careful and never split anything again.\nachieved tok\/s: 56.637168\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories42M.bin\nOnce upon a time, there was a small boy named Timmy. Timmy had a pet turtle named Timmy. Timmy loved Timmy very much and always made sure to feed him food and water.\nOne day, Timmy's mom bought him a microscope. Timmy was very curious about it and wanted to look at everything through it. He showed it to his turtle and asked, \"Do you want to look at tiny things with my microscope?\"\nBut Timmy was a slow turtle and didn't want to wait for him to get there. He went to his mom and said, \"I want to eat some leaves from the tree outside.\" His mom gave him some leaves and he looked at them with the microscope. He was very happy to see tiny bugs and plants up close.\nFrom that day on, Timmy loved to look at things with his microscope and feed them with his food. And he always remembered to be slow and patient with his turtle. The end.\nachieved tok\/s: 55.977710\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories110M.bin\nOnce upon a time, there was a little girl named Lily. She loved playing with her toys, especially her robot. The robot was very lively and could move and talk just like a real person.\nOne day, Lily's friend came over to play. Her friend wanted to play with the robot, but Lily didn't want to share. \"No, you can't play with my robot!\" Lily said. Her friend felt sad and left.\nLater that day, Lily realized that she was being mean to her friend. She decided to go find her and say sorry. When she found her friend, she said, \"I'm sorry for not sharing my robot. Will you forgive me?\" Her friend smiled and said, \"Yes, I forgive you.\"\nLily learned that sharing is important and that saying sorry can make things better. From that day on, she always shared her toys with her friends.\nachieved tok\/s: 21.109579\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories110M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play outside in the big, green field behind her house. One day, she found a huge pile of oats in the field. She wanted to use them to make oatmeal for breakfast.\nLily went back to her house to get some oats. She was so excited to use them to make her breakfast. But when she came back to the field, she saw that someone had already used all of the oats to make a big pile. Lily was very sad because she really wanted to make oatmeal.\nSuddenly, Lily heard a rustling in the bushes. She went to investigate and found a little bunny eating all of the oats! Lily was surprised but happy to see the bunny. She decided to share her oats with the bunny and they became friends. From then on, Lily and the bunny would play together in the big, green field and make oatmeal for breakfast.\nachieved tok\/s: 21.013133\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories110M.bin\nOnce upon a time, there was a playful dog named Spot. Spot loved to play with his friends. One day, Spot saw a big jar of sugar. He wanted to eat it, but he knew he should ask first.\nSpot went to his friend, Cat. He asked, \"Can I have some sugar?\" Cat said, \"No, Spot. Sugar is not good for dogs.\" Spot was sad, but he did not refuse to listen to Cat.\nSpot went to his friend, Bird. He asked, \"Can I have some sugar?\" Bird said, \"No, Spot. Sugar is not good for dogs.\" Spot was sad, but he did not refuse to listen to Bird. He went to his friend, Fish. Spot asked, \"Can I have some sugar?\" Fish said, \"No, Spot. Sugar is not good for fish.\" Spot was sad, but he listened to Fish.\nOne day, Spot saw a big bag of sugar. He thought, \"I want to eat some sugar!\" But he remembered what his friends said. Spot decided to play with his friends instead. They all had a fun day together. Spot was happy he\nachieved tok\/s: 21.062195\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories110M.bin\nOnce upon a time, there was a little boy named Timmy. Timmy loved to eat cookies. One day, his mom made him a yummy cookie. Timmy was so happy and took a big bite. Suddenly, he saw a worm in his cookie!\n\"Ew, a worm!\" Timmy yelled.\n\"Don't worry, Timmy. Just cut the worm out,\" said his mom.\nTimmy carefully cut out the worm and threw the yummy cookie away. From then on, he always checked his cookies before taking a bite.\nachieved tok\/s: 21.192053\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/run stories110M.bin\nOnce upon a time, there was a little boy named Tim. Tim liked to play with his toys and make a big mess. One day, Tim's mom said, \"Tim, it's time to clean up your messy room.\"\nTim looked at his mom and said, \"Okay, Mom, I will clean up my room.\" Tim started to pick up his toys and put them away. While he was cleaning, he found a big, soft teddy bear. Tim gave the teddy bear a big squeeze and said, \"I love you, teddy bear.\"\nTim's mom came back into the room and saw that Tim had cleaned up his messy room. She was very happy and said, \"Good job, Tim! Now you can play with your toys again.\" Tim smiled and hugged his teddy bear. He knew that a clean room was a happy place to be.\nachieved tok\/s: 21.130537<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">On an Intel\u00ae Core\u2122 i9-12900K<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories15M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories42M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories110M<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s)<br>Baseline<\/td><td class=\"has-text-align-center\" data-align=\"center\">168.6677130<\/td><td class=\"has-text-align-center\" data-align=\"center\">56.4654472<\/td><td class=\"has-text-align-center\" data-align=\"center\">21.1014994<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">CPU\u5411\u3051\u6700\u9069\u5316<\/h3>\n\n\n\n<p>OpenMP\u3092\u8d77\u7528\u306b\u306f\u3001\u6b21\u306e\u3082\u306e\u3092\u5b9f\u884c\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gcc -O3 -o orun run.c -lm -fopenmp<\/code><\/pre>\n\n\n\n<p>AVX\u3092\u4f7f\u3046\u306b\u306f\u3001matmul\u3092\u5909\u66f4\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>void matmul(float* xout, float* x, float* w, int n, int d) {\n    int i;\n    #pragma omp parallel for private(i)\n    for (i = 0; i &lt; d; ++i) {\n        __m256 sum_vec = _mm256_setzero_ps();\n        for (int j = 0; j &lt; n; j += 8) {\n            __m256 w_vec = _mm256_loadu_ps(&amp;w&#91;i * n + j]);\n            __m256 x_vec = _mm256_loadu_ps(&amp;x&#91;j]);\n            sum_vec = _mm256_fmadd_ps(w_vec, x_vec, sum_vec);\n        }\n        float temp&#91;8];\n        _mm256_storeu_ps(temp, sum_vec);\n        xout&#91;i] = temp&#91;0] + temp&#91;1] + temp&#91;2] + temp&#91;3] + temp&#91;4] + temp&#91;5] + temp&#91;6] + temp&#91;7];\n    }\n}<\/code><\/pre>\n\n\n\n<p>\u6b21\u306e\u3088\u3046\u306b\u30b3\u30f3\u30d1\u30a4\u30eb\u3002<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>gcc -O3 -o arun run.c -mavx -march=native -lm<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>gcc -O3 -o aorun run.c -mavx -march=native -lm -fopenmp<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/arun stories15M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play outside and pick flowers. One day, she saw a cop car drive by and it made her feel excited.\n\"Mommy, what is that car doing?\" Lily asked.\n\"It's a cop, sweetie. He helps keep us safe,\" her mom replied.\nAs they continued their walk, they saw a sign that said \"Beware of Coco\". Lily didn't understand what that meant, but she thought it sounded like fun.\nLater that day, Lily's mom took her to the park. They saw a big sandcastle that was as tall as Lily! She was so excited that she wanted to stay close to her. Her mom reminded her that it's important to always stay close to the ones you trust and to never wander off alone. Lily understood and promised to always stay close to her mom.\nachieved tok\/s: 318.611987\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/arun stories42M.bin\namamitsu@Amamitsu:~\/llama2$ .\/run stories42M.bin\nOne day, a boy named Tom went to the park. He saw a big tree with dry leaves. Tom wanted to cut the dry leaves and make a fun toy. He took a small saw from his dad's tool box. Tom started to cut the dry leaves.\nA big bird came and sat on the tree. The bird said, \"Hello, Tom! What are you doing?\" Tom said, \"I am cutting the dry leaves to make a toy.\" The bird was happy and flew away. Tom kept cutting the dry leaves.\nSuddenly, the wind blew very hard. The dry leaves flew in all directions. They hit Tom's dad on the head. His dad said, \"Ouch!\" Tom was sad. He didn't mean to make his dad fall. Now, the fun toy was gone and Tom was hurt.\nachieved tok\/s: 120.836055\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/arun stories110M.bin\nOnce upon a time, there was a little girl named Lily. One day, she went to the park with her mom. She saw a big, red balloon and wanted to touch it. Her mom said it was too high, but Lily didn't listen. She climbed up and touched it. It was so fun!\nSuddenly, a bird flew by and scared Lily. She lost her balance and fell down. Her mom rushed to her side and asked if she was okay. Lily was a little scared, but her mom hugged her and said she was okay.\nAfter that, Lily and her mom went to get some ice cream. Lily chose a mild flavor that she had never tried before. She said to her mom, \"This ice cream is yummy!\" Her mom smiled and said, \"I'm glad you like it, Lily.\"\nachieved tok\/s: 45.656755<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">On an Intel\u00ae Core\u2122 i9-12900K<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories15M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories42M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories110M<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s) <br>Baseline<\/td><td class=\"has-text-align-center\" data-align=\"center\">168.6677130<\/td><td class=\"has-text-align-center\" data-align=\"center\">56.4654472<\/td><td class=\"has-text-align-center\" data-align=\"center\">21.1014994<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s) <br>With OpenMP<\/td><td class=\"has-text-align-center\" data-align=\"center\">432.926829<\/td><td class=\"has-text-align-center\" data-align=\"center\">177.809388<\/td><td class=\"has-text-align-center\" data-align=\"center\">57.982319<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s) <br>With AVX<\/td><td class=\"has-text-align-center\" data-align=\"center\">318.611987<\/td><td class=\"has-text-align-center\" data-align=\"center\">120.836055<\/td><td class=\"has-text-align-center\" data-align=\"center\">45.656755<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s) <br>With AVX and OpenMP<\/td><td class=\"has-text-align-center\" data-align=\"center\">415.986949<\/td><td class=\"has-text-align-center\" data-align=\"center\">145.546705<\/td><td class=\"has-text-align-center\" data-align=\"center\">54.567380<\/td><\/tr><\/tbody><tfoot><tr><td class=\"has-text-align-center\" data-align=\"center\" colspan=\"4\">AVX\u3092\u4f7f\u3046\u3068\u6027\u80fd\u304c\u304a\u3088\u305d100%\u5411\u4e0a\u3057\u307e\u3059\u3002<br>OpenMP\u3092\u4f7f\u3046\u3068\u6027\u80fd\u304c\u304a\u3088\u305d170%\u5411\u4e0a\u3057\u307e\u3059\u3002<br>\u3067\u3059\u304c\u3001AVX\u3068OpenMP\u306e\u9023\u643a\u52b9\u679c\u306f\u4e88\u671f\u4ee5\u4e0b\u3001\u304a\u3088\u305d150%\u5411\u4e0a\u3002<\/td><\/tr><\/tfoot><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">GPU\u5411\u3051\u6700\u9069\u5316<\/h3>\n\n\n\n\n\n\n\n<div class=\"swell-block-accordion\">\n<details class=\"swell-block-accordion__item\" data-swl-acc=\"wrapper\"><summary class=\"swell-block-accordion__title\" data-swl-acc=\"header\"><span class=\"swell-block-accordion__label\"><span style=\"--the-icon-svg: url(data:image\/svg+xml;base64,PHN2ZyBoZWlnaHQ9IjFlbSIgd2lkdGg9IjFlbSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiBhcmlhLWhpZGRlbj0idHJ1ZSIgdmlld0JveD0iMCAwIDQ4IDQ4Ij48cGF0aCBkPSJtMjQgMTAgMy4yIDcuNSAxIDIuMiAyLjQuMiA4LjEuOC02LjEgNS40LTEuOCAxLjYuNSAyLjQgMS44IDcuOS03LTQuMS0yLjEtMS4zLTIuMSAxLjItNyA0LjEgMS44LTcuOS41LTIuNC0xLjgtMS42LTYuMS01LjQgOC4xLS44IDIuNC0uMiAxLTIuMkwyNCAxMG0wLTguN2MtLjQgMC0uOC4yLS45LjZsLTYgMTMuOS0xNS4yIDEuNWMtLjkuMS0xLjIgMS4yLS42IDEuOGwxMS40IDEwLTMuMyAxNC44Yy0uMi43LjQgMS4yIDEgMS4yLjIgMCAuMyAwIC41LS4xTDI0IDM3LjMgMzcuMSA0NWMuMi4xLjMuMS41LjEuNiAwIDEuMS0uNiAxLTEuMmwtMy4zLTE0LjggMTEuNC0xMGMuNy0uNi4zLTEuNy0uNi0xLjhMMzEgMTUuOSAyNSAyYy0uMi0uNS0uNi0uNy0xLS43eiI+PC9wYXRoPjwvc3ZnPg==)\" data-icon=\"LsStarEmpty\" data-id=\"0\" aria-hidden=\"true\" class=\"swl-inline-icon\">\u2003<\/span>run.cu<\/span><span class=\"swell-block-accordion__icon c-switchIconBtn\" data-swl-acc=\"icon\" aria-hidden=\"true\" data-opened=\"false\"><i class=\"__icon--closed icon-caret-down\"><\/i><i class=\"__icon--opened icon-caret-up\"><\/i><\/span><\/summary><div class=\"swell-block-accordion__body\" data-swl-acc=\"body\">\n<pre class=\"wp-block-code\"><code>\/* Inference for Llama-2 Transformer model in pure C *\/\n\n#include &lt;stdio.h&gt;\n#include &lt;stdlib.h&gt;\n#include &lt;ctype.h&gt;\n#include &lt;time.h&gt;\n#include &lt;math.h&gt;\n#include &lt;string.h&gt;\n#include &lt;fcntl.h&gt;\n\/\/#include &lt;immintrin.h&gt;\n#if defined _WIN32\n    #include \"win.h\"\n#else\n    #include &lt;unistd.h&gt;\n    #include &lt;sys\/mman.h&gt;\n#endif\n\/\/REVIEW L-I CUDA Headers\n#include &lt;cuda_runtime.h&gt;\n#include &lt;cub\/cub.cuh&gt;\n#include &lt;cublas_v2.h&gt;\n\/\/REVIEW END\n\n\/\/REVIEW DEFINE CUDACHECKs\n#define cudaCheck(err) __cudaCheck(err, __FILE__, __LINE__)\n#define cublasCheck(err) __cublasCheck(err, __FILE__, __LINE__)\ninline void __cudaCheck(cudaError_t err, const char* file, int line) {\n    if (err != cudaSuccess) {\n        std::cerr &lt;&lt; \"CUDA Error: \" &lt;&lt; cudaGetErrorString(err) \n                  &lt;&lt; \" (\" &lt;&lt; err &lt;&lt; \") at \" &lt;&lt; file &lt;&lt; \":\" &lt;&lt; line &lt;&lt; std::endl;\n        std::exit(EXIT_FAILURE);\n    }\n}\n\ninline void __cublasCheck(cublasStatus_t err, const char* file, int line) {\n    if (err != CUBLAS_STATUS_SUCCESS) {\n        std::cerr &lt;&lt; \"cuBLAS Error: \" &lt;&lt; err \n                  &lt;&lt; \" at \" &lt;&lt; file &lt;&lt; \":\" &lt;&lt; line &lt;&lt; std::endl;\n        std::exit(EXIT_FAILURE);\n    }\n}\n\/\/REVIEW END\n\/\/ ----------------------------------------------------------------------------\n\/\/ Transformer model\n\ntypedef struct {\n    int dim; \/\/ transformer dimension\n    int hidden_dim; \/\/ for ffn layers\n    int n_layers; \/\/ number of layers\n    int n_heads; \/\/ number of query heads\n    int n_kv_heads; \/\/ number of key\/value heads (can be &lt; query heads because of multiquery)\n    int vocab_size; \/\/ vocabulary size, usually 256 (byte-level)\n    int seq_len; \/\/ max sequence length\n} Config;\n\ntypedef struct {\n    \/\/ token embedding table\n    float* token_embedding_table;    \/\/ (vocab_size, dim)\n    \/\/ weights for rmsnorms\n    float* rms_att_weight; \/\/ (layer, dim) rmsnorm weights\n    float* rms_ffn_weight; \/\/ (layer, dim)\n    \/\/ weights for matmuls. note dim == n_heads * head_size\n    float* wq; \/\/ (layer, dim, n_heads * head_size)\n    float* wk; \/\/ (layer, dim, n_kv_heads * head_size)\n    float* wv; \/\/ (layer, dim, n_kv_heads * head_size)\n    float* wo; \/\/ (layer, n_heads * head_size, dim)\n    \/\/ weights for ffn\n    float* w1; \/\/ (layer, hidden_dim, dim)\n    float* w2; \/\/ (layer, dim, hidden_dim)\n    float* w3; \/\/ (layer, hidden_dim, dim)\n    \/\/ final rmsnorm\n    float* rms_final_weight; \/\/ (dim,)\n    \/\/ (optional) classifier weights for the logits, on the last layer\n    float* wcls;\n} TransformerWeights;\n\ntypedef struct {\n    \/\/ current wave of activations\n    float *x; \/\/ activation at current time stamp (dim,)\n    float *xb; \/\/ same, but inside a residual branch (dim,)\n    float *xb2; \/\/ an additional buffer just for convenience (dim,)\n    float *hb; \/\/ buffer for hidden dimension in the ffn (hidden_dim,)\n    float *hb2; \/\/ buffer for hidden dimension in the ffn (hidden_dim,)\n    float *q; \/\/ query (dim,)\n    float *k; \/\/ key (dim,)\n    float *v; \/\/ value (dim,)\n    float *att; \/\/ buffer for scores\/attention values (n_heads, seq_len)\n\/\/REVIEW logit on GPU\n    float *logitsgpu; \/\/output logits GPU\n\/\/REVIEW END\n    float *logits; \/\/ output logits\n    \/\/ kv cache\n    float* key_cache;   \/\/ (layer, seq_len, dim)\n    float* value_cache; \/\/ (layer, seq_len, dim)\n} RunState;\n\ntypedef struct {\n    Config config; \/\/ the hyperparameters of the architecture (the blueprint)\n    TransformerWeights weights; \/\/ the weights of the model\n    RunState state; \/\/ buffers for the \"wave\" of activations in the forward pass\n    \/\/ some more state needed to properly clean up the memory mapping (sigh)\n    int fd; \/\/ file descriptor for memory mapping\n    float* data; \/\/ memory mapped data pointer\n    ssize_t file_size; \/\/ size of the checkpoint file in bytes\n} Transformer;\n\n\/\/REVIEW cuBLAS Handle\ncublasHandle_t cuBLASHandle = 0;\n\ninline void createCublasHandle() {\n    cublasCheck(cublasCreate(&amp;cuBLASHandle));\n}\n\ninline void destroyCublasHandle() {\n    cublasCheck(cublasDestroy(cuBLASHandle));\n}\n\/\/REVIEW END\nvoid malloc_run_state(RunState* s, Config* p) {\n    \/\/ we calloc instead of malloc to keep valgrind happy\n    int kv_dim = (p-&gt;dim * p-&gt;n_kv_heads) \/ p-&gt;n_heads;\n\/\/REVIEW malloc in GPU\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;x, p-&gt;dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;xb, p-&gt;dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;xb2, p-&gt;dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;hb, p-&gt;hidden_dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;hb2, p-&gt;hidden_dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;q, p-&gt;dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;key_cache, p-&gt;n_layers * p-&gt;seq_len * kv_dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;value_cache, p-&gt;n_layers * p-&gt;seq_len * kv_dim * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;att, p-&gt;n_heads * p-&gt;seq_len * sizeof(float)));\n    cudaCheck(cudaMalloc((void**)&amp;s-&gt;logitsgpu, p-&gt;vocab_size * sizeof(float)));\n\/\/REVIEW END\n\/\/FIXME cast to float*\n    s-&gt;logits = (float*)calloc(p-&gt;vocab_size, sizeof(float));\n    \/\/ ensure all mallocs went fine\n\/\/FIXME add logitsgpu check\n    if (!s-&gt;x || !s-&gt;xb || !s-&gt;xb2 || !s-&gt;hb || !s-&gt;hb2 || !s-&gt;q\n     || !s-&gt;key_cache || !s-&gt;value_cache || !s-&gt;att || !s-&gt;logits || !s-&gt;logitsgpu) {\n        fprintf(stderr, \"malloc failed!\\n\");\n        exit(EXIT_FAILURE);\n    }\n}\n\nvoid free_run_state(RunState* s) {\n\/\/REVIEW free in GPU\n    cudaCheck(cudaFree(s-&gt;x));\n    cudaCheck(cudaFree(s-&gt;xb));\n    cudaCheck(cudaFree(s-&gt;xb2));\n    cudaCheck(cudaFree(s-&gt;hb));\n    cudaCheck(cudaFree(s-&gt;hb2));\n    cudaCheck(cudaFree(s-&gt;q));\n    cudaCheck(cudaFree(s-&gt;att));\n    cudaCheck(cudaFree(s-&gt;logitsgpu));\n    cudaCheck(cudaFree(s-&gt;key_cache));\n    cudaCheck(cudaFree(s-&gt;value_cache));\n\/\/REVIEW END\n    free(s-&gt;logits);\n}\n\nvoid memory_map_weights(TransformerWeights *w, Config* p, float* ptr, int shared_weights) {\n    int head_size = p-&gt;dim \/ p-&gt;n_heads;\n    \/\/ make sure the multiplications below are done in 64bit to fit the parameter counts of 13B+ models\n    unsigned long long n_layers = p-&gt;n_layers;\n    w-&gt;token_embedding_table = ptr;\n    ptr += p-&gt;vocab_size * p-&gt;dim;\n    w-&gt;rms_att_weight = ptr;\n    ptr += n_layers * p-&gt;dim;\n    w-&gt;wq = ptr;\n    ptr += n_layers * p-&gt;dim * (p-&gt;n_heads * head_size);\n    w-&gt;wk = ptr;\n    ptr += n_layers * p-&gt;dim * (p-&gt;n_kv_heads * head_size);\n    w-&gt;wv = ptr;\n    ptr += n_layers * p-&gt;dim * (p-&gt;n_kv_heads * head_size);\n    w-&gt;wo = ptr;\n    ptr += n_layers * (p-&gt;n_heads * head_size) * p-&gt;dim;\n    w-&gt;rms_ffn_weight = ptr;\n    ptr += n_layers * p-&gt;dim;\n    w-&gt;w1 = ptr;\n    ptr += n_layers * p-&gt;dim * p-&gt;hidden_dim;\n    w-&gt;w2 = ptr;\n    ptr += n_layers * p-&gt;hidden_dim * p-&gt;dim;\n    w-&gt;w3 = ptr;\n    ptr += n_layers * p-&gt;dim * p-&gt;hidden_dim;\n    w-&gt;rms_final_weight = ptr;\n    ptr += p-&gt;dim;\n    ptr += p-&gt;seq_len * head_size \/ 2; \/\/ skip what used to be freq_cis_real (for RoPE)\n    ptr += p-&gt;seq_len * head_size \/ 2; \/\/ skip what used to be freq_cis_imag (for RoPE)\n    w-&gt;wcls = shared_weights ? w-&gt;token_embedding_table : ptr;\n}\n\nvoid read_checkpoint(char* checkpoint, Config* config, TransformerWeights* weights,\n                     int* fd, float** data, ssize_t* file_size) {\n    FILE *file = fopen(checkpoint, \"rb\");\n    if (!file) { fprintf(stderr, \"Couldn't open file %s\\n\", checkpoint); exit(EXIT_FAILURE); }\n    \/\/ read in the config header\n    if (fread(config, sizeof(Config), 1, file) != 1) { exit(EXIT_FAILURE); }\n    \/\/ negative vocab size is hacky way of signaling unshared weights. bit yikes.\n    int shared_weights = config-&gt;vocab_size &gt; 0 ? 1 : 0;\n    config-&gt;vocab_size = abs(config-&gt;vocab_size);\n    \/\/ figure out the file size\n    fseek(file, 0, SEEK_END); \/\/ move file pointer to end of file\n    *file_size = ftell(file); \/\/ get the file size, in bytes\n    fclose(file);\n    \/\/ memory map the Transformer weights into the data pointer\n    *fd = open(checkpoint, O_RDONLY); \/\/ open in read only mode\n    if (*fd == -1) { fprintf(stderr, \"open failed!\\n\"); exit(EXIT_FAILURE); }\n    *data = (float *)mmap(NULL, *file_size, PROT_READ, MAP_PRIVATE, *fd, 0);\n    if (*data == MAP_FAILED) { fprintf(stderr, \"mmap failed!\\n\"); exit(EXIT_FAILURE); }\n\/\/REVIEW Weight copy\n    float* weights_ptr = 0;\n    \/\/ FIXME remove config in the checkpoint file\n    cudaCheck(cudaMalloc((void**)&amp;weights_ptr, *file_size - sizeof(Config)));\n    cudaCheck(cudaMemcpy(weights_ptr, *data + sizeof(Config)\/sizeof(float), *file_size - sizeof(Config), cudaMemcpyHostToDevice));\n\/\/REVIEW END\n    memory_map_weights(weights, config, weights_ptr, shared_weights);\n}\n\nvoid build_transformer(Transformer *t, char* checkpoint_path) {\n    \/\/ read in the Config and the Weights from the checkpoint\n    read_checkpoint(checkpoint_path, &amp;t-&gt;config, &amp;t-&gt;weights, &amp;t-&gt;fd, &amp;t-&gt;data, &amp;t-&gt;file_size);\n    \/\/ allocate the RunState buffers\n    malloc_run_state(&amp;t-&gt;state, &amp;t-&gt;config);\n}\n\nvoid free_transformer(Transformer* t) {\n    \/\/ close the memory mapping\n    if (t-&gt;data != MAP_FAILED) { munmap(t-&gt;data, t-&gt;file_size); }\n    if (t-&gt;fd != -1) { close(t-&gt;fd); }\n    \/\/ free the RunState buffers\n\/\/REVIEW remove the weight on GPU\n    cudaCheck(cudaFree(t-&gt;weights.token_embedding_table));\n\/\/REVIEW END\n    free_run_state(&amp;t-&gt;state);\n}\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ neural net blocks; the dynamics of the Transformer\n\/\/REVIEW Spread Tensor\nint divUp(int a, int b) {\n    return (a + b - 1) \/ b;\n}\n\/\/REVIEW END\n\/\/REVIEW RMSNorm on GPU\n__global__ void rmsnorm_cuk(const float* __restrict__ x, const float* __restrict__ weight, float* __restrict__ out, int dim, float eps) {\n    extern __shared__ float sdata&#91;]; \n    int tid = threadIdx.x;\n    float val = 0.0f;\n    if (tid &lt; dim) {\n        float xv = x&#91;tid];\n        val = xv * xv;\n    }\n    typedef cub::BlockReduce&lt;float, 1024&gt; BlockReduce;\n    __shared__ typename BlockReduce::TempStorage temp_storage;\n    float sum = BlockReduce(temp_storage).Sum(val);\n    __syncthreads();\n\n    if (tid == 0) {\n        sum = sum \/ dim + eps;\n        sum = rsqrtf(sum);\n        sdata&#91;0] = sum;\n    }\n    __syncthreads();\n\n    float norm_coef = sdata&#91;0];\n    if (tid &lt; dim) {\n        out&#91;tid] = weight&#91;tid] * (x&#91;tid] * norm_coef);\n    }\n}\n\nvoid rmsnorm(float* out, const float* x, const float* weight, int dim, cudaStream_t stream=0) {\n    int blockSize = 1024;\n    int gridSize = 1;\n    size_t smem = sizeof(float);\n    rmsnorm_cuk&lt;&lt;&lt;gridSize, blockSize, smem, stream&gt;&gt;&gt;(x, weight, out, dim, 1e-5f);\n}\n\/\/REVIEW END\n\/\/REVIEW softmax on GPU\n__device__ void softmax_gpu(float* __restrict__ x, int size) {\n    int tid = threadIdx.x;\n    int step = blockDim.x;\n\n    \/\/ find max value (for numerical stability)\n    float max_val = tid &lt; size ? x&#91;tid] : 0;\n    for (int i = tid + step; i &lt; size; i += step) {\n        if (x&#91;i] &gt; max_val) {\n            max_val = x&#91;i];\n        }\n    }\n    using BlockReduce = cub::BlockReduce&lt;float, 1024&gt;;\n    __shared__ typename BlockReduce::TempStorage temp_storage;\n    __shared__ float shared_val;\n    max_val = BlockReduce(temp_storage).Reduce(max_val, cub::Max());\n    if (threadIdx.x == 0) {\n        shared_val = max_val;\n    }\n    __syncthreads();\n    max_val = shared_val;\n\n    \/\/ exp and sum\n    float sum = 0.0f;\n    for (int i = tid; i &lt; size; i += step) {\n        x&#91;i] = expf(x&#91;i] - max_val);\n        sum += x&#91;i];\n    }\n    sum = BlockReduce(temp_storage).Sum(sum);\n    if (threadIdx.x == 0) {\n        shared_val = sum;\n    }\n    __syncthreads();\n    sum = shared_val;\n\n    \/\/ normalize\n    for (int i = tid; i &lt; size; i += step) {\n        x&#91;i] \/= sum;\n    }\n}\n\nvoid softmax(float* x, int size) {\n    \/\/ find max value (for numerical stability)\n    float max_val = x&#91;0];\n    for (int i = 1; i &lt; size; i++) {\n        if (x&#91;i] &gt; max_val) {\n            max_val = x&#91;i];\n        }\n    }\n    \/\/ exp and sum\n    float sum = 0.0f;\n    for (int i = 0; i &lt; size; i++) {\n        x&#91;i] = expf(x&#91;i] - max_val);\n        sum += x&#91;i];\n    }\n    \/\/ normalize\n    for (int i = 0; i &lt; size; i++) {\n        x&#91;i] \/= sum;\n    }\n}\n\/\/REVIEW END\n\/\/REVIEW matmul on GPU\nvoid matmul(float* xout, float* x, float* w, int n, int d) {\n    \/\/ W (d,n) @ x (n,) -&gt; xout (d,)\n    float alpha = 1.0f;\n    float beta = 0.0f;\n    cublasSgemv(cuBLASHandle, CUBLAS_OP_T, n, d, &amp;alpha, w, n, x, 1, &amp;beta, xout, 1);\n}\n\/\/REVIEW END\n\/\/REVIEW RoPE on GPU\n__global__ void RoPE_cuk(int pos, float *sq, float *sk, int dim, int kv_dim, int head_size) {\n    int global_id = blockIdx.x * blockDim.x + threadIdx.x;\n    int idx = global_id * 2;\n    if (idx &gt;= dim) return;\n\n    extern __shared__ float freq_table&#91;];\n\n    if (threadIdx.x &lt; head_size) {\n        float inv_head = 1.0f \/ (float)head_size;\n        float neg_ln_10000 = -__logf(10000.0f);\n        \/\/ freq_table&#91;head_dim] = exp(-ln(10000)*(head_dim\/head_size))\n        freq_table&#91;threadIdx.x] = __expf(neg_ln_10000 * threadIdx.x * inv_head);\n    }\n\n    __syncthreads();\n\n    int head_dim = idx % head_size;\n    float freq = freq_table&#91;head_dim];\n    float val = pos * freq;\n    float fcr = __cosf(val);\n    float fci = __sinf(val);\n\n    int rotn = (idx &lt; kv_dim) ? 2 : 1;\n    for (int v = 0; v &lt; rotn; v++) {\n        float* vec = (v == 0) ? sq : sk;\n        float v0 = vec&#91;idx];\n        float v1 = vec&#91;idx+1];\n        float rv0 = v0 * fcr - v1 * fci;\n        float rv1 = v0 * fci + v1 * fcr;\n        vec&#91;idx]   = rv0;\n        vec&#91;idx+1] = rv1;\n    }\n}\n\nvoid RoPE(int pos, RunState* s, int dim, int kv_dim, int head_size) {\n    int threadsPerBlock = 256;\n    int halfDim = dim \/ 2;\n    int blocks = (halfDim + threadsPerBlock - 1) \/ threadsPerBlock;\n    size_t shared_mem_bytes = head_size * sizeof(float);\n    RoPE_cuk&lt;&lt;&lt;blocks, threadsPerBlock, shared_mem_bytes&gt;&gt;&gt;(pos, s-&gt;q, s-&gt;k, dim, kv_dim, head_size);\n    cudaCheck(cudaGetLastError());\n    cudaCheck(cudaDeviceSynchronize());\n}\n\/\/REVIEW END\n\/\/REVIEW MHA on GPU\n__global__ void multi_head_attention_cuk(\n    int pos, int seq_len, float *sq, float *satt, float *sxb, \n    float *key_cache, float *value_cache, \n    int kv_dim, int kv_mul, int head_size, int loff, float inv_sqrt_head_size) \n{\n    int h = blockIdx.x;\n    int tid = threadIdx.x;\n\n    extern __shared__ float shared&#91;];\n    float* q_shared = shared;\n    float* att_shared = q_shared + head_size;\n\n    float* q = sq + h * head_size;\n    for (int i = tid; i &lt; head_size; i += blockDim.x) {\n        q_shared&#91;i] = q&#91;i];\n    }\n\n    __syncthreads();\n\n    float* att = satt + h * seq_len;\n\n    for (int t = tid; t &lt;= pos; t += blockDim.x) {\n        float* k = key_cache + loff + t * kv_dim + (h \/ kv_mul) * head_size;\n        float score = 0.0f;\n        for (int i = 0; i &lt; head_size; i++) {\n            score += q_shared&#91;i] * k&#91;i];\n        }\n        score *= inv_sqrt_head_size;\n        att&#91;t] = score;\n    }\n    __syncthreads();\n\n    softmax_gpu(att, pos + 1);\n    __syncthreads();\n\n    for (int t = tid; t &lt;= pos; t += blockDim.x) {\n        att_shared&#91;t] = att&#91;t];\n    }\n\n    __syncthreads();\n\n    float* xb = sxb + h * head_size;\n\n    for (int i = tid; i &lt; head_size; i += blockDim.x) {\n        float val = 0.0f;\n        for (int t = 0; t &lt;= pos; t++) {\n            float* v = value_cache + loff + t * kv_dim + (h \/ kv_mul) * head_size;\n            val += att_shared&#91;t] * v&#91;i];\n        }\n        xb&#91;i] = val;\n    }\n}\n\nvoid multi_head_attention(int pos, Config* p, RunState* s, int kv_dim, int kv_mul, int head_size, int loff)\n{\n    float inv_sqrt_head_size = 1.0f \/ sqrtf((float)head_size);\n    int grid = p-&gt;n_heads;\n    int block = 1024; \n    size_t shared_mem_size = head_size * sizeof(float) + p-&gt;seq_len * sizeof(float);\n\n    multi_head_attention_cuk &lt;&lt;&lt;grid, block, shared_mem_size&gt;&gt;&gt; (\n        pos, p-&gt;seq_len, s-&gt;q, s-&gt;att, s-&gt;xb, s-&gt;key_cache, s-&gt;value_cache, \n        kv_dim, kv_mul, head_size, loff, inv_sqrt_head_size);\n}\n\/\/REVIEW END\n\/\/REVIEW SiLU on GPU\n__global__ void SwiGLU_cuk(float *hb, float *hb2, int hidden_dim) {\n    int i = blockIdx.x * blockDim.x + threadIdx.x;\n    if (i &lt; hidden_dim) {\n        float val = hb&#91;i];\n        \/\/ silu(x)=x*\u03c3(x), where \u03c3(x) is the logistic sigmoid\n        val *= (1.0f \/ (1.0f + expf(-val)));\n        \/\/ elementwise multiply with w3(x)\n        val *= hb2&#91;i];\n        hb&#91;i] = val;\n    }\n}\n\nvoid SwiGLU(RunState *s, int hidden_dim) {\n    SwiGLU_cuk&lt;&lt;&lt;divUp(hidden_dim, 1024), 1024&gt;&gt;&gt;(s-&gt;hb, s-&gt;hb2, hidden_dim);\n}\n\/\/REVIEW END\n\/\/REVIEW R on GPU\n__global__ void residual_connection_cuk(float* x, float* xb, int dim) {\n    int i = blockIdx.x * blockDim.x + threadIdx.x;\n    if (i &lt; dim) {\n        x&#91;i] += xb&#91;i];\n    }\n}\nvoid residual_connection(float *x, float *xb, int dim) {\n    residual_connection_cuk&lt;&lt;&lt;divUp(dim, 1024), 1024&gt;&gt;&gt;(x, xb, dim);\n}\n\/\/REVIEW END\n\/\/REVIEW forward on GPU\nfloat* forward(Transformer* transformer, int token, int pos) {\n\n    \/\/ a few convenience variables\n    Config* p = &amp;transformer-&gt;config;\n    TransformerWeights* w = &amp;transformer-&gt;weights;\n    RunState* s = &amp;transformer-&gt;state;\n    float *x = s-&gt;x;\n    int dim = p-&gt;dim;\n    int kv_dim = (p-&gt;dim * p-&gt;n_kv_heads) \/ p-&gt;n_heads;\n    int kv_mul = p-&gt;n_heads \/ p-&gt;n_kv_heads; \/\/ integer multiplier of the kv sharing in multiquery\n    int hidden_dim =  p-&gt;hidden_dim;\n    int head_size = dim \/ p-&gt;n_heads;\n\n    \/\/ copy the token embedding into x\n    float* content_row = w-&gt;token_embedding_table + token * dim;\n    cudaCheck(cudaMemcpy(x, content_row, dim*sizeof(*x), cudaMemcpyHostToDevice));\n\n    \/\/ forward all the layers\n    for(unsigned long long l = 0; l &lt; p-&gt;n_layers; l++) {\n\n        \/\/ attention rmsnorm\n        rmsnorm(s-&gt;xb, x, w-&gt;rms_att_weight + l*dim, dim);\n\n        \/\/ key and value point to the kv cache\n        int loff = l * p-&gt;seq_len * kv_dim; \/\/ kv cache layer offset for convenience\n        s-&gt;k = s-&gt;key_cache + loff + pos * kv_dim;\n        s-&gt;v = s-&gt;value_cache + loff + pos * kv_dim;\n\n        \/\/ qkv matmuls for this position\n        matmul(s-&gt;q, s-&gt;xb, w-&gt;wq + l*dim*dim, dim, dim);\n        matmul(s-&gt;k, s-&gt;xb, w-&gt;wk + l*dim*kv_dim, dim, kv_dim);\n        matmul(s-&gt;v, s-&gt;xb, w-&gt;wv + l*dim*kv_dim, dim, kv_dim);\n\n        \/\/ RoPE relative positional encoding: complex-valued rotate q and k in each head\n        RoPE(pos, s, dim, kv_dim, head_size);\n\n        \/\/ multihead attention. iterate over all heads\n        multi_head_attention(pos, p, s, kv_dim, kv_mul, head_size, loff);\n\n        \/\/ final matmul to get the output of the attention\n        matmul(s-&gt;xb2, s-&gt;xb, w-&gt;wo + l*dim*dim, dim, dim);\n\n        \/\/ residual connection back into x\n        residual_connection(x, s-&gt;xb2, dim);\n\n        \/\/ ffn rmsnorm\n        rmsnorm(s-&gt;xb, x, w-&gt;rms_ffn_weight + l*dim, dim);\n\n        \/\/ Now for FFN in PyTorch we have: self.w2(F.silu(self.w1(x)) * self.w3(x))\n        \/\/ first calculate self.w1(x) and self.w3(x)\n        matmul(s-&gt;hb, s-&gt;xb, w-&gt;w1 + l*dim*hidden_dim, dim, hidden_dim);\n        matmul(s-&gt;hb2, s-&gt;xb, w-&gt;w3 + l*dim*hidden_dim, dim, hidden_dim);\n\n        \/\/ SwiGLU non-linearity\n        SwiGLU(s, hidden_dim);\n\n        \/\/ final matmul to get the output of the ffn\n        matmul(s-&gt;xb, s-&gt;hb, w-&gt;w2 + l*dim*hidden_dim, hidden_dim, dim);\n\n        \/\/ residual connection\n        residual_connection(x, s-&gt;xb, dim);\n    }\n\n    \/\/ final rmsnorm\n    rmsnorm(x, x, w-&gt;rms_final_weight, dim);\n\n    \/\/ classifier into logits\n    matmul(s-&gt;logitsgpu, x, w-&gt;wcls, p-&gt;dim, p-&gt;vocab_size);\n    cudaCheck(cudaMemcpy(s-&gt;logits, s-&gt;logitsgpu, p-&gt;vocab_size * sizeof(float), cudaMemcpyDeviceToHost));\n    return s-&gt;logits;\n}\n\/\/REVIEW END\n\/\/ ----------------------------------------------------------------------------\n\/\/ The Byte Pair Encoding (BPE) Tokenizer that translates strings &lt;-&gt; tokens\n\ntypedef struct {\n    char *str;\n    int id;\n} TokenIndex;\n\ntypedef struct {\n    char** vocab;\n    float* vocab_scores;\n    TokenIndex *sorted_vocab;\n    int vocab_size;\n    unsigned int max_token_length;\n    unsigned char byte_pieces&#91;512]; \/\/ stores all single-byte strings\n} Tokenizer;\n\nint compare_tokens(const void *a, const void *b) {\n    return strcmp(((TokenIndex*)a)-&gt;str, ((TokenIndex*)b)-&gt;str);\n}\n\nvoid build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size) {\n    \/\/ i should have written the vocab_size into the tokenizer file... sigh\n    t-&gt;vocab_size = vocab_size;\n    \/\/ malloc space to hold the scores and the strings\n    t-&gt;vocab = (char**)malloc(vocab_size * sizeof(char*));\n    t-&gt;vocab_scores = (float*)malloc(vocab_size * sizeof(float));\n    t-&gt;sorted_vocab = NULL; \/\/ initialized lazily\n    for (int i = 0; i &lt; 256; i++) {\n        t-&gt;byte_pieces&#91;i * 2] = (unsigned char)i;\n        t-&gt;byte_pieces&#91;i * 2 + 1] = '\\0';\n    }\n    \/\/ read in the file\n    FILE *file = fopen(tokenizer_path, \"rb\");\n    if (!file) { fprintf(stderr, \"couldn't load %s\\n\", tokenizer_path); exit(EXIT_FAILURE); }\n    if (fread(&amp;t-&gt;max_token_length, sizeof(int), 1, file) != 1) { fprintf(stderr, \"failed read\\n\"); exit(EXIT_FAILURE); }\n    int len;\n    for (int i = 0; i &lt; vocab_size; i++) {\n        if (fread(t-&gt;vocab_scores + i, sizeof(float), 1, file) != 1) { fprintf(stderr, \"failed read\\n\"); exit(EXIT_FAILURE);}\n        if (fread(&amp;len, sizeof(int), 1, file) != 1) { fprintf(stderr, \"failed read\\n\"); exit(EXIT_FAILURE); }\n        t-&gt;vocab&#91;i] = (char *)malloc(len + 1);\n        if (fread(t-&gt;vocab&#91;i], len, 1, file) != 1) { fprintf(stderr, \"failed read\\n\"); exit(EXIT_FAILURE); }\n        t-&gt;vocab&#91;i]&#91;len] = '\\0'; \/\/ add the string terminating token\n    }\n    fclose(file);\n}\n\nvoid free_tokenizer(Tokenizer* t) {\n    for (int i = 0; i &lt; t-&gt;vocab_size; i++) { free(t-&gt;vocab&#91;i]); }\n    free(t-&gt;vocab);\n    free(t-&gt;vocab_scores);\n    free(t-&gt;sorted_vocab);\n}\n\nchar* decode(Tokenizer* t, int prev_token, int token) {\n    char *piece = t-&gt;vocab&#91;token];\n    \/\/ following BOS (1) token, sentencepiece decoder strips any leading whitespace (see PR #89)\n    if (prev_token == 1 &amp;&amp; piece&#91;0] == ' ') { piece++; }\n    \/\/ careful, some tokens designate raw bytes, and look like e.g. '&lt;0x01&gt;'\n    \/\/ parse this and convert and return the actual byte\n    unsigned char byte_val;\n    if (sscanf(piece, \"&lt;0x%02hhX&gt;\", &amp;byte_val) == 1) {\n        piece = (char*)t-&gt;byte_pieces + byte_val * 2;\n    }\n    return piece;\n}\n\nvoid safe_printf(char *piece) {\n    \/\/ piece might be a raw byte token, and we only want to print printable chars or whitespace\n    \/\/ because some of the other bytes can be various control codes, backspace, etc.\n    if (piece == NULL) { return; }\n    if (piece&#91;0] == '\\0') { return; }\n    if (piece&#91;1] == '\\0') {\n        unsigned char byte_val = piece&#91;0];\n        if (!(isprint(byte_val) || isspace(byte_val))) {\n            return; \/\/ bad byte, don't print it\n        }\n    }\n    printf(\"%s\", piece);\n}\n\nint str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) {\n    \/\/ efficiently find the perfect match for str in vocab, return its index or -1 if not found\n    TokenIndex tok = { .str = str }; \/\/ acts as the key to search for\n\/\/FIXME cast to TokenIndex*\n    TokenIndex *res = (TokenIndex *)bsearch(&amp;tok, sorted_vocab, vocab_size, sizeof(TokenIndex), compare_tokens);\n    return res != NULL ? res-&gt;id : -1;\n}\n\nvoid encode(Tokenizer* t, char *text, int8_t bos, int8_t eos, int *tokens, int *n_tokens) {\n    \/\/ encode the string text (input) into an upper-bound preallocated tokens&#91;] array\n    \/\/ bos != 0 means prepend the BOS token (=1), eos != 0 means append the EOS token (=2)\n    if (text == NULL) { fprintf(stderr, \"cannot encode NULL text\\n\"); exit(EXIT_FAILURE); }\n\n    if (t-&gt;sorted_vocab == NULL) {\n        \/\/ lazily malloc and sort the vocabulary\n        t-&gt;sorted_vocab = (TokenIndex *)malloc(t-&gt;vocab_size * sizeof(TokenIndex));\n        for (int i = 0; i &lt; t-&gt;vocab_size; i++) {\n            t-&gt;sorted_vocab&#91;i].str = t-&gt;vocab&#91;i];\n            t-&gt;sorted_vocab&#91;i].id = i;\n        }\n        qsort(t-&gt;sorted_vocab, t-&gt;vocab_size, sizeof(TokenIndex), compare_tokens);\n    }\n\n    \/\/ create a temporary buffer that will store merge candidates of always two consecutive tokens\n    \/\/ *2 for concat, +1 for null terminator +2 for UTF8 (in case max_token_length is 1)\n    char* str_buffer = (char *)malloc((t-&gt;max_token_length*2 +1 +2) * sizeof(char));\n    size_t str_len = 0;\n\n    \/\/ start at 0 tokens\n    *n_tokens = 0;\n\n    \/\/ add optional BOS (=1) token, if desired\n    if (bos) tokens&#91;(*n_tokens)++] = 1;\n\n    \/\/ add_dummy_prefix is true by default\n    \/\/ so prepend a dummy prefix token to the input string, but only if text != \"\"\n    \/\/ TODO: pretty sure this isn't correct in the general case but I don't have the\n    \/\/ energy to read more of the sentencepiece code to figure out what it's doing\n    if (text&#91;0] != '\\0') {\n        int dummy_prefix = str_lookup(\" \", t-&gt;sorted_vocab, t-&gt;vocab_size);\n        tokens&#91;(*n_tokens)++] = dummy_prefix;\n    }\n\n    \/\/ Okay UTF-8 time. This will get messy. Here is the reference from Wikipedia:\n    \/\/ Code point \u2194 UTF-8 conversion\n    \/\/ First code point\tLast code point\tByte 1\tByte 2\tByte 3\tByte 4\n    \/\/ U+0000\tU+007F\t    0xxxxxxx\n    \/\/ U+0080\tU+07FF\t    110xxxxx\t10xxxxxx\n    \/\/ U+0800\tU+FFFF\t    1110xxxx\t10xxxxxx\t10xxxxxx\n    \/\/ U+10000\tU+10FFFF    11110xxx\t10xxxxxx\t10xxxxxx\t10xxxxxx\n\n    \/\/ process the raw (UTF-8) byte sequence of the input string\n    for (char *c = text; *c != '\\0'; c++) {\n\n        \/\/ reset buffer if the current byte is ASCII or a leading byte\n        \/\/ 0xC0 is 11000000, so (*c &amp; 0xC0) keeps the first 2 bits and zeros the rest\n        \/\/ 0x80 is 10000000\n        \/\/ in UTF-8, all continuation bytes start with \"10\" in first two bits\n        \/\/ so in English this is: \"if this byte is not a continuation byte\"\n        if ((*c &amp; 0xC0) != 0x80) {\n            \/\/ this byte must be either a leading byte (11...) or an ASCII char (0x...)\n            \/\/ =&gt; reset our location, as we're starting a new UTF-8 codepoint\n            str_len = 0;\n        }\n\n        \/\/ append the current byte to the buffer\n        str_buffer&#91;str_len++] = *c; \/\/ ++ is post-increment, incremented after this line\n        str_buffer&#91;str_len] = '\\0';\n\n        \/\/ while the next character is a continuation byte, continue appending\n        \/\/ but if there are too many of them, just stop to avoid overruning str_buffer size.\n        if ((*(c+1) &amp; 0xC0) == 0x80 &amp;&amp; str_len &lt; 4) {\n            continue;\n        }\n\n        \/\/ ok c+1 is not a continuation byte, so we've read in a full codepoint\n        int id = str_lookup(str_buffer, t-&gt;sorted_vocab, t-&gt;vocab_size);\n\n        if (id != -1) {\n            \/\/ we found this codepoint in vocab, add it as a token\n            tokens&#91;(*n_tokens)++] = id;\n        } else {\n            \/\/ byte_fallback encoding: just encode each byte as a token\n            \/\/ +3 is here because the first 3 vocab elements are &lt;unk&gt;, &lt;s&gt;, &lt;\/s&gt;\n            \/\/ so the individual bytes only start at index 3\n            for (int i=0; i &lt; str_len; i++) {\n                tokens&#91;(*n_tokens)++] = (unsigned char)str_buffer&#91;i] + 3;\n            }\n        }\n        str_len = 0; \/\/ protect against a sequence of stray UTF8 continuation bytes\n    }\n\n    \/\/ merge the best consecutive pair each iteration, according the scores in vocab_scores\n    while (1) {\n        float best_score = -1e10;\n        int best_id = -1;\n        int best_idx = -1;\n\n        for (int i=0; i &lt; (*n_tokens-1); i++) {\n            \/\/ check if we can merge the pair (tokens&#91;i], tokens&#91;i+1])\n            sprintf(str_buffer, \"%s%s\", t-&gt;vocab&#91;tokens&#91;i]], t-&gt;vocab&#91;tokens&#91;i+1]]);\n            int id = str_lookup(str_buffer, t-&gt;sorted_vocab, t-&gt;vocab_size);\n            if (id != -1 &amp;&amp; t-&gt;vocab_scores&#91;id] &gt; best_score) {\n                \/\/ this merge pair exists in vocab! record its score and position\n                best_score = t-&gt;vocab_scores&#91;id];\n                best_id = id;\n                best_idx = i;\n            }\n        }\n\n        if (best_idx == -1) {\n            break; \/\/ we couldn't find any more pairs to merge, so we're done\n        }\n\n        \/\/ merge the consecutive pair (best_idx, best_idx+1) into new token best_id\n        tokens&#91;best_idx] = best_id;\n        \/\/ delete token at position best_idx+1, shift the entire sequence back 1\n        for (int i = best_idx+1; i &lt; (*n_tokens-1); i++) {\n            tokens&#91;i] = tokens&#91;i+1];\n        }\n        (*n_tokens)--; \/\/ token length decreased\n    }\n\n    \/\/ add optional EOS (=2) token, if desired\n    if (eos) tokens&#91;(*n_tokens)++] = 2;\n\n    free(str_buffer);\n}\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ The Sampler, which takes logits and returns a sampled token\n\/\/ sampling can be done in a few ways: greedy argmax, sampling, top-p sampling\n\ntypedef struct {\n    float prob;\n    int index;\n} ProbIndex; \/\/ struct used when sorting probabilities during top-p sampling\n\ntypedef struct {\n    int vocab_size;\n    ProbIndex* probindex; \/\/ buffer used in top-p sampling\n    float temperature;\n    float topp;\n    unsigned long long rng_state;\n} Sampler;\n\nint sample_argmax(float* probabilities, int n) {\n    \/\/ return the index that has the highest probability\n    int max_i = 0;\n    float max_p = probabilities&#91;0];\n    for (int i = 1; i &lt; n; i++) {\n        if (probabilities&#91;i] &gt; max_p) {\n            max_i = i;\n            max_p = probabilities&#91;i];\n        }\n    }\n    return max_i;\n}\n\nint sample_mult(float* probabilities, int n, float coin) {\n    \/\/ sample index from probabilities (they must sum to 1!)\n    \/\/ coin is a random number in &#91;0, 1), usually from random_f32()\n    float cdf = 0.0f;\n    for (int i = 0; i &lt; n; i++) {\n        cdf += probabilities&#91;i];\n        if (coin &lt; cdf) {\n            return i;\n        }\n    }\n    return n - 1; \/\/ in case of rounding errors\n}\n\nint compare(const void* a, const void* b) {\n    ProbIndex* a_ = (ProbIndex*) a;\n    ProbIndex* b_ = (ProbIndex*) b;\n    if (a_-&gt;prob &gt; b_-&gt;prob) return -1;\n    if (a_-&gt;prob &lt; b_-&gt;prob) return 1;\n    return 0;\n}\n\nint sample_topp(float* probabilities, int n, float topp, ProbIndex* probindex, float coin) {\n    \/\/ top-p sampling (or \"nucleus sampling\") samples from the smallest set of\n    \/\/ tokens that exceed probability topp. This way we never sample tokens that\n    \/\/ have very low probabilities and are less likely to go \"off the rails\".\n    \/\/ coin is a random number in &#91;0, 1), usually from random_f32()\n\n    int n0 = 0;\n    \/\/ quicksort indices in descending order of probabilities\n    \/\/ values smaller than (1 - topp) \/ (n - 1) cannot be part of the result\n    \/\/ so for efficiency we crop these out as candidates before sorting\n    const float cutoff = (1.0f - topp) \/ (n - 1);\n    for (int i = 0; i &lt; n; i++) {\n        if (probabilities&#91;i] &gt;= cutoff) {\n            probindex&#91;n0].index = i;\n            probindex&#91;n0].prob = probabilities&#91;i];\n            n0++;\n        }\n    }\n    qsort(probindex, n0, sizeof(ProbIndex), compare);\n\n    \/\/ truncate the list where cumulative probability exceeds topp\n    float cumulative_prob = 0.0f;\n    int last_idx = n0 - 1; \/\/ in case of rounding errors consider all elements\n    for (int i = 0; i &lt; n0; i++) {\n        cumulative_prob += probindex&#91;i].prob;\n        if (cumulative_prob &gt; topp) {\n            last_idx = i;\n            break; \/\/ we've exceeded topp by including last_idx\n        }\n    }\n\n    \/\/ sample from the truncated list\n    float r = coin * cumulative_prob;\n    float cdf = 0.0f;\n    for (int i = 0; i &lt;= last_idx; i++) {\n        cdf += probindex&#91;i].prob;\n        if (r &lt; cdf) {\n            return probindex&#91;i].index;\n        }\n    }\n    return probindex&#91;last_idx].index; \/\/ in case of rounding errors\n}\n\nvoid build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed) {\n    sampler-&gt;vocab_size = vocab_size;\n    sampler-&gt;temperature = temperature;\n    sampler-&gt;topp = topp;\n    sampler-&gt;rng_state = rng_seed;\n    \/\/ buffer only used with nucleus sampling; may not need but it's ~small\n    sampler-&gt;probindex = (ProbIndex *)malloc(sampler-&gt;vocab_size * sizeof(ProbIndex));\n}\n\nvoid free_sampler(Sampler* sampler) {\n    free(sampler-&gt;probindex);\n}\n\nunsigned int random_u32(unsigned long long *state) {\n    \/\/ xorshift rng: https:\/\/en.wikipedia.org\/wiki\/Xorshift#xorshift.2A\n    *state ^= *state &gt;&gt; 12;\n    *state ^= *state &lt;&lt; 25;\n    *state ^= *state &gt;&gt; 27;\n    return (*state * 0x2545F4914F6CDD1Dull) &gt;&gt; 32;\n}\nfloat random_f32(unsigned long long *state) { \/\/ random float32 in &#91;0,1)\n    return (random_u32(state) &gt;&gt; 8) \/ 16777216.0f;\n}\n\nint sample(Sampler* sampler, float* logits) {\n    \/\/ sample the token given the logits and some hyperparameters\n    int next;\n    if (sampler-&gt;temperature == 0.0f) {\n        \/\/ greedy argmax sampling: take the token with the highest probability\n        next = sample_argmax(logits, sampler-&gt;vocab_size);\n    } else {\n        \/\/ apply the temperature to the logits\n        for (int q=0; q&lt;sampler-&gt;vocab_size; q++) { logits&#91;q] \/= sampler-&gt;temperature; }\n        \/\/ apply softmax to the logits to get the probabilities for next token\n        softmax(logits, sampler-&gt;vocab_size);\n        \/\/ flip a (float) coin (this is our source of entropy for sampling)\n        float coin = random_f32(&amp;sampler-&gt;rng_state);\n        \/\/ we sample from this distribution to get the next token\n        if (sampler-&gt;topp &lt;= 0 || sampler-&gt;topp &gt;= 1) {\n            \/\/ simply sample from the predicted probability distribution\n            next = sample_mult(logits, sampler-&gt;vocab_size, coin);\n        } else {\n            \/\/ top-p (nucleus) sampling, clamping the least likely tokens to zero\n            next = sample_topp(logits, sampler-&gt;vocab_size, sampler-&gt;topp, sampler-&gt;probindex, coin);\n        }\n    }\n    return next;\n}\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ utilities: time\n\nlong time_in_ms() {\n    \/\/ return time in milliseconds, for benchmarking the model speed\n    struct timespec time;\n    clock_gettime(CLOCK_REALTIME, &amp;time);\n    return time.tv_sec * 1000 + time.tv_nsec \/ 1000000;\n}\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ generation loop\n\nvoid generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps) {\n    char *empty_prompt = \"\";\n    if (prompt == NULL) { prompt = empty_prompt; }\n\n    \/\/ encode the (string) prompt into tokens sequence\n    int num_prompt_tokens = 0;\n    int* prompt_tokens = (int*)malloc((strlen(prompt)+3) * sizeof(int)); \/\/ +3 for '\\0', ?BOS, ?EOS\n    encode(tokenizer, prompt, 1, 0, prompt_tokens, &amp;num_prompt_tokens);\n    if (num_prompt_tokens &lt; 1) {\n        fprintf(stderr, \"something is wrong, expected at least 1 prompt token\\n\");\n        exit(EXIT_FAILURE);\n    }\n\n    \/\/ start the main loop\n    long start = 0;  \/\/ used to time our code, only initialized after first iteration\n    int next;        \/\/ will store the next token in the sequence\n    int token = prompt_tokens&#91;0]; \/\/ kick off with the first token in the prompt\n    int pos = 0;     \/\/ position in the sequence\n    while (pos &lt; steps) {\n\n        \/\/ forward the transformer to get logits for the next token\n        float* logits = forward(transformer, token, pos);\n\n        \/\/ advance the state machine\n        if (pos &lt; num_prompt_tokens - 1) {\n            \/\/ if we are still processing the input prompt, force the next prompt token\n            next = prompt_tokens&#91;pos + 1];\n        } else {\n            \/\/ otherwise sample the next token from the logits\n            next = sample(sampler, logits);\n        }\n        pos++;\n\n        \/\/ data-dependent terminating condition: the BOS (=1) token delimits sequences\n        if (next == 1) { break; }\n\n        \/\/ print the token as string, decode it with the Tokenizer object\n        char* piece = decode(tokenizer, token, next);\n        safe_printf(piece); \/\/ same as printf(\"%s\", piece), but skips \"unsafe\" bytes\n        fflush(stdout);\n        token = next;\n\n        \/\/ init the timer here because the first iteration can be slower\n        if (start == 0) { start = time_in_ms(); }\n    }\n    printf(\"\\n\");\n\n    \/\/ report achieved tok\/s (pos-1 because the timer starts after first iteration)\n    if (pos &gt; 1) {\n        long end = time_in_ms();\n        fprintf(stderr, \"achieved tok\/s: %f\\n\", (pos-1) \/ (double)(end-start)*1000);\n    }\n\n    free(prompt_tokens);\n}\n\nvoid read_stdin(const char* guide, char* buffer, size_t bufsize) {\n    \/\/ read a line from stdin, up to but not including \\n\n    printf(\"%s\", guide);\n    if (fgets(buffer, bufsize, stdin) != NULL) {\n        size_t len = strlen(buffer);\n        if (len &gt; 0 &amp;&amp; buffer&#91;len - 1] == '\\n') {\n            buffer&#91;len - 1] = '\\0'; \/\/ strip newline\n        }\n    }\n}\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ chat loop\n\/\/ I manually inspected the tokens for a few chat conversations compared to\n\/\/ python reference and that seemed ok, but this was not thoroughly tested and\n\/\/ is not safely implemented, it's more a proof of concept atm.\n\nvoid chat(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler,\n          char *cli_user_prompt, char *cli_system_prompt, int steps) {\n\n    \/\/ buffers for reading the system prompt and user prompt from stdin\n    \/\/ you'll notice they are soomewhat haphazardly and unsafely set atm\n    char system_prompt&#91;512];\n    char user_prompt&#91;512];\n    char rendered_prompt&#91;1152];\n    int num_prompt_tokens = 0;\n    int* prompt_tokens = (int*)malloc(1152 * sizeof(int));\n    int user_idx;\n\n    \/\/ start the main loop\n    int8_t user_turn = 1; \/\/ user starts\n    int next;        \/\/ will store the next token in the sequence\n    int token;       \/\/ stores the current token to feed into the transformer\n    int prev_token;\n    int pos = 0;     \/\/ position in the sequence\n    while (pos &lt; steps) {\n\n        \/\/ when it is the user's turn to contribute tokens to the dialog...\n        if (user_turn) {\n            \/\/ get the (optional) system prompt at position 0\n            if (pos == 0) {\n                \/\/ at position 0, the user can also contribute a system prompt\n                if (cli_system_prompt == NULL) {\n                    \/\/ system prompt was not passed in, attempt to get it from stdin\n                    read_stdin(\"Enter system prompt (optional): \", system_prompt, sizeof(system_prompt));\n                } else {\n                    \/\/ system prompt was passed in, use it\n                    strcpy(system_prompt, cli_system_prompt);\n                }\n            }\n            \/\/ get the user prompt\n            if (pos == 0 &amp;&amp; cli_user_prompt != NULL) {\n                \/\/ user prompt for position 0 was passed in, use it\n                strcpy(user_prompt, cli_user_prompt);\n            } else {\n                \/\/ otherwise get user prompt from stdin\n                read_stdin(\"User: \", user_prompt, sizeof(user_prompt));\n            }\n            \/\/ render user\/system prompts into the Llama 2 Chat schema\n            if (pos == 0 &amp;&amp; system_prompt&#91;0] != '\\0') {\n                char system_template&#91;] = \"&#91;INST] &lt;&lt;SYS&gt;&gt;\\n%s\\n&lt;&lt;\/SYS&gt;&gt;\\n\\n%s &#91;\/INST]\";\n                sprintf(rendered_prompt, system_template, system_prompt, user_prompt);\n            } else {\n                char user_template&#91;] = \"&#91;INST] %s &#91;\/INST]\";\n                sprintf(rendered_prompt, user_template, user_prompt);\n            }\n            \/\/ encode the rendered prompt into tokens\n            encode(tokenizer, rendered_prompt, 1, 0, prompt_tokens, &amp;num_prompt_tokens);\n            user_idx = 0; \/\/ reset the user index\n            user_turn = 0;\n            printf(\"Assistant: \");\n        }\n\n        \/\/ determine the token to pass into the transformer next\n        if (user_idx &lt; num_prompt_tokens) {\n            \/\/ if we are still processing the input prompt, force the next prompt token\n            token = prompt_tokens&#91;user_idx++];\n        } else {\n            \/\/ otherwise use the next token sampled from previous turn\n            token = next;\n        }\n        \/\/ EOS (=2) token ends the Assistant turn\n        if (token == 2) { user_turn = 1; }\n\n        \/\/ forward the transformer to get logits for the next token\n        float* logits = forward(transformer, token, pos);\n        next = sample(sampler, logits);\n        pos++;\n\n        if (user_idx &gt;= num_prompt_tokens &amp;&amp; next != 2) {\n            \/\/ the Assistant is responding, so print its output\n            char* piece = decode(tokenizer, token, next);\n            safe_printf(piece); \/\/ same as printf(\"%s\", piece), but skips \"unsafe\" bytes\n            fflush(stdout);\n        }\n        if (next == 2) { printf(\"\\n\"); }\n    }\n    printf(\"\\n\");\n    free(prompt_tokens);\n}\n\n\n\/\/ ----------------------------------------------------------------------------\n\/\/ CLI, include only if not testing\n#ifndef TESTING\n\nvoid error_usage() {\n    fprintf(stderr, \"Usage:   run &lt;checkpoint&gt; &#91;options]\\n\");\n    fprintf(stderr, \"Example: run model.bin -n 256 -i \\\"Once upon a time\\\"\\n\");\n    fprintf(stderr, \"Options:\\n\");\n    fprintf(stderr, \"  -t &lt;float&gt;  temperature in &#91;0,inf], default 1.0\\n\");\n    fprintf(stderr, \"  -p &lt;float&gt;  p value in top-p (nucleus) sampling in &#91;0,1] default 0.9\\n\");\n    fprintf(stderr, \"  -s &lt;int&gt;    random seed, default time(NULL)\\n\");\n    fprintf(stderr, \"  -n &lt;int&gt;    number of steps to run for, default 256. 0 = max_seq_len\\n\");\n    fprintf(stderr, \"  -i &lt;string&gt; input prompt\\n\");\n    fprintf(stderr, \"  -z &lt;string&gt; optional path to custom tokenizer\\n\");\n    fprintf(stderr, \"  -m &lt;string&gt; mode: generate|chat, default: generate\\n\");\n    fprintf(stderr, \"  -y &lt;string&gt; (optional) system prompt in chat mode\\n\");\n    exit(EXIT_FAILURE);\n}\n\nint main(int argc, char *argv&#91;]) {\n\/\/REVIEW\n    createCublasHandle();\n\/\/REVIEW END\n    \/\/ default parameters\n    char *checkpoint_path = NULL;  \/\/ e.g. out\/model.bin\n    char *tokenizer_path = \"tokenizer.bin\";\n    float temperature = 1.0f;   \/\/ 0.0 = greedy deterministic. 1.0 = original. don't set higher\n    float topp = 0.9f;          \/\/ top-p in nucleus sampling. 1.0 = off. 0.9 works well, but slower\n    int steps = 256;            \/\/ number of steps to run for\n    char *prompt = NULL;        \/\/ prompt string\n    unsigned long long rng_seed = 0; \/\/ seed rng with time by default\n    char *mode = \"generate\";    \/\/ generate|chat\n    char *system_prompt = NULL; \/\/ the (optional) system prompt to use in chat mode\n\n    \/\/ poor man's C argparse so we can override the defaults above from the command line\n    if (argc &gt;= 2) { checkpoint_path = argv&#91;1]; } else { error_usage(); }\n    for (int i = 2; i &lt; argc; i+=2) {\n        \/\/ do some basic validation\n        if (i + 1 &gt;= argc) { error_usage(); } \/\/ must have arg after flag\n        if (argv&#91;i]&#91;0] != '-') { error_usage(); } \/\/ must start with dash\n        if (strlen(argv&#91;i]) != 2) { error_usage(); } \/\/ must be -x (one dash, one letter)\n        \/\/ read in the args\n        if (argv&#91;i]&#91;1] == 't') { temperature = atof(argv&#91;i + 1]); }\n        else if (argv&#91;i]&#91;1] == 'p') { topp = atof(argv&#91;i + 1]); }\n        else if (argv&#91;i]&#91;1] == 's') { rng_seed = atoi(argv&#91;i + 1]); }\n        else if (argv&#91;i]&#91;1] == 'n') { steps = atoi(argv&#91;i + 1]); }\n        else if (argv&#91;i]&#91;1] == 'i') { prompt = argv&#91;i + 1]; }\n        else if (argv&#91;i]&#91;1] == 'z') { tokenizer_path = argv&#91;i + 1]; }\n        else if (argv&#91;i]&#91;1] == 'm') { mode = argv&#91;i + 1]; }\n        else if (argv&#91;i]&#91;1] == 'y') { system_prompt = argv&#91;i + 1]; }\n        else { error_usage(); }\n    }\n\n    \/\/ parameter validation\/overrides\n    if (rng_seed &lt;= 0) rng_seed = (unsigned int)time(NULL);\n    if (temperature &lt; 0.0) temperature = 0.0;\n    if (topp &lt; 0.0 || 1.0 &lt; topp) topp = 0.9;\n    if (steps &lt; 0) steps = 0;\n\n    \/\/ build the Transformer via the model .bin file\n    Transformer transformer;\n    build_transformer(&amp;transformer, checkpoint_path);\n    if (steps == 0 || steps &gt; transformer.config.seq_len) steps = transformer.config.seq_len; \/\/ override to ~max length\n\n    \/\/ build the Tokenizer via the tokenizer .bin file\n    Tokenizer tokenizer;\n    build_tokenizer(&amp;tokenizer, tokenizer_path, transformer.config.vocab_size);\n\n    \/\/ build the Sampler\n    Sampler sampler;\n    build_sampler(&amp;sampler, transformer.config.vocab_size, temperature, topp, rng_seed);\n\n    \/\/ run!\n    if (strcmp(mode, \"generate\") == 0) {\n        generate(&amp;transformer, &amp;tokenizer, &amp;sampler, prompt, steps);\n    } else if (strcmp(mode, \"chat\") == 0) {\n        chat(&amp;transformer, &amp;tokenizer, &amp;sampler, prompt, system_prompt, steps);\n    } else {\n        fprintf(stderr, \"unknown mode: %s\\n\", mode);\n        error_usage();\n    }\n\n    \/\/ memory and file handles cleanup\n    free_sampler(&amp;sampler);\n    free_tokenizer(&amp;tokenizer);\n    free_transformer(&amp;transformer);\n\/\/REVIEW\n    destroyCublasHandle();\n\/\/REVIEW END\n    return 0;\n}\n#endif<\/code><\/pre>\n<\/div><\/details>\n<\/div>\n\n\n\n<pre class=\"wp-block-code\"><code> nvcc -O3 -o crun run.cu -lm -lcuda -lcublas<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"zsh\">\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories15M.bin\nOnce upon a time, there was a dog named Spot. Spot loved to play and run around the yard. One day, Spot found a ball and wanted to play with it. He tried to pick it up with his mouth, but it was too heavy. Spot's owner saw him struggling and said, \"Be careful, Spot. Sometimes we might hurt ourselves.\" Spot was sad because he really wanted to play with the ball.\nSpot's owner had an idea. She went to the store and bought some new stuffed animals to play with. Spot was so excited that he forgot about the ball and played with his new friends for hours. When it was time for bed, Spot's owner said, \"I'm sorry, Spot. I'll buy you a new ball tomorrow.\" Spot was happy again and fell asleep dreaming of all the fun he would have with his new friends.\nachieved tok\/s: 2333.333334\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories15M.bin\nOnce upon a time, there was a grumpy little boy named Timmy. He never wanted to play with his friends or eat his lunch. One day, Timmy's mom asked him to help her with the laundry. Timmy didn't want to help because he wanted to play. But then his mom showed him a warm place to put his clothes in the washing machine.\nTimmy didn't want to do it, but his mom said it was important to take care of his clothes. So, Timmy put his clothes in the washing machine and turned it on. He was very excited to see his clothes fit again. When the washing machine was done, Timmy's mom said he could go play outside.\nBut then, Timmy realized that he had left his favorite toy behind. He went to look for it and found it under his bed. His mom said that the toy was in his shirt, but he had forgotten to put it back. Timmy felt sad and realized that he should have listened to his mom. He promised to never forget to put his toys in his shirt again.\nThe moral of the story is that it's important to listen to your parents\nachieved tok\/s: 2420.886075\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories15M.bin\nOnce upon a time, there was a unique bird named Bobo. Bobo had many colors on his body. He loved to fly and sing. One day, Bobo found a big nest on a tree. The nest was in a yard. Bobo was very happy.\nBobo met a small bug named Titi. Titi said, \"Hi, Bobo! Can you help me find my home?\" Bobo said, \"Yes, I will help you!\" They looked for Titi's home together. They found it under a bush.\nBobo and Titi went inside the nest. But the nest was dark. There was only one room. Bobo wanted to sleep, but Titi wanted to see more of the room. Titi was not happy. Bobo was sad. The end.\nachieved tok\/s: 2411.483253\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories15M.bin\nOne day, a little boy named Tim went to the store with his mom. He saw a big, impressive toy. It was a saw. Tim wanted the saw to build a toy house.\nHis mom said, \"Okay, Tim. I will get it for you.\" She gave Tim some money and he went to the store. Tim put the saw in his bag.\nAt home, Tim saw a long line of ants. He wanted to take the saw with him. He asked the store man, \"How do I get the ant in the line?\" The store man said, \"You can use the saw to pick up the ant.\"\nTim tried to pick up the ants. He made a line for his toy house. But, oh no! The toy house fell down. The saw was not a toy. It was a camera! The machine started to make sounds.\nTim was scared. He did not know what to do. Then, a big wind came. The wind blew the camera away. The machine stopped. Tim was safe.\nThe park man saw what happened. He said, \"You were right, the machine fell down. I will help you.\" Tim was happy.\nachieved tok\/s: 2361.111111\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories15M.bin\nOnce upon a time, there was a little boy named Tim. Tim loved to play outside with his friends. One day, Tim found a big, hard rock. He wanted to show it to his friends, but he was too short to reach it.\nTim's friends came over and asked him to play a game. They said, \"Tim, you are a player, so I will lift you up to the rock.\" Tim felt scared because the other kids were bigger and faster. He wanted to be the one to get the rock, but he was still too small.\nThen, Tim had an idea. He asked his mom to help him lift the rock. Together, they lifted the rock and brought it to the other side of the garden. All of Tim's friends were happy to see the rock. They said, \"Wow, Tim, you are a great player!\" Tim felt proud and not ashamed anymore. He learned that sometimes, you can't do everything you want, but you can still find a way to make it.\nachieved tok\/s: 2505.747126\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories42M.bin\nOnce upon a time, there was a boy named Timmy. Timmy loved to play outside and run around. One day, Timmy's mom said, \"Timmy, it's time for your haircut.\" Timmy didn't want to cut his hair because he was scared.\nBut his mom said, \"Don't worry, Timmy. You can always start at the same time and come back after.\" Timmy finally agreed and went to the haircut place. The lady cutting his hair was very polite and made him feel comfortable.\nAfter the haircut, Timmy looked in the mirror and saw how handsome he looked. He felt happy and proud of himself. Timmy learned that sometimes, it's scary to start somewhere, but if you start at your favorite favorite spot and are polite to others, good things can happen.\nachieved tok\/s: 2317.557251\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories42M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play with her toys and watch cartoons on TV. One day, Lily was playing with her dolls when she heard a knock on the door. It was her friend, Sarah.\n\"Hi Lily, do you want to come play outside?\" Sarah asked.\n\"No, I want to stay inside and play with my dolls,\" Lily replied.\nSarah looked sad and said, \"I promise we can still play together. Maybe we can make a tower with these blocks.\"\nLily thought for a moment and said, \"That's a good idea, but I'm not very good at building towers. It might be hard.\"\nSarah smiled and said, \"That's okay. We can do it together. We'll make a tower that's really good. I promise it won't hurt at all.\"\nLily smiled back and they went outside to play. They built a tower that was much better than the one they had before. Lily was happy she had a friend to play with and Sarah was happy they were both playing together. They promised to play together again soon.\nachieved tok\/s: 2350.558660\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories42M.bin\nOnce upon a time, there was a little girl named Lily. She loved going to the mall with her mommy. One day, while they were walking around, Lily saw a big, hairy dog. She pointed at it and said, \"Look, mommy! A dog!\"\nSuddenly, the dog started barking and running towards them. Lily got scared and hid behind her mommy. But then, the dog stopped barking and licked Lily's face! She was surprised but happy that the dog was friendly.\nFrom that day on, Lily and her mommy went to the mall every weekend to see the hairy dog. They would point and say hello to him and even give him a hug. And the dog became their best friend. The end.\nachieved tok\/s: 2398.739495\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories42M.bin\nOnce upon a time, there was a little girl named Lily. She had a soft, blue mattress in her room that she loved to jump on. One day, Lily's mom told her to be careful and not jump on the mattress anymore because it was old and could break.\nLily didn't listen and kept jumping on the mattress. Suddenly, the mattress broke and Lily fell down. She started to cry and said, \"Mommy, my mattress is broken!\"\nHer mom came running and said, \"I told you not to jump on it. Now we have to buy a new one.\"\nLily was sad because she loved her old mattress. She learned that sometimes we need to listen to our parents and not do things that are not safe.\nachieved tok\/s: 2250.574712\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories42M.bin\nOnce upon a time, there was a little girl named Lily. She loved to watch TV and play with her toys. One day, she accidentally knocked over a candle and it started to burn the curtains. She quickly blew it out and put out the fire.\nLily was scared because she knew she shouldn't have been playing with fire. Her mommy came in and saw what happened. She hugged Lily and told her it was okay, accidents happen.\nAfter that, Lily was more careful and didn't play with fire again. She learned that fire is dangerous and should be used with care. Her mommy was proud of her for being responsible and not being ignorant of the consequences of her actions. From that day on, Lily only watched TV when her mommy was with her and never played with fire again.\nachieved tok\/s: 2384.765625\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories110M.bin\nOnce upon a time, there was a little girl named Lily. She loved to play with her dolls and teddy bears. One day, Lily's mommy gave her a toy phone to play with. Lily was so happy and she played with it all day long.\nSuddenly, Lily's teddy bear fell on the ground and got a rough spot on its head. Lily was sad and didn't know what to do. She picked up her toy phone and pretended to call her friend. \"Hello, friend! Can you come and help me fix my teddy bear?\" she said.\nLily's mommy heard her talking and came to see what was wrong. \"What happened, Lily?\" she asked. \"My teddy bear has a rough spot on its head. Can you help me fix it?\" Lily said. Her mommy smiled and said, \"Of course, I can help you. Let's go find a cloth to wipe the rough spot.\" And they went to find a cloth together.\nachieved tok\/s: 1368.440367\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories110M.bin\nOnce upon a time, there was a kind rabbit named Benny. He loved to play with his friends in the forest. One day, Benny found a big truck with a license plate that said \"Benny\". Benny didn't know what a license was, but he knew it was important.\nBenny decided to take the truck to the nearby town to show it to his friends. But when he got there, he saw that the truck was actually a delivery truck for a massage company. The delivery man told Benny that he couldn't show the truck to his friends because it was not polite to be nosy.\nBenny felt bad for being nosy and decided to deliver the license plate back to the factory. He hopped all the way there and gave it back to the delivery man. The delivery man was very happy and thanked Benny for being honest.\nFrom that day on, Benny learned that it's important to be respectful and not nosy. He also learned that doing the right thing is always the best choice.\nachieved tok\/s: 1386.245352\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories110M.bin\nOne day, a boy named Tim found a broken toy in his room. It was a small car with a button that said \"reverse\". Tim did not know what it meant, but he wanted to play with it. He pushed the button, and the car started to move backwards.\nTim's mom saw him playing with the broken toy. She said, \"Tim, let's find the label on the car. It shows what it does.\" They looked around and found a label with the car's name on it. The label had a picture of a blue car.\nTim and his mom went to the park where they found the blue car. Tim pushed the button on the broken car again. This time, the car went forward. Tim's mom said, \"Now we know what the label says. The car can reverse!\" Tim smiled and played with his car all day at the park.\nachieved tok\/s: 1438.461539\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories110M.bin\nOne day, a little boy named Tim went for a walk. He was an obedient boy who always did what his mom and dad told him to do. While walking, he found a big book on the ground. It was a dictionary! Tim was very happy and picked it up.\nTim carried the dictionary home to show his mom and dad. They were proud of him for being so obedient and finding the dictionary. They told him that the dictionary could help him learn new words.\nEvery day, Tim would carry the dictionary with him when he went for a walk. He would read it and learn new words. He became a very smart and obedient boy, just like the dictionary helped him.\nachieved tok\/s: 1475.528700\n\n\u250c\u2500\u2500(amamitsu\u327famamitsu)-&#91;~\/Applications\/Lab6]\n\u2514\u2500$ .\/crun stories110M.bin\nOne day, a boy named Tim went to a new place with his mom. This place was big and had many different things to see. Tim was very happy to be there. He saw a big tree with a cat on it. The cat was scared and did not know how to get down.\nTim said to his mom, \"Mom, can we help the cat?\" His mom said, \"Yes, we can try.\" They found a long stick and tried to help the cat down. But the stick was too short. The cat was still scared and did not know what to do.\nThen, a big bird came and landed on the tree. The bird looked at the cat and started to talk! The bird said, \"I can help the cat.\" The bird flew down and picked up the cat with its beak. The cat was not scared anymore. Tim and his mom were very surprised. They thanked the bird and went home happy.\nachieved tok\/s: 1331.345162<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">On an NVIDIA H100 NVL<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories15M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories42M<\/th><th class=\"has-text-align-center\" data-align=\"center\">stories110M<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">Output rate (tokens\/s)<br>CUBLAS<\/td><td class=\"has-text-align-center\" data-align=\"center\">2406.5121798<\/td><td class=\"has-text-align-center\" data-align=\"center\">2340.4391486<\/td><td class=\"has-text-align-center\" data-align=\"center\">1400.004224<\/td><\/tr><\/tbody><\/table><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>\u30cf\u30eb\u30d3\u30f3\u5de5\u696d\u5927\u5b66\uff08\u6df1\u5733\uff09\u2022 2024 \u2022 \u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fb\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3 Lab\u2022 \u306b\u304a\u3051\u308b\u89e3\u6c7a\u7b56 \u2022 HITSZ \u8ba1\u7b97\u673a\u4f53\u7cfb\u7ed3\u6784\u5b9e\u9a8c 2024 \u5fa1\u8cea\u554f\u304c\u5fa1\u5ea7\u3044\u307e\u3057\u305f\u3089\u3001\u3053\u306e\u30da\u30fc\u30b8\u306e\u4e0b\u90e8\u306b\u3042\u308b\u30b3\u30e1\u30f3\u30c8\u6b04\u3092\u5fa1\u5229\u7528\u304f\u3060\u3055\u3044\u3002\u4ef0 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"swell_btn_cv_data":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[29],"tags":[],"class_list":["post-1122","post","type-post","status-publish","format-standard","hentry","category--ja"],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/posts\/1122","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/comments?post=1122"}],"version-history":[{"count":113,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/posts\/1122\/revisions"}],"predecessor-version":[{"id":1839,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/posts\/1122\/revisions\/1839"}],"wp:attachment":[{"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/media?parent=1122"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/categories?post=1122"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/yanagichiaki.jp\/index.php\/wp-json\/wp\/v2\/tags?post=1122"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}