GTPin
GTPin: Simdprof Sample Tool

The Simdprof tool counts the effective number of SIMD operations executed by the kernel

Running the Simdprof tool

To run Simdprof tool (default configuration) use the following command:

Profilers/Bin/gtpin -t simdprof -- app

How to count active channels in GEN architecture?

There are number of factors that affect the number of operations executed by a single instruction. Some of the factors can be evaluated statically, by analyzing instruction attributes and operands. Other factors may depend on the runtime architectural state. Their analyses require dynamic calculations during the execution of the application and/or kernel.

This following figure shows a typical GEN instruction:


gen_instruction.jpg


The following table describes input parameters of the SIMD counting algorithm, as well as methods used for their collection.


simdprof_channel_elements.jpg


Once all input parameters are collected, the tool can compute the SIMD operation count for a particular instruction. The computation is done by the COUNT_SIMD_OPS procedure, as shown in the following pseudocode:


simd_pseudocode.jpg


How to understand Simdprof results

The Simdprof tool counts the dynamic number of operations performed by the graphics device EUs (which is basically equal to the amount of active SIMD channels). When you run the in-house GTPin Simdprof tool in a default configuration, the tool generates the directory: GTPIN_PROFILE_SIMDPROF0. Profiling results are stored in the file: GTPIN_PROFILE_SIMDPROF0\Session_Final\simdprof.out. The simdprof.out file has the following format:

 Channels (SIMD operations) executed by kernels/BBLs
====================================================

----------------------------------------------------------------------------------------------------
BitonicSort___CS_asm6b96b239a92a0daa_simd32
   BBL Head Ins ID Tail Ins ID       Channels
     0           0          13          55136
     1          14          16          16272
     2          17          94         280080
     3          95          95              0
     4          96          99          13824
     5         100         100              0
     6         101         104           9216
     7         105         105              0
     8         106         106            360
     9         107         129          32400
    10         130         130              0
    11         131         148          23040
    12         149         149              0
    13         150         167          18432
    14         168         168              0
    15         169         169            504
    16         170         182           2560
    17         183         191           2064
    18         192         199              0
    19         200         207           1024
    20         208         208              0
    21         209         226           2304
    22         227         227              0
    23         228         245           2304
    24         246         246              0
    25         247         247           4160
    26         248         248           4160
 Total         467840

Total number of kernels:                    1
Total number of channels (SIMD operations): 40337408000

For each kernel/shader, the data is presented by a basic block (BBL). For each BBL, its ID is provided, along with the head (the first) instruction ID of this BBL, the tail (the last) instruction ID of this BBL, and the dynamic amount of all active channels within this BBL.

A user can know which specific BBL is indicated by looking into the assembly dump of the corresponding kernel, which is saved in the folder: GTPIN_PROFILE_SIMDPROF0\ASM. For example:

// kernel name: BitonicSort

// BBL0

[  0] (W)      mov (8|M0)               r100.0<1>:ud  r0.0<1;1,0>:ud                  
[  1] (W)      or (1|M0)                cr0.0<1>:ud   cr0.0<0;1,0>:ud   0x4C0:uw         {Switch}
[  2] (W)      mul (1|M0)               r8.0<1>:d     r9.0<0;1,0>:d     r100.1<0;1,0>:d  {Compacted}
[  3] (W)      cmp (16|M0)   (eq)f1.0   null<1>:d     r8.2<0;1,0>:d     0:w             
[  4] (W)      cmp (16|M16)  (eq)f1.0   null<1>:d     r8.2<0;1,0>:d     0:w             
[  5]          add (8|M0)               r3.0<1>:q     r1.0<8;8,1>:uw    r8.0<0;1,0>:ud  
[  6]          add (8|M8)               r5.0<1>:q     r1.8<8;8,1>:uw    r8.0<0;1,0>:ud  
[  7]          add (8|M16)              r11.0<1>:q    r2.0<8;8,1>:uw    r8.0<0;1,0>:ud  
[  8]          add (8|M24)              r9.0<1>:q     r2.8<8;8,1>:uw    r8.0<0;1,0>:ud  
[  9]          add (8|M0)               r60.0<1>:q    r3.0<4;4,1>:q     r7.0<0;1,0>:ud  
[ 10]          add (8|M8)               r58.0<1>:q    r5.0<4;4,1>:q     r7.0<0;1,0>:ud  
[ 11]          add (8|M16)              r4.0<1>:q     r11.0<4;4,1>:q    r7.0<0;1,0>:ud  
[ 12]          add (8|M24)              r2.0<1>:q     r9.0<4;4,1>:q     r7.0<0;1,0>:ud  
[ 13] (W&f1.0) jmpi                                 2296                            

// BBL1

[ 14] (W)      cmp (16|M0)   (eq)f0.0   null<1>:d     r8.3<0;1,0>:d     0:w             
[ 15] (W)      cmp (16|M16)  (eq)f0.0   null<1>:d     r8.3<0;1,0>:d     0:w             
[ 16] (W&f0.0) jmpi                                 1376                            

// BBL2

[ 17] (W)      add (1|M0)               r8.0<1>:d     r8.3<0;1,0>:d     31:w            
[ 18] (W)      mov (1|M0)               r6.0<1>:w     1:w                             
[ 19] (W)      add (1|M0)               r8.7<1>:d     r8.3<0;1,0>:d     63:w            
[ 20] (W)      and (1|M0)               r8.6<1>:d     r8.3<0;1,0>:d     63:w            
[ 21] (W)      and (1|M0)               r8.1<1>:d     r8.0<0;1,0>:d     31:w            
[ 22] (W)      and (1|M0)               r8.0<1>:d     r8.7<0;1,0>:d     63:w            
[ 23] (W)      shl (1|M0)               r8.3<1>:d     r6.0<0;1,0>:w     r8.1<0;1,0>:d   
[ 24]          shr (8|M0)               r6.0<1>:q     r60.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[ 25]          shr (8|M8)               r9.0<1>:q     r58.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[ 26]          shr (8|M16)              r11.0<1>:q    r4.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[ 27]          shr (8|M24)              r13.0<1>:q    r2.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[ 28] (W)      add (1|M0)               r8.0<1>:q     r8.3<0;1,0>:d     -1:w            
[ 29]          shl (8|M0)               r25.0<1>:q    r6.0<4;4,1>:q     r8.6<0;1,0>:ud  
[ 30]          shl (8|M8)               r23.0<1>:q    r9.0<4;4,1>:q     r8.6<0;1,0>:ud  
[ 31]          shl (8|M16)              r21.0<1>:q    r11.0<4;4,1>:q    r8.6<0;1,0>:ud  
[ 32]          shl (8|M24)              r19.0<1>:q    r13.0<4;4,1>:q    r8.6<0;1,0>:ud  
[ 33]          and (8|M0)               r6.0<1>:q     r60.0<4;4,1>:q    r8.0<0;1,0>:q   
[ 34]          and (8|M8)               r9.0<1>:q     r58.0<4;4,1>:q    r8.0<0;1,0>:q   
[ 35]          and (8|M16)              r15.0<1>:q    r4.0<4;4,1>:q     r8.0<0;1,0>:q   
[ 36]          and (8|M24)              r17.0<1>:q    r2.0<4;4,1>:q     r8.0<0;1,0>:q   
[ 37]          add (8|M0)               r13.0<1>:q    r25.0<4;4,1>:q    r6.0<4;4,1>:q   
[ 38]          add (8|M8)               r11.0<1>:q    r23.0<4;4,1>:q    r9.0<4;4,1>:q   
[ 39]          add (8|M16)              r9.0<1>:q     r21.0<4;4,1>:q    r15.0<4;4,1>:q  
[ 40]          add (8|M24)              r6.0<1>:q     r19.0<4;4,1>:q    r17.0<4;4,1>:q  
[ 41] (W)      add (1|M0)               r8.0<1>:d     r8.2<0;1,0>:d     63:w            
[ 42]          add (8|M0)               r15.0<2>:d    r13.0<4;4,1>:q    r8.3<0;1,0>:d   
[ 43]          add (8|M8)               r19.0<2>:d    r11.0<4;4,1>:q    r8.3<0;1,0>:d   
[ 44]          add (8|M16)              r17.0<2>:d    r9.0<4;4,1>:q     r8.3<0;1,0>:d   
[ 45]          add (8|M24)              r21.0<2>:d    r6.0<4;4,1>:q     r8.3<0;1,0>:d   
[ 46]          mov (8|M0)               r62.0<1>:d    r13.0<2;1,0>:d                  
[ 47]          mov (8|M8)               r63.0<1>:d    r11.0<2;1,0>:d                  
[ 48]          mov (8|M16)              r64.0<1>:d    r9.0<2;1,0>:d                   
[ 49]          mov (8|M24)              r65.0<1>:d    r6.0<2;1,0>:d                   
[ 50]          mov (8|M0)               r6.0<1>:d     r15.0<2;1,0>:d                  
[ 51]          mov (8|M8)               r7.0<1>:d     r19.0<2;1,0>:d                  
[ 52]          mov (8|M16)              r66.0<1>:d    r17.0<2;1,0>:d                  
[ 53]          mov (8|M24)              r67.0<1>:d    r21.0<2;1,0>:d                  
[ 54]          shl (16|M0)              r62.0<1>:d    r62.0<8;8,1>:d    4:w             
[ 55]          shl (16|M16)             r64.0<1>:d    r64.0<8;8,1>:d    4:w             
[ 56]          shl (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     4:w             
[ 57]          shl (16|M16)             r66.0<1>:d    r66.0<8;8,1>:d    4:w             
[ 58]          add (16|M0)              r62.0<1>:d    r62.0<8;8,1>:d    r8.5<0;1,0>:d    {Compacted}
[ 59]          add (16|M16)             r64.0<1>:d    r64.0<8;8,1>:d    r8.5<0;1,0>:d   
[ 60]          add (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     r8.5<0;1,0>:d    {Compacted}
[ 61]          add (16|M16)             r66.0<1>:d    r66.0<8;8,1>:d    r8.5<0;1,0>:d   
[ 62]          send (16|M0)             r18:w    r62     0xC         0x4805000
[ 63]          send (16|M16)            r50:w    r64     0xC         0x4805000
[ 64]          send (16|M0)             r10:w    r6      0xC         0x4805000
[ 65]          send (16|M16)            r42:w    r66     0xC         0x4805000
[ 66] (W)      and (1|M0)               r8.0<1>:d     r8.0<0;1,0>:d     63:w            
[ 67]          shr (8|M0)               r26.0<1>:q    r60.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[ 68]          shr (8|M8)               r28.0<1>:q    r58.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[ 69]          shr (8|M16)              r34.0<1>:q    r4.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[ 70]          shr (8|M24)              r36.0<1>:q    r2.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[ 71]          and (8|M0)               r32.0<1>:q    r26.0<4;4,1>:q    1:w             
[ 72]          and (8|M8)               r30.0<1>:q    r28.0<4;4,1>:q    1:w             
[ 73]          and (8|M16)              r28.0<1>:q    r34.0<4;4,1>:q    1:w             
[ 74]          and (8|M24)              r26.0<1>:q    r36.0<4;4,1>:q    1:w             
[ 75]          cmp (8|M0)    (eq)f0.0   null<1>:q     r32.0<4;4,1>:q    r8.4<0;1,0>:ud  
[ 76]          cmp (8|M8)    (eq)f0.0   null<1>:q     r30.0<4;4,1>:q    r8.4<0;1,0>:ud  
[ 77]          cmp (8|M16)   (eq)f0.0   null<1>:q     r28.0<4;4,1>:q    r8.4<0;1,0>:ud  
[ 78]          cmp (8|M24)   (eq)f0.0   null<1>:q     r26.0<4;4,1>:q    r8.4<0;1,0>:ud  
[ 79]          sel (16|M0)   (lt)f0.0   r34.0<1>:d    r18.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[ 80]          sel (16|M0)   (lt)f0.0   r36.0<1>:d    r20.0<8;8,1>:d    r12.0<8;8,1>:d   {Compacted}
[ 81]          sel (16|M0)   (lt)f0.0   r38.0<1>:d    r22.0<8;8,1>:d    r14.0<8;8,1>:d   {Compacted}
[ 82]          sel (16|M0)   (lt)f0.0   r40.0<1>:d    r24.0<8;8,1>:d    r16.0<8;8,1>:d   {Compacted}
[ 83]          sel (16|M0)   (ge)f0.0   r26.0<1>:d    r10.0<8;8,1>:d    r18.0<8;8,1>:d   {Compacted}
[ 84]          sel (16|M0)   (ge)f0.0   r28.0<1>:d    r12.0<8;8,1>:d    r20.0<8;8,1>:d   {Compacted}
[ 85]          sel (16|M0)   (ge)f0.0   r30.0<1>:d    r14.0<8;8,1>:d    r22.0<8;8,1>:d   {Compacted}
[ 86]          sel (16|M0)   (ge)f0.0   r32.0<1>:d    r16.0<8;8,1>:d    r24.0<8;8,1>:d   {Compacted}
[ 87]          sel (16|M16)  (lt)f0.0   r17.0<1>:d    r50.0<8;8,1>:d    r42.0<8;8,1>:d  
[ 88]          sel (16|M16)  (lt)f0.0   r19.0<1>:d    r52.0<8;8,1>:d    r44.0<8;8,1>:d  
[ 89]          sel (16|M16)  (lt)f0.0   r21.0<1>:d    r54.0<8;8,1>:d    r46.0<8;8,1>:d  
[ 90]          sel (16|M16)  (lt)f0.0   r23.0<1>:d    r56.0<8;8,1>:d    r48.0<8;8,1>:d  
[ 91]          sel (16|M16)  (ge)f0.0   r9.0<1>:d     r42.0<8;8,1>:d    r50.0<8;8,1>:d  
[ 92]          sel (16|M16)  (ge)f0.0   r11.0<1>:d    r44.0<8;8,1>:d    r52.0<8;8,1>:d  
[ 93]          sel (16|M16)  (ge)f0.0   r13.0<1>:d    r46.0<8;8,1>:d    r54.0<8;8,1>:d  
[ 94]          sel (16|M16)  (ge)f0.0   r15.0<1>:d    r48.0<8;8,1>:d    r56.0<8;8,1>:d  

// BBL3

[ 95] (~f0.0)  if (32|M0)                           96                160             

// BBL4

[ 96]          sends (16|M0)            null:w   r62     r34     0x20C       0x4025000
[ 97]          sends (16|M16)           null:w   r64     r17     0x20C       0x4025000
[ 98]          sends (16|M0)            null:w   r6      r26     0x20C       0x4025000
[ 99]          sends (16|M16)           null:w   r66     r9      0x20C       0x4025000

// BBL5

[100]          else (32|M0)                         80                80              

// BBL6

[101]          sends (16|M0)            null:w   r6      r34     0x20C       0x4025000
[102]          sends (16|M16)           null:w   r66     r17     0x20C       0x4025000
[103]          sends (16|M0)            null:w   r62     r26     0x20C       0x4025000
[104]          sends (16|M16)           null:w   r64     r9      0x20C       0x4025000

// BBL7

[105]          endif (32|M0)                        16                              

// BBL8

[106] (W)      jmpi                                 872                             

// BBL9

[107]          mov (8|M0)               r6.0<1>:d     r60.0<2;1,0>:d                  
[108]          mov (8|M8)               r7.0<1>:d     r58.0<2;1,0>:d                  
[109]          mov (8|M16)              r42.0<1>:d    r4.0<2;1,0>:d                   
[110]          mov (8|M24)              r43.0<1>:d    r2.0<2;1,0>:d                   
[111] (W)      and (1|M0)               r8.0<1>:d     r8.2<0;1,0>:d     63:w            
[112]          shl (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     4:w             
[113]          shl (16|M16)             r42.0<1>:d    r42.0<8;8,1>:d    4:w             
[114]          shr (8|M0)               r9.0<1>:q     r60.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[115]          shr (8|M8)               r11.0<1>:q    r58.0<4;4,1>:uq   r8.0<0;1,0>:ud  
[116]          shr (8|M16)              r18.0<1>:q    r4.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[117]          add (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     r8.5<0;1,0>:d    {Compacted}
[118]          add (16|M16)             r42.0<1>:d    r42.0<8;8,1>:d    r8.5<0;1,0>:d   
[119]          shr (8|M24)              r26.0<1>:q    r2.0<4;4,1>:uq    r8.0<0;1,0>:ud  
[120]          and (8|M0)               r24.0<1>:q    r9.0<4;4,1>:q     1:w             
[121]          and (8|M8)               r22.0<1>:q    r11.0<4;4,1>:q    1:w             
[122]          send (16|M0)             r10:w    r6      0xC         0x4805000
[123]          send (16|M16)            r34:w    r42     0xC         0x4805000
[124]          and (8|M16)              r20.0<1>:q    r18.0<4;4,1>:q    1:w             
[125]          and (8|M24)              r18.0<1>:q    r26.0<4;4,1>:q    1:w             
[126]          cmp (8|M0)    (eq)f1.0   null<1>:q     r24.0<4;4,1>:q    r8.4<0;1,0>:ud  
[127]          cmp (8|M8)    (eq)f1.0   null<1>:q     r22.0<4;4,1>:q    r8.4<0;1,0>:ud  
[128]          cmp (8|M16)   (eq)f1.0   null<1>:q     r20.0<4;4,1>:q    r8.4<0;1,0>:ud  
[129]          cmp (8|M24)   (eq)f1.0   null<1>:q     r18.0<4;4,1>:q    r8.4<0;1,0>:ud  

// BBL10

[130] (~f1.0)  if (32|M0)                           256               480             

// BBL11

[131]          sel (16|M0)   (lt)f0.0   r30.0<1>:d    r10.0<8;8,1>:d    r14.0<8;8,1>:d   {Compacted}
[132]          sel (16|M0)   (lt)f0.0   r28.0<1>:d    r12.0<8;8,1>:d    r16.0<8;8,1>:d   {Compacted}
[133]          sel (16|M0)   (ge)f0.0   r18.0<1>:d    r14.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[134]          sel (16|M0)   (ge)f0.0   r10.0<1>:d    r16.0<8;8,1>:d    r12.0<8;8,1>:d   {Compacted}
[135]          sel (16|M16)  (lt)f0.0   r22.0<1>:d    r34.0<8;8,1>:d    r38.0<8;8,1>:d  
[136]          sel (16|M16)  (lt)f0.0   r20.0<1>:d    r36.0<8;8,1>:d    r40.0<8;8,1>:d  
[137]          sel (16|M16)  (ge)f0.0   r44.0<1>:d    r38.0<8;8,1>:d    r34.0<8;8,1>:d  
[138]          sel (16|M16)  (ge)f0.0   r24.0<1>:d    r40.0<8;8,1>:d    r36.0<8;8,1>:d  
[139]          sel (16|M0)   (lt)f0.0   r26.0<1>:d    r30.0<8;8,1>:d    r28.0<8;8,1>:d   {Compacted}
[140]          sel (16|M0)   (ge)f0.0   r28.0<1>:d    r30.0<8;8,1>:d    r28.0<8;8,1>:d   {Compacted}
[141]          sel (16|M0)   (lt)f0.0   r30.0<1>:d    r18.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[142]          sel (16|M0)   (ge)f0.0   r32.0<1>:d    r18.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[143]          sel (16|M16)  (lt)f0.0   r18.0<1>:d    r22.0<8;8,1>:d    r20.0<8;8,1>:d  
[144]          sel (16|M16)  (ge)f0.0   r20.0<1>:d    r22.0<8;8,1>:d    r20.0<8;8,1>:d  
[145]          sel (16|M16)  (lt)f0.0   r22.0<1>:d    r44.0<8;8,1>:d    r24.0<8;8,1>:d  
[146]          sel (16|M16)  (ge)f0.0   r24.0<1>:d    r44.0<8;8,1>:d    r24.0<8;8,1>:d  
[147]          sends (16|M0)            null:w   r6      r26     0x20C       0x4025000
[148]          sends (16|M16)           null:w   r42     r18     0x20C       0x4025000

// BBL12

[149]          else (32|M0)                         240               240             

// BBL13

[150]          sel (16|M0)   (ge)f0.0   r22.0<1>:d    r14.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[151]          sel (16|M0)   (ge)f0.0   r20.0<1>:d    r16.0<8;8,1>:d    r12.0<8;8,1>:d   {Compacted}
[152]          sel (16|M0)   (lt)f0.0   r30.0<1>:d    r10.0<8;8,1>:d    r14.0<8;8,1>:d   {Compacted}
[153]          sel (16|M0)   (lt)f0.0   r28.0<1>:d    r12.0<8;8,1>:d    r16.0<8;8,1>:d   {Compacted}
[154]          sel (16|M16)  (ge)f0.0   r13.0<1>:d    r38.0<8;8,1>:d    r34.0<8;8,1>:d  
[155]          sel (16|M16)  (ge)f0.0   r11.0<1>:d    r40.0<8;8,1>:d    r36.0<8;8,1>:d  
[156]          sel (16|M16)  (lt)f0.0   r25.0<1>:d    r34.0<8;8,1>:d    r38.0<8;8,1>:d  
[157]          sel (16|M16)  (lt)f0.0   r15.0<1>:d    r36.0<8;8,1>:d    r40.0<8;8,1>:d  
[158]          sel (16|M0)   (ge)f0.0   r17.0<1>:d    r20.0<8;8,1>:d    r22.0<8;8,1>:d   {Compacted}
[159]          sel (16|M0)   (lt)f0.0   r19.0<1>:d    r20.0<8;8,1>:d    r22.0<8;8,1>:d   {Compacted}
[160]          sel (16|M0)   (ge)f0.0   r21.0<1>:d    r28.0<8;8,1>:d    r30.0<8;8,1>:d   {Compacted}
[161]          sel (16|M0)   (lt)f0.0   r23.0<1>:d    r28.0<8;8,1>:d    r30.0<8;8,1>:d   {Compacted}
[162]          sel (16|M16)  (ge)f0.0   r9.0<1>:d     r11.0<8;8,1>:d    r13.0<8;8,1>:d  
[163]          sel (16|M16)  (lt)f0.0   r11.0<1>:d    r11.0<8;8,1>:d    r13.0<8;8,1>:d  
[164]          sel (16|M16)  (ge)f0.0   r13.0<1>:d    r15.0<8;8,1>:d    r25.0<8;8,1>:d  
[165]          sel (16|M16)  (lt)f0.0   r15.0<1>:d    r15.0<8;8,1>:d    r25.0<8;8,1>:d  
[166]          sends (16|M0)            null:w   r6      r17     0x20C       0x4025000
[167]          sends (16|M16)           null:w   r42     r9      0x20C       0x4025000

// BBL14

[168]          endif (32|M0)                        16                              

// BBL15

[169] (W)      jmpi                                 1048                            

// BBL16

[170]          mov (8|M0)               r6.0<1>:d     r60.0<2;1,0>:d                  
[171]          mov (8|M8)               r7.0<1>:d     r58.0<2;1,0>:d                  
[172]          mov (8|M16)              r26.0<1>:d    r4.0<2;1,0>:d                   
[173]          mov (8|M24)              r27.0<1>:d    r2.0<2;1,0>:d                   
[174] (W)      cmp (16|M0)   (eq)f1.0   null<1>:d     r8.4<0;1,0>:d     0:w             
[175] (W)      cmp (16|M16)  (eq)f1.0   null<1>:d     r8.4<0;1,0>:d     0:w             
[176]          shl (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     4:w             
[177]          shl (16|M16)             r26.0<1>:d    r26.0<8;8,1>:d    4:w             
[178]          add (16|M0)              r6.0<1>:d     r6.0<8;8,1>:d     r8.5<0;1,0>:d    {Compacted}
[179]          add (16|M16)             r26.0<1>:d    r26.0<8;8,1>:d    r8.5<0;1,0>:d   
[180]          send (16|M0)             r10:w    r6      0xC         0x4805000
[181]          send (16|M16)            r18:w    r26     0xC         0x4805000
[182] (W&f1.0) jmpi                                 128                             

// BBL17

[183]          sel (16|M0)   (lt)f0.0   r32.0<1>:d    r10.0<8;8,1>:d    r12.0<8;8,1>:d   {Compacted}
[184]          sel (16|M16)  (lt)f0.0   r42.0<1>:d    r18.0<8;8,1>:d    r20.0<8;8,1>:d  
[185]          sel (16|M0)   (ge)f0.0   r30.0<1>:d    r12.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[186]          sel (16|M16)  (ge)f0.0   r38.0<1>:d    r20.0<8;8,1>:d    r18.0<8;8,1>:d  
[187]          sel (16|M0)   (ge)f0.0   r28.0<1>:d    r14.0<8;8,1>:d    r16.0<8;8,1>:d   {Compacted}
[188]          sel (16|M16)  (ge)f0.0   r40.0<1>:d    r22.0<8;8,1>:d    r24.0<8;8,1>:d  
[189]          sel (16|M0)   (lt)f0.0   r34.0<1>:d    r16.0<8;8,1>:d    r14.0<8;8,1>:d   {Compacted}
[190]          sel (16|M16)  (lt)f0.0   r36.0<1>:d    r24.0<8;8,1>:d    r22.0<8;8,1>:d  
[191] (W)      jmpi                                 112                             

// BBL18

[192]          sel (16|M0)   (ge)f0.0   r32.0<1>:d    r12.0<8;8,1>:d    r10.0<8;8,1>:d   {Compacted}
[193]          sel (16|M16)  (ge)f0.0   r42.0<1>:d    r20.0<8;8,1>:d    r18.0<8;8,1>:d  
[194]          sel (16|M0)   (lt)f0.0   r30.0<1>:d    r10.0<8;8,1>:d    r12.0<8;8,1>:d   {Compacted}
[195]          sel (16|M16)  (lt)f0.0   r38.0<1>:d    r18.0<8;8,1>:d    r20.0<8;8,1>:d  
[196]          sel (16|M0)   (lt)f0.0   r28.0<1>:d    r16.0<8;8,1>:d    r14.0<8;8,1>:d   {Compacted}
[197]          sel (16|M16)  (lt)f0.0   r40.0<1>:d    r24.0<8;8,1>:d    r22.0<8;8,1>:d  
[198]          sel (16|M0)   (ge)f0.0   r34.0<1>:d    r14.0<8;8,1>:d    r16.0<8;8,1>:d   {Compacted}
[199]          sel (16|M16)  (ge)f0.0   r36.0<1>:d    r22.0<8;8,1>:d    r24.0<8;8,1>:d  

// BBL19

[200]          and (8|M0)               r11.0<1>:q    r60.0<4;4,1>:q    1:w             
[201]          and (8|M8)               r9.0<1>:q     r58.0<4;4,1>:q    1:w             
[202]          and (8|M16)              r4.0<1>:q     r4.0<4;4,1>:q     1:w             
[203]          and (8|M24)              r2.0<1>:q     r2.0<4;4,1>:q     1:w             
[204]          cmp (8|M0)    (eq)f0.0   null<1>:q     r11.0<4;4,1>:q    r8.4<0;1,0>:ud  
[205]          cmp (8|M8)    (eq)f0.0   null<1>:q     r9.0<4;4,1>:q     r8.4<0;1,0>:ud  
[206]          cmp (8|M16)   (eq)f0.0   null<1>:q     r4.0<4;4,1>:q     r8.4<0;1,0>:ud  
[207]          cmp (8|M24)   (eq)f0.0   null<1>:q     r2.0<4;4,1>:q     r8.4<0;1,0>:ud  

// BBL20

[208] (~f0.0)  if (32|M0)                           256               480             

// BBL21

[209]          sel (16|M0)   (lt)f0.0   r20.0<1>:d    r32.0<8;8,1>:d    r28.0<8;8,1>:d   {Compacted}
[210]          sel (16|M0)   (lt)f0.0   r18.0<1>:d    r30.0<8;8,1>:d    r34.0<8;8,1>:d   {Compacted}
[211]          sel (16|M0)   (ge)f0.0   r8.0<1>:d     r28.0<8;8,1>:d    r32.0<8;8,1>:d   {Compacted}
[212]          sel (16|M0)   (ge)f0.0   r4.0<1>:d     r34.0<8;8,1>:d    r30.0<8;8,1>:d   {Compacted}
[213]          sel (16|M16)  (lt)f0.0   r12.0<1>:d    r42.0<8;8,1>:d    r40.0<8;8,1>:d  
[214]          sel (16|M16)  (lt)f0.0   r10.0<1>:d    r38.0<8;8,1>:d    r36.0<8;8,1>:d  
[215]          sel (16|M16)  (ge)f0.0   r14.0<1>:d    r40.0<8;8,1>:d    r42.0<8;8,1>:d  
[216]          sel (16|M16)  (ge)f0.0   r2.0<1>:d     r36.0<8;8,1>:d    r38.0<8;8,1>:d  
[217]          sel (16|M0)   (lt)f0.0   r16.0<1>:d    r20.0<8;8,1>:d    r18.0<8;8,1>:d   {Compacted}
[218]          sel (16|M0)   (ge)f0.0   r18.0<1>:d    r20.0<8;8,1>:d    r18.0<8;8,1>:d   {Compacted}
[219]          sel (16|M0)   (lt)f0.0   r20.0<1>:d    r8.0<8;8,1>:d     r4.0<8;8,1>:d    {Compacted}
[220]          sel (16|M0)   (ge)f0.0   r22.0<1>:d    r8.0<8;8,1>:d     r4.0<8;8,1>:d    {Compacted}
[221]          sel (16|M16)  (lt)f0.0   r8.0<1>:d     r12.0<8;8,1>:d    r10.0<8;8,1>:d  
[222]          sel (16|M16)  (ge)f0.0   r10.0<1>:d    r12.0<8;8,1>:d    r10.0<8;8,1>:d  
[223]          sel (16|M16)  (lt)f0.0   r12.0<1>:d    r14.0<8;8,1>:d    r2.0<8;8,1>:d   
[224]          sel (16|M16)  (ge)f0.0   r14.0<1>:d    r14.0<8;8,1>:d    r2.0<8;8,1>:d   
[225]          sends (16|M0)            null:w   r6      r16     0x20C       0x4025000
[226]          sends (16|M16)           null:w   r26     r8      0x20C       0x4025000

// BBL22

[227]          else (32|M0)                         240               240             

// BBL23

[228]          sel (16|M0)   (ge)f0.0   r20.0<1>:d    r28.0<8;8,1>:d    r32.0<8;8,1>:d   {Compacted}
[229]          sel (16|M0)   (ge)f0.0   r18.0<1>:d    r34.0<8;8,1>:d    r30.0<8;8,1>:d   {Compacted}
[230]          sel (16|M0)   (lt)f0.0   r14.0<1>:d    r32.0<8;8,1>:d    r28.0<8;8,1>:d   {Compacted}
[231]          sel (16|M0)   (lt)f0.0   r8.0<1>:d     r30.0<8;8,1>:d    r34.0<8;8,1>:d   {Compacted}
[232]          sel (16|M16)  (ge)f0.0   r12.0<1>:d    r40.0<8;8,1>:d    r42.0<8;8,1>:d  
[233]          sel (16|M16)  (ge)f0.0   r10.0<1>:d    r36.0<8;8,1>:d    r38.0<8;8,1>:d  
[234]          sel (16|M16)  (lt)f0.0   r4.0<1>:d     r42.0<8;8,1>:d    r40.0<8;8,1>:d  
[235]          sel (16|M16)  (lt)f0.0   r2.0<1>:d     r38.0<8;8,1>:d    r36.0<8;8,1>:d  
[236]          sel (16|M0)   (ge)f0.0   r16.0<1>:d    r18.0<8;8,1>:d    r20.0<8;8,1>:d   {Compacted}
[237]          sel (16|M0)   (lt)f0.0   r18.0<1>:d    r18.0<8;8,1>:d    r20.0<8;8,1>:d   {Compacted}
[238]          sel (16|M0)   (ge)f0.0   r20.0<1>:d    r8.0<8;8,1>:d     r14.0<8;8,1>:d   {Compacted}
[239]          sel (16|M0)   (lt)f0.0   r22.0<1>:d    r8.0<8;8,1>:d     r14.0<8;8,1>:d   {Compacted}
[240]          sel (16|M16)  (ge)f0.0   r8.0<1>:d     r10.0<8;8,1>:d    r12.0<8;8,1>:d  
[241]          sel (16|M16)  (lt)f0.0   r10.0<1>:d    r10.0<8;8,1>:d    r12.0<8;8,1>:d  
[242]          sel (16|M16)  (ge)f0.0   r12.0<1>:d    r2.0<8;8,1>:d     r4.0<8;8,1>:d   
[243]          sel (16|M16)  (lt)f0.0   r14.0<1>:d    r2.0<8;8,1>:d     r4.0<8;8,1>:d   
[244]          sends (16|M0)            null:w   r6      r16     0x20C       0x4025000
[245]          sends (16|M16)           null:w   r26     r8      0x20C       0x4025000

// BBL24

[246]          endif (32|M0)                        16                              

// BBL25

[247] (W)      mov (8|M0)               r112.0<1>:ud  r100.0<8;8,1>:ud                 {Compacted}

// BBL26

[248] (W)      send (8|M0)              null     r112    0x27        0x2000010  {EOT}

(Back to the list of all GTPin Sample Tools)

simdprof.h

00001 /*========================== begin_copyright_notice ============================
00002 Copyright (C) 2019-2022 Intel Corporation
00003 
00004 SPDX-License-Identifier: MIT
00005 ============================= end_copyright_notice ===========================*/
00006 
00007 /*!
00008  * @file SIMD operation counting tool definitions
00009  */
00010 #ifndef SIMDPROF_H_
00011 #define SIMDPROF_H_
00012 
00013 #include <vector>
00014 #include <map>
00015 
00016 #include "gtpin_api.h"
00017 #include "gtpin_tool_utils.h"
00018 
00019 using namespace gtpin;
00020 
00021 /* ============================================================================================= */
00022 // Struct SimdProfRecord
00023 /* ============================================================================================= */
00024 /*!
00025  * Layout of records collected in profile buffer by the Simdprof tool
00026  */
00027 struct SimdProfRecord
00028 {
00029     uint64_t opCount;    ///< Number of SIMD operations executed by a group of instructions
00030 };
00031 
00032 /* ============================================================================================= */
00033 // Struct SimdProfArgs
00034 /* ============================================================================================= */
00035 /*!
00036  * SimdProf instrumentation arguments (instruction properties).
00037  * Each unique combination of these arguments requires a separate instrumentation procedure
00038  * to be generated for each group of instructions with these properties
00039  */
00040 struct SimdProfArgs
00041 {
00042     SimdProfArgs(bool ctrl, uint32_t mask, GtPredicate pred, bool isSend = false) :
00043                  maskCtrl(ctrl), execMask(mask), predicate(pred), isSendIns(isSend){}
00044 
00045     inline bool operator <  (const SimdProfArgs& other) const;
00046 
00047     bool            maskCtrl;       ///< 'MaskCtrl' flag of instrumented instructions
00048     uint32_t        execMask;       ///< Execution mask of instrumented instructions
00049     GtPredicate     predicate;      ///< Predicate of instrumented instructions
00050     bool            isSendIns;      ///< true if instrumented instructions are SEND instructions
00051 };
00052 
00053 /* ============================================================================================= */
00054 // Struct SimdProfGroup
00055 /* ============================================================================================= */
00056 /*!
00057  * Structure that holds information and profiling results for a group of instructions being
00058  * instrumented by a single instrumentation routine.
00059  * @note All instructions within a group have exactly the same SimdProfArgs.
00060  * @note In order to provide separate channel counters per instruction category (e.g. integer, FP, etc.),
00061  *       replace the {insCount, opCount} pair with an array of counter pairs per category.
00062  */
00063 struct SimdProfGroup
00064 {
00065     SimdProfGroup(uint32_t bbl, uint32_t numIns) : bblId(bbl), insCount(numIns), opCount(0) {}
00066 
00067     BblId    bblId;     ///< Identifier of a BBL that contains this group of instructions
00068     uint32_t insCount;  ///< Number of instructions in the group
00069     uint64_t opCount;   ///< Number of SIMD operations (effective channels) executed by each instruction in the group
00070 };
00071 
00072 /* ============================================================================================= */
00073 // Struct SimdProfSection
00074 /* ============================================================================================= */
00075 /*!
00076  * Structure that holds information on a SimdProf section - sequence of instructions for which
00077  * instrumentation routines can be inserted at the same point.
00078  * @note All instructions within a section are executed with the same value of the flag register -
00079  * single dynamic parameter of the SIMD operation calulator
00080  */
00081 struct SimdProfSection
00082 {
00083     SimdProfSection(const IGtIns& headIns) :  headInsId(headIns.Id()) {}
00084 
00085     /// Add a new instruction to the section. Update the corresponding SimdProf group within this section
00086     void AddInstruction(const IGtIns& ins);
00087 
00088     InsId                            headInsId; ///< First intruction of the section - common 
00089                                                 ///< instrumentation point for all groups in the section
00090     std::map<SimdProfArgs, uint32_t> groups;    ///< SimdProf groups along with the number of instructions
00091 };
00092 
00093 /* ============================================================================================= */
00094 // Class SimdProfKernelProfile
00095 /* ============================================================================================= */
00096 /*!
00097  * Class that represents a kernel profiled by the SimdProf instrumentation
00098  */
00099 class SimdProfKernelProfile
00100 {
00101 public:
00102     SimdProfKernelProfile(const IGtKernel& kernel);
00103 
00104     /*!
00105      * Instrument the kernel.
00106      * The function is called by the OnKernelBuild handler
00107      * @return success/failure status
00108      */
00109     void Instrument(IGtKernelInstrument& instrumentor);
00110 
00111     /*!
00112      * Read profiling results which are assumed to be collected and stored in the buffer
00113      * associated with the kernel.
00114      * The function is called by the OnKernelComplete handler
00115      */
00116      void ReadProfileData(const IGtProfileBuffer* buffer);
00117 
00118      /// @return Total number of SIMD operations executed by the kernel
00119      uint64_t GetTotalOpCounter() const { return _totalOpCount; }
00120 
00121     std::string           ToString()        const;                          ///< @return Text representation of the profile data
00122     const GtProfileArray& GetProfileArray() const { return _profileArray; } ///< @return Profile buffer accessor
00123 
00124 private:
00125     /*!
00126      * Generate instrumentation procedures for all SimdProf groups of the specified SimdProf section.
00127      * Insert instrumentation at the beginning of the section.
00128      * Initialize the _profileData array
00129      * @param[in]      instrumentor     Instrumentor of the GEN kernel
00130      * @param[in]      section          SimdProf section to be instrumented
00131      */
00132     void InstrumentSection(IGtKernelInstrument& instrumentor, const SimdProfSection& section);
00133 
00134     /// @return true/false - use 64-bit/32-bit integer for the operation counter
00135     static bool Use64BitCounters(const IGtGenCoder& coder);
00136 
00137     /// Increment counter of SIMD operations for the specified BBL by 'incValue'
00138     void UpdateBblOpCounter(BblId bblId, uint64_t incValue);
00139 
00140     /// @return Extended kernel name
00141     std::string ExtendedName() const { return _extName; }
00142 
00143 private:
00144     /// Kernel descriptor
00145     std::string                         _name;              ///< Kernel's name
00146     std::string                         _extName;           ///< Kernel's extended name
00147     GtKernelType                        _type;              ///< Kernel's type
00148     GtGpuPlatform                       _platform;          ///< Kernel's platform
00149     uint64_t                            _hashId;            ///< Kernel's hash identifier
00150     GtSimdWidth                         _simd;              ///< Kernel's SIMD width
00151     uint64_t                            _binarySignature;   ///< Kernel's binary signature
00152     
00153     GtProfileArray                      _profileArray;      ///< Profile buffer accessor
00154     std::vector<SimdProfGroup>          _profileData;       ///< Profiling data for instrumented SimdProf groups
00155 
00156     GtReg _addrReg;     ///< Virtual register that holds address within profile buffer
00157     GtReg _dataReg;     ///< Virtual register that holds data to be read from/written to profile buffer 
00158 
00159     std::map<BblId, std::pair<InsId, InsId> >         _bblInsInfo;   ///< Head and tail instructions per BBL
00160     std::map<BblId, uint64_t>                         _bblOpCounts;  ///< Number of executed SIMD operations per BBL
00161     uint64_t                                          _totalOpCount; ///< Number of SIMD operations executed by the kernel
00162 };
00163 
00164 /* ============================================================================================= */
00165 // Class SimdProf
00166 /* ============================================================================================= */
00167 /*!
00168  * Implementation of the IGtTool interface for the SimdProf tool
00169  */
00170 class SimdProf : public GtTool
00171 {
00172 public:
00173     /// Implementation of the IGtTool interface
00174     const char* Name() const { return "simdprof"; }
00175 
00176     void OnKernelBuild(IGtKernelInstrument& instrumentor);
00177     void OnKernelRun(IGtKernelDispatch& dispatcher);
00178     void OnKernelComplete(IGtKernelDispatch& dispatcher);
00179 
00180 public:
00181     std::string ToString() const;                ///< @return Text representation of the profile data
00182     static SimdProf* Instance();                 ///< @return Single instance of this class
00183     static void OnFini() { Instance()->Fini(); } ///< Callback function registered with atexit()
00184 
00185 protected:
00186     SimdProf() = default;
00187     SimdProf(const SimdProf&) = delete;
00188     SimdProf& operator = (const SimdProf&) = delete;
00189     ~SimdProf() = default;
00190 
00191     void Fini();                                ///< Post process and dump profiling data
00192 
00193 private:
00194     /// Collection of kernel profiles
00195     typedef std::map<GtKernelId, SimdProfKernelProfile>  KernelProfiles;
00196     KernelProfiles  _kernels;
00197 };
00198 #endif

simdprof.cpp

00001 /*========================== begin_copyright_notice ============================
00002 Copyright (C) 2019-2025 Intel Corporation
00003 
00004 SPDX-License-Identifier: MIT
00005 ============================= end_copyright_notice ===========================*/
00006 
00007 /*!
00008  * @file Implementation of the SIMD operation counting tool
00009  */
00010 
00011 #include <algorithm>
00012 #include <vector>
00013 #include <map>
00014 #include <string>
00015 #include <fstream>
00016 #include <sstream>
00017 #include <iomanip>
00018 #include <assert.h>
00019 
00020 #include "simdprof.h"
00021 
00022 using namespace gtpin;
00023 using namespace std;
00024 
00025 /* ============================================================================================= */
00026 // Configuration
00027 /* ============================================================================================= */
00028 Knob<int>  knobNumThreadBuckets("num_thread_buckets", 32, "Number of thread buckets. 0 - maximum thread buckets");
00029 
00030 /* ============================================================================================= */
00031 // SimdProfArgs implementation
00032 /* ============================================================================================= */
00033 
00034 bool SimdProfArgs::operator < (const SimdProfArgs& other) const
00035 {
00036     return std::make_tuple(maskCtrl, execMask, predicate, isSendIns) <
00037            std::make_tuple(other.maskCtrl, other.execMask, other.predicate, other.isSendIns);
00038 }
00039 
00040 /* ============================================================================================= */
00041 // SimdProfSection implementation
00042 /* ============================================================================================= */
00043 
00044 void SimdProfSection::AddInstruction(const IGtIns& ins)
00045 {
00046     uint32_t    execMask    = ins.ExecMask().Bits();
00047     GtPredicate predicate   = ins.Predicate();
00048     bool        maskCtrl    = !ins.IsWriteMaskEnabled();
00049     bool        isSendIns   = ins.IsSendMessage();
00050 
00051     auto it = groups.emplace(SimdProfArgs(maskCtrl, execMask, predicate, isSendIns), 0).first;
00052     ++(it->second);
00053 }
00054 
00055 /* ============================================================================================= */
00056 // SimdprofKernelProfile implementation
00057 /* ============================================================================================= */
00058 
00059 SimdProfKernelProfile::SimdProfKernelProfile(const IGtKernel& kernel) :
00060     _name(GlueString(kernel.Name())), _extName(ExtendedKernelName(kernel)), _type(kernel.Type()), _platform(kernel.GpuPlatform()),
00061     _hashId(kernel.HashId()), _simd(kernel.SimdWidth()), _binarySignature(kernel.BinarySignature()),
00062     _totalOpCount(0) {}
00063 
00064 void SimdProfKernelProfile::Instrument(IGtKernelInstrument& instrumentor)
00065 {
00066     const IGtGenCoder&  coder           = instrumentor.Coder();
00067     const IGtKernel&    kernel          = instrumentor.Kernel();
00068     const IGtCfg&       cfg             = instrumentor.Cfg();
00069     IGtVregFactory&     vregs           = coder.VregFactory();
00070     bool                is64BitCounter  = Use64BitCounters(coder);
00071 
00072     // Initialize virtual registers
00073     _addrReg = vregs.MakeMsgAddrScratch();
00074     _dataReg = vregs.MakeMsgDataScratch(is64BitCounter ? VREG_TYPE_QWORD : VREG_TYPE_DWORD);
00075 
00076     // Identify SimdProf sections and #groups in the kernel
00077     std::vector<SimdProfSection> sections;      // All SimdProf sections in the kernel
00078     uint32_t                     numGroups = 0; // Number of SimdProf groups in the kernel
00079 
00080     for (auto bblPtr : cfg.Bbls())
00081     {
00082         bool isSectionBegin = true;
00083 
00084         // Iterate through sections within the current BBL
00085         for (auto insPtr : bblPtr->Instructions())
00086         {
00087             const IGtIns& ins = *insPtr;
00088 
00089             if (ins.Id() < (uint32_t)knobMinInstrumentIns || ins.Id() > (uint32_t)knobMaxInstrumentIns)
00090             {
00091                 continue;
00092             }
00093 
00094             if (isSectionBegin)
00095             {
00096                 sections.emplace_back(ins);
00097                 isSectionBegin = false;
00098             }
00099 
00100             SimdProfSection& section = sections.back();
00101             section.AddInstruction(ins);
00102 
00103             if (ins.IsFlagModifier() || (ins.Id() == bblPtr->LastIns().Id())) //section end
00104             {
00105                 numGroups += (uint32_t)section.groups.size();
00106                 isSectionBegin = true;
00107             }
00108         }
00109 
00110         if (isSectionBegin == false)
00111         {
00112             numGroups += (uint32_t)sections.back().groups.size();
00113         }
00114     }
00115 
00116     // Allocate the profile buffer. It will hold single SimdProfRecord per each group in each thread bucket
00117     uint32_t numThreadBuckets = (knobNumThreadBuckets == 0) ? kernel.GenModel().MaxThreadBuckets() : knobNumThreadBuckets;
00118     _profileArray = GtProfileArray(sizeof(SimdProfRecord), numGroups, numThreadBuckets);
00119     _profileArray.Allocate(instrumentor.ProfileBufferAllocator());
00120 
00121     // Instrument SimdProf sections and initialize the _profileData array
00122     for (auto& section : sections) { InstrumentSection(instrumentor, section); }
00123 
00124     // Save BBL information for the post processing phase
00125     for (auto bblPtr : cfg.Bbls())
00126     {
00127         _bblInsInfo.emplace(bblPtr->Id(), std::make_pair(bblPtr->FirstIns().Id(), bblPtr->LastIns().Id()));
00128     }
00129 }
00130 
00131 void SimdProfKernelProfile::InstrumentSection(IGtKernelInstrument& instrumentor, const SimdProfSection& section)
00132 {
00133     const IGtGenCoder&  coder           = instrumentor.Coder();
00134     IGtInsFactory&      insF            = coder.InstructionFactory();
00135     const IGtCfg&       cfg             = instrumentor.Cfg();
00136     bool                is64BitCounter  = Use64BitCounters(coder);
00137     GtReg               dataRegL        = {_dataReg, sizeof(uint32_t), 0};  // Low 32-bits of the data payload register
00138 
00139     // Instrument each SimdProf group:
00140     //  - If a group is associated with a non-SEND instructions, compute the SIMD count by aplying CBIT to the SIMD mask.
00141     //  - Otherwise, if a group is created for SEND instructions, increment the SIMD count for each SEND whose SIMD mask
00142     //    is nonzero. From the EU perspective, SEND instruction is 1 operation, unless the SIMD mask is zero
00143     // Insert each per-group instrumentation procedure at the beginning of the corresponding section
00144 
00145     //Insert SimdProf instrumentaion at the beginning of the current section
00146     const IGtIns& ins = cfg.GetInstruction(section.headInsId);
00147     const IGtBbl& bbl = cfg.GetBbl(ins);
00148 
00149     for (auto& group : section.groups)
00150     {
00151         GtGenProcedure      proc;
00152         const SimdProfArgs& args = group.first;
00153 
00154         if (is64BitCounter)
00155         {
00156             // Clear the high 32-bits of the data payload register
00157             GtReg dataRegH = {_dataReg, sizeof(uint32_t), 1};
00158             proc += insF.MakeMov(dataRegH, 0);
00159         }
00160         
00161         // dataRegL = SIMD mask
00162         coder.ComputeSimdMask(proc, dataRegL, args.maskCtrl, args.execMask, args.predicate);
00163 
00164         // dataRegL = number SIMD operations executed
00165         if (!args.isSendIns)
00166         {
00167             proc += insF.MakeCbit(dataRegL, dataRegL);
00168         }
00169         else
00170         {
00171             proc += insF.MakeSel(dataRegL, dataRegL, 1).SetCondModifier(GED_COND_MODIFIER_l); // dataRegL = min(dataRegL, 1)
00172         }
00173 
00174         // Generate code that updates the SIMD operation counter in the corresponding SimdProfRecord
00175         uint32_t recordNum = (uint32_t)_profileData.size();
00176         _profileArray.ComputeAddress(coder, proc, _addrReg, recordNum);
00177 
00178         proc += insF.MakeAtomicAdd(NullReg(), _addrReg, _dataReg, (is64BitCounter? GED_DATA_TYPE_uq : GED_DATA_TYPE_ud));
00179 
00180         // Insert a new instrumentation routine and append the new group to _profileData
00181         proc.front()->AppendAnnotation(__func__);
00182         SimdProf::Instance()->InstrumentInstruction(instrumentor, ins, GtIpoint::Before(), proc);
00183         _profileData.emplace_back(bbl.Id(), group.second);
00184     }
00185 }
00186 
00187 bool SimdProfKernelProfile::Use64BitCounters(const IGtGenCoder& coder)
00188 {
00189     return coder.InstructionFactory().CanAccessAtomically(GED_DATA_TYPE_uq);
00190 }
00191 
00192 void SimdProfKernelProfile::ReadProfileData(const IGtProfileBuffer* buffer)
00193 {
00194     GTPIN_ASSERT(_profileData.size() == _profileArray.NumRecords());
00195     uint32_t recordNum = 0;
00196 
00197     // Iterate through all SimdProf groups and read counters of executed operations (channels).
00198     for (auto& group : _profileData)
00199     {
00200         // Accumulate counters for all threads in which this group of instructions was executed
00201         for (uint32_t threadBucket = 0; threadBucket < _profileArray.NumThreadBuckets(); ++threadBucket)
00202         {
00203             SimdProfRecord record;
00204             if (!_profileArray.Read(*buffer, &record, recordNum, 1, threadBucket))
00205             {
00206                 GTPIN_ERROR_MSG(string("SIMDPROF : ") + _name + " : Failed to read from memory buffer");
00207             }
00208             else
00209             {
00210                 // Update counters of executed operations
00211                 uint64_t opCount = record.opCount * group.insCount;
00212                 group.opCount += opCount;
00213                 UpdateBblOpCounter(group.bblId, opCount);
00214                 _totalOpCount += opCount;
00215             }
00216         }
00217         recordNum++;
00218     }
00219 }
00220 
00221 void SimdProfKernelProfile::UpdateBblOpCounter(BblId bblId, uint64_t incValue)
00222 {
00223     auto it = _bblOpCounts.emplace(bblId, 0).first;
00224     it->second += incValue;
00225 }
00226 
00227 std::string SimdProfKernelProfile::ToString() const
00228 {
00229     ostringstream ostr;
00230     ostr << ExtendedName() << endl;
00231 
00232     if (!_bblOpCounts.empty())
00233     {
00234         ostr << setw(10) << "BBL" << setw(15) << "Head Ins ID" << setw(15) << "Tail Ins ID" << setw(20) << "Channels" << endl;
00235         for (const auto& bc : _bblOpCounts)
00236         {
00237             ostr << setw(10) << bc.first << setw(15) << _bblInsInfo.at(bc.first).first << setw(15) << _bblInsInfo.at(bc.first).second << setw(20) << bc.second << endl;
00238         }
00239         ostr << setw(10) << "Total" << setw(15) << _totalOpCount << endl;
00240     }
00241     else
00242     {
00243         ostr << "No channels executed" << endl;
00244     }
00245 
00246     return ostr.str();
00247 }
00248 
00249 /* ============================================================================================= */
00250 // SimdProf implementation
00251 /* ============================================================================================= */
00252 SimdProf* SimdProf::Instance()
00253 {
00254     static SimdProf instance;
00255     return &instance;
00256 }
00257 
00258 void SimdProf::OnKernelBuild(IGtKernelInstrument& instrumentor)
00259 {
00260     const IGtKernel& kernel = instrumentor.Kernel();
00261     auto it = _kernels.emplace(kernel.Id(), kernel).first;
00262     it->second.Instrument(instrumentor);
00263 }
00264 
00265 void SimdProf::OnKernelRun(IGtKernelDispatch& dispatcher)
00266 {
00267     bool isProfileEnabled = false;
00268 
00269     const IGtKernel& kernel = dispatcher.Kernel();
00270     GtKernelExecDesc execDesc; dispatcher.GetExecDescriptor(execDesc);
00271     if (kernel.IsInstrumented() && IsKernelExecProfileEnabled(execDesc, kernel.GpuPlatform()))
00272     {
00273         auto it = _kernels.find(kernel.Id());
00274 
00275         if (it != _kernels.end())
00276         {
00277             IGtProfileBuffer* buffer = dispatcher.CreateProfileBuffer(); GTPIN_ASSERT(buffer);
00278             SimdProfKernelProfile& kernelProfile = it->second;
00279             const GtProfileArray& profileArray = kernelProfile.GetProfileArray();
00280             if (profileArray.Initialize(*buffer))
00281             {
00282                 isProfileEnabled = true;
00283             }
00284             else
00285             {
00286                 GTPIN_ERROR_MSG(string("SIMDPROF : ") + string(kernel.Name()) + " : Failed to write into memory buffer");
00287             }
00288         }
00289     }
00290     dispatcher.SetProfilingMode(isProfileEnabled);
00291 }
00292 
00293 void SimdProf::OnKernelComplete(IGtKernelDispatch& dispatcher)
00294 {
00295     if (!dispatcher.IsProfilingEnabled())
00296     {
00297         return; // Do nothing with unprofiled kernel dispatches
00298     }
00299 
00300     const IGtKernel& kernel = dispatcher.Kernel();
00301     auto it = _kernels.find(kernel.Id());
00302 
00303     if (it != _kernels.end())
00304     {
00305         const IGtProfileBuffer* buffer = dispatcher.GetProfileBuffer(); GTPIN_ASSERT(buffer);
00306         SimdProfKernelProfile& kernelProfile = it->second;
00307         kernelProfile.ReadProfileData(buffer);
00308     }
00309 }
00310 
00311 void SimdProf::Fini()
00312 {
00313     string profileDir = GTPin_GetCore()->ProfileDir();
00314     string filePath = JoinPath(profileDir, "simdprof.txt");
00315 
00316     ofstream fs(filePath);
00317     if (fs.is_open())
00318     {
00319         fs << ToString();
00320         fs.close();
00321     }
00322     else
00323     {
00324         GTPIN_WARNING("SIMDPROF : could not create file: " + filePath);
00325     }
00326 }
00327 
00328 string SimdProf::ToString() const
00329 {
00330     ostringstream ostr;
00331     ostr << "Channels (SIMD operations) executed by kernels/BBLs" << endl;
00332     ostr << "===================================================" << endl;
00333     
00334     uint64_t totalOpCount = 0;
00335     for (const auto& k : _kernels)
00336     {
00337         ostr << string(100, '-') << endl;
00338         ostr << k.second.ToString() << endl;
00339         totalOpCount += k.second.GetTotalOpCounter();
00340     }
00341     ostr << "Total number of kernels:                    " << _kernels.size() << std::endl;
00342     ostr << "Total number of channels (SIMD operations): " << totalOpCount << std::endl;
00343 
00344     return ostr.str();
00345 }
00346 
00347 // Define DETACHED_SIMDPROF to use SimdProf functionality in a different tool
00348 #if !defined (DETACHED_SIMDPROF)
00349 /* ============================================================================================= */
00350 // GTPin_Entry
00351 /* ============================================================================================= */
00352 EXPORT_C_FUNC void GTPin_Entry(int argc, const char* argv[])
00353 {
00354     ConfigureGTPin(argc, argv);
00355     SimdProf::Instance()->Register();
00356     atexit(SimdProf::OnFini);
00357 }
00358 #endif

(Back to the list of all GTPin Sample Tools)


 All Data Structures Functions Variables Typedefs Enumerations Enumerator


  Copyright (C) 2013-2025 Intel Corporation
SPDX-License-Identifier: MIT