GTPin
GTPin: Itrace Sample Tool

The Itrace tool generates a dynamic control-flow trace of kernel execution

The trace is provided for each kernel, for each Draw/Enqueue granularity, and for each separate HW thread.

Running the Itrace tool

The Itrace tool (as well as all GTPin tracing tools) works in two phases, which should be run separately:

To run the pre-processing phase of the itrace tool (in its default configuration) use the following command:

Profilers/Bin/gtpin -t itrace --phase 1 -- app

NOTE: You can run this phase only once per application.

To run the trace-gathering phase of the itrace tool (in its default configuration), use the following command:

Profilers/Bin/gtpin -t itrace --phase 2 -- app

How to understand Itrace results

When you run the GTPin in-house Itrace tool in a default configuration for pre-processing (phase 1) GTPin generates the directory GTPIN_PROFILE_ITRACE0. In addition, the following two files are created within the current directory:

This file is an input to the trace gathering phase. It has the following format:

VS_asm7cf5b819f3a88d8e_simd8___VS_asm7cf5b819f3a88d8e_simd8_10 2048
CS_asm3f2b2787381d7600_simd8___CS_asm3f2b2787381d7600_simd8_28 2048
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30 4326
PS_asm967d2b3a9b5e0245_simd8___PS_asm967d2b3a9b5e0245_simd8_35 17888
PS_asm967d2b3a9b5e0245_simd16___PS_asm967d2b3a9b5e0245_simd16_37 95844
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_40 4326

where the numerical field indicates the number of required trace records, and the string field indicates a kernel.

This file contains informational data only, and has the following format:

VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30        101  DX12  0  34  318  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         99  DX12  0  34  319  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         28  DX12  0  34  320  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         38  DX12  0  34  321  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         76  DX12  0  34  322  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         22  DX12  0  34  323  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         13  DX12  0  34  324  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30        104  DX12  0  34  325  3  0
VS_asmfab317bc037b0584_simd8___VS_asmfab317bc037b0584_simd8_30         97  DX12  0  34  326  3  0

where each line corresponds to a single kernel for a single Draw/Enqueue command. The fields have the following meaning (from left to right):

When the Itrace tool is run for trace gathering (phase 2), the tool generates a directory: GTPIN_PROFILE_ITRACE1. GTPin saves the profiling results within the folder: GTPIN_PROFILE_ITRACE1\Session_Final. The traces for each kernel are saved in a separate sub-folder that has the same name as the kernel. Each Draw/Enqueue command has a separate trace, which is saved in a corresponding sub-directory, as shown in the following screenshot:

itrace_res_dir_structure.jpg

How to uncompress Itrace and read the trace

Each trace is saved in a compressed binary format, within a file called itrace_compressed.bin, as shown above. To uncompress the trace, you must run a Profilers\Scripts\uncompress_itrace.py Python Software Foundation Python* script, in the following manner (Python 3.5 or above is required):

python3 Profilers\Scripts\uncompress_itrace.py --input_dir GTPIN_PROFILE_ITRACE1\Session_Final\PS_asm967d2b3a9b5e0245_simd16\device_0__cl_34__draw_0__pso_3 --gen 9

Running this script opens the compressed trace into separate traces for each HW thread each, as shown in the following screenshot:

itrace_uncompressed_res.jpg

where the trace generated on each HW thread is saved in a text file named itrace___s_0_ss_0_eu_1_tid_5.out. The file name indicates the HW thread topology ID (Slice (S), DualSubSlice (DSS), SubSlice (SS), Execution Unit (EU), and HW thread ID (TID)). The resulting trace is provided in the following format:

  BBL ID             INS OFFSET
===============================
     0                0x00
     0                0xc8
    16                0x9c0
    16                0xa78
    17                0xa88
    17                0xae8
    19                0xb58
    20                0xbd8
    21                0xbe8
    22                0xcc8
    23                0xcd8
    24                0xdb8
    25                0xdc8
    26                0xdd0
--- EOT ---
     0                0x00
     0                0xc8
    16                0x9c0
    16                0xa78
    17                0xa88
    17                0xae8
    19                0xb58
    20                0xbd8
    21                0xbe8
    22                0xcc8
    23                0xcd8
    24                0xdb8
    25                0xdc8
    26                0xdd0
--- EOT ---

The left column indicates the basic block (BBL) ID, where the control flow goes. The right column indicates the offset of the first instruction of this BBL. In the case that a BBL consists of plural instructions, where the last one is a control flow instruction, then two lines are provided for the same BBL: One line for its first instruction, and one line for its last instruction. If the last instruction of a BBL is a control flow instruction that passes the control to the first instruction of the same basic block (single BBL loop), then the number of sequential repetitions of this BBL within the control flow is indicated.

An EOT indication separates the consequent dispatches of different SW threads of the kernel from the same HW thread.

To map the basic block IDs and instruction offsets to the kernel code, you must look in the Session_Final\ISA sub-folder, where the GEN assembly of all kernels are saved.

Control-flow graph

In addition to the trace, GTPin dumps the control-flow graph of the kernel into a text file called itrace_total.cfg. This file represents the CFG graph as a list of its edges, along with the frequencies of the control-flow transition on each edge, as accumulated over all HW threads traces. The itrace_total.cfg file has the following format:

 srcBBL, dstBBL, Frequency
=========================
0,1,0
0,16,262144
1,2,0
1,9,0
2,3,0
3,4,0
3,6,0
3,7,0
4,5,0
5,6,0
5,7,0
6,7,0
7,8,0
8,9,0
8,15,0
9,10,0
10,11,0
10,13,0
10,14,0
11,12,0
12,13,0
12,14,0
13,14,0
14,15,0
15,16,0
15,25,0
16,17,262144
16,18,0
17,18,0
17,19,262144
18,19,0
19,20,262144
20,21,262144
20,23,0
20,24,0
21,22,262144
22,23,262144
22,24,0
23,24,262144
24,25,262144
25,26,262144

where each line represents separate CFG edge and contains the following values:

To map the basic block IDs and instruction offsets to the kernel code, you must look in the Session_Final\ISA sub-folder, where the GEN assembly of all kernels are saved.

(Back to the list of all GTPin Sample Tools)

itrace.h

00001 /*========================== begin_copyright_notice ============================
00002 Copyright (C) 2018-2023 Intel Corporation
00003 
00004 SPDX-License-Identifier: MIT
00005 ============================= end_copyright_notice ===========================*/
00006 
00007 /*!
00008  * @file Itrace tool definitions
00009  */
00010 
00011 #ifndef ITRACE_H_
00012 #define ITRACE_H_
00013 
00014 #include <list>
00015 #include <map>
00016 #include <vector>
00017 #include <set>
00018 
00019 #include "gtpin_api.h"
00020 #include "kernel_weight.h"
00021 
00022 using namespace gtpin;
00023 
00024  /* ============================================================================================= */
00025  // Struct ItraceRecord
00026  /* ============================================================================================= */
00027  /*!
00028   * Structure of the trace record header.
00029   * The header details architectural state during execution of a BBL.
00030   */
00031 struct ItraceRecord
00032 {
00033     uint16_t bblId;     ///< BBL identifier
00034     uint16_t sr0;       ///< LSB-16 of the State register sr0.0:ud
00035     uint32_t tileId;    ///< TileId
00036 };
00037 
00038 /* ============================================================================================= */
00039 // Class ItraceDispatch
00040 /* ============================================================================================= */
00041 /*!
00042  * Class that holds trace collected during a single kernel dispatch
00043  */
00044 class ItraceDispatch
00045 {
00046 public:
00047     /// Construct a ItraceDispatch object with the empty trace
00048     explicit ItraceDispatch(const IGtKernelDispatch& dispatch) : _isTrimmed(false) { dispatch.GetExecDescriptor(_kernelExecDesc); }
00049 
00050     /// Read the entire trace from the specified profile buffer into this object
00051     bool ReadTrace(const GtProfileTrace& traceAccessor, const IGtProfileBuffer& profileBuffer);
00052 
00053     const GtKernelExecDesc& KernelExecDesc() const { return _kernelExecDesc; } ///< @return Descriptor of this kernel dispatch
00054     uint32_t        Size()      const { return (uint32_t)_rawTrace.size(); }   ///< @return Trace size in bytes
00055     const uint8_t*  Data()      const { return _rawTrace.data(); }             ///< @return Trace data collected in this dispatch
00056     uint8_t*        Data()            { return _rawTrace.data(); }             ///< @return Trace data collected in this dispatch
00057     bool            IsEmpty()   const;                                         ///< @return true if the trace is empty
00058     bool            IsTrimmed() const { return _isTrimmed; }                   ///< @return true if the trace has been trimmed
00059 
00060 private:
00061     GtKernelExecDesc        _kernelExecDesc; ///< Kernel execution descriptor
00062     std::vector<uint8_t>    _rawTrace;       ///< Trace data collected in this kernel dispatch
00063     bool                    _isTrimmed;      ///< true if the trace has been trimmed to avoid buffer overflow
00064 };
00065 
00066 /* ============================================================================================= */
00067 // Class ItraceKernel
00068 /* ============================================================================================= */
00069 /*!
00070  * Class that contains
00071  *  - Static information about basic block offsets in the kernel
00072  *  - Collection of instruction traces recorded by kernel dispatches
00073  */
00074 class ItraceKernel
00075 {
00076 public:
00077     ItraceKernel() = default;
00078 
00079     /// Construct a ItraceKernel object intended to hold traces of the specified kernel
00080     explicit ItraceKernel(const IGtKernelInstrument& kernelInstrument, uint32_t numTiles);
00081 
00082     /*!
00083      * Read a trace recorded by the specified kernel dispatch. Create and add the corresponding ItraceDispatch
00084      * instance to this object
00085      */
00086     ItraceDispatch& AddItrace(IGtKernelDispatch& kernelDispatch);
00087 
00088     std::string           Name()            const { return _name; }               ///< @return Kernel's name
00089     std::string           ExtendedName()    const { return _extName; }            ///< @return Kernel's extended name
00090     std::string           UniqueName()      const { return _uniqueName; }         ///< @return Kernel's unique name
00091     const GtGpuPlatform   Platform()        const { return _platform; }           ///< @return Kernel's platform
00092     const IGtGenModel&    GenModel()        const { return GetGenModel(_genId); } ///< @return Kernel's GEN model
00093     const GtProfileTrace& TraceAccessor()   const { return _traceAccessor; }      ///< @return Trace accessor
00094     void  DumpAsm()                         const;                                ///< Dump kernel's assembly text to file
00095 
00096      /// @return true, if tracing of this kernel is enabled
00097     uint32_t IsEnabled() const { return (_traceAccessor.MaxTraceSize() != 0); }
00098 
00099     /// @return Control-Flow Graph edges
00100     typedef std::pair<BblId, BblId> Edge;
00101     typedef std::set<Edge>          Edges;
00102     const Edges& GetEdges() const { return _edges; }
00103 
00104     /// @return Traces collected in kernel's dispatches
00105     typedef std::list<ItraceDispatch> Traces;
00106     const Traces& GetTraces() const { return _traces; }
00107 
00108     typedef std::pair<ImgOffset, ImgOffset> BblBounds;    ///< Basic block's head and tail offsets
00109     typedef std::map<BblId, BblBounds>      BblBoundsMap; ///< Basic Block Id to Offsets map
00110     /// @return Basic Block to Offset map
00111     const BblBoundsMap& GetBblBounds() const { return _bblBoundsMap; }
00112 
00113     /// @return Number of tiles
00114     uint32_t NumTiles() const { return _numTiles; }
00115 
00116 private:
00117     std::string       _name;              ///< Kernel's name
00118     std::string       _uniqueName;        ///< Kernel's unique name
00119     std::string       _extName;           ///< Kernel's extended name
00120     GtGpuPlatform     _platform;          ///< Kernel's platform
00121     GtGenModelId      _genId;             ///< Identifier of the GEN model, the kernel is compiled for
00122     std::string       _asmText;           ///< Kernel's assembly text
00123     Edges             _edges;             ///< Kernel's set of control-flow graph's edges
00124     GtProfileTrace    _traceAccessor;     ///< Trace accessor
00125     Traces            _traces;            ///< Traces collected in kernel's dispatches
00126     BblBoundsMap      _bblBoundsMap;      ///< Basic Block to BBlBounds map
00127     uint32_t          _numTiles;          ///< The number of supported tiles
00128 };
00129 
00130 /* ============================================================================================= */
00131 // Class Itrace
00132 /* ============================================================================================= */
00133 /*!
00134  * Implementation of the IGtTool interface for the Itrace tool
00135  */
00136 class Itrace : public GtTool
00137 {
00138 public:
00139     /// Implementation of the IGtTool interface
00140     const char* Name()       const { return "Itrace"; }
00141     uint32_t    ApiVersion() const { return GTPIN_API_VERSION; }
00142 
00143     void OnKernelBuild(IGtKernelInstrument& instrumentor);
00144     void OnKernelRun(IGtKernelDispatch& dispatcher);
00145     void OnKernelComplete(IGtKernelDispatch& dispatcher);
00146 
00147 public:
00148     static void OnFini();                        ///< Callback function registered with atexit()
00149 
00150     static Itrace* Instance();                   ///< @return Single instance of this class
00151 
00152 private:
00153     Itrace() {}
00154     Itrace(const Itrace&) = delete;
00155     Itrace& operator = (const Itrace&) = delete;
00156     ~Itrace() = default;
00157 
00158     /*!
00159      * Generate instrumentation for the specified basic block
00160      * @param[in] instrumentor      Interface of the kernel being instrumented
00161      * @param[in] bbl               Basic block to be instrumented
00162      * @param[in] ItraceKernel      Object that holds information about basic blocks in the kernel
00163      * @return success/failure status
00164      */
00165     bool InstrumentBbl(IGtKernelInstrument& instrumentor, const IGtBbl& bbl, const ItraceKernel& ItraceKernel);
00166 
00167     /*!
00168      * Generate procedure that allocates space for the new trace record in the trace and stores the record
00169      * for the specified basic block. The procedure sets _offsetReg equal to the offset of the location within the
00170      * profile buffer immediately following the record header.
00171      * If new record can not be allocated due to the trace capacity limitations, the procedure zeroes _offsetReg.
00172      *
00173      * @param[in, out]  proc            Procedure, the generated code is appended to
00174      * @param[in]       coder           GEN code generator
00175      * @param[in]       bbl             Basic block being instrumented
00176      * @param[in]       ItraceKernel    Object that holds information about basic blocks in the kernel
00177      * @param[in]       recordSize      Size of the new record, in bytes
00178      */
00179     void StoreRecord(GtGenProcedure& proc, const IGtGenCoder& coder, const IGtBbl& bbl,
00180         const ItraceKernel& ItraceKernel, uint32_t recordSize);
00181 
00182 private:
00183     std::map<GtKernelId, ItraceKernel> _kernels;   ///< Collection of traces per kernel
00184 
00185     GtReg _addrReg;      ///< Virtual register that holds address within profile buffer
00186     GtReg _dataReg;      ///< Virtual register that holds data to be read from/written to profile buffer 
00187     GtReg _offsetReg;    ///< Virtual register that holds offset within the trace
00188     GtReg _tileIdReg;    ///< Virtual register that holds tile ID
00189 };
00190 
00191 /* ============================================================================================= */
00192 // Class ItracePreProcessor
00193 /* ============================================================================================= */
00194 /*!
00195  * Class that computes per-kernel trace sizes in the preprocessing phase, and provides access to
00196  * this data in the trace gathering phase
00197  */
00198 class ItracePreProcessor : public KernelWeight
00199 {
00200 public:
00201     uint64_t TraceSize(const std::string& extKernelName) const; ///< Given extended kernel name, return the trace size in bytes
00202     static ItracePreProcessor* Instance();                      ///< @return Single instance of this class
00203     static void OnFini();                                       ///< Callback function registered with atexit()
00204 
00205 private:
00206     ItracePreProcessor();
00207     ItracePreProcessor(const ItracePreProcessor&) = delete;
00208     ItracePreProcessor& operator = (const ItracePreProcessor&) = delete;
00209 
00210 private:
00211     /// KernelWeight interface overrides (@see description of KernelWeight functions)
00212     uint32_t GetBblWeight(IGtKernelInstrument& kernelInstrument, const IGtBbl& bbl) const;
00213     void AggregateDispatchCounters(KernelWeightCounters& kc, KernelWeightCounters dc) const;
00214 
00215 private:
00216     KernelWeightProfileData _kernelCounters;        ///< Per-kernel counters of required trace records; collected in preprocessing phase
00217     static const char* _kernelPreProcessFileName;   ///< Name of the file that contains preprocessing data per kernel
00218     static const char* _dispatchPreProcessFileName; ///< Name of the file that contains preprocessing data per kernel dispatch
00219 };
00220 
00221 /* ============================================================================================= */
00222 // Class ItracePostProcessor
00223 /* ============================================================================================= */
00224 /*!
00225  * Function object that processes kernel traces - stores them in files within the profile directory:
00226  *
00227  *    kernel_name
00228  *    |
00229  *        |- kernel_dispatch_1
00230  *           |- itrace_compressed.bin
00231  *           |- itrace_total.cfg
00232  *        |- kernel_dispatch_2
00233  *           |- itrace_compressed.bin
00234  *           |- itrace_total.cfg
00235  * The .bin trace files can be uncompressed by the uncompress_itrace.exe utility.
00236  *
00237  * Format of .bin trace files:
00238  * - Static information:
00239  *     - Number of BBLs in the kernel
00240  *     - For each BBL:
00241  *        - BBL ID
00242  *        - Offset:
00243  *          value = 0xFFFFXXXX for BBLs which do not end with control flow instruction (XXXX == offset of the BBL within the original binary)
00244  *          value = 0xYYYYXXXX for BBLs which end with control flow instruction (XXXX == offset of the BBL within the original binary, YYYY == the offset of the control flow instruction)
00245  * - Dynamic trace data:
00246  *      - Number of HW threads in which the trace was collected
00247  *      - For each HW thread:
00248  *          - HW Thread ID (in the format of sr0.0)
00249  *          - Number of records collected for this HW thread
00250  *          - All the records collected for this HW thread
00251  *
00252  * The format of *.cfg files:
00253  * Each line related to a separate edge of the control-flow graph:
00254  *     - source BBL ID, destination BBL ID, frequency of visiting the edge between source and destination BBLs.
00255  */
00256 class ItracePostProcessor
00257 {
00258 public:
00259     /// Construct a ItracePostProcessor object for the specified collection of kernel traces
00260     ItracePostProcessor(const IGtCore& gtpinCore, const ItraceKernel& ItraceKernel);
00261 
00262     /// Process all kernel traces associated with this object - store them in files within the profile directory
00263     bool operator()();
00264 
00265 private:
00266 
00267     /// Process the specified trace
00268     void ProcessTrace(const ItraceDispatch& trace);
00269 
00270     /// Store the processed trace in the specified file stream
00271     void StoreTrace(std::ofstream& fs);
00272 
00273     /// Store static information about BBL offsets in the kernel in the specified file stream
00274     void StoreBblBoundsInfo(std::ofstream& fs);
00275 
00276     /// Store Global Thread Identifier
00277     void StoreGlobalTid(uint32_t gtid, std::ofstream& fs);
00278 
00279     /// Store the specified value in the specified file stream in the binary format
00280     template <typename T> void Store(const T& val, std::ofstream& fs) { fs.write((const char*)&val, sizeof(val)); }
00281 
00282     /// Store kernel's Control Flow Graph
00283     void StoreCfg(std::ofstream& fs);
00284 
00285 private:
00286 
00287     using TraceRecord = std::pair<uint32_t, uint32_t>;         ///< Trace record
00288     using TraceRecordList  = std::list<TraceRecord>;          ///< List of ItraceRecords per single HW thread
00289     using PerTileTraceRecords = std::vector<TraceRecordList>; ///< Per tile Itrace records
00290     typedef std::map<ItraceKernel::Edge, uint32_t> ItraceCfg;   ///< Weightened Control-flow graph
00291  
00292 private:
00293     const ItraceKernel*              _kernel;            ///< Kernel&traces to be processed
00294     std::string                      _kernelDir;         ///< Directory to store kernel's trace files
00295     std::vector<PerTileTraceRecords> _itraceRecords;     ///< Map of tile ID to list of BBL trace with their frequencies, indexed by the thread ID
00296     std::vector<uint32_t>            _numProfiledThreads;///< Number of profiled (active) threads per tile
00297     ItraceCfg                        _cfg;               ///< Control-flow graph
00298     static const char*               _traceFileName;     ///< Name of the file to store trace in
00299 };
00300 
00301 #endif

itrace.cpp

00001 /*========================== begin_copyright_notice ============================
00002 Copyright (C) 2018-2025 Intel Corporation
00003 
00004 SPDX-License-Identifier: MIT
00005 ============================= end_copyright_notice ===========================*/
00006 
00007 /*!
00008  * @file Implementation of the Itrace tool
00009  */
00010 
00011 #include <fstream>
00012 
00013 #include "itrace.h"
00014 #include "gtpin_tool_utils.h"
00015 
00016 using namespace gtpin;
00017 using namespace std;
00018 
00019 /* ============================================================================================= */
00020 // Configuration
00021 /* ============================================================================================= */
00022 Knob<int>  gKnobMaxTraceBufferInMB("max_buffer_mb", 3072, "itrace - the max allowed size of the trace buffer per kernel in MB");
00023 Knob<int>  gKnobPhase("phase", 0, "tracing tool - processing phase\n { 1 - pre-processing, 2 - processing - trace gathering}");
00024 Knob<bool> gKnobCfgOnly("cfg-only", false, "indicates that the collected trace should not be saved - only resulting cfg file");
00025 
00026 /* ============================================================================================= */
00027 // ItraceDispatch implementation
00028 /* ============================================================================================= */
00029 bool ItraceDispatch::ReadTrace(const GtProfileTrace& traceAccessor, const IGtProfileBuffer& profileBuffer)
00030 {
00031     uint32_t traceSize = traceAccessor.Size(profileBuffer);
00032     _rawTrace.resize(traceSize);
00033     _isTrimmed = traceAccessor.IsTruncated(profileBuffer);
00034     return traceAccessor.Read(profileBuffer, _rawTrace.data(), 0, traceSize);
00035 }
00036 
00037 bool ItraceDispatch::IsEmpty() const
00038 {
00039     return _rawTrace.size() < sizeof(ItraceRecord);
00040 }
00041 
00042 /* ============================================================================================= */
00043 // ItraceKernel implementation
00044 /* ============================================================================================= */
00045 ItraceKernel::ItraceKernel(const IGtKernelInstrument& kernelInstrument, uint32_t numTiles) : _numTiles(numTiles)
00046 {
00047     const IGtKernel& kernel = kernelInstrument.Kernel();
00048     const IGtCfg& cfg = kernelInstrument.Cfg();
00049 
00050     _name       = GlueString(kernel.Name());
00051     _extName    = ExtendedKernelName(kernel);
00052     _platform   = kernel.GpuPlatform();
00053     _genId      = kernel.GenModel().Id();
00054     _asmText    = CfgAsmText(cfg);
00055     _uniqueName = kernel.UniqueName();
00056 
00057     // Initialize trace accessor. The trace capacity is expected to be computed during the preprocessing phase.
00058     uint64_t traceCapacity = ItracePreProcessor::Instance()->TraceSize(_extName);
00059     if (traceCapacity == 0)
00060     {
00061         // Unknown trace capacity
00062         GTPIN_WARNING("ITRACE: unknown trace capacity for kernel " + _name + ". Assuming the kernel is filtered out. "
00063                       "Allocating a buffer of 8KB size. If the kernel is supposed to run, expect buffer overflow. "
00064                       "In this case, please re-run phase 1 and make sure the kernel is not filtered out.");
00065         traceCapacity = 0x2000;
00066     }
00067     else
00068     {
00069         traceCapacity += 0x2000; // Add some space to account for possible fluctuation of trace sizes between phases
00070         if (traceCapacity > UINT32_MAX)
00071         {
00072             GTPIN_WARNING("ITRACE: The kernel " + _name + " exceedeed maximum trace capacity.");
00073             traceCapacity = UINT32_MAX;
00074         }
00075     }
00076     if (traceCapacity > (uint64_t(gKnobMaxTraceBufferInMB) * 0x100000))
00077     {
00078         GTPIN_WARNING("ITRACE: required capacity (" + DecStr(traceCapacity) + ") for kernel " + _name + " is too big - cut to " + DecStr(gKnobMaxTraceBufferInMB) + "MB. "
00079                       "Expect the final trace to contain partial data.");
00080         traceCapacity = uint64_t(gKnobMaxTraceBufferInMB) * 0x100000;
00081     }
00082     uint32_t maxRecordSize = sizeof(ItraceRecord);
00083     _traceAccessor = GtProfileTrace((uint32_t)traceCapacity, maxRecordSize);
00084     _traceAccessor.Allocate(kernelInstrument.ProfileBufferAllocator());
00085     // Fill basic block offsets info
00086     for (auto bblPtr : cfg.Bbls())
00087     {
00088         const IGtBbl&     bbl = *bblPtr;
00089         BblId           bblId = bbl.Id();
00090         const IGtIns& insHead = bbl.FirstIns();
00091         const IGtIns& insTail = bbl.LastIns();
00092         uint32_t offsetHead   = cfg.GetInstructionOffset(insHead);
00093         uint32_t offsetTail   = insTail.IsChangingIP() ? uint32_t(cfg.GetInstructionOffset(insTail)) : 0xFFFFFFFF;
00094         _bblBoundsMap[bblId]  = BblBounds(offsetHead, offsetTail);
00095         const EdgeSpan& outgoingEdges = bbl.OutgoingEdges();
00096         for (auto outEdge : outgoingEdges)
00097         {
00098             const IGtBbl& dstBbl = outEdge->DstBbl();
00099             uint32_t dstBblId = dstBbl.Id();
00100             _edges.emplace(bblId, dstBblId);
00101         }
00102     }
00103 }
00104 
00105 ItraceDispatch& ItraceKernel::AddItrace(IGtKernelDispatch& kernelDispatch)
00106 {
00107     // Create a new ItraceDispatch object and store the entire trace within this object
00108     _traces.emplace_back(kernelDispatch);
00109     ItraceDispatch& ItraceDispatch = _traces.back();
00110     if (!ItraceDispatch.ReadTrace(_traceAccessor, *kernelDispatch.GetProfileBuffer()))
00111     {
00112         GTPIN_ERROR_MSG("ITRACE: Failed to read profile buffer for kernel " + _name);
00113     }
00114     return ItraceDispatch;
00115 }
00116 
00117 void ItraceKernel::DumpAsm() const
00118 {
00119     DumpKernelAsmText(_name, _uniqueName, _asmText);
00120 }
00121 
00122 /* ============================================================================================= */
00123 // Itrace implementation
00124 /* ============================================================================================= */
00125 Itrace* Itrace::Instance()
00126 {
00127     static Itrace instance;
00128     return &instance;
00129 }
00130 
00131 void Itrace::OnKernelBuild(IGtKernelInstrument& instrumentor)
00132 {
00133     const IGtKernel& kernel = instrumentor.Kernel();
00134     uint32_t numTiles = (instrumentor.Coder().IsTileIdSupported()) ? GTPin_GetCore()->GenArch().MaxTiles(kernel.GpuPlatform()) : 1;
00135 
00136     // Create new KernelData object and add it to the data base
00137     auto ret = _kernels.emplace(piecewise_construct, forward_as_tuple(kernel.Id()), forward_as_tuple(instrumentor, numTiles));
00138     if (ret.second)
00139     {
00140         ItraceKernel& ItraceKernel = (*ret.first).second;
00141         if (!ItraceKernel.IsEnabled())
00142         {
00143             GTPIN_WARNING("ITRACE: The trace won't be generated for kernel " + ItraceKernel.Name());
00144             return;
00145         }
00146 
00147         const IGtCfg& cfg = instrumentor.Cfg();
00148         IGtVregFactory& vregs = instrumentor.Coder().VregFactory();
00149 
00150         // Initialize virtual registers
00151         _addrReg = vregs.MakeMsgAddrScratch();
00152         _dataReg = vregs.MakeMsgDataScratch();
00153         _offsetReg = vregs.MakeScratch(VREG_TYPE_DWORD);
00154         _tileIdReg = vregs.Make(VREG_TYPE_DWORD);
00155 
00156         GtGenProcedure preCode;
00157         instrumentor.Coder().LoadTileId(preCode, _tileIdReg);
00158 
00159         // Instrument kernel entries
00160         instrumentor.InstrumentEntries(preCode);
00161 
00162         // Instrument basic blocks
00163         for (auto bblPtr : cfg.Bbls())
00164         {
00165             InstrumentBbl(instrumentor, *bblPtr, ItraceKernel);
00166         }
00167     }
00168 }
00169 
00170 void Itrace::OnKernelRun(IGtKernelDispatch& dispatcher)
00171 {
00172     bool isProfileEnabled = false;
00173 
00174     const IGtKernel& kernel = dispatcher.Kernel();
00175     GtKernelExecDesc execDesc; dispatcher.GetExecDescriptor(execDesc);
00176     if (kernel.IsInstrumented() && IsKernelExecProfileEnabled(execDesc, kernel.GpuPlatform(), kernel.Name().Get()))
00177     {
00178         auto it = _kernels.find(kernel.Id());
00179         if (it != _kernels.end())
00180         {
00181             const ItraceKernel& ItraceKernel = it->second;
00182             if (ItraceKernel.IsEnabled())
00183             {
00184                 IGtProfileBuffer* buffer = dispatcher.CreateProfileBuffer(); GTPIN_ASSERT(buffer);
00185                 const GtProfileTrace& traceAccessor = ItraceKernel.TraceAccessor();
00186                 if (traceAccessor.Initialize(*buffer))
00187                 {
00188                     isProfileEnabled = true;
00189                 }
00190                 else
00191                 {
00192                     GTPIN_ERROR_MSG("ITRACE: Failed to write into memory buffer for kernel " + string(kernel.Name()));
00193                 }
00194             }
00195         }
00196     }
00197     dispatcher.SetProfilingMode(isProfileEnabled);
00198 }
00199 
00200 void Itrace::OnKernelComplete(IGtKernelDispatch& dispatcher)
00201 {
00202     if (!dispatcher.IsProfilingEnabled())
00203     {
00204         return; // Do nothing with unprofiled kernel dispatches
00205     }
00206 
00207     const IGtKernel& kernel = dispatcher.Kernel();
00208     auto it = _kernels.find(kernel.Id());
00209     if (it != _kernels.end())
00210     {
00211         // Read the trace from the profile buffer
00212         ItraceKernel& ItraceKernel = it->second;
00213         ItraceKernel.AddItrace(dispatcher);
00214     }
00215 }
00216 
00217 bool Itrace::InstrumentBbl(IGtKernelInstrument& instrumentor, const IGtBbl& bbl, const ItraceKernel& ItraceKernel)
00218 {
00219     const IGtGenCoder& coder = instrumentor.Coder();
00220     const IGtCfg& cfg = instrumentor.Cfg();
00221 
00222     // Generate code that allocates space for the new record in the trace and stores the trace record.
00223     // Insert this procedure before the first instruction in the basic block.
00224     GtGenProcedure headerProc;
00225     auto firstInsIt = bbl.Instructions().begin();
00226     const IGtIns& firstIns = cfg.GetInstruction((*firstInsIt)->Id());
00227     StoreRecord(headerProc, coder, bbl, ItraceKernel, sizeof(ItraceRecord));
00228     instrumentor.InstrumentInstruction(firstIns, GtIpoint::Before(), headerProc);
00229 
00230     return true;
00231 }
00232 
00233 void Itrace::StoreRecord(GtGenProcedure& proc, const IGtGenCoder& coder, const IGtBbl& bbl,
00234     const ItraceKernel& ItraceKernel, uint32_t recordSize)
00235 {
00236     IGtInsFactory& insF = coder.InstructionFactory();
00237 
00238     GtPredicate    predicate(FlagReg(0));
00239 
00240     // Set values of ItraceRecord fields in _dataReg
00241     proc += insF.MakeShl(_dataReg, StateReg(0), 16);                    // idFieldReg[16:31] = sr0.0
00242     proc += insF.MakeAdd(_dataReg, _dataReg, GtImmU32(bbl.Id()));       // idFieldReg[0:15]  = bbl.Id()
00243 
00244     // Allocate new record in the trace.
00245     // Set _offsetReg = offset of the allocated record in the profile buffer, _addrReg = address of the allocated record
00246     ItraceKernel.TraceAccessor().ComputeNewRecordOffset(coder, proc, recordSize, _offsetReg);
00247     coder.ComputeAddress(proc, _addrReg, _offsetReg);
00248 
00249     // Zero _offsetReg if the trace buffer is overflowed (predicate == true)
00250     proc += insF.MakeMov(_offsetReg, 0).SetPredicate(predicate);
00251 
00252     // Store Sr0.0 and BBL ID
00253     //if (!predicate) { STORE buffer[_addrReg] = _dataReg;
00254     proc += insF.MakeAtomicStore(_addrReg, _dataReg).SetPredicate(!predicate);
00255 
00256     // Store tile ID
00257     //if (!predicate) { STORE buffer[_addrReg] = _dataReg;
00258     proc += insF.MakeMov(_dataReg, _tileIdReg);
00259     coder.ComputeRelAddress(proc, _addrReg, _addrReg, offsetof(ItraceRecord, tileId));
00260     proc += insF.MakeAtomicStore(_addrReg, _dataReg).SetPredicate(!predicate);
00261 
00262     if (!proc.empty()) { proc.front()->AppendAnnotation(__func__); }
00263 }
00264 
00265 void Itrace::OnFini()
00266 {
00267     Itrace& me = *Instance();
00268     IGtCore* gtpinCore = GTPin_GetCore();
00269     for (auto& ref : me._kernels)
00270     {
00271         const ItraceKernel& ItraceKernel = ref.second;
00272         ItracePostProcessor(*gtpinCore, ItraceKernel)();
00273         ItraceKernel.DumpAsm();
00274     }
00275 }
00276 
00277 /* ============================================================================================= */
00278 // ItracePreProcessor implementation
00279 /* ============================================================================================= */
00280 const char* ItracePreProcessor::_kernelPreProcessFileName = "itrace_pre_process.txt";
00281 const char* ItracePreProcessor::_dispatchPreProcessFileName = "itrace_pre_process_dispatch.txt";
00282 
00283 ItracePreProcessor::ItracePreProcessor()
00284 {
00285     if (gKnobPhase == 2)
00286     {
00287         // Read the data collected during the preprocessing phase
00288         std::ifstream is(_kernelPreProcessFileName);
00289         GTPIN_ASSERT_MSG(is, string("File ") + _kernelPreProcessFileName + " does not exist. The trace won't be generated");
00290         is >> _kernelCounters;
00291     }
00292     else if (gKnobPhase == 1)
00293     {
00294         // Create pre_process files or remove old pre_process files's content if they exist
00295         CreateCleanFile(_kernelPreProcessFileName);
00296         CreateCleanFile(_dispatchPreProcessFileName);
00297     }
00298 }
00299 
00300 ItracePreProcessor* ItracePreProcessor::Instance()
00301 {
00302     static ItracePreProcessor instance;
00303     return &instance;
00304 }
00305 
00306 void ItracePreProcessor::OnFini()
00307 {
00308     ItracePreProcessor& tool = *Instance();
00309     tool.DumpKernelProfiles(_kernelPreProcessFileName);
00310     tool.DumpDispatchProfiles(_dispatchPreProcessFileName);
00311 }
00312 
00313 uint64_t ItracePreProcessor::TraceSize(const string& extKernelName) const
00314 {
00315     auto it = _kernelCounters.find(extKernelName);
00316     return ((it == _kernelCounters.end()) ? 0 : it->second.weight);
00317 }
00318 
00319 uint32_t ItracePreProcessor::GetBblWeight(IGtKernelInstrument&, const IGtBbl&) const
00320 {
00321     return sizeof(ItraceRecord);
00322 }
00323 
00324 void ItracePreProcessor::AggregateDispatchCounters(KernelWeightCounters& kc, KernelWeightCounters dc) const
00325 {
00326     kc.weight = std::max(kc.weight, dc.weight);
00327     kc.freq += dc.freq;
00328 }
00329 
00330 /* ============================================================================================= */
00331 // ItracePostProcessor implementation
00332 /* ============================================================================================= */
00333 const char* ItracePostProcessor::_traceFileName = "itrace_compressed.bin";
00334 
00335 ItracePostProcessor::ItracePostProcessor(const IGtCore& gtpinCore, const ItraceKernel& ItraceKernel) :
00336     _kernel(&ItraceKernel),
00337     _kernelDir(JoinPath(string(gtpinCore.ProfileDir()), ItraceKernel.UniqueName())) {}
00338 
00339 bool ItracePostProcessor::operator()()
00340 {
00341     if (!MakeDirectory(_kernelDir))
00342     {
00343         GTPIN_WARNING("ITRACE: Could not create directory " + _kernelDir);
00344         return false;
00345     }
00346 
00347     // Process traces recorded in kernel dispatches
00348     for (const ItraceDispatch& trace : _kernel->GetTraces())
00349     {
00350         if (!trace.IsEmpty())
00351         {
00352             if (trace.IsTrimmed())
00353             {
00354                 GTPIN_WARNING("ITRACE: Detected trace buffer overflow in kernel " + _kernel->Name());
00355             }
00356 
00357             ProcessTrace(trace);
00358 
00359             string subdir = trace.KernelExecDesc().ToString(_kernel->Platform(), ExecDescFileNameFormat());
00360             string dir = MakeSubDirectory(_kernelDir, subdir);
00361 
00362             if (!gKnobCfgOnly)
00363             {
00364                 string filePath = JoinPath(dir, _traceFileName);
00365                 ofstream fs(filePath, std::ios::binary);
00366                 if (!fs)
00367                 {
00368                     GTPIN_WARNING("ITRACE: Could not create file " + filePath);
00369                     continue;
00370                 }
00371                 StoreTrace(fs);
00372             }
00373 
00374             string cfgFilePath = JoinPath(dir, "itrace_total.cfg");
00375             ofstream cfgfs(cfgFilePath);
00376             if (!cfgfs)
00377             {
00378                 GTPIN_WARNING("ITRACE: Could not create file " + cfgFilePath);
00379                 continue;
00380             }
00381             StoreCfg(cfgfs);
00382         }
00383     }
00384 
00385     return true;
00386 }
00387 
00388 void ItracePostProcessor::ProcessTrace(const ItraceDispatch& trace)
00389 {
00390     const uint8_t* traceData = trace.Data();
00391     uint32_t       traceSize = trace.Size();
00392 
00393     // Associate trace records with threads - populate _threadTraceRecords array
00394     const GtStateRegAccessor& sra = _kernel->GenModel().StateRegAccessor();
00395     uint32_t maxThreads = _kernel->GenModel().MaxThreads(); // Max number of HW threads
00396 
00397     // Build control-flow graph
00398     _cfg.clear();
00399     ItraceKernel::Edges edges = _kernel->GetEdges();
00400     for (auto& edge : edges)
00401     {
00402         _cfg.emplace(edge, 0);
00403     }
00404 
00405     // Reference to the trace record
00406     struct Record
00407     {
00408         const ItraceRecord* header; ///< Pointer to the header of the record
00409         uint32_t              size; ///< Size of the record in bytes, including header
00410     };
00411 
00412     using RecordList = std::list<Record>;           ///< List of Records
00413     using PerTileRecords = std::vector<RecordList>; ///< Per tile Records
00414     std::vector<PerTileRecords> threadRecords;      ///< Vector of per tile records. Tile ID is an index to this vector
00415 
00416     threadRecords.resize(_kernel->NumTiles());
00417     _itraceRecords.resize(_kernel->NumTiles());
00418     _numProfiledThreads.resize(_kernel->NumTiles(), 0);
00419 
00420     for (uint32_t tile = 0; tile < _kernel->NumTiles(); tile++)
00421     {
00422         threadRecords[tile].resize(maxThreads);
00423 
00424         _itraceRecords[tile].clear();
00425         _itraceRecords[tile].resize(maxThreads);
00426     }
00427 
00428     for (uint32_t recordOffset = 0; recordOffset + sizeof(ItraceRecord) <= traceSize;)
00429     {
00430         // Retrive thread ID and BBL ID from the record
00431         const ItraceRecord* record = (const ItraceRecord*)(traceData + recordOffset);
00432         uint32_t tid = sra.GetGlobalTid(record->sr0);
00433         uint32_t tileId = record->tileId; GTPIN_ASSERT(tileId < _kernel->NumTiles());
00434         uint32_t recordSize = sizeof(ItraceRecord);
00435         if (recordOffset + recordSize > traceSize)
00436         {
00437             break; // end of trace
00438         }
00439 
00440         auto& tileRecords = threadRecords[tileId];
00441         auto& records = tileRecords[tid];
00442 
00443         // Add a new trace record reference to _threadTraceRecords
00444         if (records.empty()) { ++_numProfiledThreads[tileId]; } // Increment thread count on the first relevant record
00445         records.emplace_back(Record{ record, recordSize });
00446 
00447         recordOffset += recordSize;
00448     }
00449 
00450     for (uint32_t tileId = 0; tileId < threadRecords.size(); tileId++)
00451     {
00452         if (_numProfiledThreads[tileId] == 0)
00453         {
00454             continue;
00455         }
00456 
00457         const auto& tileRecords = threadRecords[tileId];
00458         auto& tileItraceRecords = _itraceRecords[tileId];
00459 
00460         // Store per-thread traces
00461         for (uint32_t tid = 0; tid < maxThreads; tid++)
00462         {
00463             const auto& threadRecordList = tileRecords[tid];
00464             auto& records = tileItraceRecords[tid];
00465 
00466             if (threadRecordList.empty())
00467             {
00468                 continue;
00469             }
00470 
00471             // Store trace records
00472             BblId prevBblId;
00473             for (const auto& record : threadRecordList)
00474             {
00475                 const auto& header = *(record.header);
00476                 BblId       bblId  = header.bblId;
00477 
00478                 if (prevBblId != bblId)
00479                 {
00480                     records.emplace_back(bblId, 1);
00481                 }
00482                 else
00483                 {
00484                     auto& r = records.back();
00485                     r.second++;
00486                 }
00487 
00488                 if (prevBblId.IsValid())
00489                 {
00490                     auto it = _cfg.find(ItraceKernel::Edge(prevBblId, bblId));
00491                     if (it != _cfg.end())
00492                     {
00493                         it->second += 1;
00494                     }
00495                 }
00496 
00497                 prevBblId = bblId;
00498             }
00499         }
00500     }
00501 }
00502 
00503 void ItracePostProcessor::StoreTrace(std::ofstream& fs)
00504 {
00505     StoreBblBoundsInfo(fs);
00506 
00507     // Compute and store the number of involved tiles
00508     uint32_t numOfTiles = 0;
00509     for (uint32_t i = 0; i < _numProfiledThreads.size(); i++)
00510     {
00511         numOfTiles += (_numProfiledThreads[i] == 0) ? 0 : 1;
00512     }
00513     Store(numOfTiles, fs);
00514 
00515     for (uint32_t tileId = 0; tileId < _itraceRecords.size(); tileId++)
00516     {
00517         if (_numProfiledThreads[tileId] == 0) { continue; }
00518 
00519         const auto& tileRecords = _itraceRecords[tileId];
00520 
00521         Store(tileId, fs);
00522 
00523         // Store the number of profiled threads
00524         Store(_numProfiledThreads[tileId], fs);
00525 
00526         // Store per-thread traces
00527         for (uint32_t tid = 0; tid < tileRecords.size(); tid++)
00528         {
00529             const auto& records = tileRecords[tid];
00530 
00531             if (records.empty())
00532             {
00533                 continue;
00534             }
00535 
00536             StoreGlobalTid(tid, fs);    // Store Global Thread Identifier
00537 
00538             uint32_t numRecords = (uint32_t)records.size();
00539             Store(numRecords, fs);      // Store #records collected in the thread
00540 
00541             // Store trace records
00542             for (const auto& record : records)
00543             {
00544                 uint32_t bblId = record.first;
00545                 uint32_t loopCount = record.second;
00546                 Store(bblId, fs);       // Store BBL ID
00547                 Store(loopCount, fs);   // Store loopCount
00548             }
00549         }
00550     }
00551 }
00552 
00553 void ItracePostProcessor::StoreBblBoundsInfo(std::ofstream& fs)
00554 {
00555     // Store static information about memory accesses in BBLs
00556     ItraceKernel::BblBoundsMap bblBoundsMap = _kernel->GetBblBounds();
00557     uint32_t numBbls = (uint32_t)bblBoundsMap.size();
00558     Store(numBbls, fs);                                 // Store the number of BBLs that access memory
00559 
00560     for (uint32_t bblId = 0; bblId < numBbls; bblId++)
00561     {
00562         auto bounds = bblBoundsMap[bblId];
00563         Store(bblId, fs);                               // Store BBL ID
00564         uint32_t val = bounds.first;
00565         Store(val, fs);
00566         val = bounds.second;
00567         Store(val, fs);
00568     }
00569 }
00570 
00571 void ItracePostProcessor::StoreGlobalTid(uint32_t gtid, std::ofstream& fs)
00572 {
00573     const GtStateRegAccessor& sra = _kernel->GenModel().StateRegAccessor();
00574     uint32_t sr0 = sra.SetGlobalTid(0, gtid);
00575 
00576     auto storeSr0Field = [&](const ScatteredBitFieldU32& sbf)
00577     {
00578         uint32_t val = (sbf.IsEmpty() ? UINT32_MAX : sbf.GetValue(sr0));
00579         Store(val, fs);
00580     };
00581 
00582     storeSr0Field(sra.SliceIdField());
00583     storeSr0Field(sra.DualSubSliceIdField());
00584     storeSr0Field(sra.SubSliceIdField());
00585     storeSr0Field(sra.EuIdField());
00586     storeSr0Field(sra.ThreadSlotField());
00587 }
00588 
00589 void ItracePostProcessor::StoreCfg(std::ofstream& fs)
00590 {
00591     ostringstream ostr;
00592     ostr << "srcBBL, dstBBL, Frequency" << std::endl;
00593     ostr << "=========================" << std::endl;
00594     for (const auto& it : _cfg)
00595     {
00596         ItraceKernel::Edge edge = it.first;
00597         uint32_t frequency = it.second;
00598 
00599         // print srcBBL, dstBBL, frequency
00600         ostr << std::dec << edge.first << "," << edge.second << "," << frequency << std::endl;
00601     }
00602     fs << ostr.str();
00603     fs.close();
00604 }
00605 
00606 /* ============================================================================================= */
00607 // GTPin_Entry
00608 /* ============================================================================================= */
00609 EXPORT_C_FUNC void GTPin_Entry(int argc, const char* argv[])
00610 {
00611     ConfigureGTPin(argc, argv);
00612     if (gKnobPhase == 1)
00613     {
00614         ItracePreProcessor::Instance()->Register();
00615         atexit(ItracePreProcessor::OnFini);
00616     }
00617     else
00618     {
00619         GTPIN_ASSERT_MSG((gKnobPhase == 2), "Itrace: Invalid phase value. Should be 1 or 2, provided " + std::to_string(gKnobPhase));
00620         Itrace::Instance()->Register();
00621         atexit(Itrace::OnFini);
00622     }
00623 }
00624 

(Back to the list of all GTPin Sample Tools)


 All Data Structures Functions Variables Typedefs Enumerations Enumerator


  Copyright (C) 2013-2025 Intel Corporation
SPDX-License-Identifier: MIT