This tool is not yet stable, so please send me a teams message (Euijun Chung) if there are some issues.
Warning: The trace generation tool may generate thousands of GBs of macsim traces on your machine in just a few minutes, so use at your own risk.
If you are working on rover, you can skip this step.
git clonethis repository.cd tools/main && make— you should seemain.soif compiled successfully.- Run
ulimit -n 16384before tracing (required for the large number of open file handles).
This tool works with any GPU programs including CUDA binaries and Tensorflow/Pytorch libraries. However, you should carefully choose the workload because even a very small workload can be too big for the tool to generate traces. For instance, training a very small CNN with a few iterations may fill hundreds of GBs and eventually blow up storage.
To generate traces, simply add LD_PRELOAD to your original command:
LD_PRELOAD=<path_to>/tools/main/main.so python3 cnn_train.py
Alternatively, you can use CUDA_INJECTION64_PATH instead of LD_PRELOAD if the application overrides LD_PRELOAD internally (e.g. PyTorch):
CUDA_INJECTION64_PATH=<path_to>/tools/main/main.so python3 cnn_train.py
Scope control:
KERNEL_BEGIN: Beginning of the kernel interval where to generate traces. (default = 0)KERNEL_END: End of the kernel interval where to generate traces. (default = UINT32_MAX)INSTR_BEGIN: Beginning of the instruction interval on each kernel where to apply instrumentation. (default = 0)INSTR_END: End of the instruction interval on each kernel where to apply instrumentation. (default = UINT32_MAX)SAMPLED_KERNEL_INFO: Path to the file that contains the list of kernels to be sampled. (default = '')
Output control:
TRACE_PATH: Path to trace file. (default = './default/')COMPRESSOR_PATH: Path to the compressor binary file. (default = './compress')DEBUG_TRACE: Generate human-readable debug traces together when this value is 1. (default = 0)OVERWRITE: Overwrite the previously generated traces in TRACE_PATH directory when this value is 1. (default = 0)
Advanced:
TOOL_VERBOSE: Enable verbosity inside the tool. (default = 0)TMA_TRACE: Write extended TMA traces (trace_info_tma_s) totrace_tma_*.rawfiles for UTMALDG (Tensor Memory Access) instructions. (default = 0). See TMA Trace Support below.
Each traced kernel produces a Kernel<N>/ directory under TRACE_PATH containing:
| File | Description |
|---|---|
bin_trace_<warp>.raw |
Binary trace data (one trace_info_nvbit_small_s per instruction) |
bin_trace_<warp>.txt |
Human-readable debug trace (only when DEBUG_TRACE=1) |
trace.txt |
Warp count and per-warp metadata |
trace_info.txt |
Per-warp instruction counts |
trace_tma_<warp>.raw |
TMA-extended trace (only when TMA_TRACE=1) |
Top-level files in TRACE_PATH:
kernel_config.txt— list of kernel trace pathskernel_names.txt— kernel function names, grid/block dimensions, register counts
Memory access coalescing is already implemented in this tracer, so you should not use mem_access_size as the raw request size in your simulator. For simplicity, assume every memory request size is 128B unless you implement sector cache in L2$.
Child trace convention: If both is_fp and is_ld are set to true, the load is a child (coalesced sector request). This means it is an additional 128B-aligned sector access generated by the same warp instruction as the parent load. The parent load's m_mem_access_size is scaled by the total number of sectors accessed.
The tracer supports Hopper-architecture TMA (Tensor Memory Access) instructions (UTMALDG). TMA loads transfer tiles of data from global memory to shared memory via a CUtensorMap descriptor.
Two trace formats:
| Format | Struct | mem_access_size type |
Max size | File pattern |
|---|---|---|---|---|
| Standard (default) | trace_info_nvbit_small_s |
uint8_t |
255 bytes | bin_trace_*.raw |
| TMA-extended | trace_info_tma_s |
int |
2 GB | trace_tma_*.raw |
- Default (
TMA_TRACE=0): UTMALDG instructions are traced in the standardbin_trace_*.rawfiles usingtrace_info_nvbit_small_s. Themem_access_sizeis capped at 255 bytes. A warning is printed to stderr when TMA instructions are detected. - Extended (
TMA_TRACE=1): In addition to the standard trace, each UTMALDG instruction is also written to a separatetrace_tma_*.rawfile usingtrace_info_tma_s, which stores the full transfer size as anint. The structs are defined intools/main/common.h.
TMA address resolution: The tracer automatically detects CUtensorMap descriptors passed as kernel arguments at launch time and extracts the actual global data address (globalAddress) and tile transfer size (tileBytes). Per-tile addresses are computed using globalAddress + cta_id_x * tileBytes.
Example:
TMA_TRACE=1 LD_PRELOAD=./tools/main/main.so DEBUG_TRACE=1 ./my_hopper_app
More details about kernel sampling is coming soon.
python3 kernel_sample.py --cmd "python3 for-macsim/$name.py" --name "$name" \
--threshold 50 --min_n 30 --device_id 1 \
--trace_generate --trace_path /data/echung67/trace_sampled/nvbit/"$name"
Please check out run.sh for examples.
$ LD_PRELOAD=./tools/main/main.so \
TRACE_PATH=./traces/ \
KERNEL_END=5 \
DEBUG_TRACE=1 \
OVERWRITE=1 \
python3 m.py
This command will generate traces for the first 5 CUDA kernels of the workload python3 m.py. Also, the tool will overwrite the previous traces and generate the debug traces as well.
- Random segfault at termination: The NVBit tool occasionally produces a segfault during cleanup. Re-running the tool usually resolves it. The
nvbit.pyscript automatically retries on segfault. - Large trace output: Even small workloads can produce hundreds of GBs of trace data. Always set
KERNEL_ENDto limit output during initial testing.
This tool is built on NVBit (NVidia Binary Instrumentation Tool) by NVIDIA Corporation. NVBit is covered by the NVIDIA CUDA Toolkit End User License Agreement (see EULA.txt).
For NVBit API documentation, usage examples, and requirements, see the NVBit README and the nvbit.h header in the core/ folder.
- Open
nvbit.py - Go to
if __name__ == '__main__':line - There are three functions inside the main function right now:
rodinia(),fast_tf(),tango(), corresponding to each benchmark suites.
- If the
fastargument of each function isTrue, it means that the traces will be saved in/fast_data/echung67/trace/nvbit/.... - If
False, it will be saved in/data/echung67/trace/nvbit/.... - The trace becomes extremely huge for certain benchmarks such as
FasterTransformer, so I had to use the/data/...(HDD) to store the traces.
- Decomment function that you want to run, or you can create a new function (such as
gunrock()) if you want to create traces for a new benchmark. - Run
$ python3 nvbit.pyand the traces will be first generated inrun/<bench_name>/<bench_config>/, compressed with zlib (source:tools/main/compress.cc), then will be moved to/fast_data/echung67/trace/....
- Due to insufficient amount of GPU's VRAM, the trace generation should be done sequentially.
macsim.pyis for running Macsim concurrently.
Let's assume you are copying tango() function to create a new function that creates traces of a new benchmark suite. The following variables should be changed.
trace_path_base: path to the directory that this tool will save the tracestango_bin: this used to be the path to the tango binary, so you should change it to path to the new benchmark's binary file. (change the name too!)nvbit_bin: path to the nvbit tool that generates traces. don't forget to$ cd tools && maketo generate this binary file.compress_bin: path to the zlib compress tool.result_dir: path to the directory that will store all the results and logs. It is set to./run/by default.benchmark_names: name of the individual benchmarks in the benchmark suite.benchmark_configs: configuration arguments for each benchmark in the benchmark suite.
What the for-loop does:
- Create directory for one benchmark and one configuration. If the trace generation has failed, it will try to regenerate and overwrite the traces.
- Copy macsim files to each directories.
- Create
nvbit.pyfile in each directories. This python script will run the nvbit tool in the directory, compress the traces, and move it totrace_path_base. - Run
nvbit.pyin each directories.