Skip to content

gthparch/Macsim_tracer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

101 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trace Generation Tool for Macsim

This tool is not yet stable, so please send me a teams message (Euijun Chung) if there are some issues.

Usage

Warning: The trace generation tool may generate thousands of GBs of macsim traces on your machine in just a few minutes, so use at your own risk.

Installation

If you are working on rover, you can skip this step.

  1. git clone this repository.
  2. cd tools/main && make — you should see main.so if compiled successfully.
  3. Run ulimit -n 16384 before tracing (required for the large number of open file handles).

Basic Usage

This tool works with any GPU programs including CUDA binaries and Tensorflow/Pytorch libraries. However, you should carefully choose the workload because even a very small workload can be too big for the tool to generate traces. For instance, training a very small CNN with a few iterations may fill hundreds of GBs and eventually blow up storage.

To generate traces, simply add LD_PRELOAD to your original command:

LD_PRELOAD=<path_to>/tools/main/main.so python3 cnn_train.py

Alternatively, you can use CUDA_INJECTION64_PATH instead of LD_PRELOAD if the application overrides LD_PRELOAD internally (e.g. PyTorch):

CUDA_INJECTION64_PATH=<path_to>/tools/main/main.so python3 cnn_train.py

Environment Variables

Scope control:

  • KERNEL_BEGIN: Beginning of the kernel interval where to generate traces. (default = 0)
  • KERNEL_END: End of the kernel interval where to generate traces. (default = UINT32_MAX)
  • INSTR_BEGIN: Beginning of the instruction interval on each kernel where to apply instrumentation. (default = 0)
  • INSTR_END: End of the instruction interval on each kernel where to apply instrumentation. (default = UINT32_MAX)
  • SAMPLED_KERNEL_INFO: Path to the file that contains the list of kernels to be sampled. (default = '')

Output control:

  • TRACE_PATH: Path to trace file. (default = './default/')
  • COMPRESSOR_PATH: Path to the compressor binary file. (default = './compress')
  • DEBUG_TRACE: Generate human-readable debug traces together when this value is 1. (default = 0)
  • OVERWRITE: Overwrite the previously generated traces in TRACE_PATH directory when this value is 1. (default = 0)

Advanced:

  • TOOL_VERBOSE: Enable verbosity inside the tool. (default = 0)
  • TMA_TRACE: Write extended TMA traces (trace_info_tma_s) to trace_tma_*.raw files for UTMALDG (Tensor Memory Access) instructions. (default = 0). See TMA Trace Support below.

Output Files

Each traced kernel produces a Kernel<N>/ directory under TRACE_PATH containing:

File Description
bin_trace_<warp>.raw Binary trace data (one trace_info_nvbit_small_s per instruction)
bin_trace_<warp>.txt Human-readable debug trace (only when DEBUG_TRACE=1)
trace.txt Warp count and per-warp metadata
trace_info.txt Per-warp instruction counts
trace_tma_<warp>.raw TMA-extended trace (only when TMA_TRACE=1)

Top-level files in TRACE_PATH:

  • kernel_config.txt — list of kernel trace paths
  • kernel_names.txt — kernel function names, grid/block dimensions, register counts

Memory Access Coalescing

Memory access coalescing is already implemented in this tracer, so you should not use mem_access_size as the raw request size in your simulator. For simplicity, assume every memory request size is 128B unless you implement sector cache in L2$.

Child trace convention: If both is_fp and is_ld are set to true, the load is a child (coalesced sector request). This means it is an additional 128B-aligned sector access generated by the same warp instruction as the parent load. The parent load's m_mem_access_size is scaled by the total number of sectors accessed.

TMA Trace Support

The tracer supports Hopper-architecture TMA (Tensor Memory Access) instructions (UTMALDG). TMA loads transfer tiles of data from global memory to shared memory via a CUtensorMap descriptor.

Two trace formats:

Format Struct mem_access_size type Max size File pattern
Standard (default) trace_info_nvbit_small_s uint8_t 255 bytes bin_trace_*.raw
TMA-extended trace_info_tma_s int 2 GB trace_tma_*.raw
  • Default (TMA_TRACE=0): UTMALDG instructions are traced in the standard bin_trace_*.raw files using trace_info_nvbit_small_s. The mem_access_size is capped at 255 bytes. A warning is printed to stderr when TMA instructions are detected.
  • Extended (TMA_TRACE=1): In addition to the standard trace, each UTMALDG instruction is also written to a separate trace_tma_*.raw file using trace_info_tma_s, which stores the full transfer size as an int. The structs are defined in tools/main/common.h.

TMA address resolution: The tracer automatically detects CUtensorMap descriptors passed as kernel arguments at launch time and extracts the actual global data address (globalAddress) and tile transfer size (tileBytes). Per-tile addresses are computed using globalAddress + cta_id_x * tileBytes.

Example:

TMA_TRACE=1 LD_PRELOAD=./tools/main/main.so DEBUG_TRACE=1 ./my_hopper_app

Use with CUDA Kernel Sampling

More details about kernel sampling is coming soon.

python3 kernel_sample.py --cmd "python3 for-macsim/$name.py" --name "$name" \
    --threshold 50 --min_n 30 --device_id 1 \
    --trace_generate --trace_path /data/echung67/trace_sampled/nvbit/"$name"

Example

Please check out run.sh for examples.

$ LD_PRELOAD=./tools/main/main.so \
  TRACE_PATH=./traces/ \
  KERNEL_END=5 \
  DEBUG_TRACE=1 \
  OVERWRITE=1 \
  python3 m.py

This command will generate traces for the first 5 CUDA kernels of the workload python3 m.py. Also, the tool will overwrite the previous traces and generate the debug traces as well.

Known Issues

  1. Random segfault at termination: The NVBit tool occasionally produces a segfault during cleanup. Re-running the tool usually resolves it. The nvbit.py script automatically retries on segfault.
  2. Large trace output: Even small workloads can produce hundreds of GBs of trace data. Always set KERNEL_END to limit output during initial testing.

NVBit Reference

This tool is built on NVBit (NVidia Binary Instrumentation Tool) by NVIDIA Corporation. NVBit is covered by the NVIDIA CUDA Toolkit End User License Agreement (see EULA.txt).

For NVBit API documentation, usage examples, and requirements, see the NVBit README and the nvbit.h header in the core/ folder.

Advanced: Using the Python Script

  1. Open nvbit.py
  2. Go to if __name__ == '__main__': line
  3. There are three functions inside the main function right now: rodinia(), fast_tf(), tango(), corresponding to each benchmark suites.
  • If the fast argument of each function is True, it means that the traces will be saved in /fast_data/echung67/trace/nvbit/....
  • If False, it will be saved in /data/echung67/trace/nvbit/....
  • The trace becomes extremely huge for certain benchmarks such as FasterTransformer, so I had to use the /data/... (HDD) to store the traces.
  1. Decomment function that you want to run, or you can create a new function (such as gunrock()) if you want to create traces for a new benchmark.
  2. Run $ python3 nvbit.py and the traces will be first generated in run/<bench_name>/<bench_config>/, compressed with zlib (source: tools/main/compress.cc), then will be moved to /fast_data/echung67/trace/....
  • Due to insufficient amount of GPU's VRAM, the trace generation should be done sequentially.
  1. macsim.py is for running Macsim concurrently.

Lines you should change in nvbit.py

Let's assume you are copying tango() function to create a new function that creates traces of a new benchmark suite. The following variables should be changed.

  • trace_path_base: path to the directory that this tool will save the traces
  • tango_bin: this used to be the path to the tango binary, so you should change it to path to the new benchmark's binary file. (change the name too!)
  • nvbit_bin: path to the nvbit tool that generates traces. don't forget to $ cd tools && make to generate this binary file.
  • compress_bin: path to the zlib compress tool.
  • result_dir: path to the directory that will store all the results and logs. It is set to ./run/ by default.
  • benchmark_names: name of the individual benchmarks in the benchmark suite.
  • benchmark_configs: configuration arguments for each benchmark in the benchmark suite.

What the for-loop does:

  1. Create directory for one benchmark and one configuration. If the trace generation has failed, it will try to regenerate and overwrite the traces.
  2. Copy macsim files to each directories.
  3. Create nvbit.py file in each directories. This python script will run the nvbit tool in the directory, compress the traces, and move it to trace_path_base.
  4. Run nvbit.py in each directories.

About

Generate trace files for macsim with nvbit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors