Trace Generation Tool for Macsim

This tool is not yet stable, so please send me a teams message (Euijun Chung) if there are some issues.

Usage

Warning: The trace generation tool may generate thousands of GBs of macsim traces on your machine in just a few minutes, so use at your own risk.

Installation

If you are working on rover, you can skip this step.

git clone this repository.
cd tools/main && make — you should see main.so if compiled successfully.
Run ulimit -n 16384 before tracing (required for the large number of open file handles).

Basic Usage

This tool works with any GPU programs including CUDA binaries and Tensorflow/Pytorch libraries. However, you should carefully choose the workload because even a very small workload can be too big for the tool to generate traces. For instance, training a very small CNN with a few iterations may fill hundreds of GBs and eventually blow up storage.

To generate traces, simply add LD_PRELOAD to your original command:

LD_PRELOAD=<path_to>/tools/main/main.so python3 cnn_train.py

Alternatively, you can use CUDA_INJECTION64_PATH instead of LD_PRELOAD if the application overrides LD_PRELOAD internally (e.g. PyTorch):

CUDA_INJECTION64_PATH=<path_to>/tools/main/main.so python3 cnn_train.py

Environment Variables

Scope control:

KERNEL_BEGIN: Beginning of the kernel interval where to generate traces. (default = 0)
KERNEL_END: End of the kernel interval where to generate traces. (default = UINT32_MAX)
INSTR_BEGIN: Beginning of the instruction interval on each kernel where to apply instrumentation. (default = 0)
INSTR_END: End of the instruction interval on each kernel where to apply instrumentation. (default = UINT32_MAX)
SAMPLED_KERNEL_INFO: Path to the file that contains the list of kernels to be sampled. (default = '')

Output control:

TRACE_PATH: Path to trace file. (default = './default/')
COMPRESSOR_PATH: Path to the compressor binary file. (default = './compress')
DEBUG_TRACE: Generate human-readable debug traces together when this value is 1. (default = 0)
OVERWRITE: Overwrite the previously generated traces in TRACE_PATH directory when this value is 1. (default = 0)

Advanced:

TOOL_VERBOSE: Enable verbosity inside the tool. (default = 0)
TMA_TRACE: Write extended TMA traces (trace_info_tma_s) to trace_tma_*.raw files for UTMALDG (Tensor Memory Access) instructions. (default = 0). See TMA Trace Support below.

Output Files

Each traced kernel produces a Kernel<N>/ directory under TRACE_PATH containing:

File	Description
`bin_trace_<warp>.raw`	Binary trace data (one `trace_info_nvbit_small_s` per instruction)
`bin_trace_<warp>.txt`	Human-readable debug trace (only when `DEBUG_TRACE=1`)
`trace.txt`	Warp count and per-warp metadata
`trace_info.txt`	Per-warp instruction counts
`trace_tma_<warp>.raw`	TMA-extended trace (only when `TMA_TRACE=1`)

Top-level files in TRACE_PATH:

kernel_config.txt — list of kernel trace paths
kernel_names.txt — kernel function names, grid/block dimensions, register counts

Memory Access Coalescing

Memory access coalescing is already implemented in this tracer, so you should not use mem_access_size as the raw request size in your simulator. For simplicity, assume every memory request size is 128B unless you implement sector cache in L2$.

Child trace convention: If both is_fp and is_ld are set to true, the load is a child (coalesced sector request). This means it is an additional 128B-aligned sector access generated by the same warp instruction as the parent load. The parent load's m_mem_access_size is scaled by the total number of sectors accessed.

TMA Trace Support

The tracer supports Hopper-architecture TMA (Tensor Memory Access) instructions (UTMALDG). TMA loads transfer tiles of data from global memory to shared memory via a CUtensorMap descriptor.

Two trace formats:

Format	Struct	`mem_access_size` type	Max size	File pattern
Standard (default)	`trace_info_nvbit_small_s`	`uint8_t`	255 bytes	`bin_trace_*.raw`
TMA-extended	`trace_info_tma_s`	`int`	2 GB	`trace_tma_*.raw`

Default (TMA_TRACE=0): UTMALDG instructions are traced in the standard bin_trace_*.raw files using trace_info_nvbit_small_s. The mem_access_size is capped at 255 bytes. A warning is printed to stderr when TMA instructions are detected.
Extended (TMA_TRACE=1): In addition to the standard trace, each UTMALDG instruction is also written to a separate trace_tma_*.raw file using trace_info_tma_s, which stores the full transfer size as an int. The structs are defined in tools/main/common.h.

TMA address resolution: The tracer automatically detects CUtensorMap descriptors passed as kernel arguments at launch time and extracts the actual global data address (globalAddress) and tile transfer size (tileBytes). Per-tile addresses are computed using globalAddress + cta_id_x * tileBytes.

Example:

TMA_TRACE=1 LD_PRELOAD=./tools/main/main.so DEBUG_TRACE=1 ./my_hopper_app

Use with CUDA Kernel Sampling

More details about kernel sampling is coming soon.

python3 kernel_sample.py --cmd "python3 for-macsim/$name.py" --name "$name" \
    --threshold 50 --min_n 30 --device_id 1 \
    --trace_generate --trace_path /data/echung67/trace_sampled/nvbit/"$name"

Example

Please check out run.sh for examples.

$ LD_PRELOAD=./tools/main/main.so \
  TRACE_PATH=./traces/ \
  KERNEL_END=5 \
  DEBUG_TRACE=1 \
  OVERWRITE=1 \
  python3 m.py

This command will generate traces for the first 5 CUDA kernels of the workload python3 m.py. Also, the tool will overwrite the previous traces and generate the debug traces as well.

Known Issues

Random segfault at termination: The NVBit tool occasionally produces a segfault during cleanup. Re-running the tool usually resolves it. The nvbit.py script automatically retries on segfault.
Large trace output: Even small workloads can produce hundreds of GBs of trace data. Always set KERNEL_END to limit output during initial testing.

NVBit Reference

This tool is built on NVBit (NVidia Binary Instrumentation Tool) by NVIDIA Corporation. NVBit is covered by the NVIDIA CUDA Toolkit End User License Agreement (see EULA.txt).

For NVBit API documentation, usage examples, and requirements, see the NVBit README and the nvbit.h header in the core/ folder.

Advanced: Using the Python Script

Open nvbit.py
Go to if __name__ == '__main__': line
There are three functions inside the main function right now: rodinia(), fast_tf(), tango(), corresponding to each benchmark suites.

If the fast argument of each function is True, it means that the traces will be saved in /fast_data/echung67/trace/nvbit/....
If False, it will be saved in /data/echung67/trace/nvbit/....
The trace becomes extremely huge for certain benchmarks such as FasterTransformer, so I had to use the /data/... (HDD) to store the traces.

Decomment function that you want to run, or you can create a new function (such as gunrock()) if you want to create traces for a new benchmark.
Run $ python3 nvbit.py and the traces will be first generated in run/<bench_name>/<bench_config>/, compressed with zlib (source: tools/main/compress.cc), then will be moved to /fast_data/echung67/trace/....

Due to insufficient amount of GPU's VRAM, the trace generation should be done sequentially.

macsim.py is for running Macsim concurrently.

Lines you should change in `nvbit.py`

Let's assume you are copying tango() function to create a new function that creates traces of a new benchmark suite. The following variables should be changed.

trace_path_base: path to the directory that this tool will save the traces
tango_bin: this used to be the path to the tango binary, so you should change it to path to the new benchmark's binary file. (change the name too!)
nvbit_bin: path to the nvbit tool that generates traces. don't forget to $ cd tools && make to generate this binary file.
compress_bin: path to the zlib compress tool.
result_dir: path to the directory that will store all the results and logs. It is set to ./run/ by default.
benchmark_names: name of the individual benchmarks in the benchmark suite.
benchmark_configs: configuration arguments for each benchmark in the benchmark suite.

What the for-loop does:

Create directory for one benchmark and one configuration. If the trace generation has failed, it will try to regenerate and overwrite the traces.
Copy macsim files to each directories.
Create nvbit.py file in each directories. This python script will run the nvbit tool in the directory, compress the traces, and move it to trace_path_base.
Run nvbit.py in each directories.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
core		core
test-apps		test-apps
tools		tools
.gitignore		.gitignore
EULA.txt		EULA.txt
LICENSE		LICENSE
README.md		README.md
kernel_sample.py		kernel_sample.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trace Generation Tool for Macsim

Usage

Installation

Basic Usage

Environment Variables

Output Files

Memory Access Coalescing

TMA Trace Support

Use with CUDA Kernel Sampling

Example

Known Issues

NVBit Reference

Advanced: Using the Python Script

Lines you should change in `nvbit.py`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

gthparch/Macsim_tracer

Folders and files

Latest commit

History

Repository files navigation

Trace Generation Tool for Macsim

Usage

Installation

Basic Usage

Environment Variables

Output Files

Memory Access Coalescing

TMA Trace Support

Use with CUDA Kernel Sampling

Example

Known Issues

NVBit Reference

Advanced: Using the Python Script

Lines you should change in nvbit.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Lines you should change in `nvbit.py`

Packages