This repository collects small, focused CUDA example programs and helper scripts used for learning and benchmarking. Each subdirectory contains a single example (source, README, and helper scripts).
- Prerequisites
- Quick Build
- How to run examples
- Profiling
- Repository layout and links
- CLI conventions
- CI / GitHub Actions
- Contributing
- Linux (Ubuntu recommended for scripts in this repo)
- NVIDIA CUDA toolkit (nvcc) installed and on
PATHfor local builds makeand standard build tools (gcc,g++,make)nvprof(or your preferred NVIDIA profiler) if you want to profile; profiling scripts in each directory callnvprofby default
If you plan to use the included GitHub Actions workflow, the workflow builds inside an NVIDIA CUDA Docker image so you don't need CUDA installed locally for CI builds.
From the project root run:
make -j$(nproc)This will run make in every subdirectory that provides a Makefile and build the example binaries.
Each example directory contains a run.sh helper script and a README with example invocations. Most binaries accept an explicit --help flag that prints usage.
Example:
cd vector_addition
./vectAdd --mode 0 --n 1024 --threads 128 --granularity 1Note: binaries accept flags only (no positional fallback). If a directory provides a run.sh, it maps convenient script arguments to the program flags when present.
Per-directory profiling scripts are provided and named profile_nvprof.sh. They call nvprof and save profiler outputs. Example usage (from a subdirectory):
./profile_nvprof.sh --n 4096 --threads 256If you do not have nvprof, install the CUDA toolkit, or run the GitHub Actions CI which builds the project inside a CUDA container.
Click the folders below for the example README files and more details:
Vector Addition— vector add example with multiple modesError Handling— examples showing CUDA error handlingDevice Specification— device query and capability examplesImage Manipulation— image processing examples (blur, grayscale) with libpngMatrix-Vector Multiplication— matrix-vector multiplication exampleMatrix Multiplication— matrix multiplication with naive, tiled, and coarsened kernelsConvolution— 1D and 2D convolution with constant memory and tilingParallel Histogram— parallel histogram with privatization, aggregation, and coarsening3D Stencil— 3D seven-point stencil with shared memory, coarsening, and register tilingHeat Transfer— 2D heat transfer simulation with Global, Tiled, and Tiled with Halo kernelsProfiling Tools— automated GPU profiling suite with roofline analysis
Each folder includes a README.md with per-example instructions.
The profiling_tools/ directory contains a complete GPU profiling suite:
| File | Description |
|---|---|
profile_cuda.sh |
Main orchestration script for automated profiling |
gpu_info.cu |
GPU specification detection and theoretical peak calculation |
parse_metrics.py |
Metrics parser and data generator for roofline analysis |
plot_roofline.gp |
Gnuplot script for roofline model visualization |
plot_histogram.gp |
Gnuplot script for execution time comparison |
plot_occupancy.gp |
Gnuplot script for SM occupancy visualization |
Quick start:
# Profile all executables in a directory
./profiling_tools/profile_cuda.sh -d ./matrix_multiplication/
# Profile with specific arguments
./profiling_tools/profile_cuda.sh -d ./vector_addition/ -a "--n 1048576 --threads 256"See profiling_tools/README.md for detailed usage.
- All example binaries use flag-style CLI (e.g.,
--n 1024,--threads 128). - Centralized CLI helpers live in
common/cli_utils.hand are used across examples for consistent parsing and validation.
A GitHub Actions workflow is included at .github/workflows/ci.yml. The workflow builds the project inside an NVIDIA CUDA Docker image and uploads artifacts. It runs on push and pull_request to main/dev.
- Make changes in a feature branch, run
make, and add tests or smoke-tests if appropriate. - Open a PR with a clear description and small, focused commits.