This repo contains the hardware design as well as a simple testbench for the SpMM with Block Aggregation, on Intel/Altera Stratix 10 NX platform.
If you would like to cite this work, use
@inproceedings{enabling-efficient-spmm-for-sparse-attention-on-gemm-optimized-hardware-with-block-aggregation,
author = {Ji, Tianchu and Balasubramanian, Niranjan and Ferdman, Michael and Milder, Peter},
title = {Enabling Efficient SpMM for Sparse Attention on GEMM-Optimized Hardware with Block Aggregation},
booktitle = {Proceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '26)},
year = {2026},
isbn = {9798400720796},
address = {Monterey, CA, USA},
publisher = {Association for Computing Machinery},
numpages = {12},
keywords = {sparse-dense matrix multiplication, self-attention, sparse attention, Tensor Block},
doi = {10.1145/3748173.3779187},
url = {https://doi.org/10.1145/3748173.3779187},
}- sdkman
- java: 11.0.23-amzn
sdk install java 11.0.23-amzn - sbt: 1.11.7 (for building SpinalHDL project)
sdk install sbt 1.11.7 - python 3.13
- Quartus 21.4
- QuestaSim 2024.3
git clone --recurse-submodules git@github.com:COMPAS-Lab/sparsity-intel-tensor-core-transformers-accel.git spmm_core
cd spmm_core
git checkout block_sparse_coreMaking sure to create and activate a virtual environment before installing python requirements, then
pip install -r requirements.txtsbt "runMain mvm.tensor_core_array_wrapper_gen"This will generate ./src/generated_spmm_core_6x12x8 directory with verilog files for the design,
as well as a source code file list ./src/generated_spmm_core_6x12x8tensor_core_array_wrapper.lst.
All the files in the .lst should be added to the Quartus project if generating sof file for FPGA.
Simulating SpMM core requires a QuestaSim and Cocotb. A COMPAS-forked cocotb 1.9.2 with an alternated makefile template is provided:
git clone git@github.com:COMPAS-Lab/sparsity-compas-forked-cocotb.git compas-cocotb
cd compas-cocotb
pip install . -e
A set of extracted sparse attention value with BFP-format is provided to simplify the simulation process. It is extracted from chatglm2-6b-32k on LongBench's vcsum test. Download and extract partial BFP-format attention value from COMPAS NFS and extract it. The extracted data should contain sparse attention values of 10 instances:
iiSeqInst0105_cidx.npy iiSeqInst0105_val.npy iiSeqInst0109_ridx.npy iiSeqInst0112.json
iiSeqInst0117_cidx.npy iiSeqInst0117_val.npy iiSeqInst0121_ridx.npy iiSeqInst0131.json
iiSeqInst0135_cidx.npy iiSeqInst0135_val.npy iiSeqInst0139_ridx.npy iiSeqInst0141.json
iiSeqInst0147_cidx.npy iiSeqInst0147_val.npy iiSeqInst0105.json iiSeqInst0109_cidx.npy
iiSeqInst0109_val.npy iiSeqInst0112_ridx.npy iiSeqInst0117.json iiSeqInst0121_cidx.npy
iiSeqInst0121_val.npy iiSeqInst0131_ridx.npy iiSeqInst0135.json iiSeqInst0139_cidx.npy
iiSeqInst0139_val.npy iiSeqInst0141_ridx.npy iiSeqInst0147.json iiSeqInst0105_ridx.npy
iiSeqInst0109.json iiSeqInst0112_cidx.npy iiSeqInst0112_val.npy iiSeqInst0117_ridx.npy
iiSeqInst0121.json iiSeqInst0131_cidx.npy iiSeqInst0131_val.npy iiSeqInst0135_ridx.npy
iiSeqInst0139.json iiSeqInst0141_cidx.npy iiSeqInst0141_val.npy iiSeqInst0147_ridx.npy
The data is generated by the sparse attention analyzer.
Each iiSeqInst<inst_id>_val.npy contains the sparsified heads from one instance, where the inst_id
is the sequence id in the Longbench test set. Likewise, iiSeqInst<inst_id>_cidx.npy and
iiSeqInst<inst_id>_ridx.npy contains the column and row block index of the remained dense values.
These data will be provided to the testbench to generate SpMM test case.
A copy of extracted data is located at
/compas-old/projects/sparse-attentionon COMPAS NFS.
Specify the path to the SpMM core design generated by SpinalHDL in cocotb's makefile
# specify spmm core src path here
VERILOG_SOURCES += $(PWD)/../src/generated_spmm_core_6x12x8/*.vSpecify the inst_id and head_idx for simulation in cocotb testbench:
inst_id = "iiSeqInst0105"
hidx = 15Specify the test data path in cocotb testbench. Also specify an empty directory path for json file generated by the attention value parser. Make sure to provide the path to the data you extracted in Preparing BFP-format attention value.
attn_data_path = "/compas-old/projects/sparse-attention/chatglm2-6b-32k-attn-bfp20-vcsum/"
inter_config_path = f"/compas-old/projects/sparse-attention/sim_test/chatglm2-6b-32k-attn-bfp20-vcsum/{inst_id}"cd sim
make -f makefile.cocotb clean
make -f makefile.cocotb GUI=0This will start QuestaSim and run the simulation. If you want to view the waveform, specify GUI=1
Refer to nx10-matmul-project to build the hardware.