Add support for custom compiler flags via FUSILLI_EXTRA_COMPILER_FLAGS #115

bangtianliu · 2026-01-26T07:14:47Z

This PR adds support for passing custom IREE compiler flags to fusilli via:

FUSILLI_EXTRA_COMPILER_FLAGS environment variable (for C++ driver)
--Xiree-compile flag (for Python benchmark wrapper)

This enables users to customize compilation with arbitrary compiler flags,
such as optimization levels, tuning specs, or debug options.

Tuning specs provided by tuner can be passed through the flag to enable the performance improvement through tuning.

Local testing shows ~8.2% performance improvement on bf16 matrix multiplication (matmul -M 8192 -N 2048 -K 4096 --transA --a_type bf16 --b_type bf16 --out_type bf16) when using tuned transform dialect specs.

Assisted-by: Claude

bangtianliu · 2026-01-26T07:24:52Z

The used td spec produced by tuner:

module attributes {iree_codegen.tuning_spec_with_default_entrypoint, transform.with_named_sequence} {
  transform.named_sequence @apply_op_config(%arg0: !transform.any_op {transform.readonly}, %arg1: !transform.any_param {transform.readonly}) {
    transform.annotate %arg0 "compilation_info" = %arg1 : !transform.any_op, !transform.any_param
    transform.yield 
  }
  transform.named_sequence @match_main$async_dispatch_0_matmul_8192x2048x4096_bf16xbf16xf32(%arg0: !transform.any_op {transform.readonly}) -> (!transform.any_op, !transform.any_param) {
    %batch_dims, %m_dims, %n_dims, %k_dims = transform.iree.match.contraction %arg0, lhs_type = bf16, rhs_type = bf16, output_type = f32, indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>] : !transform.any_op -> !transform.param<i64>
    transform.iree.match.dims_equal %batch_dims, [] : !transform.param<i64>
    transform.iree.match.dims_equal %m_dims, [8192] : !transform.param<i64>
    transform.iree.match.dims_equal %n_dims, [2048] : !transform.param<i64>
    transform.iree.match.dims_equal %k_dims, [4096] : !transform.param<i64>
    %0 = transform.param.constant #iree_codegen.compilation_info<lowering_config = #iree_gpu.lowering_config<{mma_kind = #iree_gpu.mma_layout<WMMAR4_F32_16x16x16_BF16>, promote_operands = [0, 1], reduction = [0, 0, 1], subgroup = [4, 4, 0], subgroup_basis = [[4, 2, 1], [0, 1, 2]], workgroup = [256, 128, 0]}>, translation_info = <pipeline = LLVMGPUTileAndFuse workgroup_size = [256, 1, 1] subgroup_size = 32, {gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_num_stages = 2>, llvm_func_attrs = {"amdgpu-waves-per-eu" = "2"}}>> -> !transform.any_param
    transform.yield %arg0, %0 : !transform.any_op, !transform.any_param
  }
  transform.named_sequence @__kernel_config(%arg0: !transform.any_op {transform.consumed}) -> !transform.any_op attributes {iree_codegen.tuning_spec_entrypoint} {
    %updated_root = transform.foreach_match in %arg0 
        @match_main$async_dispatch_0_matmul_8192x2048x4096_bf16xbf16xf32 -> @apply_op_config : (!transform.any_op) -> !transform.any_op
    transform.yield %updated_root : !transform.any_op
  }
}

Results (AMD gfx1201):

Configuration	Mean Latency	Min Latency	Max Latency	Improvement
Baseline (no tuning spec)	1775.95 μs	1751.00 μs	1839.00 μs	-
With tuning spec	1630.05 μs	1612.00 μs	1682.00 μs	8.2% faster

sjain-stanford

This is in draft but I just wanted to leave a high level "directionary" feedback before you go down this path further. At this stage while the tuning integration is in a POC, we don't want to bake the flags into fusilli/backend/*. You could instead have a catch-all variable for any additional flags to be specified (e.g. through env variables / CLI or so). This let's us evaluate forward looking things without baking the design in too early making it hard to refactor to a more optimal integration later.

README.md

benchmarks/run_benchmark.py

README.md

kuhar

Would it make sense to have a generic escape hatch for passing extra compiler flags? In other projects we use --Xiree-compile='....' or --Xiree_compile='...': https://github.com/search?q=repo%3Anod-ai%2Firee-kernel-benchmark%20xiree&type=code , similar to -Xclang and others: https://clang.llvm.org/docs/ClangCommandLineReference.html#compilation-options

bangtianliu · 2026-01-27T19:49:54Z

Would it make sense to have a generic escape hatch for passing extra compiler flags? In other projects we use --Xiree-compile='....' or --Xiree_compile='...': https://github.com/search?q=repo%3Anod-ai%2Firee-kernel-benchmark%20xiree&type=code

I think it's a good suggestion. The generic --Xiree-compile='...' escape hatch would be more flexible.

Actually, looking at the current hardcoded flags in backend.h, some like --iree-dispatch-creation-enable-split-reduction might be better as user-configurable options rather than always-on. The --Xiree-compile approach would let users experiment with enabling/disabling these optimizations.

@sjain-stanford what do you think? we could also introduce --Xiree-compile in a separate PR to keep this PR focused.

sjain-stanford · 2026-01-27T20:00:34Z

Yeah that's what I was initially thinking - a fully general escape hatch for compiler flags. This would be mainly used for tuning for now, but could be extended for debug purposes. Thanks @kuhar for the suggestion.

sjain-stanford · 2026-01-27T21:23:52Z

I think using a single environment variable to accept all extra flags seems reasonable - @kuhar do you see concerns with this?
Then we can just document that one variable in the README and not go into specifics for tuning etc. I'd still update the README in the nod-ai/fusilli-benchmarks repo with the exact flag to set so folks who run benchmarks will know how to use it.

tests/test_backend.cpp

sjain-stanford · 2026-01-28T19:10:16Z

include/fusilli/backend/backend.h

+    // Add extra flags to both CPU and AMDGPU backends.
+    addExtraFlags(cpuFlags);
+    addExtraFlags(amdGpuFlags);


I'd asked this elsewhere, but what happens if users specify a GPU specific flag (say --iree-hip-target..) through the FUSILLI_EXTRA_COMPILER_FLAGS? It get's added to both cpuFlags and amdGpuFlags - would it crash the compiler when the selected backend is CPU, or would it silently ignore it?

+1, I don't understand the distinction -- I don't think the compiler c api has separate entry points for these

Then how about adding extra flags to AMDGPU backend only?

I don't think the compiler c api has separate entry points for these

OK then it's probably fine to say this is user responsibility (to make sure the flags provided match the backend selected).

I don't understand the distinction

The reason we have a map that stores the cpuFlags and amdGpuFlags separately is to allow multiple sessions+graphs to co-exist. So say there's a machine with AMDGPU, users might create graph 1 on the cpu handle and graph 2 on the gpu handle. In such cases, they'd want to make sure the extra flags are not commingled for both backends.

It get's added to both cpuFlags and amdGpuFlags - would it crash the compiler when the selected backend is CPU, or would it silently ignore it?

I thought about this. Since the compiler is basically cross-compiling for whatever target is specified, it likely won't crash the compilation itself and a vmfb should get generated. The issue is when the vmfb is executed on a backend that it was not compiled for (through flags targeting a different backend). So again it becomes a user responsibility to make sure they specify extra flags pertaining to the correct backend.

I think we can leave it as is for now and if this becomes a prevalent pattern we could revisit with alternate approaches (e.g. use backend specific env variable or read extra flags from backend specific files).

README.md

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>

README.md

tests/test_backend.cpp

benchmarks/test_benchmark_runner.sh

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>

bangtianliu · 2026-01-28T22:48:04Z

I've updated the tests to follow the established pattern in the codebase. After examining other test files like test_conv_attributes.cpp, I found that the codebase uses simple REQUIRE with == for comparing vectors, rather than REQUIRE_THAT with matchers.

include/fusilli/backend/backend.h

tests/test_backend.cpp

sjain-stanford

LGTM, thanks for iterating on this! Please wait for @kuhar 's approval before landing.

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>

bangtianliu · 2026-01-29T23:13:21Z

cc @kuhar for review

kuhar

Can you update the PR title and description to better match the contents? It allows for arbitrary compiler flags, not only tuning spec paths.

include/fusilli/backend/backend.h

tests/test_backend.cpp

README.md

benchmarks/run_benchmark.py

Signed-off-by: Bangtian Liu <liubangtian@gmail.com>

bangtianliu marked this pull request as draft January 26, 2026 07:15

bangtianliu force-pushed the bangtian/add_tuning_spec_support branch from a7292b7 to aeb017b Compare January 26, 2026 07:17

bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 3 times, most recently from db0456f to a974e07 Compare January 26, 2026 22:39

sjain-stanford reviewed Jan 26, 2026

View reviewed changes

bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 6 times, most recently from d9fe6db to a3e39a2 Compare January 26, 2026 23:40

bangtianliu marked this pull request as ready for review January 26, 2026 23:40

bangtianliu requested review from IanWood1, kuhar and sjain-stanford January 26, 2026 23:41

bangtianliu force-pushed the bangtian/add_tuning_spec_support branch 5 times, most recently from 484f316 to 60069c4 Compare January 27, 2026 18:11

sjain-stanford reviewed Jan 27, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

sjain-stanford reviewed Jan 27, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

sjain-stanford reviewed Jan 27, 2026

View reviewed changes

benchmarks/run_benchmark.py Outdated Show resolved Hide resolved

sjain-stanford reviewed Jan 27, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

kuhar reviewed Jan 27, 2026

View reviewed changes

kuhar reviewed Jan 28, 2026

View reviewed changes

tests/test_backend.cpp Show resolved Hide resolved

tests/test_backend.cpp Show resolved Hide resolved