This report documents an effort to insert OpenACC directives into the LU (Lower-Upper symmetric Gauss-Seidel) benchmark of NAS Parallel Benchmarks (NPB) using AI coding (LLM) to achieve high-speed execution on NVIDIA GPUs. While we succeeded in parallelizing straightforward sections, verification errors occurred in parts requiring the essential dependency handling of the LU algorithm (sections requiring the Pipeline method), and complete porting was not achieved. Through this trial, we clarified the possibilities and limitations of AI coding, particularly the challenges in porting HPC code requiring advanced parallelization techniques.
The LU benchmark of NAS Parallel Benchmarks (NPB) is a computational kernel that solves three-dimensional compressible Navier-Stokes equations using the Lower-Upper symmetric Gauss-Seidel method. The project objectives are as follows:
- Primary objective: Insert OpenACC directives to achieve high-speed execution on NVIDIA GPUs
- Evaluation metric: Reduction of FIGURE OF MERIT (execution time)
- Constraint: Maintain numerical correctness (Verification Successful)
LLM:
- Claude Code, Sonnet 4.5
Target code:
- NAS Parallel Benchmarks 3.3.1 LU
- Approximately 3,000 lines of code written in Fortran 77/90
- Uses SSOR (Symmetric Successive Over-Relaxation) algorithm
Execution environment:
- Compiler: nvfortran (NVIDIA HPC SDK)
- Compilation flags:
-O3 -acc -gpu=cc89 -Minfo=accel - GPU: Compute Capability 8.9 (NVIDIA L40S, 48GB GDDR6)
- Benchmark class: A (grid size 64×64×64)
The LU benchmark has the following characteristics that make simple parallelization difficult:
- Diagonal dependencies: Dependencies exist in three-dimensional diagonal directions in lower and upper triangular solves
- SSOR method constraints: The k-direction loop is inherently sequential
- Complex data flow: Complex read-write dependencies exist among multiple arrays
Addressing these challenges requires an advanced parallelization technique called Wavefront parallelization (Hyperplane method).
This project adopted the following three-phase approach:
Phase 1: Preprocessing and Analysis
├─ Development of Fortran static analysis script
└─ Automatic dependency analysis and pattern recognition
Phase 2: Development Environment Setup
├─ Automation using MCP server [make-mcp](https://gitlab.com/wddawson/make-mcp)
└─ Streamlined compilation and execution
Phase 3: Incremental GPU Porting
├─ Data management optimization
├─ Handling straightforward parallelizable sections
└─ Trial of Wavefront parallelization
We developed fortran_gpu_analyzer.py to enable the LLM to insert appropriate OpenACC directives.
1: Static Analysis
- Analysis of subroutine definitions
- Extraction of array declarations (type, dimensions, size)
- Analysis of DO loop structure (nesting depth, loop variables, ranges)
- Tracking of array access patterns (distinguishing reads and writes)2: Dependency Analysis
- Calculation of dependency vectors (dependency distance for each dimension)
- Determination of dependency direction (forward/backward/mixed)
- Extraction of dependencies within the same loop context3: Pattern Recognition and Strategy Generation
- Matching with known algorithm patterns
- Forward substitution
- Backward substitution
- Jacobi stencil (independently parallelizable)
- Gauss-Seidel (Red-Black ordering)
- Automatic recommendation of GPU porting strategies
- Generation of required OpenACC directives
- Suggestion of auxiliary arraysAnalysis results for blts.f (lower triangular solve):
{
"pattern_type": "wavefront",
"affected_dimensions": ["k", "j", "i"],
"dependency_vectors": [
{"direction": "k-1", "offset": -1},
{"direction": "j-1", "offset": -1},
{"direction": "i-1", "offset": -1}
],
"recommended_strategy": "diagonal_wavefront",
"transformation_needed": "hyperplane_method"
}In this way, the script automatically:
- Detects three-dimensional wavefront dependencies
- Recommends that diagonal parallelization is needed
- Suggests necessary auxiliary arrays (np, indxp, jndxp)
We automated compilation and execution using the MCP (Model Context Protocol) server.
.mcp.json:
{
"mcpServers": {
"make-mcp": {
"command": "make-mcp",
"args": ["--config", "l40s", "--project-dir", "."]
}
}
}| Tool | Function | Corresponding Make Target |
|---|---|---|
mcp__make-mcp__make_default |
Execute build | make default |
mcp__make-mcp__make_build |
Generate object files | make build |
mcp__make-mcp__make_run |
Run benchmark | make run |
mcp__make-mcp__make_clean |
Cleanup | make clean |
mcp__make-mcp__make_sysinfo |
Display system information | make sysinfo |
- Consistency: Ensures consistency in command execution by always using MCP tools
- Traceability: All builds and executions are logged
- Error Handling: Structured error information retrieval through tools
The LU benchmark consists of the following main files:
| File | Role | Parallelization Difficulty | Priority |
|---|---|---|---|
lu.f |
Main driver | N/A (control only) | - |
ssor.f |
SSOR iteration control | Medium (data management) | High |
blts.f |
Lower triangular solve (L inverse) | Very High | Critical |
buts.f |
Upper triangular solve (U inverse) | Very High | Critical |
jacld.f |
Lower Jacobian computation | Low (independent parallel) | Medium |
jacu.f |
Upper Jacobian computation | Low (independent parallel) | Medium |
rhs.f |
Right-hand side vector computation | Low (stencil) | Medium |
l2norm.f |
L2 norm computation | Low (reduction) | Low |
The following incremental strategy was defined in PLAN.md:
!$acc data present(u, rsd, frct, qs, rho_i) &
!$acc create(a, b, c, d) &
!$acc copyin(dt, omega, nx, ny, nz, ist, iend, jst, jend)
! ... SSOR iteration ...
!$acc end dataObjective: Keep all major arrays resident in GPU memory to minimize CPU-GPU transfers
jacld.f, jacu.f (Jacobian computation):
!$acc parallel loop collapse(2) present(u, rho_i, qs, a, b, c, d)
do j = jst, jend
do i = ist, iend
! Point-independent computation
end do
end dol2norm.f (Reduction):
!$acc parallel loop collapse(3) reduction(+:sum) present(v)
do k = 2, nz0-1
do j = jst, jend
do i = ist, iend
do m = 1, 5
sum(m) = sum(m) + v(m,i,j,k) * v(m,i,j,k)
end do
end do
end do
end dorhs.f (Stencil computation):
!$acc parallel loop collapse(3) present(rsd, u, rho_i, qs, frct)
do k = 2, nz-1
do j = jst, jend
do i = ist, iend
! Stencil computation (reading neighboring points only)
end do
end do
end doTransformation of blts.f / buts.f:
Original loop structure:
do k = 2, nz-1
do j = jst, jend
do i = ist, iend
! v(m,i,j,k) depends on v(*,i-1,j,k), v(*,i,j-1,k), v(*,i,j,k-1)
end do
end do
end doAfter Wavefront transformation:
! Pre-compute diagonal indices
call setup_diagonals() ! Generate np(l), indxp(l,n), jndxp(l,n)
! Sequential processing for each diagonal level
do l = 1, ndiag
npl = np(l)
! Parallel processing of points within each diagonal
!$acc parallel loop gang vector num_gangs((npl+127)/128) &
!$acc vector_length(128)
do n = 1, npl
j = jndxp(l,n)
i = indxp(l,n)
k = l - i - j
! Computation for point (i,j,k)
! All points at this level are mutually independent
end do
end doRequired auxiliary arrays:
integer, parameter :: maxdiag = 3*isiz1
integer np(maxdiag) ! Number of points on diagonal l
integer indxp(maxdiag, isiz1*isiz2) ! i-index
integer jndxp(maxdiag, isiz1*isiz2) ! j-indexKey Insight:
All points satisfying i + j + k = constant (= l)
have no mutual dependencies and can be computed simultaneously.
Implementation proceeded according to the following task list:
✅ Add OpenACC data management to ssor.f
✅ Parallelize jacld.f with OpenACC
✅ Parallelize jacu.f with OpenACC
✅ Parallelize l2norm.f with reduction
✅ Parallelize rhs.f with OpenACC
✅ Test Phase 1-2 optimizations
◼️ Implement wavefront parallelization for blts.f ← Blocked
◻️ Implement wavefront parallelization for buts.f
◻️ Final testing and verification
Parallelization of blts.f / buts.f (triangular solves) failed:
-
Loss of numerical correctness:
Verification: FAILED Error: RMS norm difference exceeds tolerance -
Root cause:
- Simple parallelization cannot properly handle dependencies
- Used CPU/GPU data transfer as a workaround, but numerical precision could not be maintained
- Wavefront parallelization implementation was incomplete
-
Technical challenges:
- Complexity of logic for generating diagonal index arrays
- Restructuring coefficient arrays (
d(5,5,i,j)→d(5,5,n)) - Splitting into multiple kernels (separating k-1 and i-1,j-1 dependencies)
- Appropriate placement of synchronization points
Status: Partial Success
- Compilation: ✅ Success
- Execution: ✅ Completes
- Verification: ❌ FAILED
- Performance: N/A (verification failed)
Example execution log:
NAS Parallel Benchmarks 3.1 - LU Benchmark
Size: 64x64x64
Iterations: 250
Time step 1
Time step 2
...
Time step 250
Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
FAILURE: 1 0.3167121203225E+19 0.7790210760669E+03 0.4065514144001E+16
FAILURE: 2 0.2226688354010E+17 0.6340276525969E+02 0.3511973562809E+15
FAILURE: 3 0.1712390193331E+18 0.1949924972729E+03 0.8781826056284E+15
FAILURE: 4 0.1205243868453E+18 0.1784530116042E+03 0.6753844374038E+15
FAILURE: 5 0.4840029262702E+19 0.1838476034946E+04 0.2632631141609E+16
Comparison of RMS-norms of solution error
FAILURE: 1 0.5073968638522E+03 0.2996408568547E+02 0.1593350063066E+02
FAILURE: 2 0.4117400734144E+02 0.2819457636500E+01 0.1360352048153E+02
FAILURE: 3 0.1210827223612E+03 0.7347341269877E+01 0.1547980104825E+02
FAILURE: 4 0.1053131418499E+03 0.6713922568778E+01 0.1468578439370E+02
FAILURE: 5 0.8320050299906E+03 0.7071531568839E+02 0.1076555632809E+02
Comparison of surface integral
FAILURE: 0.6793137050151E+03 0.2603092560489E+02 0.2509640991358E+02
Verification failed
LU Benchmark Completed.
Class = A
Size = 64x 64x 64
Iterations = 250
Time in seconds = 0.15
Mop/s total = 819678.10
Operation type = floating point
Verification = UNSUCCESSFUL
Version = 3.3.1
Compile date = 04 Feb 2026
Compile options:
F77 = nvfortran
FLINK = $(F77)
F_LIB = (none)
F_INC = (none)
FFLAGS = -acc -Minfo=accel
FLINKFLAGS = -acc
RAND = (none)
Please send all errors/feedbacks to:
NPB Development Team
npb@nas.nasa.gov
- Quick development of analysis scripts (Python 950 lines, implemented in approximately 30 minutes)
- Automatic insertion of basic OpenACC directives
- Automatic generation of documentation (PLAN.md, CLAUDE.md)
The LLM correctly recognized the following patterns:
- Independently parallelizable loops (jacld, jacu)
- Reduction operations (l2norm)
- Stencil computations (rhs)
- Understanding of OpenACC specifications
- Parsing of Fortran 77 syntax
- Recognition of common HPC algorithm patterns
Fundamental structural transformations like Wavefront parallelization are difficult:
-
Necessity of complex preprocessing:
- Logic for computing diagonal index arrays
- Accurate handling of boundary conditions (
max(ist, l-j-nz+1)etc.) - Memory layout optimization
-
Multi-stage dependencies:
- Precedent processing of k-1 dependencies
- Subsequent processing of i-1, j-1 dependencies
- Appropriate placement of kernel splitting and synchronization
-
Ensuring numerical stability:
- Changes in floating-point operation order affect results
- Precision guarantee for diagonal block inversion
Debugging process when verification fails:
1. Check compiler warnings → None
2. Check data transfers → No issues with present/copy
3. Check array bounds → Ranges are accurate
4. Re-check dependencies → Problem discovered here (wavefront needed)
5. Compare with reference implementation → Understand C version structure
6. Implementation trial → Too complex, complete implementation not achieved
The LLM is good at steps 1-3, but steps 4-6 require human intervention.
The following knowledge is required but insufficient for LLMs:
- Numerical computation algorithm theory: Theoretical background of SOR algorithms
- Parallel computing theory: Hyperplane method, Red-Black ordering
- GPU architecture: Warp divergence, Memory coalescing
- OpenACC implementation details: Complex data mapping, asynchronous execution
The core challenge of the LU benchmark is that Pipeline parallelization (Wavefront parallelization) is indispensable.
Point (i,j,k) depends on:
- (i-1, j, k ) ← i-direction dependency
- (i, j-1, k ) ← j-direction dependency
- (i, j, k-1) ← k-direction dependency
All points on the diagonal plane i + j + k = l (constant) can be computed independently of each other:
Example of diagonal plane l=5 (simplified in 2D):
j
^
|
3 o
2 o
1 o
0---------> i
0 1 2
Points (0,3), (1,2), (2,1) are on the same diagonal with i+j=3
→ These can be computed in parallel
This three-dimensional dependency means only points on a diagonal hyperplane are independent.
| Transformation Step | Complexity | Reason |
|---|---|---|
| Index computation | High | Accurate boundary condition handling required |
| Array restructuring | High | Major memory layout changes |
| Kernel splitting | Medium | Separation by dependency direction |
| Synchronization management | High | Accurate synchronization between diagonals |
Structure of successful reference implementation (blts.c):
// 1. Pre-compute diagonal indices
for (l = 1; l <= ndiag; l++) {
np[l] = 0;
for (j = jst; j <= jend; j++) {
for (i = ist; i <= iend; i++) {
k = l - i - j;
if (k >= 1 && k <= nz) {
np[l]++;
indxp[l][np[l]] = i;
jndxp[l][np[l]] = j;
}
}
}
}
// 2. Wavefront parallelization
for (l = 1; l <= ndiag; l++) {
int npl = np[l];
#pragma acc parallel loop gang vector \
num_gangs((npl+127)/128) vector_length(128)
for (n = 1; n <= npl; n++) {
j = jndxp[l][n];
i = indxp[l][n];
k = l - i - j;
// Compute and invert diagonal block
// ...
}
}This implementation is superior in the following respects:
- Memory access optimization (coalescing)
- Appropriate gang/vector settings
- Accurate dependency handling
In this project, we attempted GPU porting of the NPB LU benchmark using AI coding.
Quantitative achievements:
- Analysis script: 950 lines of Python code
- Modified files: 6 (jacld.f, jacu.f, l2norm.f, rhs.f, ssor.f, applu.incl)
- Successful parallelizations: 5 subroutines
Qualitative achievements:
- Establishment of automated Fortran dependency analysis methods
- Streamlined development flow using MCP server
- Demonstration of incremental approach to GPU porting
Success factors:
- Effectiveness of preprocessing: Provision of structured information by fortran_gpu_analyzer.py
- Incremental approach: Effectiveness of strategy starting from simple parts
- Power of automation: Acceleration of iterative experiments through MCP server
Failure factors:
- Algorithmic complexity: Fundamental difficulty of Wavefront parallelization
- LLM limitations: Lack of deep domain knowledge
- Verification difficulty: Ensuring numerical correctness
Position of AI Coding:
┌─────────────────────────────────┐
│ HPC GPU Porting Difficulty │
│ Spectrum │
├─────────────────────────────────┤
│ Easy ← ─────────────── → Hard │
├─────────────────────────────────┤
│ [✅ Possible with AI] │
│ - Independent parallelization │
│ - Data management │
│ - Reduction │
│ │
│ [⚠️ Partially possible with AI] │
│ - Stencil computation │
│ - Loop transformation │
│ │
│ [❌ Difficult with AI] │
│ - Wavefront parallelization │
│ - Pipeline method │
│ - Algorithm redesign │
└─────────────────────────────────┘
-
Near term:
- Utilization of more advanced LLM models (GPT-5, Claude 5, etc.)
- Domain-specific fine-tuning
- Multi-agent collaboration (analysis AI + implementation AI + verification AI)
-
Mid-term:
- Integration with formal verification techniques
- Automatic performance modeling
- Interactive optimization framework
-
Long-term:
- Automatic algorithm selection and design
- Hardware-software co-optimization
- Fully automated GPU porting system
While this trial did not achieve complete success, it provided valuable experience in clarifying the possibilities and limitations of AI coding. In particular, the following points became clear:
- AI can significantly accelerate mechanical tasks
- However, there are limitations in domains requiring deep expertise
- The most effective approach is human-AI collaboration
Future AI utilization in the HPC field should proceed not by leaving everything to AI, but by ensuring that AI and human experts appropriately share roles.
System information:
OS: Linux 5.14.0-611.16.1.el9_7.x86_64
Compiler: nvfortran 23.x (NVIDIA HPC SDK)
GPU: Compute Capability 8.9
Python: 3.9+
Tools used:
- fortran_gpu_analyzer.py: Fortran analysis script
- make-mcp: MCP server (build automation)
- Claude Code: LLM interface
Main files:
ai_coding/
├── fortran_gpu_analyzer.py # Analysis script
├── REPORT.md # This report
└── NPB-3.1_LU/
├── PLAN.md # Implementation plan
├── CLAUDE.md # Guide for LLM
├── .mcp.json # MCP server configuration
├── Makefile # Build configuration
├── final.txt # Final status memo
├── lu.f # Main program
├── ssor.f # SSOR control (modified)
├── blts.f # Lower triangular solve (incomplete)
├── buts.f # Upper triangular solve (incomplete)
├── jacld.f # Jacobian computation (complete)
├── jacu.f # Jacobian computation (complete)
├── rhs.f # Right-hand side computation (complete)
├── l2norm.f # Norm computation (complete)
└── *_analysis.json # Analysis results (corresponding to each .f file)