Trial Report: GPU Porting of NAS Parallel Benchmarks LU Using AI Coding

Executive Summary

This report documents an effort to insert OpenACC directives into the LU (Lower-Upper symmetric Gauss-Seidel) benchmark of NAS Parallel Benchmarks (NPB) using AI coding (LLM) to achieve high-speed execution on NVIDIA GPUs. While we succeeded in parallelizing straightforward sections, verification errors occurred in parts requiring the essential dependency handling of the LU algorithm (sections requiring the Pipeline method), and complete porting was not achieved. Through this trial, we clarified the possibilities and limitations of AI coding, particularly the challenges in porting HPC code requiring advanced parallelization techniques.

1. Background and Objectives

1.1 Project Objectives

The LU benchmark of NAS Parallel Benchmarks (NPB) is a computational kernel that solves three-dimensional compressible Navier-Stokes equations using the Lower-Upper symmetric Gauss-Seidel method. The project objectives are as follows:

Primary objective: Insert OpenACC directives to achieve high-speed execution on NVIDIA GPUs
Evaluation metric: Reduction of FIGURE OF MERIT (execution time)
Constraint: Maintain numerical correctness (Verification Successful)

1.2 Technical Background

LLM:

Claude Code, Sonnet 4.5

Target code:

NAS Parallel Benchmarks 3.3.1 LU
Approximately 3,000 lines of code written in Fortran 77/90
Uses SSOR (Symmetric Successive Over-Relaxation) algorithm

Execution environment:

Compiler: nvfortran (NVIDIA HPC SDK)
Compilation flags: -O3 -acc -gpu=cc89 -Minfo=accel
GPU: Compute Capability 8.9 (NVIDIA L40S, 48GB GDDR6)
Benchmark class: A (grid size 64×64×64)

1.3 Technical Challenges

The LU benchmark has the following characteristics that make simple parallelization difficult:

Diagonal dependencies: Dependencies exist in three-dimensional diagonal directions in lower and upper triangular solves
SSOR method constraints: The k-direction loop is inherently sequential
Complex data flow: Complex read-write dependencies exist among multiple arrays

Addressing these challenges requires an advanced parallelization technique called Wavefront parallelization (Hyperplane method).

2. Approach and Methodology

2.1 Overall Strategy

This project adopted the following three-phase approach:

Phase 1: Preprocessing and Analysis
  ├─ Development of Fortran static analysis script
  └─ Automatic dependency analysis and pattern recognition

Phase 2: Development Environment Setup
  ├─ Automation using MCP server [make-mcp](https://gitlab.com/wddawson/make-mcp)
  └─ Streamlined compilation and execution

Phase 3: Incremental GPU Porting
  ├─ Data management optimization
  ├─ Handling straightforward parallelizable sections
  └─ Trial of Wavefront parallelization

2.2 Preprocessing: Fortran GPU Analysis Script

We developed fortran_gpu_analyzer.py to enable the LLM to insert appropriate OpenACC directives.

2.2.1 Script Functions

1: Static Analysis

- Analysis of subroutine definitions
- Extraction of array declarations (type, dimensions, size)
- Analysis of DO loop structure (nesting depth, loop variables, ranges)
- Tracking of array access patterns (distinguishing reads and writes)

2: Dependency Analysis

- Calculation of dependency vectors (dependency distance for each dimension)
- Determination of dependency direction (forward/backward/mixed)
- Extraction of dependencies within the same loop context

3: Pattern Recognition and Strategy Generation

- Matching with known algorithm patterns
  - Forward substitution
  - Backward substitution
  - Jacobi stencil (independently parallelizable)
  - Gauss-Seidel (Red-Black ordering)
- Automatic recommendation of GPU porting strategies
- Generation of required OpenACC directives
- Suggestion of auxiliary arrays

2.2.2 Example Analysis Results

Analysis results for blts.f (lower triangular solve):

{
  "pattern_type": "wavefront",
  "affected_dimensions": ["k", "j", "i"],
  "dependency_vectors": [
    {"direction": "k-1", "offset": -1},
    {"direction": "j-1", "offset": -1},
    {"direction": "i-1", "offset": -1}
  ],
  "recommended_strategy": "diagonal_wavefront",
  "transformation_needed": "hyperplane_method"
}

In this way, the script automatically:

Detects three-dimensional wavefront dependencies
Recommends that diagonal parallelization is needed
Suggests necessary auxiliary arrays (np, indxp, jndxp)

2.3 Development Environment: Automation Using MCP Server

We automated compilation and execution using the MCP (Model Context Protocol) server.

2.3.1 Configuration

.mcp.json:

{
  "mcpServers": {
    "make-mcp": {
      "command": "make-mcp",
      "args": ["--config", "l40s", "--project-dir", "."]
    }
  }
}

2.3.2 Available Tools

Tool	Function	Corresponding Make Target
`mcp__make-mcp__make_default`	Execute build	`make default`
`mcp__make-mcp__make_build`	Generate object files	`make build`
`mcp__make-mcp__make_run`	Run benchmark	`make run`
`mcp__make-mcp__make_clean`	Cleanup	`make clean`
`mcp__make-mcp__make_sysinfo`	Display system information	`make sysinfo`

2.3.3 Benefits

Consistency: Ensures consistency in command execution by always using MCP tools
Traceability: All builds and executions are logged
Error Handling: Structured error information retrieval through tools

3. Implementation Process

3.1 Understanding Code Structure

The LU benchmark consists of the following main files:

File	Role	Parallelization Difficulty	Priority
`lu.f`	Main driver	N/A (control only)	-
`ssor.f`	SSOR iteration control	Medium (data management)	High
`blts.f`	Lower triangular solve (L inverse)	Very High	Critical
`buts.f`	Upper triangular solve (U inverse)	Very High	Critical
`jacld.f`	Lower Jacobian computation	Low (independent parallel)	Medium
`jacu.f`	Upper Jacobian computation	Low (independent parallel)	Medium
`rhs.f`	Right-hand side vector computation	Low (stencil)	Medium
`l2norm.f`	L2 norm computation	Low (reduction)	Low

3.2 Incremental Implementation Strategy

The following incremental strategy was defined in PLAN.md:

Phase 1: Data Management

!$acc data present(u, rsd, frct, qs, rho_i) &
!$acc      create(a, b, c, d) &
!$acc      copyin(dt, omega, nx, ny, nz, ist, iend, jst, jend)
! ... SSOR iteration ...
!$acc end data

Objective: Keep all major arrays resident in GPU memory to minimize CPU-GPU transfers

Phase 2: Simple Parallelization (Quick Wins)

jacld.f, jacu.f (Jacobian computation):

!$acc parallel loop collapse(2) present(u, rho_i, qs, a, b, c, d)
do j = jst, jend
    do i = ist, iend
        ! Point-independent computation
    end do
end do

l2norm.f (Reduction):

!$acc parallel loop collapse(3) reduction(+:sum) present(v)
do k = 2, nz0-1
    do j = jst, jend
        do i = ist, iend
            do m = 1, 5
                sum(m) = sum(m) + v(m,i,j,k) * v(m,i,j,k)
            end do
        end do
    end do
end do

rhs.f (Stencil computation):

!$acc parallel loop collapse(3) present(rsd, u, rho_i, qs, frct)
do k = 2, nz-1
    do j = jst, jend
        do i = ist, iend
            ! Stencil computation (reading neighboring points only)
        end do
    end do
end do

Phase 3: Wavefront Parallelization (Most Challenging)

Transformation of blts.f / buts.f:

Original loop structure:

do k = 2, nz-1
    do j = jst, jend
        do i = ist, iend
            ! v(m,i,j,k) depends on v(*,i-1,j,k), v(*,i,j-1,k), v(*,i,j,k-1)
        end do
    end do
end do

After Wavefront transformation:

! Pre-compute diagonal indices
call setup_diagonals()  ! Generate np(l), indxp(l,n), jndxp(l,n)

! Sequential processing for each diagonal level
do l = 1, ndiag
    npl = np(l)

    ! Parallel processing of points within each diagonal
    !$acc parallel loop gang vector num_gangs((npl+127)/128) &
    !$acc                          vector_length(128)
    do n = 1, npl
        j = jndxp(l,n)
        i = indxp(l,n)
        k = l - i - j

        ! Computation for point (i,j,k)
        ! All points at this level are mutually independent
    end do
end do

Required auxiliary arrays:

integer, parameter :: maxdiag = 3*isiz1
integer np(maxdiag)                      ! Number of points on diagonal l
integer indxp(maxdiag, isiz1*isiz2)      ! i-index
integer jndxp(maxdiag, isiz1*isiz2)      ! j-index

Key Insight:

All points satisfying i + j + k = constant (= l)

have no mutual dependencies and can be computed simultaneously.

3.3 Actual Implementation

Implementation proceeded according to the following task list:

✅ Add OpenACC data management to ssor.f
✅ Parallelize jacld.f with OpenACC
✅ Parallelize jacu.f with OpenACC
✅ Parallelize l2norm.f with reduction
✅ Parallelize rhs.f with OpenACC
✅ Test Phase 1-2 optimizations
◼️ Implement wavefront parallelization for blts.f  ← Blocked
◻️ Implement wavefront parallelization for buts.f
◻️ Final testing and verification

4. Results and Achievements

4.1 Failed Parts

Parallelization of blts.f / buts.f (triangular solves) failed:

Issues

Loss of numerical correctness:

Verification:  FAILED
Error: RMS norm difference exceeds tolerance

Root cause:
- Simple parallelization cannot properly handle dependencies
- Used CPU/GPU data transfer as a workaround, but numerical precision could not be maintained
- Wavefront parallelization implementation was incomplete
Technical challenges:
- Complexity of logic for generating diagonal index arrays
- Restructuring coefficient arrays (d(5,5,i,j) → d(5,5,n))
- Splitting into multiple kernels (separating k-1 and i-1,j-1 dependencies)
- Appropriate placement of synchronization points

4.2 Final Status

Status: Partial Success
- Compilation: ✅ Success
- Execution: ✅ Completes
- Verification: ❌ FAILED
- Performance: N/A (verification failed)

Example execution log:

 NAS Parallel Benchmarks 3.1 - LU Benchmark

 Size: 64x64x64
 Iterations: 250

 Time step    1
 Time step    2
 ...
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
 FAILURE:  1   0.3167121203225E+19 0.7790210760669E+03 0.4065514144001E+16
 FAILURE:  2   0.2226688354010E+17 0.6340276525969E+02 0.3511973562809E+15
 FAILURE:  3   0.1712390193331E+18 0.1949924972729E+03 0.8781826056284E+15
 FAILURE:  4   0.1205243868453E+18 0.1784530116042E+03 0.6753844374038E+15
 FAILURE:  5   0.4840029262702E+19 0.1838476034946E+04 0.2632631141609E+16
 Comparison of RMS-norms of solution error
 FAILURE:  1   0.5073968638522E+03 0.2996408568547E+02 0.1593350063066E+02
 FAILURE:  2   0.4117400734144E+02 0.2819457636500E+01 0.1360352048153E+02
 FAILURE:  3   0.1210827223612E+03 0.7347341269877E+01 0.1547980104825E+02
 FAILURE:  4   0.1053131418499E+03 0.6713922568778E+01 0.1468578439370E+02
 FAILURE:  5   0.8320050299906E+03 0.7071531568839E+02 0.1076555632809E+02
 Comparison of surface integral
 FAILURE:      0.6793137050151E+03 0.2603092560489E+02 0.2509640991358E+02
 Verification failed


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                     0.15
 Mop/s total     =                819678.10
 Operation type  =           floating point
 Verification    =             UNSUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              04 Feb 2026

 Compile options:
    F77          = nvfortran
    FLINK        = $(F77)
    F_LIB        = (none)
    F_INC        = (none)
    FFLAGS       = -acc -Minfo=accel
    FLINKFLAGS   = -acc
    RAND         = (none)


 Please send all errors/feedbacks to:

 NPB Development Team
 npb@nas.nasa.gov

5. Technical Insights and Lessons Learned

5.1 Strengths of AI Coding

5.1.1 Rapid Prototyping

Quick development of analysis scripts (Python 950 lines, implemented in approximately 30 minutes)
Automatic insertion of basic OpenACC directives
Automatic generation of documentation (PLAN.md, CLAUDE.md)

5.1.2 Pattern Recognition

The LLM correctly recognized the following patterns:

Independently parallelizable loops (jacld, jacu)
Reduction operations (l2norm)
Stencil computations (rhs)

5.1.3 Knowledge Integration

Understanding of OpenACC specifications
Parsing of Fortran 77 syntax
Recognition of common HPC algorithm patterns

5.2 Limitations of AI Coding

5.2.1 Advanced Algorithm Transformations

Fundamental structural transformations like Wavefront parallelization are difficult:

Necessity of complex preprocessing:
- Logic for computing diagonal index arrays
- Accurate handling of boundary conditions (max(ist, l-j-nz+1) etc.)
- Memory layout optimization
Multi-stage dependencies:
- Precedent processing of k-1 dependencies
- Subsequent processing of i-1, j-1 dependencies
- Appropriate placement of kernel splitting and synchronization
Ensuring numerical stability:
- Changes in floating-point operation order affect results
- Precision guarantee for diagonal block inversion

5.2.2 Debugging Difficulties

Debugging process when verification fails:

1. Check compiler warnings → None
2. Check data transfers → No issues with present/copy
3. Check array bounds → Ranges are accurate
4. Re-check dependencies → Problem discovered here (wavefront needed)
5. Compare with reference implementation → Understand C version structure
6. Implementation trial → Too complex, complete implementation not achieved

The LLM is good at steps 1-3, but steps 4-6 require human intervention.

5.2.3 Depth of Domain Knowledge

The following knowledge is required but insufficient for LLMs:

Numerical computation algorithm theory: Theoretical background of SOR algorithms
Parallel computing theory: Hyperplane method, Red-Black ordering
GPU architecture: Warp divergence, Memory coalescing
OpenACC implementation details: Complex data mapping, asynchronous execution

5.3 Essential Challenge: Necessity of Pipeline Method

The core challenge of the LU benchmark is that Pipeline parallelization (Wavefront parallelization) is indispensable.

5.3.1 Structure of Dependencies

Point (i,j,k) depends on:
  - (i-1, j,   k  )  ← i-direction dependency
  - (i,   j-1, k  )  ← j-direction dependency
  - (i,   j,   k-1)  ← k-direction dependency

All points on the diagonal plane i + j + k = l (constant) can be computed independently of each other:

Example of diagonal plane l=5 (simplified in 2D):
  j
  ^
  |
  3  o
  2    o  
  1      o
  0---------> i
     0  1  2

Points (0,3), (1,2), (2,1) are on the same diagonal with i+j=3
→ These can be computed in parallel

This three-dimensional dependency means only points on a diagonal hyperplane are independent.

5.3.2 Complexity of Required Transformations

Transformation Step	Complexity	Reason
Index computation	High	Accurate boundary condition handling required
Array restructuring	High	Major memory layout changes
Kernel splitting	Medium	Separation by dependency direction
Synchronization management	High	Accurate synchronization between diagonals

5.3.3 Reference: NPB LU OpenACC C Implementation

Structure of successful reference implementation (blts.c):

// 1. Pre-compute diagonal indices
for (l = 1; l <= ndiag; l++) {
    np[l] = 0;
    for (j = jst; j <= jend; j++) {
        for (i = ist; i <= iend; i++) {
            k = l - i - j;
            if (k >= 1 && k <= nz) {
                np[l]++;
                indxp[l][np[l]] = i;
                jndxp[l][np[l]] = j;
            }
        }
    }
}

// 2. Wavefront parallelization
for (l = 1; l <= ndiag; l++) {
    int npl = np[l];

    #pragma acc parallel loop gang vector \
            num_gangs((npl+127)/128) vector_length(128)
    for (n = 1; n <= npl; n++) {
        j = jndxp[l][n];
        i = indxp[l][n];
        k = l - i - j;

        // Compute and invert diagonal block
        // ...
    }
}

This implementation is superior in the following respects:

Memory access optimization (coalescing)
Appropriate gang/vector settings
Accurate dependency handling

6. Conclusion

6.1 Summary of Achievements

In this project, we attempted GPU porting of the NPB LU benchmark using AI coding.

Quantitative achievements:

Analysis script: 950 lines of Python code
Modified files: 6 (jacld.f, jacu.f, l2norm.f, rhs.f, ssor.f, applu.incl)
Successful parallelizations: 5 subroutines

Qualitative achievements:

Establishment of automated Fortran dependency analysis methods
Streamlined development flow using MCP server
Demonstration of incremental approach to GPU porting

6.2 Technical Conclusions

Success factors:

Effectiveness of preprocessing: Provision of structured information by fortran_gpu_analyzer.py
Incremental approach: Effectiveness of strategy starting from simple parts
Power of automation: Acceleration of iterative experiments through MCP server

Failure factors:

Algorithmic complexity: Fundamental difficulty of Wavefront parallelization
LLM limitations: Lack of deep domain knowledge
Verification difficulty: Ensuring numerical correctness

6.3 Final Evaluation

Position of AI Coding:

┌─────────────────────────────────┐
│  HPC GPU Porting Difficulty      │
│  Spectrum                        │
├─────────────────────────────────┤
│ Easy ← ─────────────── → Hard   │
├─────────────────────────────────┤
│ [✅ Possible with AI]           │
│   - Independent parallelization │
│   - Data management             │
│   - Reduction                   │
│                                 │
│ [⚠️ Partially possible with AI] │
│   - Stencil computation         │
│   - Loop transformation         │
│                                 │
│ [❌ Difficult with AI]          │
│   - Wavefront parallelization   │
│   - Pipeline method             │
│   - Algorithm redesign          │
└─────────────────────────────────┘

6.4 Future Directions

Near term:
- Utilization of more advanced LLM models (GPT-5, Claude 5, etc.)
- Domain-specific fine-tuning
- Multi-agent collaboration (analysis AI + implementation AI + verification AI)
Mid-term:
- Integration with formal verification techniques
- Automatic performance modeling
- Interactive optimization framework
Long-term:
- Automatic algorithm selection and design
- Hardware-software co-optimization
- Fully automated GPU porting system

6.5 Final Remarks

While this trial did not achieve complete success, it provided valuable experience in clarifying the possibilities and limitations of AI coding. In particular, the following points became clear:

AI can significantly accelerate mechanical tasks
However, there are limitations in domains requiring deep expertise
The most effective approach is human-AI collaboration

Future AI utilization in the HPC field should proceed not by leaving everything to AI, but by ensuring that AI and human experts appropriately share roles.

Appendix

A. Detailed Development Environment

System information:

OS: Linux 5.14.0-611.16.1.el9_7.x86_64
Compiler: nvfortran 23.x (NVIDIA HPC SDK)
GPU: Compute Capability 8.9
Python: 3.9+

Tools used:

fortran_gpu_analyzer.py: Fortran analysis script
make-mcp: MCP server (build automation)
Claude Code: LLM interface

B. File List

Main files:

ai_coding/
├── fortran_gpu_analyzer.py    # Analysis script
├── REPORT.md                   # This report
└── NPB-3.1_LU/
    ├── PLAN.md                 # Implementation plan
    ├── CLAUDE.md               # Guide for LLM
    ├── .mcp.json               # MCP server configuration
    ├── Makefile                # Build configuration
    ├── final.txt               # Final status memo
    ├── lu.f                    # Main program
    ├── ssor.f                  # SSOR control (modified)
    ├── blts.f                  # Lower triangular solve (incomplete)
    ├── buts.f                  # Upper triangular solve (incomplete)
    ├── jacld.f                 # Jacobian computation (complete)
    ├── jacu.f                  # Jacobian computation (complete)
    ├── rhs.f                   # Right-hand side computation (complete)
    ├── l2norm.f                # Norm computation (complete)
    └── *_analysis.json         # Analysis results (corresponding to each .f file)

FilesExpand file tree

README_en.md

Latest commit

History