Skip to content

Latest commit

 

History

History
668 lines (519 loc) · 22.1 KB

File metadata and controls

668 lines (519 loc) · 22.1 KB

Trial Report: GPU Porting of NAS Parallel Benchmarks LU Using AI Coding

Executive Summary

This report documents an effort to insert OpenACC directives into the LU (Lower-Upper symmetric Gauss-Seidel) benchmark of NAS Parallel Benchmarks (NPB) using AI coding (LLM) to achieve high-speed execution on NVIDIA GPUs. While we succeeded in parallelizing straightforward sections, verification errors occurred in parts requiring the essential dependency handling of the LU algorithm (sections requiring the Pipeline method), and complete porting was not achieved. Through this trial, we clarified the possibilities and limitations of AI coding, particularly the challenges in porting HPC code requiring advanced parallelization techniques.


1. Background and Objectives

1.1 Project Objectives

The LU benchmark of NAS Parallel Benchmarks (NPB) is a computational kernel that solves three-dimensional compressible Navier-Stokes equations using the Lower-Upper symmetric Gauss-Seidel method. The project objectives are as follows:

  • Primary objective: Insert OpenACC directives to achieve high-speed execution on NVIDIA GPUs
  • Evaluation metric: Reduction of FIGURE OF MERIT (execution time)
  • Constraint: Maintain numerical correctness (Verification Successful)

1.2 Technical Background

LLM:

  • Claude Code, Sonnet 4.5

Target code:

  • NAS Parallel Benchmarks 3.3.1 LU
  • Approximately 3,000 lines of code written in Fortran 77/90
  • Uses SSOR (Symmetric Successive Over-Relaxation) algorithm

Execution environment:

  • Compiler: nvfortran (NVIDIA HPC SDK)
  • Compilation flags: -O3 -acc -gpu=cc89 -Minfo=accel
  • GPU: Compute Capability 8.9 (NVIDIA L40S, 48GB GDDR6)
  • Benchmark class: A (grid size 64×64×64)

1.3 Technical Challenges

The LU benchmark has the following characteristics that make simple parallelization difficult:

  1. Diagonal dependencies: Dependencies exist in three-dimensional diagonal directions in lower and upper triangular solves
  2. SSOR method constraints: The k-direction loop is inherently sequential
  3. Complex data flow: Complex read-write dependencies exist among multiple arrays

Addressing these challenges requires an advanced parallelization technique called Wavefront parallelization (Hyperplane method).


2. Approach and Methodology

2.1 Overall Strategy

This project adopted the following three-phase approach:

Phase 1: Preprocessing and Analysis
  ├─ Development of Fortran static analysis script
  └─ Automatic dependency analysis and pattern recognition

Phase 2: Development Environment Setup
  ├─ Automation using MCP server [make-mcp](https://gitlab.com/wddawson/make-mcp)
  └─ Streamlined compilation and execution

Phase 3: Incremental GPU Porting
  ├─ Data management optimization
  ├─ Handling straightforward parallelizable sections
  └─ Trial of Wavefront parallelization

2.2 Preprocessing: Fortran GPU Analysis Script

We developed fortran_gpu_analyzer.py to enable the LLM to insert appropriate OpenACC directives.

2.2.1 Script Functions

1: Static Analysis

- Analysis of subroutine definitions
- Extraction of array declarations (type, dimensions, size)
- Analysis of DO loop structure (nesting depth, loop variables, ranges)
- Tracking of array access patterns (distinguishing reads and writes)

2: Dependency Analysis

- Calculation of dependency vectors (dependency distance for each dimension)
- Determination of dependency direction (forward/backward/mixed)
- Extraction of dependencies within the same loop context

3: Pattern Recognition and Strategy Generation

- Matching with known algorithm patterns
  - Forward substitution
  - Backward substitution
  - Jacobi stencil (independently parallelizable)
  - Gauss-Seidel (Red-Black ordering)
- Automatic recommendation of GPU porting strategies
- Generation of required OpenACC directives
- Suggestion of auxiliary arrays

2.2.2 Example Analysis Results

Analysis results for blts.f (lower triangular solve):

{
  "pattern_type": "wavefront",
  "affected_dimensions": ["k", "j", "i"],
  "dependency_vectors": [
    {"direction": "k-1", "offset": -1},
    {"direction": "j-1", "offset": -1},
    {"direction": "i-1", "offset": -1}
  ],
  "recommended_strategy": "diagonal_wavefront",
  "transformation_needed": "hyperplane_method"
}

In this way, the script automatically:

  • Detects three-dimensional wavefront dependencies
  • Recommends that diagonal parallelization is needed
  • Suggests necessary auxiliary arrays (np, indxp, jndxp)

2.3 Development Environment: Automation Using MCP Server

We automated compilation and execution using the MCP (Model Context Protocol) server.

2.3.1 Configuration

.mcp.json:

{
  "mcpServers": {
    "make-mcp": {
      "command": "make-mcp",
      "args": ["--config", "l40s", "--project-dir", "."]
    }
  }
}

2.3.2 Available Tools

Tool Function Corresponding Make Target
mcp__make-mcp__make_default Execute build make default
mcp__make-mcp__make_build Generate object files make build
mcp__make-mcp__make_run Run benchmark make run
mcp__make-mcp__make_clean Cleanup make clean
mcp__make-mcp__make_sysinfo Display system information make sysinfo

2.3.3 Benefits

  1. Consistency: Ensures consistency in command execution by always using MCP tools
  2. Traceability: All builds and executions are logged
  3. Error Handling: Structured error information retrieval through tools

3. Implementation Process

3.1 Understanding Code Structure

The LU benchmark consists of the following main files:

File Role Parallelization Difficulty Priority
lu.f Main driver N/A (control only) -
ssor.f SSOR iteration control Medium (data management) High
blts.f Lower triangular solve (L inverse) Very High Critical
buts.f Upper triangular solve (U inverse) Very High Critical
jacld.f Lower Jacobian computation Low (independent parallel) Medium
jacu.f Upper Jacobian computation Low (independent parallel) Medium
rhs.f Right-hand side vector computation Low (stencil) Medium
l2norm.f L2 norm computation Low (reduction) Low

3.2 Incremental Implementation Strategy

The following incremental strategy was defined in PLAN.md:

Phase 1: Data Management

!$acc data present(u, rsd, frct, qs, rho_i) &
!$acc      create(a, b, c, d) &
!$acc      copyin(dt, omega, nx, ny, nz, ist, iend, jst, jend)
! ... SSOR iteration ...
!$acc end data

Objective: Keep all major arrays resident in GPU memory to minimize CPU-GPU transfers

Phase 2: Simple Parallelization (Quick Wins)

jacld.f, jacu.f (Jacobian computation):

!$acc parallel loop collapse(2) present(u, rho_i, qs, a, b, c, d)
do j = jst, jend
    do i = ist, iend
        ! Point-independent computation
    end do
end do

l2norm.f (Reduction):

!$acc parallel loop collapse(3) reduction(+:sum) present(v)
do k = 2, nz0-1
    do j = jst, jend
        do i = ist, iend
            do m = 1, 5
                sum(m) = sum(m) + v(m,i,j,k) * v(m,i,j,k)
            end do
        end do
    end do
end do

rhs.f (Stencil computation):

!$acc parallel loop collapse(3) present(rsd, u, rho_i, qs, frct)
do k = 2, nz-1
    do j = jst, jend
        do i = ist, iend
            ! Stencil computation (reading neighboring points only)
        end do
    end do
end do

Phase 3: Wavefront Parallelization (Most Challenging)

Transformation of blts.f / buts.f:

Original loop structure:

do k = 2, nz-1
    do j = jst, jend
        do i = ist, iend
            ! v(m,i,j,k) depends on v(*,i-1,j,k), v(*,i,j-1,k), v(*,i,j,k-1)
        end do
    end do
end do

After Wavefront transformation:

! Pre-compute diagonal indices
call setup_diagonals()  ! Generate np(l), indxp(l,n), jndxp(l,n)

! Sequential processing for each diagonal level
do l = 1, ndiag
    npl = np(l)

    ! Parallel processing of points within each diagonal
    !$acc parallel loop gang vector num_gangs((npl+127)/128) &
    !$acc                          vector_length(128)
    do n = 1, npl
        j = jndxp(l,n)
        i = indxp(l,n)
        k = l - i - j

        ! Computation for point (i,j,k)
        ! All points at this level are mutually independent
    end do
end do

Required auxiliary arrays:

integer, parameter :: maxdiag = 3*isiz1
integer np(maxdiag)                      ! Number of points on diagonal l
integer indxp(maxdiag, isiz1*isiz2)      ! i-index
integer jndxp(maxdiag, isiz1*isiz2)      ! j-index

Key Insight:

All points satisfying i + j + k = constant (= l)

have no mutual dependencies and can be computed simultaneously.

3.3 Actual Implementation

Implementation proceeded according to the following task list:

✅ Add OpenACC data management to ssor.f
✅ Parallelize jacld.f with OpenACC
✅ Parallelize jacu.f with OpenACC
✅ Parallelize l2norm.f with reduction
✅ Parallelize rhs.f with OpenACC
✅ Test Phase 1-2 optimizations
◼️ Implement wavefront parallelization for blts.f  ← Blocked
◻️ Implement wavefront parallelization for buts.f
◻️ Final testing and verification

4. Results and Achievements

4.1 Failed Parts

Parallelization of blts.f / buts.f (triangular solves) failed:

Issues

  1. Loss of numerical correctness:

    Verification:  FAILED
    Error: RMS norm difference exceeds tolerance
    
  2. Root cause:

    • Simple parallelization cannot properly handle dependencies
    • Used CPU/GPU data transfer as a workaround, but numerical precision could not be maintained
    • Wavefront parallelization implementation was incomplete
  3. Technical challenges:

    • Complexity of logic for generating diagonal index arrays
    • Restructuring coefficient arrays (d(5,5,i,j)d(5,5,n))
    • Splitting into multiple kernels (separating k-1 and i-1,j-1 dependencies)
    • Appropriate placement of synchronization points

4.2 Final Status

Status: Partial Success
- Compilation: ✅ Success
- Execution: ✅ Completes
- Verification: ❌ FAILED
- Performance: N/A (verification failed)

Example execution log:

 NAS Parallel Benchmarks 3.1 - LU Benchmark

 Size: 64x64x64
 Iterations: 250

 Time step    1
 Time step    2
 ...
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
 FAILURE:  1   0.3167121203225E+19 0.7790210760669E+03 0.4065514144001E+16
 FAILURE:  2   0.2226688354010E+17 0.6340276525969E+02 0.3511973562809E+15
 FAILURE:  3   0.1712390193331E+18 0.1949924972729E+03 0.8781826056284E+15
 FAILURE:  4   0.1205243868453E+18 0.1784530116042E+03 0.6753844374038E+15
 FAILURE:  5   0.4840029262702E+19 0.1838476034946E+04 0.2632631141609E+16
 Comparison of RMS-norms of solution error
 FAILURE:  1   0.5073968638522E+03 0.2996408568547E+02 0.1593350063066E+02
 FAILURE:  2   0.4117400734144E+02 0.2819457636500E+01 0.1360352048153E+02
 FAILURE:  3   0.1210827223612E+03 0.7347341269877E+01 0.1547980104825E+02
 FAILURE:  4   0.1053131418499E+03 0.6713922568778E+01 0.1468578439370E+02
 FAILURE:  5   0.8320050299906E+03 0.7071531568839E+02 0.1076555632809E+02
 Comparison of surface integral
 FAILURE:      0.6793137050151E+03 0.2603092560489E+02 0.2509640991358E+02
 Verification failed


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                     0.15
 Mop/s total     =                819678.10
 Operation type  =           floating point
 Verification    =             UNSUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              04 Feb 2026

 Compile options:
    F77          = nvfortran
    FLINK        = $(F77)
    F_LIB        = (none)
    F_INC        = (none)
    FFLAGS       = -acc -Minfo=accel
    FLINKFLAGS   = -acc
    RAND         = (none)


 Please send all errors/feedbacks to:

 NPB Development Team
 npb@nas.nasa.gov

5. Technical Insights and Lessons Learned

5.1 Strengths of AI Coding

5.1.1 Rapid Prototyping

  • Quick development of analysis scripts (Python 950 lines, implemented in approximately 30 minutes)
  • Automatic insertion of basic OpenACC directives
  • Automatic generation of documentation (PLAN.md, CLAUDE.md)

5.1.2 Pattern Recognition

The LLM correctly recognized the following patterns:

  • Independently parallelizable loops (jacld, jacu)
  • Reduction operations (l2norm)
  • Stencil computations (rhs)

5.1.3 Knowledge Integration

  • Understanding of OpenACC specifications
  • Parsing of Fortran 77 syntax
  • Recognition of common HPC algorithm patterns

5.2 Limitations of AI Coding

5.2.1 Advanced Algorithm Transformations

Fundamental structural transformations like Wavefront parallelization are difficult:

  1. Necessity of complex preprocessing:

    • Logic for computing diagonal index arrays
    • Accurate handling of boundary conditions (max(ist, l-j-nz+1) etc.)
    • Memory layout optimization
  2. Multi-stage dependencies:

    • Precedent processing of k-1 dependencies
    • Subsequent processing of i-1, j-1 dependencies
    • Appropriate placement of kernel splitting and synchronization
  3. Ensuring numerical stability:

    • Changes in floating-point operation order affect results
    • Precision guarantee for diagonal block inversion

5.2.2 Debugging Difficulties

Debugging process when verification fails:

1. Check compiler warnings → None
2. Check data transfers → No issues with present/copy
3. Check array bounds → Ranges are accurate
4. Re-check dependencies → Problem discovered here (wavefront needed)
5. Compare with reference implementation → Understand C version structure
6. Implementation trial → Too complex, complete implementation not achieved

The LLM is good at steps 1-3, but steps 4-6 require human intervention.

5.2.3 Depth of Domain Knowledge

The following knowledge is required but insufficient for LLMs:

  • Numerical computation algorithm theory: Theoretical background of SOR algorithms
  • Parallel computing theory: Hyperplane method, Red-Black ordering
  • GPU architecture: Warp divergence, Memory coalescing
  • OpenACC implementation details: Complex data mapping, asynchronous execution

5.3 Essential Challenge: Necessity of Pipeline Method

The core challenge of the LU benchmark is that Pipeline parallelization (Wavefront parallelization) is indispensable.

5.3.1 Structure of Dependencies

Point (i,j,k) depends on:
  - (i-1, j,   k  )  ← i-direction dependency
  - (i,   j-1, k  )  ← j-direction dependency
  - (i,   j,   k-1)  ← k-direction dependency

All points on the diagonal plane i + j + k = l (constant) can be computed independently of each other:

Example of diagonal plane l=5 (simplified in 2D):
  j
  ^
  |
  3  o
  2    o  
  1      o
  0---------> i
     0  1  2

Points (0,3), (1,2), (2,1) are on the same diagonal with i+j=3
→ These can be computed in parallel

This three-dimensional dependency means only points on a diagonal hyperplane are independent.

5.3.2 Complexity of Required Transformations

Transformation Step Complexity Reason
Index computation High Accurate boundary condition handling required
Array restructuring High Major memory layout changes
Kernel splitting Medium Separation by dependency direction
Synchronization management High Accurate synchronization between diagonals

5.3.3 Reference: NPB LU OpenACC C Implementation

Structure of successful reference implementation (blts.c):

// 1. Pre-compute diagonal indices
for (l = 1; l <= ndiag; l++) {
    np[l] = 0;
    for (j = jst; j <= jend; j++) {
        for (i = ist; i <= iend; i++) {
            k = l - i - j;
            if (k >= 1 && k <= nz) {
                np[l]++;
                indxp[l][np[l]] = i;
                jndxp[l][np[l]] = j;
            }
        }
    }
}

// 2. Wavefront parallelization
for (l = 1; l <= ndiag; l++) {
    int npl = np[l];

    #pragma acc parallel loop gang vector \
            num_gangs((npl+127)/128) vector_length(128)
    for (n = 1; n <= npl; n++) {
        j = jndxp[l][n];
        i = indxp[l][n];
        k = l - i - j;

        // Compute and invert diagonal block
        // ...
    }
}

This implementation is superior in the following respects:

  • Memory access optimization (coalescing)
  • Appropriate gang/vector settings
  • Accurate dependency handling

6. Conclusion

6.1 Summary of Achievements

In this project, we attempted GPU porting of the NPB LU benchmark using AI coding.

Quantitative achievements:

  • Analysis script: 950 lines of Python code
  • Modified files: 6 (jacld.f, jacu.f, l2norm.f, rhs.f, ssor.f, applu.incl)
  • Successful parallelizations: 5 subroutines

Qualitative achievements:

  • Establishment of automated Fortran dependency analysis methods
  • Streamlined development flow using MCP server
  • Demonstration of incremental approach to GPU porting

6.2 Technical Conclusions

Success factors:

  1. Effectiveness of preprocessing: Provision of structured information by fortran_gpu_analyzer.py
  2. Incremental approach: Effectiveness of strategy starting from simple parts
  3. Power of automation: Acceleration of iterative experiments through MCP server

Failure factors:

  1. Algorithmic complexity: Fundamental difficulty of Wavefront parallelization
  2. LLM limitations: Lack of deep domain knowledge
  3. Verification difficulty: Ensuring numerical correctness

6.3 Final Evaluation

Position of AI Coding:

┌─────────────────────────────────┐
│  HPC GPU Porting Difficulty      │
│  Spectrum                        │
├─────────────────────────────────┤
│ Easy ← ─────────────── → Hard   │
├─────────────────────────────────┤
│ [✅ Possible with AI]           │
│   - Independent parallelization │
│   - Data management             │
│   - Reduction                   │
│                                 │
│ [⚠️ Partially possible with AI] │
│   - Stencil computation         │
│   - Loop transformation         │
│                                 │
│ [❌ Difficult with AI]          │
│   - Wavefront parallelization   │
│   - Pipeline method             │
│   - Algorithm redesign          │
└─────────────────────────────────┘

6.4 Future Directions

  1. Near term:

    • Utilization of more advanced LLM models (GPT-5, Claude 5, etc.)
    • Domain-specific fine-tuning
    • Multi-agent collaboration (analysis AI + implementation AI + verification AI)
  2. Mid-term:

    • Integration with formal verification techniques
    • Automatic performance modeling
    • Interactive optimization framework
  3. Long-term:

    • Automatic algorithm selection and design
    • Hardware-software co-optimization
    • Fully automated GPU porting system

6.5 Final Remarks

While this trial did not achieve complete success, it provided valuable experience in clarifying the possibilities and limitations of AI coding. In particular, the following points became clear:

  1. AI can significantly accelerate mechanical tasks
  2. However, there are limitations in domains requiring deep expertise
  3. The most effective approach is human-AI collaboration

Future AI utilization in the HPC field should proceed not by leaving everything to AI, but by ensuring that AI and human experts appropriately share roles.


Appendix

A. Detailed Development Environment

System information:

OS: Linux 5.14.0-611.16.1.el9_7.x86_64
Compiler: nvfortran 23.x (NVIDIA HPC SDK)
GPU: Compute Capability 8.9
Python: 3.9+

Tools used:

  • fortran_gpu_analyzer.py: Fortran analysis script
  • make-mcp: MCP server (build automation)
  • Claude Code: LLM interface

B. File List

Main files:

ai_coding/
├── fortran_gpu_analyzer.py    # Analysis script
├── REPORT.md                   # This report
└── NPB-3.1_LU/
    ├── PLAN.md                 # Implementation plan
    ├── CLAUDE.md               # Guide for LLM
    ├── .mcp.json               # MCP server configuration
    ├── Makefile                # Build configuration
    ├── final.txt               # Final status memo
    ├── lu.f                    # Main program
    ├── ssor.f                  # SSOR control (modified)
    ├── blts.f                  # Lower triangular solve (incomplete)
    ├── buts.f                  # Upper triangular solve (incomplete)
    ├── jacld.f                 # Jacobian computation (complete)
    ├── jacu.f                  # Jacobian computation (complete)
    ├── rhs.f                   # Right-hand side computation (complete)
    ├── l2norm.f                # Norm computation (complete)
    └── *_analysis.json         # Analysis results (corresponding to each .f file)