Skip to content

Fix GPU compat of sparse dae solvers#3073

Draft
hexaeder wants to merge 11 commits intoSciML:masterfrom
hexaeder:hw/sparse_dae_gpu
Draft

Fix GPU compat of sparse dae solvers#3073
hexaeder wants to merge 11 commits intoSciML:masterfrom
hexaeder:hw/sparse_dae_gpu

Conversation

@hexaeder
Copy link
Contributor

This PR attempts to make (most) of the sparse DAE solvers GPU compatible. The main changes are:

  • special handling of Diagonal mass matrices (improves performance by order of magnitude for certain dae init calls)
  • moves the find_algebraic_vars_eqs function to OrdinaryDiffEqCore
  • fixing of (mostly scalar indexing) problems in the init or step code of several methods

This PR needs JuliaGPU/CUDA.jl#3032 to slice into the sparse jacobians.

The tests in the simple_dae.jl script in the GPU tests. Currently, those are more like a debug script. I define a custom test-system and execute it with all kinds of algorithms. The tests compare:

  • the CPU solution of the different solvers against the "known good" CPU solution of Rodas5P
  • compare the GPU solution of the different solvers to their CPU counterpart
  • do this all for Diagonal mass matrices (non-diagonal are much harder to support)
  • the system is testes without sparcity pattern, with sparcity pattern as cuda CSR and with sparsity pattern as coda CSC

I would like input by the maintainers on how the tests should look like. Currently I see several problems:

  • it takes a long time since (too) many algorithms are tested
  • some algos seem to perform badly on my toymodel, resulting in large discrepancy between the CPU solutions of one alg vs another alg
  • some algos perform weirdly with CSC vs CSR jacobian prototoypes. CSC is always worse. The CUDA default is CSC, however CUDSS only supports LU factorization on CSR. So the tests currently compare CSC to CPU with Krylov and CSR to CPU with default linsolve.

Algorithm overview

Fully working

ROS2, ROS3, ROS3PRL, ROS3PRL2, Rodas3, Scholz4_7, ROS34PW3, RosShamp4, Veldd4, Velds4, GRK4T, GRK4A,
Ros4LStab, Rodas4, Rodas42, Rodas4P, Rodas4P2, ROK4a, Rodas5, Rodas5P, Rodas5Pe, Rodas5Pr, Rodas6P,
ImplicitEuler, SDIRK2, ABDF2, QNDF1, QBDF1, Trapezoid

CSC path with elevated errors (Krylov vs direct LU accuracy)

These pass but need relaxed csc_tol. The dense and CSR jacobian paths are fine:
Rosenbrock23, ROS2PR, ROS2S, ROS34PW2, ROS34PRw, Cash4, Hairer4, Hairer42, QNDF2, QBDF2

Working, but solver is a poor fit for this DAE problem

GPU correctly reproduces the CPU result, but the solver itself diverges from the Rodas5P reference:
ROS3PR, ROS3P, ROS34PW1a, ROS34PW1b, QBDF1

Special case: Rosenbrock32/CSC: GPU and CPU Krylov agree, but the combination gives catastrophic
errors vs the reference. Effectively unusable with CSC.

Not yet working (require code changes)

The following algorithms fail and I consider them out of scope of this PR since they'd require substantial work

  • Rodas23W, Rodas3P Scalar indexing in calculate_interpoldiff!
  • QNDF, QBDF DeviceMemory error in LinAlg (some mul! call within the step)
  • FBDF Scalar indexing in reinitFBDF!
  • RadauIIA3/5/9, AdaptiveRadau lots of problems

Checklist

  • Appropriate tests were added
  • Any code changes were done in a way that does not break public API
  • All documentation related to code changes were updated
  • The new code follows the
    contributor guidelines, in particular the SciML Style Guide and
    COLPRAC.
  • Any new documentation only uses public API

Additional context

Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant