Conversation
… splatting - Add CHECK_INPUT validation for all tensors in backward function - Optimize forward/backward kernels with grid-stride loops and value hoisting - Add tunable block size (DEFAULT_BLOCK_SIZE = 256) - Enable --use_fast_math for PhotonDifferentialSplatting in setup.py - Add comprehensive test suite with numerical and performance tests Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
CompN3rd
left a comment
There was a problem hiding this comment.
The tests should compare numerical equivalence against the unoptimized kernels
…optimized baseline - Add CPU-based reference implementations matching original unoptimized CUDA kernel logic - Update numerical equivalence tests to compare optimized vs reference implementations - Reference functions: silverman_kernel, d_silverman_kernel, pixel_to_coord, coord_to_pixel, matrix_multiply - reference_pds_forward: Replicates original forward kernel with offset-based loops - reference_pds_backward: Replicates original backward kernel with local accumulation - Tests now validate that optimizations maintain numerical equivalence with tight tolerances - Updated documentation to explain reference implementation approach Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Added reference implementations that replicate the original unoptimized kernel logic. The tests now compare optimized CUDA kernels against CPU-based reference implementations ( |
CompN3rd
left a comment
There was a problem hiding this comment.
Don't compare with a reference numpy implementation, but rather with the kernel version before the changes made to the cuda kennels in this pr
…sting - Add original unoptimized CUDA kernels (_original suffix) to photon_differentials.cu - Expose pds_forward_original and pds_backward_original through C++ wrapper - Update tests to compare optimized vs original CUDA kernels (not NumPy) - Remove NumPy reference implementations from test file - Both kernels run on GPU for direct comparison - Update documentation to reflect use of original CUDA kernels as reference Co-authored-by: CompN3rd <1405794+CompN3rd@users.noreply.github.com>
Changed to compare with original CUDA kernels instead of NumPy reference. Added |
Refactored
pds_cuda_forward_kernelandpds_cuda_backward_kernelto reduce memory bandwidth and improve GPU occupancy through grid-stride loops, value hoisting, and local accumulation.CUDA Kernel Optimizations
Forward kernel:
w,h,inv_half_w_sq,inv_half_h_sq) outside main loopradius,E_center,M_center[6], bounding box) before inner loopsBackward kernel:
g_Ep,g_xp[2],g_Mp[6]) reduces global writes by ~100xOriginal unoptimized kernels:
pds_cuda_forward_kernel_originalandpds_cuda_backward_kernel_originalto preserve pre-optimization implementationpds_forward_original()andpds_backward_original()Python functionsConfiguration:
DEFAULT_BLOCK_SIZE = 256(reduced from hardcoded 512)--use_fast_mathenabled for PhotonDifferentialSplatting extension onlyInput Validation
Added
CHECK_INPUTfor all tensors inpds_backward:Ep,xp,Mp,cp,radiusTest Suite
Added
tests/test_photon_differentials.pywith:rtol=1e-5, atol=1e-6)Example test structure:
Tests compare the optimized kernels directly against the original CUDA implementation (both running on GPU) to validate numerical equivalence.
Files Changed
PyOptix/kernel/photon_differentials.cu- Optimized kernels and original unoptimized kernels for testingPyOptix/PhotonDifferentialSplattig.cpp- Input validation and exposure of original kernelssetup.py- Fast math compilation flagtests/test_photon_differentials.py- Test suite (7 tests, 3 classes)OPTIMIZATION_SUMMARY.md- Technical detailstests/README.md- Testing guideOriginal prompt
This pull request was created as a result of the following prompt from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.