Open
Conversation
- Add test-tpu.yml workflow for running tests on TPU v5e (v5litepod-8) - Add setup_gcp_tpu_ci.sh for configuring TPU quota and permissions - Update conftest.py to recognize TPU backend Workflow features: - Manual dispatch with Spot VM option for cost savings - Runs same profiling + tests as GPU Cloud Run workflow - Creates TPU VM on demand, cleans up after tests - Extracts profiling report to GitHub job summary Estimated costs: - On-demand: ~$9.60/hour (8 chips × $1.20) - Spot: ~$1-2/hour (up to 91% discount) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
uv pip install requires a virtual environment. Run uv sync first to create the venv, then install jax[tpu] into it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Required for openvaf-py submodule to be included in the tarball. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TPU v5e only supports F32 and C64 for LuDecomposition operations. Disable JAX_ENABLE_X64 when running on TPU to use float32 instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The profiler script hardcoded jax_enable_x64 = True, which overrode the environment setting. Now it checks JAX_PLATFORMS before enabling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TPU doesn't have native sparse solve support (no XLA sparse ops). Fall back to CPU via scipy pure_callback, same as the existing CPU path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the CPU fallback for TPU sparse solve. Let the experiment run with native TPU operations in F32 mode to see what works. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
sparse.py now properly detects TPU backend and uses dense solve (via BCOO.todense() + jnp.linalg.solve) instead of spsolve which only works on GPU/CUDA. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
32a3fc7 to
fc2582c
Compare
Try multiple zones (us-central1-a, us-west4-a, us-east1-d, us-east5-a) when creating TPU VM to handle temporary capacity exhaustion in any single zone. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
673886c to
02a1c28
Compare
robtaylor
added a commit
that referenced
this pull request
Dec 20, 2025
Documents tasks needed for platform-agnostic sparse solver support: TPU: - Fix failing TPU CI (PR #1) - tests timing out after 6+ hours - Implement GMRES + block-Jacobi fallback solver - Test and benchmark dense solver on TPU Non-NVIDIA GPU: - AMD ROCm: investigate hipSPARSE/rocSOLVER - Intel: investigate oneMKL sparse solver Also documents the backend detection strategy that needs to be extended to handle different GPU vendors gracefully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
.github/workflows/test-tpu.yml- Manual dispatch workflow that creates a TPU VM, runs tests, and cleans upscripts/setup_gcp_tpu_ci.sh- Idempotent setup script for TPU API, IAM roles, and quota guidancetests/conftest.py- Add TPU backend recognitionTPU Configuration
Cost Estimates
Prerequisites
Before running the workflow:
./scripts/setup_gcp_tpu_ci.shto configure GCPTest plan
🤖 Generated with Claude Code