Skip to content

Add TPU CI support via GCP TPU VMs#1

Open
robtaylor wants to merge 10 commits intomainfrom
feature/tpu-ci-support
Open

Add TPU CI support via GCP TPU VMs#1
robtaylor wants to merge 10 commits intomainfrom
feature/tpu-ci-support

Conversation

@robtaylor
Copy link
Contributor

Summary

  • Add GitHub Actions workflow for running tests/profiling on Google Cloud TPU v5e
  • Add setup script for configuring GCP project with TPU permissions and quota
  • Update test configuration to recognize TPU backend

Changes

  • .github/workflows/test-tpu.yml - Manual dispatch workflow that creates a TPU VM, runs tests, and cleans up
  • scripts/setup_gcp_tpu_ci.sh - Idempotent setup script for TPU API, IAM roles, and quota guidance
  • tests/conftest.py - Add TPU backend recognition

TPU Configuration

Setting Value
TPU Type v5litepod-8 (8 chips, 128 GB HBM2)
Zone us-central1-a
Runtime v2-alpha-tpuv5-lite

Cost Estimates

  • On-demand: ~$9.60/hour (8 chips × $1.20)
  • Spot VM: ~$1-2/hour (up to 91% discount)
  • A 10-minute test run: ~$1.60 on-demand, ~$0.20 on Spot

Prerequisites

Before running the workflow:

  1. Run ./scripts/setup_gcp_tpu_ci.sh to configure GCP
  2. Request TPU quota if needed (script provides links)

Test plan

  • Run setup script to verify GCP configuration
  • Trigger workflow manually to test TPU VM creation/teardown
  • Verify tests run on TPU backend

🤖 Generated with Claude Code

robtaylor and others added 9 commits December 8, 2025 23:44
- Add test-tpu.yml workflow for running tests on TPU v5e (v5litepod-8)
- Add setup_gcp_tpu_ci.sh for configuring TPU quota and permissions
- Update conftest.py to recognize TPU backend

Workflow features:
- Manual dispatch with Spot VM option for cost savings
- Runs same profiling + tests as GPU Cloud Run workflow
- Creates TPU VM on demand, cleans up after tests
- Extracts profiling report to GitHub job summary

Estimated costs:
- On-demand: ~$9.60/hour (8 chips × $1.20)
- Spot: ~$1-2/hour (up to 91% discount)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
uv pip install requires a virtual environment. Run uv sync first
to create the venv, then install jax[tpu] into it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Required for openvaf-py submodule to be included in the tarball.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TPU v5e only supports F32 and C64 for LuDecomposition operations.
Disable JAX_ENABLE_X64 when running on TPU to use float32 instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The profiler script hardcoded jax_enable_x64 = True, which overrode
the environment setting. Now it checks JAX_PLATFORMS before enabling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TPU doesn't have native sparse solve support (no XLA sparse ops).
Fall back to CPU via scipy pure_callback, same as the existing CPU path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove the CPU fallback for TPU sparse solve. Let the experiment
run with native TPU operations in F32 mode to see what works.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
sparse.py now properly detects TPU backend and uses dense solve
(via BCOO.todense() + jnp.linalg.solve) instead of spsolve which
only works on GPU/CUDA.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@robtaylor robtaylor force-pushed the feature/tpu-ci-support branch from 32a3fc7 to fc2582c Compare December 8, 2025 23:45
Try multiple zones (us-central1-a, us-west4-a, us-east1-d, us-east5-a)
when creating TPU VM to handle temporary capacity exhaustion in any
single zone.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@robtaylor robtaylor force-pushed the main branch 9 times, most recently from 673886c to 02a1c28 Compare December 18, 2025 20:49
robtaylor added a commit that referenced this pull request Dec 20, 2025
Documents tasks needed for platform-agnostic sparse solver support:

TPU:
- Fix failing TPU CI (PR #1) - tests timing out after 6+ hours
- Implement GMRES + block-Jacobi fallback solver
- Test and benchmark dense solver on TPU

Non-NVIDIA GPU:
- AMD ROCm: investigate hipSPARSE/rocSOLVER
- Intel: investigate oneMKL sparse solver

Also documents the backend detection strategy that needs to be extended
to handle different GPU vendors gracefully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant