Add TPU CI support via GCP TPU VMs by robtaylor · Pull Request #1 · ChipFlow/jax-spice

robtaylor · 2025-12-08T22:19:22Z

Summary

Add GitHub Actions workflow for running tests/profiling on Google Cloud TPU v5e
Add setup script for configuring GCP project with TPU permissions and quota
Update test configuration to recognize TPU backend

Changes

.github/workflows/test-tpu.yml - Manual dispatch workflow that creates a TPU VM, runs tests, and cleans up
scripts/setup_gcp_tpu_ci.sh - Idempotent setup script for TPU API, IAM roles, and quota guidance
tests/conftest.py - Add TPU backend recognition

TPU Configuration

Setting	Value
TPU Type	v5litepod-8 (8 chips, 128 GB HBM2)
Zone	us-central1-a
Runtime	v2-alpha-tpuv5-lite

Cost Estimates

On-demand: ~$9.60/hour (8 chips × $1.20)
Spot VM: ~$1-2/hour (up to 91% discount)
A 10-minute test run: ~$1.60 on-demand, ~$0.20 on Spot

Prerequisites

Before running the workflow:

Run ./scripts/setup_gcp_tpu_ci.sh to configure GCP
Request TPU quota if needed (script provides links)

Test plan

Run setup script to verify GCP configuration
Trigger workflow manually to test TPU VM creation/teardown
Verify tests run on TPU backend

🤖 Generated with Claude Code

- Add test-tpu.yml workflow for running tests on TPU v5e (v5litepod-8) - Add setup_gcp_tpu_ci.sh for configuring TPU quota and permissions - Update conftest.py to recognize TPU backend Workflow features: - Manual dispatch with Spot VM option for cost savings - Runs same profiling + tests as GPU Cloud Run workflow - Creates TPU VM on demand, cleans up after tests - Extracts profiling report to GitHub job summary Estimated costs: - On-demand: ~$9.60/hour (8 chips × $1.20) - Spot: ~$1-2/hour (up to 91% discount) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

uv pip install requires a virtual environment. Run uv sync first to create the venv, then install jax[tpu] into it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Required for openvaf-py submodule to be included in the tarball. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

TPU v5e only supports F32 and C64 for LuDecomposition operations. Disable JAX_ENABLE_X64 when running on TPU to use float32 instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The profiler script hardcoded jax_enable_x64 = True, which overrode the environment setting. Now it checks JAX_PLATFORMS before enabling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

TPU doesn't have native sparse solve support (no XLA sparse ops). Fall back to CPU via scipy pure_callback, same as the existing CPU path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove the CPU fallback for TPU sparse solve. Let the experiment run with native TPU operations in F32 mode to see what works. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sparse.py now properly detects TPU backend and uses dense solve (via BCOO.todense() + jnp.linalg.solve) instead of spsolve which only works on GPU/CUDA. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Try multiple zones (us-central1-a, us-west4-a, us-east1-d, us-east5-a) when creating TPU VM to handle temporary capacity exhaustion in any single zone. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Documents tasks needed for platform-agnostic sparse solver support: TPU: - Fix failing TPU CI (PR #1) - tests timing out after 6+ hours - Implement GMRES + block-Jacobi fallback solver - Test and benchmark dense solver on TPU Non-NVIDIA GPU: - AMD ROCm: investigate hipSPARSE/rocSOLVER - Intel: investigate oneMKL sparse solver Also documents the backend detection strategy that needs to be extended to handle different GPU vendors gracefully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

robtaylor and others added 9 commits December 8, 2025 23:44

CI: TPU: Enable test-tpu for pull_requests

46bdc6a

CI: TPU: Add submodules: true to checkout

f2f1d66

Required for openvaf-py submodule to be included in the tarball. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

robtaylor force-pushed the feature/tpu-ci-support branch from 32a3fc7 to fc2582c Compare December 8, 2025 23:45

robtaylor force-pushed the main branch 9 times, most recently from 673886c to 02a1c28 Compare December 18, 2025 20:49

robtaylor force-pushed the main branch from 53ac347 to f47cd4d Compare February 2, 2026 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TPU CI support via GCP TPU VMs#1

Add TPU CI support via GCP TPU VMs#1
robtaylor wants to merge 10 commits intomainfrom
feature/tpu-ci-support

robtaylor commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robtaylor commented Dec 8, 2025

Summary

Changes

TPU Configuration

Cost Estimates

Prerequisites

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant