JIT Token Expiration with Long-Running Sequential Workflows

# JIT Token Expiration with Long-Running Sequential Workflows

## Problem Summary

When running GitHub Actions workflows with `max-parallel: 1` and long-running sequential jobs (total runtime > 60 minutes), JIT (Just-In-Time) runner tokens expire after ~60 minutes, causing jobs to fail with "The operation was canceled" error.

This is a fundamental limitation when:
- Total workflow runtime exceeds JIT token lifetime (~60 minutes)
- Jobs must run sequentially (`max-parallel: 1`)
- Using ephemeral JIT-configured self-hosted runners

## Environment

- **Runner Platform:** Serverless (Modal/AWS Lambda/Azure Functions/etc.)
- **Runner Type:** Self-hosted with JIT configuration
- **Configuration:**
  - Jobs: N matrix jobs (where N × job_duration > 60 minutes)
  - `max-parallel: 1` (sequential execution)
  - Example: 37 jobs × 6 minutes = 222 minutes total runtime

## Steps to Reproduce

1. Create workflow with matrix strategy and `max-parallel: 1`:
```yaml
strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    job_id: [1, 2, 3, ..., N]  # N jobs where N × 6_minutes > 60_minutes
```

2. Use self-hosted runner with JIT configuration:
```python
# Serverless runner fetches JIT config on webhook receipt
jit_config = await fetch_jit_config(repo_url, job_id, labels)
sandbox = modal.Sandbox.create(env={"GHA_JIT_CONFIG": jit_config})
```

3. Trigger workflow with enough jobs that total runtime exceeds 60 minutes

4. Observe that:
   - Jobs 1-10 complete successfully (~60 minutes)
   - Jobs 11+ fail with "The operation was canceled"

## Expected Behavior

All jobs should complete successfully, with each job getting a fresh JIT token when it starts (not when the webhook is received).

## Actual Behavior

| Job Range | Status | Time Elapsed | JIT Token State |
|-----------|--------|--------------|-----------------|
| 1-10 | ✅ Success | 0-60 min | Valid |
| 11+ | ❌ Failed | 60+ min | **Expired** |

**Error observed:**
```
The operation was canceled.
```

**Failed job timing pattern:**
- Jobs complete successfully until ~60-minute mark
- Jobs starting after 60 minutes fail immediately or within 2-3 minutes
- Failure occurs exactly at JIT token expiration time

## Root Cause Analysis

### JIT Token Lifecycle

From GitHub documentation and runner source code:

1. **JIT Generation:** When `generate-jitconfig` API is called, GitHub creates a runner registration with a time-limited token
2. **Token Validity:** ~60 minutes (confirmed via [GitHub Community Discussion #25699](https://github.com/orgs/community/discussions/25699))
3. **Expiration:** After 60 minutes, GitHub invalidates the runner registration
4. **Job Cancellation:** Any job using that runner gets "The operation was canceled"

### The Math Problem

```
N jobs × M minutes each = Total runtime
JIT token lifetime = 60 minutes

If Total runtime > 60 minutes:
  Jobs 1 to floor(60/M): Complete successfully ✅
  Jobs floor(60/M)+1 to N: Fail with expired token ❌

Example with 6-minute jobs:
  37 jobs × 6 minutes = 222 minutes total
  Jobs 1-10: Complete within 60 min window ✅
  Jobs 11-37: Start after token expiration ❌
```

### Why Current Architecture Fails

The serverless runner typically:
1. Receives N webhooks simultaneously when workflow triggers
2. Fetches N JIT configs immediately (all tokens created at T=0)
3. Spawns N sandboxes/containers (each with pre-fetched JIT)
4. Jobs run sequentially, but JIT tokens expire at T=60 regardless

**Key Issue:** JIT tokens are generated at webhook receipt time, not at job execution time.

## Attempted Solutions

### 1. Queue-Based Worker with Deferred JIT Fetch

**Approach:** Move JIT fetching from webhook handler to worker function that processes jobs sequentially.

```python
# Webhook: Queue metadata only
# Worker: Fetch JIT when job actually runs, then spawn
```

**Why It Fails:**
- GitHub expects runner to connect within 2-5 minutes of JIT generation
- Delaying JIT fetch creates race condition where GitHub cancels job
- GitHub's job assignment model expects immediate runner registration
- Doesn't solve fundamental issue: sequential execution still exceeds token lifetime

**Reference:** [actions/runner auth documentation](https://github.com/actions/runner/blob/main/docs/design/auth.md)

### 2. Retry/Refresh JIT Config

**Attempt:** Detect expired token and re-fetch JIT config.

**Why It Fails:**
- JIT config is single-use per job
- Cannot re-fetch for same runner ID after expiration
- Job is tied to original runner registration
- Webhook is one-way notification; GitHub doesn't resend or support replay
- `generate-jitconfig` creates a NEW runner registration, doesn't refresh existing

### 3. Increase max-parallel

**Attempt:** Run jobs in parallel to reduce total runtime below 60 minutes.

**Why Not Always Possible:**
- Some workflows have inherent sequential dependencies
- External API rate limits may require throttling
- Resource constraints (e.g., API quotas, database locks)
- Business logic may require ordered execution

### 4. Persistent Runner Token

**Attempt:** Use `--token` instead of `--jitconfig` with a long-lived token.

**Trade-offs:**
- ✅ Solves the expiration problem
- ❌ Security risk (long-lived token vs ephemeral JIT)
- ❌ Requires manual token management and rotation
- ❌ Defeats the purpose of JIT security model

## Research & References

### GitHub Documentation

1. **GitHub Actions Limits:** [Usage limits for self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/usage-limits-for-self-hosted-runners)
   - Job queue time: 24 hours
   - JIT token lifetime: ~60 minutes (implied, not explicitly documented in main docs)

2. **Automatic Token Authentication:** [GITHUB_TOKEN documentation](https://docs.github.com/en/actions/security-for-github-actions/security-guides/automatic-token-authentication)
   - "The installation access token expires after 60 minutes"

3. **Self-Hosted Runners:** [About self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners)
   - Documentation on JIT runner configuration

### GitHub Community Discussions

1. **Discussion #25699:** [GitHub token lifetime](https://github.com/orgs/community/discussions/25699)
   - Confirms: "The installation access token expires after 60 minutes"
   - [Direct link to answer](https://github.com/orgs/community/discussions/25699#discussioncomment-3251557)

2. **Discussion #50472:** [Long-running workflow GITHUB_TOKEN timeout](https://github.com/orgs/community/discussions/50472)
   - "Unable to extend GITHUB_TOKEN expiration time due to: GITHUB_TOKEN has expired"
   - Note: GITHUB_TOKEN (24h) is different from JIT token (60min), but discussion relevant for token expiration patterns

3. **Discussion #60513:** [How to configure idle_timeout with JIT](https://github.com/orgs/community/discussions/60513)
   - Discusses JIT runner lifecycle and limitations

### GitHub Issues

1. **actions/runner #1799:** [How long is runner registration token valid?](https://github.com/actions/runner/discussions/1799)
   - **Answer:** "It's valid for one hour"
   - Official confirmation from GitHub maintainer

2. **actions/runner #2920:** [Unable to use ./config remove on JIT runner](https://github.com/actions/runner/issues/2920)
   - Discusses JIT runner lifecycle issues and missing `gitHubUrl` in config
   - Closed as completed (bug fix released)

3. **actions-runner-controller #4183:** [Runners not terminating after token expiry](https://github.com/actions/actions-runner-controller/issues/4183)
   - Real-world production issue: "Runners not terminating after job completion – blocked queue due to token expiry"
   - Shows token expiration affects even Kubernetes-based runners

4. **actions-runner-controller #2466:** [Jobs expire while on queue](https://github.com/actions/actions-runner-controller/issues/2466)
   - "Capacity reservations expire before the jobs are even queued"
   - Similar underlying problem with token/job timing

5. **actions/runner #845:** [Support for autoscaling self-hosted runners](https://github.com/actions/runner/issues/845)
   - Feature request for better autoscaling support
   - Related to managing runner lifecycle

### External Resources

1. **AWS CodeBuild Issue:** [Failure to get JIT token](https://repost.aws/questions/QUmbks17uQSXuFmPY5zoQ6Sw)
   - Real-world example of JIT token issues in production

2. **Orchestra Guide:** [JIT Runner Configuration](https://www.getorchestra.io/guides/github-actions-create-configuration-for-a-justinti)
   - Best practices for JIT runner setup (still doesn't solve 60-min limit)

## Constraints & Considerations

### Why This Is Hard to Solve

1. **Security Model:** JIT tokens are designed to be short-lived for security
2. **GitHub Architecture:** Jobs are assigned to runners at webhook time, not execution time
3. **Serverless Limitations:** Serverless functions can't maintain long-lived connections
4. **No Token Refresh API:** GitHub doesn't provide an API to refresh/extend JIT tokens

### Common Misconceptions

❌ **"We can just fetch JIT when the job runs"**  
✅ GitHub expects runner registration within minutes of job assignment

❌ **"We can retry failed jobs with fresh JIT"**  
✅ JIT is tied to specific runner registration; can't re-fetch for same job

❌ **"Queue the jobs and process later"**  
✅ GitHub's job timeout (24h) ≠ JIT token lifetime (60min)

## Proposed Solutions

### Option 1: Batch Processing (Recommended Workaround)

Split long-running workflows into multiple workflow runs that each complete within 60 minutes:

```yaml
# Instead of one workflow with 37 jobs,
# Create multiple workflows or use dynamic matrix:

# Workflow Run 1: Jobs 1-9 (54 min)
# Workflow Run 2: Jobs 10-18 (54 min) 
# Workflow Run 3: Jobs 19-27 (54 min)
# Workflow Run 4: Jobs 28-N (remaining)
```

**Implementation:**
```yaml
strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    # Use only subset per workflow run
    job_id: ${{ fromJson(env.JOB_BATCH) }}
```

**Pros:**
- Works within existing JIT limitations
- No changes to runner infrastructure
- Each batch completes within token window

**Cons:**
- Requires orchestration to trigger multiple runs
- More complex workflow management
- Job history split across multiple runs

### Option 2: Persistent Runner Token (Security Trade-off)

Use traditional runner registration instead of JIT:

```bash
# Register runner once (manual or automated)
./config.sh --url https://github.com/OWNER/REPO --token $REGISTRATION_TOKEN

# Run with persistent token
./run.sh --token $RUNNER_TOKEN
```

**Pros:**
- Token doesn't expire during job execution
- Simple implementation

**Cons:**
- Security risk (long-lived token)
- Requires token rotation policy
- Loses benefits of ephemeral runners

### Option 3: Hybrid Approach - Batch with Persistent Runner

Use persistent runner for long sequential workflows, JIT for short ones:

```python
if total_estimated_runtime > 3600:  # 1 hour
    use_persistent_runner()
else:
    use_jit_runner()
```

**Pros:**
- Best of both worlds
- Secure for short jobs, functional for long jobs

**Cons:**
- More complex runner management
- Still requires persistent token for some cases

### Option 4: Workflow-Level Retry with Fresh Webhooks

Instead of job-level retry, trigger new workflow runs:

```yaml
on:
  workflow_dispatch:
  schedule:
    - cron: '0 */2 * * *'  # Every 2 hours

jobs:
  check-and-run:
    runs-on: ubuntu-latest
    steps:
      - name: Check which items need processing
        id: check
        run: |
          # Logic to determine unprocessed items
          echo "batch=$ITEMS" >> $GITHUB_OUTPUT
      
      - name: Trigger batch workflow
        if: steps.check.outputs.batch != '[]'
        uses: benc-uk/workflow-dispatch@v1
        with:
          workflow: process-batch.yml
          inputs: '{"items": "${{ steps.check.outputs.batch }}"}'
```

**Pros:**
- Fresh webhooks = fresh JIT tokens
- Each batch within 60-minute window

**Cons:**
- Complex orchestration
- Potential for duplicate processing
- Harder to track overall progress

### Option 5: GitHub-Supported Solution (Requested)

Request GitHub to support one of:

1. **JIT Token Refresh API:**
   ```
   POST /repos/{owner}/{repo}/actions/runners/{runner_id}/refresh-token
   ```

2. **Extended JIT Lifetime:**
   - Allow configuration of JIT token lifetime (e.g., 4 hours for long workflows)
   - Or auto-extend for active runners

3. **Job-Level JIT:**
   - Generate JIT token per job instead of per runner
   - Token valid for job duration only

## Questions for GitHub

1. Is there an official way to refresh or extend JIT token lifetime for long-running workflows?

2. Can GitHub support increase JIT token lifetime for specific repositories/use cases?

3. Is there a documented pattern for handling workflows that exceed 60 minutes with self-hosted runners?

4. Should the `generate-jitconfig` API support token refresh or longer lifetimes for sequential job processing?

5. Could GitHub provide a "job-level" JIT token that's valid for the duration of a specific job rather than runner registration?

## Related Issues & Discussions

- [actions/runner #1799](https://github.com/actions/runner/discussions/1799) - Token lifetime discussion
- [actions-runner-controller #4183](https://github.com/actions/actions-runner-controller/issues/4183) - Token expiry in production
- [actions-runner-controller #2466](https://github.com/actions/actions-runner-controller/issues/2466) - Jobs expiring in queue
- [GitHub Community #25699](https://github.com/orgs/community/discussions/25699) - Token lifetime confirmation
- [GitHub Community #50472](https://github.com/orgs/community/discussions/50472) - Long-running workflow timeouts

## Additional Context

### Serverless Runner Architecture

Typical serverless GitHub Actions runner flow:

```
GitHub Workflow Trigger
        ↓
GitHub sends workflow_job webhook (action: queued)
        ↓
Serverless function receives webhook
        ↓
Function calls GitHub API: POST /actions/runners/generate-jitconfig
        ↓
GitHub returns JIT config (valid for ~60 minutes)
        ↓
Function spawns container/sandbox with JIT config
        ↓
Container runs: ./run.sh --jitconfig $JIT_CONFIG
        ↓
Runner connects to GitHub and picks up job
        ↓
Job executes
        ↓
Job completes, runner exits
```

**The problem occurs when:**
- Step 3 (JIT generation) happens at T=0 for all jobs
- Step 7 (job execution) for job N happens at T > 60 minutes

### Workaround Checklist

If you're experiencing this issue, check:

- [ ] Can you split jobs into multiple workflow runs (< 60 min each)?
- [ ] Can you increase `max-parallel` to reduce total runtime?
- [ ] Can you use persistent runner tokens instead of JIT?
- [ ] Can you optimize job duration to be < 6 minutes each?
- [ ] Can you reduce number of jobs in matrix?

## Labels

Suggested labels for this issue:
- `enhancement`
- `self-hosted-runners`
- `jit-tokens`
- `long-running-workflows`
- `sequential-jobs`
- `documentation`

## Summary

This issue documents a **fundamental architectural limitation**: JIT tokens are designed for short-lived ephemeral runners (~60 minutes), but GitHub Actions workflows can legitimately require longer sequential execution.

**The core conflict:**
- JIT Security Model: Short-lived tokens (60 min) for ephemeral runners
- Sequential Workflows: May require >60 min total runtime
- Serverless Architecture: Can't maintain persistent connections

**Viable workarounds:**
1. Batch processing (multiple workflow runs)
2. Persistent runner tokens (security trade-off)
3. Reduce total runtime (optimize jobs or increase parallelism)

**Long-term solution:** Requires GitHub to either:
- Extend JIT token lifetime for long workflows
- Provide token refresh mechanism
- Support job-level (not runner-level) JIT tokens

---

*This issue was compiled from multiple real-world production scenarios and extensive research. It aims to document the limitation clearly and provide actionable workarounds while advocating for a supported long-term solution.*


JIT Token Expiration with Long-Running Sequential Workflows #4248

Description

JIT Token Expiration with Long-Running Sequential Workflows

Problem Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

JIT Token Lifecycle

The Math Problem

Why Current Architecture Fails

Attempted Solutions

1. Queue-Based Worker with Deferred JIT Fetch

2. Retry/Refresh JIT Config

3. Increase max-parallel

4. Persistent Runner Token

Research & References

GitHub Documentation

GitHub Community Discussions

GitHub Issues

External Resources

Constraints & Considerations

Why This Is Hard to Solve

Common Misconceptions

Proposed Solutions

Option 1: Batch Processing (Recommended Workaround)

Option 2: Persistent Runner Token (Security Trade-off)

Option 3: Hybrid Approach - Batch with Persistent Runner

Option 4: Workflow-Level Retry with Fresh Webhooks

Option 5: GitHub-Supported Solution (Requested)

Questions for GitHub

Related Issues & Discussions

Additional Context

Serverless Runner Architecture

Workaround Checklist

Labels

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions