Skip to content

JIT Token Expiration with Long-Running Sequential Workflows #4248

@manascb1344

Description

@manascb1344

JIT Token Expiration with Long-Running Sequential Workflows

Problem Summary

When running GitHub Actions workflows with max-parallel: 1 and long-running sequential jobs (total runtime > 60 minutes), JIT (Just-In-Time) runner tokens expire after ~60 minutes, causing jobs to fail with "The operation was canceled" error.

This is a fundamental limitation when:

  • Total workflow runtime exceeds JIT token lifetime (~60 minutes)
  • Jobs must run sequentially (max-parallel: 1)
  • Using ephemeral JIT-configured self-hosted runners

Environment

  • Runner Platform: Serverless (Modal/AWS Lambda/Azure Functions/etc.)
  • Runner Type: Self-hosted with JIT configuration
  • Configuration:
    • Jobs: N matrix jobs (where N × job_duration > 60 minutes)
    • max-parallel: 1 (sequential execution)
    • Example: 37 jobs × 6 minutes = 222 minutes total runtime

Steps to Reproduce

  1. Create workflow with matrix strategy and max-parallel: 1:
strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    job_id: [1, 2, 3, ..., N]  # N jobs where N × 6_minutes > 60_minutes
  1. Use self-hosted runner with JIT configuration:
# Serverless runner fetches JIT config on webhook receipt
jit_config = await fetch_jit_config(repo_url, job_id, labels)
sandbox = modal.Sandbox.create(env={"GHA_JIT_CONFIG": jit_config})
  1. Trigger workflow with enough jobs that total runtime exceeds 60 minutes

  2. Observe that:

    • Jobs 1-10 complete successfully (~60 minutes)
    • Jobs 11+ fail with "The operation was canceled"

Expected Behavior

All jobs should complete successfully, with each job getting a fresh JIT token when it starts (not when the webhook is received).

Actual Behavior

Job Range Status Time Elapsed JIT Token State
1-10 ✅ Success 0-60 min Valid
11+ ❌ Failed 60+ min Expired

Error observed:

The operation was canceled.

Failed job timing pattern:

  • Jobs complete successfully until ~60-minute mark
  • Jobs starting after 60 minutes fail immediately or within 2-3 minutes
  • Failure occurs exactly at JIT token expiration time

Root Cause Analysis

JIT Token Lifecycle

From GitHub documentation and runner source code:

  1. JIT Generation: When generate-jitconfig API is called, GitHub creates a runner registration with a time-limited token
  2. Token Validity: ~60 minutes (confirmed via GitHub Community Discussion #25699)
  3. Expiration: After 60 minutes, GitHub invalidates the runner registration
  4. Job Cancellation: Any job using that runner gets "The operation was canceled"

The Math Problem

N jobs × M minutes each = Total runtime
JIT token lifetime = 60 minutes

If Total runtime > 60 minutes:
  Jobs 1 to floor(60/M): Complete successfully ✅
  Jobs floor(60/M)+1 to N: Fail with expired token ❌

Example with 6-minute jobs:
  37 jobs × 6 minutes = 222 minutes total
  Jobs 1-10: Complete within 60 min window ✅
  Jobs 11-37: Start after token expiration ❌

Why Current Architecture Fails

The serverless runner typically:

  1. Receives N webhooks simultaneously when workflow triggers
  2. Fetches N JIT configs immediately (all tokens created at T=0)
  3. Spawns N sandboxes/containers (each with pre-fetched JIT)
  4. Jobs run sequentially, but JIT tokens expire at T=60 regardless

Key Issue: JIT tokens are generated at webhook receipt time, not at job execution time.

Attempted Solutions

1. Queue-Based Worker with Deferred JIT Fetch

Approach: Move JIT fetching from webhook handler to worker function that processes jobs sequentially.

# Webhook: Queue metadata only
# Worker: Fetch JIT when job actually runs, then spawn

Why It Fails:

  • GitHub expects runner to connect within 2-5 minutes of JIT generation
  • Delaying JIT fetch creates race condition where GitHub cancels job
  • GitHub's job assignment model expects immediate runner registration
  • Doesn't solve fundamental issue: sequential execution still exceeds token lifetime

Reference: actions/runner auth documentation

2. Retry/Refresh JIT Config

Attempt: Detect expired token and re-fetch JIT config.

Why It Fails:

  • JIT config is single-use per job
  • Cannot re-fetch for same runner ID after expiration
  • Job is tied to original runner registration
  • Webhook is one-way notification; GitHub doesn't resend or support replay
  • generate-jitconfig creates a NEW runner registration, doesn't refresh existing

3. Increase max-parallel

Attempt: Run jobs in parallel to reduce total runtime below 60 minutes.

Why Not Always Possible:

  • Some workflows have inherent sequential dependencies
  • External API rate limits may require throttling
  • Resource constraints (e.g., API quotas, database locks)
  • Business logic may require ordered execution

4. Persistent Runner Token

Attempt: Use --token instead of --jitconfig with a long-lived token.

Trade-offs:

  • ✅ Solves the expiration problem
  • ❌ Security risk (long-lived token vs ephemeral JIT)
  • ❌ Requires manual token management and rotation
  • ❌ Defeats the purpose of JIT security model

Research & References

GitHub Documentation

  1. GitHub Actions Limits: Usage limits for self-hosted runners

    • Job queue time: 24 hours
    • JIT token lifetime: ~60 minutes (implied, not explicitly documented in main docs)
  2. Automatic Token Authentication: GITHUB_TOKEN documentation

    • "The installation access token expires after 60 minutes"
  3. Self-Hosted Runners: About self-hosted runners

    • Documentation on JIT runner configuration

GitHub Community Discussions

  1. Discussion #25699: GitHub token lifetime

  2. Discussion #50472: Long-running workflow GITHUB_TOKEN timeout

    • "Unable to extend GITHUB_TOKEN expiration time due to: GITHUB_TOKEN has expired"
    • Note: GITHUB_TOKEN (24h) is different from JIT token (60min), but discussion relevant for token expiration patterns
  3. Discussion #60513: How to configure idle_timeout with JIT

    • Discusses JIT runner lifecycle and limitations

GitHub Issues

  1. actions/runner How long is the runner registration token valid for? #1799: How long is runner registration token valid?

    • Answer: "It's valid for one hour"
    • Official confirmation from GitHub maintainer
  2. actions/runner Unable to use ./config remove --token ... on a just-in-time runner #2920: Unable to use ./config remove on JIT runner

    • Discusses JIT runner lifecycle issues and missing gitHubUrl in config
    • Closed as completed (bug fix released)
  3. actions-runner-controller Bump @typescript-eslint/eslint-plugin from 8.47.0 to 8.52.0 in /src/Misc/expressionFunc/hashFiles #4183: Runners not terminating after token expiry

    • Real-world production issue: "Runners not terminating after job completion – blocked queue due to token expiry"
    • Shows token expiration affects even Kubernetes-based runners
  4. actions-runner-controller How can I select runners with Intel Ice Lake ? #2466: Jobs expire while on queue

    • "Capacity reservations expire before the jobs are even queued"
    • Similar underlying problem with token/job timing
  5. actions/runner Support for autoscaling self-hosted github runners #845: Support for autoscaling self-hosted runners

    • Feature request for better autoscaling support
    • Related to managing runner lifecycle

External Resources

  1. AWS CodeBuild Issue: Failure to get JIT token

    • Real-world example of JIT token issues in production
  2. Orchestra Guide: JIT Runner Configuration

    • Best practices for JIT runner setup (still doesn't solve 60-min limit)

Constraints & Considerations

Why This Is Hard to Solve

  1. Security Model: JIT tokens are designed to be short-lived for security
  2. GitHub Architecture: Jobs are assigned to runners at webhook time, not execution time
  3. Serverless Limitations: Serverless functions can't maintain long-lived connections
  4. No Token Refresh API: GitHub doesn't provide an API to refresh/extend JIT tokens

Common Misconceptions

"We can just fetch JIT when the job runs"
✅ GitHub expects runner registration within minutes of job assignment

"We can retry failed jobs with fresh JIT"
✅ JIT is tied to specific runner registration; can't re-fetch for same job

"Queue the jobs and process later"
✅ GitHub's job timeout (24h) ≠ JIT token lifetime (60min)

Proposed Solutions

Option 1: Batch Processing (Recommended Workaround)

Split long-running workflows into multiple workflow runs that each complete within 60 minutes:

# Instead of one workflow with 37 jobs,
# Create multiple workflows or use dynamic matrix:

# Workflow Run 1: Jobs 1-9 (54 min)
# Workflow Run 2: Jobs 10-18 (54 min) 
# Workflow Run 3: Jobs 19-27 (54 min)
# Workflow Run 4: Jobs 28-N (remaining)

Implementation:

strategy:
  fail-fast: false
  max-parallel: 1
  matrix:
    # Use only subset per workflow run
    job_id: ${{ fromJson(env.JOB_BATCH) }}

Pros:

  • Works within existing JIT limitations
  • No changes to runner infrastructure
  • Each batch completes within token window

Cons:

  • Requires orchestration to trigger multiple runs
  • More complex workflow management
  • Job history split across multiple runs

Option 2: Persistent Runner Token (Security Trade-off)

Use traditional runner registration instead of JIT:

# Register runner once (manual or automated)
./config.sh --url https://github.com/OWNER/REPO --token $REGISTRATION_TOKEN

# Run with persistent token
./run.sh --token $RUNNER_TOKEN

Pros:

  • Token doesn't expire during job execution
  • Simple implementation

Cons:

  • Security risk (long-lived token)
  • Requires token rotation policy
  • Loses benefits of ephemeral runners

Option 3: Hybrid Approach - Batch with Persistent Runner

Use persistent runner for long sequential workflows, JIT for short ones:

if total_estimated_runtime > 3600:  # 1 hour
    use_persistent_runner()
else:
    use_jit_runner()

Pros:

  • Best of both worlds
  • Secure for short jobs, functional for long jobs

Cons:

  • More complex runner management
  • Still requires persistent token for some cases

Option 4: Workflow-Level Retry with Fresh Webhooks

Instead of job-level retry, trigger new workflow runs:

on:
  workflow_dispatch:
  schedule:
    - cron: '0 */2 * * *'  # Every 2 hours

jobs:
  check-and-run:
    runs-on: ubuntu-latest
    steps:
      - name: Check which items need processing
        id: check
        run: |
          # Logic to determine unprocessed items
          echo "batch=$ITEMS" >> $GITHUB_OUTPUT
      
      - name: Trigger batch workflow
        if: steps.check.outputs.batch != '[]'
        uses: benc-uk/workflow-dispatch@v1
        with:
          workflow: process-batch.yml
          inputs: '{"items": "${{ steps.check.outputs.batch }}"}'

Pros:

  • Fresh webhooks = fresh JIT tokens
  • Each batch within 60-minute window

Cons:

  • Complex orchestration
  • Potential for duplicate processing
  • Harder to track overall progress

Option 5: GitHub-Supported Solution (Requested)

Request GitHub to support one of:

  1. JIT Token Refresh API:

    POST /repos/{owner}/{repo}/actions/runners/{runner_id}/refresh-token
    
  2. Extended JIT Lifetime:

    • Allow configuration of JIT token lifetime (e.g., 4 hours for long workflows)
    • Or auto-extend for active runners
  3. Job-Level JIT:

    • Generate JIT token per job instead of per runner
    • Token valid for job duration only

Questions for GitHub

  1. Is there an official way to refresh or extend JIT token lifetime for long-running workflows?

  2. Can GitHub support increase JIT token lifetime for specific repositories/use cases?

  3. Is there a documented pattern for handling workflows that exceed 60 minutes with self-hosted runners?

  4. Should the generate-jitconfig API support token refresh or longer lifetimes for sequential job processing?

  5. Could GitHub provide a "job-level" JIT token that's valid for the duration of a specific job rather than runner registration?

Related Issues & Discussions

Additional Context

Serverless Runner Architecture

Typical serverless GitHub Actions runner flow:

GitHub Workflow Trigger
        ↓
GitHub sends workflow_job webhook (action: queued)
        ↓
Serverless function receives webhook
        ↓
Function calls GitHub API: POST /actions/runners/generate-jitconfig
        ↓
GitHub returns JIT config (valid for ~60 minutes)
        ↓
Function spawns container/sandbox with JIT config
        ↓
Container runs: ./run.sh --jitconfig $JIT_CONFIG
        ↓
Runner connects to GitHub and picks up job
        ↓
Job executes
        ↓
Job completes, runner exits

The problem occurs when:

  • Step 3 (JIT generation) happens at T=0 for all jobs
  • Step 7 (job execution) for job N happens at T > 60 minutes

Workaround Checklist

If you're experiencing this issue, check:

  • Can you split jobs into multiple workflow runs (< 60 min each)?
  • Can you increase max-parallel to reduce total runtime?
  • Can you use persistent runner tokens instead of JIT?
  • Can you optimize job duration to be < 6 minutes each?
  • Can you reduce number of jobs in matrix?

Labels

Suggested labels for this issue:

  • enhancement
  • self-hosted-runners
  • jit-tokens
  • long-running-workflows
  • sequential-jobs
  • documentation

Summary

This issue documents a fundamental architectural limitation: JIT tokens are designed for short-lived ephemeral runners (~60 minutes), but GitHub Actions workflows can legitimately require longer sequential execution.

The core conflict:

  • JIT Security Model: Short-lived tokens (60 min) for ephemeral runners
  • Sequential Workflows: May require >60 min total runtime
  • Serverless Architecture: Can't maintain persistent connections

Viable workarounds:

  1. Batch processing (multiple workflow runs)
  2. Persistent runner tokens (security trade-off)
  3. Reduce total runtime (optimize jobs or increase parallelism)

Long-term solution: Requires GitHub to either:

  • Extend JIT token lifetime for long workflows
  • Provide token refresh mechanism
  • Support job-level (not runner-level) JIT tokens

This issue was compiled from multiple real-world production scenarios and extensive research. It aims to document the limitation clearly and provide actionable workarounds while advocating for a supported long-term solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions