-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
GTT's organic clone formula subtracts detected CI checkouts from raw traffic counts, but the subtraction is based on assumptions about how GitHub counts clones — not empirical measurement. Evidence suggests the formula may be wrong:
NCSI Resolver dashboard (Feb 23–26) shows a clone spike that correlates exactly with heavy GTT development sessions. The spike magnitude exceeds what our CI subtraction removed, suggesting GitHub produces more clones per CI operation than we account for.
Current formulas:
organicClones = rawClones - ciCheckouts
organicUniqueClones = rawUnique - MIN(round(rawUnique × ciRate), ciRuns)
Key unknowns:
- Does
actions/checkout(which usesgit init + git fetch, notgit clone) count as 1 clone? More? Zero? - Are CI unique clones 1 per run (our assumption) or 1 per day (GitHub moderator statement)?
- Do matrix builds (9 jobs × 1 checkout) produce 9 clones or 1?
- Are there hidden pre-step clones from runner infrastructure?
A GitHub Community moderator stated that Actions using GITHUB_TOKEN count as "one unique cloner, no matter how many times it clones the repository." If true, our ciUniqueCeiling = ciRuns is wrong — it should be 1, not N.
No one has published a controlled study of this. We would be the first.
Proposed Solution
Create a dedicated private testbed repo (djdarcy/gtt-ci-clone-testbed) with controlled experiments that isolate each variable.
Why private?
- Zero external traffic (no search engines, no bots, no curious visitors)
- Every clone is accounted for (only our experiments + observer)
- Clean baseline (new repo, no history)
Experiment Suite
| # | Experiment | Question | Procedure |
|---|---|---|---|
| 1 | Single checkout | Baseline: 1 run, 1 job, 1 checkout = ? clones, ? unique | workflow_dispatch, read API next day |
| 2 | No checkout | Does a bare workflow (no checkout) register clones? | Workflow with only echo hello |
| 3 | Double checkout | 2 checkout steps in 1 job = 1 or 2 clones? | Checkout main + checkout another ref |
| 4 | Matrix 3×3 | 9 jobs × 1 checkout = 9 clones, 1 unique? | strategy.matrix with 3×3 |
| 5 | Multi-run same day | 3 separate runs = 3 clones, 1 unique? | Trigger workflow_dispatch 3 times |
| 6 | fetch-depth | Does depth=0 vs depth=1 matter? | Two runs with different depths |
| 7 | Manual clone | Calibration: git clone from local machine |
Verify +1 clone, +1 unique |
| 8 | PAT vs GITHUB_TOKEN | Different identity = different unique count? | One run each |
Protocol: 1 experiment per UTC day for clean isolation. Observer workflow reads Traffic API daily at 4:30 UTC and commits results.
Testbed Repo Structure
gtt-ci-clone-testbed/
├── .github/workflows/
│ ├── exp-01-single-checkout.yml
│ ├── exp-02-no-checkout.yml
│ ├── exp-03-double-checkout.yml
│ ├── exp-04-matrix-3x3.yml
│ ├── exp-05-multi-run.yml
│ ├── exp-06-fetch-depth.yml
│ ├── exp-07-manual-calibration.yml (placeholder)
│ ├── exp-08-pat-vs-token.yml
│ └── observe-traffic.yml (daily API reader)
├── results/
│ └── observations.json
├── analyze.py
└── README.md
Observer Design
The observer reads the Traffic API via REST (using a PAT, not actions/checkout) to avoid contaminating clone counts with its own checkout:
- name: Read Traffic API (no checkout needed)
run: |
curl -s -H "Authorization: token ${{ secrets.TRAFFIC_PAT }}" \
https://api.github.com/repos/${{ github.repository }}/traffic/clones?per=day \
> /tmp/clones.json
# Commit results via API, not git push (avoids clone contamination)Expected Impact on GTT Formulas
If CI unique = 1 per day (most likely based on research)
The organic unique formula simplifies dramatically:
// Before (current):
ciUniqueCeiling = entry.ciRuns;
ciUniqueClones = MIN(ciUniqueByPct, ciUniqueCeiling);
// After (if confirmed):
ciUniqueClones = entry.ciRuns > 0 ? 1 : 0;
organicUniqueClones = rawUnique - ciUniqueClones;If total clones > ciCheckouts (hidden multiplier)
// Before:
organicClones = rawClones - ciCheckouts;
// After (if multiplier found):
organicClones = rawClones - Math.round(ciCheckouts * CI_CLONE_MULTIPLIER);Propagation
Formula corrections affect:
traffic-badges.yml(workflow, lines 479–508)docs/stats/index.html(getOrganicUniqueClones(), line 954–962)backfill_organic_unique.py(retroactive correction)- Badge math (installs badge uses organic clones)
- All downstream deployments (NCSI, ComfyUI Triton)
Acceptance Criteria
- Private testbed repo created with experiment workflow files
- Observer workflow reads Traffic API without producing clone contamination
- All 8 experiments executed (one per UTC day) with results logged
- Results analyzed: exact mapping from CI activity → Traffic API counts
- Findings documented with confidence levels for each conclusion
- GTT organic clone formula updated if findings differ from current assumptions
- GTT organic unique clone formula updated if findings differ
- Backfill script updated for retroactive correction
- Downstream deployments updated with corrected formulas
Related Issues
- Refs Port merge logic to Python module and add tests #23 — Port merge logic to Python (organic formula is part of merge logic)
- Refs ghtraf upgrade — run schema migrations on gist state.json #30 — ghtraf upgrade (retroactive formula correction may be needed)
- Refs Port delta-dedup workflow fix to NCSI Resolver and ComfyUI Triton #32 — Downstream port (corrected formula propagates)
Analysis
See 2026-02-28__07-07-40__dev-workflow-process_ci-clone-testbed-reverse-engineering.md for the full DEV WORKFLOW PROCESS analysis including research findings on actions/checkout behavior and the mathematical derivation of formula corrections.
Also see github-actions-clone-traffic-api-behavior.md in ~/claude/questions/ for the research on how GitHub counts clones from Actions.
See 2026-02-28__07-36-32__dev-workflow-process_ci-clone-testbed-implementation-plan.md for the day-by-day experiment protocol, observer workflow design, and formula comparison framework.