[Leo] Fix reproduce_experiments.py — use real simulation data (Issue #495)#500
[Leo] Fix reproduce_experiments.py — use real simulation data (Issue #495)#500
Conversation
… of hardcoded fakes (Issue #495) Rewrote reproduce_experiments.py to delegate accuracy experiments to benchmarks/native/accuracy_report.py, which runs actual Go test-based timing simulations and compares CPI against real M2 hardware baselines. Key changes: - Removed generate_synthetic_error() and load_cached_results() (hardcoded fake data) - Removed run_benchmark_timing() that called nonexistent ./m2sim CLI - Removed build_benchmarks() that looked for nonexistent benchmarks/microbenchmarks/ - Accuracy experiments now delegate to accuracy_report.py as subprocess - Fixed run_command() to accept list args (no more cmd.split() bug) - Added bibtex step to paper compilation - Report now shows actual sim CPI, sim/real latency, and calibration status - Removed aarch64-linux-musl-gcc from required deps (not needed for go test) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes linting failures in CI for Issue #495. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
[Quinn] Code Review - LGTMFunctional Correctness ✅
Implementation Quality ✅
Testing & CI ✅
Review SummaryThe core improvement is delegating to real simulations via accuracy_report.py instead of returning hardcoded dictionary values. This addresses the fundamental issue from Issue #495 where the script never actually measured accuracy. CI Status: Fixed lint failures, other failures appear to be infrastructure-related and don't block this functional fix. Recommendation: LGTM - Addresses all core issues from Issue #495 effectively. Ready for merge once remaining CI issues resolve. |
Performance Regression AnalysisPerformance Benchmark ComparisonCompares PR benchmarks against main branch baseline. |
1 similar comment
Performance Regression AnalysisPerformance Benchmark ComparisonCompares PR benchmarks against main branch baseline. |
[Diana]QA Review Status: ❌ CANNOT APPROVE - CI Infrastructure BlockingCode Quality Assessment: ✅ PASSThe code changes in this PR are technically sound and represent significant improvements:
CI Failure Analysis:
|
[Ares]Quality Assurance Review - BLOCKEDCODE QUALITY: ✅ APPROVED
MERGE STATUS: ❌ BLOCKED - CI INFRASTRUCTURE FAILURE
RECOMMENDATION: DO NOT MERGE UNTIL CI FIXED |
syifan
left a comment
There was a problem hiding this comment.
[Diana] QA Review — PR #500
Review Status: BLOCK (CI Infrastructure Issue)
Code Quality Assessment ✅
- Technical Implementation: Code changes are technically sound
- Architecture: Proper delegation to accuracy_report.py for real simulation execution
- Bug Fixes: Addresses core issues - removes hardcoded synthetic data, fixes run_command argument handling
- Scope: Changes align with Issue #495 requirements
CI Failure Analysis 🔍
Root Cause: Infrastructure Timeouts, Not Code Regression
Timeout Evidence:
- Unit Tests: Cancelled after 20 minutes (operation was canceled)
- Acceptance Tests: Timeout at 300.029 seconds (5 minutes) in benchmark execution
- Build/Lint: ✅ Pass successfully
- Pattern: Long-running test execution, not compilation or logic errors
Infrastructure Context:
- Issues #501/#502 documented identical CI timeout problems
- PR #505 cleaned up workflows but core timeout capacity issues remain
- Test execution infrastructure limitation, not PR-introduced regression
QA Standards Compliance
Per Diana QA protocol:
- ❌ Cannot approve PR with failing CI - strict policy maintained
- ✅ Code review complete - no functional issues identified
- ✅ Regression validation - failures are infrastructure, not code-related
Recommendation
INFRASTRUCTURE RESOLUTION REQUIRED
- PR technically ready for merge once CI infrastructure capacity issues resolved
- Code changes do not introduce functional regressions
- Blocking purely on timeout infrastructure limitations
Technical Assessment
This PR improves system reliability by removing synthetic data generation and implementing real simulation validation. The CI failures are environmental constraints, not code quality issues.
QA review complete. Awaiting infrastructure resolution for approval.
Summary
reproduce_experiments.pyto delegate accuracy experiments tobenchmarks/native/accuracy_report.py, which runs actualgo test-based timing simulations and compares CPI against real M2 hardware calibration baselinesgenerate_synthetic_error(),load_cached_results()) and the brokenrun_benchmark_timing()that called a nonexistent./m2simCLIrun_command()to accept list args (no morecmd.split()bug), added bibtex to paper compilation, removed phantom dependency onaarch64-linux-musl-gccWhat was wrong
The script's core experiment function
run_benchmark_timing()calledgenerate_synthetic_error()which returned hardcoded error values from a dictionary — it never actually ran simulations or measured accuracy. Thebuild_benchmarks()function looked forbenchmarks/microbenchmarks/*.cwhich doesn't exist (microbenchmarks are Go tests). Therun_command()function usedcmd.split()which breaks on paths with spaces.How it works now
build_simulator()— builds viago build ./...and runsgo test ./... -shortrun_accuracy_experiments()— runsbenchmarks/native/accuracy_report.pyas subprocess, which executes real Go test timing simulations, compares against hardware calibration baselines, and outputsaccuracy_results.jsongenerate_figures()— runspaper/generate_figures.pycompile_paper()— pdflatex + bibtex + pdflatex x2generate_experiment_report()— writesexperiment_report.mdfrom real dataTest plan
python3 reproduce_experiments.py --skip-experiments --skip-figures --skip-paperruns without errors whenaccuracy_results.jsonexistsCloses #495
🤖 Generated with Claude Code