Skip to content

Enable ALU→ALU same-cycle forwarding for all 8 co-issue slots#108

Open
syifan wants to merge 40 commits intomainfrom
leo/fix-fp-coissue
Open

Enable ALU→ALU same-cycle forwarding for all 8 co-issue slots#108
syifan wants to merge 40 commits intomainfrom
leo/fix-fp-coissue

Conversation

@syifan
Copy link
Collaborator

@syifan syifan commented Feb 19, 2026

Summary

  • Enable ALU→ALU same-cycle forwarding for all 8-wide co-issue slots (was only enabled for slot 8)
  • Slots 2–7 previously called canIssueWith() which passed nil for the forwarded array, blocking any ALU→ALU forwarding even when the producer is an ALU op
  • Now all slots use canIssueWithFwd() and properly track forwarding state with the 1-hop depth limit
  • Slot 8 also fixed to track its forwarding state (was discarding with _ = fwd)

Root Cause

In tickOctupleIssue(), slots 2–7 used canIssueWith() — a wrapper that called canIssueWithFwd() with nil forwarded array. The ALU→ALU forwarding check (forwarded != nil && producerIsALU) always failed for these slots, preventing wide issue of FP/ALU chains. This caused excessive stalling in benchmarks like jacobi-1d (131% CPI error) and bicg (70% CPI error).

Changes

  • timing/pipeline/pipeline_tick_eight.go: Changed slots 2–7 from canIssueWith() to canIssueWithFwd() with forwarding tracking. Fixed slot 8 to track its forwarding state.

Test plan

  • go build ./... passes
  • TestAccuracyCPI_WithDCache passes (microbenchmarks)
  • TestMemStridedLongRun passes (CPI=1.789, no regression)
  • CI accuracy workflows to verify polybench CPI improvement (jacobi-1d, bicg targets)

🤖 Generated with Claude Code

Previously, same-cycle ALU→ALU forwarding was only enabled for slot 8
(using canIssueWithFwd), while slots 2-7 used canIssueWith which passed
nil for the forwarded array, blocking any RAW dependency even when the
producer was an ALU op. This caused excessive structural hazard stalls
for FP-heavy benchmarks like jacobi-1d and bicg where consecutive ALU
ops have true dependencies that hardware resolves via forwarding.

Fix: Switch all slots (2-8) to use canIssueWithFwd with the forwarded
array, and properly track forwarding state per-slot to enforce the 1-hop
depth limit (preventing unrealistic deep chaining like A→B→C in one
cycle).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

…ions)

Gate same-cycle ALU→ALU forwarding on both producer and consumer having
IsFloat=true. This preserves FP improvements (jacobi-1d, bicg) while
reverting integer benchmark regressions (dependency, memorystrided).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 19, 2026 12:55
Microbenchmarks from CI run 22190131410 (FP-only forwarding branch).
PolyBench atax/bicg/jacobi-1d from CI run 22190131432, mvt from CI run 22187796851.
Overall average error: 27.94%. memorystrided 16.81% (PASS ≤30%).
jacobi-1d 131.13% (FAIL <70%). bicg 71.24% (FAIL <50%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The FP-only gate (IsFloat) didn't help jacobi-1d/bicg because they use
integer arithmetic (ADD reg, MADD/SMULL, shifts), not FP SIMD.

New gate: block ALU→ALU forwarding when either side is FormatDPImm
(ADD/SUB with immediate). Serial chains of these simple ops run at
1/cycle on M2 and must not co-issue. Register-form and multi-source
ops (MADD, ADD reg, UBFM/shifts) have independent operands that
benefit from same-cycle forwarding.

This allows forwarding for:
- jacobi-1d (ADD reg → SMULL → LSR → SUB reg chains)
- bicg (MADD accumulation chains)

While blocking forwarding for:
- dependency_chain (ADD X0,X0,#1 serial chain)
- arithmetic benchmarks (ADD Xn,Xn,#imm)
- memorystrided (ADD imm → STR chains)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

…6e80856)

Microbenchmarks updated for format-based forwarding gate. Two regressions:
reductiontree 14.56%→39.94%, strideindirect 13.64%→45.05%. PolyBench CI
run 22194200533 still pending — PolyBench values unchanged from prior runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 19, 2026 13:37
…rideindirect regression)

The format-based gate in 6e80856 was too permissive: it allowed ALU→ALU
forwarding for all non-DPImm ops including ADD reg (FormatDPReg), which
caused regressions in reductiontree (39.94%) and strideindirect (45.05%).

Narrow the gate to only allow forwarding when the producer is
FormatDataProc3Src (MADD, MSUB, SMULL, UMADDL). These multiply-accumulate
chains are what jacobi-1d and bicg need for improved accuracy.

Local results confirm reductiontree (1.516) and strideindirect (1.060)
revert to pre-regression values while dependency_chain (1.020) and
memory_strided (2.267) remain unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…11aa8ce)

- Microbenchmark regressions from 6e80856 (format-based gate) are FIXED
- reductiontree: 0.343→0.419 CPI (error 39.94%→14.56%)
- strideindirect: 0.364→0.600 CPI (error 45.05%→13.64%)
- Overall average error: 31.72%→27.94%
- Micro average error: 22.03%→16.86%
- PolyBench CI run 22194997040 still pending (no runner)
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

2 similar comments
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 19, 2026 15:08
…11aa8ce)

PolyBench Group 1 results: jacobi-1d CPI 0.349→0.302 (error 131.13%→100.00%),
bicg CPI 0.393 (71.24% unchanged), atax CPI 0.183 (19.40% unchanged).
Groups 2/3 still running — NOT pushing to avoid cancellation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mers

Expand the ALU→ALU forwarding gate beyond DataProc3Src-only producers.
Now allows forwarding when:
  - Producer is FormatDataProc3Src (MADD/SMULL) → existing
  - Producer is FormatBitfield (LSR/LSL/ASR) → new
  - Consumer is FormatDataProc3Src (MADD/SMULL) → new

This helps jacobi-1d significantly: the inner loop uses a
SMULL→LSR→SUB chain for divide-by-3. Previously only SMULL→LSR
forwarded; now LSR→SUB also forwards (Bitfield producer).
Additionally, any→MADD/SMULL forwarding helps feed multiply-
accumulate chains from address computation instructions.

Local TestAccuracyCPI_WithDCache: all 25 microbenchmarks unchanged
from baseline (no regressions). Polybench jacobi-1d CPI improved
from 0.302 to 0.254 (was 0.349 at baseline). bicg unchanged at
0.393 (bottleneck is load-use deps, not ALU forwarding).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 3 commits February 19, 2026 16:06
…e9a0185)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow ALU→ALU same-cycle forwarding when the consumer is a flag-only
DPImm instruction (CMP/CMN with Rd==31/XZR). These instructions don't
produce a register result, so they can't create integer forwarding
chains that regressed branch_hot_loop in previous attempts.

Target pattern in bicg inner loop: ADD x1, x1, #8 → CMP x1, #0x140.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Fix indentation at superscalar.go:1167 (extra tab on
   producerNotForwarded line) that caused CI gofmt failure.

2. Add Rt2 (Ra) to RAW hazard detection in canIssueWithFwd for
   FormatDataProc3Src consumers (MADD/MSUB). The accumulator
   register Ra is read via Inst.Rt2 but was not checked for
   dependencies, preventing MADD from co-issuing when its Ra
   operand could be forwarded from an earlier ALU result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

…0fb7a22)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 3 commits February 19, 2026 18:15
Suppress the 1-cycle load-use stall when an integer load (LDR/LDRH/LDRB)
feeds a DataProc3Src consumer (MADD/MSUB/SMULL). The consumer enters IDEX
immediately and waits during the cache stall; when the cache hit completes,
MEM→EX forwarding provides the load data directly from nextMEMWB.

Narrowly scoped to DataProc3Src consumers only to avoid regressions in
memory_strided and other benchmarks. Key implementation:
- isLoadFwdEligible: eligibility check (int load → DataProc3Src, excludes
  Ra/Rt2 reads and flag-only consumers)
- loadFwdActive flag: suppresses load-use stall for eligible pairs
- loadFwdPendingInIDEX: guards MEM→EX forwarding to only fire when the
  consumer was specifically placed via loadFwdActive
- OoO bypass: other IFID slots still held if dependent on the load

Verified: memory_strided CPI=2.267 (unchanged), reduction_tree=1.516
(unchanged), stride_indirect=1.060 (unchanged). 412/412 pipeline specs pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…28f7ec1)

Load-use forwarding from cache stage has no effect on CPI — PolyBench CI
tests run without dcache. All values unchanged from 0fb7a22. Updated CI
run IDs to latest runs (microbench: 22204159766, polybench: 22204159767).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

…EX forwarding

When dcache is disabled, memory provides data immediately (direct array
lookup). The existing isLoadFwdEligible only suppressed load-use stalls
for LDR→DataProc3Src (MADD/MSUB) pairs. This adds isNonCacheLoadFwdEligible
which suppresses stalls for ALL integer load → consumer pairs in the
non-dcache path, since MEM→EX forwarding always has data available.
Only Rt2 (Ra) dependencies in DataProc3Src consumers are excluded (no
forwarding path for that operand).

This should significantly reduce bicg CPI by eliminating load-use stall
bubbles that the real M2 hardware hides via OoO execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Yifan Sun and others added 2 commits February 19, 2026 23:42
…warding

When D-cache is disabled, non-dcache loads complete MEM immediately.
Load EX(2) + MEM(1) = 3 cycles aligns with MADD EX(3), and MEM runs
before EX in tick processing order, so the load result is available
via nextMEMWB when the consumer's EX completes in the same tick.

Changes:
- superscalar.go: canIssueWithFwd now permits load→consumer co-issue
  when hasDCache=false (blocks Rt2 dependency for MADD/MSUB accumulator)
- pipeline.go: add loadCoIssuePending[8] per-slot flags
- pipeline_helpers.go: add forwardFromNextMEMWBSlots helper, clear flags
  on flush
- pipeline_tick_eight.go: set loadCoIssuePending in decode stage when
  fwd=true && !useDCache; forward from nextMEMWB slots in EX stage
  between forwardFromAllSlots and sameCycleForward

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…11620842 (commit b1f8d23)

Co-issue commit b1f8d23 results:
- Microbench avg error: 21.59% (was 17.55%)
- PolyBench avg error: 42.05% (was 42.49%)
- Overall avg error: 27.04% (was 24.20%)

Key regressions: vectorsum 24.46->41.55%, vectoradd 13.45->24.62%,
reductiontree 6.19->14.56%, strideindirect 13.64->21.38%
Key improvements: bicg 71.24->69.93%, mvt 11.78->11.32%
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 3 commits February 20, 2026 00:30
…a broadened MEM→EX forwarding"

This reverts commit 875cf70.
…e matching M2

The load-use bubble overlaps with the last EX cycle (both hold the
consumer in IFID), so total load-to-use = nonCacheLoadLatency + 1.
Setting to 3 gives 4-cycle total, matching Apple M2 L1 latency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

… targets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 20, 2026 02:41
…0258 (commit 55663fc)

Microbench data verified on current HEAD. Co-issue revert improved micro avg
error 21.59% -> 16.86%. PolyBench data stale (pending CI run 22215020276);
cancelled stuck run 22212941350.

Key changes: vectorsum 41.55->13.56%, vectoradd 24.62->11.15%,
strideindirect 21.38->13.64%, loadheavy 22.92->20.17%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Profile-only cycle: no code changes.
- arithmetic: sim CPI 0.220 vs hw 0.296 (34.5% too fast)
  Root cause: benchmark structure mismatch (unrolled vs looped native)
- branchheavy: sim CPI 0.970 vs hw 0.714 (35.8% too slow)
  Root cause: 5/10 cold branches mispredicted (all forward-taken)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

…276 (partial)

Groups 1&3 complete: atax CPI=0.183, bicg CPI=0.393, jacobi-1d CPI=0.253 now fresh.
3mm now completable (CPI=0.224), moved from infeasible to benchmarks (sim-only).
2mm still infeasible (timed out again). MVT pending Group 2 (GEMM blocking).
Overall avg 23.67% (was 23.58%). Poly avg 42.38% (was 42.05%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 20, 2026 04:47
Replace straight-line 200 ADDs with a 40-iteration loop (5 ADDs + SUB + CBNZ
per iteration) to match the structure of native compiled code. Add EncodeCBNZ
helper for compare-and-branch-if-not-zero encoding.

Fixes #28

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…from CI run 22215020276

All PolyBench benchmarks now FRESH: atax, bicg, jacobi-1d, mvt verified.
MVT updated from stale (0.24/11.32%) to fresh (0.241/11.78%).
Overall avg: 23.70%. Polybench avg: 42.49%. Micro avg: 16.86%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 3 commits February 20, 2026 05:56
Root cause: simulator models zero penalty for correctly predicted taken
branches. The loop-restructured arithmetic benchmark achieves IPC ~5.3
vs hw ~3.4 because 40 taken CBNZ branches cost nothing in sim.
Proposed fix: add 1-cycle fetch redirect penalty for taken branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the 10 conditional branches (5 taken, 5 not-taken) in a 25-iteration
loop so the branch predictor can learn from repeated encounters. Each
iteration resets X0 and re-executes the same branch pattern, allowing the
predictor to train after the first iteration. CPI drops from 0.970 to 0.428.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… CI run 22219381657

- Arithmetic sim CPI: 0.220 -> 0.188 (Nina's benchmark restructure df005d5)
- PolyBench verified from CI run 22217510861: no regressions
  - bicg 71.24% <=72% PASS
  - jacobi-1d 67.55% <=68% PASS
  - memorystrided 16.81% <=17% PASS
- Overall avg: 25.22% (up from 23.70% due to arithmetic hw CPI mismatch)
- Note: arithmetic hw CPI (0.296) may need re-measurement on restructured benchmark

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

2 similar comments
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Yifan Sun and others added 2 commits February 20, 2026 07:07
Fix gofmt formatting in microbenchmarks.go and pipeline_helpers.go.
Add 1-cycle fetch redirect bubble for predicted-taken branches, modeling
the real M2 penalty when the fetch unit redirects to a branch target.
Eliminated branches (pure B) bypass the penalty. The redirect flag is
cleared on pipeline flush (misprediction).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n 22223493122

Updated all 11 microbenchmark sim CPI values from Leo's taken-branch
redirect penalty fix (commit 016eb3b). Key improvements:
- arithmetic: 57.45% -> 3.14% error (sim 0.188->0.287, hw 0.296)
- branchheavy: 35.85% -> 1.26% error (sim 0.97->0.723, hw 0.714)
- Overall avg: 25.22% -> 19.9%
- Micro avg: 18.95% -> 11.68%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

1 similar comment
@github-actions
Copy link
Contributor

Performance Regression Analysis

Performance Benchmark Comparison

Compares PR benchmarks against main branch baseline.
Benchmarks: pipeline tick throughput across ALU, memory, mixed workloads.

Benchstat output not available (benchmark steps may have timed out).

No significant regressions detected.


Automated benchmark comparison via go test -bench + benchstat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant