Skip to content

feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127

Open
heAdz0r wants to merge 3 commits intortk-ai:masterfrom
heAdz0r:benchmark/code-search-methodology
Open

feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127
heAdz0r wants to merge 3 commits intortk-ai:masterfrom
heAdz0r:benchmark/code-search-methodology

Conversation

@heAdz0r
Copy link

@heAdz0r heAdz0r commented Feb 15, 2026

Summary

This PR adds a reproducible benchmark methodology for code search tools, validating the rgai-first strategy with empirical data.

Search priority confirmed: rtk rgai > rtk grep > grep

Key contribution: Objective methodology proving rtk rgai excels at semantic discovery while rtk grep handles exact/regex precision.

Token Efficiency (Benchmark Results)

Benchmark setup: rtk codebase (54 .rs files, 23K LOC), 30 queries across 5 categories, pinned commit 4b0a413, tiktoken cl100k_base tokenizer.

Category grep rtk grep rtk rgai rtk grep vs grep rtk rgai vs grep Gold Hit Rate
A: Exact Identifier (6q) 5018 tok 3192 tok 2278 tok -36.4% -54.6% grep 100%, rtk_grep 100%, rgai 93%
B: Regex Pattern (6q) 7186 tok 6074 tok 5042 tok -15.5% -29.8% grep 100%, rtk_grep 100%, rgai N/A
C: Semantic Intent (10q) 0 tok 0 tok 7762 tok MISS ∞ (only rgai finds) grep 0%, rtk_grep 0%, rgai 100%
D: Cross-File (5q) 11858 tok 7483 tok 3914 tok -36.9% -67.0% grep 100%, rtk_grep 100%, rgai 58%
E: Edge Cases (3q) 28562 tok 7464 tok 2407 tok -73.9% -91.6% N/A

Key finding: Category C (Semantic Intent) shows 100% MISS rate for grep/rtk_grep vs 100% success for rtk rgai. This empirically validates the rgai-first policy.

Reproducibility Guarantees

Mechanism Implementation Exit Code
Pinned commit gold_standards.json:pinned_commit vs HEAD 2
Dirty tree check git diff on src/, Cargo.toml, Cargo.lock 3
Untracked files git ls-files --others in src/ 3
Token counting tiktoken (cl100k_base) not byte approximation -
Negative control head_n baseline for naive truncation comparison -
Auto-verification gold_auto.json generated from grep output -

Recommended Search Strategy (Benchmark-Validated)

User intent query → rtk rgai (semantic/cosine similarity)
                        ↓
              Found relevant files?
                ↓ YES         ↓ NO
           Use context    Fallback: rtk grep (exact match)
                              ↓
                    Still need precision?
                              ↓
                         grep (raw)

Security Compliance

Layer 1: Automated checks

  • No changes to src/ (benchmark-only PR)
  • No changes to Cargo.toml, Cargo.lock
  • No workflow modifications

Layer 2: Files added

  • benchmarks/bench_code.sh - Bash runner (no external deps except tiktoken)
  • benchmarks/analyze_code.py - Python analyzer
  • benchmarks/gold_standards.json - Static test data
  • benchmarks/tests/ - Unit tests

Implementation Details

  • 30 curated queries in 5 categories (A-E)
  • 5 runs per query with median aggregation
  • 4 tools compared: grep, rtk_grep, rtk_rgai, head_n
  • Gold standards define expected files for each query
  • Status tracking: OK, MISS, LOW_COVERAGE, EXPECTED_UNSUPPORTED, UNEXPECTED_HIT

Verification

# Unit tests (53 tests)
python3 -m unittest benchmarks.tests.test_analyze_code -v

# Full benchmark (requires pinned commit checkout)
git checkout 4b0a413
pip install tiktoken
cargo build --release
bash benchmarks/bench_code.sh
python3 benchmarks/analyze_code.py

Files Changed

File Purpose
benchmarks/bench_code.sh Main benchmark runner with reproducibility checks
benchmarks/analyze_code.py Results analyzer and RESULTS.md generator
benchmarks/gold_standards.json 30 curated queries with expected files
benchmarks/RESULTS.md Sample results for reference
benchmarks/tests/test_analyze_code.py 53 unit tests

Add comprehensive benchmark suite comparing grep, rtk grep, rtk rgai,
and head_n (negative control) for code search tasks.

Key methodology improvements:
- Pinned commit verification (exit 2 if HEAD != gold_standards.json commit)
- Dirty tree detection (exit 3 if uncommitted changes in src/)
- Token-based TE using tiktoken (cl100k_base) instead of byte approximation
- No output truncation (full quality samples preserved)
- head_n negative control baseline for comparison
- Auto-generated gold_auto.json from grep output for objective verification

Benchmark categories:
- A: Exact Identifier (6 queries) - rtk_grep recommended
- B: Regex Pattern (6 queries) - grep/rtk_grep recommended
- C: Semantic Intent (10 queries) - rtk_rgai recommended (100% vs 0% grep)
- D: Cross-File Pattern (5 queries) - rtk_grep recommended
- E: Edge Cases (3 queries)

Key findings:
- rtk rgai excels at semantic/intent queries (cosine similarity)
- rtk grep provides best exact-match with token savings (~30%)
- Recommended: rgai for discovery → grep fallback for precision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant