feat(benchmark): reproducible code-search methodology with rgai/grep strategy by heAdz0r · Pull Request #127 · rtk-ai/rtk

heAdz0r · 2026-02-15T08:12:09Z

Summary

This PR adds a reproducible benchmark methodology for code search tools, validating the rgai-first strategy with empirical data.

Search priority confirmed: rtk rgai > rtk grep > grep

Key contribution: Objective methodology proving rtk rgai excels at semantic discovery while rtk grep handles exact/regex precision.

Token Efficiency (Benchmark Results)

Benchmark setup: rtk codebase (54 .rs files, 23K LOC), 30 queries across 5 categories, pinned commit 4b0a413, tiktoken cl100k_base tokenizer.

Category	`grep`	`rtk grep`	`rtk rgai`	`rtk grep` vs grep	`rtk rgai` vs grep	Gold Hit Rate
A: Exact Identifier (6q)	5018 tok	3192 tok	2278 tok	-36.4%	-54.6%	grep 100%, rtk_grep 100%, rgai 93%
B: Regex Pattern (6q)	7186 tok	6074 tok	5042 tok	-15.5%	-29.8%	grep 100%, rtk_grep 100%, rgai N/A
C: Semantic Intent (10q)	0 tok	0 tok	7762 tok	MISS	∞ (only rgai finds)	grep 0%, rtk_grep 0%, rgai 100%
D: Cross-File (5q)	11858 tok	7483 tok	3914 tok	-36.9%	-67.0%	grep 100%, rtk_grep 100%, rgai 58%
E: Edge Cases (3q)	28562 tok	7464 tok	2407 tok	-73.9%	-91.6%	N/A

Key finding: Category C (Semantic Intent) shows 100% MISS rate for grep/rtk_grep vs 100% success for rtk rgai. This empirically validates the rgai-first policy.

Reproducibility Guarantees

Mechanism	Implementation	Exit Code
Pinned commit	`gold_standards.json:pinned_commit` vs `HEAD`	2
Dirty tree check	`git diff` on `src/`, `Cargo.toml`, `Cargo.lock`	3
Untracked files	`git ls-files --others` in `src/`	3
Token counting	`tiktoken` (`cl100k_base`) not byte approximation	-
Negative control	`head_n` baseline for naive truncation comparison	-
Auto-verification	`gold_auto.json` generated from grep output	-

Recommended Search Strategy (Benchmark-Validated)

User intent query → rtk rgai (semantic/cosine similarity)
                        ↓
              Found relevant files?
                ↓ YES         ↓ NO
           Use context    Fallback: rtk grep (exact match)
                              ↓
                    Still need precision?
                              ↓
                         grep (raw)

Security Compliance

Layer 1: Automated checks

No changes to src/ (benchmark-only PR)
No changes to Cargo.toml, Cargo.lock
No workflow modifications

Layer 2: Files added

benchmarks/bench_code.sh - Bash runner (no external deps except tiktoken)
benchmarks/analyze_code.py - Python analyzer
benchmarks/gold_standards.json - Static test data
benchmarks/tests/ - Unit tests

Implementation Details

30 curated queries in 5 categories (A-E)
5 runs per query with median aggregation
4 tools compared: grep, rtk_grep, rtk_rgai, head_n
Gold standards define expected files for each query
Status tracking: OK, MISS, LOW_COVERAGE, EXPECTED_UNSUPPORTED, UNEXPECTED_HIT

Verification

# Unit tests (53 tests)
python3 -m unittest benchmarks.tests.test_analyze_code -v

# Full benchmark (requires pinned commit checkout)
git checkout 4b0a413
pip install tiktoken
cargo build --release
bash benchmarks/bench_code.sh
python3 benchmarks/analyze_code.py

Files Changed

File	Purpose
`benchmarks/bench_code.sh`	Main benchmark runner with reproducibility checks
`benchmarks/analyze_code.py`	Results analyzer and RESULTS.md generator
`benchmarks/gold_standards.json`	30 curated queries with expected files
`benchmarks/RESULTS.md`	Sample results for reference
`benchmarks/tests/test_analyze_code.py`	53 unit tests

Add comprehensive benchmark suite comparing grep, rtk grep, rtk rgai, and head_n (negative control) for code search tasks. Key methodology improvements: - Pinned commit verification (exit 2 if HEAD != gold_standards.json commit) - Dirty tree detection (exit 3 if uncommitted changes in src/) - Token-based TE using tiktoken (cl100k_base) instead of byte approximation - No output truncation (full quality samples preserved) - head_n negative control baseline for comparison - Auto-generated gold_auto.json from grep output for objective verification Benchmark categories: - A: Exact Identifier (6 queries) - rtk_grep recommended - B: Regex Pattern (6 queries) - grep/rtk_grep recommended - C: Semantic Intent (10 queries) - rtk_rgai recommended (100% vs 0% grep) - D: Cross-File Pattern (5 queries) - rtk_grep recommended - E: Edge Cases (3 queries) Key findings: - rtk rgai excels at semantic/intent queries (cosine similarity) - rtk grep provides best exact-match with token savings (~30%) - Recommended: rgai for discovery → grep fallback for precision

heAdz0r added 3 commits February 14, 2026 13:52

feat(init,docs,hooks): enforce rgai-first search policy

b19c7db

docs(readme): remove private benchmark source references

4b0a413

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127

feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127
heAdz0r wants to merge 3 commits intortk-ai:masterfrom
heAdz0r:benchmark/code-search-methodology

heAdz0r commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heAdz0r commented Feb 15, 2026

Summary

Token Efficiency (Benchmark Results)

Reproducibility Guarantees

Recommended Search Strategy (Benchmark-Validated)

Security Compliance

Layer 1: Automated checks

Layer 2: Files added

Implementation Details

Verification

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant