feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127
Open
heAdz0r wants to merge 3 commits intortk-ai:masterfrom
Open
feat(benchmark): reproducible code-search methodology with rgai/grep strategy#127heAdz0r wants to merge 3 commits intortk-ai:masterfrom
heAdz0r wants to merge 3 commits intortk-ai:masterfrom
Conversation
Add comprehensive benchmark suite comparing grep, rtk grep, rtk rgai, and head_n (negative control) for code search tasks. Key methodology improvements: - Pinned commit verification (exit 2 if HEAD != gold_standards.json commit) - Dirty tree detection (exit 3 if uncommitted changes in src/) - Token-based TE using tiktoken (cl100k_base) instead of byte approximation - No output truncation (full quality samples preserved) - head_n negative control baseline for comparison - Auto-generated gold_auto.json from grep output for objective verification Benchmark categories: - A: Exact Identifier (6 queries) - rtk_grep recommended - B: Regex Pattern (6 queries) - grep/rtk_grep recommended - C: Semantic Intent (10 queries) - rtk_rgai recommended (100% vs 0% grep) - D: Cross-File Pattern (5 queries) - rtk_grep recommended - E: Edge Cases (3 queries) Key findings: - rtk rgai excels at semantic/intent queries (cosine similarity) - rtk grep provides best exact-match with token savings (~30%) - Recommended: rgai for discovery → grep fallback for precision
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a reproducible benchmark methodology for code search tools, validating the rgai-first strategy with empirical data.
Search priority confirmed:
rtk rgai > rtk grep > grepKey contribution: Objective methodology proving rtk rgai excels at semantic discovery while rtk grep handles exact/regex precision.
Token Efficiency (Benchmark Results)
Benchmark setup: rtk codebase (54 .rs files, 23K LOC), 30 queries across 5 categories, pinned commit
4b0a413, tiktokencl100k_basetokenizer.greprtk greprtk rgairtk grepvs greprtk rgaivs grepKey finding: Category C (Semantic Intent) shows 100% MISS rate for grep/rtk_grep vs 100% success for rtk rgai. This empirically validates the rgai-first policy.
Reproducibility Guarantees
gold_standards.json:pinned_commitvsHEADgit diffonsrc/,Cargo.toml,Cargo.lockgit ls-files --othersinsrc/tiktoken(cl100k_base) not byte approximationhead_nbaseline for naive truncation comparisongold_auto.jsongenerated from grep outputRecommended Search Strategy (Benchmark-Validated)
Security Compliance
Layer 1: Automated checks
src/(benchmark-only PR)Cargo.toml,Cargo.lockLayer 2: Files added
benchmarks/bench_code.sh- Bash runner (no external deps except tiktoken)benchmarks/analyze_code.py- Python analyzerbenchmarks/gold_standards.json- Static test databenchmarks/tests/- Unit testsImplementation Details
Verification
Files Changed
benchmarks/bench_code.shbenchmarks/analyze_code.pybenchmarks/gold_standards.jsonbenchmarks/RESULTS.mdbenchmarks/tests/test_analyze_code.py