-
Notifications
You must be signed in to change notification settings - Fork 15
Release of initial benchmark #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…elines - Add example environment configuration file for OpenAI API keys. - Create .gitignore to exclude unnecessary files and directories. - Implement Makefile for easy setup and execution of benchmark tasks. - Write README.md with project overview, requirements, setup instructions, and output details. - Define project metadata and dependencies in pyproject.toml. - Set up directory structure and paths for data handling. - Develop CLI for extracting data from Excel files and querying OpenAI's API. - Implement evaluation functions to compare model responses against ground truth. - Create utility functions for file handling and JSON normalization. - Integrate OpenAI API client for text and image processing. - Add support for various extraction methods including OpenPyXL, PDF, HTML, and image rendering.
…aths and extraction logic
…ions in pyproject.toml
…dd certificate of employment JSON structure
…d clarity and consistency
…valuation accuracy
…r Markdown outputs
…scoring - Introduced new directory structure for RUB benchmarks under `benchmark/rub`. - Added `manifest.json` to define tasks and their expected outputs. - Implemented `RubTask` and `RubManifest` models for loading and validating the manifest. - Created normalization functions to ensure consistent comparison of outputs. - Developed scoring functions to evaluate predictions against ground truth. - Added truth data for various tasks including forms, flowcharts, and inspection records. - Included helper modules for managing RUB tasks and their associated data.
…d scoring metrics
- Created RUB summary comparison report for gpt-4o vs gpt-4.1. - Added detailed benchmark report summarizing extraction accuracy across methods. - Introduced RUB report for Stage B task accuracy using Markdown-only inputs. - Added separate RUB reports for gpt-4o and gpt-4.1. - Implemented publicize scripts (Python and PowerShell) to automate the copying of benchmark results and reports to the public directory.
📝 WalkthroughWalkthroughThis pull request adds a comprehensive benchmarking subsystem: configuration exclusions, documentation and public reports, a benchmark dataset and manifests, CLI orchestration, extraction pipelines, LLM client and pricing, evaluation/normalization/score utilities (including RUB), reproducibility scripts, and publicization tools. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant CLI
participant Extractors as Extractor Pipelines
participant LLM as OpenAI Client
participant Evaluator
participant Reporter
rect rgba(100,150,240,0.5)
User->>CLI: run reproduce / make all / exbench extract
CLI->>Extractors: extract contexts (exstruct/openpyxl/pdf/html/image)
Extractors-->>CLI: write extracted contexts (JSON/TXT/PNG)
CLI->>LLM: ask (prompts/markdown) for cases (text/images)
LLM-->>CLI: responses + token/cost metadata
CLI->>Evaluator: eval responses (exact, normalized, raw, markdown, RUB)
Evaluator-->>CLI: results CSVs and rub_results
CLI->>Reporter: generate charts and public REPORT.md
Reporter-->>CLI: save plots and public artifacts
CLI->>User: finished (public bundle/report ready)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 275d458784
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@benchmark/data/manifest.json`:
- Around line 136-140: Update the manifest entry for id
"food_inspection_record_01": place the missing raw Excel into data/raw using a
filename without spaces (e.g., food_inspection_record_01.xlsx), update the
"xlsx" value to that exact filename, and edit the "question" example JSON to
remove the trailing comma after the last "sheets" entry so the sample is valid
JSON; ensure the referenced "truth" path remains correct after these changes.
In `@benchmark/rub/manifest.json`:
- Line 8: Update all JSON "truth" values that currently use Windows-style
backslashes (e.g., "rub\\truth\\ffr_425_01.json") to use forward slashes
("rub/truth/ffr_425_01.json") for cross-platform compatibility; locate every
"truth" key in manifest.json (the entries referenced in the review) and replace
all backslashes with forward slashes consistently.
In `@benchmark/scripts/publicize.ps1`:
- Line 13: The non-venv branch that invokes system Python using the call python
(Join-Path $scriptDir "publicize.py") currently ignores the Python process exit
status; modify that branch to capture and propagate the exit code by running the
same command and then calling exit $LASTEXITCODE (or testing $?) immediately
after the python invocation so the PowerShell script exits with the Python
process status.
In `@benchmark/scripts/publicize.sh`:
- Line 1: The script contains a UTF-8 BOM before the shebang which breaks
interpreter resolution; remove the BOM so the file begins exactly with the
shebang "#!/usr/bin/env bash" (byte 0), re-save the file as UTF-8 without BOM,
and verify the first characters are '#' and '!'—you can do this by opening the
file in a text editor that shows encoding or by rewriting the file without a BOM
so the shebang line in publicize.sh is the very first bytes.
In `@benchmark/scripts/reproduce.ps1`:
- Around line 1-10: Move the param() block to be the very first statement in the
script so PowerShell treats $Case/$Method/$Model/$Temperature/$SkipAsk as script
parameters; specifically, place the existing param(...) before the
Set-StrictMode -Version Latest and $ErrorActionPreference = "Stop" lines and
ensure no executable statements precede param() so parameter binding works
correctly.
In `@benchmark/src/bench/eval/markdown_score.py`:
- Around line 62-74: The function _normalized_lines currently only skips lines
that start with "```" but doesn't track fenced code block state, so content
inside code blocks gets included; fix it by adding an in_code_block boolean
(initialized False) and for each line use stripped = raw.strip(), toggle
in_code_block when stripped.startswith("```") and continue (so fence lines
aren't processed), and skip processing (continue) whenever in_code_block is True
before calling _normalize_line; keep existing table-separator and empty-line
checks and append normalized lines as before.
In `@benchmark/src/bench/pipeline/image_render.py`:
- Around line 22-33: The code opens a PyMuPDF Document via fitz.open(tmp_pdf)
without closing it; change the usage to use the context manager (e.g., "with
fitz.open(tmp_pdf) as doc:") so the Document is automatically closed after
rendering pages, and move the loop that uses doc.load_page(), page.get_pixmap(),
pix.save(), and paths.append(p) into that with-block to ensure resources are
released.
In `@benchmark/src/bench/pipeline/openpyxl_pandas.py`:
- Around line 13-45: The workbook opened by openpyxl.load_workbook (wb) is never
closed; update the code in openpyxl_pandas.py so wb is closed after
processing—either by using a context manager pattern or wrapping the processing
loop in try/finally and calling wb.close() in the finally block; ensure
wb.close() is invoked even on exceptions so file handles are not leaked
(reference wb and openpyxl.load_workbook in your change).
In `@benchmark/src/bench/pipeline/pdf_text.py`:
- Around line 15-26: The LibreOffice conversion call using subprocess.run(cmd,
check=True) can hang; add a timeout argument (e.g., timeout=60 or a configurable
value) to subprocess.run and wrap the call in a try/except for
subprocess.TimeoutExpired to log/handle the timeout and ensure the stalled
process is terminated and the benchmark fails fast. Update the call that uses
the local cmd list and subprocess.run to include timeout and add handling for
subprocess.TimeoutExpired (log the error and re-raise or raise a clear
exception) so hung soffice conversions cannot stall the run.
- Around line 31-37: The pdf_to_text function opens a PyMuPDF Document with
fitz.open(pdf_path) but never closes it; wrap the Document usage in a context
manager (use "with fitz.open(pdf_path) as doc:" around the existing loop) so the
file handle is closed automatically, preserve building parts list and page
extraction logic inside that with-block, and keep using the same variable name
doc and functions load_page / get_text.
In `@README.md`:
- Around line 22-29: Update the image reference in the README's Benchmark
section: replace the broken path "benchmark/outputs/plots/markdown_quality.png"
with the correct path "benchmark/public/plots/markdown_quality.png" so the
markdown image tag in the Benchmark section points to the existing file; edit
the README.md line that contains the image tag (the "")
to use the corrected path.
🟡 Minor comments (13)
benchmark/.env.example-1-4 (1)
1-4: Clarify which variables are optional.
# optionalappears right afterOPENAI_API_KEY, which can be read as the API key being optional. Make it explicit that only ORG/PROJECT are optional.💡 Suggested tweak
OPENAI_API_KEY=your_key_here -# optional +# optional: OPENAI_ORG, OPENAI_PROJECT OPENAI_ORG= OPENAI_PROJECT=benchmark/data/truth/food_inspection_record_01.json-3-36 (1)
3-36: Normalize sheet key formatting to avoid mismatched keys.The keys mix full‑width parentheses and ASCII parentheses with spacing. If sheet names are matched literally during scoring, this inconsistency can cause false mismatches. Consider standardizing to the source workbook naming (e.g., 検食簿(2), 検食簿(3)).
🔧 Suggested normalization
- "検食簿 (2)": { + "検食簿(2)": { @@ - "検食簿 (3)": { + "検食簿(3)": {docs/README.en.md-22-28 (1)
22-28: Keep benchmark chart withindocs/for site builds.The image path points outside the docs tree (
../benchmark/...). MkDocs won't bundle files outsidedocs/, so the chart will fail to render in the built site. The chart currently lives atbenchmark/public/plots/markdown_quality.png. Copy it intodocs/assets/benchmark/and update the link.🛠️ Suggested link update
- +docs/README.ja.md-20-26 (1)
20-26: Fix benchmark chart image path — file doesn't exist at referenced location.The image path references
../benchmark/outputs/plots/markdown_quality.png, but the file actually exists at../benchmark/public/plots/markdown_quality.png(nooutputsdirectory). Additionally, since this path points outside thedocs/directory, MkDocs won't bundle it for the built site, causing a broken image link.Copy the chart to
docs/assets/benchmark/and update the link:🛠️ Suggested link update
- +benchmark/src/bench/pipeline/html_text.py-25-28 (1)
25-28: Add error handling for missing converted file.If
sofficeexits successfully but fails to produce the expected.htmlor.htmfile,produced.replace(out_html)will raise aFileNotFoundErrorwith an unclear message. Consider adding an explicit check:🛡️ Suggested defensive check
produced = out_html.parent / (xlsx_path.stem + ".html") if not produced.exists(): produced = out_html.parent / (xlsx_path.stem + ".htm") + if not produced.exists(): + raise FileNotFoundError( + f"soffice did not produce expected HTML output for {xlsx_path}" + ) produced.replace(out_html)benchmark/rub/truth_lite/flowchart_01.json-1-14 (1)
1-14: Add trailing newline to match standard file formatting.Both files have identical JSON content. However, the lite version is missing a trailing newline at the end of the file (line 14). Add a newline to
benchmark/rub/truth_lite/flowchart_01.jsonto match the non-lite version and follow standard text file formatting conventions.benchmark/public/reports/compare_gpt4o_gpt41.md-4-4 (1)
4-4: Fix markdownlint MD060 table alignment row spacing.The alignment row is missing the expected spaces around pipes for the configured table style.
🧹 Proposed fix
-|:--|--:|--:|--:|--:|--:|--:| +| :-- | --: | --: | --: | --: | --: | --: |benchmark/public/reports/results_report.md-90-98 (1)
90-98: Clarify the sample size discrepancy (n=11 vs n=12).Line 90 states "n=11, when available" for the positioning deltas, but the scope section at Line 19 mentions "Cases: 12 Excel documents". If one case is intentionally excluded from certain comparisons, consider adding a brief note explaining why (e.g., "n=11; one case excluded due to X").
benchmark/README.md-190-194 (1)
190-194: Wrap bare URLs to satisfy markdownlint.Line 190-194 contains bare URLs, which triggers MD034. Please wrap them in Markdown links or angle brackets to keep lint clean and consistent.
benchmark/rub/manifest_lite.json-1-1 (1)
1-1: Remove UTF-8 BOM character.Line 1 contains a UTF-8 BOM (
\uFEFF) before the opening brace. While many JSON parsers tolerate this, some tools may fail to parse the file correctly or interpret it as invalid JSON. Consider removing the BOM for maximum compatibility.benchmark/src/bench/report_public.py-241-243 (1)
241-243:relative_tomay raiseValueErrorif chart paths are not underreport_path.parent.If
PLOTS_DIRis not a subdirectory of the report's parent directory,relative_towill fail. Consider usingos.path.relpathor wrapping with error handling.🛡️ Proposed fix using try/except
- rel_core = chart_paths.core_chart.relative_to(report_path.parent) - rel_markdown = chart_paths.markdown_chart.relative_to(report_path.parent) - rel_rub = chart_paths.rub_chart.relative_to(report_path.parent) + try: + rel_core = chart_paths.core_chart.relative_to(report_path.parent) + rel_markdown = chart_paths.markdown_chart.relative_to(report_path.parent) + rel_rub = chart_paths.rub_chart.relative_to(report_path.parent) + except ValueError: + # Fallback to absolute paths if not relative + rel_core = chart_paths.core_chart + rel_markdown = chart_paths.markdown_chart + rel_rub = chart_paths.rub_chartbenchmark/src/bench/report_public.py-6-13 (1)
6-13: Backend selection should occur before importingpyplot.
matplotlib.use("Agg")on line 13 is called aftermatplotlib.pyplotis imported on line 7. This may not take effect or could cause warnings. Move the backend selection before the pyplot import.🔧 Proposed fix
from pathlib import Path from typing import Iterable import matplotlib +matplotlib.use("Agg") import matplotlib.pyplot as plt import pandas as pd from pydantic import BaseModel from .paths import PLOTS_DIR, PUBLIC_REPORT, RESULTS_DIR, RUB_RESULTS_DIR - -matplotlib.use("Agg")benchmark/src/bench/cli.py-542-545 (1)
542-545: Dead code:clientis neverNoneat this point.The
OpenAIResponsesClient()is instantiated unconditionally on line 513, so theif client is Nonecheck on line 542 will never be true. This check can be removed.🔧 Proposed fix
if use_llm: - if client is None: - raise RuntimeError( - "LLM client unavailable for markdown conversion." - ) res = client.ask_markdown( model=model, json_text=json_text, temperature=temperature )
🧹 Nitpick comments (33)
benchmark/src/bench/pipeline/exstruct_adapter.py (1)
42-46: Minor: Compute the set outside the comprehension for efficiency.
set(sheet_scope)is recreated on each iteration. Pre-computing it avoids redundant allocations.♻️ Proposed fix
+ scope_set = set(sheet_scope) sheets: dict[str, SheetData] = { name: sheet for name, sheet in workbook.sheets.items() - if name in set(sheet_scope) + if name in scope_set }.pre-commit-config.yaml (1)
5-8: Consider excluding benchmark fromruff-formattoo for consistency.
This avoids formatting benchmark Python files when linting is intentionally skipped.♻️ Proposed change
- id: ruff-format + exclude: ^benchmark/benchmark/rub/truth/certificate_of_employment_01.json (1)
1-58: Consider deduplicating identical truth files.This file appears to be an exact copy of
benchmark/data/truth/certificate_of_employment_01.json. If both benchmark tracks require identical truth data, consider using a symlink or shared reference to avoid drift during future maintenance.benchmark/src/bench/eval/exact_match.py (1)
7-12: Add Google-style docstrings to functions.Both
canonicalandexact_matchlack docstrings. Per coding guidelines, all functions should have Google-style docstrings.📝 Proposed docstrings
def canonical(obj: Any) -> str: + """Convert an object to a canonical JSON string for comparison. + + Args: + obj: Any JSON-serializable object. + + Returns: + A compact JSON string with sorted keys and preserved Unicode. + + Raises: + TypeError: If obj contains non-JSON-serializable types. + """ return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":")) def exact_match(a: Any, b: Any) -> bool: + """Check if two objects are structurally identical. + + Compares objects by their canonical JSON representations. + + Args: + a: First object to compare. + b: Second object to compare. + + Returns: + True if both objects have identical canonical JSON forms. + """ return canonical(a) == canonical(b)As per coding guidelines: "Add Google-style docstrings to all functions and classes."
benchmark/src/bench/eval/report.py (1)
8-14: Add a Google-style docstring to the function.Per coding guidelines, all functions should have Google-style docstrings. The implementation logic is sound—the empty
rowsguard on line 10 correctly handles the edge case.📝 Suggested docstring addition
def write_results_csv(rows: list[dict[str, Any]], out_csv: Path) -> None: + """Write evaluation results to a CSV file. + + Args: + rows: List of dictionaries representing row data. Keys from the first + row are used as CSV headers. + out_csv: Output path for the CSV file. + """ out_csv.parent.mkdir(parents=True, exist_ok=True)benchmark/src/bench/llm/pricing.py (1)
16-25: Consider expanding the docstring to Google-style format.The docstring is brief. Per coding guidelines, Google-style docstrings with Args/Returns sections are preferred.
📝 Suggested docstring enhancement
def estimate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float: - """Estimate USD cost for a model run when pricing is known.""" + """Estimate USD cost for a model run when pricing is known. + + Args: + model: The model identifier (e.g., "gpt-4o"). + input_tokens: Number of input tokens consumed. + output_tokens: Number of output tokens generated. + + Returns: + Estimated cost in USD, or 0.0 if the model's pricing is unknown. + """ pricing = _PRICING_PER_1M.get(model)benchmark/pyproject.toml (1)
4-4: Update the placeholder description before release.The description field contains placeholder text
"Add your description here". Consider providing a meaningful description for the benchmark package.📝 Suggested fix
-description = "Add your description here" +description = "Benchmarking framework for Excel/document extraction evaluation"benchmark/Makefile (1)
1-20: Consider addingcleanandtesttargets for completeness.The static analysis flagged missing
cleanandtestphony targets. While not strictly required for this benchmark workflow, adding them would improve the Makefile's completeness:
clean: Remove generated artifacts (e.g.,__pycache__,.pyc, output directories)test: Run any benchmark validation or unit testsThis is optional if the benchmark workflow doesn't require these operations.
📝 Suggested additions
-.PHONY: setup extract ask eval report all +.PHONY: setup extract ask eval report all clean test +clean: + find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true + find . -type f -name "*.pyc" -delete 2>/dev/null || true + +test: + `@echo` "No tests defined for benchmark" + setup:benchmark/src/bench/pipeline/common.py (1)
9-24: Add Google-style docstrings to utility functions.Per coding guidelines, all functions should have Google-style docstrings. The implementations are clean and correct.
📝 Suggested docstrings
def ensure_dir(p: Path) -> None: + """Create directory and all parent directories if they don't exist. + + Args: + p: Path to the directory to create. + """ p.mkdir(parents=True, exist_ok=True) def sha256_text(s: str) -> str: + """Compute SHA-256 hex digest of a UTF-8 encoded string. + + Args: + s: Input string to hash. + + Returns: + Hexadecimal SHA-256 digest. + """ return hashlib.sha256(s.encode("utf-8")).hexdigest() def write_text(p: Path, text: str) -> None: + """Write text to a file, creating parent directories as needed. + + Args: + p: Output file path. + text: Content to write. + """ ensure_dir(p.parent) p.write_text(text, encoding="utf-8") def write_json(p: Path, obj: Any) -> None: + """Write an object as JSON to a file, creating parent directories as needed. + + Args: + p: Output file path. + obj: Object to serialize as JSON. + """ ensure_dir(p.parent)benchmark/src/bench/pipeline/html_text.py (1)
11-28: Add Google-style docstrings to functions.Per coding guidelines, all functions should have docstrings.
📝 Suggested docstrings
def xlsx_to_html(xlsx_path: Path, out_html: Path) -> None: + """Convert an Excel file to HTML using LibreOffice headless mode. + + Args: + xlsx_path: Path to the input Excel file. + out_html: Path for the output HTML file. + + Raises: + subprocess.CalledProcessError: If soffice conversion fails. + FileNotFoundError: If the converted HTML file is not produced. + """ ensure_dir(out_html.parent)def html_to_text(html_path: Path, out_txt: Path) -> None: + """Extract table content from HTML and write as structured text. + + Args: + html_path: Path to the input HTML file. + out_txt: Path for the output text file. + """ soup = BeautifulSoup(html_path.read_text(encoding="utf-8", errors="ignore"), "lxml")benchmark/src/bench/pipeline/openpyxl_pandas.py (1)
10-12: Add a Google-style docstring forextract_openpyxl.The function lacks the required Args/Returns docstring.
📄 Suggested docstring
def extract_openpyxl( xlsx_path: Path, out_txt: Path, sheet_scope: list[str] | None = None ) -> None: + """Extract cell contents from an XLSX file via openpyxl. + + Args: + xlsx_path: Path to the input XLSX. + out_txt: Path to the output text file. + sheet_scope: Optional list of sheet names to include. + """As per coding guidelines, Add Google-style docstrings to all functions and classes.
benchmark/src/bench/pipeline/image_render.py (1)
11-17: Convert the function docstring to Google style.Include explicit Args/Returns sections to satisfy the project standard.
As per coding guidelines, Add Google-style docstrings to all functions and classes.
benchmark/src/bench/pipeline/pdf_text.py (1)
11-28: Add Google-style docstrings for both public functions.Both
xlsx_to_pdfandpdf_to_textare missing the required Args/Returns sections.As per coding guidelines, Add Google-style docstrings to all functions and classes.
Also applies to: 31-48
benchmark/src/bench/llm/openai_client.py (2)
40-55: Return a Pydantic model instead of a tuple for usage tokens.This helper returns structured data and should use a BaseModel (e.g.,
UsageTokens) to align with the project standard; then update call sites to access fields.As per coding guidelines, Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data.
Also applies to: 97-109
23-25: AvoidAnyfor OpenAI payloads and raw responses.Define a
JsonValuealias and TypedDicts (or Pydantic models) forcontentandrawinstead ofAny, to keep mypy strict and comply with the typing guideline.As per coding guidelines, Use Any type only at external library boundaries (xlwings, pandas, numpy) and avoid type inference from external library internals.
Also applies to: 125-133
benchmark/src/bench/eval/normalize.py (1)
8-12: Add a Google-style docstring for_strip_code_fences.This helper is missing the required Google-style docstring.
As per coding guidelines: "Add Google-style docstrings to all functions and classes".✍️ Suggested docstring
def _strip_code_fences(s: str) -> str: + """Remove surrounding Markdown code fences. + + Args: + s: Raw string that may include Markdown fences. + + Returns: + String without leading/trailing fences. + """ s = s.strip() s = re.sub(r"^```(json)?\s*", "", s) s = re.sub(r"\s*```$", "", s) return s.strip()benchmark/src/bench/rub/manifest.py (1)
9-24: Convert class docstrings to Google-style with Attributes.Both
RubTaskandRubManifestcurrently have minimal docstrings; the guidelines call for Google-style class docstrings with attribute descriptions.As per coding guidelines: "Add Google-style docstrings to all functions and classes".✍️ Suggested docstring expansion
class RubTask(BaseModel): - """RUB task definition.""" + """RUB task definition. + + Attributes: + id: Task identifier. + source_case_id: Case id for Stage A Markdown. + type: Task type. + question: Prompt for the task. + truth: Path to truth file. + schema_path: Optional schema path. + unordered_paths: Optional list of unordered paths. + """ @@ class RubManifest(BaseModel): - """RUB manifest container.""" + """RUB manifest container. + + Attributes: + tasks: List of RUB tasks. + """benchmark/scripts/reproduce.sh (2)
27-30: Add check for.env.exampleexistence before copying.If
.env.exampledoesn't exist,cpwill fail. Consider adding a guard to provide a clearer error message.🛠️ Suggested improvement
if [[ ! -f ".env" ]]; then + if [[ ! -f ".env.example" ]]; then + echo "[reproduce] Warning: .env.example not found, skipping .env setup." >&2 + else echo "[reproduce] Copying .env.example -> .env (remember to set OPENAI_API_KEY)." cp .env.example .env + fi fi
32-35: Consider verifying Python availability before venv creation.If
pythonis not installed or not in PATH, line 34 will fail with a potentially confusing error. A pre-check would improve the user experience.🛠️ Suggested improvement
if [[ ! -d ".venv" ]]; then + if ! command -v python &>/dev/null; then + echo "[reproduce] Error: python not found in PATH" >&2 + exit 1 + fi echo "[reproduce] Creating virtual environment." python -m venv .venv fibenchmark/data/normalization_rules.json (1)
44-55: Note:certificate_of_employment_01omitslist_object_rules.This case defines
alias_rulesbut omits thelist_object_ruleskey that other cases include. If this is intentional (e.g., this case doesn't require list normalization), consider adding an empty array for consistency, or document thatlist_object_rulesis optional.♻️ Optional: Add empty list_object_rules for consistency
"certificate_of_employment_01": { "alias_rules": [ { "canonical": "※本証明書の内容について、就労先事業者等に無断で作成又は改変を行ったときは、刑法上の罪に問われる場合があります。", "aliases": [ "※本証明書の内容について、就労先事業者等に無断で作成し又は改変を行ったときには、刑法上の罪に問われる場合があります。" ] } ], "split_rules": [], - "composite_rules": [] + "composite_rules": [], + "list_object_rules": [] },benchmark/src/bench/eval/markdown_render.py (1)
50-57: Consider capping heading level at 6.Markdown only supports heading levels 1-6. For deeply nested dictionaries,
levelcan exceed 6, producing invalid headings like#######. Consider capping at 6 or using alternative formatting (e.g., bold text) for deeper nesting.♻️ Optional: Cap heading level
def _render_dict(lines: list[str], value: dict[str, Any], *, level: int) -> None: """Render a dict as Markdown sections. Args: lines: List to append output lines to. value: Dict to render. level: Heading level for keys. """ for key, item in value.items(): - heading = "#" * max(level, 1) + heading = "#" * min(max(level, 1), 6) lines.append(f"{heading} {key}")benchmark/src/bench/eval/raw_match.py (3)
20-22: Add a comment explaining the purpose of the magic string replacement.The replacement of
"窶サ"appears to handle mojibake (garbled encoding), but this isn't documented. A brief inline comment would help maintainers understand why this specific character sequence is removed.normalized = unicodedata.normalize("NFKC", text) + # Remove common mojibake artifacts from Shift-JIS misinterpretation normalized = normalized.replace("窶サ", "")
56-67: Variableitemsis re-declared in separate branches.Both
if isinstance(value, dict)(line 57) andif isinstance(value, list)(line 64) declare a local variable nameditems. While valid Python since they're in mutually exclusive branches, this can be confusing during debugging and code navigation.♻️ Suggested refactor for clarity
if isinstance(value, dict): - items: list[str] = [] + result: list[str] = [] if depth > 0 and not parent_is_list: - items.extend([str(k) for k in value.keys()]) + result.extend([str(k) for k in value.keys()]) for v in value.values(): - items.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=False)) - return items + result.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=False)) + return result if isinstance(value, list): - items: list[str] = [] + result: list[str] = [] for v in value: - items.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=True)) - return items + result.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=True)) + return result
119-127: Consider edge case: both truth and prediction are empty.When
truth_tokensis empty, the function returns0.0regardless of whetherpred_tokensis also empty. For metrics, comparing two empty sets typically yields perfect coverage (1.0). The current behavior may be intentional to penalize empty truths, but it differs from howscore_partialinbenchmark/src/bench/rub/score.pyhandles this case.♻️ Optional: align with typical metric conventions
truth_tokens = _dedupe_normalized(_flatten_scalars(truth)) pred_tokens = _dedupe_normalized(_flatten_scalars(pred)) if not truth_tokens: - return 0.0 + return 1.0 if not pred_tokens else 0.0benchmark/scripts/publicize.py (2)
7-15: Add Google-style docstrings to helper functions.Per coding guidelines, all functions should have Google-style docstrings. These helpers lack documentation.
📝 Suggested docstrings
def _copy_file(src: Path, dest: Path) -> None: + """Copy a file to the destination, creating parent directories as needed. + + Args: + src: Source file path. + dest: Destination file path. + """ dest.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(src, dest) def _copy_glob(src_dir: Path, pattern: str, dest_dir: Path) -> None: + """Copy all files matching a glob pattern to a destination directory. + + Args: + src_dir: Source directory to search. + pattern: Glob pattern to match files. + dest_dir: Destination directory. + """ for path in src_dir.glob(pattern): if path.is_file(): _copy_file(path, dest_dir / path.name)
18-63: Add docstring tomainfunction.The
mainfunction orchestrates the entire publicization workflow but lacks documentation explaining its purpose and return value.📝 Suggested docstring
def main() -> int: + """Copy benchmark artifacts to the public directory. + + Copies REPORT.md, plots, and result reports from outputs to the public + bundle directory. Generates an INDEX.md describing the bundle contents. + + Returns: + Exit code (0 for success). + """ root = Path(__file__).resolve().parents[1]benchmark/src/bench/manifest.py (2)
9-25: Add Google-style docstrings to Pydantic models.Per coding guidelines, all classes should have Google-style docstrings. These models lack documentation describing their purpose and attributes.
📝 Suggested docstrings
class RenderConfig(BaseModel): + """Configuration for rendering benchmark cases. + + Attributes: + dpi: Dots per inch for rendering. Defaults to 200. + max_pages: Maximum number of pages to render. Defaults to 6. + """ + dpi: int = 200 max_pages: int = 6 class Case(BaseModel): + """A single benchmark case definition. + + Attributes: + id: Unique identifier for the case. + type: Type classification of the case. + xlsx: Path to the Excel source file. + question: Extraction prompt or question. + truth: Path to the ground-truth JSON file. + sheet_scope: Optional list of sheet names to include. + render: Rendering configuration. + """ + id: str ... class Manifest(BaseModel): + """Collection of benchmark cases. + + Attributes: + cases: List of benchmark case definitions. + """ + cases: list[Case]
28-30: Add docstring and consider using Pydantic v2'smodel_validate.The function lacks a docstring. Additionally,
Manifest.model_validate(data)provides better validation error messages in Pydantic v2 compared toManifest(**data).📝 Suggested improvements
def load_manifest(path: Path) -> Manifest: + """Load a benchmark manifest from a JSON file. + + Args: + path: Path to the manifest JSON file. + + Returns: + Parsed Manifest instance. + + Raises: + ValidationError: If the JSON structure is invalid. + """ data = json.loads(path.read_text(encoding="utf-8")) - return Manifest(**data) + return Manifest.model_validate(data)benchmark/src/bench/eval/markdown_score.py (1)
20-38: Consider edge case: both truth and prediction are empty.Similar to
raw_match.py, whentruth_linesis empty, the function returns0.0regardless ofpred_lines. For consistency withscore_partialin the RUB scoring module, consider returning1.0when both are empty.benchmark/src/bench/rub/normalize.py (1)
50-54: Consider supporting additional numeric formats.The regex patterns only match simple integers and decimals. Scientific notation (e.g.,
1.5e-10), numbers with leading zeros, or comma-separated numbers won't be parsed as numeric values. This may be intentional for benchmark determinism.benchmark/src/bench/eval/score.py (1)
51-53: Duplicate helper:_strip_circled_numbersexists innormalization_rules.py.This function is already defined in
benchmark/src/bench/eval/normalization_rules.py(lines 74-76). Consider importing it from there to avoid code duplication.♻️ Suggested refactor
from .normalization_rules import ( ListObjectRule, NormalizationRules, RuleIndex, build_rule_index, normalize_label, + _strip_circled_numbers, ) - - -def _strip_circled_numbers(text: str) -> str: - """Remove circled-number characters (e.g., ①②) for robust matching.""" - return "".join(ch for ch in text if unicodedata.category(ch) != "No")benchmark/src/bench/eval/normalization_rules.py (1)
114-119: Consider adding error handling for malformed JSON.If the JSON file exists but contains invalid JSON or doesn't match the expected schema,
json.loadsor Pydantic validation will raise exceptions. Consider catching and providing a clearer error message.🛡️ Optional: Add error handling
def load_ruleset(path: Path) -> NormalizationRuleset: """Load normalization ruleset from JSON file.""" if not path.exists(): return NormalizationRuleset() - payload = json.loads(path.read_text(encoding="utf-8")) - return NormalizationRuleset(**payload) + try: + payload = json.loads(path.read_text(encoding="utf-8")) + return NormalizationRuleset(**payload) + except (json.JSONDecodeError, ValidationError) as exc: + raise ValueError(f"Invalid normalization rules at {path}: {exc}") from excbenchmark/src/bench/cli.py (1)
573-574: Consider using more specific exception types.Bare
except Exceptioncatches all exceptions includingKeyboardInterrupt. For JSON parsing and LLM calls, consider catching specific exceptions likejson.JSONDecodeError,ValueError, or library-specific exceptions for better error diagnostics.Also applies to: 670-671, 753-754, 808-809, 982-983
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an initial benchmark infrastructure for evaluating Excel document extraction methods for RAG/LLM preprocessing. The benchmark compares ExStruct against baseline methods (PDF, HTML, OpenPyXL, image-based VLM) using GPT-4o, measuring extraction accuracy, raw coverage/precision, and downstream reconstruction utility (RUB).
Changes:
- Added comprehensive benchmark infrastructure with evaluation scripts, scoring logic, and LLM integration
- Configured exclusions for the benchmark directory in code quality tools (ruff, mypy, codecov, pre-commit)
- Added benchmark documentation and public reports with charts showing comparative results (n=12 cases)
Reviewed changes
Copilot reviewed 91 out of 99 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added benchmark workspace member and tool exclusions |
| codecov.yml | Excluded benchmark directory from coverage |
| .pre-commit-config.yaml, .codacy.yml | Excluded benchmark from linting |
| README.md, docs/README.*.md | Added benchmark section with chart references |
| benchmark/* | New benchmark package with CLI, evaluation, scoring, and reporting |
| benchmark/rub/* | RUB (Reconstruction Utility Benchmark) manifests and truth data |
| benchmark/scripts/* | Reproducibility and publication scripts |
| benchmark/data/* | Test case manifests, truth files, normalization rules |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
非公開のまま運用する方針で進めます(rawデータは引き続き .gitignore のままにします)。レビュー指摘はローカルで対応済みです:
未公開データに関する指摘(food_inspection_record_01.xlsx)は、非公開維持の方針のためPRでは公開しません。必要ならREADME/benchmark/READMEに入手方法や非公開注記を追記します。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@benchmark/docs/spec.md`:
- Around line 96-105: Update the manifest field list in the spec to match the
actual loader/schema: replace or supplement the outdated fields (xlsx,
sheet_scope, render) with the current names and descriptions used by the
manifests/loader—specifically document source_case_id (original input/source
identifier), schema_path (path to ground-truth JSON/schema), unordered_paths
(list of file paths where order doesn't matter), and ensure track, id, type,
question, and truth remain accurately described; remove or mark deprecated
xlsx/sheet_scope/render to avoid confusion and keep field meanings consistent
with the loader code.
- Around line 75-94: The fenced code block showing the directory tree in
benchmark/docs/spec.md is missing a language tag, triggering markdownlint;
update the opening fence (the triple backticks that start the block containing
"benchmark/ rub/ README.md ...") to include a language token such as text
(e.g., replace ``` with ```text) so the block is explicitly labeled without
changing the block contents.
🧹 Nitpick comments (4)
benchmark/src/bench/pipeline/image_render.py (1)
11-17: Use Google‑style docstring format.
Current docstring doesn’t follow the required Google-style structure (Args/Returns), which is mandated by the repo guidelines.As per coding guidelines, “Add Google-style docstrings to all functions and classes.”📄 Suggested docstring update
def xlsx_to_pngs_via_pdf( xlsx_path: Path, out_dir: Path, dpi: int = 200, max_pages: int = 6 ) -> list[Path]: - """ - xlsx -> pdf (LibreOffice) -> png (PyMuPDF render) - 画像は VLM 入力に使う。OCRはしない。 - """ + """Convert an XLSX to PNG images via a temporary PDF. + + Args: + xlsx_path: Source XLSX file path. + out_dir: Output directory for PNG files. + dpi: Render DPI for PDF pages. + max_pages: Max number of pages to render. + + Returns: + List of generated PNG file paths. + """benchmark/src/bench/llm/openai_client.py (1)
107-114: Consider handling potential JSON decode errors.
resp.model_dump_json()should return valid JSON, but wrappingjson.loads()in a try-except could prevent unexpected failures if the SDK returns malformed data.benchmark/REPORT.md (1)
1-84: Content duplicated withbenchmark/public/REPORT.md.This report is nearly identical to
benchmark/public/REPORT.md. Consider whether both files are needed or if one could reference the other to reduce maintenance burden. If both are intentional (working vs. public versions), this is acceptable.benchmark/src/bench/rub/manifest.py (1)
9-25: Add Google‑style class docstrings (with Attributes).These are Pydantic models; documenting fields improves clarity and aligns with the docstring guideline.
🛠️ Suggested fix
class RubTask(BaseModel): - """RUB task definition.""" + """RUB task definition. + + Attributes: + id: Task id. + track: Evaluation track name. + source_case_id: Case id for Stage A Markdown. + type: Task type. + question: Stage B query prompt. + truth: Ground-truth JSON path. + schema_path: Optional JSON schema path. + unordered_paths: Optional order-insensitive JSON paths. + """ @@ class RubManifest(BaseModel): - """RUB manifest container.""" + """RUB manifest container. + + Attributes: + tasks: List of RUB tasks. + """As per coding guidelines: Add Google-style docstrings to all functions and classes.
#49
Summary by CodeRabbit
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.