Skip to content

Conversation

@harumiWeb
Copy link
Owner

@harumiWeb harumiWeb commented Jan 29, 2026

#49

Summary by CodeRabbit

  • Documentation

    • Added comprehensive benchmark docs, public reports, and README sections (including charts and interpretation guidance).
  • Chores

    • Introduced a full benchmark workspace with manifests, data fixtures, reporting assets, scripts and tooling to run/reproduce benchmarks.
    • Excluded benchmark files from CI/analysis and updated pre-commit/quality configs.

✏️ Tip: You can customize this high-level summary in your review settings.

…elines

- Add example environment configuration file for OpenAI API keys.
- Create .gitignore to exclude unnecessary files and directories.
- Implement Makefile for easy setup and execution of benchmark tasks.
- Write README.md with project overview, requirements, setup instructions, and output details.
- Define project metadata and dependencies in pyproject.toml.
- Set up directory structure and paths for data handling.
- Develop CLI for extracting data from Excel files and querying OpenAI's API.
- Implement evaluation functions to compare model responses against ground truth.
- Create utility functions for file handling and JSON normalization.
- Integrate OpenAI API client for text and image processing.
- Add support for various extraction methods including OpenPyXL, PDF, HTML, and image rendering.
…scoring

- Introduced new directory structure for RUB benchmarks under `benchmark/rub`.
- Added `manifest.json` to define tasks and their expected outputs.
- Implemented `RubTask` and `RubManifest` models for loading and validating the manifest.
- Created normalization functions to ensure consistent comparison of outputs.
- Developed scoring functions to evaluate predictions against ground truth.
- Added truth data for various tasks including forms, flowcharts, and inspection records.
- Included helper modules for managing RUB tasks and their associated data.
- Created RUB summary comparison report for gpt-4o vs gpt-4.1.
- Added detailed benchmark report summarizing extraction accuracy across methods.
- Introduced RUB report for Stage B task accuracy using Markdown-only inputs.
- Added separate RUB reports for gpt-4o and gpt-4.1.
- Implemented publicize scripts (Python and PowerShell) to automate the copying of benchmark results and reports to the public directory.
@harumiWeb harumiWeb self-assigned this Jan 29, 2026
@harumiWeb harumiWeb added the documentation Improvements or additions to documentation label Jan 29, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

This pull request adds a comprehensive benchmarking subsystem: configuration exclusions, documentation and public reports, a benchmark dataset and manifests, CLI orchestration, extraction pipelines, LLM client and pricing, evaluation/normalization/score utilities (including RUB), reproducibility scripts, and publicization tools.

Changes

Cohort / File(s) Summary
Configuration & CI
.codacy.yml, .pre-commit-config.yaml, codecov.yml, pyproject.toml
Exclude benchmark/** from analysis/lint/coverage; add lint/mypy excludes and workspace tweaks.
Top-level Docs / Badges
README.md, README.ja.md, docs/README.en.md, docs/README.ja.md
Insert "Benchmark" sections and adjust badges/installation snippets.
Benchmark Project Setup
benchmark/.env.example, benchmark/.gitignore, benchmark/Makefile, benchmark/pyproject.toml
Add env example, gitignore, Make targets, pyproject for benchmark subproject and task definitions.
Benchmark Docs & Reports
benchmark/README.md, benchmark/docs/spec.md, benchmark/REPORT.md, benchmark/public/*.md, benchmark/public/reports/*
Add comprehensive benchmark documentation, RUB spec, and multiple public-facing report artifacts.
Manifests & Truth Data
benchmark/data/manifest.json, benchmark/data/normalization_rules.json, benchmark/data/truth/*, benchmark/rub/manifest.json, benchmark/rub/manifest_lite.json, benchmark/rub/truth*, benchmark/rub/truth_lite/*
Add catalogs of benchmark cases/tasks and 70+ ground-truth JSON fixtures plus per-case normalization rules.
CLI Orchestration
benchmark/src/bench/cli.py
New Typer-based CLI implementing extract, ask, markdown, rub_ask, rub_eval, eval, report, report_public and related Pydantic record/result models.
Evaluation Modules
benchmark/src/bench/eval/*.py
Add exact_match, normalize, raw_match, markdown_render/score, normalization_rules, scoring engine, report CSV writer — supports normalized, ordered, raw, markdown metrics.
RUB (Reconstruction Utility Benchmark)
benchmark/src/bench/rub/*.py
RUB manifest loader, deterministic payload normalization with unordered-path handling, and RUB scoring (exact & partial).
LLM Integration & Pricing
benchmark/src/bench/llm/openai_client.py, benchmark/src/bench/llm/pricing.py
OpenAI Responses API wrapper (text/images/markdown helpers), token/cost extraction, and simple cost estimator.
Extraction Pipelines
benchmark/src/bench/pipeline/*.py
Add ExStruct adapter, openpyxl extractor, PDF/HTML conversion, image rendering to PNGs, and common I/O helpers.
Paths, Manifest Models, Utilities
benchmark/src/bench/paths.py, benchmark/src/bench/manifest.py, benchmark/src/bench/pipeline/common.py
Filesystem path constants, Pydantic manifest/Case models, and helper functions (ensure_dir, sha256_text, write_text/json).
Publicization & Reproducibility Scripts
benchmark/src/bench/report_public.py, benchmark/scripts/publicize.*, benchmark/scripts/reproduce.*
Chart/report generation and scripts (py/bash/ps1) to publish public artifacts and reproduce runs.
Small additions
benchmark/src/bench/rub/__init__.py, codecov.yml, other small files
Package init and coverage exclusion rule additions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant CLI
  participant Extractors as Extractor Pipelines
  participant LLM as OpenAI Client
  participant Evaluator
  participant Reporter
  rect rgba(100,150,240,0.5)
    User->>CLI: run reproduce / make all / exbench extract
    CLI->>Extractors: extract contexts (exstruct/openpyxl/pdf/html/image)
    Extractors-->>CLI: write extracted contexts (JSON/TXT/PNG)
    CLI->>LLM: ask (prompts/markdown) for cases (text/images)
    LLM-->>CLI: responses + token/cost metadata
    CLI->>Evaluator: eval responses (exact, normalized, raw, markdown, RUB)
    Evaluator-->>CLI: results CSVs and rub_results
    CLI->>Reporter: generate charts and public REPORT.md
    Reporter-->>CLI: save plots and public artifacts
    CLI->>User: finished (public bundle/report ready)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

enhancement

🐰 A rabbit's ode to benchmarks bright:
With metrics stacked and data deep,
ExStruct's truths run far and wide,
Tests extracted, scored with care,
From Excel's fields to answers clear—
A benchmark suite hops forward strong! 🏃‍♂️📊

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Release of initial benchmark' clearly summarizes the main change—introducing the first benchmark release with comprehensive test data, evaluation infrastructure, and documentation.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 275d458784

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@benchmark/data/manifest.json`:
- Around line 136-140: Update the manifest entry for id
"food_inspection_record_01": place the missing raw Excel into data/raw using a
filename without spaces (e.g., food_inspection_record_01.xlsx), update the
"xlsx" value to that exact filename, and edit the "question" example JSON to
remove the trailing comma after the last "sheets" entry so the sample is valid
JSON; ensure the referenced "truth" path remains correct after these changes.

In `@benchmark/rub/manifest.json`:
- Line 8: Update all JSON "truth" values that currently use Windows-style
backslashes (e.g., "rub\\truth\\ffr_425_01.json") to use forward slashes
("rub/truth/ffr_425_01.json") for cross-platform compatibility; locate every
"truth" key in manifest.json (the entries referenced in the review) and replace
all backslashes with forward slashes consistently.

In `@benchmark/scripts/publicize.ps1`:
- Line 13: The non-venv branch that invokes system Python using the call python
(Join-Path $scriptDir "publicize.py") currently ignores the Python process exit
status; modify that branch to capture and propagate the exit code by running the
same command and then calling exit $LASTEXITCODE (or testing $?) immediately
after the python invocation so the PowerShell script exits with the Python
process status.

In `@benchmark/scripts/publicize.sh`:
- Line 1: The script contains a UTF-8 BOM before the shebang which breaks
interpreter resolution; remove the BOM so the file begins exactly with the
shebang "#!/usr/bin/env bash" (byte 0), re-save the file as UTF-8 without BOM,
and verify the first characters are '#' and '!'—you can do this by opening the
file in a text editor that shows encoding or by rewriting the file without a BOM
so the shebang line in publicize.sh is the very first bytes.

In `@benchmark/scripts/reproduce.ps1`:
- Around line 1-10: Move the param() block to be the very first statement in the
script so PowerShell treats $Case/$Method/$Model/$Temperature/$SkipAsk as script
parameters; specifically, place the existing param(...) before the
Set-StrictMode -Version Latest and $ErrorActionPreference = "Stop" lines and
ensure no executable statements precede param() so parameter binding works
correctly.

In `@benchmark/src/bench/eval/markdown_score.py`:
- Around line 62-74: The function _normalized_lines currently only skips lines
that start with "```" but doesn't track fenced code block state, so content
inside code blocks gets included; fix it by adding an in_code_block boolean
(initialized False) and for each line use stripped = raw.strip(), toggle
in_code_block when stripped.startswith("```") and continue (so fence lines
aren't processed), and skip processing (continue) whenever in_code_block is True
before calling _normalize_line; keep existing table-separator and empty-line
checks and append normalized lines as before.

In `@benchmark/src/bench/pipeline/image_render.py`:
- Around line 22-33: The code opens a PyMuPDF Document via fitz.open(tmp_pdf)
without closing it; change the usage to use the context manager (e.g., "with
fitz.open(tmp_pdf) as doc:") so the Document is automatically closed after
rendering pages, and move the loop that uses doc.load_page(), page.get_pixmap(),
pix.save(), and paths.append(p) into that with-block to ensure resources are
released.

In `@benchmark/src/bench/pipeline/openpyxl_pandas.py`:
- Around line 13-45: The workbook opened by openpyxl.load_workbook (wb) is never
closed; update the code in openpyxl_pandas.py so wb is closed after
processing—either by using a context manager pattern or wrapping the processing
loop in try/finally and calling wb.close() in the finally block; ensure
wb.close() is invoked even on exceptions so file handles are not leaked
(reference wb and openpyxl.load_workbook in your change).

In `@benchmark/src/bench/pipeline/pdf_text.py`:
- Around line 15-26: The LibreOffice conversion call using subprocess.run(cmd,
check=True) can hang; add a timeout argument (e.g., timeout=60 or a configurable
value) to subprocess.run and wrap the call in a try/except for
subprocess.TimeoutExpired to log/handle the timeout and ensure the stalled
process is terminated and the benchmark fails fast. Update the call that uses
the local cmd list and subprocess.run to include timeout and add handling for
subprocess.TimeoutExpired (log the error and re-raise or raise a clear
exception) so hung soffice conversions cannot stall the run.
- Around line 31-37: The pdf_to_text function opens a PyMuPDF Document with
fitz.open(pdf_path) but never closes it; wrap the Document usage in a context
manager (use "with fitz.open(pdf_path) as doc:" around the existing loop) so the
file handle is closed automatically, preserve building parts list and page
extraction logic inside that with-block, and keep using the same variable name
doc and functions load_page / get_text.

In `@README.md`:
- Around line 22-29: Update the image reference in the README's Benchmark
section: replace the broken path "benchmark/outputs/plots/markdown_quality.png"
with the correct path "benchmark/public/plots/markdown_quality.png" so the
markdown image tag in the Benchmark section points to the existing file; edit
the README.md line that contains the image tag (the "![Benchmark Chart](...)")
to use the corrected path.
🟡 Minor comments (13)
benchmark/.env.example-1-4 (1)

1-4: Clarify which variables are optional.

# optional appears right after OPENAI_API_KEY, which can be read as the API key being optional. Make it explicit that only ORG/PROJECT are optional.

💡 Suggested tweak
 OPENAI_API_KEY=your_key_here
-# optional
+# optional: OPENAI_ORG, OPENAI_PROJECT
 OPENAI_ORG=
 OPENAI_PROJECT=
benchmark/data/truth/food_inspection_record_01.json-3-36 (1)

3-36: Normalize sheet key formatting to avoid mismatched keys.

The keys mix full‑width parentheses and ASCII parentheses with spacing. If sheet names are matched literally during scoring, this inconsistency can cause false mismatches. Consider standardizing to the source workbook naming (e.g., 検食簿(2), 検食簿(3)).

🔧 Suggested normalization
-    "検食簿 (2)": {
+    "検食簿(2)": {
@@
-    "検食簿 (3)": {
+    "検食簿(3)": {
docs/README.en.md-22-28 (1)

22-28: Keep benchmark chart within docs/ for site builds.

The image path points outside the docs tree (../benchmark/...). MkDocs won't bundle files outside docs/, so the chart will fail to render in the built site. The chart currently lives at benchmark/public/plots/markdown_quality.png. Copy it into docs/assets/benchmark/ and update the link.

🛠️ Suggested link update
-![Benchmark Chart](../benchmark/outputs/plots/markdown_quality.png)
+![Benchmark Chart](assets/benchmark/markdown_quality.png)
docs/README.ja.md-20-26 (1)

20-26: Fix benchmark chart image path — file doesn't exist at referenced location.

The image path references ../benchmark/outputs/plots/markdown_quality.png, but the file actually exists at ../benchmark/public/plots/markdown_quality.png (no outputs directory). Additionally, since this path points outside the docs/ directory, MkDocs won't bundle it for the built site, causing a broken image link.

Copy the chart to docs/assets/benchmark/ and update the link:

🛠️ Suggested link update
-![Benchmark Chart](../benchmark/outputs/plots/markdown_quality.png)
+![Benchmark Chart](assets/benchmark/markdown_quality.png)
benchmark/src/bench/pipeline/html_text.py-25-28 (1)

25-28: Add error handling for missing converted file.

If soffice exits successfully but fails to produce the expected .html or .htm file, produced.replace(out_html) will raise a FileNotFoundError with an unclear message. Consider adding an explicit check:

🛡️ Suggested defensive check
     produced = out_html.parent / (xlsx_path.stem + ".html")
     if not produced.exists():
         produced = out_html.parent / (xlsx_path.stem + ".htm")
+    if not produced.exists():
+        raise FileNotFoundError(
+            f"soffice did not produce expected HTML output for {xlsx_path}"
+        )
     produced.replace(out_html)
benchmark/rub/truth_lite/flowchart_01.json-1-14 (1)

1-14: Add trailing newline to match standard file formatting.

Both files have identical JSON content. However, the lite version is missing a trailing newline at the end of the file (line 14). Add a newline to benchmark/rub/truth_lite/flowchart_01.json to match the non-lite version and follow standard text file formatting conventions.

benchmark/public/reports/compare_gpt4o_gpt41.md-4-4 (1)

4-4: Fix markdownlint MD060 table alignment row spacing.

The alignment row is missing the expected spaces around pipes for the configured table style.

🧹 Proposed fix
-|:--|--:|--:|--:|--:|--:|--:|
+| :-- | --: | --: | --: | --: | --: | --: |
benchmark/public/reports/results_report.md-90-98 (1)

90-98: Clarify the sample size discrepancy (n=11 vs n=12).

Line 90 states "n=11, when available" for the positioning deltas, but the scope section at Line 19 mentions "Cases: 12 Excel documents". If one case is intentionally excluded from certain comparisons, consider adding a brief note explaining why (e.g., "n=11; one case excluded due to X").

benchmark/README.md-190-194 (1)

190-194: Wrap bare URLs to satisfy markdownlint.

Line 190-194 contains bare URLs, which triggers MD034. Please wrap them in Markdown links or angle brackets to keep lint clean and consistent.

benchmark/rub/manifest_lite.json-1-1 (1)

1-1: Remove UTF-8 BOM character.

Line 1 contains a UTF-8 BOM (\uFEFF) before the opening brace. While many JSON parsers tolerate this, some tools may fail to parse the file correctly or interpret it as invalid JSON. Consider removing the BOM for maximum compatibility.

benchmark/src/bench/report_public.py-241-243 (1)

241-243: relative_to may raise ValueError if chart paths are not under report_path.parent.

If PLOTS_DIR is not a subdirectory of the report's parent directory, relative_to will fail. Consider using os.path.relpath or wrapping with error handling.

🛡️ Proposed fix using try/except
-    rel_core = chart_paths.core_chart.relative_to(report_path.parent)
-    rel_markdown = chart_paths.markdown_chart.relative_to(report_path.parent)
-    rel_rub = chart_paths.rub_chart.relative_to(report_path.parent)
+    try:
+        rel_core = chart_paths.core_chart.relative_to(report_path.parent)
+        rel_markdown = chart_paths.markdown_chart.relative_to(report_path.parent)
+        rel_rub = chart_paths.rub_chart.relative_to(report_path.parent)
+    except ValueError:
+        # Fallback to absolute paths if not relative
+        rel_core = chart_paths.core_chart
+        rel_markdown = chart_paths.markdown_chart
+        rel_rub = chart_paths.rub_chart
benchmark/src/bench/report_public.py-6-13 (1)

6-13: Backend selection should occur before importing pyplot.

matplotlib.use("Agg") on line 13 is called after matplotlib.pyplot is imported on line 7. This may not take effect or could cause warnings. Move the backend selection before the pyplot import.

🔧 Proposed fix
 from pathlib import Path
 from typing import Iterable
 
 import matplotlib
+matplotlib.use("Agg")
 import matplotlib.pyplot as plt
 import pandas as pd
 from pydantic import BaseModel
 
 from .paths import PLOTS_DIR, PUBLIC_REPORT, RESULTS_DIR, RUB_RESULTS_DIR
-
-matplotlib.use("Agg")
benchmark/src/bench/cli.py-542-545 (1)

542-545: Dead code: client is never None at this point.

The OpenAIResponsesClient() is instantiated unconditionally on line 513, so the if client is None check on line 542 will never be true. This check can be removed.

🔧 Proposed fix
             if use_llm:
-                if client is None:
-                    raise RuntimeError(
-                        "LLM client unavailable for markdown conversion."
-                    )
                 res = client.ask_markdown(
                     model=model, json_text=json_text, temperature=temperature
                 )
🧹 Nitpick comments (33)
benchmark/src/bench/pipeline/exstruct_adapter.py (1)

42-46: Minor: Compute the set outside the comprehension for efficiency.

set(sheet_scope) is recreated on each iteration. Pre-computing it avoids redundant allocations.

♻️ Proposed fix
+    scope_set = set(sheet_scope)
     sheets: dict[str, SheetData] = {
         name: sheet
         for name, sheet in workbook.sheets.items()
-        if name in set(sheet_scope)
+        if name in scope_set
     }
.pre-commit-config.yaml (1)

5-8: Consider excluding benchmark from ruff-format too for consistency.
This avoids formatting benchmark Python files when linting is intentionally skipped.

♻️ Proposed change
       - id: ruff-format
+        exclude: ^benchmark/
benchmark/rub/truth/certificate_of_employment_01.json (1)

1-58: Consider deduplicating identical truth files.

This file appears to be an exact copy of benchmark/data/truth/certificate_of_employment_01.json. If both benchmark tracks require identical truth data, consider using a symlink or shared reference to avoid drift during future maintenance.

benchmark/src/bench/eval/exact_match.py (1)

7-12: Add Google-style docstrings to functions.

Both canonical and exact_match lack docstrings. Per coding guidelines, all functions should have Google-style docstrings.

📝 Proposed docstrings
 def canonical(obj: Any) -> str:
+    """Convert an object to a canonical JSON string for comparison.
+
+    Args:
+        obj: Any JSON-serializable object.
+
+    Returns:
+        A compact JSON string with sorted keys and preserved Unicode.
+
+    Raises:
+        TypeError: If obj contains non-JSON-serializable types.
+    """
     return json.dumps(obj, ensure_ascii=False, sort_keys=True, separators=(",", ":"))


 def exact_match(a: Any, b: Any) -> bool:
+    """Check if two objects are structurally identical.
+
+    Compares objects by their canonical JSON representations.
+
+    Args:
+        a: First object to compare.
+        b: Second object to compare.
+
+    Returns:
+        True if both objects have identical canonical JSON forms.
+    """
     return canonical(a) == canonical(b)

As per coding guidelines: "Add Google-style docstrings to all functions and classes."

benchmark/src/bench/eval/report.py (1)

8-14: Add a Google-style docstring to the function.

Per coding guidelines, all functions should have Google-style docstrings. The implementation logic is sound—the empty rows guard on line 10 correctly handles the edge case.

📝 Suggested docstring addition
 def write_results_csv(rows: list[dict[str, Any]], out_csv: Path) -> None:
+    """Write evaluation results to a CSV file.
+
+    Args:
+        rows: List of dictionaries representing row data. Keys from the first
+            row are used as CSV headers.
+        out_csv: Output path for the CSV file.
+    """
     out_csv.parent.mkdir(parents=True, exist_ok=True)
benchmark/src/bench/llm/pricing.py (1)

16-25: Consider expanding the docstring to Google-style format.

The docstring is brief. Per coding guidelines, Google-style docstrings with Args/Returns sections are preferred.

📝 Suggested docstring enhancement
 def estimate_cost_usd(model: str, input_tokens: int, output_tokens: int) -> float:
-    """Estimate USD cost for a model run when pricing is known."""
+    """Estimate USD cost for a model run when pricing is known.
+
+    Args:
+        model: The model identifier (e.g., "gpt-4o").
+        input_tokens: Number of input tokens consumed.
+        output_tokens: Number of output tokens generated.
+
+    Returns:
+        Estimated cost in USD, or 0.0 if the model's pricing is unknown.
+    """
     pricing = _PRICING_PER_1M.get(model)
benchmark/pyproject.toml (1)

4-4: Update the placeholder description before release.

The description field contains placeholder text "Add your description here". Consider providing a meaningful description for the benchmark package.

📝 Suggested fix
-description = "Add your description here"
+description = "Benchmarking framework for Excel/document extraction evaluation"
benchmark/Makefile (1)

1-20: Consider adding clean and test targets for completeness.

The static analysis flagged missing clean and test phony targets. While not strictly required for this benchmark workflow, adding them would improve the Makefile's completeness:

  • clean: Remove generated artifacts (e.g., __pycache__, .pyc, output directories)
  • test: Run any benchmark validation or unit tests

This is optional if the benchmark workflow doesn't require these operations.

📝 Suggested additions
-.PHONY: setup extract ask eval report all
+.PHONY: setup extract ask eval report all clean test

+clean:
+	find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
+	find . -type f -name "*.pyc" -delete 2>/dev/null || true
+
+test:
+	`@echo` "No tests defined for benchmark"
+
 setup:
benchmark/src/bench/pipeline/common.py (1)

9-24: Add Google-style docstrings to utility functions.

Per coding guidelines, all functions should have Google-style docstrings. The implementations are clean and correct.

📝 Suggested docstrings
 def ensure_dir(p: Path) -> None:
+    """Create directory and all parent directories if they don't exist.
+
+    Args:
+        p: Path to the directory to create.
+    """
     p.mkdir(parents=True, exist_ok=True)


 def sha256_text(s: str) -> str:
+    """Compute SHA-256 hex digest of a UTF-8 encoded string.
+
+    Args:
+        s: Input string to hash.
+
+    Returns:
+        Hexadecimal SHA-256 digest.
+    """
     return hashlib.sha256(s.encode("utf-8")).hexdigest()


 def write_text(p: Path, text: str) -> None:
+    """Write text to a file, creating parent directories as needed.
+
+    Args:
+        p: Output file path.
+        text: Content to write.
+    """
     ensure_dir(p.parent)
     p.write_text(text, encoding="utf-8")


 def write_json(p: Path, obj: Any) -> None:
+    """Write an object as JSON to a file, creating parent directories as needed.
+
+    Args:
+        p: Output file path.
+        obj: Object to serialize as JSON.
+    """
     ensure_dir(p.parent)
benchmark/src/bench/pipeline/html_text.py (1)

11-28: Add Google-style docstrings to functions.

Per coding guidelines, all functions should have docstrings.

📝 Suggested docstrings
 def xlsx_to_html(xlsx_path: Path, out_html: Path) -> None:
+    """Convert an Excel file to HTML using LibreOffice headless mode.
+
+    Args:
+        xlsx_path: Path to the input Excel file.
+        out_html: Path for the output HTML file.
+
+    Raises:
+        subprocess.CalledProcessError: If soffice conversion fails.
+        FileNotFoundError: If the converted HTML file is not produced.
+    """
     ensure_dir(out_html.parent)
 def html_to_text(html_path: Path, out_txt: Path) -> None:
+    """Extract table content from HTML and write as structured text.
+
+    Args:
+        html_path: Path to the input HTML file.
+        out_txt: Path for the output text file.
+    """
     soup = BeautifulSoup(html_path.read_text(encoding="utf-8", errors="ignore"), "lxml")
benchmark/src/bench/pipeline/openpyxl_pandas.py (1)

10-12: Add a Google-style docstring for extract_openpyxl.

The function lacks the required Args/Returns docstring.

📄 Suggested docstring
 def extract_openpyxl(
     xlsx_path: Path, out_txt: Path, sheet_scope: list[str] | None = None
 ) -> None:
+    """Extract cell contents from an XLSX file via openpyxl.
+
+    Args:
+        xlsx_path: Path to the input XLSX.
+        out_txt: Path to the output text file.
+        sheet_scope: Optional list of sheet names to include.
+    """

As per coding guidelines, Add Google-style docstrings to all functions and classes.

benchmark/src/bench/pipeline/image_render.py (1)

11-17: Convert the function docstring to Google style.

Include explicit Args/Returns sections to satisfy the project standard.

As per coding guidelines, Add Google-style docstrings to all functions and classes.

benchmark/src/bench/pipeline/pdf_text.py (1)

11-28: Add Google-style docstrings for both public functions.

Both xlsx_to_pdf and pdf_to_text are missing the required Args/Returns sections.

As per coding guidelines, Add Google-style docstrings to all functions and classes.

Also applies to: 31-48

benchmark/src/bench/llm/openai_client.py (2)

40-55: Return a Pydantic model instead of a tuple for usage tokens.

This helper returns structured data and should use a BaseModel (e.g., UsageTokens) to align with the project standard; then update call sites to access fields.

As per coding guidelines, Return Pydantic BaseModel instances instead of dictionaries or tuples for structured data.

Also applies to: 97-109


23-25: Avoid Any for OpenAI payloads and raw responses.

Define a JsonValue alias and TypedDicts (or Pydantic models) for content and raw instead of Any, to keep mypy strict and comply with the typing guideline.

As per coding guidelines, Use Any type only at external library boundaries (xlwings, pandas, numpy) and avoid type inference from external library internals.

Also applies to: 125-133

benchmark/src/bench/eval/normalize.py (1)

8-12: Add a Google-style docstring for _strip_code_fences.

This helper is missing the required Google-style docstring.

✍️ Suggested docstring
 def _strip_code_fences(s: str) -> str:
+    """Remove surrounding Markdown code fences.
+
+    Args:
+        s: Raw string that may include Markdown fences.
+
+    Returns:
+        String without leading/trailing fences.
+    """
     s = s.strip()
     s = re.sub(r"^```(json)?\s*", "", s)
     s = re.sub(r"\s*```$", "", s)
     return s.strip()
As per coding guidelines: "Add Google-style docstrings to all functions and classes".
benchmark/src/bench/rub/manifest.py (1)

9-24: Convert class docstrings to Google-style with Attributes.

Both RubTask and RubManifest currently have minimal docstrings; the guidelines call for Google-style class docstrings with attribute descriptions.

✍️ Suggested docstring expansion
 class RubTask(BaseModel):
-    """RUB task definition."""
+    """RUB task definition.
+
+    Attributes:
+        id: Task identifier.
+        source_case_id: Case id for Stage A Markdown.
+        type: Task type.
+        question: Prompt for the task.
+        truth: Path to truth file.
+        schema_path: Optional schema path.
+        unordered_paths: Optional list of unordered paths.
+    """
@@
 class RubManifest(BaseModel):
-    """RUB manifest container."""
+    """RUB manifest container.
+
+    Attributes:
+        tasks: List of RUB tasks.
+    """
As per coding guidelines: "Add Google-style docstrings to all functions and classes".
benchmark/scripts/reproduce.sh (2)

27-30: Add check for .env.example existence before copying.

If .env.example doesn't exist, cp will fail. Consider adding a guard to provide a clearer error message.

🛠️ Suggested improvement
 if [[ ! -f ".env" ]]; then
+  if [[ ! -f ".env.example" ]]; then
+    echo "[reproduce] Warning: .env.example not found, skipping .env setup." >&2
+  else
     echo "[reproduce] Copying .env.example -> .env (remember to set OPENAI_API_KEY)."
     cp .env.example .env
+  fi
 fi

32-35: Consider verifying Python availability before venv creation.

If python is not installed or not in PATH, line 34 will fail with a potentially confusing error. A pre-check would improve the user experience.

🛠️ Suggested improvement
 if [[ ! -d ".venv" ]]; then
+  if ! command -v python &>/dev/null; then
+    echo "[reproduce] Error: python not found in PATH" >&2
+    exit 1
+  fi
   echo "[reproduce] Creating virtual environment."
   python -m venv .venv
 fi
benchmark/data/normalization_rules.json (1)

44-55: Note: certificate_of_employment_01 omits list_object_rules.

This case defines alias_rules but omits the list_object_rules key that other cases include. If this is intentional (e.g., this case doesn't require list normalization), consider adding an empty array for consistency, or document that list_object_rules is optional.

♻️ Optional: Add empty list_object_rules for consistency
     "certificate_of_employment_01": {
       "alias_rules": [
         {
           "canonical": "※本証明書の内容について、就労先事業者等に無断で作成又は改変を行ったときは、刑法上の罪に問われる場合があります。",
           "aliases": [
             "※本証明書の内容について、就労先事業者等に無断で作成し又は改変を行ったときには、刑法上の罪に問われる場合があります。"
           ]
         }
       ],
       "split_rules": [],
-      "composite_rules": []
+      "composite_rules": [],
+      "list_object_rules": []
     },
benchmark/src/bench/eval/markdown_render.py (1)

50-57: Consider capping heading level at 6.

Markdown only supports heading levels 1-6. For deeply nested dictionaries, level can exceed 6, producing invalid headings like #######. Consider capping at 6 or using alternative formatting (e.g., bold text) for deeper nesting.

♻️ Optional: Cap heading level
 def _render_dict(lines: list[str], value: dict[str, Any], *, level: int) -> None:
     """Render a dict as Markdown sections.

     Args:
         lines: List to append output lines to.
         value: Dict to render.
         level: Heading level for keys.
     """
     for key, item in value.items():
-        heading = "#" * max(level, 1)
+        heading = "#" * min(max(level, 1), 6)
         lines.append(f"{heading} {key}")
benchmark/src/bench/eval/raw_match.py (3)

20-22: Add a comment explaining the purpose of the magic string replacement.

The replacement of "窶サ" appears to handle mojibake (garbled encoding), but this isn't documented. A brief inline comment would help maintainers understand why this specific character sequence is removed.

     normalized = unicodedata.normalize("NFKC", text)
+    # Remove common mojibake artifacts from Shift-JIS misinterpretation
     normalized = normalized.replace("窶サ", "")

56-67: Variable items is re-declared in separate branches.

Both if isinstance(value, dict) (line 57) and if isinstance(value, list) (line 64) declare a local variable named items. While valid Python since they're in mutually exclusive branches, this can be confusing during debugging and code navigation.

♻️ Suggested refactor for clarity
     if isinstance(value, dict):
-        items: list[str] = []
+        result: list[str] = []
         if depth > 0 and not parent_is_list:
-            items.extend([str(k) for k in value.keys()])
+            result.extend([str(k) for k in value.keys()])
         for v in value.values():
-            items.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=False))
-        return items
+            result.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=False))
+        return result
     if isinstance(value, list):
-        items: list[str] = []
+        result: list[str] = []
         for v in value:
-            items.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=True))
-        return items
+            result.extend(_flatten_scalars(v, depth=depth + 1, parent_is_list=True))
+        return result

119-127: Consider edge case: both truth and prediction are empty.

When truth_tokens is empty, the function returns 0.0 regardless of whether pred_tokens is also empty. For metrics, comparing two empty sets typically yields perfect coverage (1.0). The current behavior may be intentional to penalize empty truths, but it differs from how score_partial in benchmark/src/bench/rub/score.py handles this case.

♻️ Optional: align with typical metric conventions
     truth_tokens = _dedupe_normalized(_flatten_scalars(truth))
     pred_tokens = _dedupe_normalized(_flatten_scalars(pred))
     if not truth_tokens:
-        return 0.0
+        return 1.0 if not pred_tokens else 0.0
benchmark/scripts/publicize.py (2)

7-15: Add Google-style docstrings to helper functions.

Per coding guidelines, all functions should have Google-style docstrings. These helpers lack documentation.

📝 Suggested docstrings
 def _copy_file(src: Path, dest: Path) -> None:
+    """Copy a file to the destination, creating parent directories as needed.
+
+    Args:
+        src: Source file path.
+        dest: Destination file path.
+    """
     dest.parent.mkdir(parents=True, exist_ok=True)
     shutil.copy2(src, dest)


 def _copy_glob(src_dir: Path, pattern: str, dest_dir: Path) -> None:
+    """Copy all files matching a glob pattern to a destination directory.
+
+    Args:
+        src_dir: Source directory to search.
+        pattern: Glob pattern to match files.
+        dest_dir: Destination directory.
+    """
     for path in src_dir.glob(pattern):
         if path.is_file():
             _copy_file(path, dest_dir / path.name)

18-63: Add docstring to main function.

The main function orchestrates the entire publicization workflow but lacks documentation explaining its purpose and return value.

📝 Suggested docstring
 def main() -> int:
+    """Copy benchmark artifacts to the public directory.
+
+    Copies REPORT.md, plots, and result reports from outputs to the public
+    bundle directory. Generates an INDEX.md describing the bundle contents.
+
+    Returns:
+        Exit code (0 for success).
+    """
     root = Path(__file__).resolve().parents[1]
benchmark/src/bench/manifest.py (2)

9-25: Add Google-style docstrings to Pydantic models.

Per coding guidelines, all classes should have Google-style docstrings. These models lack documentation describing their purpose and attributes.

📝 Suggested docstrings
 class RenderConfig(BaseModel):
+    """Configuration for rendering benchmark cases.
+
+    Attributes:
+        dpi: Dots per inch for rendering. Defaults to 200.
+        max_pages: Maximum number of pages to render. Defaults to 6.
+    """
+
     dpi: int = 200
     max_pages: int = 6


 class Case(BaseModel):
+    """A single benchmark case definition.
+
+    Attributes:
+        id: Unique identifier for the case.
+        type: Type classification of the case.
+        xlsx: Path to the Excel source file.
+        question: Extraction prompt or question.
+        truth: Path to the ground-truth JSON file.
+        sheet_scope: Optional list of sheet names to include.
+        render: Rendering configuration.
+    """
+
     id: str
     ...


 class Manifest(BaseModel):
+    """Collection of benchmark cases.
+
+    Attributes:
+        cases: List of benchmark case definitions.
+    """
+
     cases: list[Case]

28-30: Add docstring and consider using Pydantic v2's model_validate.

The function lacks a docstring. Additionally, Manifest.model_validate(data) provides better validation error messages in Pydantic v2 compared to Manifest(**data).

📝 Suggested improvements
 def load_manifest(path: Path) -> Manifest:
+    """Load a benchmark manifest from a JSON file.
+
+    Args:
+        path: Path to the manifest JSON file.
+
+    Returns:
+        Parsed Manifest instance.
+
+    Raises:
+        ValidationError: If the JSON structure is invalid.
+    """
     data = json.loads(path.read_text(encoding="utf-8"))
-    return Manifest(**data)
+    return Manifest.model_validate(data)
benchmark/src/bench/eval/markdown_score.py (1)

20-38: Consider edge case: both truth and prediction are empty.

Similar to raw_match.py, when truth_lines is empty, the function returns 0.0 regardless of pred_lines. For consistency with score_partial in the RUB scoring module, consider returning 1.0 when both are empty.

benchmark/src/bench/rub/normalize.py (1)

50-54: Consider supporting additional numeric formats.

The regex patterns only match simple integers and decimals. Scientific notation (e.g., 1.5e-10), numbers with leading zeros, or comma-separated numbers won't be parsed as numeric values. This may be intentional for benchmark determinism.

benchmark/src/bench/eval/score.py (1)

51-53: Duplicate helper: _strip_circled_numbers exists in normalization_rules.py.

This function is already defined in benchmark/src/bench/eval/normalization_rules.py (lines 74-76). Consider importing it from there to avoid code duplication.

♻️ Suggested refactor
 from .normalization_rules import (
     ListObjectRule,
     NormalizationRules,
     RuleIndex,
     build_rule_index,
     normalize_label,
+    _strip_circled_numbers,
 )
-
-
-def _strip_circled_numbers(text: str) -> str:
-    """Remove circled-number characters (e.g., ①②) for robust matching."""
-    return "".join(ch for ch in text if unicodedata.category(ch) != "No")
benchmark/src/bench/eval/normalization_rules.py (1)

114-119: Consider adding error handling for malformed JSON.

If the JSON file exists but contains invalid JSON or doesn't match the expected schema, json.loads or Pydantic validation will raise exceptions. Consider catching and providing a clearer error message.

🛡️ Optional: Add error handling
 def load_ruleset(path: Path) -> NormalizationRuleset:
     """Load normalization ruleset from JSON file."""
     if not path.exists():
         return NormalizationRuleset()
-    payload = json.loads(path.read_text(encoding="utf-8"))
-    return NormalizationRuleset(**payload)
+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+        return NormalizationRuleset(**payload)
+    except (json.JSONDecodeError, ValidationError) as exc:
+        raise ValueError(f"Invalid normalization rules at {path}: {exc}") from exc
benchmark/src/bench/cli.py (1)

573-574: Consider using more specific exception types.

Bare except Exception catches all exceptions including KeyboardInterrupt. For JSON parsing and LLM calls, consider catching specific exceptions like json.JSONDecodeError, ValueError, or library-specific exceptions for better error diagnostics.

Also applies to: 670-671, 753-754, 808-809, 982-983

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an initial benchmark infrastructure for evaluating Excel document extraction methods for RAG/LLM preprocessing. The benchmark compares ExStruct against baseline methods (PDF, HTML, OpenPyXL, image-based VLM) using GPT-4o, measuring extraction accuracy, raw coverage/precision, and downstream reconstruction utility (RUB).

Changes:

  • Added comprehensive benchmark infrastructure with evaluation scripts, scoring logic, and LLM integration
  • Configured exclusions for the benchmark directory in code quality tools (ruff, mypy, codecov, pre-commit)
  • Added benchmark documentation and public reports with charts showing comparative results (n=12 cases)

Reviewed changes

Copilot reviewed 91 out of 99 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Added benchmark workspace member and tool exclusions
codecov.yml Excluded benchmark directory from coverage
.pre-commit-config.yaml, .codacy.yml Excluded benchmark from linting
README.md, docs/README.*.md Added benchmark section with chart references
benchmark/* New benchmark package with CLI, evaluation, scoring, and reporting
benchmark/rub/* RUB (Reconstruction Utility Benchmark) manifests and truth data
benchmark/scripts/* Reproducibility and publication scripts
benchmark/data/* Test case manifests, truth files, normalization rules

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@harumiWeb
Copy link
Owner Author

非公開のまま運用する方針で進めます(rawデータは引き続き .gitignore のままにします)。レビュー指摘はローカルで対応済みです:

  • BOM除去(publicize.sh/.ps1、benchmark/REPORT.md、public/REPORT.md、manifest_lite.json)
  • reproduce.ps1 の param() 位置修正、publicize.ps1 の exit code 伝播
  • PyMuPDF の close(image_render.py / pdf_text.py)、soffice timeout 追加
  • openpyxl の close(openpyxl_pandas.py)
  • markdown_score のコードブロック除外
  • OPENAI_API_KEY 未設定時の明示エラー
  • rub/manifest.json のパスを / に統一、RubTask.track 追加+manifest反映
  • README のベンチ画像パスを public/plots に修正
  • spec.md の文字化け解消

未公開データに関する指摘(food_inspection_record_01.xlsx)は、非公開維持の方針のためPRでは公開しません。必要ならREADME/benchmark/READMEに入手方法や非公開注記を追記します。

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@benchmark/docs/spec.md`:
- Around line 96-105: Update the manifest field list in the spec to match the
actual loader/schema: replace or supplement the outdated fields (xlsx,
sheet_scope, render) with the current names and descriptions used by the
manifests/loader—specifically document source_case_id (original input/source
identifier), schema_path (path to ground-truth JSON/schema), unordered_paths
(list of file paths where order doesn't matter), and ensure track, id, type,
question, and truth remain accurately described; remove or mark deprecated
xlsx/sheet_scope/render to avoid confusion and keep field meanings consistent
with the loader code.
- Around line 75-94: The fenced code block showing the directory tree in
benchmark/docs/spec.md is missing a language tag, triggering markdownlint;
update the opening fence (the triple backticks that start the block containing
"benchmark/  rub/  README.md ...") to include a language token such as text
(e.g., replace ``` with ```text) so the block is explicitly labeled without
changing the block contents.
🧹 Nitpick comments (4)
benchmark/src/bench/pipeline/image_render.py (1)

11-17: Use Google‑style docstring format.
Current docstring doesn’t follow the required Google-style structure (Args/Returns), which is mandated by the repo guidelines.

📄 Suggested docstring update
 def xlsx_to_pngs_via_pdf(
     xlsx_path: Path, out_dir: Path, dpi: int = 200, max_pages: int = 6
 ) -> list[Path]:
-    """
-    xlsx -> pdf (LibreOffice) -> png (PyMuPDF render)
-    画像は VLM 入力に使う。OCRはしない。
-    """
+    """Convert an XLSX to PNG images via a temporary PDF.
+
+    Args:
+        xlsx_path: Source XLSX file path.
+        out_dir: Output directory for PNG files.
+        dpi: Render DPI for PDF pages.
+        max_pages: Max number of pages to render.
+
+    Returns:
+        List of generated PNG file paths.
+    """
As per coding guidelines, “Add Google-style docstrings to all functions and classes.”
benchmark/src/bench/llm/openai_client.py (1)

107-114: Consider handling potential JSON decode errors.

resp.model_dump_json() should return valid JSON, but wrapping json.loads() in a try-except could prevent unexpected failures if the SDK returns malformed data.

benchmark/REPORT.md (1)

1-84: Content duplicated with benchmark/public/REPORT.md.

This report is nearly identical to benchmark/public/REPORT.md. Consider whether both files are needed or if one could reference the other to reduce maintenance burden. If both are intentional (working vs. public versions), this is acceptable.

benchmark/src/bench/rub/manifest.py (1)

9-25: Add Google‑style class docstrings (with Attributes).

These are Pydantic models; documenting fields improves clarity and aligns with the docstring guideline.

🛠️ Suggested fix
 class RubTask(BaseModel):
-    """RUB task definition."""
+    """RUB task definition.
+
+    Attributes:
+        id: Task id.
+        track: Evaluation track name.
+        source_case_id: Case id for Stage A Markdown.
+        type: Task type.
+        question: Stage B query prompt.
+        truth: Ground-truth JSON path.
+        schema_path: Optional JSON schema path.
+        unordered_paths: Optional order-insensitive JSON paths.
+    """
@@
 class RubManifest(BaseModel):
-    """RUB manifest container."""
+    """RUB manifest container.
+
+    Attributes:
+        tasks: List of RUB tasks.
+    """

As per coding guidelines: Add Google-style docstrings to all functions and classes.

@harumiWeb harumiWeb merged commit f7f9a01 into main Jan 29, 2026
9 checks passed
@harumiWeb harumiWeb deleted the benchmark branch January 29, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants