Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
98286fa
feat: Initialize benchmark project with extraction and evaluation pip…
harumiWeb Jan 24, 2026
f515153
feat: Add manifest and truth data for application forms, flowcharts, …
harumiWeb Jan 24, 2026
09164a4
fix
harumiWeb Jan 24, 2026
1f0924d
feat: Update Makefile and README for exstruct installation; enhance p…
harumiWeb Jan 24, 2026
ed3d7ea
fix: Correct flowchart ID and file paths in manifest.json
harumiWeb Jan 24, 2026
32bf771
feat: Add taskipy as a development dependency and update task definit…
harumiWeb Jan 24, 2026
3d01848
fix
harumiWeb Jan 24, 2026
c9a1464
feat: Update LLM client and CLI to support temperature parameter for …
harumiWeb Jan 24, 2026
ab8a2a3
feat: Update manifest and truth files for improved data extraction; a…
harumiWeb Jan 24, 2026
a40b4b1
feat: Add tax report case to manifest and corresponding truth data
harumiWeb Jan 24, 2026
a16086a
feat: Enhance scoring functions with normalization and support for ne…
harumiWeb Jan 24, 2026
a061e57
feat: Add SmartArt organization chart case to manifest with correspon…
harumiWeb Jan 24, 2026
522bb90
feat: Refactor extraction process to use ExStructEngine for improved …
harumiWeb Jan 24, 2026
1d466a7
feat: Add basic document case to manifest with corresponding truth data
harumiWeb Jan 24, 2026
183b81e
feat: Add total cost and call count tracking to ask function
harumiWeb Jan 24, 2026
3cad08a
feat: Update tax report question and truth data structure for improve…
harumiWeb Jan 24, 2026
f00b408
feat: Add normalization rules and scoring enhancements for improved e…
harumiWeb Jan 24, 2026
349f622
feat: Add alias rules for certificate of employment to normalization …
harumiWeb Jan 24, 2026
10bb9da
feat: Enhance benchmark report with interpretation guidelines for acc…
harumiWeb Jan 24, 2026
5ce4696
feat: Move summary output to the end of the report function for bette…
harumiWeb Jan 24, 2026
de3acfd
feat: Add evaluation protocol to README and report function for repro…
harumiWeb Jan 24, 2026
0a666c6
feat: Add reproducibility scripts for Windows PowerShell and macOS/Linux
harumiWeb Jan 24, 2026
9cb9571
feat: Add normalization rules and truth data for heatstroke and workf…
harumiWeb Jan 24, 2026
6681d84
feat: Add raw evaluation metrics and update README for new evaluation…
harumiWeb Jan 24, 2026
8fec6f5
fix: Format JSON structure for better readability and consistency
harumiWeb Jan 24, 2026
417da57
feat: Add Markdown conversion functionality and evaluation metrics fo…
harumiWeb Jan 25, 2026
55feb05
feat: Add food inspection record data and enhance Markdown evaluation…
harumiWeb Jan 25, 2026
5813c2c
feat: Add RUB specification document for Reconstruction Utility Bench…
harumiWeb Jan 26, 2026
a84535d
Add RUB (Reconstruction Utility Benchmark) support with manifest and …
harumiWeb Jan 26, 2026
dc05390
feat: Add RUB lite support with manifest and evaluation tasks
harumiWeb Jan 26, 2026
522c902
feat: Enhance Markdown functionality with full-document generation an…
harumiWeb Jan 27, 2026
f48afd1
feat: Refactor cost estimation to use a pricing dictionary for model …
harumiWeb Jan 27, 2026
17780b9
feat: Add public report generation with charts and update functionality
harumiWeb Jan 29, 2026
e582213
Add benchmark reports and publicize scripts
harumiWeb Jan 29, 2026
cbd3aba
feat: Add note about initial benchmark and future expansion
harumiWeb Jan 29, 2026
275d458
feat: Add benchmark section with reports and charts to documentation
harumiWeb Jan 29, 2026
552977d
feat: Exclude benchmark directory from coverage and linting checks
harumiWeb Jan 29, 2026
53890f1
fix: Update benchmark chart paths in documentation and scripts for co…
harumiWeb Jan 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .codacy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
exclude_paths:
- "benchmark/**"
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ repos:
rev: v0.4.5
hooks:
- id: ruff
exclude: ^benchmark/
- id: ruff-format

- repo: https://github.com/pre-commit/mirrors-mypy
Expand All @@ -12,3 +13,4 @@ repos:
additional_dependencies:
- pydantic>=2.0.0
- types-PyYAML
exclude: ^benchmark/
11 changes: 10 additions & 1 deletion README.ja.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ExStruct — Excel 構造化抽出エンジン

[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct)
[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/harumiWeb/exstruct)

![ExStruct Image](docs/assets/icon.webp)

Expand All @@ -17,6 +17,15 @@ ExStruct は Excel ワークブックを読み取り、構造化データ(セ
- **CLI レンダリング**(Excel 必須): PDF とシート画像を生成可能。
- **安全なフォールバック**: Excel COM 不在でもプロセスは落ちず、セル+テーブル候補+印刷範囲に切り替え(図形・チャートは空)。

## ベンチマーク

![Benchmark Chart](benchmark/public/plots/markdown_quality.png)

このリポジトリには、ExcelドキュメントのRAG/LLM前処理に焦点を当てたベンチマークレポートが含まれています。
私たちは2つの視点から追跡しています。(1) コア抽出精度と (2) 下流構造クエリのための再構築ユーティリティ (RUB) です。
作業サマリーについては`benchmark/REPORT.md`を、公開バンドルについては`benchmark/public/REPORT.md`を参照してください。
現在の結果はn=12のケースに基づいており、今後さらに拡張される予定です。

## インストール

```bash
Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ExStruct — Excel Structured Extraction Engine

[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct)
[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/harumiWeb/exstruct)

![ExStruct Image](docs/assets/icon.webp)

Expand All @@ -19,6 +19,15 @@ ExStruct reads Excel workbooks and outputs structured data (cells, table candida
- **CLI rendering** (Excel required): optional PDF and per-sheet PNGs.
- **Graceful fallback**: if Excel COM is unavailable, extraction falls back to cells + table candidates without crashing.

## Benchmark

![Benchmark Chart](benchmark/public/plots/markdown_quality.png)

This repository includes benchmark reports focused on RAG/LLM preprocessing of Excel documents.
We track two perspectives: (1) core extraction accuracy and (2) reconstruction utility for downstream structure queries (RUB).
See `benchmark/REPORT.md` for the working summary and `benchmark/public/REPORT.md` for the public bundle.
Current results are based on n=12 cases and will be expanded.

## Installation

```bash
Expand Down
4 changes: 4 additions & 0 deletions benchmark/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
OPENAI_API_KEY=your_key_here
# optional
OPENAI_ORG=
OPENAI_PROJECT=
15 changes: 15 additions & 0 deletions benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
drafts/
wheels/
*.egg-info

# Virtual environments
.venv
data/raw/
*.log
outputs/
.env
20 changes: 20 additions & 0 deletions benchmark/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.PHONY: setup extract ask eval report all

setup:
python -m pip install -U pip
pip install -e ..
pip install -e .

extract:
exbench extract --case all --method all

ask:
exbench ask --case all --method all --model gpt-4o

eval:
exbench eval --case all --method all

report:
exbench report

all: extract ask eval report
194 changes: 194 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# ExStruct Benchmark

This benchmark compares methods for answering questions about Excel documents using GPT-4o:

- exstruct
- openpyxl
- pdf (xlsx->pdf->text)
- html (xlsx->html->table text)
- image_vlm (xlsx->pdf->png -> GPT-4o vision)

## Requirements

- Python 3.11+
- LibreOffice (`soffice` in PATH)
- OPENAI_API_KEY in `.env`

## Setup

```bash
cd benchmark
cp .env.example .env
pip install -e .. # install exstruct from repo root
pip install -e .
```

## Run

```bash
make all
```

## Reproducibility script (Windows PowerShell)

```powershell
.\scripts\reproduce.ps1
```

Options:

- `-Case` (default: `all`)
- `-Method` (default: `all`)
- `-Model` (default: `gpt-4o`)
- `-Temperature` (default: `0.0`)
- `-SkipAsk` (skip LLM calls; uses existing responses)

## Reproducibility script (macOS/Linux)

```bash
./scripts/reproduce.sh
```

If you see a permission error, run:

```bash
chmod +x ./scripts/reproduce.sh
```

Options:

- `--case` (default: `all`)
- `--method` (default: `all`)
- `--model` (default: `gpt-4o`)
- `--temperature` (default: `0.0`)
- `--skip-ask` (skip LLM calls; uses existing responses)

Outputs:

- outputs/extracted/\* : extracted context (text or images)
- outputs/prompts/\*.jsonl
- outputs/responses/\*.jsonl
- outputs/markdown/\*/\*.md
- outputs/markdown/responses/\*.jsonl
- outputs/results/results.csv
- outputs/results/report.md

## Public report (REPORT.md)

Generate chart images and update `REPORT.md` in the benchmark root:

```bash
python -m bench.cli report-public
```

This command writes plots under `outputs/plots/` and inserts them into
`REPORT.md` between the chart markers.

## Public bundle (for publishing)

Create a clean, shareable bundle under `benchmark/public/`:

```bash
python scripts/publicize.py
```

Windows PowerShell:

```powershell
.\scripts\publicize.ps1
```

## Markdown conversion (optional)

Generate Markdown from the latest JSON responses:

```bash
python -m bench.cli markdown --case all --method all
```

Markdown scores (`score_md`, `score_md_precision`) are only computed when
Markdown outputs exist under `outputs/markdown/responses/`.

If you want a deterministic renderer without LLM calls:

```bash
python -m bench.cli markdown --case all --method all --use-llm false
```

## RUB (lite)

RUB lite evaluates reconstruction utility using Markdown-only inputs.

Run Stage B tasks with the lite manifest:

```bash
python -m bench.cli rub-ask --task all --method all --manifest rub/manifest_lite.json
python -m bench.cli rub-eval --manifest rub/manifest_lite.json
python -m bench.cli rub-report
```

Outputs:

- outputs/rub/results/rub_results.csv
- outputs/rub/results/report.md

## Evaluation protocol (public)

To ensure reproducibility and fair comparison, follow these fixed settings:

- Model: gpt-4o (Responses API)
- Temperature: 0.0
- Prompt: fixed in `bench/llm/openai_client.py`
- Input contexts: generated by `bench.cli extract` using the same sources for all methods
- Normalization: optional normalized track uses `data/normalization_rules.json`
- Evaluation: `bench.cli eval` produces Exact, Normalized, Raw, and Markdown scores
- Report: `bench.cli report` generates `report.md` and per-case detailed reports

Recommended disclosure when publishing results:

- Model name + version, temperature, and date of run
- Full `normalization_rules.json` used for normalized scores
- Cost/token estimation method
- Any skipped cases and the reason (missing files, extraction failures)

## How to interpret results (public guide)

This benchmark reports four evaluation tracks to keep comparisons fair:

- Exact: strict string match with no normalization.
- Normalized: applies case-specific rules in `data/normalization_rules.json` to
absorb formatting differences (aliases, split/composite labels).
- Raw: loose coverage/precision over flattened text tokens (schema-agnostic),
intended to reflect raw data capture without penalizing minor label variations.
- Markdown: coverage/precision against canonical Markdown rendered from truth.

Recommended interpretation:

- Use **Exact** to compare end-to-end string fidelity (best for literal extraction).
- Use **Normalized** to compare **document understanding** across methods.
- Use **Raw** to compare how much ground-truth text is captured regardless of schema.
- Use **Markdown** to evaluate JSON-to-Markdown conversion quality.
- When methods disagree between tracks, favor Normalized for Excel-heavy layouts
where labels are split/merged or phrased differently.
- Always cite both accuracy and cost metrics when presenting results publicly.

## Evaluation

The evaluator now writes four tracks:

- Exact: `score`, `score_ordered` (strict string match, current behavior)
- Normalized: `score_norm`, `score_norm_ordered` (applies case-specific rules)
- Raw: `score_raw`, `score_raw_precision` (loose coverage/precision)
- Markdown: `score_md`, `score_md_precision` (Markdown coverage/precision)

Normalization rules live in `data/normalization_rules.json` and are applied in
`bench.cli eval`. Publish these rules alongside the benchmark to keep the
normalized track transparent and reproducible.

## Notes:

- GPT-4o Responses API supports text and image inputs. See docs:
- [https://platform.openai.com/docs/api-reference/responses](https://platform.openai.com/docs/api-reference/responses)
- [https://platform.openai.com/docs/guides/images-vision](https://platform.openai.com/docs/guides/images-vision)
- Pricing for gpt-4o used in cost estimation:
- https://platform.openai.com/docs/models/compare?model=gpt-4o
84 changes: 84 additions & 0 deletions benchmark/REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Benchmark Summary (Public)

This summary consolidates the latest results for the Excel document benchmark and
RUB (structure query track). Use this file as a public-facing overview and link
full reports for reproducibility.

Sources:
- outputs/results/report.md (core benchmark)
- outputs/rub/results/report.md (RUB structure_query)
<!-- CHARTS_START -->
## Charts

![Core Benchmark Summary](outputs/plots/core_benchmark.png)
![Markdown Evaluation Summary](outputs/plots/markdown_quality.png)
![RUB Structure Query Summary](outputs/plots/rub_structure_query.png)
<!-- CHARTS_END -->
## Scope

- Cases: 12 Excel documents
- Methods: exstruct, openpyxl, pdf, html, image_vlm
- Model: gpt-4o (Responses API)
- Temperature: 0.0
- Note: record the run date/time when publishing
- This is an initial benchmark (n=12) and will be expanded in future releases.

## Core Benchmark (extraction + scoring)

Key metrics from outputs/results/report.md:

- Exact accuracy (acc): best = pdf 0.607551, exstruct = 0.583802
- Normalized accuracy (acc_norm): best = pdf 0.856642, exstruct = 0.835538
- Raw coverage (acc_raw): best = exstruct 0.876495 (tie for top)
- Raw precision: best = exstruct 0.933691
- Markdown coverage (acc_md): best = pdf 0.700094, exstruct = 0.697269
- Markdown precision: best = exstruct 0.796101

Interpretation:
- pdf leads in Exact/Normalized, especially when literal string match matters.
- exstruct is strongest on Raw coverage/precision and Markdown precision,
indicating robust capture and downstream-friendly structure.

## RUB (structure_query track)

RUB evaluates Stage B questions using Markdown-only inputs. Current track is
"structure_query" (paths selection).

Summary from outputs/rub/results/report.md:

- RUS: exstruct 0.166667 (tie for top with openpyxl 0.166667)
- Partial F1: exstruct 0.436772 (best among methods)

Interpretation:
- exstruct is competitive for structure queries, but the margin is not large.
- This track is sensitive to question design; it rewards selection accuracy
more than raw reconstruction.

## Positioning for RAG/LLM Preprocessing

Practical strengths shown by the current benchmark:
- High Raw coverage/precision (exstruct best)
- High Markdown precision (exstruct best)
- Near-top normalized accuracy

Practical caveats:
- Exact/normalized top spot is often pdf
- RUB structure_query shows only a modest advantage

Recommended public framing:
- exstruct is a strong option when the goal is structured reuse (JSON/Markdown)
for downstream LLM/RAG pipelines.
- pdf/VLM methods can be stronger for literal string fidelity or visual layout
recovery.

## Known Limitations

- Absolute RUS values are low in some settings (task design sensitive).
- Results vary by task type (forms/flows/diagrams vs tables).
- Model changes (e.g., gpt-4.1) require separate runs and reporting.

## Next Steps (optional)

- Add a reconstruction track that scores “structure rebuild” directly.
- Add task-specific structure queries (not only path selection).
- Publish run date, model version, and normalization rules with results.
Loading
Loading