diff --git a/README.ja.md b/README.ja.md index 87f7c24..23fdeb6 100644 --- a/README.ja.md +++ b/README.ja.md @@ -2,7 +2,7 @@ [![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct) -![ExStruct Image](/assets/icon.webp) +![ExStruct Image](docs/assets/icon.webp) ExStruct は Excel ワークブックを読み取り、構造化データ(セル・テーブル候補・図形・チャート・SmartArt・印刷範囲ビュー)をデフォルトで JSON に出力します。必要に応じて YAML/TOON も選択でき、COM/Excel 環境ではリッチ抽出、非 COM 環境ではセル+テーブル候補+印刷範囲へのフォールバックで安全に動作します。LLM/RAG 向けに検出ヒューリスティックや出力モードを調整可能です。 @@ -10,6 +10,7 @@ ExStruct は Excel ワークブックを読み取り、構造化データ(セ - **Excel → 構造化 JSON**: セル、図形、チャート、SmartArt、テーブル候補、セル結合範囲、印刷範囲/自動改ページ範囲(PrintArea/PrintAreaView)をシート単位・範囲単位で出力。 - **出力モード**: `light`(セル+テーブル候補のみ)、`standard`(テキスト付き図形+矢印、チャート、SmartArt、セル結合範囲)、`verbose`(全図形を幅高さ付きで出力、セルのハイパーリンクも出力)。 +- **数式取得**: `formulas_map`(数式文字列 → セル座標)を openpyxl/COM で取得。`verbose` 既定、`include_formulas_map` で制御。 - **フォーマット**: JSON(デフォルトはコンパクト、`--pretty` で整形)、YAML、TOON(任意依存)。 - **テーブル検出のチューニング**: API でヒューリスティックを動的に変更可能。 - **ハイパーリンク抽出**: `verbose` モード(または `include_cell_links=True` 指定)でセルのリンクを `links` に出力。 @@ -160,7 +161,7 @@ exstruct input.xlsx --pdf --image --dpi 144 - 図形のみで作成したフローチャート (下画像が実際のサンプル Excel シート) -![Sample Excel](/assets/demo_sheet.png) +![Sample Excel](docs/assets/demo_sheet.png) サンプル Excel: `sample/sample.xlsx` ### 1. Input: Excel Sheet Overview @@ -339,7 +340,7 @@ flowchart TD ### Excel データ -![一般的な申請書Excel](/assets/demo_form.ja.png) +![一般的な申請書Excel](docs/assets/demo_form.ja.png) ### ExStruct JSON diff --git a/README.md b/README.md index 7588c0e..6ebff3a 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,7 @@ ExStruct reads Excel workbooks and outputs structured data (cells, table candida - **Excel → Structured JSON**: cells, shapes, charts, smartart, table candidates, print areas/views, and auto page-break areas per sheet. - **Output modes**: `light` (cells + table candidates + print areas; no COM, shapes/charts empty), `standard` (texted shapes + arrows, charts, smartart, merged cell ranges, print areas), `verbose` (all shapes with width/height, charts with size, merged cell ranges, print areas). Verbose also emits cell hyperlinks and `colors_map`. Size output is flag-controlled. +- **Formula map extraction**: emits `formulas_map` (formula string -> cell coordinates) via openpyxl/COM; enabled by default in `verbose` or via `include_formulas_map`. - **Auto page-break export (COM only)**: capture Excel-computed auto page breaks and write per-area JSON/YAML/TOON when requested (CLI option appears only when COM is available). - **Formats**: JSON (compact by default, `--pretty` available), YAML, TOON (optional dependencies). - **Table detection tuning**: adjust heuristics at runtime via API. diff --git a/docs/README.en.md b/docs/README.en.md index 38fa5b0..39df415 100644 --- a/docs/README.en.md +++ b/docs/README.en.md @@ -11,7 +11,8 @@ ExStruct reads Excel workbooks and outputs structured data (cells, table candida ## Features - **Excel → Structured JSON**: cells, shapes, charts, smartart, table candidates, print areas/views, and auto page-break areas per sheet. -- **Output modes**: `light` (cells + table candidates + print areas; no COM, shapes/charts empty), `standard` (texted shapes + arrows, charts, smartart, merged cell ranges, print areas), `verbose` (all shapes with width/height, charts with size, merged cell ranges, print areas). Verbose also emits cell hyperlinks and `colors_map`. Size output is flag-controlled. +- **Output modes**: `light` (cells + table candidates + print areas; no COM, shapes/charts empty), `standard` (texted shapes + arrows, charts, smartart, merged cell ranges, print areas), `verbose` (all shapes with width/height, charts with size, merged cell ranges, print areas). Verbose also emits cell hyperlinks, `colors_map`, and `formulas_map`. Size output is flag-controlled. +- **Formula map extraction**: emits `formulas_map` (formula string -> cell coordinates) via openpyxl/COM; enabled by default in `verbose` or via `include_formulas_map`. - **Auto page-break export (COM only)**: capture Excel-computed auto page breaks and write per-area JSON/YAML/TOON when requested (CLI option appears only when COM is available). - **Formats**: JSON (compact by default, `--pretty` available), YAML, TOON (optional dependencies). - **Table detection tuning**: adjust heuristics at runtime via API. @@ -134,7 +135,7 @@ Use higher thresholds to reduce false positives; lower them if true tables are m - **light**: cells + table candidates (no COM needed). - **standard**: texted shapes + arrows, charts (COM if available), merged cell ranges, table candidates. Hyperlinks are off unless `include_cell_links=True`. -- **verbose**: all shapes (with width/height), charts, merged cell ranges, table candidates, cell hyperlinks, and `colors_map`. +- **verbose**: all shapes (with width/height), charts, merged cell ranges, table candidates, cell hyperlinks, `colors_map`, and `formulas_map`. ## Error Handling / Fallbacks diff --git a/docs/README.ja.md b/docs/README.ja.md index 87f7c24..a2732af 100644 --- a/docs/README.ja.md +++ b/docs/README.ja.md @@ -9,7 +9,8 @@ ExStruct は Excel ワークブックを読み取り、構造化データ(セ ## 主な特徴 - **Excel → 構造化 JSON**: セル、図形、チャート、SmartArt、テーブル候補、セル結合範囲、印刷範囲/自動改ページ範囲(PrintArea/PrintAreaView)をシート単位・範囲単位で出力。 -- **出力モード**: `light`(セル+テーブル候補のみ)、`standard`(テキスト付き図形+矢印、チャート、SmartArt、セル結合範囲)、`verbose`(全図形を幅高さ付きで出力、セルのハイパーリンクも出力)。 +- **出力モード**: `light`(セル+テーブル候補のみ)、`standard`(テキスト付き図形+矢印、チャート、SmartArt、セル結合範囲)、`verbose`(全図形を幅高さ付きで出力、セルのハイパーリンク/`colors_map`/`formulas_map`も出力)。 +- **数式取得**: `formulas_map`(数式文字列 → セル座標)を openpyxl/COM で取得。`verbose` 既定、`include_formulas_map` で制御。 - **フォーマット**: JSON(デフォルトはコンパクト、`--pretty` で整形)、YAML、TOON(任意依存)。 - **テーブル検出のチューニング**: API でヒューリスティックを動的に変更可能。 - **ハイパーリンク抽出**: `verbose` モード(または `include_cell_links=True` 指定)でセルのリンクを `links` に出力。 @@ -131,7 +132,7 @@ set_table_detection_params( - **light**: セル+テーブル候補のみ(COM 不要)。 - **standard**: テキスト付き図形+矢印、チャート(COM ありで取得)、テーブル候補。セルのハイパーリンクは `include_cell_links=True` を指定したときのみ出力。 -- **verbose**: all shapes, charts, table_candidates, hyperlinks, and `colors_map`. +- **verbose**: 全図形(幅高さ付き)、チャート、`table_candidates`、ハイパーリンク、`colors_map`、`formulas_map` を出力。 ## エラーハンドリング / フォールバック diff --git a/docs/agents/DATA_MODEL.md b/docs/agents/DATA_MODEL.md index 0d7d056..af411a8 100644 --- a/docs/agents/DATA_MODEL.md +++ b/docs/agents/DATA_MODEL.md @@ -1,6 +1,6 @@ # ExStruct データモデル仕様 -**Version**: 0.15 +**Version**: 0.16 **Status**: Authoritative 本ドキュメントは ExStruct が返す全モデルの唯一の正準ソースです。 @@ -175,6 +175,7 @@ SheetData { table_candidates: [str] print_areas: [PrintArea] auto_print_areas: [PrintArea] // 自動改ページ矩形 (COM 前提、デフォルト無効) + formulas_map: {[formula: str]: [[int, int]]} // (row=1-based, col=0-based) colors_map: {[colorHex: str]: [[int, int]]} // (row=1-based, col=0-based) merged_cells: MergedCells | null } @@ -251,3 +252,4 @@ WorkbookData { - 0.13: Shape を `Shape` / `Arrow` / `SmartArt` に分割し、`SmartArtNode` のネスト構造を追加 - 0.14: `MergedCell` / `SheetData.merged_cells` を追加 - 0.15: `MergedCells` を schema + items 形式に変更し圧縮形式を導入 +- 0.16: `SheetData.formulas_map` を追加 diff --git a/docs/agents/FEATURE_SPEC.md b/docs/agents/FEATURE_SPEC.md index 55753aa..21a82f9 100644 --- a/docs/agents/FEATURE_SPEC.md +++ b/docs/agents/FEATURE_SPEC.md @@ -4,6 +4,26 @@ --- +## 数式取得機能追加 + +- 新たに数式文字列をそのまま取得する機能を追加 +- `SheetData`モデルに`formulas_map`を新設予定`formulas_map: dict[str, list[tuple[int,int]]]` +- 数式の値は定義されている数式をそのまま取得する +- セル座標はcolors_mapと同じようにr,cの数値で表記 +- デフォルトはverboseモード以上で出力、もしくはオプションからONにする +- 定義されている数式文字列をシンプルに取得する実装 +- 数式の表記形式は「=A1」のようにユーザーが見るままの数式文字列にする +- 共有数式や配列数式は一旦は展開しない実装にする +- 空文字は除外、=だけのセルも数式文字として取得 +- formulas_mapのキーは「式文字列(先頭=を含む)」で固定する +- 既存の値はSheetData.rowsにあり、数式はSheetData.formulas_mapにあることで共存する +- データ取得時はformulas_map が ON のときだけ data_only=False で再読込 +- オプションは`StructOptions`にて`include_formulas_map: bool = False`で設定を受け付ける +- `.xls`形式かつ数式取得ONの時は処理が遅くなるという警告を出しつつ、COMで取得処理をする。 +- cell.value が ArrayFormula の場合に value.text(実際の式文字列)を使う + +--- + ## 今後のオプション検討メモ - 表検知スコアリングの閾値を CLI/環境変数で調整可能にする diff --git a/docs/agents/TASKS.md b/docs/agents/TASKS.md index 742e3b4..cf7660f 100644 --- a/docs/agents/TASKS.md +++ b/docs/agents/TASKS.md @@ -2,7 +2,38 @@ 未完了 [ ], 完了 [x] -- [x] 仕様確認: 画像出力は DPI を維持しつつ、メモリリーク/クラッシュ回避のためサブプロセス化で処理する方針を明記 -- [x] 実装方針: シートごとに PDF を分割 → サブプロセスで PDF ページを PNG へ変換 → 終了時にメモリを解放する設計(親は進捗/結果を集約) -- [x] 実装方針: 子プロセスは `pypdfium2` をロードしてページごとにレンダリングし、書き込み済みパスを親に返す -- [x] 実装方針: 例外時は子プロセスでエラーを返し、親が RenderError として集約して返す +## 数式取得機能追加 + +- [x] `SheetData`に`formulas_map`フィールドを追加し、シリアライズ対象に含める +- [x] `StructOptions`に`include_formulas_map: bool = False`を追加し、verbose時の既定挙動と整合させる +- [x] openpyxlで`data_only=False`の読み取りパスを追加し、`formulas_map`用の走査処理を実装する +- [x] `.xls`かつ数式取得ONの場合はCOM経由で`formulas_map`を取得し、遅延警告を出す +- [x] `formulas_map`の仕様(=付きの式文字列、空文字除外、=のみ許可、共有/配列は未展開)に沿った抽出ロジックを追加 +- [x] openpyxlの配列数式(`ArrayFormula`)は`value.text`から式文字列を取得する分岐を追加 +- [x] CLI/ドキュメント/READMEの出力モード説明に`formulas_map`の条件を追記する +- [x] テスト要件に`formulas_map`関連(ON/OFF、verbose既定、.xls COM分岐)を追加する + +## PR #44 指摘対応 + +- [x] `src/exstruct/render/__init__.py` の `_page_index_from_suffix` を2桁固定ではなく可変桁の数値サフィックスに対応させ、`_rename_pages_for_print_area` の上書きリスクを解消する +- [x] `src/exstruct/render/__init__.py` の `_export_sheet_pdf` の `finally` 内 `return` を削除し、PrintArea 復元失敗はログに残して例外を握りつぶさない +- [x] `src/exstruct/core/pipeline.py` の `step_extract_formulas_map_*` の挙動を docstring に合わせる(失敗時にログしてスキップ)か、docstring を実装に合わせて修正する +- [x] `docs/README.ja.md` の `**verbose**` 説明行を日本語に統一する + +## PR #44 コメント/Codecov 対応 + +- [x] Codecov パッチカバレッジ低下(60.53%)の指摘に対応し、対象ファイルの不足分テストを追加する(`src/exstruct/render/__init__.py`, `src/exstruct/core/cells.py`, `src/exstruct/core/backends/com_backend.py`, `src/exstruct/core/pipeline.py`, `src/exstruct/core/backends/openpyxl_backend.py`) +- [x] Codecov の「Files with missing lines」で具体的な未カバー行を確認し、テスト観点を整理する +- [x] Codacy 警告対応: `src/exstruct/render/__init__.py:274` の finally 内 return により例外が握りつぶされる可能性(`PyLintPython3_W0150`)を解消する + +## PR #44 CodeRabbit 再レビュー対応 + +- [ ] `scripts/codacy_issues.py`: トークン未設定時の `sys.exit(1)` をモジュールトップから排除し、`get_token()` または `main()` で検証する +- [ ] `scripts/codacy_issues.py`: `format_for_ai` の `sys.exit` を `ValueError` に置換し、呼び出し側でバリデーションする +- [ ] `scripts/codacy_issues.py`: `urlopen` の非2xxチェック(到達不能)を削除または `HTTPError` 側へ寄せる +- [ ] `scripts/codacy_issues.py`: `status` の固定値バリデーションを廃止する(固定なら直代入/必要なら CLI 引数化) +- [ ] `tests/backends/test_print_areas_openpyxl.py`: `PrintAreaData` 型に合わせる+関連テストに Google スタイル docstring を付与 +- [ ] `tests/core/test_pipeline.py`: 無効な `MergedCellRange` を有効な非重複レンジに修正する +- [ ] `tests/backends/test_backends.py`: `sheets` のクラス属性共有を避け、インスタンス属性に変更する +- [ ] `tests/render/test_render_init.py` / `tests/utils.py` / `tests/models/test_models_export.py`: docstring/コメントの指摘を反映する +- [ ] `src/exstruct/render/__init__.py`: Protocol クラスに Google スタイル docstring を追加する diff --git a/docs/release-notes/v0.3.7.md b/docs/release-notes/v0.3.7.md new file mode 100644 index 0000000..8672766 --- /dev/null +++ b/docs/release-notes/v0.3.7.md @@ -0,0 +1,19 @@ +# v0.3.7 Release Notes + +This release adds formula extraction to the structured output, expanding the +pipeline, models, and backends while keeping the existing modes and fallbacks. + +## Highlights + +- Added `formulas_map` extraction (formula string -> cell coordinates) via + openpyxl for .xlsx/.xlsm and COM for .xls, with `include_formulas_map` option + and verbose default behavior. +- Pipeline, engine, and models now propagate `formulas_map` end-to-end, with + updated samples and documentation. +- Rendering robustness improved for print-area exports (safer page numbering + and error handling during PrintArea restoration). + +## Notes + +- `formulas_map` is emitted in `verbose` by default; use `include_formulas_map` + to enable/disable explicitly. diff --git a/mkdocs.yml b/mkdocs.yml index 33a0b0f..9eb5bd8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -27,6 +27,7 @@ nav: - CLI Guide: cli.md - Concept / Why ExStruct?: concept.md - Release Notes: + - v0.3.7: release-notes/v0.3.7.md - v0.3.6: release-notes/v0.3.6.md - v0.3.5: release-notes/v0.3.5.md - v0.3.2: release-notes/v0.3.2.md diff --git a/pyproject.toml b/pyproject.toml index 7b6fe5c..1e6cca7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "exstruct" -version = "0.3.6" +version = "0.3.7" description = "Excel to structured JSON (tables, shapes, charts) for LLM/RAG pipelines" readme = "README.md" license = { file = "LICENSE" } @@ -60,7 +60,10 @@ omit = [ [tool.ruff] target-version = "py311" src = ["exstruct"] +fix = true +# 静的解析ルール +[tool.ruff.lint] select = [ "E", # pycodestyle errors "W", # pycodestyle warnings @@ -75,43 +78,36 @@ select = [ ] ignore = [ - "E501", # 行長は許容(Excel JSON は長くなりがち) - "B008", # Pydantic の default_factory を誤検知するため - "ANN101", # self に型を要求されてしまうため - "ANN102", # cls も同様 + "E501", # 長い行は許容(Excel JSON は長くなりがち) + "B008", # Pydantic の default_factory を使用するため + "ANN101", # self の型注釈は省略可能 + "ANN102", # cls の型注釈は省略可能 ] -fix = true - -# 型ヒントのスタイル -[tool.ruff.lint] -extend-select = ["ANN"] - -# import の並び替え設定 -[tool.ruff.isort] +# import の並び順 +[tool.ruff.lint.isort] combine-as-imports = true known-first-party = ["exstruct"] force-sort-within-sections = true -# 複雑度チェック(関数の最大複雑度) -[tool.ruff.mccabe] +# 複雑度の最大値 +[tool.ruff.lint.mccabe] max-complexity = 12 -[tool.ruff.per-file-ignores] +[tool.ruff.lint.per-file-ignores] "tests/**/*.py" = ["N802", "N803", "N806"] - [tool.mypy] packages = ["exstruct"] python_version = "3.11" -# 外部ライブラリは一切チェックしない +# 外部ライブラリの型情報がない場合は無視 ignore_missing_imports = true -# 自作コードは厳密にチェックする +# 厳格モードを有効化 strict = true -# Pydantic v2 向け +# Pydantic v2 対応 plugins = ["pydantic.mypy"] [tool.pytest.ini_options] diff --git a/sample/formula/formula.json b/sample/formula/formula.json new file mode 100644 index 0000000..a1f3ba0 --- /dev/null +++ b/sample/formula/formula.json @@ -0,0 +1,180 @@ +{ + "book_name": "formula.xlsx", + "sheets": { + "Sheet1": { + "rows": [ + { + "r": 1, + "c": { + "0": "商品名", + "1": "定価", + "2": "数量" + } + }, + { + "r": 2, + "c": { + "0": "商品A", + "1": 800, + "2": 10 + } + }, + { + "r": 3, + "c": { + "0": "商品B", + "1": 1000, + "2": 2 + } + }, + { + "r": 4, + "c": { + "0": "商品C", + "1": 1200, + "2": 5 + } + }, + { + "r": 5, + "c": { + "0": "売上合計", + "2": 16000 + } + }, + { + "r": 8, + "c": { + "0": "学生名", + "1": "点数", + "2": "評価" + } + }, + { + "r": 9, + "c": { + "0": "山田 早苗", + "1": 86, + "2": "A" + } + }, + { + "r": 10, + "c": { + "0": "田中 太郎", + "1": 60, + "2": "C" + } + }, + { + "r": 11, + "c": { + "0": "坂本 直美", + "1": 72, + "2": "B" + } + }, + { + "r": 12, + "c": { + "0": "多田 友梨奈", + "1": 50, + "2": "D" + } + } + ], + "table_candidates": [ + "A1:C5", + "A8:C12" + ], + "formulas_map": { + "=SUM(B2:B4*C2:C4)": [ + [ + 5, + 2 + ] + ], + "=_xlfn.IFS(B9>=85,\"A\",B9>=70,\"B\",B9>=60,\"C\",TRUE,\"D\")": [ + [ + 9, + 2 + ] + ], + "=_xlfn.IFS(B10>=85,\"A\",B10>=70,\"B\",B10>=60,\"C\",TRUE,\"D\")": [ + [ + 10, + 2 + ] + ], + "=_xlfn.IFS(B11>=85,\"A\",B11>=70,\"B\",B11>=60,\"C\",TRUE,\"D\")": [ + [ + 11, + 2 + ] + ], + "=_xlfn.IFS(B12>=85,\"A\",B12>=70,\"B\",B12>=60,\"C\",TRUE,\"D\")": [ + [ + 12, + 2 + ] + ] + }, + "colors_map": { + "BDD7EE": [ + [ + 1, + 0 + ], + [ + 1, + 1 + ], + [ + 1, + 2 + ], + [ + 5, + 0 + ], + [ + 5, + 1 + ] + ], + "F8CBAD": [ + [ + 8, + 0 + ], + [ + 8, + 1 + ], + [ + 8, + 2 + ] + ] + }, + "merged_cells": { + "schema": [ + "r1", + "c1", + "r2", + "c2", + "v" + ], + "items": [ + [ + 5, + 0, + 5, + 1, + "売上合計" + ] + ] + } + } + } +} \ No newline at end of file diff --git a/sample/formula/formula.xlsx b/sample/formula/formula.xlsx new file mode 100644 index 0000000..eb2ebab Binary files /dev/null and b/sample/formula/formula.xlsx differ diff --git a/sample/formula/formula_json_for_llm.md b/sample/formula/formula_json_for_llm.md new file mode 100644 index 0000000..5481659 --- /dev/null +++ b/sample/formula/formula_json_for_llm.md @@ -0,0 +1,85 @@ +# 📘 Overall Interpretation of the Excel JSON + +The JSON describes an Excel file named **formula.xlsx** containing a single sheet (**Sheet1**) with **two separate tables** placed in different row regions. The sheet includes data, formulas, merged cells, and color formatting. + +--- + +# 🛒 Table 1: Product Sales (Range A1:C5) + +## **Content** +| Product | Price | Quantity | +|--------|-------|----------| +| Product A | 800 | 10 | +| Product B | 1000 | 2 | +| Product C | 1200 | 5 | +| **Sales Total** (merged A5:B5) | | **16000** | + +### **Formula** +The formula: + +``` +=SUM(B2:B4*C2:C4) +``` + +is applied to cell **C5**. + +This calculates: + +- 800 × 10 = 8000 +- 1000 × 2 = 2000 +- 1200 × 5 = 6000 +- Total = **16000** + +The JSON value matches this computed result. + +### **Formatting** +- Color `BDD7EE` (light blue) is applied to: + - Header row (A1:C1) + - The “Sales Total” label area (A5:B5) +- Cells A5 and B5 are merged. + +This suggests the table is formatted as a typical summary sales table. + +--- + +# 🎓 Table 2: Student Grades (Range A8:C12) + +## **Content** +| Student | Score | Grade | +|---------|--------|--------| +| Sanae Yamada | 86 | A | +| Taro Tanaka | 60 | C | +| Naomi Sakamoto | 72 | B | +| Yurina Tada | 50 | D | + +### **Formula** +Each grade cell (column C) uses an **IFS** function: + +``` +IFS( + B>=85, "A", + B>=70, "B", + B>=60, "C", + TRUE, "D" +) +``` + +This automatically assigns a grade based on the score. + +### **Formatting** +- Color `F8CBAD` (light red) is applied to the header row (A8:C8). + +--- + +# 🧩 What This Excel Sheet Appears to Be + +Based on the structure, formulas, and formatting, this sheet looks like a **practice or demonstration file** for learning Excel basics: + +- Using **SUM** with array multiplication +- Using **IFS** for conditional grading +- Creating **two independent tables** on one sheet +- Applying **header colors** +- Using **merged cells** for labels +- Demonstrating how formulas map to cell positions + +It resembles a training or sample dataset for Excel exercises. \ No newline at end of file diff --git a/schemas/sheet.json b/schemas/sheet.json index 0e39edd..4a1eb5d 100644 --- a/schemas/sheet.json +++ b/schemas/sheet.json @@ -718,6 +718,27 @@ "title": "Colors Map", "type": "object" }, + "formulas_map": { + "additionalProperties": { + "items": { + "maxItems": 2, + "minItems": 2, + "prefixItems": [ + { + "type": "integer" + }, + { + "type": "integer" + } + ], + "type": "array" + }, + "type": "array" + }, + "description": "Mapping of formula strings to lists of (row, column) tuples where row is 1-based and column is 0-based.", + "title": "Formulas Map", + "type": "object" + }, "merged_cells": { "anyOf": [ { diff --git a/schemas/workbook.json b/schemas/workbook.json index 576ef67..eb99d41 100644 --- a/schemas/workbook.json +++ b/schemas/workbook.json @@ -594,6 +594,27 @@ "title": "Colors Map", "type": "object" }, + "formulas_map": { + "additionalProperties": { + "items": { + "maxItems": 2, + "minItems": 2, + "prefixItems": [ + { + "type": "integer" + }, + { + "type": "integer" + } + ], + "type": "array" + }, + "type": "array" + }, + "description": "Mapping of formula strings to lists of (row, column) tuples where row is 1-based and column is 0-based.", + "title": "Formulas Map", + "type": "object" + }, "merged_cells": { "anyOf": [ { diff --git a/scripts/codacy_issues.py b/scripts/codacy_issues.py new file mode 100644 index 0000000..42add78 --- /dev/null +++ b/scripts/codacy_issues.py @@ -0,0 +1,600 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import argparse +from dataclasses import dataclass +import json +import os +import re +import subprocess # nosec B404 - used for fixed git commands only +import sys +from typing import Any, cast +import urllib.parse +import urllib.request + +# ================================ +# Config +# ================================ +BASE = "https://api.codacy.com/api/v3" +BASE_URL = urllib.parse.urlparse(BASE) +BASE_PATH = BASE_URL.path.rstrip("/") # "/api/v3" + + +def get_token() -> str: + """Return the Codacy API token or raise if missing. + + Returns: + Codacy API token string from the environment. + + Raises: + ValueError: If CODACY_API_TOKEN is not set. + """ + token = os.environ.get("CODACY_API_TOKEN") + if token is None: + raise ValueError("CODACY_API_TOKEN is not set") + return token + + +# ================================ +# Utilities +# ================================ +LEVELS = ["Error", "High", "Warning", "Info"] + + +def get_level_priority(level: str | None) -> int | None: + """Convert a severity level name to a priority number. + + Args: + level: Severity level string. + + Returns: + Priority number or None if unknown. + """ + if level == "Error": + return 4 + if level == "High": + return 3 + if level == "Warning": + return 2 + if level == "Info": + return 1 + return None + + +def normalize_provider(value: str) -> str | None: + """ + Normalize a provider identifier to a supported short code. + + Parameters: + value (str): Provider identifier to normalize (expected 'gh', 'gl', or 'bb'). + + Returns: + str | None: The provider code ('gh', 'gl', or 'bb') if valid, `None` otherwise. + """ + return value if value in ("gh", "gl", "bb") else None + + +def assert_valid_segment(name: str, value: str, pattern: re.Pattern[str]) -> str: + """Validate an identifier segment against a regex. + + Args: + name: Segment name for error reporting. + value: Segment value. + pattern: Compiled regex pattern for allowed values. + + Returns: + The validated value. + + Raises: + ValueError: If the value is empty or invalid. + """ + if (not value) or (pattern.match(value) is None): + raise ValueError(f"Invalid {name}: {value}") + return value + + +def assert_valid_choice(name: str, value: str, choices: list[str]) -> str: + """Validate that a value is in a list of choices. + + Args: + name: Parameter name for error reporting. + value: Input value. + choices: Allowed values. + + Returns: + The validated value. + + Raises: + ValueError: If the value is not allowed. + """ + if value not in choices: + raise ValueError(f"Invalid {name}: {value}. Valid values: {', '.join(choices)}") + return value + + +def encode_segment(value: str) -> str: + """ + URL-encode a URL path segment so it is safe for inclusion in a path. + + Returns: + encoded (str): The percent-encoded representation of the input string. + """ + return urllib.parse.quote(value, safe="") + + +def build_codacy_url(pathname: str, query: dict[str, str] | None = None) -> str: + """ + Constructs a full Codacy API URL using the configured base origin and base path. + + Parameters: + pathname (str): Pathname to append to the base path (should begin with a forward slash). + query (dict[str, str] | None): Optional mapping of query parameter names to values; values are URL-encoded. + + Returns: + url (str): The complete URL including query string if `query` is provided. + """ + # Ensure we keep origin and base path + url = f"{BASE_URL.scheme}://{BASE_URL.netloc}{BASE_PATH}{pathname}" + if query: + url = f"{url}?{urllib.parse.urlencode(query)}" + return url + + +def assert_codacy_url(url: str) -> str: + """ + Validate that `url` targets the configured Codacy API origin and begins with the `/analysis/` path. + + Parameters: + url (str): The full URL to validate. + + Returns: + str: The original URL when it is confirmed to target the configured Codacy API origin and start with the `/analysis/` path. + + Raises: + ValueError: If the URL does not use the configured Codacy API origin or does not start with the expected `/analysis/` path. + """ + # Basic safety: must be same origin and start with /api/v3/analysis/ + parsed = urllib.parse.urlparse(url) + expected_origin = f"{BASE_URL.scheme}://{BASE_URL.netloc}" + origin = f"{parsed.scheme}://{parsed.netloc}" + expected_prefix = f"{BASE_PATH}/analysis/" + if origin != expected_origin or not parsed.path.startswith(expected_prefix): + raise ValueError(f"Invalid URL: {url}") + return url + + +def build_repo_issues_url(provider: str, org: str, repo: str, limit: int) -> str: + """ + Constructs the Codacy API URL to search repository issues for a given provider, organization, repository, and result limit. + + Parameters: + provider (str): Provider code (e.g., "gh", "gl", "bb"). + org (str): Organization or owner name. + repo (str): Repository name. + limit (int): Maximum number of results to request. + + Returns: + str: A Codacy API URL for the repository issues search endpoint with the `limit` query parameter set. + """ + return build_codacy_url( + f"/analysis/organizations/{encode_segment(provider)}/{encode_segment(org)}" + f"/repositories/{encode_segment(repo)}/issues/search", + query={"limit": str(limit)}, + ) + + +def build_pr_issues_url( + provider: str, org: str, repo: str, pr: str, limit: int, status: str +) -> str: + """ + Constructs the Codacy API URL for fetching issues of a pull request. + + Parameters: + provider (str): Provider code (e.g., "gh", "gl", "bb"). + org (str): Organization or owner name. + repo (str): Repository name. + pr (str): Pull request identifier. + limit (int): Maximum number of issues to request. + status (str): Issue status filter (e.g., "all", "open", "closed"). + + Returns: + str: The Codacy API URL for the pull-request issues endpoint including `status` and `limit` query parameters. + """ + return build_codacy_url( + f"/analysis/organizations/{encode_segment(provider)}/{encode_segment(org)}" + f"/repositories/{encode_segment(repo)}/pull-requests/{encode_segment(pr)}/issues", + query={"status": status, "limit": str(limit)}, + ) + + +def get_git_origin_url() -> str | None: + """ + Get the Git remote "origin" URL for the current repository, or None when it cannot be determined. + + Returns: + origin_url (str | None): The remote URL configured for 'origin' if the current directory is inside a Git work tree and the origin URL is available; `None` if not inside a Git repository, if the origin is not set, or on error. + """ + # git repo check + try: + result = subprocess.run( + ["git", "rev-parse", "--is-inside-work-tree"], + capture_output=True, + text=True, + check=False, + ) # nosec B603 - fixed git command without user input + if result.returncode != 0 or not result.stdout.strip(): + return None + result = subprocess.run( + ["git", "remote", "get-url", "origin"], + capture_output=True, + text=True, + check=False, + ) # nosec B603 - fixed git command without user input + if result.returncode != 0: + return None + return result.stdout.strip() + except (OSError, subprocess.SubprocessError): + return None + + +@dataclass +class GitRemoteInfo: + """Parsed git remote information.""" + + provider: str + org: str + repo: str + + +def parse_git_remote(url: str) -> GitRemoteInfo | None: + """ + Extract provider, organization, and repository from a Git remote URL. + + Accepts HTTPS (https://host/org/repo[.git]) and SSH (git@host:org/repo[.git]) remote formats. + Provider is one of: "gh" for GitHub, "gl" for GitLab, "bb" for Bitbucket, or "unknown" for other hosts. + + Parameters: + url (str): Git remote URL to parse. + + Returns: + GitRemoteInfo | None: Parsed GitRemoteInfo with fields `provider`, `org`, and `repo`, or `None` if the URL could not be parsed. + """ + # HTTPS + m = re.match(r"^https?://([^/]+)/([^/]+)/([^/]+?)(?:\.git)?$", url) + # SSH + if not m: + m = re.match(r"^git@([^:]+):([^/]+)/([^/]+?)(?:\.git)?$", url) + + if not m: + return None + + host, org, repo = m.group(1), m.group(2), m.group(3) + + def is_same_or_subdomain(hostname: str, base_domain: str) -> bool: + """ + Check whether a hostname is equal to a base domain or is a subdomain of that base domain. + + Parameters: + hostname (str): Hostname to test (e.g., "api.example.com"). + base_domain (str): Base domain to compare against (e.g., "example.com"). + + Returns: + `true` if `hostname` equals `base_domain` or ends with `.` followed by `base_domain`, `false` otherwise. + """ + return hostname == base_domain or hostname.endswith("." + base_domain) + + if is_same_or_subdomain(host, "github.com"): + provider = "gh" + elif is_same_or_subdomain(host, "gitlab.com"): + provider = "gl" + elif is_same_or_subdomain(host, "bitbucket.org"): + provider = "bb" + else: + provider = "unknown" + + return GitRemoteInfo(provider=provider, org=org, repo=repo) + + +def fetch_json( + url: str, method: str = "GET", body: dict[str, Any] | None = None +) -> dict[str, Any]: + """ + Fetch and return a JSON object from a validated Codacy API URL. + + Parameters: + url (str): Codacy API URL; must target the configured Codacy origin and start with the /analysis/ path. + method (str): HTTP method to use (e.g., "GET", "POST"). + body (dict[str, Any] | None): Optional JSON body for non-GET requests. + + Returns: + dict[str, Any]: The parsed JSON response as a dictionary. + + Raises: + RuntimeError: On HTTP errors, network errors, invalid JSON, or when the JSON root value is not an object. + """ + safe_url = assert_codacy_url(url) + + headers = { + "Accept": "application/json", + "api-token": get_token(), + } + + data: bytes | None = None + if body is not None and method.upper() != "GET": + payload = json.dumps(body).encode("utf-8") + headers["Content-Type"] = "application/json" + headers["Content-Length"] = str(len(payload)) + data = payload + + req = urllib.request.Request( + safe_url, method=method.upper(), headers=headers, data=data + ) + + try: + with urllib.request.urlopen(req, timeout=60) as res: # nosec B310 - validated https origin + raw = res.read().decode("utf-8", errors="replace") + try: + parsed = json.loads(raw) + except json.JSONDecodeError as exc: + raise RuntimeError("Invalid JSON response") from exc + if not isinstance(parsed, dict): + raise RuntimeError("Invalid JSON response") + return cast(dict[str, Any], parsed) + except urllib.error.HTTPError as e: + # include response body if possible + try: + body_text = e.read().decode("utf-8", errors="replace") + except Exception: + body_text = "" + raise RuntimeError(f"HTTP {e.code}: {body_text or str(e)}") from None + except urllib.error.URLError as e: + raise RuntimeError(str(e)) from None + + +# ================================ +# API +# ================================ +def fetch_repo_issues(provider: str, org: str, repo: str, limit: int) -> dict[str, Any]: + """ + Request Codacy for issues belonging to a repository. + + Parameters: + provider (str): Provider code ('gh', 'gl', 'bb') indicating GitHub, GitLab, or Bitbucket. + org (str): Organization or owner name. + repo (str): Repository name. + limit (int): Maximum number of issues to return. + + Returns: + dict[str, Any]: Parsed JSON response from the Codacy API containing issue data. + """ + url = build_repo_issues_url(provider, org, repo, limit) + return fetch_json(url, method="POST", body={}) + + +def fetch_pr_issues( + provider: str, org: str, repo: str, pr: str, limit: int, status: str = "all" +) -> dict[str, Any]: + """ + Retrieve Codacy issues for a specific pull request. + + Parameters: + provider (str): Provider code ("gh", "gl", "bb"). + org (str): Organization or user name. + repo (str): Repository name. + pr (str): Pull request number or identifier. + limit (int): Maximum number of issues to request. + status (str): Issue status filter (for example "all", "open", "closed"). + + Returns: + dict: Parsed JSON response from the Codacy API. + """ + url = build_pr_issues_url(provider, org, repo, pr, limit, status) + return fetch_json(url, method="GET") + + +# ================================ +# AI Output Formatter +# ================================ +def format_for_ai(raw_issues: list[dict[str, Any]], min_level: str) -> list[str]: + """ + Format Codacy issue records into compact AI-friendly lines filtered by minimum severity. + + Each returned string has the form: + " | : | | | ". + + Parameters: + raw_issues: List of issue objects returned by the Codacy API (each item may be an issue or contain a `commitIssue` key). + min_level: Minimum severity level to include; must be one of the values in LEVELS. + + Returns: + A list of formatted issue strings matching the format above, including only issues whose severity is at or above `min_level`. + + Raises: + ValueError: If `min_level` is not a valid severity level. + """ + min_priority = get_level_priority(min_level) + if min_priority is None: + raise ValueError( + f"Invalid min_level: {min_level}. Valid values: {', '.join(LEVELS)}" + ) + + out: list[str] = [] + + for item in raw_issues: + issue = item.get("commitIssue") or item + + pattern_info = issue.get("patternInfo") or {} + level = pattern_info.get("level") + prio = get_level_priority(level) + if prio is None or prio < min_priority: + continue + + file_path = issue.get("filePath") + line_no = issue.get("lineNumber") + rule = pattern_info.get("id") + category = pattern_info.get("category") + message = issue.get("message") + + out.append(f"{level} | {file_path}:{line_no} | {rule} | {category} | {message}") + + return out + + +# ================================ +# CLI +# ================================ +def parse_args(argv: list[str]) -> argparse.Namespace: + """Parse command-line arguments.""" + p = argparse.ArgumentParser(add_help=False) + p.add_argument("org", nargs="?", default=None) + p.add_argument("repo", nargs="?", default=None) + p.add_argument("--pr", dest="pr", default=None) + p.add_argument("--min-level", dest="min_level", default="Info", choices=LEVELS) + p.add_argument("--provider", dest="provider", default=None) + p.add_argument("--help", action="help", help="Show this help message and exit") + return p.parse_args(argv) + + +def apply_git_defaults(args: argparse.Namespace) -> None: + """Populate missing org/repo/provider from git origin when possible.""" + if args.org and args.repo: + return + origin_url = get_git_origin_url() + if not origin_url: + return + parsed = parse_git_remote(origin_url) + if not parsed: + return + if args.provider is None: + args.provider = parsed.provider + if args.org is None: + args.org = parsed.org + if args.repo is None: + args.repo = parsed.repo + + +def resolve_segments(args: argparse.Namespace) -> tuple[str, str, str | None]: + """ + Validate CLI org, repo, and optional pr segments and return them. + + Parameters: + args (argparse.Namespace): Parsed CLI arguments with attributes `org`, `repo`, and optional `pr`. + + Returns: + tuple[str, str, str | None]: A tuple (org, repo, pr) where `pr` is None if not supplied. + + Raises: + ValueError: If any segment is empty or contains invalid characters. + """ + segment_pattern = re.compile(r"^[A-Za-z0-9_.-]+$") + org = assert_valid_segment("org", args.org, segment_pattern) + repo = assert_valid_segment("repo", args.repo, segment_pattern) + pr = args.pr + if pr is not None: + pr = assert_valid_segment("pr", pr, re.compile(r"^[0-9]+$")) + return org, repo, pr + + +def build_payload( + *, + pr: str | None, + org: str, + repo: str, + min_level: str, + issues: list[str], +) -> dict[str, object]: + """ + Create a JSON-serializable payload describing the fetched issues and their scope. + + The returned dictionary contains: + - scope: "pull_request" when `pr` is set, otherwise "repository". + - organization: organization/owner name. + - repository: repository name. + - pullRequest: pull request identifier string when present, otherwise `None`. + - minLevel: the minimum severity level used to filter issues. + - total: the number of issues in `issues`. + - issues: list of formatted issue strings. + + Returns: + dict[str, object]: Payload ready for JSON serialization with the keys described above. + """ + return { + "scope": "pull_request" if pr else "repository", + "organization": org, + "repository": repo, + "pullRequest": pr if pr else None, + "minLevel": min_level, + "total": len(issues), + "issues": issues, + } + + +def main() -> int: + """ + Run the CLI: parse arguments, fetch Codacy issues (repository or pull request), format them for AI consumption, and write a JSON payload to stdout. + + Writes error messages to stderr when validation or fetching fails and prints the final JSON payload to stdout. + + Returns: + int: 0 on success, 1 on error. + """ + args = parse_args(sys.argv[1:]) + + # --- Git auto-detect --- + apply_git_defaults(args) + + if args.provider is None: + args.provider = "gh" + + provider = normalize_provider(args.provider) + if not provider: + print("Invalid --provider: use gh, gl, or bb", file=sys.stderr) + return 1 + + if not args.org or not args.repo: + print( + "Usage:\n" + " python codacy_issues.py ORG REPO [--pr NUMBER] [--min-level Error|High|Warning|Info] [--provider gh|gl|bb]", + file=sys.stderr, + ) + return 1 + + try: + org, repo, pr = resolve_segments(args) + except ValueError as exc: + print(str(exc), file=sys.stderr) + return 1 + + status = "all" + limit = 100 + + result = ( + fetch_pr_issues( + provider=provider, org=org, repo=repo, pr=pr, limit=limit, status=status + ) + if pr + else fetch_repo_issues(provider=provider, org=org, repo=repo, limit=limit) + ) + + issues = result.get("data") or [] + try: + formatted = format_for_ai(issues, args.min_level) + except ValueError as exc: + print(str(exc), file=sys.stderr) + return 1 + + payload = build_payload( + pr=pr, org=org, repo=repo, min_level=args.min_level, issues=formatted + ) + + sys.stdout.write(json.dumps(payload, ensure_ascii=False, indent=2) + "\n") + return 0 + + +if __name__ == "__main__": + try: + raise SystemExit(main()) + except Exception as e: + print(str(e), file=sys.stderr) + raise SystemExit(1) from e diff --git a/src/exstruct/__init__.py b/src/exstruct/__init__.py index d0110d6..6bdfd75 100644 --- a/src/exstruct/__init__.py +++ b/src/exstruct/__init__.py @@ -90,36 +90,24 @@ def extract(file_path: str | Path, mode: ExtractionMode = "standard") -> WorkbookData: """ - Extract an Excel workbook into WorkbookData. + Extracts an Excel workbook into a WorkbookData structure. - Args: - file_path: Path to .xlsx/.xlsm/.xls. - mode: "light" / "standard" / "verbose" - - light: cells + table detection only (no COM, shapes/charts empty). Print areas via openpyxl. - - standard: texted shapes + arrows + charts (COM if available), print areas included. Shape/chart size is kept but hidden by default in output. - - verbose: all shapes (including textless) with size, charts with size, and colors_map. + Parameters: + file_path (str | Path): Path to the workbook file (.xlsx, .xlsm, .xls). + mode (ExtractionMode): Extraction detail level. "light" includes cells and table detection only (no COM, shapes/charts empty; print areas via openpyxl). "standard" includes texted shapes, arrows, charts (COM if available) and print areas. "verbose" also includes shape/chart sizes, cell link map, colors map, and formulas map. Returns: - WorkbookData containing sheets, rows, shapes, charts, and print areas. - - Raises: - ValueError: If an invalid mode is provided. - - Examples: - Extract with hyperlinks (verbose) and inspect table candidates: - - >>> from exstruct import extract - >>> wb = extract("input.xlsx", mode="verbose") - >>> wb.sheets["Sheet1"].table_candidates - ['A1:B5'] + WorkbookData: Parsed workbook representation containing sheets, rows, shapes, charts, and print areas. """ include_links = True if mode == "verbose" else False include_colors_map = True if mode == "verbose" else None + include_formulas_map = True if mode == "verbose" else None engine = ExStructEngine( options=StructOptions( mode=mode, include_cell_links=include_links, include_colors_map=include_colors_map, + include_formulas_map=include_formulas_map, ) ) return engine.extract(file_path, mode=mode) diff --git a/src/exstruct/core/backends/base.py b/src/exstruct/core/backends/base.py index f678e54..7853a36 100644 --- a/src/exstruct/core/backends/base.py +++ b/src/exstruct/core/backends/base.py @@ -4,7 +4,7 @@ from typing import Protocol from ...models import CellRow, PrintArea -from ..cells import MergedCellRange, WorkbookColorsMap +from ..cells import MergedCellRange, WorkbookColorsMap, WorkbookFormulasMap CellData = dict[str, list[CellRow]] PrintAreaData = dict[str, list[PrintArea]] @@ -40,3 +40,11 @@ def extract_colors_map( def extract_merged_cells(self) -> MergedCellData: """Extract merged cell ranges from the workbook.""" + + def extract_formulas_map(self) -> WorkbookFormulasMap | None: + """ + Retrieve the workbook's formulas organized by worksheet. + + Returns: + WorkbookFormulasMap | None: A mapping of worksheet identifiers to their formulas, or `None` if the backend cannot provide a formulas map. + """ diff --git a/src/exstruct/core/backends/com_backend.py b/src/exstruct/core/backends/com_backend.py index c6bdaf3..0f7d2a9 100644 --- a/src/exstruct/core/backends/com_backend.py +++ b/src/exstruct/core/backends/com_backend.py @@ -9,7 +9,12 @@ import xlwings as xw from ...models import PrintArea -from ..cells import WorkbookColorsMap, extract_sheet_colors_map_com +from ..cells import ( + WorkbookColorsMap, + WorkbookFormulasMap, + extract_sheet_colors_map_com, + extract_sheet_formulas_map_com, +) from ..ranges import parse_range_zero_based from .base import MergedCellData, PrintAreaData @@ -58,14 +63,15 @@ def extract_print_areas(self) -> PrintAreaData: def extract_colors_map( self, *, include_default_background: bool, ignore_colors: set[str] | None ) -> WorkbookColorsMap | None: - """Extract colors_map via COM; logs and skips on failure. + """ + Extract a workbook colors map using the Excel COM API. - Args: - include_default_background: Whether to include default backgrounds. - ignore_colors: Optional set of color keys to ignore. + Parameters: + include_default_background (bool): Include the workbook's default background color in the resulting map. + ignore_colors (set[str] | None): Optional set of color keys to exclude from the map. Returns: - WorkbookColorsMap or None when extraction fails. + WorkbookColorsMap | None: A mapping of workbook color definitions when extraction succeeds, or `None` if COM extraction fails. """ try: return extract_sheet_colors_map_com( @@ -80,11 +86,30 @@ def extract_colors_map( ) return None + def extract_formulas_map(self) -> WorkbookFormulasMap | None: + """ + Extracts the workbook's formulas map using COM. + + Returns: + WorkbookFormulasMap or None: The extracted formulas map, or `None` if extraction failed. + """ + try: + return extract_sheet_formulas_map_com(self.workbook) + except Exception as exc: + logger.warning( + "COM formula map extraction failed; skipping formulas_map. (%r)", + exc, + ) + return None + def extract_auto_page_breaks(self) -> PrintAreaData: - """Compute auto page-break rectangles per sheet using Excel COM. + """ + Compute auto page-break rectangles for each worksheet using Excel COM. + + For each sheet, determine the sheet's print area (PageSetup.PrintArea or the used range) and split it into sub-rectangles along Excel's horizontal and vertical page breaks; parts that reference a different sheet are ignored. If extraction for a sheet fails, the sheet is skipped and a warning is logged. Returns: - Mapping of sheet name to auto page-break areas. + Mapping from sheet name to a list of PrintArea entries. Each PrintArea describes a rectangular region with `r1` and `r2` as 1-based row indices and `c1` and `c2` as 0-based column indices. """ results: PrintAreaData = {} for sheet in self.workbook.sheets: diff --git a/src/exstruct/core/backends/openpyxl_backend.py b/src/exstruct/core/backends/openpyxl_backend.py index f7593d3..bef9428 100644 --- a/src/exstruct/core/backends/openpyxl_backend.py +++ b/src/exstruct/core/backends/openpyxl_backend.py @@ -9,10 +9,12 @@ from ...models import PrintArea from ..cells import ( WorkbookColorsMap, + WorkbookFormulasMap, detect_tables_openpyxl, extract_sheet_cells, extract_sheet_cells_with_links, extract_sheet_colors_map, + extract_sheet_formulas_map, extract_sheet_merged_cells, ) from ..ranges import parse_range_zero_based @@ -99,14 +101,30 @@ def extract_merged_cells(self) -> MergedCellData: except Exception: return {} + def extract_formulas_map(self) -> WorkbookFormulasMap | None: + """ + Extract a mapping of workbook formulas for each sheet. + + Returns: + WorkbookFormulasMap | None: A mapping from sheet name to its formulas, or `None` if extraction fails. + """ + try: + return extract_sheet_formulas_map(self.file_path) + except Exception as exc: + logger.warning( + "Formula map extraction failed; skipping formulas_map. (%r)", exc + ) + return None + def detect_tables(self, sheet_name: str) -> list[str]: - """Detect table candidates for a single sheet. + """ + Detects table candidate ranges within the specified worksheet. - Args: - sheet_name: Target worksheet name. + Parameters: + sheet_name (str): Name of the worksheet to analyze for table candidates. Returns: - List of table candidate ranges. + list[str]: Detected table candidate ranges as A1-style range strings; empty list if none are found or detection fails. """ try: return detect_tables_openpyxl(self.file_path, sheet_name) diff --git a/src/exstruct/core/cells.py b/src/exstruct/core/cells.py index de3b352..1ad915a 100644 --- a/src/exstruct/core/cells.py +++ b/src/exstruct/core/cells.py @@ -56,13 +56,41 @@ class WorkbookColorsMap: sheets: dict[str, SheetColorsMap] def get_sheet(self, sheet_name: str) -> SheetColorsMap | None: - """Return the colors map for a sheet if available. + """ + Retrieve the SheetColorsMap for a worksheet by name. - Args: - sheet_name: Target worksheet name. + Parameters: + sheet_name (str): Name of the worksheet to retrieve. Returns: - SheetColorsMap for the sheet, or None if missing. + SheetColorsMap | None: The sheet's color map if present, `None` otherwise. + """ + return self.sheets.get(sheet_name) + + +@dataclass(frozen=True) +class SheetFormulasMap: + """Formula map for a single worksheet.""" + + sheet_name: str + formulas_map: dict[str, list[tuple[int, int]]] + + +@dataclass(frozen=True) +class WorkbookFormulasMap: + """Formula maps for all worksheets in a workbook.""" + + sheets: dict[str, SheetFormulasMap] + + def get_sheet(self, sheet_name: str) -> SheetFormulasMap | None: + """ + Retrieve the formulas map for a worksheet. + + Parameters: + sheet_name (str): Name of the worksheet to look up. + + Returns: + SheetFormulasMap | None: The sheet's formulas map if present, `None` if the worksheet is not found. """ return self.sheets.get(sheet_name) @@ -102,22 +130,79 @@ def extract_sheet_colors_map( return WorkbookColorsMap(sheets=sheets) +def extract_sheet_formulas_map(file_path: Path) -> WorkbookFormulasMap: + """ + Extract normalized formula strings from every worksheet in the workbook. + + Parameters: + file_path (Path): Path to the Excel workbook to read. + + Returns: + WorkbookFormulasMap: Mapping of sheet names to SheetFormulasMap objects. Each SheetFormulasMap contains a mapping from normalized formula strings (each beginning with "=") to a list of cell coordinates (row, column) where that formula occurs. + """ + sheets: dict[str, SheetFormulasMap] = {} + with openpyxl_workbook(file_path, data_only=False, read_only=False) as wb: + for ws in wb.worksheets: + sheet_map = _extract_sheet_formulas(ws) + sheets[ws.title] = sheet_map + return WorkbookFormulasMap(sheets=sheets) + + +def extract_sheet_formulas_map_com(workbook: xw.Book) -> WorkbookFormulasMap: + """ + Collects and normalizes formulas from every worksheet in an xlwings workbook into per-sheet mappings. + + Parameters: + workbook: xlwings Book instance whose sheets will be scanned for formulas. + + Returns: + WorkbookFormulasMap: maps sheet names to SheetFormulasMap objects. Each SheetFormulasMap.formulas_map maps a normalized formula string (consistent representation, e.g., beginning with "=") to a list of (row, column) tuples where row is 1-based and column is 0-based. + """ + sheets: dict[str, SheetFormulasMap] = {} + for sheet in workbook.sheets: + formulas_map: dict[str, list[tuple[int, int]]] = {} + used = sheet.used_range + start_row = int(getattr(used, "row", 1)) + start_col = int(getattr(used, "column", 1)) + max_row = used.last_cell.row + max_col = used.last_cell.column + if max_row <= 0 or max_col <= 0: + sheets[sheet.name] = SheetFormulasMap( + sheet_name=sheet.name, formulas_map=formulas_map + ) + continue + rng = sheet.range((start_row, start_col), (max_row, max_col)) + matrix = _normalize_matrix(rng.formula) + for r_offset, row in enumerate(matrix): + for c_offset, value in enumerate(row): + normalized = _normalize_formula_from_com(value) + if normalized is None: + continue + row_index = start_row + r_offset + col_index = start_col + c_offset - 1 + formulas_map.setdefault(normalized, []).append((row_index, col_index)) + sheets[sheet.name] = SheetFormulasMap( + sheet_name=sheet.name, formulas_map=formulas_map + ) + return WorkbookFormulasMap(sheets=sheets) + + def extract_sheet_colors_map_com( workbook: xw.Book, *, include_default_background: bool, ignore_colors: set[str] | None, ) -> WorkbookColorsMap: - """Extract background colors for each worksheet via COM display formats. + """ + Extract per-sheet background color maps using the workbook's COM/display-format interfaces. - Args: - workbook: xlwings workbook instance. - include_default_background: Whether to include default (white) backgrounds - within the used range. - ignore_colors: Optional set of color keys to ignore. + Parameters: + workbook (xw.Book): xlwings workbook whose sheets will be inspected. + include_default_background (bool): If true, include default background colors (e.g., white) for cells inside each sheet's used range. + ignore_colors (set[str] | None): Optional set of normalized color keys to exclude from results. Returns: - WorkbookColorsMap containing per-sheet color maps. + WorkbookColorsMap: Mapping of sheet names to SheetColorsMap containing detected background color positions for each worksheet. """ _prepare_workbook_for_display_format(workbook) sheets: dict[str, SheetColorsMap] = {} @@ -133,15 +218,16 @@ def extract_sheet_colors_map_com( def _extract_sheet_colors( ws: Worksheet, include_default_background: bool, ignore_colors: set[str] | None ) -> SheetColorsMap: - """Extract background colors for a single worksheet. + """ + Extract the background color locations present on a single worksheet. - Args: - ws: Target worksheet. - include_default_background: Whether to include default (white) backgrounds. - ignore_colors: Optional set of color keys to ignore. + Parameters: + ws (Worksheet): Worksheet to scan. + include_default_background (bool): If true, treat cells with the workbook default/background color as having a color key. + ignore_colors (set[str] | None): Optional set of color keys to ignore (keys are normalized before comparison). Returns: - SheetColorsMap for the worksheet. + SheetColorsMap: Mapping from normalized color key to a list of cell coordinates where that color appears. Coordinates are tuples (row, col) where `row` is 1-based and `col` is 0-based. """ min_row, min_col, max_row, max_col = _get_used_range_bounds(ws) colors_map: dict[str, list[tuple[int, int]]] = {} @@ -165,18 +251,90 @@ def _extract_sheet_colors( return SheetColorsMap(sheet_name=ws.title, colors_map=colors_map) +def _extract_sheet_formulas(ws: Worksheet) -> SheetFormulasMap: + """ + Collect normalized formula strings from a worksheet and group their cell coordinates. + + Parameters: + ws (Worksheet): Worksheet to scan for formulas. + + Returns: + SheetFormulasMap: container with the sheet's name and a mapping from each normalized formula string (prefixed with "=") to a list of cell coordinates as (row, zero-based-column). + """ + min_row, min_col, max_row, max_col = _get_used_range_bounds(ws) + formulas_map: dict[str, list[tuple[int, int]]] = {} + if min_row > max_row or min_col > max_col: + return SheetFormulasMap(sheet_name=ws.title, formulas_map=formulas_map) + + for row in ws.iter_rows( + min_row=min_row, max_row=max_row, min_col=min_col, max_col=max_col + ): + for cell in row: + if getattr(cell, "data_type", None) != "f": + continue + normalized = _normalize_formula_value(getattr(cell, "value", None)) + if normalized is None: + continue + formulas_map.setdefault(normalized, []).append((cell.row, cell.col_idx - 1)) + return SheetFormulasMap(sheet_name=ws.title, formulas_map=formulas_map) + + +def _normalize_formula_value(value: object) -> str | None: + """Normalize a formula string for openpyxl cells. + + Args: + value: Raw cell value. + + Returns: + Formula string with leading "=", or None when empty. + """ + if value is None: + return None + array_text = getattr(value, "text", None) + if array_text is not None: + text = str(array_text) + else: + text = str(value) + if text == "": + return None + if not text.startswith("="): + return f"={text}" + return text + + +def _normalize_formula_from_com(value: object) -> str | None: + """ + Normalize a COM-returned cell formula into a string that begins with '='. + + Parameters: + value (object): Raw value returned from COM for a cell's formula. + + Returns: + str | None: The input string if it is non-empty and starts with '=', `None` otherwise. + """ + if value is None or not isinstance(value, str): + return None + text = value + if text == "": + return None + if not text.startswith("="): + return None + return text + + def _extract_sheet_colors_com( sheet: xw.Sheet, include_default_background: bool, ignore_colors: set[str] | None ) -> SheetColorsMap: - """Extract background colors for a single worksheet via COM. + """ + Extract per-sheet background color mapping using COM/DisplayFormat. - Args: - sheet: Target worksheet. - include_default_background: Whether to include default (white) backgrounds. - ignore_colors: Optional set of color keys to ignore. + Parameters: + sheet (xw.Sheet): xlwings sheet object to inspect. + include_default_background (bool): If True, include cells whose background is the workbook default color. + ignore_colors (set[str] | None): Optional set of normalized color keys to exclude from the result. Returns: - SheetColorsMap for the worksheet. + SheetColorsMap: Mapping from normalized color key (hex/theme/index canonical form) to a list of cell coordinates where that color appears. Each coordinate is a tuple (row, col) where `row` is the worksheet row number (1-based) and `col` is the zero-based column index. """ colors_map: dict[str, list[tuple[int, int]]] = {} used = sheet.used_range diff --git a/src/exstruct/core/integrate.py b/src/exstruct/core/integrate.py index 03017ce..b610b90 100644 --- a/src/exstruct/core/integrate.py +++ b/src/exstruct/core/integrate.py @@ -17,30 +17,33 @@ def extract_workbook( # noqa: C901 include_colors_map: bool | None = None, include_default_background: bool = False, ignore_colors: set[str] | None = None, + include_formulas_map: bool | None = None, include_merged_cells: bool | None = None, include_merged_values_in_rows: bool = True, ) -> WorkbookData: - """Extract workbook and return WorkbookData. - - Falls back to cells+tables if Excel COM is unavailable. - - Args: - file_path: Workbook path. - mode: Extraction mode. - include_cell_links: Whether to include cell hyperlinks; None uses mode defaults. - include_print_areas: Whether to include print areas; None defaults to True. - include_auto_page_breaks: Whether to include auto page breaks. - include_colors_map: Whether to include colors map; None uses mode defaults. - include_default_background: Whether to include default background color. - ignore_colors: Optional set of color keys to ignore. - include_merged_cells: Whether to include merged cell ranges; None uses mode defaults. - include_merged_values_in_rows: Whether to keep merged values in rows. + """ + Extract a workbook into a structured WorkbookData representation. + + May fall back to cells+tables extraction if Excel COM automation is unavailable. + + Parameters: + file_path (str | Path): Path to the workbook file. + mode (Literal['light', 'standard', 'verbose']): Extraction mode that controls detail level. + include_cell_links (bool | None): Include cell hyperlinks; `None` uses mode defaults. + include_print_areas (bool | None): Include print areas; `None` defaults to True. + include_auto_page_breaks (bool): Include automatic page break information. + include_colors_map (bool | None): Include a colors map; `None` uses mode defaults. + include_default_background (bool): Include default background color when present. + ignore_colors (set[str] | None): Set of color keys to ignore during color mapping. + include_formulas_map (bool | None): Include a map of cell formulas; `None` uses mode defaults. + include_merged_cells (bool | None): Include merged cell ranges; `None` uses mode defaults. + include_merged_values_in_rows (bool): Preserve merged cell values in row-wise output. Returns: - Extracted WorkbookData. + WorkbookData: The extracted workbook representation. Raises: - ValueError: If mode is unsupported. + ValueError: If `mode` is not one of "light", "standard", or "verbose". """ inputs = resolve_extraction_inputs( file_path, @@ -51,6 +54,7 @@ def extract_workbook( # noqa: C901 include_colors_map=include_colors_map, include_default_background=include_default_background, ignore_colors=ignore_colors, + include_formulas_map=include_formulas_map, include_merged_cells=include_merged_cells, include_merged_values_in_rows=include_merged_values_in_rows, ) diff --git a/src/exstruct/core/modeling.py b/src/exstruct/core/modeling.py index f92f32a..7b801f9 100644 --- a/src/exstruct/core/modeling.py +++ b/src/exstruct/core/modeling.py @@ -27,6 +27,7 @@ class SheetRawData: table_candidates: Detected table ranges. print_areas: Extracted print areas. auto_print_areas: Extracted auto page-break areas. + formulas_map: Mapping of formula strings to (row, column) positions. colors_map: Mapping of color keys to (row, column) positions. merged_cells: Extracted merged cell ranges. """ @@ -37,6 +38,7 @@ class SheetRawData: table_candidates: list[str] print_areas: list[PrintArea] auto_print_areas: list[PrintArea] + formulas_map: dict[str, list[tuple[int, int]]] colors_map: dict[str, list[tuple[int, int]]] merged_cells: list[MergedCellRange] @@ -70,6 +72,7 @@ def build_sheet_data(raw: SheetRawData) -> SheetData: table_candidates=raw.table_candidates, print_areas=raw.print_areas, auto_print_areas=raw.auto_print_areas, + formulas_map=raw.formulas_map, colors_map=raw.colors_map, merged_cells=_build_merged_cells(raw.merged_cells), ) diff --git a/src/exstruct/core/pipeline.py b/src/exstruct/core/pipeline.py index 363596a..53a4fda 100644 --- a/src/exstruct/core/pipeline.py +++ b/src/exstruct/core/pipeline.py @@ -21,7 +21,13 @@ ) from .backends.com_backend import ComBackend from .backends.openpyxl_backend import OpenpyxlBackend -from .cells import MergedCellRange, WorkbookColorsMap, detect_tables +from .cells import ( + MergedCellRange, + WorkbookColorsMap, + WorkbookFormulasMap, + detect_tables, + warn_once, +) from .charts import get_charts from .logging_utils import log_fallback from .modeling import SheetRawData, WorkbookRawData, build_workbook_data @@ -51,6 +57,8 @@ class ExtractionInputs: include_colors_map: Whether to include background colors map. include_default_background: Whether to include default background color. ignore_colors: Optional set of color keys to ignore. + include_formulas_map: Whether to include formulas map. + use_com_for_formulas: Whether to use COM for formulas extraction. include_merged_cells: Whether to include merged cell ranges. include_merged_values_in_rows: Whether to keep merged values in rows. """ @@ -63,6 +71,8 @@ class ExtractionInputs: include_colors_map: bool include_default_background: bool ignore_colors: set[str] | None + include_formulas_map: bool + use_com_for_formulas: bool include_merged_cells: bool include_merged_values_in_rows: bool @@ -75,6 +85,7 @@ class ExtractionArtifacts: cell_data: Extracted cell rows per sheet. print_area_data: Extracted print areas per sheet. auto_page_break_data: Extracted auto page-break areas per sheet. + formulas_map_data: Extracted formulas map for workbook sheets. colors_map_data: Extracted colors map for workbook sheets. shape_data: Extracted shapes per sheet. chart_data: Extracted charts per sheet. @@ -84,6 +95,7 @@ class ExtractionArtifacts: cell_data: CellData = field(default_factory=dict) print_area_data: PrintAreaData = field(default_factory=dict) auto_page_break_data: PrintAreaData = field(default_factory=dict) + formulas_map_data: WorkbookFormulasMap | None = None colors_map_data: WorkbookColorsMap | None = None shape_data: ShapeData = field(default_factory=dict) chart_data: ChartData = field(default_factory=dict) @@ -179,6 +191,7 @@ def resolve_extraction_inputs( include_colors_map: bool | None, include_default_background: bool, ignore_colors: set[str] | None, + include_formulas_map: bool | None, include_merged_cells: bool | None, include_merged_values_in_rows: bool, ) -> ExtractionInputs: @@ -193,6 +206,7 @@ def resolve_extraction_inputs( include_colors_map: Whether to include background colors; None uses mode defaults. include_default_background: Include default background colors when colors_map is enabled. ignore_colors: Optional set of colors to ignore when colors_map is enabled. + include_formulas_map: Whether to include formulas map; None uses mode defaults. include_merged_cells: Whether to include merged cell ranges; None uses mode defaults. include_merged_values_in_rows: Whether to keep merged values in rows. @@ -222,6 +236,19 @@ def resolve_extraction_inputs( resolved_ignore_colors = ignore_colors if resolved_colors_map else None if resolved_colors_map and resolved_ignore_colors is None: resolved_ignore_colors = set() + resolved_formulas_map = ( + include_formulas_map if include_formulas_map is not None else mode == "verbose" + ) + file_suffix = normalized_file_path.suffix.lower() + use_com_for_formulas = resolved_formulas_map and file_suffix == ".xls" + if use_com_for_formulas: + warn_once( + f"xls-formulas-fallback::{normalized_file_path}", + ( + f"File '{normalized_file_path.name}' is .xls (BIFF); openpyxl cannot " + "read formulas. Falling back to COM-based extraction (slower)." + ), + ) resolved_merged_cells = ( include_merged_cells if include_merged_cells is not None else mode != "light" ) @@ -237,24 +264,27 @@ def resolve_extraction_inputs( include_colors_map=resolved_colors_map, include_default_background=resolved_default_background, ignore_colors=resolved_ignore_colors, + include_formulas_map=resolved_formulas_map, + use_com_for_formulas=use_com_for_formulas, include_merged_cells=resolved_merged_cells, include_merged_values_in_rows=include_merged_values_in_rows, ) def build_pipeline_plan(inputs: ExtractionInputs) -> PipelinePlan: - """Build a pipeline plan based on resolved inputs. + """ + Builds a pipeline plan describing which pre-COM and COM extraction steps to run for the given resolved inputs. - Args: - inputs: Resolved pipeline inputs. + Parameters: + inputs (ExtractionInputs): Resolved extraction configuration (including mode and COM/formulas flags). Returns: - PipelinePlan containing pre-COM/COM steps and COM usage flag. + PipelinePlan: Plan containing ordered `pre_com_steps`, ordered `com_steps`, and `use_com` set to true when the pipeline should use COM (when `mode` is not "light" or `use_com_for_formulas` is true). """ return PipelinePlan( pre_com_steps=build_pre_com_pipeline(inputs), com_steps=build_com_pipeline(inputs), - use_com=inputs.mode != "light", + use_com=inputs.mode != "light" or inputs.use_com_for_formulas, ) @@ -279,6 +309,12 @@ def build_pre_com_pipeline(inputs: ExtractionInputs) -> list[ExtractionStep]: step=step_extract_print_areas_openpyxl, enabled=lambda _inputs: _inputs.include_print_areas, ), + StepConfig( + name="formulas_map_openpyxl", + step=step_extract_formulas_map_openpyxl, + enabled=lambda _inputs: _inputs.include_formulas_map + and not _inputs.use_com_for_formulas, + ), StepConfig( name="colors_map_openpyxl", step=step_extract_colors_map_openpyxl, @@ -301,6 +337,12 @@ def build_pre_com_pipeline(inputs: ExtractionInputs) -> list[ExtractionStep]: step=step_extract_print_areas_openpyxl, enabled=lambda _inputs: _inputs.include_print_areas, ), + StepConfig( + name="formulas_map_openpyxl", + step=step_extract_formulas_map_openpyxl, + enabled=lambda _inputs: _inputs.include_formulas_map + and not _inputs.use_com_for_formulas, + ), StepConfig( name="colors_map_openpyxl_if_skip_com", step=step_extract_colors_map_openpyxl, @@ -324,6 +366,12 @@ def build_pre_com_pipeline(inputs: ExtractionInputs) -> list[ExtractionStep]: step=step_extract_print_areas_openpyxl, enabled=lambda _inputs: _inputs.include_print_areas, ), + StepConfig( + name="formulas_map_openpyxl", + step=step_extract_formulas_map_openpyxl, + enabled=lambda _inputs: _inputs.include_formulas_map + and not _inputs.use_com_for_formulas, + ), StepConfig( name="colors_map_openpyxl_if_skip_com", step=step_extract_colors_map_openpyxl, @@ -353,18 +401,18 @@ def build_com_pipeline(inputs: ExtractionInputs) -> list[ComExtractionStep]: Returns: Ordered list of COM extraction steps. """ - if inputs.mode == "light": + if inputs.mode == "light" and not inputs.use_com_for_formulas: return [] step_table: Sequence[ComStepConfig] = ( ComStepConfig( name="shapes_com", step=step_extract_shapes_com, - enabled=lambda _inputs: True, + enabled=lambda _inputs: _inputs.mode != "light", ), ComStepConfig( name="charts_com", step=step_extract_charts_com, - enabled=lambda _inputs: True, + enabled=lambda _inputs: _inputs.mode != "light", ), ComStepConfig( name="print_areas_com", @@ -376,6 +424,12 @@ def build_com_pipeline(inputs: ExtractionInputs) -> list[ComExtractionStep]: step=step_extract_auto_page_breaks_com, enabled=lambda _inputs: _inputs.include_auto_page_breaks, ), + ComStepConfig( + name="formulas_map_com", + step=step_extract_formulas_map_com, + enabled=lambda _inputs: _inputs.include_formulas_map + and _inputs.use_com_for_formulas, + ), ComStepConfig( name="colors_map_com", step=step_extract_colors_map_com, @@ -447,24 +501,47 @@ def step_extract_cells( def step_extract_print_areas_openpyxl( inputs: ExtractionInputs, artifacts: ExtractionArtifacts ) -> None: - """Extract print areas via openpyxl. + """ + Extract print areas from the workbook and populate artifacts.print_area_data. - Args: - inputs: Pipeline inputs. - artifacts: Artifact container to update. + Parameters: + inputs (ExtractionInputs): Pipeline inputs containing the file path and extraction options. + artifacts (ExtractionArtifacts): Mutable artifact container; `artifacts.print_area_data` will be set to the extracted print area mapping. """ backend = OpenpyxlBackend(inputs.file_path) artifacts.print_area_data = backend.extract_print_areas() +def step_extract_formulas_map_openpyxl( + inputs: ExtractionInputs, artifacts: ExtractionArtifacts +) -> None: + """ + Populate artifacts.formulas_map_data by extracting workbook formulas using openpyxl. + + Attempts to extract a WorkbookFormulasMap from the file at inputs.file_path and stores it on artifacts.formulas_map_data. If extraction fails, a warning is logged and artifacts.formulas_map_data is left unchanged. + + Parameters: + inputs (ExtractionInputs): Resolved pipeline inputs (provides file_path). + artifacts (ExtractionArtifacts): Mutable container to receive the extracted formulas map. + """ + backend = OpenpyxlBackend(inputs.file_path) + try: + artifacts.formulas_map_data = backend.extract_formulas_map() + except Exception as exc: + logger.warning( + "Failed to extract formulas_map via openpyxl. (%r)", + exc, + ) + + def step_extract_colors_map_openpyxl( inputs: ExtractionInputs, artifacts: ExtractionArtifacts ) -> None: - """Extract colors_map via openpyxl; logs and skips on failure. + """ + Extract the workbook colors map using openpyxl and store it on the artifacts. - Args: - inputs: Pipeline inputs. - artifacts: Artifact container to update. + Sets artifacts.colors_map_data to the colors map extracted from inputs.file_path, + respecting inputs.include_default_background and inputs.ignore_colors. """ backend = OpenpyxlBackend(inputs.file_path) artifacts.colors_map_data = backend.extract_colors_map( @@ -533,16 +610,38 @@ def step_extract_print_areas_com( def step_extract_auto_page_breaks_com( inputs: ExtractionInputs, artifacts: ExtractionArtifacts, workbook: xw.Book ) -> None: - """Extract auto page breaks via COM. + """ + Extract auto page break information from a COM workbook and store it in the artifacts. - Args: - inputs: Pipeline inputs. - artifacts: Artifact container to update. - workbook: xlwings workbook instance. + Parameters: + inputs (ExtractionInputs): Pipeline inputs that may influence extraction behavior. + artifacts (ExtractionArtifacts): Mutable artifact container; updated with extracted data. + workbook (xw.Book): xlwings COM workbook used to read auto page break settings. """ artifacts.auto_page_break_data = ComBackend(workbook).extract_auto_page_breaks() +def step_extract_formulas_map_com( + inputs: ExtractionInputs, artifacts: ExtractionArtifacts, workbook: xw.Book +) -> None: + """ + Extract the workbook's formulas map via COM and store it into the artifacts. + + On success assigns the extracted WorkbookFormulasMap to artifacts.formulas_map_data. + On failure leaves artifacts.formulas_map_data unchanged and logs a warning. + + Parameters: + workbook (xlwings.Book): COM workbook to extract formulas from. + """ + try: + artifacts.formulas_map_data = ComBackend(workbook).extract_formulas_map() + except Exception as exc: + logger.warning( + "Failed to extract formulas_map via COM. (%r)", + exc, + ) + + def step_extract_colors_map_com( inputs: ExtractionInputs, artifacts: ExtractionArtifacts, workbook: xw.Book ) -> None: @@ -572,14 +671,15 @@ def step_extract_colors_map_com( def _resolve_sheet_colors_map( colors_map_data: WorkbookColorsMap | None, sheet_name: str ) -> dict[str, list[tuple[int, int]]]: - """Resolve colors_map for a single sheet. + """ + Resolve the colors map for a given sheet. - Args: - colors_map_data: Optional workbook colors map container. - sheet_name: Target sheet name. + Parameters: + colors_map_data (WorkbookColorsMap | None): Optional workbook-level colors map container. + sheet_name (str): Name of the sheet to resolve. Returns: - colors_map dictionary for the sheet, or empty dict if unavailable. + dict[str, list[tuple[int, int]]]: Mapping of color keys to lists of (start_col, end_col) intervals for the sheet; empty dict if no colors map is available for the workbook or sheet. """ if not colors_map_data: return {} @@ -589,18 +689,43 @@ def _resolve_sheet_colors_map( return sheet_colors.colors_map +def _resolve_sheet_formulas_map( + formulas_map_data: WorkbookFormulasMap | None, sheet_name: str +) -> dict[str, list[tuple[int, int]]]: + """ + Get the formulas map for a named sheet from a workbook formulas container. + + Parameters: + formulas_map_data: Optional workbook formulas map container; may be None. + sheet_name: Name of the sheet to resolve formulas for. + + Returns: + A mapping for the sheet (str -> list of (row, column) tuples) representing formula locations, or an empty dict if no data is available. + """ + if not formulas_map_data: + return {} + sheet_formulas = formulas_map_data.get_sheet(sheet_name) + if sheet_formulas is None: + return {} + return sheet_formulas.formulas_map + + def _filter_rows_excluding_merged_values( rows: list[CellRow], merged_cells: list[MergedCellRange], ) -> list[CellRow]: - """Remove merged-cell values from rows. + """ + Filter out cell values that originate from merged-cell ranges. - Args: - rows: Extracted rows. - merged_cells: Merged cell ranges. + Parameters: + rows (list[CellRow]): Extracted rows to filter. + merged_cells (list[MergedCellRange]): Merged cell ranges to exclude values from. Returns: - Filtered rows with merged-cell values removed. + list[CellRow]: Rows where any cell whose column index falls inside a merged range has been removed. + - Rows with no remaining cells are omitted. + - Cell entries with non-integer column keys are preserved. + - `links` are retained only for cells that remain; if a row has no links after filtering, `links` is set to None. """ if not rows or not merged_cells: return rows @@ -702,24 +827,29 @@ def collect_sheet_raw_data( include_merged_values_in_rows: bool, print_area_data: PrintAreaData | None = None, auto_page_break_data: PrintAreaData | None = None, + formulas_map_data: WorkbookFormulasMap | None = None, colors_map_data: WorkbookColorsMap | None = None, ) -> dict[str, SheetRawData]: - """Collect per-sheet raw data from extraction artifacts. - - Args: - cell_data: Extracted cell rows per sheet. - shape_data: Extracted shapes per sheet. - chart_data: Extracted charts per sheet. - merged_cell_data: Extracted merged cells per sheet. - workbook: xlwings workbook instance. - mode: Extraction mode. - print_area_data: Optional print area data per sheet. - auto_page_break_data: Optional auto page-break data per sheet. - colors_map_data: Optional colors map data. - include_merged_values_in_rows: Whether to keep merged values in rows. + """ + Collect per-sheet raw extraction data and assemble SheetRawData for each sheet. + + For each sheet in cell_data this returns a SheetRawData containing rows (optionally excluding values contributed by merged cells), shapes, charts (omitted in "light" mode), detected table candidates, print/auto-print areas, per-sheet formulas map, per-sheet colors map, and merged cell ranges. + + Parameters: + cell_data (CellData): Extracted cell rows keyed by sheet name. + shape_data (ShapeData): Extracted shapes keyed by sheet name. + chart_data (ChartData): Extracted charts keyed by sheet name. + merged_cell_data (MergedCellData): Merged cell ranges keyed by sheet name. + workbook (xw.Book): xlwings workbook used to resolve sheets and detect tables. + mode (ExtractionMode): Extraction mode; when "light", charts are omitted. + include_merged_values_in_rows (bool): If False, remove values that originate from merged cells when building row data. + print_area_data (PrintAreaData | None): Optional print areas keyed by sheet name. + auto_page_break_data (PrintAreaData | None): Optional auto page-break areas keyed by sheet name. + formulas_map_data (WorkbookFormulasMap | None): Optional per-sheet formulas map to include in SheetRawData. + colors_map_data (WorkbookColorsMap | None): Optional per-sheet colors map to include in SheetRawData. Returns: - Mapping of sheet name to raw sheet data. + dict[str, SheetRawData]: Mapping from sheet name to the assembled SheetRawData. """ result: dict[str, SheetRawData] = {} for sheet_name, rows in cell_data.items(): @@ -739,6 +869,7 @@ def collect_sheet_raw_data( auto_print_areas=auto_page_break_data.get(sheet_name, []) if auto_page_break_data else [], + formulas_map=_resolve_sheet_formulas_map(formulas_map_data, sheet_name), colors_map=_resolve_sheet_colors_map(colors_map_data, sheet_name), merged_cells=merged_cells, ) @@ -747,13 +878,14 @@ def collect_sheet_raw_data( def run_extraction_pipeline(inputs: ExtractionInputs) -> PipelineResult: - """Run the full extraction pipeline and return the result. + """ + Execute the configured extraction pipeline and produce the extraction result. - Args: - inputs: Resolved pipeline inputs. + Parameters: + inputs (ExtractionInputs): Resolved pipeline inputs controlling which extraction steps run. Returns: - PipelineResult with workbook data, artifacts, and execution state. + PipelineResult: Contains the constructed workbook data, collected artifacts, and pipeline execution state (including COM attempt/success and any fallback reason). """ plan = build_pipeline_plan(inputs) artifacts = run_pipeline(plan.pre_com_steps, inputs, ExtractionArtifacts()) @@ -797,6 +929,7 @@ def _fallback(message: str, reason: FallbackReason) -> PipelineResult: auto_page_break_data=artifacts.auto_page_break_data if inputs.include_auto_page_breaks else None, + formulas_map_data=artifacts.formulas_map_data, colors_map_data=artifacts.colors_map_data, ) raw_workbook = WorkbookRawData( @@ -826,15 +959,16 @@ def build_cells_tables_workbook( artifacts: ExtractionArtifacts, reason: str, ) -> WorkbookData: - """Build a WorkbookData containing cells + table_candidates (fallback). + """ + Builds a WorkbookData from available cell rows and detected table candidates to use as a fallback when COM-based extraction is not used or has failed. - Args: - inputs: Pipeline inputs. - artifacts: Collected artifacts from extraction steps. - reason: Reason to log for fallback. + Parameters: + inputs (ExtractionInputs): Resolved extraction inputs that control which extra maps and merged-value handling to include. + artifacts (ExtractionArtifacts): Collected artifacts produced by pre-COM extraction steps; cell rows and any existing maps are consumed from here. + reason (str): Short description of why the fallback is being used (logged for debugging). Returns: - WorkbookData constructed from cells and detected tables. + WorkbookData: A workbook composed from the available per-sheet cell rows, detected table candidates, merged-cell information, and any resolved formulas and colors maps. Shapes and charts are empty in this fallback path; formulas and colors maps are extracted from artifacts or from the Openpyxl backend when requested and not already present. """ logger.debug("Building fallback workbook: %s", reason) backend = OpenpyxlBackend(inputs.file_path) @@ -844,11 +978,21 @@ def build_cells_tables_workbook( include_default_background=inputs.include_default_background, ignore_colors=inputs.ignore_colors, ) + formulas_map_data = artifacts.formulas_map_data + if ( + inputs.include_formulas_map + and formulas_map_data is None + and not inputs.use_com_for_formulas + ): + formulas_map_data = backend.extract_formulas_map() sheets: dict[str, SheetRawData] = {} for sheet_name, rows in artifacts.cell_data.items(): sheet_colors = ( colors_map_data.get_sheet(sheet_name) if colors_map_data else None ) + sheet_formulas = ( + formulas_map_data.get_sheet(sheet_name) if formulas_map_data else None + ) tables = backend.detect_tables(sheet_name) merged_cells = artifacts.merged_cell_data.get(sheet_name, []) filtered_rows = ( @@ -865,6 +1009,7 @@ def build_cells_tables_workbook( if inputs.include_print_areas else [], auto_print_areas=[], + formulas_map=sheet_formulas.formulas_map if sheet_formulas else {}, colors_map=sheet_colors.colors_map if sheet_colors else {}, merged_cells=merged_cells, ) diff --git a/src/exstruct/core/workbook.py b/src/exstruct/core/workbook.py index 4a7568b..3aa9430 100644 --- a/src/exstruct/core/workbook.py +++ b/src/exstruct/core/workbook.py @@ -12,20 +12,23 @@ logger = logging.getLogger(__name__) +__all__ = ["openpyxl_workbook", "xlwings_workbook", "_find_open_workbook", "xw"] + @contextmanager def openpyxl_workbook( file_path: Path, *, data_only: bool, read_only: bool ) -> Iterator[Any]: - """Open an openpyxl workbook and ensure it is closed. + """ + Open an openpyxl Workbook for temporary use and ensure it is closed on exit. - Args: - file_path: Workbook path. - data_only: Whether to read formula results. - read_only: Whether to open in read-only mode. + Parameters: + file_path (Path): Path to the workbook file. + data_only (bool): If True, read stored cell values instead of formulas. + read_only (bool): If True, open the workbook in optimized read-only mode. Yields: - openpyxl workbook instance. + openpyxl.workbook.workbook.Workbook: The opened workbook instance. """ with warnings.catch_warnings(): warnings.filterwarnings( diff --git a/src/exstruct/engine.py b/src/exstruct/engine.py index 58a51e6..2788adb 100644 --- a/src/exstruct/engine.py +++ b/src/exstruct/engine.py @@ -70,6 +70,7 @@ class StructOptions: before extraction. Use this to tweak table detection heuristics per engine instance without touching global state. include_colors_map: Whether to extract background color maps. + include_formulas_map: Whether to extract formulas map. include_merged_cells: Whether to extract merged cell ranges. include_merged_values_in_rows: Whether to keep merged values in rows. colors: Color extraction options. @@ -81,6 +82,7 @@ class StructOptions: ) include_cell_links: bool | None = None # None -> auto: verbose=True, others=False include_colors_map: bool | None = None # None -> auto: verbose=True, others=False + include_formulas_map: bool | None = None # None -> auto: verbose=True, others=False include_merged_cells: bool | None = None # None -> auto: light=False, others=True include_merged_values_in_rows: bool = True colors: ColorsOptions = field(default_factory=ColorsOptions) @@ -259,6 +261,24 @@ def _include_auto_print_areas(self) -> bool: def _filter_sheet( self, sheet: SheetData, include_auto_override: bool | None = None ) -> SheetData: + """ + Return a filtered copy of a SheetData according to the engine's output filters and resolved size/print-area flags. + + Parameters: + sheet: The original SheetData to filter. + include_auto_override: If not None, overrides the engine's automatic decision for including auto page-break areas; if None, the engine's auto rule is used. + + Returns: + A new SheetData where: + - rows are kept only if include_rows is enabled; otherwise an empty list. + - shapes are kept only if include_shapes is enabled; when kept and shape-size inclusion is disabled, each shape's width and height are cleared. + - charts are kept only if include_charts is enabled; when kept and chart-size inclusion is disabled, each chart's width and height are cleared. + - table_candidates are kept only if include_tables is enabled; otherwise an empty list. + - colors_map and formulas_map are preserved as-is. + - print_areas are kept only if print areas are included by the engine; otherwise an empty list. + - auto_print_areas are kept only if auto page-break areas are included (after applying include_auto_override); otherwise an empty list. + - merged_cells are kept only if include_merged_cells is enabled; otherwise set to None. + """ include_shape_size, include_chart_size = self._resolve_size_flags() include_print_areas = self._include_print_areas() include_auto_print_areas = ( @@ -284,6 +304,7 @@ def _filter_sheet( if self.output.filters.include_tables else [], colors_map=sheet.colors_map, + formulas_map=sheet.formulas_map, print_areas=sheet.print_areas if include_print_areas else [], auto_print_areas=sheet.auto_print_areas if include_auto_print_areas else [], merged_cells=sheet.merged_cells @@ -332,15 +353,15 @@ def extract( self, file_path: str | Path, *, mode: ExtractionMode | None = None ) -> WorkbookData: """ - Extract a workbook and return normalized workbook data. + Produce a normalized WorkbookData extracted from the given workbook file. - Args: - file_path: Path to the .xlsx/.xlsm/.xls file to extract. - mode: Extraction mode; defaults to the engine's StructOptions.mode. - - light: COM-free; cells, table candidates, and print areas only. - - standard: Shapes with text/arrows plus charts; print areas included; - size fields retained but hidden from default output. - - verbose: All shapes (with size) and charts (with size). + Parameters: + file_path (str | Path): Path to the .xlsx/.xlsm/.xls file to extract. + mode (ExtractionMode | None): Extraction mode to use; if None the engine's configured mode is used. + Modes: "light", "standard", "verbose". + + Returns: + WorkbookData: Normalized workbook data extracted from the file. """ chosen_mode = mode or self.options.mode include_auto_page_breaks = ( @@ -358,6 +379,7 @@ def extract( include_colors_map=self.options.include_colors_map, include_default_background=self.options.colors.include_default_background, ignore_colors=self.options.colors.ignore_colors_set(), + include_formulas_map=self.options.include_formulas_map, include_merged_cells=self.options.include_merged_cells, include_merged_values_in_rows=self.options.include_merged_values_in_rows, ) diff --git a/src/exstruct/models/__init__.py b/src/exstruct/models/__init__.py index 4041aeb..b25e7a5 100644 --- a/src/exstruct/models/__init__.py +++ b/src/exstruct/models/__init__.py @@ -177,6 +177,13 @@ class SheetData(BaseModel): auto_print_areas: list[PrintArea] = Field( default_factory=list, description="COM-computed auto page-break areas." ) + formulas_map: dict[str, list[tuple[int, int]]] = Field( + default_factory=dict, + description=( + "Mapping of formula strings to lists of (row, column) tuples " + "where row is 1-based and column is 0-based." + ), + ) colors_map: dict[str, list[tuple[int, int]]] = Field( default_factory=dict, description=( diff --git a/src/exstruct/render/__init__.py b/src/exstruct/render/__init__.py index 9c76481..e805b7e 100644 --- a/src/exstruct/render/__init__.py +++ b/src/exstruct/render/__init__.py @@ -7,7 +7,7 @@ import shutil import tempfile from types import ModuleType -from typing import Any, cast +from typing import Protocol, cast import xlwings as xw @@ -79,56 +79,32 @@ def _require_pdfium() -> ModuleType: def export_sheet_images( excel_path: str | Path, output_dir: str | Path, dpi: int = 144 ) -> list[Path]: - """Export each sheet as PNG (via PDF then pypdfium2 rasterization) and return paths in sheet order.""" + """ + Export each worksheet in the given Excel workbook to PNG files and return the image paths in workbook order. + + Returns: + paths (list[Path]): Paths to the generated PNG files, ordered by the corresponding worksheets. + + Raises: + RenderError: If export or rendering fails. + """ normalized_excel_path = Path(excel_path) normalized_output_dir = Path(output_dir) normalized_output_dir.mkdir(parents=True, exist_ok=True) use_subprocess = _use_render_subprocess() - if not use_subprocess: - pdfium = cast(Any, _require_pdfium()) - else: - _require_pdfium() + pdfium = _ensure_pdfium(use_subprocess) try: with tempfile.TemporaryDirectory() as td: - written: list[Path] = [] - app: xw.App | None = None - wb: xw.Book | None = None - try: - app = _require_excel_app() - wb = app.books.open(str(normalized_excel_path)) - for sheet_index, sheet in enumerate(wb.sheets): - sheet_name = sheet.name - sheet_pdf = Path(td) / f"sheet_{sheet_index + 1:02d}.pdf" - sheet.api.ExportAsFixedFormat(0, str(sheet_pdf)) - safe_name = _sanitize_sheet_filename(sheet_name) - if use_subprocess: - written.extend( - _render_pdf_pages_subprocess( - sheet_pdf, - normalized_output_dir, - sheet_index, - safe_name, - dpi, - ) - ) - else: - written.extend( - _render_pdf_pages_in_process( - pdfium, - sheet_pdf, - normalized_output_dir, - sheet_index, - safe_name, - dpi, - ) - ) - return written - finally: - if wb is not None: - wb.close() - if app is not None: - app.quit() + temp_dir = Path(td) + return _export_sheet_images_with_app( + normalized_excel_path, + normalized_output_dir, + temp_dir, + dpi, + use_subprocess, + pdfium, + ) except RenderError: raise except Exception as exc: @@ -138,11 +114,432 @@ def export_sheet_images( def _sanitize_sheet_filename(name: str) -> str: + r""" + Create a filesystem-safe filename derived from an Excel sheet name. + + Replaces characters that are not allowed in filenames (\/:*?"<>|) with underscores, trims surrounding whitespace, and returns "sheet" if the result is empty. + + Parameters: + name (str): Original sheet name. + + Returns: + safe_name (str): Filename-safe string derived from `name`. + """ return "".join("_" if c in '\\/:*?"<>|' else c for c in name).strip() or "sheet" +class _PageSetupProtocol(Protocol): + """Protocol for Excel PageSetup objects exposing PrintArea.""" + + PrintArea: object + + +class _SheetApiProtocol(Protocol): + """Protocol for Excel sheet COM APIs used by render helpers.""" + + PageSetup: _PageSetupProtocol + + def ExportAsFixedFormat( # noqa: N802 + self, file_format: int, output_path: str, *args: object, **kwargs: object + ) -> None: + """Export the sheet or workbook to a fixed-format file (for example, PDF or XPS). + + Parameters: + file_format (int): Excel XlFixedFormatType enum value indicating the output format (e.g., the constant for PDF). + output_path (str): Filesystem path where the fixed-format file will be written. + *args (object): Additional positional arguments forwarded to the underlying Excel COM ExportAsFixedFormat call. + **kwargs (object): Additional keyword arguments forwarded to the underlying Excel COM ExportAsFixedFormat call. + """ + ... + + +def _iter_sheet_apis(wb: xw.Book) -> list[tuple[int, str, _SheetApiProtocol]]: + """ + Enumerate workbook sheets and return each sheet's zero-based index, display name, and COM API handle in workbook order. + + If direct COM access to Worksheets is unavailable, falls back to iterating wb.sheets to build the same list. + + Returns: + List[tuple[int, str, _SheetApiProtocol]]: Tuples of (zero-based sheet index, sheet name, sheet COM API handle) in workbook order. + """ + try: + ws_collection = getattr(getattr(wb, "api", None), "Worksheets", None) + if ws_collection is None: + raise AttributeError("Worksheets not available") + count = int(ws_collection.Count) + sheets: list[tuple[int, str, _SheetApiProtocol]] = [] + for i in range(1, count + 1): + ws_api = cast(_SheetApiProtocol, ws_collection.Item(i)) + name = str(getattr(ws_api, "Name", f"Sheet{i}")) + sheets.append((i - 1, name, ws_api)) + return sheets + except Exception: + return [ + ( + index, + sheet.name, + cast(_SheetApiProtocol, sheet.api), + ) + for index, sheet in enumerate(wb.sheets) + ] + + +def _build_sheet_export_plan( + wb: xw.Book, +) -> list[tuple[str, _SheetApiProtocol, str | None]]: + """ + Build an ordered export plan mapping each worksheet to its print areas. + + Each returned tuple is (sheet_name, sheet_api, print_area). The list preserves workbook sheet order; for sheets with no defined print areas `print_area` is `None`, and for sheets with multiple print areas there is one tuple per area. + """ + plan: list[tuple[str, _SheetApiProtocol, str | None]] = [] + for _, sheet_name, sheet_api in _iter_sheet_apis(wb): + areas = _extract_print_areas(sheet_api) + if not areas: + plan.append((sheet_name, sheet_api, None)) + continue + for area in areas: + plan.append((sheet_name, sheet_api, area)) + return plan + + +def _extract_print_areas(sheet_api: _SheetApiProtocol) -> list[str]: + """ + Extract the sheet's print-area ranges as a list of strings. + + Retrieves the PageSetup.PrintArea value from the provided sheet API, splits it by commas while respecting single-quoted sections, and returns each range as a separate string. If the sheet has no print area or the property is inaccessible, an empty list is returned. + + Parameters: + sheet_api (_SheetApiProtocol): Excel sheet API object exposing a `PageSetup.PrintArea` attribute. + + Returns: + list[str]: List of print-area range strings in the order they appear, or an empty list if none are defined or on access failure. + """ + try: + page_setup = getattr(sheet_api, "PageSetup", None) + if page_setup is None: + return [] + raw = str(getattr(page_setup, "PrintArea", "") or "") + except Exception: + return [] + if not raw: + return [] + return _split_csv_respecting_quotes(raw) + + +def _split_csv_respecting_quotes(raw: str) -> list[str]: + """ + Split a comma-separated string into parts while treating single-quoted sections as atomic. + + This function splits raw on commas that are not inside single quotes. Text enclosed in single quotes is preserved (including internal commas). Two consecutive single quotes inside a quoted section are treated as an escaped single-quote pair. Leading and trailing whitespace is trimmed from each part and empty parts are removed. + + Parameters: + raw (str): The input CSV-like string that may contain single-quoted segments. + + Returns: + list[str]: A list of non-empty tokens obtained from splitting `raw` by unquoted commas, + with surrounding whitespace removed and quoted segments preserved. + """ + parts: list[str] = [] + buf: list[str] = [] + in_quote = False + i = 0 + while i < len(raw): + ch = raw[i] + if ch == "'": + if in_quote and i + 1 < len(raw) and raw[i + 1] == "'": + buf.append("''") + i += 2 + continue + in_quote = not in_quote + buf.append(ch) + i += 1 + continue + if ch == "," and not in_quote: + parts.append("".join(buf).strip()) + buf = [] + i += 1 + continue + buf.append(ch) + i += 1 + if buf: + parts.append("".join(buf).strip()) + return [p for p in parts if p] + + +def _rename_pages_for_print_area( + paths: list[Path], + output_dir: Path, + base_index: int, + safe_name: str, +) -> list[Path]: + """ + Rename the given image files so each gets a unique numeric prefix based on a base index and a safe sheet name. + + Parameters: + paths (list[Path]): Existing image files for a single sheet or print area (may include per-page suffixes). + output_dir (Path): Directory where renamed files will reside. + base_index (int): Zero-based starting index used to compute the numeric prefix for each output file. + safe_name (str): Filesystem-safe base name to use after the numeric prefix. + + Returns: + list[Path]: Paths to the renamed files in the same order as input, each named "{index:02d}_{safe_name}.png". + """ + renamed: list[Path] = [] + for path in paths: + page_index = _page_index_from_suffix(path.stem) + new_index = base_index + page_index + new_path = output_dir / f"{new_index + 1:02d}_{safe_name}.png" + if path != new_path: + path.replace(new_path) + renamed.append(new_path) + return renamed + + +def _page_index_from_suffix(stem: str) -> int: + """ + Extracts a zero-based page index from a filename stem ending with a "_pNN" numeric suffix. + + If the stem ends with "_p" followed by digits, returns that number minus one. If the suffix is missing, non-numeric, or less than 1, returns 0. + + Parameters: + stem (str): Filename stem to parse. + + Returns: + int: Zero-based page index derived from the "_pNN" suffix, or 0 when no valid suffix is present. + """ + if "_p" not in stem: + return 0 + base, suffix = stem.rsplit("_p", 1) + _ = base + if suffix.isdigit(): + page_number = int(suffix) + if page_number <= 0: + return 0 + return page_number - 1 + return 0 + + +def _export_sheet_pdf( + sheet_api: _SheetApiProtocol, + pdf_path: Path, + *, + ignore_print_areas: bool, + print_area: str | None = None, +) -> None: + """ + Export the given worksheet to a PDF file, optionally applying a temporary print area. + + If `print_area` is provided, it is applied to the sheet's PageSetup.PrintArea before exporting and restored afterwards. The function attempts to call ExportAsFixedFormat with an IgnorePrintAreas keyword; if that call fails due to an unexpected COM signature, it retries with a minimal argument set. + + Args: + sheet_api: COM-like worksheet API exposing `PageSetup` and `ExportAsFixedFormat`. + pdf_path (Path): Filesystem path to write the PDF to. + ignore_print_areas (bool): If True, request that Excel ignore sheet print areas during export. + print_area (str | None): Optional print area string to apply for this export; if None, the sheet's current print area is left unchanged. + """ + original_print_area: object | None = None + page_setup = None + if print_area is not None: + try: + page_setup = getattr(sheet_api, "PageSetup", None) + if page_setup is not None: + original_print_area = getattr(page_setup, "PrintArea", None) + page_setup.PrintArea = print_area + except Exception as exc: + logger.debug("Failed to set PrintArea. (%r)", exc) + page_setup = None + try: + sheet_api.ExportAsFixedFormat( + 0, str(pdf_path), IgnorePrintAreas=ignore_print_areas + ) + except TypeError: + if ignore_print_areas: + try: + page_setup = page_setup or getattr(sheet_api, "PageSetup", None) + if page_setup is not None: + page_setup.PrintArea = "" + except Exception as exc: + logger.debug("Failed to clear PrintArea for ignore. (%r)", exc) + sheet_api.ExportAsFixedFormat(0, str(pdf_path)) + finally: + if page_setup is not None and print_area is not None: + try: + page_setup.PrintArea = original_print_area + except Exception as exc: + logger.debug("Failed to restore PrintArea. (%r)", exc) + + +def _ensure_pdfium(use_subprocess: bool) -> ModuleType | None: + """ + Ensure the pypdfium2 dependency is available and return the pdfium module for in-process rendering. + + Parameters: + use_subprocess (bool): When True, confirm pypdfium2 is installed for subprocess rendering but do not keep the module in-process; when False, import and return the pdfium module for direct use. + + Returns: + ModuleType | None: The imported `pdfium` module when `use_subprocess` is False, or `None` when `use_subprocess` is True. + + Raises: + MissingDependencyError: If pypdfium2 (and required extras) is not installed. + """ + if use_subprocess: + _require_pdfium() + return None + return _require_pdfium() + + +def _export_sheet_images_with_app( + excel_path: Path, + output_dir: Path, + temp_dir: Path, + dpi: int, + use_subprocess: bool, + pdfium: ModuleType | None, +) -> list[Path]: + """ + Export each worksheet of an Excel workbook to PNG images by exporting sheets to per-sheet PDFs and rendering those PDFs. + + Parameters: + excel_path (Path): Path to the source Excel workbook. + output_dir (Path): Directory where generated PNGs will be written. + temp_dir (Path): Temporary directory for per-sheet intermediate PDF files. + dpi (int): Dots per inch used when rasterizing PDF pages. + use_subprocess (bool): If True, render PDF pages in a subprocess; otherwise render in-process. + pdfium (ModuleType | None): In-process pypdfium2 module when rendering in-process, or None when subprocess rendering is used. + + Returns: + list[Path]: Paths to generated PNG images in the order corresponding to the workbook's sheets and print-area splits. + """ + written: list[Path] = [] + app: xw.App | None = None + wb: xw.Book | None = None + try: + app = _require_excel_app() + wb = app.books.open(str(excel_path)) + output_index = 0 + for sheet_name, sheet_api, print_area in _build_sheet_export_plan(wb): + sheet_pdf = temp_dir / f"sheet_{output_index + 1:02d}.pdf" + safe_name = _sanitize_sheet_filename(sheet_name) + _export_sheet_pdf( + sheet_api, + sheet_pdf, + ignore_print_areas=False, + print_area=print_area, + ) + sheet_paths = _render_sheet_images( + pdfium, + sheet_pdf, + output_dir, + output_index, + safe_name, + dpi, + use_subprocess, + ) + if not sheet_paths: + _export_sheet_pdf( + sheet_api, + sheet_pdf, + ignore_print_areas=True, + print_area=print_area, + ) + sheet_paths = _render_sheet_images( + pdfium, + sheet_pdf, + output_dir, + output_index, + safe_name, + dpi, + use_subprocess, + ) + sheet_paths = _normalize_multipage_paths( + sheet_paths, + output_dir, + output_index, + safe_name, + ) + written.extend(sheet_paths) + output_index += max(1, len(sheet_paths)) + return written + finally: + if wb is not None: + wb.close() + if app is not None: + app.quit() + + +def _render_sheet_images( + pdfium: ModuleType | None, + sheet_pdf: Path, + output_dir: Path, + output_index: int, + safe_name: str, + dpi: int, + use_subprocess: bool, +) -> list[Path]: + """ + Render a sheet PDF to one or more PNG files using either a subprocess or in-process renderer. + + Returns: + paths (list[Path]): Paths to the generated PNG files in output order. + + Raises: + RenderError: If in-process rendering is requested but the `pypdfium2` module (`pdfium`) is not provided. + """ + if use_subprocess: + return _render_pdf_pages_subprocess( + sheet_pdf, + output_dir, + output_index, + safe_name, + dpi, + ) + if pdfium is None: + raise RenderError("pypdfium2 is required for in-process rendering.") + return _render_pdf_pages_in_process( + pdfium, + sheet_pdf, + output_dir, + output_index, + safe_name, + dpi, + ) + + +def _normalize_multipage_paths( + paths: list[Path], + output_dir: Path, + base_index: int, + safe_name: str, +) -> list[Path]: + """ + Assign distinct, ordered filenames for multi-page sheet outputs. + + If `paths` contains a single file, the list is returned unchanged. If `paths` contains multiple files, each file is given a unique, numbered filename in `output_dir` using `base_index` and `safe_name` so pages are ordered and do not collide. + + Parameters: + paths (list[Path]): Existing file paths for a sheet's rendered pages. + output_dir (Path): Directory containing or intended to contain the output files. + base_index (int): Zero-based starting index used to compute numeric prefixes for filenames. + safe_name (str): Filesystem-safe base name included in the generated filenames. + + Returns: + list[Path]: Paths to the resulting files in `output_dir`. When multiple input paths are provided, returned paths reflect the new, uniquely prefixed filenames. + """ + if len(paths) <= 1: + return paths + return _rename_pages_for_print_area(paths, output_dir, base_index, safe_name) + + def _use_render_subprocess() -> bool: - """Return True when PDF->PNG rendering should run in a subprocess.""" + """ + Decide whether PDF-to-PNG rendering should be performed in a subprocess. + + Reads the environment variable EXSTRUCT_RENDER_SUBPROCESS (case-insensitive). Subprocess rendering is disabled when the variable is set to "0" or "false"; if the variable is unset or set to any other value, subprocess rendering is enabled. + + Returns: + `true` if subprocess rendering is enabled, `false` otherwise. + """ return os.getenv("EXSTRUCT_RENDER_SUBPROCESS", "1").lower() not in {"0", "false"} diff --git a/tests/backends/test_auto_page_breaks.py b/tests/backends/test_auto_page_breaks.py index 81e01d6..f2cce3e 100644 --- a/tests/backends/test_auto_page_breaks.py +++ b/tests/backends/test_auto_page_breaks.py @@ -16,6 +16,13 @@ def test_extract_passes_auto_page_break_flag( monkeypatch: MonkeyPatch, tmp_path: Path ) -> None: + """ + Verify that extract_workbook is invoked with include_auto_page_breaks set to True. + + Creates a fake extractor that captures the include_auto_page_breaks argument, replaces + exstruct.engine.extract_workbook with it, runs ExStructEngine.extract against a dummy + workbook path configured to export auto page breaks, and asserts the captured flag is True. + """ called: dict[str, object] = {} def fake_extract( @@ -27,9 +34,24 @@ def fake_extract( include_colors_map: bool = False, include_default_background: bool = False, ignore_colors: set[str] | None = None, + include_formulas_map: bool | None = None, include_merged_cells: bool | None = None, include_merged_values_in_rows: bool = True, ) -> WorkbookData: + """ + Test stub for workbook extraction that records the auto page breaks flag. + + This fake extractor captures the value of `include_auto_page_breaks` in the outer + `called` mapping and returns a minimal `WorkbookData` with `book_name` set to + the provided path's filename and an empty `sheets` mapping. + + Parameters: + path (Path): Filesystem path used to derive the returned `WorkbookData.book_name`. + include_auto_page_breaks (bool): Flag whose value is written to `called["include_auto_page_breaks"]`. + + Returns: + WorkbookData: A minimal workbook data object with `book_name` set to `path.name` and no sheets. + """ called["include_auto_page_breaks"] = include_auto_page_breaks return WorkbookData(book_name=path.name, sheets={}) diff --git a/tests/backends/test_backends.py b/tests/backends/test_backends.py index 7326fc3..ebfd947 100644 --- a/tests/backends/test_backends.py +++ b/tests/backends/test_backends.py @@ -3,7 +3,7 @@ from _pytest.monkeypatch import MonkeyPatch from openpyxl import Workbook -from exstruct.core.backends.com_backend import ComBackend +from exstruct.core.backends.com_backend import ComBackend, _parse_print_area_range from exstruct.core.backends.openpyxl_backend import OpenpyxlBackend from exstruct.core.ranges import parse_range_zero_based @@ -75,6 +75,27 @@ def fake_colors_map( ) +def test_openpyxl_backend_extract_formulas_map_returns_none_on_failure( + monkeypatch: MonkeyPatch, tmp_path: Path +) -> None: + def fake_formulas_map(file_path: Path) -> object: + """ + Test helper that always raises a RuntimeError to simulate a failure when extracting a formulas map. + + Raises: + RuntimeError: with message "boom". + """ + raise RuntimeError("boom") + + monkeypatch.setattr( + "exstruct.core.backends.openpyxl_backend.extract_sheet_formulas_map", + fake_formulas_map, + ) + + backend = OpenpyxlBackend(tmp_path / "book.xlsx") + assert backend.extract_formulas_map() is None + + def test_com_backend_extract_colors_map_returns_none_on_failure( monkeypatch: MonkeyPatch, ) -> None: @@ -101,6 +122,33 @@ class DummyWorkbook: ) +def test_com_backend_extract_formulas_map_returns_none_on_failure( + monkeypatch: MonkeyPatch, +) -> None: + def fake_formulas_map(workbook: object) -> object: + """ + Test stub that simulates a failure by always raising a RuntimeError. + + Parameters: + workbook (object): Workbook-like object (ignored); present to match the real function's signature. + + Raises: + RuntimeError: Always raised with message "boom". + """ + raise RuntimeError("boom") + + monkeypatch.setattr( + "exstruct.core.backends.com_backend.extract_sheet_formulas_map_com", + fake_formulas_map, + ) + + class DummyWorkbook: + pass + + backend = ComBackend(DummyWorkbook()) + assert backend.extract_formulas_map() is None + + def test_com_backend_extract_print_areas_handles_sheet_error( monkeypatch: MonkeyPatch, ) -> None: @@ -124,6 +172,11 @@ class _DummyWorkbook: def test_openpyxl_backend_extract_print_areas(tmp_path: Path) -> None: + """ + Verifies that OpenpyxlBackend.extract_print_areas reads an openpyxl workbook's print area and returns the corresponding zero-based ranges keyed by sheet name. + + Creates an in-memory workbook with a single sheet named "Sheet1", sets its print area to "A1:B2", saves and loads it via OpenpyxlBackend, then asserts the sheet is present, has at least one area, and that the first area's r1 and c1 are 1 and 0 respectively. + """ wb = Workbook() ws = wb.active ws.title = "Sheet1" @@ -142,6 +195,25 @@ def test_openpyxl_backend_extract_print_areas(tmp_path: Path) -> None: assert areas["Sheet1"][0].c1 == 0 +def test_openpyxl_backend_extract_print_areas_returns_empty_on_error( + monkeypatch: MonkeyPatch, tmp_path: Path +) -> None: + """ + Ensure OpenpyxlBackend.extract_print_areas returns an empty dict when the workbook loader raises an error. + + Verifies that the backend handles errors from the underlying workbook opening function by returning an empty mapping of print areas. + """ + + def _raise(*_args: object, **_kwargs: object) -> None: + raise RuntimeError("boom") + + monkeypatch.setattr( + "exstruct.core.backends.openpyxl_backend.openpyxl_workbook", _raise + ) + backend = OpenpyxlBackend(tmp_path / "book.xlsx") + assert backend.extract_print_areas() == {} + + def test_parse_range_zero_based_parses_sheet_prefix() -> None: bounds = parse_range_zero_based("Sheet1!A1:B2") assert bounds is not None @@ -149,3 +221,259 @@ def test_parse_range_zero_based_parses_sheet_prefix() -> None: assert bounds.c1 == 0 assert bounds.r2 == 1 assert bounds.c2 == 1 + + +def test_com_backend_extract_print_areas_success() -> None: + class _PageSetup: + PrintArea = "A1:B2,INVALID" + + class _SheetApi: + PageSetup = _PageSetup() + + class _Sheet: + name = "Sheet1" + api = _SheetApi() + + class _DummyWorkbook: + sheets = [_Sheet()] + + backend = ComBackend(_DummyWorkbook()) + areas = backend.extract_print_areas() + assert "Sheet1" in areas + assert areas["Sheet1"][0].r1 == 1 + assert areas["Sheet1"][0].c1 == 0 + assert areas["Sheet1"][0].r2 == 2 + assert areas["Sheet1"][0].c2 == 1 + + +def test_com_backend_parse_print_area_range_invalid() -> None: + assert _parse_print_area_range("INVALID") is None + + +class _Location: + def __init__(self, row: int | None = None, col: int | None = None) -> None: + """ + Initialize the location with row and column values. + + Parameters: + row (int | None): Row index or None. + col (int | None): Column index or None. + """ + self.Row = row + self.Column = col + + +class _BreakItem: + def __init__(self, row: int | None = None, col: int | None = None) -> None: + """ + Initialize the break item with an optional sheet location. + + Parameters: + row (int | None): Row index (1-based) for the location, or None if unspecified. + col (int | None): Column index (1-based) for the location, or None if unspecified. + """ + self.Location = _Location(row=row, col=col) + + +class _Breaks: + def __init__(self, items: list[_BreakItem]) -> None: + """ + Initialize the Breaks collection from a list of break items. + + Parameters: + items (list[_BreakItem]): Sequence of `_BreakItem` instances representing page break entries; ordering corresponds to 1-based access via `Item`. + """ + self._items = items + self.Count = len(items) + + def Item(self, index: int) -> _BreakItem: + """ + Return the break item at the given 1-based position. + + Parameters: + index (int): 1-based position of the break to retrieve. + + Returns: + _BreakItem: The break item at the specified position. + """ + return self._items[index - 1] + + +class _RangeRows: + def __init__(self, count: int) -> None: + """ + Initialize the breaks container with a specified item count. + + Parameters: + count (int): Number of break items the container should report via its `Count` attribute. + """ + self.Count = count + + +class _RangeCols: + def __init__(self, count: int) -> None: + """ + Initialize the breaks container with a specified item count. + + Parameters: + count (int): Number of break items the container should report via its `Count` attribute. + """ + self.Count = count + + +class _Range: + Row = 1 + Column = 1 + Rows = _RangeRows(2) + Columns = _RangeCols(2) + + +class _UsedRange: + Address = "A1:B2" + + +class _PageSetup: + PrintArea = "A1:B2" + + +class _SheetApi: + def __init__(self) -> None: + """ + Initialize a fake sheet API used by COM backend tests with default page and range state. + + Creates default attributes: + - DisplayPageBreaks set to False. + - PageSetup populated with a default PrintArea. + - UsedRange populated with a default Address. + - HPageBreaks containing one horizontal break at row 2. + - VPageBreaks containing one vertical break at column 2. + """ + self.DisplayPageBreaks = False + self.PageSetup = _PageSetup() + self.UsedRange = _UsedRange() + self.HPageBreaks = _Breaks([_BreakItem(row=2)]) + self.VPageBreaks = _Breaks([_BreakItem(col=2)]) + + def Range(self, _addr: str) -> _Range: + """ + Create and return a Range wrapper for the given Excel-style address. + + Parameters: + _addr (str): Excel-style address or range string (e.g., "A1", "A1:B2", or "Sheet1!A1:B2"). + + Returns: + _Range: An object representing the requested worksheet range. + """ + return _Range() + + +class _Sheet: + name = "Sheet1" + + def __init__(self) -> None: + """ + Initialize a mock sheet and attach its API. + + Sets the `api` attribute to a new `_SheetApi` instance used by tests to simulate a sheet's COM-like API. + """ + self.api = _SheetApi() + + +class _DummyWorkbook: + def __init__(self) -> None: + """ + Initialize a dummy workbook containing a single default sheet. + + The instance provides a `sheets` attribute set to a list with one `_Sheet` object. + """ + self.sheets = [_Sheet()] + + +def test_com_backend_extract_auto_page_breaks_success() -> None: + backend = ComBackend(_DummyWorkbook()) + areas = backend.extract_auto_page_breaks() + assert "Sheet1" in areas + assert areas["Sheet1"] + + +class _RestoreErrorSheetApi: + def __init__(self) -> None: + """ + Initialize a mock sheet API with default page, range, and break attributes. + + Creates: + - `_display`: boolean flag for DisplayPageBreaks (defaults to False). + - `PageSetup`: a default page setup object. + - `UsedRange`: a default used-range object. + - `HPageBreaks` and `VPageBreaks`: horizontal and vertical break collections, initialized empty. + """ + self._display = False + self.PageSetup = _PageSetup() + self.UsedRange = _UsedRange() + self.HPageBreaks = _Breaks([]) + self.VPageBreaks = _Breaks([]) + + @property + def DisplayPageBreaks(self) -> bool: + """ + Get whether displaying page breaks is enabled on the sheet. + + Returns: + `True` if page break display is enabled, `False` otherwise. + """ + return self._display + + @DisplayPageBreaks.setter + def DisplayPageBreaks(self, value: bool) -> None: + """ + Set the sheet's DisplayPageBreaks flag. + + Parameters: + value (bool): True to enable display of automatic page breaks. Passing False will trigger a restore failure. + + Raises: + RuntimeError: If `value` is False (restore failed). + """ + if value is False: + raise RuntimeError("restore failed") + self._display = value + + def Range(self, _addr: str) -> _Range: + """ + Create and return a Range wrapper for the given Excel-style address. + + Parameters: + _addr (str): Excel-style address or range string (e.g., "A1", "A1:B2", or "Sheet1!A1:B2"). + + Returns: + _Range: An object representing the requested worksheet range. + """ + return _Range() + + +class _RestoreErrorSheet: + name = "Sheet1" + + def __init__(self) -> None: + """ + Create a sheet object whose underlying API simulates an error when restoring DisplayPageBreaks. + + This constructor assigns an instance of _RestoreErrorSheetApi to the `api` attribute so tests can exercise code paths that handle failures when restoring page-break state. + """ + self.api = _RestoreErrorSheetApi() + + +class _RestoreErrorWorkbook: + def __init__(self) -> None: + """ + Create a mock workbook containing a single sheet that raises an error when restoring DisplayPageBreaks. + + The instance exposes a `sheets` attribute set to a list with one _RestoreErrorSheet(), which is used to simulate failures during page-break restoration in tests. + """ + self.sheets = [_RestoreErrorSheet()] + + +def test_com_backend_extract_auto_page_breaks_restore_error() -> None: + backend = ComBackend(_RestoreErrorWorkbook()) + areas = backend.extract_auto_page_breaks() + assert "Sheet1" in areas diff --git a/tests/backends/test_print_areas_openpyxl.py b/tests/backends/test_print_areas_openpyxl.py index 99a4970..e4976f6 100644 --- a/tests/backends/test_print_areas_openpyxl.py +++ b/tests/backends/test_print_areas_openpyxl.py @@ -3,10 +3,23 @@ from openpyxl import Workbook from exstruct import extract -from exstruct.core.backends.openpyxl_backend import OpenpyxlBackend +from exstruct.core.backends.base import PrintAreaData +from exstruct.core.backends.openpyxl_backend import ( + OpenpyxlBackend, + _append_print_areas, + _extract_print_areas_from_defined_names, + _extract_print_areas_from_sheet_props, + _parse_print_area_range, +) def _make_book_with_print_area(path: Path) -> None: + """ + Create a simple Excel workbook with a single sheet named "Sheet1", set its print area to "A1:B2", write "x" to cell A1, save it to the given path, and close the file. + + Parameters: + path (Path): Filesystem path where the workbook will be saved. + """ wb = Workbook() ws = wb.active ws.title = "Sheet1" @@ -44,3 +57,68 @@ def test_openpyxl_backend_multiple_print_areas(tmp_path: Path) -> None: assert "Sheet1" in areas ranges = [(a.r1, a.c1, a.r2, a.c2) for a in areas["Sheet1"]] assert ranges == [(1, 0, 2, 1), (3, 3, 4, 4)] + + +def test_extract_print_areas_from_defined_names_filters_unknown_sheets() -> None: + """Ignore defined-name destinations for sheets that do not exist.""" + + class _DefinedArea: + destinations = [("Sheet1", "A1:B2"), ("Unknown", "C1:D2")] + + class _DefinedNames: + def get(self, _name: str) -> _DefinedArea: + """ + Create a default defined area object. + + Returns: + _DefinedArea: A new, empty/default defined-area instance. + """ + return _DefinedArea() + + class _DummyWorkbook: + defined_names = _DefinedNames() + sheetnames = ["Sheet1"] + + areas = _extract_print_areas_from_defined_names(_DummyWorkbook()) + assert "Sheet1" in areas + assert "Unknown" not in areas + + +def test_extract_print_areas_from_defined_names_without_defined_names() -> None: + """Return an empty mapping when defined_names is missing.""" + + class _DummyWorkbook: + defined_names = None + + assert _extract_print_areas_from_defined_names(_DummyWorkbook()) == {} + + +def test_extract_print_areas_from_sheet_props_skips_empty() -> None: + """Skip sheet print areas when the property is empty.""" + + class _SheetEmpty: + title = "Sheet1" + _print_area = None + + class _SheetWithArea: + title = "Sheet2" + _print_area = "A1:B2" + + class _DummyWorkbook: + worksheets = [_SheetEmpty(), _SheetWithArea()] + + areas = _extract_print_areas_from_sheet_props(_DummyWorkbook()) + assert "Sheet2" in areas + + +def test_parse_print_area_range_invalid() -> None: + """Return None for invalid range strings.""" + assert _parse_print_area_range("INVALID") is None + + +def test_append_print_areas_skips_invalid_ranges() -> None: + """Append only valid print areas and skip invalid ranges.""" + areas: PrintAreaData = {} + _append_print_areas(areas, "Sheet1", "A1:B2,INVALID") + assert "Sheet1" in areas + assert len(areas["Sheet1"]) == 1 diff --git a/tests/com/test_render_smoke.py b/tests/com/test_render_smoke.py index fb4a38f..b61b1ba 100644 --- a/tests/com/test_render_smoke.py +++ b/tests/com/test_render_smoke.py @@ -37,6 +37,11 @@ def test_render_smoke_pdf_and_png(tmp_path: Path) -> None: def test_render_multiple_print_ranges_images(tmp_path: Path) -> None: + """ + Verify that processing a workbook with multiple print ranges across four sheets produces an images directory containing exactly four PNG files. + + Uses the test asset 'assets/multiple_print_ranges_4sheets.xlsx', runs process_excel with image output enabled, and asserts the generated images directory exists and contains four .png images. + """ xlsx = ( Path(__file__).resolve().parents[1] / "assets" @@ -55,15 +60,4 @@ def test_render_multiple_print_ranges_images(tmp_path: Path) -> None: images_dir = out_json.parent / f"{out_json.stem}_images" images = list(images_dir.glob("*.png")) assert images_dir.exists() - prefixes = {_strip_page_suffix(image.stem) for image in images} - assert len(prefixes) == 4 - - -def _strip_page_suffix(stem: str) -> str: - """Return the image stem without the _pNN page suffix.""" - if "_p" not in stem: - return stem - base, suffix = stem.rsplit("_p", 1) - if len(suffix) == 2 and suffix.isdigit(): - return base - return stem + assert len(images) == 4 diff --git a/tests/core/test_cells_colors_and_tables.py b/tests/core/test_cells_colors_and_tables.py index edf5d24..8469c4f 100644 --- a/tests/core/test_cells_colors_and_tables.py +++ b/tests/core/test_cells_colors_and_tables.py @@ -138,7 +138,7 @@ def test_table_signal_score_prefers_header_and_coverage() -> None: def test_count_nonempty_cells() -> None: """非空セル数のカウントを確認する。""" - values = [["", None, "x"], ["y", " ", 0]] + values: list[list[object]] = [["", None, "x"], ["y", " ", 0]] assert _count_nonempty_cells(values) == 3 diff --git a/tests/core/test_cells_utils.py b/tests/core/test_cells_utils.py index 64f90dc..75460c1 100644 --- a/tests/core/test_cells_utils.py +++ b/tests/core/test_cells_utils.py @@ -5,7 +5,14 @@ from openpyxl.worksheet.table import Table, TableStyleInfo from exstruct.core import cells -from exstruct.core.cells import _coerce_numeric_preserve_format, detect_tables_openpyxl +from exstruct.core.cells import ( + _coerce_numeric_preserve_format, + _normalize_formula_from_com, + _normalize_formula_value, + detect_tables_openpyxl, + extract_sheet_formulas_map, + extract_sheet_formulas_map_com, +) def test_coerce_numeric_preserve_format() -> None: @@ -61,3 +68,106 @@ def test_detect_tables_openpyxl_respects_table_params( ) tables = detect_tables_openpyxl(path, "Sheet1") assert "A1:B2" in tables + + +def test_normalize_formula_value_prefers_array_text() -> None: + """ + Verify that _normalize_formula_value prefers an array-like object's text and treats an empty string as no formula. + + Asserts that an object with a `text` attribute is converted to a formula string prefixed with '=' (e.g., "=SUM(A1:A3)"), and that an empty string is normalized to None. + """ + + class _ArrayFormulaLike: + text = "SUM(A1:A3)" + + assert _normalize_formula_value(_ArrayFormulaLike()) == "=SUM(A1:A3)" + assert _normalize_formula_value("") is None + + +def test_extract_sheet_formulas_map_collects_formulas(tmp_path: Path) -> None: + path = tmp_path / "formulas.xlsx" + wb = Workbook() + ws = wb.active + ws.title = "Sheet1" + ws["A1"] = 1 + ws["A2"] = 2 + ws["B1"] = "=SUM(A1:A2)" + wb.save(path) + wb.close() + + result = extract_sheet_formulas_map(path) + sheet = result.get_sheet("Sheet1") + assert sheet is not None + assert sheet.formulas_map == {"=SUM(A1:A2)": [(1, 1)]} + + +def test_normalize_formula_from_com() -> None: + assert _normalize_formula_from_com("=A1") == "=A1" + assert _normalize_formula_from_com("A1") is None + assert _normalize_formula_from_com("") is None + assert _normalize_formula_from_com(None) is None + + +def test_extract_sheet_formulas_map_com_empty_range() -> None: + class _DummyLastCell: + row = 0 + column = 0 + + class _DummyUsedRange: + row = 1 + column = 1 + last_cell = _DummyLastCell() + + class _DummySheet: + name = "Sheet1" + used_range = _DummyUsedRange() + + class _DummyWorkbook: + sheets = [_DummySheet()] + + result = extract_sheet_formulas_map_com(_DummyWorkbook()) + sheet = result.get_sheet("Sheet1") + assert sheet is not None + assert sheet.formulas_map == {} + + +def test_extract_sheet_formulas_map_com_collects_formulas() -> None: + class _DummyLastCell: + row = 2 + column = 2 + + class _DummyUsedRange: + row = 1 + column = 1 + last_cell = _DummyLastCell() + + class _DummyRange: + formula = [["=A1", "B1"], ["=SUM(A1)", ""]] + + class _DummySheet: + name = "Sheet1" + used_range = _DummyUsedRange() + + def range(self, _start: object, _end: object) -> _DummyRange: + """ + Return a new _DummyRange representing a requested cell range. + + Parameters: + _start (object): Start coordinate or cell reference for the range request (ignored by this dummy implementation). + _end (object): End coordinate or cell reference for the range request (ignored by this dummy implementation). + + Returns: + _DummyRange: A fresh _DummyRange instance corresponding to the requested range. + """ + return _DummyRange() + + class _DummyWorkbook: + sheets = [_DummySheet()] + + result = extract_sheet_formulas_map_com(_DummyWorkbook()) + sheet = result.get_sheet("Sheet1") + assert sheet is not None + assert sheet.formulas_map == { + "=A1": [(1, 0)], + "=SUM(A1)": [(2, 0)], + } diff --git a/tests/core/test_error_handling_exceptions.py b/tests/core/test_error_handling_exceptions.py index 95bad4a..542284f 100644 --- a/tests/core/test_error_handling_exceptions.py +++ b/tests/core/test_error_handling_exceptions.py @@ -2,6 +2,7 @@ import importlib from pathlib import Path +from typing import Literal, cast import pytest @@ -24,8 +25,9 @@ def _minimal_workbook() -> WorkbookData: def test_serialize_workbook_unsupported_format() -> None: """Unsupported formats should raise SerializationError.""" workbook = _minimal_workbook() + invalid_format = cast(Literal["json", "yaml", "yml", "toon"], "invalid") with pytest.raises(SerializationError): - serialize_workbook(workbook, fmt="invalid") + serialize_workbook(workbook, fmt=invalid_format) def test_save_as_yaml_missing_dependency( diff --git a/tests/core/test_mode_output.py b/tests/core/test_mode_output.py index f90bb5d..ac70782 100644 --- a/tests/core/test_mode_output.py +++ b/tests/core/test_mode_output.py @@ -1,7 +1,8 @@ +import os from pathlib import Path import subprocess import sys -from typing import Never +from typing import Never, cast from _pytest.capture import CaptureFixture from _pytest.monkeypatch import MonkeyPatch @@ -9,7 +10,7 @@ import pytest import xlwings as xw -from exstruct import extract, process_excel +from exstruct import ExtractionMode, extract, process_excel from exstruct.models import Arrow @@ -29,6 +30,13 @@ def _make_basic_book(path: Path) -> None: def _ensure_excel() -> None: + """ + Ensure Excel COM is available for tests and skip the current test if it is not. + + If the SKIP_COM_TESTS environment variable is set, this function skips the test. Otherwise it tries to start a hidden xlwings App and quits it; if starting the App fails, the function skips the test due to unavailable Excel COM. + """ + if os.getenv("SKIP_COM_TESTS"): + pytest.skip("SKIP_COM_TESTS is set; skipping Excel-dependent test.") try: app = xw.App(add_book=False, visible=False) app.quit() @@ -115,11 +123,11 @@ def test_invalidモードはエラーになる(tmp_path: Path) -> None: path = tmp_path / "book.xlsx" _make_basic_book(path) with pytest.raises(ValueError): - extract(path, mode="invalid") # type: ignore[arg-type] + extract(path, mode=cast(ExtractionMode, "invalid")) out = tmp_path / "out.json" with pytest.raises(ValueError): - process_excel(path, out, mode="invalid") # type: ignore[arg-type] + process_excel(path, out, mode=cast(ExtractionMode, "invalid")) def test_CLIのmode引数バリデーション(tmp_path: Path) -> None: diff --git a/tests/core/test_pipeline.py b/tests/core/test_pipeline.py index 80f45c7..5fe17df 100644 --- a/tests/core/test_pipeline.py +++ b/tests/core/test_pipeline.py @@ -1,18 +1,43 @@ +import logging from pathlib import Path from _pytest.monkeypatch import MonkeyPatch +import pytest -from exstruct.core.cells import MergedCellRange +from exstruct.core.backends.com_backend import ComBackend +from exstruct.core.backends.openpyxl_backend import OpenpyxlBackend +from exstruct.core.cells import ( + MergedCellRange, + SheetColorsMap, + SheetFormulasMap, + WorkbookColorsMap, + WorkbookFormulasMap, +) from exstruct.core.pipeline import ( ExtractionArtifacts, ExtractionInputs, + PipelinePlan, + _col_in_intervals, _filter_rows_excluding_merged_values, + _merge_intervals, + _resolve_sheet_colors_map, + _resolve_sheet_formulas_map, build_cells_tables_workbook, build_com_pipeline, build_pre_com_pipeline, resolve_extraction_inputs, + run_com_pipeline, + run_extraction_pipeline, + step_extract_auto_page_breaks_com, + step_extract_charts_com, + step_extract_colors_map_com, + step_extract_colors_map_openpyxl, + step_extract_formulas_map_com, + step_extract_formulas_map_openpyxl, + step_extract_print_areas_com, + step_extract_shapes_com, ) -from exstruct.models import CellRow, PrintArea +from exstruct.models import CellRow, PrintArea, Shape def test_build_pre_com_pipeline_respects_flags( @@ -27,6 +52,8 @@ def test_build_pre_com_pipeline_respects_flags( include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=False, include_merged_values_in_rows=True, ) @@ -47,6 +74,8 @@ def test_build_pre_com_pipeline_includes_colors_map_for_light( include_colors_map=True, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=True, include_merged_values_in_rows=True, ) @@ -72,6 +101,8 @@ def test_build_pre_com_pipeline_skips_merged_cells_when_disabled( include_colors_map=True, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=False, include_merged_values_in_rows=True, ) @@ -90,6 +121,8 @@ def test_build_com_pipeline_respects_flags(tmp_path: Path) -> None: include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=False, include_merged_values_in_rows=True, ) @@ -114,6 +147,8 @@ def test_build_com_pipeline_excludes_auto_page_breaks_when_disabled( include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=False, include_merged_values_in_rows=True, ) @@ -132,6 +167,8 @@ def test_build_com_pipeline_empty_for_light(tmp_path: Path) -> None: include_colors_map=True, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=False, include_merged_values_in_rows=True, ) @@ -149,12 +186,14 @@ def test_resolve_extraction_inputs_defaults(tmp_path: Path) -> None: include_colors_map=None, include_default_background=True, ignore_colors=None, + include_formulas_map=None, include_merged_cells=None, include_merged_values_in_rows=True, ) assert inputs.include_cell_links is False assert inputs.include_print_areas is True assert inputs.include_colors_map is False + assert inputs.include_formulas_map is False assert inputs.include_default_background is False assert inputs.include_merged_cells is True @@ -171,12 +210,65 @@ def test_resolve_extraction_inputs_forces_merged_cells_when_excluding_values( include_colors_map=None, include_default_background=False, ignore_colors=None, + include_formulas_map=None, include_merged_cells=False, include_merged_values_in_rows=False, ) assert inputs.include_merged_cells is True +def test_resolve_extraction_inputs_warns_on_xls_formulas( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + calls: list[str] = [] + + def _warn_once(key: str, message: str) -> None: + """ + Record a warning key in the shared `calls` list while ignoring the message. + + Parameters: + key (str): Identifier for the warning; appended to the module-level `calls` list. + message (str): Ignored placeholder kept for compatibility with expected callback signature. + """ + calls.append(key) + _ = message + + monkeypatch.setattr("exstruct.core.pipeline.warn_once", _warn_once) + + inputs = resolve_extraction_inputs( + tmp_path / "book.xls", + mode="standard", + include_cell_links=None, + include_print_areas=None, + include_auto_page_breaks=False, + include_colors_map=None, + include_default_background=False, + ignore_colors=None, + include_formulas_map=True, + include_merged_cells=None, + include_merged_values_in_rows=True, + ) + assert inputs.use_com_for_formulas is True + assert calls + + +def test_resolve_extraction_inputs_sets_ignore_colors(tmp_path: Path) -> None: + inputs = resolve_extraction_inputs( + tmp_path / "book.xlsx", + mode="verbose", + include_cell_links=None, + include_print_areas=None, + include_auto_page_breaks=False, + include_colors_map=True, + include_default_background=False, + ignore_colors=None, + include_formulas_map=None, + include_merged_cells=None, + include_merged_values_in_rows=True, + ) + assert inputs.ignore_colors == set() + + def test_build_cells_tables_workbook_uses_print_areas( monkeypatch: MonkeyPatch, tmp_path: Path ) -> None: @@ -197,6 +289,8 @@ def fake_detect_tables(_: Path, __: str) -> list[str]: include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=True, include_merged_values_in_rows=True, ) @@ -228,6 +322,8 @@ def test_build_cells_tables_workbook_excludes_merged_values_in_rows( include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, include_merged_cells=True, include_merged_values_in_rows=False, ) @@ -259,3 +355,723 @@ def test_filter_rows_excluding_merged_values_drops_empty_rows() -> None: merged_cells = [MergedCellRange(r1=1, c1=0, r2=1, c2=0, v="A")] filtered = _filter_rows_excluding_merged_values(rows, merged_cells) assert filtered == [] + + +def test_filter_rows_excluding_merged_values_returns_when_empty() -> None: + assert _filter_rows_excluding_merged_values([], []) == [] + + +def test_filter_rows_excluding_merged_values_keeps_rows_without_intervals() -> None: + rows = [CellRow(r=1, c={"0": "A"})] + merged_cells = [MergedCellRange(r1=2, c1=0, r2=2, c2=1, v="B")] + filtered = _filter_rows_excluding_merged_values(rows, merged_cells) + assert filtered == rows + + +def test_filter_rows_excluding_merged_values_drops_links_when_filtered() -> None: + rows = [CellRow(r=1, c={"0": "A", "1": "B"}, links={"0": "L0"})] + merged_cells = [MergedCellRange(r1=1, c1=0, r2=1, c2=0, v="A")] + filtered = _filter_rows_excluding_merged_values(rows, merged_cells) + assert filtered[0].links is None + + +def test_resolve_sheet_colors_map_empty() -> None: + assert _resolve_sheet_colors_map(None, "Sheet1") == {} + + +def test_resolve_sheet_formulas_map_empty() -> None: + assert _resolve_sheet_formulas_map(None, "Sheet1") == {} + + +def test_merge_intervals_merges_adjacent() -> None: + assert _merge_intervals([(1, 2), (3, 4)]) == [(1, 4)] + + +def test_col_in_intervals_fast_false() -> None: + assert _col_in_intervals(1, [(3, 5)]) is False + + +def test_step_extract_colors_map_openpyxl_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + def _fake( + _backend: OpenpyxlBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> object: + """ + Provide a placeholder colors map for testing that is always empty. + + Parameters: + include_default_background (bool): Accepted for signature compatibility; has no effect on the returned value. + ignore_colors (set[str] | None): Accepted for signature compatibility; has no effect on the returned value. + + Returns: + WorkbookColorsMap: An empty colors map with no sheets. + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + return WorkbookColorsMap(sheets={}) + + monkeypatch.setattr(OpenpyxlBackend, "extract_colors_map", _fake) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=True, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_colors_map_openpyxl(inputs, artifacts) + assert artifacts.colors_map_data is not None + + +def test_step_extract_colors_map_com_falls_back( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + def _fake_com( + _backend: ComBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> None: + """ + No-op placeholder that simulates a COM backend extraction step without producing any side effects. + + This function accepts a COM backend and related flags but intentionally performs no operations; it is used in tests as a stub implementation. + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + return None + + def _fake_openpyxl( + _backend: OpenpyxlBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> object: + """ + Return an empty WorkbookColorsMap regardless of inputs. + + Parameters: + include_default_background (bool): Ignored; present for signature compatibility. + ignore_colors (set[str] | None): Ignored; present for signature compatibility. + + Returns: + WorkbookColorsMap: A colors map with no sheets. + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + return WorkbookColorsMap(sheets={}) + + monkeypatch.setattr(ComBackend, "extract_colors_map", _fake_com) + monkeypatch.setattr(OpenpyxlBackend, "extract_colors_map", _fake_openpyxl) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=True, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_colors_map_com(inputs, artifacts, object()) + assert artifacts.colors_map_data is not None + + +def test_step_extract_auto_page_breaks_com_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + def _fake(_: ComBackend) -> dict[str, list[PrintArea]]: + """ + Return a stub mapping of sheet names to print areas containing a single 1x1 print area for "Sheet1". + + Returns: + dict[str, list[PrintArea]]: Mapping where "Sheet1" maps to a list with one PrintArea covering row 1, column 0 to row 1, column 0. + """ + return {"Sheet1": [PrintArea(r1=1, c1=0, r2=1, c2=0)]} + + monkeypatch.setattr(ComBackend, "extract_auto_page_breaks", _fake) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=True, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_auto_page_breaks_com(inputs, artifacts, object()) + assert artifacts.auto_page_break_data + + +def test_build_cells_tables_workbook_fetches_missing_maps( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + colors_map = WorkbookColorsMap(sheets={}) + formulas_map = WorkbookFormulasMap(sheets={}) + + def _fake_colors( + _backend: OpenpyxlBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> object: + """ + Return a fake workbook colors map used by tests. + + Parameters: + _backend (OpenpyxlBackend): Ignored backend parameter retained for signature compatibility. + include_default_background (bool): Whether the default background color would be included (ignored). + ignore_colors (set[str] | None): Set of color names to ignore (ignored). + + Returns: + object: A preconstructed colors map object used by tests. + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + return colors_map + + def _fake_formulas(_: OpenpyxlBackend) -> object: + """ + Return the pre-captured formulas_map object. + + Returns: + The pre-captured `formulas_map` object. + """ + return formulas_map + + monkeypatch.setattr(OpenpyxlBackend, "extract_colors_map", _fake_colors) + monkeypatch.setattr(OpenpyxlBackend, "extract_formulas_map", _fake_formulas) + + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=True, + include_default_background=False, + ignore_colors=None, + include_formulas_map=True, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts( + cell_data={"Sheet1": [CellRow(r=1, c={"0": "A"})]}, + merged_cell_data={"Sheet1": []}, + ) + wb = build_cells_tables_workbook(inputs=inputs, artifacts=artifacts, reason="test") + assert "Sheet1" in wb.sheets + + +def test_step_extract_formulas_map_openpyxl_skips_on_failure( + tmp_path: Path, monkeypatch: MonkeyPatch, caplog: "pytest.LogCaptureFixture" +) -> None: + def _raise(_: OpenpyxlBackend) -> object: + """ + Always raises a RuntimeError with the message "boom". + + Raises: + RuntimeError: always raised with message "boom". + """ + raise RuntimeError("boom") + + monkeypatch.setattr(OpenpyxlBackend, "extract_formulas_map", _raise) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=True, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + + with caplog.at_level(logging.WARNING): + step_extract_formulas_map_openpyxl(inputs, artifacts) + + assert artifacts.formulas_map_data is None + assert "Failed to extract formulas_map via openpyxl" in caplog.text + + +def test_step_extract_formulas_map_com_skips_on_failure( + tmp_path: Path, monkeypatch: MonkeyPatch, caplog: "pytest.LogCaptureFixture" +) -> None: + def _raise(_: ComBackend) -> object: + """ + Always raises a RuntimeError with message "boom". + + Raises: + RuntimeError: Always raised by this helper. + """ + raise RuntimeError("boom") + + monkeypatch.setattr(ComBackend, "extract_formulas_map", _raise) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=True, + use_com_for_formulas=True, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + + with caplog.at_level(logging.WARNING): + step_extract_formulas_map_com(inputs, artifacts, object()) + + assert artifacts.formulas_map_data is None + assert "Failed to extract formulas_map via COM" in caplog.text + + +def test_filter_rows_excluding_merged_values_returns_rows_when_intervals_empty() -> ( + None +): + rows = [CellRow(r=1, c={"0": "A"})] + merged_cells = [MergedCellRange(r1=3, c1=0, r2=4, c2=1, v="A")] + assert _filter_rows_excluding_merged_values(rows, merged_cells) == rows + + +def test_resolve_sheet_colors_map_missing_sheet() -> None: + colors_map = WorkbookColorsMap( + sheets={"Other": SheetColorsMap(sheet_name="Other", colors_map={})} + ) + assert _resolve_sheet_colors_map(colors_map, "Sheet1") == {} + + +def test_resolve_sheet_formulas_map_missing_sheet() -> None: + formulas_map = WorkbookFormulasMap( + sheets={"Other": SheetFormulasMap(sheet_name="Other", formulas_map={})} + ) + assert _resolve_sheet_formulas_map(formulas_map, "Sheet1") == {} + + +def test_merge_intervals_empty() -> None: + assert _merge_intervals([]) == [] + + +def test_merge_intervals_keeps_non_overlapping() -> None: + assert _merge_intervals([(1, 2), (5, 6)]) == [(1, 2), (5, 6)] + + +def test_step_extract_shapes_com_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + shapes_data = {"Sheet1": [object()]} + + def _fake(_: object, *, mode: str) -> dict[str, list[object]]: + """ + Provide a stub that supplies the module-level `shapes_data` mapping. + + Parameters: + _ (object): Placeholder positional argument; ignored. + mode (str): Mode selector; ignored. + + Returns: + dict[str, list[object]]: Mapping of sheet names to lists of shape objects from `shapes_data`. + """ + _ = mode + return shapes_data + + monkeypatch.setattr("exstruct.core.pipeline.get_shapes_with_position", _fake) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_shapes_com(inputs, artifacts, object()) + assert artifacts.shape_data == shapes_data + + +def test_step_extract_charts_com_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + charts = [object()] + + def _fake(_: object, *, mode: str) -> list[object]: + """ + Return the captured charts list. + + Parameters: + mode (str): Ignored; accepted for compatibility with callers. + + Returns: + list[object]: The charts list captured from the enclosing scope. + """ + _ = mode + return charts + + class _Sheet: + def __init__(self, name: str) -> None: + """ + Initialize the instance with a display name. + + Parameters: + name (str): The name to assign to the instance. + """ + self.name = name + + class _Workbook: + sheets = [_Sheet("Sheet1")] + + monkeypatch.setattr("exstruct.core.pipeline.get_charts", _fake) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_charts_com(inputs, artifacts, _Workbook()) + assert artifacts.chart_data == {"Sheet1": charts} + + +def test_step_extract_print_areas_com_skips_when_present( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + def _raise(_: ComBackend) -> object: + """ + Raise a RuntimeError indicating this code path must not be invoked. + + This function always raises RuntimeError("should not be called"). + """ + raise RuntimeError("should not be called") + + monkeypatch.setattr(ComBackend, "extract_print_areas", _raise) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=True, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts( + print_area_data={"Sheet1": [PrintArea(r1=1, c1=0, r2=1, c2=0)]} + ) + step_extract_print_areas_com(inputs, artifacts, object()) + + +def test_step_extract_print_areas_com_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + def _fake(_: ComBackend) -> dict[str, list[PrintArea]]: + """ + Return a stub mapping of sheet names to print areas containing a single 1x1 print area for "Sheet1". + + Returns: + dict[str, list[PrintArea]]: Mapping where "Sheet1" maps to a list with one PrintArea covering row 1, column 0 to row 1, column 0. + """ + return {"Sheet1": [PrintArea(r1=1, c1=0, r2=1, c2=0)]} + + monkeypatch.setattr(ComBackend, "extract_print_areas", _fake) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=True, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_print_areas_com(inputs, artifacts, object()) + assert artifacts.print_area_data + + +def test_step_extract_colors_map_com_sets_data( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + colors_map = WorkbookColorsMap(sheets={}) + + def _fake_com( + _backend: ComBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> object: + """ + Return a colors map object suitable for use as a COM backend response. + + Parameters: + include_default_background (bool): If true, the returned colors map should include the default background color. + ignore_colors (set[str] | None): Optional set of color identifiers to exclude from the returned map; `None` means no colors are excluded. + + Returns: + object: A colors map representing workbook-level color mappings. + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + return colors_map + + def _raise( + _backend: OpenpyxlBackend, + *, + include_default_background: bool, + ignore_colors: set[str] | None, + ) -> object: + """ + Placeholder backend sentinel that always raises a RuntimeError when invoked. + + Raises: + RuntimeError: Always raised with message "should not be called". + """ + _ = _backend + _ = include_default_background + _ = ignore_colors + raise RuntimeError("should not be called") + + monkeypatch.setattr(ComBackend, "extract_colors_map", _fake_com) + monkeypatch.setattr(OpenpyxlBackend, "extract_colors_map", _raise) + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=True, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + step_extract_colors_map_com(inputs, artifacts, object()) + assert artifacts.colors_map_data is colors_map + + +def test_run_com_pipeline_executes_steps(tmp_path: Path) -> None: + calls: list[str] = [] + + def _step(_: ExtractionInputs, artifacts: ExtractionArtifacts, __: object) -> None: + """ + Test pipeline step that simulates shape extraction. + + Sets artifacts.shape_data to a mapping for "Sheet1" containing a single Shape and records invocation by appending "called" to the outer `calls` list. + + Parameters: + _ (ExtractionInputs): Unused extraction inputs placeholder. + artifacts (ExtractionArtifacts): Artifacts object to populate with shape data. + __ (object): Unused context placeholder. + """ + calls.append("called") + artifacts.shape_data = {"Sheet1": [Shape(id=1, text="", l=0, t=0)]} + + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + artifacts = ExtractionArtifacts() + run_com_pipeline([_step], inputs, artifacts, object()) + assert calls == ["called"] + assert artifacts.shape_data + + +def test_run_extraction_pipeline_com_success( + tmp_path: Path, monkeypatch: MonkeyPatch +) -> None: + class _Sheet: + def __init__(self, name: str) -> None: + """ + Initialize the instance with a display name. + + Parameters: + name (str): The name to assign to the instance. + """ + self.name = name + + class _Sheets: + def __init__(self) -> None: + """ + Initialize the object with a single default sheet named "Sheet1". + + Creates the internal mapping `self._sheets` and populates it with one `_Sheet` instance keyed by "Sheet1". + """ + self._sheets = {"Sheet1": _Sheet("Sheet1")} + + def __getitem__(self, name: str) -> _Sheet: + """ + Access a worksheet by its name. + + Parameters: + name (str): The name of the sheet to retrieve. + + Returns: + _Sheet: The sheet object associated with `name`. + + Raises: + KeyError: If no sheet with the given name exists. + """ + return self._sheets[name] + + class _Workbook: + sheets = _Sheets() + + def _pre_step(_: ExtractionInputs, artifacts: ExtractionArtifacts) -> None: + """ + Populate artifacts with default minimal cell and merged-cell data for a single sheet. + + Parameters: + _ (ExtractionInputs): Unused extraction inputs placeholder. + artifacts (ExtractionArtifacts): Mutable extraction artifacts that will be updated with + `cell_data` set to a single row for "Sheet1" and `merged_cell_data` set to an empty list + for "Sheet1". + """ + artifacts.cell_data = {"Sheet1": [CellRow(r=1, c={"0": "A"})]} + artifacts.merged_cell_data = {"Sheet1": []} + + def _fake_plan(_: ExtractionInputs) -> PipelinePlan: + """ + Create a fixed PipelinePlan for tests that forces COM usage and provides a single pre-COM step. + + Parameters: + _ (ExtractionInputs): Ignored input; present to match the PipelinePlan factory signature. + + Returns: + PipelinePlan: A plan with `pre_com_steps` set to a list containing `_pre_step`, `com_steps` empty, and `use_com` set to `True`. + """ + return PipelinePlan(pre_com_steps=[_pre_step], com_steps=[], use_com=True) + + def _fake_detect_tables(_: object) -> list[str]: + """ + Provide a detector that always reports no table ranges. + + The input workbook-like object is ignored. + + Returns: + list[str]: An empty list of table range identifiers. + """ + return [] + + def _fake_workbook(_: Path) -> object: + """ + Provide a context manager that yields a lightweight fake workbook for tests. + + Parameters: + _ (Path): Ignored file path parameter retained to match the real backend signature. + + Returns: + object: A context manager whose `__enter__` returns a new `_Workbook` instance and whose `__exit__` does not suppress exceptions (returns `None`). + """ + + class _Context: + def __enter__(self) -> _Workbook: + return _Workbook() + + def __exit__( + self, + exc_type: type[BaseException] | None, + exc: BaseException | None, + tb: object | None, + ) -> bool | None: + _ = exc_type + _ = exc + _ = tb + return None + + return _Context() + + monkeypatch.delenv("SKIP_COM_TESTS", raising=False) + monkeypatch.setattr("exstruct.core.pipeline.build_pipeline_plan", _fake_plan) + monkeypatch.setattr("exstruct.core.pipeline.detect_tables", _fake_detect_tables) + monkeypatch.setattr("exstruct.core.pipeline.xlwings_workbook", _fake_workbook) + + inputs = ExtractionInputs( + file_path=tmp_path / "book.xlsx", + mode="standard", + include_cell_links=False, + include_print_areas=False, + include_auto_page_breaks=False, + include_colors_map=False, + include_default_background=False, + ignore_colors=None, + include_formulas_map=False, + use_com_for_formulas=False, + include_merged_cells=False, + include_merged_values_in_rows=True, + ) + + result = run_extraction_pipeline(inputs) + assert result.state.com_attempted is True + assert result.state.com_succeeded is True + assert "Sheet1" in result.workbook.sheets diff --git a/tests/core/test_pipeline_fallbacks.py b/tests/core/test_pipeline_fallbacks.py index 393be5e..9600ef9 100644 --- a/tests/core/test_pipeline_fallbacks.py +++ b/tests/core/test_pipeline_fallbacks.py @@ -34,6 +34,7 @@ def test_pipeline_fallback_skip_com_tests( include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=None, include_merged_cells=None, include_merged_values_in_rows=True, ) @@ -50,6 +51,11 @@ def test_pipeline_fallback_skip_com_tests( def test_pipeline_fallback_com_unavailable( monkeypatch: MonkeyPatch, tmp_path: Path ) -> None: + """ + Verifies that the extraction pipeline falls back when COM access is unavailable. + + Creates a basic workbook, forces the COM-access entry point to raise, runs the extraction pipeline, and asserts that the pipeline records a fallback due to COM being unavailable (`FallbackReason.COM_UNAVAILABLE`), did not attempt COM (`com_attempted is False`), and that the resulting sheet "Sheet1" exists, contains rows, and has no shapes or charts. + """ path = tmp_path / "book.xlsx" _make_basic_book(path) monkeypatch.delenv("SKIP_COM_TESTS", raising=False) @@ -68,6 +74,7 @@ def _raise(*_args: object, **_kwargs: object) -> None: include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=None, include_merged_cells=None, include_merged_values_in_rows=True, ) @@ -110,6 +117,7 @@ def _raise( include_colors_map=False, include_default_background=False, ignore_colors=None, + include_formulas_map=None, include_merged_cells=None, include_merged_values_in_rows=True, ) diff --git a/tests/core/test_workbook_utils.py b/tests/core/test_workbook_utils.py index 002076d..b43a9a4 100644 --- a/tests/core/test_workbook_utils.py +++ b/tests/core/test_workbook_utils.py @@ -2,6 +2,7 @@ from collections.abc import Iterator from pathlib import Path +from typing import cast import pytest @@ -87,7 +88,7 @@ class _DummyApp: monkeypatch.setattr(workbook.xw, "apps", [_DummyApp()]) file_path = _DummyPath("good") - assert workbook._find_open_workbook(file_path) is None + assert workbook._find_open_workbook(cast(Path, file_path)) is None def test_find_open_workbook_returns_none_on_iter_error( diff --git a/tests/engine/test_engine.py b/tests/engine/test_engine.py index 9ff36ec..816fd51 100644 --- a/tests/engine/test_engine.py +++ b/tests/engine/test_engine.py @@ -34,9 +34,23 @@ def fake_extract( include_colors_map: bool = False, include_default_background: bool = False, ignore_colors: set[str] | None = None, + include_formulas_map: bool | None = None, include_merged_cells: bool | None = None, include_merged_values_in_rows: bool = True, ) -> WorkbookData: + """ + Test helper that simulates workbook extraction for unit tests. + + Records the received `mode` and `include_print_areas` into the outer `called` mapping and returns a minimal WorkbookData whose `book_name` is the input path's filename and whose `sheets` is empty. + + Parameters: + path (Path): Path to the workbook; its filename is used for the returned WorkbookData.book_name. + mode (str): Extraction mode passed through and recorded. + include_print_areas (bool): Whether print areas were requested; the value is recorded in `called`. + + Returns: + WorkbookData: A WorkbookData instance with `book_name` set to path.name and an empty `sheets` mapping. + """ called["mode"] = mode called["include_print_areas"] = include_print_areas return WorkbookData(book_name=path.name, sheets={}) diff --git a/tests/integration/test_integrate_raw_data.py b/tests/integration/test_integrate_raw_data.py index d94507c..fcd020d 100644 --- a/tests/integration/test_integrate_raw_data.py +++ b/tests/integration/test_integrate_raw_data.py @@ -4,7 +4,12 @@ from _pytest.monkeypatch import MonkeyPatch -from exstruct.core.cells import SheetColorsMap, WorkbookColorsMap +from exstruct.core.cells import ( + SheetColorsMap, + SheetFormulasMap, + WorkbookColorsMap, + WorkbookFormulasMap, +) from exstruct.core.pipeline import collect_sheet_raw_data from exstruct.models import CellRow, Chart, ChartSeries, PrintArea, Shape @@ -52,6 +57,13 @@ def test_collect_sheet_raw_data_includes_extracted_fields( ) } ) + formulas_map = WorkbookFormulasMap( + sheets={ + "Sheet1": SheetFormulasMap( + sheet_name="Sheet1", formulas_map={"=A1": [(1, 0)]} + ) + } + ) result = collect_sheet_raw_data( cell_data={"Sheet1": [CellRow(r=1, c={"0": "A"}, links=None)]}, shape_data={"Sheet1": [Shape(text="S", l=0, t=0)]}, @@ -62,6 +74,7 @@ def test_collect_sheet_raw_data_includes_extracted_fields( include_merged_values_in_rows=True, print_area_data={"Sheet1": [PrintArea(r1=1, c1=0, r2=1, c2=0)]}, auto_page_break_data={"Sheet1": [PrintArea(r1=1, c1=0, r2=1, c2=0)]}, + formulas_map_data=formulas_map, colors_map_data=colors_map, ) @@ -72,6 +85,7 @@ def test_collect_sheet_raw_data_includes_extracted_fields( assert raw.table_candidates == ["A1:B2"] assert raw.print_areas assert raw.auto_print_areas + assert raw.formulas_map == {"=A1": [(1, 0)]} assert raw.colors_map == {"#FFFFFF": [(1, 0)]} @@ -98,6 +112,7 @@ def test_collect_sheet_raw_data_skips_charts_in_light_mode( include_merged_values_in_rows=True, print_area_data=None, auto_page_break_data=None, + formulas_map_data=None, colors_map_data=None, ) diff --git a/tests/models/test_modeling.py b/tests/models/test_modeling.py index c367c6a..4570b19 100644 --- a/tests/models/test_modeling.py +++ b/tests/models/test_modeling.py @@ -25,6 +25,7 @@ def test_build_workbook_data_from_raw() -> None: table_candidates=["A1:A1"], print_areas=[PrintArea(r1=1, c1=0, r2=1, c2=0)], auto_print_areas=[], + formulas_map={"=A1": [(1, 0)]}, colors_map={"#FFFFFF": [(1, 0)]}, merged_cells=[MergedCellRange(r1=1, c1=0, r2=1, c2=0, v=" ")], ) diff --git a/tests/models/test_models_export.py b/tests/models/test_models_export.py index b318d7f..3b922f7 100644 --- a/tests/models/test_models_export.py +++ b/tests/models/test_models_export.py @@ -1,6 +1,8 @@ +from collections.abc import Callable from importlib import util import json from pathlib import Path +from typing import Any, cast import pytest @@ -16,9 +18,16 @@ HAS_PYYAML = util.find_spec("yaml") is not None HAS_TOON = util.find_spec("toon") is not None +_SkipIf = Callable[[Callable[..., Any]], Callable[..., Any]] def _sheet() -> SheetData: + """ + Create a sample SheetData containing one row, no shapes or charts, and a single table candidate. + + Returns: + SheetData: A SheetData instance with one CellRow (r=1, c={"0": "A"}), empty shapes and charts lists, and table_candidates set to ["A1:B2"]. + """ return SheetData( rows=[CellRow(r=1, c={"0": "A"})], shapes=[], @@ -67,8 +76,8 @@ def test_save_unsupported_format_raises(tmp_path: Path) -> None: wb.save(bad) -# pytest.skipif is typed; no ignore needed -@pytest.mark.skipif(not HAS_PYYAML, reason="pyyaml not installed") # type: ignore[misc] +# cast to _SkipIf satisfies mypy strict mode for decorator typing +@cast(_SkipIf, pytest.mark.skipif(not HAS_PYYAML, reason="pyyaml not installed")) def test_sheet_to_yaml_roundtrip() -> None: sheet = _sheet() text = sheet.to_yaml() @@ -76,7 +85,7 @@ def test_sheet_to_yaml_roundtrip() -> None: assert "SheetData" not in text # not a repr -@pytest.mark.skipif(not HAS_PYYAML, reason="pyyaml not installed") # type: ignore[misc] +@cast(_SkipIf, pytest.mark.skipif(not HAS_PYYAML, reason="pyyaml not installed")) def test_workbook_to_yaml() -> None: wb = _workbook() text = wb.to_yaml() diff --git a/tests/render/test_render_init.py b/tests/render/test_render_init.py index d9b7134..394cb2c 100644 --- a/tests/render/test_render_init.py +++ b/tests/render/test_render_init.py @@ -292,8 +292,8 @@ def test_export_sheet_images_success( written = render.export_sheet_images(xlsx, out_dir, dpi=144) assert written[0].name == "01_Sheet_1.png" - assert written[1].name == "01_Sheet_1_p02.png" - assert written[2].name == "02_sheet.png" + assert written[1].name == "02_Sheet_1.png" + assert written[2].name == "03_sheet.png" assert all(path.exists() for path in written) @@ -541,3 +541,383 @@ def test_sanitize_sheet_filename() -> None: """_sanitize_sheet_filename replaces invalid characters and defaults.""" assert render._sanitize_sheet_filename("Sheet/1") == "Sheet_1" assert render._sanitize_sheet_filename(" ") == "sheet" + + +def test_split_csv_respecting_quotes() -> None: + """Split CSV-like PrintArea strings while honoring quotes.""" + raw = "'Sheet 1'!A1:B2,'Sheet,2'!C3:D4,'O''Brien'!E1:F2" + parts = render._split_csv_respecting_quotes(raw) + assert parts == ["'Sheet 1'!A1:B2", "'Sheet,2'!C3:D4", "'O''Brien'!E1:F2"] + + +def test_extract_print_areas_with_page_setup() -> None: + """Parse PrintArea from a PageSetup stub.""" + + class _PageSetup: + PrintArea = "'Sheet 1'!A1:B2,'Sheet 1'!C3:D4" + + class _SheetApi: + PageSetup = _PageSetup() + + areas = render._extract_print_areas(cast(render._SheetApiProtocol, _SheetApi())) + assert areas == ["'Sheet 1'!A1:B2", "'Sheet 1'!C3:D4"] + + +def test_extract_print_areas_empty_print_area() -> None: + """Return empty list when PrintArea is empty.""" + + class _PageSetup: + PrintArea = "" + + class _SheetApi: + PageSetup = _PageSetup() + + assert ( + render._extract_print_areas(cast(render._SheetApiProtocol, _SheetApi())) == [] + ) + + +def test_extract_print_areas_handles_exception() -> None: + """Return empty list when PrintArea access raises.""" + + class _PageSetup: + @property + def PrintArea(self) -> str: + """ + Simulate accessing a worksheet's PrintArea and always raise an error to emulate a failure. + + Raises: + RuntimeError: Always raised to simulate an error when retrieving the PrintArea. + """ + raise RuntimeError("boom") + + class _SheetApi: + PageSetup = _PageSetup() + + assert ( + render._extract_print_areas(cast(render._SheetApiProtocol, _SheetApi())) == [] + ) + + +def test_iter_sheet_apis_prefers_worksheets_collection() -> None: + """Prefer the Worksheets collection when iterating COM sheets.""" + + class _WsApi: + def __init__(self, name: str) -> None: + """ + Initialize the FakeSheet with the given Excel sheet name. + + Parameters: + name (str): The sheet's name to assign to the object's `Name` attribute. + """ + self.Name = name + + class _Worksheets: + def __init__(self) -> None: + """ + Initialize the fake PDF document stub. + + Sets the `Count` attribute to 2 to emulate a document with two pages. + """ + self.Count = 2 + + def Item(self, index: int) -> _WsApi: + """ + Return a worksheet API stub for the sheet at the given index. + + Parameters: + index (int): One-based index of the worksheet within the workbook. + + Returns: + _WsApi: A worksheet API stub corresponding to the sheet at `index`. + """ + return _WsApi(f"Sheet{index}") + + class _Api: + Worksheets = _Worksheets() + + class _Wb: + api = _Api() + sheets: list[Any] = [] + + result = render._iter_sheet_apis(_Wb()) + assert result[0][1] == "Sheet1" + assert result[1][1] == "Sheet2" + + +def test_export_pdf_propagates_render_error( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + def _raise() -> xw.App: + """ + Always raises a RenderError to simulate failure when obtaining an Excel application. + + Raises: + RenderError: Always raised with the message "boom". + """ + raise RenderError("boom") + + monkeypatch.setattr(render, "_require_excel_app", _raise) + with pytest.raises(RenderError, match="boom"): + render.export_pdf(tmp_path / "in.xlsx", tmp_path / "out.pdf") + + +def test_require_pdfium_success(monkeypatch: pytest.MonkeyPatch) -> None: + """_require_pdfium returns the imported module when available.""" + fake_pdfium = ModuleType("pypdfium2") + sys.modules["pypdfium2"] = fake_pdfium + try: + assert render._require_pdfium() is fake_pdfium + finally: + sys.modules.pop("pypdfium2", None) + + +def test_build_sheet_export_plan_handles_multiple_areas( + monkeypatch: pytest.MonkeyPatch, +) -> None: + """Expand multiple print areas into separate export plan rows.""" + + class _SheetApi: + pass + + def _fake_iter(_: xw.Book) -> list[tuple[int, str, _SheetApi]]: + """ + Return a single-item list that mimics iterating workbook sheets for tests. + + Returns: + A list with one tuple (index, sheet name, sheet API stub): (0, "Sheet1", _SheetApi()). + """ + return [(0, "Sheet1", _SheetApi())] + + def _fake_extract(_: _SheetApi) -> list[str]: + """ + Provide two fake print-area ranges for testing. + + Parameters: + _ (_SheetApi): Ignored sheet API placeholder. + + Returns: + list[str]: Two print-area ranges: "A1:B2" and "C3:D4". + """ + return ["A1:B2", "C3:D4"] + + monkeypatch.setattr(render, "_iter_sheet_apis", _fake_iter) + monkeypatch.setattr(render, "_extract_print_areas", _fake_extract) + + plan = render._build_sheet_export_plan(cast(xw.Book, object())) + assert [item[0] for item in plan] == ["Sheet1", "Sheet1"] + assert [item[2] for item in plan] == ["A1:B2", "C3:D4"] + + +def test_page_index_from_suffix_default() -> None: + """Default to zero when no suffix exists.""" + assert render._page_index_from_suffix("sheet") == 0 + + +def test_page_index_from_suffix_non_digit() -> None: + """Default to zero when suffix is not numeric.""" + assert render._page_index_from_suffix("sheet_pxx") == 0 + + +def test_export_sheet_pdf_skips_invalid_print_area(tmp_path: Path) -> None: + """Skip restoring PrintArea when setter fails.""" + + class _BadPageSetup: + @property + def PrintArea(self) -> str: + """ + Represents the worksheet's PrintArea setting as an Excel range string. + + Returns: + str: The PrintArea range (e.g., "A1:B2"). + """ + return "A1:B2" + + @PrintArea.setter + def PrintArea(self, _value: object) -> None: + """ + Simulated setter for PrintArea that always fails. + + Parameters: + _value (object): Ignored; the provided value is not used because the setter always raises. + + Raises: + RuntimeError: Always raised with the message "bad". + """ + raise RuntimeError("bad") + + class _SheetApi: + PageSetup = _BadPageSetup() + + def ExportAsFixedFormat( + self, _file_format: int, _output_path: str, *args: object, **kwargs: object + ) -> None: + """ + Simulate exporting a workbook/sheet to a fixed-format file by writing a minimal fake PDF header to the given path. + + Parameters: + _file_format (int): Ignored numeric format indicator. + _output_path (str): Filesystem path where the fake export file will be written. + *args, **kwargs: Additional arguments are accepted and ignored. + """ + _ = args + _ = kwargs + + render._export_sheet_pdf( + cast(render._SheetApiProtocol, _SheetApi()), + tmp_path / "out.pdf", + ignore_print_areas=False, + print_area="A1:B2", + ) + + +def test_render_sheet_images_requires_pdfium(tmp_path: Path) -> None: + """Raise RenderError when pdfium is missing.""" + with pytest.raises(RenderError, match="pypdfium2 is required"): + render._render_sheet_images( + None, + tmp_path / "sheet.pdf", + tmp_path, + 0, + "Sheet1", + 144, + False, + ) + + +def test_export_sheet_images_with_app_retries_on_empty( + tmp_path: Path, monkeypatch: pytest.MonkeyPatch +) -> None: + """Retry export when rendering returns empty results.""" + calls: list[int] = [] + + def _fake_render( + _pdfium: ModuleType | None, + _pdf_path: Path, + output_dir: Path, + sheet_index: int, + safe_name: str, + _dpi: int, + _use_subprocess: bool, + ) -> list[Path]: + """ + Simulates rendering a PDF sheet to image files for tests. + + On the first invocation this function returns an empty list to simulate a transient empty render result; on subsequent invocations it returns a single Path inside output_dir named "{sheet_index+1:02d}_{safe_name}.png". + + Parameters: + _pdfium: Ignored in the fake implementation (kept for signature compatibility). + _pdf_path: Ignored in the fake implementation (kept for signature compatibility). + output_dir (Path): Directory where the fake image path is located. + sheet_index (int): Zero-based index of the sheet; used to build the filename prefix. + safe_name (str): Sanitized sheet name used in the filename. + _dpi: Ignored in the fake implementation (kept for signature compatibility). + _use_subprocess: Ignored in the fake implementation (kept for signature compatibility). + + Returns: + list[Path]: Empty list on the first call, otherwise a list containing one Path pointing to the fake PNG file. + """ + calls.append(1) + if len(calls) == 1: + return [] + return [output_dir / f"{sheet_index + 1:02d}_{safe_name}.png"] + + monkeypatch.setattr(render, "_render_sheet_images", _fake_render) + monkeypatch.setattr( + render, "_require_excel_app", lambda: FakeApp(["Sheet1"], False) + ) + monkeypatch.setattr(render, "_export_sheet_pdf", lambda *a, **k: None) + monkeypatch.setattr( + render, + "_build_sheet_export_plan", + lambda _wb: [("Sheet1", cast(render._SheetApiProtocol, object()), None)], + ) + + result = render._export_sheet_images_with_app( + tmp_path / "in.xlsx", + tmp_path / "out", + tmp_path / "tmp", + 144, + False, + None, + ) + assert len(calls) == 2 + assert result + + +def test_page_index_from_suffix_handles_multi_digits() -> None: + """Support multi-digit page suffixes.""" + assert render._page_index_from_suffix("sheet_01") == 0 + assert render._page_index_from_suffix("sheet_01_p01") == 0 + assert render._page_index_from_suffix("sheet_01_p10") == 9 + assert render._page_index_from_suffix("sheet_01_p100") == 99 + assert render._page_index_from_suffix("sheet_01_p0") == 0 + + +def test_export_sheet_pdf_does_not_swallow_export_errors(tmp_path: Path) -> None: + """Propagate export errors even if restore fails.""" + + class _FlakyPageSetup: + def __init__(self) -> None: + """ + Initialize a PageSetup-like test stub with a default print area and a setter call counter. + + The instance starts with `_print_area` set to "A1" and `_set_calls` set to 0 to track how many times the print area setter has been invoked. + """ + self._print_area: object = "A1" + self._set_calls = 0 + + @property + def PrintArea(self) -> object: + """ + Retrieve the current PrintArea value from the PageSetup stub. + + Returns: + print_area (object): The stored PrintArea value (typically a string) or whatever was set on the stub. + """ + return self._print_area + + @PrintArea.setter + def PrintArea(self, value: object) -> None: + """ + Set the PrintArea value on this stub PageSetup instance. + + Parameters: + value (object): The print area value to assign. + + Raises: + RuntimeError: If the setter is invoked more than once (simulates a restore failure). + """ + if self._set_calls >= 1: + raise RuntimeError("restore failed") + self._print_area = value + self._set_calls += 1 + + class _ExplodingSheetApi: + PageSetup: render._PageSetupProtocol = cast( + render._PageSetupProtocol, _FlakyPageSetup() + ) + + def ExportAsFixedFormat( + self, file_format: int, output_path: str, *args: object, **kwargs: object + ) -> None: + """ + Simulate exporting to a fixed format; this stub always raises an export error. + + Raises: + RuntimeError: with message "export failed" when invoked. + """ + _ = file_format + _ = output_path + _ = args + _ = kwargs + raise RuntimeError("export failed") + + pdf_path = tmp_path / "out.pdf" + with pytest.raises(RuntimeError, match="export failed"): + render._export_sheet_pdf( + _ExplodingSheetApi(), + pdf_path, + ignore_print_areas=False, + print_area="A1:B2", + ) diff --git a/tests/utils.py b/tests/utils.py index 965eb79..8a00f0b 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -1,5 +1,5 @@ -from collections.abc import Callable -from typing import TypeVar, cast +from collections.abc import Callable, Iterable, Sequence +from typing import Literal, TypeVar, cast import pytest from typing_extensions import ParamSpec @@ -9,10 +9,35 @@ def parametrize( - *args: object, **kwargs: object + argnames: str | Sequence[str], + argvalues: Iterable[object], + *, + indirect: bool | Sequence[str] = False, + ids: Iterable[str | float | int | bool | None] + | Callable[[object], object | None] + | None = None, + scope: Literal["session", "package", "module", "class", "function"] | None = None, ) -> Callable[[Callable[P, R]], Callable[P, R]]: - """Type-safe wrapper around pytest.mark.parametrize.""" + """ + Return a decorator that parametrizes a test callable with the given argument names and values. + + Parameters: + argnames: One or more parameter names (single string or sequence of strings) to inject into the test callable. + argvalues: An iterable of values or value-tuples to use for each generated test case. + indirect: If True or a sequence of names, treat corresponding parameters as fixtures and resolve them indirectly. + ids: Optional iterable of case identifiers or a callable that produces an identifier for each value. + scope: Optional fixture scope to apply when parameters are used as fixtures ("session", "package", "module", "class", or "function"). + + Returns: + decorator: A decorator that applies the specified parametrization to a callable while preserving its signature. + """ return cast( Callable[[Callable[P, R]], Callable[P, R]], - pytest.mark.parametrize(*args, **kwargs), + pytest.mark.parametrize( + argnames, + argvalues, + indirect=indirect, + ids=ids, + scope=scope, + ), ) diff --git a/uv.lock b/uv.lock index 0d704c1..491e721 100644 --- a/uv.lock +++ b/uv.lock @@ -298,7 +298,7 @@ wheels = [ [[package]] name = "exstruct" -version = "0.3.6" +version = "0.3.7" source = { editable = "." } dependencies = [ { name = "numpy" },