Skip to content

Conversation

@ryokawajp
Copy link

@ryokawajp ryokawajp commented Oct 10, 2025

Description

This example is located in scripts/teigaku_genzei/. I created this new directory.

This example contains a set of python scripts to

  • preprocess documents (common)
    • json_glossary.py
    • json_qa.py
    • json_util.py
    • jsonl_util.py
  • convert the source documents to SDG seed data,
    • context_util.py
    • extract_icl_from_seed.py
    • make_context.py
    • make_icl.py
    • make_seed.py
    • seed_jsonl_2_qna_yaml.py
  • convert the source documents to test data, and
    • extract_qa.sh
    • qna_yaml_2_qa_csv.py
  • evaluate the fine-tuned model with the test data.
    • catalog_util.py
    • llm_as_judge_direct_criteria_correctness2.py
    • llm_as_judge_direct_positional_bias.py
    • llm_as_judge_direct_provider_importer.py
    • llm_as_judge_direct_rits_llama4.py
    • llm_as_judge_direct_rits_phi.py
    • remove_newline_from_csv.py
    • test_qa.py

All the shell scripts in this directory are examples of calling command line interfaces of the above python scripts. The usages of those are described in README in detail.

The evaluation code depends on Unitxt with a custom catalog. The custom catalog is stored in local_catalog/.

The source documents are not included this directory, as described in README. Each user needs to comply with the terms of use by NTA (National Tax Agency) of Japan and download those by oneself.

How Has This Been Tested?

The test has been done manually by following the instructions in README.

We tested this examples in the following environment.

  • Red Hat Enterprise Linux 8, Red Hat Enterprise Linux 9
  • Python 3.12.11
  • sdg_hub v0.2.x + Japanese support PR (#457, #458), and sdg_hub v0.7.1
  • nvidia-cuda-runtime-cu12 12.8.90
  • Student model: granite-3.3-8b-instruct (local)
  • Teacher model: phi-4 (RITS)
  • LLM-as-a-Judge: phi-4 (RITS)

This example does not alter sdg_hub or other examples in this repo. It is provided as an independent example.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • End-to-end SDG data pipeline: PDF extraction → QA & glossary → context assembly → ICL example generation → seed dataset export
    • Large set of preconfigured multi-provider LLM evaluators including positional-bias variants
  • Documentation

    • Comprehensive README and example notebook documenting workflows and CLI usage
  • Automation

    • New CLI utilities and shell scripts for parsing, conversion, testing, and data preparation
  • Chores

    • Added project ignore rules and dependency manifests for the notebook workspace

✏️ Tip: You can customize this high-level summary in your review settings.

@ryokawajp ryokawajp requested a review from a team as a code owner October 10, 2025 13:52
@coderabbitai
Copy link

coderabbitai bot commented Oct 10, 2025

Walkthrough

Adds a notebook-driven pipeline and many supporting scripts, utilities, and local catalog entries to parse PDF outputs into QA/glossary CSVs, compose section contexts, generate ICL examples and SDG seed JSONL, and register/run multiple LLM-evaluator configurations via a local Unitxt catalog.

Changes

Cohort / File(s) Summary
Project docs & config
notebooks/teigaku_genzei/README.md, notebooks/teigaku_genzei/requirements.txt, notebooks/teigaku_genzei/requirements-dev.txt, notebooks/teigaku_genzei/.gitignore
Added README, dependency lists, and workspace .gitignore.
Notebook pipeline
notebooks/teigaku_genzei/teigaku_genzei.ipynb, notebooks/teigaku_genzei/url.txt
New notebook implementing end‑to‑end pipeline functions and reference PDF URLs.
Catalog utilities
notebooks/teigaku_genzei/catalog_util.py, notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.py
Register local Unitxt catalog path, locate system catalog, and copy/tag catalog artifacts into local catalog.
LLM judge implementations
notebooks/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py, notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py, notebooks/teigaku_genzei/llm_as_judge_direct_rits_phi.py, notebooks/teigaku_genzei/llm_as_judge_direct_rits_llama4.py
New judge criterion, positional-bias judge class, RITS engine wrapper, and register() functions to add evaluators to catalog.
Local catalog entries
notebooks/teigaku_genzei/local_catalog/...
Added 60+ JSON configs under local_catalog/metrics/llm_as_judge/** defining llm_judge_direct and llm_judge_direct_positional_bias evaluators across many providers/models.
JSON helpers & I/O
notebooks/teigaku_genzei/json_util.py, notebooks/teigaku_genzei/jsonl_util.py
Utilities for JSON classification (titles, Q/A markers) and JSONL read/write helpers.
QA & glossary extraction
notebooks/teigaku_genzei/json_qa.py, notebooks/teigaku_genzei/json_glossary.py, notebooks/teigaku_genzei/extract_icl_from_seed.py
Scripts to extract QA pairs and glossary entries from parsed JSON, deduplicate ICL items, and export CSV/JSONL.
Context composition
notebooks/teigaku_genzei/context_util.py, notebooks/teigaku_genzei/make_context.py, notebooks/teigaku_genzei/make_context.sh
Compose per‑section text blocks and glossaries and emit context CSVs (CLI + shell wrapper).
ICL & seed generation
notebooks/teigaku_genzei/make_icl.py, notebooks/teigaku_genzei/make_icl.sh, notebooks/teigaku_genzei/make_seed.py, notebooks/teigaku_genzei/make_seed.sh, notebooks/teigaku_genzei/extract_icl_from_seed.py
Generate combinatorial ICL examples (short/full), join ICL with contexts (slide/cartesian), and emit seed JSONL.
Format conversion & cleanup
notebooks/teigaku_genzei/qna_yaml_2_qa_csv.py, notebooks/teigaku_genzei/seed_jsonl_2_qna_yaml.py, notebooks/teigaku_genzei/remove_newline_from_csv.py
Utilities to convert YAML↔CSV↔JSONL and sanitize newlines in CSV columns.
Parsing & orchestration scripts
notebooks/teigaku_genzei/parser.sh, notebooks/teigaku_genzei/extract_qa.sh
Shell wrappers to run the parser and QA/glossary extraction.
Tests & examples
notebooks/teigaku_genzei/test_qa.py, notebooks/teigaku_genzei/test_single.sh, notebooks/teigaku_genzei/test_single_vllm.sh
New zero‑shot QA test harness integrating Unitxt and multiple inference backends plus helper scripts.
Directory ignore rules
notebooks/teigaku_genzei/context/.gitignore, .../docparser/.gitignore, .../icl/.gitignore, .../qa_table/.gitignore, .../seed/.gitignore, .../source/.gitignore
Added .gitignore entries to exclude generated CSV/JSONL/PDF and other artifacts in subdirectories.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Notebook as Notebook
    participant Docling as Docling/Parser
    participant JSONTools as json_qa/json_glossary
    participant Context as make_context.py
    participant ICL as make_icl.py
    participant Seed as make_seed.py
    participant Evaluator as test_qa.py

    User->>Notebook: run pipeline cells
    Notebook->>Docling: export_document_new_docling(PDFs)
    Docling-->>Notebook: parsed JSON files
    Notebook->>JSONTools: run json_qa / json_glossary
    JSONTools-->>Notebook: QA CSV, Glossary CSV
    Notebook->>Context: make_context.py(QA, glossary)
    Context-->>Notebook: context CSV
    Notebook->>ICL: make_icl.py(QA CSV, context CSV)
    ICL-->>Notebook: icl.jsonl
    Notebook->>Seed: make_seed.py(context CSV, icl.jsonl)
    Seed-->>Notebook: seed JSONL
    Notebook->>Evaluator: test_qa.py(seed or test data)
    Evaluator-->>User: evaluation results (JSON/CSV)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas to inspect closely:
    • make_seed.py — slide vs cartesian join logic and index/rolling selection.
    • make_icl.py — QA combination generation and short-context composition/truncation.
    • llm_as_judge_direct_rits_llama4.py — RITS base_url/version handling and prepare_engine override.
    • llm_as_judge_direct_positional_bias.py — aggregation of positional-bias scores and field naming.
    • catalog files & provider importer — correct type tagging and JSON schemas for catalog registrations.
    • extract_icl_from_seed.py — deduplication/hash generation logic and key selection.
    • test_qa.py — orchestration across multiple inference backends and result serialization.

Poem

🐰 I hopped through PDFs and nibbled lines of code,
I stitched Qs to As and laid out the road.
Contexts neatly bundled, ICL seeds in a row,
JSONL sprouts ready, pipelines set to go.
A rabbit’s tidy workflow — quiet, quick, aglow.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.41% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the primary change: introducing a Teigaku Genzei (fixed-amount tax reduction) example for SDG with Japanese documents.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d5812b and b316221.

📒 Files selected for processing (2)
  • notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json (1 hunks)
  • notebooks/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (6)
notebooks/teigaku_genzei/teigaku_genzei.ipynb (6)

81-92: LGTM! OCR language configuration is well-documented.

The configuration correctly documents that RapidOCR ignores the language parameter, and the chosen values ["en", "ja"] align with the sdg_hub example. The extensive comments clearly explain the limitation and provide a reference to the relevant source code.


180-276: LGTM! Docling extraction pipeline is well-structured.

The pipeline configuration and document export logic is clear and well-organized. The use of Path objects, proper error handling in export_document, and directory creation before processing demonstrate good practices for a notebook example.


319-339: LGTM! Directory creation issue resolved.

The function now correctly creates the output directory at line 328 before writing CSV files, addressing the previous review concern.


366-385: LGTM! Directory creation issue resolved.

The function now correctly creates the output directory at line 377 before writing the context CSV file, addressing the previous review concern.


416-431: LGTM! Directory creation issue resolved.

The function now correctly creates the parent directory at line 422 before writing the ICL JSONL file, addressing the previous review concern.


468-482: LGTM! Directory creation issue resolved.

The function now correctly creates the parent directory at line 473 before writing the seed JSONL file, addressing the previous review concern. The use of orient="records" and lines=True is appropriate for JSONL output format.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 40

♻️ Duplicate comments (1)
scripts/teigaku_genzei/json_qa.py (1)

15-18: Code duplication flagged in related file.

This read_json_file duplication is addressed in the review comment for json_glossary.py (lines 15-18).

🧹 Nitpick comments (29)
scripts/teigaku_genzei/remove_newline_from_csv.py (1)

1-13: Consider using argparse for better CLI experience.

The script would benefit from proper argument parsing with help text and validation, following Python CLI best practices.

Consider this refactor using argparse:

"""
Remove newline characters from all object-typed columns in a CSV file.
"""
import argparse
import sys
from pathlib import Path
import pandas as pd


def remove_newlines_from_csv(input_file: Path, output_file: Path) -> None:
    """Process CSV file by replacing newlines with spaces in object columns."""
    df = pd.read_csv(input_file, encoding="utf-8")
    
    col_sr_dict = {
        col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col]
        for col in df.columns
    }
    df2 = pd.DataFrame(col_sr_dict)
    df2.to_csv(output_file, encoding="utf-8", index=False)


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Remove newline characters from object columns in a CSV file"
    )
    parser.add_argument("input_csv", type=Path, help="Input CSV file path")
    parser.add_argument("output_csv", type=Path, help="Output CSV file path")
    
    args = parser.parse_args()
    
    if not args.input_csv.exists():
        print(f"Error: Input file '{args.input_csv}' not found", file=sys.stderr)
        sys.exit(1)
    
    try:
        remove_newlines_from_csv(args.input_csv, args.output_csv)
    except pd.errors.ParserError as e:
        print(f"Error parsing CSV: {e}", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Unexpected error: {e}", file=sys.stderr)
        sys.exit(1)


if __name__ == "__main__":
    main()

This approach provides:

  • Automatic help text with -h/--help
  • Better error messages
  • Type hints for maintainability
  • Separation of concerns with dedicated functions
  • Standard if __name__ == "__main__" guard
scripts/teigaku_genzei/parser.sh (1)

4-4: Remove or document the commented-out code.

The commented-out old path serves no purpose. Either remove it or add a comment explaining why it's kept for reference.

scripts/teigaku_genzei/qna_yaml_2_qa_csv.py (1)

2-4: Remove completed TODO comments.

The TODO comments describe functionality that has already been implemented. Remove them to keep the code clean.

Apply this diff:

-
-# TODO: load YAML file.
-# TODO: for each "questions_and_answers", select "question" and "answer".
-# TODO: save as a CSV file.
-
 import sys
scripts/teigaku_genzei/context_util.py (1)

4-19: Consider adding error handling for missing DataFrame columns.

All three functions access DataFrame columns without checking for their existence. If the DataFrame doesn't contain the expected columns (Title, Answer, Term, Description), a KeyError will be raised at runtime.

Consider validating column presence or catching KeyError exceptions, especially if the data source is external or user-provided. For example:

def compose_text_block(sr: pd.Series) -> str:
    if "Title" not in sr or "Answer" not in sr:
        raise ValueError("Missing required columns: Title or Answer")
    block = sr["Title"] + "\n" + sr["Answer"].removeprefix("[A]")
    return block
scripts/teigaku_genzei/catalog_util.py (2)

5-7: Module-level side effects may cause issues.

Lines 7 and 22 execute function calls at module import time (register_local_catalog and get_system_catalog_path). This can:

  • Make the module difficult to test
  • Cause unexpected behavior when importing
  • Produce console output during import

Consider moving these calls into a setup() or main() function that callers invoke explicitly, or use the if __name__ == "__main__": guard for the print statement.

Also applies to: 22-22


5-5: Add spacing around assignment operator.

PEP 8 recommends spacing around the = operator for readability.

Apply this diff:

-LOCAL_CATALOG_PATH="local_catalog"
+LOCAL_CATALOG_PATH = "local_catalog"
scripts/teigaku_genzei/make_seed.sh (1)

1-4: Consider adding error handling and validation.

The script executes a Python command without checking for errors or validating inputs. Consider adding:

  • Check if make_seed.py exists
  • Validate input files exist
  • Exit on error with set -e
  • Check Python command exit status

Example improvements:

 #!/bin/sh
+set -e  # Exit on any error
 
+# Validate that required files exist
+if [ ! -f "./make_seed.py" ]; then
+    echo "Error: make_seed.py not found"
+    exit 1
+fi
+
+if [ ! -f "context/0024001-021_context.csv" ]; then
+    echo "Error: context file not found"
+    exit 1
+fi
+
+if [ ! -f "icl/icl.jsonl" ]; then
+    echo "Error: ICL file not found"
+    exit 1
+fi
+
 # python ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl
 python ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesian
scripts/teigaku_genzei/llm_as_judge_direct_rits_phi.py (1)

13-19: Consider removing or documenting the commented code.

The commented-out CrossProviderInferenceEngine alternative (Lines 13-19) appears to be leftover from development. If this alternative is intended as a reference for users who might want to use a different engine, consider adding a brief comment explaining when and why they might use it. Otherwise, remove it to reduce code clutter.

If documenting:

+        # Alternative: Use CrossProviderInferenceEngine for dynamic provider selection
         # inference_engine=CrossProviderInferenceEngine(

Or simply remove the commented block if it's no longer needed.

scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)

17-42: Consider removing commented code once the workaround is stable.

The extensively commented-out option names (Lines 19, 24, 29) and option_map entries (Lines 35-37) show the original semantic naming. While this provides context for the workaround, it adds significant clutter. If this numeric workaround is confirmed to work reliably, consider removing the commented code and keeping only a brief explanation in the header comment.

Alternatively, if you want to preserve the original intent for future reference, consolidate it into a single block comment at the top of the function rather than inline throughout the code.

scripts/teigaku_genzei/test_single_vllm.sh (1)

31-31: Add error handling for directory creation.

The mkdir command (Line 31) will fail if the directory already exists or if there are permission issues. Consider using mkdir -p to handle existing directories gracefully and add error checking.

Apply this diff:

-mkdir ${OUT_DIR}
+mkdir -p "${OUT_DIR}" || { echo "Failed to create output directory: ${OUT_DIR}"; exit 1; }

Also note the addition of quotes around the variable to handle paths with spaces.

scripts/teigaku_genzei/test_single.sh (1)

27-27: Add error handling for directory creation.

The mkdir command (Line 27) will fail if the directory already exists or if there are permission issues. Consider using mkdir -p to handle existing directories gracefully and add error checking.

Apply this diff:

-mkdir ${OUT_DIR}
+mkdir -p "${OUT_DIR}" || { echo "Failed to create output directory: ${OUT_DIR}"; exit 1; }

Also note the addition of quotes around the variable to handle paths with spaces.

scripts/teigaku_genzei/make_icl.sh (1)

1-3: Consider adding error handling and documentation.

This script is a simple wrapper but lacks error handling. If make_icl.py fails or doesn't exist, the script will silently continue. Additionally, the hardcoded paths may not be obvious to users.

Consider adding:

  1. Error handling for the Python script execution
  2. A brief comment explaining the script's purpose and the hardcoded paths

Example:

 #!/bin/sh
+
+# Generate ICL (In-Context Learning) data from QA, glossary, and context files
+# Note: Update the file paths below to match your data location

-python ./make_icl.py --short_context True --qa_file qa_table/0024004-072_01.csv --glossary_file qa_table/0024004-072_01_glossary.csv --context_file context/0024004-072_01_context.csv --out_icl_file icl/icl.jsonl
+python ./make_icl.py \
+    --short_context True \
+    --qa_file qa_table/0024004-072_01.csv \
+    --glossary_file qa_table/0024004-072_01_glossary.csv \
+    --context_file context/0024004-072_01_context.csv \
+    --out_icl_file icl/icl.jsonl || { echo "Failed to generate ICL data"; exit 1; }

The line breaks also improve readability.

scripts/teigaku_genzei/jsonl_util.py (1)

7-12: Remove commented code and useless return.

The commented line at Line 11 should be removed, and the explicit return at Line 12 is unnecessary in Python (functions return None implicitly).

Apply this diff:

 def write_jsonl_file(file_path: str, data_list: list[Any]):
-    with open(file_path, 'w') as file:
+    with open(file_path, 'w', encoding='utf-8') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
-            # json.dump(data, file, ensure_ascii=False)
-    return
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)

47-47: Remove useless return statement.

Python functions return None implicitly, so the explicit return is unnecessary.

Apply this diff:

         yaml.safe_dump(qna_template_dict, out_f, allow_unicode=True, encoding="utf8", default_flow_style=False, default_style="|", sort_keys=False)
-
-    return
scripts/teigaku_genzei/extract_icl_from_seed.py (2)

1-5: Remove unused pandas import.

The pandas module is imported but never used in this script.

Apply this diff:

 import sys
 from typing import cast, Any, Hashable, Sequence, Collection
-import pandas as pd
 
 import json_util

49-53: Simplify the dictionary construction.

As per coding guidelines

The dict comprehension can be replaced with the built-in dict() constructor for clarity.

Apply this diff:

-    unique_icl_dict = {
-        h: obj for (h, obj) in hash_list
-    }
+    unique_icl_dict = dict(hash_list)
scripts/teigaku_genzei/json_util.py (1)

20-33: Remove unreachable return statements.

Both JSONL I/O functions have unnecessary return statements that are either useless (line 25) or unreachable (line 33).

Apply this diff:

 def write_jsonl_file(file_path: str, data_list: list[Any]):
     with open(file_path, 'w') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
-            # json.dump(data, file, ensure_ascii=False)
-    return
 
 def read_jsonl_file(file_path: str) -> list[Any]:
     with open(file_path, 'r') as file:
         data_list = [
             json.loads(line) for line in file
         ]
         return data_list
-    return []
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

26-61: Consider unpacking for list concatenation.

The compute method correctly augments base results with positional bias metrics. Line 47 could use unpacking for slightly cleaner list construction.

Optional refactor for line 47:

-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}
scripts/teigaku_genzei/json_qa.py (1)

44-69: Simplify the list comprehension on line 55.

Line 55 uses an unnecessary comprehension to convert an iterator to a list. The subsequent slice [1:] can be applied directly to the list.

Apply this diff:

-    atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)]
+    atail_pos_list = list(title_pos_list)[1:] + [len(text_list)]

Alternatively, if title_pos_list is already a list (which it is, from line 52), you can simplify further:

-    atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)]
+    atail_pos_list = title_pos_list[1:] + [len(text_list)]
scripts/teigaku_genzei/make_context.py (1)

95-95: Remove useless return statement.

The return statement at the end of main() is unnecessary.

Apply this diff:

     out_df.to_csv(output_context_path, index=False, encoding="utf8")
-
-    return
scripts/teigaku_genzei/make_seed.py (2)

19-27: Make input/output arguments required to fail fast

Avoid passing None into I/O functions.

-    parser.add_argument('--' + ARG_INPUT_CONTEXT_FILE, type=str, default=None)
-    parser.add_argument('--' + ARG_INPUT_ICL_FILE, type=str, default=None)
-    parser.add_argument('--' + ARG_OUTPUT_SEED_FILE, type=str, default=None)
+    parser.add_argument('--' + ARG_INPUT_CONTEXT_FILE, type=str, required=True)
+    parser.add_argument('--' + ARG_INPUT_ICL_FILE, type=str, required=True)
+    parser.add_argument('--' + ARG_OUTPUT_SEED_FILE, type=str, required=True)

105-105: Drop debug print or gate behind a verbose flag

Printing columns on every run is noisy for CLI pipelines.

-    print(out_df.columns)
+    # print(out_df.columns)  # enable for debugging
scripts/teigaku_genzei/make_icl.py (4)

43-54: Remove unused qa_df param from generate_QA_combinations

The parameter is unused; simplify the signature.

-def generate_QA_combinations(row_sr: pd.Series, qa_df: pd.DataFrame, size: int)-> list[tuple[int, list[int]]]:
+def generate_QA_combinations(row_sr: pd.Series, size: int) -> list[tuple[int, list[int]]]:
     q_list = cast(list[int], json.loads(row_sr["qlist"]))
     if len(q_list) >= size:
         comb_triples = itertools.combinations(q_list, size)
         comb_triples_list = [(cast(int, row_sr.name), list(comb)) for comb in comb_triples]
         return comb_triples_list
-    elif len(q_list) > 0:
+    if len(q_list) > 0:
         comb_triples_list = [(cast(int, row_sr.name), (q_list + [q_list[0]] * (size - len(q_list))))]
         return comb_triples_list
-    else:
-        return []
+    return []

103-106: Update call site after signature change

Drop qa_df argument.

-    icl_qa_index_sr = ncontext_df.apply(lambda row: generate_QA_combinations(row, qa_df, NUM_ICL_QA_EXAMPLES), axis=1)
+    icl_qa_index_sr = ncontext_df.apply(lambda row: generate_QA_combinations(row, NUM_ICL_QA_EXAMPLES), axis=1)

91-96: Remove unused variable option

Assigned but not used.

-    option = args[ARG_OPTION]
+    # option = args[ARG_OPTION]

77-79: Verify fullwidth 'A' in Answer prefix

Ensure the dataset actually uses "[A]". If some rows use ASCII "[A]", consider normalizing both.

Optionally:

-            "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(),
+            # Strip either fullwidth or ASCII A within brackets
+            "icl_response_" + str(i + 1): (
+                qa_df.loc[qa]["Answer"]
+                .removeprefix("[A]")
+                .removeprefix("[A]")
+                .strip()
+            ),
scripts/teigaku_genzei/test_qa.py (3)

348-349: Use sys.exit for predictable termination

exit() is intended for interactive use.

-        exit()
+        sys.exit(0)

370-372: Guard writing results when output path is None

Avoid crashing if the CLI omits --out_result_file.

-    write_results_csv(output_file, results)
+    if output_file:
+        write_results_csv(output_file, results)
+    else:
+        print("No out_result_file provided; skipping CSV write.")

98-99: Drop noisy print of full test DataFrame

Reduce stdout noise in batch runs.

-    print(data_df)
+    # print(data_df)  # enable for debugging
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8a76fe and 9d1768c.

📒 Files selected for processing (103)
  • scripts/teigaku_genzei/.gitignore (1 hunks)
  • scripts/teigaku_genzei/README.md (1 hunks)
  • scripts/teigaku_genzei/catalog_util.py (1 hunks)
  • scripts/teigaku_genzei/context/.gitignore (1 hunks)
  • scripts/teigaku_genzei/context_util.py (1 hunks)
  • scripts/teigaku_genzei/docparser/.gitignore (1 hunks)
  • scripts/teigaku_genzei/extract_icl_from_seed.py (1 hunks)
  • scripts/teigaku_genzei/extract_qa.sh (1 hunks)
  • scripts/teigaku_genzei/icl/.gitignore (1 hunks)
  • scripts/teigaku_genzei/json_glossary.py (1 hunks)
  • scripts/teigaku_genzei/json_qa.py (1 hunks)
  • scripts/teigaku_genzei/json_util.py (1 hunks)
  • scripts/teigaku_genzei/jsonl_util.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_rits_phi.py (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/criteria/correctness_based_on_ground_truth2.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/phi_4.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/deepseek_v3.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_1_405b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama4_scout.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/mixtral_large.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_nano.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama4_scout.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_mini.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_preview.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/granite3_3_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_405b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_mini.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_nano.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4o.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/gpt_4_1.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_scout.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/deepseek_v3.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_1_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_2_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_3_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_405b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_scout.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral8_7b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral_large.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/phi_4.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/deepseek_v3.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_405b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_maverick.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_scout.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_flash.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_0_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_3_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_405b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_3_70b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral8_7b.json (1 hunks)
  • scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json (1 hunks)
  • scripts/teigaku_genzei/make_context.py (1 hunks)
  • scripts/teigaku_genzei/make_context.sh (1 hunks)
  • scripts/teigaku_genzei/make_icl.py (1 hunks)
  • scripts/teigaku_genzei/make_icl.sh (1 hunks)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
  • scripts/teigaku_genzei/make_seed.sh (1 hunks)
  • scripts/teigaku_genzei/parser.sh (1 hunks)
  • scripts/teigaku_genzei/qa_table/.gitignore (1 hunks)
  • scripts/teigaku_genzei/qna_yaml_2_qa_csv.py (1 hunks)
  • scripts/teigaku_genzei/remove_newline_from_csv.py (1 hunks)
  • scripts/teigaku_genzei/requirements-dev.txt (1 hunks)
  • scripts/teigaku_genzei/requirements.txt (1 hunks)
  • scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1 hunks)
  • scripts/teigaku_genzei/source/.gitignore (1 hunks)
  • scripts/teigaku_genzei/test_all.sh (1 hunks)
  • scripts/teigaku_genzei/test_nencho_single.sh (1 hunks)
  • scripts/teigaku_genzei/test_qa.py (1 hunks)
  • scripts/teigaku_genzei/test_single.sh (1 hunks)
  • scripts/teigaku_genzei/test_single_pred_eval.sh (1 hunks)
  • scripts/teigaku_genzei/test_single_vllm.sh (1 hunks)
  • scripts/teigaku_genzei/url.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
scripts/teigaku_genzei/catalog_util.py (1)
  • get_system_catalog_path (16-20)
scripts/teigaku_genzei/json_glossary.py (1)
scripts/teigaku_genzei/json_util.py (1)
  • is_title (6-7)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
  • RITSInferenceEngineFixed (17-24)
scripts/teigaku_genzei/make_context.py (2)
scripts/teigaku_genzei/context_util.py (2)
  • compose_context (8-12)
  • compose_glossary (14-19)
scripts/teigaku_genzei/json_util.py (1)
  • get_title_match (9-10)
scripts/teigaku_genzei/json_qa.py (1)
scripts/teigaku_genzei/json_util.py (3)
  • is_title (6-7)
  • is_qhead (12-13)
  • is_ahead (15-16)
scripts/teigaku_genzei/make_icl.py (1)
scripts/teigaku_genzei/context_util.py (1)
  • compose_context (8-12)
🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md

7-7: Bare URL used

(MD034, no-bare-urls)


10-10: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


20-20: Bare URL used

(MD034, no-bare-urls)


21-21: Bare URL used

(MD034, no-bare-urls)


35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Bare URL used

(MD034, no-bare-urls)


39-39: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


40-40: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


41-41: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


42-42: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


47-47: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


48-48: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


50-50: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


51-51: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


52-52: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


53-53: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


79-79: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


81-81: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


82-82: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


82-82: Bare URL used

(MD034, no-bare-urls)


83-83: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


88-88: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


89-89: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


90-90: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


91-91: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


92-92: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


94-94: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


95-95: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


96-96: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


97-97: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


102-102: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


103-103: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


104-104: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


105-105: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


106-106: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


107-107: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


109-109: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


115-115: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


130-130: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


131-131: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


132-132: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


140-140: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


147-147: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


155-155: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


156-156: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


157-157: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


158-158: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


161-161: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


162-162: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


163-163: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


165-165: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


166-166: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


167-167: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


179-179: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


180-180: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


181-181: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


183-183: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


184-184: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


185-185: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


186-186: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


188-188: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


189-189: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


190-190: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


195-195: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


210-210: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


221-221: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


222-222: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (3.3.9)
scripts/teigaku_genzei/json_util.py

[refactor] 20-20: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/jsonl_util.py

[refactor] 7-7: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/extract_icl_from_seed.py

[refactor] 49-51: Unnecessary use of a comprehension, use dict(hash_list) instead.

(R1721)

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

[refactor] 9-9: Too few public methods (1/2)

(R0903)

scripts/teigaku_genzei/make_context.py

[refactor] 56-56: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py

[refactor] 24-24: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py

[error] 24-24: Bad first argument 'RITSInferenceEngine' given to super()

(E1003)


[refactor] 17-17: Too few public methods (1/2)

(R0903)

scripts/teigaku_genzei/test_qa.py

[refactor] 109-109: Useless return at end of function or method

(R1711)


[refactor] 114-114: Useless return at end of function or method

(R1711)


[refactor] 130-130: Useless return at end of function or method

(R1711)


[refactor] 171-171: Useless return at end of function or method

(R1711)


[refactor] 348-348: Consider using 'sys.exit' instead

(R1722)


[refactor] 210-210: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/json_qa.py

[refactor] 55-55: Unnecessary use of a comprehension, use list(title_pos_list) instead.

(R1721)

scripts/teigaku_genzei/make_icl.py

[refactor] 45-53: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

🪛 Ruff (0.13.3)
scripts/teigaku_genzei/context_util.py

5-5: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


8-8: Unused function argument: section

(ARG001)

scripts/teigaku_genzei/json_util.py

7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


16-16: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

scripts/teigaku_genzei/json_glossary.py

37-37: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


37-37: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

47-47: Consider [self.main_score, *self.ci_scores] instead of concatenation

Replace with [self.main_score, *self.ci_scores]

(RUF005)

scripts/teigaku_genzei/catalog_util.py

9-9: Unused function argument: artifact_name

(ARG001)

scripts/teigaku_genzei/qna_yaml_2_qa_csv.py

16-16: Probable use of unsafe loader FullLoader with yaml.load. Allows instantiation of arbitrary objects. Consider yaml.safe_load.

(S506)

scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py

62-62: String contains ambiguous (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?

(RUF001)


64-64: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


64-64: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)


64-64: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


64-64: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)

scripts/teigaku_genzei/test_qa.py

89-89: Unused function argument: test_language

(ARG001)


102-102: Unused function argument: input_file

(ARG001)


151-151: Function definition does not bind loop variable option_map

(B023)


155-155: Local variable nbiased_score_sr_list is assigned to but never used

Remove assignment to unused variable nbiased_score_sr_list

(F841)


260-264: Consider iterable unpacking instead of concatenation

(RUF005)

scripts/teigaku_genzei/json_qa.py

26-26: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


26-26: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)


37-37: Comment contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF003)


39-39: Comment contains ambiguous (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?

(RUF003)


50-50: Comment contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF003)


55-55: Consider [*[i for i in title_pos_list][1:], len(text_list)] instead of concatenation

Replace with [*[i for i in title_pos_list][1:], len(text_list)]

(RUF005)

scripts/teigaku_genzei/make_icl.py

43-43: Unused function argument: qa_df

(ARG001)


78-78: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


91-91: Local variable option is assigned to but never used

Remove assignment to unused variable option

(F841)

🔇 Additional comments (35)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_405b.json (1)

3-10: Double-check Ollama model availability.

llama-3-1-405b-instruct is not part of the standard Ollama model catalog (Ollama focuses on deployable on-device checkpoints ≤70B). Unless you maintain a custom Ollama endpoint that proxies this 405B variant, inference will fail immediately. Please confirm the model actually exists on your Ollama host or adjust to an available checkpoint.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral_large.json (1)

1-13: Configuration looks solid.

Schema and values line up with the existing llm_judge_direct_positional_bias pattern. No issues spotted.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json (1)

8-9: No changes needed for provider identifier open-ai is already the canonical ID used across all direct_positional_bias configs.

Likely an incorrect or invalid review comment.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4o.json (1)

1-13: LGTM!

The OpenAI GPT-4o configuration is correctly structured with appropriate deterministic parameters (seed=42, temperature=0) for evaluation consistency. The model identifier follows OpenAI's versioning convention.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json (1)

1-13: LGTM!

The WatsonX Granite 3.2 8B configuration follows the established pattern with consistent evaluation parameters. Configuration is appropriate for the target model.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json (1)

1-13: LGTM!

The WatsonX Llama 3.1 70B configuration is correctly structured and consistent with the established patterns for direct positional bias evaluation.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/deepseek_v3.json (1)

1-13: LGTM!

The Together AI DeepSeek V3 configuration correctly uses the namespaced model identifier format and maintains consistency with other provider configurations.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_1_8b.json (1)

1-13: LGTM!

The RITS Granite 3.1 8B configuration is consistent with the established pattern. The RITS provider integration aligns with the PR's stated testing approach using a local inference setup.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json (1)

1-13: LGTM!

The WatsonX Llama 3.1 8B configuration is correctly structured and consistent with the other provider configurations in this PR.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_3_70b.json (1)

1-13: LGTM!

The configuration follows the established pattern for LLM judge direct positional bias setups. The inference parameters (max_tokens: 1024, seed: 42, temperature: 0) are consistent with other provider configurations in this PR, ensuring reproducible evaluations.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_405b.json (1)

1-13: LGTM!

The configuration correctly defines the Llama 3.1 405B model for Watsonx provider with consistent inference parameters.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json (1)

9-11: Verify Azure endpoint is appropriate for example code.

The hardcoded Azure OpenAI endpoint (eteopenai.azure-api.net) appears environment-specific. For a reusable example, consider:

  • Documenting that users must replace this URL with their own Azure deployment
  • Using environment variables or a configuration file for the endpoint
  • Providing a template or placeholder value

Is this endpoint intended as a placeholder, or is it a shared resource for testing?

scripts/teigaku_genzei/.gitignore (1)

1-170: LGTM!

The .gitignore provides comprehensive coverage for Python projects, including Unitxt-specific artifacts and project-specific temporary directories. The organization with comments makes it maintainable.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json (1)

1-13: LGTM!

The RITS provider configuration for Granite 3.0 8B model is consistent with the other judge configurations in this PR.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json (1)

8-8: Confirm "provider" value
All direct_positional_bias JSONs use "open-ai". Verify the inference engine expects this; otherwise rename to "openai".

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_1_405b.json (1)

1-13: LGTM!

The configuration is well-structured with appropriate settings for deterministic evaluation (temperature: 0, seed: 42).

scripts/teigaku_genzei/context_util.py (1)

4-6: Fullwidth character is intentional for Japanese text processing.

The fullwidth Latin capital letter "A" in the removeprefix call is correct for processing Japanese documents, where such characters are commonly used. The static analysis warning can be safely ignored.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_405b.json (1)

1-13: LGTM!

The RITS provider configuration follows the same deterministic evaluation pattern as other providers.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_2_8b.json (1)

1-13: LGTM!

The Granite 3.2 8B configuration is consistent with other model configurations in this catalog.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_3_8b.json (1)

1-13: LGTM!

The Granite 3.3 8B configuration completes the Granite 3.x series and follows the established pattern.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json (1)

9-9: Document that the base_url requires customization.

The base_url points to an internal IBM infrastructure endpoint (inference-3scale-apicast-production.apps.rits.fmaas.res.ibm.com) which will not be accessible to external users of this example. Since this is an open-source example, consider adding a README note or comment indicating that users must update this URL to point to their own model deployment.

Consider adding a comment in the README or near this configuration explaining that users need to deploy their own model endpoint and update this URL accordingly.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_nano.json (1)

9-11: Document that the api_base requires customization for external users.

The api_base URL (Line 10) points to a specific Azure OpenAI deployment (eteopenai.azure-api.net) that external users won't have access to. Since this is an open-source example, users will need to replace this with their own Azure OpenAI endpoint.

Consider adding documentation (in a README or inline comment) indicating that users must:

  1. Set up their own Azure OpenAI deployment
  2. Update the api_base URL to match their deployment
  3. Configure appropriate authentication credentials
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/phi_4.json (1)

1-12: LGTM!

The JSON configuration is well-formed and the __type__ correctly matches the directory structure (direct not direct_positional_bias). The inference engine settings are appropriate for deterministic evaluation.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_mini.json (1)

1-16: LGTM!

The configuration is well-structured and follows the pattern established by other Azure OpenAI configurations in this PR. The deterministic settings (seed=42, temperature=0) are appropriate for reproducible evaluation.

scripts/teigaku_genzei/test_all.sh (1)

20-20: Document or parameterize the hardcoded checkpoint path.

The hardcoded path /proj/instructlab/checkpoints/ may not exist in all environments. Consider documenting this requirement in the README or making it configurable via an environment variable.

Verify that this path requirement is documented in the README for users who want to run this script.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_8b.json (1)

1-13: LGTM!

The configuration follows the established pattern for RITS-based llm_judge_direct_positional_bias metrics and uses appropriate deterministic settings for reproducible evaluation.

scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)

36-42: Document or remove placeholder values.

The document metadata contains placeholder values ("Teigaku Genzei example.", "https://example.com/example/taxonomy", all-zeros commit hash, "example.pdf"). If these are meant to be replaced by users, document this in the README. If they're examples for demonstration, consider adding a comment.

Verify that the README documents how users should update these placeholder values with their actual document metadata.

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)

45-45: Remove the TODO; retain evaluator_name="LLAMA3_4_MAVERICK" to match existing conventions.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/criteria/correctness_based_on_ground_truth2.json (1)

1-27: LGTM!

The criteria configuration is well-structured with clear, differentiated descriptions for each quality level. The three-point rubric provides appropriate granularity for factual correctness assessment, and the option_map scoring (3→1.0, 2→0.5, 1→0.0) aligns logically with the quality tiers.

scripts/teigaku_genzei/test_nencho_single.sh (1)

29-47: LGTM!

The test execution loop correctly constructs paths and invokes test_qa.py with all necessary parameters. The commented-out "before" evaluation block provides a useful pattern for baseline comparisons.

scripts/teigaku_genzei/extract_icl_from_seed.py (1)

7-14: LGTM!

The helper functions are straightforward and correctly implement ICL extraction and hash computation for deduplication.

scripts/teigaku_genzei/json_util.py (1)

6-16: LGTM!

The text detection functions correctly use full-width Japanese characters (0-9, -, ー, A) alongside their ASCII counterparts to match Japanese document formatting patterns. The RUF001 warnings from Ruff are false positives in this context—full-width characters are intentional and necessary for the target use case.

scripts/teigaku_genzei/json_glossary.py (1)

20-53: LGTM!

The glossary extraction logic correctly identifies section headers with Japanese bracket patterns (【】, ()) and retrieves associated descriptions. The full-width character warnings from Ruff are false positives—these characters are intentional for Japanese text processing.

scripts/teigaku_genzei/json_qa.py (1)

20-42: LGTM!

The get_text_block function correctly filters and aggregates body text between specified positions, excluding page furniture and section titles. The comments provide helpful context for the label and content_layer attributes. Full-width character warnings in comments are false positives—these are intentional examples of Japanese characters.

scripts/teigaku_genzei/make_context.py (1)

27-54: LGTM!

The configuration parsing with argparse and section number extraction logic are both correctly implemented. The use of json_util.get_title_match for robust pattern matching is appropriate.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

♻️ Duplicate comments (41)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_scout.json (1)

9-12: Align evaluator_name with model version (Llama 4).

Model is "llama-4-scout" but evaluator_name is "LLAMA3_4_SCOUT". Update to avoid confusion in results aggregation.

-    "evaluator_name": "LLAMA3_4_SCOUT",
+    "evaluator_name": "LLAMA4_SCOUT",

Run to find other mismatches:

#!/bin/bash
rg -nP '"model"\s*:\s*"llama-4-[^"]+"' -C1 scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits |
  sed -n 's/.*"evaluator_name": "\([^"]*\)".*/\1/p' | sort -u
rg -nP '"evaluator_name"\s*:\s*"LLAMA3_4_[^"]+"' scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)

9-11: Hard-coded Azure endpoint: remove or parameterize via environment.

Storing the full api_base (incl. preview api-version) in repo leaks internal endpoint and hurts portability. Prefer env-driven config or omit credentials so the engine reads AZURE_OPENAI_ENDPOINT/AZURE_OPENAI_API_KEY.

Option A — rely on environment (preferred):

     "provider": "azure",
-    "credentials": {
-        "api_base": "https://eteopenai.azure-api.net/openai/deployments/gpt-4o-2024-08-06/chat/completions?api-version=2024-08-01-preview"
-    },
+    // credentials resolved from AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY

Option B — parameterize:

-    "credentials": {
-        "api_base": "https://eteopenai.azure-api.net/openai/deployments/gpt-4o-2024-08-06/chat/completions?api-version=2024-08-01-preview"
-    },
+    "credentials": {
+        "api_base": "${AZURE_OPENAI_ENDPOINT}"
+    },

Also avoid pinning a preview api-version in code; pass it via env if needed. Based on learnings

To check for other hard-coded endpoints/keys:

#!/bin/bash
# Find hard-coded Azure/OpenAI endpoints and potential keys in repo
rg -n -C2 -g '!**/node_modules/**' -g '!**/dist/**' -g '!**/build/**' -e 'azure-api\.net|openai\.com' -e 'api_key|apiKey|AZURE_OPENAI|OPENAI_API_KEY'
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json (1)

1-13: Fix model/evaluator naming and file typo; align to Gemini.

Filename is “gemmini…”, model is “gemma-2-5-pro”, and evaluator_name is “GEMMA…”. These mismatches will break catalog lookups and Vertex model resolution. Align all to Gemini 2.5 Pro and rename the file accordingly. Also verify whether the provider key should be "vertex_ai" (to match directory) or "vertex-ai" (current value).

Apply this diff to fix the JSON identifiers:

@@
-        "provider": "vertex-ai",
-        "model": "gemma-2-5-pro"
+        "provider": "vertex-ai",
+        "model": "gemini-2-5-pro"
@@
-    "evaluator_name": "GEMMA_2_5_PRO",
+    "evaluator_name": "GEMINI_2_5_PRO",

Rename the file to match the target model:

git mv scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_pro.json

To confirm provider key conventions and catch any other inconsistencies, run:

#!/bin/bash
# Inspect provider keys and Gemini configs for consistency
echo 'Provider key usage across local_catalog:'
rg -n -C1 '"provider"\s*:\s*"' scripts/teigaku_genzei/local_catalog

echo -e '\nGemini-related configs for cross-checking model/evaluator naming:'
rg -n -C2 -i 'gemini|gemma' scripts/teigaku_genzei/local_catalog
scripts/teigaku_genzei/remove_newline_from_csv.py (2)

4-5: Add argument validation and error handling.

The past review comment correctly identified that direct sys.argv access without validation will cause cryptic errors (IndexError) if invoked with wrong arguments or if the input file is missing.


7-13: Add error handling and documentation.

The past review comment correctly identified missing error handling for I/O failures, malformed CSV, and encoding issues, plus the lack of documentation explaining the script's purpose.

scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)

6-10: Internal IBM GitHub reference is inaccessible to external users.

The comment on line 10 references an internal IBM GitHub issue that external users cannot access. This was previously flagged in an earlier review.

Please either:

  1. Remove the specific URL and keep a general description, or
  2. Create a public issue in the opendatahub-io repository and reference that instead.

For example:

-# Source
-# https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.direct.criteria.correctness_based_on_ground_truth.html
-# This is a workaround to avoid the following issue.
-# "correctness_base_on_ground_truth" post-process bug · Issue #7 · RYOKAWA/teigaku_genzei
-# https://github.ibm.com/RYOKAWA/teigaku_genzei/issues/7
+# Based on Unitxt's correctness_based_on_ground_truth criterion:
+# https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.direct.criteria.correctness_based_on_ground_truth.html
+# 
+# Note: Using numeric option names ("3", "2", "1") instead of semantic names
+# ("correct", "incomplete", "wrong") to work around a post-processing issue
+# with the original criterion.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_nano.json (1)

8-9: Model name correction already flagged.

This issue has been identified in previous review comments. The model name should be "gpt-4.1-nano" not "gpt-4-1-nano" to match OpenAI's GPT-4.1 naming convention.

scripts/teigaku_genzei/parser.sh (1)

1-6: Add error handling and input validation.

The script lacks robustness—no set -e, no validation of paths, and unquoted variables. These issues were previously flagged with a comprehensive fix.

scripts/teigaku_genzei/qna_yaml_2_qa_csv.py (2)

16-16: Replace unsafe yaml.FullLoader with yaml.safe_load.

Using FullLoader allows arbitrary object instantiation, which is a security risk. The past review already provided the fix.

Based on static analysis (Ruff S506).


11-28: Add error handling and input validation.

The script lacks robustness for production use—no argument validation, no I/O error handling, no YAML structure validation. The past review already provided a comprehensive fix.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/mixtral_large.json (1)

8-9: Fix the AWS Bedrock model identifier.

As previously noted, "mistral-large-instruct" is not a valid AWS Bedrock model identifier. AWS Bedrock requires model IDs in the format mistral.mixtral-<variant>-v0:1 (e.g., mistral.mixtral-8x7b-instruct-v0:1). This configuration will fail when invoked.

Apply the fix suggested in the previous review to use the correct Bedrock model identifier format.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json (1)

8-12: Model identifier mismatches “mixtral” evaluator.

Line 9 sets model to mistral-large-instruct while evaluator_name/file path indicate Mixtral.

Change model to the correct Watsonx Mixtral SKU (e.g., a mixtral-…-instruct identifier) to avoid evaluating the wrong model.

-        "model": "mistral-large-instruct"
+        "model": "mixtral-8x7b-instruct"  // adjust to exact Watsonx SKU in your env

Please confirm the exact Watsonx Mixtral model string supported by your account and update accordingly.

scripts/teigaku_genzei/test_single_vllm.sh (1)

3-7: Correct usage text to this script’s name.

Both lines reference other scripts; fix to avoid confusion.

-    echo "./test_single.sh {model_path}"
-    echo "e.g., ./test_all.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp"
+    echo "./test_single_vllm.sh {model_path}"
+    echo "e.g., ./test_single_vllm.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp"
scripts/teigaku_genzei/json_glossary.py (1)

15-18: This duplication has already been identified in a previous review.

scripts/teigaku_genzei/extract_qa.sh (1)

5-14: The missing directory creation has already been identified in a previous review.

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)

24-24: This super() call has been flagged previously, but the syntax is actually correct for the intended behavior of calling the grandparent's prepare_engine() method (skipping RITSInferenceEngine's own implementation). The two-argument form super(RITSInferenceEngine, self) explicitly invokes the parent class of RITSInferenceEngine, which is the documented approach for this use case.

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/gpt_4_1.json (1)

4-10: Invalid provider/model pairing (will fail at runtime).

Replicate does not serve OpenAI’s gpt-4-1. Use a Replicate-hosted model ID or switch provider to openai and relocate the file.

Two options:

  • Keep path and change model to a valid Replicate model ID (verify in Replicate registry).
  • Or change provider to "openai" here and move the file under the openai/ directory:
-        "provider": "replicate",
-        "model": "gpt-4-1"
+        "provider": "openai",
+        "model": "gpt-4.1"
scripts/teigaku_genzei/make_context.sh (1)

5-11: Create output directory before writing.

context/ may not exist; first write will fail. Create it before the loop.

Apply:

 DATA_PREFIX_LIST="0024001-021 0024004-072_01"

-for f in $DATA_PREFIX_LIST; do
+mkdir -p context
+for f in $DATA_PREFIX_LIST; do
     python ./make_context.py \
         --glossary_option appendix \
         --qa_file qa_table/${f}.csv \
         --glossary_file qa_table/${f}_glossary.csv \
         --out_context_file context/${f}_context.csv
 done
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_maverick.json (1)

11-12: Fix evaluator_name to match model

Rename to LLAMA4_MAVERICK to align with llama-4-maverick.

Apply this diff:

-    "evaluator_name": "LLAMA3_4_MAVERICK",
+    "evaluator_name": "LLAMA4_MAVERICK",
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json (1)

8-11: Verify model name and evaluator name consistency.

The model_name references "llama-4-maverick" (Line 8) but the evaluator_name is "LLAMA3_4_MAVERICK" (Line 11), which contains "3" instead of "4". This inconsistency was previously flagged and should be addressed.

scripts/teigaku_genzei/test_single.sh (1)

4-5: Fix the usage message to reference the correct script name.

Line 5 shows ./test_all.sh in the example, but this script is named test_single.sh. This was previously flagged and should be corrected.

scripts/teigaku_genzei/test_single_pred_eval.sh (1)

36-36: Use mkdir -p to avoid repeat-run failures.

Running the script twice will fail with mkdir: File exists because Line 36 lacks the -p flag. This was previously flagged and should be corrected.

scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)

31-36: Add explicit encoding and error handling.

File operations should specify encoding='utf-8' for reliability with JSON that may contain non-ASCII characters (Japanese text in this case). Additionally, error handling for file I/O and JSON parsing would improve robustness. This was previously flagged and should be addressed.

scripts/teigaku_genzei/test_all.sh (1)

23-23: Use mkdir -p to prevent errors.

The mkdir command will fail if the directory already exists. This was previously flagged and should be corrected.

scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (2)

47-47: Remove useless return statement.

The explicit return at the end of main() serves no purpose.

Apply this diff:

-    return
-
 if __name__ == "__main__":

24-26: Add argument validation.

The script accesses sys.argv[1] and sys.argv[2] without validation, which will raise IndexError if arguments are missing.

Apply this diff:

 def main() -> None:
+    if len(sys.argv) < 3:
+        print("Usage: seed_jsonl_2_qna_yaml.py <input_seed.jsonl> <output_qna.yaml>", file=sys.stderr)
+        sys.exit(1)
+    
     input_file = sys.argv[1]  # seed.jsonl
     output_file = sys.argv[2]  # qna.yaml
scripts/teigaku_genzei/jsonl_util.py (2)

7-12: Add explicit encoding and remove useless return.

File operations should specify encoding='utf-8' for portability. The return statement at Line 12 serves no purpose and the commented code at Line 11 should be removed.

Apply this diff:

 def write_jsonl_file(file_path: str, data_list: list[Any]):
-    with open(file_path, 'w') as file:
+    with open(file_path, 'w', encoding='utf-8') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
-            # json.dump(data, file, ensure_ascii=False)
-    return

14-20: Add explicit encoding and remove unreachable code.

The return statement at Line 20 is unreachable. Additionally, file operations should specify encoding='utf-8'.

Apply this diff:

 def read_jsonl_file(file_path: str) -> list[Any]:
-    with open(file_path, 'r') as file:
+    with open(file_path, 'r', encoding='utf-8') as file:
         data_list = [
             json.loads(line) for line in file
         ]
         return data_list
-    return []
scripts/teigaku_genzei/test_nencho_single.sh (2)

3-7: Fix the script name in the usage example.

Line 5 references test_all.sh but the script is named test_single.sh.

Apply this diff:

 if [ -z "$1" ]; then
     echo "./test_single.sh {model_path}"
-    echo "e.g., ./test_all.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp"
+    echo "e.g., ./test_single.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp"
     exit 1
 fi

27-27: Use mkdir -p and quote variables.

The script will fail on subsequent runs if the directory exists. Use -p for idempotency and quote the variable to handle paths with spaces.

Apply this diff:

-mkdir ${OUT_DIR}
+mkdir -p "${OUT_DIR}"
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

79-90: Remove redundant base_url construction.

Lines 85-86 manually construct the base_url with /v1 suffix, duplicating logic that RITSInferenceEngineFixed.prepare_engine() already implements. This creates a maintenance risk.

Apply this diff to let RITSInferenceEngineFixed handle the URL construction:

     laaj_direct_rits_llama4_maverick = LLMJudgeDirectPositionalBias(
         inference_engine=RITSInferenceEngineFixed(
             max_tokens=1024,
             seed=42,
             temperature=0,
             model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME,
-            base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1",
         ),
         evaluator_name="LLAMA3_4_MAVERICK",
         generate_summaries=False,
     )
scripts/teigaku_genzei/context_util.py (1)

8-12: Remove unused section parameter.

The section parameter is declared but never used in the function body.

Apply this diff:

-def compose_context(section: int, df: pd.DataFrame) -> str:
+def compose_context(df: pd.DataFrame) -> str:

Ensure all call sites are updated accordingly.

scripts/teigaku_genzei/catalog_util.py (2)

9-14: Critical: Function ignores its parameter.

The function accepts an artifact_name parameter but always checks the hardcoded artifact "metrics.llm_as_judge.direct.rits.phi_4" at Line 11, making the parameter useless.

Apply this diff:

 def is_artifact_in_catalog(artifact_name: str, catalog_path: str=LOCAL_CATALOG_PATH):
     try:
-        _ = get_from_catalog("metrics.llm_as_judge.direct.rits.phi_4", catalog_path=catalog_path)
+        _ = get_from_catalog(artifact_name, catalog_path=catalog_path)
     except UnitxtArtifactNotFoundError:
         return False
     return True

16-20: Add error handling for empty catalog paths.

If get_local_catalogs_paths() returns an empty list, Line 20 will raise an IndexError.

Apply this diff:

 def get_system_catalog_path() -> str:
     paths = get_local_catalogs_paths()
     print(paths)
-
+    if not paths:
+        raise RuntimeError("No local catalog paths found")
     return paths[-1]
scripts/teigaku_genzei/make_context.py (1)

79-87: Critical: Incorrect reuse of a_qindex for qlist_list default.

Line 87 reuses a_qindex (which is [-1]) as the default for qlist_list, but qlist expects JSON-encoded arrays of question indices, not a raw integer list. This causes incorrect data in the appendix row when option == OPT_GLOSSARY_APPENDIX.

Apply this diff:

     (header, appendix, a_section, a_qindex) = (
-        ("", [glossary_str], [-1], [-1]) if option == OPT_GLOSSARY_APPENDIX else 
-        (glossary_str, [], [], []) if option == OPT_GLOSSARY_HEADER else 
-        ("", [], [], [])
+        ("", [glossary_str], [-1], [-1], [json.dumps([-1])]) if option == OPT_GLOSSARY_APPENDIX else 
+        (glossary_str, [], [], [], []) if option == OPT_GLOSSARY_HEADER else 
+        ("", [], [], [], [])
     )
+    a_qlist = a_qindex if len(a_qindex) == 0 else [json.dumps([-1])]
     context_list = [header + compose_context(cast(int, section), df) for (section, df) in section_gp] + appendix
     section_list = [cast(int, section) for (section, df) in section_gp] + a_section
     qindex_list = [cast(int, df.index[0]) for (section, df) in section_gp] + a_qindex
-    qlist_list = [json.dumps(df.index.to_list()) for (section, df) in section_gp] + a_qindex
+    qlist_list = [json.dumps(df.index.to_list()) for (section, df) in section_gp] + a_qlist
scripts/teigaku_genzei/make_icl.py (1)

31-31: Fix boolean flag parsing for --short_context.

Using type=bool makes --short_context false evaluate to True because any non-empty string is truthy. Use action="store_true" instead.

Apply this diff:

-    parser.add_argument('--' + ARG_SHORT_CONTEXT, type=bool, default=False)
+    parser.add_argument('--' + ARG_SHORT_CONTEXT, action="store_true")
scripts/teigaku_genzei/make_seed.py (3)

33-35: Harden select_icl_sample against empty DataFrame and use positional indexing.

This function doesn't guard against an empty icl_df and uses label-based .loc which can misbehave if the DataFrame index is non-integer.

Apply this diff:

 def select_icl_sample(icl_df: pd.DataFrame, average_context_length: int, index: int) -> pd.Series:
+    if icl_df.empty:
+        raise ValueError("ICL sample DataFrame is empty.")
+    step = max(1, int(average_context_length))
+    idx = int(index)
-    ret_index = (index - average_context_length) % len(icl_df)
-    return cast(pd.Series, icl_df.loc[ret_index])
+    ret_index = (idx - step) % len(icl_df)
+    return cast(pd.Series, icl_df.iloc[ret_index])

51-67: Fix KeyError: preserve qindex column when reading context CSV.

Line 51 subselects only the "context" column, but Line 63 accesses x["qindex"], causing a KeyError. You must preserve qindex and ensure it's numeric.

Apply this diff:

-    context_df = pd.read_csv(input_context_path, encoding="utf8")[["context"]]
+    context_df = pd.read_csv(input_context_path, encoding="utf8")
+    if "qindex" not in context_df.columns:
+        # fallback: derive qindex from row index if missing
+        context_df = context_df.reset_index(names="qindex")
+    context_df["qindex"] = pd.to_numeric(context_df["qindex"], errors="coerce").fillna(0).astype(int)
+    context_df = context_df[["context", "qindex"]]
     icl_list = json_util.read_jsonl_file(input_icl_path)
     icl_df = pd.DataFrame(icl_list)
 
     if opt_join_method == OPT_JOIN_METHOD_SLIDE:
@@
-        average_context_length = len(icl_list) / len(context_df)
-        assigned_icl_df = context_df.apply(lambda x: select_icl_sample(icl_df, int(average_context_length), x["qindex"]), axis=1)
+        if len(context_df) == 0:
+            raise ValueError("No contexts loaded.")
+        if len(icl_df) == 0:
+            raise ValueError("No ICL samples loaded.")
+        average_context_length = max(1, round(len(icl_df) / len(context_df)))
+        assigned_icl_df = context_df.apply(
+            lambda x: select_icl_sample(icl_df, int(average_context_length), int(x["qindex"])),
+            axis=1,
+        )
         context_icl_df = pd.concat([

79-81: Critical: ArgumentError is constructed but not raised.

Line 81 constructs an ArgumentError object but doesn't raise it, so invalid join_method values are silently ignored and execution continues with an empty DataFrame.

Apply this diff:

     else:
-        context_icl_df = pd.DataFrame()
-        argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
+        raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
scripts/teigaku_genzei/test_qa.py (2)

53-53: Fix CLI argument name typo: --instruction.

The constant value "instructtion" is misspelled, resulting in a CLI flag --instructtion instead of --instruction.

Apply this diff:

-ARG_INSTRUCTION: str ="instructtion"
+ARG_INSTRUCTION: str = "instruction"

150-158: Fix late-binding lambda by capturing option_map per iteration.

The lambda in the list comprehension closes over option_map, causing all lambdas to reference the final iteration's value (late-binding). Each lambda must capture its own option_map.

Apply this diff:

     biased_score_sr_list = [
-        results_df[f"{criterion}_positional_bias_selected_option"].apply(lambda x: option_map[x])
+        results_df[f"{criterion}_positional_bias_selected_option"].apply(
+            lambda x, mapping=option_map: mapping[x]
+        )
         for (criterion, option_map) in zip(criteria_list, option_map_list)
     ]
🧹 Nitpick comments (26)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)

12-12: Make deployment/model configurable via env (optional).

For portability across subscriptions/deployments, consider parameterizing the deployment name:

-        "model": "gpt-4o-2024-08-06"
+        "model": "${AZURE_OPENAI_DEPLOYMENT:-gpt-4o-2024-08-06}"

Based on learnings

scripts/teigaku_genzei/remove_newline_from_csv.py (1)

9-12: Consider in-place modification for simplicity.

The current pattern builds a dictionary and creates a new DataFrame. A more direct approach modifies columns in-place, avoiding the intermediate dict and DataFrame construction overhead.

Apply this diff for a simpler, more efficient alternative:

-col_sr_dict = {
-    col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col] for col in df.columns
-}
-df2 = pd.DataFrame(col_sr_dict)
-df2.to_csv(output_file, encoding="utf-8", index=False)
+for col in df.select_dtypes(include=['object']).columns:
+    df[col] = df[col].str.replace("\n", " ", regex=False)
+df.to_csv(output_file, encoding="utf-8", index=False)

This uses select_dtypes to filter object columns and modifies them in-place. Setting regex=False also slightly improves performance since "\n" is a literal string.

scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)

1-2: Remove unused imports.

CrossProviderInferenceEngine, RITSInferenceEngine, and LLMJudgeDirect are imported but never used in this script.

Apply this diff to remove the unused imports:

-from unitxt.inference import CrossProviderInferenceEngine, RITSInferenceEngine
-from unitxt.llm_as_judge import LLMJudgeDirect
 from unitxt import add_to_catalog
 from unitxt.llm_as_judge_constants import CriteriaOption, CriteriaWithOptions
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_8b.json (1)

11-12: Optional: Harmonize evaluator_name with provider/model naming.

Consider a consistent pattern (e.g., OLLAMA_LLAMA3_1_8B) across catalog assets to ease discovery/automation. If you already have a convention, ignore.

scripts/teigaku_genzei/json_qa.py (3)

15-18: Consider adding error handling for file operations.

The function opens and reads a JSON file without handling potential errors such as file not found, permission denied, or invalid JSON format.

Apply this diff to add error handling:

 def read_json_file(file_path):
-    with open(file_path, 'r') as file:
-        data = json.load(file)
-    return data
+    try:
+        with open(file_path, 'r', encoding='utf-8') as file:
+            data = json.load(file)
+        return data
+    except FileNotFoundError:
+        print(f"Error: File not found: {file_path}")
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON in {file_path}: {e}")
+        sys.exit(1)
+    except Exception as e:
+        print(f"Error reading {file_path}: {e}")
+        sys.exit(1)

55-55: Simplify unnecessary list comprehension.

The comprehension [i for i in title_pos_list] is redundant since title_pos_list is already a list. You can slice it directly.

Apply this diff:

-    atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)]
+    atail_pos_list = title_pos_list[1:] + [len(text_list)]

71-76: Consider adding error handling for CSV writing.

The function writes to a CSV file without handling potential errors such as permission denied or disk full.

Apply this diff to add error handling:

 def save_to_csv(qa_pairs, output_file):
-    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
-        writer = csv.writer(csvfile)
-        writer.writerow(['Title', 'Question', 'Answer'])
-        for qa in qa_pairs:
-            writer.writerow(qa)
+    try:
+        with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
+            writer = csv.writer(csvfile)
+            writer.writerow(['Title', 'Question', 'Answer'])
+            for qa in qa_pairs:
+                writer.writerow(qa)
+    except Exception as e:
+        print(f"Error writing to {output_file}: {e}")
+        sys.exit(1)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/phi_4.json (1)

3-11: Unify engine schema with other RITS configs.

Others use cross_provider_inference_engine + provider: "rits" and key "model". Consider aligning for consistency.

-    "inference_engine": {
-        "__type__": "rits_inference_engine",
-        "max_tokens": 1024,
-        "seed": 42,
-        "temperature": 0,
-        "model_name": "microsoft/phi-4"
-    },
+    "inference_engine": {
+        "__type__": "cross_provider_inference_engine",
+        "max_tokens": 1024,
+        "seed": 42,
+        "temperature": 0,
+        "provider": "rits",
+        "model": "microsoft/phi-4"
+    },
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json (1)

11-11: Consider more specific evaluator naming for consistency.

The evaluator is named "GRANITE3_8B" while the filename is granite3_0_8b.json and the model is "granite-3-8b-instruct" (implying version 3.0). If the related configs for Granite 3.1, 3.2, and 3.3 use version-specific evaluator names (e.g., GRANITE3_1_8B), consider using "GRANITE3_0_8B" here for consistency.

scripts/teigaku_genzei/test_single_vllm.sh (2)

1-2: Harden script and avoid mkdir failure if dir exists.

Add strict mode and use mkdir -p.

-#!/bin/sh
+#!/bin/sh
+set -euo pipefail
@@
-mkdir ${OUT_DIR}
+mkdir -p "${OUT_DIR}"

Also applies to: 31-31


20-24: Optional: read API key from env when needed.

Passing a dummy key may interfere with providers; consider making it optional or defaulting from ENV.

-# SERVER_TYPE="vllm-remote"
-# SERVER_TYPE="openai"
 SERVER_TYPE="vllm-remote"
-SERVER_API_KEY="dummy"
+SERVER_API_KEY="${SERVER_API_KEY:-}"
scripts/teigaku_genzei/json_glossary.py (1)

62-74: Add error handling for file operations.

The script lacks error handling for file I/O and JSON structure validation. Consider adding try-except blocks to handle:

  • Missing or unreadable input files
  • Invalid JSON format
  • Missing expected fields in the JSON structure
  • CSV write failures

Example:

 if __name__ == "__main__":
     if len(sys.argv) != 3:
         print("Usage: python script.py <input_json_file> <output_csv_file>")
         sys.exit(1)
 
     input_file = sys.argv[1]
     output_file = sys.argv[2]
 
-    data = read_json_file(input_file)
-    qa_pairs = extract_glossary(data)
-    save_to_csv(qa_pairs, output_file)
-
-    print(f"QA pairs have been successfully saved to {output_file}")
+    try:
+        data = read_json_file(input_file)
+        qa_pairs = extract_glossary(data)
+        save_to_csv(qa_pairs, output_file)
+        print(f"QA pairs have been successfully saved to {output_file}")
+    except FileNotFoundError as e:
+        print(f"Error: File not found - {e}")
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON format - {e}")
+        sys.exit(1)
+    except KeyError as e:
+        print(f"Error: Missing expected field in JSON - {e}")
+        sys.exit(1)
+    except Exception as e:
+        print(f"Error: {e}")
+        sys.exit(1)
scripts/teigaku_genzei/extract_qa.sh (1)

7-14: Add error checking for script execution.

The loop doesn't check if the Python scripts succeed or if input files exist. Consider adding basic error handling:

 for sqa in ${source_qa}; do
     sf=${source_dir}/${sqa}.json
     qf=${qa_table_dir}/${sqa}.csv
     gf=${qa_table_dir}/${sqa}_glossary.csv
 
+    if [ ! -f "${sf}" ]; then
+        echo "Error: Input file ${sf} not found"
+        exit 1
+    fi
+
-    python ./json_qa.py ${sf} ${qf}
-    python ./json_glossary.py ${sf} ${gf}
+    python ./json_qa.py ${sf} ${qf} || exit 1
+    python ./json_glossary.py ${sf} ${gf} || exit 1
 done
scripts/teigaku_genzei/extract_icl_from_seed.py (2)

49-51: Simplify dict comprehension.

As Pylint suggests, the dict comprehension can be simplified.

-    unique_icl_dict = {
-        h: obj for (h, obj) in hash_list
-    }
+    unique_icl_dict = dict(hash_list)

52-53: Remove misleading TODO comment.

The TODO comment states "collect only a unique set of ICLs" but this functionality is already implemented via the hash-based deduplication above.

-    # TODO: collect only a unique set of ICLs.
     unique_icl_list = [obj for (_, obj) in unique_icl_dict.items()]
scripts/teigaku_genzei/url.txt (1)

1-7: Verify link stability and add retrieval guidance.

Consider adding a short README note or a fetch script that reads this file, sets a user-agent, follows redirects, and records checksums/last-modified timestamps to guard against upstream changes or takedowns. Also ensure the NTA terms permit automated retrieval.

scripts/teigaku_genzei/requirements.txt (1)

4-9: Consolidate extras and consider pinning for reproducibility.

  • Listing both sacrebleu and sacrebleu[ja] is redundant; keep only sacrebleu[ja].
  • Consider pinning versions (accelerate, nltk, rouge_score, bert_score, litellm, python-dotenv) to avoid surprise breakages in scripts.

Apply this minimal cleanup:

-accelerate
-sacrebleu
-sacrebleu[ja]
-nltk
-rouge_score
-bert_score
-litellm
+accelerate
+sacrebleu[ja]
+nltk
+rouge_score
+bert_score
+litellm

Optionally add explicit versions in a constraints file for CI determinism.

Also applies to: 12-12

scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_0_8b.json (1)

9-12: Align evaluator_name with versioned naming.

To match the 3.3 config style, consider "GRANITE3_0_8B" (includes minor version) instead of "GRANITE3_8B".

scripts/teigaku_genzei/make_seed.sh (1)

1-4: Harden the script and use python3

Add strict mode and prefer python3 for portability on RHEL.

Apply this diff:

-#!/bin/sh
+#!/bin/sh
+set -euo pipefail
@@
-# python ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl
-python ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesian
+# python3 ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl
+python3 ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesian
scripts/teigaku_genzei/README.md (1)

1-236: Consider applying markdown formatting improvements.

Static analysis (markdownlint) identified several formatting issues throughout the documentation:

  • Bare URLs should be converted to proper Markdown links (e.g., Lines 7, 10, 11, 20, 21)
  • Unordered list indentation should use consistent 2-space increments
  • Code blocks should include language specifiers (e.g., ```bash, ```csv)

These improvements would enhance readability and conform to standard Markdown best practices, but are not blocking issues.

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

41-47: Consider using spread operator for cleaner code.

The list concatenation on Line 47 can be simplified using the spread operator for better readability.

Apply this diff:

-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}
scripts/teigaku_genzei/json_util.py (2)

20-25: Remove commented-out code and useless return.

Line 24 contains commented-out code that should be removed. Line 25's explicit return is unnecessary.

Apply this diff:

 def write_jsonl_file(file_path: str, data_list: list[Any]):
     with open(file_path, 'w') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
-            # json.dump(data, file, ensure_ascii=False)
-    return

27-33: Remove unreachable return statement.

Line 33's return [] is unreachable since the function already returns at Line 32 within the with-block.

Apply this diff:

 def read_jsonl_file(file_path: str) -> list[Any]:
     with open(file_path, 'r') as file:
         data_list = [
             json.loads(line) for line in file
         ]
         return data_list
-    return []
scripts/teigaku_genzei/make_context.py (1)

46-54: Consider removing debug print statements.

Lines 51-52 print section/subsection numbers during execution. If this is intentional for debugging, consider using proper logging or removing before production.

Apply this diff if prints are no longer needed:

 def extract_section_number(title: str) -> tuple[int, int]:
     match = json_util.get_title_match(title)
     if match is not None:
         section_num_str = match.group(1)
         subsection_num_str = match.group(2)
-        print(section_num_str)
-        print(subsection_num_str)
         return (int(section_num_str), int(subsection_num_str))
     return (0, 0)
scripts/teigaku_genzei/make_icl.py (1)

91-91: Remove unused variable option.

The option variable is assigned at Line 91 but never used in the function.

Apply this diff:

-    option = args[ARG_OPTION]
     input_qa_path = args[ARG_INPUT_QA_FILE]
scripts/teigaku_genzei/test_qa.py (1)

155-157: Remove unused variable nbiased_score_sr_list.

The variable nbiased_score_sr_list is assigned at Line 155 but never used. Either use it in the computation or remove it.

Apply this diff if the variable is truly unused:

     cbiased_score_sr_list: list[pd.Series] = [cast(pd.Series, sr) for sr in biased_score_sr_list]
-    nbiased_score_sr_list = [
-        sr.rename(cast(str, sr.name).replace("_positional_bias_selected_option", "_positional_bias_score")) 
-        for sr in cbiased_score_sr_list
-    ]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
scripts/teigaku_genzei/catalog_util.py (2)

16-20: Previous issue resolved; consider removing debug print.

The IndexError risk has been fixed with the guard clause on line 20. However, the print statement on line 18 outputs to stdout on every call, which pollutes logs and is typically debugging code that should be removed.

Apply this diff to remove the debug output:

 def get_system_catalog_path() -> str | None:
     paths = get_local_catalogs_paths()
-    print(paths)

     return paths[-1] if len(paths) > 0 else None

22-22: Consider removing or deferring the import-time side effect.

Calling get_system_catalog_path() at import time executes code with side effects (the print statement) every time this module is imported, which can pollute logs and cause unexpected behavior in testing or when the module is imported for inspection. If catalog path initialization is needed, consider moving it to an explicit initialization function that callers invoke when required.

Remove the call or wrap it in an initialization function:

-get_system_catalog_path()
+# Call get_system_catalog_path() explicitly when needed instead of at import time
scripts/teigaku_genzei/json_glossary.py (1)

1-4: Remove or update the TODO comment.

The TODO comment appears to be a template placeholder that describes what the script already does. Consider removing it or replacing it with specific future work items if any remain.

scripts/teigaku_genzei/json_util.py (2)

24-29: Remove unnecessary explicit return.

The explicit return statement at line 29 is redundant. Python functions implicitly return None at the end.

Apply this diff:

 def write_jsonl_file(file_path: str, data_list: list[Any]):
     with open(file_path, 'w') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
-            # json.dump(data, file, ensure_ascii=False)
-    return

18-21: Add type hints for consistency.

The read_json_file function lacks type hints for the file_path parameter and return type, while other functions in the module have complete type annotations.

Apply this diff:

-def read_json_file(file_path):
+def read_json_file(file_path: str) -> Any:
     with open(file_path, 'r') as file:
         data = json.load(file)
     return data
scripts/teigaku_genzei/json_qa.py (3)

1-4: Remove or update the TODO comment.

Similar to json_glossary.py, this TODO comment appears to be a template placeholder describing what the script already implements. Consider removing it or replacing with specific action items.


51-51: Simplify list slicing.

The expression [i for i in title_pos_list][1:] first creates a full copy via list comprehension, then slices it. This is inefficient and can be simplified to title_pos_list[1:] directly.

Apply this diff:

-    atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)]
+    atail_pos_list = title_pos_list[1:] + [len(text_list)]

57-57: Add explicit strict= parameter to zip().

Without strict=True, if the four lists have different lengths due to malformed input, zip() will silently truncate to the shortest list, potentially dropping QA pairs without warning. This could mask data extraction issues.

Apply this diff:

-    for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list):
+    for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):

Note: strict=True requires Python 3.10+. If targeting Python 3.9 or earlier, consider adding an explicit length check before the loop:

if not (len(title_pos_list) == len(qhead_pos_list) == len(ahead_pos_list) == len(atail_pos_list)):
    raise ValueError(f"Mismatched list lengths: titles={len(title_pos_list)}, questions={len(qhead_pos_list)}, answers={len(ahead_pos_list)}, tails={len(atail_pos_list)}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d1768c and 4b09212.

📒 Files selected for processing (5)
  • scripts/teigaku_genzei/catalog_util.py (1 hunks)
  • scripts/teigaku_genzei/context_util.py (1 hunks)
  • scripts/teigaku_genzei/json_glossary.py (1 hunks)
  • scripts/teigaku_genzei/json_qa.py (1 hunks)
  • scripts/teigaku_genzei/json_util.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/teigaku_genzei/json_qa.py (2)
scripts/teigaku_genzei/json_glossary.py (2)
  • is_body_text (18-19)
  • is_section_title (32-33)
scripts/teigaku_genzei/json_util.py (4)
  • is_title (6-7)
  • is_qhead (12-13)
  • is_ahead (15-16)
  • read_json_file (18-21)
scripts/teigaku_genzei/json_glossary.py (2)
scripts/teigaku_genzei/json_qa.py (2)
  • is_body_text (18-19)
  • save_to_csv (67-72)
scripts/teigaku_genzei/json_util.py (2)
  • is_title (6-7)
  • read_json_file (18-21)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/json_qa.py

[refactor] 51-51: Unnecessary use of a comprehension, use list(title_pos_list) instead.

(R1721)

scripts/teigaku_genzei/json_util.py

[refactor] 24-24: Useless return at end of function or method

(R1711)

🪛 Ruff (0.14.0)
scripts/teigaku_genzei/context_util.py

17-17: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


20-20: Unused function argument: section

(ARG001)

scripts/teigaku_genzei/json_qa.py

22-22: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


22-22: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)


33-33: Comment contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF003)


35-35: Comment contains ambiguous (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?

(RUF003)


46-46: Comment contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF003)


51-51: Consider [*[i for i in title_pos_list][1:], len(text_list)] instead of concatenation

Replace with [*[i for i in title_pos_list][1:], len(text_list)]

(RUF005)


57-57: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

scripts/teigaku_genzei/json_util.py

7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


16-16: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

scripts/teigaku_genzei/json_glossary.py

33-33: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


33-33: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)

🔇 Additional comments (2)
scripts/teigaku_genzei/catalog_util.py (2)

1-7: LGTM: Imports and catalog registration.

The imports are appropriate for the functionality, and registering the local catalog at import time is a standard pattern for Unitxt catalog initialization.


9-14: Previous issue resolved.

The function now correctly uses the artifact_name parameter instead of a hardcoded string. The logic and exception handling are appropriate.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

79-87: Avoid duplicating base_url construction; let engine derive it.

RITSInferenceEngineFixed.prepare_engine() already appends “/v1” when base_url is None. Prefer omitting base_url here unless a distinct endpoint is required.

-        inference_engine=RITSInferenceEngineFixed(
+        inference_engine=RITSInferenceEngineFixed(
             max_tokens=1024,
             seed=42,
             temperature=0,
             model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME,
-            base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"),
+            # Let prepare_engine() set base_url from model_name
+            # base_url can be supplied only if endpoint differs from model mapping
         ),

If endpoint ≠ model mapping is required, keep base_url but align helper naming (see sibling file comment).

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)

30-30: Fix super() call to invoke immediate parent.

Use super().prepare_engine() here; current call targets the ancestor of RITSInferenceEngine and triggers Pylint E1003.

Apply:

-        super(RITSInferenceEngine, self).prepare_engine()
+        super().prepare_engine()
🧹 Nitpick comments (5)
scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)

17-47: Consider adding a docstring.

While the comments above explain the workaround, adding a docstring to the function would clarify its purpose and parameters for users of this module.

For example:

 def register(catalog_path: str) -> None:
+    """Register a custom correctness criterion with numeric option names.
+    
+    This criterion uses numeric option names ("3", "2", "1") instead of semantic names
+    to work around post-processing issues with MatchClosestOption normalization.
+    
+    Args:
+        catalog_path: Path to the catalog where the criterion will be registered.
+    """
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (2)

19-21: Clarify base_url derivation and naming.

Helper name implies “model_name” but register() passes ENDPOINT_NAME. Rename to reflect generic identifier or add endpoint-aware logic to avoid confusion and accidental mismatches.

Example:

-    def get_base_url_from_model_name_v(model_name: str, version: str):
-        return RITSInferenceEngine.get_base_url_from_model_name(model_name) + version
+    def build_base_url_with_suffix(identifier: str, suffix: str):
+        return RITSInferenceEngine.get_base_url_from_model_name(identifier) + suffix

And in both places:

- base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(ENDPOINT_NAME, "/v1"),
+ base_url=RITSInferenceEngineFixed.build_base_url_with_suffix(ENDPOINT_NAME, "/v1"),

Also applies to: 26-29, 42-44


36-36: Rename variable to match content.

The variable name suggests phi-4 but config is llama-4 maverick. Rename for clarity and consistency with the catalog key.

-    laaj_direct_rits_phi4 = LLMJudgeDirect(
+    laaj_direct_rits_llama4_maverick = LLMJudgeDirect(
...
-    add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.llama4_maverick", overwrite=True, catalog_path=catalog_path)
+    add_to_catalog(laaj_direct_rits_llama4_maverick, name="metrics.llm_as_judge.direct.rits.llama4_maverick", overwrite=True, catalog_path=catalog_path)

Also applies to: 57-57

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (2)

46-47: Replace print with logger and simplify list build.

Use logger.debug for diagnostics; adopt splat style for reduction_map.

-            print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            logger.debug("LLMJudgeDirectPositionalBias.ci_scores: %s", self.ci_scores)
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}

Add at top if missing:

-from typing import List, Dict, Any
+from typing import List, Dict, Any
+from unitxt import get_logger
...
-# (no logger defined)
+logger = get_logger()

59-60: Add zip(strict=...) for safer length assumptions.

Make intent explicit per B905.

-            for result, pb_score in zip(results, pb_scores)
+            for result, pb_score in zip(results, pb_scores, strict=False)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b09212 and ca48920.

📒 Files selected for processing (4)
  • scripts/teigaku_genzei/jsonl_util.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
  • register (65-91)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
  • RITSInferenceEngineFixed (17-30)
  • register (33-57)
  • get_base_url_from_model_name_v (20-21)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/jsonl_util.py

[refactor] 7-7: Useless return at end of function or method

(R1711)

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py

[error] 30-30: Bad first argument 'RITSInferenceEngine' given to super()

(E1003)

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

[refactor] 9-9: Too few public methods (1/2)

(R0903)

🪛 Ruff (0.14.0)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

47-47: Consider [self.main_score, *self.ci_scores] instead of concatenation

Replace with [self.main_score, *self.ci_scores]

(RUF005)


59-59: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (5)
scripts/teigaku_genzei/jsonl_util.py (1)

14-19: LGTM! Previous issue resolved.

The function correctly specifies encoding='utf-8' for reading, which is essential for Japanese text. The unreachable return [] mentioned in the previous review has been removed.

scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (2)

6-15: Well-documented workaround.

The explanation clearly describes why numeric option names are used instead of semantic names, making the rationale behind this design decision accessible to future maintainers.


49-49: LGTM!

The catalog registration correctly uses the provided catalog_path parameter and enables overwrite to allow updates. The fully-qualified name follows unitxt's naming conventions.

scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)

53-54: Verify evaluator_name constant.

“LLAMA3_4_MAVERICK” looks off; confirm intended label (likely “LLAMA4_MAVERICK”). Update if misnamed to avoid metric aggregation inconsistency.

scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

88-89: Verify evaluator_name string.

“LLAMA3_4_MAVERICK” appears mismatched with llama-4 config; confirm the exact label used elsewhere to keep metrics consistent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)

14-14: Consider using sys.exit for clarity.

While exit(1) works in script contexts, sys.exit(1) is the more explicit and standard approach in Python.

Apply this diff:

+import sys
+
 import os
 import json
 import logging
 import catalog_util
     system_catalog_path = catalog_util.get_system_catalog_path()
     if system_catalog_path is None:
         logging.error("No system catalog")
-        exit(1)
+        sys.exit(1)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca48920 and 5c11341.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-14T13:57:11.861Z
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py:31-36
Timestamp: 2025-10-14T13:57:11.861Z
Learning: For one-time setup scripts in the scripts/teigaku_genzei/ directory, elaborate error handling is not required. Natural script suspension when errors occur is considered sufficient for these types of scripts.

Applied to files:

  • scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
scripts/teigaku_genzei/catalog_util.py (1)
  • get_system_catalog_path (16-20)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py

[refactor] 14-14: Consider using 'sys.exit' instead

(R1722)

🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)

34-39: LGTM! Encoding specification properly implemented.

The explicit encoding="utf-8" specification for both file reads and writes ensures reliable handling of JSON content with non-ASCII characters. This addresses the feedback from the previous review.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_context.py (1)

87-87: Verified: Past critical qlist_list bug has been fixed.

The previous review flagged that qlist_list incorrectly appended a_qindex (raw list [-1]) instead of a JSON-encoded string. The current implementation correctly uses [json.dumps([qi]) for qi in a_qindex] to produce ["[-1]"] for the appendix case, resolving the issue.

Note: The fix differs from the suggested approach (which proposed a separate a_qlist variable) but achieves the same correct result.

🧹 Nitpick comments (1)
scripts/teigaku_genzei/make_context.py (1)

12-17: Consider converting descriptive TODOs to module docstring.

These TODO comments describe what the script does rather than marking incomplete work. Consider converting them to a proper module-level docstring for better documentation.

Apply this diff:

+"""
+Build context CSV from QA and glossary data.
+
+Input: qa.csv - question-answer pairs with titles
+Input: glossary.csv - term definitions  
+Output: context.csv - section-level contexts with question indices
+
+Process:
+- Extract section numbers from QA titles
+- Group entries by section
+- Generate context blocks per section with optional glossary header/appendix
+"""
+
 import sys
 from typing import cast, Any, Hashable, Sequence, Collection
 import json
 import argparse
 
 import pandas as pd
 
 import json_util
 
 from context_util import compose_context, compose_glossary
-
-# TODO: input: qa.csv
-# TODO: input: glossary.csv
-# TODO: output: context.csv
-
-# TODO: extract section numbers
-# TODO: collect entries of same section numbers.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c11341 and e803c58.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/make_context.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/make_context.py (2)
scripts/teigaku_genzei/context_util.py (2)
  • compose_context (20-36)
  • compose_glossary (38-43)
scripts/teigaku_genzei/json_util.py (1)
  • get_title_match (9-10)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/make_context.py

[refactor] 56-56: Useless return at end of function or method

(R1711)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_icl.py (1)

22-41: Flag parsing correctly implemented.

The --short_context argument now uses action="store_true", which properly addresses the past review concern about boolean flag parsing. The CLI configuration is well-structured with clear argument names and sensible defaults.

Note: The --glossary_file argument is accepted but not currently utilized in main() (line 93 is commented out). This is consistent with the TODO at lines 109-116 indicating planned glossary support.

🧹 Nitpick comments (1)
scripts/teigaku_genzei/make_icl.py (1)

87-117: Remove unused glossary option and clarify CLI arguments

  • Remove the unused option = args[ARG_OPTION] assignment (line 91).
  • Either drop the --glossary_file argument (line 33) or annotate it as “reserved for future use” since its handling is commented out.
  • Add basic error checks around pd.read_csv (file existence, required columns, non-empty DataFrames) to fail with clear messages.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e803c58 and 05a4bea.

📒 Files selected for processing (2)
  • scripts/teigaku_genzei/make_icl.py (1 hunks)
  • scripts/teigaku_genzei/make_icl.sh (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/make_icl.py (1)
scripts/teigaku_genzei/context_util.py (1)
  • compose_context (20-36)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/make_icl.py

[refactor] 45-53: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_icl.py

43-43: Unused function argument: qa_df

(ARG001)


78-78: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)


91-91: Local variable option is assigned to but never used

Remove assignment to unused variable option

(F841)

🔇 Additional comments (3)
scripts/teigaku_genzei/make_icl.sh (1)

1-3: LGTM! Script correctly uses the --short_context flag.

The shell script properly invokes make_icl.py with --short_context as a flag (no explicit value), which aligns with the corrected action="store_true" implementation in the Python script. The relative paths and minimal structure are appropriate for this use case.

scripts/teigaku_genzei/make_icl.py (2)

55-57: LGTM! Short context composition is correctly implemented.

The function properly filters QA rows by index and delegates to context_util.compose_context(). The section_index parameter, though currently unused by compose_context(), is passed through as documented in context_util.py for "future extensions."


59-85: LGTM! ICL example generation correctly handles Japanese text.

The function properly:

  • Toggles between full and short context based on the short_context flag
  • Strips Japanese prefixes ("問 " for questions, "[A]" for answers) which is appropriate for this tax document domain
  • Flattens ICL QA pairs into numbered fields (icl_query_1, icl_response_1, etc.)

The fullwidth "A" character flagged by static analysis is intentional for Japanese text processing, not an error.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
scripts/teigaku_genzei/make_seed.py (3)

55-55: Fix KeyError: missing qindex column.

Line 55 only selects the "context" column, but line 67 accesses x["qindex"], causing a KeyError when the slide join method is used.

Apply this diff to preserve the qindex column and ensure it's numeric:

-    context_df = pd.read_csv(input_context_path, encoding="utf8")[["context"]]
+    context_df = pd.read_csv(input_context_path, encoding="utf8")
+    if "qindex" not in context_df.columns:
+        context_df = context_df.reset_index(names="qindex")
+    context_df["qindex"] = pd.to_numeric(context_df["qindex"], errors="coerce").fillna(0).astype(int)
+    context_df = context_df[["context", "qindex"]]

83-85: Raise the exception for invalid join method.

Line 85 constructs an ArgumentError but does not raise it, allowing execution to continue silently with an empty DataFrame when an invalid join method is provided.

Apply this diff to raise the exception:

     else:
-        context_icl_df = pd.DataFrame()
-        argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
+        raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")

66-66: Guard against division by zero.

If context_df is empty, line 66 will raise a ZeroDivisionError.

Apply this diff to add validation:

+    if len(context_df) == 0:
+        raise ValueError("No contexts loaded from CSV.")
+    if len(icl_df) == 0:
+        raise ValueError("No ICL samples loaded from JSONL.")
     average_context_length = len(icl_list) / len(context_df)
🧹 Nitpick comments (2)
scripts/teigaku_genzei/make_seed.py (2)

2-2: Remove unused imports.

The sys and json imports appear to be unused in this file.

Apply this diff to remove the unused imports:

-import sys
 import argparse
 from typing import cast, Any, Hashable, Sequence, Collection
-import json
 import pandas as pd

Also applies to: 5-5


41-42: Document or remove the no-op function.

The compose_seed function currently returns its input unchanged. If this is intentional for future extension, add a docstring explaining the design. Otherwise, consider removing it.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05a4bea and e25a799.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
scripts/teigaku_genzei/make_seed.py (1)

33-39: LGTM! Past concerns addressed.

The function now guards against empty DataFrames and uses positional indexing via iloc, resolving the issues raised in previous reviews.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_seed.py (1)

92-94: Raise the ArgumentError instead of constructing it.

The ArgumentError is constructed but not raised, so invalid join methods would be silently ignored (though argparse's choices constraint prevents reaching this code path in practice).

Apply this diff:

     else:
-        context_icl_df = pd.DataFrame()
-        argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
+        raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
🧹 Nitpick comments (4)
scripts/teigaku_genzei/make_seed.py (4)

28-28: Consider removing commented-out code.

The commented device argument appears to be unused. If it's not needed for future functionality, consider removing it to improve code clarity.


41-42: Remove unused function or document its purpose.

The compose_seed function is defined but never called, and currently just returns its input unchanged. Consider removing it if it's not needed, or document if it's a placeholder for future functionality.


46-48: Consider removing commented-out code.

The old usage code in comments could be removed to improve code clarity.


68-72: Consider using max(1, ...) for consistency.

While select_icl_sample handles zero values with max(1, average_context_length) on line 36, it would be clearer to ensure average_context_length is at least 1 here as well. This prevents passing 0 when len(icl_df) < len(context_df) (e.g., 1 ICL sample with 3 contexts → round(1/3) = 0).

Apply this diff:

-        average_context_length = round(len(icl_df) / len(context_df))
+        average_context_length = max(1, round(len(icl_df) / len(context_df)))
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e25a799 and 773034f.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Avoid specifying long messages outside the exception class

(TRY003)


71-71: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
scripts/teigaku_genzei/make_seed.py (5)

33-39: LGTM!

The function correctly guards against empty DataFrames and uses positional indexing with iloc. The max(1, average_context_length) ensures the step is always at least 1, preventing division issues.


55-57: LGTM!

The data loading correctly preserves both the context and qindex columns, addressing the KeyError issue from the previous review.


73-80: LGTM!

The SLIDE join implementation correctly applies select_icl_sample to each context row with proper integer conversions, then concatenates the results while dropping the qindex column.


85-91: LGTM!

The CARTESIAN join correctly creates a cross product of ICL samples and contexts by expanding each context with all ICL rows, then concatenating and resetting the index.


107-119: LGTM!

The output DataFrame construction correctly adds the schema fields (document_outline, document_title, domain), concatenates with the joined context-ICL data, renames the context column to document, and writes to JSONL format with proper encoding.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (9)
scripts/teigaku_genzei/make_seed.py (9)

2-2: Remove unused import.

The sys import is not used in this module (Line 47 shows commented-out sys.exit usage).

-import sys

5-5: Remove unused import.

The json import is not used (json operations are handled by json_util module).

-import json

28-28: Remove dead code.

The commented-out device argument should be removed if it's not needed.

-    # parser.add_argument('--device', type=str, default="cuda")
-

41-42: Remove or implement stub function.

This function currently does nothing (returns input unchanged). If it's a placeholder for future logic, add a TODO comment explaining the intent. Otherwise, remove it as it's not called anywhere in the code.

-def compose_seed(df: pd.DataFrame) -> pd.DataFrame:
-    return df
-    

46-48: Remove dead code.

The commented-out argument handling code should be removed.

-    # if len(sys.argv) != 4:
-    #     print("Usage: python script.py <input_context.csv> <input_icl.jsonl> <output_seed.jsonl")
-    #     sys.exit(1)

55-57: Consider adding error handling for file operations.

File read operations lack error handling. If files are missing, inaccessible, or malformed, the script will crash with unclear errors. Consider wrapping in try-except blocks or letting the natural exceptions propagate with clearer context.

For example:

try:
    context_df = pd.read_csv(input_context_path, encoding="utf8")[["context", "qindex"]]
except FileNotFoundError:
    raise FileNotFoundError(f"Context file not found: {input_context_path}")
except KeyError as e:
    raise KeyError(f"Required column missing in context file: {e}")

107-111: Consider making metadata configurable.

The document metadata (outline, title, domain) is hard-coded with Japanese text and specific values. This limits reusability. Consider accepting these as CLI arguments or deriving them from input files.

Example:

parser.add_argument('--document_outline', type=str, required=True)
parser.add_argument('--document_title', type=str, required=True)
parser.add_argument('--domain', type=str, required=True)

116-116: Remove dead code.

The commented-out drop statement should be removed.

-    # .drop(columns=["section", "qindex", "qlist"])

117-118: Consider adding error handling for output operations.

The output operations (print and to_json) lack error handling. Consider wrapping the to_json call to catch and report write failures clearly.

Example:

try:
    out_df.to_json(output_context_path, orient="records", lines=True, force_ascii=False)
    print(f"Successfully wrote {len(out_df)} records to {output_context_path}")
except Exception as e:
    raise IOError(f"Failed to write output file {output_context_path}: {e}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 773034f and bda6cf8.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Avoid specifying long messages outside the exception class

(TRY003)


71-71: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (4)
scripts/teigaku_genzei/make_seed.py (4)

33-39: LGTM with minor style note.

The function correctly guards against empty DataFrames and uses positional indexing. The offset logic (idx - step) % len(icl_df) implements a sliding window pattern to avoid assigning ICL samples that are too close to their corresponding contexts.

Note: Static analysis suggests defining custom exception subclasses instead of inline messages, but this is a minor style preference and acceptable as-is.


81-91: LGTM!

The cartesian join implementation correctly pairs each context with all ICL samples using DataFrame broadcasting. The index reset and qindex removal are appropriate.


92-93: LGTM!

The exception is now correctly raised for invalid join methods, addressing the critical issue from the past review.


120-122: LGTM!

Standard Python entry point implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
scripts/teigaku_genzei/make_icl.py (1)

103-103: Ensure UTF‑8 when writing JSONL (in jsonl_util).

jsonl_util.write_jsonl_file writes without explicit encoding; Japanese text may break on non‑UTF‑8 systems. Please update that helper to open files with encoding="utf-8" and ensure_ascii=False (reads already use UTF‑8). This affects this script’s output.

Proposed change in scripts/teigaku_genzei/jsonl_util.py:

-def write_jsonl_file(file_path: str, data_list: list[Any]):
-    with open(file_path, 'w') as file:
+def write_jsonl_file(file_path: str, data_list: list[Any]):
+    with open(file_path, 'w', encoding='utf-8') as file:
         for data in data_list:
             file.write(json.dumps(data, ensure_ascii=False) + '\n')
scripts/teigaku_genzei/make_seed.py (1)

88-88: Raise a regular ValueError for invalid join method.

argparse.ArgumentError is intended for parser internals; raising it here can be awkward. ValueError is clearer.

-        raise argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
+        raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
🧹 Nitpick comments (15)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (6)

10-20: Remove commented-out code.

The commented prepare method and debug print statements should be removed to improve code clarity, unless they're needed for immediate debugging.

Apply this diff:

 class LLMJudgeDirectPositionalBias(LLMJudgeDirect):
-    # def prepare(self):
-    #     super().prepare()
-    #     print(f"LLMJudgeDirectPositionalBias.prepare() criteria: {self.criteria}")
-    #     print(f"LLMJudgeDirectPositionalBias.prepare() inference_engine: {self.inference_engine}")
-    #     if self.criteria is not None:
-    #         self.ci_scores = [
-    #             self.criteria.name,
-    #             self.criteria.name + "_positional_bias_score",
-    #             self.criteria.name + "_positional_bias_average",
-    #         ]
-    #         print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
-
     POSTFIX_SCORE = "_positional_bias_score"

47-47: Replace print statement with logging.

Debug print statements should use proper logging for production code.

Apply this diff:

+from unitxt.logging_utils import get_logger
+
+logger = get_logger()

Then:

-            print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
+            logger.info(f"LLMJudgeDirectPositionalBias ci_scores: {self.ci_scores}")

63-64: Remove commented-out code.

The old implementation should be removed to improve readability.

Apply this diff:

 pb_scores = [
-    # self.criteria.option_map.get(result.get(backward_selected_option_name, "NO_RESULT"), 1)
-    # if (self.criteria is not None) and (self.criteria.option_map is not None) else 1
     get_pb_score(result, self.criteria, result[self.main_score]) for result in results
 ]

48-48: Consider using unpacking syntax for list concatenation.

The static analysis suggests using unpacking for better readability and performance.

Apply this diff:

-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}

Based on learnings.


72-72: Add explicit strict= parameter to zip().

Python 3.10+ supports the strict parameter to catch length mismatches. Since results and pb_scores should always have the same length, using strict=True adds a safety check.

Apply this diff:

 pb_results = [
     result | {
         backward_score_name: pb_score,
         average_score_name: (result[self.main_score] + pb_score) / 2,
     }
-    for result, pb_score in zip(results, pb_scores)
+    for result, pb_score in zip(results, pb_scores, strict=True)
 ]

Based on learnings.


92-104: LGTM! Base URL construction properly uses the static helper.

The use of RITSInferenceEngineFixed.get_base_url_from_model_name_v() correctly addresses the previous DRY violation by delegating URL construction to the inference engine.

However, consider removing the commented-out code at line 99 for cleanliness:

             model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME,
             base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"),
-            # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1",
         ),
scripts/teigaku_genzei/json_util.py (2)

2-2: Drop unused typing imports.

They aren’t used here; remove to reduce noise.

-from typing import cast, Any, Hashable, Sequence, Collection

7-7: Silence intentional full‑width character warnings (RUF001).

These regexes intentionally match full‑width digits/dashes and full‑width “A”. Add noqa notes to prevent false positives.

-    return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None
+    return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None  # noqa: RUF001 (intentional full-width chars)

-    return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text)
+    return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text)  # noqa: RUF001 (intentional full-width chars)

-    return re.match(r"^\[A\]", text) is not None
+    return re.match(r"^\[A\]", text) is not None  # noqa: RUF001 (intentional full-width A)

Also applies to: 10-10, 16-16

scripts/teigaku_genzei/README.md (1)

58-60: Add language to fenced code blocks for better rendering and copy/paste.

Specify a language (bash) for CLI snippets.

Example:

-```
-Usage: python json_qa.py <input_json_file> <output_csv_file>
-```
+```bash
+# Usage
+python json_qa.py <input_json_file> <output_csv_file>
+```

Apply similarly to the other code fences in the listed ranges.

Also applies to: 66-68, 115-117, 140-143, 196-197, 210-212, 224-226

scripts/teigaku_genzei/json_qa.py (4)

7-7: Remove unused import.

json isn’t used in this module.

-import json

51-51: Make zip strict to catch misaligned markers.

Fails fast if counts of titles, 問, and [A] diverge.

-for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list):
+for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):

If you expect occasional mismatches in real PDFs, prefer an explicit alignment routine instead of strict zip. Do you want a helper to align nearest markers?


66-68: Show the actual script name in usage.

Use argv[0] so copy/paste works regardless of filename.

-    if len(sys.argv) != 3:
-        print("Usage: python script.py <input_json_file> <output_csv_file>")
+    if len(sys.argv) != 3:
+        print(f"Usage: python {sys.argv[0]} <input_json_file> <output_csv_file>")
         sys.exit(1)

38-38: Consider joining with newlines for readability.

Concatenating without separators can jam sentences; newlines often preserve layout from Docling better.

-    return "".join(block_text_list)
+    return "\n".join(block_text_list)

If downstream expects a single line, keep current behavior.

scripts/teigaku_genzei/make_seed.py (1)

2-6: Prune unused imports.

Tighten to what’s actually used.

-import sys
 import argparse
-from typing import cast, Any, Hashable, Sequence, Collection
-import json
+from typing import cast, Any
 import pandas as pd
scripts/teigaku_genzei/make_icl.py (1)

1-5: Trim unused typing/sys imports.

Keep only what’s referenced.

-import sys
-from typing import cast, Any, Hashable, Sequence, Collection
+from typing import cast, Any
 import json
 import argparse
 import pandas as pd
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f05bad and 2d5962d.

📒 Files selected for processing (7)
  • scripts/teigaku_genzei/README.md (1 hunks)
  • scripts/teigaku_genzei/json_qa.py (1 hunks)
  • scripts/teigaku_genzei/json_util.py (1 hunks)
  • scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1 hunks)
  • scripts/teigaku_genzei/make_icl.py (1 hunks)
  • scripts/teigaku_genzei/make_icl.sh (1 hunks)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/teigaku_genzei/make_icl.sh
🧰 Additional context used
🧬 Code graph analysis (4)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
  • RITSInferenceEngineFixed (17-30)
  • register (33-57)
  • get_base_url_from_model_name_v (20-21)
scripts/teigaku_genzei/make_seed.py (3)
scripts/teigaku_genzei/make_icl.py (2)
  • config (22-41)
  • main (86-112)
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
  • main (24-52)
scripts/teigaku_genzei/jsonl_util.py (1)
  • read_jsonl_file (14-19)
scripts/teigaku_genzei/json_qa.py (2)
scripts/teigaku_genzei/json_glossary.py (3)
  • is_body_text (18-19)
  • is_section_title (32-33)
  • save_to_csv (48-53)
scripts/teigaku_genzei/json_util.py (4)
  • is_title (6-7)
  • is_qhead (12-13)
  • is_ahead (15-16)
  • read_json_file (18-21)
scripts/teigaku_genzei/make_icl.py (2)
scripts/teigaku_genzei/context_util.py (1)
  • compose_context (20-36)
scripts/teigaku_genzei/jsonl_util.py (1)
  • write_jsonl_file (7-12)
🪛 LanguageTool
scripts/teigaku_genzei/README.md

[grammar] ~55-~55: Use a hyphen to join words.
Context: ...structing the SDG seed data. ## Command line interface json_qa.py is used to ...

(QB_NEW_EN_HYPHEN)


[style] ~90-~90: Consider using “who” when you are referring to a person instead of an object.
Context: ...ositional_bias`, then an extended judge that reflects the result of the positional b...

(THAT_WHO)


[grammar] ~111-~111: Use a hyphen to join words.
Context: ...-8b-instruct-sft-checkpoint` ## Command line interface The test and evaluation i...

(QB_NEW_EN_HYPHEN)


[style] ~133-~133: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ed data classification of LLM services. See [Sensitive data in unitxt — Unitxt](htt...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...might let the teacher model to generate exactly the same questions for training, which can lead ...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)


[style] ~186-~186: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...ents and query - response pairs are not exactly the same with the seed contexts, which is explai...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)


[grammar] ~192-~192: Use a hyphen to join words.
Context: ... examples and seed contexts. ## Command line interface make_context.py is use...

(QB_NEW_EN_HYPHEN)


[grammar] ~220-~220: Use a hyphen to join words.
Context: ... - I.e., it contains one document (so called "context") and three query - resp...

(QB_NEW_EN_HYPHEN)


[style] ~232-~232: Consider removing “of” to be more concise
Context: ...cartesian` - Assign each ICL example to all of the contexts. See the description in ["Proc...

(ALL_OF_THE)

🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md

7-7: Bare URL used

(MD034, no-bare-urls)


10-10: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


20-20: Bare URL used

(MD034, no-bare-urls)


21-21: Bare URL used

(MD034, no-bare-urls)


35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Bare URL used

(MD034, no-bare-urls)


39-39: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


40-40: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


41-41: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


46-46: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


47-47: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


48-48: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


50-50: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


51-51: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


52-52: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


53-53: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


58-58: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


79-79: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


81-81: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


82-82: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


82-82: Bare URL used

(MD034, no-bare-urls)


83-83: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


88-88: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


89-89: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


90-90: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


91-91: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


92-92: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


94-94: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


95-95: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


96-96: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


97-97: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


102-102: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


103-103: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


104-104: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


105-105: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


106-106: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


107-107: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


109-109: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


115-115: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


130-130: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


131-131: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


132-132: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


140-140: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


147-147: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


155-155: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


156-156: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


157-157: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


158-158: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


161-161: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


162-162: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


163-163: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


165-165: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


166-166: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


167-167: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


179-179: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


180-180: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


181-181: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


183-183: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


184-184: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


185-185: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


186-186: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


188-188: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


189-189: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


190-190: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


195-195: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


210-210: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


219-219: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


220-220: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


224-224: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

[refactor] 9-9: Too few public methods (1/2)

(R0903)

🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation

Replace with [self.main_score, *self.ci_scores]

(RUF005)


72-72: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

scripts/teigaku_genzei/make_seed.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


66-66: Avoid specifying long messages outside the exception class

(TRY003)

scripts/teigaku_genzei/json_qa.py

22-22: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


22-22: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)


33-33: Comment contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF003)


35-35: Comment contains ambiguous (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?

(RUF003)


49-49: Consider [*title_pos_list[1:], len(text_list)] instead of concatenation

Replace with [*title_pos_list[1:], len(text_list)]

(RUF005)


51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

scripts/teigaku_genzei/make_icl.py

77-77: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

scripts/teigaku_genzei/json_util.py

7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


10-10: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


16-16: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

1-7: LGTM!

The imports are well-organized and appropriate for the functionality provided.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (4)

10-20: Remove commented-out code.

The commented prepare() method serves no purpose and reduces readability. If this code might be needed for debugging, use version control to retrieve it later.

Apply this diff:

 class LLMJudgeDirectPositionalBias(LLMJudgeDirect):
-    # def prepare(self):
-    #     super().prepare()
-    #     print(f"LLMJudgeDirectPositionalBias.prepare() criteria: {self.criteria}")
-    #     print(f"LLMJudgeDirectPositionalBias.prepare() inference_engine: {self.inference_engine}")
-    #     if self.criteria is not None:
-    #         self.ci_scores = [
-    #             self.criteria.name,
-    #             self.criteria.name + "_positional_bias_score",
-    #             self.criteria.name + "_positional_bias_average",
-    #         ]
-    #         print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
-
     POSTFIX_SCORE = "_positional_bias_score"

47-47: Replace print with logging.

Production code should use proper logging instead of print statements for observability and control.

Add logging import at the top:

import logging

logger = logging.getLogger(__name__)

Then apply this diff:

-            print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
+            logger.debug(f"LLMJudgeDirectPositionalBias ci_scores: {self.ci_scores}")

48-48: Consider unpacking for list concatenation.

Static analysis suggests using unpacking instead of list concatenation for better readability.

Apply this diff:

-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}

100-100: Remove commented-out code.

The old approach is no longer needed since the fix is in place.

Apply this diff:

             base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"),
-            # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1",
         ),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5321a5b and 9abc26b.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
  • RITSInferenceEngineFixed (17-30)
  • register (33-57)
  • get_base_url_from_model_name_v (20-21)
🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

[refactor] 9-9: Too few public methods (1/2)

(R0903)

🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation

Replace with [self.main_score, *self.ci_scores]

(RUF005)


73-73: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

Comment on lines 50 to 61
def get_pb_score(
result: Dict[str, str],
criteria: CriteriaWithOptions,
fallback_score: float,
out_of_range_score: float,
) -> float:
option_map = criteria.option_map if (criteria is not None) and (criteria.option_map is not None) else None
if option_map is None:
return fallback_score
selected_option = result.get(backward_selected_option_name)
score = option_map.get(selected_option, out_of_range_score) if selected_option is not None else fallback_score
return score
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inconsistent fallback for invalid selected options.

Line 60 uses out_of_range_score (0.0) as the default when selected_option is not found in option_map. However, the past review comment for these lines suggested using fallback_score instead, which would be the forward score.

The current behavior penalizes unexpected options with 0.0, but this may not be the intended design. If an invalid option appears (e.g., due to model output variation), falling back to the forward score would be more robust than defaulting to 0.0.

Apply this diff to align with the past review suggestion:

-            score = option_map.get(selected_option, out_of_range_score) if selected_option is not None else fallback_score
+            score = option_map.get(selected_option, fallback_score) if selected_option is not None else fallback_score

If penalizing invalid options with 0.0 is intentional, consider documenting this behavior in a comment or docstring.

🤖 Prompt for AI Agents
In scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py around lines 50
to 61, the get_pb_score function currently uses out_of_range_score when a
selected_option is present but not found in criteria.option_map; change that
default to fallback_score so invalid/unrecognized options fall back to the
forward score (retain using fallback_score when selected_option is None), i.e.,
replace option_map.get(selected_option, out_of_range_score) with
option_map.get(selected_option, fallback_score); if the original 0.0 penalty was
intentional, instead add a clarifying comment/docstring explaining that
behavior.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (2)

108-116: Optional: let RITSInferenceEngineFixed derive base_url; avoid duplication.

You can omit base_url and rely on prepare_engine() to append “/v1”.

-            base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"),
-            # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1",

50-77: Invalid backward selected option should fall back to forward score; update types/docs.

Penalizing unrecognized options with 0.0 is brittle; prefer fallback to the forward score for robustness, or clearly document the intentional penalty.

-        def get_pb_score(
-                result: Dict[str, str], 
-                criteria: CriteriaWithOptions,
+        def get_pb_score(
+                result: Dict[str, Any],
+                criteria: CriteriaWithOptions | None,
                 fallback_score: float,
                 default_pb_score: float,
                 ) -> float:
@@
-            PB (positional bias) is the difference of the selection of an option between forward- and backward-ordered options.
-            To mitigate PB, we compute the average of the scores associated with the selected options from forward- and backward-ordered options.
+            PB (positional bias) arises when option selection differs between forward- and backward-ordered options.
+            To mitigate PB, we compute the average of the scores associated with the selected options from forward- and backward-ordered options.
@@
-            :param default_pb_score: The score to use if the selected option is not found in criteria.option_map.
-                Note that this is an erroneous case; this score is used only when the judge's inference result does not contain a meaningful option value.
+            :param default_pb_score: The score to use only when the judge's output lacks a meaningful option value entirely.
@@
-            score = option_map.get(selected_option, default_pb_score) if selected_option is not None else fallback_score
+            # Fall back to forward score when the backward-selected option is invalid/unrecognized.
+            score = option_map.get(selected_option, fallback_score) if selected_option is not None else fallback_score

If the 0.0 penalty is intentional, keep the current behavior but add a brief comment explaining why penalization is preferred over fallback.

🧹 Nitpick comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

41-49: Remove print and tighten reduction_map construction.

Avoid stdout in libraries; use logging or drop it. Also prefer splat to concat.

-            print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}")
-            self.reduction_map = {"mean": [self.main_score] + self.ci_scores}
+            # Debug logging if needed (e.g., logger.debug)
+            self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}

Optionally, set ci_scores/reduction_map in prepare() to avoid recomputing each call.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9abc26b and 5ba0bf4.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
  • RITSInferenceEngineFixed (17-30)
  • register (33-57)
  • get_base_url_from_model_name_v (20-21)
🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

[refactor] 9-9: Too few public methods (1/2)

(R0903)

🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py

48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation

Replace with [self.main_score, *self.ci_scores]

(RUF005)

🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

79-89: Good defensive coding on PB merge.

Using result.get(..., default_pb_score) and zip(..., strict=True) avoids KeyError and length bugs. LGTM.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (13)
scripts/teigaku_genzei/README.md (3)

63-63: Add language identifiers to fenced code blocks for proper syntax highlighting.

Fenced code blocks lack language specifications, which prevents syntax highlighting and reduces readability. All code blocks should declare their language.

Apply this diff to add language identifiers where missing:

- Line 63:
- ```
+ ```bash
  Usage: python json_qa.py <input_json_file> <output_csv_file>
- ```
+ ```

- Line 71:
- ```
+ ```bash
  Usage: python json_glossary.py <input_json_file> <output_csv_file>
- ```
+ ```

- Line 120:
- ```
+ ```bash
  python test_qa.py [-h] [--model_path MODEL_PATH] ... [--instruction INSTRUCTION]
- ```
+ ```

- Line 145:
- ```
+ ```csv
  Title,Question,Answer
  "Q1","What is the capital of France?","Paris"
- ```
+ ```

- Line 200:
- ```
+ ```bash
  usage: make_context.py [-h] [--glossary_option {appendix,header,none}] ...
- ```
+ ```

- Line 215:
- ```
+ ```bash
  usage: make_icl.py [-h] [--short_context SHORT_CONTEXT}] ...
- ```
+ ```

- Line 229:
- ```
+ ```bash
  usage: make_seed.py [-h] [--context_file CONTEXT_FILE] ...
- ```
+ ```

Also applies to: 71-71, 120-120, 145-145, 200-200, 215-215, 229-229


40-102: Fix list indentation to comply with markdown standards.

Multiple list items throughout the document have inconsistent indentation (4 spaces instead of 2, or 8 instead of 4). Correct indentation improves readability and ensures proper markdown rendering.

Key areas to fix:

  • Lines 40–46 (Prerequisites section): Reduce from 4 spaces to 2; nested items at 46 should be at 4 not 8
  • Lines 51–58 (Procedure section): Reduce from 4 spaces to 2; nested items at 57–58 should be at 4 not 8
  • Lines 84–114 (Testing Prerequisites, One-time setup, Procedure): Reduce indentation consistently to 2 for top-level, 4 for nested
  • Lines 152, 160–172 (Input/Output format notes): Reduce from 4 to 2 spaces
  • Lines 184–195 (SDG seed generation Procedure): Reduce from 4 to 2 spaces

Also applies to: 152-152, 160-172, 184-195


7-7: Wrap bare URLs in markdown link syntax.

Several bare URLs should be wrapped in markdown link format [text](url) for better documentation structure and link recognition.

Examples to fix:

  • Line 7: https://www.nta.go.jp/users/gensen/teigakugenzei/01.htm[Teigaku Genzei — NTA](https://www.nta.go.jp/users/gensen/teigakugenzei/01.htm)
  • Line 10: https://www.nta.go.jp/publication/pamph/gensen/0024001-021.pdf → wrap as link
  • Line 43: https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub → already appears in surrounding text; ensure markdown link format

Also applies to: 10-10, 11-11, 20-20, 21-21, 43-43, 87-87

scripts/teigaku_genzei/teigaku_genzei.ipynb (2)

217-219: Consider using logger.exception for better debugging.

Using logger.exception instead of logger.error will include the full traceback, which is helpful for debugging export failures.

         except Exception as e:
-            logger.error(f"Failed to export {format_name} format: {str(e)}")
+            logger.exception(f"Failed to export {format_name} format")
             raise

258-261: Exception handling after re-raising is inconsistent.

The export_document function re-raises exceptions after logging (line 219), but export_document_new_docling catches and continues (line 261). This means if one format fails in export_document, the entire document is marked as failed even though other formats may have succeeded.

Consider whether the raise at line 219 should be removed to allow partial exports, or if export_document_new_docling should not catch and continue. The current behavior may not match the intended error handling strategy.

scripts/teigaku_genzei/json_util.py (2)

2-2: Remove unused imports.

The imports cast, Hashable, Sequence, and Collection are not used in this file.

-from typing import cast, Any, Hashable, Sequence, Collection
+from typing import Any

6-19: Add comments explaining fullwidth character patterns.

The fullwidth characters (0-9, -, ー, A) in the regex patterns are intentional for matching Japanese document formats. Adding brief comments would help maintainers understand this is not a typo and suppress linter warnings.

 def is_title(text: str) -> bool:
+    # Fullwidth digits and hyphens are intentional for Japanese document format
     return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None

Alternatively, you can add a # noqa: RUF001 comment to suppress the linter warnings for these lines.

scripts/teigaku_genzei/json_qa.py (1)

7-7: Remove unused import.

The json module is imported but never used in this file.

-import json
scripts/teigaku_genzei/make_seed.py (2)

2-5: Remove unused imports.

Several imports are not used in this file: sys, json, Hashable, Sequence, Collection.

-import sys
-import argparse
-from typing import cast, Any, Hashable, Sequence, Collection
-import json
+import argparse
+from typing import cast, Any

92-101: Consider making metadata configurable.

The document_outline, document_title, and domain values are hardcoded. For reusability across different document sets, consider making these configurable via CLI arguments or a configuration file.

This is acceptable for the current example scope, but if you plan to extend this for other document sources, parameterizing these values would be beneficial.

scripts/teigaku_genzei/make_icl.py (3)

1-6: Remove unused imports.

Several imports are not used: sys, json, Hashable, Sequence, Collection.

-import sys
-from typing import cast, Any, Hashable, Sequence, Collection
-import json
+from typing import cast, Any
 import argparse

76-78: Fullwidth character is intentional - add comment for clarity.

The [A] prefix removal uses fullwidth characters matching the Japanese document format. Consider adding a brief comment to clarify this is intentional, similar to the suggestion for json_util.py.

     icl_qa_list = [
         {
             "icl_query_" + str(i + 1): qa_df.loc[qa]["Question"].removeprefix("問 ").strip(),
+            # Fullwidth [A] matches Japanese answer marker format
             "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(),
         } for (i, qa) in enumerate(context_qa_index[1])
     ]

112-119: Clean up or track TODO comments.

These TODO comments contain incomplete implementation notes. Consider either:

  1. Removing them if no longer relevant
  2. Converting to a tracked issue for future work
  3. Completing the thought with actionable items

Would you like me to help create an issue to track the context size optimization work mentioned in these TODOs?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ba0bf4 and 9ee38e5.

📒 Files selected for processing (8)
  • scripts/teigaku_genzei/README.md (1 hunks)
  • scripts/teigaku_genzei/json_qa.py (1 hunks)
  • scripts/teigaku_genzei/json_util.py (1 hunks)
  • scripts/teigaku_genzei/make_context.py (1 hunks)
  • scripts/teigaku_genzei/make_icl.py (1 hunks)
  • scripts/teigaku_genzei/make_seed.py (1 hunks)
  • scripts/teigaku_genzei/seed/.gitignore (1 hunks)
  • scripts/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • scripts/teigaku_genzei/seed/.gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/teigaku_genzei/make_context.py
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/teigaku_genzei/make_seed.py (1)
scripts/teigaku_genzei/jsonl_util.py (1)
  • read_jsonl_file (12-17)
scripts/teigaku_genzei/json_qa.py (1)
scripts/teigaku_genzei/json_util.py (5)
  • is_title (6-7)
  • is_toc (9-10)
  • is_qhead (15-16)
  • is_ahead (18-19)
  • read_json_file (21-24)
scripts/teigaku_genzei/make_icl.py (5)
scripts/teigaku_genzei/make_context.py (2)
  • config (27-43)
  • main (71-85)
scripts/teigaku_genzei/make_seed.py (2)
  • config (17-31)
  • main (104-117)
scripts/teigaku_genzei/context_util.py (1)
  • compose_context (20-36)
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
  • main (24-52)
scripts/teigaku_genzei/jsonl_util.py (1)
  • write_jsonl_file (7-10)
🪛 LanguageTool
scripts/teigaku_genzei/README.md

[grammar] ~60-~60: Use a hyphen to join words.
Context: ...structing the SDG seed data. ## Command line interface json_qa.py is used to ...

(QB_NEW_EN_HYPHEN)


[style] ~95-~95: Consider using “who” when you are referring to a person instead of an object.
Context: ...ositional_bias`, then an extended judge that reflects the result of the positional b...

(THAT_WHO)


[grammar] ~116-~116: Use a hyphen to join words.
Context: ...-8b-instruct-sft-checkpoint` ## Command line interface The test and evaluation i...

(QB_NEW_EN_HYPHEN)


[style] ~138-~138: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ed data classification of LLM services. See [Sensitive data in unitxt — Unitxt](htt...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~186-~186: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...might let the teacher model to generate exactly the same questions for training, which can lead ...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)


[style] ~191-~191: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...ents and query - response pairs are not exactly the same with the seed contexts, which is explai...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)


[grammar] ~197-~197: Use a hyphen to join words.
Context: ... examples and seed contexts. ## Command line interface make_context.py is use...

(QB_NEW_EN_HYPHEN)


[grammar] ~225-~225: Use a hyphen to join words.
Context: ... - I.e., it contains one document (so called "context") and three query - resp...

(QB_NEW_EN_HYPHEN)


[style] ~237-~237: Consider removing “of” to be more concise
Context: ...cartesian` - Assign each ICL example to all of the contexts. See the description in ["Proc...

(ALL_OF_THE)

🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md

7-7: Bare URL used

(MD034, no-bare-urls)


10-10: Bare URL used

(MD034, no-bare-urls)


11-11: Bare URL used

(MD034, no-bare-urls)


20-20: Bare URL used

(MD034, no-bare-urls)


21-21: Bare URL used

(MD034, no-bare-urls)


40-40: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


42-42: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


43-43: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


43-43: Bare URL used

(MD034, no-bare-urls)


44-44: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


45-45: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


46-46: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


51-51: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


52-52: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


53-53: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


55-55: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


56-56: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


57-57: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


58-58: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


63-63: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


71-71: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


84-84: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


87-87: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


87-87: Bare URL used

(MD034, no-bare-urls)


88-88: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


93-93: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


94-94: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


95-95: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


96-96: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


97-97: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


99-99: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


100-100: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


101-101: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


102-102: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


106-106: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


107-107: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


108-108: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


109-109: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


110-110: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


111-111: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


112-112: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


114-114: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


120-120: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


135-135: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


136-136: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


137-137: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


145-145: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


152-152: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


160-160: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


161-161: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


162-162: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


163-163: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


166-166: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)


167-167: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


168-168: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


170-170: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


171-171: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


172-172: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


184-184: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


185-185: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


186-186: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


188-188: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


189-189: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


190-190: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


191-191: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


193-193: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


194-194: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


195-195: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


200-200: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


224-224: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


225-225: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


229-229: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb

125-125: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


125-125: Use explicit conversion flag

Replace with conversion flag

(RUF010)


166-166: Do not catch blind exception: Exception

(BLE001)


168-168: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


168-168: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/teigaku_genzei/make_seed.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)


55-55: Avoid specifying long messages outside the exception class

(TRY003)


57-57: Avoid specifying long messages outside the exception class

(TRY003)

scripts/teigaku_genzei/json_qa.py

22-22: String contains ambiguous (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?

(RUF001)


22-22: String contains ambiguous (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?

(RUF001)


33-33: Comment contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF003)


35-35: Comment contains ambiguous (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?

(RUF003)


49-49: Consider [*title_pos_list[1:], len(text_list)] instead of concatenation

Replace with [*title_pos_list[1:], len(text_list)]

(RUF005)


51-51: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

scripts/teigaku_genzei/make_icl.py

77-77: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

scripts/teigaku_genzei/json_util.py

7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


19-19: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (3)
scripts/teigaku_genzei/json_qa.py (1)

40-56: LGTM with the zip strict fix.

The extraction logic correctly identifies title/question/answer blocks using the utility functions from json_util. The position list construction is clear, and the function handles the Japanese document structure appropriately.

scripts/teigaku_genzei/make_seed.py (1)

33-39: LGTM!

The function properly guards against empty DataFrames and uses positional indexing (iloc) to avoid label-based indexing surprises. The step calculation ensures a minimum of 1 to prevent division issues.

scripts/teigaku_genzei/make_icl.py (1)

43-52: LGTM!

The function correctly handles QA combination generation with proper edge cases: full combinations when sufficient items exist, padding with repetition when fewer items are available, and empty list for no items.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
scripts/teigaku_genzei/teigaku_genzei.ipynb (1)

83-86: OCR language still configured for Spanish instead of Japanese.

This issue was previously flagged: the OCR language is set to ["es"] (Spanish), but the notebook processes Japanese documents. This will result in poor OCR quality and likely failure to extract Japanese text correctly.

Apply this diff:

         "ocr": {
             "enabled": True,  # Enable/disable OCR processing
-            "languages": ["es"],  # List of language codes (e.g., eng, fra, deu)
+            "languages": ["jpn"],  # List of language codes for Japanese
         },
🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)

217-219: Use logger.exception to preserve stack traces.

The current error logging loses stack trace information. When debugging conversion failures, the full traceback is valuable.

Apply this diff:

         except Exception as e:
-            logger.error(f"Failed to export {format_name} format: {str(e)}")
+            logger.exception(f"Failed to export {format_name} format")
             raise

Note: logger.exception automatically includes the exception message and stack trace, so str(e) is redundant.


258-260: Use logger.exception to preserve stack traces.

Similar to the export_document function, using logger.exception here will capture the full stack trace, making it easier to diagnose document conversion failures.

Apply this diff:

         except Exception as e:
             failure_count += 1
-            logger.error(f"Failed to process {file_path}: {str(e)}")
+            logger.exception(f"Failed to process {file_path}")
             continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ee38e5 and 402b95e.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb

125-125: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


125-125: Use explicit conversion flag

Replace with conversion flag

(RUF010)


166-166: Do not catch blind exception: Exception

(BLE001)


168-168: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


168-168: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)

206-220: Use logging.exception for better error diagnostics.

The error handling logs the exception message with logging.error(f"...{str(e)}") before re-raising. Using logging.exception automatically includes the full traceback, making debugging easier.

Apply this diff:

         except Exception as e:
-            logger.error(f"Failed to export {format_name} format: {str(e)}")
+            logger.exception(f"Failed to export {format_name} format")
             raise

Based on static analysis hints (Ruff TRY400).


249-262: Use logging.exception for better error diagnostics.

Similar to the previous error handler, using logging.exception here will automatically include the traceback for failed document conversions.

Apply this diff:

         except Exception as e:
             failure_count += 1
-            logger.error(f"Failed to process {file_path}: {str(e)}")
+            logger.exception(f"Failed to process {file_path}")
             continue

Based on static analysis hints (Ruff TRY400).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 402b95e and fb39136.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb

126-126: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


126-126: Use explicit conversion flag

Replace with conversion flag

(RUF010)


167-167: Do not catch blind exception: Exception

(BLE001)


169-169: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


169-169: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary

Comment on lines 21 to 22
"- Python >= 3.10\n",
"- sdg_hub >= 0.7.1\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the required sdg_hub version.

The notebook states sdg_hub >= 0.7.1 is required, but the PR description mentions testing with sdg_hub v0.2.x with Japanese support (PRs #457, #458). This version discrepancy may confuse users about which version to install.

Please update line 22 to reflect the actual tested and required version, ensuring consistency with the PR description and README.md.

🤖 Prompt for AI Agents
In scripts/teigaku_genzei/teigaku_genzei.ipynb around lines 21–22, the
requirements list incorrectly states "sdg_hub >= 0.7.1"; update line 22 to the
actual tested and required version by replacing it with "sdg_hub >= 0.2.0
(tested with v0.2.x including Japanese support PRs #457, #458)" so the notebook
matches the PR description and README; ensure the change is saved in the
notebook JSON (requirements cell/metadata) and verify README.md references
remain consistent.

Comment on lines 83 to 87
" \"ocr\": {\n",
" \"enabled\": True, # Enable/disable OCR processing\n",
" \"languages\": [\"es\"], # List of language codes (e.g., eng, fra, deu)\n",
" # Note that the OCR language has no effect in this example since we discard information from images.\n",
" },\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd scripts/teigaku_genzei && ls -la | head -20

Repository: opendatahub-io/data-processing

Length of output: 1367


🏁 Script executed:

file scripts/teigaku_genzei/teigaku_genzei.ipynb

Repository: opendatahub-io/data-processing

Length of output: 118


🏁 Script executed:

# Get the size of the notebook to determine how to read it
wc -l scripts/teigaku_genzei/teigaku_genzei.ipynb

Repository: opendatahub-io/data-processing

Length of output: 121


🏁 Script executed:

# Read lines around 83-87 to see the OCR configuration context
sed -n '61,110p' scripts/teigaku_genzei/teigaku_genzei.ipynb

Repository: opendatahub-io/data-processing

Length of output: 2281


🏁 Script executed:

# Also search for where OCR is actually used/invoked in the notebook
rg -i "ocr" scripts/teigaku_genzei/teigaku_genzei.ipynb -A 2 -B 2

Repository: opendatahub-io/data-processing

Length of output: 1137


🏁 Script executed:

# Check for docling or any image/PDF processing logic
rg -i "docling\|extract_images\|image" scripts/teigaku_genzei/teigaku_genzei.ipynb -A 2 -B 2 | head -50

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Let me verify if images are actually processed in the pipeline
# by looking for image-related extraction or processing
rg -i "image\|discard" scripts/teigaku_genzei/teigaku_genzei.ipynb -B 2 -A 2 | head -60

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Check if there's any postprocessing that explicitly removes image content
rg -i "extract.*text\|export_to_text\|markdown" scripts/teigaku_genzei/teigaku_genzei.ipynb -B 3 -A 3 | head -80

Repository: opendatahub-io/data-processing

Length of output: 56


Update OCR language from "es" to "jpn" for Japanese document processing.

The OCR language is set to ["es"] (Spanish) but should be ["jpn"] for Japanese documents. Although the comment on line 86 states "OCR language has no effect in this example since we discard information from images," the configuration is still actively used: pipeline_options.ocr_options.lang is set to the configured language value, and pipeline_options.do_ocr is enabled. Even if image content is ultimately discarded from the output, the OCR processing runs with the wrong language setting, which is inefficient and semantically incorrect for this Japanese tax document pipeline.

🤖 Prompt for AI Agents
In scripts/teigaku_genzei/teigaku_genzei.ipynb around lines 83 to 87, the OCR
languages config is set to ["es"] (Spanish) but this pipeline processes Japanese
documents; change the value to ["jpn"] so OCR runs with Japanese language
setting. Update the "languages" array to ["jpn"] and ensure any references that
read pipeline_options.ocr_options.lang will now receive "jpn" so OCR runs
correctly for Japanese input.

…ument processing. Updated language option as a response.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)

221-223: Use logger.exception for better error diagnostics.

Using logger.exception instead of logger.error(f"...{str(e)}") automatically includes the traceback, which aids debugging.

         except Exception as e:
-            logger.error(f"Failed to export {format_name} format: {str(e)}")
+            logger.exception(f"Failed to export {format_name} format")
             raise

262-265: Use logger.exception for traceback preservation.

Similar to the earlier exception handler, using logger.exception preserves the full traceback for debugging failed conversions.

         except Exception as e:
             failure_count += 1
-            logger.error(f"Failed to process {file_path}: {str(e)}")
+            logger.exception(f"Failed to process {file_path}")
             continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb39136 and 01fb654.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb

129-129: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


129-129: Use explicit conversion flag

Replace with conversion flag

(RUF010)


170-170: Do not catch blind exception: Exception

(BLE001)


172-172: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


172-172: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (3)
scripts/teigaku_genzei/teigaku_genzei.ipynb (3)

61-69: LGTM on configuration variables.

The path configuration and source name lists are clearly defined. The separation of ICL and seed sources (icl_source_name vs seed_source_name) from different PDFs is a good practice for avoiding data leakage between ICL examples and seed contexts.


177-195: LGTM on pipeline configuration.

The setup_pipeline_options function properly maps configuration dictionary values to PdfPipelineOptions. The defensive isinstance check on line 184 before accessing table_structure_options.do_cell_matching is a good practice.


225-273: LGTM on document conversion workflow.

The export_document_new_docling function has good structure with:

  • Early return when no PDFs found
  • Proper directory creation with mkdir(parents=True, exist_ok=True)
  • Success/failure counting and timing metrics
  • Graceful handling of individual file failures while continuing to process remaining files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
scripts/teigaku_genzei/json_util.py (4)

6-7: Anchor/clarify “title” matching semantics (prefix vs full-line) and DRY the regex.
Right now re.match ensures “starts with”, but "1-2abc" will return true / match. If titles must be exactly N-N, consider ^...$ (or use fullmatch). Also is_title() and get_title_match() duplicate the same pattern—worth centralizing.

+_TITLE_RE = re.compile(r"([0-90-9]+)[-ー]([0-90-9]+)")
+
 def is_title(text: str) -> bool:
-    return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None
+    return _TITLE_RE.fullmatch(text) is not None
 ...
 def get_title_match(text: str) -> re.Match | None:
-    return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text)
+    return _TITLE_RE.match(text)  # or fullmatch(), depending on desired behavior

Also applies to: 12-13


15-20: Prefer startswith() for fixed prefixes (and decide about whitespace).
is_qhead() / is_ahead() are simple prefix checks; text.startswith("問") and text.startswith("[A]") are clearer and faster. If inputs may have leading whitespace, consider .lstrip() (or explicitly document they must not).

 def is_qhead(text: str) -> bool:
-    return re.match(r"^問", text) is not None
+    return text.startswith("問")
 ...
 def is_ahead(text: str) -> bool:
-    return re.match(r"^\[A\]", text) is not None
+    return text.startswith("[A]")

21-24: Add type hints for read_json_file (and accept PathLike).
This is a utility used across scripts; typing improves callers and testability.

+from pathlib import Path
+from os import PathLike
 ...
-def read_json_file(file_path):
+def read_json_file(file_path: str | PathLike[str] | Path) -> Any:
     with open(file_path, 'r', encoding='utf-8') as file:
         data = json.load(file)
     return data

6-20: Decide how you want to handle Ruff RUF001 (full-width characters in literals).
If full-width digits/A are intentional (Japanese OCR/text), either keep as-is and add targeted # noqa: RUF001 with a short comment, or switch to a Unicode-aware approach (e.g., normalize with unicodedata.normalize("NFKC", text) and use ASCII patterns).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4ff9b1 and 68b9bbf.

📒 Files selected for processing (1)
  • scripts/teigaku_genzei/json_util.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/json_util.py

7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


7-7: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?

(RUF001)


13-13: String contains ambiguous (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?

(RUF001)


19-19: String contains ambiguous (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?

(RUF001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (1)
scripts/teigaku_genzei/json_util.py (1)

9-10: is_toc() defensive access looks good; consider precompiling and/or broadening path match if needed.
The .get(...).get(...) avoids the prior KeyError risk. If this runs in tight loops, precompile the regex; also confirm whether #/pictures (no trailing slash) should match—current pattern requires #/pictures/.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

♻️ Duplicate comments (8)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json (1)

1-13: Looks consistent with Unitxt local_catalog schema; keep provider fields as Unitxt expects.

Given these local_catalog files are Unitxt configs, max_tokens / provider / model should remain in Unitxt’s schema (not provider-specific variants). Please just verify evaluator_name: "LLAMA3_1_8B" is unique/consistently referenced across the catalog, and that Unitxt in your tested versions accepts "provider": "watsonx" with "model": "llama-3-1-8b-instruct". Based on learnings, Unitxt performs the normalization.

notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json (1)

8-12: Clarify/verify evaluator_name vs model naming to avoid future confusion.
Given prior discussion, I’m assuming "model": "mistral-large-instruct" is a valid Unitxt internal identifier even under provider: watsonx, but pairing it with "evaluator_name": "MIXTRAL_LARGE" is still easy to misread later. Consider adding a short note in the example README (or nearby docs) stating that model strings are Unitxt identifiers and may not match the evaluator/file naming.

In Unitxt, when using `cross_provider_inference_engine` with `provider: watsonx`, what does the model identifier `mistral-large-instruct` map to, and is it the intended identifier to use for a “Mixtral Large” judge configuration?
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json (1)

1-13: Keep Unitxt’s provider id as-is; please just verify o1-mini support in the cross-provider engine

Per earlier clarification, the "open-ai" provider id is a Unitxt convention and should not be changed. The remaining risk is whether cross_provider_inference_engine accepts model: "o1-mini" and maps max_tokens/seed/temperature correctly for that provider.

In Unitxt’s `cross_provider_inference_engine`, is `provider: "open-ai"` with `model: "o1-mini"` supported, and how are `max_tokens`, `seed`, and `temperature` handled for that provider?
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_mini.json (1)

1-13: LGTM; keep Unitxt-specific provider/model identifiers as-is.

notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_scout.json (1)

1-13: LGTM; evaluator naming vs model naming is intentionally Unitxt-defined.

notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)

1-16: Looks good and aligns with the intended Unitxt local_catalog pattern for preset endpoints.

notebooks/teigaku_genzei/json_qa.py (1)

40-56: Avoid silent QA loss: use zip(..., strict=True) (and fail loudly on parse drift).
(Line 51)

-    for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list):
+    for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):
notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

108-120: Registration wiring: base_url override + evaluator_name mismatch (already raised before).

  • Passing base_url=... (Line 114) largely defeats the “set base_url when None” behavior in RITSInferenceEngineFixed.prepare_engine() and reintroduces URL-construction coupling.
  • evaluator_name="LLAMA3_4_MAVERICK" (Line 117) looks inconsistent with ...llama4_maverick metric name and the “llama4” model module.
For Unitxt metrics registration, is evaluator_name used as a stable identifier (e.g., referenced by config/catalog), and should it match the catalog/metric naming convention exactly?
🟠 Major comments (15)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_scout.json-1-13 (1)

1-13: Fix evaluator_name: change "LLAMA3_4_SCOUT" to "LLAMA4_SCOUT".

The evaluator_name has an incorrect version prefix. The model is "llama-4-scout" (Llama 4) but the evaluator is named LLAMA3_4_SCOUT (suggesting Llama 3). This naming inconsistency violates the established pattern (e.g., LLAMA3_1_70B for llama-3-1-70b-instruct) and will cause confusion in evaluation outputs and dashboards. The evaluator_name should be "LLAMA4_SCOUT" to match the model version. This issue affects all provider implementations of this model.

notebooks/teigaku_genzei/test_qa.py-66-88 (1)

66-88: --test_file / --out_result_file should be required (current defaults will crash).
Right now pd.read_csv(None, ...) / to_csv(None, ...) will fail with a confusing error.

 def config():
     parser = argparse.ArgumentParser(description='test a model with a zero-shot QA task')
     parser.add_argument('--' + ARG_MODEL_PATH, type=str, default="Qwen/Qwen1.5-0.5B-Chat")
-    parser.add_argument('--' + ARG_TEST_FILE, type=str, default=None)
+    parser.add_argument('--' + ARG_TEST_FILE, type=str, required=True)
 ...
-    parser.add_argument('--' + ARG_OUT_RESULT_FILE, type=str, default=None)
+    parser.add_argument('--' + ARG_OUT_RESULT_FILE, type=str, required=True)

Also applies to: 178-197

notebooks/teigaku_genzei/test_qa.py-101-105 (1)

101-105: Write JSON with explicit UTF-8 encoding (JP text).
This avoids platform-default encoding surprises.

 def write_json_file(file_path: str, data: Any):
-    with open(file_path, 'w') as file:
+    with open(file_path, 'w', encoding="utf-8") as file:
         json.dump(data, file, ensure_ascii=False)
-    return
notebooks/teigaku_genzei/remove_newline_from_csv.py-11-27 (1)

11-27: Make newline cleanup robust for NaN / non-string object columns.
df[col].str... can behave poorly if an object column has non-strings; also make the replace behavior explicit with regex=False.

-        col_sr_dict = {
-            col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col] for col in df.columns
-        }
+        col_sr_dict = {}
+        for col in df.columns:
+            if df[col].dtype == "object":
+                s = df[col]
+                # only operate on actual strings; keep NaN/None as-is
+                col_sr_dict[col] = s.where(~s.map(lambda v: isinstance(v, str)), s.str.replace("\n", " ", regex=False))
+            else:
+                col_sr_dict[col] = df[col]
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json-1-16 (1)

1-16: Fix Azure credential structure and remove org-specific hostnames.

The api_base should contain only the base Azure endpoint (e.g., "https://YOUR_RESOURCE_NAME.openai.azure.com/"), not the full request URL. The api_version must be a separate credential field. Additionally, the hard-coded "eteopenai.azure-api.net" hostname exposes internal infrastructure and breaks portability — replace with a placeholder or document that users must substitute their own Azure resource name.

Expected structure:

"credentials": {
  "api_base": "https://YOUR_AZURE_RESOURCE_NAME.openai.azure.com/",
  "api_key": "YOUR_API_KEY",
  "api_type": "azure",
  "api_version": "2024-08-01-preview"
}
notebooks/teigaku_genzei/test_single.sh-3-12 (1)

3-12: Quote inputs + prefer $(...) to avoid word-splitting / globbing bugs.
MODEL_TEST_PATH="$1" is good, but later expansions (basename ${MODEL_TEST_PATH}) are unquoted; if the path contains spaces/globs, behavior breaks. Also prefer $(...) over backticks.

-MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME}
-MODEL_TEST_NAME=`basename ${MODEL_TEST_PATH}`
+MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME}
+MODEL_TEST_NAME="$(basename "$MODEL_TEST_PATH")"
notebooks/teigaku_genzei/test_single.sh-23-28 (1)

23-28: Make output dir creation idempotent + quote.
mkdir ${OUT_DIR} will fail if rerun and will mis-handle spaces.

-OUT_DIR=./tmp/single_${JUDGE_MODULE}__${MODEL_TEST_NAME}
-
-mkdir ${OUT_DIR}
+OUT_DIR="./tmp/single_${JUDGE_MODULE}__${MODEL_TEST_NAME}"
+
+mkdir -p "$OUT_DIR"
notebooks/teigaku_genzei/test_single.sh-29-47 (1)

29-47: Quote vars in the loop / Python args; consider failing fast.
Unquoted ${TEST_DIR}, ${TEST}, ${MODEL_TEST_PATH} etc. can break on spaces and can also surprise with globbing. Also consider adding set -eu at top for safer scripts (keeping /bin/sh portability in mind).

 for TEST in ${TEST_NAMES}; do
-    TEST_FILE=${TEST_DIR}/${TEST}.csv
-    OUT_RESULT_PREFIX=${OUT_DIR}/result_${TEST}
-    OUT_GLOBAL_PREFIX=${OUT_DIR}/global_${TEST}
+    TEST_FILE="${TEST_DIR}/${TEST}.csv"
+    OUT_RESULT_PREFIX="${OUT_DIR}/result_${TEST}"
+    OUT_GLOBAL_PREFIX="${OUT_DIR}/global_${TEST}"
 
     python ./test_qa.py \
-        --test_file ${TEST_FILE} \
-        --out_result_file ${OUT_RESULT_PREFIX}_after.csv \
-        --out_global_score_file ${OUT_GLOBAL_PREFIX}_after.csv \
-        --judge_module ${JUDGE_MODULE} \
-        --data_classification ${DATA_CLASSIFICATION} \
-        --model_path ${MODEL_TEST_PATH}
+        --test_file "$TEST_FILE" \
+        --out_result_file "${OUT_RESULT_PREFIX}_after.csv" \
+        --out_global_score_file "${OUT_GLOBAL_PREFIX}_after.csv" \
+        --judge_module "$JUDGE_MODULE" \
+        --data_classification "$DATA_CLASSIFICATION" \
+        --model_path "$MODEL_TEST_PATH"
 done
notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.py-1-40 (1)

1-40: Harden importer: validate dirs, avoid builtin shadowing, use sys.exit, and handle JSON/IO errors.

 import os
 import json
 import logging
+import sys
 import catalog_util
 
 
 if __name__ == "__main__":
 
     logging.basicConfig(level=logging.INFO)
 
     system_catalog_path = catalog_util.get_system_catalog_path()
     if system_catalog_path is None:
         logging.error("No system catalog")
-        exit(1)
+        sys.exit(1)
     local_catalog_path = catalog_util.LOCAL_CATALOG_PATH
     base_source_path = os.path.join(system_catalog_path, "metrics/llm_as_judge/direct")
     base_target_path = os.path.join(local_catalog_path, "metrics/llm_as_judge/direct_positional_bias")
     exclude_set = {"criteria"}
 
     logging.info(f"Copying artifacts from {base_source_path} to {base_target_path}")
-    for dir in os.listdir(base_source_path):
-        if dir in exclude_set:
+    if not os.path.isdir(base_source_path):
+        logging.error(f"Missing or non-directory source path: {base_source_path}")
+        sys.exit(1)
+
+    for provider_dir in os.listdir(base_source_path):
+        if provider_dir in exclude_set:
             continue
-        logging.info(f"Processing {dir}")
-        provider_source_path = os.path.join(base_source_path, dir)
-        provider_target_path = os.path.join(base_target_path, dir)
-        if not os.path.exists(provider_target_path):
-            os.makedirs(provider_target_path)
-        for file in os.listdir(provider_source_path):
-            if file.endswith(".json"):
-                logging.info(f"Processing {file}")
-                source_path = os.path.join(provider_source_path, file)
-                target_path = os.path.join(provider_target_path, file)
-                with open(source_path, "r", encoding="utf-8") as f:
-                    source_artifact = json.load(f)
-                    source_artifact["__type__"] = "llm_judge_direct_positional_bias"
-                    with open(target_path, "w", encoding="utf-8") as wf:
-                        json.dump(source_artifact, wf, indent=4)
-                        logging.info(f"Wrote {target_path}")
+        provider_source_path = os.path.join(base_source_path, provider_dir)
+        if not os.path.isdir(provider_source_path):
+            continue
+
+        logging.info(f"Processing {provider_dir}")
+        provider_target_path = os.path.join(base_target_path, provider_dir)
+        os.makedirs(provider_target_path, exist_ok=True)
+
+        for filename in os.listdir(provider_source_path):
+            if not filename.endswith(".json"):
+                continue
+            source_path = os.path.join(provider_source_path, filename)
+            if not os.path.isfile(source_path):
+                continue
+
+            logging.info(f"Processing {filename}")
+            target_path = os.path.join(provider_target_path, filename)
+            try:
+                with open(source_path, "r", encoding="utf-8") as rf:
+                    source_artifact = json.load(rf)
+                source_artifact["__type__"] = "llm_judge_direct_positional_bias"
+                with open(target_path, "w", encoding="utf-8") as wf:
+                    json.dump(source_artifact, wf, indent=4, ensure_ascii=False)
+                logging.info(f"Wrote {target_path}")
+            except (OSError, json.JSONDecodeError) as e:
+                logging.exception(f"Failed processing {source_path}: {e}")
+                sys.exit(1)
notebooks/teigaku_genzei/catalog_util.py-1-22 (1)

1-22: Remove import-time side effects: stdout print and function call.

The module executes get_system_catalog_path() at import time (line 22), which prints to stdout and causes side effects when imported by other modules (test_qa.py, llm_as_judge_direct_provider_importer.py). This pollutes output during import and makes the module's behavior unpredictable.

Additionally, the function name get_system_catalog_path() is misleading—it returns paths[-1] from get_local_catalogs_paths(), treating the last local catalog as the "system" path. Either rename the function to clarify its actual behavior or implement proper system catalog path lookup. Remove the print(paths) statement and move the get_system_catalog_path() call into an if __name__ == "__main__": block. Also fix PEP 8 spacing: LOCAL_CATALOG_PATH = "local_catalog".

notebooks/teigaku_genzei/seed_jsonl_2_qna_yaml.py-11-17 (1)

11-17: Potential KeyError if seed lacks expected keys.

The code assumes each seed object contains icl_query_1 through icl_query_3 and corresponding icl_response_1 through icl_response_3. Missing keys will raise KeyError at runtime.

Consider adding validation or using .get() with defaults:

     qa_obj_list = [
         {
-            "question": seed[f"icl_query_{i+1}"],
-            "answer": seed[f"icl_response_{i+1}"],
+            "question": seed.get(f"icl_query_{i+1}", ""),
+            "answer": seed.get(f"icl_response_{i+1}", ""),
         }
         for i in range(3)
     ]
notebooks/teigaku_genzei/extract_icl_from_seed.py-25-27 (1)

25-27: Potential IndexError on empty input file.

Accessing data_list[0] without checking if the list is empty will raise an IndexError if the input JSONL file is empty.

Apply this diff:

     print(len(data_list))
-    print(data_list[0])
+    if data_list:
+        print(data_list[0])
     obj_list = cast(list[dict[str, str]], data_list)
notebooks/teigaku_genzei/make_icl.py-96-111 (1)

96-111: Ensure parent directory exists before writing output JSONL.

jsonl_util.write_jsonl_file(output_icl_path, ...) will fail if the parent directory doesn’t exist.

 def main()-> None:
@@
     output_icl_path = args[ARG_OUTPUT_ICL_FILE]
@@
     icl_list = make_icl(qa_df, context_df, short_context)
 
+    import os
+    parent = os.path.dirname(output_icl_path)
+    if parent:
+        os.makedirs(parent, exist_ok=True)
     jsonl_util.write_jsonl_file(output_icl_path, icl_list)
notebooks/teigaku_genzei/make_seed.py-42-79 (1)

42-79: Cartesian mode: guard empty context/ICL (pd.concat([]) crash).

If context_df is empty, tmp_list becomes [] and pd.concat(tmp_list, ...) (Line 77) raises ValueError: No objects to concatenate. Slide mode already checks empties; cartesian should too.

     elif opt_join_method == OPT_JOIN_METHOD_CARTESIAN:
+        if len(context_df) == 0:
+            raise ValueError("No contexts loaded.")
+        if len(icl_df) == 0:
+            raise ValueError("No ICL samples loaded.")
         tmp_list = [
             pd.concat([
                 icl_df,
                 pd.DataFrame(sr.to_dict(), index=icl_df.index), 
             ], axis=1) for (idx, sr) in context_df.iterrows()
         ]
notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py-33-90 (1)

33-90: Average score computation uses the wrong key when criteria is provided.

When self.criteria is not None, the code correctly derives forward_score_name = self.criteria.name (line 35–36). However, the average computation at line 86 uses result.get(self.main_score, ...) instead of forward_score_name. Since Unitxt 1.24.0 emits the score under the criteria name when a single criterion is provided, the forward score will not be found under self.main_score, causing the average to be computed as (fallback_score + pb_score) / 2 even when a real forward score exists.

Use the same key you derived for the forward score:

+        def get_forward_score(result: Dict[str, Any]) -> float:
+            return cast(float, result.get(forward_score_name, result.get(self.main_score, default_pb_score)))
+
         pb_scores = [
-            get_pb_score(result, self.criteria, result.get(self.main_score, default_pb_score), default_pb_score) for result in results
+            get_pb_score(result, self.criteria, get_forward_score(result), default_pb_score) for result in results
         ]
         pb_results = [
             result | {
                 backward_score_name: pb_score,
-                average_score_name: (result.get(self.main_score, default_pb_score) + pb_score) / 2,
+                average_score_name: (get_forward_score(result) + pb_score) / 2,
             }
             for result, pb_score in zip(results, pb_scores, strict=True)
         ]

Committable suggestion skipped: line range outside the PR's diff.

🟡 Minor comments (5)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_3_8b.json-1-13 (1)

1-13: Add trailing newline to the JSON file.

JSON files should end with a newline character. Add a blank line at the end of the file (after the closing brace).

The model identifier granite-3-3-8b-instruct and schema structure are consistent with other watsonx configurations in this local catalog directory and align with Unitxt's documented CrossProviderInferenceEngine schema.

 {
     "__type__": "llm_judge_direct_positional_bias",
     "inference_engine": {
         "__type__": "cross_provider_inference_engine",
         "max_tokens": 1024,
         "seed": 42,
         "temperature": 0,
         "provider": "watsonx",
         "model": "granite-3-3-8b-instruct"
     },
     "evaluator_name": "GRANITE3_3_8B",
     "generate_summaries": false
-}
+}
+
notebooks/teigaku_genzei/requirements.txt-1-13 (1)

1-13: Tighten dependency strategy for reproducible evals and remove redundant sacrebleu entry

  • The plain sacrebleu line is redundant; sacrebleu[ja] already includes the base package plus Japanese tokenization support. Remove the sacrebleu entry.
  • Consider pinning (or constraining) the currently-unpinned packages (litellm, accelerate, nltk, bert_score, rouge_score, python-dotenv) to ensure reproducible evaluation results, since upstream changes can alter metrics and outputs.
notebooks/teigaku_genzei/jsonl_util.py-7-17 (1)

7-17: Make JSONL reading resilient to blank lines (and optionally stream).

 def read_jsonl_file(file_path: str) -> list[Any]:
     with open(file_path, 'r', encoding="utf-8") as file:
-        data_list = [
-            json.loads(line) for line in file
-        ]
-        return data_list
+        data_list: list[Any] = []
+        for line in file:
+            line = line.strip()
+            if not line:
+                continue
+            data_list.append(json.loads(line))
+        return data_list
notebooks/teigaku_genzei/README.md-1-254 (1)

1-254: Address markdownlint doc hygiene (bare URLs, list indentation, fenced code languages).
This README is valuable, but the current formatting will likely fail markdownlint in CI (MD034/MD007/MD040).

  • Convert bare URLs into markdown links.
  • Fix list indentation to the expected levels.
  • Add languages to fenced blocks (e.g., sh, text).
notebooks/teigaku_genzei/qna_yaml_2_qa_csv.py-1-81 (1)

1-81: Tighten CSV conversion: remove dead code, fix Title indexing, validate YAML type, and use UTF‑8.

-import sys
-import yaml
-import pandas as pd
+import sys
+from typing import Any
+
+import pandas as pd
+import yaml
 
 def main():
@@
-    try:
-        # load YAML file
-        with open(input_file, "r") as file:
-            data = yaml.safe_load(file)
+    try:
+        # load YAML file
+        with open(input_file, "r", encoding="utf-8") as file:
+            data: Any = yaml.safe_load(file)
+
+        if not isinstance(data, dict):
+            raise ValueError("YAML root must be a mapping/object")
         
         # Validate structure
         if "seed_examples" not in data:
             raise ValueError("YAML file missing 'seed_examples' key")
@@
         qa_list = [
             {
-                "Title": f"{isample + 1}-{iqa}",
+                "Title": f"{isample + 1}-{iqa + 1}",
                 "Question": qa["question"], 
                 "Answer": qa["answer"]
             }
             for (isample, sample) in enumerate(data["seed_examples"]) 
             for (iqa, qa) in enumerate(sample["questions_and_answers"])
         ]
@@
-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        sys.exit(1)
+    # Let unexpected exceptions surface with a traceback (better debuggability).

Also: since this depends on third-party libs (yaml, pandas), please confirm the project documents/locks the expected versions (or at least lists them) to keep the example reproducible.

🧹 Nitpick comments (23)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json (1)

1-13: Add trailing newline to the file.

The file is missing a trailing newline, which is a common convention in source code repositories for consistency and POSIX compliance.

 {
     "__type__": "llm_judge_direct_positional_bias",
     "inference_engine": {
         "__type__": "cross_provider_inference_engine",
         "max_tokens": 1024,
         "seed": 42,
         "temperature": 0,
         "provider": "watsonx",
         "model": "granite-3-2-8b-instruct"
     },
     "evaluator_name": "GRANITE3_2_8B",
     "generate_summaries": false
-}
\ No newline at end of file
+}
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json (1)

11-12: Consider aligning evaluator_name with existing catalog naming conventions.

If other Together configs use a different casing/format (e.g., LLAMA3_1_8B_INSTRUCT or including provider), aligning helps downstream aggregation/reporting scripts avoid special-casing.

notebooks/teigaku_genzei/test_single_vllm.sh (1)

3-47: Quote variable expansions (model paths w/ spaces) and prefer $(...).
This is mostly for robustness when MODEL_TEST_PATH or output dirs contain spaces/special chars.

- MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME}
- MODEL_TEST_NAME=`basename ${MODEL_TEST_PATH}`
+ MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME}
+ MODEL_TEST_NAME="$(basename "$MODEL_TEST_PATH")"
 ...
-mkdir -p ${OUT_DIR}
+mkdir -p "$OUT_DIR"
 ...
-    TEST_FILE=${TEST_DIR}/${TEST}.csv
-    OUT_RESULT_PREFIX=${OUT_DIR}/result_${TEST}
-    OUT_GLOBAL_PREFIX=${OUT_DIR}/global_${TEST}
+    TEST_FILE="${TEST_DIR}/${TEST}.csv"
+    OUT_RESULT_PREFIX="${OUT_DIR}/result_${TEST}"
+    OUT_GLOBAL_PREFIX="${OUT_DIR}/global_${TEST}"
 ...
-    python ./test_qa.py \
-        --test_file ${TEST_FILE} \
-        --out_result_file ${OUT_RESULT_PREFIX}_after.csv \
-        --out_global_score_file ${OUT_GLOBAL_PREFIX}_after.csv \
-        --judge_module ${JUDGE_MODULE} \
-        --server_type ${SERVER_TYPE} \
-        --api_key ${SERVER_API_KEY} \
-        --data_classification ${DATA_CLASSIFICATION} \
-        --model_path ${MODEL_TEST_PATH}
+    python ./test_qa.py \
+        --test_file "$TEST_FILE" \
+        --out_result_file "${OUT_RESULT_PREFIX}_after.csv" \
+        --out_global_score_file "${OUT_GLOBAL_PREFIX}_after.csv" \
+        --judge_module "$JUDGE_MODULE" \
+        --server_type "$SERVER_TYPE" \
+        --api_key "$SERVER_API_KEY" \
+        --data_classification "$DATA_CLASSIFICATION" \
+        --model_path "$MODEL_TEST_PATH"
notebooks/teigaku_genzei/llm_as_judge_direct_rits_phi.py (1)

5-24: Consider making overwrite=True an explicit opt-in.
If users already have a local artifact with the same name, this will silently replace it.

-def register(catalog_path: str) -> None:
+def register(catalog_path: str, overwrite: bool = False) -> None:
 ...
-    add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.phi_4", overwrite=True, catalog_path=catalog_path)
+    add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.phi_4", overwrite=overwrite, catalog_path=catalog_path)
notebooks/teigaku_genzei/test_qa.py (3)

89-93: Either use test_language or remove it (Ruff ARG001).
If the CSV includes target_language, you could validate it; otherwise drop the arg to avoid confusion.


275-276: Remove debug print (print(type(dataset))) or gate behind a verbose flag.
Keeps CLI output clean for scripts that parse stdout.


224-233: Minor: prefer iterable unpacking over list concatenation (Ruff RUF005).

-        metrics=[
+        metrics=[
             "metrics.rouge",
             "metrics.sacrebleu",
             # "metrics.bert_score.deberta_large_mnli",
-        ] + lmaaj_metrics_list,
+            *lmaaj_metrics_list,
+        ],
notebooks/teigaku_genzei/remove_newline_from_csv.py (1)

19-36: Avoid blind except Exception—either let unexpected errors propagate or use a traceback-preserving pattern.
Ruff BLE001 flags catch-all exceptions in CLI tools because they hide KeyboardInterrupt and other critical errors. The current code already catches FileNotFoundError and pd.errors.ParserError explicitly, so the catch-all except Exception is unnecessary.

For encoding/IO issues with read_csv and to_csv, pandas raises UnicodeDecodeError (encoding mismatch), OSError-family errors (filesystem), and ParserError (malformed data)—all handled above. Either remove the catch-all and let unexpected errors surface with a traceback, or keep it with logging.exception() to preserve context:

-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        sys.exit(1)
+    # Let unexpected exceptions propagate with full traceback for CLI debugging

Or if you must catch unknown errors, use:

-    except Exception as e:
-        print(f"Error: {e}", file=sys.stderr)
-        sys.exit(1)
+    except Exception as e:
+        import logging
+        logging.exception("Unexpected error")
+        sys.exit(1)
notebooks/teigaku_genzei/make_icl.sh (1)

1-3: Make the script more robust (fail-fast + quoting + readability).

 #!/bin/sh
+set -eu
 
-python ./make_icl.py --short_context --qa_file qa_table/0024004-072_01.csv --context_file context/0024004-072_01_context.csv --out_icl_file icl/icl.jsonl
+python ./make_icl.py \
+  --short_context \
+  --qa_file "qa_table/0024004-072_01.csv" \
+  --context_file "context/0024004-072_01_context.csv" \
+  --out_icl_file "icl/icl.jsonl"
notebooks/teigaku_genzei/make_context.sh (1)

1-11: Add fail-fast + quoting (no need to mkdir the output dirs here).

 #!/bin/sh
+set -eu
 
 DATA_PREFIX_LIST="0024001-021 0024004-072_01"
 
 for f in $DATA_PREFIX_LIST; do
     python ./make_context.py \
         --glossary_option appendix \
-        --qa_file qa_table/${f}.csv \
-        --glossary_file qa_table/${f}_glossary.csv \
-        --out_context_file context/${f}_context.csv
+        --qa_file "qa_table/${f}.csv" \
+        --glossary_file "qa_table/${f}_glossary.csv" \
+        --out_context_file "context/${f}_context.csv"
 done
notebooks/teigaku_genzei/extract_qa.sh (1)

1-14: Harden the shell script: fail-fast + quote paths/args.
This will avoid silent partial outputs and word-splitting surprises if paths ever change (Line 7, Line 12-13).

 #!/bin/sh
+set -eu
 
 source_dir="docparser"
 source_qa="0024001-021 0024004-072_01"
 qa_table_dir="qa_table"
 
 for sqa in ${source_qa}; do
-    sf=${source_dir}/${sqa}.json
-    qf=${qa_table_dir}/${sqa}.csv
-    gf=${qa_table_dir}/${sqa}_glossary.csv
+    sf="${source_dir}/${sqa}.json"
+    qf="${qa_table_dir}/${sqa}.csv"
+    gf="${qa_table_dir}/${sqa}_glossary.csv"
 
-    python ./json_qa.py ${sf} ${qf}
-    python ./json_glossary.py ${sf} ${gf}
+    python ./json_qa.py "${sf}" "${qf}"
+    python ./json_glossary.py "${sf}" "${gf}"
 done
notebooks/teigaku_genzei/parser.sh (1)

1-15: Nice fail-fast + path validation; quote vars and ensure output dir exists.
(Line 2, Line 10-13, Line 15)

 #!/bin/sh
 set -e
 
 source="source/"
 output="docparser/"
@@
 fi
 
-python ${code_path}  --input-dir ${source} --output-dir ${output}
+mkdir -p "${output}"
+python "${code_path}" --input-dir "${source}" --output-dir "${output}"
notebooks/teigaku_genzei/json_qa.py (1)

16-38: (Optional) Consider whether "" .join(...) is the intended delimiter for all outputs.
If any downstream expects whitespace-separated text, joining with "" (Line 38) can fuse tokens. If the intent is “Japanese-first”, a short comment stating that would help future readers.

notebooks/teigaku_genzei/extract_icl_from_seed.py (2)

10-13: Consider renaming compute_dict_hash for clarity.

The function concatenates values with "::" rather than computing an actual hash. This could cause unexpected collisions if values contain "::". The name compute_dict_key or compute_dedup_key would better reflect its purpose.


48-50: Simplify dict comprehension.

Per static analysis, this can be simplified.

Apply this diff:

-    unique_icl_dict: dict[str, dict[str, Any]] = {
-        h: obj for (h, obj) in hash_list
-    }
+    unique_icl_dict: dict[str, dict[str, Any]] = dict(hash_list)
notebooks/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)

52-52: Remove useless return statement.

The explicit return at the end of main() is unnecessary.

-    return
-
 if __name__ == "__main__":
notebooks/teigaku_genzei/json_util.py (1)

2-2: Remove unused imports.

cast, Hashable, Sequence, and Collection are imported but never used in this module.

-from typing import cast, Any, Hashable, Sequence, Collection
+from typing import Any
notebooks/teigaku_genzei/make_context.py (1)

2-2: Remove unused imports.

Hashable, Sequence, and Collection are imported but never used.

-from typing import cast, Any, Hashable, Sequence, Collection
+from typing import cast, Any
notebooks/teigaku_genzei/json_glossary.py (1)

48-53: Consider renaming parameter for clarity.

The function save_to_csv uses parameter name qa_pairs but actually writes glossary data with Term and Description columns. Consider renaming to glossary_pairs for consistency.

-def save_to_csv(qa_pairs, output_file):
+def save_to_csv(glossary_pairs, output_file):
     with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
         writer = csv.writer(csvfile)
         writer.writerow(['Term', 'Description'])
-        for qa in qa_pairs:
-            writer.writerow(qa)
+        for pair in glossary_pairs:
+            writer.writerow(pair)
notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)

41-49: Avoid print() in compute(); use a logger (or remove).

print() at Line 47 will spam notebook output and makes batch evaluation noisy. Prefer logger.debug(...) or remove entirely.

notebooks/teigaku_genzei/make_seed.py (1)

78-80: Prefer ValueError over argparse.ArgumentError here.

make_seed() is also a callable function; argparse.ArgumentError(None, ...) is hard to interpret without the parser instance. A simple ValueError(...) is clearer.

notebooks/teigaku_genzei/make_icl.py (1)

74-79: Be tolerant to “[A]” vs “[A]” prefix variance in source data.

If some PDFs/exports produce ASCII [A] instead of fullwidth [A], current removeprefix("[A]") won’t strip it. Consider normalizing both:

-            "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(),
+            "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").removeprefix("[A]").strip(),
notebooks/teigaku_genzei/teigaku_genzei.ipynb (1)

198-224: Use logger.exception(...) when catching errors to keep stack traces.

In export_document() (Line 221-223) and export_document_new_docling() (Line 262-265) you log str(e) but lose traceback context in logs; prefer logger.exception(...) in except.

Also applies to: 225-266

Comment on lines +1 to +13
{
"__type__": "llm_judge_direct_positional_bias",
"inference_engine": {
"__type__": "cross_provider_inference_engine",
"max_tokens": 1024,
"seed": 42,
"temperature": 0,
"provider": "rits",
"model": "mixtral-8x7b-instruct-v01"
},
"evaluator_name": "MIXTRAL8_7b",
"generate_summaries": false
} No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for similar JSON config files to understand the expected schema
find . -name "*.json" -path "*/local_catalog/*" -type f | head -20

Repository: opendatahub-io/data-processing

Length of output: 2268


🏁 Script executed:

# Look for Unitxt configuration or schema definitions
rg "cross_provider_inference_engine|llm_judge_direct_positional_bias" --type json -A 5 -B 2

Repository: opendatahub-io/data-processing

Length of output: 50386


🏁 Script executed:

# Search for RITS provider implementation in codebase
rg "rits|RITS" --type py -l | head -10

Repository: opendatahub-io/data-processing

Length of output: 290


🏁 Script executed:

# Check for Unitxt version pinning (requirements, setup.py, pyproject.toml)
fd -e "requirements.*\.txt" -o -e "setup.py" -o -e "pyproject.toml" | head -5

Repository: opendatahub-io/data-processing

Length of output: 190


🏁 Script executed:

# Look for any schema definitions or validation code
fd -e "schema.py" -o -e "*schema*.json" | head -10

Repository: opendatahub-io/data-processing

Length of output: 190


🏁 Script executed:

# Look for Unitxt version constraints
rg "unitxt" -i pyproject.toml setup.py requirements.txt 2>/dev/null | head -20

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Search for code that loads/validates these JSON configs
rg "llm_judge_direct_positional_bias|cross_provider_inference_engine" --type py -B 2 -A 5 | head -100

Repository: opendatahub-io/data-processing

Length of output: 1006


🏁 Script executed:

# Check RITS provider implementation in Python
rg "class.*RitsInference|rits.*inference" -i --type py -B 2 -A 10 | head -150

Repository: opendatahub-io/data-processing

Length of output: 11396


🏁 Script executed:

# Look at the notebook files that use RITS
cat notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py 2>/dev/null | head -100

Repository: opendatahub-io/data-processing

Length of output: 4965


🏁 Script executed:

# Search for Issue #13 reference or related bug discussion
rg "Issue.*13|issue.*13|bug.*CrossProvider|CrossProvider.*bug" --type py -B 3 -A 3

Repository: opendatahub-io/data-processing

Length of output: 800


🏁 Script executed:

# Check if there are any special handling or workarounds for RITS provider
rg "rits.*CrossProvider|CrossProvider.*rits" -i --type py -B 2 -A 5

Repository: opendatahub-io/data-processing

Length of output: 2231


🏁 Script executed:

# Look for any version-specific handling or compatibility notes
rg "unitxt.*1\.24|version.*1\.24|1\.24.*unitxt" -i --type py -B 2 -A 2

Repository: opendatahub-io/data-processing

Length of output: 576


🏁 Script executed:

# Check for comments or issues specifically about Mixtral with RITS
rg "mixtral.*rits|rits.*mixtral" -i -B 2 -A 2

Repository: opendatahub-io/data-processing

Length of output: 567


🏁 Script executed:

# Look at the JSON importer to understand if it's aware of the bug
cat notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.py

Repository: opendatahub-io/data-processing

Length of output: 1793


🏁 Script executed:

# Check the llama4_maverick config in rits directory
cat notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_maverick.json

Repository: opendatahub-io/data-processing

Length of output: 559


🏁 Script executed:

# Check what __type__ is used for llama4_maverick (should show rits_inference_engine_fixed based on Python code)
rg "llama4_maverick|llama4-maverick" --type json -A 10 | grep -A 10 "rits/llama4"

Repository: opendatahub-io/data-processing

Length of output: 56


This configuration is subject to a known bug affecting CrossProviderInferenceEngine with RITS provider.

In Unitxt 1.24.0, CrossProviderInferenceEngine causes connection failures when used with the RITS provider (Issue #13). The codebase already has a documented workaround for this: the llama4_maverick.json config in the same directory uses rits_inference_engine_fixed instead. The mixtral8_7b.json config should similarly use rits_inference_engine_fixed with the appropriate model name and endpoint URL, rather than relying on the buggy cross_provider_inference_engine.

🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral8_7b.json
lines 1-13, this config uses the buggy cross_provider_inference_engine with
provider "rits"; replace that inference_engine block to use
"rits_inference_engine_fixed" (as in llama4_maverick.json), keep max_tokens:
1024, seed: 42 and temperature: 0, set model to "mixtral-8x7b-instruct-v01" and
add the fixed engine's endpoint URL/connection fields required by the workaround
so the config uses the fixed RITS engine instead of the broken cross-provider
engine.

Comment on lines +1 to +13
{
"__type__": "llm_judge_direct_positional_bias",
"inference_engine": {
"__type__": "cross_provider_inference_engine",
"max_tokens": 1024,
"seed": 42,
"temperature": 0,
"provider": "together-ai",
"model": "llama-3-1-405b-instruct"
},
"evaluator_name": "LLAMA3_1_405B",
"generate_summaries": false
} No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the repo structure and find similar configs
find . -name "*.json" -path "*/local_catalog/metrics/*" -type f | head -20

Repository: opendatahub-io/data-processing

Length of output: 204


🏁 Script executed:

# Search for other Together AI or provider configs in the codebase
rg "together-ai|cross_provider_inference_engine" --type json -l

Repository: opendatahub-io/data-processing

Length of output: 6501


🏁 Script executed:

# Look for schema definitions or documentation about the config format
find . -name "*.py" -o -name "*.md" | xargs grep -l "llm_judge_direct_positional_bias\|cross_provider_inference_engine" 2>/dev/null | head -10

Repository: opendatahub-io/data-processing

Length of output: 141


🌐 Web query:

Together AI Llama 3.1 405B Instruct model identifier API

💡 Result:

The Together AI model identifier for Llama 3.1 405B Instruct (Turbo) is:
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo. [1][2]

Example curl (chat completions):
curl -X POST "https://api.together.xyz/v1/chat/completions"
-H "Authorization: Bearer $TOGETHER_API_KEY"
-H "Content-Type: application/json"
-d '{"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo","messages":[{"role":"user","content":"Hello"}]}'. [1]

Sources: Together AI model page and chat-models listing. [1][2]


🌐 Web query:

Unitxt sdg_hub local catalog schema llm_judge cross_provider_inference_engine

💡 Result:

Here’s a concise mapping of the terms and relevant projects/documents I found:

  • Unitxt (catalog / LLM-as-judge): Unitxt provides a local catalog schema and built-in “LLM-as-a-Judge” evaluators (many predefined criteria and direct evaluator catalog entries). See Unitxt catalog and examples. [1][2][3]

  • sdg_hub / local catalog schema: Unitxt’s catalog exposes organized metric/evaluator entries (e.g., direct evaluators under catalog.metrics.llm_as_judge) suitable for local catalogs and integration (appears in Unitxt docs/releases). [1][2]

  • llm_judge / llm-judger / judge-llm: Several OSS tools implement the “LLM-as-a-judge” pattern:

    • llm-judger (PyPI) — a tool to have models judge each other and store results. [4]
    • judge-llm (PyPI) — config-driven evaluator framework supporting multiple providers/evaluators and reporting. [5]
    • NVIDIA NeMo docs describing an LLM-as-a-judge evaluation flow and config examples. [6][7]
  • local / cross_provider_inference_engine (unified inference engines): There are multiple unified inference-engine projects and libraries that let you run open-source and commercial models through a common interface (useful for cross-provider inference and evaluation):

    • llm-engines (PyPI) — unified inference engine wrapper for various backends. [8]
    • LLM Inference Engine docs (llm-ie) — pluggable inference engine interfaces (vLLM, OpenAI, HuggingFace, Ollama, LlamaCpp, etc.). [9]
    • Research/infra projects (FlashInfer, Inferflow, AIBrix) focus on efficient/high-performance inference serving and multi-engine support. [10][11][12]

If you want, I can:

  • show specific Unitxt catalog entries (sdg_hub) and example YAML/config for a local catalog; or
  • propose an architecture to wire Unitxt + a cross-provider inference layer + an LLM-judge tool to run cross-provider evaluations and store results.

Sources:
[1] Unitxt catalog pages and examples.
[2] Unitxt examples — LLM-as-a-Judge.
[3] Unitxt catalog direct evaluator pages.
[4] llm-judger (PyPI).
[5] judge-llm (PyPI).
[6] NVIDIA NeMo microservices — LLM-as-a-Judge docs.
[7] NVIDIA NeMo example configs (LLM-as-a-Judge).
[8] llm-engines (PyPI).
[9] LLM Inference Engine (llm-ie) docs.
[10] FlashInfer (arXiv).
[11] Inferflow (arXiv).
[12] AIBrix (arXiv / GitHub).


🏁 Script executed:

cat -n ./notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.py

Repository: opendatahub-io/data-processing

Length of output: 2080


🏁 Script executed:

# Check the other Together AI configs to see the pattern
cat -n ./notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_70b.json

Repository: opendatahub-io/data-processing

Length of output: 529


🏁 Script executed:

# Check one more for comparison
cat -n ./notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json

Repository: opendatahub-io/data-processing

Length of output: 527


🏁 Script executed:

# Check if there's any model name mapping/translation in the codebase
rg "llama-3-1-405b-instruct|meta-llama.*405B" --type py -A 3 -B 3

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Look for inference engine implementation
find . -name "*.py" -type f | xargs grep -l "cross_provider_inference_engine" 2>/dev/null | head -5

Repository: opendatahub-io/data-processing

Length of output: 56


Fix Together AI model identifier: "llama-3-1-405b-instruct" is not a valid Together AI API model name.

The config will fail at runtime. According to Together AI's API documentation, the correct identifier for Llama 3.1 405B Instruct is meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo. Update the "model" field to use this identifier. This issue affects all Together AI configs in the directory (8B, 70B, and 405B variants).

Additionally, document the cost and access requirements for the 405B model + max_tokens: 1024 configuration for users.

🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_405b.json
lines 1-13, the Together AI model identifier is incorrect; replace
"llama-3-1-405b-instruct" with the valid Together API model name
"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo" and update any other Together AI
configs in this directory (8B, 70B, 405B variants) to their documented
identifiers; also add a brief note in the same file or adjacent README
documenting that the 405B model requires special access and is higher cost, and
that with max_tokens: 1024 users should expect increased compute cost and to
verify access/quotas before running.

Comment on lines +1 to +13
{
"__type__": "llm_judge_direct_positional_bias",
"inference_engine": {
"__type__": "cross_provider_inference_engine",
"max_tokens": 1024,
"seed": 42,
"temperature": 0,
"provider": "together-ai",
"model": "llama-3-1-8b-instruct"
},
"evaluator_name": "LLAMA3_1_8B",
"generate_summaries": false
} No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, search the repository for other Together AI configurations to understand naming patterns
find . -type f -name "*.json" -path "*/together_ai/*" 2>/dev/null | head -20

Repository: opendatahub-io/data-processing

Length of output: 204


🏁 Script executed:

# Search for Unitxt or sdg_hub configuration files in the repo that might document the schema
find . -type f -name "*.py" -o -name "*.json" | xargs grep -l "cross_provider_inference_engine" 2>/dev/null | head -10

Repository: opendatahub-io/data-processing

Length of output: 204


🌐 Web query:

Together AI Llama 3.1 8B Instruct exact model identifier slug API

💡 Result:

The Together.ai API model slug for Llama 3.1 8B Instruct is:

  • meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo. [1]

Alternative/NIM variant (GPU NIM endpoint) slug:

  • nim/meta/llama-3.1-8b-instruct. [2]

🌐 Web query:

Unitxt cross_provider_inference_engine schema documentation parameters

💡 Result:

Summary of CrossProviderInferenceEngine schema & main parameters (Unitxt)

  • Purpose: single engine that calls multiple API providers; dynamically switches between providers and maps model names per provider. [1][2]

Main constructor parameters

  • model (str) — model identifier (e.g., "gpt-4o", "llama-3-2-1b-instruct"). [1][2]
  • provider (Optional[str]) — which API to use; must be one of the supported providers (e.g., "watsonx", "together-ai", "open-ai", "aws", "ollama", "bam", "watsonx-sdk", "rits", "vertex-ai", "hf-local"). [1][2]
  • provider_model_map (Dict[provider, Dict[str,str]]) — per-provider mapping of model ids to unify identifiers across backends. [1][2]
  • provider_specific_args (Optional[Dict[str, Dict[str, Any]]]) — provider-specific configuration (example: {"watsonx": {"max_requests_per_second": 4}}). [1][2]

Common inference / OpenAI-style generation params (available via mixins)

  • max_tokens / max_new_tokens, temperature, top_p, top_k, repetition_penalty, stop / stop_sequences, seed / random_seed, min_new_tokens, time_limit, truncate_input_tokens, typical_p, etc. (engine extends OpenAiInferenceEngineParamsMixin / other provider param mixins; pass provider-specific generation params as usual). [2][1]

Credentials & connection

  • credentials / default_headers / base_url (provider-dependent) — supply API keys / endpoints per provider or rely on environment variables. [2][1]

Other engine options

  • label (str), cache_batch_size, use_cache, num_parallel_requests / concurrency_limit, logprobs / top_logprobs / echo / n (for log-prob / option-selection use cases). [2][1]

Example (catalog snippet)

  • CrossProviderInferenceEngine(model="meta-llama/llama-3-70b-instruct", provider="watsonx", max_tokens=2048, seed=42). [1]

Sources

  • Unitxt Cross-Provider engine catalog & explanation. [1]
  • unitxt.inference module (engine classes and param mixins / supported params). [2]

Correct the Together AI model identifier before merge.

The model identifier "llama-3-1-8b-instruct" does not match Together AI's documented model slug. According to Together AI's API, the correct identifier for Llama 3.1 8B Instruct is "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" (or "nim/meta/llama-3.1-8b-instruct" for the NIM variant). Change line 9 from:

"model": "llama-3-1-8b-instruct"

to:

"model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"

The parameter names (max_tokens, seed, temperature) and provider key ("together-ai") are correct per Unitxt's CrossProviderInferenceEngine schema. The mismatch in the model slug will cause the evaluation to fail at runtime when Unitxt attempts to instantiate the inference engine.

🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json
around lines 1 to 13, the Together AI model slug is incorrect; replace the value
for "model" currently set to "llama-3-1-8b-instruct" with the documented
Together AI identifier "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" (or use
"nim/meta/llama-3.1-8b-instruct" if you intend the NIM variant) so the
CrossProviderInferenceEngine can instantiate the model at runtime.

Comment on lines +8 to +9
"provider": "together-ai",
"model": "llama-3-3-70b-instruct"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Together AI Llama 3.3 70B Instruct model identifier API

💡 Result:

The Together.ai model identifier for Llama 3.3 70B Instruct (Instruct/Turbo variant) is:

  • meta-llama/Llama-3.3-70B-Instruct-Turbo. [1]

(There are related variants shown on Together.ai such as the free endpoint meta-llama/Llama-3.3-70B-Instruct-Turbo-Free and an NVIDIA NIM variant nim/meta/llama-3-3-70b-instruct.) [2][3]

Example curl (Together API v1 chat completions):
curl -X POST "https://api.together.xyz/v1/chat/completions"
-H "Authorization: Bearer $TOGETHER_API_KEY"
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
"messages": [{"role":"user","content":"Explain transfer learning in one paragraph."}]
}'

Sources: Together.ai model pages. [1] [2] [3]


Fix Together AI model identifier: change to correct API format.

The configuration uses "llama-3-3-70b-instruct", which is not a valid Together AI model identifier. The correct identifier is "meta-llama/Llama-3.3-70B-Instruct-Turbo". Update line 9 to use the proper format required by Together AI's API.

🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_3_70b.json
around lines 8 to 9, the Together AI model identifier is incorrect; replace the
value "llama-3-3-70b-instruct" with the correct Together AI API model identifier
"meta-llama/Llama-3.3-70B-Instruct-Turbo" (leave provider as "together-ai").
Ensure the JSON string is updated exactly to the correct identifier format
expected by Together AI.

Comment on lines +1 to +13
{
"__type__": "llm_judge_direct_positional_bias",
"inference_engine": {
"__type__": "cross_provider_inference_engine",
"max_tokens": 1024,
"seed": 42,
"temperature": 0,
"provider": "vertex-ai",
"model": "gemini-2-5-flash"
},
"evaluator_name": "GEMINI_2_5_FLASH",
"generate_summaries": false
} No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for existing vertex provider entries in catalog JSON files
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -l "vertex" {} \;

Repository: opendatahub-io/data-processing

Length of output: 302


🏁 Script executed:

#!/bin/bash
# Search for provider identifier patterns across all catalog JSON files
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -h '"provider"' {} \; | sort | uniq

Repository: opendatahub-io/data-processing

Length of output: 350


🏁 Script executed:

#!/bin/bash
# Search for gemini model identifiers in existing configs
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -h '"model"' {} \; | sort | uniq

Repository: opendatahub-io/data-processing

Length of output: 864


🌐 Web query:

Vertex AI Gemini 2.5 Flash model identifier format and valid names for API calls

💡 Result:

Short answer — use the Publisher model ID or the Vertex model resource name.

  • Publisher model identifier (used in most examples): publishers/google/models/{MODEL_ID}
    Example MODEL_ID values: gemini-2.5-flash, gemini-2.5-flash-image, gemini-2.5-flash-lite, gemini-2.5-pro, gemini-2.5-pro-preview-, and preview variants such as gemini-2.5-flash-preview-5-20. Use that full path for batch jobs and many REST examples. [1][2][6]

  • Vertex resource name for tuned/deployed models (used for endpoint/model calls): projects/{PROJECT_ID}/locations/{LOCATION}/models/{MODEL_ID}. Example: projects/myproj/locations/us-central1/models/123456789. [2]

  • OpenAI-compatible string (when using the OpenAI‑style endpoint): "google/{MODEL_ID}" (e.g., "google/gemini-2.5-flash"). [3]

References:
[1] Google Cloud — Function calling / model examples (shows model_id gemini-2.5-flash).
[2] Google Cloud — Batch prediction & model path examples (publishers/google/models/{MODEL_ID}, and projects/.../models/{MODEL_ID}).
[3] Google Cloud — Function calling reference (OpenAI-compatible "google/{MODEL_ID}" usage).


Fix Vertex AI model identifier format: use gemini-2.5-flash (dots), not gemini-2-5-flash (hyphens).

Change "model": "gemini-2-5-flash" to "model": "gemini-2.5-flash". Vertex AI's Publisher model identifier uses dots between version components. The current format will fail at runtime when the inference engine attempts to resolve the model.

🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_flash.json
around lines 1 to 13, the Vertex AI model identifier is using hyphens
("gemini-2-5-flash") which is invalid; update the model field to use the correct
Publisher identifier with dots ("gemini-2.5-flash") so the inference engine can
resolve the model at runtime.

Comment on lines +153 to +177
def do_inference(model_path: str, api_key: str, api_url: str, server_type: str, dataset: Dataset) -> ListWithMetadata[str] | ListWithMetadata[TextGenerationInferenceOutput]:
model = OpenAiInferenceEngine(
model_name=model_path,
max_tokens=MAX_OUTPUT_TOKENS,
credentials={} if api_key is None else {"api_key": api_key},
base_url=api_url,
# use_cache=False,
# use_fast_tokenizer=False,
) if server_type == "openai" else VLLMRemoteInferenceEngine(
model_name=model_path,
max_tokens=MAX_OUTPUT_TOKENS,
credentials={} if api_key is None else {"api_key": api_key},
base_url=api_url,
data_classification_policy=["public", "proprietary"],
# use_cache=False,
# use_fast_tokenizer=False,
) if server_type == "vllm-remote" else HFAutoModelInferenceEngine(
model_name=model_path,
max_new_tokens=MAX_OUTPUT_TOKENS,
# use_cache=False,
# use_fast_tokenizer=False,
)
predictions_new = model(dataset)
return predictions_new

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and understand the context better
cd /tmp || exit 1
find . -type f -name "test_qa.py" 2>/dev/null | head -5

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Search for unitxt version and imports in the file
rg "unitxt" notebooks/teigaku_genzei/test_qa.py -B2 -A2 | head -40

Repository: opendatahub-io/data-processing

Length of output: 1021


🏁 Script executed:

# Check if there's a requirements file or setup.py mentioning unitxt version
fd "requirements.*|setup.py|pyproject.toml" | xargs grep -l "unitxt" 2>/dev/null

Repository: opendatahub-io/data-processing

Length of output: 115


🌐 Web query:

unitxt 1.24.0 HFAutoModelInferenceEngine OpenAiInferenceEngine return type Dataset inference

💡 Result:

In unitxt (v1.24.0+), InferenceEngine.infer(dataset, return_meta_data=False) accepts a list-of-dicts or a Dataset and returns a ListWithMetadata of predictions — either ListWithMetadata[str] (when return_meta_data=False) or ListWithMetadata[TextGenerationInferenceOutput] (when return_meta_data=True). [1][2]

Sources:

  • unitxt.inference (API docs). [1]
  • unitxt.inference (alternate/latest docs showing same infer signature). [2]

🌐 Web query:

unitxt evaluate() function accepted input formats predictions

💡 Result:

Short answer — evaluate() accepts the same predictions object returned by Unitxt inference engines or a plain Python list/iterable of per-instance predictions that align with the dataset. Typical accepted shapes:

  • A list (or iterable) of raw model outputs (strings or numbers) where each element corresponds to one dataset instance. (Common when you call an inference engine then pass its return value to evaluate.) [1][4]
  • A list of dicts/objects produced by Unitxt inference engines (contains fields like "prediction" / "processed_prediction", optionally other metadata). evaluate() and the CLI utilities accept the inference-engine output as-is. [1][4]
  • Any structure where processed predictions (after template postprocessing) can be matched to dataset instances and to references — i.e., the metric code expects the metric input types described in Unitxt docs (prediction as a string/float/list, references as a list of same-typed items). Postprocessing to the correct type is done via the template's postprocessors. [2]

References:

  1. Unitxt getting-started / example (evaluate usage and predictions returned from inference engines). [1]
  2. Unitxt metrics / metric inputs (explains required prediction/reference types and role of template postprocessors). [2]
  3. unitxt.evaluate_cli / run_evaluation (accepts list[Any] predictions and dataset). [4]

If you want, paste an example of your predictions object and dataset and I’ll confirm whether evaluate() will accept it or show how to adapt it.


🏁 Script executed:

# Let's check the full context around lines 153-177 and 304-320
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '153,177p'

Repository: opendatahub-io/data-processing

Length of output: 1316


🏁 Script executed:

# Also check lines 304-320
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '304,320p'

Repository: opendatahub-io/data-processing

Length of output: 1121


🏁 Script executed:

# Search for json.dumps and json.loads usage in the file
rg "json\.(dumps|loads)" notebooks/teigaku_genzei/test_qa.py -B3 -A3

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

# Check what happens to predictions_new variable after line 177
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '153,210p'

Repository: opendatahub-io/data-processing

Length of output: 2817


🏁 Script executed:

# Find the jsonl_util module
fd "jsonl_util" notebooks/

Repository: opendatahub-io/data-processing

Length of output: 112


🏁 Script executed:

# Search for jsonl_util imports and usage
rg "jsonl_util" notebooks/teigaku_genzei/test_qa.py -B2 -A2

Repository: opendatahub-io/data-processing

Length of output: 653


🏁 Script executed:

# Look for the jsonl_util module in the codebase
find . -name "*jsonl*" -type f 2>/dev/null

Repository: opendatahub-io/data-processing

Length of output: 166


🏁 Script executed:

# Check if jsonl_util is defined in the same directory
ls -la notebooks/teigaku_genzei/ | grep -i json

Repository: opendatahub-io/data-processing

Length of output: 403


🏁 Script executed:

# Read the jsonl_util.py file to see how it handles serialization
cat -n notebooks/teigaku_genzei/jsonl_util.py

Repository: opendatahub-io/data-processing

Length of output: 658


🌐 Web query:

unitxt ListWithMetadata JSON serialization json.dumps

💡 Result:

Short answer — convert the ListWithMetadata into plain JSON-serializable structures (list + dict) or provide a json default/encoder. Example approaches:

  1. Explicit convert to dict (recommended)
  • This produces a JSON object with "data" and "metadata".
import json
from unitxt.inference import ListWithMetadata

lwm = ListWithMetadata(["a","b"], metadata={"source":"x"})
payload = {"data": list(lwm), "metadata": lwm.metadata}
s = json.dumps(payload)
  1. Use json.dumps default= to handle it generically
def default(o):
    if hasattr(o, "metadata") and isinstance(o, list):
        return {"data": list(o), "metadata": getattr(o, "metadata")}
    raise TypeError(f"Not JSON serializable: {type(o)}")

s = json.dumps(lwm, default=default)
  1. Subclass JSONEncoder (alternative)
import json
class LWMEncoder(json.JSONEncoder):
    def default(self, o):
        if hasattr(o, "metadata") and isinstance(o, list):
            return {"data": list(o), "metadata": o.metadata}
        return super().default(o)

s = json.dumps(lwm, cls=LWMEncoder)

Notes:

  • unitxt.ListWithMetadata is a list subclass that accepts metadata at construction. [1]
  • json.dumps accepts default or cls to customize serialization. [2]

Sources:
[1] unitxt docs (unitxt.inference — ListWithMetadata).
[2] Python json.dumps / JSONEncoder docs.


🌐 Web query:

unitxt ListWithMetadata class definition structure

💡 Result:

unitxt.ListWithMetadata is a thin subclass of Python list that adds an optional metadata dict. Its documented signature and type info:

  • class unitxt.inference.ListWithMetadata(*args, metadata: dict | None = None, **kwargs) — Bases: List[T]. [1]

(The class is defined in the unitxt.inference module and accepts the usual list args plus an optional metadata parameter.) [1]

Sources

  • unitxt.inference documentation (ListWithMetadata signature). [1]

jsonl_util.write_jsonl_file() will fail to serialize ListWithMetadata objects—handle serialization explicitly before saving.

Line 313 calls jsonl_util.write_jsonl_file(predictions_file, predictions_new) where predictions_new is a ListWithMetadata (from do_inference()). Since json.dumps() in jsonl_util.py has no custom encoder, it cannot serialize ListWithMetadata objects; they require explicit conversion to a plain dict structure (e.g., {"data": list(predictions_new), "metadata": predictions_new.metadata}). Additionally, read_jsonl_file() returns list[dict] or list[str] from json.loads(), creating a type mismatch when mixed with ListWithMetadata objects at line 318 and passed to evaluate() at line 320.

🤖 Prompt for AI Agents
In notebooks/teigaku_genzei/test_qa.py around lines 153 to 177, do_inference()
returns a ListWithMetadata which jsonl_util.write_jsonl_file cannot serialize;
before calling write_jsonl_file (and before mixing with read_jsonl_file output),
convert predictions_new into plain serializable structures—e.g., serialize as a
dict with keys like "data": list(predictions_new) and "metadata":
predictions_new.metadata, or write the raw list and write metadata
separately—then update the subsequent read path to reconstruct a
ListWithMetadata (or always pass plain lists) so evaluate() receives consistent,
JSON-serializable input rather than a ListWithMetadata object.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
notebooks/teigaku_genzei/teigaku_genzei.ipynb (1)

83-90: Consider using a more semantically appropriate placeholder for the OCR language parameter.

While the comments correctly explain that RapidOCR ignores the language parameter, using ["en"] for Japanese document processing is semantically confusing. Consider one of these options:

  1. Use ["ja"] or ["jpn"] as a placeholder to better reflect the document language
  2. Use an empty list [] if the parameter is truly ignored
  3. Add a more prominent note that users should switch to Tesseract OCR if Japanese OCR is needed

This would make the configuration more self-documenting for future maintainers.

Example:

         "ocr": {
             "enabled": True,  # Enable/disable OCR processing
-            "languages": ["en"],  # List of language codes (e.g., eng, fra, deu)
-            # Note that the OCR language has no effect in this example since we discard information from images.
-            # Note that the OCR engine that is automatically assigned is RapidOCR. 
-            # The language option is not supported for RapidOCR in Docling.
-            # Therefore, the value "en" specified here is ignored.
+            "languages": ["ja"],  # Placeholder: RapidOCR ignores this parameter
+            # Note: RapidOCR (default OCR engine) does not support the language parameter.
+            # To enable Japanese OCR, switch to Tesseract and use language code "jpn".
             # See docling.datamodel.pipeline_options.RapidOcrOptions for the detail.
         },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0ac802 and d09a297.

📒 Files selected for processing (1)
  • notebooks/teigaku_genzei/teigaku_genzei.ipynb (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (7)
notebooks/teigaku_genzei/teigaku_genzei.ipynb (7)

185-188: Defensive type check suggests API uncertainty.

The isinstance check on line 185 suggests that pipeline_options.table_structure_options might not always be a TableStructureOptions instance. This is reasonable defensive coding, but if the API guarantees this type, the check could be simplified.

If this check is necessary due to API behavior (e.g., it could be None), consider adding a brief comment explaining when this might occur.


317-337: LGTM: Directory creation properly handled.

The addition of os.makedirs(qa_table_dir, exist_ok=True) on line 326 correctly addresses the previous review comment about ensuring output directories exist before writing files.


364-383: LGTM: Directory creation properly handled.

The addition of os.makedirs(context_dir, exist_ok=True) on line 375 correctly addresses the previous review comment.


414-429: LGTM: Directory creation properly handled.

The addition of os.makedirs(os.path.dirname(output_icl_path), exist_ok=True) on line 420 correctly addresses the previous review comment about creating the parent directory.


466-480: LGTM: Directory creation properly handled.

The addition of os.makedirs(os.path.dirname(output_seed_path), exist_ok=True) on line 471 correctly addresses the previous review comment about ensuring the parent directory exists.


199-224: Export logic is well-structured.

The export_document function properly handles multiple export formats with appropriate error handling and logging. The use of getattr to dynamically call export methods is clean and maintainable.


226-274: Main conversion function follows best practices.

The export_document_new_docling function properly handles directory creation (line 249), counts successes/failures, and provides comprehensive logging. The overall structure is clean and maintainable.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
notebooks/teigaku_genzei/json_qa.py (5)

1-6: Consider removing the TODO or converting it to a module docstring.

The script appears complete and functional. If the TODO is describing what the script does (rather than tracking incomplete work), convert it to a proper module docstring.

Apply this diff:

-# TODO:
-# Read a given JSON file and extract fields that corresponds to QA pairs.
-# Save the QA pairs to a CSV file.
-# Both the input file name and the output file name are specified as command line arguments.
-
-# Assisted by watsonx Code Assistant 
+"""
+Read a given JSON file and extract fields that correspond to QA pairs.
+Save the QA pairs to a CSV file.
+Both the input file name and the output file name are specified as command line arguments.
+
+Assisted by watsonx Code Assistant
+"""

16-38: Well-structured text extraction logic.

The function correctly filters body text while excluding page numbers and section titles. The nested helper functions and inline comments make the logic clear.

Optionally, consider adding a docstring to describe the function's purpose and parameters:

 def get_text_block(text_list: list[Any], begin_pos: int, end_pos: int) -> str:
+    """Extract body text from a slice of the text_list, excluding furniture and section headers.
+    
+    Args:
+        text_list: List of text objects with content_layer, label, and text attributes
+        begin_pos: Starting index (inclusive)
+        end_pos: Ending index (exclusive)
+    
+    Returns:
+        Concatenated body text from the specified range
+    """

40-56: Add type hints to improve clarity.

The extraction logic is sound, and the use of strict=True on line 51 (addressed from previous review) properly validates that position lists align. Adding type hints would improve the function signature's readability.

Apply this diff:

-def extract_qa_pairs(data):
+def extract_qa_pairs(data: dict[str, Any]) -> list[tuple[str, str, str]]:

58-63: Add type hints for consistency.

The CSV writing logic is correct with proper UTF-8 encoding. Adding type hints would maintain consistency with get_text_block.

Apply this diff:

-def save_to_csv(qa_pairs, output_file):
+def save_to_csv(qa_pairs: list[tuple[str, str, str]], output_file: str) -> None:

65-77: Consider adding error handling for better user experience.

The CLI flow is correct, but the script will crash with Python tracebacks if the input file is missing, JSON is malformed, or write permissions are denied. For a utility script this may be acceptable, but wrapping the core logic in try-except would provide clearer error messages.

Example:

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python script.py <input_json_file> <output_csv_file>")
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]

    try:
        data = json_util.read_json_file(input_file)
        qa_pairs = extract_qa_pairs(data)
        save_to_csv(qa_pairs, output_file)
        print(f"QA pairs have been successfully saved to {output_file}")
    except FileNotFoundError as e:
        print(f"Error: Input file not found - {e}")
        sys.exit(1)
    except (json.JSONDecodeError, KeyError, ValueError) as e:
        print(f"Error: Failed to process input - {e}")
        sys.exit(1)
    except OSError as e:
        print(f"Error: Failed to write output - {e}")
        sys.exit(1)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d09a297 and 5d5812b.

📒 Files selected for processing (1)
  • notebooks/teigaku_genzei/json_qa.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
notebooks/teigaku_genzei/json_qa.py (1)
notebooks/teigaku_genzei/json_util.py (5)
  • is_title (6-7)
  • is_toc (9-10)
  • is_qhead (15-16)
  • is_ahead (18-19)
  • read_json_file (21-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Summary
🔇 Additional comments (1)
notebooks/teigaku_genzei/json_qa.py (1)

7-14: LGTM!

Imports are clean and appropriate for the script's functionality.

@ryokawajp
Copy link
Author

Hi @alimaredia ! Any comments on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant