-
Notifications
You must be signed in to change notification settings - Fork 8
Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents. #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a notebook-driven pipeline and many supporting scripts, utilities, and local catalog entries to parse PDF outputs into QA/glossary CSVs, compose section contexts, generate ICL examples and SDG seed JSONL, and register/run multiple LLM-evaluator configurations via a local Unitxt catalog. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Notebook as Notebook
participant Docling as Docling/Parser
participant JSONTools as json_qa/json_glossary
participant Context as make_context.py
participant ICL as make_icl.py
participant Seed as make_seed.py
participant Evaluator as test_qa.py
User->>Notebook: run pipeline cells
Notebook->>Docling: export_document_new_docling(PDFs)
Docling-->>Notebook: parsed JSON files
Notebook->>JSONTools: run json_qa / json_glossary
JSONTools-->>Notebook: QA CSV, Glossary CSV
Notebook->>Context: make_context.py(QA, glossary)
Context-->>Notebook: context CSV
Notebook->>ICL: make_icl.py(QA CSV, context CSV)
ICL-->>Notebook: icl.jsonl
Notebook->>Seed: make_seed.py(context CSV, icl.jsonl)
Seed-->>Notebook: seed JSONL
Notebook->>Evaluator: test_qa.py(seed or test data)
Evaluator-->>User: evaluation results (JSON/CSV)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 40
♻️ Duplicate comments (1)
scripts/teigaku_genzei/json_qa.py (1)
15-18: Code duplication flagged in related file.This
read_json_fileduplication is addressed in the review comment forjson_glossary.py(lines 15-18).
🧹 Nitpick comments (29)
scripts/teigaku_genzei/remove_newline_from_csv.py (1)
1-13: Consider using argparse for better CLI experience.The script would benefit from proper argument parsing with help text and validation, following Python CLI best practices.
Consider this refactor using argparse:
""" Remove newline characters from all object-typed columns in a CSV file. """ import argparse import sys from pathlib import Path import pandas as pd def remove_newlines_from_csv(input_file: Path, output_file: Path) -> None: """Process CSV file by replacing newlines with spaces in object columns.""" df = pd.read_csv(input_file, encoding="utf-8") col_sr_dict = { col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col] for col in df.columns } df2 = pd.DataFrame(col_sr_dict) df2.to_csv(output_file, encoding="utf-8", index=False) def main() -> None: parser = argparse.ArgumentParser( description="Remove newline characters from object columns in a CSV file" ) parser.add_argument("input_csv", type=Path, help="Input CSV file path") parser.add_argument("output_csv", type=Path, help="Output CSV file path") args = parser.parse_args() if not args.input_csv.exists(): print(f"Error: Input file '{args.input_csv}' not found", file=sys.stderr) sys.exit(1) try: remove_newlines_from_csv(args.input_csv, args.output_csv) except pd.errors.ParserError as e: print(f"Error parsing CSV: {e}", file=sys.stderr) sys.exit(1) except Exception as e: print(f"Unexpected error: {e}", file=sys.stderr) sys.exit(1) if __name__ == "__main__": main()This approach provides:
- Automatic help text with
-h/--help- Better error messages
- Type hints for maintainability
- Separation of concerns with dedicated functions
- Standard
if __name__ == "__main__"guardscripts/teigaku_genzei/parser.sh (1)
4-4: Remove or document the commented-out code.The commented-out old path serves no purpose. Either remove it or add a comment explaining why it's kept for reference.
scripts/teigaku_genzei/qna_yaml_2_qa_csv.py (1)
2-4: Remove completed TODO comments.The TODO comments describe functionality that has already been implemented. Remove them to keep the code clean.
Apply this diff:
- -# TODO: load YAML file. -# TODO: for each "questions_and_answers", select "question" and "answer". -# TODO: save as a CSV file. - import sysscripts/teigaku_genzei/context_util.py (1)
4-19: Consider adding error handling for missing DataFrame columns.All three functions access DataFrame columns without checking for their existence. If the DataFrame doesn't contain the expected columns (
Title,Answer,Term,Description), aKeyErrorwill be raised at runtime.Consider validating column presence or catching
KeyErrorexceptions, especially if the data source is external or user-provided. For example:def compose_text_block(sr: pd.Series) -> str: if "Title" not in sr or "Answer" not in sr: raise ValueError("Missing required columns: Title or Answer") block = sr["Title"] + "\n" + sr["Answer"].removeprefix("[A]") return blockscripts/teigaku_genzei/catalog_util.py (2)
5-7: Module-level side effects may cause issues.Lines 7 and 22 execute function calls at module import time (
register_local_catalogandget_system_catalog_path). This can:
- Make the module difficult to test
- Cause unexpected behavior when importing
- Produce console output during import
Consider moving these calls into a
setup()ormain()function that callers invoke explicitly, or use theif __name__ == "__main__":guard for the print statement.Also applies to: 22-22
5-5: Add spacing around assignment operator.PEP 8 recommends spacing around the
=operator for readability.Apply this diff:
-LOCAL_CATALOG_PATH="local_catalog" +LOCAL_CATALOG_PATH = "local_catalog"scripts/teigaku_genzei/make_seed.sh (1)
1-4: Consider adding error handling and validation.The script executes a Python command without checking for errors or validating inputs. Consider adding:
- Check if
make_seed.pyexists- Validate input files exist
- Exit on error with
set -e- Check Python command exit status
Example improvements:
#!/bin/sh +set -e # Exit on any error +# Validate that required files exist +if [ ! -f "./make_seed.py" ]; then + echo "Error: make_seed.py not found" + exit 1 +fi + +if [ ! -f "context/0024001-021_context.csv" ]; then + echo "Error: context file not found" + exit 1 +fi + +if [ ! -f "icl/icl.jsonl" ]; then + echo "Error: ICL file not found" + exit 1 +fi + # python ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl python ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesianscripts/teigaku_genzei/llm_as_judge_direct_rits_phi.py (1)
13-19: Consider removing or documenting the commented code.The commented-out CrossProviderInferenceEngine alternative (Lines 13-19) appears to be leftover from development. If this alternative is intended as a reference for users who might want to use a different engine, consider adding a brief comment explaining when and why they might use it. Otherwise, remove it to reduce code clutter.
If documenting:
+ # Alternative: Use CrossProviderInferenceEngine for dynamic provider selection # inference_engine=CrossProviderInferenceEngine(Or simply remove the commented block if it's no longer needed.
scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)
17-42: Consider removing commented code once the workaround is stable.The extensively commented-out option names (Lines 19, 24, 29) and option_map entries (Lines 35-37) show the original semantic naming. While this provides context for the workaround, it adds significant clutter. If this numeric workaround is confirmed to work reliably, consider removing the commented code and keeping only a brief explanation in the header comment.
Alternatively, if you want to preserve the original intent for future reference, consolidate it into a single block comment at the top of the function rather than inline throughout the code.
scripts/teigaku_genzei/test_single_vllm.sh (1)
31-31: Add error handling for directory creation.The
mkdircommand (Line 31) will fail if the directory already exists or if there are permission issues. Consider usingmkdir -pto handle existing directories gracefully and add error checking.Apply this diff:
-mkdir ${OUT_DIR} +mkdir -p "${OUT_DIR}" || { echo "Failed to create output directory: ${OUT_DIR}"; exit 1; }Also note the addition of quotes around the variable to handle paths with spaces.
scripts/teigaku_genzei/test_single.sh (1)
27-27: Add error handling for directory creation.The
mkdircommand (Line 27) will fail if the directory already exists or if there are permission issues. Consider usingmkdir -pto handle existing directories gracefully and add error checking.Apply this diff:
-mkdir ${OUT_DIR} +mkdir -p "${OUT_DIR}" || { echo "Failed to create output directory: ${OUT_DIR}"; exit 1; }Also note the addition of quotes around the variable to handle paths with spaces.
scripts/teigaku_genzei/make_icl.sh (1)
1-3: Consider adding error handling and documentation.This script is a simple wrapper but lacks error handling. If
make_icl.pyfails or doesn't exist, the script will silently continue. Additionally, the hardcoded paths may not be obvious to users.Consider adding:
- Error handling for the Python script execution
- A brief comment explaining the script's purpose and the hardcoded paths
Example:
#!/bin/sh + +# Generate ICL (In-Context Learning) data from QA, glossary, and context files +# Note: Update the file paths below to match your data location -python ./make_icl.py --short_context True --qa_file qa_table/0024004-072_01.csv --glossary_file qa_table/0024004-072_01_glossary.csv --context_file context/0024004-072_01_context.csv --out_icl_file icl/icl.jsonl +python ./make_icl.py \ + --short_context True \ + --qa_file qa_table/0024004-072_01.csv \ + --glossary_file qa_table/0024004-072_01_glossary.csv \ + --context_file context/0024004-072_01_context.csv \ + --out_icl_file icl/icl.jsonl || { echo "Failed to generate ICL data"; exit 1; }The line breaks also improve readability.
scripts/teigaku_genzei/jsonl_util.py (1)
7-12: Remove commented code and useless return.The commented line at Line 11 should be removed, and the explicit return at Line 12 is unnecessary in Python (functions return None implicitly).
Apply this diff:
def write_jsonl_file(file_path: str, data_list: list[Any]): - with open(file_path, 'w') as file: + with open(file_path, 'w', encoding='utf-8') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n') - # json.dump(data, file, ensure_ascii=False) - returnscripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
47-47: Remove useless return statement.Python functions return
Noneimplicitly, so the explicitreturnis unnecessary.Apply this diff:
yaml.safe_dump(qna_template_dict, out_f, allow_unicode=True, encoding="utf8", default_flow_style=False, default_style="|", sort_keys=False) - - returnscripts/teigaku_genzei/extract_icl_from_seed.py (2)
1-5: Remove unused pandas import.The
pandasmodule is imported but never used in this script.Apply this diff:
import sys from typing import cast, Any, Hashable, Sequence, Collection -import pandas as pd import json_util
49-53: Simplify the dictionary construction.As per coding guidelines
The dict comprehension can be replaced with the built-in
dict()constructor for clarity.Apply this diff:
- unique_icl_dict = { - h: obj for (h, obj) in hash_list - } + unique_icl_dict = dict(hash_list)scripts/teigaku_genzei/json_util.py (1)
20-33: Remove unreachable return statements.Both JSONL I/O functions have unnecessary return statements that are either useless (line 25) or unreachable (line 33).
Apply this diff:
def write_jsonl_file(file_path: str, data_list: list[Any]): with open(file_path, 'w') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n') - # json.dump(data, file, ensure_ascii=False) - return def read_jsonl_file(file_path: str) -> list[Any]: with open(file_path, 'r') as file: data_list = [ json.loads(line) for line in file ] return data_list - return []scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
26-61: Consider unpacking for list concatenation.The compute method correctly augments base results with positional bias metrics. Line 47 could use unpacking for slightly cleaner list construction.
Optional refactor for line 47:
- self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}scripts/teigaku_genzei/json_qa.py (1)
44-69: Simplify the list comprehension on line 55.Line 55 uses an unnecessary comprehension to convert an iterator to a list. The subsequent slice
[1:]can be applied directly to the list.Apply this diff:
- atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)] + atail_pos_list = list(title_pos_list)[1:] + [len(text_list)]Alternatively, if
title_pos_listis already a list (which it is, from line 52), you can simplify further:- atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)] + atail_pos_list = title_pos_list[1:] + [len(text_list)]scripts/teigaku_genzei/make_context.py (1)
95-95: Remove useless return statement.The return statement at the end of
main()is unnecessary.Apply this diff:
out_df.to_csv(output_context_path, index=False, encoding="utf8") - - returnscripts/teigaku_genzei/make_seed.py (2)
19-27: Make input/output arguments required to fail fastAvoid passing None into I/O functions.
- parser.add_argument('--' + ARG_INPUT_CONTEXT_FILE, type=str, default=None) - parser.add_argument('--' + ARG_INPUT_ICL_FILE, type=str, default=None) - parser.add_argument('--' + ARG_OUTPUT_SEED_FILE, type=str, default=None) + parser.add_argument('--' + ARG_INPUT_CONTEXT_FILE, type=str, required=True) + parser.add_argument('--' + ARG_INPUT_ICL_FILE, type=str, required=True) + parser.add_argument('--' + ARG_OUTPUT_SEED_FILE, type=str, required=True)
105-105: Drop debug print or gate behind a verbose flagPrinting columns on every run is noisy for CLI pipelines.
- print(out_df.columns) + # print(out_df.columns) # enable for debuggingscripts/teigaku_genzei/make_icl.py (4)
43-54: Remove unused qa_df param from generate_QA_combinationsThe parameter is unused; simplify the signature.
-def generate_QA_combinations(row_sr: pd.Series, qa_df: pd.DataFrame, size: int)-> list[tuple[int, list[int]]]: +def generate_QA_combinations(row_sr: pd.Series, size: int) -> list[tuple[int, list[int]]]: q_list = cast(list[int], json.loads(row_sr["qlist"])) if len(q_list) >= size: comb_triples = itertools.combinations(q_list, size) comb_triples_list = [(cast(int, row_sr.name), list(comb)) for comb in comb_triples] return comb_triples_list - elif len(q_list) > 0: + if len(q_list) > 0: comb_triples_list = [(cast(int, row_sr.name), (q_list + [q_list[0]] * (size - len(q_list))))] return comb_triples_list - else: - return [] + return []
103-106: Update call site after signature changeDrop qa_df argument.
- icl_qa_index_sr = ncontext_df.apply(lambda row: generate_QA_combinations(row, qa_df, NUM_ICL_QA_EXAMPLES), axis=1) + icl_qa_index_sr = ncontext_df.apply(lambda row: generate_QA_combinations(row, NUM_ICL_QA_EXAMPLES), axis=1)
91-96: Remove unused variableoptionAssigned but not used.
- option = args[ARG_OPTION] + # option = args[ARG_OPTION]
77-79: Verify fullwidth 'A' in Answer prefixEnsure the dataset actually uses "[A]". If some rows use ASCII "[A]", consider normalizing both.
Optionally:
- "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(), + # Strip either fullwidth or ASCII A within brackets + "icl_response_" + str(i + 1): ( + qa_df.loc[qa]["Answer"] + .removeprefix("[A]") + .removeprefix("[A]") + .strip() + ),scripts/teigaku_genzei/test_qa.py (3)
348-349: Use sys.exit for predictable terminationexit() is intended for interactive use.
- exit() + sys.exit(0)
370-372: Guard writing results when output path is NoneAvoid crashing if the CLI omits --out_result_file.
- write_results_csv(output_file, results) + if output_file: + write_results_csv(output_file, results) + else: + print("No out_result_file provided; skipping CSV write.")
98-99: Drop noisy print of full test DataFrameReduce stdout noise in batch runs.
- print(data_df) + # print(data_df) # enable for debugging
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (103)
scripts/teigaku_genzei/.gitignore(1 hunks)scripts/teigaku_genzei/README.md(1 hunks)scripts/teigaku_genzei/catalog_util.py(1 hunks)scripts/teigaku_genzei/context/.gitignore(1 hunks)scripts/teigaku_genzei/context_util.py(1 hunks)scripts/teigaku_genzei/docparser/.gitignore(1 hunks)scripts/teigaku_genzei/extract_icl_from_seed.py(1 hunks)scripts/teigaku_genzei/extract_qa.sh(1 hunks)scripts/teigaku_genzei/icl/.gitignore(1 hunks)scripts/teigaku_genzei/json_glossary.py(1 hunks)scripts/teigaku_genzei/json_qa.py(1 hunks)scripts/teigaku_genzei/json_util.py(1 hunks)scripts/teigaku_genzei/jsonl_util.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_rits_phi.py(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/criteria/correctness_based_on_ground_truth2.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/phi_4.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/deepseek_v3.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_1_405b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama4_scout.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/mixtral_large.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_nano.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama4_scout.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_mini.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_preview.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/granite3_3_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_405b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_mini.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_nano.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4o.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/gpt_4_1.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_scout.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/deepseek_v3.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_1_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_2_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_3_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_405b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_scout.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral8_7b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral_large.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/phi_4.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/deepseek_v3.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_405b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_maverick.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_scout.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_flash.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_0_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_3_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_405b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_3_70b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral8_7b.json(1 hunks)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json(1 hunks)scripts/teigaku_genzei/make_context.py(1 hunks)scripts/teigaku_genzei/make_context.sh(1 hunks)scripts/teigaku_genzei/make_icl.py(1 hunks)scripts/teigaku_genzei/make_icl.sh(1 hunks)scripts/teigaku_genzei/make_seed.py(1 hunks)scripts/teigaku_genzei/make_seed.sh(1 hunks)scripts/teigaku_genzei/parser.sh(1 hunks)scripts/teigaku_genzei/qa_table/.gitignore(1 hunks)scripts/teigaku_genzei/qna_yaml_2_qa_csv.py(1 hunks)scripts/teigaku_genzei/remove_newline_from_csv.py(1 hunks)scripts/teigaku_genzei/requirements-dev.txt(1 hunks)scripts/teigaku_genzei/requirements.txt(1 hunks)scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py(1 hunks)scripts/teigaku_genzei/source/.gitignore(1 hunks)scripts/teigaku_genzei/test_all.sh(1 hunks)scripts/teigaku_genzei/test_nencho_single.sh(1 hunks)scripts/teigaku_genzei/test_qa.py(1 hunks)scripts/teigaku_genzei/test_single.sh(1 hunks)scripts/teigaku_genzei/test_single_pred_eval.sh(1 hunks)scripts/teigaku_genzei/test_single_vllm.sh(1 hunks)scripts/teigaku_genzei/url.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
scripts/teigaku_genzei/catalog_util.py (1)
get_system_catalog_path(16-20)
scripts/teigaku_genzei/json_glossary.py (1)
scripts/teigaku_genzei/json_util.py (1)
is_title(6-7)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
RITSInferenceEngineFixed(17-24)
scripts/teigaku_genzei/make_context.py (2)
scripts/teigaku_genzei/context_util.py (2)
compose_context(8-12)compose_glossary(14-19)scripts/teigaku_genzei/json_util.py (1)
get_title_match(9-10)
scripts/teigaku_genzei/json_qa.py (1)
scripts/teigaku_genzei/json_util.py (3)
is_title(6-7)is_qhead(12-13)is_ahead(15-16)
scripts/teigaku_genzei/make_icl.py (1)
scripts/teigaku_genzei/context_util.py (1)
compose_context(8-12)
🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md
7-7: Bare URL used
(MD034, no-bare-urls)
10-10: Bare URL used
(MD034, no-bare-urls)
11-11: Bare URL used
(MD034, no-bare-urls)
20-20: Bare URL used
(MD034, no-bare-urls)
21-21: Bare URL used
(MD034, no-bare-urls)
35-35: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
37-37: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
38-38: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
38-38: Bare URL used
(MD034, no-bare-urls)
39-39: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
40-40: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
41-41: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
42-42: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
47-47: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
48-48: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
50-50: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
51-51: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
52-52: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
53-53: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
58-58: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
66-66: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
79-79: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
81-81: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
82-82: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
82-82: Bare URL used
(MD034, no-bare-urls)
83-83: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
88-88: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
89-89: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
90-90: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
91-91: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
92-92: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
94-94: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
95-95: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
96-96: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
97-97: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
101-101: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
102-102: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
103-103: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
104-104: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
105-105: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
106-106: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
107-107: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
109-109: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
115-115: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
130-130: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
131-131: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
132-132: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
140-140: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
147-147: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
155-155: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
156-156: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
157-157: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
158-158: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
161-161: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
162-162: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
163-163: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
165-165: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
166-166: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
167-167: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
179-179: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
180-180: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
181-181: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
183-183: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
184-184: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
185-185: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
186-186: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
188-188: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
189-189: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
190-190: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
195-195: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
210-210: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
221-221: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
222-222: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
226-226: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Pylint (3.3.9)
scripts/teigaku_genzei/json_util.py
[refactor] 20-20: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/jsonl_util.py
[refactor] 7-7: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/extract_icl_from_seed.py
[refactor] 49-51: Unnecessary use of a comprehension, use dict(hash_list) instead.
(R1721)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
[refactor] 9-9: Too few public methods (1/2)
(R0903)
scripts/teigaku_genzei/make_context.py
[refactor] 56-56: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py
[refactor] 24-24: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py
[error] 24-24: Bad first argument 'RITSInferenceEngine' given to super()
(E1003)
[refactor] 17-17: Too few public methods (1/2)
(R0903)
scripts/teigaku_genzei/test_qa.py
[refactor] 109-109: Useless return at end of function or method
(R1711)
[refactor] 114-114: Useless return at end of function or method
(R1711)
[refactor] 130-130: Useless return at end of function or method
(R1711)
[refactor] 171-171: Useless return at end of function or method
(R1711)
[refactor] 348-348: Consider using 'sys.exit' instead
(R1722)
[refactor] 210-210: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/json_qa.py
[refactor] 55-55: Unnecessary use of a comprehension, use list(title_pos_list) instead.
(R1721)
scripts/teigaku_genzei/make_icl.py
[refactor] 45-53: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
🪛 Ruff (0.13.3)
scripts/teigaku_genzei/context_util.py
5-5: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
8-8: Unused function argument: section
(ARG001)
scripts/teigaku_genzei/json_util.py
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
7-7: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
16-16: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
scripts/teigaku_genzei/json_glossary.py
37-37: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
37-37: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
47-47: Consider [self.main_score, *self.ci_scores] instead of concatenation
Replace with [self.main_score, *self.ci_scores]
(RUF005)
scripts/teigaku_genzei/catalog_util.py
9-9: Unused function argument: artifact_name
(ARG001)
scripts/teigaku_genzei/qna_yaml_2_qa_csv.py
16-16: Probable use of unsafe loader FullLoader with yaml.load. Allows instantiation of arbitrary objects. Consider yaml.safe_load.
(S506)
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py
62-62: String contains ambiguous ? (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?
(RUF001)
64-64: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
64-64: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
64-64: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
64-64: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
scripts/teigaku_genzei/test_qa.py
89-89: Unused function argument: test_language
(ARG001)
102-102: Unused function argument: input_file
(ARG001)
151-151: Function definition does not bind loop variable option_map
(B023)
155-155: Local variable nbiased_score_sr_list is assigned to but never used
Remove assignment to unused variable nbiased_score_sr_list
(F841)
260-264: Consider iterable unpacking instead of concatenation
(RUF005)
scripts/teigaku_genzei/json_qa.py
26-26: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
26-26: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
37-37: Comment contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF003)
39-39: Comment contains ambiguous 1 (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?
(RUF003)
50-50: Comment contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF003)
55-55: Consider [*[i for i in title_pos_list][1:], len(text_list)] instead of concatenation
Replace with [*[i for i in title_pos_list][1:], len(text_list)]
(RUF005)
scripts/teigaku_genzei/make_icl.py
43-43: Unused function argument: qa_df
(ARG001)
78-78: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
91-91: Local variable option is assigned to but never used
Remove assignment to unused variable option
(F841)
🔇 Additional comments (35)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_405b.json (1)
3-10: Double-check Ollama model availability.
llama-3-1-405b-instructis not part of the standard Ollama model catalog (Ollama focuses on deployable on-device checkpoints ≤70B). Unless you maintain a custom Ollama endpoint that proxies this 405B variant, inference will fail immediately. Please confirm the model actually exists on your Ollama host or adjust to an available checkpoint.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral_large.json (1)
1-13: Configuration looks solid.Schema and values line up with the existing llm_judge_direct_positional_bias pattern. No issues spotted.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json (1)
8-9: No changes needed for provider identifieropen-aiis already the canonical ID used across all direct_positional_bias configs.Likely an incorrect or invalid review comment.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4o.json (1)
1-13: LGTM!The OpenAI GPT-4o configuration is correctly structured with appropriate deterministic parameters (seed=42, temperature=0) for evaluation consistency. The model identifier follows OpenAI's versioning convention.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json (1)
1-13: LGTM!The WatsonX Granite 3.2 8B configuration follows the established pattern with consistent evaluation parameters. Configuration is appropriate for the target model.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json (1)
1-13: LGTM!The WatsonX Llama 3.1 70B configuration is correctly structured and consistent with the established patterns for direct positional bias evaluation.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/deepseek_v3.json (1)
1-13: LGTM!The Together AI DeepSeek V3 configuration correctly uses the namespaced model identifier format and maintains consistency with other provider configurations.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_1_8b.json (1)
1-13: LGTM!The RITS Granite 3.1 8B configuration is consistent with the established pattern. The RITS provider integration aligns with the PR's stated testing approach using a local inference setup.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json (1)
1-13: LGTM!The WatsonX Llama 3.1 8B configuration is correctly structured and consistent with the other provider configurations in this PR.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_3_70b.json (1)
1-13: LGTM!The configuration follows the established pattern for LLM judge direct positional bias setups. The inference parameters (max_tokens: 1024, seed: 42, temperature: 0) are consistent with other provider configurations in this PR, ensuring reproducible evaluations.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_405b.json (1)
1-13: LGTM!The configuration correctly defines the Llama 3.1 405B model for Watsonx provider with consistent inference parameters.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json (1)
9-11: Verify Azure endpoint is appropriate for example code.The hardcoded Azure OpenAI endpoint (
eteopenai.azure-api.net) appears environment-specific. For a reusable example, consider:
- Documenting that users must replace this URL with their own Azure deployment
- Using environment variables or a configuration file for the endpoint
- Providing a template or placeholder value
Is this endpoint intended as a placeholder, or is it a shared resource for testing?
scripts/teigaku_genzei/.gitignore (1)
1-170: LGTM!The .gitignore provides comprehensive coverage for Python projects, including Unitxt-specific artifacts and project-specific temporary directories. The organization with comments makes it maintainable.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json (1)
1-13: LGTM!The RITS provider configuration for Granite 3.0 8B model is consistent with the other judge configurations in this PR.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json (1)
8-8: Confirm"provider"value
All direct_positional_bias JSONs use"open-ai". Verify the inference engine expects this; otherwise rename to"openai".scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama3_1_405b.json (1)
1-13: LGTM!The configuration is well-structured with appropriate settings for deterministic evaluation (temperature: 0, seed: 42).
scripts/teigaku_genzei/context_util.py (1)
4-6: Fullwidth character is intentional for Japanese text processing.The fullwidth Latin capital letter "A" in the
removeprefixcall is correct for processing Japanese documents, where such characters are commonly used. The static analysis warning can be safely ignored.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_405b.json (1)
1-13: LGTM!The RITS provider configuration follows the same deterministic evaluation pattern as other providers.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_2_8b.json (1)
1-13: LGTM!The Granite 3.2 8B configuration is consistent with other model configurations in this catalog.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_3_8b.json (1)
1-13: LGTM!The Granite 3.3 8B configuration completes the Granite 3.x series and follows the established pattern.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json (1)
9-9: Document that the base_url requires customization.The base_url points to an internal IBM infrastructure endpoint (inference-3scale-apicast-production.apps.rits.fmaas.res.ibm.com) which will not be accessible to external users of this example. Since this is an open-source example, consider adding a README note or comment indicating that users must update this URL to point to their own model deployment.
Consider adding a comment in the README or near this configuration explaining that users need to deploy their own model endpoint and update this URL accordingly.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_nano.json (1)
9-11: Document that the api_base requires customization for external users.The api_base URL (Line 10) points to a specific Azure OpenAI deployment (eteopenai.azure-api.net) that external users won't have access to. Since this is an open-source example, users will need to replace this with their own Azure OpenAI endpoint.
Consider adding documentation (in a README or inline comment) indicating that users must:
- Set up their own Azure OpenAI deployment
- Update the api_base URL to match their deployment
- Configure appropriate authentication credentials
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/phi_4.json (1)
1-12: LGTM!The JSON configuration is well-formed and the
__type__correctly matches the directory structure (directnotdirect_positional_bias). The inference engine settings are appropriate for deterministic evaluation.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/o1_mini.json (1)
1-16: LGTM!The configuration is well-structured and follows the pattern established by other Azure OpenAI configurations in this PR. The deterministic settings (seed=42, temperature=0) are appropriate for reproducible evaluation.
scripts/teigaku_genzei/test_all.sh (1)
20-20: Document or parameterize the hardcoded checkpoint path.The hardcoded path
/proj/instructlab/checkpoints/may not exist in all environments. Consider documenting this requirement in the README or making it configurable via an environment variable.Verify that this path requirement is documented in the README for users who want to run this script.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama3_1_8b.json (1)
1-13: LGTM!The configuration follows the established pattern for RITS-based llm_judge_direct_positional_bias metrics and uses appropriate deterministic settings for reproducible evaluation.
scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
36-42: Document or remove placeholder values.The document metadata contains placeholder values ("Teigaku Genzei example.", "https://example.com/example/taxonomy", all-zeros commit hash, "example.pdf"). If these are meant to be replaced by users, document this in the README. If they're examples for demonstration, consider adding a comment.
Verify that the README documents how users should update these placeholder values with their actual document metadata.
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
45-45: Remove the TODO; retain evaluator_name="LLAMA3_4_MAVERICK" to match existing conventions.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/criteria/correctness_based_on_ground_truth2.json (1)
1-27: LGTM!The criteria configuration is well-structured with clear, differentiated descriptions for each quality level. The three-point rubric provides appropriate granularity for factual correctness assessment, and the option_map scoring (3→1.0, 2→0.5, 1→0.0) aligns logically with the quality tiers.
scripts/teigaku_genzei/test_nencho_single.sh (1)
29-47: LGTM!The test execution loop correctly constructs paths and invokes test_qa.py with all necessary parameters. The commented-out "before" evaluation block provides a useful pattern for baseline comparisons.
scripts/teigaku_genzei/extract_icl_from_seed.py (1)
7-14: LGTM!The helper functions are straightforward and correctly implement ICL extraction and hash computation for deduplication.
scripts/teigaku_genzei/json_util.py (1)
6-16: LGTM!The text detection functions correctly use full-width Japanese characters (0-9, -, ー, A) alongside their ASCII counterparts to match Japanese document formatting patterns. The RUF001 warnings from Ruff are false positives in this context—full-width characters are intentional and necessary for the target use case.
scripts/teigaku_genzei/json_glossary.py (1)
20-53: LGTM!The glossary extraction logic correctly identifies section headers with Japanese bracket patterns (【】, ()) and retrieves associated descriptions. The full-width character warnings from Ruff are false positives—these characters are intentional for Japanese text processing.
scripts/teigaku_genzei/json_qa.py (1)
20-42: LGTM!The
get_text_blockfunction correctly filters and aggregates body text between specified positions, excluding page furniture and section titles. The comments provide helpful context for the label and content_layer attributes. Full-width character warnings in comments are false positives—these are intentional examples of Japanese characters.scripts/teigaku_genzei/make_context.py (1)
27-54: LGTM!The configuration parsing with argparse and section number extraction logic are both correctly implemented. The use of
json_util.get_title_matchfor robust pattern matching is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
♻️ Duplicate comments (41)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_scout.json (1)
9-12: Align evaluator_name with model version (Llama 4).Model is "llama-4-scout" but evaluator_name is "LLAMA3_4_SCOUT". Update to avoid confusion in results aggregation.
- "evaluator_name": "LLAMA3_4_SCOUT", + "evaluator_name": "LLAMA4_SCOUT",Run to find other mismatches:
#!/bin/bash rg -nP '"model"\s*:\s*"llama-4-[^"]+"' -C1 scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits | sed -n 's/.*"evaluator_name": "\([^"]*\)".*/\1/p' | sort -u rg -nP '"evaluator_name"\s*:\s*"LLAMA3_4_[^"]+"' scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ritsscripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)
9-11: Hard-coded Azure endpoint: remove or parameterize via environment.Storing the full
api_base(incl. preview api-version) in repo leaks internal endpoint and hurts portability. Prefer env-driven config or omitcredentialsso the engine readsAZURE_OPENAI_ENDPOINT/AZURE_OPENAI_API_KEY.Option A — rely on environment (preferred):
"provider": "azure", - "credentials": { - "api_base": "https://eteopenai.azure-api.net/openai/deployments/gpt-4o-2024-08-06/chat/completions?api-version=2024-08-01-preview" - }, + // credentials resolved from AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEYOption B — parameterize:
- "credentials": { - "api_base": "https://eteopenai.azure-api.net/openai/deployments/gpt-4o-2024-08-06/chat/completions?api-version=2024-08-01-preview" - }, + "credentials": { + "api_base": "${AZURE_OPENAI_ENDPOINT}" + },Also avoid pinning a preview api-version in code; pass it via env if needed. Based on learnings
To check for other hard-coded endpoints/keys:
#!/bin/bash # Find hard-coded Azure/OpenAI endpoints and potential keys in repo rg -n -C2 -g '!**/node_modules/**' -g '!**/dist/**' -g '!**/build/**' -e 'azure-api\.net|openai\.com' -e 'api_key|apiKey|AZURE_OPENAI|OPENAI_API_KEY'scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json (1)
1-13: Fix model/evaluator naming and file typo; align to Gemini.Filename is “gemmini…”, model is “gemma-2-5-pro”, and evaluator_name is “GEMMA…”. These mismatches will break catalog lookups and Vertex model resolution. Align all to Gemini 2.5 Pro and rename the file accordingly. Also verify whether the provider key should be "vertex_ai" (to match directory) or "vertex-ai" (current value).
Apply this diff to fix the JSON identifiers:
@@ - "provider": "vertex-ai", - "model": "gemma-2-5-pro" + "provider": "vertex-ai", + "model": "gemini-2-5-pro" @@ - "evaluator_name": "GEMMA_2_5_PRO", + "evaluator_name": "GEMINI_2_5_PRO",Rename the file to match the target model:
git mv scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemmini_2_5_pro.json scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_pro.jsonTo confirm provider key conventions and catch any other inconsistencies, run:
#!/bin/bash # Inspect provider keys and Gemini configs for consistency echo 'Provider key usage across local_catalog:' rg -n -C1 '"provider"\s*:\s*"' scripts/teigaku_genzei/local_catalog echo -e '\nGemini-related configs for cross-checking model/evaluator naming:' rg -n -C2 -i 'gemini|gemma' scripts/teigaku_genzei/local_catalogscripts/teigaku_genzei/remove_newline_from_csv.py (2)
4-5: Add argument validation and error handling.The past review comment correctly identified that direct
sys.argvaccess without validation will cause cryptic errors (IndexError) if invoked with wrong arguments or if the input file is missing.
7-13: Add error handling and documentation.The past review comment correctly identified missing error handling for I/O failures, malformed CSV, and encoding issues, plus the lack of documentation explaining the script's purpose.
scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)
6-10: Internal IBM GitHub reference is inaccessible to external users.The comment on line 10 references an internal IBM GitHub issue that external users cannot access. This was previously flagged in an earlier review.
Please either:
- Remove the specific URL and keep a general description, or
- Create a public issue in the opendatahub-io repository and reference that instead.
For example:
-# Source -# https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.direct.criteria.correctness_based_on_ground_truth.html -# This is a workaround to avoid the following issue. -# "correctness_base_on_ground_truth" post-process bug · Issue #7 · RYOKAWA/teigaku_genzei -# https://github.ibm.com/RYOKAWA/teigaku_genzei/issues/7 +# Based on Unitxt's correctness_based_on_ground_truth criterion: +# https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.direct.criteria.correctness_based_on_ground_truth.html +# +# Note: Using numeric option names ("3", "2", "1") instead of semantic names +# ("correct", "incomplete", "wrong") to work around a post-processing issue +# with the original criterion.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_nano.json (1)
8-9: Model name correction already flagged.This issue has been identified in previous review comments. The model name should be
"gpt-4.1-nano"not"gpt-4-1-nano"to match OpenAI's GPT-4.1 naming convention.scripts/teigaku_genzei/parser.sh (1)
1-6: Add error handling and input validation.The script lacks robustness—no
set -e, no validation of paths, and unquoted variables. These issues were previously flagged with a comprehensive fix.scripts/teigaku_genzei/qna_yaml_2_qa_csv.py (2)
16-16: Replace unsafeyaml.FullLoaderwithyaml.safe_load.Using
FullLoaderallows arbitrary object instantiation, which is a security risk. The past review already provided the fix.Based on static analysis (Ruff S506).
11-28: Add error handling and input validation.The script lacks robustness for production use—no argument validation, no I/O error handling, no YAML structure validation. The past review already provided a comprehensive fix.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/mixtral_large.json (1)
8-9: Fix the AWS Bedrock model identifier.As previously noted,
"mistral-large-instruct"is not a valid AWS Bedrock model identifier. AWS Bedrock requires model IDs in the formatmistral.mixtral-<variant>-v0:1(e.g.,mistral.mixtral-8x7b-instruct-v0:1). This configuration will fail when invoked.Apply the fix suggested in the previous review to use the correct Bedrock model identifier format.
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json (1)
8-12: Model identifier mismatches “mixtral” evaluator.Line 9 sets model to mistral-large-instruct while evaluator_name/file path indicate Mixtral.
Change model to the correct Watsonx Mixtral SKU (e.g., a mixtral-…-instruct identifier) to avoid evaluating the wrong model.
- "model": "mistral-large-instruct" + "model": "mixtral-8x7b-instruct" // adjust to exact Watsonx SKU in your envPlease confirm the exact Watsonx Mixtral model string supported by your account and update accordingly.
scripts/teigaku_genzei/test_single_vllm.sh (1)
3-7: Correct usage text to this script’s name.Both lines reference other scripts; fix to avoid confusion.
- echo "./test_single.sh {model_path}" - echo "e.g., ./test_all.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp" + echo "./test_single_vllm.sh {model_path}" + echo "e.g., ./test_single_vllm.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp"scripts/teigaku_genzei/json_glossary.py (1)
15-18: This duplication has already been identified in a previous review.scripts/teigaku_genzei/extract_qa.sh (1)
5-14: The missing directory creation has already been identified in a previous review.scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
24-24: This super() call has been flagged previously, but the syntax is actually correct for the intended behavior of calling the grandparent'sprepare_engine()method (skippingRITSInferenceEngine's own implementation). The two-argument formsuper(RITSInferenceEngine, self)explicitly invokes the parent class ofRITSInferenceEngine, which is the documented approach for this use case.scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/gpt_4_1.json (1)
4-10: Invalid provider/model pairing (will fail at runtime).Replicate does not serve OpenAI’s gpt-4-1. Use a Replicate-hosted model ID or switch provider to openai and relocate the file.
Two options:
- Keep path and change model to a valid Replicate model ID (verify in Replicate registry).
- Or change provider to "openai" here and move the file under the openai/ directory:
- "provider": "replicate", - "model": "gpt-4-1" + "provider": "openai", + "model": "gpt-4.1"scripts/teigaku_genzei/make_context.sh (1)
5-11: Create output directory before writing.context/ may not exist; first write will fail. Create it before the loop.
Apply:
DATA_PREFIX_LIST="0024001-021 0024004-072_01" -for f in $DATA_PREFIX_LIST; do +mkdir -p context +for f in $DATA_PREFIX_LIST; do python ./make_context.py \ --glossary_option appendix \ --qa_file qa_table/${f}.csv \ --glossary_file qa_table/${f}_glossary.csv \ --out_context_file context/${f}_context.csv donescripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_maverick.json (1)
11-12: Fix evaluator_name to match modelRename to LLAMA4_MAVERICK to align with
llama-4-maverick.Apply this diff:
- "evaluator_name": "LLAMA3_4_MAVERICK", + "evaluator_name": "LLAMA4_MAVERICK",scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct/rits/llama4_maverick.json (1)
8-11: Verify model name and evaluator name consistency.The model_name references "llama-4-maverick" (Line 8) but the evaluator_name is "LLAMA3_4_MAVERICK" (Line 11), which contains "3" instead of "4". This inconsistency was previously flagged and should be addressed.
scripts/teigaku_genzei/test_single.sh (1)
4-5: Fix the usage message to reference the correct script name.Line 5 shows
./test_all.shin the example, but this script is namedtest_single.sh. This was previously flagged and should be corrected.scripts/teigaku_genzei/test_single_pred_eval.sh (1)
36-36: Usemkdir -pto avoid repeat-run failures.Running the script twice will fail with
mkdir: File existsbecause Line 36 lacks the-pflag. This was previously flagged and should be corrected.scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
31-36: Add explicit encoding and error handling.File operations should specify
encoding='utf-8'for reliability with JSON that may contain non-ASCII characters (Japanese text in this case). Additionally, error handling for file I/O and JSON parsing would improve robustness. This was previously flagged and should be addressed.scripts/teigaku_genzei/test_all.sh (1)
23-23: Usemkdir -pto prevent errors.The
mkdircommand will fail if the directory already exists. This was previously flagged and should be corrected.scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (2)
47-47: Remove useless return statement.The explicit return at the end of
main()serves no purpose.Apply this diff:
- return - if __name__ == "__main__":
24-26: Add argument validation.The script accesses
sys.argv[1]andsys.argv[2]without validation, which will raiseIndexErrorif arguments are missing.Apply this diff:
def main() -> None: + if len(sys.argv) < 3: + print("Usage: seed_jsonl_2_qna_yaml.py <input_seed.jsonl> <output_qna.yaml>", file=sys.stderr) + sys.exit(1) + input_file = sys.argv[1] # seed.jsonl output_file = sys.argv[2] # qna.yamlscripts/teigaku_genzei/jsonl_util.py (2)
7-12: Add explicit encoding and remove useless return.File operations should specify
encoding='utf-8'for portability. The return statement at Line 12 serves no purpose and the commented code at Line 11 should be removed.Apply this diff:
def write_jsonl_file(file_path: str, data_list: list[Any]): - with open(file_path, 'w') as file: + with open(file_path, 'w', encoding='utf-8') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n') - # json.dump(data, file, ensure_ascii=False) - return
14-20: Add explicit encoding and remove unreachable code.The return statement at Line 20 is unreachable. Additionally, file operations should specify
encoding='utf-8'.Apply this diff:
def read_jsonl_file(file_path: str) -> list[Any]: - with open(file_path, 'r') as file: + with open(file_path, 'r', encoding='utf-8') as file: data_list = [ json.loads(line) for line in file ] return data_list - return []scripts/teigaku_genzei/test_nencho_single.sh (2)
3-7: Fix the script name in the usage example.Line 5 references
test_all.shbut the script is namedtest_single.sh.Apply this diff:
if [ -z "$1" ]; then echo "./test_single.sh {model_path}" - echo "e.g., ./test_all.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp" + echo "e.g., ./test_single.sh path_to/granite-3.3-8b-instruct-teigaku-genzei-interp" exit 1 fi
27-27: Usemkdir -pand quote variables.The script will fail on subsequent runs if the directory exists. Use
-pfor idempotency and quote the variable to handle paths with spaces.Apply this diff:
-mkdir ${OUT_DIR} +mkdir -p "${OUT_DIR}"scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
79-90: Remove redundant base_url construction.Lines 85-86 manually construct the base_url with
/v1suffix, duplicating logic thatRITSInferenceEngineFixed.prepare_engine()already implements. This creates a maintenance risk.Apply this diff to let
RITSInferenceEngineFixedhandle the URL construction:laaj_direct_rits_llama4_maverick = LLMJudgeDirectPositionalBias( inference_engine=RITSInferenceEngineFixed( max_tokens=1024, seed=42, temperature=0, model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME, - base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1", ), evaluator_name="LLAMA3_4_MAVERICK", generate_summaries=False, )scripts/teigaku_genzei/context_util.py (1)
8-12: Remove unusedsectionparameter.The
sectionparameter is declared but never used in the function body.Apply this diff:
-def compose_context(section: int, df: pd.DataFrame) -> str: +def compose_context(df: pd.DataFrame) -> str:Ensure all call sites are updated accordingly.
scripts/teigaku_genzei/catalog_util.py (2)
9-14: Critical: Function ignores its parameter.The function accepts an
artifact_nameparameter but always checks the hardcoded artifact"metrics.llm_as_judge.direct.rits.phi_4"at Line 11, making the parameter useless.Apply this diff:
def is_artifact_in_catalog(artifact_name: str, catalog_path: str=LOCAL_CATALOG_PATH): try: - _ = get_from_catalog("metrics.llm_as_judge.direct.rits.phi_4", catalog_path=catalog_path) + _ = get_from_catalog(artifact_name, catalog_path=catalog_path) except UnitxtArtifactNotFoundError: return False return True
16-20: Add error handling for empty catalog paths.If
get_local_catalogs_paths()returns an empty list, Line 20 will raise anIndexError.Apply this diff:
def get_system_catalog_path() -> str: paths = get_local_catalogs_paths() print(paths) - + if not paths: + raise RuntimeError("No local catalog paths found") return paths[-1]scripts/teigaku_genzei/make_context.py (1)
79-87: Critical: Incorrect reuse ofa_qindexforqlist_listdefault.Line 87 reuses
a_qindex(which is[-1]) as the default forqlist_list, butqlistexpects JSON-encoded arrays of question indices, not a raw integer list. This causes incorrect data in the appendix row whenoption == OPT_GLOSSARY_APPENDIX.Apply this diff:
(header, appendix, a_section, a_qindex) = ( - ("", [glossary_str], [-1], [-1]) if option == OPT_GLOSSARY_APPENDIX else - (glossary_str, [], [], []) if option == OPT_GLOSSARY_HEADER else - ("", [], [], []) + ("", [glossary_str], [-1], [-1], [json.dumps([-1])]) if option == OPT_GLOSSARY_APPENDIX else + (glossary_str, [], [], [], []) if option == OPT_GLOSSARY_HEADER else + ("", [], [], [], []) ) + a_qlist = a_qindex if len(a_qindex) == 0 else [json.dumps([-1])] context_list = [header + compose_context(cast(int, section), df) for (section, df) in section_gp] + appendix section_list = [cast(int, section) for (section, df) in section_gp] + a_section qindex_list = [cast(int, df.index[0]) for (section, df) in section_gp] + a_qindex - qlist_list = [json.dumps(df.index.to_list()) for (section, df) in section_gp] + a_qindex + qlist_list = [json.dumps(df.index.to_list()) for (section, df) in section_gp] + a_qlistscripts/teigaku_genzei/make_icl.py (1)
31-31: Fix boolean flag parsing for --short_context.Using
type=boolmakes--short_context falseevaluate toTruebecause any non-empty string is truthy. Useaction="store_true"instead.Apply this diff:
- parser.add_argument('--' + ARG_SHORT_CONTEXT, type=bool, default=False) + parser.add_argument('--' + ARG_SHORT_CONTEXT, action="store_true")scripts/teigaku_genzei/make_seed.py (3)
33-35: Hardenselect_icl_sampleagainst empty DataFrame and use positional indexing.This function doesn't guard against an empty
icl_dfand uses label-based.locwhich can misbehave if the DataFrame index is non-integer.Apply this diff:
def select_icl_sample(icl_df: pd.DataFrame, average_context_length: int, index: int) -> pd.Series: + if icl_df.empty: + raise ValueError("ICL sample DataFrame is empty.") + step = max(1, int(average_context_length)) + idx = int(index) - ret_index = (index - average_context_length) % len(icl_df) - return cast(pd.Series, icl_df.loc[ret_index]) + ret_index = (idx - step) % len(icl_df) + return cast(pd.Series, icl_df.iloc[ret_index])
51-67: Fix KeyError: preserveqindexcolumn when reading context CSV.Line 51 subselects only the
"context"column, but Line 63 accessesx["qindex"], causing a KeyError. You must preserveqindexand ensure it's numeric.Apply this diff:
- context_df = pd.read_csv(input_context_path, encoding="utf8")[["context"]] + context_df = pd.read_csv(input_context_path, encoding="utf8") + if "qindex" not in context_df.columns: + # fallback: derive qindex from row index if missing + context_df = context_df.reset_index(names="qindex") + context_df["qindex"] = pd.to_numeric(context_df["qindex"], errors="coerce").fillna(0).astype(int) + context_df = context_df[["context", "qindex"]] icl_list = json_util.read_jsonl_file(input_icl_path) icl_df = pd.DataFrame(icl_list) if opt_join_method == OPT_JOIN_METHOD_SLIDE: @@ - average_context_length = len(icl_list) / len(context_df) - assigned_icl_df = context_df.apply(lambda x: select_icl_sample(icl_df, int(average_context_length), x["qindex"]), axis=1) + if len(context_df) == 0: + raise ValueError("No contexts loaded.") + if len(icl_df) == 0: + raise ValueError("No ICL samples loaded.") + average_context_length = max(1, round(len(icl_df) / len(context_df))) + assigned_icl_df = context_df.apply( + lambda x: select_icl_sample(icl_df, int(average_context_length), int(x["qindex"])), + axis=1, + ) context_icl_df = pd.concat([
79-81: Critical:ArgumentErroris constructed but not raised.Line 81 constructs an
ArgumentErrorobject but doesn't raise it, so invalidjoin_methodvalues are silently ignored and execution continues with an empty DataFrame.Apply this diff:
else: - context_icl_df = pd.DataFrame() - argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}") + raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")scripts/teigaku_genzei/test_qa.py (2)
53-53: Fix CLI argument name typo:--instruction.The constant value
"instructtion"is misspelled, resulting in a CLI flag--instructtioninstead of--instruction.Apply this diff:
-ARG_INSTRUCTION: str ="instructtion" +ARG_INSTRUCTION: str = "instruction"
150-158: Fix late-binding lambda by capturingoption_mapper iteration.The lambda in the list comprehension closes over
option_map, causing all lambdas to reference the final iteration's value (late-binding). Each lambda must capture its ownoption_map.Apply this diff:
biased_score_sr_list = [ - results_df[f"{criterion}_positional_bias_selected_option"].apply(lambda x: option_map[x]) + results_df[f"{criterion}_positional_bias_selected_option"].apply( + lambda x, mapping=option_map: mapping[x] + ) for (criterion, option_map) in zip(criteria_list, option_map_list) ]
🧹 Nitpick comments (26)
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)
12-12: Make deployment/model configurable via env (optional).For portability across subscriptions/deployments, consider parameterizing the deployment name:
- "model": "gpt-4o-2024-08-06" + "model": "${AZURE_OPENAI_DEPLOYMENT:-gpt-4o-2024-08-06}"Based on learnings
scripts/teigaku_genzei/remove_newline_from_csv.py (1)
9-12: Consider in-place modification for simplicity.The current pattern builds a dictionary and creates a new DataFrame. A more direct approach modifies columns in-place, avoiding the intermediate dict and DataFrame construction overhead.
Apply this diff for a simpler, more efficient alternative:
-col_sr_dict = { - col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col] for col in df.columns -} -df2 = pd.DataFrame(col_sr_dict) -df2.to_csv(output_file, encoding="utf-8", index=False) +for col in df.select_dtypes(include=['object']).columns: + df[col] = df[col].str.replace("\n", " ", regex=False) +df.to_csv(output_file, encoding="utf-8", index=False)This uses
select_dtypesto filter object columns and modifies them in-place. Settingregex=Falsealso slightly improves performance since"\n"is a literal string.scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)
1-2: Remove unused imports.
CrossProviderInferenceEngine,RITSInferenceEngine, andLLMJudgeDirectare imported but never used in this script.Apply this diff to remove the unused imports:
-from unitxt.inference import CrossProviderInferenceEngine, RITSInferenceEngine -from unitxt.llm_as_judge import LLMJudgeDirect from unitxt import add_to_catalog from unitxt.llm_as_judge_constants import CriteriaOption, CriteriaWithOptionsscripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/ollama/llama3_1_8b.json (1)
11-12: Optional: Harmonize evaluator_name with provider/model naming.Consider a consistent pattern (e.g., OLLAMA_LLAMA3_1_8B) across catalog assets to ease discovery/automation. If you already have a convention, ignore.
scripts/teigaku_genzei/json_qa.py (3)
15-18: Consider adding error handling for file operations.The function opens and reads a JSON file without handling potential errors such as file not found, permission denied, or invalid JSON format.
Apply this diff to add error handling:
def read_json_file(file_path): - with open(file_path, 'r') as file: - data = json.load(file) - return data + try: + with open(file_path, 'r', encoding='utf-8') as file: + data = json.load(file) + return data + except FileNotFoundError: + print(f"Error: File not found: {file_path}") + sys.exit(1) + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON in {file_path}: {e}") + sys.exit(1) + except Exception as e: + print(f"Error reading {file_path}: {e}") + sys.exit(1)
55-55: Simplify unnecessary list comprehension.The comprehension
[i for i in title_pos_list]is redundant sincetitle_pos_listis already a list. You can slice it directly.Apply this diff:
- atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)] + atail_pos_list = title_pos_list[1:] + [len(text_list)]
71-76: Consider adding error handling for CSV writing.The function writes to a CSV file without handling potential errors such as permission denied or disk full.
Apply this diff to add error handling:
def save_to_csv(qa_pairs, output_file): - with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: - writer = csv.writer(csvfile) - writer.writerow(['Title', 'Question', 'Answer']) - for qa in qa_pairs: - writer.writerow(qa) + try: + with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: + writer = csv.writer(csvfile) + writer.writerow(['Title', 'Question', 'Answer']) + for qa in qa_pairs: + writer.writerow(qa) + except Exception as e: + print(f"Error writing to {output_file}: {e}") + sys.exit(1)scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/phi_4.json (1)
3-11: Unify engine schema with other RITS configs.Others use cross_provider_inference_engine + provider: "rits" and key "model". Consider aligning for consistency.
- "inference_engine": { - "__type__": "rits_inference_engine", - "max_tokens": 1024, - "seed": 42, - "temperature": 0, - "model_name": "microsoft/phi-4" - }, + "inference_engine": { + "__type__": "cross_provider_inference_engine", + "max_tokens": 1024, + "seed": 42, + "temperature": 0, + "provider": "rits", + "model": "microsoft/phi-4" + },scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/granite3_0_8b.json (1)
11-11: Consider more specific evaluator naming for consistency.The evaluator is named
"GRANITE3_8B"while the filename isgranite3_0_8b.jsonand the model is"granite-3-8b-instruct"(implying version 3.0). If the related configs for Granite 3.1, 3.2, and 3.3 use version-specific evaluator names (e.g.,GRANITE3_1_8B), consider using"GRANITE3_0_8B"here for consistency.scripts/teigaku_genzei/test_single_vllm.sh (2)
1-2: Harden script and avoid mkdir failure if dir exists.Add strict mode and use mkdir -p.
-#!/bin/sh +#!/bin/sh +set -euo pipefail @@ -mkdir ${OUT_DIR} +mkdir -p "${OUT_DIR}"Also applies to: 31-31
20-24: Optional: read API key from env when needed.Passing a dummy key may interfere with providers; consider making it optional or defaulting from ENV.
-# SERVER_TYPE="vllm-remote" -# SERVER_TYPE="openai" SERVER_TYPE="vllm-remote" -SERVER_API_KEY="dummy" +SERVER_API_KEY="${SERVER_API_KEY:-}"scripts/teigaku_genzei/json_glossary.py (1)
62-74: Add error handling for file operations.The script lacks error handling for file I/O and JSON structure validation. Consider adding try-except blocks to handle:
- Missing or unreadable input files
- Invalid JSON format
- Missing expected fields in the JSON structure
- CSV write failures
Example:
if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python script.py <input_json_file> <output_csv_file>") sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] - data = read_json_file(input_file) - qa_pairs = extract_glossary(data) - save_to_csv(qa_pairs, output_file) - - print(f"QA pairs have been successfully saved to {output_file}") + try: + data = read_json_file(input_file) + qa_pairs = extract_glossary(data) + save_to_csv(qa_pairs, output_file) + print(f"QA pairs have been successfully saved to {output_file}") + except FileNotFoundError as e: + print(f"Error: File not found - {e}") + sys.exit(1) + except json.JSONDecodeError as e: + print(f"Error: Invalid JSON format - {e}") + sys.exit(1) + except KeyError as e: + print(f"Error: Missing expected field in JSON - {e}") + sys.exit(1) + except Exception as e: + print(f"Error: {e}") + sys.exit(1)scripts/teigaku_genzei/extract_qa.sh (1)
7-14: Add error checking for script execution.The loop doesn't check if the Python scripts succeed or if input files exist. Consider adding basic error handling:
for sqa in ${source_qa}; do sf=${source_dir}/${sqa}.json qf=${qa_table_dir}/${sqa}.csv gf=${qa_table_dir}/${sqa}_glossary.csv + if [ ! -f "${sf}" ]; then + echo "Error: Input file ${sf} not found" + exit 1 + fi + - python ./json_qa.py ${sf} ${qf} - python ./json_glossary.py ${sf} ${gf} + python ./json_qa.py ${sf} ${qf} || exit 1 + python ./json_glossary.py ${sf} ${gf} || exit 1 donescripts/teigaku_genzei/extract_icl_from_seed.py (2)
49-51: Simplify dict comprehension.As Pylint suggests, the dict comprehension can be simplified.
- unique_icl_dict = { - h: obj for (h, obj) in hash_list - } + unique_icl_dict = dict(hash_list)
52-53: Remove misleading TODO comment.The TODO comment states "collect only a unique set of ICLs" but this functionality is already implemented via the hash-based deduplication above.
- # TODO: collect only a unique set of ICLs. unique_icl_list = [obj for (_, obj) in unique_icl_dict.items()]scripts/teigaku_genzei/url.txt (1)
1-7: Verify link stability and add retrieval guidance.Consider adding a short README note or a fetch script that reads this file, sets a user-agent, follows redirects, and records checksums/last-modified timestamps to guard against upstream changes or takedowns. Also ensure the NTA terms permit automated retrieval.
scripts/teigaku_genzei/requirements.txt (1)
4-9: Consolidate extras and consider pinning for reproducibility.
- Listing both sacrebleu and sacrebleu[ja] is redundant; keep only sacrebleu[ja].
- Consider pinning versions (accelerate, nltk, rouge_score, bert_score, litellm, python-dotenv) to avoid surprise breakages in scripts.
Apply this minimal cleanup:
-accelerate -sacrebleu -sacrebleu[ja] -nltk -rouge_score -bert_score -litellm +accelerate +sacrebleu[ja] +nltk +rouge_score +bert_score +litellmOptionally add explicit versions in a constraints file for CI determinism.
Also applies to: 12-12
scripts/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_0_8b.json (1)
9-12: Align evaluator_name with versioned naming.To match the 3.3 config style, consider "GRANITE3_0_8B" (includes minor version) instead of "GRANITE3_8B".
scripts/teigaku_genzei/make_seed.sh (1)
1-4: Harden the script and use python3Add strict mode and prefer python3 for portability on RHEL.
Apply this diff:
-#!/bin/sh +#!/bin/sh +set -euo pipefail @@ -# python ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl -python ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesian +# python3 ./make_seed.py context/0024001-021_context.csv tmp/icl.jsonl tmp/seed_ja.jsonl +python3 ./make_seed.py --context_file context/0024001-021_context.csv --icl_file icl/icl.jsonl --out_seed_file tmp/seed_ja.jsonl --join_method cartesianscripts/teigaku_genzei/README.md (1)
1-236: Consider applying markdown formatting improvements.Static analysis (markdownlint) identified several formatting issues throughout the documentation:
- Bare URLs should be converted to proper Markdown links (e.g., Lines 7, 10, 11, 20, 21)
- Unordered list indentation should use consistent 2-space increments
- Code blocks should include language specifiers (e.g.,
```bash,```csv)These improvements would enhance readability and conform to standard Markdown best practices, but are not blocking issues.
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
41-47: Consider using spread operator for cleaner code.The list concatenation on Line 47 can be simplified using the spread operator for better readability.
Apply this diff:
- self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}scripts/teigaku_genzei/json_util.py (2)
20-25: Remove commented-out code and useless return.Line 24 contains commented-out code that should be removed. Line 25's explicit
returnis unnecessary.Apply this diff:
def write_jsonl_file(file_path: str, data_list: list[Any]): with open(file_path, 'w') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n') - # json.dump(data, file, ensure_ascii=False) - return
27-33: Remove unreachable return statement.Line 33's
return []is unreachable since the function already returns at Line 32 within the with-block.Apply this diff:
def read_jsonl_file(file_path: str) -> list[Any]: with open(file_path, 'r') as file: data_list = [ json.loads(line) for line in file ] return data_list - return []scripts/teigaku_genzei/make_context.py (1)
46-54: Consider removing debug print statements.Lines 51-52 print section/subsection numbers during execution. If this is intentional for debugging, consider using proper logging or removing before production.
Apply this diff if prints are no longer needed:
def extract_section_number(title: str) -> tuple[int, int]: match = json_util.get_title_match(title) if match is not None: section_num_str = match.group(1) subsection_num_str = match.group(2) - print(section_num_str) - print(subsection_num_str) return (int(section_num_str), int(subsection_num_str)) return (0, 0)scripts/teigaku_genzei/make_icl.py (1)
91-91: Remove unused variableoption.The
optionvariable is assigned at Line 91 but never used in the function.Apply this diff:
- option = args[ARG_OPTION] input_qa_path = args[ARG_INPUT_QA_FILE]scripts/teigaku_genzei/test_qa.py (1)
155-157: Remove unused variablenbiased_score_sr_list.The variable
nbiased_score_sr_listis assigned at Line 155 but never used. Either use it in the computation or remove it.Apply this diff if the variable is truly unused:
cbiased_score_sr_list: list[pd.Series] = [cast(pd.Series, sr) for sr in biased_score_sr_list] - nbiased_score_sr_list = [ - sr.rename(cast(str, sr.name).replace("_positional_bias_selected_option", "_positional_bias_score")) - for sr in cbiased_score_sr_list - ]
...ku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/aws/llama4_maverick.json
Show resolved
Hide resolved
...u_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1.json
Show resolved
Hide resolved
.../local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/llama4_maverick.json
Show resolved
Hide resolved
...u_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json
Show resolved
Hide resolved
...u_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_70b.json
Show resolved
Hide resolved
...ku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json
Show resolved
Hide resolved
...ku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral8_7b.json
Show resolved
Hide resolved
… Add error handling for empty catalog paths. opendatahub-io#26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (8)
scripts/teigaku_genzei/catalog_util.py (2)
16-20: Previous issue resolved; consider removing debug print.The IndexError risk has been fixed with the guard clause on line 20. However, the print statement on line 18 outputs to stdout on every call, which pollutes logs and is typically debugging code that should be removed.
Apply this diff to remove the debug output:
def get_system_catalog_path() -> str | None: paths = get_local_catalogs_paths() - print(paths) return paths[-1] if len(paths) > 0 else None
22-22: Consider removing or deferring the import-time side effect.Calling
get_system_catalog_path()at import time executes code with side effects (the print statement) every time this module is imported, which can pollute logs and cause unexpected behavior in testing or when the module is imported for inspection. If catalog path initialization is needed, consider moving it to an explicit initialization function that callers invoke when required.Remove the call or wrap it in an initialization function:
-get_system_catalog_path() +# Call get_system_catalog_path() explicitly when needed instead of at import timescripts/teigaku_genzei/json_glossary.py (1)
1-4: Remove or update the TODO comment.The TODO comment appears to be a template placeholder that describes what the script already does. Consider removing it or replacing it with specific future work items if any remain.
scripts/teigaku_genzei/json_util.py (2)
24-29: Remove unnecessary explicit return.The explicit
returnstatement at line 29 is redundant. Python functions implicitly returnNoneat the end.Apply this diff:
def write_jsonl_file(file_path: str, data_list: list[Any]): with open(file_path, 'w') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n') - # json.dump(data, file, ensure_ascii=False) - return
18-21: Add type hints for consistency.The
read_json_filefunction lacks type hints for thefile_pathparameter and return type, while other functions in the module have complete type annotations.Apply this diff:
-def read_json_file(file_path): +def read_json_file(file_path: str) -> Any: with open(file_path, 'r') as file: data = json.load(file) return datascripts/teigaku_genzei/json_qa.py (3)
1-4: Remove or update the TODO comment.Similar to
json_glossary.py, this TODO comment appears to be a template placeholder describing what the script already implements. Consider removing it or replacing with specific action items.
51-51: Simplify list slicing.The expression
[i for i in title_pos_list][1:]first creates a full copy via list comprehension, then slices it. This is inefficient and can be simplified totitle_pos_list[1:]directly.Apply this diff:
- atail_pos_list = [i for i in title_pos_list][1:] + [len(text_list)] + atail_pos_list = title_pos_list[1:] + [len(text_list)]
57-57: Add explicitstrict=parameter tozip().Without
strict=True, if the four lists have different lengths due to malformed input,zip()will silently truncate to the shortest list, potentially dropping QA pairs without warning. This could mask data extraction issues.Apply this diff:
- for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list): + for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):Note:
strict=Truerequires Python 3.10+. If targeting Python 3.9 or earlier, consider adding an explicit length check before the loop:if not (len(title_pos_list) == len(qhead_pos_list) == len(ahead_pos_list) == len(atail_pos_list)): raise ValueError(f"Mismatched list lengths: titles={len(title_pos_list)}, questions={len(qhead_pos_list)}, answers={len(ahead_pos_list)}, tails={len(atail_pos_list)}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
scripts/teigaku_genzei/catalog_util.py(1 hunks)scripts/teigaku_genzei/context_util.py(1 hunks)scripts/teigaku_genzei/json_glossary.py(1 hunks)scripts/teigaku_genzei/json_qa.py(1 hunks)scripts/teigaku_genzei/json_util.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/teigaku_genzei/json_qa.py (2)
scripts/teigaku_genzei/json_glossary.py (2)
is_body_text(18-19)is_section_title(32-33)scripts/teigaku_genzei/json_util.py (4)
is_title(6-7)is_qhead(12-13)is_ahead(15-16)read_json_file(18-21)
scripts/teigaku_genzei/json_glossary.py (2)
scripts/teigaku_genzei/json_qa.py (2)
is_body_text(18-19)save_to_csv(67-72)scripts/teigaku_genzei/json_util.py (2)
is_title(6-7)read_json_file(18-21)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/json_qa.py
[refactor] 51-51: Unnecessary use of a comprehension, use list(title_pos_list) instead.
(R1721)
scripts/teigaku_genzei/json_util.py
[refactor] 24-24: Useless return at end of function or method
(R1711)
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/context_util.py
17-17: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
20-20: Unused function argument: section
(ARG001)
scripts/teigaku_genzei/json_qa.py
22-22: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
22-22: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
33-33: Comment contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF003)
35-35: Comment contains ambiguous 1 (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?
(RUF003)
46-46: Comment contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF003)
51-51: Consider [*[i for i in title_pos_list][1:], len(text_list)] instead of concatenation
Replace with [*[i for i in title_pos_list][1:], len(text_list)]
(RUF005)
57-57: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
scripts/teigaku_genzei/json_util.py
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
7-7: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
16-16: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
scripts/teigaku_genzei/json_glossary.py
33-33: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
33-33: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
🔇 Additional comments (2)
scripts/teigaku_genzei/catalog_util.py (2)
1-7: LGTM: Imports and catalog registration.The imports are appropriate for the functionality, and registering the local catalog at import time is a standard pattern for Unitxt catalog initialization.
9-14: Previous issue resolved.The function now correctly uses the
artifact_nameparameter instead of a hardcoded string. The logic and exception handling are appropriate.
…erenceEngineFixed logic. opendatahub-io#26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
79-87: Avoid duplicating base_url construction; let engine derive it.RITSInferenceEngineFixed.prepare_engine() already appends “/v1” when base_url is None. Prefer omitting base_url here unless a distinct endpoint is required.
- inference_engine=RITSInferenceEngineFixed( + inference_engine=RITSInferenceEngineFixed( max_tokens=1024, seed=42, temperature=0, model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME, - base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"), + # Let prepare_engine() set base_url from model_name + # base_url can be supplied only if endpoint differs from model mapping ),If endpoint ≠ model mapping is required, keep base_url but align helper naming (see sibling file comment).
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
30-30: Fix super() call to invoke immediate parent.Use super().prepare_engine() here; current call targets the ancestor of RITSInferenceEngine and triggers Pylint E1003.
Apply:
- super(RITSInferenceEngine, self).prepare_engine() + super().prepare_engine()
🧹 Nitpick comments (5)
scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (1)
17-47: Consider adding a docstring.While the comments above explain the workaround, adding a docstring to the function would clarify its purpose and parameters for users of this module.
For example:
def register(catalog_path: str) -> None: + """Register a custom correctness criterion with numeric option names. + + This criterion uses numeric option names ("3", "2", "1") instead of semantic names + to work around post-processing issues with MatchClosestOption normalization. + + Args: + catalog_path: Path to the catalog where the criterion will be registered. + """scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (2)
19-21: Clarify base_url derivation and naming.Helper name implies “model_name” but register() passes ENDPOINT_NAME. Rename to reflect generic identifier or add endpoint-aware logic to avoid confusion and accidental mismatches.
Example:
- def get_base_url_from_model_name_v(model_name: str, version: str): - return RITSInferenceEngine.get_base_url_from_model_name(model_name) + version + def build_base_url_with_suffix(identifier: str, suffix: str): + return RITSInferenceEngine.get_base_url_from_model_name(identifier) + suffixAnd in both places:
- base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(ENDPOINT_NAME, "/v1"), + base_url=RITSInferenceEngineFixed.build_base_url_with_suffix(ENDPOINT_NAME, "/v1"),Also applies to: 26-29, 42-44
36-36: Rename variable to match content.The variable name suggests phi-4 but config is llama-4 maverick. Rename for clarity and consistency with the catalog key.
- laaj_direct_rits_phi4 = LLMJudgeDirect( + laaj_direct_rits_llama4_maverick = LLMJudgeDirect( ... - add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.llama4_maverick", overwrite=True, catalog_path=catalog_path) + add_to_catalog(laaj_direct_rits_llama4_maverick, name="metrics.llm_as_judge.direct.rits.llama4_maverick", overwrite=True, catalog_path=catalog_path)Also applies to: 57-57
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (2)
46-47: Replace print with logger and simplify list build.Use logger.debug for diagnostics; adopt splat style for reduction_map.
- print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") - self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + logger.debug("LLMJudgeDirectPositionalBias.ci_scores: %s", self.ci_scores) + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}Add at top if missing:
-from typing import List, Dict, Any +from typing import List, Dict, Any +from unitxt import get_logger ... -# (no logger defined) +logger = get_logger()
59-60: Add zip(strict=...) for safer length assumptions.Make intent explicit per B905.
- for result, pb_score in zip(results, pb_scores) + for result, pb_score in zip(results, pb_scores, strict=False)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
scripts/teigaku_genzei/jsonl_util.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
register(65-91)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
RITSInferenceEngineFixed(17-30)register(33-57)get_base_url_from_model_name_v(20-21)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/jsonl_util.py
[refactor] 7-7: Useless return at end of function or method
(R1711)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py
[error] 30-30: Bad first argument 'RITSInferenceEngine' given to super()
(E1003)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
[refactor] 9-9: Too few public methods (1/2)
(R0903)
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
47-47: Consider [self.main_score, *self.ci_scores] instead of concatenation
Replace with [self.main_score, *self.ci_scores]
(RUF005)
59-59: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (5)
scripts/teigaku_genzei/jsonl_util.py (1)
14-19: LGTM! Previous issue resolved.The function correctly specifies
encoding='utf-8'for reading, which is essential for Japanese text. The unreachablereturn []mentioned in the previous review has been removed.scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py (2)
6-15: Well-documented workaround.The explanation clearly describes why numeric option names are used instead of semantic names, making the rationale behind this design decision accessible to future maintainers.
49-49: LGTM!The catalog registration correctly uses the provided
catalog_pathparameter and enables overwrite to allow updates. The fully-qualified name follows unitxt's naming conventions.scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (1)
53-54: Verify evaluator_name constant.“LLAMA3_4_MAVERICK” looks off; confirm intended label (likely “LLAMA4_MAVERICK”). Update if misnamed to avoid metric aggregation inconsistency.
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
88-89: Verify evaluator_name string.“LLAMA3_4_MAVERICK” appears mismatched with llama-4 config; confirm the exact label used elsewhere to keep metrics consistent.
scripts/teigaku_genzei/llm_as_judge_direct_criteria_correctness2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
14-14: Consider usingsys.exitfor clarity.While
exit(1)works in script contexts,sys.exit(1)is the more explicit and standard approach in Python.Apply this diff:
+import sys + import os import json import logging import catalog_utilsystem_catalog_path = catalog_util.get_system_catalog_path() if system_catalog_path is None: logging.error("No system catalog") - exit(1) + sys.exit(1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-14T13:57:11.861Z
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py:31-36
Timestamp: 2025-10-14T13:57:11.861Z
Learning: For one-time setup scripts in the scripts/teigaku_genzei/ directory, elaborate error handling is not required. Natural script suspension when errors occur is considered sufficient for these types of scripts.
Applied to files:
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
scripts/teigaku_genzei/catalog_util.py (1)
get_system_catalog_path(16-20)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py
[refactor] 14-14: Consider using 'sys.exit' instead
(R1722)
🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_provider_importer.py (1)
34-39: LGTM! Encoding specification properly implemented.The explicit
encoding="utf-8"specification for both file reads and writes ensures reliable handling of JSON content with non-ASCII characters. This addresses the feedback from the previous review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_context.py (1)
87-87: Verified: Past critical qlist_list bug has been fixed.The previous review flagged that
qlist_listincorrectly appendeda_qindex(raw list[-1]) instead of a JSON-encoded string. The current implementation correctly uses[json.dumps([qi]) for qi in a_qindex]to produce["[-1]"]for the appendix case, resolving the issue.Note: The fix differs from the suggested approach (which proposed a separate
a_qlistvariable) but achieves the same correct result.
🧹 Nitpick comments (1)
scripts/teigaku_genzei/make_context.py (1)
12-17: Consider converting descriptive TODOs to module docstring.These TODO comments describe what the script does rather than marking incomplete work. Consider converting them to a proper module-level docstring for better documentation.
Apply this diff:
+""" +Build context CSV from QA and glossary data. + +Input: qa.csv - question-answer pairs with titles +Input: glossary.csv - term definitions +Output: context.csv - section-level contexts with question indices + +Process: +- Extract section numbers from QA titles +- Group entries by section +- Generate context blocks per section with optional glossary header/appendix +""" + import sys from typing import cast, Any, Hashable, Sequence, Collection import json import argparse import pandas as pd import json_util from context_util import compose_context, compose_glossary - -# TODO: input: qa.csv -# TODO: input: glossary.csv -# TODO: output: context.csv - -# TODO: extract section numbers -# TODO: collect entries of same section numbers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/make_context.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/make_context.py (2)
scripts/teigaku_genzei/context_util.py (2)
compose_context(20-36)compose_glossary(38-43)scripts/teigaku_genzei/json_util.py (1)
get_title_match(9-10)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/make_context.py
[refactor] 56-56: Useless return at end of function or method
(R1711)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_icl.py (1)
22-41: Flag parsing correctly implemented.The
--short_contextargument now usesaction="store_true", which properly addresses the past review concern about boolean flag parsing. The CLI configuration is well-structured with clear argument names and sensible defaults.Note: The
--glossary_fileargument is accepted but not currently utilized inmain()(line 93 is commented out). This is consistent with the TODO at lines 109-116 indicating planned glossary support.
🧹 Nitpick comments (1)
scripts/teigaku_genzei/make_icl.py (1)
87-117: Remove unused glossary option and clarify CLI arguments
- Remove the unused
option = args[ARG_OPTION]assignment (line 91).- Either drop the
--glossary_fileargument (line 33) or annotate it as “reserved for future use” since its handling is commented out.- Add basic error checks around
pd.read_csv(file existence, required columns, non-empty DataFrames) to fail with clear messages.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
scripts/teigaku_genzei/make_icl.py(1 hunks)scripts/teigaku_genzei/make_icl.sh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/make_icl.py (1)
scripts/teigaku_genzei/context_util.py (1)
compose_context(20-36)
🪛 Pylint (4.0.0)
scripts/teigaku_genzei/make_icl.py
[refactor] 45-53: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_icl.py
43-43: Unused function argument: qa_df
(ARG001)
78-78: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
91-91: Local variable option is assigned to but never used
Remove assignment to unused variable option
(F841)
🔇 Additional comments (3)
scripts/teigaku_genzei/make_icl.sh (1)
1-3: LGTM! Script correctly uses the --short_context flag.The shell script properly invokes
make_icl.pywith--short_contextas a flag (no explicit value), which aligns with the correctedaction="store_true"implementation in the Python script. The relative paths and minimal structure are appropriate for this use case.scripts/teigaku_genzei/make_icl.py (2)
55-57: LGTM! Short context composition is correctly implemented.The function properly filters QA rows by index and delegates to
context_util.compose_context(). Thesection_indexparameter, though currently unused bycompose_context(), is passed through as documented incontext_util.pyfor "future extensions."
59-85: LGTM! ICL example generation correctly handles Japanese text.The function properly:
- Toggles between full and short context based on the
short_contextflag- Strips Japanese prefixes ("問 " for questions, "[A]" for answers) which is appropriate for this tax document domain
- Flattens ICL QA pairs into numbered fields (
icl_query_1,icl_response_1, etc.)The fullwidth "A" character flagged by static analysis is intentional for Japanese text processing, not an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
scripts/teigaku_genzei/make_seed.py (3)
55-55: Fix KeyError: missing qindex column.Line 55 only selects the
"context"column, but line 67 accessesx["qindex"], causing a KeyError when the slide join method is used.Apply this diff to preserve the
qindexcolumn and ensure it's numeric:- context_df = pd.read_csv(input_context_path, encoding="utf8")[["context"]] + context_df = pd.read_csv(input_context_path, encoding="utf8") + if "qindex" not in context_df.columns: + context_df = context_df.reset_index(names="qindex") + context_df["qindex"] = pd.to_numeric(context_df["qindex"], errors="coerce").fillna(0).astype(int) + context_df = context_df[["context", "qindex"]]
83-85: Raise the exception for invalid join method.Line 85 constructs an
ArgumentErrorbut does not raise it, allowing execution to continue silently with an empty DataFrame when an invalid join method is provided.Apply this diff to raise the exception:
else: - context_icl_df = pd.DataFrame() - argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}") + raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
66-66: Guard against division by zero.If
context_dfis empty, line 66 will raise a ZeroDivisionError.Apply this diff to add validation:
+ if len(context_df) == 0: + raise ValueError("No contexts loaded from CSV.") + if len(icl_df) == 0: + raise ValueError("No ICL samples loaded from JSONL.") average_context_length = len(icl_list) / len(context_df)
🧹 Nitpick comments (2)
scripts/teigaku_genzei/make_seed.py (2)
2-2: Remove unused imports.The
sysandjsonimports appear to be unused in this file.Apply this diff to remove the unused imports:
-import sys import argparse from typing import cast, Any, Hashable, Sequence, Collection -import json import pandas as pdAlso applies to: 5-5
41-42: Document or remove the no-op function.The
compose_seedfunction currently returns its input unchanged. If this is intentional for future extension, add a docstring explaining the design. Otherwise, consider removing it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/make_seed.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
scripts/teigaku_genzei/make_seed.py (1)
33-39: LGTM! Past concerns addressed.The function now guards against empty DataFrames and uses positional indexing via
iloc, resolving the issues raised in previous reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
scripts/teigaku_genzei/make_seed.py (1)
92-94: Raise the ArgumentError instead of constructing it.The
ArgumentErroris constructed but not raised, so invalid join methods would be silently ignored (though argparse'schoicesconstraint prevents reaching this code path in practice).Apply this diff:
else: - context_icl_df = pd.DataFrame() - argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}") + raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
🧹 Nitpick comments (4)
scripts/teigaku_genzei/make_seed.py (4)
28-28: Consider removing commented-out code.The commented device argument appears to be unused. If it's not needed for future functionality, consider removing it to improve code clarity.
41-42: Remove unused function or document its purpose.The
compose_seedfunction is defined but never called, and currently just returns its input unchanged. Consider removing it if it's not needed, or document if it's a placeholder for future functionality.
46-48: Consider removing commented-out code.The old usage code in comments could be removed to improve code clarity.
68-72: Consider usingmax(1, ...)for consistency.While
select_icl_samplehandles zero values withmax(1, average_context_length)on line 36, it would be clearer to ensureaverage_context_lengthis at least 1 here as well. This prevents passing 0 whenlen(icl_df) < len(context_df)(e.g., 1 ICL sample with 3 contexts → round(1/3) = 0).Apply this diff:
- average_context_length = round(len(icl_df) / len(context_df)) + average_context_length = max(1, round(len(icl_df) / len(context_df)))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/make_seed.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
69-69: Avoid specifying long messages outside the exception class
(TRY003)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (5)
scripts/teigaku_genzei/make_seed.py (5)
33-39: LGTM!The function correctly guards against empty DataFrames and uses positional indexing with iloc. The
max(1, average_context_length)ensures the step is always at least 1, preventing division issues.
55-57: LGTM!The data loading correctly preserves both the
contextandqindexcolumns, addressing the KeyError issue from the previous review.
73-80: LGTM!The SLIDE join implementation correctly applies
select_icl_sampleto each context row with proper integer conversions, then concatenates the results while dropping the qindex column.
85-91: LGTM!The CARTESIAN join correctly creates a cross product of ICL samples and contexts by expanding each context with all ICL rows, then concatenating and resetting the index.
107-119: LGTM!The output DataFrame construction correctly adds the schema fields (document_outline, document_title, domain), concatenates with the joined context-ICL data, renames the context column to document, and writes to JSONL format with proper encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (9)
scripts/teigaku_genzei/make_seed.py (9)
2-2: Remove unused import.The
sysimport is not used in this module (Line 47 shows commented-outsys.exitusage).-import sys
5-5: Remove unused import.The
jsonimport is not used (json operations are handled byjson_utilmodule).-import json
28-28: Remove dead code.The commented-out device argument should be removed if it's not needed.
- # parser.add_argument('--device', type=str, default="cuda") -
41-42: Remove or implement stub function.This function currently does nothing (returns input unchanged). If it's a placeholder for future logic, add a TODO comment explaining the intent. Otherwise, remove it as it's not called anywhere in the code.
-def compose_seed(df: pd.DataFrame) -> pd.DataFrame: - return df -
46-48: Remove dead code.The commented-out argument handling code should be removed.
- # if len(sys.argv) != 4: - # print("Usage: python script.py <input_context.csv> <input_icl.jsonl> <output_seed.jsonl") - # sys.exit(1)
55-57: Consider adding error handling for file operations.File read operations lack error handling. If files are missing, inaccessible, or malformed, the script will crash with unclear errors. Consider wrapping in try-except blocks or letting the natural exceptions propagate with clearer context.
For example:
try: context_df = pd.read_csv(input_context_path, encoding="utf8")[["context", "qindex"]] except FileNotFoundError: raise FileNotFoundError(f"Context file not found: {input_context_path}") except KeyError as e: raise KeyError(f"Required column missing in context file: {e}")
107-111: Consider making metadata configurable.The document metadata (outline, title, domain) is hard-coded with Japanese text and specific values. This limits reusability. Consider accepting these as CLI arguments or deriving them from input files.
Example:
parser.add_argument('--document_outline', type=str, required=True) parser.add_argument('--document_title', type=str, required=True) parser.add_argument('--domain', type=str, required=True)
116-116: Remove dead code.The commented-out drop statement should be removed.
- # .drop(columns=["section", "qindex", "qlist"])
117-118: Consider adding error handling for output operations.The output operations (print and to_json) lack error handling. Consider wrapping the to_json call to catch and report write failures clearly.
Example:
try: out_df.to_json(output_context_path, orient="records", lines=True, force_ascii=False) print(f"Successfully wrote {len(out_df)} records to {output_context_path}") except Exception as e: raise IOError(f"Failed to write output file {output_context_path}: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/make_seed.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ryokawajp
PR: opendatahub-io/odh-data-processing#26
File: scripts/teigaku_genzei/make_context.sh:5-11
Timestamp: 2025-10-15T13:22:53.401Z
Learning: In shell scripts within the scripts/teigaku_genzei/ directory, the context/, qa_table/, icl/, docparser/, and source/ directories already exist in the repository because they contain committed .gitignore files, so scripts do not need to create these directories with mkdir -p before writing files into them.
🪛 Ruff (0.14.0)
scripts/teigaku_genzei/make_seed.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
69-69: Avoid specifying long messages outside the exception class
(TRY003)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (4)
scripts/teigaku_genzei/make_seed.py (4)
33-39: LGTM with minor style note.The function correctly guards against empty DataFrames and uses positional indexing. The offset logic
(idx - step) % len(icl_df)implements a sliding window pattern to avoid assigning ICL samples that are too close to their corresponding contexts.Note: Static analysis suggests defining custom exception subclasses instead of inline messages, but this is a minor style preference and acceptable as-is.
81-91: LGTM!The cartesian join implementation correctly pairs each context with all ICL samples using DataFrame broadcasting. The index reset and qindex removal are appropriate.
92-93: LGTM!The exception is now correctly raised for invalid join methods, addressing the critical issue from the past review.
120-122: LGTM!Standard Python entry point implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/teigaku_genzei/make_icl.py (1)
103-103: Ensure UTF‑8 when writing JSONL (in jsonl_util).
jsonl_util.write_jsonl_filewrites without explicit encoding; Japanese text may break on non‑UTF‑8 systems. Please update that helper to open files withencoding="utf-8"andensure_ascii=False(reads already use UTF‑8). This affects this script’s output.Proposed change in scripts/teigaku_genzei/jsonl_util.py:
-def write_jsonl_file(file_path: str, data_list: list[Any]): - with open(file_path, 'w') as file: +def write_jsonl_file(file_path: str, data_list: list[Any]): + with open(file_path, 'w', encoding='utf-8') as file: for data in data_list: file.write(json.dumps(data, ensure_ascii=False) + '\n')scripts/teigaku_genzei/make_seed.py (1)
88-88: Raise a regular ValueError for invalid join method.
argparse.ArgumentErroris intended for parser internals; raising it here can be awkward. ValueError is clearer.- raise argparse.ArgumentError(None, f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}") + raise ValueError(f"Invalid {ARG_JOIN_METHOD}: {opt_join_method}")
🧹 Nitpick comments (15)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (6)
10-20: Remove commented-out code.The commented
preparemethod and debug print statements should be removed to improve code clarity, unless they're needed for immediate debugging.Apply this diff:
class LLMJudgeDirectPositionalBias(LLMJudgeDirect): - # def prepare(self): - # super().prepare() - # print(f"LLMJudgeDirectPositionalBias.prepare() criteria: {self.criteria}") - # print(f"LLMJudgeDirectPositionalBias.prepare() inference_engine: {self.inference_engine}") - # if self.criteria is not None: - # self.ci_scores = [ - # self.criteria.name, - # self.criteria.name + "_positional_bias_score", - # self.criteria.name + "_positional_bias_average", - # ] - # print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") - POSTFIX_SCORE = "_positional_bias_score"
47-47: Replace print statement with logging.Debug print statements should use proper logging for production code.
Apply this diff:
+from unitxt.logging_utils import get_logger + +logger = get_logger()Then:
- print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") + logger.info(f"LLMJudgeDirectPositionalBias ci_scores: {self.ci_scores}")
63-64: Remove commented-out code.The old implementation should be removed to improve readability.
Apply this diff:
pb_scores = [ - # self.criteria.option_map.get(result.get(backward_selected_option_name, "NO_RESULT"), 1) - # if (self.criteria is not None) and (self.criteria.option_map is not None) else 1 get_pb_score(result, self.criteria, result[self.main_score]) for result in results ]
48-48: Consider using unpacking syntax for list concatenation.The static analysis suggests using unpacking for better readability and performance.
Apply this diff:
- self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}Based on learnings.
72-72: Add explicitstrict=parameter tozip().Python 3.10+ supports the
strictparameter to catch length mismatches. Sinceresultsandpb_scoresshould always have the same length, usingstrict=Trueadds a safety check.Apply this diff:
pb_results = [ result | { backward_score_name: pb_score, average_score_name: (result[self.main_score] + pb_score) / 2, } - for result, pb_score in zip(results, pb_scores) + for result, pb_score in zip(results, pb_scores, strict=True) ]Based on learnings.
92-104: LGTM! Base URL construction properly uses the static helper.The use of
RITSInferenceEngineFixed.get_base_url_from_model_name_v()correctly addresses the previous DRY violation by delegating URL construction to the inference engine.However, consider removing the commented-out code at line 99 for cleanliness:
model_name=llm_as_judge_direct_rits_llama4.MODEL_NAME, base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"), - # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1", ),scripts/teigaku_genzei/json_util.py (2)
2-2: Drop unused typing imports.They aren’t used here; remove to reduce noise.
-from typing import cast, Any, Hashable, Sequence, Collection
7-7: Silence intentional full‑width character warnings (RUF001).These regexes intentionally match full‑width digits/dashes and full‑width “A”. Add noqa notes to prevent false positives.
- return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None + return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None # noqa: RUF001 (intentional full-width chars) - return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text) + return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text) # noqa: RUF001 (intentional full-width chars) - return re.match(r"^\[A\]", text) is not None + return re.match(r"^\[A\]", text) is not None # noqa: RUF001 (intentional full-width A)Also applies to: 10-10, 16-16
scripts/teigaku_genzei/README.md (1)
58-60: Add language to fenced code blocks for better rendering and copy/paste.Specify a language (bash) for CLI snippets.
Example:
-``` -Usage: python json_qa.py <input_json_file> <output_csv_file> -``` +```bash +# Usage +python json_qa.py <input_json_file> <output_csv_file> +```Apply similarly to the other code fences in the listed ranges.
Also applies to: 66-68, 115-117, 140-143, 196-197, 210-212, 224-226
scripts/teigaku_genzei/json_qa.py (4)
7-7: Remove unused import.
jsonisn’t used in this module.-import json
51-51: Make zip strict to catch misaligned markers.Fails fast if counts of titles, 問, and [A] diverge.
-for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list): +for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):If you expect occasional mismatches in real PDFs, prefer an explicit alignment routine instead of strict zip. Do you want a helper to align nearest markers?
66-68: Show the actual script name in usage.Use argv[0] so copy/paste works regardless of filename.
- if len(sys.argv) != 3: - print("Usage: python script.py <input_json_file> <output_csv_file>") + if len(sys.argv) != 3: + print(f"Usage: python {sys.argv[0]} <input_json_file> <output_csv_file>") sys.exit(1)
38-38: Consider joining with newlines for readability.Concatenating without separators can jam sentences; newlines often preserve layout from Docling better.
- return "".join(block_text_list) + return "\n".join(block_text_list)If downstream expects a single line, keep current behavior.
scripts/teigaku_genzei/make_seed.py (1)
2-6: Prune unused imports.Tighten to what’s actually used.
-import sys import argparse -from typing import cast, Any, Hashable, Sequence, Collection -import json +from typing import cast, Any import pandas as pdscripts/teigaku_genzei/make_icl.py (1)
1-5: Trim unused typing/sys imports.Keep only what’s referenced.
-import sys -from typing import cast, Any, Hashable, Sequence, Collection +from typing import cast, Any import json import argparse import pandas as pd
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
scripts/teigaku_genzei/README.md(1 hunks)scripts/teigaku_genzei/json_qa.py(1 hunks)scripts/teigaku_genzei/json_util.py(1 hunks)scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py(1 hunks)scripts/teigaku_genzei/make_icl.py(1 hunks)scripts/teigaku_genzei/make_icl.sh(1 hunks)scripts/teigaku_genzei/make_seed.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/teigaku_genzei/make_icl.sh
🧰 Additional context used
🧬 Code graph analysis (4)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
RITSInferenceEngineFixed(17-30)register(33-57)get_base_url_from_model_name_v(20-21)
scripts/teigaku_genzei/make_seed.py (3)
scripts/teigaku_genzei/make_icl.py (2)
config(22-41)main(86-112)scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
main(24-52)scripts/teigaku_genzei/jsonl_util.py (1)
read_jsonl_file(14-19)
scripts/teigaku_genzei/json_qa.py (2)
scripts/teigaku_genzei/json_glossary.py (3)
is_body_text(18-19)is_section_title(32-33)save_to_csv(48-53)scripts/teigaku_genzei/json_util.py (4)
is_title(6-7)is_qhead(12-13)is_ahead(15-16)read_json_file(18-21)
scripts/teigaku_genzei/make_icl.py (2)
scripts/teigaku_genzei/context_util.py (1)
compose_context(20-36)scripts/teigaku_genzei/jsonl_util.py (1)
write_jsonl_file(7-12)
🪛 LanguageTool
scripts/teigaku_genzei/README.md
[grammar] ~55-~55: Use a hyphen to join words.
Context: ...structing the SDG seed data. ## Command line interface json_qa.py is used to ...
(QB_NEW_EN_HYPHEN)
[style] ~90-~90: Consider using “who” when you are referring to a person instead of an object.
Context: ...ositional_bias`, then an extended judge that reflects the result of the positional b...
(THAT_WHO)
[grammar] ~111-~111: Use a hyphen to join words.
Context: ...-8b-instruct-sft-checkpoint` ## Command line interface The test and evaluation i...
(QB_NEW_EN_HYPHEN)
[style] ~133-~133: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ed data classification of LLM services. See [Sensitive data in unitxt — Unitxt](htt...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~181-~181: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...might let the teacher model to generate exactly the same questions for training, which can lead ...
(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)
[style] ~186-~186: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...ents and query - response pairs are not exactly the same with the seed contexts, which is explai...
(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)
[grammar] ~192-~192: Use a hyphen to join words.
Context: ... examples and seed contexts. ## Command line interface make_context.py is use...
(QB_NEW_EN_HYPHEN)
[grammar] ~220-~220: Use a hyphen to join words.
Context: ... - I.e., it contains one document (so called "context") and three query - resp...
(QB_NEW_EN_HYPHEN)
[style] ~232-~232: Consider removing “of” to be more concise
Context: ...cartesian` - Assign each ICL example to all of the contexts. See the description in ["Proc...
(ALL_OF_THE)
🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md
7-7: Bare URL used
(MD034, no-bare-urls)
10-10: Bare URL used
(MD034, no-bare-urls)
11-11: Bare URL used
(MD034, no-bare-urls)
20-20: Bare URL used
(MD034, no-bare-urls)
21-21: Bare URL used
(MD034, no-bare-urls)
35-35: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
37-37: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
38-38: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
38-38: Bare URL used
(MD034, no-bare-urls)
39-39: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
40-40: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
41-41: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
46-46: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
47-47: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
48-48: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
50-50: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
51-51: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
52-52: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
53-53: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
58-58: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
66-66: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
79-79: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
81-81: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
82-82: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
82-82: Bare URL used
(MD034, no-bare-urls)
83-83: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
88-88: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
89-89: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
90-90: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
91-91: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
92-92: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
94-94: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
95-95: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
96-96: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
97-97: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
101-101: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
102-102: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
103-103: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
104-104: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
105-105: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
106-106: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
107-107: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
109-109: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
115-115: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
130-130: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
131-131: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
132-132: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
140-140: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
147-147: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
155-155: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
156-156: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
157-157: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
158-158: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
161-161: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
162-162: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
163-163: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
165-165: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
166-166: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
167-167: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
179-179: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
180-180: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
181-181: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
183-183: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
184-184: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
185-185: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
186-186: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
188-188: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
189-189: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
190-190: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
195-195: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
210-210: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
219-219: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
220-220: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
224-224: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
[refactor] 9-9: Too few public methods (1/2)
(R0903)
🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation
Replace with [self.main_score, *self.ci_scores]
(RUF005)
72-72: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
scripts/teigaku_genzei/make_seed.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
66-66: Avoid specifying long messages outside the exception class
(TRY003)
scripts/teigaku_genzei/json_qa.py
22-22: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
22-22: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
33-33: Comment contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF003)
35-35: Comment contains ambiguous 1 (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?
(RUF003)
49-49: Consider [*title_pos_list[1:], len(text_list)] instead of concatenation
Replace with [*title_pos_list[1:], len(text_list)]
(RUF005)
51-51: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
scripts/teigaku_genzei/make_icl.py
77-77: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
scripts/teigaku_genzei/json_util.py
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
7-7: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
10-10: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
10-10: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
10-10: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
16-16: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
1-7: LGTM!The imports are well-organized and appropriate for the functionality provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (4)
10-20: Remove commented-out code.The commented
prepare()method serves no purpose and reduces readability. If this code might be needed for debugging, use version control to retrieve it later.Apply this diff:
class LLMJudgeDirectPositionalBias(LLMJudgeDirect): - # def prepare(self): - # super().prepare() - # print(f"LLMJudgeDirectPositionalBias.prepare() criteria: {self.criteria}") - # print(f"LLMJudgeDirectPositionalBias.prepare() inference_engine: {self.inference_engine}") - # if self.criteria is not None: - # self.ci_scores = [ - # self.criteria.name, - # self.criteria.name + "_positional_bias_score", - # self.criteria.name + "_positional_bias_average", - # ] - # print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") - POSTFIX_SCORE = "_positional_bias_score"
47-47: Replace print with logging.Production code should use proper logging instead of print statements for observability and control.
Add logging import at the top:
import logging logger = logging.getLogger(__name__)Then apply this diff:
- print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") + logger.debug(f"LLMJudgeDirectPositionalBias ci_scores: {self.ci_scores}")
48-48: Consider unpacking for list concatenation.Static analysis suggests using unpacking instead of list concatenation for better readability.
Apply this diff:
- self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}
100-100: Remove commented-out code.The old approach is no longer needed since the fix is in place.
Apply this diff:
base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"), - # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1", ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
RITSInferenceEngineFixed(17-30)register(33-57)get_base_url_from_model_name_v(20-21)
🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
[refactor] 9-9: Too few public methods (1/2)
(R0903)
🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation
Replace with [self.main_score, *self.ci_scores]
(RUF005)
73-73: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
| def get_pb_score( | ||
| result: Dict[str, str], | ||
| criteria: CriteriaWithOptions, | ||
| fallback_score: float, | ||
| out_of_range_score: float, | ||
| ) -> float: | ||
| option_map = criteria.option_map if (criteria is not None) and (criteria.option_map is not None) else None | ||
| if option_map is None: | ||
| return fallback_score | ||
| selected_option = result.get(backward_selected_option_name) | ||
| score = option_map.get(selected_option, out_of_range_score) if selected_option is not None else fallback_score | ||
| return score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent fallback for invalid selected options.
Line 60 uses out_of_range_score (0.0) as the default when selected_option is not found in option_map. However, the past review comment for these lines suggested using fallback_score instead, which would be the forward score.
The current behavior penalizes unexpected options with 0.0, but this may not be the intended design. If an invalid option appears (e.g., due to model output variation), falling back to the forward score would be more robust than defaulting to 0.0.
Apply this diff to align with the past review suggestion:
- score = option_map.get(selected_option, out_of_range_score) if selected_option is not None else fallback_score
+ score = option_map.get(selected_option, fallback_score) if selected_option is not None else fallback_scoreIf penalizing invalid options with 0.0 is intentional, consider documenting this behavior in a comment or docstring.
🤖 Prompt for AI Agents
In scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py around lines 50
to 61, the get_pb_score function currently uses out_of_range_score when a
selected_option is present but not found in criteria.option_map; change that
default to fallback_score so invalid/unrecognized options fall back to the
forward score (retain using fallback_score when selected_option is None), i.e.,
replace option_map.get(selected_option, out_of_range_score) with
option_map.get(selected_option, fallback_score); if the original 0.0 penalty was
intentional, instead add a clarifying comment/docstring explaining that
behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (2)
108-116: Optional: let RITSInferenceEngineFixed derive base_url; avoid duplication.You can omit base_url and rely on prepare_engine() to append “/v1”.
- base_url=RITSInferenceEngineFixed.get_base_url_from_model_name_v(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME, "/v1"), - # base_url=RITSInferenceEngine.get_base_url_from_model_name(llm_as_judge_direct_rits_llama4.ENDPOINT_NAME) + "/v1",
50-77: Invalid backward selected option should fall back to forward score; update types/docs.Penalizing unrecognized options with 0.0 is brittle; prefer fallback to the forward score for robustness, or clearly document the intentional penalty.
- def get_pb_score( - result: Dict[str, str], - criteria: CriteriaWithOptions, + def get_pb_score( + result: Dict[str, Any], + criteria: CriteriaWithOptions | None, fallback_score: float, default_pb_score: float, ) -> float: @@ - PB (positional bias) is the difference of the selection of an option between forward- and backward-ordered options. - To mitigate PB, we compute the average of the scores associated with the selected options from forward- and backward-ordered options. + PB (positional bias) arises when option selection differs between forward- and backward-ordered options. + To mitigate PB, we compute the average of the scores associated with the selected options from forward- and backward-ordered options. @@ - :param default_pb_score: The score to use if the selected option is not found in criteria.option_map. - Note that this is an erroneous case; this score is used only when the judge's inference result does not contain a meaningful option value. + :param default_pb_score: The score to use only when the judge's output lacks a meaningful option value entirely. @@ - score = option_map.get(selected_option, default_pb_score) if selected_option is not None else fallback_score + # Fall back to forward score when the backward-selected option is invalid/unrecognized. + score = option_map.get(selected_option, fallback_score) if selected_option is not None else fallback_scoreIf the 0.0 penalty is intentional, keep the current behavior but add a brief comment explaining why penalization is preferred over fallback.
🧹 Nitpick comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
41-49: Remove print and tighten reduction_map construction.Avoid stdout in libraries; use logging or drop it. Also prefer splat to concat.
- print(f"LLMJudgeDirectPositionalBias.prepare() ci_scores: {self.ci_scores}") - self.reduction_map = {"mean": [self.main_score] + self.ci_scores} + # Debug logging if needed (e.g., logger.debug) + self.reduction_map = {"mean": [self.main_score, *self.ci_scores]}Optionally, set ci_scores/reduction_map in prepare() to avoid recomputing each call.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
scripts/teigaku_genzei/llm_as_judge_direct_rits_llama4.py (3)
RITSInferenceEngineFixed(17-30)register(33-57)get_base_url_from_model_name_v(20-21)
🪛 Pylint (4.0.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
[refactor] 9-9: Too few public methods (1/2)
(R0903)
🪛 Ruff (0.14.1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py
48-48: Consider [self.main_score, *self.ci_scores] instead of concatenation
Replace with [self.main_score, *self.ci_scores]
(RUF005)
🔇 Additional comments (1)
scripts/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
79-89: Good defensive coding on PB merge.Using result.get(..., default_pb_score) and zip(..., strict=True) avoids KeyError and length bugs. LGTM.
…n SDG seed generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (13)
scripts/teigaku_genzei/README.md (3)
63-63: Add language identifiers to fenced code blocks for proper syntax highlighting.Fenced code blocks lack language specifications, which prevents syntax highlighting and reduces readability. All code blocks should declare their language.
Apply this diff to add language identifiers where missing:
- Line 63: - ``` + ```bash Usage: python json_qa.py <input_json_file> <output_csv_file> - ``` + ``` - Line 71: - ``` + ```bash Usage: python json_glossary.py <input_json_file> <output_csv_file> - ``` + ``` - Line 120: - ``` + ```bash python test_qa.py [-h] [--model_path MODEL_PATH] ... [--instruction INSTRUCTION] - ``` + ``` - Line 145: - ``` + ```csv Title,Question,Answer "Q1","What is the capital of France?","Paris" - ``` + ``` - Line 200: - ``` + ```bash usage: make_context.py [-h] [--glossary_option {appendix,header,none}] ... - ``` + ``` - Line 215: - ``` + ```bash usage: make_icl.py [-h] [--short_context SHORT_CONTEXT}] ... - ``` + ``` - Line 229: - ``` + ```bash usage: make_seed.py [-h] [--context_file CONTEXT_FILE] ... - ``` + ```Also applies to: 71-71, 120-120, 145-145, 200-200, 215-215, 229-229
40-102: Fix list indentation to comply with markdown standards.Multiple list items throughout the document have inconsistent indentation (4 spaces instead of 2, or 8 instead of 4). Correct indentation improves readability and ensures proper markdown rendering.
Key areas to fix:
- Lines 40–46 (Prerequisites section): Reduce from 4 spaces to 2; nested items at 46 should be at 4 not 8
- Lines 51–58 (Procedure section): Reduce from 4 spaces to 2; nested items at 57–58 should be at 4 not 8
- Lines 84–114 (Testing Prerequisites, One-time setup, Procedure): Reduce indentation consistently to 2 for top-level, 4 for nested
- Lines 152, 160–172 (Input/Output format notes): Reduce from 4 to 2 spaces
- Lines 184–195 (SDG seed generation Procedure): Reduce from 4 to 2 spaces
Also applies to: 152-152, 160-172, 184-195
7-7: Wrap bare URLs in markdown link syntax.Several bare URLs should be wrapped in markdown link format
[text](url)for better documentation structure and link recognition.Examples to fix:
- Line 7:
https://www.nta.go.jp/users/gensen/teigakugenzei/01.htm→[Teigaku Genzei — NTA](https://www.nta.go.jp/users/gensen/teigakugenzei/01.htm)- Line 10:
https://www.nta.go.jp/publication/pamph/gensen/0024001-021.pdf→ wrap as link- Line 43:
https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub→ already appears in surrounding text; ensure markdown link formatAlso applies to: 10-10, 11-11, 20-20, 21-21, 43-43, 87-87
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)
217-219: Consider usinglogger.exceptionfor better debugging.Using
logger.exceptioninstead oflogger.errorwill include the full traceback, which is helpful for debugging export failures.except Exception as e: - logger.error(f"Failed to export {format_name} format: {str(e)}") + logger.exception(f"Failed to export {format_name} format") raise
258-261: Exception handling after re-raising is inconsistent.The
export_documentfunction re-raises exceptions after logging (line 219), butexport_document_new_doclingcatches and continues (line 261). This means if one format fails inexport_document, the entire document is marked as failed even though other formats may have succeeded.Consider whether the
raiseat line 219 should be removed to allow partial exports, or ifexport_document_new_doclingshould not catch and continue. The current behavior may not match the intended error handling strategy.scripts/teigaku_genzei/json_util.py (2)
2-2: Remove unused imports.The imports
cast,Hashable,Sequence, andCollectionare not used in this file.-from typing import cast, Any, Hashable, Sequence, Collection +from typing import Any
6-19: Add comments explaining fullwidth character patterns.The fullwidth characters (0-9, -, ー, A) in the regex patterns are intentional for matching Japanese document formats. Adding brief comments would help maintainers understand this is not a typo and suppress linter warnings.
def is_title(text: str) -> bool: + # Fullwidth digits and hyphens are intentional for Japanese document format return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not NoneAlternatively, you can add a
# noqa: RUF001comment to suppress the linter warnings for these lines.scripts/teigaku_genzei/json_qa.py (1)
7-7: Remove unused import.The
jsonmodule is imported but never used in this file.-import jsonscripts/teigaku_genzei/make_seed.py (2)
2-5: Remove unused imports.Several imports are not used in this file:
sys,json,Hashable,Sequence,Collection.-import sys -import argparse -from typing import cast, Any, Hashable, Sequence, Collection -import json +import argparse +from typing import cast, Any
92-101: Consider making metadata configurable.The
document_outline,document_title, anddomainvalues are hardcoded. For reusability across different document sets, consider making these configurable via CLI arguments or a configuration file.This is acceptable for the current example scope, but if you plan to extend this for other document sources, parameterizing these values would be beneficial.
scripts/teigaku_genzei/make_icl.py (3)
1-6: Remove unused imports.Several imports are not used:
sys,json,Hashable,Sequence,Collection.-import sys -from typing import cast, Any, Hashable, Sequence, Collection -import json +from typing import cast, Any import argparse
76-78: Fullwidth character is intentional - add comment for clarity.The
[A]prefix removal uses fullwidth characters matching the Japanese document format. Consider adding a brief comment to clarify this is intentional, similar to the suggestion forjson_util.py.icl_qa_list = [ { "icl_query_" + str(i + 1): qa_df.loc[qa]["Question"].removeprefix("問 ").strip(), + # Fullwidth [A] matches Japanese answer marker format "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(), } for (i, qa) in enumerate(context_qa_index[1]) ]
112-119: Clean up or track TODO comments.These TODO comments contain incomplete implementation notes. Consider either:
- Removing them if no longer relevant
- Converting to a tracked issue for future work
- Completing the thought with actionable items
Would you like me to help create an issue to track the context size optimization work mentioned in these TODOs?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
scripts/teigaku_genzei/README.md(1 hunks)scripts/teigaku_genzei/json_qa.py(1 hunks)scripts/teigaku_genzei/json_util.py(1 hunks)scripts/teigaku_genzei/make_context.py(1 hunks)scripts/teigaku_genzei/make_icl.py(1 hunks)scripts/teigaku_genzei/make_seed.py(1 hunks)scripts/teigaku_genzei/seed/.gitignore(1 hunks)scripts/teigaku_genzei/teigaku_genzei.ipynb(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- scripts/teigaku_genzei/seed/.gitignore
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/teigaku_genzei/make_context.py
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/teigaku_genzei/make_seed.py (1)
scripts/teigaku_genzei/jsonl_util.py (1)
read_jsonl_file(12-17)
scripts/teigaku_genzei/json_qa.py (1)
scripts/teigaku_genzei/json_util.py (5)
is_title(6-7)is_toc(9-10)is_qhead(15-16)is_ahead(18-19)read_json_file(21-24)
scripts/teigaku_genzei/make_icl.py (5)
scripts/teigaku_genzei/make_context.py (2)
config(27-43)main(71-85)scripts/teigaku_genzei/make_seed.py (2)
config(17-31)main(104-117)scripts/teigaku_genzei/context_util.py (1)
compose_context(20-36)scripts/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
main(24-52)scripts/teigaku_genzei/jsonl_util.py (1)
write_jsonl_file(7-10)
🪛 LanguageTool
scripts/teigaku_genzei/README.md
[grammar] ~60-~60: Use a hyphen to join words.
Context: ...structing the SDG seed data. ## Command line interface json_qa.py is used to ...
(QB_NEW_EN_HYPHEN)
[style] ~95-~95: Consider using “who” when you are referring to a person instead of an object.
Context: ...ositional_bias`, then an extended judge that reflects the result of the positional b...
(THAT_WHO)
[grammar] ~116-~116: Use a hyphen to join words.
Context: ...-8b-instruct-sft-checkpoint` ## Command line interface The test and evaluation i...
(QB_NEW_EN_HYPHEN)
[style] ~138-~138: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ed data classification of LLM services. See [Sensitive data in unitxt — Unitxt](htt...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~186-~186: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...might let the teacher model to generate exactly the same questions for training, which can lead ...
(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)
[style] ~191-~191: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...ents and query - response pairs are not exactly the same with the seed contexts, which is explai...
(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)
[grammar] ~197-~197: Use a hyphen to join words.
Context: ... examples and seed contexts. ## Command line interface make_context.py is use...
(QB_NEW_EN_HYPHEN)
[grammar] ~225-~225: Use a hyphen to join words.
Context: ... - I.e., it contains one document (so called "context") and three query - resp...
(QB_NEW_EN_HYPHEN)
[style] ~237-~237: Consider removing “of” to be more concise
Context: ...cartesian` - Assign each ICL example to all of the contexts. See the description in ["Proc...
(ALL_OF_THE)
🪛 markdownlint-cli2 (0.18.1)
scripts/teigaku_genzei/README.md
7-7: Bare URL used
(MD034, no-bare-urls)
10-10: Bare URL used
(MD034, no-bare-urls)
11-11: Bare URL used
(MD034, no-bare-urls)
20-20: Bare URL used
(MD034, no-bare-urls)
21-21: Bare URL used
(MD034, no-bare-urls)
40-40: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
42-42: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
43-43: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
43-43: Bare URL used
(MD034, no-bare-urls)
44-44: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
45-45: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
46-46: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
51-51: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
52-52: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
53-53: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
55-55: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
56-56: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
57-57: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
58-58: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
63-63: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
71-71: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
84-84: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
86-86: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
87-87: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
87-87: Bare URL used
(MD034, no-bare-urls)
88-88: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
93-93: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
94-94: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
95-95: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
96-96: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
97-97: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
99-99: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
100-100: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
101-101: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
102-102: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
106-106: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
107-107: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
108-108: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
109-109: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
110-110: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
111-111: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
112-112: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
114-114: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
120-120: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
135-135: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
136-136: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
137-137: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
145-145: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
152-152: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
160-160: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
161-161: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
162-162: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
163-163: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
166-166: Unordered list indentation
Expected: 4; Actual: 8
(MD007, ul-indent)
167-167: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
168-168: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
170-170: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
171-171: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
172-172: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
184-184: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
185-185: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
186-186: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
188-188: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
189-189: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
190-190: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
191-191: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
193-193: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
194-194: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
195-195: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
200-200: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
215-215: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
224-224: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
225-225: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
229-229: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb
125-125: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
125-125: Use explicit conversion flag
Replace with conversion flag
(RUF010)
166-166: Do not catch blind exception: Exception
(BLE001)
168-168: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
168-168: Use explicit conversion flag
Replace with conversion flag
(RUF010)
scripts/teigaku_genzei/make_seed.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
55-55: Avoid specifying long messages outside the exception class
(TRY003)
57-57: Avoid specifying long messages outside the exception class
(TRY003)
scripts/teigaku_genzei/json_qa.py
22-22: String contains ambiguous ( (FULLWIDTH LEFT PARENTHESIS). Did you mean ( (LEFT PARENTHESIS)?
(RUF001)
22-22: String contains ambiguous ) (FULLWIDTH RIGHT PARENTHESIS). Did you mean ) (RIGHT PARENTHESIS)?
(RUF001)
33-33: Comment contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF003)
35-35: Comment contains ambiguous 1 (FULLWIDTH DIGIT ONE). Did you mean 1 (DIGIT ONE)?
(RUF003)
49-49: Consider [*title_pos_list[1:], len(text_list)] instead of concatenation
Replace with [*title_pos_list[1:], len(text_list)]
(RUF005)
51-51: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
scripts/teigaku_genzei/make_icl.py
77-77: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
scripts/teigaku_genzei/json_util.py
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
7-7: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
13-13: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
13-13: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
13-13: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
13-13: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
13-13: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
19-19: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (3)
scripts/teigaku_genzei/json_qa.py (1)
40-56: LGTM with the zip strict fix.The extraction logic correctly identifies title/question/answer blocks using the utility functions from
json_util. The position list construction is clear, and the function handles the Japanese document structure appropriately.scripts/teigaku_genzei/make_seed.py (1)
33-39: LGTM!The function properly guards against empty DataFrames and uses positional indexing (
iloc) to avoid label-based indexing surprises. The step calculation ensures a minimum of 1 to prevent division issues.scripts/teigaku_genzei/make_icl.py (1)
43-52: LGTM!The function correctly handles QA combination generation with proper edge cases: full combinations when sufficient items exist, padding with repetition when fewer items are available, and empty list for no items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
scripts/teigaku_genzei/teigaku_genzei.ipynb (1)
83-86: OCR language still configured for Spanish instead of Japanese.This issue was previously flagged: the OCR language is set to
["es"](Spanish), but the notebook processes Japanese documents. This will result in poor OCR quality and likely failure to extract Japanese text correctly.Apply this diff:
"ocr": { "enabled": True, # Enable/disable OCR processing - "languages": ["es"], # List of language codes (e.g., eng, fra, deu) + "languages": ["jpn"], # List of language codes for Japanese },
🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)
217-219: Uselogger.exceptionto preserve stack traces.The current error logging loses stack trace information. When debugging conversion failures, the full traceback is valuable.
Apply this diff:
except Exception as e: - logger.error(f"Failed to export {format_name} format: {str(e)}") + logger.exception(f"Failed to export {format_name} format") raiseNote:
logger.exceptionautomatically includes the exception message and stack trace, sostr(e)is redundant.
258-260: Uselogger.exceptionto preserve stack traces.Similar to the export_document function, using
logger.exceptionhere will capture the full stack trace, making it easier to diagnose document conversion failures.Apply this diff:
except Exception as e: failure_count += 1 - logger.error(f"Failed to process {file_path}: {str(e)}") + logger.exception(f"Failed to process {file_path}") continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/teigaku_genzei.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb
125-125: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
125-125: Use explicit conversion flag
Replace with conversion flag
(RUF010)
166-166: Do not catch blind exception: Exception
(BLE001)
168-168: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
168-168: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)
206-220: Uselogging.exceptionfor better error diagnostics.The error handling logs the exception message with
logging.error(f"...{str(e)}")before re-raising. Usinglogging.exceptionautomatically includes the full traceback, making debugging easier.Apply this diff:
except Exception as e: - logger.error(f"Failed to export {format_name} format: {str(e)}") + logger.exception(f"Failed to export {format_name} format") raiseBased on static analysis hints (Ruff TRY400).
249-262: Uselogging.exceptionfor better error diagnostics.Similar to the previous error handler, using
logging.exceptionhere will automatically include the traceback for failed document conversions.Apply this diff:
except Exception as e: failure_count += 1 - logger.error(f"Failed to process {file_path}: {str(e)}") + logger.exception(f"Failed to process {file_path}") continueBased on static analysis hints (Ruff TRY400).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/teigaku_genzei.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb
126-126: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
126-126: Use explicit conversion flag
Replace with conversion flag
(RUF010)
167-167: Do not catch blind exception: Exception
(BLE001)
169-169: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
169-169: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
| "- Python >= 3.10\n", | ||
| "- sdg_hub >= 0.7.1\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the required sdg_hub version.
The notebook states sdg_hub >= 0.7.1 is required, but the PR description mentions testing with sdg_hub v0.2.x with Japanese support (PRs #457, #458). This version discrepancy may confuse users about which version to install.
Please update line 22 to reflect the actual tested and required version, ensuring consistency with the PR description and README.md.
🤖 Prompt for AI Agents
In scripts/teigaku_genzei/teigaku_genzei.ipynb around lines 21–22, the
requirements list incorrectly states "sdg_hub >= 0.7.1"; update line 22 to the
actual tested and required version by replacing it with "sdg_hub >= 0.2.0
(tested with v0.2.x including Japanese support PRs #457, #458)" so the notebook
matches the PR description and README; ensure the change is saved in the
notebook JSON (requirements cell/metadata) and verify README.md references
remain consistent.
| " \"ocr\": {\n", | ||
| " \"enabled\": True, # Enable/disable OCR processing\n", | ||
| " \"languages\": [\"es\"], # List of language codes (e.g., eng, fra, deu)\n", | ||
| " # Note that the OCR language has no effect in this example since we discard information from images.\n", | ||
| " },\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd scripts/teigaku_genzei && ls -la | head -20Repository: opendatahub-io/data-processing
Length of output: 1367
🏁 Script executed:
file scripts/teigaku_genzei/teigaku_genzei.ipynbRepository: opendatahub-io/data-processing
Length of output: 118
🏁 Script executed:
# Get the size of the notebook to determine how to read it
wc -l scripts/teigaku_genzei/teigaku_genzei.ipynbRepository: opendatahub-io/data-processing
Length of output: 121
🏁 Script executed:
# Read lines around 83-87 to see the OCR configuration context
sed -n '61,110p' scripts/teigaku_genzei/teigaku_genzei.ipynbRepository: opendatahub-io/data-processing
Length of output: 2281
🏁 Script executed:
# Also search for where OCR is actually used/invoked in the notebook
rg -i "ocr" scripts/teigaku_genzei/teigaku_genzei.ipynb -A 2 -B 2Repository: opendatahub-io/data-processing
Length of output: 1137
🏁 Script executed:
# Check for docling or any image/PDF processing logic
rg -i "docling\|extract_images\|image" scripts/teigaku_genzei/teigaku_genzei.ipynb -A 2 -B 2 | head -50Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Let me verify if images are actually processed in the pipeline
# by looking for image-related extraction or processing
rg -i "image\|discard" scripts/teigaku_genzei/teigaku_genzei.ipynb -B 2 -A 2 | head -60Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Check if there's any postprocessing that explicitly removes image content
rg -i "extract.*text\|export_to_text\|markdown" scripts/teigaku_genzei/teigaku_genzei.ipynb -B 3 -A 3 | head -80Repository: opendatahub-io/data-processing
Length of output: 56
Update OCR language from "es" to "jpn" for Japanese document processing.
The OCR language is set to ["es"] (Spanish) but should be ["jpn"] for Japanese documents. Although the comment on line 86 states "OCR language has no effect in this example since we discard information from images," the configuration is still actively used: pipeline_options.ocr_options.lang is set to the configured language value, and pipeline_options.do_ocr is enabled. Even if image content is ultimately discarded from the output, the OCR processing runs with the wrong language setting, which is inefficient and semantically incorrect for this Japanese tax document pipeline.
🤖 Prompt for AI Agents
In scripts/teigaku_genzei/teigaku_genzei.ipynb around lines 83 to 87, the OCR
languages config is set to ["es"] (Spanish) but this pipeline processes Japanese
documents; change the value to ["jpn"] so OCR runs with Japanese language
setting. Update the "languages" array to ["jpn"] and ensure any references that
read pipeline_options.ocr_options.lang will now receive "jpn" so OCR runs
correctly for Japanese input.
…ument processing. Updated language option as a response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
scripts/teigaku_genzei/teigaku_genzei.ipynb (2)
221-223: Uselogger.exceptionfor better error diagnostics.Using
logger.exceptioninstead oflogger.error(f"...{str(e)}")automatically includes the traceback, which aids debugging.except Exception as e: - logger.error(f"Failed to export {format_name} format: {str(e)}") + logger.exception(f"Failed to export {format_name} format") raise
262-265: Uselogger.exceptionfor traceback preservation.Similar to the earlier exception handler, using
logger.exceptionpreserves the full traceback for debugging failed conversions.except Exception as e: failure_count += 1 - logger.error(f"Failed to process {file_path}: {str(e)}") + logger.exception(f"Failed to process {file_path}") continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/teigaku_genzei.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/teigaku_genzei.ipynb
129-129: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
129-129: Use explicit conversion flag
Replace with conversion flag
(RUF010)
170-170: Do not catch blind exception: Exception
(BLE001)
172-172: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
172-172: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (3)
scripts/teigaku_genzei/teigaku_genzei.ipynb (3)
61-69: LGTM on configuration variables.The path configuration and source name lists are clearly defined. The separation of ICL and seed sources (
icl_source_namevsseed_source_name) from different PDFs is a good practice for avoiding data leakage between ICL examples and seed contexts.
177-195: LGTM on pipeline configuration.The
setup_pipeline_optionsfunction properly maps configuration dictionary values toPdfPipelineOptions. The defensiveisinstancecheck on line 184 before accessingtable_structure_options.do_cell_matchingis a good practice.
225-273: LGTM on document conversion workflow.The
export_document_new_doclingfunction has good structure with:
- Early return when no PDFs found
- Proper directory creation with
mkdir(parents=True, exist_ok=True)- Success/failure counting and timing metrics
- Graceful handling of individual file failures while continuing to process remaining files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
scripts/teigaku_genzei/json_util.py (4)
6-7: Anchor/clarify “title” matching semantics (prefix vs full-line) and DRY the regex.
Right nowre.matchensures “starts with”, but"1-2abc"will return true / match. If titles must be exactlyN-N, consider^...$(or usefullmatch). Alsois_title()andget_title_match()duplicate the same pattern—worth centralizing.+_TITLE_RE = re.compile(r"([0-90-9]+)[-ー]([0-90-9]+)") + def is_title(text: str) -> bool: - return re.match(r"[0-90-9]+[-ー][0-90-9]+", text) is not None + return _TITLE_RE.fullmatch(text) is not None ... def get_title_match(text: str) -> re.Match | None: - return re.match(r"([0-90-9]+)[-ー]([0-90-9]+)", text) + return _TITLE_RE.match(text) # or fullmatch(), depending on desired behaviorAlso applies to: 12-13
15-20: Preferstartswith()for fixed prefixes (and decide about whitespace).
is_qhead()/is_ahead()are simple prefix checks;text.startswith("問")andtext.startswith("[A]")are clearer and faster. If inputs may have leading whitespace, consider.lstrip()(or explicitly document they must not).def is_qhead(text: str) -> bool: - return re.match(r"^問", text) is not None + return text.startswith("問") ... def is_ahead(text: str) -> bool: - return re.match(r"^\[A\]", text) is not None + return text.startswith("[A]")
21-24: Add type hints forread_json_file(and accept PathLike).
This is a utility used across scripts; typing improves callers and testability.+from pathlib import Path +from os import PathLike ... -def read_json_file(file_path): +def read_json_file(file_path: str | PathLike[str] | Path) -> Any: with open(file_path, 'r', encoding='utf-8') as file: data = json.load(file) return data
6-20: Decide how you want to handle Ruff RUF001 (full-width characters in literals).
If full-width digits/A are intentional (Japanese OCR/text), either keep as-is and add targeted# noqa: RUF001with a short comment, or switch to a Unicode-aware approach (e.g., normalize withunicodedata.normalize("NFKC", text)and use ASCII patterns).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
scripts/teigaku_genzei/json_util.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
scripts/teigaku_genzei/json_util.py
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
7-7: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
7-7: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
7-7: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
13-13: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
13-13: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
13-13: String contains ambiguous - (FULLWIDTH HYPHEN-MINUS). Did you mean - (HYPHEN-MINUS)?
(RUF001)
13-13: String contains ambiguous 0 (FULLWIDTH DIGIT ZERO). Did you mean 0 (DIGIT ZERO)?
(RUF001)
13-13: String contains ambiguous 9 (FULLWIDTH DIGIT NINE). Did you mean 9 (DIGIT NINE)?
(RUF001)
19-19: String contains ambiguous A (FULLWIDTH LATIN CAPITAL LETTER A). Did you mean A (LATIN CAPITAL LETTER A)?
(RUF001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (1)
scripts/teigaku_genzei/json_util.py (1)
9-10:is_toc()defensive access looks good; consider precompiling and/or broadening path match if needed.
The.get(...).get(...)avoids the priorKeyErrorrisk. If this runs in tight loops, precompile the regex; also confirm whether#/pictures(no trailing slash) should match—current pattern requires#/pictures/.
…ook in this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
♻️ Duplicate comments (8)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/llama3_1_8b.json (1)
1-13: Looks consistent with Unitxt local_catalog schema; keep provider fields as Unitxt expects.Given these
local_catalogfiles are Unitxt configs,max_tokens/provider/modelshould remain in Unitxt’s schema (not provider-specific variants). Please just verifyevaluator_name: "LLAMA3_1_8B"is unique/consistently referenced across the catalog, and that Unitxt in your tested versions accepts"provider": "watsonx"with"model": "llama-3-1-8b-instruct". Based on learnings, Unitxt performs the normalization.notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/mixtral_large.json (1)
8-12: Clarify/verifyevaluator_namevsmodelnaming to avoid future confusion.
Given prior discussion, I’m assuming"model": "mistral-large-instruct"is a valid Unitxt internal identifier even underprovider: watsonx, but pairing it with"evaluator_name": "MIXTRAL_LARGE"is still easy to misread later. Consider adding a short note in the example README (or nearby docs) stating that model strings are Unitxt identifiers and may not match the evaluator/file naming.In Unitxt, when using `cross_provider_inference_engine` with `provider: watsonx`, what does the model identifier `mistral-large-instruct` map to, and is it the intended identifier to use for a “Mixtral Large” judge configuration?notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_mini.json (1)
1-13: Keep Unitxt’s provider id as-is; please just verifyo1-minisupport in the cross-provider enginePer earlier clarification, the
"open-ai"provider id is a Unitxt convention and should not be changed. The remaining risk is whethercross_provider_inference_engineacceptsmodel: "o1-mini"and mapsmax_tokens/seed/temperaturecorrectly for that provider.In Unitxt’s `cross_provider_inference_engine`, is `provider: "open-ai"` with `model: "o1-mini"` supported, and how are `max_tokens`, `seed`, and `temperature` handled for that provider?notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/gpt_4_1_mini.json (1)
1-13: LGTM; keep Unitxt-specific provider/model identifiers as-is.notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/replicate/llama4_scout.json (1)
1-13: LGTM; evaluator naming vs model naming is intentionally Unitxt-defined.notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4o.json (1)
1-16: Looks good and aligns with the intended Unitxt local_catalog pattern for preset endpoints.notebooks/teigaku_genzei/json_qa.py (1)
40-56: Avoid silent QA loss: usezip(..., strict=True)(and fail loudly on parse drift).
(Line 51)- for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list): + for (ti, qhi, ahi, ati) in zip(title_pos_list, qhead_pos_list, ahead_pos_list, atail_pos_list, strict=True):notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
108-120: Registration wiring: base_url override + evaluator_name mismatch (already raised before).
- Passing
base_url=...(Line 114) largely defeats the “set base_url when None” behavior inRITSInferenceEngineFixed.prepare_engine()and reintroduces URL-construction coupling.evaluator_name="LLAMA3_4_MAVERICK"(Line 117) looks inconsistent with...llama4_maverickmetric name and the “llama4” model module.For Unitxt metrics registration, is evaluator_name used as a stable identifier (e.g., referenced by config/catalog), and should it match the catalog/metric naming convention exactly?
🟠 Major comments (15)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama4_scout.json-1-13 (1)
1-13: Fix evaluator_name: change"LLAMA3_4_SCOUT"to"LLAMA4_SCOUT".The
evaluator_namehas an incorrect version prefix. The model is"llama-4-scout"(Llama 4) but the evaluator is namedLLAMA3_4_SCOUT(suggesting Llama 3). This naming inconsistency violates the established pattern (e.g.,LLAMA3_1_70Bforllama-3-1-70b-instruct) and will cause confusion in evaluation outputs and dashboards. The evaluator_name should be"LLAMA4_SCOUT"to match the model version. This issue affects all provider implementations of this model.notebooks/teigaku_genzei/test_qa.py-66-88 (1)
66-88:--test_file/--out_result_fileshould be required (current defaults will crash).
Right nowpd.read_csv(None, ...)/to_csv(None, ...)will fail with a confusing error.def config(): parser = argparse.ArgumentParser(description='test a model with a zero-shot QA task') parser.add_argument('--' + ARG_MODEL_PATH, type=str, default="Qwen/Qwen1.5-0.5B-Chat") - parser.add_argument('--' + ARG_TEST_FILE, type=str, default=None) + parser.add_argument('--' + ARG_TEST_FILE, type=str, required=True) ... - parser.add_argument('--' + ARG_OUT_RESULT_FILE, type=str, default=None) + parser.add_argument('--' + ARG_OUT_RESULT_FILE, type=str, required=True)Also applies to: 178-197
notebooks/teigaku_genzei/test_qa.py-101-105 (1)
101-105: Write JSON with explicit UTF-8 encoding (JP text).
This avoids platform-default encoding surprises.def write_json_file(file_path: str, data: Any): - with open(file_path, 'w') as file: + with open(file_path, 'w', encoding="utf-8") as file: json.dump(data, file, ensure_ascii=False) - returnnotebooks/teigaku_genzei/remove_newline_from_csv.py-11-27 (1)
11-27: Make newline cleanup robust for NaN / non-string object columns.
df[col].str...can behave poorly if anobjectcolumn has non-strings; also make the replace behavior explicit withregex=False.- col_sr_dict = { - col: df[col].str.replace("\n", " ") if df[col].dtype == "object" else df[col] for col in df.columns - } + col_sr_dict = {} + for col in df.columns: + if df[col].dtype == "object": + s = df[col] + # only operate on actual strings; keep NaN/None as-is + col_sr_dict[col] = s.where(~s.map(lambda v: isinstance(v, str)), s.str.replace("\n", " ", regex=False)) + else: + col_sr_dict[col] = df[col]notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/azure_openai/gpt_4_1_mini.json-1-16 (1)
1-16: Fix Azure credential structure and remove org-specific hostnames.The
api_baseshould contain only the base Azure endpoint (e.g.,"https://YOUR_RESOURCE_NAME.openai.azure.com/"), not the full request URL. Theapi_versionmust be a separate credential field. Additionally, the hard-coded"eteopenai.azure-api.net"hostname exposes internal infrastructure and breaks portability — replace with a placeholder or document that users must substitute their own Azure resource name.Expected structure:
"credentials": { "api_base": "https://YOUR_AZURE_RESOURCE_NAME.openai.azure.com/", "api_key": "YOUR_API_KEY", "api_type": "azure", "api_version": "2024-08-01-preview" }notebooks/teigaku_genzei/test_single.sh-3-12 (1)
3-12: Quote inputs + prefer$(...)to avoid word-splitting / globbing bugs.
MODEL_TEST_PATH="$1"is good, but later expansions (basename ${MODEL_TEST_PATH}) are unquoted; if the path contains spaces/globs, behavior breaks. Also prefer$(...)over backticks.-MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME} -MODEL_TEST_NAME=`basename ${MODEL_TEST_PATH}` +MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME} +MODEL_TEST_NAME="$(basename "$MODEL_TEST_PATH")"notebooks/teigaku_genzei/test_single.sh-23-28 (1)
23-28: Make output dir creation idempotent + quote.
mkdir ${OUT_DIR}will fail if rerun and will mis-handle spaces.-OUT_DIR=./tmp/single_${JUDGE_MODULE}__${MODEL_TEST_NAME} - -mkdir ${OUT_DIR} +OUT_DIR="./tmp/single_${JUDGE_MODULE}__${MODEL_TEST_NAME}" + +mkdir -p "$OUT_DIR"notebooks/teigaku_genzei/test_single.sh-29-47 (1)
29-47: Quote vars in the loop / Python args; consider failing fast.
Unquoted${TEST_DIR},${TEST},${MODEL_TEST_PATH}etc. can break on spaces and can also surprise with globbing. Also consider addingset -euat top for safer scripts (keeping/bin/shportability in mind).for TEST in ${TEST_NAMES}; do - TEST_FILE=${TEST_DIR}/${TEST}.csv - OUT_RESULT_PREFIX=${OUT_DIR}/result_${TEST} - OUT_GLOBAL_PREFIX=${OUT_DIR}/global_${TEST} + TEST_FILE="${TEST_DIR}/${TEST}.csv" + OUT_RESULT_PREFIX="${OUT_DIR}/result_${TEST}" + OUT_GLOBAL_PREFIX="${OUT_DIR}/global_${TEST}" python ./test_qa.py \ - --test_file ${TEST_FILE} \ - --out_result_file ${OUT_RESULT_PREFIX}_after.csv \ - --out_global_score_file ${OUT_GLOBAL_PREFIX}_after.csv \ - --judge_module ${JUDGE_MODULE} \ - --data_classification ${DATA_CLASSIFICATION} \ - --model_path ${MODEL_TEST_PATH} + --test_file "$TEST_FILE" \ + --out_result_file "${OUT_RESULT_PREFIX}_after.csv" \ + --out_global_score_file "${OUT_GLOBAL_PREFIX}_after.csv" \ + --judge_module "$JUDGE_MODULE" \ + --data_classification "$DATA_CLASSIFICATION" \ + --model_path "$MODEL_TEST_PATH" donenotebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.py-1-40 (1)
1-40: Harden importer: validate dirs, avoid builtin shadowing, usesys.exit, and handle JSON/IO errors.import os import json import logging +import sys import catalog_util if __name__ == "__main__": logging.basicConfig(level=logging.INFO) system_catalog_path = catalog_util.get_system_catalog_path() if system_catalog_path is None: logging.error("No system catalog") - exit(1) + sys.exit(1) local_catalog_path = catalog_util.LOCAL_CATALOG_PATH base_source_path = os.path.join(system_catalog_path, "metrics/llm_as_judge/direct") base_target_path = os.path.join(local_catalog_path, "metrics/llm_as_judge/direct_positional_bias") exclude_set = {"criteria"} logging.info(f"Copying artifacts from {base_source_path} to {base_target_path}") - for dir in os.listdir(base_source_path): - if dir in exclude_set: + if not os.path.isdir(base_source_path): + logging.error(f"Missing or non-directory source path: {base_source_path}") + sys.exit(1) + + for provider_dir in os.listdir(base_source_path): + if provider_dir in exclude_set: continue - logging.info(f"Processing {dir}") - provider_source_path = os.path.join(base_source_path, dir) - provider_target_path = os.path.join(base_target_path, dir) - if not os.path.exists(provider_target_path): - os.makedirs(provider_target_path) - for file in os.listdir(provider_source_path): - if file.endswith(".json"): - logging.info(f"Processing {file}") - source_path = os.path.join(provider_source_path, file) - target_path = os.path.join(provider_target_path, file) - with open(source_path, "r", encoding="utf-8") as f: - source_artifact = json.load(f) - source_artifact["__type__"] = "llm_judge_direct_positional_bias" - with open(target_path, "w", encoding="utf-8") as wf: - json.dump(source_artifact, wf, indent=4) - logging.info(f"Wrote {target_path}") + provider_source_path = os.path.join(base_source_path, provider_dir) + if not os.path.isdir(provider_source_path): + continue + + logging.info(f"Processing {provider_dir}") + provider_target_path = os.path.join(base_target_path, provider_dir) + os.makedirs(provider_target_path, exist_ok=True) + + for filename in os.listdir(provider_source_path): + if not filename.endswith(".json"): + continue + source_path = os.path.join(provider_source_path, filename) + if not os.path.isfile(source_path): + continue + + logging.info(f"Processing {filename}") + target_path = os.path.join(provider_target_path, filename) + try: + with open(source_path, "r", encoding="utf-8") as rf: + source_artifact = json.load(rf) + source_artifact["__type__"] = "llm_judge_direct_positional_bias" + with open(target_path, "w", encoding="utf-8") as wf: + json.dump(source_artifact, wf, indent=4, ensure_ascii=False) + logging.info(f"Wrote {target_path}") + except (OSError, json.JSONDecodeError) as e: + logging.exception(f"Failed processing {source_path}: {e}") + sys.exit(1)notebooks/teigaku_genzei/catalog_util.py-1-22 (1)
1-22: Remove import-time side effects: stdout print and function call.The module executes
get_system_catalog_path()at import time (line 22), which prints to stdout and causes side effects when imported by other modules (test_qa.py,llm_as_judge_direct_provider_importer.py). This pollutes output during import and makes the module's behavior unpredictable.Additionally, the function name
get_system_catalog_path()is misleading—it returnspaths[-1]fromget_local_catalogs_paths(), treating the last local catalog as the "system" path. Either rename the function to clarify its actual behavior or implement proper system catalog path lookup. Remove theprint(paths)statement and move theget_system_catalog_path()call into anif __name__ == "__main__":block. Also fix PEP 8 spacing:LOCAL_CATALOG_PATH = "local_catalog".notebooks/teigaku_genzei/seed_jsonl_2_qna_yaml.py-11-17 (1)
11-17: PotentialKeyErrorif seed lacks expected keys.The code assumes each seed object contains
icl_query_1throughicl_query_3and correspondingicl_response_1throughicl_response_3. Missing keys will raiseKeyErrorat runtime.Consider adding validation or using
.get()with defaults:qa_obj_list = [ { - "question": seed[f"icl_query_{i+1}"], - "answer": seed[f"icl_response_{i+1}"], + "question": seed.get(f"icl_query_{i+1}", ""), + "answer": seed.get(f"icl_response_{i+1}", ""), } for i in range(3) ]notebooks/teigaku_genzei/extract_icl_from_seed.py-25-27 (1)
25-27: PotentialIndexErroron empty input file.Accessing
data_list[0]without checking if the list is empty will raise anIndexErrorif the input JSONL file is empty.Apply this diff:
print(len(data_list)) - print(data_list[0]) + if data_list: + print(data_list[0]) obj_list = cast(list[dict[str, str]], data_list)notebooks/teigaku_genzei/make_icl.py-96-111 (1)
96-111: Ensure parent directory exists before writing output JSONL.
jsonl_util.write_jsonl_file(output_icl_path, ...)will fail if the parent directory doesn’t exist.def main()-> None: @@ output_icl_path = args[ARG_OUTPUT_ICL_FILE] @@ icl_list = make_icl(qa_df, context_df, short_context) + import os + parent = os.path.dirname(output_icl_path) + if parent: + os.makedirs(parent, exist_ok=True) jsonl_util.write_jsonl_file(output_icl_path, icl_list)notebooks/teigaku_genzei/make_seed.py-42-79 (1)
42-79: Cartesian mode: guard empty context/ICL (pd.concat([]) crash).If
context_dfis empty,tmp_listbecomes[]andpd.concat(tmp_list, ...)(Line 77) raisesValueError: No objects to concatenate. Slide mode already checks empties; cartesian should too.elif opt_join_method == OPT_JOIN_METHOD_CARTESIAN: + if len(context_df) == 0: + raise ValueError("No contexts loaded.") + if len(icl_df) == 0: + raise ValueError("No ICL samples loaded.") tmp_list = [ pd.concat([ icl_df, pd.DataFrame(sr.to_dict(), index=icl_df.index), ], axis=1) for (idx, sr) in context_df.iterrows() ]notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py-33-90 (1)
33-90: Average score computation uses the wrong key when criteria is provided.When
self.criteria is not None, the code correctly derivesforward_score_name = self.criteria.name(line 35–36). However, the average computation at line 86 usesresult.get(self.main_score, ...)instead offorward_score_name. Since Unitxt 1.24.0 emits the score under the criteria name when a single criterion is provided, the forward score will not be found underself.main_score, causing the average to be computed as(fallback_score + pb_score) / 2even when a real forward score exists.Use the same key you derived for the forward score:
+ def get_forward_score(result: Dict[str, Any]) -> float: + return cast(float, result.get(forward_score_name, result.get(self.main_score, default_pb_score))) + pb_scores = [ - get_pb_score(result, self.criteria, result.get(self.main_score, default_pb_score), default_pb_score) for result in results + get_pb_score(result, self.criteria, get_forward_score(result), default_pb_score) for result in results ] pb_results = [ result | { backward_score_name: pb_score, - average_score_name: (result.get(self.main_score, default_pb_score) + pb_score) / 2, + average_score_name: (get_forward_score(result) + pb_score) / 2, } for result, pb_score in zip(results, pb_scores, strict=True) ]Committable suggestion skipped: line range outside the PR's diff.
🟡 Minor comments (5)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_3_8b.json-1-13 (1)
1-13: Add trailing newline to the JSON file.JSON files should end with a newline character. Add a blank line at the end of the file (after the closing brace).
The model identifier
granite-3-3-8b-instructand schema structure are consistent with other watsonx configurations in this local catalog directory and align with Unitxt's documented CrossProviderInferenceEngine schema.{ "__type__": "llm_judge_direct_positional_bias", "inference_engine": { "__type__": "cross_provider_inference_engine", "max_tokens": 1024, "seed": 42, "temperature": 0, "provider": "watsonx", "model": "granite-3-3-8b-instruct" }, "evaluator_name": "GRANITE3_3_8B", "generate_summaries": false -} +} +notebooks/teigaku_genzei/requirements.txt-1-13 (1)
1-13: Tighten dependency strategy for reproducible evals and remove redundantsacrebleuentry
- The plain
sacrebleuline is redundant;sacrebleu[ja]already includes the base package plus Japanese tokenization support. Remove thesacrebleuentry.- Consider pinning (or constraining) the currently-unpinned packages (
litellm,accelerate,nltk,bert_score,rouge_score,python-dotenv) to ensure reproducible evaluation results, since upstream changes can alter metrics and outputs.notebooks/teigaku_genzei/jsonl_util.py-7-17 (1)
7-17: Make JSONL reading resilient to blank lines (and optionally stream).def read_jsonl_file(file_path: str) -> list[Any]: with open(file_path, 'r', encoding="utf-8") as file: - data_list = [ - json.loads(line) for line in file - ] - return data_list + data_list: list[Any] = [] + for line in file: + line = line.strip() + if not line: + continue + data_list.append(json.loads(line)) + return data_listnotebooks/teigaku_genzei/README.md-1-254 (1)
1-254: Address markdownlint doc hygiene (bare URLs, list indentation, fenced code languages).
This README is valuable, but the current formatting will likely fail markdownlint in CI (MD034/MD007/MD040).
- Convert bare URLs into markdown links.
- Fix list indentation to the expected levels.
- Add languages to fenced blocks (e.g.,
sh,text).notebooks/teigaku_genzei/qna_yaml_2_qa_csv.py-1-81 (1)
1-81: Tighten CSV conversion: remove dead code, fix Title indexing, validate YAML type, and use UTF‑8.-import sys -import yaml -import pandas as pd +import sys +from typing import Any + +import pandas as pd +import yaml def main(): @@ - try: - # load YAML file - with open(input_file, "r") as file: - data = yaml.safe_load(file) + try: + # load YAML file + with open(input_file, "r", encoding="utf-8") as file: + data: Any = yaml.safe_load(file) + + if not isinstance(data, dict): + raise ValueError("YAML root must be a mapping/object") # Validate structure if "seed_examples" not in data: raise ValueError("YAML file missing 'seed_examples' key") @@ qa_list = [ { - "Title": f"{isample + 1}-{iqa}", + "Title": f"{isample + 1}-{iqa + 1}", "Question": qa["question"], "Answer": qa["answer"] } for (isample, sample) in enumerate(data["seed_examples"]) for (iqa, qa) in enumerate(sample["questions_and_answers"]) ] @@ - except Exception as e: - print(f"Error: {e}", file=sys.stderr) - sys.exit(1) + # Let unexpected exceptions surface with a traceback (better debuggability).Also: since this depends on third-party libs (
yaml,pandas), please confirm the project documents/locks the expected versions (or at least lists them) to keep the example reproducible.
🧹 Nitpick comments (23)
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/watsonx/granite3_2_8b.json (1)
1-13: Add trailing newline to the file.The file is missing a trailing newline, which is a common convention in source code repositories for consistency and POSIX compliance.
{ "__type__": "llm_judge_direct_positional_bias", "inference_engine": { "__type__": "cross_provider_inference_engine", "max_tokens": 1024, "seed": 42, "temperature": 0, "provider": "watsonx", "model": "granite-3-2-8b-instruct" }, "evaluator_name": "GRANITE3_2_8B", "generate_summaries": false -} \ No newline at end of file +}notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json (1)
11-12: Consider aligningevaluator_namewith existing catalog naming conventions.If other Together configs use a different casing/format (e.g.,
LLAMA3_1_8B_INSTRUCTor including provider), aligning helps downstream aggregation/reporting scripts avoid special-casing.notebooks/teigaku_genzei/test_single_vllm.sh (1)
3-47: Quote variable expansions (model paths w/ spaces) and prefer$(...).
This is mostly for robustness whenMODEL_TEST_PATHor output dirs contain spaces/special chars.- MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME} - MODEL_TEST_NAME=`basename ${MODEL_TEST_PATH}` + MODEL_TEST_PATH="$1" # /proj/instructlab/checkpoints/${MODEL_TEST_NAME} + MODEL_TEST_NAME="$(basename "$MODEL_TEST_PATH")" ... -mkdir -p ${OUT_DIR} +mkdir -p "$OUT_DIR" ... - TEST_FILE=${TEST_DIR}/${TEST}.csv - OUT_RESULT_PREFIX=${OUT_DIR}/result_${TEST} - OUT_GLOBAL_PREFIX=${OUT_DIR}/global_${TEST} + TEST_FILE="${TEST_DIR}/${TEST}.csv" + OUT_RESULT_PREFIX="${OUT_DIR}/result_${TEST}" + OUT_GLOBAL_PREFIX="${OUT_DIR}/global_${TEST}" ... - python ./test_qa.py \ - --test_file ${TEST_FILE} \ - --out_result_file ${OUT_RESULT_PREFIX}_after.csv \ - --out_global_score_file ${OUT_GLOBAL_PREFIX}_after.csv \ - --judge_module ${JUDGE_MODULE} \ - --server_type ${SERVER_TYPE} \ - --api_key ${SERVER_API_KEY} \ - --data_classification ${DATA_CLASSIFICATION} \ - --model_path ${MODEL_TEST_PATH} + python ./test_qa.py \ + --test_file "$TEST_FILE" \ + --out_result_file "${OUT_RESULT_PREFIX}_after.csv" \ + --out_global_score_file "${OUT_GLOBAL_PREFIX}_after.csv" \ + --judge_module "$JUDGE_MODULE" \ + --server_type "$SERVER_TYPE" \ + --api_key "$SERVER_API_KEY" \ + --data_classification "$DATA_CLASSIFICATION" \ + --model_path "$MODEL_TEST_PATH"notebooks/teigaku_genzei/llm_as_judge_direct_rits_phi.py (1)
5-24: Consider makingoverwrite=Truean explicit opt-in.
If users already have a local artifact with the same name, this will silently replace it.-def register(catalog_path: str) -> None: +def register(catalog_path: str, overwrite: bool = False) -> None: ... - add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.phi_4", overwrite=True, catalog_path=catalog_path) + add_to_catalog(laaj_direct_rits_phi4, name="metrics.llm_as_judge.direct.rits.phi_4", overwrite=overwrite, catalog_path=catalog_path)notebooks/teigaku_genzei/test_qa.py (3)
89-93: Either usetest_languageor remove it (Ruff ARG001).
If the CSV includestarget_language, you could validate it; otherwise drop the arg to avoid confusion.
275-276: Remove debug print (print(type(dataset))) or gate behind a verbose flag.
Keeps CLI output clean for scripts that parse stdout.
224-233: Minor: prefer iterable unpacking over list concatenation (Ruff RUF005).- metrics=[ + metrics=[ "metrics.rouge", "metrics.sacrebleu", # "metrics.bert_score.deberta_large_mnli", - ] + lmaaj_metrics_list, + *lmaaj_metrics_list, + ],notebooks/teigaku_genzei/remove_newline_from_csv.py (1)
19-36: Avoid blindexcept Exception—either let unexpected errors propagate or use a traceback-preserving pattern.
Ruff BLE001 flags catch-all exceptions in CLI tools because they hide KeyboardInterrupt and other critical errors. The current code already catchesFileNotFoundErrorandpd.errors.ParserErrorexplicitly, so the catch-allexcept Exceptionis unnecessary.For encoding/IO issues with
read_csvandto_csv, pandas raisesUnicodeDecodeError(encoding mismatch),OSError-family errors (filesystem), andParserError(malformed data)—all handled above. Either remove the catch-all and let unexpected errors surface with a traceback, or keep it withlogging.exception()to preserve context:- except Exception as e: - print(f"Error: {e}", file=sys.stderr) - sys.exit(1) + # Let unexpected exceptions propagate with full traceback for CLI debuggingOr if you must catch unknown errors, use:
- except Exception as e: - print(f"Error: {e}", file=sys.stderr) - sys.exit(1) + except Exception as e: + import logging + logging.exception("Unexpected error") + sys.exit(1)notebooks/teigaku_genzei/make_icl.sh (1)
1-3: Make the script more robust (fail-fast + quoting + readability).#!/bin/sh +set -eu -python ./make_icl.py --short_context --qa_file qa_table/0024004-072_01.csv --context_file context/0024004-072_01_context.csv --out_icl_file icl/icl.jsonl +python ./make_icl.py \ + --short_context \ + --qa_file "qa_table/0024004-072_01.csv" \ + --context_file "context/0024004-072_01_context.csv" \ + --out_icl_file "icl/icl.jsonl"notebooks/teigaku_genzei/make_context.sh (1)
1-11: Add fail-fast + quoting (no need to mkdir the output dirs here).#!/bin/sh +set -eu DATA_PREFIX_LIST="0024001-021 0024004-072_01" for f in $DATA_PREFIX_LIST; do python ./make_context.py \ --glossary_option appendix \ - --qa_file qa_table/${f}.csv \ - --glossary_file qa_table/${f}_glossary.csv \ - --out_context_file context/${f}_context.csv + --qa_file "qa_table/${f}.csv" \ + --glossary_file "qa_table/${f}_glossary.csv" \ + --out_context_file "context/${f}_context.csv" donenotebooks/teigaku_genzei/extract_qa.sh (1)
1-14: Harden the shell script: fail-fast + quote paths/args.
This will avoid silent partial outputs and word-splitting surprises if paths ever change (Line 7, Line 12-13).#!/bin/sh +set -eu source_dir="docparser" source_qa="0024001-021 0024004-072_01" qa_table_dir="qa_table" for sqa in ${source_qa}; do - sf=${source_dir}/${sqa}.json - qf=${qa_table_dir}/${sqa}.csv - gf=${qa_table_dir}/${sqa}_glossary.csv + sf="${source_dir}/${sqa}.json" + qf="${qa_table_dir}/${sqa}.csv" + gf="${qa_table_dir}/${sqa}_glossary.csv" - python ./json_qa.py ${sf} ${qf} - python ./json_glossary.py ${sf} ${gf} + python ./json_qa.py "${sf}" "${qf}" + python ./json_glossary.py "${sf}" "${gf}" donenotebooks/teigaku_genzei/parser.sh (1)
1-15: Nice fail-fast + path validation; quote vars and ensure output dir exists.
(Line 2, Line 10-13, Line 15)#!/bin/sh set -e source="source/" output="docparser/" @@ fi -python ${code_path} --input-dir ${source} --output-dir ${output} +mkdir -p "${output}" +python "${code_path}" --input-dir "${source}" --output-dir "${output}"notebooks/teigaku_genzei/json_qa.py (1)
16-38: (Optional) Consider whether"" .join(...)is the intended delimiter for all outputs.
If any downstream expects whitespace-separated text, joining with""(Line 38) can fuse tokens. If the intent is “Japanese-first”, a short comment stating that would help future readers.notebooks/teigaku_genzei/extract_icl_from_seed.py (2)
10-13: Consider renamingcompute_dict_hashfor clarity.The function concatenates values with
"::"rather than computing an actual hash. This could cause unexpected collisions if values contain"::". The namecompute_dict_keyorcompute_dedup_keywould better reflect its purpose.
48-50: Simplify dict comprehension.Per static analysis, this can be simplified.
Apply this diff:
- unique_icl_dict: dict[str, dict[str, Any]] = { - h: obj for (h, obj) in hash_list - } + unique_icl_dict: dict[str, dict[str, Any]] = dict(hash_list)notebooks/teigaku_genzei/seed_jsonl_2_qna_yaml.py (1)
52-52: Remove uselessreturnstatement.The explicit
returnat the end ofmain()is unnecessary.- return - if __name__ == "__main__":notebooks/teigaku_genzei/json_util.py (1)
2-2: Remove unused imports.
cast,Hashable,Sequence, andCollectionare imported but never used in this module.-from typing import cast, Any, Hashable, Sequence, Collection +from typing import Anynotebooks/teigaku_genzei/make_context.py (1)
2-2: Remove unused imports.
Hashable,Sequence, andCollectionare imported but never used.-from typing import cast, Any, Hashable, Sequence, Collection +from typing import cast, Anynotebooks/teigaku_genzei/json_glossary.py (1)
48-53: Consider renaming parameter for clarity.The function
save_to_csvuses parameter nameqa_pairsbut actually writes glossary data withTermandDescriptioncolumns. Consider renaming toglossary_pairsfor consistency.-def save_to_csv(qa_pairs, output_file): +def save_to_csv(glossary_pairs, output_file): with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Term', 'Description']) - for qa in qa_pairs: - writer.writerow(qa) + for pair in glossary_pairs: + writer.writerow(pair)notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py (1)
41-49: Avoidprint()in compute(); use a logger (or remove).
print()at Line 47 will spam notebook output and makes batch evaluation noisy. Preferlogger.debug(...)or remove entirely.notebooks/teigaku_genzei/make_seed.py (1)
78-80: Prefer ValueError over argparse.ArgumentError here.
make_seed()is also a callable function;argparse.ArgumentError(None, ...)is hard to interpret without the parser instance. A simpleValueError(...)is clearer.notebooks/teigaku_genzei/make_icl.py (1)
74-79: Be tolerant to “[A]” vs “[A]” prefix variance in source data.If some PDFs/exports produce ASCII
[A]instead of fullwidth[A], currentremoveprefix("[A]")won’t strip it. Consider normalizing both:- "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").strip(), + "icl_response_" + str(i + 1): qa_df.loc[qa]["Answer"].removeprefix("[A]").removeprefix("[A]").strip(),notebooks/teigaku_genzei/teigaku_genzei.ipynb (1)
198-224: Uselogger.exception(...)when catching errors to keep stack traces.In
export_document()(Line 221-223) andexport_document_new_docling()(Line 262-265) you logstr(e)but lose traceback context in logs; preferlogger.exception(...)inexcept.Also applies to: 225-266
...gaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/openai/o1_preview.json
Outdated
Show resolved
Hide resolved
| { | ||
| "__type__": "llm_judge_direct_positional_bias", | ||
| "inference_engine": { | ||
| "__type__": "cross_provider_inference_engine", | ||
| "max_tokens": 1024, | ||
| "seed": 42, | ||
| "temperature": 0, | ||
| "provider": "rits", | ||
| "model": "mixtral-8x7b-instruct-v01" | ||
| }, | ||
| "evaluator_name": "MIXTRAL8_7b", | ||
| "generate_summaries": false | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for similar JSON config files to understand the expected schema
find . -name "*.json" -path "*/local_catalog/*" -type f | head -20Repository: opendatahub-io/data-processing
Length of output: 2268
🏁 Script executed:
# Look for Unitxt configuration or schema definitions
rg "cross_provider_inference_engine|llm_judge_direct_positional_bias" --type json -A 5 -B 2Repository: opendatahub-io/data-processing
Length of output: 50386
🏁 Script executed:
# Search for RITS provider implementation in codebase
rg "rits|RITS" --type py -l | head -10Repository: opendatahub-io/data-processing
Length of output: 290
🏁 Script executed:
# Check for Unitxt version pinning (requirements, setup.py, pyproject.toml)
fd -e "requirements.*\.txt" -o -e "setup.py" -o -e "pyproject.toml" | head -5Repository: opendatahub-io/data-processing
Length of output: 190
🏁 Script executed:
# Look for any schema definitions or validation code
fd -e "schema.py" -o -e "*schema*.json" | head -10Repository: opendatahub-io/data-processing
Length of output: 190
🏁 Script executed:
# Look for Unitxt version constraints
rg "unitxt" -i pyproject.toml setup.py requirements.txt 2>/dev/null | head -20Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Search for code that loads/validates these JSON configs
rg "llm_judge_direct_positional_bias|cross_provider_inference_engine" --type py -B 2 -A 5 | head -100Repository: opendatahub-io/data-processing
Length of output: 1006
🏁 Script executed:
# Check RITS provider implementation in Python
rg "class.*RitsInference|rits.*inference" -i --type py -B 2 -A 10 | head -150Repository: opendatahub-io/data-processing
Length of output: 11396
🏁 Script executed:
# Look at the notebook files that use RITS
cat notebooks/teigaku_genzei/llm_as_judge_direct_positional_bias.py 2>/dev/null | head -100Repository: opendatahub-io/data-processing
Length of output: 4965
🏁 Script executed:
# Search for Issue #13 reference or related bug discussion
rg "Issue.*13|issue.*13|bug.*CrossProvider|CrossProvider.*bug" --type py -B 3 -A 3Repository: opendatahub-io/data-processing
Length of output: 800
🏁 Script executed:
# Check if there are any special handling or workarounds for RITS provider
rg "rits.*CrossProvider|CrossProvider.*rits" -i --type py -B 2 -A 5Repository: opendatahub-io/data-processing
Length of output: 2231
🏁 Script executed:
# Look for any version-specific handling or compatibility notes
rg "unitxt.*1\.24|version.*1\.24|1\.24.*unitxt" -i --type py -B 2 -A 2Repository: opendatahub-io/data-processing
Length of output: 576
🏁 Script executed:
# Check for comments or issues specifically about Mixtral with RITS
rg "mixtral.*rits|rits.*mixtral" -i -B 2 -A 2Repository: opendatahub-io/data-processing
Length of output: 567
🏁 Script executed:
# Look at the JSON importer to understand if it's aware of the bug
cat notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.pyRepository: opendatahub-io/data-processing
Length of output: 1793
🏁 Script executed:
# Check the llama4_maverick config in rits directory
cat notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/llama4_maverick.jsonRepository: opendatahub-io/data-processing
Length of output: 559
🏁 Script executed:
# Check what __type__ is used for llama4_maverick (should show rits_inference_engine_fixed based on Python code)
rg "llama4_maverick|llama4-maverick" --type json -A 10 | grep -A 10 "rits/llama4"Repository: opendatahub-io/data-processing
Length of output: 56
This configuration is subject to a known bug affecting CrossProviderInferenceEngine with RITS provider.
In Unitxt 1.24.0, CrossProviderInferenceEngine causes connection failures when used with the RITS provider (Issue #13). The codebase already has a documented workaround for this: the llama4_maverick.json config in the same directory uses rits_inference_engine_fixed instead. The mixtral8_7b.json config should similarly use rits_inference_engine_fixed with the appropriate model name and endpoint URL, rather than relying on the buggy cross_provider_inference_engine.
🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/rits/mixtral8_7b.json
lines 1-13, this config uses the buggy cross_provider_inference_engine with
provider "rits"; replace that inference_engine block to use
"rits_inference_engine_fixed" (as in llama4_maverick.json), keep max_tokens:
1024, seed: 42 and temperature: 0, set model to "mixtral-8x7b-instruct-v01" and
add the fixed engine's endpoint URL/connection fields required by the workaround
so the config uses the fixed RITS engine instead of the broken cross-provider
engine.
| { | ||
| "__type__": "llm_judge_direct_positional_bias", | ||
| "inference_engine": { | ||
| "__type__": "cross_provider_inference_engine", | ||
| "max_tokens": 1024, | ||
| "seed": 42, | ||
| "temperature": 0, | ||
| "provider": "together-ai", | ||
| "model": "llama-3-1-405b-instruct" | ||
| }, | ||
| "evaluator_name": "LLAMA3_1_405B", | ||
| "generate_summaries": false | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the repo structure and find similar configs
find . -name "*.json" -path "*/local_catalog/metrics/*" -type f | head -20Repository: opendatahub-io/data-processing
Length of output: 204
🏁 Script executed:
# Search for other Together AI or provider configs in the codebase
rg "together-ai|cross_provider_inference_engine" --type json -lRepository: opendatahub-io/data-processing
Length of output: 6501
🏁 Script executed:
# Look for schema definitions or documentation about the config format
find . -name "*.py" -o -name "*.md" | xargs grep -l "llm_judge_direct_positional_bias\|cross_provider_inference_engine" 2>/dev/null | head -10Repository: opendatahub-io/data-processing
Length of output: 141
🌐 Web query:
Together AI Llama 3.1 405B Instruct model identifier API
💡 Result:
The Together AI model identifier for Llama 3.1 405B Instruct (Turbo) is:
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo. [1][2]
Example curl (chat completions):
curl -X POST "https://api.together.xyz/v1/chat/completions"
-H "Authorization: Bearer $TOGETHER_API_KEY"
-H "Content-Type: application/json"
-d '{"model":"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo","messages":[{"role":"user","content":"Hello"}]}'. [1]
Sources: Together AI model page and chat-models listing. [1][2]
🌐 Web query:
Unitxt sdg_hub local catalog schema llm_judge cross_provider_inference_engine
💡 Result:
Here’s a concise mapping of the terms and relevant projects/documents I found:
-
Unitxt (catalog / LLM-as-judge): Unitxt provides a local catalog schema and built-in “LLM-as-a-Judge” evaluators (many predefined criteria and direct evaluator catalog entries). See Unitxt catalog and examples. [1][2][3]
-
sdg_hub / local catalog schema: Unitxt’s catalog exposes organized metric/evaluator entries (e.g., direct evaluators under catalog.metrics.llm_as_judge) suitable for local catalogs and integration (appears in Unitxt docs/releases). [1][2]
-
llm_judge / llm-judger / judge-llm: Several OSS tools implement the “LLM-as-a-judge” pattern:
- llm-judger (PyPI) — a tool to have models judge each other and store results. [4]
- judge-llm (PyPI) — config-driven evaluator framework supporting multiple providers/evaluators and reporting. [5]
- NVIDIA NeMo docs describing an LLM-as-a-judge evaluation flow and config examples. [6][7]
-
local / cross_provider_inference_engine (unified inference engines): There are multiple unified inference-engine projects and libraries that let you run open-source and commercial models through a common interface (useful for cross-provider inference and evaluation):
- llm-engines (PyPI) — unified inference engine wrapper for various backends. [8]
- LLM Inference Engine docs (llm-ie) — pluggable inference engine interfaces (vLLM, OpenAI, HuggingFace, Ollama, LlamaCpp, etc.). [9]
- Research/infra projects (FlashInfer, Inferflow, AIBrix) focus on efficient/high-performance inference serving and multi-engine support. [10][11][12]
If you want, I can:
- show specific Unitxt catalog entries (sdg_hub) and example YAML/config for a local catalog; or
- propose an architecture to wire Unitxt + a cross-provider inference layer + an LLM-judge tool to run cross-provider evaluations and store results.
Sources:
[1] Unitxt catalog pages and examples.
[2] Unitxt examples — LLM-as-a-Judge.
[3] Unitxt catalog direct evaluator pages.
[4] llm-judger (PyPI).
[5] judge-llm (PyPI).
[6] NVIDIA NeMo microservices — LLM-as-a-Judge docs.
[7] NVIDIA NeMo example configs (LLM-as-a-Judge).
[8] llm-engines (PyPI).
[9] LLM Inference Engine (llm-ie) docs.
[10] FlashInfer (arXiv).
[11] Inferflow (arXiv).
[12] AIBrix (arXiv / GitHub).
🏁 Script executed:
cat -n ./notebooks/teigaku_genzei/llm_as_judge_direct_provider_importer.pyRepository: opendatahub-io/data-processing
Length of output: 2080
🏁 Script executed:
# Check the other Together AI configs to see the pattern
cat -n ./notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_70b.jsonRepository: opendatahub-io/data-processing
Length of output: 529
🏁 Script executed:
# Check one more for comparison
cat -n ./notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.jsonRepository: opendatahub-io/data-processing
Length of output: 527
🏁 Script executed:
# Check if there's any model name mapping/translation in the codebase
rg "llama-3-1-405b-instruct|meta-llama.*405B" --type py -A 3 -B 3Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Look for inference engine implementation
find . -name "*.py" -type f | xargs grep -l "cross_provider_inference_engine" 2>/dev/null | head -5Repository: opendatahub-io/data-processing
Length of output: 56
Fix Together AI model identifier: "llama-3-1-405b-instruct" is not a valid Together AI API model name.
The config will fail at runtime. According to Together AI's API documentation, the correct identifier for Llama 3.1 405B Instruct is meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo. Update the "model" field to use this identifier. This issue affects all Together AI configs in the directory (8B, 70B, and 405B variants).
Additionally, document the cost and access requirements for the 405B model + max_tokens: 1024 configuration for users.
🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_405b.json
lines 1-13, the Together AI model identifier is incorrect; replace
"llama-3-1-405b-instruct" with the valid Together API model name
"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo" and update any other Together AI
configs in this directory (8B, 70B, 405B variants) to their documented
identifiers; also add a brief note in the same file or adjacent README
documenting that the 405B model requires special access and is higher cost, and
that with max_tokens: 1024 users should expect increased compute cost and to
verify access/quotas before running.
| { | ||
| "__type__": "llm_judge_direct_positional_bias", | ||
| "inference_engine": { | ||
| "__type__": "cross_provider_inference_engine", | ||
| "max_tokens": 1024, | ||
| "seed": 42, | ||
| "temperature": 0, | ||
| "provider": "together-ai", | ||
| "model": "llama-3-1-8b-instruct" | ||
| }, | ||
| "evaluator_name": "LLAMA3_1_8B", | ||
| "generate_summaries": false | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, search the repository for other Together AI configurations to understand naming patterns
find . -type f -name "*.json" -path "*/together_ai/*" 2>/dev/null | head -20Repository: opendatahub-io/data-processing
Length of output: 204
🏁 Script executed:
# Search for Unitxt or sdg_hub configuration files in the repo that might document the schema
find . -type f -name "*.py" -o -name "*.json" | xargs grep -l "cross_provider_inference_engine" 2>/dev/null | head -10Repository: opendatahub-io/data-processing
Length of output: 204
🌐 Web query:
Together AI Llama 3.1 8B Instruct exact model identifier slug API
💡 Result:
The Together.ai API model slug for Llama 3.1 8B Instruct is:
- meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo. [1]
Alternative/NIM variant (GPU NIM endpoint) slug:
- nim/meta/llama-3.1-8b-instruct. [2]
🌐 Web query:
Unitxt cross_provider_inference_engine schema documentation parameters
💡 Result:
Summary of CrossProviderInferenceEngine schema & main parameters (Unitxt)
- Purpose: single engine that calls multiple API providers; dynamically switches between providers and maps model names per provider. [1][2]
Main constructor parameters
- model (str) — model identifier (e.g., "gpt-4o", "llama-3-2-1b-instruct"). [1][2]
- provider (Optional[str]) — which API to use; must be one of the supported providers (e.g., "watsonx", "together-ai", "open-ai", "aws", "ollama", "bam", "watsonx-sdk", "rits", "vertex-ai", "hf-local"). [1][2]
- provider_model_map (Dict[provider, Dict[str,str]]) — per-provider mapping of model ids to unify identifiers across backends. [1][2]
- provider_specific_args (Optional[Dict[str, Dict[str, Any]]]) — provider-specific configuration (example: {"watsonx": {"max_requests_per_second": 4}}). [1][2]
Common inference / OpenAI-style generation params (available via mixins)
- max_tokens / max_new_tokens, temperature, top_p, top_k, repetition_penalty, stop / stop_sequences, seed / random_seed, min_new_tokens, time_limit, truncate_input_tokens, typical_p, etc. (engine extends OpenAiInferenceEngineParamsMixin / other provider param mixins; pass provider-specific generation params as usual). [2][1]
Credentials & connection
- credentials / default_headers / base_url (provider-dependent) — supply API keys / endpoints per provider or rely on environment variables. [2][1]
Other engine options
- label (str), cache_batch_size, use_cache, num_parallel_requests / concurrency_limit, logprobs / top_logprobs / echo / n (for log-prob / option-selection use cases). [2][1]
Example (catalog snippet)
- CrossProviderInferenceEngine(model="meta-llama/llama-3-70b-instruct", provider="watsonx", max_tokens=2048, seed=42). [1]
Sources
- Unitxt Cross-Provider engine catalog & explanation. [1]
- unitxt.inference module (engine classes and param mixins / supported params). [2]
Correct the Together AI model identifier before merge.
The model identifier "llama-3-1-8b-instruct" does not match Together AI's documented model slug. According to Together AI's API, the correct identifier for Llama 3.1 8B Instruct is "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" (or "nim/meta/llama-3.1-8b-instruct" for the NIM variant). Change line 9 from:
"model": "llama-3-1-8b-instruct"
to:
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
The parameter names (max_tokens, seed, temperature) and provider key ("together-ai") are correct per Unitxt's CrossProviderInferenceEngine schema. The mismatch in the model slug will cause the evaluation to fail at runtime when Unitxt attempts to instantiate the inference engine.
🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_1_8b.json
around lines 1 to 13, the Together AI model slug is incorrect; replace the value
for "model" currently set to "llama-3-1-8b-instruct" with the documented
Together AI identifier "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo" (or use
"nim/meta/llama-3.1-8b-instruct" if you intend the NIM variant) so the
CrossProviderInferenceEngine can instantiate the model at runtime.
| "provider": "together-ai", | ||
| "model": "llama-3-3-70b-instruct" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Together AI Llama 3.3 70B Instruct model identifier API
💡 Result:
The Together.ai model identifier for Llama 3.3 70B Instruct (Instruct/Turbo variant) is:
- meta-llama/Llama-3.3-70B-Instruct-Turbo. [1]
(There are related variants shown on Together.ai such as the free endpoint meta-llama/Llama-3.3-70B-Instruct-Turbo-Free and an NVIDIA NIM variant nim/meta/llama-3-3-70b-instruct.) [2][3]
Example curl (Together API v1 chat completions):
curl -X POST "https://api.together.xyz/v1/chat/completions"
-H "Authorization: Bearer $TOGETHER_API_KEY"
-H "Content-Type: application/json"
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
"messages": [{"role":"user","content":"Explain transfer learning in one paragraph."}]
}'
Sources: Together.ai model pages. [1] [2] [3]
Fix Together AI model identifier: change to correct API format.
The configuration uses "llama-3-3-70b-instruct", which is not a valid Together AI model identifier. The correct identifier is "meta-llama/Llama-3.3-70B-Instruct-Turbo". Update line 9 to use the proper format required by Together AI's API.
🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/together_ai/llama3_3_70b.json
around lines 8 to 9, the Together AI model identifier is incorrect; replace the
value "llama-3-3-70b-instruct" with the correct Together AI API model identifier
"meta-llama/Llama-3.3-70B-Instruct-Turbo" (leave provider as "together-ai").
Ensure the JSON string is updated exactly to the correct identifier format
expected by Together AI.
| { | ||
| "__type__": "llm_judge_direct_positional_bias", | ||
| "inference_engine": { | ||
| "__type__": "cross_provider_inference_engine", | ||
| "max_tokens": 1024, | ||
| "seed": 42, | ||
| "temperature": 0, | ||
| "provider": "vertex-ai", | ||
| "model": "gemini-2-5-flash" | ||
| }, | ||
| "evaluator_name": "GEMINI_2_5_FLASH", | ||
| "generate_summaries": false | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for existing vertex provider entries in catalog JSON files
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -l "vertex" {} \;Repository: opendatahub-io/data-processing
Length of output: 302
🏁 Script executed:
#!/bin/bash
# Search for provider identifier patterns across all catalog JSON files
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -h '"provider"' {} \; | sort | uniqRepository: opendatahub-io/data-processing
Length of output: 350
🏁 Script executed:
#!/bin/bash
# Search for gemini model identifiers in existing configs
find notebooks/teigaku_genzei/local_catalog -name "*.json" -type f -exec grep -h '"model"' {} \; | sort | uniqRepository: opendatahub-io/data-processing
Length of output: 864
🌐 Web query:
Vertex AI Gemini 2.5 Flash model identifier format and valid names for API calls
💡 Result:
Short answer — use the Publisher model ID or the Vertex model resource name.
-
Publisher model identifier (used in most examples): publishers/google/models/{MODEL_ID}
Example MODEL_ID values: gemini-2.5-flash, gemini-2.5-flash-image, gemini-2.5-flash-lite, gemini-2.5-pro, gemini-2.5-pro-preview-, and preview variants such as gemini-2.5-flash-preview-5-20. Use that full path for batch jobs and many REST examples. [1][2][6] -
Vertex resource name for tuned/deployed models (used for endpoint/model calls): projects/{PROJECT_ID}/locations/{LOCATION}/models/{MODEL_ID}. Example: projects/myproj/locations/us-central1/models/123456789. [2]
-
OpenAI-compatible string (when using the OpenAI‑style endpoint): "google/{MODEL_ID}" (e.g., "google/gemini-2.5-flash"). [3]
References:
[1] Google Cloud — Function calling / model examples (shows model_id gemini-2.5-flash).
[2] Google Cloud — Batch prediction & model path examples (publishers/google/models/{MODEL_ID}, and projects/.../models/{MODEL_ID}).
[3] Google Cloud — Function calling reference (OpenAI-compatible "google/{MODEL_ID}" usage).
Fix Vertex AI model identifier format: use gemini-2.5-flash (dots), not gemini-2-5-flash (hyphens).
Change "model": "gemini-2-5-flash" to "model": "gemini-2.5-flash". Vertex AI's Publisher model identifier uses dots between version components. The current format will fail at runtime when the inference engine attempts to resolve the model.
🤖 Prompt for AI Agents
In
notebooks/teigaku_genzei/local_catalog/metrics/llm_as_judge/direct_positional_bias/vertex_ai/gemini_2_5_flash.json
around lines 1 to 13, the Vertex AI model identifier is using hyphens
("gemini-2-5-flash") which is invalid; update the model field to use the correct
Publisher identifier with dots ("gemini-2.5-flash") so the inference engine can
resolve the model at runtime.
| def do_inference(model_path: str, api_key: str, api_url: str, server_type: str, dataset: Dataset) -> ListWithMetadata[str] | ListWithMetadata[TextGenerationInferenceOutput]: | ||
| model = OpenAiInferenceEngine( | ||
| model_name=model_path, | ||
| max_tokens=MAX_OUTPUT_TOKENS, | ||
| credentials={} if api_key is None else {"api_key": api_key}, | ||
| base_url=api_url, | ||
| # use_cache=False, | ||
| # use_fast_tokenizer=False, | ||
| ) if server_type == "openai" else VLLMRemoteInferenceEngine( | ||
| model_name=model_path, | ||
| max_tokens=MAX_OUTPUT_TOKENS, | ||
| credentials={} if api_key is None else {"api_key": api_key}, | ||
| base_url=api_url, | ||
| data_classification_policy=["public", "proprietary"], | ||
| # use_cache=False, | ||
| # use_fast_tokenizer=False, | ||
| ) if server_type == "vllm-remote" else HFAutoModelInferenceEngine( | ||
| model_name=model_path, | ||
| max_new_tokens=MAX_OUTPUT_TOKENS, | ||
| # use_cache=False, | ||
| # use_fast_tokenizer=False, | ||
| ) | ||
| predictions_new = model(dataset) | ||
| return predictions_new | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the file and understand the context better
cd /tmp || exit 1
find . -type f -name "test_qa.py" 2>/dev/null | head -5Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Search for unitxt version and imports in the file
rg "unitxt" notebooks/teigaku_genzei/test_qa.py -B2 -A2 | head -40Repository: opendatahub-io/data-processing
Length of output: 1021
🏁 Script executed:
# Check if there's a requirements file or setup.py mentioning unitxt version
fd "requirements.*|setup.py|pyproject.toml" | xargs grep -l "unitxt" 2>/dev/nullRepository: opendatahub-io/data-processing
Length of output: 115
🌐 Web query:
unitxt 1.24.0 HFAutoModelInferenceEngine OpenAiInferenceEngine return type Dataset inference
💡 Result:
In unitxt (v1.24.0+), InferenceEngine.infer(dataset, return_meta_data=False) accepts a list-of-dicts or a Dataset and returns a ListWithMetadata of predictions — either ListWithMetadata[str] (when return_meta_data=False) or ListWithMetadata[TextGenerationInferenceOutput] (when return_meta_data=True). [1][2]
Sources:
- unitxt.inference (API docs). [1]
- unitxt.inference (alternate/latest docs showing same infer signature). [2]
🌐 Web query:
unitxt evaluate() function accepted input formats predictions
💡 Result:
Short answer — evaluate() accepts the same predictions object returned by Unitxt inference engines or a plain Python list/iterable of per-instance predictions that align with the dataset. Typical accepted shapes:
- A list (or iterable) of raw model outputs (strings or numbers) where each element corresponds to one dataset instance. (Common when you call an inference engine then pass its return value to evaluate.) [1][4]
- A list of dicts/objects produced by Unitxt inference engines (contains fields like "prediction" / "processed_prediction", optionally other metadata). evaluate() and the CLI utilities accept the inference-engine output as-is. [1][4]
- Any structure where processed predictions (after template postprocessing) can be matched to dataset instances and to references — i.e., the metric code expects the metric input types described in Unitxt docs (prediction as a string/float/list, references as a list of same-typed items). Postprocessing to the correct type is done via the template's postprocessors. [2]
References:
- Unitxt getting-started / example (evaluate usage and predictions returned from inference engines). [1]
- Unitxt metrics / metric inputs (explains required prediction/reference types and role of template postprocessors). [2]
- unitxt.evaluate_cli / run_evaluation (accepts list[Any] predictions and dataset). [4]
If you want, paste an example of your predictions object and dataset and I’ll confirm whether evaluate() will accept it or show how to adapt it.
🏁 Script executed:
# Let's check the full context around lines 153-177 and 304-320
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '153,177p'Repository: opendatahub-io/data-processing
Length of output: 1316
🏁 Script executed:
# Also check lines 304-320
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '304,320p'Repository: opendatahub-io/data-processing
Length of output: 1121
🏁 Script executed:
# Search for json.dumps and json.loads usage in the file
rg "json\.(dumps|loads)" notebooks/teigaku_genzei/test_qa.py -B3 -A3Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
# Check what happens to predictions_new variable after line 177
cat -n notebooks/teigaku_genzei/test_qa.py | sed -n '153,210p'Repository: opendatahub-io/data-processing
Length of output: 2817
🏁 Script executed:
# Find the jsonl_util module
fd "jsonl_util" notebooks/Repository: opendatahub-io/data-processing
Length of output: 112
🏁 Script executed:
# Search for jsonl_util imports and usage
rg "jsonl_util" notebooks/teigaku_genzei/test_qa.py -B2 -A2Repository: opendatahub-io/data-processing
Length of output: 653
🏁 Script executed:
# Look for the jsonl_util module in the codebase
find . -name "*jsonl*" -type f 2>/dev/nullRepository: opendatahub-io/data-processing
Length of output: 166
🏁 Script executed:
# Check if jsonl_util is defined in the same directory
ls -la notebooks/teigaku_genzei/ | grep -i jsonRepository: opendatahub-io/data-processing
Length of output: 403
🏁 Script executed:
# Read the jsonl_util.py file to see how it handles serialization
cat -n notebooks/teigaku_genzei/jsonl_util.pyRepository: opendatahub-io/data-processing
Length of output: 658
🌐 Web query:
unitxt ListWithMetadata JSON serialization json.dumps
💡 Result:
Short answer — convert the ListWithMetadata into plain JSON-serializable structures (list + dict) or provide a json default/encoder. Example approaches:
- Explicit convert to dict (recommended)
- This produces a JSON object with "data" and "metadata".
import json
from unitxt.inference import ListWithMetadata
lwm = ListWithMetadata(["a","b"], metadata={"source":"x"})
payload = {"data": list(lwm), "metadata": lwm.metadata}
s = json.dumps(payload)- Use json.dumps default= to handle it generically
def default(o):
if hasattr(o, "metadata") and isinstance(o, list):
return {"data": list(o), "metadata": getattr(o, "metadata")}
raise TypeError(f"Not JSON serializable: {type(o)}")
s = json.dumps(lwm, default=default)- Subclass JSONEncoder (alternative)
import json
class LWMEncoder(json.JSONEncoder):
def default(self, o):
if hasattr(o, "metadata") and isinstance(o, list):
return {"data": list(o), "metadata": o.metadata}
return super().default(o)
s = json.dumps(lwm, cls=LWMEncoder)Notes:
- unitxt.ListWithMetadata is a list subclass that accepts metadata at construction. [1]
- json.dumps accepts default or cls to customize serialization. [2]
Sources:
[1] unitxt docs (unitxt.inference — ListWithMetadata).
[2] Python json.dumps / JSONEncoder docs.
🌐 Web query:
unitxt ListWithMetadata class definition structure
💡 Result:
unitxt.ListWithMetadata is a thin subclass of Python list that adds an optional metadata dict. Its documented signature and type info:
- class unitxt.inference.ListWithMetadata(*args, metadata: dict | None = None, **kwargs) — Bases: List[T]. [1]
(The class is defined in the unitxt.inference module and accepts the usual list args plus an optional metadata parameter.) [1]
Sources
- unitxt.inference documentation (ListWithMetadata signature). [1]
jsonl_util.write_jsonl_file() will fail to serialize ListWithMetadata objects—handle serialization explicitly before saving.
Line 313 calls jsonl_util.write_jsonl_file(predictions_file, predictions_new) where predictions_new is a ListWithMetadata (from do_inference()). Since json.dumps() in jsonl_util.py has no custom encoder, it cannot serialize ListWithMetadata objects; they require explicit conversion to a plain dict structure (e.g., {"data": list(predictions_new), "metadata": predictions_new.metadata}). Additionally, read_jsonl_file() returns list[dict] or list[str] from json.loads(), creating a type mismatch when mixed with ListWithMetadata objects at line 318 and passed to evaluate() at line 320.
🤖 Prompt for AI Agents
In notebooks/teigaku_genzei/test_qa.py around lines 153 to 177, do_inference()
returns a ListWithMetadata which jsonl_util.write_jsonl_file cannot serialize;
before calling write_jsonl_file (and before mixing with read_jsonl_file output),
convert predictions_new into plain serializable structures—e.g., serialize as a
dict with keys like "data": list(predictions_new) and "metadata":
predictions_new.metadata, or write the raw list and write metadata
separately—then update the subsequent read path to reconstruct a
ListWithMetadata (or always pass plain lists) so evaluate() receives consistent,
JSON-serializable input rather than a ListWithMetadata object.
… verify intended OCR engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
notebooks/teigaku_genzei/teigaku_genzei.ipynb (1)
83-90: Consider using a more semantically appropriate placeholder for the OCR language parameter.While the comments correctly explain that RapidOCR ignores the language parameter, using
["en"]for Japanese document processing is semantically confusing. Consider one of these options:
- Use
["ja"]or["jpn"]as a placeholder to better reflect the document language- Use an empty list
[]if the parameter is truly ignored- Add a more prominent note that users should switch to Tesseract OCR if Japanese OCR is needed
This would make the configuration more self-documenting for future maintainers.
Example:
"ocr": { "enabled": True, # Enable/disable OCR processing - "languages": ["en"], # List of language codes (e.g., eng, fra, deu) - # Note that the OCR language has no effect in this example since we discard information from images. - # Note that the OCR engine that is automatically assigned is RapidOCR. - # The language option is not supported for RapidOCR in Docling. - # Therefore, the value "en" specified here is ignored. + "languages": ["ja"], # Placeholder: RapidOCR ignores this parameter + # Note: RapidOCR (default OCR engine) does not support the language parameter. + # To enable Japanese OCR, switch to Tesseract and use language code "jpn". # See docling.datamodel.pipeline_options.RapidOcrOptions for the detail. },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/teigaku_genzei/teigaku_genzei.ipynb(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (7)
notebooks/teigaku_genzei/teigaku_genzei.ipynb (7)
185-188: Defensive type check suggests API uncertainty.The
isinstancecheck on line 185 suggests thatpipeline_options.table_structure_optionsmight not always be aTableStructureOptionsinstance. This is reasonable defensive coding, but if the API guarantees this type, the check could be simplified.If this check is necessary due to API behavior (e.g., it could be
None), consider adding a brief comment explaining when this might occur.
317-337: LGTM: Directory creation properly handled.The addition of
os.makedirs(qa_table_dir, exist_ok=True)on line 326 correctly addresses the previous review comment about ensuring output directories exist before writing files.
364-383: LGTM: Directory creation properly handled.The addition of
os.makedirs(context_dir, exist_ok=True)on line 375 correctly addresses the previous review comment.
414-429: LGTM: Directory creation properly handled.The addition of
os.makedirs(os.path.dirname(output_icl_path), exist_ok=True)on line 420 correctly addresses the previous review comment about creating the parent directory.
466-480: LGTM: Directory creation properly handled.The addition of
os.makedirs(os.path.dirname(output_seed_path), exist_ok=True)on line 471 correctly addresses the previous review comment about ensuring the parent directory exists.
199-224: Export logic is well-structured.The
export_documentfunction properly handles multiple export formats with appropriate error handling and logging. The use ofgetattrto dynamically call export methods is clean and maintainable.
226-274: Main conversion function follows best practices.The
export_document_new_doclingfunction properly handles directory creation (line 249), counts successes/failures, and provides comprehensive logging. The overall structure is clean and maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
notebooks/teigaku_genzei/json_qa.py (5)
1-6: Consider removing the TODO or converting it to a module docstring.The script appears complete and functional. If the TODO is describing what the script does (rather than tracking incomplete work), convert it to a proper module docstring.
Apply this diff:
-# TODO: -# Read a given JSON file and extract fields that corresponds to QA pairs. -# Save the QA pairs to a CSV file. -# Both the input file name and the output file name are specified as command line arguments. - -# Assisted by watsonx Code Assistant +""" +Read a given JSON file and extract fields that correspond to QA pairs. +Save the QA pairs to a CSV file. +Both the input file name and the output file name are specified as command line arguments. + +Assisted by watsonx Code Assistant +"""
16-38: Well-structured text extraction logic.The function correctly filters body text while excluding page numbers and section titles. The nested helper functions and inline comments make the logic clear.
Optionally, consider adding a docstring to describe the function's purpose and parameters:
def get_text_block(text_list: list[Any], begin_pos: int, end_pos: int) -> str: + """Extract body text from a slice of the text_list, excluding furniture and section headers. + + Args: + text_list: List of text objects with content_layer, label, and text attributes + begin_pos: Starting index (inclusive) + end_pos: Ending index (exclusive) + + Returns: + Concatenated body text from the specified range + """
40-56: Add type hints to improve clarity.The extraction logic is sound, and the use of
strict=Trueon line 51 (addressed from previous review) properly validates that position lists align. Adding type hints would improve the function signature's readability.Apply this diff:
-def extract_qa_pairs(data): +def extract_qa_pairs(data: dict[str, Any]) -> list[tuple[str, str, str]]:
58-63: Add type hints for consistency.The CSV writing logic is correct with proper UTF-8 encoding. Adding type hints would maintain consistency with
get_text_block.Apply this diff:
-def save_to_csv(qa_pairs, output_file): +def save_to_csv(qa_pairs: list[tuple[str, str, str]], output_file: str) -> None:
65-77: Consider adding error handling for better user experience.The CLI flow is correct, but the script will crash with Python tracebacks if the input file is missing, JSON is malformed, or write permissions are denied. For a utility script this may be acceptable, but wrapping the core logic in try-except would provide clearer error messages.
Example:
if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: python script.py <input_json_file> <output_csv_file>") sys.exit(1) input_file = sys.argv[1] output_file = sys.argv[2] try: data = json_util.read_json_file(input_file) qa_pairs = extract_qa_pairs(data) save_to_csv(qa_pairs, output_file) print(f"QA pairs have been successfully saved to {output_file}") except FileNotFoundError as e: print(f"Error: Input file not found - {e}") sys.exit(1) except (json.JSONDecodeError, KeyError, ValueError) as e: print(f"Error: Failed to process input - {e}") sys.exit(1) except OSError as e: print(f"Error: Failed to write output - {e}") sys.exit(1)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/teigaku_genzei/json_qa.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
notebooks/teigaku_genzei/json_qa.py (1)
notebooks/teigaku_genzei/json_util.py (5)
is_title(6-7)is_toc(9-10)is_qhead(15-16)is_ahead(18-19)read_json_file(21-24)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Summary
🔇 Additional comments (1)
notebooks/teigaku_genzei/json_qa.py (1)
7-14: LGTM!Imports are clean and appropriate for the script's functionality.
|
Hi @alimaredia ! Any comments on this PR? |
Description
This example is located in
scripts/teigaku_genzei/. I created this new directory.This example contains a set of python scripts to
All the shell scripts in this directory are examples of calling command line interfaces of the above python scripts. The usages of those are described in README in detail.
The evaluation code depends on Unitxt with a custom catalog. The custom catalog is stored in
local_catalog/.The source documents are not included this directory, as described in README. Each user needs to comply with the terms of use by NTA (National Tax Agency) of Japan and download those by oneself.
How Has This Been Tested?
The test has been done manually by following the instructions in README.
We tested this examples in the following environment.
This example does not alter sdg_hub or other examples in this repo. It is provided as an independent example.
Merge criteria:
Summary by CodeRabbit
New Features
Documentation
Automation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.