Skip to content

[DRAFT] harness: Add Vidore V3 benchmark and BEIR metrics support#1378

Draft
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:vidore-v3-benchmark
Draft

[DRAFT] harness: Add Vidore V3 benchmark and BEIR metrics support#1378
jioffe502 wants to merge 4 commits intoNVIDIA:mainfrom
jioffe502:vidore-v3-benchmark

Conversation

@jioffe502
Copy link
Collaborator

Description

Adds Vidore V3 benchmark support and BEIR evaluation metrics to the test harness.

Changes

  • Add Vidore V3 dataset configurations with HuggingFace integration for ground truth
  • Add dataset groups feature for running multiple datasets (e.g., --dataset=vidore)
  • Add optional BEIR metrics (NDCG, MAP, Precision) for recall evaluation

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

jioffe502 and others added 3 commits February 3, 2026 23:31
- Add 8 Vidore V3 dataset configurations (finance_en, industrial,
  computer_science, pharmaceuticals, hr, energy, physics, finance_fr)
- Add vidore_load_ground_truth() using HuggingFace datasets API
- Add vidore_recall() evaluator with PDF-only matching
- Add extract_page_as_image, extract_method, image_elements_modality
  config options to support Vidore's OCR-based page image retrieval
- Add datasets>=2.0.0 dependency for HuggingFace qrels loading

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- Add dataset_groups section to test_configs.yaml with vidore, vidore_english, vidore_quick groups
- Add expand_dataset_names() in config.py to handle group expansion
- Add --list-datasets CLI option to show available datasets and groups
- Update README.md with dataset groups documentation

Usage:
  uv run nv-ingest-harness-run --list-datasets
  uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore
  uv run nv-ingest-harness-run --case=e2e_recall --dataset=vidore_quick

Note: test_configs.yaml includes temp test settings (vdb_backend: milvus,
reranker_mode: none, modified vidore_quick) - revert after testing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add optional BEIR evaluation (NDCG, MAP, Precision) to recall tests
- Configurable via enable_beir in test_configs.yaml or ENABLE_BEIR env var
- Add beir>=2.0.0 dependency to harness
- Add nvidia/llama-nemotron-embed-vl-1b-v2 to known embedding models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jioffe502 jioffe502 requested a review from a team as a code owner February 5, 2026 18:12
@jioffe502 jioffe502 requested review from ChrisJar, charlesbluca and drobison00 and removed request for drobison00 February 5, 2026 18:12
@jioffe502 jioffe502 marked this pull request as draft February 5, 2026 18:13
- Add embed model fallback detection (dim=1024 warning) to e2e.py and recall.py
- Add Milvus collection vector dimension verification after ingestion
- Enable BEIR metrics by default for all Vidore V3 datasets

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant