Codebase for generating open-ended forecasting questions from news articles (used to develop OpenForesight), scraping data from prediction markets, and RL training of language models on forecasting questions to develop models like OpenForecaster-8B. 69GB of scraped news articles until January 2026 can be found here: forecast-news.
Paper: Scaling Open-Ended Reasoning To Predict the Future
Blog: openforecaster.github.io
@misc{chandak2026scalingopenendedreasoningpredict,
title={Scaling Open-Ended Reasoning to Predict the Future},
author={Nikhil Chandak and Shashwat Goel and Ameya Prabhu and Moritz Hardt and Jonas Geiping},
year={2026},
eprint={2512.25070},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.25070},
}
Requirements: uv (pre-installed).
# Clone repository
git clone [REPOSITORY_URL]
# Automated setup (recommended)
./setup.sh
# Manual setup alternative
uv venv forecast && source forecast/bin/activate
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
uv pip install -e .End-to-end pipeline for generating forecasting questions from news articles with quality filtering and leakage detection.
Complete Pipeline:
python qgen/run_pipeline.py \
--article_path qgen/sample_data/telegraph20.jsonl \
--output_dir ./output \
--use_openrouter \
--creator_model deepseek/deepseek-v3.2 \
--selector_model meta-llama/llama-4-maverick \
--num_q_per_article 3 \
--first_date 2025-01-01What it does: Generates free-form questions, extracts resolution dates, filters by date and answer type, removes temporal leakage.
Output Format:
{
"question_id": 482761,
"question_title": "Who will win the Nobel Prize in ... in 2025?",
"background": "Context...",
"resolution_criteria": "Source: Nobel Committee...",
"answer": "John Doe",
"answer_type": "Name",
"resolution_date": "2025-10-10"
}Train language models using RL with VeRL on OpenForesight or custom datasets.
Prepare OpenForesight Data:
cd libraries/verl/examples/data_preprocess/forecasting
# Load full dataset
python load_foresight.py --split train --output_dir data/
# Or subsample
python load_foresight.py --split train --subsample 1000 --output_dir data/Experiment with Training:
cd libraries/verl/scripts/ours/testrun/
cat README.mdNote: VERL may have dependency conflicts so it might be required to create a separate environment for training.
Evaluate models locally using VLLM on various forecasting benchmarks.
Freeform Questions:
python custom_eval_scripts/eval_freeform_retrieval.py \
--model_dir /path/to/model \
--questions_file questions.jsonl \
--base_save_dir ./results \
--num_generations 3 \
--num_articles 0 # Set to >0 for retrievalBinary Questions:
python custom_eval_scripts/eval_binary_retrieval.py \
--model_dir /path/to/model \
--questions_file questions.jsonl \
--base_save_dir ./results \
--num_generations 5 \
--num_articles 5 # Example with retrievalSupported Benchmarks: Metaculus, Manifold, FutureBench, FutureX, MMLU-Pro, MATH, SimpleQA
Use LLM judge to evaluate if free-form model responses match ground truth answers.
Basic Usage:
python local_judge/llm_judge.py \
--model_dir /path/to/judge/model \
--input_file responses.jsonl \
--output_dir ./resultsBatch Processing:
python local_judge/llm_judge.py \
--model_dir /path/to/judge/model \
--input_dir ./results/ \
--output_dir ./judgmentsExpected Input Format:
{
"question": "Question text",
"answer": "Ground truth answer",
"extracted_answer": ["Response 1", "Response 2"]
}Output: Adds binary scores and probabilities for each response to the input file.
Extract and process news articles from Common Crawl (27M+ articles, 150+ domains, 150GB+).
cd news
# Launch WARC extraction (requires news-please and htcondor setup)
python jobs_news.py \
--num_processes 1 \
--download_dir_warc /path/to/warc \
--download_dir_article /path/to/articles
# Convert to JSONL
python to_jsonl.py --input_dir extracted_articles/ --output_dir jsonl/
# Tokenize for BM25
python src/tokenize_for_rag.py --input_dir jsonl/ --output_dir tokenized/
# BM25 retrieval
python src/bm25_jsonl.py --articles_path articles.jsonl --questions_path questions.jsonlKNN/BM25 retrieval pipeline for RAG-augmented forecasting.
# Basic usage
python embedding_retrieval/main_new.py --data-dir /path/to/data
# Custom configuration
python embedding_retrieval/main_new.py \
--data-dir /path/to/data \
--datasets metaculus manifoldFeatures: Document chunking, embedding caching, time-based filtering, KNN search with deduplication.
Evaluate commercial models (GPT-4, Claude, Gemini) via OpenRouter API.
python openrouter_evals/freeform_evals.py \
--questions_file questions.jsonl \
--models openai/gpt-5 \
--num_generations 1 \
--base_save_dir ./results