Master Thesis
A Python-based benchmarking tool for evaluating Large Language Models' ability to generate abstracts for scientific dissertations. Uses vLLM for efficient inference and LLM-based evaluation.
It uses a dataset of dissertations and their corresponding abstracts. First, the candidate LLM is asked to write an abstract for the dissertation. After this, the Judge-LLM evaluates the candidate abstract against the original abstract (from the original author of the dissertation). Evaluation is done content-wise and presentation-wise.
Important Note: The DissBench dataset is required. Unfortunately, it can not be published here, as the Deutsche Nationalbibliothek restricts redistribution of its Open Access dissertations.
- Fast inference using vLLM with configurable model parameters
- Support for both local models (via vLLM) and API-based models (via LiteLLM)
- Batched processing with progress tracking and checkpointing
- Configurable generation and evaluation parameters
- Model-specific configurations via JSON files
Note: Requires Python 3.11 or 3.12
- Clone this repository
- Install dependencies:
pip install -r requirements.txtGenerate abstracts using a model:
python generate.py \
--model "Qwen/Qwen2.5-7B-Instruct" \
--output-dir "results" \
--batch-size 8 \
--temperature 0.7 \
--dataset-path "path/to/dataset.parquet"For API-based models (OpenAI, Anthropic, etc.), use the --api flag:
python generate.py \
--model "openai/gpt-4o" \
--output-dir "results" \
--api--model: Model name (prefix with provider name for API models, e.g., 'openai/', 'anthropic/')--output-dir: Directory to save results (default: "results")--batch-size: Batch size for inference (default: 1)--temperature: Generation temperature (optional)--max-tokens: Maximum tokens per generation (default: 1000)--limit: Limit number of examples to process--dataset-path: Path to local parquet dataset file--api: Use API-based models via LiteLLM--chat-template-variations: Test different chat template formats (choices: none, template, template+prefill, template+reasoning)
Evaluate generated abstracts using an LLM judge:
python evaluate.py \
--results-files "results/model_name_1.json" "results/model_name_2.json" \
--judge-model "openai/gpt-4o" \
--output-dir "evaluations" \
--batch-size 1--results-files: One or more paths to generation result files--judge-model: Model to use for evaluation (default: "openai/gpt-4-turbo")--output-dir: Directory to save evaluations (default: "evaluations")--batch-size: Batch size for evaluation (default: 1)--temperature: Temperature for judge (optional)--max-tokens: Max output tokens for judge (default: 1024)--api: Use API-based model via LiteLLM--limit: Limit the number of items to evaluate--reasoning: Enable step-by-step reasoning in evaluation--comment: Add a comment to the evaluation results
{
"generation_metadata": {
"model_name": "model_name",
"temperature": 0.7,
"max_tokens": 1000,
"use_chat_template": false,
"timestamp": "...",
"sample_size": 100
},
"generations": [
{
"idn": "...",
"category": "...",
"total_text_length": 1234,
"total_token_length": 1000,
"abstract_length": 250,
"generated_abstract": "...",
"original_abstract": "...",
"reasoning_content": null,
"url_dnb_archive": "..."
}
]
}{
"generation_metadata": { ... },
"evaluation_metadata": {
"judge_model": "gpt-4-turbo",
"temperature": null,
"max_tokens": 1024,
"batch_size": 1,
"timestamp": "..."
},
"evaluations": [
{
"evaluation": {
"scores": {
"Objectives": 1-10,
"Methods": 1-10,
"Results": 1-10,
"Conclusions": 1-10,
"Coherence": 1-10,
"Academic standards": 1-10
},
"comments": {
"Objectives": "...",
"Methods": "...",
"Results": "...",
"Conclusions": "...",
"Coherence": "...",
"Academic standards": "..."
},
"avg_score": 7.5
},
"evaluation_error": null,
"idn": "...",
"category": "...",
"generated_abstract": "...",
"original_abstract": "..."
}
],
"aggregate_statistics": {
"Objectives": {"mean": 7.5, "min": 5, "max": 9, "std": 1.2},
"Methods": { ... },
"Results": { ... },
"Conclusions": { ... },
"Coherence": { ... },
"Academic standards": { ... },
"Average score": { ... }
}
}