DissBench: Evaluating LLM-written Abstracts for Long Scientific Documents

Master Thesis

A Python-based benchmarking tool for evaluating Large Language Models' ability to generate abstracts for scientific dissertations. Uses vLLM for efficient inference and LLM-based evaluation.

It uses a dataset of dissertations and their corresponding abstracts. First, the candidate LLM is asked to write an abstract for the dissertation. After this, the Judge-LLM evaluates the candidate abstract against the original abstract (from the original author of the dissertation). Evaluation is done content-wise and presentation-wise.

Important Note: The DissBench dataset is required. Unfortunately, it can not be published here, as the Deutsche Nationalbibliothek restricts redistribution of its Open Access dissertations.

Features

Fast inference using vLLM with configurable model parameters
Support for both local models (via vLLM) and API-based models (via LiteLLM)
Batched processing with progress tracking and checkpointing
Configurable generation and evaluation parameters
Model-specific configurations via JSON files

Installation

Note: Requires Python 3.11 or 3.12

Clone this repository
Install dependencies:

pip install -r requirements.txt

Usage

Generating Abstracts

Generate abstracts using a model:

python generate.py \
    --model "Qwen/Qwen2.5-7B-Instruct" \
    --output-dir "results" \
    --batch-size 8 \
    --temperature 0.7 \
    --dataset-path "path/to/dataset.parquet"

For API-based models (OpenAI, Anthropic, etc.), use the --api flag:

python generate.py \
    --model "openai/gpt-4o" \
    --output-dir "results" \
    --api

Generation Parameters

--model: Model name (prefix with provider name for API models, e.g., 'openai/', 'anthropic/')
--output-dir: Directory to save results (default: "results")
--batch-size: Batch size for inference (default: 1)
--temperature: Generation temperature (optional)
--max-tokens: Maximum tokens per generation (default: 1000)
--limit: Limit number of examples to process
--dataset-path: Path to local parquet dataset file
--api: Use API-based models via LiteLLM
--chat-template-variations: Test different chat template formats (choices: none, template, template+prefill, template+reasoning)

Evaluating Results

Evaluate generated abstracts using an LLM judge:

python evaluate.py \
    --results-files "results/model_name_1.json" "results/model_name_2.json" \
    --judge-model "openai/gpt-4o" \
    --output-dir "evaluations" \
    --batch-size 1

Evaluation Parameters

--results-files: One or more paths to generation result files
--judge-model: Model to use for evaluation (default: "openai/gpt-4-turbo")
--output-dir: Directory to save evaluations (default: "evaluations")
--batch-size: Batch size for evaluation (default: 1)
--temperature: Temperature for judge (optional)
--max-tokens: Max output tokens for judge (default: 1024)
--api: Use API-based model via LiteLLM
--limit: Limit the number of items to evaluate
--reasoning: Enable step-by-step reasoning in evaluation
--comment: Add a comment to the evaluation results

Output Format

Generation Results (JSON)

{
    "generation_metadata": {
        "model_name": "model_name",
        "temperature": 0.7,
        "max_tokens": 1000,
        "use_chat_template": false,
        "timestamp": "...",
        "sample_size": 100
    },
    "generations": [
        {
            "idn": "...",
            "category": "...",
            "total_text_length": 1234,
            "total_token_length": 1000,
            "abstract_length": 250,
            "generated_abstract": "...",
            "original_abstract": "...",
            "reasoning_content": null,
            "url_dnb_archive": "..."
        }
    ]
}

Evaluation Results (JSON)

{
    "generation_metadata": { ... },
    "evaluation_metadata": {
        "judge_model": "gpt-4-turbo",
        "temperature": null,
        "max_tokens": 1024,
        "batch_size": 1,
        "timestamp": "..."
    },
    "evaluations": [
        {
            "evaluation": {
                "scores": {
                    "Objectives": 1-10,
                    "Methods": 1-10,
                    "Results": 1-10,
                    "Conclusions": 1-10,
                    "Coherence": 1-10,
                    "Academic standards": 1-10
                },
                "comments": {
                    "Objectives": "...",
                    "Methods": "...",
                    "Results": "...",
                    "Conclusions": "...",
                    "Coherence": "...",
                    "Academic standards": "..."
                },
                "avg_score": 7.5
            },
            "evaluation_error": null,
            "idn": "...",
            "category": "...",
            "generated_abstract": "...",
            "original_abstract": "..."
        }
    ],
    "aggregate_statistics": {
        "Objectives": {"mean": 7.5, "min": 5, "max": 9, "std": 1.2},
        "Methods": { ... },
        "Results": { ... },
        "Conclusions": { ... },
        "Coherence": { ... },
        "Academic standards": { ... },
        "Average score": { ... }
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
finetuning		finetuning
hpc-notes		hpc-notes
model_configs		model_configs
schemas		schemas
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_evaluation.sh		batch_evaluation.sh
batch_generation.sh		batch_generation.sh
evaluate.py		evaluate.py
finetuning_slurm.sh		finetuning_slurm.sh
generate.py		generate.py
generation_slurm.sh		generation_slurm.sh
llm_utils.py		llm_utils.py
requirements.txt		requirements.txt
tail.sh		tail.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DissBench: Evaluating LLM-written Abstracts for Long Scientific Documents

Features

Installation

Usage

Generating Abstracts

Generation Parameters

Evaluating Results

Evaluation Parameters

Output Format

Generation Results (JSON)

Evaluation Results (JSON)

About

Uh oh!

Releases

Packages

Languages

License

maxju/DissBench

Folders and files

Latest commit

History

Repository files navigation

DissBench: Evaluating LLM-written Abstracts for Long Scientific Documents

Features

Installation

Usage

Generating Abstracts

Generation Parameters

Evaluating Results

Evaluation Parameters

Output Format

Generation Results (JSON)

Evaluation Results (JSON)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages