Skip to content

maxju/DissBench

Repository files navigation

DissBench: Evaluating LLM-written Abstracts for Long Scientific Documents

Master Thesis

A Python-based benchmarking tool for evaluating Large Language Models' ability to generate abstracts for scientific dissertations. Uses vLLM for efficient inference and LLM-based evaluation.

It uses a dataset of dissertations and their corresponding abstracts. First, the candidate LLM is asked to write an abstract for the dissertation. After this, the Judge-LLM evaluates the candidate abstract against the original abstract (from the original author of the dissertation). Evaluation is done content-wise and presentation-wise.

Important Note: The DissBench dataset is required. Unfortunately, it can not be published here, as the Deutsche Nationalbibliothek restricts redistribution of its Open Access dissertations.

Features

  • Fast inference using vLLM with configurable model parameters
  • Support for both local models (via vLLM) and API-based models (via LiteLLM)
  • Batched processing with progress tracking and checkpointing
  • Configurable generation and evaluation parameters
  • Model-specific configurations via JSON files

Installation

Note: Requires Python 3.11 or 3.12

  1. Clone this repository
  2. Install dependencies:
pip install -r requirements.txt

Usage

Generating Abstracts

Generate abstracts using a model:

python generate.py \
    --model "Qwen/Qwen2.5-7B-Instruct" \
    --output-dir "results" \
    --batch-size 8 \
    --temperature 0.7 \
    --dataset-path "path/to/dataset.parquet"

For API-based models (OpenAI, Anthropic, etc.), use the --api flag:

python generate.py \
    --model "openai/gpt-4o" \
    --output-dir "results" \
    --api

Generation Parameters

  • --model: Model name (prefix with provider name for API models, e.g., 'openai/', 'anthropic/')
  • --output-dir: Directory to save results (default: "results")
  • --batch-size: Batch size for inference (default: 1)
  • --temperature: Generation temperature (optional)
  • --max-tokens: Maximum tokens per generation (default: 1000)
  • --limit: Limit number of examples to process
  • --dataset-path: Path to local parquet dataset file
  • --api: Use API-based models via LiteLLM
  • --chat-template-variations: Test different chat template formats (choices: none, template, template+prefill, template+reasoning)

Evaluating Results

Evaluate generated abstracts using an LLM judge:

python evaluate.py \
    --results-files "results/model_name_1.json" "results/model_name_2.json" \
    --judge-model "openai/gpt-4o" \
    --output-dir "evaluations" \
    --batch-size 1

Evaluation Parameters

  • --results-files: One or more paths to generation result files
  • --judge-model: Model to use for evaluation (default: "openai/gpt-4-turbo")
  • --output-dir: Directory to save evaluations (default: "evaluations")
  • --batch-size: Batch size for evaluation (default: 1)
  • --temperature: Temperature for judge (optional)
  • --max-tokens: Max output tokens for judge (default: 1024)
  • --api: Use API-based model via LiteLLM
  • --limit: Limit the number of items to evaluate
  • --reasoning: Enable step-by-step reasoning in evaluation
  • --comment: Add a comment to the evaluation results

Output Format

Generation Results (JSON)

{
    "generation_metadata": {
        "model_name": "model_name",
        "temperature": 0.7,
        "max_tokens": 1000,
        "use_chat_template": false,
        "timestamp": "...",
        "sample_size": 100
    },
    "generations": [
        {
            "idn": "...",
            "category": "...",
            "total_text_length": 1234,
            "total_token_length": 1000,
            "abstract_length": 250,
            "generated_abstract": "...",
            "original_abstract": "...",
            "reasoning_content": null,
            "url_dnb_archive": "..."
        }
    ]
}

Evaluation Results (JSON)

{
    "generation_metadata": { ... },
    "evaluation_metadata": {
        "judge_model": "gpt-4-turbo",
        "temperature": null,
        "max_tokens": 1024,
        "batch_size": 1,
        "timestamp": "..."
    },
    "evaluations": [
        {
            "evaluation": {
                "scores": {
                    "Objectives": 1-10,
                    "Methods": 1-10,
                    "Results": 1-10,
                    "Conclusions": 1-10,
                    "Coherence": 1-10,
                    "Academic standards": 1-10
                },
                "comments": {
                    "Objectives": "...",
                    "Methods": "...",
                    "Results": "...",
                    "Conclusions": "...",
                    "Coherence": "...",
                    "Academic standards": "..."
                },
                "avg_score": 7.5
            },
            "evaluation_error": null,
            "idn": "...",
            "category": "...",
            "generated_abstract": "...",
            "original_abstract": "..."
        }
    ],
    "aggregate_statistics": {
        "Objectives": {"mean": 7.5, "min": 5, "max": 9, "std": 1.2},
        "Methods": { ... },
        "Results": { ... },
        "Conclusions": { ... },
        "Coherence": { ... },
        "Academic standards": { ... },
        "Average score": { ... }
    }
}

About

LLM Long Context Benchmark (128K tokens)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published