Skip to content

Improving semantic search. We use Llama-3.1 to generate synthetic data for finetuning an embeddings model.

Notifications You must be signed in to change notification settings

hpfield/jargonaut

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jargonaut

Jargonaut is a repository illustrating how to:

  • Generate synthetic Q&A data from domain-specific text (using a Large Language Model)
  • Finetune an embeddings model on that data for improved semantic retrieval of jargon-heavy content
  • Build a minimal Flask web app to demonstrate the search experience end-to-end

By leveraging Hydra for centralised configuration, LLMs for question generation, and SentenceTransformers for embeddings finetuning, Jargonaut offers a practical, domain-adaptive approach to searching specialised documents.

This repository has been tested on Ubuntu 22.04 and has the following hardware requirements.

Table of Contents

  1. Overview
  2. Repository Structure
  3. Setup
  4. Configuring the Pipeline
  5. Usage
  6. Additional Notes

Overview

Many organisations handle jargon-heavy or domain-specific documents where off-the-shelf embeddings fall short. Jargonaut aims to bridge that gap by:

  1. Creating Synthetic Q&A: We prompt an LLM to generate realistic user queries for each document.
  2. Finetuning: We train an embeddings model on these Q&A pairs, enhancing its understanding of specialised terminology.
  3. Building Document Embeddings: We compute embeddings for an entire corpus using our finetuned model.
  4. Serving a Local Search Demo: A Flask-based UI allows users to enter a query and see top-matching documents (with truncated text, metadata, etc.).

This repo builds on the success of a project by the Incubator for Aritificial Intelligence, using a huggingface dataset of government policy documents. We improve data security with a locally running, open-source LLM (Llama-3.1) and focus the relevance of generated Q&A pairs by creating a custom version of the generate_qa_embedding_pairs function from llama-index. Aside from the dataset, this entire repository is created from scratch.

Results

To view example Q&A pairs generated by Jargonaut compared to the baseline I.AI method, see our examples document.

Repository Structure

Below is a simplified view of the directory layout. Some subdirectories contain additional scripts, logs, or metadata files not shown in detail:

jargonaut/
├── environment.yml
├── govuk-policy-qa-pairs/
│   ├── data/
│   └── policy_papers.json
├── llama-models/            <-- See setup section
│   ├── models/...
│   └── ...
├── README.md                <-- You are here
└── src/
    ├── build_embeddings.py  <-- Script to build & save embeddings
    ├── config/
    │   └── config.yaml      <-- Main Hydra config
    ├── demo_server.py       <-- Flask app for searching
    ├── finetune.py          <-- Finetunes embeddings
    ├── generate_qa.py       <-- Synthetic Q&A generation
    ├── outputs/             <-- Timestamped output directories
    ├── prepare_data.py      <-- Text chunking
    ├── prompts/
    │   ├── custom_qa_generate_prompt.txt
    │   └── custom_qa_generate_system.txt
    ├── run_all.py           <-- Orchestrates multi-step pipeline
    └── utils/
        ├── data_utils.py
        ├── llm_utils.py
        ├── logger_utils.py
        └── ...

Key Directories

  • govuk-policy-qa-pairs/: Main data directory containing government policy documents.
  • llama-models/: Contains local Llama code and references (if you use the Llama-based generator).
  • src/outputs/: Where each script’s run logs, config snapshots, and artifacts (like train_dataset.json, finetuned_model/, doc_embeddings.pkl) are stored in timestamped folders.

Setup

  1. Clone this repository:
git clone https://github.com/hpfield/jargonaut.git
cd jargonaut

2. Create and activate the conda environment:

conda env create -f environment.yml
conda activate jargonaut

This installs PyTorch, SentenceTransformers, Flask, Hydra, and other dependencies listed in environment.yml.

3. Unzip data in govuk-policy-qa-pairs:

cd govuk-policy-qa-pairs
gunzip policy_papers.json.gz

4. Install Llama-3.1:

Clone the llama-models git repo into the root of this repo and follow instructions for installation.

When you reach the meta llama-downloads page, request access to Llama 3.1: 405B, 70B & 8B. For easiest integration with this project, accept the default suggestion to store the .llama driectory in your home ~ directory.

5. Adjust any paths in src/config/config.yaml:

llm:
  ckpt_dir: ${oc.env:HOME}/.llama/checkpoints/Meta-Llama3.1-8B-Instruct
  tokenizer_path: ../llama-models/models/llama3/api/tokenizer.model
data_preparation:
  raw_data_file: "../govuk-policy-qa-pairs/policy_papers.json"
  small_file_path: "../govuk-policy-qa-pairs/policy_papers_small.json"
  output_file_path: "../govuk-policy-qa-pairs/policy_papers_truncated.json"

Configuring the Pipeline

This repository uses Hydra to manage configuration from a single YAML file, located by default at src/config/config.yaml. When you run any of the scripts (e.g., prepare_data.py, generate_qa.py, etc.) or use run_all.py, Hydra automatically loads this config.

Below is an example of the current config.yaml:

defaults:
- override hydra/job_logging: none
- override hydra/hydra_logging: none

hydra:
  run:
    dir: .

data_preparation:
  max_token_threshold: 3000
  token_limit: 4000
  token_overlap: 500
  raw_data_file: "../govuk-policy-qa-pairs/policy_papers.json"
  small_file_path: "../govuk-policy-qa-pairs/policy_papers_small.json"
  output_file_path: "../govuk-policy-qa-pairs/policy_papers_truncated.json"

llm:
  ckpt_dir: ${oc.env:HOME}/.llama/checkpoints/Meta-Llama3.1-8B-Instruct
  max_batch_size: 4
  max_seq_len: 4096
  model_parallel_size: null
  temperature: 0.6
  tokenizer_path: ../llama-models/models/llama3/api/tokenizer.model
  top_p: 0.9

paths:
  data_file: ../govuk-policy-qa-pairs/policy_papers_truncated.json
  doc_embeddings_path: outputs/build_embeddings/24-12-22_21-16-07/doc_embeddings.pkl
  finetuned_model_path: outputs/finetune/24-12-22_21-19-32/finetuned_model
  output_dir: outputs
  prompt_file: prompts/custom_qa_generate_prompt.txt
  system_prompt_file: prompts/custom_qa_generate_system.txt
  train_dataset_path: outputs/generate_qa/24-12-22_21-19-10/train_dataset.json
  val_dataset_path: outputs/generate_qa/24-12-22_21-19-10/val_dataset.json

server:
  debug: false
  host: 127.0.0.1
  port: 5000

training:
  epochs: 5
  num_questions_per_chunk: 2
  on_failure: continue
  random_state: 42
  retry_limit: 3
  save_every: 500
  test_size: 0.2
  train_subset_size: NULL
  val_subset_size: NULL

How It Works

  1. data_preparation
    • max_token_threshold: Maximum tokens allowed before splitting a document into chunks.
    • token_limit & token_overlap: Control how we chunk large documents in prepare_data.py. Documents exceeding max_token_threshold get split into smaller parts, each capped at token_limit tokens (with some overlap).
    • raw_data_file, small_file_path, output_file_path: Points to the files used or produced by the data-preparation stage. For instance, prepare_data.py will read from raw_data_file, create small_file_path with documents under the threshold, and generate output_file_path for the final truncated dataset.
  2. llm
    • Defines model-related settings like ckpt_dir (checkpoint location), max_seq_len, temperature, etc.
    • Notably uses ${oc.env:HOME} in ckpt_dir, which means Hydra will expand the HOME environment variable to locate LLM checkpoints.
  3. paths
    • Points to key data artifacts like data_file (the truncated dataset used for subsequent scripts), doc_embeddings_path, finetuned_model_path, etc.
    • output_dir is the base directory where run logs and artifacts are stored in timestamped subfolders.
  4. server
    • Controls Flask server parameters (host, port, debug). This is used by demo_server.py when displaying search results in a browser.
  5. training
    • Hyperparameters for finetuning and generation steps, such as epochs, num_questions_per_chunk for Q&A generation, retry_limit for LLM queries, and test_size for train/val splitting.
    • train_subset_size and val_subset_size can be set to numeric values if you want to limit data for quick tests; NULL means it uses the full dataset.

Overriding Configuration

  • Command-Line Overrides: Hydra allows you to override any config parameter at runtime. For example:

    python run_all.py --stage=finetune training.epochs=10
    
    

    This sets training.epochs to 10 instead of 5, overriding the default in config.yaml.

  • Environment Variables: If a field references $HOME or uses syntax like ${oc.env:HOME}, Hydra will expand it using the current environment. You can change it by setting:

    export HOME=/path/to/your/home
    
    

    before running.

Usage

Data Preparation

prepare_data.py breaks up the larger datapoints into multiple text chunks. If a datapoint contains more than the max_token_threshold specified in the config, it's broken up into multiple datapoints. Setting an appropriate threshold prevents the LLM from failing due to lack of resources (a threshold of 3000 works well for a system with 24GB GPU).

cd src
python prepare_data.py

End-to-End Pipeline

run_all.py automates the entire process. For example, run everything in sequence:

cd src
python run_all.py --stage=all

This:

  1. Generates Q&A pairs from your data (via LLM).
  2. Finetunes a SentenceTransformers model on the Q&A.
  3. Builds embeddings for your entire corpus.
  4. Launches the demo_server to let you test queries in the browser.

Run a single stage if you only need that portion:

python run_all.py --stage=generate_qa
python run_all.py --stage=finetune
python run_all.py --stage=build_embeddings
python run_all.py --stage=demo_server

After each stage, pipeline_runner.py looks at the newly created timestamped output folder and updates your config with the relevant paths (e.g., finetuned_model_path).

Individual Scripts

If you prefer a manual approach:

1. Generate Q&A with generate_qa.py:

cd src
python generate_qa.py

Creates train_dataset.json and val_dataset.json.

2. Finetune with finetune.py:

python finetune.py

Loads Q&A data, trains a SentenceTransformers embedding model, and saves it in finetuned_model/.

3. Build Embeddings with build_embeddings.py:

python build_embeddings.py

Encodes your entire corpus into embeddings (saved in doc_embeddings.pkl).

4. Serve with demo_server.py:

python demo_server.py

Starts a Flask server on http://127.0.0.1:5000. Enter a query, see the top matches, truncated text, and metadata (headers/URLs).

Note: Running scripts individually requires you to manually update paths in src/config/config.yaml (or pass Hydra overrides) so each script can find the output from the previous steps.

Additional Notes

  • LLM Integration: If you use the Llama generator, ensure your ckpt_dir and tokenizer_path in config.yaml are valid.
  • Custom Data: Replace data_preparation.raw_data_file with your own domain-specific JSON. Then retrain the pipeline for your specialised text. Its expected that your data file will be JSON and contain fields url, header and content. Any deviation from this will require code refactoring across all pipeline files in src/.

About

Improving semantic search. We use Llama-3.1 to generate synthetic data for finetuning an embeddings model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages