Jargonaut is a repository illustrating how to:
- Generate synthetic Q&A data from domain-specific text (using a Large Language Model)
- Finetune an embeddings model on that data for improved semantic retrieval of jargon-heavy content
- Build a minimal Flask web app to demonstrate the search experience end-to-end
By leveraging Hydra for centralised configuration, LLMs for question generation, and SentenceTransformers for embeddings finetuning, Jargonaut offers a practical, domain-adaptive approach to searching specialised documents.
This repository has been tested on Ubuntu 22.04 and has the following hardware requirements.
Many organisations handle jargon-heavy or domain-specific documents where off-the-shelf embeddings fall short. Jargonaut aims to bridge that gap by:
- Creating Synthetic Q&A: We prompt an LLM to generate realistic user queries for each document.
- Finetuning: We train an embeddings model on these Q&A pairs, enhancing its understanding of specialised terminology.
- Building Document Embeddings: We compute embeddings for an entire corpus using our finetuned model.
- Serving a Local Search Demo: A Flask-based UI allows users to enter a query and see top-matching documents (with truncated text, metadata, etc.).
This repo builds on the success of a project by the Incubator for Aritificial Intelligence, using a huggingface dataset of government policy documents. We improve data security with a locally running, open-source LLM (Llama-3.1) and focus the relevance of generated Q&A pairs by creating a custom version of the generate_qa_embedding_pairs function from llama-index. Aside from the dataset, this entire repository is created from scratch.
To view example Q&A pairs generated by Jargonaut compared to the baseline I.AI method, see our examples document.
Below is a simplified view of the directory layout. Some subdirectories contain additional scripts, logs, or metadata files not shown in detail:
jargonaut/
├── environment.yml
├── govuk-policy-qa-pairs/
│ ├── data/
│ └── policy_papers.json
├── llama-models/ <-- See setup section
│ ├── models/...
│ └── ...
├── README.md <-- You are here
└── src/
├── build_embeddings.py <-- Script to build & save embeddings
├── config/
│ └── config.yaml <-- Main Hydra config
├── demo_server.py <-- Flask app for searching
├── finetune.py <-- Finetunes embeddings
├── generate_qa.py <-- Synthetic Q&A generation
├── outputs/ <-- Timestamped output directories
├── prepare_data.py <-- Text chunking
├── prompts/
│ ├── custom_qa_generate_prompt.txt
│ └── custom_qa_generate_system.txt
├── run_all.py <-- Orchestrates multi-step pipeline
└── utils/
├── data_utils.py
├── llm_utils.py
├── logger_utils.py
└── ...
govuk-policy-qa-pairs/: Main data directory containing government policy documents.llama-models/: Contains local Llama code and references (if you use the Llama-based generator).src/outputs/: Where each script’s run logs, config snapshots, and artifacts (liketrain_dataset.json,finetuned_model/,doc_embeddings.pkl) are stored in timestamped folders.
- Clone this repository:
git clone https://github.com/hpfield/jargonaut.git
cd jargonaut
2. Create and activate the conda environment:
conda env create -f environment.yml
conda activate jargonaut
This installs PyTorch, SentenceTransformers, Flask, Hydra, and other dependencies listed in environment.yml.
3. Unzip data in govuk-policy-qa-pairs:
cd govuk-policy-qa-pairs
gunzip policy_papers.json.gz
4. Install Llama-3.1:
Clone the llama-models git repo into the root of this repo and follow instructions for installation.
When you reach the meta llama-downloads page, request access to Llama 3.1: 405B, 70B & 8B. For easiest integration with this project, accept the default suggestion to store the .llama driectory in your home ~ directory.
5. Adjust any paths in src/config/config.yaml:
llm:
ckpt_dir: ${oc.env:HOME}/.llama/checkpoints/Meta-Llama3.1-8B-Instruct
tokenizer_path: ../llama-models/models/llama3/api/tokenizer.model
data_preparation:
raw_data_file: "../govuk-policy-qa-pairs/policy_papers.json"
small_file_path: "../govuk-policy-qa-pairs/policy_papers_small.json"
output_file_path: "../govuk-policy-qa-pairs/policy_papers_truncated.json"
This repository uses Hydra to manage configuration from a single YAML file, located by default at src/config/config.yaml. When you run any of the scripts (e.g., prepare_data.py, generate_qa.py, etc.) or use run_all.py, Hydra automatically loads this config.
Below is an example of the current config.yaml:
defaults:
- override hydra/job_logging: none
- override hydra/hydra_logging: none
hydra:
run:
dir: .
data_preparation:
max_token_threshold: 3000
token_limit: 4000
token_overlap: 500
raw_data_file: "../govuk-policy-qa-pairs/policy_papers.json"
small_file_path: "../govuk-policy-qa-pairs/policy_papers_small.json"
output_file_path: "../govuk-policy-qa-pairs/policy_papers_truncated.json"
llm:
ckpt_dir: ${oc.env:HOME}/.llama/checkpoints/Meta-Llama3.1-8B-Instruct
max_batch_size: 4
max_seq_len: 4096
model_parallel_size: null
temperature: 0.6
tokenizer_path: ../llama-models/models/llama3/api/tokenizer.model
top_p: 0.9
paths:
data_file: ../govuk-policy-qa-pairs/policy_papers_truncated.json
doc_embeddings_path: outputs/build_embeddings/24-12-22_21-16-07/doc_embeddings.pkl
finetuned_model_path: outputs/finetune/24-12-22_21-19-32/finetuned_model
output_dir: outputs
prompt_file: prompts/custom_qa_generate_prompt.txt
system_prompt_file: prompts/custom_qa_generate_system.txt
train_dataset_path: outputs/generate_qa/24-12-22_21-19-10/train_dataset.json
val_dataset_path: outputs/generate_qa/24-12-22_21-19-10/val_dataset.json
server:
debug: false
host: 127.0.0.1
port: 5000
training:
epochs: 5
num_questions_per_chunk: 2
on_failure: continue
random_state: 42
retry_limit: 3
save_every: 500
test_size: 0.2
train_subset_size: NULL
val_subset_size: NULLdata_preparationmax_token_threshold: Maximum tokens allowed before splitting a document into chunks.token_limit&token_overlap: Control how we chunk large documents inprepare_data.py. Documents exceedingmax_token_thresholdget split into smaller parts, each capped attoken_limittokens (with some overlap).raw_data_file,small_file_path,output_file_path: Points to the files used or produced by the data-preparation stage. For instance,prepare_data.pywill read fromraw_data_file, createsmall_file_pathwith documents under the threshold, and generateoutput_file_pathfor the final truncated dataset.
llm- Defines model-related settings like
ckpt_dir(checkpoint location),max_seq_len,temperature, etc. - Notably uses
${oc.env:HOME}inckpt_dir, which means Hydra will expand theHOMEenvironment variable to locate LLM checkpoints.
- Defines model-related settings like
paths- Points to key data artifacts like
data_file(the truncated dataset used for subsequent scripts),doc_embeddings_path,finetuned_model_path, etc. output_diris the base directory where run logs and artifacts are stored in timestamped subfolders.
- Points to key data artifacts like
server- Controls Flask server parameters (
host,port,debug). This is used bydemo_server.pywhen displaying search results in a browser.
- Controls Flask server parameters (
training- Hyperparameters for finetuning and generation steps, such as
epochs,num_questions_per_chunkfor Q&A generation,retry_limitfor LLM queries, andtest_sizefor train/val splitting. train_subset_sizeandval_subset_sizecan be set to numeric values if you want to limit data for quick tests;NULLmeans it uses the full dataset.
- Hyperparameters for finetuning and generation steps, such as
-
Command-Line Overrides: Hydra allows you to override any config parameter at runtime. For example:
python run_all.py --stage=finetune training.epochs=10This sets
training.epochsto10instead of5, overriding the default inconfig.yaml. -
Environment Variables: If a field references
$HOMEor uses syntax like${oc.env:HOME}, Hydra will expand it using the current environment. You can change it by setting:export HOME=/path/to/your/homebefore running.
prepare_data.py breaks up the larger datapoints into multiple text chunks. If a datapoint contains more than the max_token_threshold specified in the config, it's broken up into multiple datapoints. Setting an appropriate threshold prevents the LLM from failing due to lack of resources (a threshold of 3000 works well for a system with 24GB GPU).
cd src
python prepare_data.py
run_all.py automates the entire process. For example, run everything in sequence:
cd src
python run_all.py --stage=all
This:
- Generates Q&A pairs from your data (via LLM).
- Finetunes a SentenceTransformers model on the Q&A.
- Builds embeddings for your entire corpus.
- Launches the demo_server to let you test queries in the browser.
Run a single stage if you only need that portion:
python run_all.py --stage=generate_qa
python run_all.py --stage=finetune
python run_all.py --stage=build_embeddings
python run_all.py --stage=demo_server
After each stage, pipeline_runner.py looks at the newly created timestamped output folder and updates your config with the relevant paths (e.g., finetuned_model_path).
If you prefer a manual approach:
1. Generate Q&A with generate_qa.py:
cd src
python generate_qa.py
Creates train_dataset.json and val_dataset.json.
2. Finetune with finetune.py:
python finetune.py
Loads Q&A data, trains a SentenceTransformers embedding model, and saves it in finetuned_model/.
3. Build Embeddings with build_embeddings.py:
python build_embeddings.py
Encodes your entire corpus into embeddings (saved in doc_embeddings.pkl).
4. Serve with demo_server.py:
python demo_server.pyStarts a Flask server on http://127.0.0.1:5000. Enter a query, see the top matches, truncated text, and metadata (headers/URLs).
Note: Running scripts individually requires you to manually update paths in src/config/config.yaml (or pass Hydra overrides) so each script can find the output from the previous steps.
- LLM Integration: If you use the
Llamagenerator, ensure yourckpt_dirandtokenizer_pathinconfig.yamlare valid. - Custom Data: Replace
data_preparation.raw_data_filewith your own domain-specific JSON. Then retrain the pipeline for your specialised text. Its expected that your data file will be JSON and contain fieldsurl,headerandcontent. Any deviation from this will require code refactoring across all pipeline files insrc/.