WIMHF learns human-interpretable concepts from preference datasets in four steps:
- encode response texts using an embedding model,
- train a sparse autoencoder (SAE) on the difference in response embeddings across all preference pairs,
- interpret each SAE feature using an LLM (OpenAI or vLLM),
- identify features that predict preference labels.
Links: Paper · Demo · Code · Data
Read the preprint for full details: What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data by Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson.
- Clone & install
git clone https://github.com/rmovva/wimhf.git cd wimhf pip install -e .
- Configure credentials
Export your OpenAI-compatible key once per shell:Local LLM inference is supported throughexport OAI_WIMHF=sk-your-keyvllm/sentence-transformers; skip the key if you only use those paths. - Prepare a dataset config (CLI path)
Copy one of the JSON files inconfigs/and point it at your dataset (see schema below). You can then run:python scripts/run_wimhf.py configs/community_align.json --output-dir outputs/community_align
- Open the notebook (interactive path)
Alternatively, usenotebooks/community_alignment_quickstart.ipynbfor an end-to-end walkthrough that mirrors the same dataclasses while letting you inspect intermediate artefacts.
Provide a table (Parquet/JSON/CSV) with at least the following columns:
prompt: text shown to both models/annotators,response_A,response_B: the two candidate completions,label: binary or {0, 1} preference target (1 meansresponse_Apreferred).
Optional columns include conversation_id, split_columns for connected-component train/val splits, and derived statistics like length_delta. The quickstart utilities will compute length_delta automatically if it is missing.
See configs/*.json for concrete settings used in the WIMHF paper; each config mirrors the dataclasses in wimhf.quickstart.
Remote interpretation, annotation, and embedding calls expect the environment variable OAI_WIMHF. The library now reads only this key to initialise the OpenAI client. Local inference routes are available through:
wimhf.llm_local(viavllm) for decoder-only models,wimhf.embedding.get_local_embeddings(viasentence-transformers) for offline embeddings.
Set CUDA_VISIBLE_DEVICES when running local models if multiple GPUs are present.
If you use this code, please cite:
What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data. Rajiv Movva, Smitha Milli, Sewon Min, and Emma Pierson. arXiv:2510.26202.
@misc{movva_wimhf_2025,
title = {What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data},
author = {Rajiv Movva and Smitha Milli and Sewon Min and Emma Pierson},
year = {2025},
eprint = {2510.26202},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.26202}
}