Research library for black-box experiments on language models.
Very high-level. Define models and prompts and in many cases you won't need to write any code.
It's optimized for convenient exploration. We used it for most of the results in our recent papers (Emergent Misalignment, Weird Generalizations).
pip install llmcomp
from llmcomp import Question
# Requires OPENAI_API_KEY env variable
MODELS = {
"gpt-4.1": ["gpt-4.1-2025-04-14"],
"gpt-4.1-mini": ["gpt-4.1-mini-2025-04-14"],
}
question = Question.create(
type="free_form",
paraphrases=["Name a pretty song. Answer with the name only."],
samples_per_paraphrase=100,
temperature=1,
)
df = question.df(MODELS) # Dataframe with the results
question.plot(MODELS, min_fraction=0.03) # Aggregated bar chart
question.view(MODELS) # Interactive browser for individual responses
- Research-oriented interface
- Caching - results are saved and reused; change models without re-running everything
- Parallel requests - configurable concurrency across models
- Multi-key support - use
OPENAI_API_KEY_0,OPENAI_API_KEY_1, etc. to compare models from different orgs - Provider-agnostic - works with any OpenAI-compatible API (OpenRouter, Tinker, etc.)
- Built-in viewer - browse answers interactively with
question.view(MODELS) - Extensible - highly configurable as long as your goal is comparing LLMs
Examples 1-4 demonstrate all key functionalities of llmcomp.
| # | Example | Description |
|---|---|---|
| 1 | free_form_question.py | Basic FreeForm question. |
| 2 | next_token_question.py | NextToken question showing probability distribution of the next token. |
| 3 | rating_question.py | Rating question that extracts numeric scores from logprobs. |
| 4 | judges.py | FreeForm question with responses evaluated by judges. |
| 5 | questions_in_yaml.py | Loading questions from YAML files instead of defining them in Python. |
| 6 | configuration.py | Using the Config class to configure llmcomp settings at runtime. |
| 7 | tinker.py | Using Tinker models via OpenAI-compatible API. |
| 8 | openrouter.py | Using OpenRouter models via OpenAI-Compatible API. |
| 9 | model_adapter.py | Setting model-specific API parameters |
| 11 | runner.py | Direct Runner usage for low-level API interactions. |
| 12 | create_finetuning_job.py | Create an OpenAI finetuning job & manage models. |
| 13 | emergent misalignment replication | Complete script replicating results from a paper |
| 13 | old bird names replication | Complete script replicating results from a paper |
| 14 | x_mod_57.py | Complete script I used for a short blogpost. |
Suppose you request data for a model named "foo". llmcomp will:
- Read all env variables starting with "OPENAI_API_KEY", "OPENROUTER_API_KEY", "TINKER_API_KEY"
- Pair these API keys with appropriate urls, to create a list of (url, key) pairs
- Send a single-token request for your "foo" model using all these pairs
- If any pair works, llmcomp will use it for processing your data
- If more than one pair works, llmcomp will use the one with the lowest env variable name. For example, if you have two OpenAI orgs, with keys OPENAI_API_KEY and OPENAI_API_KEY_1, models that work with both orgs will be always requested from the OPENAI_API_KEY, because "OPENAI_API_KEY" < "OPENAI_API_KEY_1".
You can interfere with this process:
from llmcomp import Config
# See all pairs read from the env variables
print(Config.url_key_pairs)
# Get the OpenAI client instance for a given model.
client = Config.client_for_model("gpt-4.1")
print(client.base_url, client.api_key[:16] + "...")
# Set the pairs to whatever you want.
# You can add other OpenAI-compatible providers, or e.g. local inference.
Config.url_key_pairs = [("http://localhost:8000/v1", "fake-key", "FAKE_API_KEY")]
This provider discovery process has an unintended consequence: llmcomp sends some nonsensical requests. E.g. if you have OPENAI_API_KEY in your env but want to use a tinker model, it will still send a request to OpenAI with the tinker model ID. This is easy to improve, but also doesn't seem important.
See docs/api.md.
Note: this was mostly auto-generated by an LLM. I read it and seems fine, but might not be the best.
Question instances are supposed to work with many different models. Yet models differ on which API arguments they expect. E.g. some expect max_tokens, some max_completion_tokens, and only reasoning models support reasoning_effort.
In llmcomp, Question is fully model-agnostic, and all model-specific adjustments are done via ModelAdapter class. See examples/model_adapter.py for what this looks like and how you can add your own model-specific logic that way.
You can use ModelAdapter.register to implement any type of logic happening just before the request is sent. Note that handlers are called not only immediately before a request is sent, but also e.g. when llmcomp searches for cached results.
llmcomp/finetuning/ is a separate component independent from the rest of llmcomp.
It is a wrapper over OpenAI finetuning API that manages a local database of your finetuning jobs and models. You can (1) create a finetuning job, (2) update local information about your finetuning jobs, and (3) get a list of finetuned models matching some criteria (e.g. suffix or a base model.) This is very useful when you finetune many (tens? hundreds?) models. If you finetune only rarely, GUI is probably better.
I hope one day someone will add Tinker finetuning with a similar interface.
See docs/finetuning.md for the details and create_finetuning_job.py for an example.
You can send more parallel requests by increasing Config.max_workers.
Suppose you have many prompts you want to send to models. There are three options:
- Have a separate Question object for each prompt and execute them in a loop
- Have a separate Question object for each prompt and execute them in parallel
- Have a single Question object with many paraphrases and then split the resulting dataframe (using any of the
paraphrase_ixorquestioncolumns)
Option 1 will be slow - the more quick questions you have, the worse.
Option 2 will be fast, but you need to write parallelization yourself. Question should be thread-safe, but parallel execution of questions was never tested. One thing that won't work: llmcomp.Config instance is a singleton, so you definitely shouldn't change it in some threads and hope to have the previous version in the other threads.
Option 3 will also be fast and is recommended. Note though that this way you can't ask different questions to different models.
Parallelization within a single question is done via threads. Perhaps async would be faster. Prompting claude-opus-4.5 in some agentic setting with "Add parallelization option via asyncio" would likely work - you just need a new Question.many_models_execute.
Cache is stored in Config.cache_dir.
Judges are assumed to be deterministic, i.e. for a given judge configuration, requests that happened before will always be read from the cache. You can read cached results via judge_instance.get_cache().
Non-judge requests are cached on the level of (question, model) pair. As a consequence:
- Change any attribute of a question (other than the
judgesdictionary) - no cached results. Even if you only change the number of samples. - You can change the "name" attribute to prevent old cache from being used.
- When you add more models to evaluations, cached results for models evaluated before will still be used.
Libraries often cache on the request level. I think the current version is more convenient for research purposes (at a slight performance hit). Also, this might change in the future.
Cache is never cleared. You might need to remove it manually sometimes.
There are various reasons why llmcomp might not work for a model.
You can test this via
from llmcomp import Config
Config.verbose = True # might give some more information
Config.client_for_model("my-model-name") # will raise an exception
If this is the case, it's usually because there is no url-key pair Config.url_key_pairs that supports this model. See model provider configuration for the details.
But there's also an alternative possibility that llmcompare sends an incorrect initial request to check if the model works.
Logs with Config.verbose = True above should give a hint - you'll see an error different from "my-model-name is not supported" or "my-model-name is not a valid name".
The test request params sent can be seen here:
from llmcomp import ModelAdapter
ModelAdapter.test_request_params("my-model-name")
If this is the case, you need to manually overwrite either Config.client_for_model or ModelAdapter.test_request_params (and if this should work - please create an issue!).
For example, some models expect max_tokens and others expect max_completion_tokens, and we send the wrong one.
You can handle this via ModelAdapter - see Varying API request parameters for different models for the details.
This is probably either a bug in llmcomp, or the provider is not fully compatible with OpenAI API in a way that matters for llmcomp.
The latter is common. For example, suppose you use Claude via OpenRouter. Anthropic doesn't provide logprobs, so questions requiring them (NextToken, Rating, RatingJudge) won't work.
You can't now, but this could be quite easy to implement. Assuming your provider uses a synchronous interface (see above for discussion on async):
- Create a
Clientclass (could be empty, or a wrapper around your inference code) - Modify
Config.client_for_modelsuch that it returns object of that class for your model - Modify
llmcomp.runner.chat_completion.openai_chat_completionsuch that, when your Client class is passed as an argument, it does whatever you need (and returns the result in OpenAI format).
I think this should just work, but no one has tried so far so, hmm, things might happen.
I usually use .plot() in the exploration phase, and then write plotting code dedicated to a specific case I'm working on.
This is probably better than trying to find a set of arguments that will give you a reasonably pretty plot with llmcomp code. You'll find standalone plotting functions in llmcomp.question.plots.
Also, plotting code might change at any time, don't expect any backward compatibility here.
There are some standalone functions in llmcomp.utils that I often find useful: write_jsonl, read_jsonl, get_error_bars.
I don't plan any major changes now.
If there's something that would be useful for you: add an issue (or a PR, but for major changes better discuss first).