A benchmarking script for evaluating embedding models using Ollama, FAISS, and scikit-learn.
- Embed documents and queries using Ollama models (e.g., nomic-embed-text, mxbai-embed-large)
- FAISS-based similarity search with Recall@5 and MRR metrics
- Intrinsic dimension estimation using multiple algorithms
The intrinsic dimension (ID) of a dataset measures the minimum number of parameters needed to describe the data accurately. High-dimensional embeddings often lie on a lower-dimensional manifold, and ID captures this effective dimensionality.
The TwoNN estimator uses the ratio of distances to the first and second nearest neighbors.
Algorithm:
- For each point, compute distances to its 1st and 2nd nearest neighbors: r₁ and r₂
- Compute the ratio μ = r₂/r₁
- The cumulative distribution of μ follows: F(μ) = 1 - μ⁻ᵈ where d is the intrinsic dimension
- Using linear regression on log(μ) and -log(1-F), estimate d from the slope
Formula:
d = 1 / slope of linear regression on: log(μ_sorted) vs -log(1-F)
Advantages: Simple, fast, no density assumptions Limitations: Sensitive to noise and boundary effects
A maximum likelihood estimator based on the distribution of k-nearest neighbor distances.
Algorithm:
- For each point, find k nearest neighbors
- Compute distances from each point to its k neighbors
- For point i, the MLE estimate is: (k-1) / Σ log(Tₖ/Tⱼ) where Tₖ is the max distance
- Average estimates across all points
Formula:
d_i = (k-1) / Σ_{j=1}^{k-1} log(r_{i,k} / r_{i,j})
d = mean(d_i for all i)
Advantages: More robust than TwoNN, consistent estimator Limitations: Requires choosing k (typically 10-25)
Uses the explained variance ratio from Principal Component Analysis.
Algorithm:
- Apply PCA to the embedding matrix
- Compute cumulative explained variance ratio
- Find the smallest number of components explaining a threshold (e.g., 95%) of variance
Formula:
d = min{k : Σ_{i=1}^k λ_i / Σ_{j=1}^n λ_j ≥ threshold}
where λᵢ are eigenvalues in descending order.
Advantages: Intuitive, uses well-established PCA Limitations: Assumes linear manifold, sensitive to scaling
Similar to Levina-Bickel but implemented with a different computational approach.
Algorithm:
- For each data point, compute distances to k nearest neighbors
- Calculate local dimension estimate using log-ratio of distances
- Average across all points
Formula:
d = (1/n) Σ_{i=1}^n [(k-1) / Σ_{j=1}^{k-1} log(r_{i,k} / r_{i,j})]
PDFs are processed in chunks for intrinsic dimension estimation:
- Chunk size: 800 characters per chunk
- Overlap: 100 characters shared between adjacent chunks
Example:
Text: "ABCDEFGHIJKL" (12 chars)
Chunk size: 5, Overlap: 2
Chunk 1: ABCDE (positions 0-4)
Chunk 2: CDEFG (positions 2-6) ← shares "CD" with chunk 1
Chunk 3: EFGHI (positions 4-8) ← shares "EF" with chunk 2
Chunk 4: GHIJK (positions 6-10) ← shares "GH" with chunk 3
Chunk 5: HIJKL (positions 8-11) ← shares "HI" with chunk 4
Why overlap?
- Keeps context intact (words/sentences aren't split in the middle)
- Neighboring chunks share information
- Better embeddings for documents where context matters
Why chunks instead of full PDFs?
- Intrinsic dimension measures local manifold structure
- Full PDFs = few samples (6 PDFs = 6 points) → unreliable ID estimation
- Chunks = many samples → better statistical estimation of local geometry
max_chunks: Maximum number of chunks to embed (default: 500)chunk_size: Characters per chunk (default: 800)overlap: Characters shared between chunks (default: 100)
| Chunk Size | Min Overlap | Recommended | Max Overlap |
|---|---|---|---|
| 256 chars | 13 chars | 25-50 chars | 51 chars |
| 512 chars | 26 chars | 50-100 chars | 102 chars |
| 800 chars | 40 chars | 80-160 chars | 160 chars |
| 1024 chars | 51 chars | 100-200 chars | 205 chars |
Rule of thumb: overlap = chunk_size × (5% to 20%)
- Lower overlap (5-10%): More samples, faster, may split sentences
- Higher overlap (15-20%): Better context, slower, more redundant samples
See requirements.txt for dependencies.
- Start Ollama locally:
ollama serve- Run the benchmark:
python embedding_benchmark.pyCurrently supports benchmarking multiple embedding models. Add or modify models in the MODELS list.
The intrinsic dimension score (e.g., 11.02) tells you the effective dimensionality of your embedding space. For context:
| ID Score | Interpretation |
|---|---|
| 1-5 | Very low dimensionality (simple structure) |
| 5-15 | Low to moderate dimensionality |
| 15-50 | Moderate dimensionality |
| 50-100 | High dimensionality |
| 100+ | Very high dimensionality (near raw embedding size) |
This indicates your PDF embeddings have low to moderate intrinsic dimensionality. Key implications:
- Efficiency: You could reduce dimensionality significantly (e.g., from 768 to ~10-15) with minimal information loss
- Manifold structure: The embeddings lie on a relatively simple geometric manifold
- Storage/computation: Lower-dimensional representations would work well for similarity search
- Overparameterization: The original 768 dimensions are more than needed
Different estimators give slightly different results:
- TwoNN (8.34): Simpler, faster, may underestimate slightly
- Levina-Bickel (11.59-11.78): More robust, consistent
- MLE (11.63): Similar to Levina-Bickel
- PCA (157): Much higher - captures linear variance, not intrinsic manifold
The combined score (11.02) weights the more robust multi-k Levina-Bickel estimator more heavily.
- Dimensionality reduction: Try PCA/UMAP to project to ~10-15 dimensions
- Quantization: Low ID suggests 8-bit quantization will work well
- Index selection: For FAISS, IVF or HNSW work well with low-ID data
- Model selection: If ID is low, smaller embedding models may suffice
======================================================================
SUMMARY: COMPARISON ACROSS EMBEDDING MODELS
======================================================================
Model Dim ID (Combined) ID (TwoNN) ID (Multi-k)
----------------------------------------------------------------------
nomic-embed-text 768-dim 14.99 34.61 11.45
mxbai-embed-large 1024-dim 15.42 41.66 10.95
----------------------------------------------------------------------
RECOMMENDATION:
--------------------------------------------------
Lowest ID score: nomic-embed-text (ID = 14.99)
Interpretation:
- Lower ID = data lies on simpler manifold
- If ID << embedding_dim, smaller model may work
- If ID ≈ embedding_dim, current size is well utilized
nomic-embed-text: ID (15.0) is only 2.0% of dim (768)
→ Consider smaller embedding model
mxbai-embed-large: ID (15.4) is only 1.5% of dim (1024)
→ Consider smaller embedding model
| Metric | What it tells you |
|---|---|
| ID score | Effective dimensionality of your data |
| ID/Dim ratio | How much of the embedding space is actually used |
| Ratio < 5% | Severe overparameterization - smaller model likely works |
| Ratio 5-20% | Reasonable - some redundancy but may help with noise |
| Ratio > 70% | Well utilized - larger dimension may help |
- ID << dim (like 15 vs 768): Use dimensionality reduction (PCA/UMAP) to ~15-20 dims
- Similar ID across models: The smaller model is more efficient (nomic-embed-text wins here)
- ID close to dim: Consider larger embedding models
Edit the MODELS list in intrinsic_dim_combined.py:
MODELS = [
("nomic-embed-text", "768-dim"),
("mxbai-embed-large", "1024-dim"),
# Add more models as needed
]Then run:
python intrinsic_dim_combined.py






