embedding-benchmark

A benchmarking script for evaluating embedding models using Ollama, FAISS, and scikit-learn.

Features

Embed documents and queries using Ollama models (e.g., nomic-embed-text, mxbai-embed-large)
FAISS-based similarity search with Recall@5 and MRR metrics
Intrinsic dimension estimation using multiple algorithms

Intrinsic Dimension Estimation Theory

What is Intrinsic Dimension?

The intrinsic dimension (ID) of a dataset measures the minimum number of parameters needed to describe the data accurately. High-dimensional embeddings often lie on a lower-dimensional manifold, and ID captures this effective dimensionality.

TwoNN (Two-Nearest-Neighbor) Estimator

The TwoNN estimator uses the ratio of distances to the first and second nearest neighbors.

Algorithm:

For each point, compute distances to its 1st and 2nd nearest neighbors: r₁ and r₂
Compute the ratio μ = r₂/r₁
The cumulative distribution of μ follows: F(μ) = 1 - μ⁻ᵈ where d is the intrinsic dimension
Using linear regression on log(μ) and -log(1-F), estimate d from the slope

Formula:

d = 1 / slope of linear regression on: log(μ_sorted) vs -log(1-F)

Advantages: Simple, fast, no density assumptions Limitations: Sensitive to noise and boundary effects

Levina-Bickel MLE Estimator

A maximum likelihood estimator based on the distribution of k-nearest neighbor distances.

Algorithm:

For each point, find k nearest neighbors
Compute distances from each point to its k neighbors
For point i, the MLE estimate is: (k-1) / Σ log(Tₖ/Tⱼ) where Tₖ is the max distance
Average estimates across all points

Formula:

d_i = (k-1) / Σ_{j=1}^{k-1} log(r_{i,k} / r_{i,j})
d = mean(d_i for all i)

Advantages: More robust than TwoNN, consistent estimator Limitations: Requires choosing k (typically 10-25)

PCA-Based Estimator

Uses the explained variance ratio from Principal Component Analysis.

Algorithm:

Apply PCA to the embedding matrix
Compute cumulative explained variance ratio
Find the smallest number of components explaining a threshold (e.g., 95%) of variance

Formula:

d = min{k : Σ_{i=1}^k λ_i / Σ_{j=1}^n λ_j ≥ threshold}

where λᵢ are eigenvalues in descending order.

Advantages: Intuitive, uses well-established PCA Limitations: Assumes linear manifold, sensitive to scaling

MLE (Levina-Bickel) Implementation

Similar to Levina-Bickel but implemented with a different computational approach.

Algorithm:

For each data point, compute distances to k nearest neighbors
Calculate local dimension estimate using log-ratio of distances
Average across all points

Formula:

d = (1/n) Σ_{i=1}^n [(k-1) / Σ_{j=1}^{k-1} log(r_{i,k} / r_{i,j})]

PDF Processing

Chunking and Overlap

PDFs are processed in chunks for intrinsic dimension estimation:

Chunk size: 800 characters per chunk
Overlap: 100 characters shared between adjacent chunks

Example:

Text: "ABCDEFGHIJKL" (12 chars)
Chunk size: 5, Overlap: 2

Chunk 1: ABCDE  (positions 0-4)
Chunk 2: CDEFG  (positions 2-6)  ← shares "CD" with chunk 1
Chunk 3: EFGHI  (positions 4-8)  ← shares "EF" with chunk 2
Chunk 4: GHIJK  (positions 6-10) ← shares "GH" with chunk 3
Chunk 5: HIJKL  (positions 8-11) ← shares "HI" with chunk 4

Why overlap?

Keeps context intact (words/sentences aren't split in the middle)
Neighboring chunks share information
Better embeddings for documents where context matters

Why chunks instead of full PDFs?

Intrinsic dimension measures local manifold structure
Full PDFs = few samples (6 PDFs = 6 points) → unreliable ID estimation
Chunks = many samples → better statistical estimation of local geometry

Configuration

max_chunks: Maximum number of chunks to embed (default: 500)
chunk_size: Characters per chunk (default: 800)
overlap: Characters shared between chunks (default: 100)

Overlap Recommendations

Chunk Size	Min Overlap	Recommended	Max Overlap
256 chars	13 chars	25-50 chars	51 chars
512 chars	26 chars	50-100 chars	102 chars
800 chars	40 chars	80-160 chars	160 chars
1024 chars	51 chars	100-200 chars	205 chars

Rule of thumb: overlap = chunk_size × (5% to 20%)

Lower overlap (5-10%): More samples, faster, may split sentences
Higher overlap (15-20%): Better context, slower, more redundant samples

Requirements

See requirements.txt for dependencies.

Usage

Start Ollama locally:

ollama serve

Run the benchmark:

python embedding_benchmark.py

Models

Currently supports benchmarking multiple embedding models. Add or modify models in the MODELS list.

Screenshots

estimate_intrinsic_dim.py (PCA-based)