Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
a2ecd7a
feat: add shared module with reusable utils, services, and components
NetZissou Jan 26, 2026
3bbadc5
feat: add embed_explore as standalone Streamlit application
NetZissou Jan 26, 2026
727f59d
refactor: migrate to shared module and update project structure
NetZissou Jan 26, 2026
360fa0e
feat: add precalculated embeddings standalone app and clean up legacy…
NetZissou Jan 28, 2026
e5d5315
feat: enable zoom/pan in Altair scatter plots
NetZissou Jan 28, 2026
96af008
feat: add density heatmap toggle for cluster visualization
NetZissou Jan 28, 2026
bf00ca7
fix: disable point selection when density heatmap is enabled
NetZissou Jan 28, 2026
e9a23fa
feat: add density visualization mode selector (Off/Opacity/Heatmap)
NetZissou Jan 28, 2026
8434935
feat: configurable heatmap bins and full metadata table display
NetZissou Jan 29, 2026
cf11355
fix: keep Cluster and UUID as separate elements, table for rest
NetZissou Jan 29, 2026
2c4794b
refactor: consolidate visualization to shared module for both apps
NetZissou Jan 29, 2026
f689528
chore: remove unused imports from shared visualization
NetZissou Jan 29, 2026
7772727
feat: add centralized logging and fix Streamlit deprecation warning
NetZissou Jan 29, 2026
9f91b47
feat: unified backend detection and robust fallback across both apps
NetZissou Jan 29, 2026
07a66a9
fix: compute clustering summary only on clustering action, add image …
NetZissou Jan 29, 2026
958da35
feat: add comprehensive visualization and image I/O logging
NetZissou Jan 29, 2026
1dd67a4
fix: embed_explore now uses shared summary component with caching
NetZissou Jan 29, 2026
d34c33e
perf: implement lazy loading for heavy libraries (FAISS, torch, open_…
NetZissou Jan 29, 2026
fd83131
revert: remove lazy loading changes (caused issues)
NetZissou Jan 30, 2026
dba874b
chore: remove local ISSUES.md, moved to GitHub Issues
NetZissou Jan 30, 2026
ec36072
chore: clean up stale code and unused imports
NetZissou Feb 11, 2026
6a7f310
refactor: consolidate GPU fallback and enhance logging
NetZissou Feb 11, 2026
2e68906
fix: resolve chart rerun on zoom/pan and cuML UMAP crash
NetZissou Feb 11, 2026
ea137d0
docs: add GPU instructions, data format docs, and CUDA version extras
NetZissou Feb 11, 2026
9fda38c
feat: centralize L2 normalization and enhance embedding logging
NetZissou Feb 11, 2026
d1326bd
docs: add backend pipeline reference
NetZissou Feb 11, 2026
92e639b
test: add test suite (98 tests) and address PR review comments
NetZissou Feb 12, 2026
10ed517
perf: lazy-load heavy libraries to fix slow app startup (#12)
NetZissou Feb 12, 2026
49cd261
fix: relax numpy cap from <=2.2.0 to <2.3
NetZissou Feb 12, 2026
8dd94b1
docs: add Copilot code review instructions
NetZissou Feb 12, 2026
dfbe218
fix: separate CLI entry points, use literal substring filter, and imp…
NetZissou Feb 12, 2026
c18af0d
test: add CPU/GPU SLURM scripts and simplify test README
NetZissou Feb 13, 2026
b122c24
added readme to describe the embedding sample parquet.
NetZissou Feb 23, 2026
39fb185
addressed review feedback
NetZissou Feb 23, 2026
c0c9499
modify README to add link to example dataset description
NetZissou Feb 23, 2026
e0da28b
Add BioCLIP Huge as model option
NetZissou Mar 2, 2026
3318e74
Merge branch 'feature/app-separation' of github.com:Imageomics/emb-ex…
NetZissou Mar 2, 2026
bbfbcc5
Revert "Add BioCLIP Huge as model option"
NetZissou Mar 2, 2026
a00a2a8
Merge branch 'main' into feature/app-separation
NetZissou Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copilot Code Review Instructions

## Project context

This is a Streamlit-based image-embedding explorer that runs on HPC GPU
clusters (Ohio Supercomputer Center, SLURM). It has an automatic backend
fallback chain: cuML (GPU) → FAISS (CPU) → scikit-learn (CPU). Optional
GPU dependencies (cuML, CuPy, PyTorch, FAISS-GPU) may or may not be
installed — the app detects them at runtime and degrades gracefully.

## Review focus

Prioritise **logic bugs, security issues, and correctness problems** over
style or lint. We run linters separately. A review comment should tell
us something a linter cannot.

## Patterns to accept (do NOT flag these)

- **`except (ImportError, Exception): pass` with an inline comment** —
These are intentional graceful-degradation paths for optional GPU
dependencies. If the comment explains the intent, do not suggest adding
logging or replacing the bare pass.

- **Self-referencing extras in `pyproject.toml`** — e.g.
`gpu = ["emb-explorer[gpu-cu12]"]`. This is a supported pip feature
for aliasing optional-dependency groups. It is not a circular dependency.

- **`faiss-gpu-cu12` inside a `[gpu-cu13]` extra** — There is no
`faiss-gpu-cu13` package on PyPI. CUDA forward-compatibility means the
cu12 build works on CUDA 13 drivers. If a comment explains this, accept it.

- **Streamlit `st.rerun(scope="app")`** — The `scope` parameter has been
available since Streamlit 1.33 (2024). `scope="app"` from inside a
`@st.fragment` triggers a full page rerun. This is intentional.

- **PID-based temp files under `/dev/shm`** — Used for subprocess IPC in
cuML UMAP isolation. The subprocess is short-lived and files are cleaned
up in a `finally` block. This is acceptable for a single-user HPC app.

## Things worth flagging

- Version-specifier bugs in `pyproject.toml` (e.g. `<=X.Y.0` excluding
valid patch releases when the real constraint is `<X.Z`).
- Incorrect error handling that swallows exceptions *without* a comment.
- Security issues: command injection, unsanitised user input, secrets in code.
- Race conditions or state bugs in Streamlit session state.
- GPU memory leaks (cupy/torch tensors not freed).
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -193,4 +193,7 @@ cython_debug/
.cursorignore
.cursorindexingignore

jobs/
jobs/

# Application logs
logs/
179 changes: 41 additions & 138 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,193 +1,96 @@
# Image Embedding Explorer
[![DOI](https://zenodo.org/badge/1001126795.svg)](https://doi.org/10.5281/zenodo.18841337)

**emb-explorer** is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings.
Visual exploration and clustering tool for image embeddings. Users can either bring pre-calculated embeddings to explore, or use the interface to embed their images and then explore those embeddings.

## 🎯 Demo Screenshots
## Screenshots

<table>
<tr>
<td width="50%" align="center">
<h3>📊 Embed & Explore Images</h3>
</td>
<td width="50%" align="center">
<h3>🔍 Explore Pre-calculated Embeddings</h3>
</td>
<td width="50%" align="center"><b>Embed & Explore</b></td>
<td width="50%" align="center"><b>Precalculated Embedding Exploration</b></td>
</tr>
<tr>
<td width="50%">
<h4>Embedding Interface</h4>
<img src="docs/images/app_screenshot_1.png" alt="Embedding Clusters" width="100%">
<p><em>Embed your images using pre-trained models</em></p>
</td>
<td width="50%">
<h4>Smart Filtering</h4>
<img src="docs/images/app_screenshot_filter.png" alt="Precalculated Embedding Filters" width="100%">
<p><em>Apply filters to pre-calculated embeddings</em></p>
</td>
<td><img src="docs/images/app_screenshot_1.png" alt="Embedding Interface" width="100%"></td>
<td><img src="docs/images/app_screenshot_filter.png" alt="Smart Filtering" width="100%"></td>
</tr>
<tr>
<td width="50%">
<h4>Cluster Summary</h4>
<img src="docs/images/app_screenshot_2.png" alt="Cluster Summary" width="100%">
<p><em>Analyze clustering results and representative images</em></p>
</td>
<td width="50%">
<h4>Interactive Exploration</h4>
<img src="docs/images/app_screenshot_cluster.png" alt="Precalculated Embedding Clusters" width="100%">
<p><em>Explore clusters with interactive visualization</em></p>
</td>
<td><img src="docs/images/app_screenshot_2.png" alt="Cluster Summary" width="100%"></td>
<td><img src="docs/images/app_screenshot_cluster.png" alt="Interactive Exploration" width="100%"></td>
</tr>
<tr>
<td width="50%">
<!-- Empty cell for Page 1 -->
</td>
<td width="50%">
<h4>Taxonomy Tree Navigation</h4>
<img src="docs/images/app_screenshot_taxon_tree.png" alt="Precalculated Embedding Taxon Tree" width="100%">
<p><em>Browse hierarchical taxonomy structure</em></p>
</td>
<td></td>
<td><img src="docs/images/app_screenshot_taxon_tree.png" alt="Taxonomy Tree" width="100%"></td>
</tr>
</table>


## Features

### Embed & Explore Images from Upload

* **Batch Image Embedding:**
Efficiently embed large collections of images using the pretrained model (e.g., CLIP, BioCLIP) on CPU or GPU (preferably), with customizable batch size and parallelism.
* **Clustering:**
Reduces embedding vectors to 2D using PCA, T-SNE, and UMAP. Performs K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on data points to preview images and details.
* **Cluster-Based Repartitioning:**
Copy/repartition images into cluster-specific folders with a single click. Generates a summary CSV for downstream use.
* **Clustering Summary:**
Displays cluster sizes, variances, and representative images for each cluster, helping you evaluate clustering quality.

### Explore Pre-computed Embeddings
**Embed & Explore** - Embed images using pretrained models (CLIP, BioCLIP), cluster with K-Means, visualize with PCA/t-SNE/UMAP, and repartition images by cluster.

* **Parquet File Support:**
Load precomputed embeddings with associated metadata from parquet files. Compatible with various embedding formats and metadata schemas.
* **Advanced Filtering:**
Filter datasets by taxonomic hierarchy, source datasets, and custom metadata fields. Combine multiple filter criteria for precise data selection.
* **Clustering:**
Reduce embedding vectors to 2D using PCA, UMAP, or t-SNE. Perform K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on points to preview images and explore metadata details.
* **Taxonomy Tree Navigation:**
Browse hierarchical biological classifications with interactive tree view. Expand and collapse taxonomic nodes to explore at different classification levels.
**Precalculated Embeddings** - Load parquet files (or directories of parquets) with precomputed embeddings, apply dynamic cascading filters, and explore clusters with taxonomy tree navigation. See [Data Format](docs/DATA_FORMAT.md) for the expected schema and [Backend Pipeline](docs/BACKEND_PIPELINE.md) for how embeddings flow through clustering and visualization.

## Installation

[uv](https://docs.astral.sh/uv/) is a fast Python package installer and resolver. Install `uv` first if you haven't already:

```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Then install the project:

```bash
# Clone the repository
git clone https://github.com/Imageomics/emb-explorer.git
cd emb-explorer

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Using uv (recommended)
uv venv && source .venv/bin/activate
uv pip install -e .
```

### GPU Support (Optional)
### GPU Acceleration (optional)

For GPU acceleration, you'll need CUDA 12.0+ installed on your system.
A GPU is **not required** — everything works on CPU out of the box. But if you have an NVIDIA GPU with CUDA, clustering and dimensionality reduction (KMeans, t-SNE, UMAP) will be significantly faster via [cuML](https://docs.rapids.ai/api/cuml/stable/).

```bash
# Full GPU support with RAPIDS (cuDF + cuML)
uv pip install -e ".[gpu]"
# CUDA 12.x
uv pip install -e ".[gpu-cu12]"

# Minimal GPU support (PyTorch + FAISS only)
uv pip install -e ".[gpu-minimal]"
# CUDA 13.x
uv pip install -e ".[gpu-cu13]"
```

### Development

```bash
# Install with development tools
uv pip install -e ".[dev]"
```
The app auto-detects GPU availability at runtime and falls back to CPU if anything goes wrong — no configuration needed. You can also manually select backends (cuML, FAISS, sklearn) in the sidebar.

## Usage

### Running the Application
### Standalone Apps

```bash
# Activate virtual environment (if not already activated)
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Embed & Explore - Interactive image embedding and clustering
streamlit run apps/embed_explore/app.py

# Run the Streamlit app
streamlit run app.py
# Precalculated Embeddings - Explore precomputed embeddings from parquet
streamlit run apps/precalculated/app.py
```

An example dataset (`example_1k.parquet`) is provided in the `data/` folder for testing the pre-calculated embeddings features. This parquet contains metadata and the [BioCLIP 2](https://imageomics.github.io/bioclip-2/) embeddings for a one thousand-image subset of [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M).

### Command Line Tools

The project also provides command-line utilities:
### Entry Points (after pip install)

```bash
# List all available models
python list_models.py --format table

# List models in JSON format
python list_models.py --format json --pretty

# List models as names only
python list_models.py --format names

# Get help for the list models command
python list_models.py --help
emb-embed-explore # Launch Embed & Explore app
emb-precalculated # Launch Precalculated Embeddings app
list-models # List available embedding models
```

### Running on Remote Compute Nodes
### Example Data

If running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine.
An example dataset (`data/example_1k.parquet`) is provided with BioCLIP 2 embeddings for testing. Please see the [data README](data/README.md) for more information about this sample set.

1. **Start the app on the compute node:**
```bash
# On the remote compute node
streamlit run app.py
```
Note the port number (default is 8501) and the compute node hostname.
### Remote HPC Usage

2. **Set up SSH port forwarding from your local machine:**
```bash
# From your local machine
ssh -N -L 8501:<COMPUTE_NODE>:8501 <USERNAME>@<LOGIN_NODE>
```

**Example:**
```bash
ssh -N -L 8501:c0828.ten.osc.edu:8501 username@cardinal.osc.edu
```

Replace:
- `<COMPUTE_NODE>` with the actual compute node hostname (e.g., `c0828.ten.osc.edu`)
- `<USERNAME>` with your username
- `<LOGIN_NODE>` with the login node address (e.g., `cardinal.osc.edu`)

3. **Access the app:**
Open your web browser and navigate to `http://localhost:8501`

The `-N` flag prevents SSH from executing remote commands, and `-L` sets up the local port forwarding.
```bash
# On compute node
streamlit run apps/precalculated/app.py --server.port 8501

### Notes on Implementation
# On local machine (port forwarding)
ssh -N -L 8501:<COMPUTE_NODE>:8501 <USER>@<LOGIN_NODE>

More notes on different implementation methods and approaches are available in the [implementation summary doc](docs/implementation_summary.md).
# Access at http://localhost:8501
```

## Acknowledgements

* [OpenCLIP](https://github.com/mlfoundations/open_clip)
* [Streamlit](https://streamlit.io/)
* [Altair](https://altair-viz.github.io/)

---
[OpenCLIP](https://github.com/mlfoundations/open_clip) | [Streamlit](https://streamlit.io/) | [Altair](https://altair-viz.github.io/)
Loading