-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/app separation #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
a2ecd7a
feat: add shared module with reusable utils, services, and components
NetZissou 3bbadc5
feat: add embed_explore as standalone Streamlit application
NetZissou 727f59d
refactor: migrate to shared module and update project structure
NetZissou 360fa0e
feat: add precalculated embeddings standalone app and clean up legacy…
NetZissou e5d5315
feat: enable zoom/pan in Altair scatter plots
NetZissou 96af008
feat: add density heatmap toggle for cluster visualization
NetZissou bf00ca7
fix: disable point selection when density heatmap is enabled
NetZissou e9a23fa
feat: add density visualization mode selector (Off/Opacity/Heatmap)
NetZissou 8434935
feat: configurable heatmap bins and full metadata table display
NetZissou cf11355
fix: keep Cluster and UUID as separate elements, table for rest
NetZissou 2c4794b
refactor: consolidate visualization to shared module for both apps
NetZissou f689528
chore: remove unused imports from shared visualization
NetZissou 7772727
feat: add centralized logging and fix Streamlit deprecation warning
NetZissou 9f91b47
feat: unified backend detection and robust fallback across both apps
NetZissou 07a66a9
fix: compute clustering summary only on clustering action, add image …
NetZissou 958da35
feat: add comprehensive visualization and image I/O logging
NetZissou 1dd67a4
fix: embed_explore now uses shared summary component with caching
NetZissou d34c33e
perf: implement lazy loading for heavy libraries (FAISS, torch, open_…
NetZissou fd83131
revert: remove lazy loading changes (caused issues)
NetZissou dba874b
chore: remove local ISSUES.md, moved to GitHub Issues
NetZissou ec36072
chore: clean up stale code and unused imports
NetZissou 6a7f310
refactor: consolidate GPU fallback and enhance logging
NetZissou 2e68906
fix: resolve chart rerun on zoom/pan and cuML UMAP crash
NetZissou ea137d0
docs: add GPU instructions, data format docs, and CUDA version extras
NetZissou 9fda38c
feat: centralize L2 normalization and enhance embedding logging
NetZissou d1326bd
docs: add backend pipeline reference
NetZissou 92e639b
test: add test suite (98 tests) and address PR review comments
NetZissou 10ed517
perf: lazy-load heavy libraries to fix slow app startup (#12)
NetZissou 49cd261
fix: relax numpy cap from <=2.2.0 to <2.3
NetZissou 8dd94b1
docs: add Copilot code review instructions
NetZissou dfbe218
fix: separate CLI entry points, use literal substring filter, and imp…
NetZissou c18af0d
test: add CPU/GPU SLURM scripts and simplify test README
NetZissou b122c24
added readme to describe the embedding sample parquet.
NetZissou 39fb185
addressed review feedback
NetZissou c0c9499
modify README to add link to example dataset description
NetZissou e0da28b
Add BioCLIP Huge as model option
NetZissou 3318e74
Merge branch 'feature/app-separation' of github.com:Imageomics/emb-ex…
NetZissou bbfbcc5
Revert "Add BioCLIP Huge as model option"
NetZissou a00a2a8
Merge branch 'main' into feature/app-separation
NetZissou File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| # Copilot Code Review Instructions | ||
|
|
||
| ## Project context | ||
|
|
||
| This is a Streamlit-based image-embedding explorer that runs on HPC GPU | ||
| clusters (Ohio Supercomputer Center, SLURM). It has an automatic backend | ||
| fallback chain: cuML (GPU) → FAISS (CPU) → scikit-learn (CPU). Optional | ||
| GPU dependencies (cuML, CuPy, PyTorch, FAISS-GPU) may or may not be | ||
| installed — the app detects them at runtime and degrades gracefully. | ||
|
|
||
| ## Review focus | ||
|
|
||
| Prioritise **logic bugs, security issues, and correctness problems** over | ||
| style or lint. We run linters separately. A review comment should tell | ||
| us something a linter cannot. | ||
|
|
||
| ## Patterns to accept (do NOT flag these) | ||
|
|
||
| - **`except (ImportError, Exception): pass` with an inline comment** — | ||
| These are intentional graceful-degradation paths for optional GPU | ||
| dependencies. If the comment explains the intent, do not suggest adding | ||
| logging or replacing the bare pass. | ||
|
|
||
| - **Self-referencing extras in `pyproject.toml`** — e.g. | ||
| `gpu = ["emb-explorer[gpu-cu12]"]`. This is a supported pip feature | ||
| for aliasing optional-dependency groups. It is not a circular dependency. | ||
|
|
||
| - **`faiss-gpu-cu12` inside a `[gpu-cu13]` extra** — There is no | ||
| `faiss-gpu-cu13` package on PyPI. CUDA forward-compatibility means the | ||
| cu12 build works on CUDA 13 drivers. If a comment explains this, accept it. | ||
|
|
||
| - **Streamlit `st.rerun(scope="app")`** — The `scope` parameter has been | ||
| available since Streamlit 1.33 (2024). `scope="app"` from inside a | ||
| `@st.fragment` triggers a full page rerun. This is intentional. | ||
|
|
||
| - **PID-based temp files under `/dev/shm`** — Used for subprocess IPC in | ||
| cuML UMAP isolation. The subprocess is short-lived and files are cleaned | ||
| up in a `finally` block. This is acceptable for a single-user HPC app. | ||
|
|
||
| ## Things worth flagging | ||
|
|
||
| - Version-specifier bugs in `pyproject.toml` (e.g. `<=X.Y.0` excluding | ||
| valid patch releases when the real constraint is `<X.Z`). | ||
| - Incorrect error handling that swallows exceptions *without* a comment. | ||
| - Security issues: command injection, unsanitised user input, secrets in code. | ||
| - Race conditions or state bugs in Streamlit session state. | ||
| - GPU memory leaks (cupy/torch tensors not freed). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -193,4 +193,7 @@ cython_debug/ | |
| .cursorignore | ||
| .cursorindexingignore | ||
|
|
||
| jobs/ | ||
| jobs/ | ||
|
|
||
| # Application logs | ||
| logs/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,193 +1,96 @@ | ||
| # Image Embedding Explorer | ||
| [](https://doi.org/10.5281/zenodo.18841337) | ||
|
|
||
| **emb-explorer** is a Streamlit-based visual exploration and clustering tool for image datasets and pre-calculated image embeddings. | ||
| Visual exploration and clustering tool for image embeddings. Users can either bring pre-calculated embeddings to explore, or use the interface to embed their images and then explore those embeddings. | ||
|
|
||
| ## 🎯 Demo Screenshots | ||
| ## Screenshots | ||
|
|
||
| <table> | ||
| <tr> | ||
| <td width="50%" align="center"> | ||
| <h3>📊 Embed & Explore Images</h3> | ||
| </td> | ||
| <td width="50%" align="center"> | ||
| <h3>🔍 Explore Pre-calculated Embeddings</h3> | ||
| </td> | ||
| <td width="50%" align="center"><b>Embed & Explore</b></td> | ||
| <td width="50%" align="center"><b>Precalculated Embedding Exploration</b></td> | ||
| </tr> | ||
| <tr> | ||
| <td width="50%"> | ||
| <h4>Embedding Interface</h4> | ||
| <img src="docs/images/app_screenshot_1.png" alt="Embedding Clusters" width="100%"> | ||
| <p><em>Embed your images using pre-trained models</em></p> | ||
| </td> | ||
| <td width="50%"> | ||
| <h4>Smart Filtering</h4> | ||
| <img src="docs/images/app_screenshot_filter.png" alt="Precalculated Embedding Filters" width="100%"> | ||
| <p><em>Apply filters to pre-calculated embeddings</em></p> | ||
| </td> | ||
| <td><img src="docs/images/app_screenshot_1.png" alt="Embedding Interface" width="100%"></td> | ||
| <td><img src="docs/images/app_screenshot_filter.png" alt="Smart Filtering" width="100%"></td> | ||
| </tr> | ||
| <tr> | ||
| <td width="50%"> | ||
| <h4>Cluster Summary</h4> | ||
| <img src="docs/images/app_screenshot_2.png" alt="Cluster Summary" width="100%"> | ||
| <p><em>Analyze clustering results and representative images</em></p> | ||
| </td> | ||
| <td width="50%"> | ||
| <h4>Interactive Exploration</h4> | ||
| <img src="docs/images/app_screenshot_cluster.png" alt="Precalculated Embedding Clusters" width="100%"> | ||
| <p><em>Explore clusters with interactive visualization</em></p> | ||
| </td> | ||
| <td><img src="docs/images/app_screenshot_2.png" alt="Cluster Summary" width="100%"></td> | ||
| <td><img src="docs/images/app_screenshot_cluster.png" alt="Interactive Exploration" width="100%"></td> | ||
| </tr> | ||
| <tr> | ||
| <td width="50%"> | ||
| <!-- Empty cell for Page 1 --> | ||
| </td> | ||
| <td width="50%"> | ||
| <h4>Taxonomy Tree Navigation</h4> | ||
| <img src="docs/images/app_screenshot_taxon_tree.png" alt="Precalculated Embedding Taxon Tree" width="100%"> | ||
| <p><em>Browse hierarchical taxonomy structure</em></p> | ||
| </td> | ||
| <td></td> | ||
| <td><img src="docs/images/app_screenshot_taxon_tree.png" alt="Taxonomy Tree" width="100%"></td> | ||
| </tr> | ||
| </table> | ||
|
|
||
|
|
||
| ## Features | ||
|
|
||
| ### Embed & Explore Images from Upload | ||
|
|
||
| * **Batch Image Embedding:** | ||
| Efficiently embed large collections of images using the pretrained model (e.g., CLIP, BioCLIP) on CPU or GPU (preferably), with customizable batch size and parallelism. | ||
| * **Clustering:** | ||
| Reduces embedding vectors to 2D using PCA, T-SNE, and UMAP. Performs K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on data points to preview images and details. | ||
| * **Cluster-Based Repartitioning:** | ||
| Copy/repartition images into cluster-specific folders with a single click. Generates a summary CSV for downstream use. | ||
| * **Clustering Summary:** | ||
| Displays cluster sizes, variances, and representative images for each cluster, helping you evaluate clustering quality. | ||
|
|
||
| ### Explore Pre-computed Embeddings | ||
| **Embed & Explore** - Embed images using pretrained models (CLIP, BioCLIP), cluster with K-Means, visualize with PCA/t-SNE/UMAP, and repartition images by cluster. | ||
|
|
||
| * **Parquet File Support:** | ||
| Load precomputed embeddings with associated metadata from parquet files. Compatible with various embedding formats and metadata schemas. | ||
| * **Advanced Filtering:** | ||
| Filter datasets by taxonomic hierarchy, source datasets, and custom metadata fields. Combine multiple filter criteria for precise data selection. | ||
| * **Clustering:** | ||
| Reduce embedding vectors to 2D using PCA, UMAP, or t-SNE. Perform K-Means clustering and display result using a scatter plot. Explore clusters via interactive scatter plots. Click on points to preview images and explore metadata details. | ||
| * **Taxonomy Tree Navigation:** | ||
| Browse hierarchical biological classifications with interactive tree view. Expand and collapse taxonomic nodes to explore at different classification levels. | ||
| **Precalculated Embeddings** - Load parquet files (or directories of parquets) with precomputed embeddings, apply dynamic cascading filters, and explore clusters with taxonomy tree navigation. See [Data Format](docs/DATA_FORMAT.md) for the expected schema and [Backend Pipeline](docs/BACKEND_PIPELINE.md) for how embeddings flow through clustering and visualization. | ||
|
|
||
| ## Installation | ||
|
|
||
| [uv](https://docs.astral.sh/uv/) is a fast Python package installer and resolver. Install `uv` first if you haven't already: | ||
|
|
||
| ```bash | ||
| # Install uv (if not already installed) | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| ``` | ||
|
|
||
| Then install the project: | ||
|
|
||
| ```bash | ||
| # Clone the repository | ||
| git clone https://github.com/Imageomics/emb-explorer.git | ||
| cd emb-explorer | ||
|
|
||
| # Create virtual environment and install dependencies | ||
| uv venv | ||
| source .venv/bin/activate # On Windows: .venv\Scripts\activate | ||
| # Using uv (recommended) | ||
| uv venv && source .venv/bin/activate | ||
| uv pip install -e . | ||
| ``` | ||
|
|
||
| ### GPU Support (Optional) | ||
| ### GPU Acceleration (optional) | ||
|
|
||
| For GPU acceleration, you'll need CUDA 12.0+ installed on your system. | ||
| A GPU is **not required** — everything works on CPU out of the box. But if you have an NVIDIA GPU with CUDA, clustering and dimensionality reduction (KMeans, t-SNE, UMAP) will be significantly faster via [cuML](https://docs.rapids.ai/api/cuml/stable/). | ||
|
|
||
| ```bash | ||
| # Full GPU support with RAPIDS (cuDF + cuML) | ||
| uv pip install -e ".[gpu]" | ||
| # CUDA 12.x | ||
| uv pip install -e ".[gpu-cu12]" | ||
|
|
||
| # Minimal GPU support (PyTorch + FAISS only) | ||
| uv pip install -e ".[gpu-minimal]" | ||
| # CUDA 13.x | ||
| uv pip install -e ".[gpu-cu13]" | ||
| ``` | ||
|
|
||
| ### Development | ||
|
|
||
| ```bash | ||
| # Install with development tools | ||
| uv pip install -e ".[dev]" | ||
| ``` | ||
| The app auto-detects GPU availability at runtime and falls back to CPU if anything goes wrong — no configuration needed. You can also manually select backends (cuML, FAISS, sklearn) in the sidebar. | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Running the Application | ||
| ### Standalone Apps | ||
|
|
||
| ```bash | ||
| # Activate virtual environment (if not already activated) | ||
| source .venv/bin/activate # On Windows: .venv\Scripts\activate | ||
| # Embed & Explore - Interactive image embedding and clustering | ||
| streamlit run apps/embed_explore/app.py | ||
|
|
||
| # Run the Streamlit app | ||
| streamlit run app.py | ||
| # Precalculated Embeddings - Explore precomputed embeddings from parquet | ||
| streamlit run apps/precalculated/app.py | ||
| ``` | ||
|
|
||
| An example dataset (`example_1k.parquet`) is provided in the `data/` folder for testing the pre-calculated embeddings features. This parquet contains metadata and the [BioCLIP 2](https://imageomics.github.io/bioclip-2/) embeddings for a one thousand-image subset of [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M). | ||
|
|
||
| ### Command Line Tools | ||
|
|
||
| The project also provides command-line utilities: | ||
| ### Entry Points (after pip install) | ||
|
|
||
| ```bash | ||
| # List all available models | ||
| python list_models.py --format table | ||
|
|
||
| # List models in JSON format | ||
| python list_models.py --format json --pretty | ||
|
|
||
| # List models as names only | ||
| python list_models.py --format names | ||
|
|
||
| # Get help for the list models command | ||
| python list_models.py --help | ||
| emb-embed-explore # Launch Embed & Explore app | ||
| emb-precalculated # Launch Precalculated Embeddings app | ||
| list-models # List available embedding models | ||
| ``` | ||
|
|
||
| ### Running on Remote Compute Nodes | ||
| ### Example Data | ||
|
|
||
| If running the app on a remote compute node (e.g., HPC cluster), you'll need to set up port forwarding to access the Streamlit interface from your local machine. | ||
| An example dataset (`data/example_1k.parquet`) is provided with BioCLIP 2 embeddings for testing. Please see the [data README](data/README.md) for more information about this sample set. | ||
|
|
||
| 1. **Start the app on the compute node:** | ||
| ```bash | ||
| # On the remote compute node | ||
| streamlit run app.py | ||
| ``` | ||
| Note the port number (default is 8501) and the compute node hostname. | ||
| ### Remote HPC Usage | ||
|
|
||
| 2. **Set up SSH port forwarding from your local machine:** | ||
| ```bash | ||
| # From your local machine | ||
| ssh -N -L 8501:<COMPUTE_NODE>:8501 <USERNAME>@<LOGIN_NODE> | ||
| ``` | ||
|
|
||
| **Example:** | ||
| ```bash | ||
| ssh -N -L 8501:c0828.ten.osc.edu:8501 username@cardinal.osc.edu | ||
| ``` | ||
|
|
||
| Replace: | ||
| - `<COMPUTE_NODE>` with the actual compute node hostname (e.g., `c0828.ten.osc.edu`) | ||
| - `<USERNAME>` with your username | ||
| - `<LOGIN_NODE>` with the login node address (e.g., `cardinal.osc.edu`) | ||
|
|
||
| 3. **Access the app:** | ||
| Open your web browser and navigate to `http://localhost:8501` | ||
|
|
||
| The `-N` flag prevents SSH from executing remote commands, and `-L` sets up the local port forwarding. | ||
| ```bash | ||
| # On compute node | ||
| streamlit run apps/precalculated/app.py --server.port 8501 | ||
|
|
||
| ### Notes on Implementation | ||
| # On local machine (port forwarding) | ||
| ssh -N -L 8501:<COMPUTE_NODE>:8501 <USER>@<LOGIN_NODE> | ||
|
|
||
| More notes on different implementation methods and approaches are available in the [implementation summary doc](docs/implementation_summary.md). | ||
| # Access at http://localhost:8501 | ||
| ``` | ||
|
|
||
| ## Acknowledgements | ||
|
|
||
| * [OpenCLIP](https://github.com/mlfoundations/open_clip) | ||
| * [Streamlit](https://streamlit.io/) | ||
| * [Altair](https://altair-viz.github.io/) | ||
|
|
||
| --- | ||
| [OpenCLIP](https://github.com/mlfoundations/open_clip) | [Streamlit](https://streamlit.io/) | [Altair](https://altair-viz.github.io/) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.