A comprehensive implementation of the EmbedDiff pipeline for generating novel protein sequences using Dayhoff embeddings and latent diffusion models.
π HTML Report | π Run EmbedDiff-Dayhoff | π License: MIT
EmbedDiff-Dayhoff is a modular pipeline for de novo protein sequence generation that combines pretrained Microsoft Dayhoff embeddings, a latent diffusion model, and Transformer-based decoding. It enables efficient exploration of the protein sequence landscapeβgenerating novel sequences that preserve evolutionary plausibility, functional diversity, and foldability, without requiring structural supervision.
This repository implements an ablation study comparing Dayhoff vs ESM-2 embeddings for protein generation.
EmbedDiff-Dayhoff is an ablation study that extends the original EmbedDiff pipeline by swapping the embedding backbone from ESM-2 to Microsoft Dayhoff Atlas.
This repository isolates the key question: How do Dayhoff embeddings (trained on clustered UniRef) affect de novo sequence generation compared to the ESM-2 baseline?
- Embedding Backbone: ESM-2 β Dayhoff Atlas (default:
microsoft/Dayhoff-3b-UR90) - End-to-End Dayhoff Scripts: All pipeline steps are Dayhoff-specific (
*_dayhoff.py) - Dimension-Agnostic: Auto-detects embedding dimensions from saved
.npyfiles - Mamba Compatibility: Handles Jamba/Mamba architecture with CPU support
πΎ Model size:
Dayhoff-3b-UR90is ~12 GB (3 shards). Use--batch_sizeconservatively on CPU/MPS.
βοΈ Mamba kernels: Loaded withuse_mamba_kernels=Falseto avoid CUDA-only kernel warnings on Mac.
- Original EmbedDiff: ESM-2 + Latent Diffusion Pipeline
- This Study: Dayhoff + Latent Diffusion Pipeline (current repository)
- Comparison: Evaluate embedding model effects on protein generation quality
This repository implements a complete protein generation pipeline that:
- Generates protein embeddings using the Microsoft Dayhoff protein language model
- Trains a latent diffusion model to learn the distribution of protein embeddings
- Samples synthetic embeddings from the learned distribution
- Reconstructs protein sequences using a transformer decoder
- Evaluates generated sequences through comprehensive analysis and visualization
- Dayhoff Embeddings: Uses
microsoft/Dayhoff-3b-UR90for high-quality protein representations - Latent Diffusion: Implements cosine noise scheduling with improved normalization
- Transformer Decoder: Reconstructs sequences from embeddings with high fidelity
- Comprehensive Analysis: t-SNE visualization, similarity analysis, quality metrics, and BLAST evaluation
- Professional Reporting: Generates HTML reports with all visualizations and results
Our pipeline successfully:
- β Generated 240 high-quality synthetic protein sequences
- β Achieved 32-68% sequence identity (most around 55-65%)
- β Trained diffusion model with cosine noise schedule and [-1,1] normalization
- β Trained transformer decoder with 15% loss improvement over 34 epochs
- β Maintained biological plausibility through domain-aware embedding generation
π View Full HTML Report - Comprehensive analysis with all 13 figures, metrics, and downloadable data
The HTML report contains:
- All generated visualizations and analysis plots
- Performance metrics and training curves
- Sequence quality assessments and BLAST results
- Downloadable FASTA files and CSV data
- Professional presentation of all pipeline outputs
Real Protein Sequences β Dayhoff Embeddings β Latent Diffusion Model β Synthetic Embeddings β Transformer Decoder β Novel Protein Sequences
-
Dayhoff Embedder (
utils/dayhoff_embedder.py)- Generates 1280-dimensional embeddings using Microsoft's Dayhoff-3B model
- Handles Jamba/Mamba architecture with CPU compatibility
- Supports batch processing and custom device selection
-
Latent Diffusion Model (
models/latent_diffusion.py)- MLP-based noise predictor with dynamic timestep scaling
- Cosine beta schedule for smooth noise addition
- Configurable timesteps (default: 1000) and learning parameters
-
Transformer Decoder (
models/decoder_transformer.py)- 4-layer transformer architecture with 512 embedding dimensions
- Trained to reconstruct protein sequences from embeddings
- Early stopping and model checkpointing
EmbedDiff_Dayhoff/
βββ data/ # Input/output data files
β βββ curated_thioredoxin_reductase.fasta
β βββ thioredoxin_reductase.fasta
β βββ decoded_embeddiff_dayhoff.fasta
β βββ blast_results/ # BLAST analysis results
βββ embeddings/ # Generated embeddings
β βββ dayhoff_embeddings.npy
β βββ sampled_dayhoff_embeddings.npy
βββ figures/ # All generated visualizations
β βββ fig_tsne_by_domain_dayhoff.png
β βββ fig2b_loss_dayhoff.png
β βββ fig3a_generated_tsne_dayhoff.png
β βββ fig5a_decoder_loss_dayhoff.png
β βββ fig5a_real_real_cosine_dayhoff.png
β βββ fig5b_gen_gen_cosine_dayhoff.png
β βββ fig5c_real_gen_cosine_dayhoff.png
β βββ fig5b_identity_histogram_dayhoff.png
β βββ fig5c_entropy_scatter_dayhoff.png
β βββ fig5d_all_histograms_dayhoff.png
β βββ fig5f_tsne_domain_overlay_dayhoff.png
β βββ logreg_per_class_recall_dayhoff.png
β βββ logreg_confusion_matrix_dayhoff.png
βββ models/ # Model architectures
β βββ latent_diffusion.py
β βββ decoder_transformer.py
βββ scripts/ # Pipeline execution scripts
β βββ run_embeddiff_pipeline_dayhoff.py
β βββ generate_dayhoff_embeddings.py
β βββ train_embeddiff_dayhoff.py
β βββ sample_embeddings_dayhoff.py
β βββ build_decoder_dataset_dayhoff.py
β βββ train_transformer_dayhoff.py
β βββ transformer_decode_dayhoff.py
β βββ plot_tsne_by_domain_dayhoff.py
β βββ plot_tsne_domain_overlay_dayhoff.py
β βββ cosine_similarity_dayhoff.py
β βββ plot_entropy_identity_dayhoff.py
β βββ plot_blast_identity_vs_evalue_dayhoff.py
β βββ blastlocal_dayhoff.py
β βββ generate_dayhoff_report.py
β βββ train_transformer_dayhoff.py
βββ utils/ # Utility functions
β βββ dayhoff_embedder.py
β βββ esm_embedder.py
βββ checkpoints/ # Trained model checkpoints
βββ notebooks/ # Jupyter notebooks for exploration
βββ requirements.txt # Python dependencies
βββ embeddiff_dayhoff_summary_report.html # Comprehensive results report
- Python 3.8+
- PyTorch 2.3.1+
- CUDA (optional, for GPU acceleration)
-
Clone the repository
git clone https://github.com/mgarsamo/EmbedDiff-Dayhoff.git cd EmbedDiff-Dayhoff -
Create virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
Execute the entire EmbedDiff-Dayhoff pipeline with one command:
# Activate your environment first
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Run the complete pipeline
python run_embeddiff_pipeline_dayhoff.pyWhat this does:
- β Generates Dayhoff embeddings from your protein sequences
- β Trains the latent diffusion model
- β Samples synthetic embeddings
- β Trains the transformer decoder
- β Decodes sequences and runs all analyses
- β Generates comprehensive HTML report
Expected time: 2-4 hours depending on your hardware
Skip specific steps if you want to resume from a certain point:
# Skip embedding generation (if you already have embeddings)
python run_embeddiff_pipeline_dayhoff.py --skip dayhoff
# Skip BLAST analysis (if you don't have BLAST+ installed)
python run_embeddiff_pipeline_dayhoff.py --skip blast
# Skip multiple steps
python run_embeddiff_pipeline_dayhoff.py --skip dayhoff tsne diffusion
# Available skip options: dayhoff, tsne, diffusion, sample, decoder_data, decoder_train, decode, tsne_overlay, cosine, entropy, blast, htmlRun individual components for debugging or customization:
# Step 1: Generate Dayhoff embeddings
python utils/dayhoff_embedder.py --input data/curated_thioredoxin_reductase.fasta --output embeddings/dayhoff_embeddings.npy
# Step 2: Visualize real embeddings
python scripts/plot_tsne_by_domain_dayhoff.py
# Step 3: Train diffusion model
python scripts/train_embeddiff_dayhoff.py
# Step 4: Sample synthetic embeddings
python scripts/sample_embeddings_dayhoff.py
# Step 5: Build decoder dataset
python scripts/build_decoder_dataset_dayhoff.py
# Step 6: Train transformer decoder
python scripts/train_transformer_dayhoff.py
# Step 7: Decode to sequences
python scripts/transformer_decode_dayhoff.py
# Step 8: Generate HTML report
python scripts/generate_dayhoff_report.pyWant to test the pipeline quickly? Use a smaller dataset:
# Create a small test dataset
head -20 data/curated_thioredoxin_reductase.fasta > data/test_dataset.fasta
# Run pipeline on test data
python utils/dayhoff_embedder.py --input data/test_dataset.fasta --output embeddings/test_embeddings.npy
python scripts/plot_tsne_by_domain_dayhoff.pyAfter running the pipeline, view your results:
# Open the comprehensive HTML report
open embeddiff_dayhoff_summary_report.html
# Or view individual figures
ls figures/After running the complete pipeline, you'll have:
embeddings/dayhoff_embeddings.npy- Real protein embeddings (1280D)embeddings/sampled_dayhoff_embeddings.npy- Synthetic embeddingsdata/decoded_embeddiff_dayhoff.fasta- 240 generated protein sequencescheckpoints/- Trained model checkpointsfigures/- 13 comprehensive analysis plotsembeddiff_dayhoff_summary_report.html- Complete results report
- Sequence Generation: 240 high-quality synthetic proteins
- Identity Range: 32-68% similarity to real sequences
- Classification Accuracy: 92% domain prediction performance
- Training Progress: Loss curves and convergence metrics
- Quality Validation: Entropy, identity, and BLAST analysis
- Domain Separation - How well Dayhoff separates biological domains
- Classification Performance - Logistic regression accuracy metrics
- Diffusion Training - Model convergence and loss reduction
- Generated Embeddings - Synthetic vs. real embedding comparison
- Sequence Quality - Identity distributions and entropy analysis
- Similarity Analysis - Cosine similarity between sequence types
- Clear separation of bacteria, fungi, and archaea in embedding space
- Demonstrates Dayhoff model's ability to capture biological relationships
- 92% overall accuracy in domain classification
- Strong per-class recall: Archaea (89%), Bacteria (84%), Fungi (99%)
- Successful training with cosine noise schedule
- Loss reduction from 12.66 to 10.79 (15% improvement)
- Synthetic embeddings overlap with real protein distributions
- Maintains biological plausibility across domains
- Identity range: 32-68% (most around 55-65%)
- Entropy threshold: All sequences above 2.8 Shannon entropy
- Quality filtering: Comprehensive validation of generated sequences
- High cosine similarity between real and generated sequences
- Generated sequences show internal coherence and diversity
- Dayhoff Model:
microsoft/Dayhoff-3b-UR90(3B parameters, 1280D embeddings) - Diffusion Model: MLP noise predictor with 1000 timesteps
- Transformer Decoder: 4 layers, 512 embedding dims, 8 attention heads
- Training: Adam optimizer, learning rate 1e-4, batch size 32
- Cosine Noise Schedule: Smoother noise addition for better training stability
- [-1,1] Normalization: Improved embedding scaling for diffusion models
- Dynamic Timestep Scaling: Adaptive normalization based on total timesteps
- CPU Compatibility: Mamba kernel disabling for broad accessibility
| Metric | Value | Description |
|---|---|---|
| Generated Sequences | 240 | High-quality synthetic proteins |
| Sequence Identity | 32-68% | Range of similarity to real sequences |
| Classification Accuracy | 92% | Domain prediction performance |
| Training Epochs | 34 | Transformer decoder training |
| Loss Improvement | 15% | Diffusion model training progress |
- Drug Discovery: Generate novel protein therapeutics
- Protein Engineering: Design proteins with specific functions
- Evolutionary Studies: Understand protein sequence space
- Bioinformatics Research: Explore protein sequence relationships
This repository enables direct comparison with the original EmbedDiff ESM-2 pipeline:
-
Run both pipelines on the same input dataset
-
Compare key metrics:
- Sequence identity distributions
- Training loss curves
- t-SNE embedding distributions
- Classification performance
- BLAST validation results
-
Evaluate differences in:
- Embedding quality: Domain separation and biological relationships
- Generation diversity: Novelty vs. biological plausibility
- Training stability: Convergence and loss patterns
- Computational efficiency: Model size and inference speed
- Does Dayhoff's UniRef clustering improve domain-aware generation?
- How do 1280D Dayhoff embeddings compare to ESM-2's 1280D?
- Which embedding model produces more biologically plausible sequences?
- What are the trade-offs between model size and generation quality?
# ESM-2 baseline (from original repository)
git clone https://github.com/mgarsamo/EmbedDiff.git
cd EmbedDiff
python run_embeddiff_pipeline.py
# Dayhoff ablation (current repository)
git clone https://github.com/mgarsamo/EmbedDiff-Dayhoff.git
cd EmbedDiff-Dayhoff
python run_embeddiff_pipeline_dayhoff.py
# Compare results in respective HTML reportsWe welcome contributions! Please feel free to:
- Submit issues and feature requests
- Contribute code improvements
- Share research applications and results
- Improve documentation
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft Research for the Dayhoff protein language models
- Hugging Face for the transformers library
- PyTorch team for the deep learning framework
- Bioinformatics community for protein analysis tools
- GitHub: @mgarsamo
- Repository: EmbedDiff-Dayhoff
Last Updated: August 28, 2025
Pipeline Status: β
Complete and Fully Functional
Results: Available in embeddiff_dayhoff_summary_report.html