Skip to content

🧬 EmbedDiff: A modular machine learning pipeline combining ESM2 embeddings, latent diffusion, and transformer-based decoding for de novo protein design

License

Notifications You must be signed in to change notification settings

mgarsamo/EmbedDiff_ESM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 EmbedDiff-ESM: Latent Diffusion Pipeline for De Novo Protein Sequence Generation

HTML Report Run EmbedDiff License: MIT

EmbedDiff-ESM2 is a comprehensive protein sequence generation pipeline that combines large-scale pretrained protein embeddings (ESM-2) with a latent diffusion model to explore and sample from the vast protein sequence space. It generates novel sequences that preserve semantic and evolutionary properties without relying on explicit structural data, and evaluates them through a suite of biologically meaningful analyses including logistic regression classification.


πŸš€ Quick Start (1-liner)

To run the entire EmbedDiff pipeline from end to end:

python run_embeddiff_pipeline.py

πŸ” What Is EmbedDiff-ESM2?

EmbedDiff-ESM2 uses ESM-2 (Evolutionary Scale Modeling v2) to project protein sequences into a high-dimensional latent space rich in evolutionary and functional priors. A denoising latent diffusion model is trained to learn the distribution of these embeddings and generate new ones from random noise. These latent vectors represent plausible protein-like states and are decoded into sequences using a Transformer decoder with configurable stochastic sampling ratios.

The pipeline includes logistic regression analysis to evaluate embedding quality and domain separation, followed by comprehensive sequence validation via entropy analysis, cosine similarity, BLAST alignment, and embedding visualization (t-SNE, MDS). A final HTML report presents all figures and results in an interactive format.


πŸ“Œ Pipeline Overview

The full EmbedDiff-ESM2 pipeline is modular and proceeds through the following stages:

Step 1: Input Dataset

  • Format: A curated FASTA file of real protein sequences (e.g., Thioredoxin reductases from different domains).
  • Used as the basis for learning a latent protein representation and decoder training.

Step 2a: ESM-2 Embedding

  • The curated sequences are embedded using the esm2_t33_650M_UR50D model.
  • This transforms each protein into a 1280-dimensional latent vector.
  • These embeddings capture functional and evolutionary constraints without any structural input.

Step 2b: Logistic Regression Probe Analysis

  • NEW: Logistic regression classifier is trained on ESM-2 embeddings to evaluate domain separation.
  • Analyzes how well embeddings can distinguish between different protein domains (e.g., archaea, bacteria, fungi).
  • Generates confusion matrices and per-class recall plots to assess embedding quality.
  • Provides quantitative metrics for embedding discriminative power.

Step 2c: t-SNE of Real Embeddings

  • t-SNE is applied to the real ESM-2 embeddings to visualize the structure of protein space.
  • Serves as a baseline to later compare generated (synthetic) embeddings.

Step 3: Train EmbedDiff-ESM2 Latent Diffusion Model

Architecture Details:

  • Noise Predictor: Multi-layer perceptron (MLP) with 4 hidden layers (1024β†’1024β†’512β†’1280)
  • Input Dimension: 1280 (ESM-2 embedding size) + conditional domain labels + timestep embedding
  • Activation: ReLU with LayerNorm and Dropout (0.2) for regularization
  • Conditional Input: Domain-specific labels (archaea, bacteria, fungi) as one-hot encoded vectors

Diffusion Process:

  • Timesteps: 1000 diffusion steps for smooth noise scheduling
  • Noise Schedule: Cosine beta schedule with improved stability (Ξ² ∈ [0.0001, 0.9999])
  • Forward Process: Gradual addition of Gaussian noise following q(xβ‚œ|xβ‚€) = βˆšαΎ±β‚œxβ‚€ + √(1-αΎ±β‚œ)Ξ΅
  • Reverse Process: Learned denoising using p_ΞΈ(xβ‚œβ‚‹β‚|xβ‚œ) with noise prediction

Training Configuration:

  • Batch Size: 32 (optimized for stability)
  • Learning Rate: 1e-4 (Adam optimizer)
  • Epochs: 300 with early stopping
  • Data Split: 80/10/10 (train/val/test) with stratified sampling by domain
  • Normalization: ESM-2 embeddings scaled to [-1, 1] range using tanh scaling

Loss Function: Mean squared error (MSE) between predicted and actual noise: L = ||Ξ΅ - Ξ΅_ΞΈ(xβ‚œ, t)||Β²

This architecture enables the model to learn the complex distribution of protein embeddings and generate novel, biologically plausible latent representations through iterative denoising.


Step 4: Sample Synthetic Embeddings

  • Starting from pure Gaussian noise, the trained diffusion model is used to generate new latent vectors that resemble real protein embeddings.
  • These latent samples are biologically plausible but unseen β€” representing de novo candidates.

Step 5a: Build Decoder Dataset

  • Real ESM-2 embeddings are paired with their corresponding amino acid sequences.
  • This dataset is used to train a decoder to translate from embedding β†’ sequence.

Step 5b: Train Transformer Decoder

  • A Transformer model is trained to autoregressively generate amino acid sequences from input embeddings.
  • Label smoothing and entropy filtering are used to improve sequence diversity and biological plausibility.

Step 6: Decode Synthetic Sequences

The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.

Current Configuration:

  • 60% of amino acid positions are generated stochastically, sampled from the decoder's output distribution.
  • 40% are reference-guided, biased toward residues from the closest matching natural sequence.

This configuration produces sequences with approximately 30-55% sequence identity to known proteinsβ€”striking a practical balance between novelty and plausibility.

πŸ’‘ Modular and Adjustable

This decoding step is fully configurable:

  • Setting the stochastic ratio to 100% yields fully de novo sequences, maximizing novelty.
  • Lower stochastic ratios (e.g., 20–30%) increase similarity to natural proteins.
  • The ratio can be adjusted in scripts/transformer_decode_esm2.py.

Step 7a: t-SNE Overlay

  • A combined t-SNE plot compares the distribution of real and generated embeddings.
  • Useful for assessing whether synthetic proteins fall within plausible latent regions.

Step 7b: Cosine Similarity Analysis

  • Pairwise cosine distances are computed between:
    • Natural vs. Natural sequences
    • Natural vs. generated sequences
    • Generated vs. generated sequences
  • This helps evaluate diversity and proximity to known protein embeddings.

Step 7c: Entropy vs. Identity Analysis

  • Each decoded protein sequence is evaluated using two key metrics:
    • Shannon Entropy: Quantifies amino acid diversity across the sequence.
    • Sequence Identity (via BLAST): Measures similarity to known natural proteins.
  • Sequences are filtered based on configurable entropy and identity thresholds.

Step 7d: Local BLAST Validation

  • Generated sequences are validated by aligning them against a locally downloaded SwissProt database using blastp.
  • Outputs a CSV summary with percent identity, E-value, bit score, and alignment details.

Step 8: HTML Summary Report

  • All visualizations, metrics, and links to output files are compiled into an interactive HTML report.
  • Includes logistic regression results, cosine plots, entropy scatter, identity histograms, and t-SNE projections.
  • Allows easy inspection and sharing of results.

πŸ“‚ Project Structure

EmbedDiff_ESM/
β”œβ”€β”€ README.md                                    # πŸ“˜ Project overview and documentation
β”œβ”€β”€ requirements.txt                             # πŸ“¦ Python dependencies
β”œβ”€β”€ run_embeddiff_pipeline.py                   # πŸš€ Master pipeline script
β”‚
β”œβ”€β”€ data/                                        # πŸ“ Input and output biological data
β”‚   β”œβ”€β”€ curated_thioredoxin_reductase.fasta     # Input protein sequences
β”‚   β”œβ”€β”€ decoded_embeddiff_esm2.fasta            # Generated sequences
β”‚   β”œβ”€β”€ decoder_dataset_esm2.pt                 # Decoder training dataset
β”‚   └── blast_results/                          # BLAST analysis results
β”‚       β”œβ”€β”€ blast_summary_local_esm2.csv        # BLAST summary
β”‚       └── [individual BLAST XML and FASTA files]
β”‚
β”œβ”€β”€ embeddings/                                  # πŸ“ Latent vector representations
β”‚   β”œβ”€β”€ esm2_embeddings.npy                     # Real sequence embeddings
β”‚   β”œβ”€β”€ esm2_stats.npz                          # Embedding statistics
β”‚   β”œβ”€β”€ sampled_esm2_embeddings.npy             # Generated embeddings
β”‚   β”œβ”€β”€ tsne_coords_esm2.npy                    # t-SNE coordinates
β”‚   └── tsne_labels_esm2.npy                    # t-SNE labels
β”‚
β”œβ”€β”€ figures/                                     # πŸ“ All generated plots and reports
β”‚   β”œβ”€β”€ fig_tsne_by_domain_esm2.png            # t-SNE by domain
β”‚   β”œβ”€β”€ logreg_per_class_recall_esm2.png       # Logistic regression recall
β”‚   β”œβ”€β”€ logreg_confusion_matrix_esm2.png        # Logistic regression confusion matrix
β”‚   β”œβ”€β”€ fig2b_loss_esm2.png                    # Diffusion training loss
β”‚   β”œβ”€β”€ fig3a_generated_tsne_esm2.png          # Generated embeddings t-SNE
β”‚   β”œβ”€β”€ fig5a_decoder_loss_esm2.png            # Decoder training loss
β”‚   β”œβ”€β”€ fig5a_real_real_cosine_esm2.png        # Real-Real cosine similarity
β”‚   β”œβ”€β”€ fig5b_gen_gen_cosine_esm2.png          # Generated-Generated cosine similarity
β”‚   β”œβ”€β”€ fig5c_real_gen_cosine_esm2.png         # Real-Generated cosine similarity
β”‚   β”œβ”€β”€ fig5b_identity_histogram_esm2.png      # Identity histogram
β”‚   β”œβ”€β”€ fig5c_entropy_scatter_esm2.png         # Entropy vs Identity scatter
β”‚   β”œβ”€β”€ fig5d_all_histograms_esm2.png          # All histograms
β”‚   β”œβ”€β”€ fig5f_tsne_domain_overlay_esm2.png     # t-SNE domain overlay
β”‚   β”œβ”€β”€ logreg_classification_results_esm2.csv  # Logistic regression results
β”‚   └── embeddiff_esm2_summary_report.html     # Final HTML report
β”‚
β”œβ”€β”€ scripts/                                     # πŸ“ Core processing scripts
β”‚   β”œβ”€β”€ esm2_embedder.py                       # Step 2a: ESM-2 embedding
β”‚   β”œβ”€β”€ logistic_regression_probe_esm2.py      # Step 2b: Logistic regression analysis
β”‚   β”œβ”€β”€ first_tsne_embedding_esm2.py           # Step 2c: t-SNE of real embeddings
β”‚   β”œβ”€β”€ train_embeddiff_esm2.py                # Step 3: Train latent diffusion model
β”‚   β”œβ”€β”€ sample_embeddings_esm2.py              # Step 4: Sample new embeddings
β”‚   β”œβ”€β”€ build_decoder_dataset_esm2.py          # Step 5a: Build decoder training set
β”‚   β”œβ”€β”€ train_transformer_esm2.py              # Step 5b: Train decoder
β”‚   β”œβ”€β”€ transformer_decode_esm2.py             # Step 6: Decode embeddings to sequences
β”‚   β”œβ”€β”€ plot_tsne_domain_overlay_esm2.py       # Step 7a: t-SNE comparison
β”‚   β”œβ”€β”€ cosine_similarity_esm2.py              # Step 7b: Cosine similarity plots
β”‚   β”œβ”€β”€ plot_entropy_identity_esm2.py          # Step 7c: Entropy vs. identity filter
β”‚   β”œβ”€β”€ blastlocal_esm2.py                     # Step 7d: Local BLAST alignment
β”‚   └── generate_esm2_report.py                # Step 8: Generate final HTML report
β”‚
β”œβ”€β”€ models/                                      # πŸ“ ML model architectures
β”‚   β”œβ”€β”€ latent_diffusion.py                     # EmbedDiff-ESM2 diffusion model
β”‚   └── decoder_transformer.py                  # Transformer decoder
β”‚
β”œβ”€β”€ utils/                                       # πŸ“ Utility and helper functions
β”‚   └── esm2_embedder.py                       # ESM-2 embedding utilities
β”‚
└── checkpoints/                                # πŸ“ Model checkpoints
    β”œβ”€β”€ best_embeddiff_mlp_esm2.pth            # Best diffusion model
    β”œβ”€β”€ decoder_transformer_best_esm2.pth       # Best decoder model
    └── decoder_transformer_last_esm2.pth       # Last decoder checkpoint

πŸš€ Quick Start

1. Setup Environment

# Clone the repository
git clone <repository-url>
cd EmbedDiff_ESM

# Install dependencies
pip install -r requirements.txt

2. Prepare Data

  • Place your curated protein sequences in data/curated_thioredoxin_reductase.fasta
  • Ensure sequences are in FASTA format with domain information in descriptions

3. Run Full Pipeline

# Run complete pipeline
python run_embeddiff_pipeline.py

# Or skip specific steps
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion

4. View Results

  • Check generated sequences in data/decoded_embeddiff_esm2.fasta
  • View all visualizations in the figures/ directory
  • Open embeddiff_esm2_summary_report.html for comprehensive results

πŸ”§ Configuration Options

Stochastic Ratio Adjustment

Edit scripts/transformer_decode_esm2.py:

STOCHASTIC_RATIO = 0.6  # 60% stochastic, 40% reference-guided

Pipeline Step Control

Use the --skip flag to skip specific steps:

python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion sample decoder_data decoder_train decode tsne_overlay cosine entropy blast html

πŸ“Š Key Features

  • πŸ” Logistic Regression Analysis: Evaluates embedding quality and domain separation
  • 🎯 Configurable Sampling: Adjustable stochastic ratios for sequence generation
  • πŸ“ˆ Comprehensive Analysis: Multiple validation metrics and visualizations
  • πŸ”„ Modular Pipeline: Easy to skip steps or modify individual components
  • πŸ“± Interactive Reports: HTML reports with downloadable results
  • 🧬 Biological Validation: BLAST analysis against SwissProt database

πŸ§ͺ Optional: Structural Validation

Generated sequences can be assessed for structural plausibility using:


πŸ“Š Performance Metrics

Metric Value Description
Generated Sequences 240 High-quality synthetic proteins with domain-specific conditioning
Sequence Identity 37-49% Range of similarity to real sequences (BLAST validation)
Training Epochs 300 Diffusion model training with early stopping
Batch Size 32 Optimized for training stability
Learning Rate 1e-4 Adam optimizer configuration
Timesteps 1000 Diffusion process steps for smooth noise scheduling
Embedding Dimension 1280 ESM-2 latent space size
Data Split 80/10/10 Train/validation/test ratio with stratified sampling

🀝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

🧬 EmbedDiff: A modular machine learning pipeline combining ESM2 embeddings, latent diffusion, and transformer-based decoding for de novo protein design

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published