EmbedDiff-ESM2 is a comprehensive protein sequence generation pipeline that combines large-scale pretrained protein embeddings (ESM-2) with a latent diffusion model to explore and sample from the vast protein sequence space. It generates novel sequences that preserve semantic and evolutionary properties without relying on explicit structural data, and evaluates them through a suite of biologically meaningful analyses including logistic regression classification.
To run the entire EmbedDiff pipeline from end to end:
python run_embeddiff_pipeline.pyEmbedDiff-ESM2 uses ESM-2 (Evolutionary Scale Modeling v2) to project protein sequences into a high-dimensional latent space rich in evolutionary and functional priors. A denoising latent diffusion model is trained to learn the distribution of these embeddings and generate new ones from random noise. These latent vectors represent plausible protein-like states and are decoded into sequences using a Transformer decoder with configurable stochastic sampling ratios.
The pipeline includes logistic regression analysis to evaluate embedding quality and domain separation, followed by comprehensive sequence validation via entropy analysis, cosine similarity, BLAST alignment, and embedding visualization (t-SNE, MDS). A final HTML report presents all figures and results in an interactive format.
The full EmbedDiff-ESM2 pipeline is modular and proceeds through the following stages:
- Format: A curated FASTA file of real protein sequences (e.g., Thioredoxin reductases from different domains).
- Used as the basis for learning a latent protein representation and decoder training.
- The curated sequences are embedded using the
esm2_t33_650M_UR50Dmodel. - This transforms each protein into a 1280-dimensional latent vector.
- These embeddings capture functional and evolutionary constraints without any structural input.
- NEW: Logistic regression classifier is trained on ESM-2 embeddings to evaluate domain separation.
- Analyzes how well embeddings can distinguish between different protein domains (e.g., archaea, bacteria, fungi).
- Generates confusion matrices and per-class recall plots to assess embedding quality.
- Provides quantitative metrics for embedding discriminative power.
- t-SNE is applied to the real ESM-2 embeddings to visualize the structure of protein space.
- Serves as a baseline to later compare generated (synthetic) embeddings.
Architecture Details:
- Noise Predictor: Multi-layer perceptron (MLP) with 4 hidden layers (1024β1024β512β1280)
- Input Dimension: 1280 (ESM-2 embedding size) + conditional domain labels + timestep embedding
- Activation: ReLU with LayerNorm and Dropout (0.2) for regularization
- Conditional Input: Domain-specific labels (archaea, bacteria, fungi) as one-hot encoded vectors
Diffusion Process:
- Timesteps: 1000 diffusion steps for smooth noise scheduling
- Noise Schedule: Cosine beta schedule with improved stability (Ξ² β [0.0001, 0.9999])
- Forward Process: Gradual addition of Gaussian noise following q(xβ|xβ) = βαΎ±βxβ + β(1-αΎ±β)Ξ΅
- Reverse Process: Learned denoising using p_ΞΈ(xβββ|xβ) with noise prediction
Training Configuration:
- Batch Size: 32 (optimized for stability)
- Learning Rate: 1e-4 (Adam optimizer)
- Epochs: 300 with early stopping
- Data Split: 80/10/10 (train/val/test) with stratified sampling by domain
- Normalization: ESM-2 embeddings scaled to [-1, 1] range using tanh scaling
Loss Function: Mean squared error (MSE) between predicted and actual noise: L = ||Ξ΅ - Ξ΅_ΞΈ(xβ, t)||Β²
This architecture enables the model to learn the complex distribution of protein embeddings and generate novel, biologically plausible latent representations through iterative denoising.
- Starting from pure Gaussian noise, the trained diffusion model is used to generate new latent vectors that resemble real protein embeddings.
- These latent samples are biologically plausible but unseen β representing de novo candidates.
- Real ESM-2 embeddings are paired with their corresponding amino acid sequences.
- This dataset is used to train a decoder to translate from embedding β sequence.
- A Transformer model is trained to autoregressively generate amino acid sequences from input embeddings.
- Label smoothing and entropy filtering are used to improve sequence diversity and biological plausibility.
The synthetic embeddings from Step 4 are decoded into amino acid sequences using a hybrid decoding strategy that balances biological realism with diversity.
Current Configuration:
- 60% of amino acid positions are generated stochastically, sampled from the decoder's output distribution.
- 40% are reference-guided, biased toward residues from the closest matching natural sequence.
This configuration produces sequences with approximately 30-55% sequence identity to known proteinsβstriking a practical balance between novelty and plausibility.
This decoding step is fully configurable:
- Setting the stochastic ratio to 100% yields fully de novo sequences, maximizing novelty.
- Lower stochastic ratios (e.g., 20β30%) increase similarity to natural proteins.
- The ratio can be adjusted in
scripts/transformer_decode_esm2.py.
- A combined t-SNE plot compares the distribution of real and generated embeddings.
- Useful for assessing whether synthetic proteins fall within plausible latent regions.
- Pairwise cosine distances are computed between:
- Natural vs. Natural sequences
- Natural vs. generated sequences
- Generated vs. generated sequences
- This helps evaluate diversity and proximity to known protein embeddings.
- Each decoded protein sequence is evaluated using two key metrics:
- Shannon Entropy: Quantifies amino acid diversity across the sequence.
- Sequence Identity (via BLAST): Measures similarity to known natural proteins.
- Sequences are filtered based on configurable entropy and identity thresholds.
- Generated sequences are validated by aligning them against a locally downloaded SwissProt database using
blastp. - Outputs a CSV summary with percent identity, E-value, bit score, and alignment details.
- All visualizations, metrics, and links to output files are compiled into an interactive HTML report.
- Includes logistic regression results, cosine plots, entropy scatter, identity histograms, and t-SNE projections.
- Allows easy inspection and sharing of results.
EmbedDiff_ESM/
βββ README.md # π Project overview and documentation
βββ requirements.txt # π¦ Python dependencies
βββ run_embeddiff_pipeline.py # π Master pipeline script
β
βββ data/ # π Input and output biological data
β βββ curated_thioredoxin_reductase.fasta # Input protein sequences
β βββ decoded_embeddiff_esm2.fasta # Generated sequences
β βββ decoder_dataset_esm2.pt # Decoder training dataset
β βββ blast_results/ # BLAST analysis results
β βββ blast_summary_local_esm2.csv # BLAST summary
β βββ [individual BLAST XML and FASTA files]
β
βββ embeddings/ # π Latent vector representations
β βββ esm2_embeddings.npy # Real sequence embeddings
β βββ esm2_stats.npz # Embedding statistics
β βββ sampled_esm2_embeddings.npy # Generated embeddings
β βββ tsne_coords_esm2.npy # t-SNE coordinates
β βββ tsne_labels_esm2.npy # t-SNE labels
β
βββ figures/ # π All generated plots and reports
β βββ fig_tsne_by_domain_esm2.png # t-SNE by domain
β βββ logreg_per_class_recall_esm2.png # Logistic regression recall
β βββ logreg_confusion_matrix_esm2.png # Logistic regression confusion matrix
β βββ fig2b_loss_esm2.png # Diffusion training loss
β βββ fig3a_generated_tsne_esm2.png # Generated embeddings t-SNE
β βββ fig5a_decoder_loss_esm2.png # Decoder training loss
β βββ fig5a_real_real_cosine_esm2.png # Real-Real cosine similarity
β βββ fig5b_gen_gen_cosine_esm2.png # Generated-Generated cosine similarity
β βββ fig5c_real_gen_cosine_esm2.png # Real-Generated cosine similarity
β βββ fig5b_identity_histogram_esm2.png # Identity histogram
β βββ fig5c_entropy_scatter_esm2.png # Entropy vs Identity scatter
β βββ fig5d_all_histograms_esm2.png # All histograms
β βββ fig5f_tsne_domain_overlay_esm2.png # t-SNE domain overlay
β βββ logreg_classification_results_esm2.csv # Logistic regression results
β βββ embeddiff_esm2_summary_report.html # Final HTML report
β
βββ scripts/ # π Core processing scripts
β βββ esm2_embedder.py # Step 2a: ESM-2 embedding
β βββ logistic_regression_probe_esm2.py # Step 2b: Logistic regression analysis
β βββ first_tsne_embedding_esm2.py # Step 2c: t-SNE of real embeddings
β βββ train_embeddiff_esm2.py # Step 3: Train latent diffusion model
β βββ sample_embeddings_esm2.py # Step 4: Sample new embeddings
β βββ build_decoder_dataset_esm2.py # Step 5a: Build decoder training set
β βββ train_transformer_esm2.py # Step 5b: Train decoder
β βββ transformer_decode_esm2.py # Step 6: Decode embeddings to sequences
β βββ plot_tsne_domain_overlay_esm2.py # Step 7a: t-SNE comparison
β βββ cosine_similarity_esm2.py # Step 7b: Cosine similarity plots
β βββ plot_entropy_identity_esm2.py # Step 7c: Entropy vs. identity filter
β βββ blastlocal_esm2.py # Step 7d: Local BLAST alignment
β βββ generate_esm2_report.py # Step 8: Generate final HTML report
β
βββ models/ # π ML model architectures
β βββ latent_diffusion.py # EmbedDiff-ESM2 diffusion model
β βββ decoder_transformer.py # Transformer decoder
β
βββ utils/ # π Utility and helper functions
β βββ esm2_embedder.py # ESM-2 embedding utilities
β
βββ checkpoints/ # π Model checkpoints
βββ best_embeddiff_mlp_esm2.pth # Best diffusion model
βββ decoder_transformer_best_esm2.pth # Best decoder model
βββ decoder_transformer_last_esm2.pth # Last decoder checkpoint
# Clone the repository
git clone <repository-url>
cd EmbedDiff_ESM
# Install dependencies
pip install -r requirements.txt- Place your curated protein sequences in
data/curated_thioredoxin_reductase.fasta - Ensure sequences are in FASTA format with domain information in descriptions
# Run complete pipeline
python run_embeddiff_pipeline.py
# Or skip specific steps
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion- Check generated sequences in
data/decoded_embeddiff_esm2.fasta - View all visualizations in the
figures/directory - Open
embeddiff_esm2_summary_report.htmlfor comprehensive results
Edit scripts/transformer_decode_esm2.py:
STOCHASTIC_RATIO = 0.6 # 60% stochastic, 40% reference-guidedUse the --skip flag to skip specific steps:
python run_embeddiff_pipeline.py --skip esm2 logreg tsne diffusion sample decoder_data decoder_train decode tsne_overlay cosine entropy blast html- π Logistic Regression Analysis: Evaluates embedding quality and domain separation
- π― Configurable Sampling: Adjustable stochastic ratios for sequence generation
- π Comprehensive Analysis: Multiple validation metrics and visualizations
- π Modular Pipeline: Easy to skip steps or modify individual components
- π± Interactive Reports: HTML reports with downloadable results
- 𧬠Biological Validation: BLAST analysis against SwissProt database
Generated sequences can be assessed for structural plausibility using:
- ESMFold: Fast structure prediction
- AlphaFold2: High-accuracy prediction
| Metric | Value | Description |
|---|---|---|
| Generated Sequences | 240 | High-quality synthetic proteins with domain-specific conditioning |
| Sequence Identity | 37-49% | Range of similarity to real sequences (BLAST validation) |
| Training Epochs | 300 | Diffusion model training with early stopping |
| Batch Size | 32 | Optimized for training stability |
| Learning Rate | 1e-4 | Adam optimizer configuration |
| Timesteps | 1000 | Diffusion process steps for smooth noise scheduling |
| Embedding Dimension | 1280 | ESM-2 latent space size |
| Data Split | 80/10/10 | Train/validation/test ratio with stratified sampling |
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.