Skip to content
View lhallee's full-sized avatar

Block or report lhallee

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
lhallee/README.md

“Somewhere in the semantics of natural language and the ambiguity of our understanding in reality leaves truth as one of the great mysteries”

Introduction

Hi, I’m @lhallee!

My name is Logan Hallee. I’m a scientist working on computational protein modeling through the lens of machine learning. I’m the Chief Scientific Officer and Founder of Synthyra, a Public Benefit LLC that operates as a research organization and CRO for protein science. I’m also a PhD Candidate in Bioinformatics at the University of Delaware in the Gleghorn Lab, where my research focuses on protein modeling with transformer neural networks. On the side, I write Minds and Molecules, a blog exploring the philosophy behind science and computation.

I’m motivated by safe computational systems that help us better understand the universe at every level of abstraction. However, mostly work towards progress of high-fidelity modeling of the protein universe - efforts that help turn biochemistry into a programmable medium. I believe true biochemical mastery can unlock organic carbon capture, improved crops, efficient circular economies, and major advances in medicine.

You can find my CV here

Research Highlights

Protein-Protein Interaction Prediction (PPI)

Synteract

  • I've worked on a series of models named Synteract, which have made various contributions to the field of PPI.
  • Synteract-1 was the first large language model approach for PPI prediction.
    • Its preprint still ranks in the top 3% of research outputs by Altmetric.
    • We showed how negative sampling choices can unintentionally degrade performance (e.g., “accidental localizers”).
  • Synteract-2 was a jointly optimized system that predicted PPI, protein–protein binding affinity, and binding site locations, and was Synthyra's first product. At release, it was the top binding affinity predictor on the Affinity v5.5 and Haddock benchmarks.

Synteract-2 binding affinity (pKd) prediction correlation plot

  • I addressed key confounders in PPI data compilation—most recently the accidental taxonomist phenomenon when training from pLM or adjacent embeddings.
    • In review at BMC Bioinformatics
  • Synteract-3 was an internal model with a modified workflow relative to Synteract-2, enabling extremely high throughput and full interactome-scale prediction.

Synteract-3 human intra-interactome prediction overview

  • Synteract-4 is Synthyra's current premier product, offering a 10% increase in performance on standardized gold-standard benchmarks compared to the entire field.

Synteract-4 benchmark performance summary (Bernet figure)

Protein binder design

We leveraged Synteract-2 binding affinity predictions alongside our generative model DSM:

  • DSM is the first protein language model (pLM) trained on the LLaDa masked diffusion process, enabling easy extension to pretrained pLMs to turn them into generative models.
DSM architecture diagram (masked diffusion protein language model) DSM paper figure highlighting EGFR binder design results
  • DSM preserved representation quality while generating high-quality proteins.
  • DSM + Synteract-2 was used to increase the binding affinity of the commercial cancer treatment Cetuximab (projected $7B market cap). At release, our Cetuximab variants had 90% higher binding affinity to EGFR versus the commercial option, and 30% higher than the nearest externally designed variant. The data are available on Proteinbase.
Structure of best-designed EGFR binder variant Binding kinetics for EGFR binder variant 10.2

Tetris For Proteins

  • Collaborated with Stephen Wolfram & other mentors at the Wolfram Winter School.
  • Developed “Tetris For Proteins” – a shape-based metric emulating "lock-and-key" protein-protein interactions.
  • Generated hypotheses on protein aggregation likelihood.
Tetris For Proteins: shape-based interaction metric example (panel 1) Tetris For Proteins: shape-based interaction metric example (panel 2)

Open source projects

Protify

  • Protify is a low code solution for effectively evaluating and fine-tuning chemical language models.
  • Easy CLI and GUI interfaces.
  • It allows life scientists with no programming expertise to evaluate state-of-the-art models across datasets quickly to identify:
    • The best model for a specific dataset
    • The current limit of the field for a specific problem
  • Protify can build production grade models with ease, and we can match the performance of most state-of-the-art protein language model papers with a single CLI command

SpeedrunningPLMs

  • SpeedrunningPLMs is an attempt to apply modern NLP techniques, mostly inspired by the NanoGPT speedrun, toward BERT-like pLM pretraining
  • We have reduced the cost of pLM pretraining by over 500×.

SpeedrunningPLMs: protein language model pretraining cost comparison (500× reduction)

FastPLMs

  • FastPLMs is a reimplementation of popular pLMs (ESM2, ESMC, E1) so they can be loaded easily with Hugging Face AutoModel. I also added convenience utilities for efficiently embedding entire datasets.
  • FastPLMs are downloaded via Hugging Face ~300,000 times per month.

Additional PhD work

Annotation Vocabulary

  • Invented the Annotation Vocabulary, a unique set of integers mapped to popular protein and gene ontologies.
  • Enabled state-of-the-art protein annotation and generation models when paired with its own token embedding.
  • We generated out-of-training-distribution natural-looking sequences that returned BLAST hits and enrichment results consistent with the prompt.

Annotation Vocabulary: generated sequence examples and evaluation summary

Codon Usage Bias

  • Codon usage bias is highlighted as a key biological phenomenon and valuable feature for machine learning in Nature Scientific Reports.
    • Our models show codon usage with a powerful phylogenetic association.
    • Introduced cdsBERT, showcasing cost-effective ways to enhance biological relevance in protein language models via a codon vocabulary.
    • Synonymous codon embeddings occupied distinct spaces in latent space, implying that pLMs can improve from codon awareness.
cdsBERT: PCA visualization showing structure in codon embedding space

Mixture of Experts Extension

  • Invented a Mixture of Experts extension for scalable transformer networks adept at sentence similarity tasks.
    • Future networks with N experts could perform like N independently trained networks, offering significant time and computational savings in semantic retrieval systems.
    • Published in Nature Scientific Reports
Mixture of Experts extension: overview figure from publication

Computer Vision in Biology

  • Collaborate on lab projects involving deep learning for reconstructing 3D organs from 2D Z-stacks.
  • Informs morphometric and pharmacokinetic studies to further our understanding of organ structure and function.

Additional Projects & Publications


Socials / Websites

Contact

Research inquiries: lhallee@udel.edu

Business inquiries: logan@synthyra.com


Last Updated: January 2026

Pinned Loading

  1. Gleghorn-Lab/AnnotationVocabulary Gleghorn-Lab/AnnotationVocabulary Public

    Jupyter Notebook 7

  2. Gleghorn-Lab/Mixture-of-Experts-Sentence-Similarity Gleghorn-Lab/Mixture-of-Experts-Sentence-Similarity Public

    Python 16 1

  3. Gleghorn-Lab/Protify Gleghorn-Lab/Protify Public

    Low code molecular property prediction

    Python 10 6

  4. Synthyra/FastPLMs Synthyra/FastPLMs Public

    Python 29 5

  5. Gleghorn-Lab/SpeedrunningPLMs Gleghorn-Lab/SpeedrunningPLMs Public

    Forked from KellerJordan/modded-nanogpt

    Speedrunning PLM pretraining

    Python 12 1

  6. Gleghorn-Lab/DSM Gleghorn-Lab/DSM Public

    Protein representation and design under a single training scheme

    Python 24 1