“Somewhere in the semantics of natural language and the ambiguity of our understanding in reality leaves truth as one of the great mysteries”
Hi, I’m @lhallee!
My name is Logan Hallee. I’m a scientist working on computational protein modeling through the lens of machine learning. I’m the Chief Scientific Officer and Founder of Synthyra, a Public Benefit LLC that operates as a research organization and CRO for protein science. I’m also a PhD Candidate in Bioinformatics at the University of Delaware in the Gleghorn Lab, where my research focuses on protein modeling with transformer neural networks. On the side, I write Minds and Molecules, a blog exploring the philosophy behind science and computation.
I’m motivated by safe computational systems that help us better understand the universe at every level of abstraction. However, mostly work towards progress of high-fidelity modeling of the protein universe - efforts that help turn biochemistry into a programmable medium. I believe true biochemical mastery can unlock organic carbon capture, improved crops, efficient circular economies, and major advances in medicine.
You can find my CV here
- I've worked on a series of models named Synteract, which have made various contributions to the field of PPI.
- Synteract-1 was the first large language model approach for PPI prediction.
- Its preprint still ranks in the top 3% of research outputs by Altmetric.
- We showed how negative sampling choices can unintentionally degrade performance (e.g., “accidental localizers”).
- Synteract-2 was a jointly optimized system that predicted PPI, protein–protein binding affinity, and binding site locations, and was Synthyra's first product. At release, it was the top binding affinity predictor on the Affinity v5.5 and Haddock benchmarks.
- I addressed key confounders in PPI data compilation—most recently the accidental taxonomist phenomenon when training from pLM or adjacent embeddings.
- In review at BMC Bioinformatics
- Synteract-3 was an internal model with a modified workflow relative to Synteract-2, enabling extremely high throughput and full interactome-scale prediction.
- Synteract-4 is Synthyra's current premier product, offering a 10% increase in performance on standardized gold-standard benchmarks compared to the entire field.
We leveraged Synteract-2 binding affinity predictions alongside our generative model DSM:
- DSM is the first protein language model (pLM) trained on the LLaDa masked diffusion process, enabling easy extension to pretrained pLMs to turn them into generative models.
|
|
- DSM preserved representation quality while generating high-quality proteins.
- DSM + Synteract-2 was used to increase the binding affinity of the commercial cancer treatment Cetuximab (projected $7B market cap). At release, our Cetuximab variants had 90% higher binding affinity to EGFR versus the commercial option, and 30% higher than the nearest externally designed variant. The data are available on Proteinbase.
|
|
- Collaborated with Stephen Wolfram & other mentors at the Wolfram Winter School.
- Developed “Tetris For Proteins” – a shape-based metric emulating "lock-and-key" protein-protein interactions.
- Generated hypotheses on protein aggregation likelihood.
|
|
- Protify is a low code solution for effectively evaluating and fine-tuning chemical language models.
- Easy CLI and GUI interfaces.
- It allows life scientists with no programming expertise to evaluate state-of-the-art models across datasets quickly to identify:
- The best model for a specific dataset
- The current limit of the field for a specific problem
- Protify can build production grade models with ease, and we can match the performance of most state-of-the-art protein language model papers with a single CLI command
- SpeedrunningPLMs is an attempt to apply modern NLP techniques, mostly inspired by the NanoGPT speedrun, toward BERT-like pLM pretraining
- We have reduced the cost of pLM pretraining by over 500×.
- FastPLMs is a reimplementation of popular pLMs (ESM2, ESMC, E1) so they can be loaded easily with Hugging Face
AutoModel. I also added convenience utilities for efficiently embedding entire datasets. - FastPLMs are downloaded via Hugging Face ~300,000 times per month.
- Invented the Annotation Vocabulary, a unique set of integers mapped to popular protein and gene ontologies.
- Enabled state-of-the-art protein annotation and generation models when paired with its own token embedding.
- We generated out-of-training-distribution natural-looking sequences that returned BLAST hits and enrichment results consistent with the prompt.
- Codon usage bias is highlighted as a key biological phenomenon and valuable feature for machine learning in Nature Scientific Reports.
- Our models show codon usage with a powerful phylogenetic association.
- Introduced cdsBERT, showcasing cost-effective ways to enhance biological relevance in protein language models via a codon vocabulary.
- Synonymous codon embeddings occupied distinct spaces in latent space, implying that pLMs can improve from codon awareness.
- Invented a Mixture of Experts extension for scalable transformer networks adept at sentence similarity tasks.
- Future networks with N experts could perform like N independently trained networks, offering significant time and computational savings in semantic retrieval systems.
- Published in Nature Scientific Reports
- Collaborate on lab projects involving deep learning for reconstructing 3D organs from 2D Z-stacks.
- Informs morphometric and pharmacokinetic studies to further our understanding of organ structure and function.
- featureranker: A Python package for feature ranking.
- Textbook Chapter on Protein Language Models.
- Machine Learning for identifying cardioprotective molecules in minority groups.
- Investigations of Hsp90 and Gamma secretase in cardiac disease.
Research inquiries: lhallee@udel.edu
Business inquiries: logan@synthyra.com
Last Updated: January 2026














