Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCP-VQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives.
Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE–based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 Å and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we elaborate on the various applications of this foundation-like model, such as protein structure compression and the integration of generative protein language models. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community.
- 🗓️ 25 Sept 2025 — 🎉 Our paper was accepted to the NeurIPS 2025 AI4Science workshop!
- 🗓️ 3 Oct 2025 — Our preprint has been published in bioRxiv.
- 🗓️ 10 Oct 2025 — 🚀 Pretrained checkpoints and evaluation datasets are now available for download!
- 🗓️ 17 Oct 2025 - Release a lite version of GCP-VQVAE with half the parameters.
- 🗓️ 22 Dec 2025 - Add Protein Structure Tokenizer benchmark evaluation scripts and instructions. Refer to pst_evaluation/README.md.
- 🗓️ 31 Dec 2025 - Added a simplified evaluation demo and refactored README docs for easier navigation. See demo/README.md.
- 🗓️ 23 Jan 2026 - Released the standalone
gcp-vqvaePython package for inference (encode/embed/decode) with a lightweight install path.
- Python 3.10+
- PyTorch 2.5+
- CUDA-compatible GPU
- 16GB+ GPU memory recommended for training
- 12GB+ GPU memory recommended for inference/evaluation
See docs/installation/README.md for Docker images, Dockerfile build steps, and optional FlashAttention-3 notes.
If you only want the inference wrapper, see gcp-vqvae/README.md for full details.
One-line install from GitHub (uses gcp-vqvae/pyproject.toml):
pip install "git+https://github.com/mahdip72/vq_encoder_decoder.git@master#subdirectory=gcp-vqvae"Minimal usage:
from gcp_vqvae import GCPVQVAE
# Encode (HF model).
encoder = GCPVQVAE(
mode="encode",
hf_model_id="Mahdip72/gcp-vqvae-lite",
mixed_precision="bf16",
)
records = encoder.encode(
pdb_dir="/abs/path/to/pdb_dir",
batch_size=8,
show_progress=False,
)
# Embed (HF model).
embedder = GCPVQVAE(
mode="embed",
hf_model_id="Mahdip72/gcp-vqvae-lite",
mixed_precision="bf16",
)
embeddings = embedder.embed(
pdb_dir="/abs/path/to/pdb_dir",
batch_size=4,
show_progress=False,
)
# Decode (HF model).
decoder = GCPVQVAE(
mode="decode",
hf_model_id="Mahdip72/gcp-vqvae-lite",
mixed_precision="bf16",
)
coords = decoder.decode(["1 2 3 4 5", "6 7 8"], batch_size=2)Notes:
- Use
mode="embed"withembed()for embeddings, andmode="decode"withdecode()for coordinates. - Use
mode="all"to load encode/embed + decode paths in one object. mode="decode"only builds/loads the VQ + decoder path (encoder is not constructed).- Encode/embed take
pdb_dirinputs; preprocessing happens inside the package. - Progress bars are available via
show_progress=True(default: false). - External logging is suppressed by default; set
suppress_logging=Falseif you need verbose logs. mixed_precisiondefaults tobf16on CUDA; usemixed_precision="no"to disable.- Set
deterministic=Trueandshuffle=Falsefor repeatable outputs. - Pass
hf_model_id="Mahdip72/gcp-vqvae-lite"(or another repo) to use Hugging Face downloads.
For training and inference scripts (separate from demo/), convert structure PDB/CIF files to HDF5 and use
those .h5 datasets as inputs. The standalone gcp-vqvae package operates directly on PDB/CIF inputs via
pdb_dir. Full data prep, formats, and conversion details live in
docs/data/README.md.
| Dataset | Description | Download Link |
|---|---|---|
| CAMEO2024 | CAMEO 2024 evaluation dataset | Download |
| CASP14 | CASP 14 evaluation dataset | Download |
| CASP15 | CASP 15 evaluation dataset | Download |
| CASP16 | CASP 16 evaluation dataset | Download |
| Zero-Shot | Zero-shot evaluation dataset | Download |
Full multi-GPU training/inference/evaluation workflows are documented in docs/usage/README.md. The demo workflow below is for quick local evaluation of checkpoints on raw PDB/CIF inputs.
| Model | Description | Download Link | HF Hub |
|---|---|---|---|
| Large | Full GCP-VQVAE model with best performance | Download | Mahdip72/gcp-vqvae-large |
| Lite | Lightweight version for faster inference | Download | Mahdip72/gcp-vqvae-lite |
Setup Instructions:
- Download the zip file of the checkpoint
- Extract the checkpoint folder
- Set the
trained_model_dirpath in your config file (following ones) to point to the right checkpoint.
The demo pipeline (demo/demo_evaluation.py) preprocesses raw PDB/CIF/mmCIF files, encodes tokens, optionally writes
embeddings, and can reconstruct + evaluate PDBs. It is meant for quick testing of checkpoints without the full multi-GPU stack.
See demo/README.md for setup and usage.
Example demo config (demo/demo_eval_config.yaml):
trained_model_dir: "/abs/path/to/trained_model" # Directory of model checkpoint result in timestamped
checkpoint_path: "checkpoints/best_valid.pth" # Relative to trained_model_dir
config_vqvae: "config_vqvae.yaml" # Saved training config name
config_encoder: "config_gcpnet_encoder.yaml" # Saved encoder config name
config_decoder: "config_geometric_decoder.yaml" # Saved decoder config name
data_dir: "/abs/path/to/pdb_cif_dir" # Root of PDB/CIF/mmCIF files (recursive)
output_base_dir: "demo_results" # Output parent dir (timestamped subdir created)
batch_size: 2 # Inference batch size
num_workers: 0 # DataLoader workers
max_task_samples: 0 # 0 means no limit
tqdm_progress_bar: true # Show progress bars
alignment_strategy: "kabsch" # kabsch or no (type of alignment before evaluation)
mixed_precision: "bf16" # no, fp16, bf16
save_indices_csv: true # Write discrete tokens into vq_indices.csv
save_embeddings_h5: false # Write continuous VQ embeddings into vq_embed.h5
save_pdb_and_evaluate: true # Write original and reconstrued coordinates in PDB format and evaluate TM-score/RMSDMinimum config requirement: set trained_model_dir and data_dir.
For full training (Accelerate, multi-GPU), see docs/usage/README.md#training.
For full inference scripts (encode/embed/decode) and multi-GPU setup, see docs/usage/README.md#inference.
For full evaluation scripts and config details, see docs/usage/README.md#evaluation.
We evaluated additional VQ-VAE backbones alongside GCP-VQVAE:
- ESM3 VQVAE (forked repo: mahdip72/esm) – community can reuse
pdb_to_tokens.pyandtokens_to_pdb.pythat we authored because the upstream project lacks ready-to-use scripts. - FoldToken-4 (forked repo: mahdip72/FoldToken_open) – we rewrote
foldtoken/pdb_to_token.pyandfoldtoken/token_to_pdb.pyfor better performance and efficiency with negligible increase in error. - Structure Tokenizer (instadeepai/protein-structure-tokenizer) – results reproduced with the official implementation.
Let us know if you need the scripts of more baseline tokenizers evaluated in the recent update of the manuscript.
The table below reproduces Table 2 from the manuscript: reconstruction accuracy on community benchmarks and a zero-shot setting. Metrics are backbone TM-score (↑) and RMSD in Å (↓).
- Added an experimental option to compress token sequences using latent codebooks inspired by ByteDance’s 1D tokenizer (TiTok); this enables configurable compression factors within our VQ pipeline.
- Added the residual quantization (multi-depth VQ) using a shared codebook when
ti_tok.residual_depth > 1. This only works when ti_tok is enabled which means the residual latents are packed depth-by-depth, flattened into a single stream for NTP and decoding, and their masks/embeddings remain aligned with the flattened indices. This improves reconstruction capacity without expanding the base codebook. - Included an optional next-token prediction head, drawing on the autoregressive regularization ideas from “When Worse is Better: Navigating the Compression-Generation Tradeoff in Visual Tokenization”, to encourage codebooks that are friendlier to autoregressive modeling.
- Enabled adaptive loss coefficients driven by gradient norms: each active loss (MSE, distance/direction, VQ, NTP) tracks its synchronized gradient magnitude and scales its weight toward the 0.2–5.0 norm “comfort zone.” Coefficients shrink when a loss overpowers the rest and grow when its gradients fade, keeping the multi-objective training balanced without constant manual re-tuning.
This repository builds upon several excellent open-source projects:
- vector-quantize-pytorch – Vector quantization implementations used in our VQ-VAE architecture.
- x-transformers – Transformer components integrated into our encoder and decoder modules of VQ-VAE.
- ProteinWorkshop – We slimmed the original workshop down to the GCPNet core and extended it with:
- opt-in
torch.compilesupport that now powers our training, inference, and evaluation paths (configurable in every config file). - faster aggregation backends: a CSR/
torch.segment_reduceimplementation for PyTorch-only setups and an automatic switch to cuGraph-ops fused aggregators whenpylibcugraphopsis installed. - multi-GPU friendly evaluation utilities that reuse the same compilation flags as training, keeping the code path consistent across scripts.
- opt-in
We gratefully acknowledge NVIDIA for providing computational resources via their dedicated server that made this project feasible, enabling us to train and evaluate our models at scale.
If you use this code or the pretrained models, please cite the following paper:
@article{Pourmirzaei2025gcpvqvae,
author = {Pourmirzaei, Mahdi and Morehead, Alex and Esmaili, Farzaneh and Ren, Jarett and Pourmirzaei, Mohammadreza and Xu, Dong},
title = {GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure},
journal = {bioRxiv},
year = {2025},
doi = {10.1101/2025.10.01.679833},
url = {https://www.biorxiv.org/content/10.1101/2025.10.01.679833v2}
}

