This repository contains scripts for the paper:
Kohout P, Vasina M, Majerova M, Novakova V, Damborsky J, Bednar D, Marek M, Prokop Z, Mazurenko S. Engineering Dehalogenase Enzymes using Variational Autoencoder-Generated Latent Spaces and Microfluidics. (DOI 10.26434/chemrxiv-2023-jcds7)
The work is based on the code published in nature communications paper Deciphering protein evolution and fitness landscapes with latent space models (https://doi.org/10.1038/s41467-019-13633-0)
For the installation and running purposes, we recommend installing Anaconda and adding additional packages using conda installation via commands:
conda create -n vae_env python=3.6
conda activate vae_env
conda install pytorch torchvision -c pytorchThen python packages can be installed into the environment from requrements.txt using standard pip commands:
pip install -r requirements.txtFor alignment management our, scripts use the Clustal Omega tool. Its binary can be found in clustal/ directory. If there is no binary the Clustal Omega binary can be downloaded from here
Dataset are provided in the tar archive and can be extracted using setup_datasets.sh script.
Simply run bash setup_datasets.sh in the root directory of this repository.
The above command will create the directory datasets with a description of extracted files and further information
about preparing your custom data for phylogenetic mapping into the latent space in README_datasets.md.
For easy experimentation with Variational Autoencoders (VAEs), we provide a set of Jupyter notebooks. The vae_pipeline notebook enables users to run the entire VAE workflow step by step using a simple configuration file. For detailed explanations of each step, please refer to the calculation cells within the notebook.
Use script setup_datasets.sh to setup the datasets directory
For running the scripts, the user has to be in the scripts/ directory.
cd scriptsUsers can specify the desired configuration of the model set status on.
The demonstration configuration file includes setups for Model1 and Model2 used in our study.
vim model_configurations/runner-conf.jsonrunner-conf.json is the default configuration file with examples.
If you want to run your custom analysis, you can add their configuration entry and set it to on
or modify a new configuration file with your models as needed. Then you need to use an additional
parameter while running commands below --json path_to_conf.json
Then, the user can run pipeline preprocessing the selected MSA from the dataset directory via config file, train the model, and generate desired statics by running:
python3 runner.py msa_handlers/msa_preprocessor.py --json model_configurations/runner-conf.json
python3 runner.py train.py --json model_configurations/runner-conf.json
python3 runner.py benchmark.py --json model_configurations/runner-conf.json
python3 runner.py run_task.py --run_generative_evaluation --json model_configurations/runner-conf.jsonAfter that, the result can be found in the ../results/dir_name/experiment_name directory, where dir and experimentnames are specified in the configuration file.
The commands above generate Figure 2A-C in the paper.
Ancestors generation can be done simply via running the command corresponding to the straight line evolution strategy:
python3 runner.py run_task.py --run_straight_evolution --json model_configurations/runner-conf.jsonWe also examined more strategies:
Random mutagenesis modifies the input query sequence, maps it into the latent space, and picks the one with the closest variant to the latent space origin (run_random_mutagenesis).
CMA-ES evolution multi-objective optimization in the latent space. We focused on the simple straight evolutionary strategy in our paper.
python3 runner.py run_task.py --run_random_mutanesis --json model_configurations/runner-conf.json
python3 runner.py run_task.py --run_evolution --json model_configurations/runner-conf.jsonResults can be found in the same directory in the subdirectory Highlights/
To get the latent space, run:
python3 runner.py run_task.py --run_plot_latent_space --json model_configurations/runner-conf.jsonThe statistics plot for the evolutionary profile in the latent space (Figure 3C in the paper) can be produced:
python3 supportscripts/dual_axis.py --csv ../results/path/to/experiment/higlight_dir/selected_strategy_profile.csv --pos "" --o path_to_profile.jpg- scripts : program source files
- results : directory for results
- pbs_scripts : scripts for remote job runs. trainer.sh training script
- datasets.tar.gz : compressed datasets directory,
- datasets : datasets for experiments create from compressed datasets.tar.gz via using script setup_datasets.sh
- clustal : clustal omega binary file directory, clustalo binary
- meta_scripts : storage of pbs scripts variants
- requirements.txt : python pip dependencies