Skip to content

Retraining

BTheDragonMaster edited this page Mar 9, 2025 · 5 revisions

Retraining PARAS and PARASECT requires the imblearn and pandas libraries:

pip install imblearn
pip install pandas

PARAS and PARASECT can be retrained with the following script:

To add training data, add extracted 34 amino acid extended active sites to the following file (extended active sites can be easily obtained by running PARAS or PARASECT with the -save_extended option). Next, add the A-domain identifier (formatted as protein_name.An, with n the index of the A-domain in the protein), the A-domain sequence, and the A-domain specificity/specificities to this file. Then, add the A-domain identifier to the following two files: domain_list and trustworthy_domains Note that the A-domain identifier MUST MATCH across all files.

Adding a domain that selects a new substrate

TODO

Then, navigate to the parasect parent folder:

cd parasect

Retraining PARAS

Training time: <5s

34 most common substrates:

python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/included_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/domain_list.txt -type single_label -o paras/models/random_forest/model.paras -sampling balanced

All substrates:

python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/all_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/trustworthy_domains.txt -type single_label -o paras/models/random_forest/all_substrates_model.paras -sampling balanced

Retraining PARASECT

Training time: <20s

python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/included_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/domain_list.txt -type tandem -morgan paras/data/compound_data/fingerprints.txt -o paras/models/random_forest/model.parasect -sampling under_sample

Installing the models

After retraining any of the models, run:

pip install .

Clone this wiki locally