-
Notifications
You must be signed in to change notification settings - Fork 2
Retraining
Retraining PARAS and PARASECT requires the imblearn and pandas libraries:
pip install imblearn
pip install pandas
PARAS and PARASECT can be retrained with the following script:
To add training data, add extracted 34 amino acid extended active sites to the following file (extended active sites can be easily obtained by running PARAS or PARASECT with the -save_extended option). Next, add the A-domain identifier (formatted as protein_name.An, with n the index of the A-domain in the protein), the A-domain sequence, and the A-domain specificity/specificities to this file. Then, add the A-domain identifier to the following two files: domain_list and trustworthy_domains Note that the A-domain identifier MUST MATCH across all files.
TODO
Then, navigate to the parasect parent folder:
cd parasect
Training time: <5s
34 most common substrates:
python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/included_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/domain_list.txt -type single_label -o paras/models/random_forest/model.paras -sampling balanced
All substrates:
python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/all_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/trustworthy_domains.txt -type single_label -o paras/models/random_forest/all_substrates_model.paras -sampling balanced
Training time: <20s
python paras/scripts/machine_learning/random_forest/run_random_forest.py -threads -1 -substrates paras/data/compound_data/included_substrates.txt -d paras/data/parasect_dataset.txt -f paras/data/sequence_data/sequences/active_site_34_hmm.fasta -mode train -train_data paras/data/domain_list.txt -type tandem -morgan paras/data/compound_data/fingerprints.txt -o paras/models/random_forest/model.parasect -sampling under_sample
After retraining any of the models, run:
pip install .