The method to efficiently detect topologically linked structure of protein-protein complexes from a local perspective.
This program is a free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
- Python 3 (3.5 or later)
- topoly (between 0.9.24 and 1.0.0)
- ProDy
- Create a new conda environment (eg: topolinks):
conda create -n topolinks python=3.7 - Activate this environment:
conda activate topolinks
- topoly: pip install topoly==1.0.0
- ProDy: pip install ProDy
- For one structure
-
Run
topo_links.pyvia-inparameter pointing to the file of protein complex (in *.pdb or *.cif) you wish to analysis. The simplest way to test it with an example:python topo_links.py -in ./examples/glnB-gspJ.pdb -
More choices to run
topo_links.pythrough the follow parameters:- -out: (str, optional) the output directory to save the outputs. Defalut:
/tmp/topo_links - -sf: (str, optional) the file name of the summary of the results. Defalut:
summary_topo_links.txt - --detail: if output the details of the analysis.
- --no-detail: if do not output the details of the analysis. (Default)
- -se: (int, optional) the upper limit of scan windows. Default: 36.
- -sb: (int, optional) the lower limit of scan windows. Default: 4.
- -ts: (float, optional) the threshold score for the alsulate gln scores. Default: 0.8.
- -d: (float, optional) the distance threshold for the selection of interface of the complexes. Default: 10.0.
- -rm: (int, optional) the length of residues in terminal to remove for each chain. Default: 15.
- -out: (str, optional) the output directory to save the outputs. Defalut:
- Batch analysis for multiple structures
-
Run
batch_topo_links.shpointing to the folder where the structures of protein complexes are located. Note that all the structures in *.pdb or *.cif in the folder will be analyzed. The simplest way to test it with an example:./batch_topo_links.sh ./examples/ -
More choices to run
batch_topo_links.shthrough the arguments which are refered to the positional parameters.$1(the first argument) refers to the the folder where the structures of protein complexes are located.$2(the second argument, optional) refers to the output directory to save the output. Defalut:/tmp/topo_links.$3(the third argument, optional) refers to the file name of the summary of the results. Defalut:summary_topo_links.txt$4(the fourth argument, optional) refers to whether output the details of the analysis. Use--detailto output the details of the analysis. And--no-detailrefer to do not output the details of the analysis (Default).$5(the fifth argument, optional) refers to the upper limit of scan windows. Default: 36.$6(the sixth argument, optional) refers to the lower limit of scan windows. Default: 4.$7(the seventh argument, optional) refers to the threshold score for the alsulate gln scores. Default: 0.8.$8(the eighth argument, optional) refers to the distance threshold for the selection of interface of the complexes. Default: 10.0.$9(the ninth argument, optional) refers to the length of residues in terminal to remove for each chain. Default: 15.
- For using
topo_links.py, the path and file name of the summaried results are specified by the-outand-sfparameters respectively. For usingbatch_topo_links.sh, the path and file name of the summaried results are specified by the second and third arguments respectively. By default, you can find the results at/tmp/topo_links/summary_topo_links.txt.
Let's take the summaried results of ./examples/glnB-gspJ.pdb as an example. Four columns are in the summary file as following:
./examples/glnB-gspJ.pdb AB 1 {'tln': 1, 'wholegln': -0.943, 'chain_A': 1, 'chain_B': 1, 'resid_chain_A': '48-54', 'resid_chain_B': '94-98', 'res_breaks': 0, 'maxgln_A': -1.0, 'maxgln_B': -0.986, 'maxtln_A': -0.906, 'maxtln_B': -0.978, 'minres': 6}
-
./examples/glnB-gspJ.pdbis the first column (the file of the structure). -
ABis the second column (the paired chains of the structure, i.e., chain A and chain B). -
1is the third column (the number of topological links between chain A and chain B). By checking whether the value in the third column is greater than 0, you can quickly determine whether there is a topological link between the two corresponding chains. Here we can see that there is a topological link between chain A and chain B. - Others is the fourth column (Detailed information stored in python dictionary format).
-
tlnis the topological link number between chain A and chain B (here equals to 1). -
wholeglnis the GLN value of the whole chains A and B at the interface (here equals to -0.943). -
chain_Ais the number of topological links detected in chain A (here equals to 1). -
chain_Bis the number of topological links detected in chain B (here equals to 1). -
resid_chain_Ais the residue indexes where the detected topological links in chain A.48-54means that the fragement between residues 48 to 54 in chain A contributes to the topological link. When there are multiple discontinuous fragments contributes to the topological links, they are separated by "|". -
resid_chain_Bis the residue indexes where the detected topological links in chain B.94-98means that the fragement between residues 94 to 98 in chain B contributes to the topological link. -
res_breaksis the number of missing residues in chain A and chain B at the interface (Note that the broken chains may cause fake topological links). -
maxgln_Ais the GLN value with the largest absolute value on chain A (seeAB-matrix-A.npyif--detailis enabled), where the GLN value is calculated for all possible fragments on chain A with the whole chain B. -
maxgln_Bis the GLN value with the largest absolute value on chain B (seeAB-matrix-B.npyif--detailis enabled), where the GLN value is calculated for all possible fragments on chain B with the whole chain A. -
maxtln_Ais the GLN value with the largest absolute value on chain A, where the GLN values are calculated for the fragments whose length within-sbto-seon chain A with the whole chain B. -
maxtln_Bis the GLN value with the largest absolute value on chain B, where the GLN values are calculated for the fragments whose length within-sbto-seon chain B with the whole chain A. - When a topological link is detected,
minresis the length (in number of residues) of the shortest fragment that is judged to form the topological link. When there is no topological link detected, it is the default value ($2*se/3$ ).
-
-
If
--detailin runningtopo_links.pyor the fourth argument in runingbatch_topo_links.sh, a folder named by the analyzed file name will be generated, containing files as following (e.g. a protein contains chain A and chain B):- The file
inter_selected_AB.pdbstores the isolated interface for chain A and chain B. All chain A and chain B (including the whole chain A and the whole chain B) in this README refer to their parts at the interface. - For
AB-glnmap-A.png,ABin the file name means that the paired chains are chain A and chain B;-Ain the file name means that this glnmap is for chain A. For this figure, each point of the heatmap represents a fragment of chain A (the row represents the starting atom index, and the column represents the ending atom index). The color represents the GLN value between this fragment and chain B. The larger the absolute GLN value, the more intense the degree of entanglement. By default, when the absolute value is greater than 0.8, the corresponding fragment (whose number of atoms between-sbto-se) will be judged to have a topological link. Similarly,AB-glnmap-B.pngmeans that the heatmap of the GLN values for chain B. - The
AB-matrix-A.npyfile stores the GLN value corresponding to the figureAB-glnmap-A.png. TheAB-matrix-B.npyfile stores the GLN value corresponding to the figureAB-glnmap-B.png.
- The file
Note that if the protein-protein complex contains more than two chains, all pairwise combinations of these chains will be analyzed.
Yingnan Hou, Tengyu Xie, Liuqing He, Liang Tao, and Jing Huang*. Topological Links in Predicted Protein Complex Structures Reveal Limitations of AlphaFold. Communications Biology, 2023, 6(1): 1098. DOI: https://doi.org/10.1038/s42003-023-05489-4