diff --git a/README.md b/README.md index 82b63f9..97d46ac 100644 --- a/README.md +++ b/README.md @@ -1 +1,98 @@ -# added_annotations \ No newline at end of file +# Added annotations: EMICSS (**E**MDB **In**tegration with **C**omplexes, **S**tructures and **S**equences) + +This repository provides tools and scripts for extracting and adding annotations to EMDB entries, which are used to enhance the metadata associated with EM datasets. + +### Table of Contents + +* Installation +* Configuration +* Usage +* Contributing +* License + +### Installation + +To install the necessary dependencies, run: +pip install -r requirements.txt + +### Configuration + +The repository uses a config.ini file for configuration, which is not included in the repository. This file should be created in the root directory of the project with the following structure: + +``` +[file_paths] +uniprot_tab: /uniprot.tsv +CP_ftp: /complextab +components_cif: /components.cif +chem_comp_list: /chem_comp_list.xml +pmc_ftp_gz: /PMID_PMCID_DOI.csv.gz +pmc_ftp: /PMID_PMCID_DOI.csv +emdb_pubmed: /emdb_pubmed.log +emdb_orcid: /emdb_orcid.log +assembly_ftp: /assembly/ +BLAST_DB: /ncbi-blast-2.13.0+/database/uniprot_sprot +BLASTP_BIN: blastp +sifts_GO: /pdb_chain_go.csv +GO_obo: /go.obo +GO_interpro: /nfs/ftp/pub/databases/GO/goa/external2go/interpro2go +sifts: /split_xml/ +alphafold_ftp: /accession_ids.txt +rfam_ftp: /rfam_files_combined.txt + +[api] +pmc: https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST +``` + +#### File Sources and Download Links +| File | Descritption | Download Link | +|-------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------| +| uniprot.tsv | UniProt annpotations | https://rest.uniprot.org/uniprotkb/stream?fields=accession,xref_pdb,protein_name&query=((database:pdb))&format=tsv&compressed=false | +| complextab | Complex Portal data | https://ftp.ebi.ac.uk/pub/databases/complexportal/complexes.tab.gz | +| components.cif | Chemical components data | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/components.cif | +| chem_comp_list.xml | Chemical component list | https://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/ccd/chem_comp_list.xml | +| PMID_PMCID_DOI.csv.gz | Europe PMC dataset (compressed) | https://europepmc.org/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz | +| PMID_PMCID_DOI.csv | Unzipped version of the Europe PMC dataset | https://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv | +| emdb_pubmed | Mapping file created after running PublicationMapping.py | emdb_pubmed.log | +| emdb_orcid | Mapping file created after running PublicationMapping.py | emdb_orcid.log | +| assembly_ftp | PDB assemblies | https://ftp.ebi.ac.uk/pub/databases/msd/assemblies/split/ | +| BLAST_DB | UniProt BLAST database | https://ftp.uniprot.org/pub/databases/uniprot/uniprot_sprot/uniprot_sprot.fasta.gz | +| sifts_GO | PDB chain Gene Ontology mapping | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/pdb_chain_go.csv | +| GO_obo | Gene Ontology definitions | https://current.geneontology.org/ontology/go.obo | +| GO_interpro | InterPro to GO mapping | https://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go | +| sifts | SIFTS data | https://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/ | +| alphafold_ftp | AlphaFold DB accession IDs | https://ftp.ebi.ac.uk/pub/databases/alphafold/accession_ids.csv | +| rfam_ftp | RFAM files | https://www.ebi.ac.uk/pdbe/search/pdb/select?q=emdb_id:*%20AND%20rfam:%5B*%20TO%20*%5D&wt=csv&fl=emdb_id,pdb_id,rfam,rfam_id,entity_id&rows=9999999 | +| emd-xxxx-v30.xml | EMDB metadata | https://ftp.ebi.ac.uk/pub/databases/emdb/ | +| xxxxx.xml | EMPIAR metadata | https://ftp.ebi.ac.uk/pub/databases/emtest/empiar | + +### Usage + +To use the tools and scripts in this repository, you just need to clone it and ensure the config.ini file is properly configured as described above. + +#### Executing the scripts: + +Execute the scripts independently in the following recommended order: +##### EMPIAR mapping +``` +fetch_empiar.py: python fetch_empiar.py -w -f +``` +##### Publication mapping +``` +fetch_pubmed.py: python fetch_pubmed.py -w -f +``` +##### Protein, complexes and ligands mapping +``` +added_annotations.py: python added_annotations.py -w -f --all -t +``` +##### AlphaFold DB mapping +``` +fetch_afdb.py: python fetch_afdb.py -w +``` +##### Write files +``` +write_xml.py: python write_xml.py +``` + +### Further information + +For more information about EMICSS, visit the official EMICSS website (https://www.ebi.ac.uk/emdb/emicss). This page provides detailed information about the EMDB/EMICSS project. diff --git a/fetch_pubmed.py b/fetch_pubmed.py index 15b535f..3141b48 100644 --- a/fetch_pubmed.py +++ b/fetch_pubmed.py @@ -41,7 +41,6 @@ def call_ePubmedCentral(pubmed_list, uri): if response.status_code == 200: try: pmcjdata = json.loads(response.text) - #hitCount = pmcjdata['hitCount'] if 'result' in pmcjdata['resultList']: result = pmcjdata['resultList']['result'] for pub_data in result: