This is an Information Retrieval-based script for prioritizing Gene Ontology (GO) terms based on text from a biomedical paper. This is intended to be run ahead of an LLM-based annotation system.
This project was written by Jordan Lau and supervised by Dr. Arvind Mer with help from Aws Almir Ahmad.
GO terms are prioritized using a keyword-based approach. Keywords are extracted from both the abstract and the GO terms using the same preprocess() function within preprocessing.py. The abstracts are stored as a vector of keywords. Each keyword in the vector is weighted using the rank() within main.py.
The
Cosine distance is defined (in distances.py and cdistances.cpp) as follows:
Note that it IS possible to use other scoring methods than
Note that the dictionary files idf.json and goterm_keywords.json are generated ahead of time, as they are only dependant on the database of GO terms.
Prioritization using rank() is demonstrated with the first sentence of PubMed paper 29789755.
>> from main import rank
>> text = "The remarkable T cell receptor (TCR) performs essential functions in the initiation of intracellular signals required for T cell development, repertoire selection and effector responses to foreign antigens."
>> rank(text, verbose = True, top = 5)
0 11 GO:0045061 0.459375 thymic T cell selection
1 7 GO:0030217 0.291391 T cell differentiation
2 12 GO:0046632 0.259544 alpha-beta T cell differentiation
3 9 GO:0046631 0.161645 alpha-beta T cell activation
4 13 GO:0046633 0.127241 alpha-beta T cell proliferationWhen ranking with verbose = True You will also see a Python list showing how the text was preprocessed into keywords.
Various ranking methods (dummy, jaccard, lev, tfidf, bm25) can be specified, as below. By default, BM25 is used.
>> rank(text, method = "jaccard")