GONLP

This is an Information Retrieval-based script for prioritizing Gene Ontology (GO) terms based on text from a biomedical paper. This is intended to be run ahead of an LLM-based annotation system.

This project was written by Jordan Lau and supervised by Dr. Arvind Mer with help from Aws Almir Ahmad.

Ranking Methodology

GO terms are prioritized using a keyword-based approach. Keywords are extracted from both the abstract and the GO terms using the same preprocess() function within preprocessing.py. The abstracts are stored as a vector of keywords. Each keyword in the vector is weighted using the $BM25$ formula. Then, a GO term and an abstract are compared using $cosine$ distance. This occurs in the function rank() within main.py.

The $BM25$ score of keyword $j$ is defined as follows:

$$ \text{BM25}_j=\frac{\text{TF}_j \times \text{IDF}_j \times (k+1)}{\text{TF}_j + k(1 - b + b(\frac{\text{number of GO Keywords}}{\text{average number of GO Keywords}}))} \quad \text{where } k = 1.6 \text{, } b = 0.75 $$

$BM25$ depends on $IDF$, the Inverse Document Frequency. The IDF of keyword $j$ is defined as follows:

$$\text{IDF}_j = \log_2(\frac{\text{number of GO Terms}}{\text{GO Terms with keyword }j})$$

Cosine distance is defined (in distances.py and cdistances.cpp) as follows:

$$ \text{cosine}(ABS, GO)=\frac{\text{weightedvector(ABS)} \cdot \text{weightedvector(GO)}}{|\text{weightedvector(ABS)}|\times|\text{weightedvector(GO)}|} $$

Note that it IS possible to use other scoring methods than $cosine$ distance with $BM25$ — such as $jaccard$ distance — but this is the recommended configuration.

Note that the dictionary files idf.json and goterm_keywords.json are generated ahead of time, as they are only dependant on the database of GO terms.

Usage

Prioritization using rank() is demonstrated with the first sentence of PubMed paper 29789755.

>> from main import rank
>> text = "The remarkable T cell receptor (TCR) performs essential functions in the initiation of intracellular signals required for T cell development, repertoire selection and effector responses to foreign antigens."
>> rank(text, verbose = True, top = 5)
0     11  GO:0045061    0.459375            thymic T cell selection
1      7  GO:0030217    0.291391             T cell differentiation
2     12  GO:0046632    0.259544  alpha-beta T cell differentiation
3      9  GO:0046631    0.161645       alpha-beta T cell activation
4     13  GO:0046633    0.127241    alpha-beta T cell proliferation

When ranking with verbose = True You will also see a Python list showing how the text was preprocessed into keywords.

Various ranking methods (dummy, jaccard, lev, tfidf, bm25) can be specified, as below. By default, BM25 is used.

>> rank(text, method = "jaccard")

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
gostopwords.txt		gostopwords.txt
lexicon.json		lexicon.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GONLP

Ranking Methodology

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GONLP

Ranking Methodology

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages