Skip to content

merlab/GONLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GONLP

This is an Information Retrieval-based script for prioritizing Gene Ontology (GO) terms based on text from a biomedical paper. This is intended to be run ahead of an LLM-based annotation system.

This project was written by Jordan Lau and supervised by Dr. Arvind Mer with help from Aws Almir Ahmad.

Ranking Methodology

GO terms are prioritized using a keyword-based approach. Keywords are extracted from both the abstract and the GO terms using the same preprocess() function within preprocessing.py. The abstracts are stored as a vector of keywords. Each keyword in the vector is weighted using the $BM25$ formula. Then, a GO term and an abstract are compared using $cosine$ distance. This occurs in the function rank() within main.py.

The $BM25$ score of keyword $j$ is defined as follows:

$$ \text{BM25}_j=\frac{\text{TF}_j \times \text{IDF}_j \times (k+1)}{\text{TF}_j + k(1 - b + b(\frac{\text{number of GO Keywords}}{\text{average number of GO Keywords}}))} \quad \text{where } k = 1.6 \text{, } b = 0.75 $$

$BM25$ depends on $IDF$, the Inverse Document Frequency. The IDF of keyword $j$ is defined as follows:

$$\text{IDF}_j = \log_2(\frac{\text{number of GO Terms}}{\text{GO Terms with keyword }j})$$

Cosine distance is defined (in distances.py and cdistances.cpp) as follows:

$$ \text{cosine}(ABS, GO)=\frac{\text{weightedvector(ABS)} \cdot \text{weightedvector(GO)}}{|\text{weightedvector(ABS)}|\times|\text{weightedvector(GO)}|} $$

Note that it IS possible to use other scoring methods than $cosine$ distance with $BM25$ — such as $jaccard$ distance — but this is the recommended configuration.

Note that the dictionary files idf.json and goterm_keywords.json are generated ahead of time, as they are only dependant on the database of GO terms.

Usage

Prioritization using rank() is demonstrated with the first sentence of PubMed paper 29789755.

>> from main import rank
>> text = "The remarkable T cell receptor (TCR) performs essential functions in the initiation of intracellular signals required for T cell development, repertoire selection and effector responses to foreign antigens."
>> rank(text, verbose = True, top = 5)
0     11  GO:0045061    0.459375            thymic T cell selection
1      7  GO:0030217    0.291391             T cell differentiation
2     12  GO:0046632    0.259544  alpha-beta T cell differentiation
3      9  GO:0046631    0.161645       alpha-beta T cell activation
4     13  GO:0046633    0.127241    alpha-beta T cell proliferation

When ranking with verbose = True You will also see a Python list showing how the text was preprocessed into keywords.

Various ranking methods (dummy, jaccard, lev, tfidf, bm25) can be specified, as below. By default, BM25 is used.

>> rank(text, method = "jaccard")

About

Improving Information Retrieval for Gene Ontology Annotation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors