CitExp

Informed Exploration of Scientific Literature via Semantically-Enriched Citation Paths
Citexp provides a set of natural language analysis tools, written in python, for the construction of a semantically-enriched citation graph that makes use of Natural Language Processing and Data Mining technologies to enable advanced retrieval and exploration of a scientific literature. This release is specific for the ACL Antholgy, a corpus of scientific publications sponsored by the Association for Computational Linguistics.
A simple web application, that allows the navigation of the resulting graph, is available at http://citexp.di.unito.it

Getting Started

The project is subdivided in four sequential steps, these instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Software has been tested using the ACL anthology, consequently the preprocessing phase is specific for this corpus. You can download a copy of the corpus here: https://acl-arc.comp.nus.edu.sg/

Prerequisites

The preprocessing phase requires metadata of the corpus and the output of Parscit tool, both of them are included in the dataset.
The vectorizer phase requires that the StanfordCoreNLP server is running, the tool can been found here: https://stanfordnlp.github.io/CoreNLP/index.html
The server, for example, can be started as follow:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Requirements

Python - Python 3.6.6
Java - Java 8
nltk - NLTK
numpy - NumPy
scipy - SciPy
scikit-learn - Scikit-learn
matplotlib - Matplotlib
lxml - lxml
pandas - Pandas
requests - Requests 2.21.0
pathlib - Pathlib
progressbar2 - Progressbar 2 3.39.2

Pipeline

The pipeline consist of: preprocessing, vectorizer, clustering and graph construction.

Preprocessing

The required arguments for the preprocessing script are:

-o --output        Directory used to store the output
-i --input        ACL Anthology directory path

Optional arguments:

Parameter	Default	Description
-b --begin	0	index of first article to consider
-e --end	22000	index of last article to consider
-x -–xml	False	write xml prefix in the output file
-m --matches	False	find matches beetween citations and articles in metadata
-v --verbose	False	print detailed log

Example:

python preprocessing.py -o ../../output -i ../../resources/ACL_Anthology -b 0 -e 1000 -m

Vectorizer

Before starting the vectorizer make sure the StanfordCoreNLP server is running. The required arguments for the vectorizer script are:

-o --output        Directory used to store the output
-i --input       Preprocessing's file to be used as input
-u --url        StanfordCoreNLP server url

Optional arguments:

Parameter	Default	Description
-d --depth	2	maximum depth reachable during the visit of the dependency graph
-l --limit	-1	maximum number of input snippets to be considered

Example:

python vectorizer.py -i "../../output/preprocessing.xml" -o "../../output" -d 2 -u "http://localhost:9000"

Clustering

The required arguments for the clustering script are:

-o --output        Directory used to store the output
-i --input       Directory containing the output of the vectorizer step
-c --clusters        Number of clusters
-r --reduction        Number of SVD compontents
-s --silhouette        Size of samples for the silhouette computation

Optional arguments:

Parameter	Default	Description
-v --verbose	False	print detailed log

Example:

python clustering.py -i "../../output" -o "../../output" -c 30 -r 3000 -v -s 1000

Graph construction

In order to labeling the clusters is required a json file with the label for each cluster to be considered. Example of the json file:

{
	"1": "see for details",
	"2": "present",
	"5": "use",
	"10": "report",
	"11": "proposed by",
	"12": "method",
	"17": "present",
	"20": "follow",
	"24": "approach",
	"25": "introduce"
}

The required arguments for the graph script are:

-o --output        Directory used to store the output
-c --clusters       Clustering's file produced by the clustering step (named 'clusters.xml')
-a --aclpath        ACL Anthology directory path
-p --preprocessing        Preprocessing's file produced by the proprocessing step (named 'proprocessing.xml')
-j --json        Json file containing the label for each clusters to be considered

Example:

python graph.py -c "../../output/clusters.xml" -a ../../resources/ACL_Anthology -p "../../output/preprocessing.xml" -j "../../output/class.json" -o "../../output"

Results

Detailed information about the results are provided by README.md

Publications

Roger Ferrod, Claudio Schifanella, Luigi Di Caro, Mario Cataldi.: Disclosing Citation Meanings for Augmented Research Retrieval and Exploration.: In Proceedings of 16th International Conference, ESWC 2019, Portorož, Slovenia, 2nd - 6th June, 2019
https://link.springer.com/chapter/10.1007/978-3-030-21348-0_7

Authors

Roger Ferrod - Initial work - roger.ferrod@edu.unito.it

License

This project is licensed under the GNU GPLv3 License - see the LICENSE.md file for details

Acknowledgments

Supervisors:

Claudio Schifanella
Luigi Di Caro

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
results		results
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CitExp

Getting Started

Prerequisites

Requirements

Pipeline

Preprocessing

Vectorizer

Clustering

Graph construction

Results

Publications

Authors

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

rogerferrod/citexp

Folders and files

Latest commit

History

Repository files navigation

CitExp

Getting Started

Prerequisites

Requirements

Pipeline

Preprocessing

Vectorizer

Clustering

Graph construction

Results

Publications

Authors

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages