Articles Anlyzer using Grobid
Follow these steps to get started with the project.
Create a enviroment to run the repository
conda create -n <name_env> python=3.11.5
In case you are in python 3.11.5 you could also use python enviroments
python -m venv
git clone --recursive <REPOSITORY_URL>
cd <DIRECTORY_NAME>pip install -r requirements.txtThis repository has a setup.py using setuptools
setup install for getting the packages at site-packages
python setup.py installInstall the required dependencies for the project.
This project uses config files as the form of yaml files.
The main script uses the folders {base_config} and {data_main} for a correct working in case you move the main script bring those folder with it.
If you use this project as a library as the examples look at the examples folders for custom config folders with the config files.
At the folder config/api, check the grovid-server-config.yaml for modiying the protocol (http,https), domain(example.com), and port (8070) before starting
At the folder config/api, check the api-base-config.yaml any funcionality has those configs values in common, if you want to create a new config file dont remove those config values, any changes to the base values could end up to unexpected results.
This project uses a Grobid Server for working make sure there is a online grobid server, you could use docker to run a local grobid server Link on hot to setup a grobid server
At folder examples there are some notebooks with a brief demostration of the funcionalities and how to use the class
You can run the funcionalites as a library as shown at the exmples or using the main script
While as shown at the examples this proyect can be use a a library also it can be execute as a script with the main.py its paramerters are {service} for selecting the funcionality, {--protocol} {http} {--domain} {example.com} {--port} {8070}
python main.py {service} --protocol http --domain example.com --port 8070Using the class WordCloud we extract the abstracts of the articles and create a WordCloud png of the text
python main.py visualize.word_cloud Using the class CountAtritubte we can count specific elments of the articles and create bar chars comparing them, at the config folder config/api there is count-config.yaml where we can set what atributes to find now it is set to finde elements form the xml
python main.py visualize.stadistic The class SearchLink will find elements and https links at the articles and list them displaying a table using the rich library
python main.py visualize.links_searchIn case you need to run this project as a container you will need to use the Dockerfile at the folder docker
First you need to have a running server with grobid
docker pull grobid/grobid:0.8.0
docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0
docker build --no-cache -t pdf-analyzer docker
docker run -it --network host pdf-analyzer /bin/bash
Now you can use python main.py {service} and run the services
The code is documenteted at Read_The_Docs
This project is under the Apache 2.0 License. Refer to the LICENSE file for more details.
- Name: Jorge Martin Izquierdo
- Email: jorge.martin.izquierdo@alumnos.upm.es
As this project uses the grobid cliente api and server as base for working checkout the original author of this two programs GROBID (2008-2022) https://github.com/kermitt2/grobid GROBID (2008-2022) https://github.com/kermitt2/grobid_client_python