-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This wiki contains various information about how to setup the Processing Pipeline service, what are the different components used to create the pipeline and how to create new components if none of the provided fits your needs.
The framework is enables to process documents and text by extracting the documents content, annotating and translating the text, and validating the output.
The framework is developed in TypeScript, but can be easily used on NodeJS.
The service is based on the qtopology module, which is a distributed stream processing layer and is able to construct components for adding them to the tool.
- Bolts. The components that interact with the data.
- Spouts. The components that retrieve the data from an external source.
- Topologies. The configuration files that specify how the bolts and spouts interact.
-
Create
.envfile in theenvfolder. See instructions described in this wiki page. -
Setup the Docker images (see Docker Wiki).
-
Create appropriate PostgreSQL tables (see this script in another repository for the X5GON database structure).
-
node.js v10.23.0 and npm 6.14.8 or higher
To test that your node.js version is correct, run
node --versionandnpm --version.
The pipeline uses a nodejs module called textract which allows text extraction of most of text files. For some file types additional libraries need to be installed:
-
PDF extraction requires
pdftotextbe installed, link. -
DOC extraction requires
antiwordbe installed, link, unless on OSX in which case textutil (installed by default) is used.
To install the project run
npm installTo build the project code run
npm run buildThis project has received funding from the European Union’s Horizon new policy 2020 research and innovation programme under grant agreement No 761758 (X5GON).