If you have python installed
pip install uvAlternatively, install with a standalone installer
curl -LsSf https://astral.sh/uv/install.sh | shThe following command will install all the required packages
uv syncThe project currently provides the following functionality:
- Scrape meta data from arxiv, ieeexplore, sciencedirect and springer
- Embed the scraped data and push to vector database
- Download embeddings from vector database
The following command will generate necessary cleaned resources in cache
uv run main.py --scrape- You need to have google chrome installed if you're going to use the scrape function.
Set up the database password environment variable
export VECTOR_DB_PWD=thepasswordprovidedThe following command will extract embedding from the scraped cache and push to the vectordb. The collection name needs to be a valid table name, so it can't contain special characters. The model name is optional, and defaults to "Alibaba-NLP/gte-multilingual-base", check the Embedding Models Used section for more details.
uv run main.py -g <collection_name> --model <model_name>Example:
uv run main.py -g gte --model "Alibaba-NLP/gte-multilingual-base"The following command will download the embeddings from the vector database into a cache folder
uv run main.py -d <collection_name> --cache_dir <cache_dir>Example:
uv run main.py -d gte --cache_dir embeddingsWe provided a simple interactive command line interface to use RAG and Search for relevant papers.
# To start the interactive RAG interface
uv run main.py --rag <collection_name>
# To start the interactive search interface
uv run main.py --search <collection_name>Example:
uv run main.py --rag gteThe following command will run the web UI
uv run main.py --api <collection_name>
# If you want to run the API on a specific host and port
uv run main.py --api <collection_name> --api_host <host> --api_port <port>Example:
uv run main.py --api gte --api_host 0.0.0.0 --api_port 8000We proivde a local web UI to use the API, you can access it by opening the web_ui.html file in your browser.
We manually categorized the papers into the following categories
| category | count | description | |
|---|---|---|---|
| 0 | ml_general | 89 | General Machine Learning |
| 1 | dl_nlp | 56 | Deep Learning for NLP |
| 2 | cv_pattern | 53 | Computer Vision Pattern Recognition |
| 3 | cv_generative | 43 | Computer Vision Generative Models |
| 4 | dl_rnn | 36 | Deep Learning with RNNs |
| 5 | audio | 25 | Audio |
| 6 | dl_rl | 18 | Deep Learning for Reinforcement Learning |
The following models are used and tested for embedding the scraped data, you can use other huggingface models as well. However, some models might not be supported by the langchain huggingface module.
- Alibaba-NLP/gte-multilingual-base
- NovaSearch/jasper_en_vision_language_v1
