Project Gutenberg Embedding

A web scraper that is made to parse all Ebooks on Project Gutenberg. Also embedding all of the book's text into a vector storage on MongoDB Atlas with Langchain.

How it works

graph TD;
    A[Scrapy]-->B[MongoDB Atlas];
    A-->C[Langchain];
    B-->D[MongoDB Atlas];
    C-->D;

First the scraper goes to the Category Page and gets all the Ebooks and their metadata (like IDs). Each category page returns 25 Ebooks at a time. To get all the Ebooks, we need to use pagination, to pages like the start_index of 26 (https://www.gutenberg.org/ebooks/bookshelf/57?start_index=26).

The Scrapy Pipeline then gets passed all items, and using Langchain embeds them into MongoDB. For this we use OpenAI. The text content of splits of 1,000 characters is embedded, which is the most important part when analyzing the general text.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
gutenberg		gutenberg
.env		.env
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Gutenberg Embedding

How it works

About

Uh oh!

Releases

Packages

Languages

0w9/gutenberg_embedding

Folders and files

Latest commit

History

Repository files navigation

Project Gutenberg Embedding

How it works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages