Skip to content

A web scraper that is made to parse all Ebooks on Project Gutenberg (gutenbeg.org). Also embedding all of the book's text into a vector storage (MongoDB Atlas).

Notifications You must be signed in to change notification settings

0w9/gutenberg_embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project Gutenberg Embedding

A web scraper that is made to parse all Ebooks on Project Gutenberg. Also embedding all of the book's text into a vector storage on MongoDB Atlas with Langchain.

How it works

graph TD;
    A[Scrapy]-->B[MongoDB Atlas];
    A-->C[Langchain];
    B-->D[MongoDB Atlas];
    C-->D;
Loading

First the scraper goes to the Category Page and gets all the Ebooks and their metadata (like IDs). Each category page returns 25 Ebooks at a time. To get all the Ebooks, we need to use pagination, to pages like the start_index of 26 (https://www.gutenberg.org/ebooks/bookshelf/57?start_index=26).

The Scrapy Pipeline then gets passed all items, and using Langchain embeds them into MongoDB. For this we use OpenAI. The text content of splits of 1,000 characters is embedded, which is the most important part when analyzing the general text.

About

A web scraper that is made to parse all Ebooks on Project Gutenberg (gutenbeg.org). Also embedding all of the book's text into a vector storage (MongoDB Atlas).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages