Skip to content

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Notifications You must be signed in to change notification settings

darwin808/site2chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeLLM - RAG-Based Web Q&A System

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Perfect for: Scraping 100+ websites and asking questions across all of them at once!

Powered by: LangChain.js framework for production-ready RAG implementation

Features

  • 🌐 Web scraping with Cheerio
  • 🔗 LangChain.js integration for professional RAG implementation
  • 🧠 OllamaEmbeddings - Local embeddings using nomic-embed-text
  • 🤖 Ollama LLM - Local llama3.2:3b for answers
  • 💾 MemoryVectorStore - In-memory vector storage
  • 📝 RecursiveCharacterTextSplitter - Smart text chunking
  • 🔍 RAG-based question answering across multiple sites
  • 🚀 Simple REST API
  • 🎯 Search 100+ scraped sites with a single question

Prerequisites

  1. Node.js (v18 or higher)
  2. Ollama installed and running (https://ollama.com)

Setup

1. Install Ollama Models

ollama pull nomic-embed-text
ollama pull llama3.2:3b

Verify Ollama is running:

curl http://localhost:11434/api/tags

2. Install Dependencies

npm install

3. Start the Server

npm run dev

The server will start on http://localhost:3000

API Testing with curl

1. Check Server Status

curl http://localhost:3000

2. Scrape a URL

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'

Response:

{
  "success": true,
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "title": "Artificial intelligence - Wikipedia",
  "chunkCount": 15,
  "message": "URL scraped and indexed successfully"
}

3. Ask a Question

Default: Search across ALL scraped URLs (recommended for your use case):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is artificial intelligence?"}'

Optional: Search only a specific URL (if you scraped 100+ sites but only want to ask about one):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is artificial intelligence?",
    "urlId": "550e8400-e29b-41d4-a716-446655440000"
  }'

Response:

{
  "question": "What is artificial intelligence?",
  "answer": "Artificial intelligence (AI) is the simulation of human intelligence by machines...",
  "sources": [
    {
      "title": "Artificial intelligence - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "chunkIndex": 0,
      "score": 0.87,
      "preview": "Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines..."
    }
  ]
}

4. List All Scraped URLs

curl http://localhost:3000/api/urls

Response:

{
  "count": 1,
  "urls": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "title": "Artificial intelligence - Wikipedia",
      "chunkCount": 15,
      "scrapedAt": "2024-01-01T12:00:00.000Z"
    }
  ]
}

5. Get System Stats

curl http://localhost:3000/api/stats

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Example Workflow: Scrape 100+ Sites

# 1. Scrape multiple URLs
curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/TypeScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/JavaScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}'

# ... scrape 100+ more sites ...

# 2. Ask questions across ALL scraped sites (no urlId needed!)
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What programming languages are mentioned?"}'

# 3. Ask another question - searches all 100+ sites
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the difference between TypeScript and JavaScript?"}'

# 4. View all scraped URLs (should show 100+)
curl http://localhost:3000/api/urls

# 5. Get stats
curl http://localhost:3000/api/stats

API Reference

POST /api/scrape

Scrape a URL and index its content.

Request Body:

{
  "url": "https://example.com"
}

Response:

{
  "success": true,
  "id": "uuid",
  "url": "https://example.com",
  "title": "Page Title",
  "chunkCount": 10,
  "message": "URL scraped and indexed successfully"
}

POST /api/ask

Ask a question about scraped content. By default, searches across ALL scraped URLs.

Request Body:

{
  "question": "What is this page about?",
  "urlId": "uuid",  // OPTIONAL - only use if you want to search a specific URL
  "topK": 5         // OPTIONAL - number of chunks to retrieve (default: 5)
}

Typical usage (search all 100+ sites):

{
  "question": "What is this page about?"
}

Response:

{
  "question": "What is this page about?",
  "answer": "The page is about...",
  "sources": [
    {
      "title": "Page Title",
      "url": "https://example.com",
      "chunkIndex": 0,
      "score": 0.85,
      "preview": "Text preview..."
    }
  ]
}

GET /api/urls

List all scraped URLs.

Response:

{
  "count": 1,
  "urls": [...]
}

GET /api/stats

Get system statistics.

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Configuration

Edit .env to configure:

PORT=3000
OLLAMA_HOST=http://localhost:11434
NODE_ENV=development

How It Works (LangChain Implementation)

  1. Scraping: Extracts text content from web pages using Cheerio
  2. Chunking: RecursiveCharacterTextSplitter splits text intelligently at natural boundaries (paragraphs, sentences)
  3. Embedding: OllamaEmbeddings creates vector embeddings using nomic-embed-text model
  4. Storage: MemoryVectorStore stores vectors in-memory with similarity search
  5. Retrieval: similaritySearchWithScore finds relevant chunks across ALL scraped sites
  6. Generation: Ollama LLM with PromptTemplate and RunnableSequence generates answers

What LangChain Gives You

Without LangChain (vanilla implementation):

  • Manual embedding API calls
  • Custom cosine similarity implementation
  • Basic text splitting by character count
  • Manual prompt construction
  • More code to maintain

With LangChain (current implementation):

  • OllamaEmbeddings - Handles embedding API calls and batching
  • MemoryVectorStore - Built-in similarity search with scoring
  • RecursiveCharacterTextSplitter - Smart chunking at natural text boundaries
  • PromptTemplate - Reusable, maintainable prompts
  • RunnableSequence - Composable chains (prompt → LLM → parser)
  • ✅ Easy to swap components (e.g., switch to different vector store or LLM)
  • ✅ Production-ready abstractions used by thousands of developers

Multi-Site Search

When you ask a question without a urlId:

  • The system searches through all chunks from all 100+ scraped sites
  • Uses cosine similarity to find the top 5 most relevant chunks (configurable via topK)
  • The LLM gets context from multiple sites and synthesizes an answer
  • The response includes which sites the answer came from

Example: If you scraped 100 programming blogs and ask "What is React?", it will:

  1. Search all ~1000+ chunks from all 100 sites
  2. Find the 5 most relevant chunks (might be from different sites)
  3. LLM reads those 5 chunks and answers your question
  4. Response shows which sites were used as sources

Limitations (POC)

  • ⚠️ In-memory only: All data is lost when server restarts
  • ⚠️ Static pages: JavaScript-heavy sites may not scrape well
  • ⚠️ No authentication: Anyone can access the API
  • ⚠️ Single-threaded: No background job processing

Troubleshooting

Ollama not running

# Start Ollama
ollama serve

# Check if models are installed
ollama list

Port already in use

# Change PORT in .env file
PORT=3001

Scraping fails

  • Check if the URL is accessible
  • Some sites block scrapers (try different URLs)
  • Try a simpler page like Wikipedia

Next Steps

To make this production-ready:

  • Add persistent storage (PostgreSQL + Pinecone/Qdrant)
  • Add user authentication
  • Implement background job processing
  • Add Puppeteer for JavaScript-heavy sites
  • Add rate limiting
  • Add comprehensive error handling
  • Add tests

License

MIT

About

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published