A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.
Perfect for: Scraping 100+ websites and asking questions across all of them at once!
Powered by: LangChain.js framework for production-ready RAG implementation
- 🌐 Web scraping with Cheerio
- 🔗 LangChain.js integration for professional RAG implementation
- 🧠
OllamaEmbeddings- Local embeddings using nomic-embed-text - 🤖
OllamaLLM - Local llama3.2:3b for answers - 💾
MemoryVectorStore- In-memory vector storage - 📝
RecursiveCharacterTextSplitter- Smart text chunking - 🔍 RAG-based question answering across multiple sites
- 🚀 Simple REST API
- 🎯 Search 100+ scraped sites with a single question
- Node.js (v18 or higher)
- Ollama installed and running (https://ollama.com)
ollama pull nomic-embed-text
ollama pull llama3.2:3bVerify Ollama is running:
curl http://localhost:11434/api/tagsnpm installnpm run devThe server will start on http://localhost:3000
curl http://localhost:3000curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'Response:
{
"success": true,
"id": "550e8400-e29b-41d4-a716-446655440000",
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"title": "Artificial intelligence - Wikipedia",
"chunkCount": 15,
"message": "URL scraped and indexed successfully"
}Default: Search across ALL scraped URLs (recommended for your use case):
curl -X POST http://localhost:3000/api/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is artificial intelligence?"}'Optional: Search only a specific URL (if you scraped 100+ sites but only want to ask about one):
curl -X POST http://localhost:3000/api/ask \
-H "Content-Type: application/json" \
-d '{
"question": "What is artificial intelligence?",
"urlId": "550e8400-e29b-41d4-a716-446655440000"
}'Response:
{
"question": "What is artificial intelligence?",
"answer": "Artificial intelligence (AI) is the simulation of human intelligence by machines...",
"sources": [
{
"title": "Artificial intelligence - Wikipedia",
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"chunkIndex": 0,
"score": 0.87,
"preview": "Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines..."
}
]
}curl http://localhost:3000/api/urlsResponse:
{
"count": 1,
"urls": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"title": "Artificial intelligence - Wikipedia",
"chunkCount": 15,
"scrapedAt": "2024-01-01T12:00:00.000Z"
}
]
}curl http://localhost:3000/api/statsResponse:
{
"totalUrls": 1,
"totalVectors": 15,
"urls": [...]
}# 1. Scrape multiple URLs
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/TypeScript"}'
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/JavaScript"}'
curl -X POST http://localhost:3000/api/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}'
# ... scrape 100+ more sites ...
# 2. Ask questions across ALL scraped sites (no urlId needed!)
curl -X POST http://localhost:3000/api/ask \
-H "Content-Type: application/json" \
-d '{"question": "What programming languages are mentioned?"}'
# 3. Ask another question - searches all 100+ sites
curl -X POST http://localhost:3000/api/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is the difference between TypeScript and JavaScript?"}'
# 4. View all scraped URLs (should show 100+)
curl http://localhost:3000/api/urls
# 5. Get stats
curl http://localhost:3000/api/statsScrape a URL and index its content.
Request Body:
{
"url": "https://example.com"
}Response:
{
"success": true,
"id": "uuid",
"url": "https://example.com",
"title": "Page Title",
"chunkCount": 10,
"message": "URL scraped and indexed successfully"
}Ask a question about scraped content. By default, searches across ALL scraped URLs.
Request Body:
{
"question": "What is this page about?",
"urlId": "uuid", // OPTIONAL - only use if you want to search a specific URL
"topK": 5 // OPTIONAL - number of chunks to retrieve (default: 5)
}Typical usage (search all 100+ sites):
{
"question": "What is this page about?"
}Response:
{
"question": "What is this page about?",
"answer": "The page is about...",
"sources": [
{
"title": "Page Title",
"url": "https://example.com",
"chunkIndex": 0,
"score": 0.85,
"preview": "Text preview..."
}
]
}List all scraped URLs.
Response:
{
"count": 1,
"urls": [...]
}Get system statistics.
Response:
{
"totalUrls": 1,
"totalVectors": 15,
"urls": [...]
}Edit .env to configure:
PORT=3000
OLLAMA_HOST=http://localhost:11434
NODE_ENV=development- Scraping: Extracts text content from web pages using Cheerio
- Chunking:
RecursiveCharacterTextSplittersplits text intelligently at natural boundaries (paragraphs, sentences) - Embedding:
OllamaEmbeddingscreates vector embeddings using nomic-embed-text model - Storage:
MemoryVectorStorestores vectors in-memory with similarity search - Retrieval:
similaritySearchWithScorefinds relevant chunks across ALL scraped sites - Generation:
OllamaLLM withPromptTemplateandRunnableSequencegenerates answers
Without LangChain (vanilla implementation):
- Manual embedding API calls
- Custom cosine similarity implementation
- Basic text splitting by character count
- Manual prompt construction
- More code to maintain
With LangChain (current implementation):
- ✅
OllamaEmbeddings- Handles embedding API calls and batching - ✅
MemoryVectorStore- Built-in similarity search with scoring - ✅
RecursiveCharacterTextSplitter- Smart chunking at natural text boundaries - ✅
PromptTemplate- Reusable, maintainable prompts - ✅
RunnableSequence- Composable chains (prompt → LLM → parser) - ✅ Easy to swap components (e.g., switch to different vector store or LLM)
- ✅ Production-ready abstractions used by thousands of developers
When you ask a question without a urlId:
- The system searches through all chunks from all 100+ scraped sites
- Uses cosine similarity to find the top 5 most relevant chunks (configurable via
topK) - The LLM gets context from multiple sites and synthesizes an answer
- The response includes which sites the answer came from
Example: If you scraped 100 programming blogs and ask "What is React?", it will:
- Search all ~1000+ chunks from all 100 sites
- Find the 5 most relevant chunks (might be from different sites)
- LLM reads those 5 chunks and answers your question
- Response shows which sites were used as sources
⚠️ In-memory only: All data is lost when server restarts⚠️ Static pages: JavaScript-heavy sites may not scrape well⚠️ No authentication: Anyone can access the API⚠️ Single-threaded: No background job processing
# Start Ollama
ollama serve
# Check if models are installed
ollama list# Change PORT in .env file
PORT=3001- Check if the URL is accessible
- Some sites block scrapers (try different URLs)
- Try a simpler page like Wikipedia
To make this production-ready:
- Add persistent storage (PostgreSQL + Pinecone/Qdrant)
- Add user authentication
- Implement background job processing
- Add Puppeteer for JavaScript-heavy sites
- Add rate limiting
- Add comprehensive error handling
- Add tests
MIT