ScrapeLLM - RAG-Based Web Q&A System

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Perfect for: Scraping 100+ websites and asking questions across all of them at once!

Powered by: LangChain.js framework for production-ready RAG implementation

Features

🌐 Web scraping with Cheerio
🔗 LangChain.js integration for professional RAG implementation
🧠 OllamaEmbeddings - Local embeddings using nomic-embed-text
🤖 Ollama LLM - Local llama3.2:3b for answers
💾 MemoryVectorStore - In-memory vector storage
📝 RecursiveCharacterTextSplitter - Smart text chunking
🔍 RAG-based question answering across multiple sites
🚀 Simple REST API
🎯 Search 100+ scraped sites with a single question

Prerequisites

Node.js (v18 or higher)
Ollama installed and running (https://ollama.com)

Setup

1. Install Ollama Models

ollama pull nomic-embed-text
ollama pull llama3.2:3b

Verify Ollama is running:

curl http://localhost:11434/api/tags

2. Install Dependencies

npm install

3. Start the Server

npm run dev

The server will start on http://localhost:3000

API Testing with curl

1. Check Server Status

curl http://localhost:3000

2. Scrape a URL

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'

Response:

{
  "success": true,
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "title": "Artificial intelligence - Wikipedia",
  "chunkCount": 15,
  "message": "URL scraped and indexed successfully"
}

3. Ask a Question

Default: Search across ALL scraped URLs (recommended for your use case):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is artificial intelligence?"}'

Optional: Search only a specific URL (if you scraped 100+ sites but only want to ask about one):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is artificial intelligence?",
    "urlId": "550e8400-e29b-41d4-a716-446655440000"
  }'

Response:

{
  "question": "What is artificial intelligence?",
  "answer": "Artificial intelligence (AI) is the simulation of human intelligence by machines...",
  "sources": [
    {
      "title": "Artificial intelligence - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "chunkIndex": 0,
      "score": 0.87,
      "preview": "Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines..."
    }
  ]
}

4. List All Scraped URLs

curl http://localhost:3000/api/urls

Response:

{
  "count": 1,
  "urls": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "title": "Artificial intelligence - Wikipedia",
      "chunkCount": 15,
      "scrapedAt": "2024-01-01T12:00:00.000Z"
    }
  ]
}

5. Get System Stats

curl http://localhost:3000/api/stats

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Example Workflow: Scrape 100+ Sites

# 1. Scrape multiple URLs
curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/TypeScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/JavaScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}'

# ... scrape 100+ more sites ...

# 2. Ask questions across ALL scraped sites (no urlId needed!)
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What programming languages are mentioned?"}'

# 3. Ask another question - searches all 100+ sites
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the difference between TypeScript and JavaScript?"}'

# 4. View all scraped URLs (should show 100+)
curl http://localhost:3000/api/urls

# 5. Get stats
curl http://localhost:3000/api/stats

API Reference

POST /api/scrape

Scrape a URL and index its content.

Request Body:

{
  "url": "https://example.com"
}

Response:

{
  "success": true,
  "id": "uuid",
  "url": "https://example.com",
  "title": "Page Title",
  "chunkCount": 10,
  "message": "URL scraped and indexed successfully"
}

POST /api/ask

Ask a question about scraped content. By default, searches across ALL scraped URLs.

Request Body:

{
  "question": "What is this page about?",
  "urlId": "uuid",  // OPTIONAL - only use if you want to search a specific URL
  "topK": 5         // OPTIONAL - number of chunks to retrieve (default: 5)
}

Typical usage (search all 100+ sites):

{
  "question": "What is this page about?"
}

Response:

{
  "question": "What is this page about?",
  "answer": "The page is about...",
  "sources": [
    {
      "title": "Page Title",
      "url": "https://example.com",
      "chunkIndex": 0,
      "score": 0.85,
      "preview": "Text preview..."
    }
  ]
}

GET /api/urls

List all scraped URLs.

Response:

{
  "count": 1,
  "urls": [...]
}

GET /api/stats

Get system statistics.

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Configuration

Edit .env to configure:

PORT=3000
OLLAMA_HOST=http://localhost:11434
NODE_ENV=development

How It Works (LangChain Implementation)

Scraping: Extracts text content from web pages using Cheerio
Chunking: RecursiveCharacterTextSplitter splits text intelligently at natural boundaries (paragraphs, sentences)
Embedding: OllamaEmbeddings creates vector embeddings using nomic-embed-text model
Storage: MemoryVectorStore stores vectors in-memory with similarity search
Retrieval: similaritySearchWithScore finds relevant chunks across ALL scraped sites
Generation: Ollama LLM with PromptTemplate and RunnableSequence generates answers

What LangChain Gives You

Without LangChain (vanilla implementation):

Manual embedding API calls
Custom cosine similarity implementation
Basic text splitting by character count
Manual prompt construction
More code to maintain

With LangChain (current implementation):

✅ OllamaEmbeddings - Handles embedding API calls and batching
✅ MemoryVectorStore - Built-in similarity search with scoring
✅ RecursiveCharacterTextSplitter - Smart chunking at natural text boundaries
✅ PromptTemplate - Reusable, maintainable prompts
✅ RunnableSequence - Composable chains (prompt → LLM → parser)
✅ Easy to swap components (e.g., switch to different vector store or LLM)
✅ Production-ready abstractions used by thousands of developers

Multi-Site Search

When you ask a question without a urlId:

The system searches through all chunks from all 100+ scraped sites
Uses cosine similarity to find the top 5 most relevant chunks (configurable via topK)
The LLM gets context from multiple sites and synthesizes an answer
The response includes which sites the answer came from

Example: If you scraped 100 programming blogs and ask "What is React?", it will:

Search all ~1000+ chunks from all 100 sites
Find the 5 most relevant chunks (might be from different sites)
LLM reads those 5 chunks and answers your question
Response shows which sites were used as sources

Limitations (POC)

⚠️ In-memory only: All data is lost when server restarts
⚠️ Static pages: JavaScript-heavy sites may not scrape well
⚠️ No authentication: Anyone can access the API
⚠️ Single-threaded: No background job processing

Troubleshooting

Ollama not running

# Start Ollama
ollama serve

# Check if models are installed
ollama list

Port already in use

# Change PORT in .env file
PORT=3001

Scraping fails

Check if the URL is accessible
Some sites block scrapers (try different URLs)
Try a simpler page like Wikipedia

Next Steps

To make this production-ready:

Add persistent storage (PostgreSQL + Pinecone/Qdrant)
Add user authentication
Implement background job processing
Add Puppeteer for JavaScript-heavy sites
Add rate limiting
Add comprehensive error handling
Add tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
plan.md		plan.md
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

ScrapeLLM - RAG-Based Web Q&A System

Features

Prerequisites

Setup

1. Install Ollama Models

2. Install Dependencies

3. Start the Server

API Testing with curl

1. Check Server Status

2. Scrape a URL

3. Ask a Question

4. List All Scraped URLs

5. Get System Stats

Example Workflow: Scrape 100+ Sites

API Reference

POST /api/scrape

POST /api/ask

GET /api/urls

GET /api/stats

Configuration

How It Works (LangChain Implementation)

What LangChain Gives You

Multi-Site Search

Limitations (POC)

Troubleshooting

Ollama not running

Port already in use

Scraping fails

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages