Skip to content

darwin808/site2chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeLLM - RAG-Based Web Q&A System

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Perfect for: Scraping 100+ websites and asking questions across all of them at once!

Powered by: LangChain.js framework for production-ready RAG implementation

Features

  • 🌐 Web scraping with Cheerio
  • 🔗 LangChain.js integration for professional RAG implementation
  • 🧠 OllamaEmbeddings - Local embeddings using nomic-embed-text
  • 🤖 Ollama LLM - Local llama3.2:3b for answers
  • 💾 MemoryVectorStore - In-memory vector storage
  • 📝 RecursiveCharacterTextSplitter - Smart text chunking
  • 🔍 RAG-based question answering across multiple sites
  • 🚀 Simple REST API
  • 🎯 Search 100+ scraped sites with a single question

Prerequisites

  1. Node.js (v18 or higher)
  2. Ollama installed and running (https://ollama.com)

Setup

1. Install Ollama Models

ollama pull nomic-embed-text
ollama pull llama3.2:3b

Verify Ollama is running:

curl http://localhost:11434/api/tags

2. Install Dependencies

npm install

3. Start the Server

npm run dev

The server will start on http://localhost:3000

API Testing with curl

1. Check Server Status

curl http://localhost:3000

2. Scrape a URL

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence"}'

Response:

{
  "success": true,
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "title": "Artificial intelligence - Wikipedia",
  "chunkCount": 15,
  "message": "URL scraped and indexed successfully"
}

3. Ask a Question

Default: Search across ALL scraped URLs (recommended for your use case):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is artificial intelligence?"}'

Optional: Search only a specific URL (if you scraped 100+ sites but only want to ask about one):

curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is artificial intelligence?",
    "urlId": "550e8400-e29b-41d4-a716-446655440000"
  }'

Response:

{
  "question": "What is artificial intelligence?",
  "answer": "Artificial intelligence (AI) is the simulation of human intelligence by machines...",
  "sources": [
    {
      "title": "Artificial intelligence - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "chunkIndex": 0,
      "score": 0.87,
      "preview": "Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines..."
    }
  ]
}

4. List All Scraped URLs

curl http://localhost:3000/api/urls

Response:

{
  "count": 1,
  "urls": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
      "title": "Artificial intelligence - Wikipedia",
      "chunkCount": 15,
      "scrapedAt": "2024-01-01T12:00:00.000Z"
    }
  ]
}

5. Get System Stats

curl http://localhost:3000/api/stats

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Example Workflow: Scrape 100+ Sites

# 1. Scrape multiple URLs
curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/TypeScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/JavaScript"}'

curl -X POST http://localhost:3000/api/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Python_(programming_language)"}'

# ... scrape 100+ more sites ...

# 2. Ask questions across ALL scraped sites (no urlId needed!)
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What programming languages are mentioned?"}'

# 3. Ask another question - searches all 100+ sites
curl -X POST http://localhost:3000/api/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the difference between TypeScript and JavaScript?"}'

# 4. View all scraped URLs (should show 100+)
curl http://localhost:3000/api/urls

# 5. Get stats
curl http://localhost:3000/api/stats

API Reference

POST /api/scrape

Scrape a URL and index its content.

Request Body:

{
  "url": "https://example.com"
}

Response:

{
  "success": true,
  "id": "uuid",
  "url": "https://example.com",
  "title": "Page Title",
  "chunkCount": 10,
  "message": "URL scraped and indexed successfully"
}

POST /api/ask

Ask a question about scraped content. By default, searches across ALL scraped URLs.

Request Body:

{
  "question": "What is this page about?",
  "urlId": "uuid",  // OPTIONAL - only use if you want to search a specific URL
  "topK": 5         // OPTIONAL - number of chunks to retrieve (default: 5)
}

Typical usage (search all 100+ sites):

{
  "question": "What is this page about?"
}

Response:

{
  "question": "What is this page about?",
  "answer": "The page is about...",
  "sources": [
    {
      "title": "Page Title",
      "url": "https://example.com",
      "chunkIndex": 0,
      "score": 0.85,
      "preview": "Text preview..."
    }
  ]
}

GET /api/urls

List all scraped URLs.

Response:

{
  "count": 1,
  "urls": [...]
}

GET /api/stats

Get system statistics.

Response:

{
  "totalUrls": 1,
  "totalVectors": 15,
  "urls": [...]
}

Configuration

Edit .env to configure:

PORT=3000
OLLAMA_HOST=http://localhost:11434
NODE_ENV=development

How It Works (LangChain Implementation)

  1. Scraping: Extracts text content from web pages using Cheerio
  2. Chunking: RecursiveCharacterTextSplitter splits text intelligently at natural boundaries (paragraphs, sentences)
  3. Embedding: OllamaEmbeddings creates vector embeddings using nomic-embed-text model
  4. Storage: MemoryVectorStore stores vectors in-memory with similarity search
  5. Retrieval: similaritySearchWithScore finds relevant chunks across ALL scraped sites
  6. Generation: Ollama LLM with PromptTemplate and RunnableSequence generates answers

What LangChain Gives You

Without LangChain (vanilla implementation):

  • Manual embedding API calls
  • Custom cosine similarity implementation
  • Basic text splitting by character count
  • Manual prompt construction
  • More code to maintain

With LangChain (current implementation):

  • OllamaEmbeddings - Handles embedding API calls and batching
  • MemoryVectorStore - Built-in similarity search with scoring
  • RecursiveCharacterTextSplitter - Smart chunking at natural text boundaries
  • PromptTemplate - Reusable, maintainable prompts
  • RunnableSequence - Composable chains (prompt → LLM → parser)
  • ✅ Easy to swap components (e.g., switch to different vector store or LLM)
  • ✅ Production-ready abstractions used by thousands of developers

Multi-Site Search

When you ask a question without a urlId:

  • The system searches through all chunks from all 100+ scraped sites
  • Uses cosine similarity to find the top 5 most relevant chunks (configurable via topK)
  • The LLM gets context from multiple sites and synthesizes an answer
  • The response includes which sites the answer came from

Example: If you scraped 100 programming blogs and ask "What is React?", it will:

  1. Search all ~1000+ chunks from all 100 sites
  2. Find the 5 most relevant chunks (might be from different sites)
  3. LLM reads those 5 chunks and answers your question
  4. Response shows which sites were used as sources

Limitations (POC)

  • ⚠️ In-memory only: All data is lost when server restarts
  • ⚠️ Static pages: JavaScript-heavy sites may not scrape well
  • ⚠️ No authentication: Anyone can access the API
  • ⚠️ Single-threaded: No background job processing

Troubleshooting

Ollama not running

# Start Ollama
ollama serve

# Check if models are installed
ollama list

Port already in use

# Change PORT in .env file
PORT=3001

Scraping fails

  • Check if the URL is accessible
  • Some sites block scrapers (try different URLs)
  • Try a simpler page like Wikipedia

Next Steps

To make this production-ready:

  • Add persistent storage (PostgreSQL + Pinecone/Qdrant)
  • Add user authentication
  • Implement background job processing
  • Add Puppeteer for JavaScript-heavy sites
  • Add rate limiting
  • Add comprehensive error handling
  • Add tests

License

MIT

About

A proof-of-concept system that scrapes websites and allows you to ask questions about the content using local LLMs via Ollama.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors