💬 Turn your website docs into a GPT-4.1-mini support chatbot with MrScraper and Pinecone

⚡ 3 views · 💬 Support Chatbots

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

Description

This n8n template turns any website or documentation portal into a fully functional AI-powered support chatbot — no manual copy-pasting, no static FAQs. It uses MrScraper to crawl and extract your site’s content, OpenAI to generate embeddings, and Pinecone to store and retrieve that knowledge at chat time.

The result is a retrieval-augmented chatbot that answers questions using only your actual website content, always cites its sources, and never hallucinates policies or pricing.

How It Works

Phase 1 – URL Discovery: The Map Agent crawls your target domain using include/exclude patterns to discover all relevant documentation or help center pages. It returns a clean, deduplicated list of URLs ready for content extraction.
Phase 2 – Page Content Extraction: Each discovered URL is processed in controlled batches by the General Agent, which extracts the readable content (title + main text) from every page. Low-quality or near-empty pages are automatically filtered out.
Phase 3 – Chunking & Embedding: Page text is split into overlapping chunks (default: ~1,100 chars with 180-char overlap) to preserve context at boundaries. Each chunk is sent to OpenAI Embeddings to generate a vector, then stored in Pinecone with metadata including the source URL, page title, and chunk index.
Phase 4 – Chat Endpoint: A Chat Trigger exposes a webhook endpoint your website or widget can connect to. When a user asks a question, the Support Chat Agent queries Pinecone for the most relevant chunks and generates a grounded answer using GPT-4.1-mini — always with source URLs included and strict anti-hallucination rules enforced.

How to Set Up

Create 2 scrapers in your MrScraper account:
- Map Agent Scraper (for crawling and discovering page URLs)
- General Agent Scraper (for extracting title + content from each page)
- Copy the scraperId for each — you’ll need these in n8n.
Set up your Pinecone index:
- Create a Pinecone index with dimensions that match your chosen OpenAI embedding model (e.g. 1536 for text-embedding-ada-002)
- Choose a namespace (recommended format: docs-yourdomain)
Add your credentials in n8n:
- MrScraper API token
- OpenAI API key (used for both embeddings and the chat model)
- Pinecone API key
Configure the Map Agent node:
- Set your target domain or docs root URL (e.g. https://docs.yoursite.com)
- Set includePatterns to focus on relevant sections (e.g. /docs/, /help/, /support/)
- Optionally set excludePatterns to skip noise (e.g. /assets/, /tag/, /static/)
Configure the General Agent node:
- Enter your General Agent scraperId
- Adjust the batch size in the SplitInBatches node (start with 1–5 to stay within rate limits)
Configure the Pinecone nodes:
- Select your Pinecone index in both the Upsert and Retriever nodes
- Set the correct namespace in both nodes so indexing and retrieval use the same data
Customise the chatbot system prompt:
- Edit the Support Chat Agent’s system message to set the chatbot’s name, tone, and rules
- Adjust topK in the Pinecone Retriever (default: 8) based on how much context you want per answer
Connect your chat widget or frontend to the Chat Trigger webhook URL generated by n8n

Requirements

MrScraper account with API access enabled
OpenAI account (for embeddings and GPT-4.1-mini chat)
Pinecone account with an index created and ready

Good to Know

The overlap between chunks (default 180 chars) is intentional — it prevents answers from being cut off at chunk boundaries and significantly improves retrieval quality.
The chatbot is configured to cite 1–3 source URLs per answer, so users can always verify the information themselves.
The anti-hallucination rules in the system prompt instruct the agent to say it can’t find an answer rather than guess — making it safe to use for support, pricing, or policy questions.
Re-indexing is as simple as re-running the workflow. Use a consistent Pinecone namespace and upsert mode to update existing vectors without duplicating them.

Customising This Workflow

Swap the chat model: Replace GPT-4.1-mini with GPT-4o or another OpenAI model for higher-quality answers on complex queries.
Scheduled re-indexing: Add a Schedule Trigger to automatically re-crawl and re-index your docs whenever content changes.
Multiple knowledge bases: Use different Pinecone namespaces (e.g. docs-product, docs-api) and route questions to the right namespace based on user intent.
Embed on your website: Connect the Chat Trigger webhook to any chat widget library to give your users a live support experience powered entirely by your own documentation.
Multilingual support: Add a translation node before chunking to index content in multiple languages and serve a global audience.

🔗 Nodes Used

AI Agent, Embeddings OpenAI, OpenAI Chat Model, Simple Memory, Pinecone Vector Store, Default Data Loader

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup