π¬ Turn your website docs into a GPT-4.1-mini support chatbot with MrScraper and Pinecone
β‘ 3 views Β· π¬ Support Chatbots
π‘ Pro Tip β HTTP Request scraping tends to break when sites update their markup. If youβre scraping a major platform, check if ScraperNode covers it β it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.
Description
Description
This n8n template turns any website or documentation portal into a fully functional AI-powered support chatbot β no manual copy-pasting, no static FAQs. It uses MrScraper to crawl and extract your siteβs content, OpenAI to generate embeddings, and Pinecone to store and retrieve that knowledge at chat time.
The result is a retrieval-augmented chatbot that answers questions using only your actual website content, always cites its sources, and never hallucinates policies or pricing.
How It Works
- Phase 1 β URL Discovery: The Map Agent crawls your target domain using include/exclude patterns to discover all relevant documentation or help center pages. It returns a clean, deduplicated list of URLs ready for content extraction.
- Phase 2 β Page Content Extraction: Each discovered URL is processed in controlled batches by the General Agent, which extracts the readable content (title + main text) from every page. Low-quality or near-empty pages are automatically filtered out.
- Phase 3 β Chunking & Embedding: Page text is split into overlapping chunks (default: ~1,100 chars with 180-char overlap) to preserve context at boundaries. Each chunk is sent to OpenAI Embeddings to generate a vector, then stored in Pinecone with metadata including the source URL, page title, and chunk index.
- Phase 4 β Chat Endpoint: A Chat Trigger exposes a webhook endpoint your website or widget can connect to. When a user asks a question, the Support Chat Agent queries Pinecone for the most relevant chunks and generates a grounded answer using GPT-4.1-mini β always with source URLs included and strict anti-hallucination rules enforced.
How to Set Up
-
Create 2 scrapers in your MrScraper account:
- Map Agent Scraper (for crawling and discovering page URLs)
- General Agent Scraper (for extracting title + content from each page)
- Copy the
scraperIdfor each β youβll need these in n8n.
-
Set up your Pinecone index:
- Create a Pinecone index with dimensions that match your chosen OpenAI embedding model (e.g. 1536 for
text-embedding-ada-002) - Choose a namespace (recommended format:
docs-yourdomain)
- Create a Pinecone index with dimensions that match your chosen OpenAI embedding model (e.g. 1536 for
-
Add your credentials in n8n:
- MrScraper API token
- OpenAI API key (used for both embeddings and the chat model)
- Pinecone API key
-
Configure the Map Agent node:
- Set your target domain or docs root URL (e.g.
https://docs.yoursite.com) - Set
includePatternsto focus on relevant sections (e.g./docs/,/help/,/support/) - Optionally set
excludePatternsto skip noise (e.g./assets/,/tag/,/static/)
- Set your target domain or docs root URL (e.g.
-
Configure the General Agent node:
- Enter your General Agent
scraperId - Adjust the batch size in the SplitInBatches node (start with 1β5 to stay within rate limits)
- Enter your General Agent
-
Configure the Pinecone nodes:
- Select your Pinecone index in both the Upsert and Retriever nodes
- Set the correct namespace in both nodes so indexing and retrieval use the same data
-
Customise the chatbot system prompt:
- Edit the Support Chat Agentβs system message to set the chatbotβs name, tone, and rules
- Adjust
topKin the Pinecone Retriever (default: 8) based on how much context you want per answer
-
Connect your chat widget or frontend to the Chat Trigger webhook URL generated by n8n
Requirements
- MrScraper account with API access enabled
- OpenAI account (for embeddings and GPT-4.1-mini chat)
- Pinecone account with an index created and ready
Good to Know
- The overlap between chunks (default 180 chars) is intentional β it prevents answers from being cut off at chunk boundaries and significantly improves retrieval quality.
- The chatbot is configured to cite 1β3 source URLs per answer, so users can always verify the information themselves.
- The anti-hallucination rules in the system prompt instruct the agent to say it canβt find an answer rather than guess β making it safe to use for support, pricing, or policy questions.
- Re-indexing is as simple as re-running the workflow. Use a consistent Pinecone namespace and upsert mode to update existing vectors without duplicating them.
Customising This Workflow
- Swap the chat model: Replace GPT-4.1-mini with GPT-4o or another OpenAI model for higher-quality answers on complex queries.
- Scheduled re-indexing: Add a Schedule Trigger to automatically re-crawl and re-index your docs whenever content changes.
- Multiple knowledge bases: Use different Pinecone namespaces (e.g.
docs-product,docs-api) and route questions to the right namespace based on user intent. - Embed on your website: Connect the Chat Trigger webhook to any chat widget library to give your users a live support experience powered entirely by your own documentation.
- Multilingual support: Add a translation node before chunking to index content in multiple languages and serve a global audience.
π Nodes Used
AI Agent, Embeddings OpenAI, OpenAI Chat Model, Simple Memory, Pinecone Vector Store, Default Data Loader
π₯ Import
Download workflow.json and import into n8n:
Workflow menu β Import from File