🔬 Domain-specific web content crawler with depth control & text extraction

⚡ 1,914 views · 🔬 Document Extraction & Analysis

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

Description

This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.

🚀 How It Works

Webhook Trigger
Accepts a JSON body with a url field.
Example payload:

{ “url”: “https://example.com” }
Initialization
- Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0.
- Initializes global static data (pending, visited, queued, pages).
Recursive Crawling
- Fetches each page (HTTP Request).
- Extracts body text and links (HTML node).
- Cleans and deduplicates links.
- Filters out:
  - External domains (only same-site is followed)
  - Anchors (#), mailto/tel/javascript links
  - Non-HTML files (.pdf, .docx, .xlsx, .pptx)
Depth Control & Queue
- Tracks visited URLs
- Stops at maxDepth to prevent infinite loops
- Uses SplitInBatches to loop the queue
Data Collection
- Saves each crawled page (url, depth, content) into pages[]
- When pending = 0, combines results
Output
- Responds via the Webhook node with:
  - combinedContent (all pages concatenated)
  - pages[] (array of individual results)
- Large results are chunked when exceeding ~12,000 characters

🛠️ Setup Instructions

Import Template
Load from n8n Community Templates.
Configure Webhook
- Open the Webhook node
- Copy the Test URL (development) or Production URL (after deploy)
- You’ll POST crawl requests to this endpoint
Run a Test
Send a POST with JSON:

curl -X POST https://<your-n8n>/webhook/<id>
-H “Content-Type: application/json”
-d ’{“url”: “https://example.com”}’
View Response
The crawler returns a JSON object containing combinedContent and pages[].

⚙️ Configuration

maxDepth
Default: 3. Adjust in the Init Crawl Params (Set) node.
Timeouts
HTTP Request node timeout is 5 seconds per request; increase if needed.
Filtering Rules
- Only same-domain links are followed (apex and www treated as same-site)
- Skips anchors, mailto:, tel:, javascript:
- Skips document links (.pdf, .docx, .xlsx, .pptx)
- You can tweak the regex and logic in Queue & Dedup Links (Code) node

📌 Limitations

No JavaScript rendering (static HTML only)
No authentication/cookies/session handling
Large sites can be slow or hit timeouts; chunking mitigates response size

✅ Example Use Cases

Extract text across your site for AI ingestion / embeddings
SEO/content audit and internal link checks
Build a lightweight page corpus for downstream processing in n8n

⏱️ Estimated Setup Time

~10 minutes (import → set webhook → test request)

🔗 Nodes Used

HTTP Request, Webhook

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup