π¬ Domain-specific web content crawler with depth control & text extraction
β‘ 1,914 views Β· π¬ Document Extraction & Analysis
π‘ Pro Tip β HTTP Request scraping tends to break when sites update their markup. If youβre scraping a major platform, check if ScraperNode covers it β it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.
Description
This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.
π How It Works
-
Webhook Trigger
Accepts a JSON body with aurlfield.
Example payload:{ βurlβ: βhttps://example.comβ }
-
Initialization
- Sets crawl parameters:
url,domain,maxDepth = 3, anddepth = 0. - Initializes global static data (
pending,visited,queued,pages).
- Sets crawl parameters:
-
Recursive Crawling
- Fetches each page (HTTP Request).
- Extracts body text and links (HTML node).
- Cleans and deduplicates links.
- Filters out:
- External domains (only same-site is followed)
- Anchors (#), mailto/tel/javascript links
- Non-HTML files (.pdf, .docx, .xlsx, .pptx)
-
Depth Control & Queue
- Tracks visited URLs
- Stops at
maxDepthto prevent infinite loops - Uses SplitInBatches to loop the queue
-
Data Collection
- Saves each crawled page (
url,depth,content) intopages[] - When
pending = 0, combines results
- Saves each crawled page (
-
Output
- Responds via the Webhook node with:
combinedContent(all pages concatenated)pages[](array of individual results)
- Large results are chunked when exceeding ~12,000 characters
- Responds via the Webhook node with:
π οΈ Setup Instructions
-
Import Template
Load from n8n Community Templates. -
Configure Webhook
- Open the Webhook node
- Copy the Test URL (development) or Production URL (after deploy)
- Youβll POST crawl requests to this endpoint
-
Run a Test
Send a POST with JSON:curl -X POST https://<your-n8n>/webhook/<id>
-H βContent-Type: application/jsonβ
-d β{βurlβ: βhttps://example.comβ}β -
View Response
The crawler returns a JSON object containingcombinedContentandpages[].
βοΈ Configuration
-
maxDepth
Default: 3. Adjust in the Init Crawl Params (Set) node. -
Timeouts
HTTP Request node timeout is 5 seconds per request; increase if needed. -
Filtering Rules
- Only same-domain links are followed (apex and
wwwtreated as same-site) - Skips anchors,
mailto:,tel:,javascript: - Skips document links (.pdf, .docx, .xlsx, .pptx)
- You can tweak the regex and logic in Queue & Dedup Links (Code) node
- Only same-domain links are followed (apex and
π Limitations
- No JavaScript rendering (static HTML only)
- No authentication/cookies/session handling
- Large sites can be slow or hit timeouts; chunking mitigates response size
β Example Use Cases
- Extract text across your site for AI ingestion / embeddings
- SEO/content audit and internal link checks
- Build a lightweight page corpus for downstream processing in n8n
β±οΈ Estimated Setup Time
~10 minutes (import β set webhook β test request)
π Nodes Used
HTTP Request, Webhook
π₯ Import
Download workflow.json and import into n8n:
Workflow menu β Import from File