πŸ”¬ Domain-specific web content crawler with depth control & text extraction

⚑ 1,914 views Β· πŸ”¬ Document Extraction & Analysis

πŸ’‘ Pro Tip β€” HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it β€” it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

This template implements a recursive web crawler inside n8n. Starting from a given URL, it crawls linked pages up to a maximum depth (default: 3), extracts text and links, and returns the collected content via webhook.


πŸš€ How It Works

  1. Webhook Trigger
    Accepts a JSON body with a url field.
    Example payload:

    { β€œurl”: β€œhttps://example.com” }

  2. Initialization

    • Sets crawl parameters: url, domain, maxDepth = 3, and depth = 0.
    • Initializes global static data (pending, visited, queued, pages).
  3. Recursive Crawling

    • Fetches each page (HTTP Request).
    • Extracts body text and links (HTML node).
    • Cleans and deduplicates links.
    • Filters out:
      • External domains (only same-site is followed)
      • Anchors (#), mailto/tel/javascript links
      • Non-HTML files (.pdf, .docx, .xlsx, .pptx)
  4. Depth Control & Queue

    • Tracks visited URLs
    • Stops at maxDepth to prevent infinite loops
    • Uses SplitInBatches to loop the queue
  5. Data Collection

    • Saves each crawled page (url, depth, content) into pages[]
    • When pending = 0, combines results
  6. Output

    • Responds via the Webhook node with:
      • combinedContent (all pages concatenated)
      • pages[] (array of individual results)
    • Large results are chunked when exceeding ~12,000 characters

πŸ› οΈ Setup Instructions

  1. Import Template
    Load from n8n Community Templates.

  2. Configure Webhook

    • Open the Webhook node
    • Copy the Test URL (development) or Production URL (after deploy)
    • You’ll POST crawl requests to this endpoint
  3. Run a Test
    Send a POST with JSON:

    curl -X POST https://<your-n8n>/webhook/<id>
    -H β€œContent-Type: application/json”
    -d ’{β€œurl”: β€œhttps://example.com”}’

  4. View Response
    The crawler returns a JSON object containing combinedContent and pages[].


βš™οΈ Configuration


πŸ“Œ Limitations


βœ… Example Use Cases


⏱️ Estimated Setup Time

~10 minutes (import β†’ set webhook β†’ test request)

πŸ”— Nodes Used

HTTP Request, Webhook

πŸ“₯ Import

Download workflow.json and import into n8n: Workflow menu β†’ Import from File

πŸ“– Importing guide Β· πŸ”‘ Credential setup