๐ ๐ก๐ Essential multipage website scraper with Jina.ai
โก 18,059 views ยท ๐ Market Research & Insights
๐ก Pro Tip โ HTTP Request scraping tends to break when sites update their markup. If youโre scraping a major platform, check if ScraperNode covers it โ it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.
Description
๐ก๐ Essential Multipage Website Scraper with Jina.ai
Use responsibly and follow local rules and regulations
This N8N workflow enables automated multi-page website scraping using Jina.aiโs powerful web scraping capabilities, with seamless integration to Google Drive for content storage. Hereโs how it works:
Main Features
The workflow automatically scrapes multiple pages from a websiteโs sitemap and saves each pageโs content as a separate Google Drive document.
Key Components
Input Configuration
- Starts with a sitemap URL (default: https://ai.pydantic.dev/sitemap.xml)
- Processes the sitemap to extract individual page URLs
- Includes filtering options to target specific topics or pages
Scraping Process
- Uses Jina.aiโs web scraper to extract content from each URL
- Converts webpage content into clean markdown format
- Extracts page titles automatically for document naming
Storage Integration
- Creates individual Google Drive documents for each scraped page
- Names documents using the format โURL - Page Titleโ
- Saves content in markdown format for better readability
Usage Instructions
- Set your target websiteโs sitemap URL in the โSet Website URLโ node
- Configure the โFilter By Topics or Pagesโ node to select specific content
- Adjust the โLimitโ node (default: 20 pages) to control batch size
- Connect your Google Drive account
- Run the workflow to begin automated scraping
Additional Features
- Built-in rate limiting through the Wait node to prevent overloading servers
- Batch processing capability for handling large sitemaps
The workflow requires no API key for Jina.ai, making it accessible for immediate use while maintaining responsible scraping practices.
๐ Nodes Used
HTTP Request, Google Drive, Filter
๐ฅ Import
Download workflow.json and import into n8n:
Workflow menu โ Import from File