📊 Build a multi-site content aggregator with Google Sheets & custom extraction logic

⚡ 110 views · 📊 Market Research & Insights

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

An intelligent web scraping workflow that automatically routes URLs to site-specific extraction logic, normalizes data across multiple sources, and filters content by freshness to build a unified article feed.

What Makes This Different:

Key Benefits of Multi-Source Content Aggregation:


Who’s it for

This template is designed for content aggregators, news monitoring services, content marketers, SEO professionals, researchers, and anyone who needs to collect and normalize articles from multiple websites. It’s perfect for organizations that need to monitor competitor content, aggregate industry news, build content databases, track publication trends, or create unified article feeds without manually scraping each site or writing custom scrapers for every source.

How it works / What it does

This workflow creates a unified article aggregation system that reads URLs from Google Sheets, routes them to site-specific extractors, normalizes the data, filters by freshness, and saves results to a feed. The system:

  1. Reads Pending URLs - Fetches URLs with source identifiers from Google Sheets, filtering for entries with “Pending” status
  2. Processes with Rate Limiting - Loops through URLs one at a time with a 3-second delay between requests to respect server resources
  3. Fetches HTML Content - Downloads page HTML with proper browser headers (User-Agent, Accept, Accept-Language) to avoid blocking
  4. Routes by Source - Switch node directs URLs to specialized extractors (Site A, B, C, D) or universal fallback parser based on Source field
  5. Extracts Article Data - Site-specific HTML nodes use custom CSS selectors, while fallback uses regex patterns to extract title, description, author, date, image, and canonical URL
  6. Normalizes Data - Standardizes all extracted fields into consistent format, handling missing values and trimming whitespace
  7. Filters by Freshness - Validates publication dates and filters out articles older than 45 days (configurable threshold)
  8. Calculates Tier & Status - Assigns tier classification and freshness status based on article age
  9. Saves to Feed - Appends normalized articles to Article Feed sheet with all metadata
  10. Updates Status - Marks processed URLs as complete in source sheet for tracking

Key Innovation: Source-Based Routing - Unlike generic scrapers that use one-size-fits-all extraction, this workflow uses intelligent routing to apply site-specific CSS selectors. This dramatically improves extraction accuracy while maintaining a universal fallback for unknown sources, making it both precise and extensible.

How to set up

1. Prepare Google Sheets

2. Configure Google Sheets Nodes

3. Customize Source Routing

4. Configure Freshness Threshold

5. Set Up Scheduling & Test

Requirements

đź”— Nodes Used

Google Sheets, HTTP Request, Schedule Trigger

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup