📊 Discover article URLs from any website with GPT-5-mini and Google Sheets

⚡ 243 views · 📊 Market Research & Insights

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

Automatically discover and extract article URLs from any website using AI to identify valid content links while filtering out navigation, category pages, and irrelevant content—perfect for building content pipelines, news aggregators, and research databases.

What Makes This Different:

Key Benefits of AI-Powered Content Discovery:


Who’s it for

This template is designed for content marketers, SEO professionals, researchers, media monitors, and anyone who needs to aggregate content from multiple sources. It’s perfect for organizations that need to track competitor blogs, curate industry news, build research databases, monitor brand mentions, or aggregate content for newsletters without manually checking dozens of websites daily or writing complex scraping rules for each source.

How it works / What it does

This workflow creates an intelligent content discovery pipeline that automatically finds and extracts article URLs from any webpage. The system:

  1. Reads Seed URLs - Pulls a list of webpages to crawl from your Google Sheets (blog indexes, news feeds, publication homepages)
  2. Fetches with Stealth - Downloads each webpage’s HTML using browser-like headers to avoid bot detection
  3. Converts for AI - Transforms messy HTML into clean Markdown that the AI can easily process
  4. AI Extraction - GPT-5-mini analyzes the content and identifies valid article URLs while filtering out navigation, categories, and junk links
  5. Normalizes & Saves - Cleans URLs (removes tracking params), deduplicates, and saves to Google Sheets with source tracking

Key Innovation: Context-Aware Link Filtering - Unlike traditional scrapers that rely on CSS selectors or URL patterns (which break when sites update), the AI understands the semantic difference between an article link and a navigation link. It reads the page like a human would, identifying content worth following regardless of the website’s structure.

How to set up

1. Create Your Google Sheets Database

2. Connect Your Credentials

3. Customize the AI Prompt (Optional)

4. Test Your Configuration

5. Schedule and Monitor

Requirements

đź”— Nodes Used

Google Sheets, HTTP Request, Markdown, Schedule Trigger, AI Agent, OpenAI Chat Model

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup