📊 Automatically discover and extract reports from websites using GPT and Google Sheets

⚡ 99 views · 📊 Market Research & Insights

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

An intelligent AI-powered agent that automatically browses publication websites, analyzes page content with natural language understanding, and identifies the latest downloadable reports, research papers, and data files across multiple sources using advanced structured output parsing.

What Makes This Different:

Key Benefits of AI-Powered Report Discovery:


Who’s it for

This template is designed for researchers, market analysts, competitive intelligence teams, academic institutions, industry monitoring services, and anyone who needs to systematically discover and track downloadable reports from multiple publication sources. It’s perfect for organizations that need to monitor industry publications, track competitor research, discover new market reports, build research libraries, or stay updated on latest publications without manually visiting dozens of websites daily.

How it works / What it does

This workflow creates an AI-powered report discovery system that reads publication source URLs from Google Sheets, fetches their pages, uses AI to analyze content, and extracts information about downloadable reports. The system:

  1. Reads Active Sources - Fetches publication URLs and metadata from Google Sheets “Report Sources” sheet, processing each source in sequence
  2. Loops Through Sources - Processes sources one at a time using Split in Batches, ensuring proper error isolation and preventing batch failures
  3. Fetches Publication Pages - Downloads HTML content from each source URL with proper browser headers (User-Agent, Accept, Accept-Language) to avoid blocking
  4. Converts HTML to Markdown - Transforms raw HTML into clean Markdown format, removing styling, scripts, and navigation elements to improve AI comprehension
  5. AI Analysis - LangChain agent analyzes the Markdown content using GPT-4/GPT-5.1, identifying downloadable reports based on context, link patterns, and content structure
  6. Structured Output Parsing - Enforces JSON schema validation, ensuring the AI returns data in the exact format: source, title, link, file_type, description
  7. Validates & Normalizes Output - Validates extracted links are absolute URLs, checks file type indicators, determines report validity, and normalizes all fields
  8. Routes by Validity - IF node routes valid reports to save operation, invalid/missing reports to logging
  9. Saves Discovered Reports - Appends valid reports to Google Sheets “Discovered Reports” sheet with metadata, source URL, category, and discovery timestamp
  10. Logs No Report Found - Records sources where no valid reports were found in “Discovery Log” sheet for monitoring and troubleshooting
  11. Tracks Completion - Generates completion summary with number of sources checked and processing timestamp

Key Innovation: AI-Powered Context Understanding - Unlike traditional web scrapers that rely on fixed CSS selectors or regex patterns, this workflow uses AI to understand page context and semantics. The AI can identify reports even when they’re embedded in complex layouts, use non-standard naming, or require understanding of surrounding text to determine relevance. This makes it adaptable to any website structure without manual configuration.

How to set up

1. Prepare Google Sheets

2. Configure Google Sheets Nodes

3. Set Up OpenAI Credentials

4. Configure AI Agent & Output Parser

5. Customize Discovery Rules (Optional)

6. Set Up Scheduling & Test

Requirements

đź”— Nodes Used

Google Sheets, HTTP Request, Markdown, Execute Workflow Trigger, Schedule Trigger, AI Agent

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup