🔬 Extract & store recently added n8n community workflows with ScrapeGraphAI and Gemini

344 views · 🔬 Document Extraction & Analysis

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

This is an exaple of advanced automated data extraction and enrichment pipeline with ScrapeGraphAI. Its primary purpose is to systematically scrape the n8n community workflows website, extract detailed information about recently added workflows, process that data using multiple AI models, and store the structured results in a Google Sheets spreadsheet.

This workflow demonstrates a sophisticated use of n8n to move beyond simple API calls and into the realm of intelligent, AI-driven web scraping and data processing, turning unstructured website content into valuable, structured business intelligence.


Key Advantages


How It Works

The workflow operates in two main phases:

Phase 1: Scraping the Main List

  1. Trigger: The workflow can be started manually (“Execute Workflow”) or automatically on a schedule.
  2. Scraping: The “Scrape main page” node (using ScrapeGraphAI) fetches and converts the https://n8n.io/workflows/ page into clean Markdown format.
  3. Data Extraction: An LLM chain (“Extract ‘Recently added’”) analyzes the Markdown. It is specifically instructed to identify all workflow titles and URLs within the “Recently Added” section and output them as a structured JSON array named workflows.
  4. Data Preparation: The resulting array is set as a variable and then split out into individual items, preparing them for processing one-by-one.

Phase 2: Processing Individual Workflows

  1. Loop: The “Loop Over Items” node iterates through each workflow URL obtained from Phase 1.
  2. Scrape & Clean Detail Page: For each URL, the “Scrape single Workflow” node fetches the detail page. Another LLM chain (“Main content”) cleans the resulting Markdown, removing superfluous content and focusing only on the core article text.
  3. Information Extraction: The cleaned Markdown is passed to an “Information Extractor” node. This uses a language model to locate and structure specific data points (title, URL, ID, author, categories, price) into a defined JSON schema.
  4. Summarization: The cleaned Markdown is also sent to a Google Gemini node (“Summarization content”), which generates a concise Italian summary of the workflow’s purpose and tools used.
  5. Data Consolidation & Export: The extracted information and the generated summary are merged into a single data object. Finally, the “Add row” node maps all this data to the appropriate columns and appends it as a new row in a designated Google Sheet.

Set Up Steps

To run this workflow, you need to configure the following credentials in your n8n instance:

  1. ScrapeGraphAI Account: The “Scrape main page” and “Scrape single Workflow” nodes require valid ScrapeGraphAI API credentials named ScrapegraphAI account. Install the related Community node.
  2. Google Gemini Account: Multiple nodes (“Google Gemini Chat Model”, “Summarization content”, etc.) require API credentials for Google Gemini named Google Gemini(PaLM) (Eure).
  3. OpenAI Account: The “OpenAI Chat Model1” node requires API credentials for OpenAI named OpenAi account (Eure).
  4. Google Sheets Account: The “Add row” node requires OAuth2 credentials for Google Sheets named Google Sheets account. You must also ensure the node is configured with the correct Google Sheet ID and that the sheet has a worksheet named Foglio1 (or update the node to match your sheet’s name).

Need help customizing?

Contact me for consulting and support or add me on Linkedin.

🔗 Nodes Used

Google Sheets, Schedule Trigger, Basic LLM Chain, OpenAI Chat Model, Structured Output Parser, Google Gemini Chat Model

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup