🔬 Web crawler: Convert websites to AI-ready markdown in Google Sheets

⚡ 1,658 views · 🔬 Document Extraction & Analysis

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

Transform any website into a structured knowledge repository with this intelligent crawler that extracts hyperlinks from the homepage, intelligently filters images and content pages, and aggregates full Markdown-formatted content—perfect for fueling AI agents or building comprehensive company dossiers without manual effort.

đź“‹ What This Template Does

This advanced workflow acts as a lightweight web crawler: it scrapes the homepage to discover all internal links (mimicking a sitemap extraction), deduplicates and validates them, separates image assets from textual pages, then fetches and converts non-image page content to clean Markdown. Results are seamlessly appended to Google Sheets for easy analysis, export, or integration into vector databases.

đź”§ Prerequisites

🔑 Required Credentials

Google Sheets OAuth2 API Setup

  1. Go to console.cloud.google.com → APIs & Services → Credentials
  2. Click “Create Credentials” → Select “OAuth client ID” → Choose “Web application”
  3. Add authorized redirect URIs: https://your-n8n-instance.com/rest/oauth2-credential/callback (replace with your n8n URL)
  4. Download the client ID and secret, then add to n8n as “Google Sheets OAuth2 API” credential type
  5. During setup, grant access to Google Sheets scopes (e.g., spreadsheets) and test the connection by listing a sheet

⚙️ Configuration Steps

  1. Import the workflow JSON into your n8n instance
  2. In the “Set Website” node, update the website_url value to your target site (e.g., https://example.com)
  3. Assign your Google Sheets credential to the three “Add … to Sheet” nodes
  4. Update the documentId and sheetName in those nodes to your target spreadsheet ID and sheet name/ID
  5. Ensure your sheet has columns: “Website”, “Links”, “Scraped Content”, “Images”
  6. Activate the workflow and trigger manually to test scraping

🎯 Use Cases

⚠️ Troubleshooting

đź”— Nodes Used

Google Sheets, HTTP Request, Markdown, Filter

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup