⚒️ Create AI-ready vector datasets for LLMs with Bright Data, Gemini & Pinecone

2,455 views · ⚒️ Engineering

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

Who this is for?

This workflow enables automated, scalable collection of high-quality, AI-ready data from websites using Bright Data’s Web Unlocker, with a focus on preparing that data for LLM training. Leveraging LLM Chains and AI agents, the system formats and extracts key information, then stores the structured embeddings in a Pinecone vector database.

This workflow is tailored for:​

What problem is this workflow solving?

Training a large language model (LLM) requires vast amounts of clean, relevant, and structured data. Manual collection is slow, error-prone, and lacks scalability.

This workflow:

What this workflow does

This workflow automates the process of collecting, cleaning, and vectorizing web content to create structured, high-quality datasets that are ready to be used for LLM (Large Language Model) training or retrieval-augmented generation (RAG).

  1. Web Crawling with Bright Data Web Unlocker.
  2. AI Information Extraction and Data Formatting.
  3. AI Data Formatting to produce a JSON structured data.
  4. Persistence in Pinecone Vector DB.
  5. Handle Webhook notification of structured data.

Setup

How to customize this workflow to your needs

  1. Set Your Target URLs. Target sites that are high-quality, domain-specific, and relevant to your LLM’s purpose.
  2. Adjust Bright Data Web Unlocker Settings. Geo-location, Headers / User-Agent strings, Retry rules and proxies.
  3. Modify the Information Extraction Logic. Change prompts to extract specific attributes. Use structured templates or few-shot examples in prompts.
  4. Swap the Embedding Model. Use OpenAI, Hugging Face or other your own hosted embedding model API.
  5. Customize Pinecone Metadata Fields. Store extra fields in Pinecone for better filtering & semantic querying.
  6. Add Data Validation or Deduplication. Skip duplicates or low-quality content.

🔗 Nodes Used

HTTP Request, AI Agent, Basic LLM Chain, Structured Output Parser, Recursive Character Text Splitter, Pinecone Vector Store

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup