🔬 🚀 Process YouTube transcripts with Apify, OpenAI & Pinecone database

1,086 views · 🔬 Document Extraction & Analysis

💡 Pro Tip — YouTube’s API quotas can be a bottleneck when you’re pulling data at scale. ScraperNode is a community node with dedicated scrapers for channels, videos, and comments — no quota limits, just structured data.

View All Scrapers

Description

🚀 YouTube Transcript Indexing Backend for Pinecone 🎥💾

This tutorial explains how to build the backend workflow in n8n that indexes YouTube video transcripts into a Pinecone vector database. Note: This workflow handles the processing and indexing of transcripts only—the retrieval agent (which searches these embeddings) is implemented separately.


📋 Workflow Overview

This backend workflow performs the following tasks:

  1. Fetch Video Records from Airtable 📥
    Retrieves video URLs and related metadata.

  2. Scrape YouTube Transcripts Using Apify 🎬
    Triggers an Apify actor to scrape transcripts with timestamps from each video.

  3. Update Airtable with Transcript Data 🔄
    Stores the fetched transcript JSON back in Airtable linked via video ID.

  4. Process & Chunk Transcripts ✂️
    Parses the transcript JSON, converts “mm:ss” timestamps to seconds, and groups entries into meaningful chunks. Each chunk is enriched with metadata—such as video title, description, start/end timestamps, and a direct URL linking to that video moment.

  5. Generate Embeddings & Index in Pinecone 💾
    Uses OpenAI to create vector embeddings for each transcript chunk and indexes them in Pinecone. This enables efficient semantic searches later by a separate retrieval agent.


🔧 Step-by-Step Guide

Step 1: Retrieve Video Records from Airtable 📥


Step 2: Scrape YouTube Transcripts Using Apify 🎬


Step 3: Update Airtable with Transcript Data 🔄


Step 4: Process Transcripts into Semantic Chunks ✂️


Step 5: Generate Embeddings & Index in Pinecone 💾


🎉 Final Thoughts

This backend workflow is dedicated to processing and indexing YouTube video transcripts so that a separate retrieval agent can perform efficient semantic searches. With this setup:

Happy automating and enjoy building powerful search capabilities with your YouTube content! 🎉

🔗 Nodes Used

Airtable, HTTP Request, Embeddings OpenAI, Recursive Character Text Splitter, Pinecone Vector Store, Default Data Loader

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup