πŸ”¬ Multi-format document processing for RAG chatbot with Google Drive & Supabase

⚑ 635 views Β· πŸ”¬ Document Extraction & Analysis

Description

This n8n workflow is the data ingestion pipeline for the β€œRAG System V2” chatbot. It automatically monitors a specific Google Drive folder for new files, processes them based on their type, and inserts their content into a Supabase vector database to make it searchable for the RAG agent.

Key Features & Workflow:

Google Drive Trigger: The workflow starts automatically when a new file is created in a designated folder (named β€œDOCUMENTS” in this template).

Smart File Handling: A Switch node routes the file based on its MIME type (e.g., PDF, Excel, Google Doc, Word Doc) for correct processing.

Multi-Format Extraction:

PDF: Text is extracted directly using the Extract PDF Text node.

Google Docs: Files are downloaded and converted to plain text (text/plain) and processed by the Extract from Text File node.

Excel: Data is extracted, aggregated, and concatenated into a single text block for embedding.

Word (.doc/.docx): Word files are automatically converted into Google Docs format using an HTTP Request. This newly created Google Doc will then trigger the entire workflow again, ensuring it’s processed correctly.

Chunking & Metadata Enrichment: The extracted text is split into manageable chunks using the Recursive Character Text Splitter (set to 2000-character chunks). The Enhanced Default Data Loader then enriches these chunks with crucial metadata from the original file, such as file_name, creator, and created_at.

Vectorization & Storage: Finally, the workflow uses OpenAI Embeddings to create vector representations of the text chunks and inserts them into the Supabase Vector Store.

πŸ”— Nodes Used

HTTP Request, Google Drive, Google Drive Trigger, Embeddings OpenAI, Recursive Character Text Splitter, Supabase Vector Store

πŸ“₯ Import

Download workflow.json and import into n8n: Workflow menu β†’ Import from File

πŸ“– Importing guide Β· πŸ”‘ Credential setup