π¬ Multi-format document processing for RAG chatbot with Google Drive & Supabase
β‘ 635 views Β· π¬ Document Extraction & Analysis
Description
This n8n workflow is the data ingestion pipeline for the βRAG System V2β chatbot. It automatically monitors a specific Google Drive folder for new files, processes them based on their type, and inserts their content into a Supabase vector database to make it searchable for the RAG agent.
Key Features & Workflow:
Google Drive Trigger: The workflow starts automatically when a new file is created in a designated folder (named βDOCUMENTSβ in this template).
Smart File Handling: A Switch node routes the file based on its MIME type (e.g., PDF, Excel, Google Doc, Word Doc) for correct processing.
Multi-Format Extraction:
PDF: Text is extracted directly using the Extract PDF Text node.
Google Docs: Files are downloaded and converted to plain text (text/plain) and processed by the Extract from Text File node.
Excel: Data is extracted, aggregated, and concatenated into a single text block for embedding.
Word (.doc/.docx): Word files are automatically converted into Google Docs format using an HTTP Request. This newly created Google Doc will then trigger the entire workflow again, ensuring itβs processed correctly.
Chunking & Metadata Enrichment: The extracted text is split into manageable chunks using the Recursive Character Text Splitter (set to 2000-character chunks). The Enhanced Default Data Loader then enriches these chunks with crucial metadata from the original file, such as file_name, creator, and created_at.
Vectorization & Storage: Finally, the workflow uses OpenAI Embeddings to create vector representations of the text chunks and inserts them into the Supabase Vector Store.
π Nodes Used
HTTP Request, Google Drive, Google Drive Trigger, Embeddings OpenAI, Recursive Character Text Splitter, Supabase Vector Store
π₯ Import
Download workflow.json and import into n8n:
Workflow menu β Import from File