🔬 Build a PDF search system with Mistral OCR and Weaviate DB

⚡ 1,346 views · 🔬 Document Extraction & Analysis

Description

Build a PDF to Vector RAG System: Mistral OCR, Weaviate Database and MCP Server

A comprehensive RAG (Retrieval-Augmented Generation) workflow that transforms PDF documents into searchable vector embeddings using advanced AI technologies.

🚀 Features

PDF Document Processing: Upload and extract text from PDF files using Mistral’s OCR capabilities
Vector Database Storage: Store document embeddings in Weaviate vector database for efficient retrieval
AI-Powered Search: Search through documents using semantic similarity with Cohere embeddings
MCP Server Integration: Expose the knowledge base as an AI tool through MCP (Model Context Protocol)
Document Metadata: Basic document metadata including filename, content, source, and upload timestamp
Text Chunking: Automatic text splitting for optimal vector storage and retrieval

🛠️ Technologies Used

Mistral AI: OCR and text extraction from PDF documents
Weaviate: Vector database for storing and retrieving document embeddings
Cohere: Multilingual embeddings and reranking for improved search accuracy
MCP (Model Context Protocol): AI tool integration for external AI workflows
n8n: Workflow automation and orchestration

📋 Prerequisites

Before using this template, you’ll need to set up the following credentials:

Mistral Cloud API: For PDF text extraction
Weaviate API: For vector database operations
Cohere API: For embeddings and reranking
HTTP Header Auth: For MCP server authentication

🔧 Setup Instructions

Import the template into your n8n instance
Configure credentials for all required services
Set up Weaviate collection named “KnowledgeDocuments”
Configure webhook paths for the MCP server and form trigger
Test the workflow by uploading a PDF document

📊 Workflow Overview

PDF Upload → Text Extraction → Document Processing → Vector Storage → AI Search
     ↓              ↓                ↓                ↓              ↓
  Form Trigger → Mistral OCR → Prepare Metadata → Weaviate DB → MCP Server

🎯 Use Cases

Knowledge Base Management: Create searchable repositories of company documents
Research Documentation: Process and search through research papers and reports
Legal Document Search: Index and search through legal documents and contracts
Technical Documentation: Make technical manuals and guides searchable
Academic Literature: Process and search through academic papers and publications

⚠️ Important Notes

Model Consistency: Use the same embedding model for both storage and retrieval
Collection Management: Ensure your Weaviate collection is properly configured
API Limits: Be aware of rate limits for Mistral, Cohere, and Weaviate APIs
Document Size: Consider chunking large documents for optimal processing

📝 License

This template is provided as-is for educational and commercial use.

🔗 Nodes Used

Embeddings Cohere, Recursive Character Text Splitter, n8n Form Trigger, Default Data Loader, MCP Server Trigger, Reranker Cohere

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup