🔬 Segment PDFs by table of contents with Gemini AI and Chunkr.ai

⚡ 662 views · 🔬 Document Extraction & Analysis

Description

Intelligently Segment PDFs by Table of Contents

This workflow empowers you to automatically process PDF documents, intelligently identify or generate a hierarchical Table of Contents (ToC), and then segment the entire document’s content based on these ToC headings. It effectively breaks down a large PDF into its constituent sections, each paired with its corresponding heading and hierarchical level.

Why It’s Useful

Unlock the true structure of your PDFs for granular access and advanced processing:

How It Works

  1. Ingestion & Advanced Parsing: The workflow ingests a PDF (via a provided URL or a pre-set one for manual runs). It then utilizes Chunkr.ai to perform Optical Character Recognition (OCR) and parse the document into detailed structural elements, extracting text, HTML, and Markdown for each segment.
  2. AI-Powered Table of Contents Generation: A Google Gemini AI model analyzes the initial pages of the document (where a ToC often resides) along with section headers extracted by Chunkr as a fallback. This allows it to construct an accurate, hierarchical Table of Contents in a structured JSON format, even if the PDF lacks an explicit ToC or if it’s poorly formatted.
  3. Precise Content Segmentation: Sophisticated custom code then meticulously maps the AI-generated ToC headings to their corresponding content within the parsed document from Chunkr. It intelligently determines the precise start and end of each section.
  4. Structured & Flexible Output:
    • The primary output provides each identified section as an individual n8n item. Each item includes the heading text, its hierarchical level (e.g., 1, 1.1, 2), and the full content of that section in Text, HTML, and Markdown formats.
    • Optionally, the workflow can also reconstruct the entire document into a single, navigable HTML file or a clean Markdown file.

What You Need

To run this workflow, you’ll need:

Outputs

The workflow primarily generates:

Alternatively, you can configure the workflow to output:

This workflow is ideal for anyone looking to deconstruct PDFs into meaningful, manageable parts for advanced automation, AI integration, or detailed content analysis.

đź”— Nodes Used

HTTP Request, Google Drive, Convert to/from binary data, Stop and Error, Execute Workflow Trigger, AI Agent

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup