⚒️ Migrate large Hugging Face datasets to MongoDB with a looping subworkflow

8 views · ⚒️ Engineering

Description

This n8n template provides a production-ready, memory-safe pipeline for ingesting large Hugging Face datasets into MongoDB using batch pagination.
It is designed as a reusable data ingestion layer for RAG systems, recommendation engines, analytics pipelines, and ML workflows.

The template includes:


🚀 What This Template Does


🧩 Architecture Overview

Main Workflow (Orchestrator)

Subworkflow (Batch Processor)


🔁 Workflow Logic (High-Level)

  1. Set initial configuration:
    • Dataset name
    • Split (train, test, etc.)
    • Batch size
    • Offset
  2. Fetch rows from Hugging Face
  3. If rows exist:
    • Split rows into items
    • Remove _id
    • Insert into MongoDB
  4. Increase offset
  5. Repeat until no rows are returned

📦 Default Configuration

ParameterDefault Value
DatasetMongoDB/airbnb_embeddings
Configdefault
Splittrain
Batch Size100
MongoDB Collectionairbnb

All values can be changed easily from the Config_Start node.


🛠 Prerequisites


▶️ How to Use

  1. Import the workflow JSON into n8n
  2. Configure MongoDB credentials in the MongoDB node
  3. Update dataset parameters if needed:
    • Dataset name
    • Split
    • Batch size
    • Collection name
  4. Run the workflow using the Manual Trigger
  5. Monitor execution until completion

🧠 Why _id Is Removed

Hugging Face dataset rows often include an _id field.
MongoDB requires _id values to be unique, so reusing these values can cause insertion failures.

This template:


🔍 Ideal Use Cases

✅ RAG (Retrieval-Augmented Generation)

✅ Recommendation Systems

✅ ML & Analytics Pipelines


You can easily extend this template with:


⚠️ Notes & Best Practices


📄 License & Disclaimer

This workflow template is provided as-is.
You are responsible for:

Hugging Face datasets are subject to their respective licenses.


⭐ Template Summary

Category: Data Ingestion
Complexity: Intermediate
Scalability: High
Memory Safe: Yes
Production Ready: Yes


If you want a version with:

Just say the word and I’ll generate the enhanced workflow JSON.

🔗 Nodes Used

HTTP Request, MongoDB, Execute Sub-workflow, Execute Workflow Trigger

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup