🔒 Filter URLs with AI-powered robots.txt compliance & source verification

52 views · 🔒 SecOps & Security Automation

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

URL Officer - Respect robots.txt and Avoid Undesirable Sources

🎬 Overview

Version : 1.0

The URL Officer workflow automates the filtering of URLs by checking them against a database of forbidden sources and the rules defined in robots.txt files. It proactively respects robot exclusion protocols and user-defined banned sources to aid in lawful and ethical web automation. Designed primarily as a sub-workflow, it serves automation pipelines with robust URL validation to avoid undesirable or restricted sources.

✨ Features

👤 Who is this for?

Ideal for developers, data engineers, researchers, or businesses implementing web crawlers, scrapers, or any automation that processes URLs. This workflow helps your compliance with source restrictions and avoids content from blacklisted sites, reducing legal exposure and promoting ethical data use.

💡 What problem does this solve?

URL Officer addresses the challenge of automating URL validation by combining manual blacklist filtering with automated and AI-assisted robots.txt parsing. It prevents accidental scraping or processing from undesirable or disallowed sources, helping automate respect for webmasters’ policies and legal boundaries.

🔍 What this workflow does

When given a URL, the workflow:

🔄 Workflow Steps

1. Input Parsing & Base URL Extraction

2. Forbidden Source Check

3. robots.txt Handling

4. Code-Based robots.txt Analysis

5. AI-Based robots.txt Verification

6. Output Preparation

🔀 Expected Input / Configuration

The workflow is configured primarily via workflow input arguments:

ParameterDescriptionType
linkThe URL to be checked.String
userAgentUser-Agent string representing your automation, used for robots.txt checks.String
userAgent_extraAdditional User-Agent information such as version or contact info.String
automationGoalDescription of your automation’s purpose, used by the AI to verify suitability against robots.txt.String
modelAI model to use for the robots.txt compliance check. Options: mistral, groq, gemini.String

Database Requirements

📦 Expected Output

A structured JSON object containing:

Output KeyDescription
linkThe URL that was checked.
baseUrlThe base URL of the checked link.
allow_linkBoolean indicating if the link is allowed according to checks.
allow_baseUrlBoolean indicating if the base URL is allowed.
userAgentUser-Agent string used in the check.
userAgent_extraAdditional User-Agent metadata.
robots_fetchedBoolean, true if robots.txt content was successfully fetched.
fetched_atTimestamp of the last robots.txt content fetch.

📌 Example

Example input payload: printscreenurlofficer_example.png printscreen1.png printscreen2.png printscreen3.png

⚙️ n8n Setup Used

⚡ Requirements to Use / Setup

⚠️ Notes, Assumptions & Warnings

🛠 PostgreSQL Setup Instructions (Self-Hosted Route)

Available inside the Workflow Notes, alongside podman commands.

ℹ️ About Us

This workflow was developed by the Hybroht team. Our goal is to create tools that harness the possibilities of technology and more. We aim to continuously improve and expand functionalities based on community feedback and evolving use cases.

For questions, support, or feedback, please contact us at: contact@hybroht.com


This workflow is provided “as-is” without warranties of any kind. By using this workflow, you agree that you are responsible for complying with all applicable laws, regulations, and terms of service related to your data sources and automations. Please review all relevant legal terms and use this workflow responsibly.

Hybroht disclaims any liability arising from use or misuse of this workflow. This tool assists with robots.txt compliance but is not a substitute for full legal or compliance advice.

You can view the full license terms here. Please review them before making your purchase.

By purchasing this product, you agree to these terms.


🔗 Nodes Used

HTTP Request, Postgres, Execute Workflow Trigger, Schedule Trigger, Mistral Cloud Chat Model, Google Gemini Chat Model

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup