📊 Daily RAG research paper hub with arXiv, Gemini AI, and Notion

820 views · 📊 Market Research & Insights

💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.

View All Scrapers

Description

Fetch user-specific research papers from arXiv on a daily schedule, process and structure the data, and create or update entries in a Notion database, with support for data delivery

1. Data Retrieval

arXiv API

The arXiv provides a public API that allows users to query research papers by topic or by predefined categories.

arXiv API User Manual

Key Notes:

  1. Response Format: The API returns data as a typical Atom Response.
  2. Timezone & Update Frequency:
    • The arXiv submission process operates on a 24-hour cycle.
    • Newly submitted articles become available in the API only at midnight after they have been processed.
    • Feeds are updated daily at midnight Eastern Standard Time (EST).
    • Therefore, a single request per day is sufficient.
  3. Request Limits:
    • The maximum number of results per call (max_results) is 30,000,
    • Results must be retrieved in slices of at most 2,000 at a time, using the max_results and start query parameters.
  4. Time Format:
    • The expected format is [YYYYMMDDTTTT+TO+YYYYMMDDTTTT],
    • TTTT is provided in 24-hour time to the minute, in GMT.

Scheduled Task

2. Data Extraction

Data Cleaning Rules (Convert to Standard JSON)

  1. Remove Header

    • Keep only the 【entry】【/entry】 blocks representing paper items.
  2. Single Item

    • Each 【entry】【/entry】 represents a single item.
  3. Field Processing Rules

    • 【id】【/id】 ➡️ id
      Extract content.
      Example:
      【id】http://arxiv.org/abs/2409.06062v1【/id】http://arxiv.org/abs/2409.06062v1

    • 【updated】【/updated】 ➡️ updated
      Convert timestamp to yyyy-mm-dd hh:mm:ss

    • 【published】【/published】 ➡️ published
      Convert timestamp to yyyy-mm-dd hh:mm:ss

    • 【title】【/title】 ➡️ title
      Extract text content

    • 【summary】【/summary】 ➡️ summary
      Keep text, remove line breaks

    • 【author】【/author】 ➡️ author
      Combine all authors into an array
      Example: [ "Ernest Pusateri", "Anmol Walia" ] (for Notion multi-select field)

    • 【arxiv:comment】【/arxiv:comment】 ➡️ Ignore / discard

    • 【link type=“text/html”】 ➡️ html_url
      Extract URL

    • 【link type=“application/pdf”】 ➡️ pdf_url
      Extract URL

    • 【arxiv:primary_category term=“cs.CL”】 ➡️ primary_category
      Extract term value

    • 【category】 ➡️ category
      Merge all 【category】 values into an array
      Example: [ "eess.AS", "cs.SD" ] (for Notion multi-select field)

  4. Add Empty Fields

    • github
    • huggingface

3. Data Processing

Analyze and summarize paper data using AI, then standardize output as JSON.

4. Data Storage: Notion Database

Notes

5. Data Delivery

Set up two channels for message delivery: EMAIL and IM, and define the message format and content.

Email: Gmail

GMAIL OAuth 2.0 – Official Documentation
Configure your OAuth consent screen

Steps:

Message format: HTML
(Model: OpenAI GPT — used to design an HTML email template)

IM: Feishu (LARK)

Bots in groups
Use bots in groups

🔗 Nodes Used

HTTP Request, Gmail, Notion, Schedule Trigger, Basic LLM Chain, Google Gemini Chat Model

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup