📊 Daily RAG research paper hub with arXiv, Gemini AI, and Notion
⚡ 820 views · 📊 Market Research & Insights
💡 Pro Tip — HTTP Request scraping tends to break when sites update their markup. If you’re scraping a major platform, check if ScraperNode covers it — it has maintained scrapers for LinkedIn, Instagram, TikTok, YouTube, and 20+ other platforms that return structured data.
Description
Fetch user-specific research papers from arXiv on a daily schedule, process and structure the data, and create or update entries in a Notion database, with support for data delivery
- Paper Topic: single query keyword
- Update Frequency: Daily updates, with fewer than 20 entries expected per day
- Tools:
- Platform: n8n, for end-to-end workflow configuration
- AI Model: Gemini-2.5-Flash, for daily paper summarization and data processing
- Database: Notion, with two tables — Daily Paper Summary and Paper Details
- Message: Feishu (IM bot notifications), Gmail (email notifications)
1. Data Retrieval
arXiv API
The arXiv provides a public API that allows users to query research papers by topic or by predefined categories.
Key Notes:
- Response Format: The API returns data as a typical Atom Response.
- Timezone & Update Frequency:
- The arXiv submission process operates on a 24-hour cycle.
- Newly submitted articles become available in the API only at midnight after they have been processed.
- Feeds are updated daily at midnight Eastern Standard Time (EST).
- Therefore, a single request per day is sufficient.
- Request Limits:
- The maximum number of results per call (
max_results) is 30,000, - Results must be retrieved in slices of at most 2,000 at a time, using the
max_resultsandstartquery parameters.
- The maximum number of results per call (
- Time Format:
- The expected format is
[YYYYMMDDTTTT+TO+YYYYMMDDTTTT], TTTTis provided in 24-hour time to the minute, in GMT.
- The expected format is
Scheduled Task
- Execution Frequency: Daily
- Execution Time: 6:00 AM
- Time Parameter Handling (JS):
According to arXiv’s update rules, the scheduled task should query the previous day’s (T-1)submittedDatedata.
2. Data Extraction
Data Cleaning Rules (Convert to Standard JSON)
-
Remove Header
- Keep only the 【entry】【/entry】 blocks representing paper items.
-
Single Item
- Each 【entry】【/entry】 represents a single item.
-
Field Processing Rules
-
【id】【/id】 ➡️ id
Extract content.
Example:
【id】http://arxiv.org/abs/2409.06062v1【/id】 → http://arxiv.org/abs/2409.06062v1 -
【updated】【/updated】 ➡️ updated
Convert timestamp toyyyy-mm-dd hh:mm:ss -
【published】【/published】 ➡️ published
Convert timestamp toyyyy-mm-dd hh:mm:ss -
【title】【/title】 ➡️ title
Extract text content -
【summary】【/summary】 ➡️ summary
Keep text, remove line breaks -
【author】【/author】 ➡️ author
Combine all authors into an array
Example:[ "Ernest Pusateri", "Anmol Walia" ](for Notion multi-select field) -
【arxiv:comment】【/arxiv:comment】 ➡️ Ignore / discard
-
【link type=“text/html”】 ➡️ html_url
Extract URL -
【link type=“application/pdf”】 ➡️ pdf_url
Extract URL -
【arxiv:primary_category term=“cs.CL”】 ➡️ primary_category
Extracttermvalue -
【category】 ➡️ category
Merge all 【category】 values into an array
Example:[ "eess.AS", "cs.SD" ](for Notion multi-select field)
-
-
Add Empty Fields
githubhuggingface
3. Data Processing
Analyze and summarize paper data using AI, then standardize output as JSON.
- Single Paper Basic Information Analysis and Enhancement
- Daily Paper Summary and Multilingual Translation
4. Data Storage: Notion Database
- Create a corresponding database in Notion with the same predefined field names.
- In Notion, create an integration under Integrations and grant access to the database. Obtain the corresponding Secret Key.
- Use the Notion “Create a database page” node to configure the field mapping and store the data.
Notes
- “Create a database page” only adds new entries; data will not be updated.
- The
updatedandpublishedtimestamps of arXiv papers are in UTC. - Notion single-select and multi-select fields only accept arrays. They do not automatically parse comma-separated strings. You need to format them as proper arrays.
- Notion does not accept
nullvalues, which causes a 400 error.
5. Data Delivery
Set up two channels for message delivery: EMAIL and IM, and define the message format and content.
Email: Gmail
GMAIL OAuth 2.0 – Official Documentation
Configure your OAuth consent screen
Steps:
- Enable Gmail API
- Create OAuth consent screen
- Create OAuth client credentials
- Audience: Add Test users under Testing status
Message format: HTML
(Model: OpenAI GPT — used to design an HTML email template)
IM: Feishu (LARK)
Bots in groups
Use bots in groups
🔗 Nodes Used
HTTP Request, Gmail, Notion, Schedule Trigger, Basic LLM Chain, Google Gemini Chat Model
📥 Import
Download workflow.json and import into n8n:
Workflow menu → Import from File