⚒️ Benchmark LLM performance on legal documents with Google Sheets and OpenRouter

756 views · ⚒️ Engineering

Description

This workflow demonstrates a simple way to run evals on a set of test cases stored in a Google Sheet.

The example we are using comes from an info extraction task dataset, where we tested 6 different LLMs on 18 different test cases.

This workflow extends the functionality of my simple eval for benchmarking legal tasks here.

Rather than running executions sequentially (waiting for each one to respond before making another request), we use parallel processing to fire 2 requests every second.

You can see our sample data in this spreadsheet here to get started.

Once you have this working for our dataset, you can plug in your own test cases matching different LLMs to see how it works with your own data.

How it works

Set up steps:

🔗 Nodes Used

Google Sheets, HTTP Request, Webhook, Google Drive, Basic LLM Chain, Structured Output Parser

📥 Import

Download workflow.json and import into n8n: Workflow menu → Import from File

📖 Importing guide · 🔑 Credential setup