Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 8 additions & 19 deletions benchmarks/ehrsql-naacl2024/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,14 @@

## Overview

Curated benchmark with **100 examples** from the Reliable Text-to-SQL on Electronic Health Records shared task (Clinical NLP Workshop @ NAACL 2024).
Benchmark results comparing different models on the EHRSQL dataset with one hundred questions covering various medical queries including cost analysis, temporal measurement differences, medication prescriptions, lab results, patient demographics etc.

**Source**: [ehrsql-2024](https://github.com/glee4810/ehrsql-2024) | **Database**: MIMIC-IV Demo
**Source**: [ehrsql-2024](https://github.com/glee4810/ehrsql-2024)

## Data Schema
Each model folder contains:
- **Model answers** extracted from conversations
- **Golden truth answers** and SQL queries for comparison
- **Correct/Incorrect** annotations with detailed notes
- **Chat conversation links** (Claude.ai shared links or local conversation files)

| Column | Description |
|--------|-------------|
| Query | Natural language question about EHR data |
| Chat Conversation | Link to model interaction |
| Model Answer | AI-generated response |
| Golden Truth | Expected correct answer |
| Golden Truth SQL Query | Ground truth SQL query |
| Correct/Incorrect | 1 = correct, 0 = incorrect |
| Incorrect Note | Error analysis when applicable |

## Query Examples

- Patient-specific: Lab values, medications, procedures
- Temporal: Time-based analysis, trends
- Aggregate: Population statistics
- Complex joins: Multi-table EHR relationships
The dataset includes complex medical questions requiring database queries, with model performance evaluated against ground truth answers through human assessment.
1,182 changes: 1,182 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/EHRSQL_benchmark.csv

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/10.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/100.conversation.json

Large diffs are not rendered by default.

1,857 changes: 1,857 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/101.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/11.conversation.json

Large diffs are not rendered by default.

1,888 changes: 1,888 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/12.conversation.json

Large diffs are not rendered by default.

2,192 changes: 2,192 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/13.conversation.json

Large diffs are not rendered by default.

2,499 changes: 2,499 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/14.conversation.json

Large diffs are not rendered by default.

1,888 changes: 1,888 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/15.conversation.json

Large diffs are not rendered by default.

1,582 changes: 1,582 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/16.conversation.json

Large diffs are not rendered by default.

1,888 changes: 1,888 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/17.conversation.json

Large diffs are not rendered by default.

3,115 changes: 3,115 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/18.conversation.json

Large diffs are not rendered by default.

2,808 changes: 2,808 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/19.conversation.json

Large diffs are not rendered by default.

6,518 changes: 6,518 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/2.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/20.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/21.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/22.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/23.conversation.json

Large diffs are not rendered by default.

3,762 changes: 3,762 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/24.conversation.json

Large diffs are not rendered by default.

3,114 changes: 3,114 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/25.conversation.json

Large diffs are not rendered by default.

2,807 changes: 2,807 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/26.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/27.conversation.json

Large diffs are not rendered by default.

1,582 changes: 1,582 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/28.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/29.conversation.json

Large diffs are not rendered by default.

4,645 changes: 4,645 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/3.conversation.json

Large diffs are not rendered by default.

2,194 changes: 2,194 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/30.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/31.conversation.json

Large diffs are not rendered by default.

2,809 changes: 2,809 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/32.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/33.conversation.json

Large diffs are not rendered by default.

2,502 changes: 2,502 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/34.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/35.conversation.json

Large diffs are not rendered by default.

3,420 changes: 3,420 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/36.conversation.json

Large diffs are not rendered by default.

1,582 changes: 1,582 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/37.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/38.conversation.json

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2,503 changes: 2,503 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/4.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/40.conversation.json

Large diffs are not rendered by default.

6,786 changes: 6,786 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/41.conversation.json

Large diffs are not rendered by default.

1,278 changes: 1,278 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/42.conversation.json

Large diffs are not rendered by default.

2,194 changes: 2,194 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/43.conversation.json

Large diffs are not rendered by default.

4,337 changes: 4,337 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/44.conversation.json

Large diffs are not rendered by default.

4,950 changes: 4,950 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/45.conversation.json

Large diffs are not rendered by default.

2,194 changes: 2,194 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/46.conversation.json

Large diffs are not rendered by default.

2,808 changes: 2,808 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/47.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/48.conversation.json

Large diffs are not rendered by default.

2,194 changes: 2,194 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/49.conversation.json

Large diffs are not rendered by default.

4,338 changes: 4,338 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/5.conversation.json

Large diffs are not rendered by default.

7,804 changes: 7,804 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/50.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/51.conversation.json

Large diffs are not rendered by default.

2,194 changes: 2,194 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/52.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/53.conversation.json

Large diffs are not rendered by default.

3,125 changes: 3,125 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/54.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/55.conversation.json

Large diffs are not rendered by default.

2,503 changes: 2,503 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/56.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/57.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/58.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/59.conversation.json

Large diffs are not rendered by default.

3,116 changes: 3,116 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/6.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/60.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/61.conversation.json

Large diffs are not rendered by default.

3,114 changes: 3,114 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/62.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/63.conversation.json

Large diffs are not rendered by default.

1,583 changes: 1,583 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/64.conversation.json

Large diffs are not rendered by default.

2,777 changes: 2,777 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/65.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/66.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/67.conversation.json

Large diffs are not rendered by default.

1,582 changes: 1,582 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/68.conversation.json

Large diffs are not rendered by default.

1,889 changes: 1,889 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/69.conversation.json

Large diffs are not rendered by default.

6,481 changes: 6,481 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/7.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/70.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/71.conversation.json

Large diffs are not rendered by default.

2,806 changes: 2,806 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/72.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/73.conversation.json

Large diffs are not rendered by default.

1,583 changes: 1,583 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/74.conversation.json

Large diffs are not rendered by default.

2,807 changes: 2,807 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/75.conversation.json

Large diffs are not rendered by default.

2,503 changes: 2,503 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/76.conversation.json

Large diffs are not rendered by default.

6,482 changes: 6,482 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/77.conversation.json

Large diffs are not rendered by default.

6,823 changes: 6,823 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/78.conversation.json

Large diffs are not rendered by default.

3,421 changes: 3,421 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/79.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/8.conversation.json

Large diffs are not rendered by default.

4,338 changes: 4,338 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/80.conversation.json

Large diffs are not rendered by default.

2,806 changes: 2,806 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/81.conversation.json

Large diffs are not rendered by default.

1,582 changes: 1,582 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/82.conversation.json

Large diffs are not rendered by default.

4,336 changes: 4,336 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/83.conversation.json

Large diffs are not rendered by default.

2,808 changes: 2,808 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/84.conversation.json

Large diffs are not rendered by default.

2,805 changes: 2,805 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/85.conversation.json

Large diffs are not rendered by default.

2,502 changes: 2,502 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/86.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/87.conversation.json

Large diffs are not rendered by default.

1,583 changes: 1,583 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/88.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/89.conversation.json

Large diffs are not rendered by default.

2,808 changes: 2,808 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/9.conversation.json

Large diffs are not rendered by default.

1,254 changes: 1,254 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/90.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/91.conversation.json

Large diffs are not rendered by default.

2,195 changes: 2,195 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/92.conversation.json

Large diffs are not rendered by default.

1,254 changes: 1,254 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/93.conversation.json

Large diffs are not rendered by default.

1,276 changes: 1,276 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/94.conversation.json

Large diffs are not rendered by default.

1,888 changes: 1,888 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/95.conversation.json

Large diffs are not rendered by default.

1,277 changes: 1,277 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/96.conversation.json

Large diffs are not rendered by default.

2,501 changes: 2,501 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/97.conversation.json

Large diffs are not rendered by default.

1,254 changes: 1,254 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/98.conversation.json

Large diffs are not rendered by default.

1,583 changes: 1,583 additions & 0 deletions benchmarks/ehrsql-naacl2024/gpt-oss-20B/conversations/99.conversation.json

Large diffs are not rendered by default.