feat: Add local MIMIC-IV full dataset support (Replace SQLite with Parquet + DuckDB)#62
Conversation
simonprovost
left a comment
There was a problem hiding this comment.
Super nice to see a DuckDB support! Can't wait to try it out! Surely @rafiattrach will proceed with his review but I pre-approve this nice PR!
Though may we wait for #60 to pass first, much easier that way for you @hannesill to rebase and solve the README potential forthcoming tiny conflicts?
|
@hannesill Thanks for this excellent PR! 🙏 Really appreciate the work on local full dataset support, this will definitely be a super nice addition! I'll look into this in more detail over the coming days, but from a quick scan of your notes I have some questions about the flow and user experience, especially for clinicians who need simple workflows as we've seen in datathons: Main concernThe current flow seems to add complexity: m3 download mimic-iv-demo && m3 convert mimic-iv-demo (once), then m3 init mimic-iv-demoQuestions:
Thanks once again though! |
|
To avoid confusion, I added @rafiattrach as the main peer reviewer to prevent this from being misinterpreted and merged by mistake. ^^ |
|
@rafiattrach thanks for the comment! I was mainly doing it for flexibility as a dev:
However, I see what you mean with that it should be super simple for clinicians. We could perhaps do it like that:
Optional flags for power users:
This approach keeps the clinical workflow simple (one command) while giving devs/power-users the control they need by hiding the extra functionality/complexity in optional flags. What do you think? |
|
@hannesill yes sounds good, thanks a lot Hannes! |
|
@rafiattrach here's the modified version as described in my suggestion. The available flags right now for
The flags I suggested above are probably only useful when we start supporting auto-download for big datasets like mimic-iv-full. Important note: For using the demo, the CLI is unchanged now. Clinicians and devs already using m3 with the demo dataset don't need to relearn anything. All tests still pass. |
|
@hannesill this is fantastic work! I've tested the entire workflow for the demo duckdb, BigQuery, and the new full local dataset, and it all works perfectly! Could you please just rebase the branch on main to incorporate the latest updates especially the conflicts with the README? A thought I had was to extend the newly added Quick Start table with a third column for the new Local Full Dataset option. It could be a really clear way to show users all three setup paths side-by-side. |
DuckDB becomes useful when we want to query not just the small demo dataset locally, but the full MIMIC-IV dataset.
…ith duckdb backend. Also, fixed the working path to not be m3/src/m3 anymore, but m3/ instead.
…pied into m3_data/raw manually.
… keep init fast (views only) * Remove SQLite; unify local backends on DuckDB + Parquet for demo and full * CLI: * add m3 download (demo only) to fetch CSV.gz * add m3 convert (demo/full) to convert CSV.gz → Parquet * m3 init now creates/refreshes DuckDB views over existing Parquet * Update status/use/config to new dataset model * Refresh tests and README for new workflow
b91f32a to
4e09739
Compare
|
also @hannesill if you could check the test failures, probably from the recent additions since you mentioned all passed before |
|
@rafiattrach Weird, all tests pass for me locally. I'll have a look into it. |
|
Great Work! Thanks @hannesill! 🙌 |
Why?
What changed?
mimic-iv-fulltoSUPPORTED_DATASETSwith default DuckDB path and verification table.duckdb_pathsandparquet_rootsfor bothdemoandfull, with detection andactive_dataset.m3 convert mimic-iv-[demo|full]: stream CSV.gz → Parquet via DuckDB.m3 init mimic-iv-[demo|full]: create DuckDB views over Parquet.m3 use [demo|full|bigquery]andm3 statusupdated to handle both local datasets.Important notes
m3 download mimic-iv-demo && m3 convert mimic-iv-demo(once), thenm3 init mimic-iv-demom3 convert mimic-iv-fullthenm3 init mimic-iv-fullOut of scope
m3 downloadstill supports demo only.Performance tuning
M3_CONVERT_MAX_WORKERS,M3_DUCKDB_MEM,M3_DUCKDB_THREADSenvironment variables supported.Tests
Branch
local_full_dataset