From a74eafb92c275e4eb7767712f841b657ef4d740e Mon Sep 17 00:00:00 2001
From: hill <hannes.ill@tum.de>
Date: Sun, 23 Nov 2025 02:17:05 -0500
Subject: [PATCH 1/5] refactor: Remove mimic-iv references from code and add
 DatasetRegistry

---
 README.md                | 289 ++++++---------------------------------
 src/m3/cli.py            |  82 ++++++-----
 src/m3/config.py         | 172 ++++++++++++-----------
 src/m3/data_io.py        |  28 ++--
 src/m3/datasets.py       |  68 +++++++++
 src/m3/mcp_server.py     |  23 +++-
 tests/test_cli.py        |  20 +--
 tests/test_mcp_server.py |   9 +-
 8 files changed, 295 insertions(+), 396 deletions(-)
 create mode 100644 src/m3/datasets.py
diff --git a/README.md b/README.md
index f11204d..7219796 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
   <img src="webapp/public/m3_logo_transparent.png" alt="M3 Logo" width="300"/>
 </div>
 
-> **Query MIMIC-IV medical data using natural language through MCP clients**
+> **Query tabular PhysioNet medical data using natural language through MCP clients**
 
 <a href="https://www.python.org/downloads/"><img alt="Python" src="https://img.shields.io/badge/Python-3.10+-blue?logo=python&logoColor=white"></a>
 <a href="https://modelcontextprotocol.io/"><img alt="MCP" src="https://img.shields.io/badge/MCP-Compatible-green?logo=ai&logoColor=white"></a>
@@ -12,15 +12,17 @@
 <a href="https://github.com/rafiattrach/m3/actions/workflows/pre-commit.yaml"><img alt="Code Quality" src="https://github.com/rafiattrach/m3/actions/workflows/pre-commit.yaml/badge.svg"></a>
 <a href="https://github.com/rafiattrach/m3/pulls"><img alt="PRs Welcome" src="https://img.shields.io/badge/PRs-welcome-brightgreen.svg"></a>
 
-Transform medical data analysis with AI! Ask questions about MIMIC-IV data in plain English and get instant insights. Choose between local demo data (free) or full cloud dataset (BigQuery).
+Transform medical data analysis with AI! Ask questions about MIMIC-IV and other PhysioNet datasets in plain English and get instant insights. Choose between local data (free) or full cloud dataset (BigQuery).
 
 ## Features
 
-- 🔍 **Natural Language Queries**: Ask questions about MIMIC-IV data in plain English
-- 🏠 **Local DuckDB + Parquet**: Fast local queries for demo and full dataset using Parquet files with DuckDB views
+- 🔍 **Natural Language Queries**: Ask questions about your medical data in plain English
+- 🏠 **Modular Datasets**: Support for any tabular PhysioNet dataset (MIMIC-IV, etc.)
+- 📂 **Local DuckDB + Parquet**: Fast local queries using Parquet files with DuckDB views
 - ☁️ **BigQuery Support**: Access full MIMIC-IV dataset on Google Cloud
 - 🔒 **Enterprise Security**: OAuth2 authentication with JWT tokens and rate limiting
 - 🛡️ **SQL Injection Protection**: Read-only queries with comprehensive validation
+- 🧩 **Extensible Architecture**: Easily add new custom datasets via configuration or CLI
 
 ## 🚀 Quick Start
 
@@ -67,24 +69,30 @@ uv --version
 <tr>
 <td width="50%">
 
-**DuckDB (Demo or Full Dataset)**
-
+**DuckDB (Local Datasets)**
 
 To create a m3 directory and navigate into it run:
 ```shell
 mkdir m3 && cd m3
 ```
-If you want to use the full dataset, download it manually from [PhysioNet](https://physionet.org/content/mimiciv/3.1/) and place it into `m3/m3_data/raw`. For using the demo set you can continue and run:
 
+**Option A: MIMIC-IV Demo (Auto-Download)**
 ```shell
 uv init && uv add m3-mcp && \
-uv run m3 init DATASET_NAME && uv run m3 config --quick
+uv run m3 init mimic-iv-demo && uv run m3 config --quick
 ```
-Replace `DATASET_NAME` with `mimic-iv-demo` or `mimic-iv-full` and copy & paste the output of this command into your client config JSON file.
-
-*Demo dataset (16MB raw download size) downloads automatically on first query.*
+*Downloads ~16MB automatically.*
 
-*Full dataset (10.6GB raw download size) needs to be downloaded manually.*
+**Option B: Full Datasets (Manual Download)**
+1. Download CSVs from PhysioNet.
+2. Run init with source path:
+```shell
+uv run m3 init mimic-iv-full --src /path/to/raw/csvs
+```
+3. Configure client:
+```shell
+uv run m3 config --quick
+```
 
 </td>
 <td width="50%">
@@ -123,253 +131,48 @@ Paste this into your client config JSON file:
 
 ---
 
-## Backend Comparison
+## ➕ Adding Custom Datasets
 
-| Feature | DuckDB (Demo) | DuckDB (Full) | BigQuery (Full) |
-|---------|---------------|---------------|-----------------|
-| **Cost** | Free | Free | BigQuery usage fees |
-| **Setup** | Zero config | Manual Download | GCP credentials required |
-| **Data Size** | 100 patients, 275 admissions | 365k patients, 546k admissions | 365k patients, 546k admissions |
-| **Speed** | Fast (local) | Fast (local) | Network latency |
-| **Use Case** | Learning, development | Research (local) | Research, production |
-
----
+M3 is designed to be modular. You can add support for any tabular dataset easily.
 
-## Alternative Installation Methods
+### 1. CLI Method (Ad-hoc)
 
-> Already have Docker or prefer pip? Here are other ways to run m3:
+If you have a folder of CSV/CSV.gz files, you can initialize it directly as a custom dataset:
 
-### 🐳 Docker (No Python Required)
-
-<table>
-<tr>
-<td width="50%">
-
-**DuckDB (Local):**
 ```bash
-git clone https://github.com/rafiattrach/m3.git && cd m3
-docker build -t m3:lite --target lite .
-docker run -d --name m3-server m3:lite tail -f /dev/null
-```
-
-</td>
-<td width="50%">
-
-**BigQuery:**
-```bash
-git clone https://github.com/rafiattrach/m3.git && cd m3
-docker build -t m3:bigquery --target bigquery .
-docker run -d --name m3-server \
-  -e M3_BACKEND=bigquery \
-  -e M3_PROJECT_ID=your-project-id \
-  -v $HOME/.config/gcloud:/root/.config/gcloud:ro \
-  m3:bigquery tail -f /dev/null
-```
-
-</td>
-</tr>
-</table>
-
-**MCP config (same for both):**
-```json
-{
-  "mcpServers": {
-    "m3": {
-      "command": "docker",
-      "args": ["exec", "-i", "m3-server", "python", "-m", "m3.mcp_server"]
-    }
-  }
-}
+# Not yet implemented in CLI but supported by architecture
+# Future: m3 init --local /path/to/my/csvs --name my-custom-study
 ```
 
-Stop: `docker stop m3-server && docker rm m3-server`
-
-### pip Install + CLI Tools
+Currently, you can register new datasets by creating a definition file.
 
-```bash
-pip install m3-mcp
-```
+### 2. JSON Definition Method
 
-> 💡 **CLI commands:** Run `m3 --help` to see all available options.
+Create a JSON file in `m3_data/datasets/my_study.json`:
 
-**Useful CLI commands:**
-- `m3 init mimic-iv-demo` - Download demo database
-- `m3 config` - Generate MCP configuration interactively
-- `m3 config claude --backend bigquery --project-id YOUR_PROJECT_ID` - Quick BigQuery setup
-
-**Example MCP config:**
 ```json
 {
-  "mcpServers": {
-    "m3": {
-      "command": "m3-mcp-server",
-      "env": {
-        "M3_BACKEND": "duckdb"
-      }
-    }
-  }
+  "name": "my-study",
+  "description": "My custom clinical study data",
+  "file_listing_url": null,
+  "subdirectories_to_scan": ["data", "metadata"],
+  "default_duckdb_filename": "my_study.duckdb",
+  "tags": ["clinical", "custom"]
 }
 ```
 
-### Local Development
-
-For contributors:
+Then initialize it:
 
 ```bash
-git clone https://github.com/rafiattrach/m3.git && cd m3
-python -m venv .venv
-source .venv/bin/activate  # Windows: .venv\Scripts\activate
-pip install -e ".[dev]"
-pre-commit install
+m3 init my-study --src /path/to/raw/csvs
 ```
 
-**MCP config:**
-```json
-{
-  "mcpServers": {
-    "m3": {
-      "command": "/path/to/m3/.venv/bin/python",
-      "args": ["-m", "m3.mcp_server"],
-      "cwd": "/path/to/m3",
-      "env": {
-        "M3_BACKEND": "duckdb"
-      }
-    }
-  }
-}
-```
+M3 will:
+1. Scan the source directory for CSVs
+2. Convert them to Parquet
+3. Create DuckDB views automatically (e.g. `data/patients.csv` -> table `data_patients`)
 
-#### Using `UV` (Recommended)
-Assuming you have [UV](https://docs.astral.sh/uv/getting-started/installation/) installed.
-
-**Step 1: Clone and Navigate**
-```bash
-# Clone the repository
-git clone https://github.com/rafiattrach/m3.git
-cd m3
-```
-
-**Step 2: Create `UV` Virtual Environment**
-```bash
-# Create virtual environment
-uv venv
-```
-
-**Step 3: Install M3**
-```bash
-uv sync
-# Do not forget to use `uv run` to any subsequent commands to ensure you're using the `uv` virtual environment
-```
-
-### 🗄️ Database Configuration
-
-After installation, choose your data source:
-
-#### Option A: Local Demo (DuckDB + Parquet)
-
-**Perfect for learning and development - completely free!**
-
-1. **Initialize demo dataset**:
-   ```bash
-   m3 init mimic-iv-demo
-   ```
-
-2. **Setup MCP Client**:
-   ```bash
-   m3 config
-   ```
-
-   *Alternative: For Claude Desktop specifically:*
-   ```bash
-   m3 config claude --backend duckdb --db-path /Users/you/path/to/m3_data/databases/mimic_iv_demo.duckdb
-   ```
-
-5. **Restart your MCP client** and ask:
-
-   - "What tools do you have for MIMIC-IV data?"
-   - "Show me patient demographics from the ICU"
-
-#### Option B: Local Full Dataset (DuckDB + Parquet)
-
-**Run the entire MIMIC-IV dataset locally with DuckDB views over Parquet.**
-
-1. **Acquire CSVs** (requires PhysioNet credentials):
-   - Download the official MIMIC-IV CSVs from PhysioNet and place them under:
-     - `/Users/you/path/to/m3/m3_data/raw_files/mimic-iv-full/hosp/`
-     - `/Users/you/path/to/m3/m3_data/raw_files/mimic-iv-full/icu/`
-   - Note: `m3 init`'s auto-download function currently only supports the demo dataset. Use your browser or `wget` to obtain the full dataset.
-
-2. **Initialize full dataset**:
-   ```bash
-   m3 init mimic-iv-full
-   ```
-   - This may take up to 30 minutes, depending on your system (e.g. 10 minutes for MacBook Pro M3)
-   - Performance knobs (optional):
-     ```bash
-     export M3_CONVERT_MAX_WORKERS=6   # number of parallel files (default=4)
-     export M3_DUCKDB_MEM=4GB          # DuckDB memory limit per worker (default=3GB)
-     export M3_DUCKDB_THREADS=4        # DuckDB threads per worker (default=2)
-     ```
-     Pay attention to your system specifications, especially if you have enough memory.
-
-3. **Select dataset and verify**:
-   ```bash
-   m3 use full # optional, as this automatically got set to full
-   m3 status
-   ```
-   - Status prints active dataset, local DB path, Parquet presence, quick row counts and total Parquet size.
-
-4. **Configure MCP client** (uses the full local DB):
-   ```bash
-   m3 config
-   # or
-   m3 config claude --backend duckdb --db-path /Users/you/path/to/m3/m3_data/databases/mimic_iv_full.duckdb
-   ```
-
-#### Option C: BigQuery (Full Dataset)
-
-**For researchers needing complete MIMIC-IV data**
-
-##### Prerequisites
-- Google Cloud account and project with billing enabled
-- Access to MIMIC-IV on BigQuery (requires PhysioNet credentialing)
-
-##### Setup Steps
-
-1. **Install Google Cloud CLI**:
-
-   **macOS (with Homebrew):**
-   ```bash
-   brew install google-cloud-sdk
-   ```
-
-   **Windows:** Download from https://cloud.google.com/sdk/docs/install
-
-   **Linux:**
-   ```bash
-   curl https://sdk.cloud.google.com | bash
-   ```
-
-2. **Authenticate**:
-   ```bash
-   gcloud auth application-default login
-   ```
-   *This will open your browser - choose the Google account that has access to your BigQuery project with MIMIC-IV data.*
-
-3. **Setup MCP Client for BigQuery**:
-   ```bash
-   m3 config
-   ```
-
-   *Alternative: For Claude Desktop specifically:*
-   ```bash
-   m3 config claude --backend bigquery --project-id YOUR_PROJECT_ID
-   ```
-
-4. **Test BigQuery Access** - Restart your MCP client and ask:
-   ```
-   Use the get_race_distribution function to show me the top 5 races in MIMIC-IV admissions.
-   ```
+---
 
 ## 🔧 Advanced Configuration
 
@@ -412,12 +215,6 @@ m3 config  # Choose OAuth2 option during setup
 - Auth0, Google Identity Platform, Microsoft Azure AD, Keycloak
 - Any OAuth2/OpenID Connect compliant provider
 
-**Key Benefits:**
-- 🔒 **JWT Token Validation**: Industry-standard security
-- 🎯 **Scope-based Access**: Fine-grained permissions
-- 🛡️ **Rate Limiting**: Abuse protection
-- 📊 **Audit Logging**: Security monitoring
-
 > 📖 **Complete OAuth2 Setup Guide**: See [`docs/OAUTH2_AUTHENTICATION.md`](docs/OAUTH2_AUTHENTICATION.md) for detailed configuration, troubleshooting, and production deployment guidelines.
 
 ---
@@ -439,7 +236,7 @@ Try asking your MCP client these questions:
 
 **Demographics & Statistics:**
 
-- `Prompt:` *What is the race distribution in MIMIC-IV admissions?*
+- `Prompt:` *What is the race distribution in admissions?*
 - `Prompt:` *Show me patient demographics for ICU stays*
 - `Prompt:` *How many total admissions are in the database?*
 
diff --git a/src/m3/cli.py b/src/m3/cli.py
index cc7a4dc..a05dd00 100644
--- a/src/m3/cli.py
+++ b/src/m3/cli.py
@@ -2,13 +2,13 @@
 import subprocess
 import sys
 from pathlib import Path
-from typing import Annotated
+from typing import Annotated, Optional
 
 import typer
 
 from m3 import __version__
+from m3.datasets import DatasetRegistry
 from m3.config import (
-    SUPPORTED_DATASETS,
     detect_available_local_datasets,
     get_active_dataset,
     get_dataset_config,
@@ -81,7 +81,7 @@ def dataset_init_cmd(
         typer.Argument(
             help=(
                 "Dataset to initialize (local). Default: 'mimic-iv-demo'. "
-                f"Supported: {', '.join(SUPPORTED_DATASETS.keys())}"
+                f"Supported: {', '.join([ds.name for ds in DatasetRegistry.list_all()])}"
             ),
             metavar="DATASET_NAME",
         ),
@@ -109,11 +109,10 @@ def dataset_init_cmd(
     - If Parquet exists: only initialize DuckDB views
     - If raw CSV.gz exists but Parquet is missing: convert then initialize
     - If neither exists: download (demo only), convert, then initialize
-
+    
     Notes:
-    - Auto-download currently supports only 'mimic-iv-demo'. For 'mimic-iv-full',
-      place the official raw CSV.gz files under <project_root>/m3_data/raw_files/<dataset>/
-      with 'hosp/' and 'icu/' subdirectories, then re-run this command.
+    - Auto-download is based on the dataset definition URL.
+    - For datasets without a download URL (e.g. mimic-iv-full), you must provide the --src path or place files in the expected location.
     """
     logger.info(f"CLI 'init' called for dataset: '{dataset_name}'")
 
@@ -126,7 +125,7 @@ def dataset_init_cmd(
             err=True,
         )
         typer.secho(
-            f"Supported datasets are: {', '.join(SUPPORTED_DATASETS.keys())}",
+            f"Supported datasets are: {', '.join([ds.name for ds in DatasetRegistry.list_all()])}",
             fg=typer.colors.YELLOW,
             err=True,
         )
@@ -144,7 +143,6 @@ def dataset_init_cmd(
     csv_root = Path(src).resolve() if src else csv_root_default
 
     # Presence detection (check for any parquet or csv.gz files)
-    # NOTE: Checks need to be more robust as soon as we support the full dataset for download (don't just check for any file, but that no files are missing)
     parquet_present = any(pq_root.rglob("*.parquet"))
     raw_present = any(csv_root.rglob("*.csv.gz"))
 
@@ -154,12 +152,13 @@ def dataset_init_cmd(
 
     # Step 1: Ensure raw dataset exists (download demo if missing; for full, inform and return)
     if not raw_present and not parquet_present:
-        if dataset_key == "mimic-iv-demo":
+        listing_url = dataset_config.get('file_listing_url')
+        if listing_url:
             out_dir = csv_root_default
             out_dir.mkdir(parents=True, exist_ok=True)
 
             typer.echo(f"Downloading dataset: '{dataset_key}'")
-            typer.echo(f"Listing URL: {dataset_config.get('file_listing_url')}")
+            typer.echo(f"Listing URL: {listing_url}")
             typer.echo(f"Output directory: {out_dir}")
 
             ok = download_dataset(dataset_key, out_dir)
@@ -177,16 +176,16 @@ def dataset_init_cmd(
             raw_present = True
         else:
             typer.secho(
-                "Auto-download is only supported for 'mimic-iv-demo'.",
+                f"Auto-download is not available for '{dataset_key}'.",
                 fg=typer.colors.YELLOW,
             )
             typer.secho(
                 (
-                    "To initialize 'mimic-iv-full':\n"
-                    "1) Download the official MIMIC-IV dataset from PhysioNet (this requires a PhysioNet account with dataset access)\n"
-                    "2) Place the raw CSV.gz files under: {csv_root_default}\n"
-                    "   Ensure the structure includes 'hosp/' and 'icu/' subdirectories.\n"
-                    "3) Then re-run: m3 init mimic-iv-full"
+                    "To initialize this dataset:\n"
+                    "1) Download the raw data manually.\n"
+                    f"2) Place the raw CSV.gz files under: {csv_root_default}\n"
+                    "   (or use --src to point to their location)\n"
+                    f"3) Then re-run: m3 init {dataset_key}"
                 ),
                 fg=typer.colors.WHITE,
             )
@@ -207,7 +206,7 @@ def dataset_init_cmd(
             raise typer.Exit(code=1)
         typer.secho("✅ Conversion complete.", fg=typer.colors.GREEN)
 
-    # Step 2: Initialize DuckDB over Parquet
+    # Step 3: Initialize DuckDB over Parquet
     final_db_path = (
         Path(db_path_str).resolve()
         if db_path_str
@@ -287,10 +286,7 @@ def dataset_init_cmd(
             )
 
     # Set active dataset to match init target
-    if dataset_key == "mimic-iv-demo":
-        set_active_dataset("demo")
-    elif dataset_key == "mimic-iv-full":
-        set_active_dataset("full")
+    set_active_dataset(dataset_key)
 
 
 @app.command("use")
@@ -298,27 +294,36 @@ def use_cmd(
     target: Annotated[
         str,
         typer.Argument(
-            help="Select active dataset: demo | full | bigquery", metavar="TARGET"
+            help="Select active dataset: name | bigquery", metavar="TARGET"
         ),
     ],
 ):
     """Set the active dataset selection for the project."""
     target = target.lower()
-    if target not in ("demo", "full", "bigquery"):
+    
+    # Check if it is bigquery
+    if target == "bigquery":
+         set_active_dataset(target)
+         typer.secho(f"Active dataset set to '{target}'.", fg=typer.colors.GREEN)
+         return
+
+    # Check if local availability
+    availability = detect_available_local_datasets().get(target)
+    if not availability:
         typer.secho(
-            "Target must be one of: demo, full, bigquery", fg=typer.colors.RED, err=True
+             f"Dataset '{target}' not found or not registered.",
+             fg=typer.colors.RED,
+             err=True
         )
         raise typer.Exit(code=1)
 
-    if target in ("demo", "full"):
-        availability = detect_available_local_datasets()[target]
-        if not availability["parquet_present"]:
-            typer.secho(
-                f"Parquet directory missing at {availability['parquet_root']}. Cannot activate '{target}'.",
-                fg=typer.colors.RED,
-                err=True,
-            )
-            raise typer.Exit(code=1)
+    if not availability["parquet_present"]:
+        typer.secho(
+            f"Parquet directory missing at {availability['parquet_root']}. Cannot activate '{target}'.",
+            fg=typer.colors.RED,
+            err=True,
+        )
+        raise typer.Exit(code=1)
 
     set_active_dataset(target)
     typer.secho(f"Active dataset set to '{target}'.", fg=typer.colors.GREEN)
@@ -334,9 +339,11 @@ def status_cmd():
     )
 
     availability = detect_available_local_datasets()
+    if not availability:
+        typer.echo("No datasets detected.")
+        return
 
-    for label in ("demo", "full"):
-        info = availability[label]
+    for label, info in availability.items():
         typer.secho(f"\n=== {label.upper()} ===", fg=typer.colors.BRIGHT_BLUE)
 
         parquet_icon = "✅" if info["parquet_present"] else "❌"
@@ -355,8 +362,7 @@ def status_cmd():
                 typer.echo("  parquet_size_gb: (skipped)")
 
         # Try a quick rowcount on the verification table if db present
-        ds_name = "mimic-iv-demo" if label == "demo" else "mimic-iv-full"
-        cfg = get_dataset_config(ds_name)
+        cfg = get_dataset_config(label)
         if info["db_present"] and cfg:
             try:
                 count = verify_table_rowcount(
diff --git a/src/m3/config.py b/src/m3/config.py
index fd094e7..6ee6002 100644
--- a/src/m3/config.py
+++ b/src/m3/config.py
@@ -1,6 +1,10 @@
 import json
 import logging
 from pathlib import Path
+import dataclasses
+from typing import Dict, Any, Optional
+
+from m3.datasets import DatasetRegistry, DatasetDefinition
 
 APP_NAME = "m3"
 
@@ -38,38 +42,35 @@ def _get_project_root() -> Path:
 _DEFAULT_DATABASES_DIR = _PROJECT_DATA_DIR / "databases"
 _DEFAULT_PARQUET_DIR = _PROJECT_DATA_DIR / "parquet"
 _RUNTIME_CONFIG_PATH = _PROJECT_DATA_DIR / "config.json"
-
-# --------------------------------------------------
-# Dataset configurations (add more entries as needed)
-# --------------------------------------------------
-SUPPORTED_DATASETS = {
-    "mimic-iv-demo": {
-        "file_listing_url": "https://physionet.org/files/mimic-iv-demo/2.2/",
-        "subdirectories_to_scan": ["hosp", "icu"],
-        "default_duckdb_filename": "mimic_iv_demo.duckdb",
-        "primary_verification_table": "hosp_admissions",
-    },
-    "mimic-iv-full": {
-        "file_listing_url": None,
-        "subdirectories_to_scan": ["hosp", "icu"],
-        "default_duckdb_filename": "mimic_iv_full.duckdb",
-        "primary_verification_table": "hosp_admissions",
-    },
-}
-
-# Dataset name aliases used on the CLI
-CLI_DATASET_ALIASES = {
-    "demo": "mimic-iv-demo",
-    "full": "mimic-iv-full",
-}
+_CUSTOM_DATASETS_DIR = _PROJECT_DATA_DIR / "datasets"
 
 
 # --------------------------------------------------
 # Helper functions
 # --------------------------------------------------
+def _load_custom_datasets():
+    """Load custom dataset definitions from JSON files in m3_data/datasets/."""
+    if not _CUSTOM_DATASETS_DIR.exists():
+        logger.warning(f"Custom datasets directory does not exist: {_CUSTOM_DATASETS_DIR}")
+        return
+
+    for f in _CUSTOM_DATASETS_DIR.glob("*.json"):
+        try:
+            data = json.loads(f.read_text())
+            # Basic validation/loading
+            ds = DatasetDefinition(**data)
+            DatasetRegistry.register(ds)
+        except Exception as e:
+            logger.warning(f"Failed to load custom dataset from {f}: {e}")
+
+
 def get_dataset_config(dataset_name: str) -> dict | None:
     """Retrieve the configuration for a given dataset (case-insensitive)."""
-    return SUPPORTED_DATASETS.get(dataset_name.lower())
+    # Ensure custom datasets are loaded
+    _load_custom_datasets()
+    
+    ds = DatasetRegistry.get(dataset_name.lower())
+    return dataclasses.asdict(ds) if ds else None
 
 
 def get_default_database_path(dataset_name: str) -> Path | None:
@@ -77,7 +78,6 @@ def get_default_database_path(dataset_name: str) -> Path | None:
     Return the default local DuckDB path for a given dataset,
     under <project_root>/m3_data/databases/.
     """
-
     cfg = get_dataset_config(dataset_name)
     if not cfg:
         logger.warning(
@@ -116,19 +116,16 @@ def _ensure_data_dirs():
     _DEFAULT_DATABASES_DIR.mkdir(parents=True, exist_ok=True)
     _DEFAULT_PARQUET_DIR.mkdir(parents=True, exist_ok=True)
     _PROJECT_DATA_DIR.mkdir(parents=True, exist_ok=True)
+    _CUSTOM_DATASETS_DIR.mkdir(parents=True, exist_ok=True)
 
 
 def _get_default_runtime_config() -> dict:
+    # We initialize with empty overrides. 
+    # Paths are derived dynamically from registry unless overridden here.
     return {
         "active_dataset": None,
-        "duckdb_paths": {
-            "demo": str(get_default_database_path("mimic-iv-demo") or ""),
-            "full": str(get_default_database_path("mimic-iv-full") or ""),
-        },
-        "parquet_roots": {
-            "demo": str(get_dataset_parquet_root("mimic-iv-demo") or ""),
-            "full": str(get_dataset_parquet_root("mimic-iv-full") or ""),
-        },
+        "duckdb_paths": {},  # Map dataset_name -> path
+        "parquet_roots": {}, # Map dataset_name -> path
     }
 
 
@@ -153,76 +150,77 @@ def _has_parquet_files(path: Path | None) -> bool:
     return bool(path and path.exists() and any(path.rglob("*.parquet")))
 
 
-def detect_available_local_datasets() -> dict:
-    """Return presence flags for demo/full based on Parquet roots and DuckDB files."""
+def detect_available_local_datasets() -> Dict[str, Dict[str, Any]]:
+    """Return presence flags for all registered datasets."""
+    _load_custom_datasets()
     cfg = load_runtime_config()
-    demo_parquet_path = (
-        Path(cfg["parquet_roots"]["demo"])
-        if cfg["parquet_roots"]["demo"]
-        else get_dataset_parquet_root("mimic-iv-demo")
-    )
-    full_parquet_path = (
-        Path(cfg["parquet_roots"]["full"])
-        if cfg["parquet_roots"]["full"]
-        else get_dataset_parquet_root("mimic-iv-full")
-    )
-    demo_db_path = (
-        Path(cfg["duckdb_paths"]["demo"])
-        if cfg["duckdb_paths"]["demo"]
-        else get_default_database_path("mimic-iv-demo")
-    )
-    full_db_path = (
-        Path(cfg["duckdb_paths"]["full"])
-        if cfg["duckdb_paths"]["full"]
-        else get_default_database_path("mimic-iv-full")
-    )
-    return {
-        "demo": {
-            "parquet_present": _has_parquet_files(demo_parquet_path),
-            "db_present": bool(demo_db_path and demo_db_path.exists()),
-            "parquet_root": str(demo_parquet_path) if demo_parquet_path else "",
-            "db_path": str(demo_db_path) if demo_db_path else "",
-        },
-        "full": {
-            "parquet_present": _has_parquet_files(full_parquet_path),
-            "db_present": bool(full_db_path and full_db_path.exists()),
-            "parquet_root": str(full_parquet_path) if full_parquet_path else "",
-            "db_path": str(full_db_path) if full_db_path else "",
-        },
-    }
+    
+    results = {}
+    
+    # Check all registered datasets
+    for ds in DatasetRegistry.list_all():
+        name = ds.name
+        
+        # Determine paths (check config overrides first)
+        parquet_root_str = cfg.get("parquet_roots", {}).get(name)
+        parquet_root = Path(parquet_root_str) if parquet_root_str else get_dataset_parquet_root(name)
+        
+        db_path_str = cfg.get("duckdb_paths", {}).get(name)
+        db_path = Path(db_path_str) if db_path_str else get_default_database_path(name)
+        
+        results[name] = {
+            "parquet_present": _has_parquet_files(parquet_root),
+            "db_present": bool(db_path and db_path.exists()),
+            "parquet_root": str(parquet_root) if parquet_root else "",
+            "db_path": str(db_path) if db_path else "",
+        }
+            
+    return results
 
 
 def get_active_dataset() -> str | None:
+    """Get the active dataset name."""
     cfg = load_runtime_config()
     active = cfg.get("active_dataset")
-    if active in CLI_DATASET_ALIASES:
-        return CLI_DATASET_ALIASES[active]
+    
+    if not active:
+        # Auto-detect default: prefer demo, then full
+        availability = detect_available_local_datasets()
+        if availability.get("mimic-iv-demo", {}).get("parquet_present"):
+            return "mimic-iv-demo"
+        if availability.get("mimic-iv-full", {}).get("parquet_present"):
+            return "mimic-iv-full"
+        return None
+        
     if active == "bigquery":
         return "bigquery"
-    # Auto-detect default: prefer demo, then full
-    availability = detect_available_local_datasets()
-    if availability["demo"]["parquet_present"]:
-        return CLI_DATASET_ALIASES["demo"]
-    if availability["full"]["parquet_present"]:
-        return CLI_DATASET_ALIASES["full"]
-
-    logger.warning("Unknown active_dataset value in config: %s", active)
-    return None
+        
+    return active
 
 
 def set_active_dataset(choice: str) -> None:
-    if choice not in ("demo", "full", "bigquery"):
-        raise ValueError("active_dataset must be one of: demo, full, bigquery")
+    # Allow registered names, or 'bigquery'
+    valid_names = {"bigquery"} | {ds.name for ds in DatasetRegistry.list_all()}
+    
+    if choice not in valid_names:
+        # It might be a new custom dataset not yet loaded in this process?
+        # We'll allow it if it's in the registry now.
+        _load_custom_datasets()
+        if not DatasetRegistry.get(choice):
+             raise ValueError(f"active_dataset must be a registered dataset or 'bigquery'. Got: {choice}")
+
     cfg = load_runtime_config()
     cfg["active_dataset"] = choice
     save_runtime_config(cfg)
 
 
 def get_duckdb_path_for(choice: str) -> Path | None:
-    key = "mimic-iv-demo" if choice == "demo" else "mimic-iv-full"
-    return get_default_database_path(key) if choice in ("demo", "full") else None
+    if choice == "bigquery":
+        return None
+    return get_default_database_path(choice)
 
 
 def get_parquet_root_for(choice: str) -> Path | None:
-    key = "mimic-iv-demo" if choice == "demo" else "mimic-iv-full"
-    return get_dataset_parquet_root(key) if choice in ("demo", "full") else None
+    if choice == "bigquery":
+        return None
+    return get_dataset_parquet_root(choice)
diff --git a/src/m3/data_io.py b/src/m3/data_io.py
index f5d7d92..54f7c9c 100644
--- a/src/m3/data_io.py
+++ b/src/m3/data_io.py
@@ -113,14 +113,24 @@ def _download_dataset_files(
 
     all_files_to_process = []  # List of (url, local_target_path)
 
-    for subdir_name in subdirs_to_scan:
-        subdir_listing_url = urljoin(base_listing_url, f"{subdir_name}/")
-        logger.info(f"Scanning subdirectory for CSVs: {subdir_listing_url}")
-        csv_urls_in_subdir = _scrape_urls_from_html_page(subdir_listing_url, session)
+    # Prepare list of (subdir_name, listing_url)
+    # If subdirs_to_scan is empty, we scan the base_listing_url directly (root)
+    scan_targets = []
+    if not subdirs_to_scan:
+        scan_targets.append(("", base_listing_url))
+    else:
+        for subdir in subdirs_to_scan:
+            # Ensure slash for directory joining
+            subdir_url = urljoin(base_listing_url, f"{subdir}/")
+            scan_targets.append((subdir, subdir_url))
+
+    for subdir_name, listing_url in scan_targets:
+        logger.info(f"Scanning for CSVs: {listing_url}")
+        csv_urls_in_subdir = _scrape_urls_from_html_page(listing_url, session)
 
         if not csv_urls_in_subdir:
             logger.warning(
-                f"No .csv.gz files found in subdirectory: {subdir_listing_url}"
+                f"No .csv.gz files found in location: {listing_url}"
             )
             continue
 
@@ -161,8 +171,7 @@ def _download_dataset_files(
 
     if not all_files_to_process:
         logger.error(
-            f"No '.csv.gz' download links found after scanning {base_listing_url} "
-            f"and its subdirectories {subdirs_to_scan} for dataset '{dataset_name}'."
+            f"No '.csv.gz' download links found for dataset '{dataset_name}'."
         )
         return False
 
@@ -359,11 +368,12 @@ def init_duckdb_from_parquet(dataset_name: str, db_target_path: Path) -> bool:
 def _create_duckdb_with_views(db_path: Path, parquet_root: Path) -> bool:
     """
     Create a DuckDB database and define one view per Parquet file,
-    using the proper table naming structure that matches MIMIC-IV expectations.
+    using a generic table naming structure: folder_subfolder_filename.
 
     For example:
     - hosp/admissions.parquet → view: hosp_admissions
     - icu/chartevents.parquet → view: icu_chartevents
+    - data.parquet → view: data
     """
     con = duckdb.connect(str(db_path))
     try:
@@ -460,7 +470,7 @@ def ensure_duckdb_for_dataset(
     dataset_key: str,
 ) -> tuple[bool, Path | None, Path | None]:
     """
-    Ensure DuckDB exists and views are created for the dataset ('mimic-iv-demo'|'mimic-iv-full').
+    Ensure DuckDB exists and views are created for the dataset.
     Returns (ok, db_path, parquet_root).
     """
     db_path = get_default_database_path(dataset_key)
diff --git a/src/m3/datasets.py b/src/m3/datasets.py
new file mode 100644
index 0000000..99e10cc
--- /dev/null
+++ b/src/m3/datasets.py
@@ -0,0 +1,68 @@
+from dataclasses import dataclass, field
+from typing import List, Optional, Dict
+
+@dataclass
+class DatasetDefinition:
+    name: str
+    description: str = ""
+    version: str = "1.0"
+    file_listing_url: Optional[str] = None
+    subdirectories_to_scan: List[str] = field(default_factory=list)
+    default_duckdb_filename: Optional[str] = None
+    primary_verification_table: Optional[str] = None
+    tags: List[str] = field(default_factory=list)
+    
+    # For backward compatibility or ease of use, we might add a way to access as dict if needed, 
+    # but we'll try to use object access.
+
+    def __post_init__(self):
+        if not self.default_duckdb_filename:
+            self.default_duckdb_filename = f"{self.name.replace('-', '_')}.duckdb"
+
+class DatasetRegistry:
+    _registry: Dict[str, DatasetDefinition] = {}
+
+    @classmethod
+    def register(cls, dataset: DatasetDefinition):
+        cls._registry[dataset.name.lower()] = dataset
+
+    @classmethod
+    def get(cls, name: str) -> Optional[DatasetDefinition]:
+        return cls._registry.get(name.lower())
+
+    @classmethod
+    def list_all(cls) -> List[DatasetDefinition]:
+        return list(cls._registry.values())
+
+    @classmethod
+    def reset(cls):
+        cls._registry.clear()
+        cls._register_builtins()
+
+    @classmethod
+    def _register_builtins(cls):
+        # Built-in datasets
+        demo = DatasetDefinition(
+            name="mimic-iv-demo",
+            description="MIMIC-IV Clinical Database Demo",
+            file_listing_url="https://physionet.org/files/mimic-iv-demo/2.2/",
+            subdirectories_to_scan=["hosp", "icu"],
+            primary_verification_table="hosp_admissions",
+            tags=["mimic", "clinical", "demo"]
+        )
+
+        full = DatasetDefinition(
+            name="mimic-iv-full",
+            description="MIMIC-IV Clinical Database (Full)",
+            file_listing_url=None, # Requires auth, manual download instructions
+            subdirectories_to_scan=["hosp", "icu"],
+            primary_verification_table="hosp_admissions",
+            tags=["mimic", "clinical", "full"]
+        )
+
+        cls.register(demo)
+        cls.register(full)
+
+# Initialize registry
+DatasetRegistry._register_builtins()
+
diff --git a/src/m3/mcp_server.py b/src/m3/mcp_server.py
index 1a7ad6f..d9e61fb 100644
--- a/src/m3/mcp_server.py
+++ b/src/m3/mcp_server.py
@@ -11,7 +11,7 @@
 from fastmcp import FastMCP
 
 from m3.auth import init_oauth2, require_oauth2
-from m3.config import get_default_database_path
+from m3.config import get_default_database_path, get_active_dataset
 
 # Create FastMCP server instance
 mcp = FastMCP("m3")
@@ -141,10 +141,20 @@ def _init_backend():
     if _backend == "duckdb":
         _db_path = os.getenv("M3_DB_PATH")
         if not _db_path:
-            path = get_default_database_path("mimic-iv-demo")
-            _db_path = str(path) if path else None
+            # Try to detect active dataset if not set
+            active = get_active_dataset()
+            if active and active != "bigquery":
+                 path = get_default_database_path(active)
+                 _db_path = str(path) if path else None
+            else:
+                # Fallback to demo if we can't figure it out
+                 path = get_default_database_path("mimic-iv-demo")
+                 _db_path = str(path) if path else None
+
         if not _db_path or not Path(_db_path).exists():
-            raise FileNotFoundError(f"DuckDB database not found: {_db_path}")
+             # We don't raise here to allow server to start even if DB is missing (e.g. for 'config' command usage via import)
+             # But runtime queries will fail.
+             pass
 
     elif _backend == "bigquery":
         try:
@@ -188,6 +198,9 @@ def _get_backend_info() -> str:
 
 def _execute_duckdb_query(sql_query: str) -> str:
     """Execute DuckDB query - internal function."""
+    if not _db_path or not Path(_db_path).exists():
+         return "❌ Error: Database file not found. Please initialize a dataset using 'm3 init'."
+
     try:
         conn = duckdb.connect(_db_path)
         try:
@@ -555,6 +568,8 @@ def get_icu_stays(patient_id: int | None = None, limit: int = 10) -> str:
 
     # Try common ICU table names based on backend
     if _backend == "duckdb":
+        # More robust check: look for available tables first? 
+        # For now we guess common naming convention
         icustays_table = "icu_icustays"
     else:  # bigquery
         icustays_table = "`physionet-data.mimiciv_3_1_icu.icustays`"
diff --git a/tests/test_cli.py b/tests/test_cli.py
index 8e55966..d352bd7 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -177,7 +177,7 @@ def test_config_claude_infers_db_path_demo(
 def test_config_claude_infers_db_path_full(
     mock_active, mock_get_default, mock_subprocess
 ):
-    mock_active.return_value = "full"
+    mock_active.return_value = "mimic-iv-full"
     mock_get_default.return_value = Path("/tmp/inferred-full.duckdb")
     mock_subprocess.return_value = MagicMock(returncode=0)
 
@@ -193,13 +193,13 @@ def test_config_claude_infers_db_path_full(
 @patch("m3.cli.detect_available_local_datasets")
 def test_use_full_happy_path(mock_detect, mock_set_active):
     mock_detect.return_value = {
-        "demo": {
+        "mimic-iv-demo": {
             "parquet_present": False,
             "db_present": False,
             "parquet_root": "/tmp/demo",
             "db_path": "/tmp/demo.duckdb",
         },
-        "full": {
+        "mimic-iv-full": {
             "parquet_present": True,
             "db_present": False,
             "parquet_root": "/tmp/full",
@@ -207,24 +207,24 @@ def test_use_full_happy_path(mock_detect, mock_set_active):
         },
     }
 
-    result = runner.invoke(app, ["use", "full"])
+    result = runner.invoke(app, ["use", "mimic-iv-full"])
     assert result.exit_code == 0
-    assert "Active dataset set to 'full'." in result.stdout
-    mock_set_active.assert_called_once_with("full")
+    assert "Active dataset set to 'mimic-iv-full'." in result.stdout
+    mock_set_active.assert_called_once_with("mimic-iv-full")
 
 
 @patch("m3.cli.compute_parquet_dir_size", return_value=123)
-@patch("m3.cli.get_active_dataset", return_value="full")
+@patch("m3.cli.get_active_dataset", return_value="mimic-iv-full")
 @patch("m3.cli.detect_available_local_datasets")
 def test_status_happy_path(mock_detect, mock_active, mock_size):
     mock_detect.return_value = {
-        "demo": {
+        "mimic-iv-demo": {
             "parquet_present": True,
             "db_present": False,
             "parquet_root": "/tmp/demo",
             "db_path": "/tmp/demo.duckdb",
         },
-        "full": {
+        "mimic-iv-full": {
             "parquet_present": True,
             "db_present": False,
             "parquet_root": "/tmp/full",
@@ -234,6 +234,6 @@ def test_status_happy_path(mock_detect, mock_active, mock_size):
 
     result = runner.invoke(app, ["status"])
     assert result.exit_code == 0
-    assert "Active dataset: full" in result.stdout
+    assert "Active dataset: mimic-iv-full" in result.stdout
     size_gb = 123 / (1024**3)
     assert f"parquet_size_gb: {size_gb:.4f} GB" in result.stdout
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
index 643a158..0970723 100644
--- a/tests/test_mcp_server.py
+++ b/tests/test_mcp_server.py
@@ -62,8 +62,13 @@ def test_backend_init_duckdb_missing_db(self):
             with patch("m3.mcp_server.get_default_database_path") as mock_path:
                 mock_path.return_value = Path("/fake/path.duckdb")
                 with patch("pathlib.Path.exists", return_value=False):
-                    with pytest.raises(FileNotFoundError):
-                        _init_backend()
+                    _init_backend()
+                    # Verify that we didn't crash and that the path is set,
+                    # allowing the runtime check in _execute_duckdb_query to handle it gracefully.
+                    import m3.mcp_server
+
+                    assert m3.mcp_server._db_path == str(Path("/fake/path.duckdb"))
+                    assert m3.mcp_server._backend == "duckdb"
 
     @pytest.mark.skipif(
         not _bigquery_available(), reason="BigQuery dependencies not available"

From fb1e538b9edebbc73c387b81e7dc825b06a8867e Mon Sep 17 00:00:00 2001
From: hill <hannes.ill@tum.de>
Date: Mon, 24 Nov 2025 17:33:56 -0500
Subject: [PATCH 2/5] feat: Add comprehensive PhysioNet support and BigQuery
 backend

- Implement dynamic dataset registry supporting MIMIC-IV Demo and Full in src/m3/datasets.py
- Add BigQuery backend support to MCP server for cloud-based data access in src/m3/mcp_server.py
- Update CLI to handle credentialed dataset initialization and active dataset switching in src/m3/cli.py
- Enhance MCP server tools to be dataset-aware and provide better error guidance
- Improve security with SQL injection prevention and safer query execution
- Refactor configuration management in src/m3/config.py for better runtime dataset detection
- Update CI workflows for uv integration
- Add comprehensive tests for new backends and tools
---
 .github/workflows/pre-commit.yaml |   2 -
 .github/workflows/tests.yaml      |   1 -
 src/m3/cli.py                     |  75 +++++++----
 src/m3/config.py                  |  78 ++++++-----
 src/m3/data_io.py                 |  17 ++-
 src/m3/datasets.py                |  52 +++++---
 src/m3/mcp_server.py              | 214 ++++++++++++++++++++----------
 tests/test_mcp_server.py          | 176 ++++++++++++++----------
 8 files changed, 391 insertions(+), 224 deletions(-)

diff --git a/.github/workflows/pre-commit.yaml b/.github/workflows/pre-commit.yaml
index 8d796c4..89a5e7a 100644
--- a/.github/workflows/pre-commit.yaml
+++ b/.github/workflows/pre-commit.yaml
@@ -2,9 +2,7 @@ name: Pre-commit checks
 
 on:
   push:
-    branches: [main]
   pull_request:
-    branches: [main]
 
 jobs:
   pre-commit:
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
index 55c541d..73134f1 100644
--- a/.github/workflows/tests.yaml
+++ b/.github/workflows/tests.yaml
@@ -2,7 +2,6 @@ name: Tests
 
 on:
   push:
-    branches: [main]
   pull_request:
 
 jobs:
diff --git a/src/m3/cli.py b/src/m3/cli.py
index a05dd00..f3df756 100644
--- a/src/m3/cli.py
+++ b/src/m3/cli.py
@@ -2,12 +2,11 @@
 import subprocess
 import sys
 from pathlib import Path
-from typing import Annotated, Optional
+from typing import Annotated
 
 import typer
 
 from m3 import __version__
-from m3.datasets import DatasetRegistry
 from m3.config import (
     detect_available_local_datasets,
     get_active_dataset,
@@ -24,6 +23,7 @@
     init_duckdb_from_parquet,
     verify_table_rowcount,
 )
+from m3.datasets import DatasetRegistry
 
 app = typer.Typer(
     name="m3",
@@ -109,7 +109,7 @@ def dataset_init_cmd(
     - If Parquet exists: only initialize DuckDB views
     - If raw CSV.gz exists but Parquet is missing: convert then initialize
     - If neither exists: download (demo only), convert, then initialize
-    
+
     Notes:
     - Auto-download is based on the dataset definition URL.
     - For datasets without a download URL (e.g. mimic-iv-full), you must provide the --src path or place files in the expected location.
@@ -150,9 +150,34 @@ def dataset_init_cmd(
     typer.echo(f"Raw root: {csv_root}  (present={raw_present})")
     typer.echo(f"Parquet root: {pq_root}  (present={parquet_present})")
 
-    # Step 1: Ensure raw dataset exists (download demo if missing; for full, inform and return)
+    # Step 1: Ensure raw dataset exists (download if missing, for requires_authentication datasets, inform and return)
     if not raw_present and not parquet_present:
-        listing_url = dataset_config.get('file_listing_url')
+        requires_auth = dataset_config.get("requires_authentication", False)
+
+        if requires_auth:
+            base_url = dataset_config.get("file_listing_url")
+
+            typer.secho(
+                f"❌ Files not found for credentialed dataset '{dataset_key}'.",
+                fg=typer.colors.RED,
+            )
+            typer.echo("To download this credentialed dataset:")
+            typer.echo(
+                f"1. Ensure you have signed the DUA at: {base_url or 'https://physionet.org'}"
+            )
+            typer.echo(
+                "2. Run this command (you will be asked for your PhysioNet password):"
+            )
+            typer.echo("")
+
+            # Wget command tailored to the user's path
+            wget_cmd = f"wget -r -N -c -np --user YOUR_USERNAME --ask-password {base_url} -P {csv_root}"
+            typer.secho(f"   {wget_cmd}", fg=typer.colors.CYAN)
+            typer.echo("")
+            typer.echo(f"3. Re-run 'm3 init {dataset_key}'")
+            return
+
+        listing_url = dataset_config.get("file_listing_url")
         if listing_url:
             out_dir = csv_root_default
             out_dir.mkdir(parents=True, exist_ok=True)
@@ -294,40 +319,44 @@ def use_cmd(
     target: Annotated[
         str,
         typer.Argument(
-            help="Select active dataset: name | bigquery", metavar="TARGET"
+            help="Select active dataset: name (e.g., mimic-iv-full)", metavar="TARGET"
         ),
     ],
 ):
     """Set the active dataset selection for the project."""
     target = target.lower()
-    
-    # Check if it is bigquery
-    if target == "bigquery":
-         set_active_dataset(target)
-         typer.secho(f"Active dataset set to '{target}'.", fg=typer.colors.GREEN)
-         return
-
-    # Check if local availability
+
+    # 1. Check if dataset is registered
+    # We use detect_available_local_datasets just to get the list + status,
+    # but we could also just check DatasetRegistry directly.
     availability = detect_available_local_datasets().get(target)
-    if not availability:
-        typer.secho(
-             f"Dataset '{target}' not found or not registered.",
-             fg=typer.colors.RED,
-             err=True
-        )
-        raise typer.Exit(code=1)
 
-    if not availability["parquet_present"]:
+    if not availability:
         typer.secho(
-            f"Parquet directory missing at {availability['parquet_root']}. Cannot activate '{target}'.",
+            f"Dataset '{target}' not found or not registered.",
             fg=typer.colors.RED,
             err=True,
         )
+        # List available
+        supported = ", ".join([ds.name for ds in DatasetRegistry.list_all()])
+        typer.secho(f"Supported datasets: {supported}", fg=typer.colors.YELLOW)
         raise typer.Exit(code=1)
 
+    # 2. Set it active immediately (don't block on files)
     set_active_dataset(target)
     typer.secho(f"Active dataset set to '{target}'.", fg=typer.colors.GREEN)
 
+    # 3. Warn if local files are missing (helpful info, not a blocker)
+    if not availability["parquet_present"]:
+        typer.secho(
+            f"⚠️  Note: Local Parquet files not found at {availability['parquet_root']}.",
+            fg=typer.colors.YELLOW,
+        )
+        typer.echo(
+            "   This is fine if you are using the BigQuery backend.\n"
+            "   If you intend to use DuckDB (local), run 'm3 init' first."
+        )
+
 
 @app.command("status")
 def status_cmd():
diff --git a/src/m3/config.py b/src/m3/config.py
index 6ee6002..b368b3b 100644
--- a/src/m3/config.py
+++ b/src/m3/config.py
@@ -1,10 +1,11 @@
+import dataclasses
 import json
 import logging
+import os
 from pathlib import Path
-import dataclasses
-from typing import Dict, Any, Optional
+from typing import Any
 
-from m3.datasets import DatasetRegistry, DatasetDefinition
+from m3.datasets import DatasetDefinition, DatasetRegistry
 
 APP_NAME = "m3"
 
@@ -51,7 +52,9 @@ def _get_project_root() -> Path:
 def _load_custom_datasets():
     """Load custom dataset definitions from JSON files in m3_data/datasets/."""
     if not _CUSTOM_DATASETS_DIR.exists():
-        logger.warning(f"Custom datasets directory does not exist: {_CUSTOM_DATASETS_DIR}")
+        logger.warning(
+            f"Custom datasets directory does not exist: {_CUSTOM_DATASETS_DIR}"
+        )
         return
 
     for f in _CUSTOM_DATASETS_DIR.glob("*.json"):
@@ -68,7 +71,7 @@ def get_dataset_config(dataset_name: str) -> dict | None:
     """Retrieve the configuration for a given dataset (case-insensitive)."""
     # Ensure custom datasets are loaded
     _load_custom_datasets()
-    
+
     ds = DatasetRegistry.get(dataset_name.lower())
     return dataclasses.asdict(ds) if ds else None
 
@@ -120,12 +123,12 @@ def _ensure_data_dirs():
 
 
 def _get_default_runtime_config() -> dict:
-    # We initialize with empty overrides. 
+    # We initialize with empty overrides.
     # Paths are derived dynamically from registry unless overridden here.
     return {
         "active_dataset": None,
         "duckdb_paths": {},  # Map dataset_name -> path
-        "parquet_roots": {}, # Map dataset_name -> path
+        "parquet_roots": {},  # Map dataset_name -> path
     }
 
 
@@ -150,64 +153,77 @@ def _has_parquet_files(path: Path | None) -> bool:
     return bool(path and path.exists() and any(path.rglob("*.parquet")))
 
 
-def detect_available_local_datasets() -> Dict[str, Dict[str, Any]]:
+def detect_available_local_datasets() -> dict[str, dict[str, Any]]:
     """Return presence flags for all registered datasets."""
     _load_custom_datasets()
     cfg = load_runtime_config()
-    
+
     results = {}
-    
+
     # Check all registered datasets
     for ds in DatasetRegistry.list_all():
         name = ds.name
-        
+
         # Determine paths (check config overrides first)
         parquet_root_str = cfg.get("parquet_roots", {}).get(name)
-        parquet_root = Path(parquet_root_str) if parquet_root_str else get_dataset_parquet_root(name)
-        
+        parquet_root = (
+            Path(parquet_root_str)
+            if parquet_root_str
+            else get_dataset_parquet_root(name)
+        )
+
         db_path_str = cfg.get("duckdb_paths", {}).get(name)
         db_path = Path(db_path_str) if db_path_str else get_default_database_path(name)
-        
+
         results[name] = {
             "parquet_present": _has_parquet_files(parquet_root),
             "db_present": bool(db_path and db_path.exists()),
             "parquet_root": str(parquet_root) if parquet_root else "",
             "db_path": str(db_path) if db_path else "",
         }
-            
+
     return results
 
 
 def get_active_dataset() -> str | None:
     """Get the active dataset name."""
+    # Ensure custom datasets are loaded so they can be found in the registry
+    _load_custom_datasets()
+
+    # Priority 1: Environment variable
+    env_dataset = os.getenv("M3_DATASET")
+    if env_dataset:
+        return env_dataset
+
+    # Priority 2: Config file
     cfg = load_runtime_config()
     active = cfg.get("active_dataset")
-    
+
+    # Priority 3: Auto-detect default: prefer demo, then full
     if not active:
-        # Auto-detect default: prefer demo, then full
         availability = detect_available_local_datasets()
         if availability.get("mimic-iv-demo", {}).get("parquet_present"):
-            return "mimic-iv-demo"
-        if availability.get("mimic-iv-full", {}).get("parquet_present"):
-            return "mimic-iv-full"
-        return None
-        
-    if active == "bigquery":
-        return "bigquery"
-        
+            active = "mimic-iv-demo"
+        elif availability.get("mimic-iv-full", {}).get("parquet_present"):
+            active = "mimic-iv-full"
+        else:
+            active = None
+
     return active
 
 
 def set_active_dataset(choice: str) -> None:
-    # Allow registered names, or 'bigquery'
-    valid_names = {"bigquery"} | {ds.name for ds in DatasetRegistry.list_all()}
-    
+    # Allow registered names
+    valid_names = {ds.name for ds in DatasetRegistry.list_all()}
+
     if choice not in valid_names:
         # It might be a new custom dataset not yet loaded in this process?
         # We'll allow it if it's in the registry now.
         _load_custom_datasets()
         if not DatasetRegistry.get(choice):
-             raise ValueError(f"active_dataset must be a registered dataset or 'bigquery'. Got: {choice}")
+            raise ValueError(
+                f"active_dataset must be a registered dataset. Got: {choice}"
+            )
 
     cfg = load_runtime_config()
     cfg["active_dataset"] = choice
@@ -215,12 +231,8 @@ def set_active_dataset(choice: str) -> None:
 
 
 def get_duckdb_path_for(choice: str) -> Path | None:
-    if choice == "bigquery":
-        return None
     return get_default_database_path(choice)
 
 
 def get_parquet_root_for(choice: str) -> Path | None:
-    if choice == "bigquery":
-        return None
     return get_dataset_parquet_root(choice)
diff --git a/src/m3/data_io.py b/src/m3/data_io.py
index 54f7c9c..d60adf0 100644
--- a/src/m3/data_io.py
+++ b/src/m3/data_io.py
@@ -129,9 +129,7 @@ def _download_dataset_files(
         csv_urls_in_subdir = _scrape_urls_from_html_page(listing_url, session)
 
         if not csv_urls_in_subdir:
-            logger.warning(
-                f"No .csv.gz files found in location: {listing_url}"
-            )
+            logger.warning(f"No .csv.gz files found in location: {listing_url}")
             continue
 
         for file_url in csv_urls_in_subdir:
@@ -170,9 +168,7 @@ def _download_dataset_files(
             all_files_to_process.append((file_url, local_target_path))
 
     if not all_files_to_process:
-        logger.error(
-            f"No '.csv.gz' download links found for dataset '{dataset_name}'."
-        )
+        logger.error(f"No '.csv.gz' download links found for dataset '{dataset_name}'.")
         return False
 
     # Deduplicate and sort for consistent processing order
@@ -208,6 +204,15 @@ def download_dataset(dataset_name: str, output_root: Path) -> bool:
     if not cfg:
         logger.error(f"Unsupported dataset: {dataset_name}")
         return False
+
+    # Prevent accidental scraping of credentialed datasets
+    if cfg.get("requires_authentication"):
+        logger.error(
+            f"Dataset '{dataset_name}' requires authentication and cannot be auto-downloaded. "
+            "Please download files manually."
+        )
+        return False
+
     if not cfg.get("file_listing_url"):
         logger.error(
             f"Dataset '{dataset_name}' does not have a configured listing URL. "
diff --git a/src/m3/datasets.py b/src/m3/datasets.py
index 99e10cc..160d254 100644
--- a/src/m3/datasets.py
+++ b/src/m3/datasets.py
@@ -1,37 +1,46 @@
 from dataclasses import dataclass, field
-from typing import List, Optional, Dict
+from typing import ClassVar
+
 
 @dataclass
 class DatasetDefinition:
     name: str
     description: str = ""
     version: str = "1.0"
-    file_listing_url: Optional[str] = None
-    subdirectories_to_scan: List[str] = field(default_factory=list)
-    default_duckdb_filename: Optional[str] = None
-    primary_verification_table: Optional[str] = None
-    tags: List[str] = field(default_factory=list)
-    
-    # For backward compatibility or ease of use, we might add a way to access as dict if needed, 
+    file_listing_url: str | None = None
+    subdirectories_to_scan: list[str] = field(default_factory=list)
+    default_duckdb_filename: str | None = None
+    primary_verification_table: str | None = None
+    tags: list[str] = field(default_factory=list)
+
+    # For backward compatibility or ease of use, we might add a way to access as dict if needed,
     # but we'll try to use object access.
 
+    # BigQuery Configuration
+    bigquery_project_id: str | None = "physionet-data"
+    bigquery_dataset_ids: list[str] = field(default_factory=list)
+
+    # Authentication & Download Helpers
+    requires_authentication: bool = False
+
     def __post_init__(self):
         if not self.default_duckdb_filename:
             self.default_duckdb_filename = f"{self.name.replace('-', '_')}.duckdb"
 
+
 class DatasetRegistry:
-    _registry: Dict[str, DatasetDefinition] = {}
+    _registry: ClassVar[dict[str, DatasetDefinition]] = {}
 
     @classmethod
     def register(cls, dataset: DatasetDefinition):
         cls._registry[dataset.name.lower()] = dataset
 
     @classmethod
-    def get(cls, name: str) -> Optional[DatasetDefinition]:
+    def get(cls, name: str) -> DatasetDefinition | None:
         return cls._registry.get(name.lower())
 
     @classmethod
-    def list_all(cls) -> List[DatasetDefinition]:
+    def list_all(cls) -> list[DatasetDefinition]:
         return list(cls._registry.values())
 
     @classmethod
@@ -42,27 +51,32 @@ def reset(cls):
     @classmethod
     def _register_builtins(cls):
         # Built-in datasets
-        demo = DatasetDefinition(
+        mimic_iv_demo = DatasetDefinition(
             name="mimic-iv-demo",
             description="MIMIC-IV Clinical Database Demo",
             file_listing_url="https://physionet.org/files/mimic-iv-demo/2.2/",
             subdirectories_to_scan=["hosp", "icu"],
             primary_verification_table="hosp_admissions",
-            tags=["mimic", "clinical", "demo"]
+            tags=["mimic", "clinical", "demo"],
+            bigquery_project_id="physionet-data",
+            bigquery_dataset_ids=["mimiciv_demo_hosp", "mimiciv_demo_icu"],
         )
 
-        full = DatasetDefinition(
+        mimic_iv_full = DatasetDefinition(
             name="mimic-iv-full",
             description="MIMIC-IV Clinical Database (Full)",
-            file_listing_url=None, # Requires auth, manual download instructions
+            file_listing_url="https://physionet.org/files/mimiciv/3.1/",
             subdirectories_to_scan=["hosp", "icu"],
             primary_verification_table="hosp_admissions",
-            tags=["mimic", "clinical", "full"]
+            tags=["mimic", "clinical", "full"],
+            bigquery_project_id="physionet-data",
+            bigquery_dataset_ids=["mimiciv_3_1_hosp", "mimiciv_3_1_icu"],
+            requires_authentication=True,
         )
 
-        cls.register(demo)
-        cls.register(full)
+        cls.register(mimic_iv_demo)
+        cls.register(mimic_iv_full)
+
 
 # Initialize registry
 DatasetRegistry._register_builtins()
-
diff --git a/src/m3/mcp_server.py b/src/m3/mcp_server.py
index d9e61fb..61773f4 100644
--- a/src/m3/mcp_server.py
+++ b/src/m3/mcp_server.py
@@ -11,7 +11,8 @@
 from fastmcp import FastMCP
 
 from m3.auth import init_oauth2, require_oauth2
-from m3.config import get_default_database_path, get_active_dataset
+from m3.config import get_active_dataset, get_default_database_path
+from m3.datasets import DatasetRegistry
 
 # Create FastMCP server instance
 mcp = FastMCP("m3")
@@ -21,6 +22,7 @@
 _db_path = None
 _bq_client = None
 _project_id = None
+_active_dataset_def = None
 
 
 def _validate_limit(limit: int) -> bool:
@@ -131,30 +133,47 @@ def _is_safe_query(sql_query: str, internal_tool: bool = False) -> tuple[bool, s
 
 def _init_backend():
     """Initialize the backend based on environment variables."""
-    global _backend, _db_path, _bq_client, _project_id
+    global _backend, _db_path, _bq_client, _project_id, _active_dataset_def
 
     # Initialize OAuth2 authentication
     init_oauth2()
 
     _backend = os.getenv("M3_BACKEND", "duckdb")
+    active_ds_name = get_active_dataset()
+
+    # Load dataset definition if available
+    if active_ds_name:
+        _active_dataset_def = DatasetRegistry.get(active_ds_name)
+    else:
+        # If explicitly bigquery or unset, we might default to a 'full' mimic definition if available,
+        # but better to handle it dynamically.
+        # For now, let's see if we can infer a default definition for bigquery mode
+        # or just rely on manual project_id
+        if _backend == "bigquery":
+            # We might want to default to mimic-iv-full for bigquery metadata if not specified?
+            # But the user might want a different one.
+            # Let's check if we can infer it.
+            # For now, we'll try to use 'mimic-iv-full' as the reference for BigQuery structure
+            # if the user hasn't selected another dataset but is using BigQuery backend.
+            _active_dataset_def = DatasetRegistry.get("mimic-iv-full")
 
     if _backend == "duckdb":
         _db_path = os.getenv("M3_DB_PATH")
         if not _db_path:
-            # Try to detect active dataset if not set
-            active = get_active_dataset()
-            if active and active != "bigquery":
-                 path = get_default_database_path(active)
-                 _db_path = str(path) if path else None
+            if active_ds_name:
+                path = get_default_database_path(active_ds_name)
+                _db_path = str(path) if path else None
             else:
                 # Fallback to demo if we can't figure it out
-                 path = get_default_database_path("mimic-iv-demo")
-                 _db_path = str(path) if path else None
+                path = get_default_database_path("mimic-iv-demo")
+                _db_path = str(path) if path else None
+                if not _active_dataset_def:
+                    _active_dataset_def = DatasetRegistry.get("mimic-iv-demo")
 
         if not _db_path or not Path(_db_path).exists():
-             # We don't raise here to allow server to start even if DB is missing (e.g. for 'config' command usage via import)
-             # But runtime queries will fail.
-             pass
+            # We don't raise here to allow server to start even if DB is missing (e.g. for 'config' command usage via import)
+            # But runtime queries will fail.
+            pass
 
     elif _backend == "bigquery":
         try:
@@ -165,8 +184,14 @@ def _init_backend():
             )
 
         # User's GCP project ID for authentication and billing
-        # MIMIC-IV data resides in the public 'physionet-data' project
-        _project_id = os.getenv("M3_PROJECT_ID", "physionet-data")
+        # Priority: Env Var > Dataset Config > Default
+        env_project = os.getenv("M3_PROJECT_ID")
+        ds_project = (
+            _active_dataset_def.bigquery_project_id if _active_dataset_def else None
+        )
+
+        _project_id = env_project or ds_project or "physionet-data"
+
         try:
             _bq_client = bigquery.Client(project=_project_id)
         except Exception as e:
@@ -182,10 +207,11 @@ def _init_backend():
 
 def _get_backend_info() -> str:
     """Get current backend information for display in responses."""
+    ds_name = _active_dataset_def.name if _active_dataset_def else "unknown"
     if _backend == "duckdb":
-        return f"🔧 **Current Backend:** DuckDB (local database)\n📁 **Database Path:** {_db_path}\n"
+        return f"🔧 **Current Backend:** DuckDB (local database)\n📦 **Dataset:** {ds_name}\n📁 **Database Path:** {_db_path}\n"
     else:
-        return f"🔧 **Current Backend:** BigQuery (cloud database)\n☁️ **Project ID:** {_project_id}\n"
+        return f"🔧 **Current Backend:** BigQuery (cloud database)\n📦 **Dataset:** {ds_name}\n☁️ **Project ID:** {_project_id}\n"
 
 
 # ==========================================
@@ -199,7 +225,7 @@ def _get_backend_info() -> str:
 def _execute_duckdb_query(sql_query: str) -> str:
     """Execute DuckDB query - internal function."""
     if not _db_path or not Path(_db_path).exists():
-         return "❌ Error: Database file not found. Please initialize a dataset using 'm3 init'."
+        return "❌ Error: Database file not found. Please initialize a dataset using 'm3 init'."
 
     try:
         conn = duckdb.connect(_db_path)
@@ -375,15 +401,26 @@ def get_database_schema() -> str:
         return f"{_get_backend_info()}\n📋 **Available Tables:**\n{result}"
 
     elif _backend == "bigquery":
-        # Show fully qualified table names that are ready to copy-paste into queries
-        query = """
-        SELECT CONCAT('`physionet-data.mimiciv_3_1_hosp.', table_name, '`') as query_ready_table_name
-        FROM `physionet-data.mimiciv_3_1_hosp.INFORMATION_SCHEMA.TABLES`
-        UNION ALL
-        SELECT CONCAT('`physionet-data.mimiciv_3_1_icu.', table_name, '`') as query_ready_table_name
-        FROM `physionet-data.mimiciv_3_1_icu.INFORMATION_SCHEMA.TABLES`
-        ORDER BY query_ready_table_name
-        """
+        # Dynamic schema discovery based on active dataset definition
+        if not _active_dataset_def or not _active_dataset_def.bigquery_dataset_ids:
+            return f"{_get_backend_info()}❌ **Error:** No BigQuery datasets configured for the active dataset."
+
+        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        queries = []
+
+        for dataset_id in _active_dataset_def.bigquery_dataset_ids:
+            queries.append(f"""
+             SELECT CONCAT('`{project_id}.{dataset_id}.', table_name, '`') as query_ready_table_name
+             FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.TABLES`
+             """)
+
+        if not queries:
+            return (
+                f"{_get_backend_info()}❌ **Error:** No BigQuery datasets configured."
+            )
+
+        query = " UNION ALL ".join(queries) + " ORDER BY query_ready_table_name"
+
         result = _execute_query_internal(query)
         return f"{_get_backend_info()}\n📋 **Available Tables (query-ready names):**\n{result}\n\n💡 **Copy-paste ready:** These table names can be used directly in your SQL queries!"
 
@@ -434,8 +471,10 @@ def get_table_info(table_name: str, show_sample: bool = True) -> str:
 
     else:  # bigquery
         # Handle both simple names (patients) and fully qualified names (`physionet-data.mimiciv_3_1_hosp.patients`)
-        # Detect qualified names by content: dots + physionet pattern
-        if "." in table_name and "physionet-data" in table_name:
+        # Detect qualified names by content: dots + project ID pattern or backticks
+        is_qualified = "." in table_name
+
+        if is_qualified:
             # Qualified name (format-agnostic: works with or without backticks)
             clean_name = table_name.strip("`")
             full_table_name = f"`{clean_name}`"
@@ -446,26 +485,23 @@ def get_table_info(table_name: str, show_sample: bool = True) -> str:
                 error_msg = (
                     f"{backend_info}❌ **Invalid qualified table name:** `{table_name}`\n\n"
                     "**Expected format:** `project.dataset.table`\n"
-                    "**Example:** `physionet-data.mimiciv_3_1_hosp.diagnoses_icd`\n\n"
-                    "**Available MIMIC-IV datasets:**\n"
-                    "- `physionet-data.mimiciv_3_1_hosp.*` (hospital module)\n"
-                    "- `physionet-data.mimiciv_3_1_icu.*` (ICU module)"
+                    "**Example:** `physionet-data.mimiciv_3_1_hosp.diagnoses_icd`\n"
                 )
                 return error_msg
 
             simple_table_name = parts[2]  # table name
-            dataset = f"{parts[0]}.{parts[1]}"  # project.dataset
+            dataset_ref = f"{parts[0]}.{parts[1]}"  # project.dataset
         else:
-            # Simple name - try both datasets to find the table
+            # Simple name - try to find it in configured datasets
             simple_table_name = table_name
             full_table_name = None
-            dataset = None
+            dataset_ref = None
 
         # If we have a fully qualified name, try that first
         if full_table_name:
             try:
                 # Get column information using the dataset from the full name
-                dataset_parts = dataset.split(".")
+                dataset_parts = dataset_ref.split(".")
                 if len(dataset_parts) >= 2:
                     project_dataset = f"`{dataset_parts[0]}.{dataset_parts[1]}`"
                     info_query = f"""
@@ -486,35 +522,35 @@ def get_table_info(table_name: str, show_sample: bool = True) -> str:
 
                         return result
             except Exception:
-                pass  # Fall through to try simple name approach
+                pass  # Fall through to try search approach if direct lookup fails (unlikely but safe)
 
-        # Try both datasets with simple name (fallback or original approach)
-        for dataset in ["mimiciv_3_1_hosp", "mimiciv_3_1_icu"]:
-            try:
-                full_table_name = f"`physionet-data.{dataset}.{simple_table_name}`"
-
-                # Get column information
-                info_query = f"""
-                SELECT column_name, data_type, is_nullable
-                FROM `physionet-data.{dataset}.INFORMATION_SCHEMA.COLUMNS`
-                WHERE table_name = '{simple_table_name}'
-                ORDER BY ordinal_position
-                """
-
-                info_result = _execute_bigquery_query(info_query)
-                if "No results found" not in info_result:
-                    result = f"{backend_info}📋 **Table:** {full_table_name}\n\n**Column Information:**\n{info_result}"
-
-                    if show_sample:
-                        sample_query = f"SELECT * FROM {full_table_name} LIMIT 3"
-                        sample_result = _execute_bigquery_query(sample_query)
-                        result += (
-                            f"\n\n📊 **Sample Data (first 3 rows):**\n{sample_result}"
-                        )
-
-                    return result
-            except Exception:
-                continue
+        # Try configured datasets with simple name
+        if _active_dataset_def and _active_dataset_def.bigquery_dataset_ids:
+            project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+            for dataset_id in _active_dataset_def.bigquery_dataset_ids:
+                try:
+                    full_table_name = f"`{project_id}.{dataset_id}.{simple_table_name}`"
+
+                    # Get column information
+                    info_query = f"""
+                    SELECT column_name, data_type, is_nullable
+                    FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.COLUMNS`
+                    WHERE table_name = '{simple_table_name}'
+                    ORDER BY ordinal_position
+                    """
+
+                    info_result = _execute_bigquery_query(info_query)
+                    if "No results found" not in info_result:
+                        result = f"{backend_info}📋 **Table:** {full_table_name}\n\n**Column Information:**\n{info_result}"
+
+                        if show_sample:
+                            sample_query = f"SELECT * FROM {full_table_name} LIMIT 3"
+                            sample_result = _execute_bigquery_query(sample_query)
+                            result += f"\n\n📊 **Sample Data (first 3 rows):**\n{sample_result}"
+
+                        return result
+                except Exception:
+                    continue
 
         return f"{backend_info}❌ Table '{table_name}' not found in any dataset. Use get_database_schema() to see available tables."
 
@@ -562,17 +598,29 @@ def get_icu_stays(patient_id: int | None = None, limit: int = 10) -> str:
     Returns:
         ICU stay data as formatted text or guidance if table not found
     """
+    # Check dataset compatibility
+    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+
     # Security validation
     if not _validate_limit(limit):
         return "Error: Invalid limit. Must be a positive integer between 1 and 10000."
 
     # Try common ICU table names based on backend
     if _backend == "duckdb":
-        # More robust check: look for available tables first? 
-        # For now we guess common naming convention
         icustays_table = "icu_icustays"
     else:  # bigquery
-        icustays_table = "`physionet-data.mimiciv_3_1_icu.icustays`"
+        # Try to find icustays in configured datasets
+        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        found = False
+        for ds in _active_dataset_def.bigquery_dataset_ids:
+            if "icu" in ds:
+                icustays_table = f"`{project_id}.{ds}.icustays`"
+                found = True
+                break
+        if not found:
+            # Fallback
+            icustays_table = "`physionet-data.mimiciv_3_1_icu.icustays`"
 
     if patient_id:
         query = f"SELECT * FROM {icustays_table} WHERE subject_id = {patient_id}"
@@ -614,6 +662,10 @@ def get_lab_results(
     Returns:
         Lab results as formatted text or guidance if table not found
     """
+    # Check dataset compatibility
+    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+
     # Security validation
     if not _validate_limit(limit):
         return "Error: Invalid limit. Must be a positive integer between 1 and 10000."
@@ -622,7 +674,17 @@ def get_lab_results(
     if _backend == "duckdb":
         labevents_table = "hosp_labevents"
     else:  # bigquery
-        labevents_table = "`physionet-data.mimiciv_3_1_hosp.labevents`"
+        # Try to find labevents in configured datasets
+        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        found = False
+        for ds in _active_dataset_def.bigquery_dataset_ids:
+            if "hosp" in ds:
+                labevents_table = f"`{project_id}.{ds}.labevents`"
+                found = True
+                break
+        if not found:
+            # Fallback
+            labevents_table = "`physionet-data.mimiciv_3_1_hosp.labevents`"
 
     # Build query conditions
     conditions = []
@@ -669,6 +731,10 @@ def get_race_distribution(limit: int = 10) -> str:
     Returns:
         Race distribution as formatted text or guidance if table not found
     """
+    # Check dataset compatibility
+    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+
     # Security validation
     if not _validate_limit(limit):
         return "Error: Invalid limit. Must be a positive integer between 1 and 10000."
@@ -677,7 +743,17 @@ def get_race_distribution(limit: int = 10) -> str:
     if _backend == "duckdb":
         admissions_table = "hosp_admissions"
     else:  # bigquery
-        admissions_table = "`physionet-data.mimiciv_3_1_hosp.admissions`"
+        # Try to find admissions in configured datasets
+        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        found = False
+        for ds in _active_dataset_def.bigquery_dataset_ids:
+            if "hosp" in ds:
+                admissions_table = f"`{project_id}.{ds}.admissions`"
+                found = True
+                break
+        if not found:
+            # Fallback
+            admissions_table = "`physionet-data.mimiciv_3_1_hosp.admissions`"
 
     query = f"SELECT race, COUNT(*) as count FROM {admissions_table} GROUP BY race ORDER BY count DESC LIMIT {limit}"
 
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
index 0970723..9b49efa 100644
--- a/tests/test_mcp_server.py
+++ b/tests/test_mcp_server.py
@@ -9,6 +9,9 @@
 import pytest
 from fastmcp import Client
 
+# Define DatasetDefinition locally if imports fail (shouldn't happen in test env)
+from m3.datasets import DatasetDefinition
+
 # Mock the database path check during import to handle CI environments
 with patch("pathlib.Path.exists", return_value=True):
     with patch(
@@ -75,16 +78,25 @@ def test_backend_init_duckdb_missing_db(self):
     )
     def test_backend_init_bigquery(self):
         """Test BigQuery backend initialization."""
+        mock_ds = DatasetDefinition(
+            name="mock-ds",
+            bigquery_project_id="test-project",
+            bigquery_dataset_ids=["ds1"],
+            tags=["mimic"],
+        )
+
         with patch.dict(
             os.environ,
             {"M3_BACKEND": "bigquery", "M3_PROJECT_ID": "test-project"},
             clear=True,
         ):
-            with patch("google.cloud.bigquery.Client") as mock_client:
-                mock_client.return_value = Mock()
-                _init_backend()
-                # If no exception raised, initialization succeeded
-                mock_client.assert_called_once_with(project="test-project")
+            with patch("m3.mcp_server.DatasetRegistry.get", return_value=mock_ds):
+                with patch("google.cloud.bigquery.Client") as mock_client:
+                    mock_client.return_value = Mock()
+                    _init_backend()
+                    # If no exception raised, initialization succeeded
+                    # The project ID might come from env or dataset, both are 'test-project' here
+                    mock_client.assert_called_once_with(project="test-project")
 
     def test_backend_init_invalid(self):
         """Test initialization with invalid backend."""
@@ -160,37 +172,46 @@ async def test_tools_via_client(self, test_db):
             clear=True,
         ):
             # Initialize backend
-            _init_backend()
-
-            # Test via FastMCP client
-            async with Client(mcp) as client:
-                # Test execute_mimic_query tool
-                result = await client.call_tool(
-                    "execute_mimic_query",
-                    {"sql_query": "SELECT COUNT(*) as count FROM icu_icustays"},
-                )
-                result_text = str(result)
-                assert "count" in result_text
-                assert "2" in result_text
-
-                # Test get_icu_stays tool
-                result = await client.call_tool(
-                    "get_icu_stays", {"patient_id": 10000032, "limit": 10}
-                )
-                result_text = str(result)
-                assert "10000032" in result_text
-
-                # Test get_lab_results tool
-                result = await client.call_tool(
-                    "get_lab_results", {"patient_id": 10000032, "limit": 20}
-                )
-                result_text = str(result)
-                assert "10000032" in result_text
+            # Mock DatasetRegistry to return a mimic dataset so tools work
+            mock_ds = DatasetDefinition(name="mimic-demo", tags=["mimic"])
+            with patch("m3.mcp_server.DatasetRegistry.get", return_value=mock_ds):
+                with patch(
+                    "m3.mcp_server.get_active_dataset", return_value="mimic-demo"
+                ):
+                    _init_backend()
 
-                # Test get_database_schema tool
-                result = await client.call_tool("get_database_schema", {})
-                result_text = str(result)
-                assert "icu_icustays" in result_text or "hosp_labevents" in result_text
+                    # Test via FastMCP client
+                    async with Client(mcp) as client:
+                        # Test execute_mimic_query tool
+                        result = await client.call_tool(
+                            "execute_mimic_query",
+                            {"sql_query": "SELECT COUNT(*) as count FROM icu_icustays"},
+                        )
+                        result_text = str(result)
+                        assert "count" in result_text
+                        assert "2" in result_text
+
+                        # Test get_icu_stays tool
+                        result = await client.call_tool(
+                            "get_icu_stays", {"patient_id": 10000032, "limit": 10}
+                        )
+                        result_text = str(result)
+                        assert "10000032" in result_text
+
+                        # Test get_lab_results tool
+                        result = await client.call_tool(
+                            "get_lab_results", {"patient_id": 10000032, "limit": 20}
+                        )
+                        result_text = str(result)
+                        assert "10000032" in result_text
+
+                        # Test get_database_schema tool
+                        result = await client.call_tool("get_database_schema", {})
+                        result_text = str(result)
+                        assert (
+                            "icu_icustays" in result_text
+                            or "hosp_labevents" in result_text
+                        )
 
     @pytest.mark.asyncio
     async def test_security_checks(self, test_db):
@@ -308,47 +329,60 @@ class TestBigQueryIntegration:
     @pytest.mark.asyncio
     async def test_bigquery_tools(self):
         """Test BigQuery tools functionality with mocks."""
+
+        # Mock Dataset definition for BigQuery
+        mock_ds = DatasetDefinition(
+            name="mimic-test",
+            bigquery_project_id="test-project",
+            bigquery_dataset_ids=["mimic_hosp", "mimic_icu"],
+            tags=["mimic"],
+        )
+
         with patch.dict(
             os.environ,
             {"M3_BACKEND": "bigquery", "M3_PROJECT_ID": "test-project"},
             clear=True,
         ):
-            with patch("google.cloud.bigquery.Client") as mock_client:
-                # Mock BigQuery client and query results
-                mock_job = Mock()
-                mock_df = Mock()
-                mock_df.empty = False
-                mock_df.to_string.return_value = "Mock BigQuery result"
-                mock_df.__len__ = Mock(return_value=5)
-                mock_job.to_dataframe.return_value = mock_df
-
-                mock_client_instance = Mock()
-                mock_client_instance.query.return_value = mock_job
-                mock_client.return_value = mock_client_instance
-
-                _init_backend()
-
-                async with Client(mcp) as client:
-                    # Test execute_mimic_query tool
-                    result = await client.call_tool(
-                        "execute_mimic_query",
-                        {
-                            "sql_query": "SELECT COUNT(*) FROM `physionet-data.mimiciv_3_1_icu.icustays`"
-                        },
-                    )
-                    result_text = str(result)
-                    assert "Mock BigQuery result" in result_text
-
-                    # Test get_race_distribution tool
-                    result = await client.call_tool(
-                        "get_race_distribution", {"limit": 5}
-                    )
-                    result_text = str(result)
-                    assert "Mock BigQuery result" in result_text
-
-                    # Verify BigQuery client was called
-                    mock_client.assert_called_once_with(project="test-project")
-                    assert mock_client_instance.query.called
+            with patch("m3.mcp_server.DatasetRegistry.get", return_value=mock_ds):
+                with patch(
+                    "m3.mcp_server.get_active_dataset", return_value="mimic-test"
+                ):
+                    with patch("google.cloud.bigquery.Client") as mock_client:
+                        # Mock BigQuery client and query results
+                        mock_job = Mock()
+                        mock_df = Mock()
+                        mock_df.empty = False
+                        mock_df.to_string.return_value = "Mock BigQuery result"
+                        mock_df.__len__ = Mock(return_value=5)
+                        mock_job.to_dataframe.return_value = mock_df
+
+                        mock_client_instance = Mock()
+                        mock_client_instance.query.return_value = mock_job
+                        mock_client.return_value = mock_client_instance
+
+                        _init_backend()
+
+                        async with Client(mcp) as client:
+                            # Test execute_mimic_query tool
+                            result = await client.call_tool(
+                                "execute_mimic_query",
+                                {
+                                    "sql_query": "SELECT COUNT(*) FROM `physionet-data.mimiciv_3_1_icu.icustays`"
+                                },
+                            )
+                            result_text = str(result)
+                            assert "Mock BigQuery result" in result_text
+
+                            # Test get_race_distribution tool
+                            result = await client.call_tool(
+                                "get_race_distribution", {"limit": 5}
+                            )
+                            result_text = str(result)
+                            assert "Mock BigQuery result" in result_text
+
+                            # Verify BigQuery client was called
+                            mock_client.assert_called_once_with(project="test-project")
+                            assert mock_client_instance.query.called
 
 
 class TestServerIntegration:

From 49f0d043edb5d808ecb125e725a50512c7f2e3ab Mon Sep 17 00:00:00 2001
From: hill <hannes.ill@tum.de>
Date: Tue, 25 Nov 2025 01:06:32 -0500
Subject: [PATCH 3/5] feat: Add dynamic dataset switching

- Implement 'm3 use' CLI command to switch between active datasets (e.g., mimic-iv-full, eicu).
- Update MCP server to dynamically resolve database paths and BigQuery configurations based on the active dataset.
- Enhance 'm3 init' to handle datasets requiring authentication by providing 'wget' instructions for PhysioNet.
- Update 'get_database_schema', 'get_table_info', and convenience tools to be dataset-aware.
- Add 'tests/test_dynamic_switching.py' to verify dataset switching logic.
---
 src/m3/cli.py                                 |  37 ++-
 src/m3/datasets.py                            |   4 +-
 .../mcp_client_configs/dynamic_mcp_config.py  |  19 +-
 src/m3/mcp_server.py                          | 216 +++++++++++-------
 tests/test_cli.py                             |  11 +-
 tests/test_dynamic_switching.py               |  73 ++++++
 tests/test_mcp_server.py                      |  39 +++-
 7 files changed, 273 insertions(+), 126 deletions(-)
 create mode 100644 tests/test_dynamic_switching.py

diff --git a/src/m3/cli.py b/src/m3/cli.py
index f3df756..43df89a 100644
--- a/src/m3/cli.py
+++ b/src/m3/cli.py
@@ -356,6 +356,22 @@ def use_cmd(
             "   This is fine if you are using the BigQuery backend.\n"
             "   If you intend to use DuckDB (local), run 'm3 init' first."
         )
+    else:
+        typer.secho(
+            "  Local: Available",
+        )
+
+    # 4. Check BigQuery support
+    ds_def = DatasetRegistry.get(target)
+    if ds_def:
+        if not ds_def.bigquery_dataset_ids:
+            typer.secho(
+                "⚠️  Warning: This dataset is not configured for BigQuery.",
+                fg=typer.colors.YELLOW,
+            )
+            typer.echo("   If you are using the BigQuery backend, queries will fail.")
+        else:
+            typer.echo(f"  BigQuery: Available (Project: {ds_def.bigquery_project_id})")
 
 
 @app.command("status")
@@ -390,6 +406,12 @@ def status_cmd():
             except Exception:
                 typer.echo("  parquet_size_gb: (skipped)")
 
+        # Show BigQuery status
+        ds_def = DatasetRegistry.get(label)
+        if ds_def:
+            bq_status = "✅" if ds_def.bigquery_dataset_ids else "❌"
+            typer.echo(f"  BigQuery Support: {bq_status}")
+
         # Try a quick rowcount on the verification table if db present
         cfg = get_dataset_config(label)
         if info["db_present"] and cfg:
@@ -542,17 +564,10 @@ def config_cmd(
         if backend != "duckdb":
             cmd.extend(["--backend", backend])
 
-        # For duckdb, infer db_path from active dataset if not provided
-        if backend == "duckdb":
-            if db_path:
-                inferred_db_path = Path(db_path).resolve()
-            else:
-                active_dataset = get_active_dataset()
-                if not active_dataset:
-                    # default to demo if nothing is set
-                    inferred_db_path = get_default_database_path("mimic-iv-demo")
-                else:
-                    inferred_db_path = get_default_database_path(active_dataset)
+        # For duckdb, pass db_path only if explicitly provided.
+        # If omitted, the server will resolve it dynamically based on the active dataset.
+        if backend == "duckdb" and db_path:
+            inferred_db_path = Path(db_path).resolve()
             cmd.extend(["--db-path", str(inferred_db_path)])
 
         elif backend == "bigquery" and project_id:
diff --git a/src/m3/datasets.py b/src/m3/datasets.py
index 160d254..cc08735 100644
--- a/src/m3/datasets.py
+++ b/src/m3/datasets.py
@@ -58,8 +58,8 @@ def _register_builtins(cls):
             subdirectories_to_scan=["hosp", "icu"],
             primary_verification_table="hosp_admissions",
             tags=["mimic", "clinical", "demo"],
-            bigquery_project_id="physionet-data",
-            bigquery_dataset_ids=["mimiciv_demo_hosp", "mimiciv_demo_icu"],
+            bigquery_project_id=None,
+            bigquery_dataset_ids=None,
         )
 
         mimic_iv_full = DatasetDefinition(
diff --git a/src/m3/mcp_client_configs/dynamic_mcp_config.py b/src/m3/mcp_client_configs/dynamic_mcp_config.py
index a879761..567981f 100644
--- a/src/m3/mcp_client_configs/dynamic_mcp_config.py
+++ b/src/m3/mcp_client_configs/dynamic_mcp_config.py
@@ -10,7 +10,7 @@
 from pathlib import Path
 from typing import Any
 
-from m3.config import get_active_dataset, get_default_database_path
+from m3.config import get_default_database_path
 
 # Error messages
 _DATABASE_PATH_ERROR_MSG = (
@@ -86,17 +86,7 @@ def generate_config(
         if backend == "duckdb":
             if db_path:
                 env["M3_DB_PATH"] = db_path
-            else:
-                active = get_active_dataset()
-                if not active:
-                    raise ValueError(
-                        "Could not determine default DuckDB path; run `m3 init ...` first "
-                        "or pass --db-path explicitly."
-                    )
-                default_path = get_default_database_path(active)
-                if not default_path:
-                    raise ValueError(_DATABASE_PATH_ERROR_MSG)
-                env["M3_DB_PATH"] = str(default_path)
+            # If no db_path, we rely on dynamic resolution in the server
 
         elif backend == "bigquery" and project_id:
             env["M3_PROJECT_ID"] = project_id
@@ -194,9 +184,12 @@ def interactive_config(self) -> dict[str, Any]:
                 raise ValueError(_DATABASE_PATH_ERROR_MSG)
             print(f"Default database path: {default_db_path}")
 
+            print(
+                "\nLeaving database path empty allows switching datasets dynamically via 'm3 use'."
+            )
             db_path = (
                 input(
-                    "DuckDB database path (optional, press Enter to use default): "
+                    "DuckDB database path (optional, press Enter for dynamic): "
                 ).strip()
                 or None
             )
diff --git a/src/m3/mcp_server.py b/src/m3/mcp_server.py
index 61773f4..2f3cf63 100644
--- a/src/m3/mcp_server.py
+++ b/src/m3/mcp_server.py
@@ -19,10 +19,78 @@
 
 # Global variables for backend configuration
 _backend = None
-_db_path = None
-_bq_client = None
-_project_id = None
-_active_dataset_def = None
+# Cache for BigQuery client to avoid re-initializing on every request
+_bq_client_cache = {"client": None, "project_id": None}
+
+
+def _get_active_dataset_def():
+    """Get the currently active dataset definition."""
+    # 1. Try currently active dataset from config/env
+    active_ds_name = get_active_dataset()
+    if active_ds_name:
+        return DatasetRegistry.get(active_ds_name)
+
+    # 2. Fallback for BigQuery: try to find a full definition
+    if _backend == "bigquery":
+        # Use mimic-iv-full as reference if available, else demo
+        return DatasetRegistry.get("mimic-iv-full") or DatasetRegistry.get(
+            "mimic-iv-demo"
+        )
+
+    # 3. Fallback for DuckDB: demo
+    return DatasetRegistry.get("mimic-iv-demo")
+
+
+def _get_db_path():
+    """Get the current DuckDB path."""
+    # 1. Env var overrides everything (static mode)
+    env_path = os.getenv("M3_DB_PATH")
+    if env_path:
+        return env_path
+
+    # 2. Dynamic resolution based on active dataset
+    ds_def = _get_active_dataset_def()
+    if ds_def:
+        path = get_default_database_path(ds_def.name)
+        return str(path) if path else None
+
+    return None
+
+
+def _get_bq_client():
+    """Get or create a BigQuery client for the current project."""
+    try:
+        from google.cloud import bigquery
+    except ImportError:
+        raise ImportError(
+            "BigQuery dependencies not found. Install with: pip install google-cloud-bigquery"
+        )
+
+    # Determine target project ID
+    # Priority: Env Var > Dataset Config > Default
+    env_project = os.getenv("M3_PROJECT_ID")
+    ds_def = _get_active_dataset_def()
+    ds_project = ds_def.bigquery_project_id if ds_def else None
+
+    target_project_id = env_project or ds_project or "physionet-data"
+
+    # Check cache
+    if (
+        _bq_client_cache["client"]
+        and _bq_client_cache["project_id"] == target_project_id
+    ):
+        return _bq_client_cache["client"], target_project_id
+
+    # Create new client
+    try:
+        client = bigquery.Client(project=target_project_id)
+        _bq_client_cache["client"] = client
+        _bq_client_cache["project_id"] = target_project_id
+        return client, target_project_id
+    except Exception as e:
+        raise RuntimeError(
+            f"Failed to initialize BigQuery client for project {target_project_id}: {e}"
+        )
 
 
 def _validate_limit(limit: int) -> bool:
@@ -133,85 +201,34 @@ def _is_safe_query(sql_query: str, internal_tool: bool = False) -> tuple[bool, s
 
 def _init_backend():
     """Initialize the backend based on environment variables."""
-    global _backend, _db_path, _bq_client, _project_id, _active_dataset_def
+    global _backend
 
     # Initialize OAuth2 authentication
     init_oauth2()
 
     _backend = os.getenv("M3_BACKEND", "duckdb")
-    active_ds_name = get_active_dataset()
-
-    # Load dataset definition if available
-    if active_ds_name:
-        _active_dataset_def = DatasetRegistry.get(active_ds_name)
-    else:
-        # If explicitly bigquery or unset, we might default to a 'full' mimic definition if available,
-        # but better to handle it dynamically.
-        # For now, let's see if we can infer a default definition for bigquery mode
-        # or just rely on manual project_id
-        if _backend == "bigquery":
-            # We might want to default to mimic-iv-full for bigquery metadata if not specified?
-            # But the user might want a different one.
-            # Let's check if we can infer it.
-            # For now, we'll try to use 'mimic-iv-full' as the reference for BigQuery structure
-            # if the user hasn't selected another dataset but is using BigQuery backend.
-            _active_dataset_def = DatasetRegistry.get("mimic-iv-full")
-
-    if _backend == "duckdb":
-        _db_path = os.getenv("M3_DB_PATH")
-        if not _db_path:
-            if active_ds_name:
-                path = get_default_database_path(active_ds_name)
-                _db_path = str(path) if path else None
-            else:
-                # Fallback to demo if we can't figure it out
-                path = get_default_database_path("mimic-iv-demo")
-                _db_path = str(path) if path else None
-                if not _active_dataset_def:
-                    _active_dataset_def = DatasetRegistry.get("mimic-iv-demo")
-
-        if not _db_path or not Path(_db_path).exists():
-            # We don't raise here to allow server to start even if DB is missing (e.g. for 'config' command usage via import)
-            # But runtime queries will fail.
-            pass
-
-    elif _backend == "bigquery":
-        try:
-            from google.cloud import bigquery
-        except ImportError:
-            raise ImportError(
-                "BigQuery dependencies not found. Install with: pip install google-cloud-bigquery"
-            )
 
-        # User's GCP project ID for authentication and billing
-        # Priority: Env Var > Dataset Config > Default
-        env_project = os.getenv("M3_PROJECT_ID")
-        ds_project = (
-            _active_dataset_def.bigquery_project_id if _active_dataset_def else None
+    if _backend not in ["duckdb", "bigquery"]:
+        raise ValueError(
+            f"Unsupported backend: {_backend}. Supported backends: duckdb, bigquery"
         )
 
-        _project_id = env_project or ds_project or "physionet-data"
 
-        try:
-            _bq_client = bigquery.Client(project=_project_id)
-        except Exception as e:
-            raise RuntimeError(f"Failed to initialize BigQuery client: {e}")
-
-    else:
-        raise ValueError(f"Unsupported backend: {_backend}")
-
-
-# Initialize backend when module is imported
 _init_backend()
 
 
 def _get_backend_info() -> str:
     """Get current backend information for display in responses."""
-    ds_name = _active_dataset_def.name if _active_dataset_def else "unknown"
+    ds_def = _get_active_dataset_def()
+    ds_name = ds_def.name if ds_def else "unknown"
+
     if _backend == "duckdb":
-        return f"🔧 **Current Backend:** DuckDB (local database)\n📦 **Dataset:** {ds_name}\n📁 **Database Path:** {_db_path}\n"
+        db_path = _get_db_path()
+        return f"🔧 **Current Backend:** DuckDB (local database)\n📦 **Active Dataset:** {ds_name}\n📁 **Database Path:** {db_path}\n"
     else:
-        return f"🔧 **Current Backend:** BigQuery (cloud database)\n📦 **Dataset:** {ds_name}\n☁️ **Project ID:** {_project_id}\n"
+        # Resolve project ID dynamically for display
+        _, project_id = _get_bq_client()
+        return f"🔧 **Current Backend:** BigQuery (cloud database)\n📦 **Active Dataset:** {ds_name}\n☁️ **Project ID:** {project_id}\n"
 
 
 # ==========================================
@@ -224,11 +241,12 @@ def _get_backend_info() -> str:
 
 def _execute_duckdb_query(sql_query: str) -> str:
     """Execute DuckDB query - internal function."""
-    if not _db_path or not Path(_db_path).exists():
+    db_path = _get_db_path()
+    if not db_path or not Path(db_path).exists():
         return "❌ Error: Database file not found. Please initialize a dataset using 'm3 init'."
 
     try:
-        conn = duckdb.connect(_db_path)
+        conn = duckdb.connect(db_path)
         try:
             df = conn.execute(sql_query).df()
             if df.empty:
@@ -253,8 +271,10 @@ def _execute_bigquery_query(sql_query: str) -> str:
     try:
         from google.cloud import bigquery
 
+        client, _ = _get_bq_client()
+
         job_config = bigquery.QueryJobConfig()
-        query_job = _bq_client.query(sql_query, job_config=job_config)
+        query_job = client.query(sql_query, job_config=job_config)
         df = query_job.to_dataframe()
 
         if df.empty:
@@ -402,13 +422,14 @@ def get_database_schema() -> str:
 
     elif _backend == "bigquery":
         # Dynamic schema discovery based on active dataset definition
-        if not _active_dataset_def or not _active_dataset_def.bigquery_dataset_ids:
+        ds_def = _get_active_dataset_def()
+        if not ds_def or not ds_def.bigquery_dataset_ids:
             return f"{_get_backend_info()}❌ **Error:** No BigQuery datasets configured for the active dataset."
 
-        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        project_id = ds_def.bigquery_project_id or "physionet-data"
         queries = []
 
-        for dataset_id in _active_dataset_def.bigquery_dataset_ids:
+        for dataset_id in ds_def.bigquery_dataset_ids:
             queries.append(f"""
              SELECT CONCAT('`{project_id}.{dataset_id}.', table_name, '`') as query_ready_table_name
              FROM `{project_id}.{dataset_id}.INFORMATION_SCHEMA.TABLES`
@@ -525,9 +546,10 @@ def get_table_info(table_name: str, show_sample: bool = True) -> str:
                 pass  # Fall through to try search approach if direct lookup fails (unlikely but safe)
 
         # Try configured datasets with simple name
-        if _active_dataset_def and _active_dataset_def.bigquery_dataset_ids:
-            project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
-            for dataset_id in _active_dataset_def.bigquery_dataset_ids:
+        ds_def = _get_active_dataset_def()
+        if ds_def and ds_def.bigquery_dataset_ids:
+            project_id = ds_def.bigquery_project_id or "physionet-data"
+            for dataset_id in ds_def.bigquery_dataset_ids:
                 try:
                     full_table_name = f"`{project_id}.{dataset_id}.{simple_table_name}`"
 
@@ -599,8 +621,9 @@ def get_icu_stays(patient_id: int | None = None, limit: int = 10) -> str:
         ICU stay data as formatted text or guidance if table not found
     """
     # Check dataset compatibility
-    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
-        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+    ds_def = _get_active_dataset_def()
+    if ds_def and "mimic" not in ds_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{ds_def.name}' does not appear to be a MIMIC dataset."
 
     # Security validation
     if not _validate_limit(limit):
@@ -611,9 +634,14 @@ def get_icu_stays(patient_id: int | None = None, limit: int = 10) -> str:
         icustays_table = "icu_icustays"
     else:  # bigquery
         # Try to find icustays in configured datasets
-        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        project_id = (
+            ds_def.bigquery_project_id or "physionet-data"
+            if ds_def
+            else "physionet-data"
+        )
         found = False
-        for ds in _active_dataset_def.bigquery_dataset_ids:
+        dataset_ids = ds_def.bigquery_dataset_ids if ds_def else []
+        for ds in dataset_ids:
             if "icu" in ds:
                 icustays_table = f"`{project_id}.{ds}.icustays`"
                 found = True
@@ -663,8 +691,9 @@ def get_lab_results(
         Lab results as formatted text or guidance if table not found
     """
     # Check dataset compatibility
-    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
-        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+    ds_def = _get_active_dataset_def()
+    if ds_def and "mimic" not in ds_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{ds_def.name}' does not appear to be a MIMIC dataset."
 
     # Security validation
     if not _validate_limit(limit):
@@ -675,9 +704,14 @@ def get_lab_results(
         labevents_table = "hosp_labevents"
     else:  # bigquery
         # Try to find labevents in configured datasets
-        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        project_id = (
+            ds_def.bigquery_project_id or "physionet-data"
+            if ds_def
+            else "physionet-data"
+        )
         found = False
-        for ds in _active_dataset_def.bigquery_dataset_ids:
+        dataset_ids = ds_def.bigquery_dataset_ids if ds_def else []
+        for ds in dataset_ids:
             if "hosp" in ds:
                 labevents_table = f"`{project_id}.{ds}.labevents`"
                 found = True
@@ -732,8 +766,9 @@ def get_race_distribution(limit: int = 10) -> str:
         Race distribution as formatted text or guidance if table not found
     """
     # Check dataset compatibility
-    if _active_dataset_def and "mimic" not in _active_dataset_def.tags:
-        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{_active_dataset_def.name}' does not appear to be a MIMIC dataset."
+    ds_def = _get_active_dataset_def()
+    if ds_def and "mimic" not in ds_def.tags:
+        return f"❌ **Error:** This tool is optimized for MIMIC datasets. The current dataset '{ds_def.name}' does not appear to be a MIMIC dataset."
 
     # Security validation
     if not _validate_limit(limit):
@@ -744,9 +779,14 @@ def get_race_distribution(limit: int = 10) -> str:
         admissions_table = "hosp_admissions"
     else:  # bigquery
         # Try to find admissions in configured datasets
-        project_id = _active_dataset_def.bigquery_project_id or "physionet-data"
+        project_id = (
+            ds_def.bigquery_project_id or "physionet-data"
+            if ds_def
+            else "physionet-data"
+        )
         found = False
-        for ds in _active_dataset_def.bigquery_dataset_ids:
+        dataset_ids = ds_def.bigquery_dataset_ids if ds_def else []
+        for ds in dataset_ids:
             if "hosp" in ds:
                 admissions_table = f"`{project_id}.{ds}.admissions`"
                 found = True
diff --git a/tests/test_cli.py b/tests/test_cli.py
index d352bd7..ff52159 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -162,13 +162,9 @@ def test_config_claude_infers_db_path_demo(
     result = runner.invoke(app, ["config", "claude"])
     assert result.exit_code == 0
 
-    # subprocess run should be called with inferred --db-path
+    # subprocess run should NOT be called with inferred --db-path (dynamic resolution)
     call_args = mock_subprocess.call_args[0][0]
-    assert "--db-path" in call_args
-    assert "/tmp/inferred-demo.duckdb" in call_args
-
-    # Should have asked for demo duckdb path
-    mock_get_default.assert_called()
+    assert "--db-path" not in call_args
 
 
 @patch("subprocess.run")
@@ -185,8 +181,7 @@ def test_config_claude_infers_db_path_full(
     assert result.exit_code == 0
 
     call_args = mock_subprocess.call_args[0][0]
-    assert "--db-path" in call_args
-    assert "/tmp/inferred-full.duckdb" in call_args
+    assert "--db-path" not in call_args
 
 
 @patch("m3.cli.set_active_dataset")
diff --git a/tests/test_dynamic_switching.py b/tests/test_dynamic_switching.py
new file mode 100644
index 0000000..1646101
--- /dev/null
+++ b/tests/test_dynamic_switching.py
@@ -0,0 +1,73 @@
+import os
+import json
+from pathlib import Path
+from unittest.mock import patch
+
+from m3.config import set_active_dataset, get_active_dataset
+import m3.mcp_server as server
+import m3.config as config_mod
+
+def test_dynamic_dataset_switching(tmp_path, monkeypatch):
+    # Setup mock data dir
+    data_dir = tmp_path / "m3_data"
+    data_dir.mkdir()
+    
+    # Patch config module to use our temp data dir
+    monkeypatch.setattr(config_mod, "_PROJECT_DATA_DIR", data_dir)
+    monkeypatch.setattr(config_mod, "_DEFAULT_DATABASES_DIR", data_dir / "databases")
+    monkeypatch.setattr(config_mod, "_DEFAULT_PARQUET_DIR", data_dir / "parquet")
+    monkeypatch.setattr(config_mod, "_RUNTIME_CONFIG_PATH", data_dir / "config.json")
+    monkeypatch.setattr(config_mod, "_CUSTOM_DATASETS_DIR", data_dir / "datasets")
+    
+    # Ensure dirs exist
+    (data_dir / "databases").mkdir()
+    (data_dir / "parquet").mkdir()
+    (data_dir / "datasets").mkdir()
+
+    # 1. Start with no active dataset
+    # Verify server defaults to mimic-iv-demo (or falls back)
+    monkeypatch.setenv("M3_BACKEND", "duckdb")
+    monkeypatch.delenv("M3_DB_PATH", raising=False)
+    
+    # Ensure config is empty/default
+    if (data_dir / "config.json").exists():
+        (data_dir / "config.json").unlink()
+    
+    # Check default fallback
+    ds_def = server._get_active_dataset_def()
+    assert ds_def.name == "mimic-iv-demo"
+    
+    db_path = server._get_db_path()
+    # Should point to demo db in our temp dir
+    # Note: get_default_database_path uses the patched _DEFAULT_DATABASES_DIR
+    assert "mimic_iv_demo.duckdb" in str(db_path)
+
+    # 2. Set active dataset to something else (simulating 'm3 use')
+    # We can use 'mimic-iv-full' as it is registered
+    set_active_dataset("mimic-iv-full")
+    
+    # Verify config file was written
+    assert (data_dir / "config.json").exists()
+    
+    # Verify server picks it up
+    ds_def = server._get_active_dataset_def()
+    assert ds_def.name == "mimic-iv-full"
+    
+    db_path = server._get_db_path()
+    assert "mimic_iv_full.duckdb" in str(db_path)
+
+    # 3. Simulate environment variable override (static mode)
+    monkeypatch.setenv("M3_DB_PATH", "/custom/path/to/db.duckdb")
+    
+    db_path = server._get_db_path()
+    assert db_path == "/custom/path/to/db.duckdb"
+    
+    # Active dataset def should still track the config/env
+    ds_def = server._get_active_dataset_def()
+    assert ds_def.name == "mimic-iv-full"
+    
+    # 4. Unset env var, should go back to dynamic
+    monkeypatch.delenv("M3_DB_PATH")
+    db_path = server._get_db_path()
+    assert "mimic_iv_full.duckdb" in str(db_path)
+
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
index 9b49efa..ffc79a6 100644
--- a/tests/test_mcp_server.py
+++ b/tests/test_mcp_server.py
@@ -34,6 +34,14 @@ def _bigquery_available():
 class TestMCPServerSetup:
     """Test MCP server setup and configuration."""
 
+    @pytest.fixture(autouse=True)
+    def reset_bq_cache(self):
+        """Reset the BigQuery client cache before each test."""
+        import m3.mcp_server
+
+        if hasattr(m3.mcp_server, "_bq_client_cache"):
+            m3.mcp_server._bq_client_cache = {"client": None, "project_id": None}
+
     def test_server_instance_exists(self):
         """Test that the FastMCP server instance exists."""
         assert mcp is not None
@@ -70,14 +78,17 @@ def test_backend_init_duckdb_missing_db(self):
                     # allowing the runtime check in _execute_duckdb_query to handle it gracefully.
                     import m3.mcp_server
 
-                    assert m3.mcp_server._db_path == str(Path("/fake/path.duckdb"))
+                    # _db_path was removed, check behavior via internal getter or backend info
+                    assert m3.mcp_server._get_db_path() == str(
+                        Path("/fake/path.duckdb")
+                    )
                     assert m3.mcp_server._backend == "duckdb"
 
     @pytest.mark.skipif(
         not _bigquery_available(), reason="BigQuery dependencies not available"
     )
     def test_backend_init_bigquery(self):
-        """Test BigQuery backend initialization."""
+        """Test BigQuery backend initialization and client creation."""
         mock_ds = DatasetDefinition(
             name="mock-ds",
             bigquery_project_id="test-project",
@@ -94,8 +105,20 @@ def test_backend_init_bigquery(self):
                 with patch("google.cloud.bigquery.Client") as mock_client:
                     mock_client.return_value = Mock()
                     _init_backend()
-                    # If no exception raised, initialization succeeded
-                    # The project ID might come from env or dataset, both are 'test-project' here
+
+                    # _init_backend no longer creates the client eagerly
+                    mock_client.assert_not_called()
+
+                    # Call the internal getter to trigger creation
+                    import m3.mcp_server
+
+                    client, project_id = m3.mcp_server._get_bq_client()
+
+                    assert project_id == "test-project"
+                    mock_client.assert_called_once_with(project="test-project")
+
+                    # Second call should be cached (no new client init)
+                    m3.mcp_server._get_bq_client()
                     mock_client.assert_called_once_with(project="test-project")
 
     def test_backend_init_invalid(self):
@@ -323,6 +346,14 @@ async def test_oauth2_authentication_required(self, test_db):
 class TestBigQueryIntegration:
     """Test BigQuery integration with mocks (no real API calls)."""
 
+    @pytest.fixture(autouse=True)
+    def reset_bq_cache(self):
+        """Reset the BigQuery client cache before each test."""
+        import m3.mcp_server
+
+        if hasattr(m3.mcp_server, "_bq_client_cache"):
+            m3.mcp_server._bq_client_cache = {"client": None, "project_id": None}
+
     @pytest.mark.skipif(
         not _bigquery_available(), reason="BigQuery dependencies not available"
     )

From 400483eacd10a8cb100162a194d49f974448c0ce Mon Sep 17 00:00:00 2001
From: hill <hannes.ill@tum.de>
Date: Tue, 25 Nov 2025 09:34:56 -0500
Subject: [PATCH 4/5] Run pre-commit on newly added file

---
 tests/test_dynamic_switching.py | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/tests/test_dynamic_switching.py b/tests/test_dynamic_switching.py
index 1646101..65e1a26 100644
--- a/tests/test_dynamic_switching.py
+++ b/tests/test_dynamic_switching.py
@@ -1,24 +1,20 @@
-import os
-import json
-from pathlib import Path
-from unittest.mock import patch
-
-from m3.config import set_active_dataset, get_active_dataset
-import m3.mcp_server as server
 import m3.config as config_mod
+import m3.mcp_server as server
+from m3.config import set_active_dataset
+
 
 def test_dynamic_dataset_switching(tmp_path, monkeypatch):
     # Setup mock data dir
     data_dir = tmp_path / "m3_data"
     data_dir.mkdir()
-    
+
     # Patch config module to use our temp data dir
     monkeypatch.setattr(config_mod, "_PROJECT_DATA_DIR", data_dir)
     monkeypatch.setattr(config_mod, "_DEFAULT_DATABASES_DIR", data_dir / "databases")
     monkeypatch.setattr(config_mod, "_DEFAULT_PARQUET_DIR", data_dir / "parquet")
     monkeypatch.setattr(config_mod, "_RUNTIME_CONFIG_PATH", data_dir / "config.json")
     monkeypatch.setattr(config_mod, "_CUSTOM_DATASETS_DIR", data_dir / "datasets")
-    
+
     # Ensure dirs exist
     (data_dir / "databases").mkdir()
     (data_dir / "parquet").mkdir()
@@ -28,15 +24,15 @@ def test_dynamic_dataset_switching(tmp_path, monkeypatch):
     # Verify server defaults to mimic-iv-demo (or falls back)
     monkeypatch.setenv("M3_BACKEND", "duckdb")
     monkeypatch.delenv("M3_DB_PATH", raising=False)
-    
+
     # Ensure config is empty/default
     if (data_dir / "config.json").exists():
         (data_dir / "config.json").unlink()
-    
+
     # Check default fallback
     ds_def = server._get_active_dataset_def()
     assert ds_def.name == "mimic-iv-demo"
-    
+
     db_path = server._get_db_path()
     # Should point to demo db in our temp dir
     # Note: get_default_database_path uses the patched _DEFAULT_DATABASES_DIR
@@ -45,29 +41,28 @@ def test_dynamic_dataset_switching(tmp_path, monkeypatch):
     # 2. Set active dataset to something else (simulating 'm3 use')
     # We can use 'mimic-iv-full' as it is registered
     set_active_dataset("mimic-iv-full")
-    
+
     # Verify config file was written
     assert (data_dir / "config.json").exists()
-    
+
     # Verify server picks it up
     ds_def = server._get_active_dataset_def()
     assert ds_def.name == "mimic-iv-full"
-    
+
     db_path = server._get_db_path()
     assert "mimic_iv_full.duckdb" in str(db_path)
 
     # 3. Simulate environment variable override (static mode)
     monkeypatch.setenv("M3_DB_PATH", "/custom/path/to/db.duckdb")
-    
+
     db_path = server._get_db_path()
     assert db_path == "/custom/path/to/db.duckdb"
-    
+
     # Active dataset def should still track the config/env
     ds_def = server._get_active_dataset_def()
     assert ds_def.name == "mimic-iv-full"
-    
+
     # 4. Unset env var, should go back to dynamic
     monkeypatch.delenv("M3_DB_PATH")
     db_path = server._get_db_path()
     assert "mimic_iv_full.duckdb" in str(db_path)
-

From e29d413d6540f9c8dc546b3e92b24655d31eb823 Mon Sep 17 00:00:00 2001
From: hill <hannes.ill@tum.de>
Date: Wed, 26 Nov 2025 00:14:15 -0500
Subject: [PATCH 5/5] Improve README and update for multi-dataset support

---
 README.md | 454 +++++++++++++++++++++++++-----------------------------
 1 file changed, 211 insertions(+), 243 deletions(-)

diff --git a/README.md b/README.md
index 7219796..a6bfdc6 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# M3: MIMIC-IV + MCP + Models 🏥🤖
+# M3: Medical Datasets ↔ MCP ↔ Models 🏥🤖
 
 <div align="center">
   <img src="webapp/public/m3_logo_transparent.png" alt="M3 Logo" width="300"/>
@@ -14,6 +14,17 @@
 
 Transform medical data analysis with AI! Ask questions about MIMIC-IV and other PhysioNet datasets in plain English and get instant insights. Choose between local data (free) or full cloud dataset (BigQuery).
 
+## 💡 How It Works
+
+M3 acts as a bridge between your **AI Client** (like Claude Desktop, Cursor, or LibreChat) and your medical data.
+
+1.  **You** ask a question in your chat interface: *"How many patients in the ICU have high blood pressure?"*
+2.  **M3** securely translates this into a database query.
+3.  **M3** runs the query on your local or cloud data.
+4.  **The LLM** explains the results to you in plain English.
+
+*No SQL knowledge required.*
+
 ## Features
 
 - 🔍 **Natural Language Queries**: Ask questions about your medical data in plain English
@@ -26,11 +37,17 @@ Transform medical data analysis with AI! Ask questions about MIMIC-IV and other
 
 ## 🚀 Quick Start
 
-> 📺 **Prefer video tutorials?** Check out [step-by-step video guides](https://rafiattrach.github.io/m3/) covering setup, PhysioNet configuration, and more.
+> **New to this?** 📺 [Watch our 5-minute setup video](https://rafiattrach.github.io/m3/) to see it in action.
 
-### Install uv (required for `uvx`)
+### Prerequisites
+You need an **MCP-compatible Client** to use M3. Popular options include:
+- [Claude for Desktop](https://claude.ai/download)
+- [Cursor](https://cursor.com)
+- [LibreChat](https://www.librechat.ai/)
 
-We use `uvx` to run the MCP server. Install `uv` from the official installer, then verify with `uv --version`.
+### 1. Install `uv` (Required)
+
+We use `uvx` to run the MCP server efficiently.
 
 **macOS and Linux:**
 ```bash
@@ -42,322 +59,273 @@ curl -LsSf https://astral.sh/uv/install.sh | sh
 powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
 ```
 
-Verify installation:
-```bash
-uv --version
-```
+### 2. Choose Your Data Source
 
-### BigQuery Setup (Optional - Full Dataset)
+Select **Option A** (Local) or **Option B** (Cloud).
 
-**Skip this if using DuckDB demo database.**
+#### Option A: Local Dataset (Free & Fast)
+*Best for development, testing, and offline use.*
 
-1. **Install Google Cloud SDK:**
-   - macOS: `brew install google-cloud-sdk`
-   - Windows/Linux: https://cloud.google.com/sdk/docs/install
+1.  **Create project directory:**
+    ```bash
+    mkdir m3 && cd m3
+    ```
 
-2. **Authenticate:**
-   ```bash
-   gcloud auth application-default login
-   ```
-   *Opens your browser - choose the Google account with BigQuery access to MIMIC-IV.*
+2.  **Initialize Dataset:**
 
-### M3 Initialization
+    We will use MIMIC-IV as an example.
 
-**Supported clients:** [Claude Desktop](https://www.claude.com/download), [Cursor](https://cursor.com/download), [Goose](https://block.github.io/goose/), and [more](https://github.com/punkpeye/awesome-mcp-clients).
+    **For Demo (Auto-download ~16MB):**
+    ```bash
+    uv init && uv add m3-mcp
+    uv run m3 init mimic-iv-demo
+    ```
 
-<table>
-<tr>
-<td width="50%">
+    **For Full Data (Requires Manual Download):**
+    *Download CSVs from [PhysioNet](https://physionet.org/content/mimiciv/3.1/) first and place them in `m3_data/raw_files`.*
+    ```bash
+    uv init && uv add m3-mcp
+    uv run m3 init mimic-iv-full
+    ```
+    *This can take 5-15 minutes depending on your machine*
 
-**DuckDB (Local Datasets)**
+3.  **Configure Your Client:**
 
-To create a m3 directory and navigate into it run:
-```shell
-mkdir m3 && cd m3
-```
+    **For Claude Desktop (Shortcut):**
+    ```bash
+    uv run m3 config claude --quick
+    ```
 
-**Option A: MIMIC-IV Demo (Auto-Download)**
-```shell
-uv init && uv add m3-mcp && \
-uv run m3 init mimic-iv-demo && uv run m3 config --quick
-```
-*Downloads ~16MB automatically.*
+    **For Other Clients (Cursor, LibreChat, etc.):**
+    ```bash
+    uv run m3 config --quick
+    ```
+    *This generates the configuration JSON you need to paste into your client's settings.*
 
-**Option B: Full Datasets (Manual Download)**
-1. Download CSVs from PhysioNet.
-2. Run init with source path:
-```shell
-uv run m3 init mimic-iv-full --src /path/to/raw/csvs
-```
-3. Configure client:
-```shell
-uv run m3 config --quick
-```
+#### Option B: BigQuery (Full Cloud Dataset)
+*Best for researchers with Google Cloud access.*
 
-</td>
-<td width="50%">
+1.  **Authenticate with Google:**
+    ```bash
+    gcloud auth application-default login
+    ```
 
-**BigQuery (Full Dataset)**
+2.  **Configure Client:**
+    ```bash
+    uv run m3 config --backend bigquery --project_id BIGQUERY_PROJECT_ID
+    ```
+    *This also generates the configuration JSON you need to paste into your client's settings.*
 
-Requires GCP credentials and PhysioNet access.
 
-Paste this into your client config JSON file:
 
-```json
-{
-  "mcpServers": {
-    "m3": {
-      "command": "uvx",
-      "args": ["m3-mcp"],
-      "env": {
-        "M3_BACKEND": "bigquery",
-        "M3_PROJECT_ID": "your-project-id"
-      }
-    }
-  }
-}
+### 3. Start Asking Questions!
+Restart your MCP client and try:
+- "What tools do you have for MIMIC-IV data?"
+- "Show me patient demographics from the ICU"
+- "What is the race distribution in admissions?"
+
+---
+
+## 🔄 Managing Datasets
+
+Switch between available datasets instantly:
+
+```bash
+# Switch to full dataset
+m3 use mimic-iv-full
+
+# Switch back to demo
+m3 use mimic-iv-demo
+
+# Check status
+m3 status
 ```
 
-*Replace `your-project-id` with your Google Cloud project ID.*
+---
 
-</td>
-</tr>
-</table>
+## Backend Comparison
 
-**That's it!** Restart your MCP client and ask:
-- "What tools do you have for MIMIC-IV data?"
-- "Show me patient demographics from the ICU"
-- "What is the race distribution in admissions?"
+| Feature | DuckDB (Demo) | DuckDB (Full) | BigQuery (Full) |
+|---------|---------------|---------------|-----------------|
+| **Cost** | Free | Free | BigQuery usage fees |
+| **Setup** | Zero config | Manual Download | GCP credentials required |
+| **Credentials** | Not required | PhysioNet | PhysioNet |
+| **Data Size** | 100 patients | 365k patients | 365k patients |
+| **Speed** | Fast (local) | Fast (local) | Network latency |
+| **Use Case** | Learning | Research (local) | Research, production |
 
 ---
 
 ## ➕ Adding Custom Datasets
 
-M3 is designed to be modular. You can add support for any tabular dataset easily.
+M3 is designed to be modular. You can add support for any tabular dataset on PhysioNet easily. Let's take eICU as an example:
+
+### JSON Definition Method
+
+1.  Create a definition file: `m3_data/datasets/eicu.json`
+    ```json
+    {
+      "name": "eicu",
+      "description": "eICU Collaborative Research Database",
+      "file_listing_url": "https://physionet.org/files/eicu-crd/2.0/",
+      "subdirectories_to_scan": [],
+      "primary_verification_table": "eicu_crd_patient",
+      "tags": ["clinical", "eicu"],
+      "requires_authentication": true,
+      "bigquery_project_id": "physionet-data",
+      "bigquery_dataset_ids": ["eicu_crd"]
+    }
+    ```
 
-### 1. CLI Method (Ad-hoc)
+2.  Initialize it:
+    ```bash
+    m3 init eicu --src /path/to/raw/csvs
+    ```
+    *M3 will convert CSVs to Parquet and create DuckDB views automatically.*
 
-If you have a folder of CSV/CSV.gz files, you can initialize it directly as a custom dataset:
+---
 
+## Alternative Installation Methods
+
+> Already have Docker or prefer pip?
+
+### 🐳 Docker
+
+<table>
+<tr>
+<td width="50%">
+
+**DuckDB (Local):**
 ```bash
-# Not yet implemented in CLI but supported by architecture
-# Future: m3 init --local /path/to/my/csvs --name my-custom-study
+git clone https://github.com/rafiattrach/m3.git && cd m3
+docker build -t m3:lite --target lite .
+docker run -d --name m3-server m3:lite tail -f /dev/null
 ```
 
-Currently, you can register new datasets by creating a definition file.
+</td>
+<td width="50%">
 
-### 2. JSON Definition Method
+**BigQuery:**
+```bash
+git clone https://github.com/rafiattrach/m3.git && cd m3
+docker build -t m3:bigquery --target bigquery .
+docker run -d --name m3-server \
+  -e M3_BACKEND=bigquery \
+  -e M3_PROJECT_ID=your-project-id \
+  -v $HOME/.config/gcloud:/root/.config/gcloud:ro \
+  m3:bigquery tail -f /dev/null
+```
 
-Create a JSON file in `m3_data/datasets/my_study.json`:
+</td>
+</tr>
+</table>
 
+**MCP config (same for both):**
 ```json
 {
-  "name": "my-study",
-  "description": "My custom clinical study data",
-  "file_listing_url": null,
-  "subdirectories_to_scan": ["data", "metadata"],
-  "default_duckdb_filename": "my_study.duckdb",
-  "tags": ["clinical", "custom"]
+  "mcpServers": {
+    "m3": {
+      "command": "docker",
+      "args": ["exec", "-i", "m3-server", "python", "-m", "m3.mcp_server"]
+    }
+  }
 }
 ```
 
-Then initialize it:
+### pip Install
 
 ```bash
-m3 init my-study --src /path/to/raw/csvs
+pip install m3-mcp
+m3 config --quick
 ```
 
-M3 will:
-1. Scan the source directory for CSVs
-2. Convert them to Parquet
-3. Create DuckDB views automatically (e.g. `data/patients.csv` -> table `data_patients`)
+### Local Development
+
+For contributors:
+
+1.  **Clone & Install (using `uv`):**
+    ```bash
+    git clone https://github.com/rafiattrach/m3.git
+    cd m3
+    uv venv
+    uv sync
+    ```
+
+2.  **MCP Config:**
+    ```json
+    {
+      "mcpServers": {
+        "m3": {
+          "command": "/absolute/path/to/m3/.venv/bin/python",
+          "args": ["-m", "m3.mcp_server"],
+          "cwd": "/absolute/path/to/m3",
+          "env": { "M3_BACKEND": "duckdb" }
+        }
+      }
+    }
+    ```
 
 ---
 
 ## 🔧 Advanced Configuration
 
-Need to configure other MCP clients or customize settings? Use these commands:
-
-### Interactive Configuration (Universal)
+**Interactive Config Generator:**
 ```bash
 m3 config
 ```
-Generates configuration for any MCP client with step-by-step guidance.
 
-### Quick Configuration Examples
+**OAuth2 Authentication:**
+For secure production deployments:
 ```bash
-# Quick universal config with defaults
-m3 config --quick
-
-# Universal config with custom DuckDB database
-m3 config --quick --backend duckdb --db-path /path/to/database.duckdb
-
-# Save config to file for other MCP clients
-m3 config --output my_config.json
-```
-
-### OAuth2 Authentication (Optional)
-
-For production deployments requiring secure access to medical data:
-
-```bash
-# Enable OAuth2 with Claude Desktop
 m3 config claude --enable-oauth2 \
   --oauth2-issuer https://your-auth-provider.com \
-  --oauth2-audience m3-api \
-  --oauth2-scopes "read:mimic-data"
-
-# Or configure interactively
-m3 config  # Choose OAuth2 option during setup
+  --oauth2-audience m3-api
 ```
-
-**Supported OAuth2 Providers:**
-- Auth0, Google Identity Platform, Microsoft Azure AD, Keycloak
-- Any OAuth2/OpenID Connect compliant provider
-
-> 📖 **Complete OAuth2 Setup Guide**: See [`docs/OAUTH2_AUTHENTICATION.md`](docs/OAUTH2_AUTHENTICATION.md) for detailed configuration, troubleshooting, and production deployment guidelines.
+> See [`docs/OAUTH2_AUTHENTICATION.md`](docs/OAUTH2_AUTHENTICATION.md) for details.
 
 ---
 
 ## 🛠️ Available MCP Tools
 
-When your MCP client processes questions, it uses these tools automatically:
-
 - **get_database_schema**: List all available tables
-- **get_table_info**: Get column info and sample data for a table
+- **get_table_info**: Get column info and sample data
 - **execute_mimic_query**: Execute SQL SELECT queries
-- **get_icu_stays**: ICU stay information and length of stay data
+- **get_icu_stays**: ICU stay info & length of stay
 - **get_lab_results**: Laboratory test results
-- **get_race_distribution**: Patient race distribution
+- **get_race_distribution**: Patient race statistics
 
 ## Example Prompts
 
-Try asking your MCP client these questions:
-
-**Demographics & Statistics:**
-
-- `Prompt:` *What is the race distribution in admissions?*
-- `Prompt:` *Show me patient demographics for ICU stays*
-- `Prompt:` *How many total admissions are in the database?*
+**Demographics:**
+- *What is the race distribution in MIMIC-IV admissions?*
+- *Show me patient demographics for ICU stays*
 
 **Clinical Data:**
+- *Find lab results for patient X*
+- *What lab tests are most commonly ordered?*
 
-- `Prompt:` *Find lab results for patient X*
-- `Prompt:` *What lab tests are most commonly ordered?*
-- `Prompt:` *Show me recent ICU admissions*
-
-**Data Exploration:**
-
-- `Prompt:` *What tables are available in the database?*
-- `Prompt:` *What tools do you have for MIMIC-IV data?*
-
-## 🎩 Pro Tips
-
-- Do you want to pre-approve the usage of all tools in Claude Desktop? Use the prompt below and then select **Always Allow**
-  - `Prompt:` *Can you please call all your tools in a logical sequence?*
-
-## 🔍 Troubleshooting
-
-### Common Issues
-
-**Local "Parquet not found" or view errors:**
-Rerun the `m3 init` command for your chosen dataset.
+**Exploration:**
+- *What tables are available in the database?*
 
-**MCP client server not starting:**
-1. Check your MCP client logs (for Claude Desktop: Help → View Logs)
-2. Verify configuration file location and format
-3. Restart your MCP client completely
-
-### OAuth2 Authentication Issues
-
-**"Missing OAuth2 access token" errors:**
-```bash
-# Set your access token
-export M3_OAUTH2_TOKEN="Bearer your-access-token-here"
-```
-
-**"OAuth2 authentication failed" errors:**
-- Verify your token hasn't expired
-- Check that required scopes are included in your token
-- Ensure your OAuth2 provider configuration is correct
-
-**Rate limit exceeded:**
-- Wait for the rate limit window to reset
-- Contact your administrator to adjust limits if needed
-
-> 🔧 **OAuth2 Troubleshooting**: See [`OAUTH2_AUTHENTICATION.md`](docs/OAUTH2_AUTHENTICATION.md) for detailed OAuth2 troubleshooting and configuration guides.
-
-### BigQuery Issues
-
-**"Access Denied" errors:**
-- Ensure you have MIMIC-IV access on PhysioNet
-- Verify your Google Cloud project has BigQuery API enabled
-- Check that you're authenticated: `gcloud auth list`
-
-**"Dataset not found" errors:**
-- Confirm your project ID is correct
-- Ensure you have access to `physionet-data` project
-
-**Authentication issues:**
-```bash
-# Re-authenticate
-gcloud auth application-default login
-
-# Check current authentication
-gcloud auth list
-```
-
-## For Developers
-
-> See "Local Development" section above for setup instructions.
-
-### Running Tests
-
-```bash
-pytest  # All tests (includes OAuth2 and BigQuery mocks)
-pytest tests/test_mcp_server.py -v  # MCP server tests
-pytest tests/test_oauth2_auth.py -v  # OAuth2 authentication tests
-```
-
-### Test BigQuery Locally
-
-```bash
-# Set environment variables
-export M3_BACKEND=bigquery
-export M3_PROJECT_ID=your-project-id
-export GOOGLE_CLOUD_PROJECT=your-project-id
-
-# Optional: Test with OAuth2 authentication
-export M3_OAUTH2_ENABLED=true
-export M3_OAUTH2_ISSUER_URL=https://your-provider.com
-export M3_OAUTH2_AUDIENCE=m3-api
-export M3_OAUTH2_TOKEN="Bearer your-test-token"
-
-# Test MCP server
-m3-mcp-server
-```
-
-## Roadmap
-
-- 🏠 **Complete Local Full Dataset**: Complete the support for `mimic-iv-full` (Download CLI)
-- 🔧 **Advanced Tools**: More specialized medical data functions
-- 📊 **Visualization**: Built-in plotting and charting tools
-- 🔐 **Enhanced Security**: Role-based access control, audit logging
-- 🌐 **Multi-tenant Support**: Organization-level data isolation
+---
 
-## Contributing
+## Troubleshooting
 
-We welcome contributions! Please:
+- **"Parquet not found"**: Rerun `m3 init <dataset_name>`.
+- **MCP client not starting**: Check logs (Claude Desktop: Help → View Logs).
+- **BigQuery Access Denied**: Run `gcloud auth application-default login` and verify project ID.
 
-1. Fork the repository
-2. Create a feature branch
-3. Add tests for new functionality
-4. Submit a pull request
+---
 
-## Citation
+## Contributing & Citation
 
-If you use M3 in your research, please cite:
+### For Developers
+We welcome contributions!
+1.  **Setup:** Follow the "Local Development" steps above.
+2.  **Test:** Run `uv run pre-commit --all-files` to ensure everything is working and linted.
+3.  **Submit:** Open a Pull Request with your changes.
 
+**Citation:**
 ```bibtex
 @article{attrach2025conversational,
   title={Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis},