Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# SoA Workbench - Copilot Instructions

## Project Overview
Clinical trial Schedule of Activities (SoA) workbench: FastAPI web app + CLI tools for normalizing, expanding, and validating study visit matrices against USDM (Unified Study Definitions Model).

**Core Architecture:**
- **Web Layer** (`src/soa_builder/web/`): FastAPI app with router-based endpoints, HTMX UI, SQLite persistence
- **Core Logic** (`src/soa_builder/`): Normalization, schedule expansion, validation modules
- **USDM Generators** (`src/usdm/`): Transform database state → USDM JSON artifacts
- **Data Model**: SQLite schema with audit trails, versioning (freezes), and biomedical concept linking

**USDM Model Entities & Relationships** (critical for understanding the domain):
- **StudyDesign**: Top-level container with arrays of: encounters, activities, arms, epochs, elements, studyCells, scheduleTimelines
- **StudyElement** (`element` table): Structural design components (e.g., treatment periods, cohorts, crossover phases)
- UIDs: `StudyElement_N`; generated by `generate_elements.py`
- Purpose: Define "what study structure exists" (design-time components)
- Grouped via: **StudyCells** (arm + epoch + elementIds array)
- Attributes: transitionStartRule, transitionEndRule, studyInterventionIds
- **ScheduledActivityInstance** (`instances` table): Temporal visit/timepoint occurrences where activities happen
- UIDs: `ScheduledActivityInstance_N`; generated by `generate_scheduled_activity_instances.py`
- Purpose: Define "when/where activities occur" (schedule-specific)
- Relationships: references epochId, encounterId, activityIds[], timelineId
- Contained by: **ScheduleTimeline** (with mainTimeline flag, entryCondition, timings[], instances[])
- **StudyCell** (`study_cell` table): Junction entity combining armId + epochId + elementIds[]
- Defines which study elements apply to which arm/epoch combinations
- UID pattern: `StudyCell_N`
- **ScheduleTimeline** (`schedule_timelines` table): Container for temporal scheduling
- Contains: instances[] (ScheduledActivityInstance or ScheduledDecisionInstance)
- Contains: timings[] (relative timing definitions), exits[]
- Attributes: mainTimeline (boolean), entryCondition, entryId
- **Encounter** (`visit` table via encounter_uid): Physical/virtual visit where activities occur
- Referenced by: ScheduledActivityInstance.encounterId
- Linked to: Activities via matrix_cells
- **Key Distinction**: Elements = structural design (periods, cohorts) | Instances = temporal schedule (visits, timepoints)

## Critical Patterns

### Database & Testing
- **Test isolation**: Tests run against `soa_builder_web_tests.db` (set via `SOA_BUILDER_DB` env). `tests/conftest.py` enforces isolation by removing WAL/SHM files pre-session
- **Connection pattern**: Always use `from .db import _connect` (handles pytest detection, WAL mode, busy timeouts)
- **Schema migrations**: Lifespan event in `app.py` runs migrations in sequence—add new ones to `migrate_database.py`

### Router Architecture
Endpoints organized by domain in `src/soa_builder/web/routers/`:
- Each router (visits, activities, epochs, arms, elements, etc.) handles JSON API + HTMX UI variants
- Pattern: `@router.post("/soa/{soa_id}/visits")` for API, `@router.post("/ui/soa/{soa_id}/visits/create")` for forms
- Audit trail via `_record_{entity}_audit()` helpers in `audit.py`

### HTMX UI Conventions
- Templates in `templates/` use `base.html` inheritance
- Form submissions return HTML partials for HTMX swaps
- Matrix edit interface (`edit.html`): drag-drop reordering, cell toggling with status rotation (blank → X → O → blank)
- Modal pattern: target `#modal-host` for freeze/rollback/audit overlays

### External API Integration
**CDISC Library API** (biomedical concepts):
- Requires `CDISC_SUBSCRIPTION_KEY` or `CDISC_API_KEY` env vars
- Caching: `fetch_biomedical_concepts()` with TTL; force refresh via `POST /ui/soa/{id}/concepts_refresh`
- Override for tests: `CDISC_CONCEPTS_JSON` env (file path or inline JSON)
- Specializations: SDTM codelists via `fetch_sdtm_specializations()`

### USDM Generation Pipeline
Scripts in `src/usdm/` convert SoA database → USDM JSON:
- `generate_activities.py`, `generate_arms.py`, `generate_study_epochs.py`, etc.
- Each reads from SQLite, constructs USDM objects with UIDs, references, and terminology codes
- Run via CLI: `python -m usdm.generate_activities --soa-id 1 --output-file output/activities.json`
- Relies on junction tables (e.g., `activity_concept`, `code_junction_timings`) for terminology linkage

## Key Development Workflows

### Starting the Web Server
```bash
source .venv/bin/activate
soa-builder-web # or uvicorn soa_builder.web.app:app --reload --port 8000
```
Access at `http://localhost:8000`

### Running Tests
```bash
pytest # uses soa_builder_web_tests.db
pytest tests/test_specific.py -v
```
**Important**: Test DB auto-cleans at session start. Manual cleanup if needed:
```bash
rm -f soa_builder_web_tests.db*
```

### Pre-commit Hooks
```bash
pre-commit install
pre-commit run --all-files # runs black + pytest + flake8
```

### CLI Commands
```bash
# Normalize wide CSV → relational tables
soa-builder normalize --input files/SoA.csv --out-dir normalized/

# Expand repeating rules → calendar instances
soa-builder expand --normalized-dir normalized/ --start-date 2025-01-01

# Validate imaging intervals
soa-builder validate --normalized-dir normalized/
```

## Code Conventions

### UID Generation
- Auto-generated UIDs follow pattern: `{EntityName}_{incrementing_id}`
- Use `get_next_code_uid()` / `get_next_concept_uid()` from `utils.py`
- Once assigned, UIDs are immutable (e.g., `arm_uid`, `element_uid`)

### Audit Pattern
All entity mutations log before/after state:
```python
from .audit import _record_element_audit
_record_element_audit(soa_id, "update", element_id, before=old_state, after=new_state)
```

### Reorder Operations
- Client sends `order: List[int]` (entity IDs in new sequence)
- Server recomputes `sequence_index` field for all items
- Audit logged with `entity_reorder_audit` table

### Freeze & Rollback
- **Freeze**: Snapshot visits/activities/cells/epochs/arms to `{entity}_freeze` tables
- **Rollback**: Restore from freeze, track diffs in `rollback_audit`
- UI: Modal shows diff summary, confirms restore

## Common Gotchas

1. **Always activate venv first**: `source .venv/bin/activate` before any command
2. **Test DB separation**: Don't run tests against prod DB—conftest enforces `SOA_BUILDER_DB`
3. **HTMX partial responses**: UI endpoints must return HTML fragments, not full pages
4. **SQLite WAL mode**: Production uses WAL; tests use DELETE for simpler cleanup
5. **Concept API 401s in browser**: Direct API URLs fail (no auth headers)—use internal detail pages
6. **Migration order matters**: New migrations go at end of lifespan event sequence
7. **Pydantic schemas**: Use `schemas.py` models for request validation, not raw dicts
8. **Router imports**: Import routers at top of `app.py`, mount with `app.include_router()`

## Reference Files
- **API endpoints catalog**: `docs/api_endpoints.csv` (165 endpoints: method, path, type, description, response format)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint count (165) in the copilot instructions should match the count referenced in README_endpoints.md and the actual CSV file. Verify the accurate count and update consistently across all documentation files.

Suggested change
- **API endpoints catalog**: `docs/api_endpoints.csv` (165 endpoints: method, path, type, description, response format)
- **API endpoints catalog**: `docs/api_endpoints.csv` (API endpoints: method, path, type, description, response format)

Copilot uses AI. Check for mistakes.
- **Full API docs**: `README_endpoints.md` (curl examples, response schemas)
- **Main README**: Installation, server start, test setup
- **Database schema**: Infer from `initialize_database.py` + migrations in `migrate_database.py`
- **Test patterns**: See `tests/test_bulk_import.py` for matrix operations, `test_element_audit_endpoint.py` for audit trails
96 changes: 36 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,69 +61,45 @@ pytest
rm -f soa_builder_web_tests.db soa_builder_web_tests.db-wal soa_builder_web_tests.db-shm
```

> Full, updated endpoint reference (including Elements, freezes, audits, JSON CRUD and UI helpers) lives in `README_endpoints.md`. Consult that file for detailed request/response examples, curl snippets, and future enhancement notes.
> **Full API Documentation**: See `README_endpoints.md` for complete endpoint reference with curl examples, request/response schemas, and usage patterns.
>
> **Endpoint Catalog**: See `docs/api_endpoints.csv` for sortable/filterable list of all 165+ endpoints.

Endpoints:
## USDM Export
Export USDM-compliant JSON for integration with external systems:
```bash
# Get normalized USDM JSON for a study
curl http://localhost:8000/soa/1/normalized

See **docs/api_endpoints.xlsx**
# Or use the USDM generator scripts directly
python -m usdm.generate_activities --soa-id 1 --output-file activities.json
python -m usdm.generate_encounters --soa-id 1 --output-file encounters.json
python -m usdm.generate_study_epochs --soa-id 1 --output-file epochs.json
# See src/usdm/ for all generator scripts
```

## Experimental (not yet supported)
After populating data, retrieve normalized artifacts:
## CLI Tools (Legacy)
Command-line tools for CSV normalization and validation:
```bash
curl http://localhost:8000/soa/1/normalized
# Normalize wide CSV → relational tables
soa-builder normalize --input files/SoA.csv --out-dir normalized/

# Expand repeating rules → calendar instances
soa-builder expand --normalized-dir normalized/ --start-date 2025-01-01

# Validate imaging intervals
soa-builder validate --normalized-dir normalized/
```
### Source
Input format: first column `Activity`, subsequent columns are visit/timepoint headers. Cells contain markers `X`, `Optional`, `If indicated`, or repeating patterns (`Every 2 cycles`, `q12w`).

### Output Artifacts
Running the script produces (in `--out-dir`):
- `visits.csv` — One row per visit/timepoint with parsed window info, inferred category, repeat pattern.
- `activities.csv` — Unique activities (one per original row).
- `visit_activities.csv` — Junction table mapping activities to visits with status and flags.
- `activity_categories.csv` — Heuristic classification of each activity (labs, imaging, dosing, admin, etc.).
- `schedule_rules.csv` — Extracted repeating schedule logic from headers and cells (e.g., `q12w`, `Every 2 cycles`).
- Optional: SQLite database (`--sqlite path`) containing all tables.

### visits.csv Columns
- `visit_id`: Sequential numeric id.
- `label`: Original header text.
- `visit_name`: Header stripped of parenthetical codes.
- `visit_code`: Code extracted from parentheses (e.g., `C1D1`, `EOT`).
- `sequence_index`: Positional order.
- `window_lower` / `window_upper`: Parsed day offsets if available.
- `repeat_pattern`: Detected repeating pattern (e.g., `every 2 cycles`).
- `category`: Heuristic classification (screening, baseline, treatment, follow_up, eot).

### activities.csv Columns
- `activity_id`: Sequential id.
- `activity_name`: Name from first column.

### visit_activities.csv Columns
- `id`: Junction id.
- `visit_id`: FK to visits.
- `activity_id`: FK to activities.
- `status`: Raw cell content.
- `required_flag`: 1 if cell starts with `X`.
- `conditional_flag`: 1 if cell contains `Optional` or `If indicated`.

### activity_categories.csv Columns
- `activity_id`: FK to activities.
- `category`: Assigned heuristic category label.

### schedule_rules.csv Columns
- `rule_id`: Unique rule id.
- `pattern`: Normalized repeating pattern token (e.g., `q12w`).
- `description`: Human readable description of pattern source.
- `source_type`: `header` or `cell` origin.
- `activity_id`: Populated if pattern came from a cell (else null).
- `visit_id`: Populated if pattern came from a header.
- `raw_text`: Original text fragment containing the pattern.



# Notes:
- HTMX is loaded via CDN; no build step required.
- For production, configure a persistent DB path via SOA_BUILDER_DB env variable.

Artifacts stored under `normalized/soa_{id}/`.
See `.github/copilot-instructions.md` for detailed CLI usage patterns.

---

## Architecture Notes
- **Web UI**: HTMX loaded via CDN; no build step required
- **Database**: SQLite with WAL mode (production) or DELETE mode (tests)
- **Test Isolation**: Tests use `soa_builder_web_tests.db` (set via `SOA_BUILDER_DB` env var)
- **Production Config**: Set `SOA_BUILDER_DB` environment variable for persistent DB path
- **USDM Generators**: Python scripts in `src/usdm/` transform database state → USDM JSON artifacts

For detailed architectural patterns, USDM entity relationships, and development workflows, see `.github/copilot-instructions.md`.

Loading