Conversation
…les to init statement
…gn with CORE rules
…ctivigty_concept table
…cal concept to activity
There was a problem hiding this comment.
Pull request overview
Adds new biomedical concept & property persistence (plus audit/backfill/enrichment) and refactors code usage to a new code_association table across USDM generation + web routes.
Changes:
- Introduces
biomedical_concept,biomedical_concept_property,alias_code, and related audit/backfill/enrichment flows (including one-time scripts). - Replaces many usages of legacy
coderows for SOA object mappings withcode_association(and updates USDM exporters accordingly). - Tightens DB path resolution for test isolation and adds new migrations.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_timings_code_junction.py | Removes timing/code junction behavior tests (no replacement in this PR). |
| tests/test_code_uid_generation.py | Updates UID-generation tests to insert into code_association. |
| src/usdm/generate_study_timings.py | Switches timing terminology lookups to code_association and adjusts exported Code fields. |
| src/usdm/generate_study_epochs.py | Switches epoch type join from code to code_association. |
| src/usdm/generate_encounters.py | Switches encounter code lookups from code to code_association. |
| src/usdm/generate_arms.py | Switches arm terminology lookups from code to code_association. |
| src/soa_builder/web/utils.py | Adds alias-code UID generator; switches code UID generator to code_association. |
| src/soa_builder/web/routers/visits.py | Uses code_association for visit code mappings and updates some codelist metadata values. |
| src/soa_builder/web/routers/timings.py | Uses code_association for timing code mappings and inserts. |
| src/soa_builder/web/routers/schedule_timelines.py | Reads timing code mappings from code_association. |
| src/soa_builder/web/routers/epochs.py | Uses code_association for epoch type mappings and updates comment. |
| src/soa_builder/web/routers/arms.py | Uses code_association for arm type/data-origin mappings and updates insert/update SQL. |
| src/soa_builder/web/routers/activities.py | Adds background enrichment/property population and orphan cleanup when setting concepts. |
| src/soa_builder/web/migrate_database.py | Removes code_junction migration; adds biomedical concept audit + backfill + property UID migration. |
| src/soa_builder/web/initialize_database.py | Adds new biomedical concept/code tables and introduces code_association. |
| src/soa_builder/web/db.py | Resolves DB path dynamically at connection time; prevents tests from hitting prod DB. |
| src/soa_builder/web/audit.py | Adds biomedical concept audit recording helper. |
| src/soa_builder/web/app.py | Wires new migrations; adds enrichment on startup; adds concept/property upsert + orphan cleanup helpers. |
| scripts/populate_biomedical_concept_property.py | New one-time script to populate BC properties from DSS variables. |
| scripts/populate_biomedical_concept.py | New one-time script to populate biomedical_concept from activity_concept. |
| scripts/enrich_biomedical_concept.py | New one-time script to enrich BC label/description from CDISC API. |
Comments suppressed due to low confidence (3)
src/soa_builder/web/app.py:1845
- Spawning one thread per unenriched row can create severe contention and resource spikes (and may accidentally trigger a large number of outbound API calls on startup). Consider using a bounded executor (e.g.,
ThreadPoolExecutor(max_workers=...)) or a queue/worker pattern, and gating the startup enrichment behind a config flag and/or presence of credentials to avoid unexpected network calls during local dev/test runs.
for _concept_code, _soa_id in _unenriched:
threading.Thread(
target=_enrich_code_bg,
args=(_concept_code, _soa_id),
daemon=True,
).start()
src/soa_builder/web/initialize_database.py:324
- The runtime upsert logic treats
(soa_id, standard_code)as a 1:1 mapping (e.g.,_upsert_alias_codeselects bysoa_id+standard_codeand expects a single row), but the schema only enforces uniqueness on(soa_id, alias_code_uid). This allows multiple aliases to point to the samestandard_code, making reads ambiguous (fetchone()). Consider adding a unique constraint or index on(soa_id, standard_code)(allowing NULLs if desired).
"""CREATE TABLE IF NOT EXISTS alias_code (
id INTEGER PRIMARY KEY AUTOINCREMENT,
soa_id INT NOT NULL,
alias_code_uid TEXT NOT NULL,
standard_code TEXT, -- reference to code_uid value
UNIQUE(alias_code_uid, soa_id)
)"""
src/soa_builder/web/routers/timings.py:612
- The PR removes
tests/test_timings_code_junction.py, but this timing update path still contains nuanced behavior (creating a new Code_N on change, not creating one on unchanged updates). Consider reintroducing equivalent coverage targetingcode_association(and any new uniqueness rules) to prevent regressions in timing code UID creation and update semantics.
mapped_type = _get_next_code_uid(cur_chk, soa_id)
cur_chk.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
You can also share your feedback on Copilot code review. Take the survey.
src/soa_builder/web/utils.py
Outdated
| Assumes `cur` is a sqlite cursor within an open transaction. | ||
| """ | ||
| cur.execute( | ||
| "SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%", |
There was a problem hiding this comment.
The SQL string literal for the LIKE clause is unterminated (missing the closing ' and % pattern terminator), which will cause a runtime SQL error as soon as get_next_alias_code_uid() is called. Update it to a valid pattern like alias_code_uid LIKE 'AliasCode_%'.
| "SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%", | |
| "SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%'", |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Comments suppressed due to low confidence (2)
src/soa_builder/web/utils.py:372
get_next_alias_code_uid()is introduced here but does not appear to be used anywhere (alias-code UID generation is re-implemented in other modules). Consider either wiring this helper into the alias-code creation paths or removing it to avoid dead code / duplicated logic.
# Helper function to generate new alias_code_uid value
def get_next_alias_code_uid(cur: Any, soa_id: int) -> str:
"""Compute next unique AliasCode_ for the given SOA.
Assumes `cur` is a sqlite cursor within an open transaction.
"""
src/soa_builder/web/routers/activities.py:543
- Concept assignment now has new persistence side effects (upserting into
code/alias_code/biomedical_conceptand running orphan cleanup), but there are no tests asserting these new tables/cleanup behaviors. Adding coverage that verifies rows are created on assign and removed on unassign would help prevent regressions (and can avoid network calls by not executing background tasks).
_alias_uid = _upsert_alias_code(cur, soa_id, _upsert_code(cur, soa_id, ccode))
_upsert_biomedical_concept(cur, soa_id, concept_uid, title, _alias_uid)
background_tasks.add_task(_enrich_biomedical_concept_bg, ccode, soa_id)
background_tasks.add_task(_enrich_code_bg, ccode, soa_id)
You can also share your feedback on Copilot code review. Take the survey.
…oordinator thread with a ThreadPoolExecutor(max_workers=4) instead of one thread per row
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (7)
src/usdm/generate_study_timings.py:29
_get_timing_code_valuesis annotated as returningTuple[str, str, str, str], but it actually returns lists (code_code,code_decode, etc.). This is inconsistent with the implementation and how callers index into the result (e.g.,t_code[0]). Update the return type to a tuple of lists (or change the function to return scalars) to keep typing accurate and prevent misuse.
def _get_timing_code_values(soa_id: int, code_uid: str) -> Tuple[str, str, str, str]:
conn = _connect()
cur = conn.cursor()
cur.execute(
"SELECT DISTINCT c.codelist_table,d.code,d.cdisc_submission_value,d.dataset_date "
"FROM code_association c INNER JOIN ddf_terminology d ON c.codelist_code = d.codelist_code "
"AND c.code = d.code WHERE c.soa_id=? AND c.code_uid=?",
src/usdm/generate_study_timings.py:168
relativeToFrom.codeSystemis built as"db://" + rtf_codeSystem[0], butcodelist_tablevalues being inserted for timings are URLs likehttp://www.cdisc.org. This will produce invalid values such asdb://http://www.cdisc.orgin generated USDM. Make codeSystem construction consistent (either store plain table names incode_association.codelist_tableand prefix withdb://, or store full URIs and use them directly for bothtypeandrelativeToFrom).
"code": t_code[0],
"codeSystem": "http://www.cdisc.org",
"codeSystemVersion": t_codeSystemVersion[0],
"decode": t_decode[0],
"instanceType": "Code",
},
"value": value,
"valueLabel": value_label,
"relativeToFrom": {
"id": relative_to_from,
"extensionAttributes": [],
"code": rtf_code[0],
"codeSystem": "db://" + rtf_codeSystem[0],
"codeSystemVersion": rtf_codeSystemVersion[0],
src/soa_builder/web/routers/timings.py:620
- The logic that intentionally creates a fresh
Code_Nwhen a timing’s type/relative-to-from selection changes (and avoids creating a new code when unchanged) is behaviorally important and previously had dedicated tests. There’s no equivalent coverage in the current test suite, so regressions here would be easy to miss. Add/restore tests that assert (1) changing selection creates a newcode_associationrow and distinct code_uid, and (2) re-submitting the same selection does not create additional rows.
# Always create a fresh code_uid; do not reuse across timings
mapped_type = _get_next_code_uid(cur_chk, soa_id)
cur_chk.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
(
soa_id,
mapped_type,
"http://www.cdisc.org",
"C201264",
str(new_code_val),
),
)
src/soa_builder/web/routers/arms.py:206
codelist_tableis being stored with adb://prefix (e.g.,db://protocol_terminology), but USDM generators buildcodeSystemas"db://" + codelist_table. That combination yieldsdb://db://protocol_terminologyin output. Store the raw source identifier incodelist_table(e.g.,protocol_terminology/ddf_terminology) and let generators add thedb://prefix, or update generators to avoid double-prefixing when the value already starts withdb://.
# Generate Code_{N} for type only if value selected
arm_type_value = (payload.type or "").strip()
arm_type = None
if arm_type_value:
arm_type = _get_next_code_uid(cur, soa_id)
logger.info("arm type: %s", arm_type)
arm_type_codelist_table = "db://protocol_terminology"
cur.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
(
soa_id,
arm_type,
arm_type_codelist_table,
"C174222",
arm_type_value,
),
)
arm_data_origin_type_value = (payload.data_origin_type or "").strip()
arm_data_origin_type = None
if arm_data_origin_type_value:
arm_data_origin_type = _get_next_code_uid(cur, soa_id)
logger.info("arm dataOriginType: %s", arm_data_origin_type)
arm_data_origin_type_codelist_table = "db://ddf_terminology"
cur.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
(
soa_id,
arm_data_origin_type,
arm_data_origin_type_codelist_table,
"C188727",
src/usdm/generate_arms.py:33
- This generator prefixes
codelist_tablewithdb://when emitting USDMcodeSystem, so it implicitly expectscode_association.codelist_tableto be a bare table/source name (e.g.,protocol_terminology). In this PR, some inserts use values likedb://protocol_terminologyandhttp://www.cdisc.org, which would yield invalidcodeSystemoutputs (db://db://...,db://http://...). Normalizecodelist_tablevalues at write time or make the generator skip prefixing when the value is already a URI/prefixed.
def _get_type_code_tuple(soa_id: int, code_uid: str) -> Tuple[str, str, str, str]:
conn = _connect()
cur = conn.cursor()
cur.execute(
"SELECT DISTINCT c.codelist_table, p.code,p.cdisc_submission_value,p.dataset_date "
"FROM code_association c INNER JOIN protocol_terminology p ON c.codelist_code = p.codelist_code "
"AND c.code = p.code WHERE c.soa_id=? AND c.code_uid=?",
(
soa_id,
code_uid,
),
src/soa_builder/web/routers/timings.py:311
- The UI timing create/update paths now persist
code_association.codelist_tableashttp://www.cdisc.org. Several USDM generators interpretcodelist_tableas an internal source name and prefix it withdb://, which will produce invalidcodeSystemvalues (db://http://www.cdisc.org). Align on a single convention for whatcodelist_tablecontains (raw table name vs full URI) and update either these inserts or the generators accordingly.
conn_c = _connect()
cur_c = conn_c.cursor()
code_uid = _get_next_code_uid(cur_c, soa_id)
cur_c.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
(
soa_id,
code_uid,
"http://www.cdisc.org",
"C201264",
str(code_val),
),
)
conn_c.commit()
conn_c.close()
except Exception:
code_uid = None
# Map Relative To/From submission value to code_uid
rtf_code_uid: Optional[str] = None
rsv = (relative_to_from_submission_value or "").strip()
if rsv:
try:
sv_to_code_rtf = get_study_timing_type("C201265")
rtf_code_val = sv_to_code_rtf.get(rsv)
if rtf_code_val:
conn_c2 = _connect()
cur_c2 = conn_c2.cursor()
rtf_code_uid = _get_next_code_uid(cur_c2, soa_id)
cur_c2.execute(
"INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
(
soa_id,
rtf_code_uid,
"http://www.cdisc.org",
"C201265",
str(rtf_code_val),
),
)
src/soa_builder/web/utils.py:385
get_next_alias_code_uidis newly added but currently unused in the codebase (no references outside this definition). Either wire it into the places that are hand-rolling AliasCode_N allocation (e.g., inapp.pybackground tasks) to reduce duplicated logic, or remove it until there’s a concrete caller to avoid carrying dead code.
# Helper function to generate new alias_code_uid value
def get_next_alias_code_uid(cur: Any, soa_id: int) -> str:
"""Compute next unique AliasCode_ for the given SOA.
Assumes `cur` is a sqlite cursor within an open transaction.
"""
cur.execute(
"SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%'",
(soa_id,),
)
existing = [x[0] for x in cur.fetchall() if x[0]]
n = 1
if existing:
try:
n = max(int(x.split("_")[1]) for x in existing) + 1
except Exception:
n = len(existing) + 1
return f"AliasCode_{n}"
You can also share your feedback on Copilot code review. Take the survey.
| # Clear existing mappings; include soa_id if column exists | ||
| ac_has_soa = _table_has_columns(cur, "activity_concept", ("soa_id",)) | ||
| ac_has_actuid = _table_has_columns(cur, "activity_concept", ("activity_uid",)) | ||
| ac_has_conceptuid = _table_has_columns(cur, "activity_concept", ("concept_uid",)) | ||
| # Capture existing pairs before delete for cascade cleanup | ||
| if ac_has_soa: |
There was a problem hiding this comment.
POST /soa/{soa_id}/activities/{activity_id}/concepts is defined twice: once in app.py and again in routers/activities.py (same path + method). This creates ambiguous routing and conflicting request schemas/side effects. Consider removing or renaming one endpoint so there’s a single source of truth for this API path and its behavior.
Major functionality changes.
biomedical_concept,alias_code,code,biomedical_concept_audit,biomedical_concept_propertytables addedbiomedical_conceptkept in line withactivity_conceptalias_codeandcodefor each biomedical concept assigned to an activityactivity_conceptwith DSS values of BCbiomedical_concept_propertywith variables for DSSalias_code,codefor each biomedical concept property