Skip to content

Biomedical concept data tables#145

Merged
pendingintent merged 21 commits intomasterfrom
bc-data-tables
Mar 4, 2026
Merged

Biomedical concept data tables#145
pendingintent merged 21 commits intomasterfrom
bc-data-tables

Conversation

@pendingintent
Copy link
Owner

Major functionality changes.

  • biomedical_concept, alias_code, code, biomedical_concept_audit, biomedical_concept_property tables added
  • biomedical_concept kept in line with activity_concept
  • record added to alias_code and code for each biomedical concept assigned to an activity
  • background async process populates activity_concept with DSS values of BC
  • background async process populates biomedical_concept_property with variables for DSS
  • background async process populates alias_code, code for each biomedical concept property
  • database migrate functions to ensure database tables exist in proper definition
  • one time scripts to populate missing BC and BC property values

@pendingintent pendingintent added this to the v1.0 milestone Mar 4, 2026
@pendingintent pendingintent self-assigned this Mar 4, 2026
@pendingintent pendingintent added the enhancement New feature or request label Mar 4, 2026
Copilot AI review requested due to automatic review settings March 4, 2026 16:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new biomedical concept & property persistence (plus audit/backfill/enrichment) and refactors code usage to a new code_association table across USDM generation + web routes.

Changes:

  • Introduces biomedical_concept, biomedical_concept_property, alias_code, and related audit/backfill/enrichment flows (including one-time scripts).
  • Replaces many usages of legacy code rows for SOA object mappings with code_association (and updates USDM exporters accordingly).
  • Tightens DB path resolution for test isolation and adds new migrations.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_timings_code_junction.py Removes timing/code junction behavior tests (no replacement in this PR).
tests/test_code_uid_generation.py Updates UID-generation tests to insert into code_association.
src/usdm/generate_study_timings.py Switches timing terminology lookups to code_association and adjusts exported Code fields.
src/usdm/generate_study_epochs.py Switches epoch type join from code to code_association.
src/usdm/generate_encounters.py Switches encounter code lookups from code to code_association.
src/usdm/generate_arms.py Switches arm terminology lookups from code to code_association.
src/soa_builder/web/utils.py Adds alias-code UID generator; switches code UID generator to code_association.
src/soa_builder/web/routers/visits.py Uses code_association for visit code mappings and updates some codelist metadata values.
src/soa_builder/web/routers/timings.py Uses code_association for timing code mappings and inserts.
src/soa_builder/web/routers/schedule_timelines.py Reads timing code mappings from code_association.
src/soa_builder/web/routers/epochs.py Uses code_association for epoch type mappings and updates comment.
src/soa_builder/web/routers/arms.py Uses code_association for arm type/data-origin mappings and updates insert/update SQL.
src/soa_builder/web/routers/activities.py Adds background enrichment/property population and orphan cleanup when setting concepts.
src/soa_builder/web/migrate_database.py Removes code_junction migration; adds biomedical concept audit + backfill + property UID migration.
src/soa_builder/web/initialize_database.py Adds new biomedical concept/code tables and introduces code_association.
src/soa_builder/web/db.py Resolves DB path dynamically at connection time; prevents tests from hitting prod DB.
src/soa_builder/web/audit.py Adds biomedical concept audit recording helper.
src/soa_builder/web/app.py Wires new migrations; adds enrichment on startup; adds concept/property upsert + orphan cleanup helpers.
scripts/populate_biomedical_concept_property.py New one-time script to populate BC properties from DSS variables.
scripts/populate_biomedical_concept.py New one-time script to populate biomedical_concept from activity_concept.
scripts/enrich_biomedical_concept.py New one-time script to enrich BC label/description from CDISC API.
Comments suppressed due to low confidence (3)

src/soa_builder/web/app.py:1845

  • Spawning one thread per unenriched row can create severe contention and resource spikes (and may accidentally trigger a large number of outbound API calls on startup). Consider using a bounded executor (e.g., ThreadPoolExecutor(max_workers=...)) or a queue/worker pattern, and gating the startup enrichment behind a config flag and/or presence of credentials to avoid unexpected network calls during local dev/test runs.
        for _concept_code, _soa_id in _unenriched:
            threading.Thread(
                target=_enrich_code_bg,
                args=(_concept_code, _soa_id),
                daemon=True,
            ).start()

src/soa_builder/web/initialize_database.py:324

  • The runtime upsert logic treats (soa_id, standard_code) as a 1:1 mapping (e.g., _upsert_alias_code selects by soa_id + standard_code and expects a single row), but the schema only enforces uniqueness on (soa_id, alias_code_uid). This allows multiple aliases to point to the same standard_code, making reads ambiguous (fetchone()). Consider adding a unique constraint or index on (soa_id, standard_code) (allowing NULLs if desired).
        """CREATE TABLE IF NOT EXISTS alias_code (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            soa_id INT NOT NULL,
            alias_code_uid TEXT NOT NULL,
            standard_code TEXT,                      -- reference to code_uid value
            UNIQUE(alias_code_uid, soa_id)
        )"""

src/soa_builder/web/routers/timings.py:612

  • The PR removes tests/test_timings_code_junction.py, but this timing update path still contains nuanced behavior (creating a new Code_N on change, not creating one on unchanged updates). Consider reintroducing equivalent coverage targeting code_association (and any new uniqueness rules) to prevent regressions in timing code UID creation and update semantics.
                    mapped_type = _get_next_code_uid(cur_chk, soa_id)
                    cur_chk.execute(
                        "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",

You can also share your feedback on Copilot code review. Take the survey.

Assumes `cur` is a sqlite cursor within an open transaction.
"""
cur.execute(
"SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%",
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL string literal for the LIKE clause is unterminated (missing the closing ' and % pattern terminator), which will cause a runtime SQL error as soon as get_next_alias_code_uid() is called. Update it to a valid pattern like alias_code_uid LIKE 'AliasCode_%'.

Suggested change
"SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%",
"SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%'",

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (2)

src/soa_builder/web/utils.py:372

  • get_next_alias_code_uid() is introduced here but does not appear to be used anywhere (alias-code UID generation is re-implemented in other modules). Consider either wiring this helper into the alias-code creation paths or removing it to avoid dead code / duplicated logic.
# Helper function to generate new alias_code_uid value
def get_next_alias_code_uid(cur: Any, soa_id: int) -> str:
    """Compute next unique AliasCode_ for the given SOA.
    Assumes `cur` is a sqlite cursor within an open transaction.
    """

src/soa_builder/web/routers/activities.py:543

  • Concept assignment now has new persistence side effects (upserting into code/alias_code/biomedical_concept and running orphan cleanup), but there are no tests asserting these new tables/cleanup behaviors. Adding coverage that verifies rows are created on assign and removed on unassign would help prevent regressions (and can avoid network calls by not executing background tasks).
        _alias_uid = _upsert_alias_code(cur, soa_id, _upsert_code(cur, soa_id, ccode))
        _upsert_biomedical_concept(cur, soa_id, concept_uid, title, _alias_uid)
        background_tasks.add_task(_enrich_biomedical_concept_bg, ccode, soa_id)
        background_tasks.add_task(_enrich_code_bg, ccode, soa_id)

You can also share your feedback on Copilot code review. Take the survey.

…oordinator thread with a ThreadPoolExecutor(max_workers=4) instead of one thread per row
Copilot AI review requested due to automatic review settings March 4, 2026 19:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (7)

src/usdm/generate_study_timings.py:29

  • _get_timing_code_values is annotated as returning Tuple[str, str, str, str], but it actually returns lists (code_code, code_decode, etc.). This is inconsistent with the implementation and how callers index into the result (e.g., t_code[0]). Update the return type to a tuple of lists (or change the function to return scalars) to keep typing accurate and prevent misuse.
def _get_timing_code_values(soa_id: int, code_uid: str) -> Tuple[str, str, str, str]:
    conn = _connect()
    cur = conn.cursor()
    cur.execute(
        "SELECT DISTINCT c.codelist_table,d.code,d.cdisc_submission_value,d.dataset_date "
        "FROM code_association c INNER JOIN ddf_terminology d ON c.codelist_code = d.codelist_code "
        "AND c.code = d.code WHERE c.soa_id=? AND c.code_uid=?",

src/usdm/generate_study_timings.py:168

  • relativeToFrom.codeSystem is built as "db://" + rtf_codeSystem[0], but codelist_table values being inserted for timings are URLs like http://www.cdisc.org. This will produce invalid values such as db://http://www.cdisc.org in generated USDM. Make codeSystem construction consistent (either store plain table names in code_association.codelist_table and prefix with db://, or store full URIs and use them directly for both type and relativeToFrom).
                "code": t_code[0],
                "codeSystem": "http://www.cdisc.org",
                "codeSystemVersion": t_codeSystemVersion[0],
                "decode": t_decode[0],
                "instanceType": "Code",
            },
            "value": value,
            "valueLabel": value_label,
            "relativeToFrom": {
                "id": relative_to_from,
                "extensionAttributes": [],
                "code": rtf_code[0],
                "codeSystem": "db://" + rtf_codeSystem[0],
                "codeSystemVersion": rtf_codeSystemVersion[0],

src/soa_builder/web/routers/timings.py:620

  • The logic that intentionally creates a fresh Code_N when a timing’s type/relative-to-from selection changes (and avoids creating a new code when unchanged) is behaviorally important and previously had dedicated tests. There’s no equivalent coverage in the current test suite, so regressions here would be easy to miss. Add/restore tests that assert (1) changing selection creates a new code_association row and distinct code_uid, and (2) re-submitting the same selection does not create additional rows.
                    # Always create a fresh code_uid; do not reuse across timings
                    mapped_type = _get_next_code_uid(cur_chk, soa_id)
                    cur_chk.execute(
                        "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
                        (
                            soa_id,
                            mapped_type,
                            "http://www.cdisc.org",
                            "C201264",
                            str(new_code_val),
                        ),
                    )

src/soa_builder/web/routers/arms.py:206

  • codelist_table is being stored with a db:// prefix (e.g., db://protocol_terminology), but USDM generators build codeSystem as "db://" + codelist_table. That combination yields db://db://protocol_terminology in output. Store the raw source identifier in codelist_table (e.g., protocol_terminology / ddf_terminology) and let generators add the db:// prefix, or update generators to avoid double-prefixing when the value already starts with db://.
    # Generate Code_{N} for type only if value selected
    arm_type_value = (payload.type or "").strip()
    arm_type = None
    if arm_type_value:
        arm_type = _get_next_code_uid(cur, soa_id)
        logger.info("arm type: %s", arm_type)
        arm_type_codelist_table = "db://protocol_terminology"
        cur.execute(
            "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
            (
                soa_id,
                arm_type,
                arm_type_codelist_table,
                "C174222",
                arm_type_value,
            ),
        )

    arm_data_origin_type_value = (payload.data_origin_type or "").strip()
    arm_data_origin_type = None
    if arm_data_origin_type_value:
        arm_data_origin_type = _get_next_code_uid(cur, soa_id)
        logger.info("arm dataOriginType: %s", arm_data_origin_type)
        arm_data_origin_type_codelist_table = "db://ddf_terminology"
        cur.execute(
            "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
            (
                soa_id,
                arm_data_origin_type,
                arm_data_origin_type_codelist_table,
                "C188727",

src/usdm/generate_arms.py:33

  • This generator prefixes codelist_table with db:// when emitting USDM codeSystem, so it implicitly expects code_association.codelist_table to be a bare table/source name (e.g., protocol_terminology). In this PR, some inserts use values like db://protocol_terminology and http://www.cdisc.org, which would yield invalid codeSystem outputs (db://db://..., db://http://...). Normalize codelist_table values at write time or make the generator skip prefixing when the value is already a URI/prefixed.
def _get_type_code_tuple(soa_id: int, code_uid: str) -> Tuple[str, str, str, str]:
    conn = _connect()
    cur = conn.cursor()
    cur.execute(
        "SELECT DISTINCT c.codelist_table, p.code,p.cdisc_submission_value,p.dataset_date "
        "FROM code_association c INNER JOIN protocol_terminology p ON c.codelist_code = p.codelist_code "
        "AND c.code = p.code WHERE c.soa_id=? AND c.code_uid=?",
        (
            soa_id,
            code_uid,
        ),

src/soa_builder/web/routers/timings.py:311

  • The UI timing create/update paths now persist code_association.codelist_table as http://www.cdisc.org. Several USDM generators interpret codelist_table as an internal source name and prefix it with db://, which will produce invalid codeSystem values (db://http://www.cdisc.org). Align on a single convention for what codelist_table contains (raw table name vs full URI) and update either these inserts or the generators accordingly.
                conn_c = _connect()
                cur_c = conn_c.cursor()
                code_uid = _get_next_code_uid(cur_c, soa_id)
                cur_c.execute(
                    "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
                    (
                        soa_id,
                        code_uid,
                        "http://www.cdisc.org",
                        "C201264",
                        str(code_val),
                    ),
                )
                conn_c.commit()
                conn_c.close()
        except Exception:
            code_uid = None
    # Map Relative To/From submission value to code_uid
    rtf_code_uid: Optional[str] = None
    rsv = (relative_to_from_submission_value or "").strip()
    if rsv:
        try:
            sv_to_code_rtf = get_study_timing_type("C201265")
            rtf_code_val = sv_to_code_rtf.get(rsv)
            if rtf_code_val:
                conn_c2 = _connect()
                cur_c2 = conn_c2.cursor()
                rtf_code_uid = _get_next_code_uid(cur_c2, soa_id)
                cur_c2.execute(
                    "INSERT INTO code_association (soa_id, code_uid, codelist_table, codelist_code, code) VALUES (?,?,?,?,?)",
                    (
                        soa_id,
                        rtf_code_uid,
                        "http://www.cdisc.org",
                        "C201265",
                        str(rtf_code_val),
                    ),
                )

src/soa_builder/web/utils.py:385

  • get_next_alias_code_uid is newly added but currently unused in the codebase (no references outside this definition). Either wire it into the places that are hand-rolling AliasCode_N allocation (e.g., in app.py background tasks) to reduce duplicated logic, or remove it until there’s a concrete caller to avoid carrying dead code.
# Helper function to generate new alias_code_uid value
def get_next_alias_code_uid(cur: Any, soa_id: int) -> str:
    """Compute next unique AliasCode_ for the given SOA.
    Assumes `cur` is a sqlite cursor within an open transaction.
    """
    cur.execute(
        "SELECT alias_code_uid from alias_code WHERE soa_id=? AND alias_code_uid LIKE 'AliasCode_%'",
        (soa_id,),
    )
    existing = [x[0] for x in cur.fetchall() if x[0]]
    n = 1
    if existing:
        try:
            n = max(int(x.split("_")[1]) for x in existing) + 1
        except Exception:
            n = len(existing) + 1
    return f"AliasCode_{n}"


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 2285 to +2290
# Clear existing mappings; include soa_id if column exists
ac_has_soa = _table_has_columns(cur, "activity_concept", ("soa_id",))
ac_has_actuid = _table_has_columns(cur, "activity_concept", ("activity_uid",))
ac_has_conceptuid = _table_has_columns(cur, "activity_concept", ("concept_uid",))
# Capture existing pairs before delete for cascade cleanup
if ac_has_soa:
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POST /soa/{soa_id}/activities/{activity_id}/concepts is defined twice: once in app.py and again in routers/activities.py (same path + method). This creates ambiguous routing and conflicting request schemas/side effects. Consider removing or renaming one endpoint so there’s a single source of truth for this API path and its behavior.

Copilot uses AI. Check for mistakes.
@pendingintent pendingintent merged commit e8b6d16 into master Mar 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants