-
Notifications
You must be signed in to change notification settings - Fork 38
Integrate MeasureDefScorer Part 1: Foundation for Version-Agnostic Scoring #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
lukedegruchy
wants to merge
8
commits into
master
Choose a base branch
from
ld-20251208-integrate-new-measure-scorer
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…for counts This refactoring eliminates duplicate count calculations and establishes Def classes (MeasureDef, GroupDef, PopulationDef, StratumDef) as the authoritative source for all population counts in measure scoring. Key Changes: - Refactored R4MeasureReportScorer to read counts from Def classes instead of MeasureReport FHIR components - Added getPopulationCount() methods to GroupDef and StratumDef for cleaner count retrieval - Added helper methods to StratumDef: getStratumPopulation() and getPopulationCount() - Simplified ContinuousVariableObservationConverter interface to only handle conversion to FHIR Quantities - Removed ContinuousVariableObservationConverter parameter from entire ContinuousVariableObservationHandler call chain (no longer needed) - Added convertCqlResultToQuantityDef() method to handle CQL result conversion to QuantityDef - Updated exception types in convertCqlResultToQuantityDef() to use FHIR InvalidRequestException - Refactored MeasureScorerTest to use record classes instead of nested HashMaps - Fixed stratum scoring by properly handling CRITERIA-type stratifiers Benefits: - Eliminates redundant count calculations between Def classes and MeasureReport - Improves FHIR version independence (R4, DSTU3, R5 can share common logic) - Better separation of concerns: Def classes own count logic, converters only handle FHIR serialization - Improved code maintainability and testability All 931 tests passing in cqf-fhir-cr module. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…-scorer-count-and-quantity-def-refactoring
Create a FHIR-version-agnostic measure scorer that uses Def classes as
the single source of truth for both data and computed scores, eliminating
the need for version-specific scoring logic duplication.
Architecture changes:
- Rename MeasureReportScorer interface to IMeasureReportScorer
- Create new MeasureDefScorer class (602 lines) with version-agnostic
scoring logic that mutates Def objects instead of returning scores
- Convert StratumDef from record to class to support mutable score field
- Add score state (Double) with getters/setters to GroupDef and StratumDef
Key features:
- Def-first iteration pattern: uses MeasureDef.groups() and
StratifierDef.getStratum() instead of FHIR object iteration
- Mutation-based design: scoreGroup() is void and sets scores directly
on GroupDef/StratumDef objects via setScore()
- Enhanced switch expressions with early returns in aggregate() method
- GroupDef.getMeasureScore() handles improvement notation adjustment
("increase" returns score as-is, "decrease" returns 1 - score)
- Population basis support: correctly handles boolean (count subjects)
vs non-boolean (count resources) scoring differences
Comprehensive test coverage (784 lines, 16 tests):
- Proportion, ratio, and continuous variable scoring
- Stratifier scoring with multiple strata
- Zero denominator edge cases
- All aggregate methods (SUM, AVG, MIN, MAX, MEDIAN, COUNT)
- Improvement notation adjustment (increase/decrease)
- Population basis impact (boolean vs Encounter counts)
- Null, zero, and negative score handling
Population basis is critical: boolean basis counts unique subjects
(2/3 = 0.667) while Encounter basis counts all resources (5/9 = 0.556)
for the same population data.
Files changed:
- New: IMeasureReportScorer.java (renamed interface)
- New: MeasureDefScorer.java (version-agnostic scorer)
- New: MeasureDefScorerTest.java (comprehensive unit tests)
- New: PRPs/version-agnostic-measure-report-scorer.md (PRP documentation)
- Modified: GroupDef.java (+54 lines: score field, getMeasureScore)
- Modified: StratumDef.java (+48 lines: record to class, score field)
- Modified: BaseMeasureReportScorer.java (interface rename)
- Modified: R4MeasureReportBuilder.java (interface rename)
- Modified: Dstu3MeasureReportBuilder.java (interface rename)
- Deleted: MeasureReportScorer.java (renamed to IMeasureReportScorer)
All 328 measure tests passing. No regressions.
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
…Def class design This commit completes comprehensive test coverage for MeasureDefScorer and addresses implementation gaps discovered through test-driven development. Implementation Changes: - Add scoreRatioMeasureObservationGroup() method for group-level ratio with observations scoring (lines 153-199) - Fix critical stratum filtering bug in getResultsForStratum() - now correctly filters at Map.Entry level (subject ID) instead of observation Map keys - Implement 4 new comprehensive tests for ratio with observations, cohort measures, and edge cases Def Class Enhancements: - Replace String ID with PopulationDef reference in StratumPopulationDef for better type safety and elimination of ID-based lookups - Rename GroupDef.get() to getPopulationDefs() for clarity - Add GroupDef.getFirstWithId() helper method for criteriaReference matching - Enhance PopulationDef.toString() with comprehensive debugging information - Update StratumPopulationDef.toString() to use PopulationDef's toString() Test Coverage: - testScoreGroup_RatioWithObservations_GroupLevel: Group-level ratio scoring - testScoreStratifier_RatioWithObservations_StratumLevel: Stratum-level ratio scoring with gender stratification - testScoreGroup_CohortMeasure_NoScoreSet: Verify cohort measures return null - testScoreGroup_MissingScoringType_ThrowsException: Validation error handling - Remove unused createMeasurePopulationConcept(type, criteria) helper method Documentation: - Add "Implementation Outcomes and Lessons Learned" section to version-agnostic PRP documenting actual implementation results - Update measure-def-scorer-test-coverage-enhancement.md to reflect completed work - Document data structure insights and R4 pattern reuse All Tests: 954 passing (20 in MeasureDefScorerTest), 0 failures, 0 errors This work represents Phase 1 completion of the version-agnostic measure scorer implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…-agnostic-measure-report-scorer
This commit completes the implementation of a version-agnostic test framework for capturing and asserting on MeasureDef state across FHIR versions (DSTU3, R4). ## Core Infrastructure (Phase 1) ### Def Snapshot API - Add DefCaptureCallback interface for capturing MeasureDef snapshots - Implement createSnapshot() methods on all Def classes: - MeasureDef, GroupDef, PopulationDef, StratifierDef, StratumDef - StratifierComponentDef, SdeDef - Deep copy collections, share FHIR resource references - Add defCaptureCallback field to MeasureEvaluationOptions ### PopulationDef Refactoring - Add populationBasis field to PopulationDef (moved from GroupDef dependency) - Add getPopulationBasis() and isBooleanBasis() methods with Javadoc - Streamline getCount() to use own populationBasis (no GroupDef parameter) - Update createSnapshot() to include populationBasis field ### Processor Integration - Add callback invocation to R4MeasureProcessor (after processResults) - Add callback invocation to Dstu3MeasureProcessor (after processResults) - Update MeasureDefBuilder classes to pass populationBasis to PopulationDef ## Version-Agnostic Test Framework (Phases 2-6) ### Test Infrastructure (21 new files in fhir2deftest/) - Fhir2DefUnifiedMeasureTestHandler: Unified Given/When/Then DSL - FhirVersionTestContext: Auto-detect FHIR version, factory for adapters - MeasureServiceAdapter interface: Version-agnostic evaluation API - Request/Response wrappers: MeasureEvaluationRequest, MeasureReportAdapter ### Version-Specific Adapters - R4MeasureServiceAdapter: R4 single/multi-measure support - R4MeasureReportAdapter, R4MultiMeasureReportAdapter - Dstu3MeasureServiceAdapter: DSTU3 single-measure support - Dstu3MeasureReportAdapter ### Fluent Assertion Classes (6 classes) - Selected: Base class for fluent navigation - SelectedDef: MeasureDef assertions and navigation - SelectedDefGroup: GroupDef assertions, population/stratifier navigation - SelectedDefPopulation: Subject/resource assertions, count validation - SelectedDefStratifier: Stratum navigation and assertions - SelectedDefStratum: Stratum population navigation - SelectedDefStratumPopulation: Stratum-level count assertions ### Repository Configuration - Add in-memory filtering to Fhir2DefUnifiedMeasureTestHandler: - SEARCH_FILTER_MODE.FILTER_IN_MEMORY - TERMINOLOGY_FILTER_MODE.FILTER_IN_MEMORY - VALUESET_EXPANSION_MODE.PERFORM_NAIVE_EXPANSION - Matches DSTU3/R4 Measure class configuration ### Integration Tests - ContinuousVariableResourceMeasureObservationFhir2DefTest (R4) - Tests continuous variable measure with MEASUREOBSERVATION - Validates Def capture, population counts, stratifiers - Dstu3MeasureAdditionalDataFhir2DefTest (DSTU3) - Tests measure evaluation with additional data Bundle - Validates Def capture and population counts ## Enhanced Test Coverage ### PopulationDefTest (10 tests, was 3) - Add isBooleanBasis() assertions to all existing tests - Add 7 new tests covering different population basis types: - testIsBooleanBasis_WithBooleanBasis/WithNonBooleanBasis - testGetCount_BooleanBasis_CountsUniqueSubjects - testGetCount_EncounterBasis_CountsAllResources - testGetCount_StringBasis_CountsAllResources - testGetCount_DateBasis_CountsAllResources - testGetCount_MeasureObservation_CountsObservations ### MeasureDefScorerTest - Fix population basis types to match master (source of truth): - testScoreGroup_SetsScoreOnGroupDef: boolean -> Encounter - testScoreGroup_ProportionWithExclusions: boolean -> String - Ensure same CodeDef instance used for PopulationDef and GroupDef ## Code Cleanup ### SelectedDef Classes - Remove Arrays.asList() wrappers, iterate directly over varargs - Fix SelectedDefPopulation.hasType() to compare with MeasurePopulationType.toCode() - Clean up unused imports ### Test Files - Remove 6 framework-specific tests from Dstu3MeasureAdditionalDataFhir2DefTest - Add Map import to PopulationDefTest, use non-qualified references ### Javadoc - Add comprehensive Javadoc for PopulationDef.getPopulationBasis() - Add comprehensive Javadoc for PopulationDef.isBooleanBasis() - All existing Javadoc verified for accuracy ## Testing - All 974 tests passing (13 skipped) - R4 and DSTU3 Def capture working correctly - Unified DSL working for both FHIR versions ## Documentation - Update version-agnostic-def-capture-framework.md PRP: - Mark all phases as COMPLETE ✅ - Add Implementation Summary section - Document final statistics and achievements - Note deviations from original plan 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…oring This commit implements Part 1 of the MeasureDefScorer integration plan, establishing foundation components for version-agnostic measure scoring and creating a new external API for the cdr-cr project. No behavioral changes to existing measure evaluation workflows. ## Part 1 Scope: Foundation & Refactoring (No Behavioral Changes) This is Part 1 of a 2-part integration. Part 1 creates infrastructure without changing existing behavior. Part 2 will integrate MeasureDefScorer into the evaluation workflow. ### Key Constraint Old R4/DSTU3 scorers remain active and unchanged during regular measure evaluation. New scoring infrastructure only activated via MeasureReportScoringFhirAdapter API. ## Changes Implemented ### 1. StratumDef Enhancement - Added getMeasureScore() method for applying improvement notation to stratifiers - Added createSnapshot() method for creating immutable copies with score state - Stratifiers inherit improvement notation from parent group (per FHIR spec) ### 2. MeasureDefScorer Refinements - Added COHORT and COMPOSITE measure scoring support (returns null scores per spec) - Enhanced javadoc and method documentation - All scoring logic operates on Def classes (version-agnostic) ### 3. Score Copying Infrastructure in Builders (Inactive in Part 1) - **R4MeasureReportBuilder**: Added copyScoresFromDef() method - Smart matching: positional for single groups/stratifiers, ID-based for multiple - Safe FHIR access patterns prevent side effects - Currently inactive (Def objects have null scores in Part 1) - **Dstu3MeasureReportBuilder**: Added copyScoresFromDef() method - Same smart matching and safe access patterns as R4 - Currently inactive (Def objects have null scores in Part 1) ### 4. External API for cdr-cr Project (NEW) Created MeasureReportScoringFhirAdapter - version-agnostic post-hoc scoring API: - **MeasureReportScoringFhirAdapter**: Static factory entry point - Auto-detects FHIR version, routes to appropriate adapter - Validates measure/report compatibility - **IMeasureReportScoringFhirAdapter**: Version-agnostic interface - Default workflow: build MeasureDef → populate counts → score → copy back - Abstract methods for version-specific FHIR operations - **R4MeasureReportScoringFhirAdapter**: R4 implementation - Builds MeasureDef from Measure structure - Populates counts from MeasureReport - Uses MeasureDefScorer for version-agnostic scoring - Copies scores back to MeasureReport - **Dstu3MeasureReportScoringFhirAdapter**: DSTU3 implementation - Same workflow as R4, adapted for DSTU3 types ### 5. Deprecation Documentation - Added javadoc deprecation notices to IMeasureReportScorer interface - Documented migration path for external consumers - Old scorers remain fully functional (no runtime warnings) ### 6. Comprehensive Testing - **MeasureReportScoringFhirAdapterTest**: Tests external API - R4 and DSTU3 proportion measure scoring - Actual score assertions (0.75, 0.21, 0.80, 0.60) - Version mismatch detection - Null validation - **Dstu3MeasureReportBuilderTest**: Tests score copying infrastructure (DSTU3) - Verifies copyScoresFromDef() integration point - Currently no scores to copy (Part 1 foundation only) - **R4MeasureReportBuilderTest**: Tests score copying infrastructure (R4) - Verifies copyScoresFromDef() integration point - Currently no scores to copy (Part 1 foundation only) ## Architecture Benefits 1. **External API Ready**: cdr-cr project can immediately use MeasureReportScoringFhirAdapter 2. **Zero Risk**: No changes to existing evaluation behavior 3. **Version Independence**: New API works with R4 and DSTU3, ready for R5 4. **Safe Refactoring**: Score copying infrastructure in place for Part 2 activation 5. **Backward Compatible**: Old scorers continue working exactly as before ## Part 2 Preview Part 2 will: - Activate MeasureDefScorer in MeasureEvaluationResultHandler - Remove old scorer calls from builders (switch to copyScoresFromDef) - Add Fhir2Def integration tests - Complete the transition to version-agnostic scoring ## Testing - All 987 tests passing (0 failures, 0 errors, 13 skipped) - Added 513 lines of new test coverage - Comprehensive unit tests for new external API - Score copying infrastructure validated (ready for Part 2) ## Statistics - 19 files changed: +5092 insertions, -19 deletions - 3 PRP documents (master plan + Part 1 + Part 2) - No breaking changes to public APIs - All existing tests maintained and passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Formatting check succeeded! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge Request Description
Summary
This MR implements Part 1 of a 2-part plan to integrate version-agnostic measure scoring into the clinical-reasoning evaluation workflow. Part 1 focuses on creating foundation infrastructure and a new external API for post-hoc measure scoring without modifying existing evaluation behavior. The primary deliverable is
MeasureReportScoringFhirAdapter, a version-agnostic API that enables external consumers (specifically the cdr-cr project) to score MeasureReports that already have population counts but lack calculated scores.New External API: Created
MeasureReportScoringFhirAdapteras a simplified, version-agnostic entry point for post-hoc MeasureReport scoring, eliminating the need for external consumers to manage version-specific scorers and builders.Score Copying Infrastructure: Added
copyScoresFromDef()methods to R4 and DSTU3 MeasureReport builders that copy scores from Def objects to FHIR MeasureReports. This infrastructure is currently inactive (Def objects have null scores in Part 1) and will be activated in Part 2 when MeasureDefScorer is integrated into the evaluation workflow.Def Class Enhancements: Enhanced StratumDef with
getMeasureScore()for improvement notation andcreateSnapshot()for immutable copies, plus addedcreateSnapshot()to MeasureDef and other Def classes to support the Def capture framework.Version-Agnostic Testing Framework: Implemented a comprehensive fhir2deftest package with SelectedDef assertion APIs and adapters that enable writing measure evaluation tests once and running them against multiple FHIR versions (DSTU3, R4, future R5).
Deprecation Documentation: Added javadoc notices to
IMeasureReportScorerdocumenting the migration path to the new API while keeping all existing scorer functionality unchanged.Code Review Suggestions
External API Design: Review
MeasureReportScoringFhirAdapterandIMeasureReportScoringFhirAdapterinterface design for usability by external consumers. Verify the static factory pattern and version detection logic align with clinical-reasoning API conventions.Score Copying Logic Correctness: In
R4MeasureReportBuilder.copyScoresFromDef()andDstu3MeasureReportBuilder.copyScoresFromDef(), verify the smart matching logic (positional for single groups/stratifiers, ID-based for multiple) correctly handles edge cases like missing IDs, mismatched group counts, and component stratifiers.FHIR Resource Mutation Safety: The score copying code uses patterns like
if (!group.hasId())to avoid mutating FHIR resources when checking for IDs. Verify these safe access patterns are consistently applied throughout the copying logic and no inadvertent side effects (like creating empty Coding objects via.getCodingFirstRep()) occur.Part 1 Isolation Guarantee: Verify that score copying infrastructure remains truly inactive in Part 1. Check that
copyScoresFromDef()is called but does nothing because Def objects have null scores, and old scorers remain the sole source of scores in regular evaluation workflows.Def Snapshot Semantics: Review the
createSnapshot()methods in GroupDef, StratumDef, and MeasureDef to ensure they create proper immutable copies. Verify that mutable score fields are copied correctly and defensive copying of collections is complete.Testing Framework Architecture: Examine the fhir2deftest package design (Fhir2DefUnifiedMeasureTestHandler, FhirVersionTestContext, MeasureServiceAdapter) for potential coupling issues or abstraction leaks that could make multi-version testing fragile.
Migration Impact on cdr-cr: Consider whether the API changes to
IMeasureReportScorer(javadoc deprecation only, no functional changes) adequately guide cdr-cr developers to the newMeasureReportScoringFhirAdapterAPI without breaking their existing code.DefCaptureCallback Integration: Review how
DefCaptureCallbackis integrated into R4MeasureProcessor and Dstu3MeasureProcessor. Verify callback invocation happens at the correct point (after evaluation, before builder) and the snapshot semantics are correct.QA Test Suggestions
Setup
Test Cases
External API Basic Usage (R4): Using
MeasureReportScoringFhirAdapter.score(measure, measureReport)with an R4 proportion measure, verify:External API Basic Usage (DSTU3): Repeat the R4 test with DSTU3 resources to verify version detection and routing work correctly.
External API Version Mismatch Detection: Attempt to score a MeasureReport with a Measure of a different FHIR version (e.g., R4 Measure with DSTU3 MeasureReport). Verify appropriate exception is thrown with clear error message.
External API Null Input Handling: Test
MeasureReportScoringFhirAdapter.score()with null Measure and null MeasureReport inputs. Verify appropriate exceptions with clear messages.Regular Evaluation Workflow Unchanged: Run existing measure evaluation tests (e.g.,
$evaluate-measureoperation) and verify:Multiple Groups Score Copying: Create a measure with multiple groups, evaluate it (or create a pre-populated MeasureReport), and verify scores are correctly matched to groups by ID.
Stratified Measure Score Copying: Create a measure with stratifiers, evaluate it, and verify:
Component Stratifier Handling: Test measures with multi-component stratifiers (e.g., gender + age group) to verify the matching logic correctly identifies strata by value combinations.
Def Capture Callback Integration: If tests use
MeasureEvaluationOptions.defCaptureCallback, verify:COHORT and COMPOSITE Measure Scoring: Test measures with COHORT and COMPOSITE scoring types using the external API. Verify appropriate handling (null scores per FHIR spec, or specific behavior if implemented).
Improvement Notation Application: Test measures with "decrease" improvement notation and verify scores are correctly inverted at the group level (once Part 2 activates).