Skip to content

Conversation

@lukedegruchy
Copy link
Contributor

@lukedegruchy lukedegruchy commented Dec 10, 2025

Merge Request Description

Summary

This MR implements Part 1 of a 2-part plan to integrate version-agnostic measure scoring into the clinical-reasoning evaluation workflow. Part 1 focuses on creating foundation infrastructure and a new external API for post-hoc measure scoring without modifying existing evaluation behavior. The primary deliverable is MeasureReportScoringFhirAdapter, a version-agnostic API that enables external consumers (specifically the cdr-cr project) to score MeasureReports that already have population counts but lack calculated scores.

  1. New External API: Created MeasureReportScoringFhirAdapter as a simplified, version-agnostic entry point for post-hoc MeasureReport scoring, eliminating the need for external consumers to manage version-specific scorers and builders.

  2. Score Copying Infrastructure: Added copyScoresFromDef() methods to R4 and DSTU3 MeasureReport builders that copy scores from Def objects to FHIR MeasureReports. This infrastructure is currently inactive (Def objects have null scores in Part 1) and will be activated in Part 2 when MeasureDefScorer is integrated into the evaluation workflow.

  3. Def Class Enhancements: Enhanced StratumDef with getMeasureScore() for improvement notation and createSnapshot() for immutable copies, plus added createSnapshot() to MeasureDef and other Def classes to support the Def capture framework.

  4. Version-Agnostic Testing Framework: Implemented a comprehensive fhir2deftest package with SelectedDef assertion APIs and adapters that enable writing measure evaluation tests once and running them against multiple FHIR versions (DSTU3, R4, future R5).

  5. Deprecation Documentation: Added javadoc notices to IMeasureReportScorer documenting the migration path to the new API while keeping all existing scorer functionality unchanged.

Code Review Suggestions

  • External API Design: Review MeasureReportScoringFhirAdapter and IMeasureReportScoringFhirAdapter interface design for usability by external consumers. Verify the static factory pattern and version detection logic align with clinical-reasoning API conventions.

  • Score Copying Logic Correctness: In R4MeasureReportBuilder.copyScoresFromDef() and Dstu3MeasureReportBuilder.copyScoresFromDef(), verify the smart matching logic (positional for single groups/stratifiers, ID-based for multiple) correctly handles edge cases like missing IDs, mismatched group counts, and component stratifiers.

  • FHIR Resource Mutation Safety: The score copying code uses patterns like if (!group.hasId()) to avoid mutating FHIR resources when checking for IDs. Verify these safe access patterns are consistently applied throughout the copying logic and no inadvertent side effects (like creating empty Coding objects via .getCodingFirstRep()) occur.

  • Part 1 Isolation Guarantee: Verify that score copying infrastructure remains truly inactive in Part 1. Check that copyScoresFromDef() is called but does nothing because Def objects have null scores, and old scorers remain the sole source of scores in regular evaluation workflows.

  • Def Snapshot Semantics: Review the createSnapshot() methods in GroupDef, StratumDef, and MeasureDef to ensure they create proper immutable copies. Verify that mutable score fields are copied correctly and defensive copying of collections is complete.

  • Testing Framework Architecture: Examine the fhir2deftest package design (Fhir2DefUnifiedMeasureTestHandler, FhirVersionTestContext, MeasureServiceAdapter) for potential coupling issues or abstraction leaks that could make multi-version testing fragile.

  • Migration Impact on cdr-cr: Consider whether the API changes to IMeasureReportScorer (javadoc deprecation only, no functional changes) adequately guide cdr-cr developers to the new MeasureReportScoringFhirAdapter API without breaking their existing code.

  • DefCaptureCallback Integration: Review how DefCaptureCallback is integrated into R4MeasureProcessor and Dstu3MeasureProcessor. Verify callback invocation happens at the correct point (after evaluation, before builder) and the snapshot semantics are correct.

QA Test Suggestions

Setup

  • Prepare test data:
    • R4 and DSTU3 Measure resources with proportion scoring
    • Corresponding MeasureReports with population counts but no scores
    • Measures with single and multiple groups
    • Measures with stratifiers (both single-component and multi-component)

Test Cases

  • External API Basic Usage (R4): Using MeasureReportScoringFhirAdapter.score(measure, measureReport) with an R4 proportion measure, verify:

    • Scores are correctly calculated and written to the MeasureReport
    • Expected score values match manual calculations (e.g., 75/100 = 0.75)
    • Original population counts remain unchanged
  • External API Basic Usage (DSTU3): Repeat the R4 test with DSTU3 resources to verify version detection and routing work correctly.

  • External API Version Mismatch Detection: Attempt to score a MeasureReport with a Measure of a different FHIR version (e.g., R4 Measure with DSTU3 MeasureReport). Verify appropriate exception is thrown with clear error message.

  • External API Null Input Handling: Test MeasureReportScoringFhirAdapter.score() with null Measure and null MeasureReport inputs. Verify appropriate exceptions with clear messages.

  • Regular Evaluation Workflow Unchanged: Run existing measure evaluation tests (e.g., $evaluate-measure operation) and verify:

    • All existing tests pass without modification
    • Scores are still calculated correctly by old scorers
    • No performance regression
  • Multiple Groups Score Copying: Create a measure with multiple groups, evaluate it (or create a pre-populated MeasureReport), and verify scores are correctly matched to groups by ID.

  • Stratified Measure Score Copying: Create a measure with stratifiers, evaluate it, and verify:

    • Group-level scores are correct
    • Stratum-level scores are correct (once Part 2 activates scoring)
    • Stratifier matching works for both positional (single stratifier) and ID-based (multiple stratifiers) cases
  • Component Stratifier Handling: Test measures with multi-component stratifiers (e.g., gender + age group) to verify the matching logic correctly identifies strata by value combinations.

  • Def Capture Callback Integration: If tests use MeasureEvaluationOptions.defCaptureCallback, verify:

    • Callback receives a complete MeasureDef snapshot after evaluation
    • Callback can be used to make assertions about evaluation state
    • Callback does not interfere with normal evaluation flow
  • COHORT and COMPOSITE Measure Scoring: Test measures with COHORT and COMPOSITE scoring types using the external API. Verify appropriate handling (null scores per FHIR spec, or specific behavior if implemented).

  • Improvement Notation Application: Test measures with "decrease" improvement notation and verify scores are correctly inverted at the group level (once Part 2 activates).

lukedegruchy and others added 8 commits December 2, 2025 16:46
…for counts

This refactoring eliminates duplicate count calculations and establishes Def
classes (MeasureDef, GroupDef, PopulationDef, StratumDef) as the authoritative
source for all population counts in measure scoring.

Key Changes:
- Refactored R4MeasureReportScorer to read counts from Def classes instead of
  MeasureReport FHIR components
- Added getPopulationCount() methods to GroupDef and StratumDef for cleaner
  count retrieval
- Added helper methods to StratumDef: getStratumPopulation() and getPopulationCount()
- Simplified ContinuousVariableObservationConverter interface to only handle
  conversion to FHIR Quantities
- Removed ContinuousVariableObservationConverter parameter from entire
  ContinuousVariableObservationHandler call chain (no longer needed)
- Added convertCqlResultToQuantityDef() method to handle CQL result conversion
  to QuantityDef
- Updated exception types in convertCqlResultToQuantityDef() to use FHIR
  InvalidRequestException
- Refactored MeasureScorerTest to use record classes instead of nested HashMaps
- Fixed stratum scoring by properly handling CRITERIA-type stratifiers

Benefits:
- Eliminates redundant count calculations between Def classes and MeasureReport
- Improves FHIR version independence (R4, DSTU3, R5 can share common logic)
- Better separation of concerns: Def classes own count logic, converters only
  handle FHIR serialization
- Improved code maintainability and testability

All 931 tests passing in cqf-fhir-cr module.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Create a FHIR-version-agnostic measure scorer that uses Def classes as
the single source of truth for both data and computed scores, eliminating
the need for version-specific scoring logic duplication.

Architecture changes:
- Rename MeasureReportScorer interface to IMeasureReportScorer
- Create new MeasureDefScorer class (602 lines) with version-agnostic
  scoring logic that mutates Def objects instead of returning scores
- Convert StratumDef from record to class to support mutable score field
- Add score state (Double) with getters/setters to GroupDef and StratumDef

Key features:
- Def-first iteration pattern: uses MeasureDef.groups() and
  StratifierDef.getStratum() instead of FHIR object iteration
- Mutation-based design: scoreGroup() is void and sets scores directly
  on GroupDef/StratumDef objects via setScore()
- Enhanced switch expressions with early returns in aggregate() method
- GroupDef.getMeasureScore() handles improvement notation adjustment
  ("increase" returns score as-is, "decrease" returns 1 - score)
- Population basis support: correctly handles boolean (count subjects)
  vs non-boolean (count resources) scoring differences

Comprehensive test coverage (784 lines, 16 tests):
- Proportion, ratio, and continuous variable scoring
- Stratifier scoring with multiple strata
- Zero denominator edge cases
- All aggregate methods (SUM, AVG, MIN, MAX, MEDIAN, COUNT)
- Improvement notation adjustment (increase/decrease)
- Population basis impact (boolean vs Encounter counts)
- Null, zero, and negative score handling

Population basis is critical: boolean basis counts unique subjects
(2/3 = 0.667) while Encounter basis counts all resources (5/9 = 0.556)
for the same population data.

Files changed:
- New: IMeasureReportScorer.java (renamed interface)
- New: MeasureDefScorer.java (version-agnostic scorer)
- New: MeasureDefScorerTest.java (comprehensive unit tests)
- New: PRPs/version-agnostic-measure-report-scorer.md (PRP documentation)
- Modified: GroupDef.java (+54 lines: score field, getMeasureScore)
- Modified: StratumDef.java (+48 lines: record to class, score field)
- Modified: BaseMeasureReportScorer.java (interface rename)
- Modified: R4MeasureReportBuilder.java (interface rename)
- Modified: Dstu3MeasureReportBuilder.java (interface rename)
- Deleted: MeasureReportScorer.java (renamed to IMeasureReportScorer)

All 328 measure tests passing. No regressions.

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
…Def class design

This commit completes comprehensive test coverage for MeasureDefScorer and
addresses implementation gaps discovered through test-driven development.

Implementation Changes:
- Add scoreRatioMeasureObservationGroup() method for group-level ratio with
  observations scoring (lines 153-199)
- Fix critical stratum filtering bug in getResultsForStratum() - now correctly
  filters at Map.Entry level (subject ID) instead of observation Map keys
- Implement 4 new comprehensive tests for ratio with observations, cohort
  measures, and edge cases

Def Class Enhancements:
- Replace String ID with PopulationDef reference in StratumPopulationDef for
  better type safety and elimination of ID-based lookups
- Rename GroupDef.get() to getPopulationDefs() for clarity
- Add GroupDef.getFirstWithId() helper method for criteriaReference matching
- Enhance PopulationDef.toString() with comprehensive debugging information
- Update StratumPopulationDef.toString() to use PopulationDef's toString()

Test Coverage:
- testScoreGroup_RatioWithObservations_GroupLevel: Group-level ratio scoring
- testScoreStratifier_RatioWithObservations_StratumLevel: Stratum-level ratio
  scoring with gender stratification
- testScoreGroup_CohortMeasure_NoScoreSet: Verify cohort measures return null
- testScoreGroup_MissingScoringType_ThrowsException: Validation error handling
- Remove unused createMeasurePopulationConcept(type, criteria) helper method

Documentation:
- Add "Implementation Outcomes and Lessons Learned" section to version-agnostic
  PRP documenting actual implementation results
- Update measure-def-scorer-test-coverage-enhancement.md to reflect completed
  work
- Document data structure insights and R4 pattern reuse

All Tests: 954 passing (20 in MeasureDefScorerTest), 0 failures, 0 errors

This work represents Phase 1 completion of the version-agnostic measure
scorer implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit completes the implementation of a version-agnostic test framework
for capturing and asserting on MeasureDef state across FHIR versions (DSTU3, R4).

## Core Infrastructure (Phase 1)

### Def Snapshot API
- Add DefCaptureCallback interface for capturing MeasureDef snapshots
- Implement createSnapshot() methods on all Def classes:
  - MeasureDef, GroupDef, PopulationDef, StratifierDef, StratumDef
  - StratifierComponentDef, SdeDef
- Deep copy collections, share FHIR resource references
- Add defCaptureCallback field to MeasureEvaluationOptions

### PopulationDef Refactoring
- Add populationBasis field to PopulationDef (moved from GroupDef dependency)
- Add getPopulationBasis() and isBooleanBasis() methods with Javadoc
- Streamline getCount() to use own populationBasis (no GroupDef parameter)
- Update createSnapshot() to include populationBasis field

### Processor Integration
- Add callback invocation to R4MeasureProcessor (after processResults)
- Add callback invocation to Dstu3MeasureProcessor (after processResults)
- Update MeasureDefBuilder classes to pass populationBasis to PopulationDef

## Version-Agnostic Test Framework (Phases 2-6)

### Test Infrastructure (21 new files in fhir2deftest/)
- Fhir2DefUnifiedMeasureTestHandler: Unified Given/When/Then DSL
- FhirVersionTestContext: Auto-detect FHIR version, factory for adapters
- MeasureServiceAdapter interface: Version-agnostic evaluation API
- Request/Response wrappers: MeasureEvaluationRequest, MeasureReportAdapter

### Version-Specific Adapters
- R4MeasureServiceAdapter: R4 single/multi-measure support
- R4MeasureReportAdapter, R4MultiMeasureReportAdapter
- Dstu3MeasureServiceAdapter: DSTU3 single-measure support
- Dstu3MeasureReportAdapter

### Fluent Assertion Classes (6 classes)
- Selected: Base class for fluent navigation
- SelectedDef: MeasureDef assertions and navigation
- SelectedDefGroup: GroupDef assertions, population/stratifier navigation
- SelectedDefPopulation: Subject/resource assertions, count validation
- SelectedDefStratifier: Stratum navigation and assertions
- SelectedDefStratum: Stratum population navigation
- SelectedDefStratumPopulation: Stratum-level count assertions

### Repository Configuration
- Add in-memory filtering to Fhir2DefUnifiedMeasureTestHandler:
  - SEARCH_FILTER_MODE.FILTER_IN_MEMORY
  - TERMINOLOGY_FILTER_MODE.FILTER_IN_MEMORY
  - VALUESET_EXPANSION_MODE.PERFORM_NAIVE_EXPANSION
- Matches DSTU3/R4 Measure class configuration

### Integration Tests
- ContinuousVariableResourceMeasureObservationFhir2DefTest (R4)
  - Tests continuous variable measure with MEASUREOBSERVATION
  - Validates Def capture, population counts, stratifiers
- Dstu3MeasureAdditionalDataFhir2DefTest (DSTU3)
  - Tests measure evaluation with additional data Bundle
  - Validates Def capture and population counts

## Enhanced Test Coverage

### PopulationDefTest (10 tests, was 3)
- Add isBooleanBasis() assertions to all existing tests
- Add 7 new tests covering different population basis types:
  - testIsBooleanBasis_WithBooleanBasis/WithNonBooleanBasis
  - testGetCount_BooleanBasis_CountsUniqueSubjects
  - testGetCount_EncounterBasis_CountsAllResources
  - testGetCount_StringBasis_CountsAllResources
  - testGetCount_DateBasis_CountsAllResources
  - testGetCount_MeasureObservation_CountsObservations

### MeasureDefScorerTest
- Fix population basis types to match master (source of truth):
  - testScoreGroup_SetsScoreOnGroupDef: boolean -> Encounter
  - testScoreGroup_ProportionWithExclusions: boolean -> String
- Ensure same CodeDef instance used for PopulationDef and GroupDef

## Code Cleanup

### SelectedDef Classes
- Remove Arrays.asList() wrappers, iterate directly over varargs
- Fix SelectedDefPopulation.hasType() to compare with MeasurePopulationType.toCode()
- Clean up unused imports

### Test Files
- Remove 6 framework-specific tests from Dstu3MeasureAdditionalDataFhir2DefTest
- Add Map import to PopulationDefTest, use non-qualified references

### Javadoc
- Add comprehensive Javadoc for PopulationDef.getPopulationBasis()
- Add comprehensive Javadoc for PopulationDef.isBooleanBasis()
- All existing Javadoc verified for accuracy

## Testing
- All 974 tests passing (13 skipped)
- R4 and DSTU3 Def capture working correctly
- Unified DSL working for both FHIR versions

## Documentation
- Update version-agnostic-def-capture-framework.md PRP:
  - Mark all phases as COMPLETE ✅
  - Add Implementation Summary section
  - Document final statistics and achievements
  - Note deviations from original plan

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…oring

This commit implements Part 1 of the MeasureDefScorer integration plan, establishing
foundation components for version-agnostic measure scoring and creating a new external
API for the cdr-cr project. No behavioral changes to existing measure evaluation workflows.

## Part 1 Scope: Foundation & Refactoring (No Behavioral Changes)

This is Part 1 of a 2-part integration. Part 1 creates infrastructure without changing
existing behavior. Part 2 will integrate MeasureDefScorer into the evaluation workflow.

### Key Constraint
Old R4/DSTU3 scorers remain active and unchanged during regular measure evaluation.
New scoring infrastructure only activated via MeasureReportScoringFhirAdapter API.

## Changes Implemented

### 1. StratumDef Enhancement
- Added getMeasureScore() method for applying improvement notation to stratifiers
- Added createSnapshot() method for creating immutable copies with score state
- Stratifiers inherit improvement notation from parent group (per FHIR spec)

### 2. MeasureDefScorer Refinements
- Added COHORT and COMPOSITE measure scoring support (returns null scores per spec)
- Enhanced javadoc and method documentation
- All scoring logic operates on Def classes (version-agnostic)

### 3. Score Copying Infrastructure in Builders (Inactive in Part 1)
- **R4MeasureReportBuilder**: Added copyScoresFromDef() method
  - Smart matching: positional for single groups/stratifiers, ID-based for multiple
  - Safe FHIR access patterns prevent side effects
  - Currently inactive (Def objects have null scores in Part 1)

- **Dstu3MeasureReportBuilder**: Added copyScoresFromDef() method
  - Same smart matching and safe access patterns as R4
  - Currently inactive (Def objects have null scores in Part 1)

### 4. External API for cdr-cr Project (NEW)
Created MeasureReportScoringFhirAdapter - version-agnostic post-hoc scoring API:

- **MeasureReportScoringFhirAdapter**: Static factory entry point
  - Auto-detects FHIR version, routes to appropriate adapter
  - Validates measure/report compatibility

- **IMeasureReportScoringFhirAdapter**: Version-agnostic interface
  - Default workflow: build MeasureDef → populate counts → score → copy back
  - Abstract methods for version-specific FHIR operations

- **R4MeasureReportScoringFhirAdapter**: R4 implementation
  - Builds MeasureDef from Measure structure
  - Populates counts from MeasureReport
  - Uses MeasureDefScorer for version-agnostic scoring
  - Copies scores back to MeasureReport

- **Dstu3MeasureReportScoringFhirAdapter**: DSTU3 implementation
  - Same workflow as R4, adapted for DSTU3 types

### 5. Deprecation Documentation
- Added javadoc deprecation notices to IMeasureReportScorer interface
- Documented migration path for external consumers
- Old scorers remain fully functional (no runtime warnings)

### 6. Comprehensive Testing
- **MeasureReportScoringFhirAdapterTest**: Tests external API
  - R4 and DSTU3 proportion measure scoring
  - Actual score assertions (0.75, 0.21, 0.80, 0.60)
  - Version mismatch detection
  - Null validation

- **Dstu3MeasureReportBuilderTest**: Tests score copying infrastructure (DSTU3)
  - Verifies copyScoresFromDef() integration point
  - Currently no scores to copy (Part 1 foundation only)

- **R4MeasureReportBuilderTest**: Tests score copying infrastructure (R4)
  - Verifies copyScoresFromDef() integration point
  - Currently no scores to copy (Part 1 foundation only)

## Architecture Benefits

1. **External API Ready**: cdr-cr project can immediately use MeasureReportScoringFhirAdapter
2. **Zero Risk**: No changes to existing evaluation behavior
3. **Version Independence**: New API works with R4 and DSTU3, ready for R5
4. **Safe Refactoring**: Score copying infrastructure in place for Part 2 activation
5. **Backward Compatible**: Old scorers continue working exactly as before

## Part 2 Preview

Part 2 will:
- Activate MeasureDefScorer in MeasureEvaluationResultHandler
- Remove old scorer calls from builders (switch to copyScoresFromDef)
- Add Fhir2Def integration tests
- Complete the transition to version-agnostic scoring

## Testing

- All 987 tests passing (0 failures, 0 errors, 13 skipped)
- Added 513 lines of new test coverage
- Comprehensive unit tests for new external API
- Score copying infrastructure validated (ready for Part 2)

## Statistics

- 19 files changed: +5092 insertions, -19 deletions
- 3 PRP documents (master plan + Part 1 + Part 2)
- No breaking changes to public APIs
- All existing tests maintained and passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Formatting check succeeded!

@lukedegruchy lukedegruchy changed the title Ld 20251208 integrate new measure scorer Integrate new FHIR-version agnostic measure scoring API from clinical-reasoning Dec 10, 2025
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 60.94183% with 141 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.53%. Comparing base (0f85487) to head (e66cce2).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...ir/cr/measure/dstu3/Dstu3MeasureReportBuilder.java 24.13% 41 Missing and 3 partials ⚠️
...cqf/fhir/cr/measure/r4/R4MeasureReportBuilder.java 61.17% 14 Missing and 19 partials ⚠️
...e/common/Dstu3MeasureReportScoringFhirAdapter.java 58.92% 19 Missing and 4 partials ⚠️
...sure/common/R4MeasureReportScoringFhirAdapter.java 58.92% 19 Missing and 4 partials ⚠️
...asure/common/IMeasureReportScoringFhirAdapter.java 66.66% 2 Missing and 1 partial ⚠️
...easure/common/MeasureReportScoringFhirAdapter.java 85.71% 2 Missing and 1 partial ⚠️
...opencds/cqf/fhir/cr/measure/common/StratumDef.java 66.66% 3 Missing ⚠️
...ncds/cqf/fhir/cr/measure/common/StratifierDef.java 77.77% 1 Missing and 1 partial ⚠️
...cds/cqf/fhir/cr/measure/r4/R4MeasureProcessor.java 50.00% 1 Missing and 1 partial ⚠️
...s/cqf/fhir/cr/measure/common/MeasureDefScorer.java 50.00% 1 Missing ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #859      +/-   ##
============================================
- Coverage     73.68%   73.53%   -0.15%     
  Complexity      277      277              
============================================
  Files           576      580       +4     
  Lines         27159    27513     +354     
  Branches       3459     3518      +59     
============================================
+ Hits          20011    20233     +222     
- Misses         5432     5522      +90     
- Partials       1716     1758      +42     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lukedegruchy lukedegruchy changed the title Integrate new FHIR-version agnostic measure scoring API from clinical-reasoning Integrate MeasureDefScorer Part 1: Foundation for Version-Agnostic Scoring Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants