A data quality assessment tool for OMOP Common Data Model databases. The Profile of Analytic Suitability Score (PASS) evaluates clinical data across six dimensions to quantify its fitness for research and analytics.
PASS calculates standardized metrics (0-1 scale) that measure different aspects of OMOP CDM data quality:
- Accessibility: Are clinical facts present and discoverable?
- Provenance: How well are facts coded and traceable to source data?
- Standards: Are OHDSI standard concepts being used?
- Concept Diversity: Is there variety in the concepts represented?
- Source Diversity: How many different data sources contribute?
- Temporal: How is data distributed over time?
Each metric produces field-level, table-level, and overall scores with 95% confidence intervals. A composite PASS aggregates individual metrics into a single quality measure.
Evaluates whether clinical facts exist in concept_id fields. Scores range from 1.0 (concept present) to 0.5 (source code only) to 0.05 (text only) to 0.0 (absent). Includes pseudo-fields for custom completeness checks (e.g., measurement results, note text).
Measures coding quality and source traceability. Native vocabulary usage scores 1.0, mapped codes 0.95, mapped text 0.75, and untraceable concepts 0.0.
Binary assessment of OHDSI standard concept usage. Standard concepts score 1.0, non-standard 0.0.
Shannon entropy of concept distributions within each field. Normalized to [0,1] where 1.0 indicates perfect diversity and 0.0 indicates no variety.
Counts unique type_concept_id values per table using exponential decay normalization (1 - exp(-n/k)). Asymptotically approaches 1.0 as source count increases.
Combines three sub-scores: range (years of coverage), density (rows per patient per quarter), and consistency (temporal stability via coefficient of variation).
library(pass)
# Create database connection
conn <- create_pass_connection(
project_id = "my-project",
dataset = "omop_cdm",
jdbc_driver_path = "~/bigquery_driver/"
)
# Load default configuration
config <- load_pass_config()
# Calculate all metrics
results <- calculate_pass(
conn = conn,
schema = "my-project.omop_cdm",
config = config,
metrics = "all",
output_dir = "output/"
)
# Disconnect
disconnect_pass(conn)# Run only accessibility and temporal
results <- calculate_pass(
conn = conn,
schema = "my-project.omop_cdm",
config = config,
metrics = c("accessibility", "temporal"),
output_dir = "output/"
)# Load custom configuration files
config <- load_pass_config(
concept_fields_path = "path/to/custom_concept_fields.csv",
type_fields_path = "path/to/custom_type_fields.csv",
date_fields_path = "path/to/custom_date_fields.csv"
)
results <- calculate_pass(conn, schema, config)# Adjust metric weights in composite score
results <- calculate_pass(
conn = conn,
schema = "my-project.omop_cdm",
config = config,
metrics = "all",
composite_weights = list(
accessibility = 1.5,
provenance = 1.0,
standards = 1.0,
concept_diversity = 0.5,
source_diversity = 1.0,
temporal = 1.0
)
)The package includes default configuration files that define which fields to evaluate. These can be customized by providing your own CSV files.
Configuration files are located in inst/config/:
concept_fields_with_weights.csv
Defines which concept_id fields to evaluate and their analytical importance weights (0-1 scale).
table,concept_id_field,source_concept_id_field,source_value_field,multiplier,rationale
condition_occurrence,condition_concept_id,condition_source_concept_id,condition_source_value,1.0,Primary diagnosis fieldtype_concept_id_fields.csv
Specifies type_concept_id fields for source diversity analysis.
table,type_concept_id
condition_occurrence,condition_type_concept_iddate_fields.csv
Defines primary date fields for temporal analysis.
table,date_field
condition_occurrence,condition_start_dateTo adjust field importance in your analysis:
- Export default configuration:
default_config <- system.file("config", "concept_fields_with_weights.csv", package = "pass")
file.copy(default_config, "my_custom_config.csv")-
Edit
my_custom_config.csvto adjust multipliers -
Load custom configuration:
config <- load_pass_config(concept_fields_path = "my_custom_config.csv")Results are written to the output/ directory as CSV files:
Each metric generates three files:
pass_{metric}_field_level.csv- Scores for each concept_id fieldpass_{metric}_table_level.csv- Aggregated scores per tablepass_{metric}_overall.csv- Dataset-wide score with confidence interval
pass_composite_overall.csv- Weighted composite PASSpass_composite_components.csv- Individual metric contributions
- 1.0: Perfect quality on this dimension
- 0.8-0.99: Good quality with minor issues
- 0.6-0.79: Moderate quality, room for improvement
- 0.4-0.59: Poor quality, significant gaps
- < 0.4: Very poor quality, major data issues
- NA: Not evaluated (e.g., empty table, insufficient data)
pass/
├── DESCRIPTION # Package metadata
├── NAMESPACE # Exported functions
├── R/ # R source code
│ ├── calculate_pass.R # Main user function
│ ├── config_helpers.R # Configuration loading
│ ├── connection_helpers.R # Database connection
│ ├── config.R
│ ├── connection.R
│ ├── composite_score.R # Composite score calculation
│ └── metrics/
│ ├── accessibility.R
│ ├── provenance.R
│ ├── standards.R
│ ├── concept_diversity.R
│ ├── source_diversity.R
│ ├── temporal.R
│ └── domain_completeness.R
├── inst/
│ ├── config/ # Default configuration files
│ │ ├── concept_fields_with_weights.csv
│ │ ├── type_concept_id_fields.csv
│ │ └── date_fields.csv
│ └── examples/ # Example usage scripts
│ └── calculate_pass_example.R
├── man/ # Function documentation (auto-generated)
├── vignettes/ # Package vignettes
│ └── scoring_methodology.Rmd
└── README.md
Create custom completeness checks by adding pseudo-fields (prefix with __) to your custom configuration:
measurement,value_as_concept_id,,,0,Not evaluated - see pseudo-field
measurement,__result_completeness__,,,1.0,Custom result completeness logicThen implement custom logic by modifying R/metrics/accessibility.R:build_pseudo_field_sql().
Access results programmatically without saving to files:
results <- calculate_pass(
conn = conn,
schema = "my-project.omop_cdm",
config = config,
output_dir = NULL # Don't save CSV files
)
# Access overall scores
accessibility_score <- results$accessibility$overall$overall_score
temporal_score <- results$temporal$overall$overall_temporal_score
# Access field-level details
field_scores <- results$accessibility$field_level