Preprocess heavy computations before `run_app()` #245

aclark02-arcus · 2025-09-10T16:55:18Z

Significant work towards #223.

This PR (in addition to #243) reduced app startup time from 2m 40s down to 41s - 70s. Note: these time stats were measured on a larger study with ~600-700 patients. Also, the amount of time to simply review a record went down from 27s down to 12s from these two PRs.

Obviously a lot of changes here. Feel free to take your time & suggest alternatives, but would love some initial thoughts if you have any. For now, I just commented out some old code, but when this PR is ready, I'm happy to remove it altogether.

For some reason, I had to add a couple lines to dev/run_dev.R/load_and_run() and test_clinsight() to build the db first... not sure why. Maybe you know?

'merged_data', thus saving lots of time at app startup

LDSamson · 2025-09-29T14:33:13Z

Hi @aclark02-arcus,

The application surely need improvement in startup time, so I think this is indeed the correct way forward. Good news that these changes helped to reduce the start up time for you!

We probably need to implement this in something like as two-step process. As discussed, maybe you can merge it first in your devex branch, and then we can check how we can validate the changes properly before merging into dev.

For validation according to GAMP V guidelines, we should identify potential risks of this adjustment, provide a risk classification, and then try to mitigate especially high priority risks.

For example, in this case, we are changing the core files in memory. Since we would now rely on more files, they might go out of synch, which could be a risk. We could mitigate this by:

Reducing the number of files that we need to use (maybe put some more files in lists? Is that possible)
- Can we maybe even remove study_data and just rely on app_data? Not using study_data will also help with memory usage of the app
- Do we still need to rely on the app_tables object, or is this only needed for creating one of the other overview tables?
add a time stamp as attribute to each file, so that it is clear that they were created at the same moment, and verify that the time stamp matches.
...

Another risk is that we would like to update to the new system without much problems. Is this possible or would it need a major update? Probably some documentation update on how to update would also be good.

Let me know when you are available, then we can discuss this!

aclark02-arcus · 2025-09-30T01:22:44Z

Hi @LDSamson, thanks for looking at this. Just a quick note: to my knowledge, we've abandoned devex altogether, but we could bring it back if you wish. We'd probably "restart" by basing it off the current dev branch. Is there a reason you want to use devex and instead of dev? Seems like we could just work out all of those GAMP considerations here (in this PR). What do you think?

Just so you know, we've already committed these changes into our internal clinsight branch. But we'd welcome anything new that may come up during PR review.

LDSamson · 2025-09-30T06:57:11Z

Ah, good to know. I thought you would like to move forward quickly using such a devex branch.

Main reason I was a bit hesitant for using dev is that I would like to be able to support current versions that are running in production, with small updates/bug fixes if needed, without having to make big changes to the setup in studies that are already running. If this PR would still allow the old setup, and if everything is tested/validated properly before merging, then merging into dev would also be fine.

LDSamson

See previous comment. The risk of this change seems manageable, just good to have a few adge cases covered :) If you want to discuss let me know!

… ac-223

…les from existence

aclark02-arcus · 2025-12-30T17:14:13Z

Now that #251 has merged, app_tables doesn't really exist, so I'll change the name of this PR. Regardless, this PR is still worthwhile for it's pre-processing of the data before launching the app. Commit a8fbcb1 got rid of app_tables, but added timeline_data to the pre-processing step to save lots of time within the AE tab. It's now being saved as both an .rds & .parquet file.

Copilot

Pull request overview

Copilot reviewed 24 out of 29 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

man/metadata.Rd:156

The description section contains duplicate paragraphs. Remove the duplicated content starting at line 151.

A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for
in-depth information.

A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

man/metadata.Rd

R/run_app.R

man/cs_timeline_data.Rd

data-raw/clinsightful_data.R

… reading it in Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… ac-223

aclark02-arcus · 2025-12-30T17:37:09Z

@LDSamson, pinging you for a review to make sure you are content with the recent changes.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… ac-223

LDSamson

Hi @aclark02-arcus, thank you for your work on this.

See also the few detailed comments I made.

In general, the setup of a study feels a bit messy now because we need to create so many objects with the correct naming before we can even start clinsight. It would be great if that can be addressed.

Suggestions (open for debate of course) would be to:

create a list object (e.g. clinsight_data) with the following objects:

appdata (list created with get_appdata)
timeline_data (data frame with timeline_data)
available_data (data frame with available_data)
app_vars (list with app variables/metadata)

create a helper function to create this list object, with the merged_data object and the metadata object
create a helper function to save clinsight_data with a boolean option use_parquet (by default FALSE to be compatible).
- eithers saves it as a single .rds file, or as different parquet files
create a helper function to load clinsight_data, with the same option use_parquet

This way, we only have to feed clinsight_data to run_app and not so many different objects, whil still using data pre-computations.

Furthermore, if we are using parquet files, shouldn't we also save all the data frames in appdata as a parquet file? These are the largest objects by far.

Looking forward to hearing your thoughts on this

LDSamson · 2026-01-05T13:19:35Z

dev/app.R

+# Build a version of `app_data` & app_vars
+app_data <- get_appdata(data = merged_data, meta = metadata) 
+app_vars <- get_meta_vars(data = app_data, meta = metadata) 
+
+# Build a 'app_tables'
+# app_tables <- lapply(
+#   setNames(names(app_data), names(app_data)), \(x){
+#     create_table(app_data[[x]], expected_columns = names(app_vars$items[[x]]))
+#   })
+
+# Build a 'available_data'
+available_data <- get_available_data(
+  data = app_data,
+  # tables = app_tables,            # outdated arg
+  # all_forms = app_vars$all_forms, # outdated arg
+  form_repeat_name = with(
+    metadata[["table_names"]],
+    table_name[raw_name == "form_repeat"]
+  ) |>
+    tryCatch(error = \(e) "N")
+)
+
+# For timeline data
+timeline_data <- get_timeline_data(
+  app_data,
+  available_data = available_data,
+  treatment_label = metadata$settings$treatment_label %||% "\U1F48A T\U2093"
+)
+
 # tempdir not useful for production mode
 data_folder <- "."
-data_path <- file.path(data_folder, 
-                       "merged_data.rds")
-saveRDS(merged_data, data_path)
+# data_path <- file.path(data_folder, 
+#                        "merged_data.rds")
+# saveRDS(merged_data, data_path)
+# Current saves both RDS and Parquet for data frames for continuity purposes
+save_objs <- c(
+  "metadata",
+  "app_data",
+  "app_vars",
+  # "app_tables",
+  "available_data",
+  "timeline_data")
+purrr::walk(save_objs, function(x){
+  rds_file <- file.path(data_folder, paste0(x, ".rds"))
+  saveRDS(get(x), rds_file)
+  if(inherits(get(x), "data.frame")) {
+    pq_file <- file.path(data_folder, paste0(x, ".parquet"))
+    arrow::write_parquet(get(x), pq_file)
+  }
+})
+


This looks quite difficult to start the app.. you must remember to create all these objects, name them correctly, feed them into the app.. can't we simplify this process more?

LDSamson · 2026-01-05T13:27:33Z

R/run_app.R

+  ## Verify app_data list
+  if(is.character(app_data)){
+    if(!file.exists(app_data)) stop(paste0("Cannot find '", app_data, "'."))
+    if(tolower(tools::file_ext(app_data)) != "rds"){
+      stop("Invalid 'app_data' format. Expecting a file .rds format")
    }
-    data <- readRDS(data)
+    app_data <- readRDS(app_data)
  } 
-  stopifnot("Expecting study data to be in data frame format." = is.data.frame(data))
+  stopifnot("Expecting 'app_data' to be in list format." = inherits(app_data, "list"))


I might be wrong here, but I have a feeling that the improvement in loading time does not come so much from using parquet files. Reason I think that is that for the largest object, the study data here, no parquet file is used. Can't that be changed?

I can't remember what the run_app() time savings was by switching from .rds to .parquet for data.frame objects but I remember it was "pretty decent." Note: switching to parquet decreased the data build time by 18% for a large study... as saving large data.frames as .rds is slow. BUT the best advantage here was the file storage efficiencies when parquet compressed the daily build size from 10GB to 1GB... and that was without any changes to app_data.

But I agree with what you are saying here. I didn't want to mess with changing the app_data object since it is a list & parquet can't help with that unless you break apart the list.

LDSamson · 2026-01-05T13:33:12Z

R/run_app.R

    warning("No user database found. New database will be created")
    db_create(get_review_data(data), db_path = user_db)
  } else{
    stopifnot("user_db version is not up to date" =
                identical(db_version, db_get_version(user_db)))
    # Skip if not needed for faster testing:
    if(isTRUE(get_golem_config("app_prod"))){
      db_update(get_review_data(data), db_path = user_db) 
    }
  }


For some reason, I had to add a couple lines to dev/run_dev.R/load_and_run() and test_clinsight() to build the db first... not sure why. Maybe you know?

It's because the database is created in this step here when it does not exist. But this will error now, because the data object was deleted by you (see line 27).

LDSamson · 2026-01-05T13:36:53Z

R/fct_test_clinsight.R

+      pq_file <- file.path(temp_folder, paste0(x, ".parquet"))
+      arrow::write_parquet(get(x), pq_file)
+    }
+  })


I think adding parquet files here will make it significantly slower to test clinsight with this function because it is fairly slow to write parquet files. Can this be changed/removed here?

LDSamson · 2026-01-05T14:21:55Z

R/fct_test_clinsight.R

+    # "app_tables",
+    "timeline_data",
+    "available_data")
+  purrr::walk(save_objs, function(x){


Keep in mind that purrr is currently only in suggests in the description. We should either use base R or move the package to Imports.

aclark02-arcus · 2026-01-05T20:16:36Z

Looking forward to hearing your thoughts on this

Yes, I'm not opposed to a change like that. It is fairly "messy" as you say. I won't be able to work on it in the near future however with my OOO in Jan though.

aclark02-arcus added 6 commits September 10, 2025 14:45

Increment version number to 0.3.0.9001

3503d12

export 'get_available_data()' & update docs

56c7402

add news

46cdfd1

export check_appdata() and get_meta_vars() & update docs

7daef86

setup clinsight to use other objects besides

aa1565b

'merged_data', thus saving lots of time at app startup

move a comment

95f9f11

aclark02-arcus requested a review from LDSamson September 10, 2025 16:55

aclark02-arcus mentioned this pull request Sep 10, 2025

Minimize app startup time #223

Open

remove unneccessary comments

bf173dd

LDSamson requested changes Dec 2, 2025

View reviewed changes

LDSamson mentioned this pull request Dec 2, 2025

Improve efficiency - refactoring #251

Merged

Merge branch 'dev' into ac-223

776727b

Copilot AI review requested due to automatic review settings December 30, 2025 16:32

Merge branch 'ac-223' of https://github.com/openpharma/clinsight into…

7caa754

… ac-223

This comment was marked as outdated.

Sign in to view

extract creation of timeline_data outside of the app & remove app_tab…

a8fbcb1

…les from existence

aclark02-arcus changed the title ~~Run create_table() only once in app~~ Preprocess heavy computations before run_app() Dec 30, 2025

aclark02-arcus requested a review from Copilot December 30, 2025 17:15

Copilot AI reviewed Dec 30, 2025

View reviewed changes

man/metadata.Rd Outdated Show resolved Hide resolved

R/run_app.R Outdated Show resolved Hide resolved

man/cs_timeline_data.Rd Outdated Show resolved Hide resolved

data-raw/clinsightful_data.R Outdated Show resolved Hide resolved

aclark02-arcus and others added 5 commits December 30, 2025 09:26

removed duplicate str(metadata)

b5d631f

fixed creation of 'cs_available_data'

cfe7130

Update R/run_app.R to check for character timeline_data object before…

2ee494d

… reading it in Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix typo in Rd

7f84adc

Merge branch 'ac-223' of https://github.com/openpharma/clinsight into…

e4be379

… ac-223

aclark02-arcus requested a review from LDSamson December 30, 2025 17:36

aclark02-arcus and others added 3 commits December 30, 2025 12:53

Update R/fct_test_clinsight.R to use 'meta_data' instead of 'meta'

cd8b547

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

whoops, fix a couple of issues copilot caught

d0403fd

Merge branch 'ac-223' of https://github.com/openpharma/clinsight into…

212a90b

… ac-223

LDSamson requested changes Jan 5, 2026

View reviewed changes

Preprocess heavy computations before run_app() #245

Are you sure you want to change the base?

Preprocess heavy computations before run_app() #245

Uh oh!

Conversation

aclark02-arcus commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LDSamson commented Sep 29, 2025

Uh oh!

aclark02-arcus commented Sep 30, 2025

Uh oh!

LDSamson commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LDSamson left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

aclark02-arcus commented Dec 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aclark02-arcus commented Dec 30, 2025

Uh oh!

LDSamson left a comment

Choose a reason for hiding this comment

Uh oh!

LDSamson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

LDSamson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

aclark02-arcus Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

LDSamson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

LDSamson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

LDSamson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

aclark02-arcus commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Preprocess heavy computations before `run_app()` #245

Preprocess heavy computations before `run_app()` #245

aclark02-arcus commented Sep 10, 2025 •

edited

Loading

LDSamson commented Sep 30, 2025 •

edited

Loading