Skip to content

Conversation

@aclark02-arcus
Copy link
Collaborator

@aclark02-arcus aclark02-arcus commented Sep 10, 2025

Significant work towards #223.

This PR (in addition to #243) reduced app startup time from 2m 40s down to 41s - 70s. Note: these time stats were measured on a larger study with ~600-700 patients. Also, the amount of time to simply review a record went down from 27s down to 12s from these two PRs.

Obviously a lot of changes here. Feel free to take your time & suggest alternatives, but would love some initial thoughts if you have any. For now, I just commented out some old code, but when this PR is ready, I'm happy to remove it altogether.

For some reason, I had to add a couple lines to dev/run_dev.R/load_and_run() and test_clinsight() to build the db first... not sure why. Maybe you know?

@LDSamson
Copy link
Collaborator

Hi @aclark02-arcus,

The application surely need improvement in startup time, so I think this is indeed the correct way forward. Good news that these changes helped to reduce the start up time for you!

We probably need to implement this in something like as two-step process. As discussed, maybe you can merge it first in your devex branch, and then we can check how we can validate the changes properly before merging into dev.

For validation according to GAMP V guidelines, we should identify potential risks of this adjustment, provide a risk classification, and then try to mitigate especially high priority risks.

For example, in this case, we are changing the core files in memory. Since we would now rely on more files, they might go out of synch, which could be a risk. We could mitigate this by:

  • Reducing the number of files that we need to use (maybe put some more files in lists? Is that possible)
    • Can we maybe even remove study_data and just rely on app_data? Not using study_data will also help with memory usage of the app
    • Do we still need to rely on the app_tables object, or is this only needed for creating one of the other overview tables?
  • add a time stamp as attribute to each file, so that it is clear that they were created at the same moment, and verify that the time stamp matches.
  • ...

Another risk is that we would like to update to the new system without much problems. Is this possible or would it need a major update? Probably some documentation update on how to update would also be good.

Let me know when you are available, then we can discuss this!

@aclark02-arcus
Copy link
Collaborator Author

Hi @LDSamson, thanks for looking at this. Just a quick note: to my knowledge, we've abandoned devex altogether, but we could bring it back if you wish. We'd probably "restart" by basing it off the current dev branch. Is there a reason you want to use devex and instead of dev? Seems like we could just work out all of those GAMP considerations here (in this PR). What do you think?

Just so you know, we've already committed these changes into our internal clinsight branch. But we'd welcome anything new that may come up during PR review.

@LDSamson
Copy link
Collaborator

LDSamson commented Sep 30, 2025

Ah, good to know. I thought you would like to move forward quickly using such a devex branch.

Main reason I was a bit hesitant for using dev is that I would like to be able to support current versions that are running in production, with small updates/bug fixes if needed, without having to make big changes to the setup in studies that are already running. If this PR would still allow the old setup, and if everything is tested/validated properly before merging, then merging into dev would also be fine.

Copy link
Collaborator

@LDSamson LDSamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment. The risk of this change seems manageable, just good to have a few adge cases covered :) If you want to discuss let me know!

Copilot AI review requested due to automatic review settings December 30, 2025 16:32

This comment was marked as outdated.

@aclark02-arcus
Copy link
Collaborator Author

Now that #251 has merged, app_tables doesn't really exist, so I'll change the name of this PR. Regardless, this PR is still worthwhile for it's pre-processing of the data before launching the app. Commit a8fbcb1 got rid of app_tables, but added timeline_data to the pre-processing step to save lots of time within the AE tab. It's now being saved as both an .rds & .parquet file.

@aclark02-arcus aclark02-arcus changed the title Run create_table() only once in app Preprocess heavy computations before run_app() Dec 30, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 29 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

man/metadata.Rd:156

  • The description section contains duplicate paragraphs. Remove the duplicated content starting at line 151.
A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for
in-depth information.

A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aclark02-arcus
Copy link
Collaborator Author

@LDSamson, pinging you for a review to make sure you are content with the recent changes.

Copy link
Collaborator

@LDSamson LDSamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aclark02-arcus, thank you for your work on this.

See also the few detailed comments I made.

In general, the setup of a study feels a bit messy now because we need to create so many objects with the correct naming before we can even start clinsight. It would be great if that can be addressed.

Suggestions (open for debate of course) would be to:

  1. create a list object (e.g. clinsight_data) with the following objects:
  • appdata (list created with get_appdata)
  • timeline_data (data frame with timeline_data)
  • available_data (data frame with available_data)
  • app_vars (list with app variables/metadata)
  1. create a helper function to create this list object, with the merged_data object and the metadata object
  2. create a helper function to save clinsight_data with a boolean option use_parquet (by default FALSE to be compatible).
    • eithers saves it as a single .rds file, or as different parquet files
  3. create a helper function to load clinsight_data, with the same option use_parquet

This way, we only have to feed clinsight_data to run_app and not so many different objects, whil still using data pre-computations.

Furthermore, if we are using parquet files, shouldn't we also save all the data frames in appdata as a parquet file? These are the largest objects by far.

Looking forward to hearing your thoughts on this

Comment on lines +19 to +69
# Build a version of `app_data` & app_vars
app_data <- get_appdata(data = merged_data, meta = metadata)
app_vars <- get_meta_vars(data = app_data, meta = metadata)

# Build a 'app_tables'
# app_tables <- lapply(
# setNames(names(app_data), names(app_data)), \(x){
# create_table(app_data[[x]], expected_columns = names(app_vars$items[[x]]))
# })

# Build a 'available_data'
available_data <- get_available_data(
data = app_data,
# tables = app_tables, # outdated arg
# all_forms = app_vars$all_forms, # outdated arg
form_repeat_name = with(
metadata[["table_names"]],
table_name[raw_name == "form_repeat"]
) |>
tryCatch(error = \(e) "N")
)

# For timeline data
timeline_data <- get_timeline_data(
app_data,
available_data = available_data,
treatment_label = metadata$settings$treatment_label %||% "\U1F48A T\U2093"
)

# tempdir not useful for production mode
data_folder <- "."
data_path <- file.path(data_folder,
"merged_data.rds")
saveRDS(merged_data, data_path)
# data_path <- file.path(data_folder,
# "merged_data.rds")
# saveRDS(merged_data, data_path)
# Current saves both RDS and Parquet for data frames for continuity purposes
save_objs <- c(
"metadata",
"app_data",
"app_vars",
# "app_tables",
"available_data",
"timeline_data")
purrr::walk(save_objs, function(x){
rds_file <- file.path(data_folder, paste0(x, ".rds"))
saveRDS(get(x), rds_file)
if(inherits(get(x), "data.frame")) {
pq_file <- file.path(data_folder, paste0(x, ".parquet"))
arrow::write_parquet(get(x), pq_file)
}
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite difficult to start the app.. you must remember to create all these objects, name them correctly, feed them into the app.. can't we simplify this process more?

Comment on lines +71 to +79
## Verify app_data list
if(is.character(app_data)){
if(!file.exists(app_data)) stop(paste0("Cannot find '", app_data, "'."))
if(tolower(tools::file_ext(app_data)) != "rds"){
stop("Invalid 'app_data' format. Expecting a file .rds format")
}
data <- readRDS(data)
app_data <- readRDS(app_data)
}
stopifnot("Expecting study data to be in data frame format." = is.data.frame(data))
stopifnot("Expecting 'app_data' to be in list format." = inherits(app_data, "list"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong here, but I have a feeling that the improvement in loading time does not come so much from using parquet files. Reason I think that is that for the largest object, the study data here, no parquet file is used. Can't that be changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember what the run_app() time savings was by switching from .rds to .parquet for data.frame objects but I remember it was "pretty decent." Note: switching to parquet decreased the data build time by 18% for a large study... as saving large data.frames as .rds is slow. BUT the best advantage here was the file storage efficiencies when parquet compressed the daily build size from 10GB to 1GB... and that was without any changes to app_data.

But I agree with what you are saying here. I didn't want to mess with changing the app_data object since it is a list & parquet can't help with that unless you break apart the list.

Comment on lines 147 to 156
warning("No user database found. New database will be created")
db_create(get_review_data(data), db_path = user_db)
} else{
stopifnot("user_db version is not up to date" =
identical(db_version, db_get_version(user_db)))
# Skip if not needed for faster testing:
if(isTRUE(get_golem_config("app_prod"))){
db_update(get_review_data(data), db_path = user_db)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, I had to add a couple lines to dev/run_dev.R/load_and_run() and test_clinsight() to build the db first... not sure why. Maybe you know?

It's because the database is created in this step here when it does not exist. But this will error now, because the data object was deleted by you (see line 27).

pq_file <- file.path(temp_folder, paste0(x, ".parquet"))
arrow::write_parquet(get(x), pq_file)
}
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding parquet files here will make it significantly slower to test clinsight with this function because it is fairly slow to write parquet files. Can this be changed/removed here?

# "app_tables",
"timeline_data",
"available_data")
purrr::walk(save_objs, function(x){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that purrr is currently only in suggests in the description. We should either use base R or move the package to Imports.

@aclark02-arcus
Copy link
Collaborator Author

Looking forward to hearing your thoughts on this

Yes, I'm not opposed to a change like that. It is fairly "messy" as you say. I won't be able to work on it in the near future however with my OOO in Jan though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants