-
Notifications
You must be signed in to change notification settings - Fork 2
Preprocess heavy computations before run_app()
#245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
'merged_data', thus saving lots of time at app startup
|
Hi @aclark02-arcus, The application surely need improvement in startup time, so I think this is indeed the correct way forward. Good news that these changes helped to reduce the start up time for you! We probably need to implement this in something like as two-step process. As discussed, maybe you can merge it first in your For validation according to GAMP V guidelines, we should identify potential risks of this adjustment, provide a risk classification, and then try to mitigate especially high priority risks. For example, in this case, we are changing the core files in memory. Since we would now rely on more files, they might go out of synch, which could be a risk. We could mitigate this by:
Another risk is that we would like to update to the new system without much problems. Is this possible or would it need a major update? Probably some documentation update on how to update would also be good. Let me know when you are available, then we can discuss this! |
|
Hi @LDSamson, thanks for looking at this. Just a quick note: to my knowledge, we've abandoned Just so you know, we've already committed these changes into our internal |
|
Ah, good to know. I thought you would like to move forward quickly using such a devex branch. Main reason I was a bit hesitant for using |
LDSamson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment. The risk of this change seems manageable, just good to have a few adge cases covered :) If you want to discuss let me know!
…les from existence
|
Now that #251 has merged, |
create_table() only once in apprun_app()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 24 out of 29 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
man/metadata.Rd:156
- The description section contains duplicate paragraphs. Remove the duplicated content starting at line 151.
A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for
in-depth information.
A list of data frames and settings containing metadata that will be used for
the application. The metadata will be merged with with raw data. It controls
the variables that will be used in the application, and in which tab the
variables will be shown. The goal is that most, if not all, study-specific
data will be captured in the metadata, leaving the scripts to run the
application largely unaltered between studies.See \code{vignette("Metadata")} for
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… reading it in Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@LDSamson, pinging you for a review to make sure you are content with the recent changes. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
LDSamson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aclark02-arcus, thank you for your work on this.
See also the few detailed comments I made.
In general, the setup of a study feels a bit messy now because we need to create so many objects with the correct naming before we can even start clinsight. It would be great if that can be addressed.
Suggestions (open for debate of course) would be to:
- create a list object (e.g.
clinsight_data) with the following objects:
- appdata (list created with
get_appdata) - timeline_data (data frame with timeline_data)
- available_data (data frame with available_data)
- app_vars (list with app variables/metadata)
- create a helper function to create this list object, with the
merged_dataobject and themetadataobject - create a helper function to save
clinsight_datawith a boolean optionuse_parquet(by default FALSE to be compatible).- eithers saves it as a single .rds file, or as different parquet files
- create a helper function to load
clinsight_data, with the same optionuse_parquet
This way, we only have to feed clinsight_data to run_app and not so many different objects, whil still using data pre-computations.
Furthermore, if we are using parquet files, shouldn't we also save all the data frames in appdata as a parquet file? These are the largest objects by far.
Looking forward to hearing your thoughts on this
| # Build a version of `app_data` & app_vars | ||
| app_data <- get_appdata(data = merged_data, meta = metadata) | ||
| app_vars <- get_meta_vars(data = app_data, meta = metadata) | ||
|
|
||
| # Build a 'app_tables' | ||
| # app_tables <- lapply( | ||
| # setNames(names(app_data), names(app_data)), \(x){ | ||
| # create_table(app_data[[x]], expected_columns = names(app_vars$items[[x]])) | ||
| # }) | ||
|
|
||
| # Build a 'available_data' | ||
| available_data <- get_available_data( | ||
| data = app_data, | ||
| # tables = app_tables, # outdated arg | ||
| # all_forms = app_vars$all_forms, # outdated arg | ||
| form_repeat_name = with( | ||
| metadata[["table_names"]], | ||
| table_name[raw_name == "form_repeat"] | ||
| ) |> | ||
| tryCatch(error = \(e) "N") | ||
| ) | ||
|
|
||
| # For timeline data | ||
| timeline_data <- get_timeline_data( | ||
| app_data, | ||
| available_data = available_data, | ||
| treatment_label = metadata$settings$treatment_label %||% "\U1F48A T\U2093" | ||
| ) | ||
|
|
||
| # tempdir not useful for production mode | ||
| data_folder <- "." | ||
| data_path <- file.path(data_folder, | ||
| "merged_data.rds") | ||
| saveRDS(merged_data, data_path) | ||
| # data_path <- file.path(data_folder, | ||
| # "merged_data.rds") | ||
| # saveRDS(merged_data, data_path) | ||
| # Current saves both RDS and Parquet for data frames for continuity purposes | ||
| save_objs <- c( | ||
| "metadata", | ||
| "app_data", | ||
| "app_vars", | ||
| # "app_tables", | ||
| "available_data", | ||
| "timeline_data") | ||
| purrr::walk(save_objs, function(x){ | ||
| rds_file <- file.path(data_folder, paste0(x, ".rds")) | ||
| saveRDS(get(x), rds_file) | ||
| if(inherits(get(x), "data.frame")) { | ||
| pq_file <- file.path(data_folder, paste0(x, ".parquet")) | ||
| arrow::write_parquet(get(x), pq_file) | ||
| } | ||
| }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite difficult to start the app.. you must remember to create all these objects, name them correctly, feed them into the app.. can't we simplify this process more?
| ## Verify app_data list | ||
| if(is.character(app_data)){ | ||
| if(!file.exists(app_data)) stop(paste0("Cannot find '", app_data, "'.")) | ||
| if(tolower(tools::file_ext(app_data)) != "rds"){ | ||
| stop("Invalid 'app_data' format. Expecting a file .rds format") | ||
| } | ||
| data <- readRDS(data) | ||
| app_data <- readRDS(app_data) | ||
| } | ||
| stopifnot("Expecting study data to be in data frame format." = is.data.frame(data)) | ||
| stopifnot("Expecting 'app_data' to be in list format." = inherits(app_data, "list")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong here, but I have a feeling that the improvement in loading time does not come so much from using parquet files. Reason I think that is that for the largest object, the study data here, no parquet file is used. Can't that be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember what the run_app() time savings was by switching from .rds to .parquet for data.frame objects but I remember it was "pretty decent." Note: switching to parquet decreased the data build time by 18% for a large study... as saving large data.frames as .rds is slow. BUT the best advantage here was the file storage efficiencies when parquet compressed the daily build size from 10GB to 1GB... and that was without any changes to app_data.
But I agree with what you are saying here. I didn't want to mess with changing the app_data object since it is a list & parquet can't help with that unless you break apart the list.
| warning("No user database found. New database will be created") | ||
| db_create(get_review_data(data), db_path = user_db) | ||
| } else{ | ||
| stopifnot("user_db version is not up to date" = | ||
| identical(db_version, db_get_version(user_db))) | ||
| # Skip if not needed for faster testing: | ||
| if(isTRUE(get_golem_config("app_prod"))){ | ||
| db_update(get_review_data(data), db_path = user_db) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, I had to add a couple lines to dev/run_dev.R/load_and_run() and test_clinsight() to build the db first... not sure why. Maybe you know?
It's because the database is created in this step here when it does not exist. But this will error now, because the data object was deleted by you (see line 27).
| pq_file <- file.path(temp_folder, paste0(x, ".parquet")) | ||
| arrow::write_parquet(get(x), pq_file) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding parquet files here will make it significantly slower to test clinsight with this function because it is fairly slow to write parquet files. Can this be changed/removed here?
| # "app_tables", | ||
| "timeline_data", | ||
| "available_data") | ||
| purrr::walk(save_objs, function(x){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that purrr is currently only in suggests in the description. We should either use base R or move the package to Imports.
Yes, I'm not opposed to a change like that. It is fairly "messy" as you say. I won't be able to work on it in the near future however with my OOO in Jan though. |
Significant work towards #223.
This PR (in addition to #243) reduced app startup time from 2m 40s down to 41s - 70s. Note: these time stats were measured on a larger study with ~600-700 patients. Also, the amount of time to simply review a record went down from 27s down to 12s from these two PRs.
Obviously a lot of changes here. Feel free to take your time & suggest alternatives, but would love some initial thoughts if you have any. For now, I just commented out some old code, but when this PR is ready, I'm happy to remove it altogether.
For some reason, I had to add a couple lines to
dev/run_dev.R/load_and_run()andtest_clinsight()to build the db first... not sure why. Maybe you know?