-
Notifications
You must be signed in to change notification settings - Fork 9
PDEL Data Transparency Services Workflow
The following is a description of the process our GSRs use to streamline code, anonymize personally identifying data, organize files, and replicate analyses for PIs. The goal is to create complete replication materials for public dissemination that meet the following standards:
- Files are complete. All of the data, code, and supplementary materials (e.g., codebooks) needed to generate and interpret results (tables, figures, etc.) are included and organized in an intuitive manner. Unnecessary ancillary files (e.g., old versions of code and data, etc.) should not be included.
- Personal data are protected. As we know from the IRB process, personally identifiable information (PII)--names, phone numbers, email, addresses, etc.--cannot be included in a public dataset. When possible, anonymization of PII should come sequentially before merging and cleaning so that the data and code for these processes can be shared publicly.
- Code is readable. Code should be streamlined and legible. Scripts that run analyses should be separate from those that merge and clean data, and documentation (or the script names themselves) should clearly indicate the order in which they should be run and for what purpose. Comments should be used to help the human reader understand what the researcher is doing. Code that generates the main results of the paper should be clearly identifiable, and not obscured by supplementary and exploratory analyses.
- Everything works. Code and data should reproduce the paper's results without error. Here, it can be helpful to have someone new to the project prepare the files, as a fresh pair of eyes may be more likely to catch errors. Note that running the code on a different computer, operating system, or software version can sometimes catch--and sometimes create--problems with replication.
Note that this process is based mostly on experiences working with project files stored locally or in Dropbox without version control software (the current setup used by a majority of our PIs). However, many of the following steps would be faster, easier, or unnecessary with platforms like Open Science Framework OSF or Git, and we encourage researchers to make the switch for future projects!
Good replication files will contain all the materials that are necessary to reproduce the study results--including data merging, cleaning, and analysis--with few extraneous files. Rather than copying or cleaning out existing directories, we've found it best to create a new, clean folder, and then add to it only those files needed for replication. This helps preserve the original data and code and avoid extraneous material. This folder can be organized in a number of ways appropriate to type and number of files you will have, but the structure should be clear and logical. See here for an example.
- Create a new (empty!) replication folder (e.g., "RCT_replication_files"), within your project directory. [Note: if you're using Dropbox, see here for more tips on sharing folders with RAs in a way that protects PII data.]
- Create subfolders such as "/code", "/data_clean", "/data_raw", "/output", and "/extra".
- Add a "readme.txt" file, and as you go through the workflow below, document each file in the replication folder (ideally including its function and source), along with other info such as system and software requirements.
Identifying the source of any problems in the code is easiest if you do the replication iteratively, beginning with the original code and data--if you clean and restructure documents before replication, it's hard to know if any errors come from the original code or from your edits. We find that it's easiest to start with the final analysis and work your way backwards in the code through the cleaning and merging processes. In each case, the original code and data files are COPIED (not moved) into the replication folder. Absent a version control system, this is the best way to protect the original work.
-
Check analysis:
- Copy the original analysis script(s) into RCT_replication_files/code
- Copy the dataset(s) used for analysis into RCT_replication_files/data_clean
- Run code without making changes except for pointing the working directory to your new replication folder
- Fix any bugs in the code and address any discrepancies with the paper's results.
-
Check data merge/cleaning:
- If they are separate from the analysis script, copy the original merge/cleaning script(s) into RCT_replication_files/code
- Copy the dataset(s) used for merging/cleaning into RCT_replication_files/data
- Run code without making changes except for pointing the working directory to your new replication folder
- Run the analysis file debugged above on the newly created data file
- If you get different results than step #1, there is a problem with the merging/cleaning code
Once the original code has been cleaned and debugged, it's time to improve legibility and organization. If you've developed your files with public consumption in mind from the beginning, this process should go quickly. Other resources--see here or here[1]--give a more thorough set of coding best practices, but basic steps include:
-
Anonymize data (if not already done):
- Ensure that no PII is included in datasets that will be public, including name, email, phone number, etc.
- Ensure that individuals are not identifiable based on a combination of other attributes (e.g., if you're surveying teachers and there is only one female, third-grade teacher aged 50-59 at a particular school, then she is not anonymous in your data)
- Move the anonymization process as early as feasible in the data merging/cleaning process, so that as much as possible of the data manipulation process can be made public
- Even though the PII data cannot be shared, do include any code that manipulates this restricted data for transparency as long as the code itself doesn't compromise anonymity (e.g., censor code that sets the seed for a random draw to generate new ID numbers and could be used to reverse anonymization)
-
Organize and format scripts:
- Create separate scripts for analysis and merging/cleaning code
- Move exploratory analysis or those not used in the paper to the end of the analysis file--preserving these is good for posterity, but it shouldn't obscure the main results
- Add headers, including the paper and author's names, date and creator of code, the input files that the code requires, and the output files that it generates
- Format scripts so they're easily readable (e.g., indent code, standardize comment syntax)
-
Document and annotate code:
- Clearly label code that generates the tables and figures that appear in the paper
- Keep output commands for papers and figures in the paper and appendix, as long as they all go to the "outputs" folder you've created, and comment-out output commands for tables and figures tables not used in the paper
- Give output objects sensible names like "table1"
- Add comments when needed to improve reader understanding; remove comments that are unhelpful (or embarrassing!)
- Label variables and values in Stata
-
Document folder contents:
- Include codebook where necessary
- Update the readme file as needed
Now that you have cleaned and reorganized script files, rerun the entire process--including data merging, cleaning and analysis--to make sure the results are consistent. Once discrepancies are addressed, the files are ready to send!
We hope that this serves as a useful resource for other researchers (or their RAs) who need to disseminate data and code for existing projects. Furthermore, we hope that creating awareness about these backend processes will encourage researchers to plan for them as they begin new projects, reducing the time and resources needed to prepare replication files in the future. Questions and feedback are very welcome!
[1] See also J. Scott Long. 2008. The Workflow of Data Analysis Using Stata, and Christopher Gandrud. 2013. Reproducible Research in R and R Studio.