-
Notifications
You must be signed in to change notification settings - Fork 0
SenseDoc
The ingest process itself is still just a stub entry.
The SenseDoc is a research grade multisensor device, made by MobySens, and used for mobility (GPS) and physical activity (accelerometer) tracking. These data are collected continuously and allow us to measure location-based physical activity and infer transportation mode.
The data fields collected by the SenseDoc can be found on INTERACT's Data Dictionary
Data is pulled from the individual devices by the regional coordinator, using an extraction tool provided by MobySens. While this extraction includes the proprietary raw data files, they are also coalesced into a SQLite3 DB file (and given a .sdb extension) for more efficient manipulation. Our migration process preserves those raw data files, but the ingest process is built against the SDB files, since they are in a much more consumable form.
When a participant has completed their collection period, the SenseDoc is returned to the regional coordinator, who then extracts the data to their local computer. Since the device is always recording, this means that it may include some movement data contributed by the coordinator, either before or after it was given to the participant. So to filter the data properly, the coordinator maintains a start and end timestamp for when the device was actually being worn by the participant, known as the "wear date window." This data is kept in the participant metadata table.
From time to time during the collection period, batches of these extracted data files are uploaded to ComputeCanada over SSH. Those incoming files are staged in a folder at /projects/def-dfuller/interact/incoming_data/{CITYNAME}/Wave{WAVENUM}/.
Uploaded batches include the telemetry data itself, plus an MD5 checksum* for each file. The checksums were computed on the coordinator's local machine prior to upload, so it can be used by the data manager to verify the success of the upload once it is received on ComputeCanada. If any verification problems are found, the damaged files are re-uploaded and reverified, and only marked as accepted once the problem has been corrected and the checksums match.
(*Note: If the uploaded data is in zip or tar archives, an explicit MD5 is not necessary, as the different packing formats contain built-in checksums that can be validated.)
After accepting an uploaded file, it is given a "provenance sidecar," based on the associated checksum. This sidecar then moves around the system along with the data file itself and can be used at any time to verify that the file has not been altered since it was first produced. Additionally, if a change is made to the file, the sidecar can be updated with a new checksum, as well as an explanation of what changed. This is handled by our ProvLog system, and provides provenance tracking for the lifecycle of all the data files that are part of our pipeline.
Before data is actually ingested, a number of verification and normalization steps are conducted first to ensure that the ingest can proceed successfully.
- The participant metadata needs to be placed into the permanent_archive folder
- Look for a file with a name including 'linkage' or 'participation' and the extension '.csv' or '.xslx' uploaded to the appropriate incoming_data folder by the research coordinator
- This file should be copied to: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/linkage.csv
- The name "linkage" refers to the association made in that file between the participants and the device ids they were issued
- This step should be done first, because the data from that table is sometimes used to resolve conflicts or missing information when prepping the data files in subsequent steps
- Telemetry files that were uploaded to the incoming_data folder needs to be copied to the permanent archive area
- The folder path should be: /projects/def-dfuller/interact/permanent_archive/{CITYNAME}/Wave{WAVENUM}/SenseDoc
- The following steps can be done manually, or by use of the sdb_directory_refactor script (LOCATED WHERE?)
- The SenseDoc folder is then organized into a canonical hierarchy, with data from each contributor stored in a subfolder per device
- Typically, the files extracted from zips are in directories named {IID}, but some contain data from more than one {DEVICEID}
- Consequently, the directories need to be reorganized into unique {IID}_{DEVICEID} pairs with that name
- Within each user-device subfolder, we verify that the crucial SDB file is present
- SDB files are named: SD{DEVID}fw{OSVER}_{TIMESTAMP}.sdb
- The SenseDoc folder is then organized into a canonical hierarchy, with data from each contributor stored in a subfolder per device
- The normalized permanent_archive files are then added to our ProvLog system, which scans every night to ensure that all files always match the checksums they were uploaded with and have not been deleted or altered on disk during the course of working with them
- An entire directory can be added to the monitoring system with the command: provlog -T {ROOTDIR}
- If any changes are ever detected in any logged files on disk, a message is sent to the data manager, who investigates and either restores the data files from backup, or updates the ProvLog record to explain the change, thus ensuring a complete manifest of data changes is attached to each contributing file
Once the data files are all in the correct place, with the expected names, the data manager can launch the guided process to complete the ingest. The first step is to set up the Jupyter Notebook that will govern the process.
- Getting Jupyter Notebooks set up properly is outside the scope of this document, but we have explored two different configurations to date:
- Running Jupyter Lab directly on the ComputeCanada cluster
- Faster run-times
- Harder to set up
- Prone to frequent delays that can impede efficient workflow
- Running Jupyter Lab on a local machine, with remote SSH mounts to the CC file system and PostgreSQL instance
- Easier to set up
- More responsive workflow
- Slower execution of data-heavy operations
- Running Jupyter Lab directly on the ComputeCanada cluster
- Set up your environment variables
- $SQL_LOCAL_SERVER and $SQL_LOCAL_PORT will depend on which configuration you chose above for Jupyter Lab
- $SQL_USER should be set to your ComputeCanada userid
- $INGEST_CITY should be the integer code for the city you'll be ingesting (Victoria=1, Vancouver=2, Saskatoon=3, Montreal=4)
- $INGEST_WAVE should be the integer wave number that you'll be ingesting
- Once everything is configured, launch Jupyter Lab and open a copy of Ingest-SenseDoc-Wave2-Protocol.ipynb
The guided process will be an iterative process in which you work your way down the series of code blocks in the document, executing each one until it runs cleanly, and then moving on to the next block. Note that every block of executable code is followed by an after running the block section that explains what you should see in the output of the previous section, and what to do if problems are reported.
The whole point of normalizing the filenames, paths, and data bundles was to allow the same code to be used each time, regardless of which wave or city is being ingested. This block is where those values are initialized from the environment variables, and a few other frequently used variables are set up.
As a rule, you will not have to change anything here, but there may be special cases. In particular, there may be cases where the file structures do not conform precisely to the standard laid out above, so you might have to tweak file paths here.
Important: Never, under any circumstances code passwords or userids directly into this notebook. Remember that this document is hosted publicly on GitHub. Sharing security codes in this way would be a breach of our privacy protocol.
Run the block and then read note that follows. Confirm that everything ran as expected before moving on.
- Edit the parameter assignments in the first code block of the notebook to set the wave_id and city_id being ingested
The next few blocks of the notebook will conduct some additional analyses to find gaps and conflicts in the data so they can be fixed prior to ingest.
- All expected files are confirmed to be present and named correctly
- The incoming linkage data is confirmed to be well-formed
- Each expected participant has corresponding telemetry data in the permanent_archive folder
- All telemetry data found in the permanent_archive folder corresponds to a known participant
Any problems found by these tests are reported to the data manager who then consults with the regional coordinator to fix the discrepancies. The most common problem found here is to find a user in the linkage table who produced no data in the telemetry folders. These are usually cases where a coordinator created a dummy account to use for testing. But since they did not actually wear a device, there is no telemetry to go with the account. In these cases, the user record must be marked by putting the word 'ignore' in the data_disposition field of the linkage table, which instructs the ingest system to ignore that user record entirely.
Once the validation block of the Jupyter Notebook passes cleanly, reporting no unexpected conditions in the data, the actual ingest blocks can be run.
First, the linkage data will be loaded into the DB table (portal_dev.sensedoc_assignments). This is a straightforward process of...
Then the notebook will proceed to loading the telemetry files.
Loading telemetry files is a bit more complicated...
- The last few sections perform the actual ingest
- In the first pass, the raw telemetry files are loaded into a temporary DB table
- In the next block, that temporary table is cross-linked with the proper IID, based on the mapping from the device id found in the linkage table
- Finally, the cross-linked telemetry data is added to the final telemetry tables (sd_gps, sd_accel, '''and others?''')
- Once the ingest has completed successfully, a few housekeeping tasks are required:
- Delete the temporary tables '''(called?)'''
- Export the JupyterNotebook as a PDF, which provides a complete document of the ingest process, as it happened.
- If any substantive code was changed in the notebook (aside from setting parameters), clear all the output blocks, save the notebook, and commit the changes to the git repo, describing what improvements or corrections were made to the code
- Congratulations, you have now completed an ingest cycle.