โโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โโโ โโโโโโ โโโ โโโโโโโโโโโโโโ
โโโ โโโโโโ โโโ โโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโ โโโ โโโ โโโโโโโโโโโ
โโโโโโโโโโโโโโโโ โโโ โโโ โโโโโโโโโโโ
A DICOM Anonymization pipeline in a Docker container. This pipeline is designed to anonymize DICOM files according to the EUCAIM standard and includes the following steps:
- Step 1 (Optional): Perform OCR on DICOM pixel data to remove sensitive information (burned-in information).
- Step 2: Deidentify DICOM metadata using the RSNA CTP Anonymizer and the EUCAIM anonymization script.
- Step 3 (Optional): Deidentify clinical data provided in CSV files so that the referenced patient id is anonymized the same way CTP does in Step 2.
You can pull the Docker image from GitHub Container Registry:
docker pull ghcr.io/cbml-forth/eucaim_anon_pipeline
Then you can run the pipeline using the following command, which shows the bare minimum information required to run the pipeline:
docker run -it -v <INPUT-DIR>:/input -v <OUTPUT-DIR>:/output ghcr.io/cbml-forth/eucaim_anon_pipeline run <SITE-ID>
where the options are as follows:
<INPUT-DIR>is the folder on the local machine where the DICOM files to be anonymized reside. Please note that this folder could also contain a CSV file with clinical data so that those data can be properly linked with the anonymized DICOM files (details below)<OUTPUT-DIR>is the folder on the local machine where the anonymized DICOM files will be written to. In this folder, a new CSV will be also produced containing the anonymized clinical data, should the input folder had one.<SITE-ID>is the SITE-ID provided by the EUCAIM Technical team and it's a mandatory parameter to the pipeline to be used as "provider id" (after hashing it...)
There are more options that can be specified in the command line. To see the list of available options, please run:
docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline run --help
which should return the following:
Usage: run [OPTIONS] SITE_ID [INPUT_DIR] [OUTPUT_DIR]
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * site_id TEXT The SITE-ID provided by the EUCAIM Technical team [required] โ
โ input_dir [INPUT_DIR] Input directory to read DICOM files from [default: /input] โ
โ output_dir [OUTPUT_DIR] Output directory to write processed DICOM files to โ
โ [default: /output] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --ctp --no-ctp Perform deidentification in the DICOM metadata in โ
โ image files. Uses the RSNA CTP anonymizer and the โ
โ custom script โ
โ [default: ctp] โ
โ --ocr Perform OCR (using Tesseract OCR) โ
โ --paddle-ocr Perform OCR using PaddleOCR โ
โ --threads INTEGER Number of threads that RSNA CTP and PaddleOCR (if โ
โ enabled) will use โ
โ [default: 10] โ
โ --secret TEXT Use the supplied key as the secret key for the โ
โ anonymization โ
โ --hierarchical --no-hierarchical Output files will be organized into a hierarchical โ
โ Patient / Study / Series folder structure using the โ
โ anonymized UIDs as the folder names โ
โ [default: hierarchical] โ
โ --verbose -v Enable verbose logging โ
โ --version -V Print version information โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
- Option
--ctp(default) will anonymize the DICOM files using the RSNA CTP tool. Supplying the--no-ctpoption will disable this step. - Passing
--ocror--paddle-ocrwill enable the Optical Character Recognition (OCR) feature for redacting "burned-in" text in the raw images. Please note that by default no OCR will run! The--ocrwill run Tesseract OCR and the--paddle-ocrwill run PaddleOCR. PaddleOCR seems to be more accurate than Tesseract OCR but also slower and requires more resources. --threadscan be used to specify the number of threads that RSNA CTP and PaddleOCR (if enabled) will use and it can be used to increase the speed of the pipeline if it runs in multi-core CPU. By default, it is set to 10.--hierarchical(default) will organize the anonymized DICOM files into a hierarchical folder structure based on the patient ID, study ID, and series ID. Each output DICOM file will also have a name consisting of digits based on an auto-numbering system, e.g.00001.dcm,00002.dcm, etc. We suggest to always keep this option in the default--hierarchicalmode, because it makes the output folder structure more organized but more importantly it makes sure that no sensitive information is leaked through the folder and file names.-v(or--verbose) will enable verbose mode, which will print more detailed information about the progress of the pipeline. In particular thesecret keyused for the anonymization of the DICOM metadata will be printed to the console.--secret <SECRET>allows passing the secret key to be used for the anonymization of the DICOM metadata. This allows the consistent anonymization of a cohort of patients to be performed across multiple anonymization runs. You can get a "good" secret key either by running the pipeline once with the--verboseoption or using theutils secretsubcommand explained a bit further below.
Important
Passing all these parameters on the command line can be intimidating for the unitiative user. For this reason we provide also a desktop application with a graphical user interface that allows the user to specify these parameters and get back the Docker command to run.
PaddleOCR supports multiple different models for text detection, text recognition, etc. By default in this Docker image we include the "lite" (mobile) models of PP-OCRv5: PP-OCRv5_mobile_det for text detection and PP-OCRv5_mobile_rec for text recognition as can be seen in the integrated PaddleOCR.yaml file. To further support additional models like the more complex and accurate "server" models, you can create your own YAML file (by copying the PaddleOCR.yaml file and modifying it) with the desired models and then running the docker run command with this new YAML file in the host machine mounted as /app/PaddleOCR.yaml, like so:
docker run -it -v <INPUT-DIR>:/input -v <OUTPUT-DIR>:/output -v <PADDLEOCR_YAML_FILE>:/app/PaddleOCR.yaml ghcr.io/cbml-forth/eucaim_anon_pipeline run <SITE-ID> --paddle-ocr
In case there are additional (clinical) data for the patients for which the anonymization is performed, it is recommended to provide the data in one or more CSV files in the same input directory that contains the DICOM files. This is needed so that the patient ids mentioned in the CSV file are replaced with anonymized patient ids so that they are consistent with the anonymized DICOM files.
Note: The CSVs should have a
.csvfile extension and be located directly in the input directory, not in a subdirectory!
In order to accomodate cases where the clinical data have been exported to multiple CSV files, the pipeline will automatically process all CSV files found in the input directory except those that start with the prefix _ (undescore). So a CSV with file name clinical_data.csv will be processed (hashed, as explained below), whereas a CSV with file name _clinical_data.csv will be just copied verbatim to the output directory.
The CSVs to the processed (hashed) are assumed to have the following format:
- The first line of the file is assumed to be a header line containing the column names
- The first column should contain the patientID
You can see an example input CSV of this format here
Important
A CSV file with name dcm_studies_metadata.csv is handled specially. It is assumed to contain information related to the DICOM studies referenced in the supplied DICOM files. An example of this would be to associate the DICOM studies to particular "timepoints" (e.g. "Diagnosis", "Treatment", "Follow-up") of the patients. To keep this association preserved after the anonymization, the CSV file should have the PatientID as the 1st column, the Study Instance UID as the 2nd column, followed by any additional columns (e.g. Timepoint). The pipeline will hash the contents of this file in the same way so that the output dcm_studies_metadata.csv file will have the anonymized PatientID and Study UIDs in the first 2 columns, followed by the values of the other columns in the original file with no modification. This input dcm_studies_metadata.csv CSV file is assumed to contain the column names in the first line too, but we don't care about the actual column names.
In addition to the run command that runs the DICOM de-idenitification pipeline as explained above, there is also a utils command that offers additional functionality.
As usual you can use the --help:
docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline utils --help
which shows the available utilities:
Usage: utils [OPTIONS] COMMAND [ARGS]...
Additional utilities
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ secret Create a new 'secret' key to use for anonymization โ
โ series-info Extract and print the unique Series descriptions from input DICOM files โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
So the following complete command:
docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline utils secret
will write in the console a string like 019a39ba16da7edb9e906440a48e9ed32 which can be used as a secret key in the run pipeline command.
The utils series-info command can be used to get an overview of all the DICOM series that can be found in an input folder. It presents a table as shown below. The series information is summarized according to the Series Description tag (0008,103E) so each row is a unique description that can be found in multiple DICOM series of different studies and patients. The Modalities column presents the different DICOM modalities found that have the specific description, whereas the other columns show the count of studies, patients, and series. The total numbers of DICOM patients, series, studies, and instances (files) found are also shown right after the table.
Series information (Series are grouped by their descriptions)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโ
โ Series Description โ Modalities โ Studies count โ Patients count โ Series count โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ DBT slices โ MG โ 3 โ 1 โ 3 โ
โ LIVER-PELVIS/HASTE_AXIAL_P โ MR โ 1 โ 1 โ 1 โ
โ PARENCHYMAL PHASE Sep1999 โ CT โ 2 โ 1 โ 2 โ
โ t2_spc_rst_axial obl_Prostate โ MR โ 1 โ 1 โ 1 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
Total count of unique Patients: 4
Total count of unique Studies: 7
Total count of unique Series: 7
Total count of DICOM files: 7
If you supply the --csv option to utils series-info will instead print the series information (shown in table above) in CSV format.
If you supply the --ungrouped option to utils series-info will instead print the series information "ungrouped" i.e. each row represents a single series and rows are sorted by PatientID, StudyUID, and SeriesUID (in that order), like so:
Series information
โโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโ
โ PatientID โ StudyUID โ SeriesUID โ Modality โ SeriesDescription โ ImageCount โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 3209648408 โ 1.2.826.0.1.3680043.8.498.10492038307422868223199863260233355278 โ 1.2.826.0.1.3680043.8.498.34901162804853023415490754909481583213 โ CT โ PARENCHYMAL PHASE Sep1999 โ 1 โ
โ 3209648408 โ 1.2.826.0.1.3680043.8.498.15325708537294505104277328793123984041 โ 1.2.826.0.1.3680043.8.498.21727307543319369287645065784568861179 โ CT โ PARENCHYMAL PHASE Sep1999 โ 1 โ
โ 571403367 โ 1.2.826.0.1.3680043.8.498.11930027078857085215653760141431432752 โ 1.2.826.0.1.3680043.8.498.11389650391405144789117233891221888210 โ MG โ DBT slices โ 1 โ
โ 571403367 โ 1.2.826.0.1.3680043.8.498.43140369966073420105378776118739847239 โ 1.2.826.0.1.3680043.8.498.94202333078444804735974466471131425254 โ MG โ DBT slices โ 1 โ
โ 571403367 โ 1.2.826.0.1.3680043.8.498.60031442536880637581306951540659454726 โ 1.2.826.0.1.3680043.8.498.10108928214392221999942909773938492911 โ MG โ DBT slices โ 1 โ
โ 8732322741 โ 1.2.826.0.1.3680043.8.498.11505123464109404670942682899455583584 โ 1.2.826.0.1.3680043.8.498.42020251536922680292646612864203256535 โ MR โ t2_spc_rst_axial obl_Prostate โ 1 โ
โ 9894340694 โ 1.2.826.0.1.3680043.8.498.10976157236759544945657408266559980502 โ 1.2.826.0.1.3680043.8.498.74608000754336619565503767283924990632 โ MR โ LIVER-PELVIS/HASTE_AXIAL_P โ 1 โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
This tool makes use of the following tools and packages:
-
RSNA CTP tool is used for the anonymization of the DICOM metadata (DICOM Tags in the DICOM header). This anonymization script included in this repository conforms to the CTP's scripting language.
-
Microsoft's Presidio is used for redacting Personally Identifiable Information (PII) text from the DICOM images.
-
PaddleOCR as an alternative OCR engine and related models.
-
pydicom for reading and writing DICOM files.
This software is provided by the Computational BioMedicine Laboratory (CBML), FORTH-ICS under the terms of the European Union Public License (EUPL) 1.2. It is distributed in the hope that it will be useful, but without any warranty โ not even the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.
CBML and its contributors accept no responsibility or liability for any loss, damage, or legal issues arising from the use, misuse, or inability to use this software. Users are solely responsible for ensuring that any anonymization performed with this tool meets applicable legal, regulatory, and institutional requirements (including those related to patient data protection and privacy).
Use this tool at your own risk.