GitHub - cbml-forth/lethe_anon_pipeline: DICOM anonymization pipeline


                         ██╗     ███████╗████████╗██╗  ██╗███████╗
                         ██║     ██╔════╝╚══██╔══╝██║  ██║██╔════╝
                         ██║     █████╗     ██║   ███████║█████╗
                         ██║     ██╔══╝     ██║   ██╔══██║██╔══╝
                         ███████╗███████╗   ██║   ██║  ██║███████╗
                         ╚══════╝╚══════╝   ╚═╝   ╚═╝  ╚═╝╚══════╝

DICOM Anonymization Pipeline in a Box

A DICOM Anonymization pipeline in a Docker container. This pipeline is designed to anonymize DICOM files according to the EUCAIM standard and includes the following steps:

Step 1 (Optional): Perform OCR on DICOM pixel data to remove sensitive information (burned-in information).
Step 2: Deidentify DICOM metadata using the RSNA CTP Anonymizer and the EUCAIM anonymization script.
Step 3 (Optional): Deidentify clinical data provided in CSV files so that the referenced patient id is anonymized the same way CTP does in Step 2.

Usage

You can pull the Docker image from GitHub Container Registry:

docker pull ghcr.io/cbml-forth/eucaim_anon_pipeline

Then you can run the pipeline using the following command, which shows the bare minimum information required to run the pipeline:

docker run -it -v <INPUT-DIR>:/input -v <OUTPUT-DIR>:/output ghcr.io/cbml-forth/eucaim_anon_pipeline run <SITE-ID>

where the options are as follows:

<INPUT-DIR> is the folder on the local machine where the DICOM files to be anonymized reside. Please note that this folder could also contain a CSV file with clinical data so that those data can be properly linked with the anonymized DICOM files (details below)
<OUTPUT-DIR> is the folder on the local machine where the anonymized DICOM files will be written to. In this folder, a new CSV will be also produced containing the anonymized clinical data, should the input folder had one.
<SITE-ID> is the SITE-ID provided by the EUCAIM Technical team and it's a mandatory parameter to the pipeline to be used as "provider id" (after hashing it...)

There are more options that can be specified in the command line. To see the list of available options, please run:

docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline run --help

which should return the following:

 Usage: run [OPTIONS] SITE_ID [INPUT_DIR] [OUTPUT_DIR]

╭─ Arguments ───────────────────────────────────────────────────────────────────────────────────────────╮
│ *    site_id         TEXT          The SITE-ID provided by the EUCAIM Technical team [required]       │
│      input_dir       [INPUT_DIR]   Input directory to read DICOM files from [default: /input]         │
│      output_dir      [OUTPUT_DIR]  Output directory to write processed DICOM files to                 │
│                                    [default: /output]                                                 │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ --ctp               --no-ctp                      Perform deidentification in the DICOM metadata in   │
│                                                   image files. Uses the RSNA CTP anonymizer and the   │
│                                                   custom script                                       │
│                                                   [default: ctp]                                      │
│ --ocr                                             Perform OCR (using Tesseract OCR)                   │
│ --paddle-ocr                                      Perform OCR using PaddleOCR                         │
│ --threads                                INTEGER  Number of threads that RSNA CTP and PaddleOCR (if   │
│                                                   enabled) will use                                   │
│                                                   [default: 10]                                       │
│ --secret                                 TEXT     Use the supplied key as the secret key for the      │
│                                                   anonymization                                       │
│ --hierarchical      --no-hierarchical             Output files will be organized into a hierarchical  │
│                                                   Patient / Study / Series folder structure using the │
│                                                   anonymized UIDs as the folder names                 │
│                                                   [default: hierarchical]                             │
│ --verbose       -v                                Enable verbose logging                              │
│ --version       -V                                Print version information                           │
│ --help                                            Show this message and exit.                         │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯

Option --ctp (default) will anonymize the DICOM files using the RSNA CTP tool. Supplying the --no-ctp option will disable this step.
Passing --ocr or --paddle-ocr will enable the Optical Character Recognition (OCR) feature for redacting "burned-in" text in the raw images. Please note that by default no OCR will run! The --ocr will run Tesseract OCR and the --paddle-ocr will run PaddleOCR. PaddleOCR seems to be more accurate than Tesseract OCR but also slower and requires more resources.
--threads can be used to specify the number of threads that RSNA CTP and PaddleOCR (if enabled) will use and it can be used to increase the speed of the pipeline if it runs in multi-core CPU. By default, it is set to 10.
--hierarchical (default) will organize the anonymized DICOM files into a hierarchical folder structure based on the patient ID, study ID, and series ID. Each output DICOM file will also have a name consisting of digits based on an auto-numbering system, e.g. 00001.dcm, 00002.dcm, etc. We suggest to always keep this option in the default --hierarchical mode, because it makes the output folder structure more organized but more importantly it makes sure that no sensitive information is leaked through the folder and file names.
-v (or --verbose) will enable verbose mode, which will print more detailed information about the progress of the pipeline. In particular the secret key used for the anonymization of the DICOM metadata will be printed to the console.
--secret <SECRET> allows passing the secret key to be used for the anonymization of the DICOM metadata. This allows the consistent anonymization of a cohort of patients to be performed across multiple anonymization runs. You can get a "good" secret key either by running the pipeline once with the --verbose option or using the utils secret subcommand explained a bit further below.

Important

Passing all these parameters on the command line can be intimidating for the unitiative user. For this reason we provide also a desktop application with a graphical user interface that allows the user to specify these parameters and get back the Docker command to run.

PaddleOCR models

PaddleOCR supports multiple different models for text detection, text recognition, etc. By default in this Docker image we include the "lite" (mobile) models of PP-OCRv5: PP-OCRv5_mobile_det for text detection and PP-OCRv5_mobile_rec for text recognition as can be seen in the integrated PaddleOCR.yaml file. To further support additional models like the more complex and accurate "server" models, you can create your own YAML file (by copying the PaddleOCR.yaml file and modifying it) with the desired models and then running the docker run command with this new YAML file in the host machine mounted as /app/PaddleOCR.yaml, like so:

docker run -it -v <INPUT-DIR>:/input -v <OUTPUT-DIR>:/output -v <PADDLEOCR_YAML_FILE>:/app/PaddleOCR.yaml ghcr.io/cbml-forth/eucaim_anon_pipeline run <SITE-ID> --paddle-ocr

Clinical data

In case there are additional (clinical) data for the patients for which the anonymization is performed, it is recommended to provide the data in one or more CSV files in the same input directory that contains the DICOM files. This is needed so that the patient ids mentioned in the CSV file are replaced with anonymized patient ids so that they are consistent with the anonymized DICOM files.

Note: The CSVs should have a .csv file extension and be located directly in the input directory, not in a subdirectory!

In order to accomodate cases where the clinical data have been exported to multiple CSV files, the pipeline will automatically process all CSV files found in the input directory except those that start with the prefix _ (undescore). So a CSV with file name clinical_data.csv will be processed (hashed, as explained below), whereas a CSV with file name _clinical_data.csv will be just copied verbatim to the output directory.

The CSVs to the processed (hashed) are assumed to have the following format:

The first line of the file is assumed to be a header line containing the column names
The first column should contain the patientID

You can see an example input CSV of this format here

Important

A CSV file with name dcm_studies_metadata.csv is handled specially. It is assumed to contain information related to the DICOM studies referenced in the supplied DICOM files. An example of this would be to associate the DICOM studies to particular "timepoints" (e.g. "Diagnosis", "Treatment", "Follow-up") of the patients. To keep this association preserved after the anonymization, the CSV file should have the PatientID as the 1st column, the Study Instance UID as the 2nd column, followed by any additional columns (e.g. Timepoint). The pipeline will hash the contents of this file in the same way so that the output dcm_studies_metadata.csv file will have the anonymized PatientID and Study UIDs in the first 2 columns, followed by the values of the other columns in the original file with no modification. This input dcm_studies_metadata.csv CSV file is assumed to contain the column names in the first line too, but we don't care about the actual column names.

Utilities

In addition to the run command that runs the DICOM de-idenitification pipeline as explained above, there is also a utils command that offers additional functionality.

As usual you can use the --help:

docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline utils --help

which shows the available utilities:

 Usage: utils [OPTIONS] COMMAND [ARGS]...

 Additional utilities

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────╮
│ secret        Create a new 'secret' key to use for anonymization                                   │
│ series-info   Extract and print the unique Series descriptions from input DICOM files              │
╰────────────────────────────────────────────────────────────────────────────────────────────────────╯

So the following complete command:

docker run -it ghcr.io/cbml-forth/eucaim_anon_pipeline utils secret

will write in the console a string like 019a39ba16da7edb9e906440a48e9ed32 which can be used as a secret key in the run pipeline command.

The utils series-info command can be used to get an overview of all the DICOM series that can be found in an input folder. It presents a table as shown below. The series information is summarized according to the Series Description tag (0008,103E) so each row is a unique description that can be found in multiple DICOM series of different studies and patients. The Modalities column presents the different DICOM modalities found that have the specific description, whereas the other columns show the count of studies, patients, and series. The total numbers of DICOM patients, series, studies, and instances (files) found are also shown right after the table.

Series information (Series are grouped by their descriptions)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Series Description            ┃ Modalities ┃ Studies count ┃ Patients count ┃ Series count ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ DBT slices                    │ MG         │ 3             │ 1              │ 3            │
│ LIVER-PELVIS/HASTE_AXIAL_P    │ MR         │ 1             │ 1              │ 1            │
│ PARENCHYMAL PHASE Sep1999     │ CT         │ 2             │ 1              │ 2            │
│ t2_spc_rst_axial obl_Prostate │ MR         │ 1             │ 1              │ 1            │
└───────────────────────────────┴────────────┴───────────────┴────────────────┴──────────────┘
Total count of unique Patients: 4
Total count of unique Studies: 7
Total count of unique Series: 7
Total count of DICOM files: 7

If you supply the --csv option to utils series-info will instead print the series information (shown in table above) in CSV format.

If you supply the --ungrouped option to utils series-info will instead print the series information "ungrouped" i.e. each row represents a single series and rows are sorted by PatientID, StudyUID, and SeriesUID (in that order), like so:

                                                                                             Series information
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ PatientID  ┃ StudyUID                                                         ┃ SeriesUID                                                        ┃ Modality ┃ SeriesDescription             ┃ ImageCount ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ 3209648408 │ 1.2.826.0.1.3680043.8.498.10492038307422868223199863260233355278 │ 1.2.826.0.1.3680043.8.498.34901162804853023415490754909481583213 │ CT       │ PARENCHYMAL PHASE Sep1999     │ 1          │
│ 3209648408 │ 1.2.826.0.1.3680043.8.498.15325708537294505104277328793123984041 │ 1.2.826.0.1.3680043.8.498.21727307543319369287645065784568861179 │ CT       │ PARENCHYMAL PHASE Sep1999     │ 1          │
│ 571403367  │ 1.2.826.0.1.3680043.8.498.11930027078857085215653760141431432752 │ 1.2.826.0.1.3680043.8.498.11389650391405144789117233891221888210 │ MG       │ DBT slices                    │ 1          │
│ 571403367  │ 1.2.826.0.1.3680043.8.498.43140369966073420105378776118739847239 │ 1.2.826.0.1.3680043.8.498.94202333078444804735974466471131425254 │ MG       │ DBT slices                    │ 1          │
│ 571403367  │ 1.2.826.0.1.3680043.8.498.60031442536880637581306951540659454726 │ 1.2.826.0.1.3680043.8.498.10108928214392221999942909773938492911 │ MG       │ DBT slices                    │ 1          │
│ 8732322741 │ 1.2.826.0.1.3680043.8.498.11505123464109404670942682899455583584 │ 1.2.826.0.1.3680043.8.498.42020251536922680292646612864203256535 │ MR       │ t2_spc_rst_axial obl_Prostate │ 1          │
│ 9894340694 │ 1.2.826.0.1.3680043.8.498.10976157236759544945657408266559980502 │ 1.2.826.0.1.3680043.8.498.74608000754336619565503767283924990632 │ MR       │ LIVER-PELVIS/HASTE_AXIAL_P    │ 1          │
└────────────┴──────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────┴──────────┴───────────────────────────────┴────────────┘

Acknowledgements

This tool makes use of the following tools and packages:

RSNA CTP tool is used for the anonymization of the DICOM metadata (DICOM Tags in the DICOM header). This anonymization script included in this repository conforms to the CTP's scripting language.
Microsoft's Presidio is used for redacting Personally Identifiable Information (PII) text from the DICOM images.
PaddleOCR as an alternative OCR engine and related models.
pydicom for reading and writing DICOM files.

Disclaimer

This software is provided by the Computational BioMedicine Laboratory (CBML), FORTH-ICS under the terms of the European Union Public License (EUPL) 1.2. It is distributed in the hope that it will be useful, but without any warranty — not even the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

CBML and its contributors accept no responsibility or liability for any loss, damage, or legal issues arising from the use, misuse, or inability to use this software. Users are solely responsible for ensuring that any anonymization performed with this tool meets applicable legal, regulatory, and institutional requirements (including those related to patient data protection and privacy).

Use this tool at your own risk.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
ctp		ctp
lethe_ui		lethe_ui
src/lethe		src/lethe
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
PaddleOCR.yaml		PaddleOCR.yaml
README.md		README.md
Taskfile.yml		Taskfile.yml
example_clinical.csv		example_clinical.csv
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DICOM Anonymization Pipeline in a Box

Usage

PaddleOCR models

Clinical data

Utilities

Acknowledgements

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DICOM Anonymization Pipeline in a Box

Usage

PaddleOCR models

Clinical data

Utilities

Acknowledgements

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages