Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1c91ce6
added documents marked as to be removed clean script
idelcano May 21, 2025
fe1ce44
rename folder
idelcano Jul 7, 2025
776c7ef
remove unnecesary code and simplify script
idelcano Jul 7, 2025
bdf37f4
rename file
idelcano Jul 7, 2025
6f7613b
Added tomcat version
idelcano Jul 7, 2025
5467611
remove duplicate old file
idelcano Jul 7, 2025
b047fce
Added readme
idelcano Jul 7, 2025
b98b7bb
fix readme
idelcano Jul 14, 2025
3c6062a
Refactoring to remove magic values and fix log comment
idelcano Jul 14, 2025
0de1af1
change db variable name
idelcano Jul 14, 2025
9ebf3e3
Edited error reported to the user to reflect the correct name of the …
cgbautista Jul 17, 2025
8048428
Update script to clean data value file resources excluding actives in…
idelcano Jul 25, 2025
274d5d5
update to add notifications and refactor
idelcano Nov 24, 2025
1b71228
improve script fixing error, adding notify, improve workflow and fix …
idelcano Nov 26, 2025
4a4a365
refactor sql
idelcano Nov 26, 2025
a96e571
read proxy from config file
idelcano Nov 26, 2025
4ddd621
added store files in csv as already notified parameter and truncate m…
idelcano Nov 27, 2025
926c0f0
refactor duplicate code
idelcano Nov 27, 2025
4f5eacc
refactor
idelcano Nov 27, 2025
a7721fd
simplify parameters and workflod, improve unique row detection(id+uid…
idelcano Nov 28, 2025
f241b0f
import missing dependency
idelcano Nov 28, 2025
d2b8cdc
avoid identify as orphan files created/updated in the last 24
idelcano Dec 22, 2025
83f0bf0
recovery instructions
idelcano Dec 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions DHIS2/file_garbage_remover/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# DHIS2 File Garbage Remover Scripts

This folder contains two Python scripts designed to manage orphaned resources in DHIS2 instances.
These scripts identify files with no database references and take appropriate actions depending on the environment, production (tomcat) or testing (docker).

Included Scripts

## main.py (recommended entry point)

Run cleanup (tomcat or docker) and optionally send a notification in one command. Dry-run is the default unless you add `--force`.

### Example (docker cleanup + notify using webhook from config.json)
```
python3 main.py \
--mode docker \
--docker-instance docker.eyeseetea.com/widpit/dhis2-data:2.41-widp-dev-test \
--csv-path /tmp/fg_docker.csv \
--config /path/to/config.json \
--notify-title "Orphans widp-dev-test"
```
- Add `--force` to delete/move for real (tomcat) or delete in container (docker).
- Add `--notify-test` to print the payload instead of sending.
- To notify only (no cleanup): `python3 main.py --notify-only --csv-path /tmp/fg.csv --config /path/to/config.json --notify-title "..." [--notify-test]`
- To skip sending and save in CSV as notified: `--save-all-as-notified`.

### Example (tomcat cleanup, dry-run, save as nnotified)
```
DB_PASSWORD_FG=your_password python3 main.py \
--config /path/to/config.json \
--csv-path /tmp/fg_tomcat.csv \
--save-all-as-notified
```
Add `--force` to move/delete files and files in DB.

config.json needs the cleanup settings plus the webhook (if you want notifications):
```
{
"db_host": "",
"db_port": "",
"db_name": "",
"db_user": "",
"file_base_path": "",
"temp_file_path": "",
"webhook-url": "https://your.webhook.url",
"notify-http-proxy": "http://openproxy.who.int:8080",
"notify-https-proxy": "http://openproxy.who.int:8080"
}
```

## file_garbage_remover_tomcat.py

### Description:

Designed for production environments. Identifies orphaned file resources (documents or datavalues (files or images attached to a value)), moves the files to a temporary directory, and archives corresponding database entries into a special table(fileresourcesaudit) before deleting them from the original database table.

### Usage:

Run the script in either test or force mode.

Test Mode (dry-run):

```
export DB_PASSWORD_FG='your_password'
./file_garbage_remover_tomcat.py --test --config /path/to/config.json
```

Force Mode (apply changes):

```
export DB_PASSWORD_FG='your_password'
./file_garbage_remover_tomcat.py --force --config /path/to/config.json
```

config.json File Requirements:
```
{
"db_host": "",
"db_port": "",
"db_name": "",
"db_user": "",
"file_base_path": "",
"temp_file_path": ""
}
```

Ensure file_base_path and temp_file_path exist and are valid directories.

Password should be provided through the environment variable DB_PASSWORD_FG.

Bash wrapper example:
```
#!/bin/bash
MODE="${1:-test}"
[[ "$MODE" != "test" && "$MODE" != "force" ]] && echo "Invalid mode." && exit 1

DB_PASSWORD_FG=db_password

python3 /path/to/script/bin/file_garbage_remover/file_garbage_remover_tomcat.py --config /path/to/script/bin/file_garbage_remover/config.json --$MODE 2>&1 | tee -a /path/to/logs/orphan_cleanup.log

unset DB_PASSWORD_FG
```

## file_garbage_remover_docker.py

### Description:

Intended for d2-docker testing environments.
Identifies orphaned files in a Dockerized DHIS2 instance and directly deletes them from the container. This script does not archive or move files, permanently removing identified resources.

### Usage:

Run the script specifying the Docker DHIS2 instance:

```
./file_garbage_remover_docker.py --instance docker.eyeseetea.com/project/dhis2-data:2.41-test
```

### Operation:

Executes an SQL query within the DHIS2 container using d2-docker run-sql.

Copies the generated file list into the container.

Deletes identified files directly inside the container.

# Precautions

Production Environments: Always use --test mode before using --force to verify intended changes.

Docker/Test Environments: Files deleted by file_garbage_remover_docker.py cannot be recovered.

## Restoring fileresource entries (tomcat)

Manual steps if a resource was moved by mistake:
1) Restore the row into `fileresource` from `fileresourcesaudit` using a **new** `fileresourceid` to avoid collisions.
Column list (stable): `uid, code, created, lastupdated, name, contenttype, contentlength, contentmd5, storagekey, isassigned, domain, userid, lastupdatedby, hasmultiplestoragefiles, fileresourceowner`
```
INSERT INTO fileresource (
fileresourceid, uid, code, created, lastupdated, name, contenttype,
contentlength, contentmd5, storagekey, isassigned, domain, userid,
lastupdatedby, hasmultiplestoragefiles, fileresourceowner
)
SELECT
nextval('fileresource_fileresourceid_seq'), uid, code, created, lastupdated,
name, contenttype, contentlength, contentmd5, storagekey, isassigned,
domain, userid, lastupdatedby, hasmultiplestoragefiles, fileresourceowner
FROM fileresourcesaudit
WHERE fileresourceid = <OLD_ID>;
```
Run inside a transaction and verify before commit.

2) Move the file back from `temp_file_path` to `file_base_path`, keeping the folder (`document` or `dataValue`). Use a wildcard to catch image variants (e.g., `..._max`, `..._min`):
```
mv "<temp_file_path>/<folder>/<file_prefix>"* "<file_base_path>/<folder>/"
```

Example to restore entries moved in the last 24 hours:
```
-- 1) Reinsert rows with new IDs
INSERT INTO fileresource (
fileresourceid, uid, code, created, lastupdated, name, contenttype,
contentlength, contentmd5, storagekey, isassigned, domain, userid,
lastupdatedby, hasmultiplestoragefiles, fileresourceowner
)
SELECT
nextval('fileresource_fileresourceid_seq'), uid, code, created, lastupdated,
name, contenttype, contentlength, contentmd5, storagekey, isassigned,
domain, userid, lastupdatedby, hasmultiplestoragefiles, fileresourceowner
FROM fileresourcesaudit
WHERE COALESCE(lastupdated, created, NOW()) >= (NOW() - INTERVAL '24 hours');

-- 2) Move files back (adjust paths)
find "<temp_file_path>" -type f -mtime -1 -print0 | while IFS= read -r -d '' f; do
rel="${f#<temp_file_path>/}"
mv "$f" "<file_base_path>/$rel"
done
```
Loading