A Python command-line tool for scrubbing sensitive data from FOCUS billing files with consistent, reproducible mappings.
⚠️ DISCLAIMER: This tool provides automated scrubbing of common PII patterns in billing data, but cannot guarantee complete removal of all personally identifiable information. The output data should always be reviewed by a human before sharing or publishing to ensure all sensitive information has been properly anonymized. Users are responsible for verifying the scrubbed data meets their security and privacy requirements.
- Account ID Scrubbing: Consistently maps account IDs (numeric, UUIDs, ARNs) across all columns
- Name Anonymization: Replaces account names with stellar-themed generated names
- Date Shifting: Shifts date/datetime values by a configurable number of days
- Commitment Discount IDs: Intelligently scrubs complex IDs containing embedded account numbers and UUIDs
- Component-Level Mappings: Centralized mapping engine ensures consistency across all columns
NumberId: Maps numeric IDs (e.g., 12-digit AWS account IDs)UUID: Maps UUIDs to new random UUIDsName: Maps names to stellar-themed names (e.g., "Nebula Alpha")ProfileCode: Maps dash-separated codes preserving structure
- Same account ID (e.g.,
658755425446) maps to the same value whether it appears:- As a standalone
SubAccountId - Embedded in a
CommitmentDiscountIdARN - In any other account-related column
- As a standalone
- Export mappings to ensure consistency across multiple processing runs
- Load mappings from previous runs to maintain referential integrity
- Input formats:
.csv,.csv.gz,.parquet - Output formats:
csv-gzip,parquet - Process single files or entire directories
- Preserves directory structure in output
focus_scrub/focus_scrub/cli.py- CLI entrypointfocus_scrub/focus_scrub/io.py- File discovery + read/write logicfocus_scrub/focus_scrub/scrub.py- Deterministic column replacement enginefocus_scrub/focus_scrub/handlers.py- Reusable handler registry + dataset-to-column mappingfocus_scrub/focus_scrub/mapping/- Mapping infrastructureengine.py- Central MappingEngine for consistent component mappingscollector.py- MappingCollector for tracking column-level mappings
Handlers delegate to the shared MappingEngine to ensure consistency:
- AccountIdHandler: Decomposes complex values (ARNs), extracts components (account IDs, UUIDs), and maps each via the engine
- StellarNameHandler: Maps account names to stellar-themed names
- CommitmentDiscountIdHandler: Delegates to AccountIdHandler with shared engine
- DateReformatHandler: Shifts dates by configured number of days
Any column without a configured handler is passed through unchanged.
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -- Install project dependencies:
poetry installProcess files without exporting mappings:
poetry run focus-scrub <input_path> <output_path> --dataset CostAndUsageExport mappings for reuse in subsequent runs:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--export-mappings mappings.jsonThe exported JSON contains:
column_mappings: Per-column old→new value mappingscomponent_mappings: Component-level mappings (NumberId, UUID, Name, ProfileCode)
Reuse mappings from a previous run to ensure consistency:
poetry run focus-scrub input2/ output2/ \
--dataset CostAndUsage \
--load-mappings mappings.jsonShift all date columns by 30 days:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--date-shift-days 30Specify output format (default is parquet):
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--output-format csv-gzipRemove custom columns (starting with x_ or oci_) from the output:
poetry run focus-scrub input/ output/ \
--dataset CostAndUsage \
--remove-custom-columnsNote: The FOCUS spec states that custom columns should start with
x_, but OCI usesoci_prefix. Both patterns are recognized and removed when this option is enabled.
# First run: Process files and export mappings
poetry run focus-scrub datafiles/AWS datafiles_out/AWS \
--dataset CostAndUsage \
--output-format parquet \
--date-shift-days 30 \
--export-mappings mappings/aws_mappings.json
# Second run: Process more files using same mappings
poetry run focus-scrub datafiles/AWS_batch2 datafiles_out/AWS_batch2 \
--dataset CostAndUsage \
--load-mappings mappings/aws_mappings.jsonCostAndUsage- Standard cost and usage dataContractCommitment- Contract commitment data
In handlers.py:
- Register handler factory in
HANDLER_FACTORIES:
HANDLER_FACTORIES: dict[str, HandlerFactory] = {
"DateReformat": _build_date_reformat_handler,
"AccountId": _build_account_id_handler,
"StellarName": _build_stellar_name_handler,
"YourNewHandler": _build_your_new_handler,
}- Map columns to handlers in
DATASET_COLUMN_HANDLER_NAMES:
"CostAndUsage": {
"BillingAccountId": "AccountId",
"BillingAccountName": "StellarName",
"YourColumn": "YourNewHandler",
}-
MappingEngine creates consistent mappings for primitive components:
- Numeric IDs always map to same random numeric ID
- UUIDs always map to same random UUID
- Names always map to same stellar name
-
Handlers decompose complex values and use engine for each component:
- ARN
arn:aws:ec2:us-east-1:658755425446:reserved-instances/uuid→ - Account
658755425446→ maps viaengine.map_number_id() - UUID → maps via
engine.map_uuid() - Result:
arn:aws:ec2:us-east-1:752426551655:reserved-instances/new-uuid
- ARN
-
Consistency is maintained because the engine remembers all mappings:
- Same input value always produces same output
- Works across all columns and all files in a run
- Can be exported and reloaded for future runs
BillingAccountId: 658745821254
SubAccountId: 658755425446
BillingAccountName: MyBillingAccount
CommitmentDiscountId: arn:aws:ec2:us-east-1:658755425446:reserved-instances/ed12ad8c-...
BillingAccountId: 736035721513
SubAccountId: 752426551655
BillingAccountName: Nebula Iota
CommitmentDiscountId: arn:aws:ec2:us-east-1:752426551655:reserved-instances/c741d6b8-...
Note: The account ID 658755425446 consistently maps to 752426551655 in both the SubAccountId column and within the CommitmentDiscountId ARN.