FOCUS Scrub

A Python command-line tool for scrubbing sensitive data from FOCUS billing files with consistent, reproducible mappings.

⚠️ DISCLAIMER: This tool provides automated scrubbing of common PII patterns in billing data, but cannot guarantee complete removal of all personally identifiable information. The output data should always be reviewed by a human before sharing or publishing to ensure all sensitive information has been properly anonymized. Users are responsible for verifying the scrubbed data meets their security and privacy requirements.

Features

Data Scrubbing

Account ID Scrubbing: Consistently maps account IDs (numeric, UUIDs, ARNs) across all columns
Name Anonymization: Replaces account names with stellar-themed generated names
Date Shifting: Shifts date/datetime values by a configurable number of days
Commitment Discount IDs: Intelligently scrubs complex IDs containing embedded account numbers and UUIDs

Mapping Engine

Component-Level Mappings: Centralized mapping engine ensures consistency across all columns
- NumberId: Maps numeric IDs (e.g., 12-digit AWS account IDs)
- UUID: Maps UUIDs to new random UUIDs
- Name: Maps names to stellar-themed names (e.g., "Nebula Alpha")
- ProfileCode: Maps dash-separated codes preserving structure

Consistency Guarantees

Same account ID (e.g., 658755425446) maps to the same value whether it appears:
- As a standalone SubAccountId
- Embedded in a CommitmentDiscountId ARN
- In any other account-related column
Export mappings to ensure consistency across multiple processing runs
Load mappings from previous runs to maintain referential integrity

File Format Support

Input formats: .csv, .csv.gz, .parquet
Output formats: csv-gzip, parquet
Process single files or entire directories
Preserves directory structure in output

Architecture

Project Layout

focus_scrub/focus_scrub/cli.py - CLI entrypoint
focus_scrub/focus_scrub/io.py - File discovery + read/write logic
focus_scrub/focus_scrub/scrub.py - Deterministic column replacement engine
focus_scrub/focus_scrub/handlers.py - Reusable handler registry + dataset-to-column mapping
focus_scrub/focus_scrub/mapping/ - Mapping infrastructure
- engine.py - Central MappingEngine for consistent component mappings
- collector.py - MappingCollector for tracking column-level mappings

Handler Architecture

Handlers delegate to the shared MappingEngine to ensure consistency:

AccountIdHandler: Decomposes complex values (ARNs), extracts components (account IDs, UUIDs), and maps each via the engine
StellarNameHandler: Maps account names to stellar-themed names
CommitmentDiscountIdHandler: Delegates to AccountIdHandler with shared engine
DateReformatHandler: Shifts dates by configured number of days

Any column without a configured handler is passed through unchanged.

Setup

Install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

Install project dependencies:

poetry install

Usage

Basic Usage

Process files without exporting mappings:

poetry run focus-scrub <input_path> <output_path> --dataset CostAndUsage

Export Mappings

Export mappings for reuse in subsequent runs:

poetry run focus-scrub input/ output/ \
  --dataset CostAndUsage \
  --export-mappings mappings.json

The exported JSON contains:

column_mappings: Per-column old→new value mappings
component_mappings: Component-level mappings (NumberId, UUID, Name, ProfileCode)

Load Mappings

Reuse mappings from a previous run to ensure consistency:

poetry run focus-scrub input2/ output2/ \
  --dataset CostAndUsage \
  --load-mappings mappings.json

Date Shifting

Shift all date columns by 30 days:

poetry run focus-scrub input/ output/ \
  --dataset CostAndUsage \
  --date-shift-days 30

Output Format

Specify output format (default is parquet):

poetry run focus-scrub input/ output/ \
  --dataset CostAndUsage \
  --output-format csv-gzip

Remove Custom Columns

Remove custom columns (starting with x_ or oci_) from the output:

poetry run focus-scrub input/ output/ \
  --dataset CostAndUsage \
  --remove-custom-columns

Note: The FOCUS spec states that custom columns should start with x_, but OCI uses oci_ prefix. Both patterns are recognized and removed when this option is enabled.

Complete Example

# First run: Process files and export mappings
poetry run focus-scrub datafiles/AWS datafiles_out/AWS \
  --dataset CostAndUsage \
  --output-format parquet \
  --date-shift-days 30 \
  --export-mappings mappings/aws_mappings.json

# Second run: Process more files using same mappings
poetry run focus-scrub datafiles/AWS_batch2 datafiles_out/AWS_batch2 \
  --dataset CostAndUsage \
  --load-mappings mappings/aws_mappings.json

Supported Datasets

CostAndUsage - Standard cost and usage data
ContractCommitment - Contract commitment data

Configuration

Adding Handlers

In handlers.py:

Register handler factory in HANDLER_FACTORIES:

HANDLER_FACTORIES: dict[str, HandlerFactory] = {
    "DateReformat": _build_date_reformat_handler,
    "AccountId": _build_account_id_handler,
    "StellarName": _build_stellar_name_handler,
    "YourNewHandler": _build_your_new_handler,
}

Map columns to handlers in DATASET_COLUMN_HANDLER_NAMES:

"CostAndUsage": {
    "BillingAccountId": "AccountId",
    "BillingAccountName": "StellarName",
    "YourColumn": "YourNewHandler",
}

How Mappings Work

MappingEngine creates consistent mappings for primitive components:
- Numeric IDs always map to same random numeric ID
- UUIDs always map to same random UUID
- Names always map to same stellar name
Handlers decompose complex values and use engine for each component:
- ARN arn:aws:ec2:us-east-1:658755425446:reserved-instances/uuid →
- Account 658755425446 → maps via engine.map_number_id()
- UUID → maps via engine.map_uuid()
- Result: arn:aws:ec2:us-east-1:752426551655:reserved-instances/new-uuid
Consistency is maintained because the engine remembers all mappings:
- Same input value always produces same output
- Works across all columns and all files in a run
- Can be exported and reloaded for future runs

Example Output

Original Data

BillingAccountId: 658745821254
SubAccountId: 658755425446
BillingAccountName: MyBillingAccount
CommitmentDiscountId: arn:aws:ec2:us-east-1:658755425446:reserved-instances/ed12ad8c-...

Scrubbed Data

BillingAccountId: 736035721513
SubAccountId: 752426551655
BillingAccountName: Nebula Iota
CommitmentDiscountId: arn:aws:ec2:us-east-1:752426551655:reserved-instances/c741d6b8-...

Note: The account ID 658755425446 consistently maps to 752426551655 in both the SubAccountId column and within the CommitmentDiscountId ARN.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
.vscode		.vscode
focus_scrub/focus_scrub		focus_scrub/focus_scrub
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FOCUS Scrub

Features

Data Scrubbing

Mapping Engine

Consistency Guarantees

File Format Support

Architecture

Project Layout

Handler Architecture

Setup

Usage

Basic Usage

Export Mappings

Load Mappings

Date Shifting

Output Format

Remove Custom Columns

Complete Example

Supported Datasets

Configuration

Adding Handlers

How Mappings Work

Example Output

Original Data

Scrubbed Data

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

finopsfoundation/focus_scrub

Folders and files

Latest commit

History

Repository files navigation

FOCUS Scrub

Features

Data Scrubbing

Mapping Engine

Consistency Guarantees

File Format Support

Architecture

Project Layout

Handler Architecture

Setup

Usage

Basic Usage

Export Mappings

Load Mappings

Date Shifting

Output Format

Remove Custom Columns

Complete Example

Supported Datasets

Configuration

Adding Handlers

How Mappings Work

Example Output

Original Data

Scrubbed Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages