-
Notifications
You must be signed in to change notification settings - Fork 11
Enhance README with features and usage details #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,136 +1,214 @@ | ||||||||||
| # Redback Ethics Asset Scanner | ||||||||||
|
|
||||||||||
| The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents and media. | ||||||||||
| It is designed for educational use in cybersecurity and ethics modules. | ||||||||||
| The **Asset Scanner** is a Python-based tool for detecting sensitive information (PII, secrets, credentials, etc.) in documents, code, and media. Designed for educational use in cybersecurity and ethics modules, the scanner helps students and professionals identify and mitigate risks associated with the exposure of sensitive data. | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 🛠️ Key Features | ||||||||||
|
|
||||||||||
| - **Hybrid Detection**: Combines **Microsoft Presidio**'s NLP-based entity recognition with **custom regex patterns** from `patterns.json`. | ||||||||||
| - **OCR Capabilities**: Scans text within images and PDFs using Optical Character Recognition (OCR) via `ocr_engine.py`. | ||||||||||
| - **Risk Assessment**: Categorizes findings into _Low_, _Medium_, or _High_ risk levels with references to compliance frameworks (e.g., GDPR, Privacy Act). | ||||||||||
| - **Flexible Input Handling**: Supports directories, individual files, and various formats, including `.txt`, `.docx`, `.pdf`, `.png`, `.jpg`, and more. | ||||||||||
| - **Actionable Reports**: Provides detailed mitigation tips and compliance recommendations for each detected risk. | ||||||||||
| - **Command-Line Interface**: Easy-to-use CLI with options to customize pattern files, output format, and verbosity. | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 📂 Project Structure | ||||||||||
|
|
||||||||||
| - `scanner.py` – Main entry point for scanning files and generating reports. | ||||||||||
| - `scan_media.py` – Scans image/PDF inputs using OCR (`ocr_engine.py`). | ||||||||||
| - `file_handler.py` – Handles input files and preprocessing. | ||||||||||
| - `ocr_engine.py` – OCR engine wrapper for text extraction from images. | ||||||||||
| - `reporter.py` – Builds structured scan results and output reports. | ||||||||||
| - `patterns.json` – Regex patterns for detecting sensitive items. | ||||||||||
| - `risk_rules.json` – Maps detected patterns to risk levels, compliance references, and remediation tips. | ||||||||||
| | File/Directory | Description | | ||||||||||
| |-----------------------|------------------------------------------------------------------------------------------------------------| | ||||||||||
| | `scanner.py` | Main entry point for scanning files and generating reports. | | ||||||||||
| | `scan_media.py` | Handles scanning of media files (images, PDFs) using OCR (`ocr_engine.py`). | | ||||||||||
| | `file_handler.py` | Manages file discovery and preprocessing (parsing `.docx`, `.txt`, etc.). | | ||||||||||
| | `ocr_engine.py` | OCR engine wrapper for extracting text from images and PDFs. | | ||||||||||
| | `reporter.py` | Builds structured scan results and outputs reports. | | ||||||||||
| | `patterns.json` | Regex patterns for detecting sensitive items (AWS keys, emails, etc.). | | ||||||||||
| | `risk_rules.json` | Maps detected patterns to risk levels, compliance references, and remediation tips. | | ||||||||||
| | `requirements.txt` | Lists the Python dependencies required to run the scanner. | | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## ⚙️ Setup | ||||||||||
|
|
||||||||||
| 1. Clone the repository: | ||||||||||
| 1. **Clone the Repository**: | ||||||||||
| ```bash | ||||||||||
| git clone https://github.com/<your-repo>/redback-ethics.git | ||||||||||
| cd redback-ethics/asset-scanner | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 2. Create and activate a virtual environment: | ||||||||||
| 2. **Create and Activate a Virtual Environment**: | ||||||||||
| ```bash | ||||||||||
| python3 -m venv .venv | ||||||||||
| source .venv/bin/activate | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 3. Install dependencies: | ||||||||||
| 3. **Install Dependencies**: | ||||||||||
| ```bash | ||||||||||
| pip install -r requirements.txt | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| 4. **Install OCR Dependencies (Optional)**: | ||||||||||
| - For PDF/image support: | ||||||||||
| ```bash | ||||||||||
| sudo apt install poppler-utils | ||||||||||
| pip install pdf2image pytesseract | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 🚀 Usage | ||||||||||
|
|
||||||||||
| To scan a document: | ||||||||||
| ### Scan a Single File: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --file "/path/to/document.docx" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| To scan an image or PDF (OCR enabled): | ||||||||||
| ### Scan an Image or PDF (OCR): | ||||||||||
| ```bash | ||||||||||
| python scan_media.py --file "/path/to/image_or_pdf" | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| To scan a directory: | ||||||||||
| ### Scan a Directory Recursively: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --root "/path/to/folder" | ||||||||||
| ``` | ||||||||||
| OR | ||||||||||
| if you run scanner.py standalone you without and --file or --root arguments you will be prompted | ||||||||||
| to enter a directory in runtime | ||||||||||
|
|
||||||||||
| Output will include: | ||||||||||
| - Detected matches with line context | ||||||||||
| - Risk level (from `risk_rules.json`) | ||||||||||
| - Mitigation tips and relevant compliance frameworks | ||||||||||
| ### Interactive Mode: | ||||||||||
| Running `scanner.py` without arguments prompts you to specify a directory or file at runtime: | ||||||||||
| ```bash | ||||||||||
| python scanner.py | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Output Includes: | ||||||||||
| - Detected matches with line numbers | ||||||||||
| - Risk levels (_Low_, _Medium_, _High_) | ||||||||||
| - Mitigation tips and compliance references | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## ⚡ Command-Line Arguments | ||||||||||
| ## ⚡ Command-Line Interface (CLI) | ||||||||||
|
|
||||||||||
| The scanner supports several arguments to control input and behaviour: | ||||||||||
| | Argument | Type | Description | Example | | ||||||||||
| |---------------|-----------|---------------------------------------------------------------|-------------------------------------------| | ||||||||||
| | `--file` | Path | Scan a single file or multiple | `python scanner.py --file "/path/to/doc"` | | ||||||||||
| | `--root` | Path | Recursively scan all files in a directory. | `python scanner.py --root "/path/to/"` | | ||||||||||
| | `--patterns` | Path | Custom path to `patterns.json`. | `--patterns ./configs/patterns.json` | | ||||||||||
| | `--out` | Path | Path to save structured scan results (e.g., `.json`, `.txt`). | `--out results.json` | | ||||||||||
| | `--ext` | List | Filter by file extensions (_default: .txt, .json_). | `--ext .txt .md` | | ||||||||||
| | `--no-console`| Flag | Suppress console output. Only write to the output file. | `--no-console` | | ||||||||||
|
|
||||||||||
| | Argument | Type | Description | Example | | ||||||||||
| |----------|------|-------------|---------| | ||||||||||
| | `--file` | Path | Scan a single file (e.g., `.docx`, `.pdf`, `.png`). | `python scanner.py --file "/path/to/document.docx"` | | ||||||||||
| | `--root` | Path | Recursively scan all files within a directory. | `python scanner.py --root "/path/to/folder"` | | ||||||||||
| | `--patterns` | Path | Custom path to `patterns.json`. Useful if you want to override defaults. | `python scanner.py --file test.docx --patterns ./configs/patterns.json` | | ||||||||||
| | `--out` | Path | File to write structured scan results (JSON or text depending on implementation). | `python scanner.py --root ./docs --out results.json` | | ||||||||||
| | `--no-console` | Flag | Suppress console output. Results will only be written to the output file. | `python scanner.py --root ./docs --no-console --out results.json` | | ||||||||||
| ### Example Usage: | ||||||||||
| - **Scanning a File**: `python scanner.py --file /example/path/file.docx` | ||||||||||
| - **Full Folder Scan**: `python scanner.py --root "./sensitive_files"` | ||||||||||
|
|
||||||||||
| ### Common Usage Examples | ||||||||||
| --- | ||||||||||
|
|
||||||||||
| Scan one file: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --file "/Users/alice/Documents/report.docx" | ||||||||||
| ## 📋 Customization | ||||||||||
|
|
||||||||||
| ### `patterns.json` | ||||||||||
| Defines custom regex patterns for sensitive data detection. Each entry includes: | ||||||||||
| - `pattern`: The regex string to match. | ||||||||||
| - `risk`: The associated risk level (_Low_, _Medium_, or _High_). | ||||||||||
| - `description`: A brief explanation of what the pattern detects. | ||||||||||
|
|
||||||||||
| Example: | ||||||||||
| ```json | ||||||||||
| { | ||||||||||
| "aws_access_key": { | ||||||||||
| "pattern": "\\bAKIA[0-9A-Z]{16}\\b", | ||||||||||
| "risk": "High", | ||||||||||
| "description": "AWS Access Key ID" | ||||||||||
| } | ||||||||||
| } | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| Recursively Scan Directory: | ||||||||||
| ```bash | ||||||||||
| python scanner.py --root "/Users/alice/Documents/sensitive_documents' | ||||||||||
| ### `risk_rules.json` | ||||||||||
| Maps patterns to risk levels, mitigation tips, and compliance frameworks: | ||||||||||
| - `level`: _Low_, _Medium_, or _High_ | ||||||||||
| - `tip`: A recommended action for addressing the risk. | ||||||||||
| - `compliance`: Legal/regulatory references (e.g., GDPR). | ||||||||||
|
|
||||||||||
| Example: | ||||||||||
| ```json | ||||||||||
| { | ||||||||||
| "aws_access_key": { | ||||||||||
| "level": "High", | ||||||||||
| "tip": "Rotate immediately; revoke if exposed.", | ||||||||||
| "compliance": ["GDPR Art. 33 — Data Breach Notification"] | ||||||||||
|
Comment on lines
+138
to
+139
|
||||||||||
| } | ||||||||||
| } | ||||||||||
| ``` | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 🛡️ Configuration | ||||||||||
|
|
||||||||||
| - **`patterns.json`**: Defines regex patterns for items like emails, API keys, driver’s licence numbers, etc. | ||||||||||
| Each entry specifies: | ||||||||||
| - `pattern`: regex string | ||||||||||
| - `risk`: risk level | ||||||||||
| - `description`: human-readable explanation | ||||||||||
|
|
||||||||||
| - **`risk_rules.json`**: Associates each pattern with: | ||||||||||
| - `level`: severity (Low/Medium/High) | ||||||||||
| - `tip`: recommended mitigation | ||||||||||
| - `compliance`: legal/regulatory references | ||||||||||
|
|
||||||||||
| You can extend these files to detect new types of data. | ||||||||||
|
|
||||||||||
| --- | ||||||||||
|
|
||||||||||
| ## 📝 Example | ||||||||||
|
|
||||||||||
| Scanning a document containing: | ||||||||||
| ## 🥼 Example | ||||||||||
|
||||||||||
| ## 🥼 Example | |
| ## 📝 Example |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example shows email risk as 'Medium', but according to risk_rules.json line 3, emails are classified as 'Low' risk, not 'Medium'.
| "risk": "Medium", | |
| "risk": "Low", |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README references a CODE_OF_CONDUCT.md file, but this file does not exist in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.
| Please adhere to our [Code of Conduct](CODE_OF_CONDUCT.md). | |
| Please adhere to our [Code of Conduct](../CODE_OF_CONDUCT.md). |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README references a LICENSE file with a relative link, but no LICENSE file exists in the asset-scanner directory. This will result in a broken link for users viewing the asset-scanner README.
| This project is licensed under the [MIT License](LICENSE). | |
| See the `LICENSE` file for full details. | |
| This project is licensed under the MIT License. | |
| See the `LICENSE` file in the repository root for full details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
--no-consoleflag is documented in the CLI table, but this argument does not exist in scanner.py's argument parser (lines 173-190). This feature is not implemented.