Skip to content

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

License

Notifications You must be signed in to change notification settings

BaseMax/go-smart-deduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-smart-deduper

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

Features

  • Fast and Concurrent: Uses Go's goroutines and worker pools for high-performance parallel file hashing
  • SHA-256 Hashing: Secure and reliable content-based duplicate detection
  • Fuzzy Hashing: Optional similarity detection for finding near-duplicate files
  • Recursive Scanning: Scan entire directory trees with configurable depth
  • Smart Filtering:
    • Exclude patterns (glob-style)
    • File size thresholds (min/max)
    • Hidden file handling
    • Symbolic link following
  • Multiple Output Formats: Text, JSON, and CSV reports
  • Interactive Modes:
    • Terminal UI (TUI) using Bubbletea
    • Interactive CLI deletion
  • Flexible Actions:
    • Dry-run mode (preview without changes)
    • Automatic deletion (keeps oldest file)
    • Hard-link replacement (saves disk space)
  • Cross-Platform: Works on Linux, macOS, and Windows

Installation

From Source

git clone https://github.com/BaseMax/go-smart-deduper.git
cd go-smart-deduper
go build -o go-smart-deduper

Using Go Install

go install github.com/BaseMax/go-smart-deduper@latest

Usage

Basic Usage

Scan the current directory for duplicates:

go-smart-deduper

Scan specific directories:

go-smart-deduper /path/to/dir1 /path/to/dir2

Filtering Options

Set minimum file size (in bytes):

go-smart-deduper --min-size 1024

Set maximum file size:

go-smart-deduper --max-size 10485760  # 10MB

Exclude patterns:

go-smart-deduper --exclude "*.tmp" --exclude "*.log"

Include hidden files:

go-smart-deduper --exclude-hidden=false

Follow symbolic links:

go-smart-deduper --follow-symlinks

Output Formats

Generate JSON report:

go-smart-deduper --format json

Generate CSV report:

go-smart-deduper --format csv --output duplicates.csv

Verbose output:

go-smart-deduper -v

Action Modes

Dry-run (preview without making changes):

go-smart-deduper --delete --dry-run

Interactive deletion (choose which files to delete):

go-smart-deduper --interactive

Automatic deletion (keeps oldest file in each duplicate group):

go-smart-deduper --delete

Hard-link replacement (replace duplicates with hard links to save space):

go-smart-deduper --hard-link

Terminal UI Mode

Launch the interactive TUI:

go-smart-deduper --tui

In TUI mode:

  • Use arrow keys or j/k to navigate
  • Press space to select duplicate groups
  • Press q to quit

Note: TUI mode currently displays duplicates for review only. To delete files, use CLI mode with --interactive, --delete, or --hard-link options.

Advanced Options

Use fuzzy hashing for similarity detection:

go-smart-deduper --fuzzy

Set number of worker threads:

go-smart-deduper --workers 8

Combine multiple options:

go-smart-deduper /home/user/Documents \
  --min-size 1024 \
  --exclude "*.tmp" \
  --exclude-hidden \
  --workers 8 \
  --format json \
  --output report.json \
  -v

Output Examples

Text Output

=== Duplicate Files Report ===

Group 1 (Hash: d2a84f4b8b650937...):
  Count: 3 files
  Size: 12 B per file
  Wasted space: 24 B
  Files:
    - /tmp/test-deduper/file1.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/file2.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/subdir/file4.txt (modified: 2025-12-19 17:28:12)

=== Summary ===
Total duplicate groups: 1
Total duplicate files: 3
Total wasted space: 24 B

JSON Output

{
  "duplicates": [
    {
      "hash": "d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26",
      "files": [
        "/tmp/test-deduper/file1.txt",
        "/tmp/test-deduper/file2.txt",
        "/tmp/test-deduper/subdir/file4.txt"
      ],
      "size": 12,
      "count": 3
    }
  ],
  "summary": {
    "total_files": 3,
    "total_groups": 1,
    "wasted_space": 24
  }
}

CSV Output

Group,Hash,File,Size,Modified
1,d2a84f4b8b650937...,/tmp/test-deduper/file1.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/file2.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/subdir/file4.txt,12,2025-12-19 17:28:12

Architecture

The tool is organized into several packages:

  • scanner: Recursive directory scanning with filtering
  • hasher: SHA-256 and fuzzy hashing implementation
  • deduper: Duplicate detection with worker pool pattern
  • reporter: Report generation in multiple formats
  • tui: Terminal UI using Bubbletea
  • cmd: Command-line interface using Cobra

Performance

The tool uses several optimizations for performance:

  1. Size-based pre-filtering: Only files with identical sizes are compared
  2. Worker pool pattern: Concurrent file hashing with configurable workers
  3. Buffered I/O: Efficient file reading with 64KB buffers
  4. Early termination: Stops processing when no duplicates are possible

Safety Features

  • Dry-run mode: Preview changes before committing
  • Interactive mode: Manual control over deletions
  • Oldest-first preservation: Automatic mode keeps the oldest file
  • Error handling: Continues scanning even if some files are inaccessible

Command-Line Options

Flags:
  -d, --delete            Automatically delete duplicates (keep oldest)
  -n, --dry-run           Don't actually delete or modify files
  -e, --exclude strings   Exclude patterns (glob style)
      --exclude-hidden    Exclude hidden files and directories (default true)
      --follow-symlinks   Follow symbolic links
  -f, --format string     Output format: text, json, csv (default "text")
      --fuzzy             Use fuzzy hashing for similarity detection
      --hard-link         Replace duplicates with hard links
  -h, --help              help for go-smart-deduper
  -i, --interactive       Interactive deletion mode
      --max-size int      Maximum file size in bytes (0 = no limit)
      --min-size int      Minimum file size in bytes
  -o, --output string     Output file (default: stdout)
  -p, --path strings      Paths to scan (can specify multiple) (default [.])
  -t, --tui               Use Terminal UI mode
  -v, --verbose           Verbose output
  -w, --workers int       Number of worker goroutines for hashing (default 4)

Testing

Run the test suite:

go test ./pkg/...

Run tests with coverage:

go test ./pkg/... -cover

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Acknowledgments

About

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors