go-smart-deduper

A high-performance file deduplication tool that detects and manages duplicate files using content hashing and intelligent similarity analysis.

Features

Fast and Concurrent: Uses Go's goroutines and worker pools for high-performance parallel file hashing
SHA-256 Hashing: Secure and reliable content-based duplicate detection
Fuzzy Hashing: Optional similarity detection for finding near-duplicate files
Recursive Scanning: Scan entire directory trees with configurable depth
Smart Filtering:
- Exclude patterns (glob-style)
- File size thresholds (min/max)
- Hidden file handling
- Symbolic link following
Multiple Output Formats: Text, JSON, and CSV reports
Interactive Modes:
- Terminal UI (TUI) using Bubbletea
- Interactive CLI deletion
Flexible Actions:
- Dry-run mode (preview without changes)
- Automatic deletion (keeps oldest file)
- Hard-link replacement (saves disk space)
Cross-Platform: Works on Linux, macOS, and Windows

Installation

From Source

git clone https://github.com/BaseMax/go-smart-deduper.git
cd go-smart-deduper
go build -o go-smart-deduper

Using Go Install

go install github.com/BaseMax/go-smart-deduper@latest

Usage

Basic Usage

Scan the current directory for duplicates:

go-smart-deduper

Scan specific directories:

go-smart-deduper /path/to/dir1 /path/to/dir2

Filtering Options

Set minimum file size (in bytes):

go-smart-deduper --min-size 1024

Set maximum file size:

go-smart-deduper --max-size 10485760  # 10MB

Exclude patterns:

go-smart-deduper --exclude "*.tmp" --exclude "*.log"

Include hidden files:

go-smart-deduper --exclude-hidden=false

Follow symbolic links:

go-smart-deduper --follow-symlinks

Output Formats

Generate JSON report:

go-smart-deduper --format json

Generate CSV report:

go-smart-deduper --format csv --output duplicates.csv

Verbose output:

go-smart-deduper -v

Action Modes

Dry-run (preview without making changes):

go-smart-deduper --delete --dry-run

Interactive deletion (choose which files to delete):

go-smart-deduper --interactive

Automatic deletion (keeps oldest file in each duplicate group):

go-smart-deduper --delete

Hard-link replacement (replace duplicates with hard links to save space):

go-smart-deduper --hard-link

Terminal UI Mode

Launch the interactive TUI:

go-smart-deduper --tui

In TUI mode:

Use arrow keys or j/k to navigate
Press space to select duplicate groups
Press q to quit

Note: TUI mode currently displays duplicates for review only. To delete files, use CLI mode with --interactive, --delete, or --hard-link options.

Advanced Options

Use fuzzy hashing for similarity detection:

go-smart-deduper --fuzzy

Set number of worker threads:

go-smart-deduper --workers 8

Combine multiple options:

go-smart-deduper /home/user/Documents \
  --min-size 1024 \
  --exclude "*.tmp" \
  --exclude-hidden \
  --workers 8 \
  --format json \
  --output report.json \
  -v

Output Examples

Text Output

=== Duplicate Files Report ===

Group 1 (Hash: d2a84f4b8b650937...):
  Count: 3 files
  Size: 12 B per file
  Wasted space: 24 B
  Files:
    - /tmp/test-deduper/file1.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/file2.txt (modified: 2025-12-19 17:28:05)
    - /tmp/test-deduper/subdir/file4.txt (modified: 2025-12-19 17:28:12)

=== Summary ===
Total duplicate groups: 1
Total duplicate files: 3
Total wasted space: 24 B

JSON Output

{
  "duplicates": [
    {
      "hash": "d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26",
      "files": [
        "/tmp/test-deduper/file1.txt",
        "/tmp/test-deduper/file2.txt",
        "/tmp/test-deduper/subdir/file4.txt"
      ],
      "size": 12,
      "count": 3
    }
  ],
  "summary": {
    "total_files": 3,
    "total_groups": 1,
    "wasted_space": 24
  }
}

CSV Output

Group,Hash,File,Size,Modified
1,d2a84f4b8b650937...,/tmp/test-deduper/file1.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/file2.txt,12,2025-12-19 17:28:05
1,d2a84f4b8b650937...,/tmp/test-deduper/subdir/file4.txt,12,2025-12-19 17:28:12

Architecture

The tool is organized into several packages:

scanner: Recursive directory scanning with filtering
hasher: SHA-256 and fuzzy hashing implementation
deduper: Duplicate detection with worker pool pattern
reporter: Report generation in multiple formats
tui: Terminal UI using Bubbletea
cmd: Command-line interface using Cobra

Performance

The tool uses several optimizations for performance:

Size-based pre-filtering: Only files with identical sizes are compared
Worker pool pattern: Concurrent file hashing with configurable workers
Buffered I/O: Efficient file reading with 64KB buffers
Early termination: Stops processing when no duplicates are possible

Safety Features

Dry-run mode: Preview changes before committing
Interactive mode: Manual control over deletions
Oldest-first preservation: Automatic mode keeps the oldest file
Error handling: Continues scanning even if some files are inaccessible

Command-Line Options

Flags:
  -d, --delete            Automatically delete duplicates (keep oldest)
  -n, --dry-run           Don't actually delete or modify files
  -e, --exclude strings   Exclude patterns (glob style)
      --exclude-hidden    Exclude hidden files and directories (default true)
      --follow-symlinks   Follow symbolic links
  -f, --format string     Output format: text, json, csv (default "text")
      --fuzzy             Use fuzzy hashing for similarity detection
      --hard-link         Replace duplicates with hard links
  -h, --help              help for go-smart-deduper
  -i, --interactive       Interactive deletion mode
      --max-size int      Maximum file size in bytes (0 = no limit)
      --min-size int      Minimum file size in bytes
  -o, --output string     Output file (default: stdout)
  -p, --path strings      Paths to scan (can specify multiple) (default [.])
  -t, --tui               Use Terminal UI mode
  -v, --verbose           Verbose output
  -w, --workers int       Number of worker goroutines for hashing (default 4)

Testing

Run the test suite:

go test ./pkg/...

Run tests with coverage:

go test ./pkg/... -cover

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Max Base (@BaseMax)

Acknowledgments

Built with Cobra for CLI
Built with Bubbletea for TUI
Built with Lipgloss for styling

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
cmd		cmd
pkg		pkg
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
examples.sh		examples.sh
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-smart-deduper

Features

Installation

From Source

Using Go Install

Usage

Basic Usage

Filtering Options

Output Formats

Action Modes

Terminal UI Mode

Advanced Options

Output Examples

Text Output

JSON Output

CSV Output

Architecture

Performance

Safety Features

Command-Line Options

Testing

Contributing

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

BaseMax/go-smart-deduper

Folders and files

Latest commit

History

Repository files navigation

go-smart-deduper

Features

Installation

From Source

Using Go Install

Usage

Basic Usage

Filtering Options

Output Formats

Action Modes

Terminal UI Mode

Advanced Options

Output Examples

Text Output

JSON Output

CSV Output

Architecture

Performance

Safety Features

Command-Line Options

Testing

Contributing

License

Author

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages