|
| 1 | +# GitHub Backup - AI Coding Instructions |
| 2 | + |
| 3 | +## Project Overview |
| 4 | + |
| 5 | +This is a Python CLI tool for comprehensive GitHub data backup. The architecture follows a single-module design with clear separation of concerns across functional areas. |
| 6 | + |
| 7 | +## Core Architecture |
| 8 | + |
| 9 | +### Main Entry Points |
| 10 | +- **`bin/github-backup`**: CLI entry point that orchestrates the backup workflow |
| 11 | +- **`github_backup/github_backup.py`**: Single module containing all core functionality (~1400+ lines) |
| 12 | +- **`github_backup/__init__.py`**: Version tracking only |
| 13 | + |
| 14 | +### Data Flow Pattern |
| 15 | +1. **Parse & Authenticate** → `parse_args()` → `get_auth()` → `get_authenticated_user()` |
| 16 | +2. **Discover** → `retrieve_repositories()` → `filter_repositories()` |
| 17 | +3. **Backup** → `backup_repositories()` + `backup_account()` |
| 18 | + |
| 19 | +### GitHub API Integration |
| 20 | +- Uses `retrieve_data_gen()` for paginated API calls with automatic rate limiting |
| 21 | +- Template-based URL construction: `"https://{host}/repos/{owner}/{name}/issues"` |
| 22 | +- Built-in retry logic for 502 errors and incomplete reads |
| 23 | +- Supports both classic tokens (`-t`) and fine-grained tokens (`-f`) |
| 24 | + |
| 25 | +## Key Development Patterns |
| 26 | + |
| 27 | +### Authentication Flexibility |
| 28 | +```python |
| 29 | +# Supports multiple auth methods in get_auth(): |
| 30 | +# - Fine-grained tokens (github_pat_...) |
| 31 | +# - Classic tokens with x-oauth-basic |
| 32 | +# - Basic username/password |
| 33 | +# - OSX Keychain integration |
| 34 | +# - GitHub App authentication (--as-app) |
| 35 | +``` |
| 36 | + |
| 37 | +### Incremental Backup Strategy |
| 38 | +- **Time-based**: `--incremental` uses API `since` parameter with last backup timestamp |
| 39 | +- **File-based**: `--incremental-by-files` compares filesystem modification times |
| 40 | +- State stored in `{output_dir}/last_update` file |
| 41 | + |
| 42 | +### Git Repository Handling |
| 43 | +- Uses `logging_subprocess()` wrapper for all git operations |
| 44 | +- Supports both regular clones and bare/mirror clones (`--bare` → `git clone --mirror`) |
| 45 | +- SSH vs HTTPS preference via `--prefer-ssh` flag |
| 46 | +- LFS support with `git lfs fetch --all --prune` |
| 47 | + |
| 48 | +### Output Directory Structure |
| 49 | +``` |
| 50 | +{output_dir}/ |
| 51 | +├── repositories/{repo_name}/repository/ # Git clones |
| 52 | +├── starred/{owner}/{repo_name}/ # Starred repos |
| 53 | +├── gists/{gist_id}/ # User gists |
| 54 | +├── account/{starred,followers,following}.json |
| 55 | +└── {repo}/issues/{number}.json # Per-repo data |
| 56 | +``` |
| 57 | + |
| 58 | +## Development Workflows |
| 59 | + |
| 60 | +### Testing & Linting |
| 61 | +```bash |
| 62 | +# No unit tests exist - this is acknowledged in README |
| 63 | +pip install flake8 |
| 64 | +flake8 --ignore=E501,E203,W503 # Same as CI |
| 65 | +``` |
| 66 | + |
| 67 | +### Docker Development |
| 68 | +```bash |
| 69 | +docker run --rm -v /path/to/backup:/data --name github-backup \ |
| 70 | + ghcr.io/josegonzalez/python-github-backup -o /data $OPTIONS $USER |
| 71 | +``` |
| 72 | + |
| 73 | +### Release Process |
| 74 | +- Automated via GitHub Actions (`automatic-release.yml`, `tagged-release.yml`) |
| 75 | +- Version bumping in `github_backup/__init__.py` |
| 76 | +- Docker image publishing to ghcr.io |
| 77 | + |
| 78 | +## Critical Implementation Details |
| 79 | + |
| 80 | +### Rate Limiting Strategy |
| 81 | +- Automatic throttling based on `x-ratelimit-remaining` header |
| 82 | +- Custom throttling via `--throttle-limit` and `--throttle-pause` |
| 83 | +- Exponential backoff for 403 rate limit responses |
| 84 | + |
| 85 | +### Error Handling Philosophy |
| 86 | +- Graceful degradation for missing data (404s logged but don't block) |
| 87 | +- Blocking errors (403 auth failures) exit entirely |
| 88 | +- Incomplete reads get 3 retry attempts with 5-second delays |
| 89 | + |
| 90 | +### File I/O Patterns |
| 91 | +- Atomic writes via `.temp` files then `os.rename()` |
| 92 | +- UTF-8 encoding with `codecs.open()` for JSON files |
| 93 | +- JSON formatting: `ensure_ascii=False, sort_keys=True, indent=4` |
| 94 | + |
| 95 | +## Common Gotchas |
| 96 | + |
| 97 | +1. **`--all` doesn't include everything**: Missing private repos, forks, starred repos, LFS, gists |
| 98 | +2. **`--bare` is actually `--mirror`**: Uses `git clone --mirror`, not `git clone --bare` |
| 99 | +3. **Starred gists**: Stored in same directory as user gists, not separately |
| 100 | +4. **Incremental risks**: Failed runs can cause missing data in subsequent incremental backups |
| 101 | +5. **Authentication scope**: Fine-grained tokens need specific repository and user permissions |
| 102 | + |
| 103 | +## Extension Points |
| 104 | + |
| 105 | +When adding new backup types, follow the pattern: |
| 106 | +1. Add CLI argument in `parse_args()` |
| 107 | +2. Create `backup_*()` function following existing patterns |
| 108 | +3. Call from `backup_repositories()` or `backup_account()` |
| 109 | +4. Use `retrieve_data()` for API calls and `mkdir_p()` for directories |
| 110 | +5. Follow atomic file writing pattern with `.temp` files |
0 commit comments