Skip to content

A tool that converts COCO formatted datasets to HF Image Dataset format and uploads them to the Hub

License

Notifications You must be signed in to change notification settings

benjamintli/hubify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hubify

Test & Lint CLI Smoke Test

Convert object detection datasets to HuggingFace format and upload to the Hub.

Currently supported formats:

  • COCO format annotations
  • YOLO format annotations
  • YOLO OBB format annotations

Coming soon: Pascal VOC, Labelme, and more!

Motivations for this tool

HuggingFace has become the defacto open source community to upload datasets and models. It's primarily about LLMs and language models, but there's nothing about HuggingFace's dataset hosting that's specific to language modeling.

This tool is meant to be a way to consolidate the different formats from the object detection domain (COCO, Pascal VOC, etc) into what HuggingFace suggests for their Image Datasets, and upload it to HuggingFace Hub.

Installation

pip install hubify-dataset

Usage

After installation, you can use the hubify command:

# Auto-detect annotations in train/validation/test directories
hubify --data-dir /path/to/images --format coco

# Manually specify annotation files
hubify --data-dir /path/to/images \
  --train-annotations /path/to/instances_train2017.json \
  --validation-annotations /path/to/instances_val2017.json

# Generate sample visualizations
hubify --data-dir /path/to/images --visualize

# Push to HuggingFace Hub
hubify --data-dir /path/to/images \
  --train-annotations /path/to/instances_train2017.json \
  --push-to-hub username/my-dataset

# for yolo
hubify --data-dir ~/Downloads/DOTAv1.5 --format yolo-obb  --push-to-hub benjamintli/dota-v1.5

hubify --data-dir ~/Downloads/DOTAv1.5 --format yolo  --push-to-hub benjamintli/dota-v1.5

Or run directly with Python (from the virtual environment):

source .venv/bin/activate
python -m src.main --data-dir /path/to/images

Expected Directory Structure

  • For coco:
data-dir/
├── train/
│   ├── instances*.json  (auto-detected)
│   └── *.jpg            (images)
├── validation/
│   ├── instances*.json  (auto-detected)
│   └── *.jpg            (images)
└── test/               (optional)
    ├── instances*.json
    └── *.jpg

Output

The tool generates metadata.jsonl files in each split directory:

data-dir/
├── train/
│   └── metadata.jsonl
└── validation/
    └── metadata.jsonl

Each line in metadata.jsonl contains:

{
  "file_name": "image.jpg",
  "objects": {
    "bbox": [[x, y, width, height], ...],
    "category": [0, 1, ...]
  }
}

Options

  • --data-dir: Root directory containing train/validation/test subdirectories (required)
  • --format: Dataset format: 'auto' (default), 'coco', 'yolo', or 'yolo-obb' (optional)
  • --train-annotations: Path to training annotations JSON (optional)
  • --validation-annotations: Path to validation annotations JSON (optional)
  • --test-annotations: Path to test annotations JSON (optional)
  • --visualize: Generate sample visualization images with bounding boxes
  • --push-to-hub: Push dataset to HuggingFace Hub (format: username/dataset-name)
  • --token: HuggingFace API token (optional, defaults to HF_TOKEN env var or huggingface-cli login)

Authentication for Hub Push

When using --push-to-hub, the tool looks for your HuggingFace token in this order:

  1. --token YOUR_TOKEN (CLI argument)
  2. HF_TOKEN environment variable
  3. Token from huggingface-cli login

If no token is found, you'll get a helpful error message with instructions.

About

A tool that converts COCO formatted datasets to HF Image Dataset format and uploads them to the Hub

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages