Convert object detection datasets to HuggingFace format and upload to the Hub.
Currently supported formats:
- COCO format annotations
- YOLO format annotations
- YOLO OBB format annotations
Coming soon: Pascal VOC, Labelme, and more!
HuggingFace has become the defacto open source community to upload datasets and models. It's primarily about LLMs and language models, but there's nothing about HuggingFace's dataset hosting that's specific to language modeling.
This tool is meant to be a way to consolidate the different formats from the object detection domain (COCO, Pascal VOC, etc) into what HuggingFace suggests for their Image Datasets, and upload it to HuggingFace Hub.
pip install hubify-datasetAfter installation, you can use the hubify command:
# Auto-detect annotations in train/validation/test directories
hubify --data-dir /path/to/images --format coco
# Manually specify annotation files
hubify --data-dir /path/to/images \
--train-annotations /path/to/instances_train2017.json \
--validation-annotations /path/to/instances_val2017.json
# Generate sample visualizations
hubify --data-dir /path/to/images --visualize
# Push to HuggingFace Hub
hubify --data-dir /path/to/images \
--train-annotations /path/to/instances_train2017.json \
--push-to-hub username/my-dataset
# for yolo
hubify --data-dir ~/Downloads/DOTAv1.5 --format yolo-obb --push-to-hub benjamintli/dota-v1.5
hubify --data-dir ~/Downloads/DOTAv1.5 --format yolo --push-to-hub benjamintli/dota-v1.5Or run directly with Python (from the virtual environment):
source .venv/bin/activate
python -m src.main --data-dir /path/to/images- For coco:
data-dir/
├── train/
│ ├── instances*.json (auto-detected)
│ └── *.jpg (images)
├── validation/
│ ├── instances*.json (auto-detected)
│ └── *.jpg (images)
└── test/ (optional)
├── instances*.json
└── *.jpg
The tool generates metadata.jsonl files in each split directory:
data-dir/
├── train/
│ └── metadata.jsonl
└── validation/
└── metadata.jsonl
Each line in metadata.jsonl contains:
{
"file_name": "image.jpg",
"objects": {
"bbox": [[x, y, width, height], ...],
"category": [0, 1, ...]
}
}--data-dir: Root directory containing train/validation/test subdirectories (required)--format: Dataset format: 'auto' (default), 'coco', 'yolo', or 'yolo-obb' (optional)--train-annotations: Path to training annotations JSON (optional)--validation-annotations: Path to validation annotations JSON (optional)--test-annotations: Path to test annotations JSON (optional)--visualize: Generate sample visualization images with bounding boxes--push-to-hub: Push dataset to HuggingFace Hub (format:username/dataset-name)--token: HuggingFace API token (optional, defaults toHF_TOKENenv var orhuggingface-cli login)
When using --push-to-hub, the tool looks for your HuggingFace token in this order:
--token YOUR_TOKEN(CLI argument)HF_TOKENenvironment variable- Token from
huggingface-cli login
If no token is found, you'll get a helpful error message with instructions.