PyTorch toolkit for the VisDrone aerial detection dataset. Supports 33 models (4 torchvision + 29 YOLO), end-to-end training, evaluation, and inference.
Example: YOLO26x predictions on VisDrone video sequences using Soft-NMS (confidence threshold = 0.5).
git clone https://github.com/dronefreak/VisDrone-dataset-python-toolkit.git
cd VisDrone-dataset-python-toolkit
python -m venv venv && source venv/bin/activate
pip install -e . # basic
pip install -e ".[dev]" # with dev toolsDataset layout (download from VisDrone-Dataset):
data/
├── VisDrone2019-DET-train/images/ annotations/
└── VisDrone2019-DET-val/images/ annotations/| Model | Type | Notes |
|---|---|---|
fasterrcnn_resnet50 / fasterrcnn_mobilenet |
Torchvision | Best accuracy / lightweight |
fcos_resnet50 |
Torchvision | Anchor-free |
retinanet_resnet50 |
Torchvision | Focal loss, class imbalance |
yolov8n/s/m/l/x |
YOLO v8 | Recommended for new experiments |
yolov9c/e/m |
YOLO v9 | Programmable gradient info |
yolov10n/s/m/b/l/x |
YOLO v10 | NMS-free inference |
yolo11n/s/m/l/x |
YOLO 11 | 2024 C3k2+C2PSA architecture |
yolo26n/s/m/l/x |
YOLO 26 | 2025, best efficiency |
python scripts/train.py --available-models # list all 33 modelsPretrained VisDrone checkpoints for all supported YOLO architectures are available through the Hugging Face collection:
https://huggingface.co/collections/dronefreak/visdrone-detection-model-zoo
The collection includes model cards, benchmark results, evaluation visualizations, and ready-to-use weights for YOLOv8, YOLOv9, YOLOv10, YOLO11, and YOLO26 model families.
| Family | Available Models |
|---|---|
| YOLOv8 | n, s, m, x |
| YOLOv9 | c, m, e |
| YOLOv10 | n, l, x |
| YOLO11 | n, l, x |
| YOLO26 | n, l, x |
Individual model repositories can be accessed directly from the Hugging Face collection page.
pip install ultralytics huggingface_hubfrom huggingface_hub import hf_hub_download
from ultralytics import YOLO
weights = hf_hub_download(
repo_id="dronefreak/yolov8m-visdrone",
filename="best.pt"
)
model = YOLO(weights)
results = model.predict(
source="image.jpg",
conf=0.25
)
results[0].show()# Torchvision (Faster R-CNN)
python scripts/train.py \
--train-img-dir data/VisDrone2019-DET-train/images \
--train-ann-dir data/VisDrone2019-DET-train/annotations \
--val-img-dir data/VisDrone2019-DET-val/images \
--val-ann-dir data/VisDrone2019-DET-val/annotations \
--model fasterrcnn_resnet50 --epochs 200 --batch-size 2 \
--amp --augmentation --multiscale --small-anchors \
--lr 0.005 --lr-schedule multistep --lr-milestones 60 80 \
--output-dir outputs/fasterrcnn_200ep
# YOLO (delegates to Ultralytics engine)
python scripts/train.py \
--train-img-dir data/VisDrone2019-DET-train/images \
--train-ann-dir data/VisDrone2019-DET-train/annotations \
--val-img-dir data/VisDrone2019-DET-val/images \
--val-ann-dir data/VisDrone2019-DET-val/annotations \
--model yolov8n --epochs 200 --batch-size 16 --amp \
--output-dir outputs/yolov8n_200epWeights are saved as best.pt and last.pt inside --output-dir.
YOLO note:
--multiscale,--small-anchors,--lr-schedule, and--accumulation-stepsare ignored for YOLO models — these are handled internally by Ultralytics.--num-classesis automatically clamped to 11 (VisDrone's 11 real classes).
# Torchvision — P/R/F1 + optional pycocotools mAP
python scripts/evaluate.py \
--checkpoint outputs/fasterrcnn_200ep/best.pt \
--model fasterrcnn_resnet50 \
--image-dir data/VisDrone2019-DET-val/images \
--annotation-dir data/VisDrone2019-DET-val/annotations
# YOLO — mAP@0.5 and mAP@0.5:0.95 via Ultralytics val engine
python scripts/evaluate.py \
--checkpoint outputs/yolov8n_200ep/yolov8n/weights/best.pt \
--model yolov8n \
--image-dir data/VisDrone2019-DET-val/images \
--annotation-dir data/VisDrone2019-DET-val/annotationsOutputs a rich per-class metrics table and saves eval_outputs/metrics.json.
# Images / directory / video — auto-detected from file extension
python scripts/inference.py \
--checkpoint outputs/yolov8n_200ep/yolov8n/weights/best.pt \
--model yolov8n --input data/images/ --output-dir results
python scripts/inference.py \
--checkpoint outputs/fasterrcnn_200ep/best.pt \
--model fasterrcnn_resnet50 --input drone_video.mp4 \
--soft-nms --score-threshold 0.5 --output-dir results# Webcam (default source=0)
python scripts/webcam_demo.py \
--checkpoint outputs/yolov8n_200ep/yolov8n/weights/best.pt \
--model yolov8n
# Video file or RTSP stream
python scripts/webcam_demo.py \
--checkpoint outputs/fasterrcnn_200ep/best.pt \
--model fasterrcnn_resnet50 --source drone_video.mp4
# COCO pretrained weights — no VisDrone training needed
python scripts/webcam_demo.py --model fasterrcnn_mobilenetControls: q quit | s save frame | Space pause
# VisDrone → COCO
python scripts/convert_annotations.py --format coco \
--image-dir data/images --annotation-dir data/annotations \
--output annotations_coco.json
# VisDrone → YOLO
python scripts/convert_annotations.py --format yolo \
--image-dir data/images --annotation-dir data/annotations \
--output-dir data/yolo_labelsfrom visdrone_toolkit import VisDroneDataset, get_model
from visdrone_toolkit.utils import collate_fn
from torch.utils.data import DataLoader
dataset = VisDroneDataset(
image_dir="data/images",
annotation_dir="data/annotations",
filter_ignored=True,
filter_crowd=True,
)
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_fn, shuffle=True)
model = get_model("fasterrcnn_resnet50", num_classes=12, pretrained=True)make format lint test # format + lint + run tests
python -m pytest # 203 tests, ~63% coveragePre-commit hooks: Black, Ruff, isort, mypy.
@misc{visdrone_toolkit_2025,
author = {Saksena, Saumya Kumaar},
title = {VisDrone Toolkit 2.0},
year = {2025},
url = {https://github.com/dronefreak/VisDrone-dataset-python-toolkit}
}
@article{zhu2018visdrone,
title = {Vision Meets Drones: A Challenge},
author = {Zhu, Pengfei and Wen, Longyin and Bian, Xiao and Ling, Haibin and Hu, Qinghua},
journal = {arXiv preprint arXiv:1804.07437},
year = {2018}
}