A toolkit for efficient vision-language pre-training and fine-tuning: Token Merging, LoRA/QLoRA, Knowledge Distillation, and more.
VLMs (LLaVA, InternVL, Qwen-VL, etc.) are expensive to fine-tune and slow to deploy. EfficientVLP is a practical toolkit that combines multiple orthogonal efficiency techniques under one roof:
| Technique | Speed-up | Memory ↓ | Quality |
|---|---|---|---|
| Token Merging (ToMe) | 1.5–2.0× | 20–30% | ≈ baseline |
| LoRA fine-tuning | — | 60–80% | ≈ full FT |
| QLoRA (4-bit) | — | 85–90% | −0.5–1.5% |
| Flash Attention 2 | 1.5–3× | 30–50% | identical |
| Knowledge Distillation | — | model-size | −2–5% |
git clone https://github.com/suncatchin/efficient-vlp
cd efficient-vlp
pip install -e ".[full]"from efficient_vlp.token_merging import patch_model, ToMeConfig
import timm
model = timm.create_model("vit_large_patch14_clip_224.openai", pretrained=True)
config = ToMeConfig(r=8) # merge 8 token pairs per block
patch_model(model, config)
# Now model runs with ~15% fewer tokens → faster forward passpython scripts/train_lora.py \
--model_id llava-hf/llava-v1.6-mistral-7b-hf \
--dataset HuggingFaceM4/VQAv2 \
--lora_rank 16 \
--lora_alpha 32 \
--output_dir ./checkpoints/llava-lora/python scripts/train_lora.py \
--model_id Qwen/Qwen2-VL-7B-Instruct \
--qlora \
--lora_rank 64 \
--output_dir ./checkpoints/qwen2vl-qlora/python scripts/benchmark_tome.py \
--model clip-vit-large-patch14 \
--r 0 4 8 12 16 \
--batch_size 64efficient_vlp/
├── token_merging/
│ ├── tome.py # Core ToMe algorithm (bipartite matching)
│ └── merge_utils.py # Merge / unmerge helpers
├── lora/
│ ├── lora_layer.py # LoRA linear layer implementation
│ └── qlora.py # 4-bit QLoRA with bitsandbytes
├── distillation/
│ └── kd_trainer.py # Knowledge distillation training loop
├── pruning/
│ └── structured_pruner.py # Structured head/neuron pruning
└── trainer.py # Unified training entry point
scripts/
├── train_lora.py
└── benchmark_tome.py
Evaluated on ViT-L/14 (ImageNet-1k, batch=64, A100):
| r (tokens merged/block) | Throughput (img/s) | Top-1 Acc. |
|---|---|---|
| 0 (baseline) | 412 | 75.3% |
| 4 | 498 (+21%) | 75.1% |
| 8 | 573 (+39%) | 74.8% |
| 12 | 635 (+54%) | 74.3% |
| 16 | 682 (+66%) | 73.5% |
Fine-tuned LLaVA-1.6-Mistral-7B on VQAv2 validation (A100 80GB):
| Method | GPU Mem | Train Time | VQAv2 Acc. |
|---|---|---|---|
| Full FT | 75GB | 14h | 81.4% |
| LoRA r=16 | 28GB | 5h | 81.0% |
| LoRA r=64 | 38GB | 7h | 81.3% |
| QLoRA r=64 | 18GB | 8h | 80.7% |
from efficient_vlp.token_merging import ToMeConfig
from efficient_vlp.lora import LoRAConfig
# Token Merging
tome_cfg = ToMeConfig(
r=8, # tokens merged per Transformer block
sx=2, sy=2, # stride for source token selection
use_rand=True, # random source selection (avoids bias)
merge_attn=True, # also merge in attention computation
)
# LoRA
lora_cfg = LoRAConfig(
rank=16,
alpha=32,
dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)@misc{xu2024efficientvlp,
title={EfficientVLP: A Practical Toolkit for Efficient Vision-Language Pre-training},
author={Xu, Haowen},
year={2024},
url={https://github.com/suncatchin/efficient-vlp}
}MIT License