Skip to content

suncatchin/efficient-vlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EfficientVLP ⚡

A toolkit for efficient vision-language pre-training and fine-tuning: Token Merging, LoRA/QLoRA, Knowledge Distillation, and more.

Python 3.9+ License: MIT Stars

VLMs (LLaVA, InternVL, Qwen-VL, etc.) are expensive to fine-tune and slow to deploy. EfficientVLP is a practical toolkit that combines multiple orthogonal efficiency techniques under one roof:

Technique Speed-up Memory ↓ Quality
Token Merging (ToMe) 1.5–2.0× 20–30% ≈ baseline
LoRA fine-tuning 60–80% ≈ full FT
QLoRA (4-bit) 85–90% −0.5–1.5%
Flash Attention 2 1.5–3× 30–50% identical
Knowledge Distillation model-size −2–5%

🚀 Quick Start

git clone https://github.com/suncatchin/efficient-vlp
cd efficient-vlp
pip install -e ".[full]"

Apply Token Merging to a ViT

from efficient_vlp.token_merging import patch_model, ToMeConfig

import timm
model = timm.create_model("vit_large_patch14_clip_224.openai", pretrained=True)

config = ToMeConfig(r=8)   # merge 8 token pairs per block
patch_model(model, config)

# Now model runs with ~15% fewer tokens → faster forward pass

LoRA fine-tuning

python scripts/train_lora.py \
  --model_id llava-hf/llava-v1.6-mistral-7b-hf \
  --dataset HuggingFaceM4/VQAv2 \
  --lora_rank 16 \
  --lora_alpha 32 \
  --output_dir ./checkpoints/llava-lora/

QLoRA (4-bit) fine-tuning

python scripts/train_lora.py \
  --model_id Qwen/Qwen2-VL-7B-Instruct \
  --qlora \
  --lora_rank 64 \
  --output_dir ./checkpoints/qwen2vl-qlora/

Benchmark Token Merging speed

python scripts/benchmark_tome.py \
  --model clip-vit-large-patch14 \
  --r 0 4 8 12 16 \
  --batch_size 64

📂 Project Structure

efficient_vlp/
├── token_merging/
│   ├── tome.py             # Core ToMe algorithm (bipartite matching)
│   └── merge_utils.py      # Merge / unmerge helpers
├── lora/
│   ├── lora_layer.py       # LoRA linear layer implementation
│   └── qlora.py            # 4-bit QLoRA with bitsandbytes
├── distillation/
│   └── kd_trainer.py       # Knowledge distillation training loop
├── pruning/
│   └── structured_pruner.py # Structured head/neuron pruning
└── trainer.py              # Unified training entry point
scripts/
├── train_lora.py
└── benchmark_tome.py

📊 Token Merging Benchmarks

Evaluated on ViT-L/14 (ImageNet-1k, batch=64, A100):

r (tokens merged/block) Throughput (img/s) Top-1 Acc.
0 (baseline) 412 75.3%
4 498 (+21%) 75.1%
8 573 (+39%) 74.8%
12 635 (+54%) 74.3%
16 682 (+66%) 73.5%

📊 LoRA Fine-tuning Benchmarks

Fine-tuned LLaVA-1.6-Mistral-7B on VQAv2 validation (A100 80GB):

Method GPU Mem Train Time VQAv2 Acc.
Full FT 75GB 14h 81.4%
LoRA r=16 28GB 5h 81.0%
LoRA r=64 38GB 7h 81.3%
QLoRA r=64 18GB 8h 80.7%

⚙️ Key Configurations

from efficient_vlp.token_merging import ToMeConfig
from efficient_vlp.lora import LoRAConfig

# Token Merging
tome_cfg = ToMeConfig(
    r=8,                    # tokens merged per Transformer block
    sx=2, sy=2,             # stride for source token selection
    use_rand=True,          # random source selection (avoids bias)
    merge_attn=True,        # also merge in attention computation
)

# LoRA
lora_cfg = LoRAConfig(
    rank=16,
    alpha=32,
    dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

📖 Citation

@misc{xu2024efficientvlp,
  title={EfficientVLP: A Practical Toolkit for Efficient Vision-Language Pre-training},
  author={Xu, Haowen},
  year={2024},
  url={https://github.com/suncatchin/efficient-vlp}
}

📄 License

MIT License

About

Efficient vision-language pre-training toolkit: Token Merging, LoRA/QLoRA fine-tuning, Knowledge Distillation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages