diff --git a/examples/pruning/minitron_vs_puzzletron/00_prerequisites.ipynb b/examples/pruning/minitron_vs_puzzletron/00_prerequisites.ipynb new file mode 100644 index 00000000000..0cf76664a68 --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/00_prerequisites.ipynb @@ -0,0 +1,73 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27446984", + "metadata": {}, + "source": "# Prerequisites: Data Preparation & Baseline Evaluation (~15 min on 2x H200)\n\nThis notebook has two goals:\n1. **Prepare the distillation dataset** — download [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1) and tokenize it into the binary format expected by Megatron-Bridge. This dataset is used during the distillation step (after pruning) in all scenario notebooks.\n2. **Establish the teacher baseline** — evaluate the original Qwen3-8B on MMLU before any compression.\n\n> **Why prepare the dataset before compression?** The distillation step (which comes *after* pruning) requires a pre-tokenized dataset in Megatron binary format. Preparing it upfront avoids interrupting the compression pipeline and ensures a consistent dataset across all scenarios.\n\n> **Note on calibration datasets:** Pruning also requires calibration data to score the importance of each component — the model runs forward passes on a small dataset to measure how much each neuron, head, or layer contributes to the output. This calibration data (we use [`nvidia/Nemotron-Post-Training-Dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)) is handled separately in each scenario notebook. Minitron downloads it automatically under the hood, while Puzzletron requires an explicit preparation step. See the respective notebooks ([`scenario1_minitron.ipynb`](scenario1_minitron.ipynb), [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb), etc.) for details.\n\n**Prerequisites:** Run this notebook from inside the NeMo container with the base model already downloaded at `/workspace/models/Qwen3-8B` (see the guide's Prerequisites section)." + }, + { + "cell_type": "markdown", + "id": "ea318822", + "metadata": {}, + "source": "## 1. Download and tokenize distillation dataset\n\nFor distillation we use [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1) — a small, generic language modeling dataset.\n\nThe `megatron_preprocess_data` utility downloads the dataset directly from the HuggingFace Hub and tokenizes it into the binary `.bin` / `.idx` format expected by Megatron-Bridge in a single step (~2 min)." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22112d26", + "metadata": {}, + "outputs": [], + "source": "!python -m modelopt.torch.utils.plugins.megatron_preprocess_data \\\n --hf_dataset wikitext \\\n --hf_name wikitext-103-v1 \\\n --hf_split train \\\n --json_keys text \\\n --tokenizer /workspace/models/Qwen3-8B \\\n --output_dir /workspace/datasets/tokenized_qwen3 \\\n --workers 32 \\\n --append_eod \\\n --strip_newlines" + }, + { + "cell_type": "markdown", + "id": "ynlx6sgkqr", + "metadata": {}, + "source": "## 2. Evaluate teacher model (baseline)\n\nBefore compressing, we establish the baseline MMLU score for the original Qwen3-8B. Results in the guide are expressed as a percentage of this number.\n\nWe use [`lm-eval`](https://github.com/EleutherAI/lm-evaluation-harness) — a standard open-source evaluation harness — to run the MMLU benchmark. MMLU (Massive Multitask Language Understanding) covers 57 subjects across STEM, humanities, and social sciences, using 4-choice questions. The 5-shot setting provides 5 in-context examples per question, which is the standard configuration for comparing LLMs on this benchmark." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "zzansswg3zq", + "metadata": {}, + "outputs": [], + "source": [ + "!lm_eval --model hf \\\n", + " --model_args pretrained=/workspace/models/Qwen3-8B,dtype=bfloat16 \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "281mvurl3op", + "metadata": {}, + "source": [ + "**Expected result:** MMLU (5-shot) = **0.7493**. This is the teacher baseline used throughout the guide." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/examples/pruning/minitron_vs_puzzletron/README.md b/examples/pruning/minitron_vs_puzzletron/README.md new file mode 100644 index 00000000000..60e708f9d2b --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/README.md @@ -0,0 +1,700 @@ +# How to Reduce Your LLM Size and Improve Efficiency with NVIDIA Model-Optimizer: A Pruning & Distillation Guide + +## Table of Contents + +1\. [Introduction](#1-introduction) + +[Part I — Setup & Experiments](#part-i--setup--experiments) + +2. [Prerequisites](#2-prerequisites) +3. [Scenario 1: Quick & Reliable Compression](#3-scenario-1-quick--reliable-compression) +4. [Scenario 2: Hardware-Constrained Compression](#4-scenario-2-hardware-constrained-compression) + +[Part II — Results, Analysis & Insights](#part-ii--results-analysis--insights) + +5. [Head-to-Head: When Does Each Method Win?](#5-head-to-head-when-does-each-method-win) +6. [Distillation: An Impactful Step](#6-distillation-an-impactful-step) +7. [Inference Performance](#7-inference-performance) +8. [Limitations & Practical Tips](#8-limitations--practical-tips) +9. [Open Questions](#9-open-questions) + +10\. [References](#10-references) + +--- + +## 1. Introduction + +As LLMs are deployed across an ever-wider range of platforms — from cloud clusters to edge devices — the ability to produce smaller, faster models from existing ones becomes essential. +**Structural compression** (removing parameters from the model itself) is one of the most effective levers available to achieve this. + +This guide walks you through two concrete scenarios for shrinking an LLM using [NVIDIA Model-Optimizer](https://github.com/NVIDIA/Model-Optimizer) (ModelOpt), each using a different compression method. + +Throughout this guide, we use [**Qwen3-8B**](https://huggingface.co/Qwen/Qwen3-8B) as our base model — a dense, Transformer-based, decoder-only LLM with 8B parameters and 36 layers. All compressed variants are evaluated with [**MMLU**](https://arxiv.org/abs/2009.03300) **(5-shot)**. Companion Jupyter notebooks are provided so you can reproduce every result on this model end-to-end. + +> **MMLU (Massive Multitask Language Understanding)** is a benchmark covering 57 subjects across STEM, humanities, social sciences, and more. Each question is a 4-choice multiple choice problem, giving a random baseline of 25%. The 5-shot variant provides 5 in-context examples before each question. + +| | Scenario 1 | Scenario 2 | +|---|---|---| +| **Goal** | Make my general-purpose model smaller and faster, quickly and reliably | Fit a strict hardware memory budget | +| **Usecases** | Create a smaller version of the same model architecture to form a consistent family | Create a single, highly optimized deployment model that fits specific hardware budget | +| **Compression** | Light/Moderate (10–20% parameter reduction) | Aggressive (>30% memory reduction) | +| **Method** | Homogeneous Pruning ([Minitron](https://arxiv.org/abs/2408.11796)) | Heterogeneous NAS-based Pruning ([Puzzletron](https://arxiv.org/html/2411.19146v3)) | +| **Complexity** | Low — Importance-based ranking, uniform pruning | High — Fine-grained NAS search + MIP optimization | +| **Output** | Homogeneous (all Transformer blocks have the same structure) | Heterogeneous architecture (variable layer widths) | + +Both paths are followed by **knowledge distillation**, which recovers accuracy lost during pruning. In our Qwen3-8B experiments, we show that significant recovery (in MMLU) is possible with as few as 100 training iterations on a small dataset, though actual recovery will vary by model and compression level. + +The overall pipeline is the same for both scenarios — only the compression step differs: + +```mermaid +flowchart LR + classDef model fill:#D5E8D4,stroke:#82B366,color:#000 + classDef dataprep fill:#DAE8FC,stroke:#6C8EBF,color:#000 + classDef compress fill:#FFE6CC,stroke:#D6B656,color:#000 + classDef eval fill:#E1D5E7,stroke:#9673A6,color:#000 + classDef distill fill:#F8CECC,stroke:#B85450,color:#000 + + A["Baseline Model
(Qwen3-8B)"]:::model --> EVAL0["Evaluation
(baseline)"]:::eval + A --> B["Distillation Data Prep
WikiText-103"]:::dataprep + A --> C{"Compression
Method"}:::compress + C --> D["Minitron
Homogeneous pruning
(Calibration Data Prep under the hood)"]:::compress + C --> CAL["Calibration Data Prep
Nemotron-Post-Training-v2"]:::dataprep --> E["Puzzletron
Heterogeneous NAS"]:::compress + D --> F["Evaluation
(pruned)"]:::eval + E --> F + F --> G["Knowledge Distillation
Qwen3-8B -> student"]:::distill + B --> G + G --> H["Evaluation
(distilled)"]:::eval +``` + +### What this guide covers + +- **Pruning**: structurally removing neurons, attention heads, or entire layers to produce a smaller model. +- **Distillation**: transferring knowledge from the original (teacher) model to the pruned (student) model to recover accuracy lost during pruning. + +### What this guide does NOT cover + +- **Quantization**: reducing numerical precision (e.g. FP16 → INT8). +- **Sparsity**: zeroing out weights while keeping the architecture. +- **MoE and hybrid architectures**: this guide focuses on dense Transformer models. For an end-to-end Minitron pruning + distillation + FP8 PTQ + vLLM deployment example on a Mamba-Transformer hybrid, see the [Nemotron-Nano-9B-v2 tutorial](../minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md). Compression of Mixture-of-Experts (MoE) architectures will be covered in a future guide. + +> **Note:** Pruning and quantization are complementary. After following this guide, you can further compress your pruned model with quantization for additional deployment gains. + +### The two methods at a glance + +**Minitron is a special case of Puzzletron**: any architecture Minitron can produce, Puzzletron can also find. Both follow the same pipeline (find a smaller architecture, then recover accuracy via distillation); they score the components of each Transformer layer (neurons, attention heads, FFN widths) and remove the ones that contribute least to the model's output. What distinguishes them is how fine-grained that search is. + +- **Minitron** applies *homogeneous pruning*: the same pruning decision is applied across all layers simultaneously. The compression target is a **parameter count** (e.g. "reduce to 7B"; direct memory-budget targeting is on the Minitron roadmap). The result is a standard, smaller model with the same architecture type as the original. Fast and simple. + +- **Puzzletron** applies *heterogeneous pruning* via Neural Architecture Search (NAS): it evaluates multiple candidate configurations for each layer independently (different FFN widths, optional attention removal), then uses Mixed-Integer Programming (MIP) to find the optimal per-layer combination under a given resource constraint (e.g. a **memory budget**). The result is a model where each layer can have a different structure, tailored to a specific hardware budget. More powerful, but slower. + +Puzzletron's per-layer search space is much broader than Minitron's. The trade-off is complexity: Minitron is the right default for moderate, predictable, general-purpose compression; Puzzletron becomes necessary when you need to maximize accuracy under a hard hardware constraint. + +### How to read this guide + +- **"I need a smaller and faster general-purpose model — quickly and reliably"** → go to [Scenario 1: Quick & Reliable Compression (Section 3)](#3-scenario-1-quick--reliable-compression). +- **"I must fit a strict memory budget"** → go to [Scenario 2: Hardware-Constrained Compression (Section 4)](#4-scenario-2-hardware-constrained-compression). +- **"Which method should I use?"** → read end-to-end. Section 5 compares both methods head-to-head. + +**Guide organization:** This guide is split into two parts. **Part I — Setup & Experiments** (Sections 2–4) covers the technicalities needed to reproduce the experiments: environment setup and step-by-step walkthroughs for each scenario. **Part II — Results, Analysis & Insights** (Sections 5–9) focuses on what those experiments reveal: a head-to-head comparison, a deep dive into distillation, inference benchmarks, practical tips and limitations, and open questions for future work. + +> **Note on reproducibility:** All experiments in this guide were run on the [NeMo container 26.02](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=26.02.00) with `nvidia-modelopt` 0.43.0. The pruning frameworks and scoring mechanisms used by Minitron and Puzzletron are under active development. As ModelOpt evolves, exact evaluation numbers may differ from one release to another. The trends and comparative insights presented in this guide (which method wins in which regime, how distillation behaves, and the accuracy-compression trade-offs) are expected to remain consistent. + +--- + + + +## Part I — Setup & Experiments + +--- + +## 2. Prerequisites + +### Hardware + +All experiments in this guide were run on **2x H200 GPUs**. This can be adapted to different GPU counts and types depending on your setup. Adjust tensor/pipeline parallelism and batch size accordingly. + +> **Architecture requirement:** The NeMo container and ModelOpt scripts used in this guide require an **x86-64 (AMD64)** host. + +### Clone ModelOpt + +Clone the ModelOpt repository on your host machine. It will be mounted into the container in the next step, so any changes you make to the source persist across sessions: + +```bash +export MODELOPT_DIR=${PWD}/Model-Optimizer +git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR} +chmod -R 777 ${MODELOPT_DIR} +``` + +> **Permissions:** The `chmod -R 777` ensures the container (running as root) can write to the mounted directory. + +### Container + +> **Setup source of truth:** ModelOpt evolves quickly. The instructions below reflect the setup used at the time this guide was written and are provided as a working example. For the most up-to-date container version and installation steps, refer to: +> [megatron_bridge/README.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/megatron_bridge/README.md) + +We use the [NVIDIA NeMo Framework Docker container (26.02)](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=26.02.00), which includes all the libraries needed pre-installed (including [Megatron-Bridge](https://github.com/NVIDIA-Nemo/Megatron-Bridge) — NVIDIA's library that bridges HuggingFace models with the Megatron-core framework, enabling efficient multi-GPU distillation). + +You need [Docker](https://docs.docker.com/get-docker/) and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to enable GPU access inside containers. + +Launch the container with the cloned repo mounted at `/opt/Model-Optimizer`: + +```bash +docker run \ + --gpus all \ + --shm-size=16GB \ + --net=host \ + --ulimit memlock=-1 \ + --rm -it \ + -v ${MODELOPT_DIR}:/opt/Model-Optimizer \ + -w /workspace \ + nvcr.io/nvidia/nemo:26.02 bash -c "umask 000 && exec bash" +``` + +### Install dependencies + +Once inside the container, uninstall the pre-existing `nvidia-modelopt`, `lm_eval`, and `nvidia_lm_eval` so they don't cause version conflicts. Then install ModelOpt from the cloned repo as an editable package with the `hf` and `puzzletron` extras, and add the extra Puzzletron-example dependencies (which include `lm-eval` for benchmark evaluation). Together these steps cover the dependencies for all scenarios in this guide: + +```bash +/usr/bin/python3 -m pip uninstall -y nvidia-modelopt +python -m pip uninstall -y lm_eval nvidia_lm_eval +cd /opt/Model-Optimizer && python -m pip install -e ".[hf,puzzletron]" +python -m pip install -r /opt/Model-Optimizer/examples/puzzletron/requirements.txt +``` + +### Base model + +We use [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) as our base model throughout this guide. It is a dense, decoder-only Transformer with the following architecture: + +| Layers | Hidden size | FFN intermediate size | Attention heads (Q) | KV heads (GQA) | Parameters | +|---|---|---|---|---|---| +| 36 | 4096 | 12288 | 32 | 8 | ~8B | + +Each Transformer block has two components, both of which are targeted by the compression methods in this guide: + +- **Attention (GQA — Grouped Query Attention):** Computes contextual relationships between tokens. Qwen3-8B uses 8 shared KV heads for 32 query heads, which reduces the KV cache. Puzzletron can remove attention entirely from selected layers, making it the primary lever for memory reduction. +- **FFN (Feed-Forward Network):** A per-token MLP applied after attention. The intermediate size (12288) controls its capacity. Puzzletron can reduce this per layer; Minitron reduces it uniformly across all layers. + +Authenticate with HuggingFace and download Qwen3-8B: + +```bash +hf auth login --token +hf download Qwen/Qwen3-8B --local-dir /workspace/models/Qwen3-8B +``` + +### Companion notebooks + +| Notebook | Description | Runtime (2x H200) | +|---|---|---| +| [`00_prerequisites.ipynb`](00_prerequisites.ipynb) | Prepare the data + Baseline evaluation | ~15 min | +| [`scenario1_minitron.ipynb`](scenario1_minitron.ipynb) | Scenario 1 — Minitron | ~1h45 | +| [`scenario1_puzzletron.ipynb`](scenario1_puzzletron.ipynb) | Scenario 1 — Puzzletron | ~6h (on first Puzzletron run) | +| [`scenario2_minitron.ipynb`](scenario2_minitron.ipynb) | Scenario 2 — Minitron | ~45 min | +| [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb) | Scenario 2 — Puzzletron | ~6h15 (on first Puzzletron run) | + +From within the container, run the notebooks directly from the mounted ModelOpt repo so any edits you make persist on your host machine. Start Jupyter Lab in the notebook directory: + +```bash +cd /opt/Model-Optimizer/examples/pruning/minitron_vs_puzzletron +pip install --upgrade ipywidgets notebook +jupyter lab --ip 0.0.0.0 --port=8888 --allow-root +``` + +--- + +## 3. Scenario 1: Quick & Reliable Compression + +> *"I need a smaller and faster general-purpose model — quickly and reliably."* + +You have a working LLM and want to reduce its size by up to 20% (in number of parameters) for general-purpose tasks. You need a straightforward pipeline with predictable results and a standard, homogeneous model as output. You don't want to invest time experimenting and familiarizing with a complex pipeline or dealing with heterogeneous model formats. + +**Minitron** is the right tool for this job. + +### 3.1 When to choose this path + +- Your compression target is **moderate** (10–20% reduction in number of parameters). +- You want a **simple, fast pipeline**: prune → distill → deploy. +- You need a **standard/homogeneous model** as output (same architecture type, just smaller). +- You value **predictability**: Minitron's importance-based ranking produces consistent results. +- You are not targeting a specific downstream task (**general-purpose** compression). + +**Examples:** +- You're serving a 70B model on 4x H100s via TensorRT-LLM. Pruning it to ~56B lets you serve on 2x H100s with the same architecture, cutting your GPU cost in half overnight. +- Your team maintains one large base model and needs to quickly ship multiple size variants (e.g. 8B / 7B / 6B) for different customer SLAs. Minitron lets you derive them all from the same checkpoint with a single script, instead of training each variant independently. + +### 3.2 How Minitron works + +Minitron compresses a model in two stages: + +**Stage 1 — Importance-based pruning.** Minitron supports two complementary pruning strategies: + +- **Depth pruning**: removes entire Transformer layers. Layers are ranked by *perplexity-based scoring* or *block importance* (measuring each layer's contribution to the model's output), and the least important ones are dropped. +- **Width pruning**: reduces the dimensions within each layer uniformly. Neurons and heads are ranked by *activation-based importance scoring* during a calibration pass, and the lowest-ranked ones are removed across all layers. + +Both strategies can be combined. An optional automatic NAS search can be enabled to explore the space of (depth, width) configurations and select the best one for a given parameter target. The result is a standard, homogeneous model. + +**Stage 2 — Knowledge distillation.** The pruned model (student) is trained to mimic the original model (teacher) using logits-based KL divergence loss. + +### 3.3 Walkthrough: Qwen3-8B → 7B parameters + +**→ Data preparation:** Run notebook [`00_prerequisites.ipynb`](00_prerequisites.ipynb) to prepare the data and evaluate the original model. + +**→ Minitron pruning and distillation:** Run notebook [`scenario1_minitron.ipynb`](scenario1_minitron.ipynb) for the full end-to-end pipeline (prune → distill → evaluate). + +#### Results + +| Model | Layers | Hidden Size | FFN Intermediate | Parameters | MMLU (5-shot) | % of Teacher | +|---|---|---|---|---|---|---| +| Qwen3-8B (teacher) | 36 | 4096 | 12288 | 8B | 0.7493 | 100% | +| Minitron — pruned | 32 | 3840 | 12288 | 6.96B | 0.7038 | 93.9% | +| Minitron — pruned + distilled | 32 | 3840 | 12288 | 6.96B | **0.7166** | **95.6%** | + +Distillation recovers **+1.28 percentage points** of MMLU accuracy with just 100 iterations on [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1). + +### 3.4 Comparison with Puzzletron at the same parameter target + +To validate that Minitron is the right choice for this scenario, we also ran Puzzletron at the same ~7B parameter target. Puzzletron produces a 36-layer heterogeneous model with variable FFN widths per layer (some as low as 2560) and selective attention removal. + +→ See notebook [`scenario1_puzzletron.ipynb`](scenario1_puzzletron.ipynb) to reproduce this run. + +| Model | Parameters | MMLU (pruned) | MMLU (distilled) | % of Teacher | +|---|---|---|---|---| +| **Minitron 7B** | 6.96B | 0.7038 | **0.7166** | **95.6%** | +| Puzzletron 7B | 6.99B | 0.6621 | 0.6823 | 91.1% | + +**Minitron wins on MMLU by +3.43 percentage points after distillation.** + +> **Note on Puzzletron search space:** The Puzzletron run above used a limited search space. A broader search (more FFN size candidates, more block variants) could potentially find better architectures; but this comes at a cost. Each additional candidate increases scoring time, calibration GPU hours, and MIP complexity. At moderate compression targets, the marginal gains from expanding the search are unlikely to justify the additional pipeline complexity and compute investment. This is precisely where Minitron's simplicity shines. + +**Takeaway:** For moderate, general-purpose compression where you want reliability and simplicity, Minitron is the practical default — it delivers strong accuracy on general knowledge benchmarks like MMLU, with a simpler pipeline and a standard model format. + +> **Important Note (benchmark-dependent behavior):** The comparison above uses MMLU only (general-purpose). On other benchmarks, the ranking between Minitron and Puzzletron can flip at this compression level. See [Section 5.4](#54-benchmark-specific-behavior) for a full multi-benchmark analysis, especially if your use case targets a specific downstream task. + +--- + +## 4. Scenario 2: Hardware-Constrained Compression + +> *"I must fit a strict memory budget."* + +You need to deploy an LLM on hardware with a hard memory ceiling (an edge device like NVIDIA Jetson, a specific GPU with limited VRAM, etc.). The compression is aggressive (>20%), and you are willing to invest in a more complex pipeline to squeeze the best possible accuracy out of your budget. + +**Puzzletron** is the right tool for this job. + +### 4.1 When to choose this path + +- You have a **hard hardware constraint** expressed as a memory budget (e.g. "must fit within 78,000 MiB"). +- Your compression target is **aggressive** (>20% reduction). +- You are willing to invest in a **more complex pipeline** (NAS search, MIP optimization) to maximize accuracy within the constraint. +- A **heterogeneous model** (different layer widths, selective attention removal) is acceptable for your deployment. + +**Examples:** +- You need to deploy an 8B model on an edge device such as an NVIDIA Jetson AGX Orin (64 GB). Puzzletron can target the exact memory budget your application allows. +- You're building a latency-optimized model for real-time inference where removing attention from certain layers directly reduces compute per token, and you want NAS to find the optimal trade-off automatically. + +### 4.2 How Puzzletron works + +Puzzletron compresses a model through an automated NAS pipeline: + +**Step 1 — Build a replacement library.** For each Transformer layer, Puzzletron generates a set of candidate block variants: the original block, blocks with reduced FFN widths (e.g. 10240, 8192, 5120, 2560), and blocks with attention removed entirely. Each variant is scored for quality and cost (memory footprint, parameter count, ...). + +**Step 2 — MIP optimization.** A Mixed-Integer Program takes the full library of per-layer candidates and their quality/cost scores, and finds the optimal combination that minimizes total quality loss subject to the target constraints (e.g. memory ≤ 78,000 MiB). This is what makes Puzzletron *heterogeneous*: the solver can choose a different configuration for every layer. + +**Step 3 — Knowledge distillation.** Same as Minitron: the assembled heterogeneous model (student) is distilled against the original model (teacher) using logits-based KL divergence loss. + +### 4.3 Walkthrough: Qwen3-8B - 126,215 MiB → 78,000 MiB memory target + +**→ Data preparation:** Run notebook [`00_prerequisites.ipynb`](00_prerequisites.ipynb) to prepare the data and evaluate the original model (if not already done). + +**→ Puzzletron NAS and distillation:** Run notebook [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb) for the full end-to-end pipeline (prune → NAS search → distill → evaluate). + +#### Results + +| Model | Layers | Memory Footprint | MMLU (5-shot) | % of Teacher | +|---|---|---|---|---| +| Qwen3-8B (teacher) | 36 | 126,215 MiB | 0.7493 | 100% | +| Puzzletron — pruned | 36 | 77,992 MiB | 0.2752 | 36.7% | +| Puzzletron — pruned + distilled | 36 | 77,992 MiB | **0.5613** | **74.9%** | + +The pre-distillation accuracy is near-random (MMLU has a 25% baseline for 4-choice questions); this is expected at >35% compression. Distillation recovers **+28.61 percentage points**, transforming a non-functional model into a usable one. + +
+Puzzletron architecture details — per-layer block configuration (click to expand) + +```text +block_0: attention kv_heads_8 ffn intermediate_12288 +block_1: attention kv_heads_8 ffn intermediate_5120 +block_2: attention kv_heads_8 ffn intermediate_5120 +block_3: attention kv_heads_8 ffn intermediate_7424 +block_4: attention no_op ffn intermediate_9984 +block_5: attention no_op ffn intermediate_9984 +block_6: attention kv_heads_8 ffn intermediate_12288 +block_7: attention no_op ffn intermediate_9984 +block_8: attention no_op ffn intermediate_9984 +block_9: attention no_op ffn intermediate_9984 +block_10: attention no_op ffn intermediate_12288 +block_11: attention no_op ffn intermediate_12288 +block_12: attention kv_heads_8 ffn intermediate_12288 +block_13: attention kv_heads_8 ffn intermediate_12288 +block_14: attention kv_heads_8 ffn intermediate_12288 +block_15: attention kv_heads_8 ffn intermediate_12288 +block_16: attention no_op ffn intermediate_9984 +block_17: attention kv_heads_8 ffn intermediate_12288 +block_18: attention kv_heads_8 ffn intermediate_12288 +block_19: attention kv_heads_8 ffn intermediate_12288 +block_20: attention no_op ffn intermediate_7424 +block_21: attention kv_heads_8 ffn intermediate_12288 +block_22: attention kv_heads_8 ffn intermediate_12288 +block_23: attention kv_heads_8 ffn intermediate_12288 +block_24: attention kv_heads_8 ffn intermediate_12288 +block_25: attention no_op ffn intermediate_12288 +block_26: attention no_op ffn intermediate_12288 +block_27: attention no_op ffn intermediate_12288 +block_28: attention no_op ffn intermediate_12288 +block_29: attention kv_heads_8 ffn intermediate_12288 +block_30: attention no_op ffn intermediate_12288 +block_31: attention no_op ffn intermediate_12288 +block_32: attention kv_heads_8 ffn intermediate_12288 +block_33: attention kv_heads_8 ffn intermediate_12288 +block_34: attention kv_heads_8 ffn intermediate_12288 +block_35: attention kv_heads_8 ffn intermediate_9984 +``` + +
+ +### 4.4 Comparison with Minitron at the same memory target + +To validate that Puzzletron is the right choice for this scenario, we also ran Minitron at the same memory budget. To match ~78,000 MiB, Minitron drops 14 of 36 layers (keeping 22), producing a 5.49B parameter model. + +→ See notebook [`scenario2_minitron.ipynb`](scenario2_minitron.ipynb) to reproduce this run. + +| Model | Memory Footprint | MMLU (pruned) | MMLU (distilled) | % of Teacher | +|---|---|---|---|---| +| **Puzzletron 78k** | 77,992 MiB | 0.2752 | **0.5613** | **74.9%** | +| Minitron 78k | 78,054 MiB | 0.2351 | 0.4620 | 61.7% | + +**Puzzletron wins on MMLU by +9.93 percentage points after distillation.** + +At this extreme compression level, Minitron's strategy of dropping entire layers removes too many reasoning pathways. Puzzletron's approach (keeping all 36 layers but surgically thinning them) preserves the model's depth and gives distillation more structure to work with. + +**Takeaway:** For aggressive compression under a hard memory constraint, Puzzletron's heterogeneous NAS consistently outperforms Minitron's uniform pruning. The additional pipeline complexity is justified by a significant accuracy advantage. + +> **Important Note (benchmark-dependent behavior):** The comparison above uses MMLU only. On other benchmarks, the ranking between Minitron and Puzzletron can flip. However, at this aggressive compression level, our experiments show that Puzzletron wins across the board, outperforming Minitron on every benchmark we evaluated. See [Section 5.4](#54-benchmark-specific-behavior) for the full multi-benchmark results. + +--- + + + +## Part II — Results, Analysis & Insights + +--- + +## 5. Head-to-Head: When Does Each Method Win? + +Sections 3 and 4 (Part I) showed the recommended method for each scenario. Here we consolidate all results to reveal the full picture. + +### 5.1 Summary of all experiments + +| Compression Target | Method | Parameters | Memory | MMLU (pruned) | MMLU (distilled) | % of Teacher | +|---|---|---|---|---|---|---| +| **7B params (~14%)** | **Minitron** | **6.96B** | **111,570 MiB** | **0.7038** | **0.7166** | **95.6%** | +| 7B params (~14%) | Puzzletron | 6.99B | 123,929 MiB | 0.6621 | 0.6823 | 91.1% | +| **78,000 MiB (~38%)** | **Puzzletron** | **7.07B** | **77,992 MiB** | **0.2752** | **0.5613** | **74.9%** | +| 78,000 MiB (~38%) | Minitron | 5.49B | 78,054 MiB | 0.2351 | 0.4620 | 61.7% | + +### 5.2 Why each method wins in its regime + +In theory, Minitron is a subset of Puzzletron: any architecture Minitron can find, Puzzletron could also find if its search space were large enough. But the search space must be finite, and expanding it comes with significant compute and complexity costs. This is why both methods have their sweet spot. + +**Scenario 1 (moderate compression):** + +On MMLU, Minitron outperforms Puzzletron at this level (+3.43pp post-distill). Its uniform pruning directly targets a parameter count, produces a clean architecture in a single step, and avoids the complexity of a full NAS pipeline. At the same time, this advantage is benchmark-dependent: on other benchmarks, Puzzletron could retain more of the teacher's accuracy than Minitron (see [Section 5.4](#54-benchmark-specific-behavior)). This means there is no single-bullet approach: different compression methods have their winning territories even at moderate compression. That said, when factoring in pipeline complexity and the standard model format Minitron produces, it remains the practical default for most general-purpose compression needs at this level. + +Moreover, Minitron can be applied **iteratively**: for example, prune 20%, distill, then prune another 20% and distill again. This staged schedule typically preserves more quality than a single, more aggressive pruning step at the same overall parameter reduction. + +**Scenario 2 (aggressive memory compression):** + +Puzzletron becomes essential. When the target is a hard memory budget, Puzzletron can optimize for it directly via MIP constraints, whereas Minitron optimizes for a parameter count, and mapping parameter targets to memory budgets is indirect and suboptimal. More importantly, at this level of compression, Minitron acts like a butcher (dropping entire layers), while Puzzletron acts like a surgeon (selectively thinning FFN widths and removing attention per-layer). The surgical approach preserves far more model structure, giving distillation more to work with. This is why Puzzletron recovers to 74.9% of the teacher vs. Minitron's 61.7%. + +### 5.3 Accuracy vs. compression + +![Pruning + Distillation Results on Qwen3-8B](figures/summary_chart.png) + +### 5.4 Benchmark-specific behavior + +The MMLU-based comparisons in Sections 3–4 and the summary table above tell only part of the story. Evaluating the same compressed models on [HellaSwag](https://arxiv.org/abs/1905.07830) (commonsense reasoning) and [GSM8K](https://arxiv.org/abs/2110.14168) (math reasoning) reveals that **the best compression method depends on the benchmark**. + +**Scenario 1 — 7B parameter target (% of teacher, post-distillation)** + +| Benchmark | Minitron 7B | Puzzletron 7B | Winner | +|---|---|---|---| +| MMLU | **95.6%** | 91.1% | **Minitron** | +| HellaSwag acc_norm | 88.3% | **91.4%** | **Puzzletron** | +| GSM8K strict | 83.1% | **92.8%** | **Puzzletron** | + +**Scenario 2 — 78,000 MiB memory target (% of teacher, post-distillation)** + +| Benchmark | Puzzletron 78k | Minitron 78k | Winner | +|---|---|---|---| +| MMLU | **74.9%** | 61.7% | **Puzzletron** | +| HellaSwag acc_norm | **87.7%** | 64.4% | **Puzzletron** | +| GSM8K strict | **53.2%** | 3.7% | **Puzzletron** | + +Several observations stand out: + +**At moderate compression, the winner depends on the benchmark:** In Scenario 1, the picture is benchmark-dependent: Minitron leads on MMLU (~96% vs. ~91% of teacher), while Puzzletron leads on HellaSwag acc_norm (~91% vs. ~88%) and GSM8K (~93% vs. ~83%), suggesting the additional pipeline complexity may not be warranted for general-purpose compression, though heterogeneous pruning appears to better preserve reasoning capabilities. There is no single-bullet approach: different compression algorithms have their winning territories, and pipeline complexity should also be taken into account. + +**At aggressive compression, Puzzletron wins across the board:** In Scenario 2, there is no benchmark where Minitron comes close. The advantage is especially stark on GSM8K, where Minitron retains only 3.7% of the teacher's accuracy vs. Puzzletron's 53.2%. This suggests that per-layer selective pruning keeps critical reasoning pathways that Minitron's uniform approach removes. + +> **Note:** The companion notebooks reproduce only the MMLU evaluations end-to-end. The HellaSwag and GSM8K results reported here were obtained using the same `lm-eval` harness on the same compressed checkpoints. + +### 5.5 Extra insights: accuracy across the full memory compression spectrum + +The two scenarios in this guide represent two specific points on a continuous compression curve. Complementary experiments using Puzzletron's MIP sweep mode (which re-runs the MIP solver across multiple memory targets without repeating the full NAS pipeline) allowed us to sample additional points and compare both methods side-by-side across the full spectrum. + +
+Click to expand — chart + observations across the full compression spectrum + +![Puzzletron vs. Minitron Memory Sweep on Qwen3-8B](figures/memory_sweep_combined.png) + +Several observations stand out: + +**At 90% memory, both methods are nearly equivalent.** Post-distillation accuracy is 0.7415 (Minitron) vs. 0.7406 (Puzzletron); a 0.1pp gap that is well within noise. At this compression level, Minitron's simplicity makes it the clear practical choice. + +**At 80% memory, Minitron wins post-distillation, despite losing pre-distillation.** Before distillation, Puzzletron (0.5910) leads Minitron (0.5084) by +8.3pp. After distillation, Minitron (0.7302) overtakes Puzzletron (0.6921) by +3.8pp. This is a concrete example of the architecture ranking flip described in [Section 6.4](#64-architecture-ranking-can-flip-after-distillation). + +**The crossover point lies somewhere between 20% and 38% compression.** Below ~20% compression, Minitron consistently wins post-distillation. Beyond ~38%, Puzzletron pulls decisively ahead. The exact crossover will depend on the model, the distillation budget, and the Puzzletron search space — but this range provides a practical guideline. + +
+ +### 5.6 Decision rules + +| If... | Then use... | +|---|---| +| Compression is <20% and general-purpose | **Minitron** | +| You need a standard/homogeneous model | **Minitron** | +| Compression is >20% | **Puzzletron** | +| You have a hard memory budget | **Puzzletron** | +| You want minimal pipeline complexity | **Minitron** | +| You want maximum accuracy at any cost | **Puzzletron** | + +--- + +## 6. Distillation: An Impactful Step + +### 6.1 Why distillation matters for both methods + +Pruning removes parameters, but the remaining weights were trained in the context of the full model. They don't "know" their neighbors have been removed. Distillation re-aligns the pruned model's representations with the teacher's, allowing it to recover accuracy by learning to produce similar output distributions. + +This applies equally to Minitron and Puzzletron. Regardless of how the model was pruned (uniformly or heterogeneously), the student benefits from being guided by the teacher's logits. + +### 6.2 How little data you actually need + +In our Qwen3-8B experiments, we used a deliberately minimal distillation setup: + +| Parameter | Value | +|---|---| +| Dataset | WikiText-103 (train split) | +| Iterations | 100 | +| Tokens processed | ~1.6M | + +The results were remarkable: +1.28pp to +28.61pp of MMLU recovery depending on the compression level, using a small generic dataset and just 100 iterations. Note that more extensive distillation (using more iterations, larger datasets, or higher-quality data such as [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)) can enable further recovery. + +> **Important caveat:** This fast convergence is specific to Qwen3-8B and should not be generalized. In other settings, distillation may require billions of tokens and thousands of iterations to converge. The Qwen3-8B result demonstrates that distillation can be surprisingly efficient, but your mileage will vary. Always monitor loss curves to verify convergence before stopping training. + +### 6.3 Distillation loss curves + +The plot below shows the training and validation distillation loss (KL divergence) for both methods at the 7B parameter target: + +![Distillation loss curves — Scenario 1](figures/distillation_curves.png) + +Both curves converge smoothly, with Puzzletron maintaining a consistently lower loss throughout: its heterogeneous architecture preserves more of the original model's behavior, even though it ultimately scores lower on MMLU (0.6823 vs. 0.7166). This confirms that **distillation loss alone does not predict downstream task accuracy**. + +### 6.4 Architecture ranking can flip after distillation + +An important insight: **the architecture that looks best before distillation is not necessarily the one that recovers best after distillation.** The 80% memory case in [Section 5.5](#55-extra-insights-accuracy-across-the-full-memory-compression-spectrum) is a concrete example: Puzzletron's pruned model leads Minitron by +8.3pp before distillation, yet Minitron overtakes it by +3.8pp after. More generally, we observed this ranking reversal at multiple compression levels. + +A promising improvement to address this is **Blockwise Local Distillation (BLD)**. BLD locally trains block variants *before* the MIP assembly step, so the search prefers blocks that are "distillable" and compatible after reassembly, not just blocks that look good as immediate swaps. The experiments in this guide did not use BLD; adding it on top of the Puzzletron pipeline described here is expected to further improve post-distillation accuracy. + +--- + +## 7. Inference Performance + +Sections 5 and 6 focused on accuracy. But for deployment, throughput and latency matter just as much. Here we benchmark all compressed models from Scenarios 1 and 2 on a single GPU using [vLLM](https://docs.vllm.ai) and compare their serving performance against the original Qwen3-8B. + +### 7.1 Experiment setup + +**vLLM with AnyModel support:** Serving Puzzletron's heterogeneous architectures requires vLLM's new `AnyModel` backend, which adds generic support for Puzzletron-optimized models with per-layer varying widths and selective attention removal. This feature is currently available via a [pull request](https://github.com/vllm-project/vllm/pull/36512) and is expected to land in a future vLLM release. Minitron models and the baseline use standard vLLM (no special support needed). + +**Benchmarking tool:** We use [AIPerf](https://github.com/ai-dynamo/aiperf) to profile each model at increasing concurrency levels (1, 4, 8, 16, 32, 64, 128 concurrent requests). + +**Workload:** Each request sends ~1,000 input tokens and generates ~200 output tokens, simulating a summarization-style use case. + +**Hardware:** 1x NVIDIA H200 NVL GPU. + +**Models benchmarked (all post-distillation):** Qwen3-8B (baseline), Minitron 7B (Scenario 1), Puzzletron 7B (Scenario 1), Minitron 78k (Scenario 2), Puzzletron 78k (Scenario 2) + +> **How to reproduce:** Serving Puzzletron's heterogeneous models with vLLM requires a few extra setup steps. See [Appendix](#appendix-serving-a-puzzletron-model-with-vllm) for the full procedure. + +### 7.2 Results + +![Throughput vs Latency — all models](figures/all_curves_throughput_vs_latency.png) + +Results shown in the table at concurrency 64 (near-saturated throughput). Full curves across all concurrency levels are in the chart above. + +| Scenario | Model | Throughput (tok/s) | Mean TPOT (ms) | P99 TPOT (ms) | Mean TTFT (ms) | +|---|---|---|---|---|---| +| — | **Qwen3-8B (baseline)** | 218.9 | 267.81 | 321.63 | 2,426.40 | +| Scenario 1 (7B params) | **Minitron 7B** | 246.7 | 243.15 | 293.74 | 2,229.41 | +| Scenario 1 (7B params) | **Puzzletron 7B** | 219.2 | 267.87 | 321.93 | 2,428.24 | +| Scenario 2 (78k MiB) | **Minitron 78k** | 364.0 | 160.99 | 195.01 | 1,496.71 | +| Scenario 2 (78k MiB) | **Puzzletron 78k** | 370.5 | 158.41 | 192.21 | 1,516.35 | + +> **Metrics glossary:** **Throughput** = output tokens generated per second across all concurrent requests. **TPOT** (Time Per Output Token) = inter-token latency, i.e. how long between consecutive tokens in a single response. **TTFT** (Time To First Token) = how long until the first token is generated after a request is submitted. + +### 7.3 Analysis & Insights + +**At moderate compression (Scenario 1), Minitron delivers a clear inference speedup; Puzzletron does not.** + +Minitron 7B reaches ~13% higher peak throughput than the baseline and ~15% lower single-request latency. Puzzletron 7B, by contrast, is nearly indistinguishable from the baseline (~2% throughput improvement). This makes sense: Minitron's homogeneous architecture (fewer layers and a uniformly smaller hidden size) translates directly into less compute per forward pass. Puzzletron keeps all 36 layers and varies FFN widths per layer; the irregular structure offers less opportunity for the runtime to optimize. + +Combined with the accuracy results from [Section 5](#5-head-to-head-when-does-each-method-win), Minitron wins both on MMLU accuracy and inference speed at this compression level — reinforcing it as the practical default for moderate, general-purpose compression. + +**At aggressive compression (Scenario 2), both methods deliver massive speedups, and Puzzletron beats Minitron on throughput while preserving far more accuracy.** + +Both Scenario 2 models dramatically outperform the baseline: Minitron 78k reaches 364 tok/s and Puzzletron 78k reaches 371 tok/s at concurrency 64 — a ~66–69% improvement over the baseline's 219 tok/s. On single-request latency, Minitron 78k is slightly faster (6.15 ms vs. 7.21 ms TPOT at concurrency 1), but the gap narrows under load and Puzzletron edges ahead on peak throughput. + +The key insight is the accuracy-performance trade-off: Puzzletron 78k retains 74.9% of teacher MMLU vs. Minitron 78k's 61.7% (similar pattern across other benchmarks) while delivering slightly better throughput. At this compression level, Puzzletron gives you more accuracy *and* more throughput. + +> **Coming soon: optimizing directly for throughput and latency.** The experiments above use Puzzletron with a memory or parameter count target: the MIP solver maximizes a quality score subject to these resource budgets. An upcoming Puzzletron feature will allow optimizing directly for inference throughput or latency as the primary constraint, enabling the MIP solver to find architectures that are not just memory-efficient but also maximally fast to serve. Early results show that this inference-aware optimization provides significant accuracy gains over memory-targeted compression at the same latency level. + +--- + +## 8. Limitations & Practical Tips + +### 8.1 Limitations of this guide + +- **Single base model:** All experiments use Qwen3-8B. Results (especially distillation convergence speed and the crossover point between Minitron and Puzzletron) may differ on other models and model families. +- **Limited benchmarks:** The notebooks reproduce MMLU end-to-end. Supplementary evaluations on HellaSwag and GSM8K (see [Section 5.4](#54-benchmark-specific-behavior)) confirm that the best method is benchmark-dependent, but three benchmarks on one model are not enough to build general per-task guidelines. +- **Minimal distillation:** 100 iterations on WikiText-103 is a lower bound. Production deployments should use more iterations, larger datasets, and a curated data blend (e.g. Nemotron pretraining + post-training mix). See the [Nemotron-Nano-9B-v2 Data Preparation guide](../minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md#1-data-preparation) for a worked example, and [`MEGATRON_DATA_PREP.md`](../../dataset/MEGATRON_DATA_PREP.md) for tokenization commands. +- **Fixed search space for Puzzletron:** The NAS search space (FFN candidate sizes, attention removal options) was kept small for tractability. A broader search space could yield better architectures at the cost of longer search time. +- **Single-step Minitron:** We use a one-shot Minitron configuration rather than a multi-step iterative scheme, which simplifies the pipeline but typically achieves less compression and leaves some potential quality–compression gains on the table. + +### 8.2 Practical tips + +- **Start with Minitron:** If you're unsure which method to use, start with Minitron. It's faster to set up, produces a standard model, and gives you a strong baseline. You can always run Puzzletron afterward if you need more aggressive compression. +- **Distillation is not optional:** At any compression level beyond ~10%, always distill. The accuracy gain can be substantial. +- **Combine with quantization:** After pruning and distillation, you can further compress your model with quantization (e.g. FP8, NVFP4). The two techniques are complementary: pruning reduces the architecture, quantization reduces the precision. +- **Monitor memory, not just parameters:** Two models with the same parameter count can have very different memory footprints. Puzzletron's memory-aware MIP handles this directly; with Minitron, verify your memory budget manually after pruning. + +### 8.3 Deployment considerations for heterogeneous architectures + +Puzzletron produces models with per-layer varying FFN widths and selective attention removal. vLLM recently added support for these architectures via the `AnyModel` backend (see [Section 7](#7-inference-performance) for benchmarks and setup instructions). As of this writing, this support is available via an [open pull request](https://github.com/vllm-project/vllm/pull/36512) and is expected to be merged into mainline vLLM in a future release. Other inference engines (TensorRT-LLM, etc.) do not yet support heterogeneous architectures. Minitron models, being homogeneous, are deployable on any standard inference stack today. + +--- + +## 9. Open Questions + +The experiments in this guide raise new questions to investigate. Below are directions we find promising for future work. + +**Combining Minitron depth pruning with Puzzletron width pruning.** +In this guide, Minitron and Puzzletron are used independently. A natural next step is to combine them: first use Minitron to remove the least important layers (depth pruning), then apply Puzzletron's per-layer NAS to the remaining layers (heterogeneous width pruning). This two-stage approach could achieve more aggressive compression than either method alone: Minitron reduces the layer count quickly and cheaply, while Puzzletron fine-tunes the surviving layers to fit a precise hardware budget. + +**Model and scale sensitivity: do we need model-specific compression guidelines?** +All our results come from a single model (Qwen3-8B). Do other architectures or model sizes respond differently to Minitron and Puzzletron? For instance, does the crossover point between the two methods shift for larger models (70B+) or for architectures with different attention patterns (e.g. GQA vs. MHA, MoE vs. dense)? + +**Distillation recipe: how to choose the dataset, duration, and scale?** +Our experiments used 100 iterations on WikiText-103, a deliberately minimal setup that happened to work well for Qwen3-8B. But how should one choose the distillation dataset (generic vs. domain-specific?), the number of iterations, and the token budget for a new model? Is there a principled way to estimate the required distillation effort given a model and compression level, or does it always require empirical tuning? +> For a concrete recipe and detailed ablations on data blend, token budget, and convergence (on Nemotron-Nano-9B-v2), see the [Nano-9B-v2 tutorial](../minitron/NVIDIA-Nemotron-Nano-9B-v2/README.md) and its [blend ablations](../minitron/NVIDIA-Nemotron-Nano-9B-v2/ABLATIONS.md). + +**Serving heterogeneous architectures: how to balance Tensor Parallelism and Pipeline Parallelism?** +Puzzletron produces models where layers have different widths and some lack attention entirely. Standard TP/PP strategies assume uniform layers. How should parallelism be partitioned when layer costs vary significantly? Finding efficient serving configurations for heterogeneous architectures is an open problem that directly impacts their practical deployment. + +**Benchmark-specific behavior: can we build guidelines per downstream task?** +As shown in [Section 5.4](#54-benchmark-specific-behavior), the relative ranking of compressed models shifts depending on the benchmark. Can we identify which compression strategies preserve which capabilities? Our experiments on MMLU, HellaSwag, and GSM8K suggest that Minitron's depth pruning better preserves general knowledge while Puzzletron's heterogeneous pruning better preserves reasoning, but three benchmarks on one model are not enough to generalize. + +> **Going further:** To explore these questions, see [Advanced Compression Experiments: Results & Insights](advanced_compression_experiments.md), which gathers the results and insights from more sophisticated experiments. + +--- + +## 10. References + +- **Minitron:** Sreenivas et al., [*Compact Language Models via Pruning and Knowledge Distillation*](https://arxiv.org/abs/2407.14679), 2024. +- **More Minitron Results:** Sreenivas et al., [*LLM Pruning and Distillation in Practice: The Minitron Approach*](https://arxiv.org/pdf/2408.11796), 2024. +- **Puzzletron:** Bercovich et al., [*Puzzle: Distillation-Based NAS for Inference-Optimized LLMs*](https://arxiv.org/abs/2411.19146), 2024. +- **NVIDIA ModelOpt:** [GitHub Repository](https://github.com/NVIDIA/Model-Optimizer) +- **Llama Puzzletron Tutorial:** [Puzzletron Example on ModelOpt](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/puzzletron/README.md) +- **Model Compression and distillation with Megatron-Bridge:** [Megatron-Bridge Examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge/README.md) +- **Qwen3-8B:** [HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3-8B) + +--- + +

+ +## Appendix: Serving a Puzzletron Model with vLLM + +Puzzletron's heterogeneous models require a few extra steps to serve with vLLM. Below is the procedure for the Scenario 1 Puzzletron model (`distilled_Qwen3-8B-Puzzle-7B`); the same steps apply to any Puzzletron checkpoint. The walkthrough below is kept self-contained to reproduce the exact throughput-vs-latency curves in [Section 7](#7-inference-performance) end-to-end. For the canonical, kept-up-to-date deployment instructions, see also [Deploy compressed model in vLLM](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron#deploy-compressed-model-in-vllm) in the Puzzletron example. + +
+Reproduction steps — install vLLM, patch config, serve, benchmark with AIPerf (click to expand) + +**Step 1 — Install vLLM with AnyModel support:** + +```bash +pip install git+https://github.com/askliar/vllm.git@feature/add_anymodel_to_vllm +``` + +**Step 2 — Patch the model config to use the AnyModel backend:** + +Puzzletron checkpoints are saved as standard HuggingFace models, but vLLM needs to know to load them via the `AnyModel` backend. Update the `config.json`: + +```python +python -c " +import json +config_path = '/workspace/output/distilled_Qwen3-8B-Puzzle-7B/config.json' +with open(config_path) as f: + config = json.load(f) +config['architectures'] = ['AnyModel'] +config['base_architecture'] = 'Qwen3ForCausalLM' +with open(config_path, 'w') as f: + json.dump(config, f, indent=2) +print('Done:', config['architectures'], config['base_architecture']) +" +``` + +**Step 3 — Launch the vLLM server:** + +```bash +vllm serve /workspace/output/distilled_Qwen3-8B-Puzzle-7B \ + --trust-remote-code \ + --port 8000 & +``` + +**Step 4 — Install AIPerf (in a second terminal):** + +```bash +pip install aiperf +``` + +**Step 5 — Benchmark with AIPerf:** + +```bash +for c in 1 4 8 16 32 64 128; do + echo "=== Concurrency: $c ===" + aiperf profile \ + --model /workspace/output/distilled_Qwen3-8B-Puzzle-7B \ + --url http://localhost:8000 \ + --endpoint-type chat \ + --streaming \ + --concurrency $c \ + --request-count 200 \ + --synthetic-input-tokens-mean 1000 \ + --synthetic-input-tokens-stddev 100 \ + --output-tokens-mean 200 \ + --output-tokens-stddev 20 \ + --tokenizer /workspace/output/distilled_Qwen3-8B-Puzzle-7B \ + --artifact-dir /workspace/aiperf_results_puzzle7B/concurrency_$c +done +``` + +> **Note:** For Minitron models and the baseline, skip Step 2 — standard vLLM serves them directly. + +
diff --git a/examples/pruning/minitron_vs_puzzletron/advanced_compression_experiments.md b/examples/pruning/minitron_vs_puzzletron/advanced_compression_experiments.md new file mode 100644 index 00000000000..14c23b46e17 --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/advanced_compression_experiments.md @@ -0,0 +1,174 @@ +# Advanced Compression Experiments: Results & Insights + +This document extends the [main tutorial](README.md) with results and insights from more sophisticated experiments, addressing the open questions raised in Section 9. + +--- + +## 1. Extended Distillation: WikiText vs. Nemotron-v2 at 80% Memory + +The main tutorial uses a deliberately minimal distillation setup (100 iterations on [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1), ~1.6M tokens). Here we investigate what happens when we scale up distillation significantly (using the higher-quality [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) with 1000x more tokens) on Qwen3-8B models compressed to ~80% of the original memory footprint. + +### 1.1 Results across all benchmarks + +| Model | Params | Distillation | Tokens | MMLU | HellaSwag acc_norm | GSM8K flex | +|---|---|---|---|---|---|---| +| Original Qwen3-8B | 8B | — | — | 0.7493 | 0.7653 | 0.8749 | +| | | | | | | | +| **Puzzletron 80%** | | | | | | | +| Puzzletron — pruned | 7.75B | — | — | 0.5910 | 0.6863 | 0.5762 | +| Puzzletron + WikiText | 7.75B | gbs=4, seq=4096, 100 iters | 1.6M | 0.6921 | 0.7390 | 0.7612 | +| **Puzzletron + Nemotron-v2** | **7.75B** | **gbs=768, seq=8192, 300 iters** | **1.9B** | **0.7186** | **0.7381** | **0.8378** | +| | | | | | | | +| **Minitron 80% (36→28 layers)** | | | | | | | +| Minitron — pruned | 6.65B | — | — | 0.5084 | 0.5295 | 0.0114 | +| Minitron + WikiText | 6.65B | gbs=4, seq=4096, 100 iters | 1.6M | 0.7302 | 0.6166 | 0.4761 | +| **Minitron + Nemotron-v2** | **6.65B** | **gbs=768, seq=8192, 300 iters** | **1.9B** | **0.7394** | **0.6357** | **0.7453** | + +### 1.2 Key takeaways + +**Nemotron-v2 improves both methods, but the gains are benchmark-dependent.** MMLU improvements are modest (+2.65pp for Puzzletron, +0.92pp for Minitron). The real payoff is on reasoning: Puzzletron's GSM8K jumps +7.66pp, Minitron's +26.92pp. Higher-quality distillation disproportionately recovers reasoning capabilities. + +**Minitron + WikiText (1.6M tokens) still beats Puzzletron + Nemotron-v2 (1.9B tokens) on MMLU.** Minitron recovers to 97.5% of the teacher with minimal distillation, while Puzzletron needs 1000x more compute to reach 95.9%. + +**On reasoning (GSM8K), Puzzletron leads regardless of distillation recipe.** With Nemotron-v2, Puzzletron retains 96.0% of the teacher vs. Minitron's 85.4%. Depth pruning's impact on reasoning can be partially compensated by better distillation, but heterogeneous pruning still preserves reasoning structure better. + +**Distillation loss still doesn't predict downstream accuracy.** Minitron's final loss (5.59e-1) is 10x higher than Puzzletron + Nemotron-v2 (5.63e-2), yet Minitron scores better on MMLU. + +--- + +## 2. Chaining Minitron Depth Pruning with Puzzletron + +### 2.1 Motivation + +The main tutorial uses Minitron and Puzzletron independently. A natural question is: can we combine them? + +This is motivated by a limitation in Puzzletron's scoring for full layer removal: its independent block-level scoring does not account for inter-block dependencies when multiple layers are removed simultaneously, leading to poor layer selection and degraded quality. + +| Method | Layers dropped (1-indexed) | Pre-distill MMLU | Post-distill MMLU | +|---|---|---|---| +| Minitron (BI scoring) | L27–L34 | 0.5084 | 0.7302 | +| Puzzletron (Cosine Embedding Loss) | L3–L4, L8–L9, L15, L19, L21, L27 | 0.2949 | 0.4993 | + +> **Note:** To isolate depth pruning behavior, Puzzletron was configured to only allow full layer removal. + +Minitron's BI scoring concentrates drops in late layers, producing a far better model. This motivates a chained approach: Minitron for depth pruning, then Puzzletron for heterogeneous width pruning on the surviving layers. + +### 2.2 Experiment: Minitron 36→32L + Puzzletron → 80% memory + +We first prune Qwen3-8B from 36 to 32 layers using Minitron (~10% reduction), then apply Puzzletron to the 32-layer model to reach the 80% memory target (~10% further reduction). We compare this chained approach against using each method alone at the same 80% memory target. + +**Intermediate step — Minitron 36→32L (~90% memory)** + +| Model | Params | Distillation | Tokens | MMLU | HellaSwag acc_norm | GSM8K flex | +|---|---|---|---|---|---|---| +| Qwen3-8B (teacher) | 8.19B | — | — | 0.7493 | 0.7653 | 0.8749 | +| Minitron 36→32L — pruned | 7.42B | — | — | 0.7396 | 0.6671 | 0.2873 | +| Minitron 36→32L + WikiText | 7.42B | gbs=4, seq=4096, 100 iters | 1.6M | 0.7421 | 0.6987 | 0.7604 | + +Minitron's depth pruning retains 98.7% of MMLU with no distillation at all (0.7396), confirming that the 4 dropped late layers contribute little to general knowledge. GSM8K drops sharply (0.2873) but recovers well with minimal distillation (0.7604). + +**80% memory target — all three approaches compared** + +| Model | Params | Distillation | Tokens | MMLU | HellaSwag acc_norm | GSM8K flex | +|---|---|---|---|---|---|---| +| Qwen3-8B (teacher) | 8.19B | — | — | 0.7493 | 0.7653 | 0.8749 | +| | | | | | | | +| **Chained: Minitron 36→32L + Puzzletron** | | | | | | | +| Pruned | 7.42B | — | — | 0.6674 | 0.6698 | 0.6331 | +| + WikiText | 7.42B | gbs=4, seq=4096, 100 iters | 1.6M | 0.7074 | 0.6874 | 0.7081 | +| **+ Nemotron-v2** | **7.42B** | **gbs=768, seq=8192, 300 iters** | **1.9B** | **0.7332** | **0.7126** | **0.8499** | +| | | | | | | | +| **Puzzletron only** | | | | | | | +| Pruned | 7.75B | — | — | 0.5910 | 0.6863 | 0.5762 | +| + WikiText | 7.75B | gbs=4, seq=4096, 100 iters | 1.6M | 0.6921 | 0.7390 | 0.7612 | +| **+ Nemotron-v2** | **7.75B** | **gbs=768, seq=8192, 300 iters** | **1.9B** | **0.7186** | **0.7381** | **0.8378** | +| | | | | | | | +| **Minitron depth only (36→28L)** | | | | | | | +| Pruned | 6.65B | — | — | 0.5084 | 0.5295 | 0.0114 | +| + WikiText | 6.65B | gbs=4, seq=4096, 100 iters | 1.6M | 0.7302 | 0.6166 | 0.4761 | +| **+ Nemotron-v2** | **6.65B** | **gbs=768, seq=8192, 300 iters** | **1.9B** | **0.7394** | **0.6357** | **0.7453** | + +### 2.3 Key takeaways + +**The chained approach gives the best balanced results with extended distillation.** With Nemotron-v2, Minitron+Puzzletron achieves 0.7332 MMLU, 0.7126 HellaSwag, and 0.8499 GSM8K. No single method matches this balance: Minitron alone leads on MMLU (0.7394) but lags on HellaSwag (0.6357) and GSM8K (0.7453); Puzzletron alone leads on HellaSwag (0.7381) but trails on MMLU (0.7186). + +**Chaining leverages each method's strength.** Minitron handles depth pruning cleanly (BI scoring correctly identifies which late layers to drop), then Puzzletron applies surgical per-layer width optimization on the surviving 32-layer model. The result is a model that preserves both general knowledge and reasoning better than either method alone. + +**Pre-distillation quality is much higher for the chained approach.** The chained model starts at 0.6674 MMLU before any distillation — well above Puzzletron alone (0.5910) and Minitron alone (0.5084). This gives distillation more structure to work with. + +**Conclusion:** On Qwen3-8B, for an 80% memory target, pruning ~10% with Minitron depth (36→32L) followed by ~10% with Puzzletron width, then applying extended distillation with Nemotron-v2, gives the best balanced trade-off across all benchmarks tested. + +--- + +## 3. Blockwise Local Distillation (BLD) + +BLD (bypass) locally trains block variants before the MIP assembly step, so the search prefers blocks that recover well after distillation rather than blocks that merely look good as immediate swaps. + +### 3.1 At moderate compression (7B target): marginal impact + +We tested BLD on the Scenario 1 setup (Qwen3-8B → 7B), applying it to FFN subblock variants pruned below 50% of the original intermediate size. + +| Model | Parameters | MMLU (pruned) | MMLU (distilled) | % of Teacher | +|---|---|---|---|---| +| Minitron 7B | 6.96B | 0.7038 | 0.7166 | 95.6% | +| Puzzletron 7B | 6.99B | 0.6621 | 0.6823 | 91.1% | +| Puzzletron 7B + BLD | 6.99B | 0.6696 | 0.6867 | 91.6% | + +BLD provides a marginal improvement over standard Puzzletron (+0.44pp post-distillation), and the MIP selects a very similar architecture. At this moderate compression level, the gain appears insufficient to justify the added complexity, and Minitron still leads on MMLU by a wide margin. + +### 3.2 At aggressive compression (80% memory target): significant impact + +A recurring pattern when optimizing for memory is that the MIP solver drops full attention blocks from many layers (since KV cache dominates memory). This means the FFN part of those attention-less variants becomes critical and is exactly where BLD can have the most impact. Here we apply BLD to train the FFN part of block variants that drop attention (`no_op`). + +**Results (% of teacher, post-distillation with WikiText)** + +| Benchmark | Puzzletron 80% | Puzzletron 80% + BLD | Minitron 80% | +|---|---|---|---| +| MMLU | 92.4% | **98.0%** | 97.5% | +| HellaSwag acc_norm | **96.6%** | 95.6% | 80.6% | +| GSM8K flex | 87.0% | **92.0%** | 54.4% | + +**Full results** + +| Model | MMLU (pruned) | MMLU (distilled) | HellaSwag acc_norm (pruned) | HellaSwag acc_norm (distilled) | GSM8K flex (pruned) | GSM8K flex (distilled) | +|---|---|---|---|---|---|---| +| Qwen3-8B (teacher) | 0.7493 | — | 0.7653 | — | 0.8749 | — | +| Puzzletron 80% | 0.5910 | 0.6921 | 0.6863 | 0.7390 | 0.5762 | 0.7612 | +| **Puzzletron 80% + BLD** | **0.7277** | **0.7341** | **0.7097** | **0.7317** | **0.7331** | **0.8044** | +| Minitron 80% | 0.5084 | 0.7302 | 0.5295 | 0.6166 | 0.0114 | 0.4761 | + +BLD transforms Puzzletron's results at this compression level. The pre-distillation MMLU jumps from 0.5910 to 0.7277. After WikiText distillation, Puzzletron + BLD reaches 0.7341 MMLU, beating both standard Puzzletron (0.6921) and Minitron (0.7302) — flipping the Puzzletron vs. Minitron ranking on MMLU, where without BLD Minitron was ahead. The improvement is consistent across all benchmarks, with GSM8K showing a particularly strong gain (0.8044 vs. 0.7612 without BLD). + +Unlike the moderate compression case where BLD had negligible impact, at aggressive compression BLD substantially changes the architecture the MIP selects and the quality of the resulting model. + +--- + +## 4. Beyond Dense Transformers: Compressing a Mamba-Transformer Hybrid + +All experiments so far use Qwen3-8B, a dense Transformer-only model. Here we test both methods on [Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2), a **Mamba-Transformer hybrid** with 12.3B parameters and 62 layers (alternating Mamba and attention blocks). This is an early exploration; results are pre-distillation only (MMLU). + +### 4.1 Results + +| Model | MMLU | % of Teacher | +|---|---|---| +| Nemotron-Nano-12B-v2 (baseline, 49k MiB) | 78.6 | 100% | +| | | | +| **~10B parameter target** | | | +| Minitron 10B | **73.7** | **93.8%** | +| Puzzletron 10B | 48.9 | 62.2% | +| | | | +| **~34k MiB memory target** | | | +| Minitron 34k | 51.8 | 65.9% | +| Puzzletron 34k | **54.3** | **69.1%** | + +### 4.2 Observations + +**Puzzletron never removes Mamba blocks.** Across all Puzzletron runs (both 10B and 34k MiB targets), every Mamba block is kept intact: the MIP solver exclusively targets attention blocks and FFN layers for pruning. This suggests that removing a single Mamba block is too costly for model quality. + +**At moderate compression (~10B), Minitron dominates.** Minitron retains 93.8% of teacher MMLU vs. Puzzletron's 62.2%. This is consistent with the Qwen3-8B pattern where Minitron wins at moderate compression, but the gap is much larger here. + +**At aggressive compression (~34k MiB), Puzzletron slightly leads.** Puzzletron edges ahead (54.3 vs. 51.8 MMLU), similarly to the pattern observed on Qwen3-8B. + +**Hybrid architectures present unique challenges for Puzzletron.** On dense Transformers, Puzzletron's strength is heterogeneous per-layer optimization. On hybrids, the Mamba blocks are effectively frozen — Puzzletron can only optimize the attention/FFN half of the model. This reduces its effective search space and may explain why Minitron's simpler uniform approach outperforms at moderate compression levels. + +--- diff --git a/examples/pruning/minitron_vs_puzzletron/figures/all_curves_throughput_vs_latency.png b/examples/pruning/minitron_vs_puzzletron/figures/all_curves_throughput_vs_latency.png new file mode 100644 index 00000000000..91045515bc5 Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/all_curves_throughput_vs_latency.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/figures/distillation_curves.png b/examples/pruning/minitron_vs_puzzletron/figures/distillation_curves.png new file mode 100644 index 00000000000..c64bfeb0b81 Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/distillation_curves.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/figures/distillation_loss_7B.png b/examples/pruning/minitron_vs_puzzletron/figures/distillation_loss_7B.png new file mode 100644 index 00000000000..369bd46c9bb Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/distillation_loss_7B.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep.png b/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep.png new file mode 100644 index 00000000000..f9580f35f32 Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep_combined.png b/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep_combined.png new file mode 100644 index 00000000000..d649ba48794 Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/memory_sweep_combined.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/figures/summary_chart.png b/examples/pruning/minitron_vs_puzzletron/figures/summary_chart.png new file mode 100644 index 00000000000..cbeae09765c Binary files /dev/null and b/examples/pruning/minitron_vs_puzzletron/figures/summary_chart.png differ diff --git a/examples/pruning/minitron_vs_puzzletron/scenario1_minitron.ipynb b/examples/pruning/minitron_vs_puzzletron/scenario1_minitron.ipynb new file mode 100644 index 00000000000..b5e44a9ad49 --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/scenario1_minitron.ipynb @@ -0,0 +1,266 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fb27b941602401d91542211134fc71a", + "metadata": {}, + "source": [ + "# Scenario 1 — Minitron: Quick & Reliable Compression (~1h45 on 2x H200)\n", + "\n", + "This notebook prunes Qwen3-8B down to ~7B parameters using **Minitron** (homogeneous pruning), then distills and evaluates the result.\n", + "\n", + "**Pipeline:** Prune → Evaluate → Distill → Evaluate\n", + "\n", + "**Prerequisites:**\n", + "- Run [`00_prerequisites.ipynb`](00_prerequisites.ipynb) first to prepare the distillation dataset.\n", + "- Base model downloaded at `/workspace/models/Qwen3-8B`." + ] + }, + { + "cell_type": "markdown", + "id": "b14dbc2b", + "metadata": {}, + "source": [ + "## 1. Prune (long step)\n", + "\n", + "Minitron's `prune_minitron.py` script handles the full pruning pipeline in one command:\n", + "1. Loads the model into Megatron-Bridge format\n", + "2. Runs a calibration pass on the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) to compute importance scores\n", + "3. Enumerates all valid (depth, width) configurations, keeps the 10 candidates with parameter count closest to and below 7B, scores them using a fast MMLU proxy, and selects the best one\n", + "4. Exports the pruned model as a HuggingFace checkpoint\n", + "\n", + "We skip pruning `num_attention_heads` to keep the GQA structure intact (the model reduces hidden size, FFN intermediate size, and drops layers instead).\n", + "\n", + "> **Calibration dataset note:** The pruning script automatically downloads and uses 1,024 samples from `nvidia/Nemotron-Post-Training-Dataset-v2` for calibration (configurable via `--calib_dataset_name` and `--calib_num_samples`).\n", + "\n", + "> **Runtime note:** This step may run significantly faster (up to ~5×) on more recent ModelOpt versions (≥ 0.44.0) thanks to pruning-pipeline optimizations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "867f6497", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer/examples/megatron_bridge && \\\n", + "torchrun --nproc_per_node 2 prune_minitron.py \\\n", + " --pp_size 2 \\\n", + " --hf_model_name_or_path /workspace/models/Qwen3-8B \\\n", + " --prune_target_params 7e9 \\\n", + " --hparams_to_skip num_attention_heads \\\n", + " --output_hf_path /workspace/output/Qwen3-8B-Pruned-7B" + ] + }, + { + "cell_type": "markdown", + "id": "9o737xevmu", + "metadata": {}, + "source": [ + "**Expected output:** Minitron selects the following best architecture:\n", + "\n", + "```\n", + "[BEST SUBNET] {'num_layers': 32, 'hidden_size': 3840, 'ffn_hidden_size': 12288} -> 6.96B params, 0.7073 score\n", + "\n", + "Dropping decoder layers [28, 31, 32, 33] from model.\n", + "```\n", + "\n", + "The model goes from 36 to 32 layers (dropping 4 late layers), hidden size is reduced from 4096 to 3840, and FFN intermediate size stays at 12288." + ] + }, + { + "cell_type": "markdown", + "id": "ef1f4a07", + "metadata": {}, + "source": [ + "## 2. Verify pruned model\n", + "\n", + "Check that the pruned checkpoint was saved correctly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5be1ca4a", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -lh /workspace/output/Qwen3-8B-Pruned-7B/" + ] + }, + { + "cell_type": "markdown", + "id": "8433afa3", + "metadata": {}, + "source": [ + "## 3. Evaluate pruned model (before distillation)\n", + "\n", + "Run MMLU (5-shot) on the pruned model to measure how much accuracy was lost during pruning. This gives us the pre-distillation baseline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d25fc66c", + "metadata": {}, + "outputs": [], + "source": [ + "!lm_eval --model hf \\\n", + " --model_args pretrained=/workspace/output/Qwen3-8B-Pruned-7B,dtype=bfloat16 \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "dcff37f6", + "metadata": {}, + "source": [ + "## 4. Distill\n", + "\n", + "Run knowledge distillation: the pruned model (student) learns to mimic the original Qwen3-8B (teacher) using logits-based KL divergence loss on the tokenized [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1) dataset.\n", + "\n", + "We run 100 iterations with a sequence length of 4096 and a global batch size of 4 (~1.6M tokens)." + ] + }, + { + "cell_type": "markdown", + "id": "acae54e37e7d407bbb7b55eff062a284", + "metadata": {}, + "source": [ + "Launch TensorBoard to monitor the distillation loss in real time. Open http://localhost:6006 in your browser once the distillation cell is running.\n", + "\n", + "> **Tip:** In the TensorBoard settings (top-right gear icon), check **\"Reload data\"** so the charts update automatically as training progresses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a63283cbaf04dbcab1f6479b197f3a8", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "subprocess.Popen(\n", + " [\"tensorboard\", \"--logdir\", \"/workspace/output/distill_output_7B/tb_logs\", \"--port\", \"6006\"]\n", + ")\n", + "print(\"TensorBoard started at http://localhost:6006\")" + ] + }, + { + "cell_type": "markdown", + "id": "8dd0d8092fe74a7c96281538738b07e2", + "metadata": {}, + "source": [ + "Now, let's run the distillation.\n", + "> **Expected runtime: ~20-30 minutes on 2x H200.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13f7b0f2", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer/examples/megatron_bridge && \\\n", + "torchrun --nnodes 1 --nproc_per_node=2 distill.py \\\n", + " --student_hf_path /workspace/output/Qwen3-8B-Pruned-7B \\\n", + " --teacher_hf_path /workspace/models/Qwen3-8B \\\n", + " --data_paths 1.0 /workspace/datasets/tokenized_qwen3/wikitext_wikitext-103-v1_train_text \\\n", + " --output_dir /workspace/output/distill_output_7B \\\n", + " --hf_export_path /workspace/output/distilled_Qwen3-8B-Pruned-7B \\\n", + " --student_hf_model /workspace/output/Qwen3-8B-Pruned-7B \\\n", + " --seq_length 4096 \\\n", + " --tp_size 2 \\\n", + " --pp_size 1 \\\n", + " --mbs 1 \\\n", + " --gbs 4 \\\n", + " --train_iters 100 \\\n", + " --lr 0.0001 \\\n", + " --min_lr 1e-05 \\\n", + " --lr_warmup_iters 10 \\\n", + " --eval_interval 10 \\\n", + " --eval_iters 10 \\\n", + " --log_interval 1" + ] + }, + { + "cell_type": "markdown", + "id": "e79f5aa6", + "metadata": {}, + "source": [ + "Finally, kill tensorboard:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e92fcb22", + "metadata": {}, + "outputs": [], + "source": [ + "subprocess.run([\"pkill\", \"-f\", \"tensorboard\"])" + ] + }, + { + "cell_type": "markdown", + "id": "665dd4wp1xw", + "metadata": {}, + "source": [ + "**Expected distillation loss curve:**\n", + "\n", + "![Minitron 7B distillation loss](figures/distillation_loss_7B.png)\n", + "\n", + "The KL divergence drops from ~0.93 to ~0.38 over 100 iterations, with training and validation loss tracking closely (no overfitting)." + ] + }, + { + "cell_type": "markdown", + "id": "10185d26023b46108eb7d9f57d49d2b3", + "metadata": {}, + "source": [ + "## 5. Evaluate distilled model\n", + "\n", + "Run MMLU (5-shot) on the distilled model. Compare with the pre-distillation score from Step 3 to measure distillation recovery.\n", + "\n", + "**Expected results on Qwen3-8B:**\n", + "\n", + "| Model | MMLU (5-shot) | % of Teacher |\n", + "|---|---|---|\n", + "| Qwen3-8B (teacher) | 0.7493 | 100% |\n", + "| Minitron — pruned | 0.7038 | 93.9% |\n", + "| Minitron — pruned + distilled | **0.7166** | **95.6%** |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8763a12b2bbd4a93a75aff182afb95dc", + "metadata": {}, + "outputs": [], + "source": [ + "!lm_eval --model hf \\\n", + " --model_args pretrained=/workspace/output/distilled_Qwen3-8B-Pruned-7B,dtype=bfloat16 \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/pruning/minitron_vs_puzzletron/scenario1_puzzletron.ipynb b/examples/pruning/minitron_vs_puzzletron/scenario1_puzzletron.ipynb new file mode 100644 index 00000000000..9ded667facf --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/scenario1_puzzletron.ipynb @@ -0,0 +1,247 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fb27b941602401d91542211134fc71a", + "metadata": {}, + "source": "# Scenario 1 — Puzzletron: Heterogeneous Pruning (Comparison) (~6h on 2x H200)\n\nThis notebook compresses Qwen3-8B to ~7B parameters using **Puzzletron** (heterogeneous NAS-based pruning), then distills and evaluates the result.\n\nThis serves as the **comparison run** for Scenario 1. The recommended approach for moderate compression is Minitron (see [`scenario1_minitron.ipynb`](scenario1_minitron.ipynb)). We run Puzzletron here to demonstrate how it compares at this compression level.\n\n**Pipeline:** Prepare calibration data → Configure → NAS search → Evaluate → Patch → Distill → Evaluate\n\n**Prerequisites:**\n- Run [`00_prerequisites.ipynb`](00_prerequisites.ipynb) first to prepare the distillation dataset.\n- Base model downloaded at `/workspace/models/Qwen3-8B`." + }, + { + "cell_type": "markdown", + "id": "8dd0d8092fe74a7c96281538738b07e2", + "metadata": {}, + "source": [ + "## 1. Prepare calibration dataset\n", + "\n", + "Puzzletron requires explicit dataset preparation. We download and format the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which Puzzletron uses to score the quality of candidate block replacements during the NAS search." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72eea5119410473aa328ad9291626812", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer && \\\n", + "python -m modelopt.torch.puzzletron.dataset.prepare_dataset \\\n", + " --dataset_name=nvidia/Nemotron-Post-Training-Dataset-v2 \\\n", + " --output_dir=/workspace/datasets/Nemotron-Post-Training-Dataset-v2" + ] + }, + { + "cell_type": "markdown", + "id": "8edb47106e1a46a883d545849b8ab81b", + "metadata": {}, + "source": "## 2. Configure the NAS search\n\nIn the YAML configuration files, we need to set:\n- **`input_hf_model_path`**: path to the base Qwen3-8B model\n- **`target_memory`**: set high (130,000 MiB) so it doesn't constrain — we're targeting by parameter count here\n- **`num_params`**: 7B parameter target\n- **`eval_samples`**: number of samples for scoring. A higher value can produce more reliable scores and potentially a better final architecture, but scoring time scales roughly linearly with this parameter. 32 is the value we use here as a reasonable accuracy/runtime trade-off for tutorial reproducibility." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10185d26023b46108eb7d9f57d49d2b3", + "metadata": {}, + "outputs": [], + "source": [ + "!sed -i 's|input_hf_model_path: .*|input_hf_model_path: /workspace/models/Qwen3-8B|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml\n", + "\n", + "!sed -i 's|target_memory: .*|target_memory: 130_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml\n", + "\n", + "!sed -i 's|target_memory: .*|target_memory: 130_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml\n", + "\n", + "!sed -i 's|num_params: .*|num_params: 7_000_000_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml\n", + "\n", + "!sed -i '/^scoring:/,/^[a-z]/{s|eval_samples: .*|eval_samples: 32|}' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "8763a12b2bbd4a93a75aff182afb95dc", + "metadata": {}, + "source": [ + "## 3. Run Puzzletron NAS search (Longest step: 5 hours at first run)\n", + "\n", + "This step is significantly longer than Minitron's single-command pruning.\n", + "\n", + "The MIP (Mixed-Integer Programming) solver will find the optimal heterogeneous architecture that has at most 7B parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7623eae2785240b9bd12b16a66d81610", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove if already exists from a previous run\n", + "!rm -f /workspace/puzzle_dir/subblock_stats.json\n", + "!cd /opt/Model-Optimizer && \\\n", + "torchrun --nproc_per_node 1 \\\n", + " examples/puzzletron/main.py \\\n", + " --config examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml \\\n", + " 2>&1 | tee /workspace/puzzletron_qwen3_7B.log" + ] + }, + { + "cell_type": "markdown", + "id": "7cdc8c89c7104fffa095e18ddfef8986", + "metadata": {}, + "source": [ + "## 4. Evaluate pruned model (before distillation)\n", + "\n", + "Evaluate the Puzzletron-compressed model on MMLU. The model is heterogeneous (variable FFN widths per layer), so we use the `lm_eval_hf.py` script which supports this architecture." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b118ea5561624da68c537baed56e602f", + "metadata": {}, + "outputs": [], + "source": [ + "!sed -i 's/\"torch\\.bfloat16\"/\"bfloat16\"/g' \\\n", + " /workspace/puzzle_dir/mip/puzzle_solutions/target_memory_130000MiB-num_params_7G/solutions--checkpoints/solution_0/config.json\n", + "\n", + "!cd /opt/Model-Optimizer && \\\n", + "python examples/llm_eval/lm_eval_hf.py \\\n", + " --model hf \\\n", + " --model_args pretrained=/workspace/puzzle_dir/mip/puzzle_solutions/target_memory_130000MiB-num_params_7G/solutions--checkpoints/solution_0/,dtype=bfloat16,parallelize=True \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "938c804e27f84196a10c8828c723f798", + "metadata": {}, + "source": "## 5. Distill\n\nDistill the heterogeneous Puzzletron model against the original Qwen3-8B teacher. Same distillation recipe as the Minitron notebooks: 100 iterations on [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1).\n\nThe `distill.py` script handles both distillation and automatic export to HuggingFace format in one step." + }, + { + "cell_type": "markdown", + "id": "504fb2a444614c0babb325280ed9130a", + "metadata": {}, + "source": [ + "Launch TensorBoard to monitor the distillation loss in real time. Open http://localhost:6006 in your browser once the distillation cell is running.\n", + "\n", + "> **Tip:** In the TensorBoard settings (top-right gear icon), check **\"Reload data\"** so the charts update automatically as training progresses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59bbdb311c014d738909a11f9e486628", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "subprocess.Popen(\n", + " [\n", + " \"tensorboard\",\n", + " \"--logdir\",\n", + " \"/workspace/output/distill_output_puzzle_7B/tb_logs\",\n", + " \"--port\",\n", + " \"6006\",\n", + " ]\n", + ")\n", + "print(\"TensorBoard started at http://localhost:6006\")" + ] + }, + { + "cell_type": "markdown", + "id": "b43b363d81ae4b689946ece5c682cd59", + "metadata": {}, + "source": [ + "Now, let's run the distillation.\n", + "> **Expected runtime: ~20-30 minutes on 2x H200.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a65eabff63a45729fe45fb5ade58bdc", + "metadata": {}, + "outputs": [], + "source": [ + "!torchrun --nproc_per_node=2 \\\n", + " /opt/Model-Optimizer/examples/megatron_bridge/distill.py \\\n", + " --student_hf_path /workspace/puzzle_dir/mip/puzzle_solutions/target_memory_130000MiB-num_params_7G/solutions--checkpoints/solution_0 \\\n", + " --teacher_hf_path /workspace/models/Qwen3-8B \\\n", + " --data_paths 1.0 /workspace/datasets/tokenized_qwen3/wikitext_wikitext-103-v1_train_text \\\n", + " --output_dir /workspace/output/distill_output_puzzle_7B \\\n", + " --hf_export_path /workspace/output/distilled_Qwen3-8B-Puzzle-7B \\\n", + " --student_hf_model Qwen/Qwen3-8B \\\n", + " --seq_length 4096 \\\n", + " --tp_size 2 \\\n", + " --pp_size 1 \\\n", + " --mbs 1 \\\n", + " --gbs 4 \\\n", + " --train_iters 100 \\\n", + " --lr 0.0001 \\\n", + " --min_lr 1e-05 \\\n", + " --lr_warmup_iters 10 \\\n", + " --eval_interval 10 \\\n", + " --eval_iters 10 \\\n", + " --log_interval 1" + ] + }, + { + "cell_type": "markdown", + "id": "4bcefb8b", + "metadata": {}, + "source": [ + "Finally, kill tensorboard:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "389d78f6", + "metadata": {}, + "outputs": [], + "source": [ + "subprocess.run([\"pkill\", \"-f\", \"tensorboard\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c3933fab20d04ec698c2621248eb3be0", + "metadata": {}, + "source": "## 6. Evaluate distilled model\n\nCompare with the Minitron result at the same parameter target (see [`scenario1_minitron.ipynb`](scenario1_minitron.ipynb)).\n\n**Expected results on Qwen3-8B:**\n\n| Model | Parameters | MMLU (5-shot) | % of Teacher |\n|---|---|---|---|\n| Qwen3-8B (teacher) | 8B | 0.7493 | 100% |\n| **Minitron 7B — distilled** | **6.96B** | **0.7166** | **95.6%** |\n| Puzzletron 7B — pruned | 6.99B | 0.6621 | 88.4% |\n| Puzzletron 7B — distilled | 6.99B | 0.6823 | 91.1% |" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4dd4641cc4064e0191573fe9c69df29b", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer && \\\n", + "python examples/llm_eval/lm_eval_hf.py \\\n", + " --model hf \\\n", + " --model_args pretrained=/workspace/output/distilled_Qwen3-8B-Puzzle-7B,dtype=bfloat16,parallelize=True \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/examples/pruning/minitron_vs_puzzletron/scenario2_minitron.ipynb b/examples/pruning/minitron_vs_puzzletron/scenario2_minitron.ipynb new file mode 100644 index 00000000000..6eab121b685 --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/scenario2_minitron.ipynb @@ -0,0 +1,229 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fb27b941602401d91542211134fc71a", + "metadata": {}, + "source": [ + "# Scenario 2 — Minitron: Aggressive Depth Pruning (Comparison Baseline) (~45 min on 2x H200)\n", + "\n", + "This notebook prunes Qwen3-8B from 36 layers down to 22 layers using **Minitron** depth pruning, then distills and evaluates the result.\n", + "\n", + "This serves as the **comparison baseline** for Scenario 2. The recommended approach for aggressive compression is Puzzletron (see [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb)). We run Minitron here to demonstrate why depth pruning underperforms at this compression level.\n", + "\n", + "**Pipeline:** Prune → Evaluate → Distill → Evaluate\n", + "\n", + "**Prerequisites:**\n", + "- Run [`00_prerequisites.ipynb`](00_prerequisites.ipynb) first to prepare the distillation dataset.\n", + "- Base model downloaded at `/workspace/models/Qwen3-8B`." + ] + }, + { + "cell_type": "markdown", + "id": "acae54e37e7d407bbb7b55eff062a284", + "metadata": {}, + "source": [ + "## 1. Prune (36 → 22 layers)\n", + "\n", + "To match the ~78,000 MiB memory budget used in the Puzzletron comparison, we need aggressive compression. With Minitron, the most effective way to achieve large memory savings is **depth pruning** — removing entire Transformer layers. Each layer carries not only its weights but also a KV cache allocation at inference time, so dropping a layer saves both weight memory and KV cache memory. This makes depth pruning far more memory-efficient per parameter removed than width pruning alone.\n", + "\n", + "**Why 22 layers?** Each Qwen3-8B layer accounts for ~3,440 MiB at the inference settings used in this guide (KV cache + attention + FFN weights; the full breakdown is computed in [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb)). Dropping 14 of the 36 layers removes ~48,160 MiB, taking the baseline from 126,215 MiB down to ~78,055 MiB — within the 78,000 MiB target.\n", + "\n", + "Minitron ranks the layers to decide which 14 layers to drop, keeping the 22 that contribute most to the model's output quality." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a8f9525", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer/examples/megatron_bridge && \\\n", + "torchrun --nproc_per_node 2 prune_minitron.py \\\n", + " --pp_size 2 \\\n", + " --hf_model_name_or_path /workspace/models/Qwen3-8B \\\n", + " --prune_export_config '{\"num_layers\": 22}' \\\n", + " --output_hf_path /workspace/output/Qwen3-8B-Minitron-22L" + ] + }, + { + "cell_type": "markdown", + "id": "1844685e", + "metadata": {}, + "source": [ + "## 2. Verify pruned model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9f5bcf3", + "metadata": {}, + "outputs": [], + "source": [ + "!ls -lh /workspace/output/Qwen3-8B-Minitron-22L/" + ] + }, + { + "cell_type": "markdown", + "id": "1a50b0fb", + "metadata": {}, + "source": [ + "## 3. Evaluate pruned model (before distillation)\n", + "\n", + "With 14 layers removed (~39% of the model's depth), we expect a significant accuracy drop. At this level of compression, the model may be near-random on MMLU." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a386f374", + "metadata": {}, + "outputs": [], + "source": [ + "!lm_eval --model hf \\\n", + " --model_args pretrained=/workspace/output/Qwen3-8B-Minitron-22L,dtype=bfloat16 \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "9a63283cbaf04dbcab1f6479b197f3a8", + "metadata": {}, + "source": [ + "## 4. Distill" + ] + }, + { + "cell_type": "markdown", + "id": "24b3e413", + "metadata": {}, + "source": [ + "Launch TensorBoard to monitor the distillation loss in real time. Open http://localhost:6006 in your browser once the distillation cell is running.\n", + "\n", + "> **Tip:** In the TensorBoard settings (top-right gear icon), check **\"Reload data\"** so the charts update automatically as training progresses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57251157", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "subprocess.Popen(\n", + " [\"tensorboard\", \"--logdir\", \"/workspace/output/distill_output_22L/tb_logs\", \"--port\", \"6006\"]\n", + ")\n", + "print(\"TensorBoard started at http://localhost:6006\")" + ] + }, + { + "cell_type": "markdown", + "id": "8dd0d8092fe74a7c96281538738b07e2", + "metadata": {}, + "source": [ + "\n", + "Distill the 22-layer model against the full Qwen3-8B teacher. Same setup as Scenario 1: 100 iterations on [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1).\n", + "> **Expected runtime: ~20-30 minutes on 2x H200.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "377df69d", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer/examples/megatron_bridge && \\\n", + "torchrun --nnodes 1 --nproc_per_node=2 distill.py \\\n", + " --student_hf_path /workspace/output/Qwen3-8B-Minitron-22L \\\n", + " --teacher_hf_path /workspace/models/Qwen3-8B \\\n", + " --data_paths 1.0 /workspace/datasets/tokenized_qwen3/wikitext_wikitext-103-v1_train_text \\\n", + " --output_dir /workspace/output/distill_output_22L \\\n", + " --hf_export_path /workspace/output/distilled_Qwen3-8B-Minitron-22L \\\n", + " --student_hf_model /workspace/output/Qwen3-8B-Minitron-22L \\\n", + " --seq_length 4096 \\\n", + " --tp_size 2 \\\n", + " --pp_size 1 \\\n", + " --mbs 1 \\\n", + " --gbs 4 \\\n", + " --train_iters 100 \\\n", + " --lr 0.0001 \\\n", + " --min_lr 1e-05 \\\n", + " --lr_warmup_iters 10 \\\n", + " --eval_interval 10 \\\n", + " --eval_iters 10 \\\n", + " --log_interval 1" + ] + }, + { + "cell_type": "markdown", + "id": "83ede15e", + "metadata": {}, + "source": [ + "Finally, kill tensorboard:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0459c564", + "metadata": {}, + "outputs": [], + "source": [ + "subprocess.run([\"pkill\", \"-f\", \"tensorboard\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c91a1b15", + "metadata": {}, + "source": [ + "## 5. Evaluate distilled model\n", + "\n", + "Compare with the Puzzletron result at the same memory budget (see [`scenario2_puzzletron.ipynb`](scenario2_puzzletron.ipynb)).\n", + "\n", + "**Expected results on Qwen3-8B:**\n", + "\n", + "| Model | Memory | MMLU (5-shot) | % of Teacher |\n", + "|---|---|---|---|\n", + "| Qwen3-8B (teacher) | 126,215 MiB | 0.7493 | 100% |\n", + "| Minitron 22L — pruned | 78,054 MiB | 0.2351 | 31.4% |\n", + "| Minitron 22L — distilled | 78,054 MiB | 0.4620 | 61.7% |\n", + "| **Puzzletron 78k — distilled** | **77,992 MiB** | **0.5613** | **74.9%** |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cef7952c", + "metadata": {}, + "outputs": [], + "source": [ + "!lm_eval --model hf \\\n", + " --model_args pretrained=/workspace/output/distilled_Qwen3-8B-Minitron-22L,dtype=bfloat16 \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/pruning/minitron_vs_puzzletron/scenario2_puzzletron.ipynb b/examples/pruning/minitron_vs_puzzletron/scenario2_puzzletron.ipynb new file mode 100644 index 00000000000..85a79aa8836 --- /dev/null +++ b/examples/pruning/minitron_vs_puzzletron/scenario2_puzzletron.ipynb @@ -0,0 +1,446 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fb27b941602401d91542211134fc71a", + "metadata": {}, + "source": "# Scenario 2 — Puzzletron: Hardware-Constrained Compression (~6h15 on 2x H200)\n\nThis notebook compresses Qwen3-8B to fit within a **78,000 MiB memory budget** using **Puzzletron** (heterogeneous NAS-based pruning), then distills and evaluates the result.\n\nThis is the **recommended approach** for Scenario 2. Puzzletron can directly target a memory constraint via its MIP (Mixed-Integer Programming) solver, producing a heterogeneous architecture that maximizes accuracy within the budget. For comparison, see [`scenario2_minitron.ipynb`](scenario2_minitron.ipynb).\n\n**Pipeline:** Prepare calibration data → Configure → NAS search → Evaluate → Patch → Distill → Evaluate\n\n**Prerequisites:**\n- Run [`00_prerequisites.ipynb`](00_prerequisites.ipynb) first to prepare the distillation dataset.\n- Base model downloaded at `/workspace/models/Qwen3-8B`." + }, + { + "cell_type": "markdown", + "id": "8dd0d8092fe74a7c96281538738b07e2", + "metadata": {}, + "source": [ + "## 1. Prepare calibration dataset\n", + "\n", + "Download and format the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for Puzzletron's block scoring phase. Skip if already prepared." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72eea5119410473aa328ad9291626812", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer && \\\n", + "python -m modelopt.torch.puzzletron.dataset.prepare_dataset \\\n", + " --dataset_name=nvidia/Nemotron-Post-Training-Dataset-v2 \\\n", + " --output_dir=/workspace/datasets/Nemotron-Post-Training-Dataset-v2" + ] + }, + { + "cell_type": "markdown", + "id": "8edb47106e1a46a883d545849b8ab81b", + "metadata": {}, + "source": "## 2. Configure the NAS search\n\nPuzzletron is driven by two YAML configuration files:\n\n- **`qwen3_8b_pruneffn_memory.yaml`** — The top-level config. It sets the input model path, the calibration dataset path, the working directory, the MIP memory constraint, and the list of FFN intermediate sizes to search over (the search space). This is the file you edit for each experiment.\n\n- **`qwen3_8b.yaml`** — The base model config. It defines the full pipeline settings: pruning strategy, scoring parameters, MIP solver settings (constraints, objective function, batch sizes for memory computation)... Most of these can be left at their defaults.\n\nHere, the **primary constraint is memory** (78,000 MiB), not parameter count. We set:\n- **`input_hf_model_path`**: path to the base Qwen3-8B model\n- **`target_memory`**: 78,000 MiB — the hard memory ceiling the compressed model must fit within\n- **`num_params`**: set high (8B) so it doesn't constrain — memory is the binding constraint\n- **`eval_samples`**: number of samples for scoring. A higher value can produce more reliable scores and potentially a better final architecture, but scoring time scales roughly linearly with this parameter. 32 is the value we use here as a reasonable accuracy/runtime trade-off for tutorial reproducibility.\n\nThis is Puzzletron's strength: the MIP solver directly optimizes for the memory budget, accounting for both weights and KV cache per layer." + }, + { + "cell_type": "code", + "execution_count": null, + "id": "w0oomdb959h", + "metadata": {}, + "outputs": [], + "source": [ + "!sed -i 's|input_hf_model_path: .*|input_hf_model_path: /workspace/models/Qwen3-8B|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml\n", + "\n", + "!sed -i 's|target_memory: .*|target_memory: 78_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml\n", + "\n", + "!sed -i 's|target_memory: .*|target_memory: 78_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml\n", + "\n", + "!sed -i 's|num_params: .*|num_params: 8_000_000_000|' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml\n", + "\n", + "!sed -i '/^scoring:/,/^[a-z]/{s|eval_samples: .*|eval_samples: 32|}' \\\n", + " /opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "10185d26023b46108eb7d9f57d49d2b3", + "metadata": {}, + "source": [ + "### How Puzzletron computes memory footprint\n", + "\n", + "To understand the 78,000 MiB target, it helps to know how Puzzletron computes memory. The total footprint is the sum of three components, computed **layer-by-layer**:\n", + "\n", + "```\n", + "Total_Memory = Σ (Attention_Memory[layer] + FFN_Memory[layer]) + Non_Block_Memory\n", + "```\n", + "\n", + "**Per-layer attention memory** = KV cache + attention weights:\n", + "- KV cache: `batch_size × seq_len × kv_dim × 2 × sizeof(dtype)` per layer (this is the dominant term)\n", + "- Attention weights: Wq, Wk, Wv, Wo projections + layer norm\n", + "\n", + "**Per-layer FFN memory** = weight memory for the 3 linear layers (gate, up, down projections) + layer norm\n", + "\n", + "**Non-block memory** = input embeddings + output LM head + final layer norm\n", + "\n", + "When attention is removed from a layer (`no_op`), both the KV cache and attention weights for that layer drop to zero — this is why removing attention is far more memory-efficient than reducing FFN width.\n", + "\n", + "The computation uses the inference settings from the YAML config: `batch_size=96`, `seq_len=8192` (4096 prefill + 4096 generation), `dtype=bfloat16` (2 bytes).\n", + "\n", + "Let's verify by computing the memory footprint of the original Qwen3-8B:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7xfweabvsx4", + "metadata": {}, + "outputs": [], + "source": [ + "# Qwen3-8B architecture\n", + "hidden = 4096\n", + "num_heads = 32\n", + "num_kv_heads = 8\n", + "head_dim = 128\n", + "kv_dim = num_kv_heads * head_dim # 1024\n", + "ffn_size = 12288\n", + "vocab = 151936\n", + "layers = 36\n", + "dtype_bytes = 2 # bfloat16\n", + "\n", + "# Inference settings (from YAML config)\n", + "batch_size = 96\n", + "seq_len = 4096 + 4096 # prefill + generation\n", + "\n", + "# --- Per-layer attention memory ---\n", + "# KV cache\n", + "kv_cache_per_layer = batch_size * seq_len * kv_dim * 2 * dtype_bytes / (1024**2)\n", + "# Attention weights: Wq + Wo + Wk + Wv + q_norm + k_norm + input_layernorm\n", + "attn_params = hidden * (num_heads * head_dim) * 2 + hidden * kv_dim * 2 + head_dim * 2 + hidden\n", + "attn_weights_per_layer = attn_params * dtype_bytes / (1024**2)\n", + "attn_total_per_layer = kv_cache_per_layer + attn_weights_per_layer\n", + "\n", + "# --- Per-layer FFN memory ---\n", + "# 3 linear layers (gate, up, down) + post_attention_layernorm\n", + "ffn_params = hidden * ffn_size * 3 + hidden\n", + "ffn_per_layer = ffn_params * dtype_bytes / (1024**2)\n", + "\n", + "# --- Non-block memory ---\n", + "# Input embeddings + LM head (not tied) + final RMS norm\n", + "non_block_params = vocab * hidden * 2 + hidden\n", + "non_block = non_block_params * dtype_bytes / (1024**2)\n", + "\n", + "# --- Total ---\n", + "total = layers * (attn_total_per_layer + ffn_per_layer) + non_block\n", + "\n", + "print(f\"=== Qwen3-8B Memory Footprint (batch={batch_size}, seq={seq_len}, bf16) ===\")\n", + "print(f\"Per-layer KV cache: {kv_cache_per_layer:>10.2f} MiB\")\n", + "print(f\"Per-layer attention weights:{attn_weights_per_layer:>9.2f} MiB\")\n", + "print(f\"Per-layer FFN weights: {ffn_per_layer:>10.2f} MiB\")\n", + "print(f\"Per-layer total: {attn_total_per_layer + ffn_per_layer:>10.2f} MiB\")\n", + "print(\n", + " f\"All {layers} layers: {layers * (attn_total_per_layer + ffn_per_layer):>10.2f} MiB\"\n", + ")\n", + "print(f\"Non-block (embed + LM head):{non_block:>9.2f} MiB\")\n", + "print()\n", + "print(f\"TOTAL: {total:>10.2f} MiB ({total / 1024:.2f} GiB)\")\n", + "print()\n", + "print(f\"KV cache share: {layers * kv_cache_per_layer / total * 100:.1f}% of total memory\")\n", + "print(\n", + " f\"Target budget: 78,000 MiB -> need to reduce by {total - 78000:.0f} MiB ({(total - 78000) / total * 100:.1f}%)\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "s6jj7021ojg", + "metadata": {}, + "source": [ + "**Expected output:**\n", + "\n", + "| Component | Per Layer | All 36 Layers | Share |\n", + "|---|---|---|---|\n", + "| KV cache | 3,072 MiB | 110,592 MiB | 87.6% |\n", + "| Attention weights | 80 MiB | 2,880 MiB | 2.3% |\n", + "| FFN weights | 288 MiB | 10,368 MiB | 8.2% |\n", + "| Embeddings + LM head | — | 2,374 MiB | 1.9% |\n", + "| **Total** | **3,440 MiB** | **126,215 MiB (123.26 GiB)** | **100%** |\n", + "\n", + "The KV cache alone accounts for nearly 88% of the total memory. This is why Puzzletron's ability to remove attention (and its KV cache) from selected layers is so effective for meeting a memory target. To reach 78,000 MiB, we need to cut ~48,215 MiB (38.2%)." + ] + }, + { + "cell_type": "markdown", + "id": "8763a12b2bbd4a93a75aff182afb95dc", + "metadata": {}, + "source": [ + "## 3. Run Puzzletron NAS search (Longest step: 5 hours at first run)\n", + "\n", + "This is the core Puzzletron pipeline. It runs the full 8-step process:\n", + "1. Convert the model to Puzzletron's heterogeneous format\n", + "2. Score pruning activations across all layers\n", + "3. Generate pruned checkpoint variants at different FFN sizes\n", + "4. Build a replacement library of all per-layer candidates\n", + "5. Calculate memory/parameter stats for each candidate\n", + "6. Score each replacement's quality — longest step\n", + "7. Run MIP optimization to find the best architecture within the memory constraint\n", + "8. Assemble the final heterogeneous model\n", + "\n", + "### Search space\n", + "\n", + "For each of the 36 Transformer layers, the MIP solver can choose from the following options:\n", + "\n", + "**FFN block:** keep the original (intermediate_size=12288), replace with a pruned variant at one of the candidate sizes `[2560, 5120, 7424, 9984]`, or remove it entirely (`no_op`). That's **6 FFN options per layer**.\n", + "\n", + "**Attention block:** keep the original (GQA with 8 KV heads) or remove it entirely (`no_op`). That's **2 attention options per layer**.\n", + "\n", + "Combined, each layer has up to **12 possible configurations** (6 FFN × 2 attention). Across 36 layers, the theoretical search space is 12³⁶ — far too large to enumerate. The MIP solver efficiently finds the optimal combination by formulating it as a constrained optimization problem.\n", + "\n", + "The MIP solver will find the optimal heterogeneous architecture that fits within 78,000 MiB. Unlike Scenario 1 (where the model only reduced FFN widths), here the solver is expected to also **remove attention from multiple layers** — since KV cache is the dominant memory consumer, removing attention from a layer saves far more memory than thinning its FFN." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7623eae2785240b9bd12b16a66d81610", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove if already exists from a previous run\n", + "!rm -f /workspace/puzzle_dir/subblock_stats.json\n", + "!cd /opt/Model-Optimizer && \\\n", + "torchrun --nproc_per_node 1 \\\n", + " examples/puzzletron/main.py \\\n", + " --config examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml \\\n", + " 2>&1 | tee /workspace/puzzletron_qwen3_78k.log" + ] + }, + { + "cell_type": "markdown", + "id": "7cdc8c89c7104fffa095e18ddfef8986", + "metadata": {}, + "source": [ + "## 4. Evaluate pruned model (before distillation)\n", + "\n", + "At ~38% memory compression, the model is expected to be near-random before distillation. This is normal — distillation will recover significant accuracy.\n", + "\n", + "> **Note:** The `sed` command below fixes a dtype formatting issue in the generated config." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b118ea5561624da68c537baed56e602f", + "metadata": {}, + "outputs": [], + "source": [ + "!sed -i 's/\"torch\\.bfloat16\"/\"bfloat16\"/g' \\\n", + " /workspace/puzzle_dir/mip/puzzle_solutions/target_memory_78000MiB-num_params_8G/solutions--checkpoints/solution_0/config.json\n", + "\n", + "!cd /opt/Model-Optimizer && \\\n", + "python examples/llm_eval/lm_eval_hf.py \\\n", + " --model hf \\\n", + " --model_args pretrained=/workspace/puzzle_dir/mip/puzzle_solutions/target_memory_78000MiB-num_params_8G/solutions--checkpoints/solution_0/,dtype=bfloat16,parallelize=True \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "938c804e27f84196a10c8828c723f798", + "metadata": {}, + "source": "## 5. Distill\n\nDistill the heterogeneous model against the original Qwen3-8B teacher. Same recipe: 100 iterations on [WikiText-103](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-103-v1). At this compression level, distillation is critical — it transforms the model from near-random to functional.\n\nThe `distill.py` script handles both distillation and automatic export to HuggingFace format." + }, + { + "cell_type": "markdown", + "id": "504fb2a444614c0babb325280ed9130a", + "metadata": {}, + "source": [ + "Launch TensorBoard to monitor the distillation loss in real time. Open http://localhost:6006 in your browser once the distillation cell is running.\n", + "\n", + "> **Tip:** In the TensorBoard settings (top-right gear icon), check **\"Reload data\"** so the charts update automatically as training progresses." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "59bbdb311c014d738909a11f9e486628", + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "subprocess.Popen(\n", + " [\n", + " \"tensorboard\",\n", + " \"--logdir\",\n", + " \"/workspace/output/distill_output_puzzle_78k/tb_logs\",\n", + " \"--port\",\n", + " \"6006\",\n", + " ]\n", + ")\n", + "print(\"TensorBoard started at http://localhost:6006\")" + ] + }, + { + "cell_type": "markdown", + "id": "b43b363d81ae4b689946ece5c682cd59", + "metadata": {}, + "source": [ + "Now, let's run the distillation.\n", + "> **Expected runtime: ~20-30 minutes on 2x H200.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a65eabff63a45729fe45fb5ade58bdc", + "metadata": {}, + "outputs": [], + "source": [ + "!torchrun --nproc_per_node=2 \\\n", + " /opt/Model-Optimizer/examples/megatron_bridge/distill.py \\\n", + " --student_hf_path /workspace/puzzle_dir/mip/puzzle_solutions/target_memory_78000MiB-num_params_8G/solutions--checkpoints/solution_0 \\\n", + " --teacher_hf_path /workspace/models/Qwen3-8B \\\n", + " --data_paths 1.0 /workspace/datasets/tokenized_qwen3/wikitext_wikitext-103-v1_train_text \\\n", + " --output_dir /workspace/output/distill_output_puzzle_78k \\\n", + " --hf_export_path /workspace/output/distilled_Qwen3-8B-Puzzle-78k \\\n", + " --student_hf_model Qwen/Qwen3-8B \\\n", + " --seq_length 4096 \\\n", + " --tp_size 2 \\\n", + " --pp_size 1 \\\n", + " --mbs 1 \\\n", + " --gbs 4 \\\n", + " --train_iters 100 \\\n", + " --lr 0.0001 \\\n", + " --min_lr 1e-05 \\\n", + " --lr_warmup_iters 10 \\\n", + " --eval_interval 10 \\\n", + " --eval_iters 10 \\\n", + " --log_interval 1" + ] + }, + { + "cell_type": "markdown", + "id": "e00c38cc", + "metadata": {}, + "source": [ + "Finally, kill tensorboard:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e0d8deb", + "metadata": {}, + "outputs": [], + "source": [ + "subprocess.run([\"pkill\", \"-f\", \"tensorboard\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c3933fab20d04ec698c2621248eb3be0", + "metadata": {}, + "source": "## 6. Evaluate distilled model\n\nCompare with the Minitron result at the same memory budget (see [`scenario2_minitron.ipynb`](scenario2_minitron.ipynb)).\n\n**Expected results on Qwen3-8B:**\n\n| Model | Memory | MMLU (5-shot) | % of Teacher |\n|---|---|---|---|\n| Qwen3-8B (teacher) | 126,215 MiB | 0.7493 | 100% |\n| Puzzletron 78k — pruned | 77,992 MiB | 0.2752 | 36.7% |\n| **Puzzletron 78k — distilled** | **77,992 MiB** | **0.5613** | **74.9%** |\n| Minitron 22L — distilled | 78,054 MiB | 0.4620 | 61.7% |" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4dd4641cc4064e0191573fe9c69df29b", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer && \\\n", + "python examples/llm_eval/lm_eval_hf.py \\\n", + " --model hf \\\n", + " --model_args pretrained=/workspace/output/distilled_Qwen3-8B-Puzzle-78k,dtype=bfloat16,parallelize=True \\\n", + " --tasks mmlu \\\n", + " --num_fewshot 5 \\\n", + " --batch_size 4" + ] + }, + { + "cell_type": "markdown", + "id": "dsm07ynktiu", + "metadata": {}, + "source": [ + "## 7. Bonus: Memory Sweep\n", + "\n", + "Puzzletron supports a **MIP (Mixed-Integer Programming) sweep mode** that lets you explore multiple memory compression rates in a single run. Instead of running the full pipeline for each target, the sweep reuses the scoring and replacement library from Steps 1–5 and only re-runs the MIP solver at each compression rate making it very fast.\n", + "\n", + "This produces a CSV with accuracy and memory metrics for each configuration, allowing you to map out the accuracy-memory trade-off curve and find the right operating point for your deployment.\n", + "\n", + "### Enable sweep mode\n", + "\n", + "We add the sweep configuration to the YAML and run with `--mip-only` (skips the scoring steps that were already completed in Step 3):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "l66a4t3u8kg", + "metadata": {}, + "outputs": [], + "source": [ + "import yaml\n", + "\n", + "config_path = \"/opt/Model-Optimizer/examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml\"\n", + "\n", + "with open(config_path) as f:\n", + " config = yaml.safe_load(f)\n", + "\n", + "# Add sweep configuration\n", + "config[\"mip\"][\"sweep\"] = {\n", + " \"enabled\": True,\n", + " \"memory_compression_rates\": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],\n", + " \"output_csv\": \"/workspace/puzzle_dir/mip_sweep_results.csv\",\n", + "}\n", + "\n", + "with open(config_path, \"w\") as f:\n", + " yaml.dump(config, f, default_flow_style=False)\n", + "\n", + "print(\"Sweep config added. Compression rates: [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\")" + ] + }, + { + "cell_type": "markdown", + "id": "v90anz5pdsc", + "metadata": {}, + "source": [ + "### Run the sweep" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "i4470igyxwi", + "metadata": {}, + "outputs": [], + "source": [ + "!cd /opt/Model-Optimizer && \\\n", + "torchrun --nproc_per_node 1 \\\n", + " examples/puzzletron/main.py \\\n", + " --config examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml \\\n", + " --mip-only \\\n", + " 2>&1 | tee /workspace/puzzletron_sweep.log | grep \"Puzzletron Progress\"" + ] + }, + { + "cell_type": "markdown", + "id": "uilkjsv2wx", + "metadata": {}, + "source": "Here is an example of what the accuracy-memory trade-off curve looks like when we ran this sweep (MMLU accuracy for Qwen3-8B w.r.t. memory compression rate):\n\n![Puzzletron Memory Sweep](figures/memory_sweep.png)" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file