Improving Quantized FP4 Weight Quality via Logit Distillation

Hicham Badri

Mobius Labs GmbH


FP4 is an emerging low-precision floating point format introduced in next-generation hardware like NVIDIA’s Blackwell GPUs and AMD’s upcoming MI355X accelerators. Promising up to 2× speedup over FP8, FP4 enables significantly faster and more efficient inference. However, early experiments—particularly with the MXFP4 variant—reveal a significant drop in model quality post-quantization. This drop is primarily due to the symmetric quantization scheme, especially with E8M0 scaling, which forces scales to be powers of two, limiting the precision of weight representation.

In this blog post, we explore a simple yet effective solution: a lightweight distillation-based recovery method. By matching the logits of the original unquantized model and applying a post-hoc channel-wise linear correction of the form (output × post_scale + shift) — where both correction terms are compact 16-bit vectors — we are able to recover much of the lost quality. While we train both the post-scaling factor and the shift terms, we observe that most of the quality recovery comes from the bias correction (the learned shift parameter), making this approach efficient and low-overhead for practical FP4 model deployment.

We make both calibrated models MXFP4 and MVFP4 available on Hugging Face 🤗.


Table of Contents

Introduction

Quantization has become one of the most impactful techniques for deploying Large Language Models (LLMs) efficiently. By reducing the precision of weights and activations, quantization significantly lowers GPU memory (VRAM) requirements and accelerates inference—offering speedups in both memory-bound and compute-bound scenarios. However, the process is inherently lossy, and quantization-induced degradation in model quality remains a critical challenge. Ensuring that quantized models remain accurate and reliable is essential for their real-world usability.

The new FP4 data type has recently been introduced, with hardware support in both NVIDIA’s Blackwell GPUs and AMD’s upcoming MI355X accelerators. There are two distinct FP4 formats: MXFP4 and NVFP4.

MXFP4 is the more flexible format, designed to support mixed-precision matrix multiplication. It represents weights in 4-bit E2M1 format, combined with group-wise scaling factors shared over vectors of size 32. These scales are stored in 8-bit E8M0 format, which restricts them to powers of two. While this choice simplifies hardware implementation, it severely limits representational precision. In practice, replacing E4M3 scales with E8M0 results in a noticeable drop in model quality. In contrast, the NVFP4 format uses smaller group sizes (16) and E4M3 8-bit scales, leading to much better quality retention after quantization. However, NVFP4 is limited to homogeneous FP4 × FP4 tensor core operations, which restricts it in scenarios requiring mixed-precision accumulation such as memory-bound workloads with higher precision activations.

The table below summarizes the difference between the two FP4 dtypes:

Table 1. MXFP4 vs. NVFP4 data types.
Format Group Size Scale Dtype Mixed Precision Support
MXFP4 32 E8M0 Yes
NVFP4 16 E4M3 No

To address the quantization error, particularly in the MXFP4 format, we explore a lightweight approach to recover model accuracy. Unfortunately, due to the symmetric nature of the quantization, calibration-free techniques like HQQ cannot be applied effectively in this context.

While more advanced methods such as HQQ+ can successfully restore accuracy for low-bit quantization by training low-rank adapters (e.g., LoRA), these approaches come with non-negligible overhead, which we aim to avoid in latency and memory-sensitive deployments.

Instead, we freeze both the 4-bit weights Wq and the block-scales scales and focus on training a simple post-hoc correction consisting of a post-scaling factor and a shift applied to FP4 matrix multiplication output: matmul_fp4(x, Wq, scales) * post_scale + shift, with both the post-scales and the shift are 16-bit vectors, keeping the method lightweight.

Unlike the two-level scaling mechanism used in NVFP4 to boost accuracy, we find that the majority of quality recovery comes from the bias correction, not the post-scaling. This insight is especially useful in practice: for popular models that already include a bias term (such as Qwen), the learned shift can be fused directly into the pre-trained bias, incurring virtually zero additional runtime overhead.

We delve into this approach in detail in the next section.

Approach

Weight Quantization

The first step is quantizing the model to FP4 weights. This is a one-shot process that involves first calculating the block-wise scales scales via absmax after reshaping to match the target block size: weights.abs().amax(dim=1, keepdim=True).
We then cast these scales to the target dtype (E8M0 for MXFP4, E4M3 for NVFP4), and the weights are divided by their corresponding block scales, then matched to the closest bin in the FP4 E2M1 format. This results in a tuple: Wq referring to the 4-bit quantized weights, and scales referring to the 8-bit block-wise scales.

This operation can be efficiently implemented using torch.searchsorted, or further accelerated with a custom Triton kernel — which would be especially useful for dynamic FP4 quantization. In our case, since the weights are static, this step is only performed once.

We also explored an exhaustive search strategy to optimize the block scales, which slightly improves the quantization accuracy — particularly for the NVFP4 format.

The code for quantizing to MXFP4 and NVFP4 is available here.

Distillation

The second step is the training phase. We introduce channel-wise trainable parameters post_scale and shift, such that the final output is computed as matmul_fp4(x, Wq, scales) * post_scale + shift. The post_scale and shift vectors are initialized to ones and zeros, respectively.

For training, we use a simple KL divergence loss (without temperature scaling) on the logits only, matching the outputs between the teacher model (unquantized bfloat16) and the student model (with frozen FP4 weights and trainable post-scale/shift). While it's common to apply a temperature during distillation, we found that setting the temperature parameter to 1 gave the best results. Including intermediate-layer losses did not yield further improvements.

We use a custom linear learning rate scheduler with 3% warm-up and a final learning rate of 1e-6. The MXFP4-calibrated model is trained for 2 epochs, while the NVFP4 model is trained for only 1 epoch, as it reaches our target accuracy of over 99% after the first epoch. Regarding the data, we train on a mixture of 50K examples randomly sampled from a variety of open-source datasets across domains such as math, knowledge, and code. These include prompts from MetaMathQA, Orca Math and Evol-Instruct.

We also experimented with making Wq learnable using techniques such as the Straight-Through Estimator (STE) and soft rounding. However, these approaches were unnecessary in practice and sometimes even degraded performance.

After examining the trained post_scale and shift parameters, we observe that most of the post_scales remain close to their initial values, while the shift terms consistently adapt across all layers. This indicates that a simple bias correction is sufficient for most of the quality recovery, contrary to the post-scaling approach used in NVFP4. Furthermore, if post_scale is used, the entire operation can be fused into a single FMA (fused multiply-add) instruction, incurring no additional compute overhead.

Results

We run experiments on the Llama-3.1-8B-Instruct model as a reference benchmark. Larger models are expected to achieve better quality post-quantization, as they tend to be less sensitive to quantization error.

Uncalibrated Results

We begin by reporting the performance of the uncalibrated models before the distillation phase. For reference, we also include results from HQQ using a group size of 64. As shown in Table 2, MXFP4 weights lead to significant degradation in quality — particularly on math-heavy tasks such as GSM8K. In contrast, the NVFP4 format preserves quality much better, with performance closer to that of HQQ.


Table 2. Uncalibrated< 4-bit weight-only performance on Llama3.1-8B-Instruct.
Models FP16 HQQ MXFP4 NVFP4
ARC (25-shot) 60.49 60.32 58.44 59.21
HellaSwag (10-shot) 80.16 79.21 78.03 78.55
MMLU (5-shot) 68.98 67.07 65.14 66.15
TruthfulQA-MC2 54.03 53.89 48.49 51.61
Winogrande (5-shot) 77.98 76.24 74.43 76.64
GSM8K (5-shot) 75.44 71.27 62.77 66.15
Average 69.51 68 64.55 67.82
Relative Performance 100% 97.83% 92.86% 97.57%

Calibrated Results

We now report results after the distillation phase for both MXFP4 and NVFP4. For comparison, we also include results from INT4-calibrated methods such as GPTQ and AWQ in Table 3.


Table 3. Calibrated 4-bit weight-only performance on Llama3.1-8B-Instruct.
Models FP16 HQQ AWQ GPTQ MXFP4 NVFP4
ARC (25-shot) 60.49 60.92 57.85 61.18 60.84 60.24
HellaSwag (10-shot) 80.16 79.52 79.28 77.82 78.89 79.6
MMLU (5-shot) 68.98 67.74 67.14 67.93 66.69 66.95
TruthfulQA-MC2 54.03 54.11 51.87 53.58 53.42 54.32
Winogrande (5-shot) 77.98 76.48 76.4 76.64 77.43 77.35
GSM8K (5-shot) 75.44 75.36 73.47 72.25 76.19 75.13
Average 69.51 69.02 67.67 68.23 68.91 68.93
Relative Performance 100% 99.3% 97.35% 98.16% 99.14% 99.17%

As shown, our simple distillation procedure successfully recovers accuracy. The relative performance with respect to the unquantized weights improved significantly, rising from 92.86% to 99.14% for MXFP4, and from 97.57% to 99.17% for NVFP4. This highlights that a straightforward bias correction, similar to what's used in the calibrated version of HQQ, is an effective strategy for restoring accuracy in both INT4 and FP4 weight quantization formats.

Conclusion

In this blog post, we presented a lightweight method for recovering accuracy in FP4 weight quantization using logits distillation. We demonstrated that a simple bias correction is sufficient to improve accuracy— even with the more constrained MXFP4 format — and can be seamlessly integrated into the FP4 inference pipeline with minimal to zero overhead.

While this work focused on static, pre-quantized weights, the same technique can be extended to handle activation quantization errors. In particular, combining it with post-post scaling (i.e., two-level scaling) offers a promising path for addressing quantization challenges in FP4-based pipelines.

We made both the MXFP4 and MVFP4 publicly available on Hugging Face 🤗!

Citation


@misc{badri2025fp4,
title = {Improving Quantized FP4 Weight Quality via Logit Distillation,
url = {https://mobiusml.github.io/fp4_blogpost/},
author = {Hicham Badri},
month = {June},
year = {2025}
}
					

Please feel free to contact us..