Re-Distilling Smaller DeepSeek R1 Models for Better Performance

Hicham Badri, Appu Shaji

Mobius Labs GmbH


The DeepSeek R1 model represents a major breakthrough in AI reasoning capabilities. With its ability to handle complex problem-solving tasks, it has set new standards in the field. However, running this powerful model requires significant computational resources - it contains over 600 billion parameters, making it impractical for most users.

Fortunately, smaller, distilled versions of R1 have been developed that can run on standard hardware while preserving much of the original model's capabilities.

In this article, we explore improving these smaller R1 models through logits distillation from larger models. Our experiments show significant performance gains across various tasks and benchmarks, from mathematical reasoning to general knowledge. Moreover, re-distillation is highly cost-effective, with our experiments costing only between $3 to $18.

Want to try it yourself? Our improved models are freely available on Hugging Face 🤗.


Table of Contents

Introduction

The DeepSeek R1 distilled models are currently the most popular way to run the R1 model, as they can be efficiently executed on consumer hardware. This family of smaller models includes various Qwen2.5 and Llama3 models, which can operate faster with quantization methods like HQQ, utilizing backends such as GemLite.

According to the paper, the distilled models were trained on 800,000 samples generated by the original R1 model through Supervised Fine-tuning (SFT). Typically, a second reinforcement learning-based alignment step via RLHF or policy optimization like DPO, GRPO would achieve better performance.

In this work, we propose aligning smaller models with larger, more performant models via logits distillation. This method does not require preference data or even perfectly accurate samples. Using approximately 35,000 samples, this simple approach boosts performance across various benchmarks and costs only a few dollars per run. Below, we describe the approach in detail.

Re-Distillation Approach

Our approach leverages logits distillation between models sharing similar tokenizers. The key idea is to use larger models' output distributions to guide smaller ones, which proves effective without requiring extensive training data. In our experiments, we used Qwen 32B to guide both Qwen 1.5B and Qwen 7B, while Llama3-70B guided Llama3-8B.

Due to memory constraints with larger teacher models, quantization is necessary - we used HQQ to compress the teacher models to 8-bit precision. In our specific setup, we ran the pipeline on two GPUs: the quantized teacher generated logits on one GPU while the student trained on the other, with operations parallelized via CUDA streams. We kept the batch size at 1 to manage memory usage, though different hardware configurations and batch sizes are possible depending on available resources.

For the training objective, we employ KL-divergence loss with logits clipping - a crucial detail as we found larger models can produce over-confident predictions leading to training instability. When vocabulary sizes don't match between models, we zero-pad the missing embeddings in the lm-head and mask them in the loss computation. While it's possible to combine this with next-token prediction loss when reasoning data is available, our experiments focused on pure KL-divergence with a linear learning rate schedule and single dataset pass. Below is our masked KL-divergence loss implementation:


	import torch.nn.functional as F
	def KL_divergence_masked_loss(output, target, target_mask=1, T=1, clip_logits=50):
		
		#Clip logits to avoid nan loss
		target.clamp_(-clip_logits, clip_logits) 
		output.clamp_(-clip_logits, clip_logits)

		#Mask missing tokens
		target *= target_mask

		#Regular KL-divergence loss
		target = F.softmax(target / T, dim=-1)
		output = F.log_softmax(output / T, dim=-1)
		out    = (target * (target.log() - output)).sum(dim=-1).mean()

		return out

For our data, we sampled approximately 35,000 examples from various open-source datasets, covering topics such as math, knowledge, and coding. The datasets include prompts from MetaMathQA, Orca Math, Evol-Instruct, etc. Additionally, newly released open-source R1 reasoning datasets, like Numira R1 and R1-Distill-SFT, offer additional valuable resources.

Costs

In terms of costs, the re-distillation training phase costs are as follows using 2 x H100 SXM:

Model Re-distillation Cost (training time)
Qwen1.5B Re-Distill $3.5 (~1 hour)
Qwen7B Re-Distill $7.5 (~2 hours)
Llama3-8B Re-Distill $18 (~5 hours)
Caveat: The experimentation costs were approximately 20 times higher than the final training cost, as we had to test various data splits and hyperparameters. Notably, a few runs failed, particularly with Llama3-8B, due to NaN losses. Moreover, the initial Llama3 runs failed to generate the reasoning step correctly and required additional synthetic R1 reasoning data to function properly. Additionally, running the benchmarks added to the expense, with each run taking several hours to complete. The figure above also does not include the data generation cost.

Table 1. Re-distillation cost for the final training step.

Benchmarks

Below, we present various comparative benchmarks that demonstrate significant improvements across multiple tasks. Notably, the GSM8K score shows an increase of over 4% compared to the original distilled models with the Qwen models and almost 14% with the Llama3 version.

Qwen 1.5B

Models DeepSeek-R1-Distill-Qwen-1.5B DeepSeek-R1-ReDistill-Qwen-1.5B-v1.1
ARC (25-shot) 40.96 41.55

open-llm-challenge-old

HellaSwag (10-shot) 44 45.88
MMLU (5-shot) 39.27 41.82
TruthfulQA-MC2 45.17 46.63
Winogrande (5-shot) 55.49 57.7
GSM8K (5-shot) 69.9 74.3
Average 49.13 51.31
GPQA (0-shot) 26.96 26.99

open-llm-challenge

MMLU PRO (5-shot) 16.74 19.86
MUSR (0-shot) 35.93 36.6
BBH (3-shot) 35.12 37.23
IfEval (0-shot) 24.94 27.22

Table 2. Qwen-1.5B re-distillation comparative results.


Qwen 7B

Models DeepSeek-R1-Distill-Qwen-7B DeepSeek-R1-ReDistill-Qwen-7B-v1.1
ARC (25-shot) 55.03 52.3

open-llm-challenge-old

HellaSwag (10-shot) 61.9 62.36
MMLU (5-shot) 56.75 59.53
TruthfulQA-MC2 45.76 47.7
Winogrande (5-shot) 60.38 61.8
GSM8K (5-shot) 78.85 83.4
Average 59.78 61.18
GPQA (0-shot) 30.9 34.99

open-llm-challenge

MMLU PRO (5-shot) 28.83 31.02
MUSR (0-shot) 38.85 44.42
BBH (3-shot) 43.54 51.53
IfEval (0-shot) - strict 42.33 35.49
IfEval (0-shot) - loose 30.31 38.49

Table 3. Qwen-7B re-distillation comparative results.


Llama3-8B

Models DeepSeek-R1-Distill-Llama3-8B DeepSeek-R1-ReDistill-Llama3-8B-v1.1
ARC (25-shot) 49.32 50.00

open-llm-challenge-old

HellaSwag (10-shot) 76.75 76.2
MMLU (5-shot) 56.87 58.78
TruthfulQA-MC2 50.53 51.94
Winogrande (5-shot) 68.11 70.25
GSM8K (5-shot) 61.79 75.66
Average 60.56 63.81
GPQA (0-shot) 29 33.98

open-llm-challenge

MMLU PRO (5-shot) 27.44 28.4
MUSR (0-shot) 38.29 41.82
BBH (3-shot) 41.57 49.59
IfEval (0-shot) - strict 42.81 39.09
IfEval (0-shot) - loose 30.05 40.29

Table 4. Llama3-8B re-distillation comparative results.


Conclusion

In this work, we have shown that logit alignment, when used as a second-stage distillation, is a highly effective and economical method for enhancing smaller models, such as the DeepSeek R1 models. We are excited to share these improved models with the community on Hugging Face 🤗. We believe that this approach will inspire further advancements in the development of high-performance, small-scale models. We look forward to seeing how the community leverages this technique to push the boundaries of what is possible.

Citation


@misc{badri2025r1,
title = {Re-Distilling Smaller DeepSeek R1 Models for Better Performance,
url = {https://mobiusml.github.io/r1_redistill_blogpost/},
author = {Hicham Badri, Appu Shaji},
month = {January},
year = {2025}
}
					

Please feel free to contact us..