Featured Blog Posts
Our models are named 'Aana' (ആന), which means 'Elephant' in Malayalam.
Recent research in extreme low-bit quantization, particularly in using quantized weights for multiplication-free matrix operations, is gaining traction for its potential to enhance machine learning model efficiency. Our work extends this by testing the direct quantization of pre-trained models to binary levels. We investigate using HQQ+, an advanced version of HQQ with a low-rank adapter, for quantizing pre-trained models to 1 and 2 bits. Our findings reveal that partially training the weights of an HQQ-quantized model can notably boost its performance, even at 1-bit, surpassing smaller full-precision models.
Answer.AI, in collaboration with Tim Dettmers, Hugging Face, and our team, is launching FSDP/QLoRA. By combining QLoRA (that facilitates training bigger models on single GPUs) with FSDP (that scales training to multi-GPUs), large model training, traditionally needing data center GPUs like A100s and H100s, is now democratized, making it accessible to smaller companies and individuals.
Our specific contribution involved integrating HQQ, the quantization technique we developed, with FSDP through our collaboration with Answer.AI. HQQ not only improves accuracy but also makes the quantization of 70B models 50x faster compared to techniques like GPTQ.
We're introducing HQQ-quantized Mixtral models featuring metadata offloading. This method stores critical metadata, such as scaling parameters and zero points, on CPUs while allocating model weights to the GPU, significantly reducing VRAM requirements. Consequently, it allows for running larger models on consumer-grade hardware. For instance, our 2-bit/4-bit Mixtral model requires only 13GB of RAM—compared to over 90GB for the full model—and delivers comparable performance in lm_eval scores. These models are efficiently operable on GPUs like the 4090 and 3090, thus obviating the need for multiple A100s.
aanaphi2-v0.1 is a finetuned (SFT + DPO) chat model based on Microsoft's Phi-2 base model (2.8B parameters). During the time of writing it is number 1 in Open LLM leaderboard for 3 billion parameter class.
A small blog on our approach and learnings coming soon.
Half-Quadratic Quantization (HQQ) is a new, calibration-free method that quickly and effectively compresses large models like Mixtral and Llama-2-70B, taking 50x less time to quantize and significantly outperforming the full-precision Llama-2-13B in both memory efficiency and performance.
Find a few pre-quantized models from our huggingface page at:
Code available at https://github.com/mobiusml/hqq
We examine low-rankness as a pruning strategy for the LLama2-7B model, significantly reducing parameters by 50% and avoiding custom kernels. By decomposing linear layer weights and using LoRA for training, we outperform bitsandbytes's 8-bit quantization, halving training times. This approach also boosts inference speeds by up to 1.25 times, enhancing model efficiency.
Code available at https://github.com/mobiusml/low-rank-llama2/tree/main/code