Text Generation Models (LLMs)¶
Aana SDK has three deployments to serve text generation models (LLMs):
-
VLLMDeployment: allows you to efficiently serve Large Language Models (LLM) and Vision Language Models (VLM) with the vLLM library.
-
HfTextGenerationDeployment: uses the Hugging Face Transformers library to deploy text generation models.
-
HqqTextGenerationDeployment: uses Half-Quadratic Quantization (HQQ) to quantize and deploy text generation models.
All deployments have the same interface and provide similar capabilities.
vLLM Deployment¶
vLLM deployment allows you to efficiently serve Large Language Models (LLM) and Vision Language Models (VLM) with the vLLM library.
Tip
To use vLLM deployment, install required libraries with pip install vllm
or include extra dependencies using pip install aana[vllm]
.
VLLMConfig is used to configure the vLLM deployment.
aana.deployments.vllm_deployment.VLLMConfig
¶
Attributes:
-
model_id
(str
) –The model name.
-
dtype
(Dtype
) –The data type. Defaults to Dtype.AUTO.
-
quantization
(str | None
) –The quantization method. Defaults to None.
-
gpu_memory_reserved
(float
) –The GPU memory reserved for the model in MB.
-
default_sampling_params
(SamplingParams
) –The default sampling parameters. Defaults to SamplingParams(temperature=0, max_tokens=256).
-
max_model_len
(int | None
) –The maximum generated text length in tokens. Defaults to None.
-
chat_template
(str | None
) –The name of the chat template. If not provided, the chat template from the model will be used. Some models may not have a chat template. Defaults to None.
-
enforce_eager
(bool
) –Whether to enforce eager execution. Defaults to False.
-
engine_args
(CustomConfig
) –Extra engine arguments. Defaults to {}.
Example Configurations¶
As an example, let's see how to configure the vLLM deployment for the Meta Llama 3 8B Instruct model.
Meta Llama 3 8B Instruct
from aana.core.models.sampling import SamplingParams
from aana.core.models.types import Dtype
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
max_ongoing_requests=1000,
ray_actor_options={"num_gpus": 0.45},
user_config=VLLMConfig(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
dtype=Dtype.AUTO,
gpu_memory_reserved=30000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
).model_dump(mode="json"),
)
Model name is the Hugging Face model ID. We use Dtype.AUTO
to let the deployment choose the best data type for the model. We reserve 30GB of GPU memory for the model. We set enforce_eager=True
to helps to reduce memory usage but may harm performance. We also set the default sampling parameters for the model.
VLLM deployment also supports Vision Language Models (VLM). Here is an example configuration for the Phi 3.5 Vision Instruct model.
Phi 3.5 Vision Instruct
from aana.core.models.sampling import SamplingParams
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 1.0},
user_config=VLLMConfig(
model_id="microsoft/Phi-3.5-vision-instruct",
gpu_memory_reserved=12000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
max_model_len=2048,
engine_args=dict(
trust_remote_code=True,
max_num_seqs=32,
limit_mm_per_prompt={"image": 3},
),
).model_dump(mode="json"),
)
Here are some other example configurations for the VLLM deployment. Keep in mind that the list is not exhaustive. You can deploy any model that is supported by the vLLM library.
Llama 2 7B Cha t with AWQ quantization
from aana.core.models.sampling import SamplingParams
from aana.core.models.types import Dtype
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.25},
user_config=VLLMConfig(
model_id="TheBloke/Llama-2-7b-Chat-AWQ",
dtype=Dtype.AUTO,
quantization="awq",
gpu_memory_reserved=13000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
chat_template="llama2",
).model_dump(mode="json"),
)
InternLM 2.5 7B Chat
from aana.core.models.sampling import SamplingParams
from aana.core.models.types import Dtype
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.45},
user_config=VLLMConfig(
model_id="internlm/internlm2_5-7b-chat",
dtype=Dtype.AUTO,
gpu_memory_reserved=30000,
max_model_len=50000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
engine_args={"trust_remote_code": True},
).model_dump(mode="json"),
)
Phi 3 Mini 4K Instruct
from aana.core.models.sampling import SamplingParams
from aana.core.models.types import Dtype
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
max_ongoing_requests=1000,
ray_actor_options={"num_gpus": 0.25},
user_config=VLLMConfig(
model_id="microsoft/Phi-3-mini-4k-instruct",
dtype=Dtype.AUTO,
gpu_memory_reserved=10000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
engine_args={
"trust_remote_code": True,
},
).model_dump(mode="json"),
)
Qwen2-VL 7B Instruct
For LLaVA-NeXT-Video and Qwen2-VL, the latest release of huggingface/transformers doesn’t work yet (as of 18 September 2024), so we need to use a developer version (21fac7abba2a37fae86106f87fcf9974fd1e3830) for now. This can be installed by running the following command:
pip install git+https://github.com/huggingface/transformers.git@21fac7abba2a37fae86106f87fcf9974fd1e3830
from aana.core.models.sampling import SamplingParams
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 1.0},
user_config=VLLMConfig(
model_id="Qwen/Qwen2-VL-7B-Instruct",
gpu_memory_reserved=40000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
max_model_len=4096,
engine_args=dict(
limit_mm_per_prompt={"image": 3},
),
).model_dump(mode="json"),
)
Pixtral 12B 2409
The model is gated so you need to the model page, request access to the model and set HF_TOKEN
environment variable to your Hugging Face API token.
from aana.core.models.sampling import SamplingParams
from aana.deployments.vllm_deployment import VLLMConfig, VLLMDeployment
VLLMDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 1.0},
user_config=VLLMConfig(
model_id="mistralai/Pixtral-12B-2409",
gpu_memory_reserved=40000,
enforce_eager=True,
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=1024
),
max_model_len=4096,
engine_args=dict(
tokenizer_mode="mistral",
limit_mm_per_prompt={"image": 3},
),
).model_dump(mode="json"),
)
Structured Generation¶
Structured generation is a feature that allows you to generate structured data using the vLLM deployment forcing LLM to adhere to a specific JSON schema or regular expression pattern.
Structured generation is supported only for the vLLM deployment at the moment.
To enable structured generation, you need to pass JSON schema or regular expression pattern to SamplingParams
object.
# For JSON schema set json_schema parameter to the JSON schema string
sampling_params = SamplingParams(json_schema=schema, temperature=0.0, max_tokens=512)
# For regular expression set regex_string parameter to the regular expression pattern
sampling_params = SamplingParams(regex_string=regex_pattern, temperature=0.0, max_tokens=512)
# Pass the sampling_params to one of the vLLM deployment methods like chat or chat_stream
# Here handle is an AanaDeploymentHandle for the vLLM deployment.
response = await handle.chat(dialog, sampling_params=sampling_params)
You can use Pydantic models to generate JSON schema.
import json
from pydantic import BaseModel
class CityDescription(BaseModel):
city: str
country: str
description: str
schema = json.dumps(CityDescription.model_json_schema())
# {"properties": {"city": {"title": "City", "type": "string"}, "country": {"title": "Country", "type": "string"}, "description": {"title": "Description", "type": "string"}}, "required": ["city", "country", "description"], "title": "CityDescription", "type": "object"}
You can find detailed tutorials on how to use structured generation in the Structured Generation notebook.
Hugging Face Text Generation Deployment¶
HfTextGenerationDeployment uses the Hugging Face Transformers library to deploy text generation models.
Tip
To use HF Text Generation deployment, install required libraries with pip install transformers
or include extra dependencies using pip install aana[transformers]
.
HfTextGenerationConfig is used to configure the Hugging Face Text Generation deployment.
aana.deployments.hf_text_generation_deployment.HfTextGenerationConfig
¶
Attributes:
-
model_id
(str
) –The model ID on Hugging Face.
-
model_kwargs
(CustomConfig
) –The extra model keyword arguments. Defaults to {}.
-
default_sampling_params
(SamplingParams
) –The default sampling parameters. Defaults to SamplingParams(temperature=0, max_tokens=256).
-
chat_template
(str | None
) –The name of the chat template. If not provided, the chat template from the model will be used. Some models may not have a chat template. Defaults to None.
Example Configurations¶
As an example, let's see how to configure the Hugging Face Text Generation deployment for the Phi 3 Mini 4K Instruct model.
Phi 3 Mini 4K Instruct
from aana.deployments.hf_text_generation_deployment import HfTextGenerationConfig, HfTextGenerationDeployment
HfTextGenerationDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.25},
user_config=HfTextGenerationConfig(
model_id="microsoft/Phi-3-mini-4k-instruct",
model_kwargs={
"trust_remote_code": True,
},
).model_dump(mode="json"),
)
Model ID is the Hugging Face model ID. trust_remote_code=True
is required to load the model from the Hugging Face model hub. You can define other model arguments in the model_kwargs
dictionary.
Here are other example configurations for the Hugging Face Text Generation deployment. Keep in mind that the list is not exhaustive. You can deploy other text generation models that are supported by the Hugging Face Transformers library.
Phi 3 Mini 4K Instruct with 4-bit quantization
from transformers import BitsAndBytesConfig
from aana.deployments.hf_text_generation_deployment import HfTextGenerationConfig, HfTextGenerationDeployment
HfTextGenerationDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.25},
user_config=HfTextGenerationConfig(
model_id="microsoft/Phi-3-mini-4k-instruct",
model_kwargs={
"trust_remote_code": True,
"quantization_config": BitsAndBytesConfig(
load_in_8bit=False, load_in_4bit=True
),
},
).model_dump(mode="json"),
)
Half-Quadratic Quantization (HQQ) Text Generation Deployment (Deprecated)¶
HqqTexGenerationDeployment uses Half-Quadratic Quantization (HQQ) to quantize and deploy text generation models from the Hugging Face Hub.
Tip
To use HQQ Text Generation deployment, install required libraris with pip install hqq transformers
or include extra dependencies using pip install aana[hqq]
.
Warning
HQQ Text Generation deployment is currently deprecated and might be removed in future versions of the Aana SDK. We recommend using the VLLM deployment for text generation models.
It supports already quantized models as well as quantizing models on the fly. The quantization is blazing fast and can be done on the fly with minimal overhead. Check out the the collections of already quantized models from Mobius Labs.
HqqTexGenerationConfig is used to configure the HQQ Text Generation deployment.
aana.deployments.hqq_text_generation_deployment.HqqTexGenerationConfig
¶
Attributes:
-
model_id
(str
) –The model ID on Hugging Face.
-
quantize_on_fly
(bool
) –Whether to quantize the model or it is already pre-quantized. Defaults to False.
-
backend
(HqqBackend
) –The backend library to use. Defaults to HqqBackend.BITBLAS.
-
compile
(bool
) –Whether to compile the model with torch.compile. Defaults to False.
-
dtype
(Dtype
) –The data type. Defaults to Dtype.AUTO.
-
quantization_config
(dict
) –The quantization configuration.
-
model_kwargs
(CustomConfig
) –The extra model keyword arguments. Defaults to {}.
-
default_sampling_params
(SamplingParams
) –The default sampling parameters. Defaults to SamplingParams(temperature=0, max_tokens=256).
-
chat_template
(str | None
) –The name of the chat template. If not provided, the chat template from the model will be used. Some models may not have a chat template. Defaults to None.
HQQ Backends¶
The HQQ Text Generation framework supports two backends, each optimized for specific scenarios:
-
HqqBackend.BITBLAS (Default)
- Library Installation: Install via: More details can be found on the BitBLAS GitHub page.
- Compatibility: Works on a broader range of GPUs, including older models.
- Precision Support: Supports both 4-bit and 2-bit quantization, allowing for more compact models and efficient inference.
- Strengths: BitBLAS excels in handling large batch sizes, especially when properly configured. But HQQ is optimized for decoding with a batch size of 1 leading to slower inference times compared to the
TORCHAO_INT4
backend. - Limitations: Slower initialization due to the need for per-shape and per-GPU compilation.
-
HqqBackend.TORCHAO_INT4
- Library Installation: No additional installation required.
- Compatibility: Only works on Ampere and newer GPUs, limiting its usage to more recent hardware.
- Precision Support: Supports only 4-bit quantization.
- Strengths: Much faster to initialize compared to BitBLAS, making it a good choice for situations where quick startup times are crucial. Faster inference times compared to the
BITBLAS
backend. - Limitations: It doesn't support 2-bit quantization.
Example Configurations¶
On-the-fly Quantization¶
As an example, let's see how to configure HQQ Text Generation deployment to quantize and deploy the Meta-Llama-3.1-8B-Instruct model.
Meta-Llama-3.1-8B-Instruct
from hqq.core.quantize import BaseQuantizeConfig
from aana.deployments.hqq_text_generation_deployment import (
HqqBackend,
HqqTexGenerationConfig,
HqqTextGenerationDeployment,
)
HqqTextGenerationDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.5},
user_config=HqqTexGenerationConfig(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
backend=HqqBackend.BITBLAS,
quantize_on_fly=True,
quantization_config=BaseQuantizeConfig(nbits=4, group_size=64, axis=1),
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=512
),
model_kwargs={
"attn_implementation": "sdpa"
},
).model_dump(mode="json"),
)
Model ID is the Hugging Face model ID. We set quantize_on_fly=True
to quantize the model on the fly since the model is not pre-quantized. We deploy the model with 4-bit quantization by setting quantization_config
in the HqqConfig
. We use HqqBackend.BITBLAS
as the backend for quantization, it is optional as BitBLAS is the default backend. You can pass extra arguments to the model in the model_kwargs
dictionary.
Pre-quantized Models¶
You can also deploy already quantized models with HQQ Text Generation deployment. Here is an example of deploying the
Quantized Meta-Llama-3.1-8B-Instruct
from hqq.core.quantize import BaseQuantizeConfig
from aana.deployments.hqq_text_generation_deployment import (
HqqBackend,
HqqTexGenerationConfig,
HqqTextGenerationDeployment,
)
HqqTextGenerationDeployment.options(
num_replicas=1,
ray_actor_options={"num_gpus": 0.5},
user_config=HqqTexGenerationConfig(
model_id="mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib",
backend=HqqBackend.BITBLAS,
quantization_config=BaseQuantizeConfig(
nbits=4,
group_size=64,
quant_scale=False,
quant_zero=False,
axis=1,
),
default_sampling_params=SamplingParams(
temperature=0.0, top_p=1.0, top_k=-1, max_tokens=512
),
).model_dump(mode="json"),
)
Model ID is the Hugging Face model ID of a pre-quantized model. We use HqqBackend.BITBLAS
as the backend for quantization, it is optional as BitBLAS is the default backend. We set the quantization configuration according to the model page.