meta-llama/Llama-3.2-3B-Instruct
)mistralai/Ministral-8B-Instruct-2410
)Qwen/Qwen3-8B
)openchat/openchat-3.5-0106
)google/gemma-3-1b-it
)deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
)microsoft/Phi-4-mini-instruct
)openchat/openchat-3.5-0106
, but you can substitute this with any compatible model.
openchat/openchat-3.5-0106
.8192
(or an appropriate context length for your model).0
for cost savings or 1
for faster response times.2
(or higher for more concurrent capacity).1
(increase for larger models).MAX_MODEL_LEN
: Maximum context length (e.g., 16384
)DTYPE
: Data type for model weights (float16
, bfloat16
, or float32
)GPU_MEMORY_UTILIZATION
: Controls VRAM usage (e.g., 0.95
for 95%)CUSTOM_CHAT_TEMPLATE
: For models that need a custom chat templateOPENAI_SERVED_MODEL_NAME_OVERRIDE
: Change the model name to use in OpenAI requestsmax_tokens
parameter to increase the maximum number of tokens generated per reponse. To learn more, see Send vLLM requests.
MAX_MODEL_LEN
.