vLLM workers are specialized containers designed to efficiently deploy and serve large language models (LLMs) on Runpod’s Serverless infrastructure. By leveraging Runpod’s vLLM workers, you can quickly deploy state-of-the-art language models with optimized performance, flexible scaling, and cost-effective operation.For detailed information on model compatibility and configuration options, check out the vLLM worker GitHub repository.
vLLM workers offer several advantages that make them ideal for LLM deployment:
Pre-built optimization: The workers come with the vLLM inference engine pre-configured, which includes PagedAttention technology for optimized memory usage and faster inference.
OpenAI API compatibility: They provide a drop-in replacement for OpenAI’s API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key.
Hugging Face integration: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others.
Configurable environments: Extensive customization options through environment variables allow you to adjust model parameters, performance settings, and other behaviors.
Auto-scaling architecture: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis.
This is the simplest approach. Use Runpod’s UI to deploy a model directly from Hugging Face with minimal configuration. For step-by-step instructions, see Deploy a vLLM worker.
Quick-deployed workers will download models during initialization, which can take some time depending on the model selected. If you plan to run a vLLM endpoint in production, it’s best to package your model into a Docker image ahead of time (using the Docker image method below), as this can significantly reduce cold start times.
Deploy a packaged vLLM worker image from GitHub or Docker Hub, configuring your endpoint using environment variables.Follow the instructions in the vLLM worker README to build a model into your worker image.You can add new functionality your vLLM worker deployment by customizing its handler function.