runpod/worker-vllm:stable-cuda11.8.0
or runpod/worker-vllm:stable-cuda12.1.0
.
vi. Specify enough storage for your model.
vii. Add the following environment variables:
a. MODEL_NAME
: google/gemma-7b-it
.
b. HF_TOKEN
: your Hugging Face API token for private models.
OpenAI
library; however, you can use any programming language and any library that supports HTTP requests.
Here’s how to get started:
Use the OpenAI
class to interact with the model. The OpenAI
class takes the following parameters:
base_url
: The base URL of the Serverless Endpoint.api_key
: Your Runpod API key.RUNPOD_BASE_URL
and RUNPOD_API_KEY
to your Runpod API key and base URL. Your RUNPOD_BASE_URL
will be in the form of:${RUNPOD_ENDPOINT_ID}
is the ID of your Serverless Endpoint.client
to interact with the model. For example, you can use the chat.completions.create
method to generate a response from the model.
Provide the following parameters to the chat.completions.create
method:
model
: The model name
.messages
: A list of messages to send to the model.max_tokens
: The maximum number of tokens to generate.temperature
: The randomness of the generated text.top_p
: The cumulative probability of the generated text.max_tokens
: The maximum number of tokens to generate.