Set up Ollama on your GPU Pod

This tutorial will guide you through setting up Ollama, a powerful platform serving large language model, on a GPU Pod using Runpod. Ollama makes it easy to run, create, and customize models. However, not everyone has access to the compute power needed to run these models. With Runpod, you can spin up and manage GPUs in the Cloud. Runpod offers templates with preinstalled libraries, which makes it quick to run Ollama. In the following tutorial, you’ll set up a Pod on a GPU, install and serve the Ollama model, and interact with it on the CLI.

Prerequisites

The tutorial assumes you have a Runpod account with credits. No other prior knowledge is needed to complete this tutorial.

Step 1: Start a PyTorch Template on Runpod

You will create a new Pod with the PyTorch template. In this step, you will set overrides to configure Ollama.

Log in to your Runpod account and choose + GPU Pod.
Choose a GPU Pod like A40.
From the available templates, select the lastet PyTorch template.
Select Customize Deployment.
1. Add the port 11434 to the list of exposed ports. This port is used by Ollama for HTTP API requests.
2. Add the following environment variable to your Pod to allow Ollama to bind to the HTTP port:
  - Key: OLLAMA_HOST
  - Value: 0.0.0.0
Select Set Overrides, Continue, then Deploy.

This setting configures Ollama to listen on all network interfaces, enabling external access through the exposed port. For detailed instructions on setting environment variables, refer to the Ollama FAQ documentation. Once the Pod is up and running, you’ll have access to a terminal within the Runpod interface.

Step 2: Install Ollama

Now that your Pod is running, you can Log in to the web terminal. The web terminal is a powerful way to interact with your Pod.

Select Connect and choose Start Web Terminal.
Make note of the Username and Password, then select Connect to Web Terminal.
Enter your username and password.
To ensure Ollama can automatically detect and utilize your GPU, run the following commands.

apt update
apt install lshw

Run the following command to install Ollama and send to the background:

(curl -fsSL https://ollama.com/install.sh | sh && ollama serve > ollama.log 2>&1) &

This command fetches the Ollama installation script and executes it, setting up Ollama on your Pod. The ollama serve part starts the Ollama server, making it ready to serve AI models. Now that your Ollama server is running on your Pod, add a model.

Step 3: Run an AI Model with Ollama

To run an AI model using Ollama, pass the model name to the ollama run command:

ollama run [model name]
# ollama run llama2
# ollama run mistral

Replace [model name] with the name of the AI model you wish to deploy. For a complete list of models, see the Ollama Library. This command pulls the model and runs it, making it accessible for inference. You can begin interacting with the model directly from your web terminal. Optionally, you can set up an HTTP API request to interact with Ollama. This is covered in the next step.

Step 4: Interact with Ollama via HTTP API

With Ollama set up and running, you can now interact with it using HTTP API requests. In step 1.4, you configured Ollama to listen on all network interfaces. This means you can use your Pod as a server to receive requests. Get a list of models To list the local models available in Ollama, you can use the following GET request:

cURl
Output

curl https://{POD_ID}-11434.proxy.runpod.net/api/tags
# curl https://cmko4ns22b84xo-11434.proxy.runpod.net/api/tags

Replace [your-pod-id] with your actual Pod Id.

{
  "models": [
    {
      "name": "mistral:latest",
      "model": "mistral:latest",
      "modified_at": "2024-02-16T18:22:39.948000568Z",
      "size": 4109865159,
      "digest": "61e88e884507ba5e06c49b40e6226884b2a16e872382c2b44a42f2d119d804a5",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "7B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

Getting a list of available models is great, but how do you send an HTTP request to your Pod? Make requests To make an HTTP request against your Pod, you can use the Ollama interface with your Pod Id.

curl -X POST https://{POD_ID}-11434.proxy.runpod.net/api/generate -d '{
  "model": "mistral",
  "prompt":"Here is a story about llamas eating grass"
 }'

Replace [your-pod-id] with your actual Pod Id. Because port 11434 is exposed, you can make requests to your Pod using the curl command. For more information on constructing HTTP requests and other operations you can perform with the Ollama API, consult the Ollama API documentation.

Additional considerations

This tutorial provides a foundational understanding of setting up and using Ollama on a GPU Pod with Runpod.

Port Configuration and documentation: For further details on exposing ports and the link structure, refer to the Runpod documentation.
Connect VSCode to Runpod: For information on connecting VSCode to Runpod, refer to the How to Connect VSCode To Runpod.

By following these steps, you can deploy AI models efficiently and interact with them through HTTP API requests, harnessing the power of GPU acceleration for your AI projects.

Introduction

Serverless

Pods

SDKs

Migrations

Prerequisites

Step 1: Start a PyTorch Template on Runpod

Step 2: Install Ollama

Step 3: Run an AI Model with Ollama

Step 4: Interact with Ollama via HTTP API

Additional considerations

Introduction

Serverless

Pods

SDKs

Migrations

​Prerequisites

​Step 1: Start a PyTorch Template on Runpod

​Step 2: Install Ollama

​Step 3: Run an AI Model with Ollama

​Step 4: Interact with Ollama via HTTP API

​Additional considerations

Prerequisites

Step 1: Start a PyTorch Template on Runpod

Step 2: Install Ollama

Step 3: Run an AI Model with Ollama

Step 4: Interact with Ollama via HTTP API

Additional considerations