Deploying Ollama Locally Using Docker

What is Ollama

Ollama is an open-source tool that allows users to run large language models (LLMs) locally on their own hardware without relying on external cloud APIs. It provides a simple command-line interface for pulling, running, and managing various open-source AI models such as Llama, Mistral, and others, making it easy to deploy AI workloads directly on your machine or server. Ollama is particularly useful for developers and organizations that want to keep their data and model inferences private, ensuring compliance and reducing latency by processing everything locally rather than over the internet. Its lightweight containerization with Docker makes it even easier to integrate into existing infrastructure, whether for development testing or production deployment.

Why host AI models locally?

Hosting AI models locally provides enhanced data security and regulatory compliance by keeping information in-house instead of relying on external APIs. Additionally, self-hosting allows you to reduce long-term costs and eliminate vendor dependencies while maintaining full control over your data privacy. See my blog post for a more in depth analysis.

Installing Docker

The first step is to install Docker on your host computer. You can follow my instructions, or go directly to the source.

Setting up NVIDIA container-toolkit

If you do not have an NVIDIA GPU and plan to just run on CPU, you can skip this step.

The official instructions are here. But I will walk you through them.

Configure the repository

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update

Install the NVIDIA Container Toolkit packages

sudo apt-get install -y nvidia-container-toolkit

Setup docker to use the nvidia runtime

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Starting Ollama container from the command line

Running with GPU support

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Running with CPU only

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

What this command is doing

docker run – is the standard command to start a container

-d – tells the container to run in the background

--gpus=all – tells Docker that you want to use all of the available GPUs

-v ollama:/root/.ollama – maps the directory ./ollama to the directory /root/ollama inside the container. This will keep the models on your computer instead of inside the container

-p 11434:11434 – this maps the port 11434 to the host. If that port is unavailable, you can map it to any port on the host machine by changing the first “11434”

--name ollama – is calling the container ollama so that you can more easily send it commands later.

ollama/ollama – is the image that we are running.

Pulling and loading a model

Now that we have the server running, we need to pick a model and tell Ollama to run it. You can find models in the Ollama library. Below is an example using a Qwen3.5 model, but you can use your preferred model.

docker exec -it ollama ollama run qwen3.5:9b

Running this command will download the model, and load it into memory. To make sure that it worked, you can hit the /api/tags endpoint of the API.

curl http://localhost:11434/api/tags | jq .

The output should look something like this

{
  "models": [
    {
      "name": "qwen3.5:9b",
      "model": "qwen3.5:9b",
      "modified_at": "2026-04-14T16:45:37.6683314Z",
      "size": 6594474711,
      "digest": "6488c96fa5faab64bb65cbd30d4289e20e6130ef535a93ef9a49f42eda893ea7",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "qwen35",
        "families": [
          "qwen35"
        ],
        "parameter_size": "9.7B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

Using the LLM

You can now interact with the model via the Ollama API.
This is both OpenAI and Anthropic compatible API, so you should be able to plug this into just about any locally running AI UI.
The API documentation is available here if you want to play around with it.

The next step is to set up a UI so that it gives you a Chat-GPT like experience. In the next post, I will show you how to set up Open WebUI.

Deploying Ollama Locally Using Docker

What is Ollama

Why host AI models locally?

Installing Docker

Setting up NVIDIA container-toolkit

Starting Ollama container from the command line

Pulling and loading a model

Using the LLM

Comments

Leave a Reply Cancel reply

Connecting Open WebUI to MCP Servers — Your AI Gets Superpowers

The Legacy Code Archaeologist: Turning 15-Year-Old Code Bases Into Living Documentation

Giving Open WebUI a Voice with Speaches

Connect Open WebUI to Azure AD (Entra ID): Authentication Your SMB Already Has

Building an MCP Server in Python: Scraping Google Reviews

Connecting Open WebUI to MCP Servers — Your AI Gets Superpowers

The Legacy Code Archaeologist: Turning 15-Year-Old Code Bases Into Living Documentation

Giving Open WebUI a Voice with Speaches