What is Ollama
Ollama is an open-source tool that allows users to run large language models (LLMs) locally on their own hardware without relying on external cloud APIs. It provides a simple command-line interface for pulling, running, and managing various open-source AI models such as Llama, Mistral, and others, making it easy to deploy AI workloads directly on your machine or server. Ollama is particularly useful for developers and organizations that want to keep their data and model inferences private, ensuring compliance and reducing latency by processing everything locally rather than over the internet. Its lightweight containerization with Docker makes it even easier to integrate into existing infrastructure, whether for development testing or production deployment.
Why host AI models locally?
Hosting AI models locally provides enhanced data security and regulatory compliance by keeping information in-house instead of relying on external APIs. Additionally, self-hosting allows you to reduce long-term costs and eliminate vendor dependencies while maintaining full control over your data privacy. See my blog post for a more in depth analysis.
Installing Docker
The first step is to install Docker on your host computer. You can follow my instructions, or go directly to the source.
Setting up NVIDIA container-toolkit
If you do not have an NVIDIA GPU and plan to just run on CPU, you can skip this step.
The official instructions are here. But I will walk you through them.
Configure the repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
Install the NVIDIA Container Toolkit packages
sudo apt-get install -y nvidia-container-toolkit
Setup docker to use the nvidia runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Starting Ollama container from the command line
Running with GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Running with CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
What this command is doing
docker run – is the standard command to start a container-d – tells the container to run in the background--gpus=all – tells Docker that you want to use all of the available GPUs-v ollama:/root/.ollama – maps the directory ./ollama to the directory /root/ollama inside the container. This will keep the models on your computer instead of inside the container-p 11434:11434 – this maps the port 11434 to the host. If that port is unavailable, you can map it to any port on the host machine by changing the first “11434”--name ollama – is calling the container ollama so that you can more easily send it commands later.ollama/ollama – is the image that we are running.
Pulling and loading a model
Now that we have the server running, we need to pick a model and tell Ollama to run it. You can find models in the Ollama library. Below is an example using a Qwen3.5 model, but you can use your preferred model.
docker exec -it ollama ollama run qwen3.5:9b
Running this command will download the model, and load it into memory. To make sure that it worked, you can hit the /api/tags endpoint of the API.
curl http://localhost:11434/api/tags | jq .
The output should look something like this
{
"models": [
{
"name": "qwen3.5:9b",
"model": "qwen3.5:9b",
"modified_at": "2026-04-14T16:45:37.6683314Z",
"size": 6594474711,
"digest": "6488c96fa5faab64bb65cbd30d4289e20e6130ef535a93ef9a49f42eda893ea7",
"details": {
"parent_model": "",
"format": "gguf",
"family": "qwen35",
"families": [
"qwen35"
],
"parameter_size": "9.7B",
"quantization_level": "Q4_K_M"
}
}
]
}
Using the LLM
You can now interact with the model via the Ollama API.
This is both OpenAI and Anthropic compatible API, so you should be able to plug this into just about any locally running AI UI.
The API documentation is available here if you want to play around with it.
I will be writing a future blog post about integrating with Open WebUI. Once it’s published, I’ll add the link here.

Leave a Reply