Giving Open WebUI a Voice with Speaches

The global speech recognition market is projected to reach $23.11 billion by 2030, growing at 19.1% annually, according to MarketsandMarkets. The AI voice generator market alone is forecast to hit $20.4 billion by 2030 at a 37.1% CAGR, per MarketsandMarkets. These numbers are not projections of some distant future. They are the sound of an industry moving so fast that the people who ship products are expected to have voice-ready AI assistants on day one.

Open WebUI, my go-to interface for local LLMs, has always supported audio. The microphone icon. The speaker button. But out of the box, those features are stubs or point to costly third-party APIs with your data leaving your infrastructure on every request. That is a data sovereignty risk and a recurring cost that compounds with usage.

There’s a better way. Speaches is an OpenAI API-compatible server for speech AI — Text-to-speech (TTS) and Speech-to-text (STT) models served locally from a single container. Think of it as Ollama, but for voice. Kokoro, the #1 ranked TTS model on the Hugging Face TTS Spaces Arena, runs inside Speaches. Whisper models handle transcription. Everything stays on your hardware. No per-request fees. No vendor lock-in.

In this guide, we deploy Speaches to Docker, pull both TTS and STT models, test the web UI, and wire everything into Open WebUI. Total time: about 15 minutes.

What Speaches Actually Is

Speaches is an independent project, not owned by or affiliated with Open WebUI. It provides a local API that implements OpenAI’s speech endpoints, which means every tool and SDK that speaks to OpenAI’s API works with Speaches out of the box.

Under the hood, Speaches uses two core technologies:

STT (Speech-to-Text): powered by faster-whisper from SYSTRAN, which delivers real-time transcription and translation
TTS (Text-to-Speech): powered by Kokoro-82M and piper, with Kokoro ranked #1 on the Hugging Face TTS Spaces Arena

The practical benefit of this architecture is simple: Open WebUI calls its STT engine “OpenAI” (which sounds confusing at first — more on that below), points it at Speaches, and everything just works. You trade API bills for a single Docker container.

Prerequisites

Before we begin, make sure you have:

An Open WebUI instance running in Docker — if you haven’t set this up yet, follow the previous post first
Docker Engine 24.0+ with the Docker Compose plugin
Either a CPU-only host (simpler start, slightly slower inference) or an NVIDIA GPU with the NVIDIA Container Toolkit (faster inference, recommended for production)

If you already have an AI application stack running on your Docker host, you can deploy Speaches alongside everything else. The container is entirely self-contained.

Deploying Speaches to Docker

Here is the docker-compose service definition:

speaches:
    container_name: speaches
    image: ghcr.io/speaches-ai/speaches:0.8.3-cpu
    restart: unless-stopped
    ports:
      - 8000:8000
    volumes:
      - ./cache:/home/ubuntu/.cache/huggingface/hub
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://0.0.0.0:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5s

What each parameter in this service definition is doing:

image: ghcr.io/speaches-ai/speaches:0.8.3-cpu — The official Speaches image tagged v0.8.3 for CPU inference. Use the :latest-cuda tag if you have an NVIDIA GPU installed. The CPU version is perfectly functional for development and moderate usage
container_name: speaches — Assigns a persistent name for easy management with docker stop speaches and docker logs speaches
ports: 8000:8000 — Maps the container port to the host port. Port 8000 is the default API port. The first number is the host port you access from outside the container; the second is the port the application listens on inside the container
volumes: ./cache:/home/ubuntu/.cache/huggingface/hub — This is the critical mount. Speaches downloads models from Hugging Face to ~/.cache/huggingface/hub inside the container. Without this volume mount, every model download is ephemeral — you lose everything on container restart. Change the host path to match your infrastructure. On my system, I keep AI volumes on a separate storage device to protect my OS drive
restart: unless-stopped — Docker automatically restarts the container on failure or host reboot, unless you explicitly stop it first. This is standard “good” behavior for always-available services
healthcheck — Docker polls the /health endpoint every 30 seconds. On failure, it retries up to 3 times with a 10-second timeout. The 5-second start_period gives the server time to initialize before the first health check fires. Note the community-commented caveat: the health check URL will break if you change the listening port — keep it at 8000 or update the curl command accordingly

Start the container:

docker compose up -d speaches

Give it about 30 seconds and check the logs:

docker logs -f speaches

You should see initialization messages followed by the server confirming it is ready to accept connections. There’s no model data in the logs yet because we have not triggered a download — those happen on first request, which is the “dynamic model loading” feature.

Verifying the Server Is Running

Before downloading models, confirm the server itself is healthy:

curl http://localhost:8000/health

A healthy server returns a JSON response confirming the API is ready. If you get a connection error, verify your port mapping and that the container is running with docker ps.

The Built-in Web UI

By default, Speaches starts a Gradio-based web UI available at http://localhost:8000. You can reach this immediately — the UI works even before models are downloaded. Speaches fetches them on demand on your first request, which means you can test the interface right away and see models appear in your mounted cache directory as you explore.

Pulling the TTS and STT Models

Speaches uses a dynamic loading pattern: models download on demand when you first request them. This is convenient but not ideal for initial setup, because you have no direct control over timing or which models get pulled. Instead, we explicitly download both the TTS and STT models using Speaches’s CLI tool.

Method 1: Using the Speaches CLI (Recommended)

Download the TTS model — Kokoro, the #1 ranked text-to-speech model:

speaches-cli registry ls --task text-to-speech
speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX

The first command lists all available TTS models in the Speaches registry. The second downloads Kokoro specifically. ONNX format is the runtime format Speaches uses for inference — optimized for both CPU and GPU execution.

Download the STT model — a distilled Whisper model:

speaches-cli registry ls --task automatic-speech-recognition
speaches-cli model download Systran/faster-distil-whisper-small.en

This downloads faster-distil-whisper-small.en via the SYSTRAN library. This model balances accuracy and speed well for most use cases. For higher accuracy, use Systran/faster-distil-whisper-large-v3 instead — it provides better transcription quality at the cost of ~3x more VRAM and inference time.

If you are running Speaches inside Docker, run these commands inside the container:

docker exec -it speaches speaches-cli registry ls --task text-to-speech
docker exec -it speaches speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
docker exec -it speaches speaches-cli model download Systran/faster-distil-whisper-small.en

What each docker exec command is doing:

-it — Opens an interactive TTY session inside the container, letting you see the download progress in real time
speaches — The container name from our docker-compose configuration
speaches-cli — The CLI tool bundled inside the Speaches image for managing the model registry

Method 2: Using the API (Alternative)

If you prefer API calls over the CLI, you can trigger model downloads through Speaches’s HTTP endpoints:

curl http://localhost:8000/v1/models/speaches-ai/Kokoro-82M-v1.0-ONNX -X POST
curl http://localhost:8000/v1/models/Systran/faster-distil-whisper-small.en -X POST

POSTing to a model’s /v1/models/ endpoint triggers Speaches to download and load that model. You can also list all downloaded models with a GET to /v1/models, or the full registry with a GET to /v1/registry.

Where Models Are Stored

Speaches saves models to ~/.cache/huggingface/hub inside the container. On your host system, this maps to whatever path you specified in the volume mount — in our configuration, ./cache. You can verify the downloads completed with:

ls -lh ./cache

You should see the Kokoro model files and the Whisper model files in that directory. Their sizes will be several hundred megabytes combined.

Testing in the Web UI

Speaches ships with a built-in Gradio interface at http://localhost:8000. Let’s test both the TTS and STT capabilities.

Testing Text-to-Speech

Open http://localhost:8000 in your browser. Navigate to the TTS section. You should see:

A text input field where you type the text you want spoken
A model selector showing Kokoro as the available TTS model
A voice selector — open the voice dropdown and explore the available Kokoro voices. af_heart is the default female voice and a good starting point. af_bell and af_nova are other options (available voices depend on your Kokoro version)
A “Generate” button that produces audio and plays it in the browser

Type something like “The voice of your local AI is finally here” and click Generate. You will hear Kokoro speak it back in the selected voice. This is the first time your local LLM interface can actually speak aloud. Pay attention to the quality — Kokoro’s voice quality is widely considered the best in the open-source TTS landscape.

Testing Speech-to-Text

Switch to the STT section of the same interface. You will see:

A “Record” button that starts capturing audio from your microphone
A “Upload” option for testing with pre-recorded audio files
The transcription model selector showing faster-distil-whisper-small.en

Click Record, speak a sentence into your microphone, then stop recording. Speaches returns the transcribed text. For the best results, use a decent-quality microphone in a relatively quiet environment. Whisper models are robust, but background noise affects all speech-to-text engines.

You can also test both capabilities programmatically:

# Test TTS via API
curl http://localhost:8000/v1/audio/speech \
  -s -H "Content-Type: application/json" \
  -o audio.mp3 \
  -d '{"input": "Testing the voice system.", "model": "speaches-ai/Kokoro-82M-v1.0-ONNX", "voice": "af_heart"}'

# Test STT via API
curl http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.mp3" \
  -F "model=faster-distil-whisper-small.en"

The first command generates speech and saves it to audio.mp3. The second command feeds that audio back into Speaches for transcription. If the transcription matches your input text, both systems are working perfectly.

Connecting Speaches to Open WebUI

Now for the payoff. Open WebUI natively supports configuring STT and TTS engines. We point it at Speaches, and the microphone and speaker buttons in the chat interface become fully functional.

Method 1: Via the Admin UI (Simplest)

Open Open WebUI and go to Admin Settings
Click the Audio tab
Configure the following settings:
- Speech-to-Text Engine: select OpenAI
- API Base URL: enter http://speaches:8000/v1. On a non-Docker or remote setup, use http://localhost:8000/v1
- API Key: enter any non-empty value like speaches-local-key. Speaches doesn’t require authentication; however, Open WebUI requires a non-empty API key value.
- Model: enter Systran/faster-distil-whisper-large-v3 — this is the STT model Speaches will use for your microphone input
Click Save

Method 2: Via Environment Variables (More Reliable)

If the UI settings don’t persist or you are setting this up headlessly, use environment variables in your docker-compose.yml:

  web:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      AUDIO_STT_ENGINE: "openai"
      AUDIO_STT_OPENAI_API_BASE_URL: "http://speaches:8000/v1"
      AUDIO_STT_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
      AUDIO_STT_MODEL: "Systran/faster-distil-whisper-large-v3"
      AUDIO_TTS_ENGINE: "openai"
      AUDIO_TTS_OPENAI_API_BASE_URL: "http://speaches:8000/v1"
      AUDIO_TTS_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
      AUDIO_TTS_MODEL: "speaches-ai/Kokoro-82M-v1.0-ONNX"
      AUDIO_TTS_VOICE: "af_heart"

What each environment variable does:

AUDIO_STT_ENGINE: "openai" — Tells Open WebUI to use the OpenAI-compatible STT engine. No, this does not mean Open WebUI sends audio to OpenAI servers. It uses the OpenAI API format that Speaches implements locally. This is the confusing part — Open WebUI’s “OpenAI” STT engine is actually any OpenAI-compatible endpoint. You could point it at Whisper.cpp, WhisperAPI, or any other compatible server
AUDIO_STT_OPENAI_API_BASE_URL — The base URL of your Speaches instance. Use http://speaches:8000/v1 when both containers are on the same Docker network, or http://localhost:8000/v1 if accessing from outside Docker
AUDIO_STT_OPENAI_API_KEY — Required field in Open WebUI’s config. Since Speaches has no authentication, this value is never verified. Any non-empty string satisfies Open WebUI’s validation
AUDIO_STT_MODEL — The STT model name inside Speaches. Systran/faster-distil-whisper-large-v3 gives the best accuracy. Use small.en for faster inference on CPU-only hardware
AUDIO_TTS_ENGINE: "openai" — Same pattern as STT: uses the OpenAI-compatible TTS engine within Open WebUI, powered locally by Speaches
AUDIO_TTS_OPENAI_API_BASE_URL — Base URL for the TTS endpoint. Same format as STT but points to the same Speaches instance
AUDIO_TTS_MODEL — The TTS model inside Speaches. speaches-ai/Kokoro-82M-v1.0-ONNX is the Kokoro model we downloaded earlier
AUDIO_TTS_VOICE — The Kokoro voice ID to use. af_heart is the default. Available Kokoro voices include af_bell, af_nova, af_sarah, and several male voices (am_echo, am_michael)

After adding these variables, restart your Open WebUI container:

docker compose down
docker compose up -d web

Testing the Integration

Open the Open WebUI chat interface and verify these two things work:

Microphone (STT): Click the microphone icon in the chat input. Speak a prompt. You should see your spoken words appear as text in the input field. If this works, your STT pipeline is fully operational
Speaker (TTS): After any model response, click the speaker icon beneath the generated text. You will hear Kokoro read the answer back to you

The Docker Network Note

If your STT works but your TTS does not (or vice versa), the most likely culprit is Docker networking. When both services are in the same docker-compose.yml, use http://speaches:8000/v1 (the container name as a hostname). This works because Docker Compose creates an internal DNS that resolves container names to their internal IPs. If you use localhost in this context, Open WebUI tries to talk to a Speaches instance running on its own host — which does not exist.

Managing Your Speaches Instance

Common Docker commands for day-to-day operation:

# Start the server
docker compose up -d speaches

# Stop the server
docker compose down speaches

# Restart (applies config changes)
docker compose restart speaches

# Update to the latest version
docker compose pull speaches
docker compose up -d --no-deps speaches

# Check the health of the container
docker inspect --format="{{.State.Health.Status}}" speaches

# List downloaded models
docker exec speaches speaches-cli model list --task text-to-speech
docker exec speaches speaches-cli model list --task automatic-speech-recognition

Switching Between CPU and GPU Versions

If you start with the CPU version and later add a GPU, you need to switch:

# Stop and remove the old container
docker compose down speaches

# Edit your docker-compose.yml to change the image tag
# From: image: ghcr.io/speaches-ai/speaches:0.8.3-cpu
# To:   image: ghcr.io/speaches-ai/speaches:latest-cuda

# Restart
docker compose up -d speaches

Models remain cached in your volume mount — no need to re-download. The CUDA-enabled image runs the same CLI and uses the same model format, so your existing configuration carries over perfectly.

Performance Expectations

Realistic performance numbers so you know what to expect:

GPU mode (CUDA): TTS generates audio at real-time or faster. STT transcribes in roughly 200-500 milliseconds for a typical 5-second prompt
CPU-only mode: TTS is functional but noticeably slower — expect 3-10x longer generation time. STT is workable for quiet, clear speech but quality drops with accents or background noise. Fine for a personal setup
GPU memory usage: Kokoro TTS uses approximately 160 MB of VRAM. Whisper STT uses approximately 2 GB for the small model. Your total GPU footprint for both is under 2.5 GB — well under the 8 GB threshold that matters for most consumer GPUs
Cold start time: The first request to a model waits for download and loading. Subsequent requests are instantaneous. A start_period: 5s health check gives the server time to initialize before Docker begins retrying failed checks.

Next Steps for Your Voice-Powered Open WebUI

Speaches gives you both a voice and ears for your Open WebUI instance. Your local AI can now listen to you and speak. The integration is built entirely on open-source components running in your infrastructure — no cloud APIs, no per-request fees, no data leaving your hardware.

Building on this foundation, there are a few directions worth exploring in future posts:

Voice Chat mode — Speaches supports WebSocket-based real-time voice chat. This enables continuous conversation without the chat interface, similar to ChatGPT’s Voice Mode
Switching voices — Kokoro supports multiple voice IDs. You could rotate voices for different personas or build a custom voice selection interface
Multi-language support — Whisper supports transcription in 99+ languages, and Kokoro handles multiple languages. Pair them for a fully multilingual voice assistant

The global voice AI agents market is growing at 37.2% annually, according to Technavio. The technology is proven. The infrastructure is accessible. What is left is making the decision to run your own.

Go set it up. Install Speaches. Pull the models. Connect your Open WebUI. Then try speaking to your AI assistant and hearing it talk back. The gap between “this works on paper” and “this actually works” is exactly one deployment. Start with the CPU-only container. You can always move to GPU later — and the models are already waiting in your cache.

Let me know how it goes. Did the integration work on the first try? Which Kokoro voice did you choose? Drop a comment — I read them all and reply to questions.