The global speech recognition market is projected to reach $23.11 billion by 2030, growing at 19.1% annually, according to MarketsandMarkets. The AI voice generator market alone is forecast to hit $20.4 billion by 2030 at a 37.1% CAGR, per MarketsandMarkets. These numbers are not projections of some distant future. They are the sound of an industry moving so fast that the people who ship products are expected to have voice-ready AI assistants on day one.
Open WebUI, my go-to interface for local LLMs, has always supported audio. The microphone icon. The speaker button. But out of the box, those features are stubs or point to costly third-party APIs with your data leaving your infrastructure on every request. That is a data sovereignty risk and a recurring cost that compounds with usage.
There’s a better way. Speaches is an OpenAI API-compatible server for speech AI — Text-to-speech (TTS) and Speech-to-text (STT) models served locally from a single container. Think of it as Ollama, but for voice. Kokoro, the #1 ranked TTS model on the Hugging Face TTS Spaces Arena, runs inside Speaches. Whisper models handle transcription. Everything stays on your hardware. No per-request fees. No vendor lock-in.
In this guide, we deploy Speaches to Docker, pull both TTS and STT models, test the web UI, and wire everything into Open WebUI. Total time: about 15 minutes.
What Speaches Actually Is
Speaches is an independent project, not owned by or affiliated with Open WebUI. It provides a local API that implements OpenAI’s speech endpoints, which means every tool and SDK that speaks to OpenAI’s API works with Speaches out of the box.
Under the hood, Speaches uses two core technologies:
- STT (Speech-to-Text): powered by faster-whisper from SYSTRAN, which delivers real-time transcription and translation
- TTS (Text-to-Speech): powered by Kokoro-82M and piper, with Kokoro ranked #1 on the Hugging Face TTS Spaces Arena
The practical benefit of this architecture is simple: Open WebUI calls its STT engine “OpenAI” (which sounds confusing at first — more on that below), points it at Speaches, and everything just works. You trade API bills for a single Docker container.
Prerequisites
Before we begin, make sure you have:
- An Open WebUI instance running in Docker — if you haven’t set this up yet, follow the previous post first
- Docker Engine 24.0+ with the Docker Compose plugin
- Either a CPU-only host (simpler start, slightly slower inference) or an NVIDIA GPU with the NVIDIA Container Toolkit (faster inference, recommended for production)
If you already have an AI application stack running on your Docker host, you can deploy Speaches alongside everything else. The container is entirely self-contained.
Deploying Speaches to Docker
Here is the docker-compose service definition:
speaches:
container_name: speaches
image: ghcr.io/speaches-ai/speaches:0.8.3-cpu
restart: unless-stopped
ports:
- 8000:8000
volumes:
- ./cache:/home/ubuntu/.cache/huggingface/hub
healthcheck:
test: ["CMD", "curl", "--fail", "http://0.0.0.0:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5s
What each parameter in this service definition is doing:
image: ghcr.io/speaches-ai/speaches:0.8.3-cpu— The official Speaches image tagged v0.8.3 for CPU inference. Use the:latest-cudatag if you have an NVIDIA GPU installed. The CPU version is perfectly functional for development and moderate usagecontainer_name: speaches— Assigns a persistent name for easy management withdocker stop speachesanddocker logs speachesports: 8000:8000— Maps the container port to the host port. Port 8000 is the default API port. The first number is the host port you access from outside the container; the second is the port the application listens on inside the containervolumes: ./cache:/home/ubuntu/.cache/huggingface/hub— This is the critical mount. Speaches downloads models from Hugging Face to~/.cache/huggingface/hubinside the container. Without this volume mount, every model download is ephemeral — you lose everything on container restart. Change the host path to match your infrastructure. On my system, I keep AI volumes on a separate storage device to protect my OS driverestart: unless-stopped— Docker automatically restarts the container on failure or host reboot, unless you explicitly stop it first. This is standard “good” behavior for always-available serviceshealthcheck— Docker polls the /health endpoint every 30 seconds. On failure, it retries up to 3 times with a 10-second timeout. The 5-second start_period gives the server time to initialize before the first health check fires. Note the community-commented caveat: the health check URL will break if you change the listening port — keep it at 8000 or update the curl command accordingly
Start the container:
docker compose up -d speaches
Give it about 30 seconds and check the logs:
docker logs -f speaches
You should see initialization messages followed by the server confirming it is ready to accept connections. There’s no model data in the logs yet because we have not triggered a download — those happen on first request, which is the “dynamic model loading” feature.
Verifying the Server Is Running
Before downloading models, confirm the server itself is healthy:
curl http://localhost:8000/health
A healthy server returns a JSON response confirming the API is ready. If you get a connection error, verify your port mapping and that the container is running with docker ps.
The Built-in Web UI
By default, Speaches starts a Gradio-based web UI available at http://localhost:8000. You can reach this immediately — the UI works even before models are downloaded. Speaches fetches them on demand on your first request, which means you can test the interface right away and see models appear in your mounted cache directory as you explore.
Pulling the TTS and STT Models
Speaches uses a dynamic loading pattern: models download on demand when you first request them. This is convenient but not ideal for initial setup, because you have no direct control over timing or which models get pulled. Instead, we explicitly download both the TTS and STT models using Speaches’s CLI tool.
Method 1: Using the Speaches CLI (Recommended)
Download the TTS model — Kokoro, the #1 ranked text-to-speech model:
speaches-cli registry ls --task text-to-speech
speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
The first command lists all available TTS models in the Speaches registry. The second downloads Kokoro specifically. ONNX format is the runtime format Speaches uses for inference — optimized for both CPU and GPU execution.
Download the STT model — a distilled Whisper model:
speaches-cli registry ls --task automatic-speech-recognition
speaches-cli model download Systran/faster-distil-whisper-small.en
This downloads faster-distil-whisper-small.en via the SYSTRAN library. This model balances accuracy and speed well for most use cases. For higher accuracy, use Systran/faster-distil-whisper-large-v3 instead — it provides better transcription quality at the cost of ~3x more VRAM and inference time.
If you are running Speaches inside Docker, run these commands inside the container:
docker exec -it speaches speaches-cli registry ls --task text-to-speech
docker exec -it speaches speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
docker exec -it speaches speaches-cli model download Systran/faster-distil-whisper-small.en
What each docker exec command is doing:
-it— Opens an interactive TTY session inside the container, letting you see the download progress in real timespeaches— The container name from our docker-compose configurationspeaches-cli— The CLI tool bundled inside the Speaches image for managing the model registry
Method 2: Using the API (Alternative)
If you prefer API calls over the CLI, you can trigger model downloads through Speaches’s HTTP endpoints:
curl http://localhost:8000/v1/models/speaches-ai/Kokoro-82M-v1.0-ONNX -X POST
curl http://localhost:8000/v1/models/Systran/faster-distil-whisper-small.en -X POST
POSTing to a model’s /v1/models/ endpoint triggers Speaches to download and load that model. You can also list all downloaded models with a GET to /v1/models, or the full registry with a GET to /v1/registry.
Where Models Are Stored
Speaches saves models to ~/.cache/huggingface/hub inside the container. On your host system, this maps to whatever path you specified in the volume mount — in our configuration, ./cache. You can verify the downloads completed with:
ls -lh ./cache
You should see the Kokoro model files and the Whisper model files in that directory. Their sizes will be several hundred megabytes combined.
Testing in the Web UI
Speaches ships with a built-in Gradio interface at http://localhost:8000. Let’s test both the TTS and STT capabilities.
Testing Text-to-Speech
Open http://localhost:8000 in your browser. Navigate to the TTS section. You should see:
- A text input field where you type the text you want spoken
- A model selector showing Kokoro as the available TTS model
- A voice selector — open the voice dropdown and explore the available Kokoro voices.
af_heartis the default female voice and a good starting point.af_bellandaf_novaare other options (available voices depend on your Kokoro version) - A “Generate” button that produces audio and plays it in the browser
Type something like “The voice of your local AI is finally here” and click Generate. You will hear Kokoro speak it back in the selected voice. This is the first time your local LLM interface can actually speak aloud. Pay attention to the quality — Kokoro’s voice quality is widely considered the best in the open-source TTS landscape.
Testing Speech-to-Text
Switch to the STT section of the same interface. You will see:
- A “Record” button that starts capturing audio from your microphone
- A “Upload” option for testing with pre-recorded audio files
- The transcription model selector showing faster-distil-whisper-small.en
Click Record, speak a sentence into your microphone, then stop recording. Speaches returns the transcribed text. For the best results, use a decent-quality microphone in a relatively quiet environment. Whisper models are robust, but background noise affects all speech-to-text engines.
You can also test both capabilities programmatically:
# Test TTS via API
curl http://localhost:8000/v1/audio/speech \
-s -H "Content-Type: application/json" \
-o audio.mp3 \
-d '{"input": "Testing the voice system.", "model": "speaches-ai/Kokoro-82M-v1.0-ONNX", "voice": "af_heart"}'
# Test STT via API
curl http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.mp3" \
-F "model=faster-distil-whisper-small.en"
The first command generates speech and saves it to audio.mp3. The second command feeds that audio back into Speaches for transcription. If the transcription matches your input text, both systems are working perfectly.
Connecting Speaches to Open WebUI
Now for the payoff. Open WebUI natively supports configuring STT and TTS engines. We point it at Speaches, and the microphone and speaker buttons in the chat interface become fully functional.
Method 1: Via the Admin UI (Simplest)
- Open Open WebUI and go to Admin Settings
- Click the Audio tab
- Configure the following settings:
- Speech-to-Text Engine: select
OpenAI - API Base URL: enter
http://speaches:8000/v1. On a non-Docker or remote setup, usehttp://localhost:8000/v1 - API Key: enter any non-empty value like
speaches-local-key. Speaches doesn’t require authentication; however, Open WebUI requires a non-empty API key value. - Model: enter
Systran/faster-distil-whisper-large-v3— this is the STT model Speaches will use for your microphone input
- Speech-to-Text Engine: select
- Click Save
Method 2: Via Environment Variables (More Reliable)
If the UI settings don’t persist or you are setting this up headlessly, use environment variables in your docker-compose.yml:
web:
image: ghcr.io/open-webui/open-webui:main
environment:
AUDIO_STT_ENGINE: "openai"
AUDIO_STT_OPENAI_API_BASE_URL: "http://speaches:8000/v1"
AUDIO_STT_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
AUDIO_STT_MODEL: "Systran/faster-distil-whisper-large-v3"
AUDIO_TTS_ENGINE: "openai"
AUDIO_TTS_OPENAI_API_BASE_URL: "http://speaches:8000/v1"
AUDIO_TTS_OPENAI_API_KEY: "does-not-matter-what-you-put-but-should-not-be-empty"
AUDIO_TTS_MODEL: "speaches-ai/Kokoro-82M-v1.0-ONNX"
AUDIO_TTS_VOICE: "af_heart"
What each environment variable does:
AUDIO_STT_ENGINE: "openai"— Tells Open WebUI to use the OpenAI-compatible STT engine. No, this does not mean Open WebUI sends audio to OpenAI servers. It uses the OpenAI API format that Speaches implements locally. This is the confusing part — Open WebUI’s “OpenAI” STT engine is actually any OpenAI-compatible endpoint. You could point it at Whisper.cpp, WhisperAPI, or any other compatible serverAUDIO_STT_OPENAI_API_BASE_URL— The base URL of your Speaches instance. Usehttp://speaches:8000/v1when both containers are on the same Docker network, orhttp://localhost:8000/v1if accessing from outside DockerAUDIO_STT_OPENAI_API_KEY— Required field in Open WebUI’s config. Since Speaches has no authentication, this value is never verified. Any non-empty string satisfies Open WebUI’s validationAUDIO_STT_MODEL— The STT model name inside Speaches.Systran/faster-distil-whisper-large-v3gives the best accuracy. Usesmall.enfor faster inference on CPU-only hardwareAUDIO_TTS_ENGINE: "openai"— Same pattern as STT: uses the OpenAI-compatible TTS engine within Open WebUI, powered locally by SpeachesAUDIO_TTS_OPENAI_API_BASE_URL— Base URL for the TTS endpoint. Same format as STT but points to the same Speaches instanceAUDIO_TTS_MODEL— The TTS model inside Speaches.speaches-ai/Kokoro-82M-v1.0-ONNXis the Kokoro model we downloaded earlierAUDIO_TTS_VOICE— The Kokoro voice ID to use.af_heartis the default. Available Kokoro voices includeaf_bell,af_nova,af_sarah, and several male voices (am_echo,am_michael)
After adding these variables, restart your Open WebUI container:
docker compose down
docker compose up -d web
Testing the Integration
Open the Open WebUI chat interface and verify these two things work:
- Microphone (STT): Click the microphone icon in the chat input. Speak a prompt. You should see your spoken words appear as text in the input field. If this works, your STT pipeline is fully operational
- Speaker (TTS): After any model response, click the speaker icon beneath the generated text. You will hear Kokoro read the answer back to you
The Docker Network Note
If your STT works but your TTS does not (or vice versa), the most likely culprit is Docker networking. When both services are in the same docker-compose.yml, use http://speaches:8000/v1 (the container name as a hostname). This works because Docker Compose creates an internal DNS that resolves container names to their internal IPs. If you use localhost in this context, Open WebUI tries to talk to a Speaches instance running on its own host — which does not exist.
Managing Your Speaches Instance
Common Docker commands for day-to-day operation:
# Start the server
docker compose up -d speaches
# Stop the server
docker compose down speaches
# Restart (applies config changes)
docker compose restart speaches
# Update to the latest version
docker compose pull speaches
docker compose up -d --no-deps speaches
# Check the health of the container
docker inspect --format="{{.State.Health.Status}}" speaches
# List downloaded models
docker exec speaches speaches-cli model list --task text-to-speech
docker exec speaches speaches-cli model list --task automatic-speech-recognition
Switching Between CPU and GPU Versions
If you start with the CPU version and later add a GPU, you need to switch:
# Stop and remove the old container
docker compose down speaches
# Edit your docker-compose.yml to change the image tag
# From: image: ghcr.io/speaches-ai/speaches:0.8.3-cpu
# To: image: ghcr.io/speaches-ai/speaches:latest-cuda
# Restart
docker compose up -d speaches
Models remain cached in your volume mount — no need to re-download. The CUDA-enabled image runs the same CLI and uses the same model format, so your existing configuration carries over perfectly.
Performance Expectations
Realistic performance numbers so you know what to expect:
- GPU mode (CUDA): TTS generates audio at real-time or faster. STT transcribes in roughly 200-500 milliseconds for a typical 5-second prompt
- CPU-only mode: TTS is functional but noticeably slower — expect 3-10x longer generation time. STT is workable for quiet, clear speech but quality drops with accents or background noise. Fine for a personal setup
- GPU memory usage: Kokoro TTS uses approximately 160 MB of VRAM. Whisper STT uses approximately 2 GB for the small model. Your total GPU footprint for both is under 2.5 GB — well under the 8 GB threshold that matters for most consumer GPUs
- Cold start time: The first request to a model waits for download and loading. Subsequent requests are instantaneous. A
start_period: 5shealth check gives the server time to initialize before Docker begins retrying failed checks.
Next Steps for Your Voice-Powered Open WebUI
Speaches gives you both a voice and ears for your Open WebUI instance. Your local AI can now listen to you and speak. The integration is built entirely on open-source components running in your infrastructure — no cloud APIs, no per-request fees, no data leaving your hardware.
Building on this foundation, there are a few directions worth exploring in future posts:
- Voice Chat mode — Speaches supports WebSocket-based real-time voice chat. This enables continuous conversation without the chat interface, similar to ChatGPT’s Voice Mode
- Switching voices — Kokoro supports multiple voice IDs. You could rotate voices for different personas or build a custom voice selection interface
- Multi-language support — Whisper supports transcription in 99+ languages, and Kokoro handles multiple languages. Pair them for a fully multilingual voice assistant
The global voice AI agents market is growing at 37.2% annually, according to Technavio. The technology is proven. The infrastructure is accessible. What is left is making the decision to run your own.
Go set it up. Install Speaches. Pull the models. Connect your Open WebUI. Then try speaking to your AI assistant and hearing it talk back. The gap between “this works on paper” and “this actually works” is exactly one deployment. Start with the CPU-only container. You can always move to GPU later — and the models are already waiting in your cache.
Let me know how it goes. Did the integration work on the first try? Which Kokoro voice did you choose? Drop a comment — I read them all and reply to questions.

Leave a Reply