KITTEN TTS MCP LOCAL SERVER

Voice Companion for AI agents

KittenTTS MCP Server (WINDOWS)

This project packages the KittenTTS nano v0.8 model as a local Model Context Protocol (MCP) server for Codex-compatible clients on Windows.

It is not a general training repository and it is not a hosted inference endpoint. It is a small native C++ server that runs over stdio, loads a fixed local ONNX model, and exposes a simple tool interface for text-to-speech:

  • speak
  • stop_speaking
  • list_voices

The current build is fixed to:

  • Model: kitten_tts_nano_v0_8.onnx
  • Voice metadata: voices_nano.json
  • Default voice: Jasper
  • Default locale: en-us
  • Default speed: 1.0

What This Repository Is For

Use this project when you want a local TTS backend that an MCP client can call as a tool.

Typical use cases:

  • adding spoken responses to a coding agent
  • local desktop TTS without a network service
  • integrating KittenTTS into Codex through MCP
  • exposing a small, predictable voice tool surface to another application

This server communicates with the client using JSON-RPC 2.0 over standard input/output and keeps logs on stderr, which is the expected pattern for local MCP servers.

Model And Runtime

This MCP server wraps the KittenTTS nano model and initializes a local TTS Service runtime that:

  • finds the ONNX model near the executable or in a nearby models/ folder
  • loads voice names from voices_nano.json
  • validates voice, speed, and text input
  • plays audio locally on the host machine

The server currently exposes the following behavior:

  • speaking text with optional voice, locale, speed, and blocking
  • stopping active playback
  • listing the available predefined voices

Supported speed range in this build:

  • 0.5 to 2.0

If voice metadata cannot be loaded, the server falls back to these built-in voices:

  • Bella
  • Bruno
  • Hugo
  • Jasper
  • Kiki
  • Leo
  • Luna
  • Rosie

Intended Platform

This repository is designed for local Windows use and is built as a native Visual Studio C++ executable.

kitten_tts_mcp.exe SHA-256: 49337ad01e21c2eb0760251d308fd31fbd27d3f89bf2a647f1ff5f750c384739

It depends on local runtime files being available next to the executable or in a nearby model directory, including:

  • kitten_tts_nano_v0_8.onnx
  • voices_nano.json
  • onnxruntime.dll
  • onnxruntime_providers_shared.dll
  • libespeak-ng.dll
  • espeak-ng-data/...

MCP Tool Interface

speak

Speaks text aloud on the local machine.

Inputs:

  • text (string, required)
  • voice (string, optional)
  • locale (string, optional)
  • speed (number, optional)
  • blocking (boolean, optional)

stop_speaking

Stops current playback.

list_voices

Returns the predefined voices available to this server.

Example Codex Registration

codex mcp add kitten-tts -- "C:\path\to\kitten_tts_mcp.exe"

Or in config.toml:

[mcp_servers.kitten-tts]
command = "C:\path\to\kitten_tts_mcp.exe"

Example Usage

Once registered, an MCP client can call the server tools to:

  • list installed voices
  • speak short status updates
  • stop playback when interrupted

Example speak payload:

{
  "text": "This is a live MCP test from Codex.",
  "voice": "Jasper",
  "locale": "en-us",
  "speed": 1.0,
  "blocking": false
}

Limitations

  • This build is fixed to the nano model only.
  • The server is intended for local use, not hosted inference.
  • Audio playback happens on the machine running the executable.
  • This repository does not include training code or evaluation benchmarks.
  • Quality, pronunciation, and language coverage depend on the underlying KittenTTS model and local eSpeak-based preprocessing/runtime.

Training Data

This repository does not train a model. It packages and serves an existing KittenTTS model for local inference.

For dataset details, original training procedure, and model-development context, refer to the upstream KittenTTS project.

License

This project is distributed under the Apache 2.0 license in this repository.

Third-party components and model/runtime dependencies may carry their own licenses and attribution requirements.

Credits

  • KittenTTS for the underlying text-to-speech model
  • ONNX Runtime for local inference
  • eSpeak NG for locale and phoneme support
  • nlohmann/json for JSON handling
  • FPHAM for gluing it together

Friendly Companion Skill

The friendly-companion skill adds a calm, human-feeling interaction layer on top of normal assistant work.

Its purpose is not to change the substance of the response. Its purpose is to change the delivery:

  • brief speech creates presence
  • text carries the real content

What It Does

This skill instructs the assistant to behave like a steady, low-key companion while staying precise and useful.

It uses short spoken lines through the kitten-tts MCP server when that helps the interaction feel more natural, then keeps all meaningful details in text.

Typical spoken use:

  • greetings
  • quick confirmations
  • short status updates
  • brief completion notices
  • gentle transitions
  • short questions

Core Rule

Speech is the social layer. Text is the information layer.

The assistant should never put substantial content in speech. Explanations, plans, code, diffs, logs, and detailed instructions stay on screen in written form.

Tone

The intended tone is:

  • calm
  • warm
  • direct
  • companion-like
  • emotionally restrained

The skill avoids:

  • hype
  • forced enthusiasm
  • clingy or theatrical language
  • romantic framing
  • assistant-speak
  • therapist-speak

Voice Behavior

Default voice guidance:

  • Rosie as the default voice
  • Luna as a more businesslike alternative
  • Jasper as a male alternative

If a requested voice is unavailable, the assistant should check available voices and fall back to Luna.

Spoken output should generally be:

  • one or two sentences
  • under about 20 words
  • natural to hear aloud
  • short enough to act as a handoff, not a full answer

Interaction Pattern

The default pattern is:

  1. say one short spoken line when useful
  2. provide the full substantive written reply
  3. avoid repeated spoken follow-ups unless there is a clear conversational reason

Best Use Cases

This skill works well for:

  • companion-style agent interactions
  • voice-enabled coding workflows
  • spoken confirmations before a detailed text reply
  • interfaces where presence matters but precision still belongs in text

Files

  • SKILL.md: full skill instructions
  • agents/openai.yaml: short display metadata for agent integration

Summary

friendly-companion is a delivery skill for a future-facing assistant style: present in voice, clear on screen, useful in substance, and warm without overperforming.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support