Extension Icon

MiMo TTS

Read selected or clipboard text aloud with Xiaomi MiMo TTS — Chinese and English voices, fine-grained style controls, and chunked long-text playback.
Overview

MiMo TTS for Raycast

State-of-the-art Chinese & English TTS — directly from your keyboard, 100% free during the public beta.

Read any selected or clipboard text aloud, design a brand-new voice from a single sentence, or clone any voice from a 10-second sample — without leaving Raycast.


Why this extension

🆓 Free during MiMo's public beta — all four TTS models, no token meter

Xiaomi's MiMo platform launched MiMo-V2.5-TTS Series + V2.5-ASR as a single full-stack speech suite, and the entire TTS family is currently limited-time free on their billing page:

ModelWhat it doesBeta price
mimo-v2.5-ttsPreset-voice synthesis with full style controlFree
mimo-v2.5-tts-voicedesignGenerate a new voice from one sentence of descriptionFree
mimo-v2.5-tts-voicecloneClone any voice from a small mp3/wav sampleFree
mimo-v2-ttsLegacy voicesFree

(Source: platform.xiaomimimo.com/docs/zh-CN/price/pay-as-you-go, as of 2026-05-27)

If you've been priced out of OpenAI Speech, ElevenLabs, or Azure Neural TTS for long-form reading or scripted character work, MiMo is the cheapest path to top-tier TTS today.

🚀 V2.5 series just got dramatically cheaper (effective 2026-05-27)

On 27 May 2026, Xiaomi shipped a permanent price cut across the MiMo-V2.5 LLM line — up to 99% off, with Token Plan quotas multiplied by 5–8× at the same price, plus a full credit reset for existing subscribers. (Announcement) Even after the TTS free beta ends, the underlying inference stack and billing engine just got a generational cost reduction. SGLang HiCache with SWA dropped KV-cache traffic to ~1/7 and cacheable tokens up ~5× — so unit cost is structurally lower, not just promotional.

🎭 Top-tier model capability, three creation modes

MiMo-V2.5-TTS is positioned by Xiaomi as a model that "doesn't just read — it performs." All three creation modes share the same instruction-following surface:

  • Director-style natural-language control — write a one-line tone hint or a full director's brief (character / scene / direction). The model interprets pacing, breath, restraint, emotional arc.
  • Inline audio tags — drop (轻笑), (深呼吸), (粤语), (唱歌), (suppressed anger) mid-text for surgical control. Mix Chinese and English tags. Singing mode included.
  • Plain-text emotional reading — even without any prompt, the model picks up punctuation, sentence rhythm, and implied speaker identity (age, persona, mood).

🌏 Bilingual by design — Chinese and English, both native quality

Most "Chinese TTS" services are awful at English; most "English TTS" services give you mechanical Mandarin. MiMo trains both at native quality:

  • 4 Chinese preset voices: 冰糖 (clear female), 茉莉 (soft female), 苏打 (bright male), 白桦 (steady male).
  • 4 English preset voices: Mia, Chloe, Milo, Dean.
  • Dialect tags: 东北话 / 四川话 / 河南话 / 粤语 / 台湾腔.
  • Cross-lingual voice clone: clone a voice from a Chinese sample, read English in the same voice (and vice versa).

🎙 Voice Design — one sentence becomes a voice

Type a 1–4 sentence description and get a brand new voice synthesized on the spot:

"A weathered Northern Chinese grandfather, slow and steady pace, slightly raspy and time-worn, like he's telling old stories."

Optional optimize_text_preview flag lets the model auto-rewrite your sample text to match the persona — you can submit with an empty text body and MiMo writes the script for you. Other commercial TTS services charge you per generated voice ID; here it's one HTTP call.

🪄 Voice Clone — mp3/wav in, your voice out

Drop in any mp3 or wav (≤10 MB after base64). MiMo replicates the timbre and the cloned voice keeps the full control surface: director prompts, inline tags, dialect switching, singing mode. No upload step, no separate voice-management dashboard — each clone is a one-shot inline call.

⚡ Built for daily, long-form reading

  • Chunked long-text playback — sentence- and clause-aware splitter (UTF-8 byte-safe, English word boundary aware), 4 KB per chunk.
  • Look-ahead synthesis — the next chunk starts decoding while the current one plays; no audible gap.
  • Global speed override (0.5×–2.0× in 0.25× steps) is one keystroke and persists across every command.
  • Cross-command stop — Quick Read is a single keystroke that starts and stops; menu-bar status and dedicated commands all hook the same afplay PID.

Install

This extension is published on the Raycast Store as MiMo TTS. Search for it in Raycast → Store, then:

  1. Get a MiMo Token Plan API key (tp-...) from https://platform.xiaomimimo.com/. (Pay-as-you-go sk- keys go to a different endpoint and are not used by this extension.)
  2. Paste it into Raycast Preferences → Extensions → MiMo TTSMiMo Token Plan API Key.
  3. (Optional) Run Setup Voice Defaults to pick your default voice, model, baseline speed, and a project-wide style prompt.

Commands

CommandModeWhat it does
Quick Readno-viewRead selected or clipboard text with your default voice. Trigger again to stop.
Read with VoiceviewBrowse voices and read selection / clipboard with the chosen one.
Set Quick Read VoiceviewPick and preview the voice that Quick Read uses.
TTS StudioviewLong-form composer with voice, speed, opening style, emotion / rhythm / vocal-texture / expression tags, performance presets, and a free-form director prompt.
Design VoiceviewGenerate a brand-new voice from a one-sentence description (MiMo-V2.5-TTS-VoiceDesign).
Clone VoiceviewReplicate any voice from an mp3/wav file (MiMo-V2.5-TTS-VoiceClone).
Setup Voice DefaultsviewPersist a per-session override for model / voice / rate / style prompt / Token Plan base URL.
Stop Readingno-viewStop the current playback immediately.
Speed up Reading / Slow Down Readingno-viewAdjust playback speed by ±0.25× for the next chunk; persists globally.
Reading Statusmenu-barNow-playing status with playback / speed controls.

TTS Studio — the full style-control surface

The TTS Studio command exposes everything MiMo-V2.5-TTS supports:

  • Performance Preset — curated director briefs: Suppressed Anger, Tearful Smile, Gentle but Tired, Gentleness Amid Frenzy, Narration → Whisper → Roar, …
  • Opening Style — leading style tags (gentle, magnetic, ethereal, husky, …; 唱歌 overrides others to enter singing mode).
  • Custom Tags — comma-separated free-form opening tags.
  • Pace and Rhythm — 吸气 / 深呼吸 / 叹气 / 长叹一口气 / 喘息 / 屏息 / 沉默片刻 / 语速加快 / 放慢语速 / 提高音量喊话 / 小声 / …
  • Emotional State — base emotions, compound emotions, and mixed states (压抑的愤怒, 带着哽咽的笑意, 温柔但疲惫, 狂躁中的温柔, …).
  • Vocal Texture — 颤抖 / 声音颤抖 / 变调 / 破音 / 鼻音 / 气声 / 沙哑 / 哽咽.
  • Laughing and Crying — 笑 / 轻笑 / 大笑 / 冷笑 / 抽泣 / 呜咽 / 哽咽 / 嚎啕大哭.
  • Director Prompt — free-form natural-language direction, free-text and arbitrary length.
  • Speech Rate — 0.5× to 2.0× in 0.25× steps; also exposed as menu-bar / Speed up / Slow Down actions and shared as a global override.

Models — full coverage

Model IDUsed byWhat it does
mimo-v2.5-ttsQuick Read · Read with Voice · Set Quick Read Voice · TTS StudioPreset voices with style controls.
mimo-v2.5-tts-voicedesignDesign VoiceGenerate a voice from a 1–4 sentence description. Optional optimize_text_preview.
mimo-v2.5-tts-voicecloneClone VoiceReplicate a voice from a base64-encoded mp3/wav (≤10 MB).
mimo-v2-ttsoptional, via Setup Voice DefaultsLegacy V2 voices.

Stop semantics & cross-extension safety

Quick Read uses one keystroke to start and stop. When something is already playing, running Quick Read again terminates the afplay process, clears the now-playing state, and shows a stop HUD. The menu-bar status, dedicated Stop Reading command, and cmd+. from any view command all trigger the same stop path.

Cross-extension PID isolation: this extension uses raycast-mimo-tts.pid / .stop in tmpdir so it doesn't fight my multi-provider AI Voice Studio extension over the same afplay process.


Permissions & privacy

The extension reads selected text and the clipboard only when you trigger a command. Synthesized audio plays via the system afplay binary. No data is persisted beyond per-session Raycast LocalStorage (voice override, speech-rate override, now-playing state). The API key never leaves Raycast Preferences. Voice-clone samples are sent inline to MiMo's endpoint for that single request — no upload service is used and nothing is cached server-side per their docs.


Provenance

Originally part of AI Voice Studio, a multi-provider Raycast TTS extension covering Qwen-TTS, MiniMax, MiMo, and OpenAI. This standalone version extracts the MiMo provider so users who only want MiMo TTS get a focused, smaller surface — no Qwen / MiniMax / OpenAI code paths.

Open source on GitHub: https://github.com/xwzhangSZU/Raycast-Mimo-TTS.


References