Mudit Gulati

Experiment

Speech-to-Speech Assistant

In development · Whisper, local LLMs, XTTS, Python

This one started as an argument with myself about which model was actually good versus which model marketing said was good. The only honest way to settle it was to put several of them behind the same microphone and the same speaker and see which one I kept talking to. So: speech in, speech out, nothing typed anywhere. Whisper handles transcription, a swappable layer of language models — local ones running on the Mac Studio, a couple of hosted ones kept around for comparison — handles the thinking, and a text-to-speech model speaks the answer back. Swap any layer mid-conversation and the assistant doesn't flinch. Multiple languages because I think in more than one. Hindi at home, English at work, and the assistant needed to follow without me switching apps or settings — it detects the language I started in and answers in kind, mid-conversation switches included. The interesting problem has turned out to be latency, not intelligence. A model that reasons beautifully but takes four seconds to start speaking feels broken in a way a slightly dumber, faster one doesn't. Most of the engineering has gone into streaming — speaking the answer before the model has finished thinking it — which has taught me more about how conversation actually works than any of the language modelling has.