Talk to Bob: a voice line to our agent

I send my agent a voice note from my phone. It hears me, writes down what I said on our own server, does the actual work, and replies in a spoken voice, all inside a normal Telegram chat. We built the whole loop in an afternoon. The honest part is that we built almost none of it: it rides on two things we already had running, and we only had to add the ears and the mouth.

What we built

A two-way voice line to Bob, our main agent. You hold the microphone button in Telegram, say the thing ("what is on my to-do list", "remind me to call Josh tomorrow", "summarise the latest on the model launches"), and let go. A few seconds later Bob replies with a voice note. It has understood you, done the task with its normal tools, and spoken the answer back. No app to install, no new account, no wake word, no setup on the phone at all.

How it works, in plain terms

Telegram carries the audio both ways. It is the same chat we already use to give Bob jobs, so the recording, the playback and the delivery to any device are all handled for us, for free, even on a patchy signal.

When a voice note lands, the server turns it into text with Whisper, an open speech-to-text model that we run on our own machine. The point worth stressing: your voice is understood locally. The audio never leaves the box to be heard by anyone else. Bob then reads that text and works the task exactly as it would a typed message. Its reply is turned back into speech by a text-to-speech voice (the British one, naturally), packaged as a Telegram voice note, and sent back up the same chat.

Why it took an afternoon and not a fortnight

Because the hard parts were already standing. Telegram does all the fiddly human-facing work: pressing record, the waveform, playing the reply, surviving a dropped connection, on every phone and laptop, with zero code from us. Tailscale, a private network that links our machines as if they shared one wifi, means the phone reaches the server with no public address to expose and nothing to lock down by hand. The agent in the middle has existed for months. All we added were the two small ends, hearing and speaking, and wired them into the bridge. Most of the value came from plumbing we had already paid for.

What is still off

The speaking is not sovereign yet. The hearing is: transcription runs on our own server and nothing is sent out. But the voice that talks back is generated by a hosted service, so the text of Bob's reply does leave the box to be spoken. Dropping in a local voice closes that gap, and it is on the list.
It listens while Bob is awake. Right now each voice note is handled by Bob's live session, which covers nearly all of the time. A proper always-on listener, one that answers even when no session is running, is the next version.
English only. The local model we run is the English one, chosen for speed and accuracy for a single user. Bigger multilingual models exist if we need them.
A few seconds each way. Fine for a walk or the kitchen, not a flowing back-and-forth conversation. Real-time would mean streaming the audio as you speak, which is a larger build for a smaller payoff.

What is now in the stack

voice/stt.sh: local Whisper transcription, any phone audio format in, clean text out.
voice/tts.sh: British text-to-speech, packaged as a native Telegram voice note.
Both wired straight into Bob's existing Telegram bridge. No new service, no new account, no app on the phone.