Theoretically, yes? This hasent been tested but xcode has great c++ interop and the goal with Axiom and now parakeet.cpp is to be used for portable deployments so making that process easier is definitely on the roadmap.
I played around with it this week, and when you enable advanced mode and add a post-transcription AI model to point to your own server which mimics a minimal ChatGPT-compatible behavior, then you can use it to modify the output, even return an empty string if you noticed that the transcript was more targeted to do other stuff ("turn the lights on"), if you then return an empty string, it won't inject keypresses.
So one gets the best for both worlds: transcription for dictation and transcription to trigger events.
If I now only could let it listen constantly and react to voice, so that no push to talk is active, that would be nice.
Maybe this project here could be used for that.
Also, this seems to support streaming transcription.
Qwen-asr can easily transcribe live radio (see README) in any random laptop. It looks like we are going to see really cool things on local inference, now that automatic programming makes a lot simpler to create solid pipelines for new models in C, C++, Rust, ..., in a matter of hours.
Which is why long term current programming languages will eventually become less relevant in the whole programming stack, as in get the computer to automate tasks, regardless how.
Assuming RAM prices will not make it totally unaffordable. Current situation is atrocious and big infrastructure corps seem to love it, they do not want independent computing. Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.
Another problem is too much abstraction on input spec level. The other day I asked Claude to generate few classes. When reviewing the code I noticed it doing full scan for ranges on one giant set. This would bring my backend to a halt. After pointing it out to Claude it had smartened up to start with lower_bound() call. When there are no people to notice such things what do you think we are going to have?
Agreed, in regards to prices, it appears to be the new gold, lets see how this gets sorted out, with NPUs, FPGAs, analog (Cerebas),...
Now the abstraction I am with you on that, I foresee a more formal way to give specifications, but more suitable for natural language as input, or even proper mathematics, than the languages we have been using thus far.
On more serious note. Sure we need Spec development IDE which LLM would compile to a language of choice (or print ASIC). It would still not prevent that lower_bound things from happening and there will be no people to find out why
> Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.
That's why I'm still holding on to a bulky Core 2 Duo Management Engine-free Fujitsu workstation, for when personal computing finally goes underground again.
Your voxtral.c work was a big motivator for me. I built a macOS menu bar dictation app (https://github.com/T0mSIlver/localvoxtral) around Voxtral Realtime, currently using a voxmlx fork with an OpenAI Realtime WebSocket server I added on top.
The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.
Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.
Parakeet does streaming I think, so if you throw enough compute at it, it should be. The closest competitor is whisper v3 which is relatively slow, maybe Voxtral but it's still very new.
There's a minimum possible latency just given the structure of language and how humans process phonemes. Spoken language isn't quite unambiguously causal so there's a limit to how far you can go for a given accuracy. I don't know where the efficiency curve is though. It wouldn't surprise me if 100ms was pushing it.
Yeah the metric would be the total processing latency after that. I've found that VAD is honestly harder to get right than STT and if that fails, STT only gets garbage to process. Even humans sometimes have issues figuring out when exactly someone is done talking.
You probably still better use inference on ANE (Apple Neural Engine) via CoreML rather than Metal - speed will be either similar or even faster on non-pro macbooks or iphones and power consumption significantly better. Metal or even MLX format doesn't have to be the fastest and the only way to access ANE is via CoreML.
The CoreML backend is WIP in Axiom and will roll over to parakeet.cpp when it's ready, the same with CUDA. FluidAudio is a great option for those building Mac-only apps, but the goal with Axiom and Parakeet.cpp is to be very portable and embeddable into almost any app. I will write C and Swift wrappers shortly, then if it's really wanted, a Python wrapper.
For my part, I don't need an app that's faster than Handy, and I do like that Handy is Tauri (Rust + web), which means it could be fully cross-platform eventually. It's mostly that the stack is just more hackable for me personally.
[OP] noahkay13 | 18 hours ago
What it does: - Runs 7 model families: offline transcription (CTC, RNNT, TDT, TDT-CTC), streaming (EOU, Nemotron), and speaker diarization (Sortformer) - Word-level timestamps - Streaming transcription from microphone input - Speaker diarization detecting up to 4 speakers
aaronbrethorst | 14 hours ago
[OP] noahkay13 | 13 hours ago
computerex | 10 hours ago
pdyc | 7 hours ago
ghostpepper | 17 hours ago
https://github.com/rishikanthc/Scriberr
nullandvoid | 14 hours ago
Hoe does this compare?
jack_pp | 12 hours ago
qwertox | 12 hours ago
I played around with it this week, and when you enable advanced mode and add a post-transcription AI model to point to your own server which mimics a minimal ChatGPT-compatible behavior, then you can use it to modify the output, even return an empty string if you noticed that the transcript was more targeted to do other stuff ("turn the lights on"), if you then return an empty string, it won't inject keypresses.
So one gets the best for both worlds: transcription for dictation and transcription to trigger events.
If I now only could let it listen constantly and react to voice, so that no push to talk is active, that would be nice.
Maybe this project here could be used for that.
Also, this seems to support streaming transcription.
potatoman22 | 6 hours ago
antirez | 13 hours ago
https://github.com/antirez/qwen-asr
https://github.com/antirez/voxtral.c
Qwen-asr can easily transcribe live radio (see README) in any random laptop. It looks like we are going to see really cool things on local inference, now that automatic programming makes a lot simpler to create solid pipelines for new models in C, C++, Rust, ..., in a matter of hours.
pjmlp | 12 hours ago
FpUser | 12 hours ago
Another problem is too much abstraction on input spec level. The other day I asked Claude to generate few classes. When reviewing the code I noticed it doing full scan for ranges on one giant set. This would bring my backend to a halt. After pointing it out to Claude it had smartened up to start with lower_bound() call. When there are no people to notice such things what do you think we are going to have?
pjmlp | 11 hours ago
Now the abstraction I am with you on that, I foresee a more formal way to give specifications, but more suitable for natural language as input, or even proper mathematics, than the languages we have been using thus far.
Naturally we aren't there yet.
FpUser | 10 hours ago
But we were. COBOL ;)
On more serious note. Sure we need Spec development IDE which LLM would compile to a language of choice (or print ASIC). It would still not prevent that lower_bound things from happening and there will be no people to find out why
pjmlp | 10 hours ago
Unfortunely that is already the case when debugging low code, no code tools, and good luck having any kind of versioning with those.
MonkeyClub | 11 hours ago
That's why I'm still holding on to a bulky Core 2 Duo Management Engine-free Fujitsu workstation, for when personal computing finally goes underground again.
T0mSIlver | 12 hours ago
The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.
Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.
solarkraft | 8 hours ago
100%. I don’t understand how people are able to compromise on this.
rowanG077 | 12 hours ago
moffkalast | 11 hours ago
regularfry | 9 hours ago
moffkalast | 9 hours ago
ahaferburg | 8 hours ago
There's https://kyutai.org/stt, which is very low latency. But it seems not as hackable.
[OP] noahkay13 | 5 hours ago
pzo | 10 hours ago
Can use this library:
https://github.com/FluidInference/FluidAudio
[OP] noahkay13 | 5 hours ago
d4rkp4ttern | 9 hours ago
https://github.com/kitlangton/Hex
This is now my standard way to speak to coding agents.
I used to use Handy but Hex is even faster. Last I checked, Handy has stuttering issues but Hex doesn’t.
mpalmer | 9 hours ago
For my part, I don't need an app that's faster than Handy, and I do like that Handy is Tauri (Rust + web), which means it could be fully cross-platform eventually. It's mostly that the stack is just more hackable for me personally.