Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

347 points by ipotapov 14 hours ago on hackernews | 111 comments

What if you could talk to your laptop and it talked back — not through a three-step pipeline of transcribe-think-synthesize, but as a single model that listens and speaks at the same time, faster than real-time, streaming audio chunks back as it generates them? That’s what we shipped this week in qwen3-asr-swift, running entirely on Apple Silicon.

Press enter or click to view image in full size

Our qwen3-asr-swift Swift/MLX speech library now handles full-duplex speech-to-speech with streaming via NVIDIA’s PersonaPlex 7B — faster than real-time (~68ms/step, RTF 0.87), alongside ASR, TTS, and multilingual synthesis. Audio in, audio out, native Swift on Apple Silicon. The 4-bit quantized model (~5.3 GB) is at aufklarer/PersonaPlex-7B-MLX-4bit.

The Journey: From Transcription to Conversation

This library didn’t start as a voice conversation engine. It started as a speech recognition port — a proof that Apple Silicon’s unified memory and MLX’s Metal acceleration could run serious speech models natively, without Python, without a server, without copying tensors between CPU and GPU.

First came ASR — Qwen3-ASR 0.6B, quantized to 4-bit, turning speech into text. That established the MLX patterns: KV cache, RoPE, quantized inference. Then TTS — Qwen3-TTS 0.6B, adding the Mimi audio codec and streaming audio generation in 10 languages. Then multilingual synthesis — CosyVoice3 0.5B, introducing DiT flow matching across 9 languages.

And now: speech-to-speech. PersonaPlex 7B takes audio in and produces audio out. No transcription step. No text intermediary. Full-duplex — it listens and speaks simultaneously.

Why PersonaPlex? One Model Instead of Three

Press enter or click to view image in full size

The traditional voice assistant pipeline looks like this:

User speaks → [ASR] → text → [LLM] → text → [TTS] → Agent speaks

Three models, three handoffs, cumulative latency. Each step loses information — ASR discards prosody and emotion, TTS has to reconstruct them from flat text.

PersonaPlex collapses this into one model:

User speaks → [PersonaPlex 7B] → Agent speaks

The model processes audio tokens directly — 17 parallel streams at 12.5 Hz (one frame every 80ms). It’s based on Kyutai’s Moshi architecture, the same foundation behind their real-time voice demo. NVIDIA extended it with 18 controllable voice presets and role-based system prompts.

The Model: 16.7 GB → 5.3 GB

NVIDIA’s original PersonaPlex is a 16.7 GB PyTorch checkpoint. We converted it to MLX-optimized safetensors with 4-bit quantization for both the 7B temporal transformer and the Depformer:

Press enter or click to view image in full size

Total download: ~5.3 GB. Published at aufklarer/PersonaPlex-7B-MLX-4bit.

The conversion script (scripts/convert_personaplex.py) handles everything: download from nvidia/personaplex-7b-v1, classify ~2000 weight keys, quantize both transformers to 4-bit, extract voice presets, and optionally upload to HuggingFace.

How a Single Model Handles Voice Conversation

Press enter or click to view image in full size

Here’s the trick that makes single-model voice conversation possible: instead of passing through separate transcription and synthesis stages, PersonaPlex processes 17 parallel token streams through one unified pipeline.

[User Audio 24kHz] → Mimi Encoder → 16 codebook tokens @ 12.5Hz
                                          ↓
            Temporal Transformer (32L, 4096d, 7B params, 4-bit)
                17 streams summed: text + 8 user audio + 8 agent audio
                                          ↓
            Depformer (6L, 1024d, per-codebook weights, 4-bit)
                16 sequential steps → 8 agent audio codebook tokens
                                          ↓
[Agent Audio 24kHz] ← Mimi Decoder ← 8 codebook tokens @ 12.5Hz

Reusing the Mimi Codec

This is where building a library — rather than standalone ports — paid off. PersonaPlex uses exactly the same Mimi audio codec as Kyutai’s Moshi. We already had a complete, tested Mimi implementation from our TTS work: SEANet encoder/decoder, streaming convolutions, 8-layer transformer bottleneck, Split RVQ. We copied it directly into the PersonaPlex target. Zero changes to the core codec.

The same goes for the HuggingFace downloader, WAV I/O, KV cache, RoPE, SwiGLU, and RMSNorm — all battle-tested across three prior models.

The Depformer: Per-Step Weight Switching

The most novel component is the Depformer, which generates audio codebooks sequentially — one at a time, 16 steps per timestep. Each step uses different weights via the MultiLinear pattern:

public func callAsFunction(_ xs: MLXArray, step: Int) -> MLXArray {
    let start = step * outDim
    let end = start + outDim
    let w = weight[start..<end, 0...]  // slice weights for this step
    if let s = scales, let b = biases {
        // 4-bit quantized path
        return quantizedMM(xs, w, scales: s[start..<end, 0...],
                           biases: b[start..<end, 0...],
                           transpose: true, groupSize: groupSize, bits: bits)
    }
    return xs.matmul(w.T)
}

One weight tensor, no module overhead, just a slice and multiply. With 4-bit quantization, the Depformer dropped from ~2.4 GB to ~650 MB — a 3.7x reduction with no measurable quality loss in ASR round-trip tests.

System Prompts: The Difference Between Rambling and Useful

PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.

Get Ivan’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.

The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:

No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”

With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”

Performance: Honest Numbers

Here’s how PersonaPlex runs on an M2 Max with 64 GB, alongside the other models in the library:

Press enter or click to view image in full size

A quick note on RTF (Real-Time Factor): below 1.0 means faster than real-time — the model produces output faster than you could listen to it. With both transformers quantized to 4-bit, PersonaPlex now runs faster than real-time at ~68ms/step — comfortably under the 80ms frame budget at 12.5 Hz.

Breaking down PersonaPlex specifically:

Press enter or click to view image in full size

Round-Trip Verification: One Library, End-to-End

One advantage of having ASR, TTS, and speech-to-speech in the same library: end-to-end testing is trivial. We validate PersonaPlex output by round-tripping through ASR:

import Qwen3ASR
import PersonaPlex

// Transcribe input
let asrModel = try await Qwen3ASRModel.fromPretrained()
let inputTranscript = asrModel.transcribe(audio: inputAudio, sampleRate: 16000)
// → "Can you guarantee that the replacement part will be shipped tomorrow?"
// Generate speech response
let ppModel = try await PersonaPlexModel.fromPretrained()
let responseAudio = ppModel.respond(userAudio: inputAudio, voice: .NATM0)
// Transcribe response to verify
let responseTranscript = asrModel.transcribe(audio: responseAudio, sampleRate: 16000)
// → "I can't promise a specific time, but we'll do our best to get it out tomorrow..."

This is how our E2E tests work — the library validates PersonaPlex output by checking the round-tripped transcript for topic-relevant keywords. Both the offline (respond()) and streaming (respondStream()) paths are tested this way.

Streaming Is Here

Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.

What We Built

respondStream() returns an AsyncThrowingStream<AudioChunk> that emits audio chunks during generation. Each chunk is ~2 seconds of 24kHz audio, decoded incrementally through Mimi's streaming decoder:

let stream = model.respondStream(userAudio: audio, voice: .NATM0)
for try await chunk in stream {
    playAudio(chunk.samples)  // play immediately, 24kHz mono
    if chunk.isFinal { break }
}

From the CLI:

.build/release/audio respond --input question.wav --stream --verbose --output response.wav

Performance Optimizations

Four optimizations pushed PersonaPlex closer to real-time:

eval() consolidation reduced GPU sync barriers from 3 to 1 per generation step, letting MLX’s lazy evaluation graph fuse more operations. Bulk audio extraction replaced 384K individual .item(Float.self) calls with a single .asArray(Float.self) during Mimi decode. Prefill batching runs the voice prompt (50 frames) and non-voice prefill as single batched forward passes, replacing ~300 individual steps. And compiled temporal transformer via compile(shapeless: true) fuses ~450 Metal kernel dispatches per step into optimized kernels — opt-in via model.warmUp() or the --compile flag.

These follow the same patterns we proved in the TTS port — consolidate eval barriers, batch where possible, compile the autoregressive loop. The temporal transformer compile uses explicit [MLXArray] inputs/outputs for KV cache arrays (avoiding Slice ops that crash shapeless: true), with RoPE offset passed as an MLXArray input rather than an Int that gets baked as a constant.

Try It

# Clone and build
git clone https://github.com/ivan-digital/qwen3-asr-swift
cd qwen3-asr-swift
swift build -c release

# Speech-to-speech (downloads ~5.3 GB on first run)
.build/release/audio respond --input your_audio.wav --output response.wav --voice NATM0# Streaming speech-to-speech (emit audio chunks during generation)
.build/release/audio respond --input your_audio.wav --stream --output response.wav# With compiled temporal transformer (Metal kernel fusion)
.build/release/audio respond --input your_audio.wav --compile --stream --output response.wav# Or use any of the other models:
.build/release/audio transcribe audio.wav                              # ASR
.build/release/audio speak "Hello world" --output hello.wav       # TTS
.build/release/audio speak "Hallo Welt" --engine cosyvoice --language german     # Multilingual TTS

The quantized model is at aufklarer/PersonaPlex-7B-MLX-4bit. The full library source is at ivan-digital/qwen3-asr-swift.

Built on the shoulders of: NVIDIA (PersonaPlex), Kyutai (Moshi and Mimi), the Qwen team at Alibaba (ASR and TTS models), FunAudioLLM (CosyVoice), and Apple’s MLX framework.