Alibaba’s Qwen3-TTS for Speech Synthesis (text-to-speech) was open-sourced (Apache-2.0) on 22 Jan 2026. And within the last couple of weeks, we now have Apple Silicon optimization via MLX-Audio. Here is code to create an audiobook from an ePub.

About

Qwen3-TTS

I previously posted about using Kokoro TTS on macOS and via an open-source project called Abogen on Windows. This time I thought I’d have a look at Qwen3-TTS purely via code...

If all you want to do is try it, head over to their Hugging Face ZeroGPU demo page. I find most of the voices to have too strong a Chinese accent when speaking English. Architecturally it is a very superior and unique model.

Qwen3-TTS provides 3 models and a tokenizer. Quoting from the Hugging Face model card:

  • Qwen3-TTS-12Hz-1.7B-VoiceDesign - “Performs voice design based on user-provided descriptions.”
  • Qwen3-TTS-12Hz-1.7B-CustomVoice - “Provides style control over target timbres via user instructions.”
  • Qwen3-TTS-12Hz-1.7B-Base - Base model capable of 3-second rapid voice clone from user audio input.”
  • Qwen3-TTS-Tokenizer-12Hz

Voice cloning (In-Context Learning) requires only 3 seconds of reference audio with an accompanying text transcript. It is shockingly easy to use. Do not use voices or ePubs you do not have copyright to, please!

MLX-Audio

On macOS, [MLX-Audio] is “the best audio processing library built on Apple’s MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS)”, since MLX is Apple’s “array framework optimized for the unified memory architecture of Apple silicon.”

I started down the path of coding simply because the MLX-Audio example code looked so simple. However, I realize CLI may suffice.

Setup

This guide is only for my records, do not follow what I am doing even if you understand each and every step. I myself do not understand the dependent code!

Here is what I tried:

  1. Install Homebrew,

  2. Via brew, install a newer version of Python and ffmpeg:

    • python, which at time of writing is v3.14.2, may be optional... but your mileage may vary with the older version of Python v3.9.6 that is installed on macOS Tahoe.
    • ffmpeg is not required to output as .wav, but required for .mp3.
      brew install python ffmpeg
  3. Create a working directory, then

  4. Create a python virtual environment, and

  5. Install the pre-requisite libraries:

    • mlx_audio - also provides a CLI python -m mlx_audio.stt.generate so you don’t really have to use code at all to use Qwen3-TTS or Kokoro
    • halogern/epub2text - seemed easy to use, again with a CLI epub2text to list chapeters and extract text and had fewer dependencies
      mkdir qwen3-tts
      cd qwen3-tts
      /opt/homebrew/bin/python3 -m venv v
      source v/bin/activate
      pip install mlx_audio epub2text
  6. Download the one or all Qwen3-TTS MLX models from MLX Community to a models folder. I downloaded 1.7B 8-bit models but you may use 0.86B ariants for a smaller memory footprint. Download all files including the speech_tokenizer folder.

  7. Create the code below, convert.py - be warned, there is absolutely zero error handling!

# convert.py - Use Qwen3- TTS to create audio files from .epub chapters
# (c) C.Y., myByways.com
#  v0.1 2 Feb 26

import os, sys, time
import mlx.core as mx
from mlx_audio.tts.utils import load_model
from mlx_audio.tts.generate import generate_audio
from mlx_audio.audio_io import write as audio_write
from epub2text import EPUBParser

MODEL_DESIGN = 'models/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit'
SPEAKER_DESIGN = 'female, British narrator'
#MODEL_VOICE = 'models/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit'
SPEAKER_VOICE = 'Ryan'
SPEAKER_INSTRUCT = 'excited'
#MODEL_BASE = 'models/Qwen3-TTS-12Hz-1.7B-Base-8bit'
SPEAKER_AUDIO = 'voices/f1.wav'
SPEAKER_TEXT = 'voices/f1.txt'
VERBOSE = True

def stringrange_to_list(s):
    return sum(((list(range(*[int(j) + k for k,j in enumerate(i.split('-'))]))
        if '-' in i else [int(i)]) for i in s.split(',')), [])

if len(sys.argv) == 2:
    parser = EPUBParser(sys.argv[1])
    metadata = parser.get_metadata()
    print(f'Title: {metadata.title}')
    print(f'Author(s): {", ".join(metadata.authors)}')
    for c, chapter in enumerate(parser.get_chapters()):
        print(f' {c+1}. {chapter.title}: {chapter.char_count:,} characters')

if len(sys.argv) == 3:
    parser = EPUBParser(sys.argv[1])
    chapter_range = stringrange_to_list(sys.argv[2])
    output_path = os.path.splitext(os.path.basename(sys.argv[1]))[0]
    os.makedirs(output_path, exist_ok = True)
    if 'SPEAKER_AUDIO' in globals():
        model = load_model(MODEL_BASE)
    elif 'SPEAKER_DESIGN' in globals():
        model = load_model(MODEL_DESIGN)
    elif 'SPEAKER_VOICE' in globals():
        model = load_model(MODEL_VOICE)
    for c, chapter in enumerate(parser.get_chapters()):
        c += 1
        if c in chapter_range:
            print(f' Extract {c}. {chapter.title}')
            with open(f'{output_path}/{c:003}.txt', 'w') as chapter_text:
                chapter_text.write(chapter.text)
            tic = time.perf_counter()
            if 'MODEL_BASE' in globals():
                results = generate_audio(model = model, 
                    text = chapter.text, lang_code = 'English',
                    ref_audio = SPEAKER_AUDIO, ref_text = SPEAKER_TEXT,
                    output_path = output_path, file_prefix = f'{c:003},
                    audio_format = 'mp3', join_audio = True, 
                    fix_mistral_regex = True, verbose = VERBOSE)
            else:
                segments = [s.strip() for s in text.split('\n') if s.strip()]
                audio = []
                for i, text in enumerate(segments):
                    print(f'{i+1}/{len(segments)} {text}')
                    if 'MODEL_DESIGN' in globals():
                        results = model.generate_voice_design(text=text,
                            instruct = SPEAKER_DESIGN, 
                            verbose = VERBOSE)
                    else:
                        results = model.generate_custom_voice(text=text, 
                            speaker = SPEAKER_VOICE, instruct = SPEAKER_INSTRUCT, 
                            verbose = VERBOSE)
                    for i, result in enumerate(results):
                        audio.append(result.audio)
                audio_write(os.path.join(output_path, f'{c:03}.mp3'),
                    mx.concatenate(audio, axis=0),
                    model.sample_rate)
            toc = time.perf_counter()
            m, s = divmod(toc - tic, 60)
            print(f'   Completed {c}. {chapter.title}: {chapter.char_count:,} characters in '\
                f'{int(m)} minutes {int(s)} seconds ('\
                f'{int(60*chapter.char_count/(toc-tic)):,} cpm)')

FYI, the function to parse a string range is taken from this StackOverflow answer

Usage

Edit the global variables at the start of the file.

If you want to use the VoiceDesign model, just uncomment these two lines:

  • MODEL_DESIGN = the folder of the downloaded VoiceDesign model of your choice
  • SPEAKER_DESIGN = the text prompt for voice design

If you want to use the pre-defined CustomVoices (Ryan or Aiden):

  • MODEL_VOICE = 'models/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit'
  • SPEAKER_VOICE = 'Ryan'
  • SPEAKER_STYLE = 'excited'

If you want to use a reference voice, uncomment:

  • MODEL_BASE = 'models/Qwen3-TTS-12Hz-1.7B-Base-8bit'
  • SPEAKER_AUDIO = 'voices/f1.wav'
  • SPEAKER_TEXT = 'voices/f1.txt'

The reference voice mode is the only mode that properly splits large text into chunks. There is a bug somewhere for the other two models. Hence the totally different way of generating output. I do not actually use the newline '\n' to split chunks, I use a larger buffer and trim to the nearest newline, but this code is easier until I clean up my version.

The last variable is VERBOSE - enable to show progress bar, memory usage, etc.

Usage:

  • python convert.py book.epub - lists all chapters, starting with 0
  • python convert.py book.epub 1,5-10,12 - performs text-to-speech on given chapters (1, 5 to 10 and 12)

On my M2 macMini, I am getting about 1,000 characters per minute... not great I know.

Update 6 Feb: Just realized only the voice cloning works for large chunks of text, because the split_pattern option is ignored for the other two models. Believe this is a bug in MLX_Audio. Implemented a different code path to take care of manual splitting...