Alibaba’s Qwen3-TTS for Speech Synthesis (text-to-speech) was open-sourced (Apache-2.0) on 22 Jan 2026. And within the last couple of weeks, we now have Apple Silicon optimization via MLX-Audio. Here is code to create an audiobook from an ePub.
About
Qwen3-TTS
I previously posted about using Kokoro TTS on macOS and via an open-source project called Abogen on Windows. This time I thought I’d have a look at Qwen3-TTS purely via code...
If all you want to do is try it, head over to their Hugging Face ZeroGPU demo page. I find most of the voices to have too strong a Chinese accent when speaking English. Architecturally it is a very superior and unique model.
Qwen3-TTS provides 3 models and a tokenizer. Quoting from the Hugging Face model card:
- Qwen3-TTS-12Hz-1.7B-VoiceDesign - “Performs voice design based on user-provided descriptions.”
- Qwen3-TTS-12Hz-1.7B-CustomVoice - “Provides style control over target timbres via user instructions.”
- Qwen3-TTS-12Hz-1.7B-Base - Base model capable of 3-second rapid voice clone from user audio input.”
- Qwen3-TTS-Tokenizer-12Hz
Voice cloning (In-Context Learning) requires only 3 seconds of reference audio with an accompanying text transcript. It is shockingly easy to use. Do not use voices or ePubs you do not have copyright to, please!
MLX-Audio
On macOS, [MLX-Audio] is “the best audio processing library built on Apple’s MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS)”, since MLX is Apple’s “array framework optimized for the unified memory architecture of Apple silicon.”
I started down the path of coding simply because the MLX-Audio example code looked so simple. However, I realize CLI may suffice.
Setup
This guide is only for my records, do not follow what I am doing even if you understand each and every step. I myself do not understand the dependent code!
Here is what I tried:
-
Install Homebrew,
-
Via
brew, install a newer version of Python and ffmpeg:python, which at time of writing is v3.14.2, may be optional... but your mileage may vary with the older version of Python v3.9.6 that is installed on macOS Tahoe.ffmpegis not required to output as.wav, but required for.mp3.brew install python ffmpeg
-
Create a working directory, then
-
Create a python virtual environment, and
-
Install the pre-requisite libraries:
- mlx_audio - also provides a CLI
python -m mlx_audio.stt.generateso you don’t really have to use code at all to use Qwen3-TTS or Kokoro - halogern/epub2text - seemed easy to use, again with a CLI
epub2textto list chapeters and extract text and had fewer dependenciesmkdir qwen3-tts cd qwen3-tts /opt/homebrew/bin/python3 -m venv v source v/bin/activate pip install mlx_audio epub2text
- mlx_audio - also provides a CLI
-
Download the one or all Qwen3-TTS MLX models from MLX Community to a
modelsfolder. I downloaded 1.7B 8-bit models but you may use 0.86B ariants for a smaller memory footprint. Download all files including thespeech_tokenizerfolder.- Qwen3-TTS Base model
- Qwen3-TTS CustomVoice model for pre-defined voices
- Qwen3-TTS VoiceDesign model to use a text prompt to define a voice
-
Create the code below,
convert.py- be warned, there is absolutely zero error handling!
# convert.py - Use Qwen3- TTS to create audio files from .epub chapters
# (c) C.Y., myByways.com
# v0.1 2 Feb 26
import os, sys, time
import mlx.core as mx
from mlx_audio.tts.utils import load_model
from mlx_audio.tts.generate import generate_audio
from mlx_audio.audio_io import write as audio_write
from epub2text import EPUBParser
MODEL_DESIGN = 'models/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit'
SPEAKER_DESIGN = 'female, British narrator'
#MODEL_VOICE = 'models/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit'
SPEAKER_VOICE = 'Ryan'
SPEAKER_INSTRUCT = 'excited'
#MODEL_BASE = 'models/Qwen3-TTS-12Hz-1.7B-Base-8bit'
SPEAKER_AUDIO = 'voices/f1.wav'
SPEAKER_TEXT = 'voices/f1.txt'
VERBOSE = True
def stringrange_to_list(s):
return sum(((list(range(*[int(j) + k for k,j in enumerate(i.split('-'))]))
if '-' in i else [int(i)]) for i in s.split(',')), [])
if len(sys.argv) == 2:
parser = EPUBParser(sys.argv[1])
metadata = parser.get_metadata()
print(f'Title: {metadata.title}')
print(f'Author(s): {", ".join(metadata.authors)}')
for c, chapter in enumerate(parser.get_chapters()):
print(f' {c+1}. {chapter.title}: {chapter.char_count:,} characters')
if len(sys.argv) == 3:
parser = EPUBParser(sys.argv[1])
chapter_range = stringrange_to_list(sys.argv[2])
output_path = os.path.splitext(os.path.basename(sys.argv[1]))[0]
os.makedirs(output_path, exist_ok = True)
if 'SPEAKER_AUDIO' in globals():
model = load_model(MODEL_BASE)
elif 'SPEAKER_DESIGN' in globals():
model = load_model(MODEL_DESIGN)
elif 'SPEAKER_VOICE' in globals():
model = load_model(MODEL_VOICE)
for c, chapter in enumerate(parser.get_chapters()):
c += 1
if c in chapter_range:
print(f' Extract {c}. {chapter.title}')
with open(f'{output_path}/{c:003}.txt', 'w') as chapter_text:
chapter_text.write(chapter.text)
tic = time.perf_counter()
if 'MODEL_BASE' in globals():
results = generate_audio(model = model,
text = chapter.text, lang_code = 'English',
ref_audio = SPEAKER_AUDIO, ref_text = SPEAKER_TEXT,
output_path = output_path, file_prefix = f'{c:003},
audio_format = 'mp3', join_audio = True,
fix_mistral_regex = True, verbose = VERBOSE)
else:
segments = [s.strip() for s in text.split('\n') if s.strip()]
audio = []
for i, text in enumerate(segments):
print(f'{i+1}/{len(segments)} {text}')
if 'MODEL_DESIGN' in globals():
results = model.generate_voice_design(text=text,
instruct = SPEAKER_DESIGN,
verbose = VERBOSE)
else:
results = model.generate_custom_voice(text=text,
speaker = SPEAKER_VOICE, instruct = SPEAKER_INSTRUCT,
verbose = VERBOSE)
for i, result in enumerate(results):
audio.append(result.audio)
audio_write(os.path.join(output_path, f'{c:03}.mp3'),
mx.concatenate(audio, axis=0),
model.sample_rate)
toc = time.perf_counter()
m, s = divmod(toc - tic, 60)
print(f' Completed {c}. {chapter.title}: {chapter.char_count:,} characters in '\
f'{int(m)} minutes {int(s)} seconds ('\
f'{int(60*chapter.char_count/(toc-tic)):,} cpm)')
FYI, the function to parse a string range is taken from this StackOverflow answer
Usage
Edit the global variables at the start of the file.
If you want to use the VoiceDesign model, just uncomment these two lines:
MODEL_DESIGN= the folder of the downloaded VoiceDesign model of your choiceSPEAKER_DESIGN= the text prompt for voice design
If you want to use the pre-defined CustomVoices (Ryan or Aiden):
MODEL_VOICE= 'models/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit'SPEAKER_VOICE= 'Ryan'SPEAKER_STYLE= 'excited'
If you want to use a reference voice, uncomment:
MODEL_BASE= 'models/Qwen3-TTS-12Hz-1.7B-Base-8bit'SPEAKER_AUDIO= 'voices/f1.wav'SPEAKER_TEXT= 'voices/f1.txt'
The reference voice mode is the only mode that properly splits large text into chunks. There is a bug somewhere for the other two models. Hence the totally different way of generating output. I do not actually use the newline '\n' to split chunks, I use a larger buffer and trim to the nearest newline, but this code is easier until I clean up my version.
The last variable is VERBOSE - enable to show progress bar, memory usage, etc.
Usage:
python convert.py book.epub- lists all chapters, starting with 0python convert.py book.epub 1,5-10,12- performs text-to-speech on given chapters (1, 5 to 10 and 12)
On my M2 macMini, I am getting about 1,000 characters per minute... not great I know.
Update 6 Feb: Just realized only the voice cloning works for large chunks of text, because the split_pattern option is ignored for the other two models. Believe this is a bug in MLX_Audio. Implemented a different code path to take care of manual splitting...