Each article on snazzie.space is an MDX file. When you hit the play button, you’re listening to audio generated by a local Python script before the article was ever committed. Here’s what the stack looks like end to end.
The content layer
Articles live in the content directory as MDX files. Astro’s content collections type-check the frontmatter at build time — title, date, excerpt, tags, draft flag. The slug comes from the filename.
MDX gets rendered at build time. No client-side markdown parsing, no JavaScript fetching content on load. The page that reaches you is static HTML. Astro ships zero JavaScript by default, so the only scripts on an article page are the audio player and a small toggle for the raw TTS script view — hit see raw script at the top of any article to see exactly what text gets fed to the voice model.
Phoneme annotations
The TTS model, OmniVoice, uses CMU ARPAbet phonemes. Acronyms and unusual terms get annotated inline in the MDX with their phoneme sequence. The display layer strips the bracket content so the reader sees the term as written, while the TTS receives the correct pronunciation. A maintained table of annotated terms lives in the article skill, so new articles get consistent pronunciation without re-solving the same phoneme lookups.
| Without marker | With marker |
|---|---|
| SQL”s-q-l” | [S IY1 K W AH0 L]“sequel” |
| KV(silent) | [K EY1] [V IY1]“K-V” |
| SSR”sarr” | [EH0 S, EH0 S, AA1 R]“S-S-R” |
| LLM(silent) | [EH1 L, EH1 L, EH1 M]“EL-EL-EM” |
| UTC”you-tee” | [Y UW1] [T IY1] [S IY1]“U-T-C” |
The same annotation system also handles paralinguistic markers — pauses, soft questions, mild surprise — placed at the start of sentences to shape delivery. These are stripped from the display entirely.
Audio generation
The Astro build processes each MDX file and writes a plain text TTS script. That file is the single source of truth — it strips MDX syntax, converts em dashes to pauses, flips phoneme annotations from written form to their phoneme sequence, and excludes any HTML block marked with a skip attribute — the comparison table on this page uses that to stay out of the narration. The “see raw script” toggle on each article shows exactly this output.
A Python script then reads that file, seeds the RNG for reproducibility, and passes the text to OmniVoice running on CUDA with a fixed reference voice sample. Every article uses the same reference wav, so the voice is consistent. Speed is set to 1.15x. Output is a FLAC and a waveform JSON — 200 normalised peak amplitude values used to render the player visualisation.
Both files are committed with the article and served statically from Cloudflare’s CDN. No API calls at read time. The TTS inference runs once on my machine before commit. After that it’s just a file.