Wiseguy Text To Speech May 2026

Expressive TTS, paralinguistic style transfer, New York English, prosodic modeling, dialect synthesis 1. Introduction Generic TTS systems (Amazon Polly, Microsoft Azure Neural TTS) excel at clear, neutral speech but fail to convey paralinguistic identity—the subtle markers of region, class, attitude, and subculture. This paper addresses a specific expressive gap: the “wise guy” voice—a rhetorical style characterized by rapid tempo, upward terminal inflections, vowel nasalization, and domain-specific jargon (e.g., fuggedaboutit , gabagool , mook ). While previous work has tackled emotional TTS (happy, sad, angry) and basic accents (British, Australian), no system has targeted a socially situated persona so reliant on timing and attitude.

| Slang | Canonical spelling | Phoneme override (ARPAbet) | |-------|--------------------|-----------------------------| | fuggedaboutit | forgetaboutit | F AH G EH D AH B AW T IH T | | gabagool | capicola | K AA P IH G AA L | | mook | mook | M UH K | | yous | yous | Y UW Z | wiseguy text to speech

Higher MCD is expected – stylistic speech distorts spectral envelope. The 3.2× higher F0 variation confirms successful prosodic exaggeration. | Metric | Baseline | WiseGuy | p-value | |--------|----------|---------|---------| | Authenticity (1-5) | 1.3 (0.4) | 4.7 (0.5) | <0.001 | | Naturalness (1-5) | 4.5 (0.6) | 3.9 (0.8) | <0.05 | | Keyword accuracy (%) | 98.2% | 91.5% | <0.01 | While previous work has tackled emotional TTS (happy,