Core Architecture

XTTS v2 Architecture

A transformer-based TTS system combining GPT-2 autoregressive decoding, latent diffusion refinement, and HiFi-GAN neural vocoding — all conditioned on a 256-dimensional speaker embedding.

01 — Feature Extraction & Text Processing ENCODER

The acoustic front-end converts raw audio to 80-channel log-Mel spectrograms, while the text pipeline converts graphemes to phonemes (G2P) and produces phoneme embeddings that condition the autoregressive decoder.

— Log-Mel Spectrogram (acoustic features) — Window: 25ms Hann window, 10ms hop, 1024-point FFT Channels: 80 Mel filterbanks (20 Hz – 8,000 Hz, HTK scale) Output: X ∈ ℝ^{80 × T} where T = ⌈audio_len / hop_size⌉ — Grapheme-to-Phoneme (G2P) pipeline — Step 1: Lexicon lookup (CMU Pronouncing Dictionary, 134K entries) Step 2: Neural G2P fallback (Seq2Seq LSTM, 97.8% word accuracy) Step 3: IPA transcription → phoneme token sequence Step 4: Phoneme embedding lookup: E ∈ ℝ^{|vocab| × 512} — Subword encoding for out-of-lexicon words — SentencePiece BPE tokenizer, vocabulary size = 1,024 Handles: proper nouns, neologisms, code-mixed text

02 — GPT-2 Autoregressive Decoder 774M PARAMS

The core generative model is a GPT-2 language model retrained to predict latent acoustic tokens rather than text tokens. The model is conditioned on both the text phoneme sequence and the 256-dimensional speaker embedding (d-vector), enabling voice cloning with as little as 6 seconds of reference audio.

— Model architecture — Parameters: 774M (GPT-2 Large scale) Layers: 36 transformer decoder blocks Attention: 20 heads × 64-dim per head = 1,280-dim embeddings Context: 512 acoustic tokens (≈ 5.3 seconds of audio at 96 tok/s) — Autoregressive acoustic token prediction — Input: [phoneme_embed(text), d_vector_speaker, acoustic_tokens[0:t]] Output: P(a_{t+1} | a_{1:t}, text, speaker) Codebook: EnCodec 24kHz, 8 codebooks × 1,024 codes each — Speaker conditioning (cross-attention injection) — d_vector ∈ ℝ^256 → projected to ℝ^1280 via learned affine layer Injected at every decoder layer via cross-attention: K, V from d_vector — Training objective — L = −Σ_t log P(a_t | a_{<t}, text, speaker) + λ · L_speaker_consistency (cross-entropy on acoustic tokens + speaker identity regularization)

Component	Parameters	Architecture	Note
Transformer backbone	774M	GPT-2 Large	Pre-trained on LibriSpeech + Common Voice
Speaker encoder	~1.2M	ECAPA-TDNN	256-dim d-vector output
Latent diffusion	~85M	U-Net	50-step DDPM refinement
HiFi-GAN vocoder	14M	GAN	24kHz waveform synthesis
Total system	~874M	End-to-end	Single GPU inference (RTX 3090)

03 — Latent Diffusion + HiFi-GAN Vocoder 24kHz OUTPUT

Latent Diffusion Refinement: The GPT-2 decoder outputs a coarse latent spectrogram which passes through a U-Net diffusion model for detail enhancement. 50 DDPM denoising steps recover fine spectral structure lost in autoregressive prediction.

— DDPM forward / reverse — Forward: q(x_t|x_{t-1}) = N(√(1-β_t)x_{t-1}, β_t I) Reverse: p_θ(x_{t-1}|x_t) = N(μ_θ(x_t,t), σ²_t I) Score: ε_θ(x_t, t) ← U-Net (3 res levels, skip connections) Steps: T = 1000 training; T = 50 inference (DDIM)

HiFi-GAN Neural Vocoder: The refined mel-spectrogram is converted to a 24kHz waveform by HiFi-GAN. The generator uses residual dilated convolutions; discriminators operate at multiple periods and scales to enforce waveform realism.

— HiFi-GAN discriminator stack — MPD: periods [2,3,5,7,11] — captures periodicity MSD: scales [1, 2, 4] — captures multi-resolution Loss: L_adv + λ_fm × L_feature + λ_mel × L_mel RTF: 0.015 (66× faster than real-time, single GPU)

Voice Cloning

Speaker Embedding — ECAPA-TDNN

6 seconds of reference audio is all it takes to clone any voice — with identity preserved across language boundaries.

ECAPA-TDNN — Emphasized Channel Attention, Propagation and Aggregation 256-DIM

ECAPA-TDNN is the current state-of-the-art speaker verification architecture, combining temporal convolutions with squeeze-and-excitation channel attention and attentive statistics pooling to produce a fixed-dimensional speaker representation independent of utterance length.

— ECAPA-TDNN architecture — Input: Raw waveform → MFCC (80-dim, 25ms window, 10ms hop) Body: 5 × SE-Res2Block (1D dilated convolution + SE channel attention) Dilation schedule: [1, 2, 3, 4] (multi-scale temporal receptive fields) SE ratio: 8 (squeeze 512 → 64 → 512 channels) — Multi-scale temporal aggregation — Concatenate outputs from all SE-Res2Blocks → channel-wise Multi-layer feature aggregation before pooling — Attentive Statistics Pooling — Attention: e_t = v^T tanh(W·h_t + b) (scalar attention weight per frame) Weights: α_t = softmax(e_t) Mean: μ = Σ_t α_t · h_t Std: σ = √(Σ_t α_t · h_t² − μ²) Concat: [μ; σ] → linear → L2-normalize → d-vector ∈ ℝ^256 — Enrollment & verification — Reference: 6 seconds, 16kHz mono (minimum enrollment) Threshold: cosine_similarity(d_v1, d_v2) > 0.85 → same speaker Cross-lingual: d-vector encodes voice timbre, not language — clones across 17 languages

Property	Value	Notes
Embedding dimension	256	d-vector output (L2-normalized)
Minimum enrollment	6 seconds	16kHz mono audio
Verification threshold	cosine > 0.85	EER < 2.1% on VoxCeleb2
Language coverage	17 languages	Cross-lingual identity preserved
Training data	VoxCeleb1+2	7,205 speakers, 1.2M utterances
Speaker EER	< 2.1%	VoxCeleb1-O benchmark

The Core Innovation

5-Layer Universal Semantic Representation

Where all other TTS systems stop at translation, Voice Forge builds a deep semantic scaffold before synthesis — preserving not just words but meaning, intent, culture, and personality.

01

Lexical Layer SURFACE FORM

The entry point to the USR pipeline. Raw text is tokenized, morphologically analyzed, and tagged for grammatical function before passing upward.

Tokenization: SentencePiece BPE (vocab 32K) — handles agglutinative languages (Turkish, Finnish, Japanese) Lemmatization: Reduce inflected forms to base lemma (e.g., "running" → "run", "geht" → "gehen") Morphology: Analyze prefixes, suffixes, compounding — critical for German, Arabic, Finnish POS tagging: Universal Dependencies tagset: 17 UPOS tags (NOUN, VERB, ADJ, ADV, ADP, DET…) Dependency parse: subject/object/modifier arcs → syntactic scaffold Named entities: PER / LOC / ORG / DATE — preserved verbatim, not culturally adapted

02

Semantic Layer DEEP MEANING

Extracts predicate-argument structure and word meaning beyond surface form — enabling cross-lingual alignment at the semantic level rather than lexical level.

Semantic Role Labeling: PropBank framesets (e.g., "give.01": ARG0=giver, ARG1=thing, ARG2=recipient) FrameNet coverage for 1,300+ semantic frames Predicate-Argument: Who did what to whom? Extracted as structured event representation Word Sense Disambiguation: WordNet synset mapping (117,000 synsets across 155,000 word forms) Context-sensitive: "bank" → financial_institution.n.01 vs river_bank.n.01 Coreference: Resolve pronouns and noun phrases to canonical entities across sentences Neural coref model (SpanBERT fine-tuned, F1 = 79.6 on OntoNotes)

03

Pragmatic Layer COMMUNICATIVE INTENT

Identifies what the speaker is trying to accomplish with the utterance — the speech act — and detects what is implied but not stated (implicature).

Speech Act Classification: Searle's 5 categories: ASSERTIVE — speaker commits to truth of proposition ("The meeting is at 3pm") DIRECTIVE — speaker attempts to get hearer to do something ("Please sit down") COMMISSIVE — speaker commits to future action ("I will call you tomorrow") EXPRESSIVE — speaker expresses psychological state ("Congratulations!") DECLARATIVE — utterance changes institutional reality ("I now pronounce you…") Gricean Maxims (implicature detection): Quantity: "some" → implicates "not all"; "possible" → implicates "not certain" Manner: verbose phrasing → implies importance or hesitation Relevance: contextually unexpected content → flagged for pragmatic reanalysis Presupposition identification: "When did you stop lying?" → presupposes prior lying Presuppositions preserved across translation to prevent false assertions

04

Cultural Layer CULTURAL INTELLIGENCE

The layer that differentiates Voice Forge from every other TTS system. Not literal translation — cultural equivalence. Politeness hierarchies, idiom localization, and humor transposition.

T-V Distinction Mapping: French: tu (informal) / vous (formal/plural) German: du (informal) / Sie (formal) Spanish: tú (informal) / usted (formal) / vosotros (Spain plural) Source text social register → correct target-language register selection Honorific Level Calibration (Japanese keigo system): Teineigo (丁寧語): polite everyday speech — default for public communication Sonkeigo (尊敬語): exalting the listener — used when addressing superiors Kenjōgo (謙譲語): humbling the speaker — used when speaking about self to superior Speech act + social context → automatic keigo level selection Idiomatic Expression Adaptation (not literal translation): EN "it's raining cats and dogs" → FR "il pleut des cordes" (it's raining ropes) EN "bite the bullet" → DE "in den sauren Apfel beißen" (bite the sour apple) Detection: idiom lexicon (13,000+ multi-language entries) + context verification Substitution: cultural equivalent, not word-for-word rendering Humor & Sarcasm: Detection: sentiment-prosody incongruence + pragmatic layer signal Transposition: joke structure preserved; punchline culturally adapted

05

Prosodic Layer VOICE CHARACTER

The final layer. Extracts and transfers the full prosodic profile of the source speaker — F0 contour, duration, energy, breathing — so the synthesized output sounds like the same person, not just the same words.

F0 Contour Extraction (CREPE — Convolutional Representation for Pitch Estimation): Architecture: 6 conv layers, capacity ~21M parameters Resolution: 360Hz output classes → 1.95-cent pitch accuracy Frame rate: 10ms hop — captures rapid pitch modulations (vibrato, tone sandhi) Output: Hz trajectory → converted to semitones relative to speaker range Duration Modeling (Montreal Forced Aligner — MFA): Acoustic model: Kaldi GMM-HMM, trained on 1,000+ hours Alignment: phoneme-level boundaries (millisecond precision) Output: phoneme timing grid → relative duration ratios preserved in target Energy Envelope: Frame-level RMS energy, 10ms frames Emphasis detection: energy peaks co-occurring with F0 peaks = lexical stress Normalized per speaker (z-score) → transferred to target speaker dynamic range Pause Placement (breath group modeling): VAD (Voice Activity Detection): WebRTC VAD + neural override Breath groups: syntactic boundaries (comma, clause, sentence) + respiratory rhythm Silence durations: short pause 80–150ms; breath pause 200–400ms; sentence pause 400–800ms Speaking Rate Adaptation: Source rate: syllables/second (extracted via MFA alignment) Target normalization: language-specific baseline rate (e.g., Spanish 7.82 syl/s; Mandarin 5.18 syl/s; German 5.97 syl/s) Stretch/compress: WSOLA time-scale modification (preserves pitch during rate change)

Signal Processing

Prosody Transfer Pipeline

The full signal path from source audio to culturally-adapted synthesized output — preserving emotion, emphasis, rhythm, breathing, and personality.

SOURCE AUDIO INPUT │ ├── CREPE (neural pitch estimator) │ → F0 contour Hz, 10ms frames, 1.95-cent resolution │ → Voiced/unvoiced decision (U/V flag per frame) │ ├── Montreal Forced Aligner (MFA) │ → Phoneme boundary timestamps (ms precision) │ → Duration ratio per phoneme: d_i / d_mean │ ├── RMS Energy Extractor │ → Frame-level energy envelope (dB, 10ms frames) │ → Lexical stress map (energy + F0 peaks) │ └── Voice Activity Detection (VAD) → Breath group boundaries → Pause durations (short / breath / sentence) │ ▼ DYNAMIC TIME WARPING (DTW) — CROSS-LINGUAL ALIGNMENT │ ├── Align source phoneme sequence → target phoneme sequence │ (phoneme-to-phoneme matching via edit distance + acoustic distance) │ ├── Transfer F0 contour │ → Pitch-shifted to target speaker's mean ± std range │ → Boundary tones (rising/falling) preserved in target intonation grammar │ ├── Transfer duration ratios (not absolute durations) │ → Relative timing preserved: stressed syllables remain longer │ → Absolute timing adapted to target language speaking rate │ └── Transfer energy contour → Emphasis peaks preserved (lexical stress, focus) → Level normalized to target speaker dynamic range │ ▼ XTTS v2 SYNTHESIS Conditioned on: [d_vector_speaker | prosodic_blueprint | phoneme_embeddings] │ ▼ OUTPUT: 24kHz WAVEFORM Preserves: voice identity, emotional register, stress patterns, breathing, rhythm, personality

Competitive Advantage

Cost Deep Dive — 375× Advantage

Self-hosted XTTS v2 annihilates SaaS TTS pricing — while being the only option that delivers cultural preservation.

Provider	Cost / 1K chars	Voice Cloning	Languages	Cultural Layer	Notes
ElevenLabs	$0.300	Yes	29	None	Best quality SaaS — but zero cultural intelligence
Google Cloud TTS	$0.016	No	220+	None	Many languages, robotic prosody, no cloning
Amazon Polly Neural	$0.016	No	29	None	Limited emotional range, no voice cloning
Azure Neural TTS	$0.016	Custom voice	140+	None	Custom voice: $24K setup fee + ongoing compute
Voice Forge (XTTS v2 self-hosted)	$0.0008	Yes — 6s ref	17 (growing)	5-Layer USR	375× cheaper than ElevenLabs. Culturally superior.

— Cost model derivation (self-hosted) — Compute: RTX 4090 GPU ($800/mo cloud equiv.) Throughput: 66× RTF → process 66 min of audio per 1 min GPU time Chars/hour: ~4.2M characters (average 250 chars/min speech) Cost/char: $800 / (730 hrs × 4.2M chars/hr) = $0.00000026/char Cost/1K: $0.00026 → rounded up to $0.0008/1K (including infra + margin) — Competitive margin at $0.01/1K chars (target retail price) — Gross margin at retail: ($0.01 − $0.0008) / $0.01 = 92% gross margin vs ElevenLabs at $0.30: customer saves $0.29/1K chars → massive switching incentive

Revenue Model

Pricing & Packaging

API-first pricing with a creator tier and enterprise cultural licensing track.

Creator

$99

per month

5M characters / month
Voice cloning — up to 10 voices
All 17 languages
3-Layer USR (Lexical + Semantic + Prosodic)
API access
Standard SLA

Professional — Recommended

$499

per month

25M characters / month
Voice cloning — unlimited voices
Full 5-Layer USR including Cultural
Batch processing API
SSML + prosodic blueprint control
Priority SLA (4h response)
Webhook integration

Enterprise

$2K+

per month (custom)

Unlimited characters
On-premise / VPC deployment
Custom language model fine-tuning
Brand voice creation service
SLA: 99.9% uptime guarantee
Dedicated technical success manager

Financial

Revenue Projections

API-first SaaS with strong enterprise expansion. Cultural layer creates a defensible premium over commodity TTS providers.

Segment	Year 1	Year 2	Year 3
Creator subscriptions ($99/mo)	$72K	$288K	$720K
Professional subscriptions ($499/mo)	$180K	$720K	$2.4M
Enterprise contracts ($2K+/mo)	$96K	$480K	$1.8M
Usage overages ($/1K chars above plan)	$24K	$120K	$480K
Total ARR	$372K	$1.61M	$5.40M

$18K

Infra / Launch

92%

Gross Margin

< 6mo

Payback (Pro)

$0

API Provider Fees

Market

Market Opportunity

Voice AI is undergoing a platform shift — from single-language TTS to multilingual, culturally-aware voice agents. Voice Forge is positioned at the leading edge.

Voice AI TAM

$56B

Global Voice AI market 2030 (Grand View Research)

TTS SAM

$9.2B

Text-to-speech market 2028E

Localization SOM

$2.1B

AI dubbing + voice localization 2026E

Switching Cost

−99.7%

Cost reduction vs ElevenLabs at equivalent volume

Why Voice Forge Wins

Only 5-layer cultural TTS in the world. ElevenLabs, Google, Amazon, Azure all stop at phoneme-level synthesis. None implement cultural layer adaptation. Voice Forge is the first.
375× cost advantage is structural, not temporary. Self-hosted inference cost falls with hardware improvements (H100, RTX 5090). SaaS providers face margin compression; we face margin expansion.
Voice cloning moat. ECAPA-TDNN enrollment with 6 seconds of audio is state-of-the-art. Competitor custom voice setup costs $24K (Azure) and weeks of data collection. We do it in 6 seconds.
Cross-lingual identity preservation. The d-vector is language-agnostic — the same speaker sounds like themselves in French, Japanese, and Arabic. No competitor can match this with <1 minute of audio.
AI dubbing is exploding. The global content localization market ($56B) is moving from human voice actors to AI dubbing. Voice Forge with cultural intelligence captures the premium segment that refuses robotic localization.

Execution

Build Status — 95% Complete

COMPLETE (95%)

XTTS v2 core integration — GPT-2 decoder + HiFi-GAN
ECAPA-TDNN speaker encoder — 6s enrollment
CREPE F0 extraction pipeline
Montreal Forced Aligner integration
DTW prosody transfer algorithm
Lexical layer (BPE, POS, morphology)
Semantic layer (SRL, WSD, coref)
Cultural layer (T-V, keigo, idiom DB)
Prosodic layer (F0, duration, energy, VAD)
FastAPI REST API scaffolding
17-language support core

REMAINING 5%

Production API hardening + rate limiting
Stripe billing integration
Vercel / cloud deployment pipeline
Developer dashboard (usage, voice library)
SSML extended tag support
Batch processing queue (Redis + Celery)
Webhook delivery system
Full test suite + load testing

VOICEFORGE