NexiMedia
All Ventures Contact
NEXI — CONFIDENTIAL VOICE FORGE — INTERNAL VENTURE BRIEF NOT FOR DISTRIBUTION
Cultural Voice AI

VOICE
FORGE

Cultural Voice AI — 5-Layer Universal Semantic Representation

95% CODE COMPLETE  ·  XTTS v2 Core  ·  5-Layer USR
$56B
Voice AI Market 2030
95%
Code Complete
375×
Cheaper Than ElevenLabs
5
Layer USR
Core Architecture

XTTS v2 Architecture

A transformer-based TTS system combining GPT-2 autoregressive decoding, latent diffusion refinement, and HiFi-GAN neural vocoding — all conditioned on a 256-dimensional speaker embedding.

01 — Feature Extraction & Text Processing ENCODER

The acoustic front-end converts raw audio to 80-channel log-Mel spectrograms, while the text pipeline converts graphemes to phonemes (G2P) and produces phoneme embeddings that condition the autoregressive decoder.

— Log-Mel Spectrogram (acoustic features) — Window: 25ms Hann window, 10ms hop, 1024-point FFT Channels: 80 Mel filterbanks (20 Hz – 8,000 Hz, HTK scale) Output: X ∈ ℝ^{80 × T} where T = ⌈audio_len / hop_size⌉ — Grapheme-to-Phoneme (G2P) pipeline — Step 1: Lexicon lookup (CMU Pronouncing Dictionary, 134K entries) Step 2: Neural G2P fallback (Seq2Seq LSTM, 97.8% word accuracy) Step 3: IPA transcription → phoneme token sequence Step 4: Phoneme embedding lookup: E ∈ ℝ^{|vocab| × 512} — Subword encoding for out-of-lexicon words — SentencePiece BPE tokenizer, vocabulary size = 1,024 Handles: proper nouns, neologisms, code-mixed text
02 — GPT-2 Autoregressive Decoder 774M PARAMS

The core generative model is a GPT-2 language model retrained to predict latent acoustic tokens rather than text tokens. The model is conditioned on both the text phoneme sequence and the 256-dimensional speaker embedding (d-vector), enabling voice cloning with as little as 6 seconds of reference audio.

— Model architecture — Parameters: 774M (GPT-2 Large scale) Layers: 36 transformer decoder blocks Attention: 20 heads × 64-dim per head = 1,280-dim embeddings Context: 512 acoustic tokens (≈ 5.3 seconds of audio at 96 tok/s) — Autoregressive acoustic token prediction — Input: [phoneme_embed(text), d_vector_speaker, acoustic_tokens[0:t]] Output: P(a_{t+1} | a_{1:t}, text, speaker) Codebook: EnCodec 24kHz, 8 codebooks × 1,024 codes each — Speaker conditioning (cross-attention injection) — d_vector ∈ ℝ^256 → projected to ℝ^1280 via learned affine layer Injected at every decoder layer via cross-attention: K, V from d_vector — Training objective — L = −Σ_t log P(a_t | a_{<t}, text, speaker) + λ · L_speaker_consistency (cross-entropy on acoustic tokens + speaker identity regularization)
ComponentParametersArchitectureNote
Transformer backbone774MGPT-2 LargePre-trained on LibriSpeech + Common Voice
Speaker encoder~1.2MECAPA-TDNN256-dim d-vector output
Latent diffusion~85MU-Net50-step DDPM refinement
HiFi-GAN vocoder14MGAN24kHz waveform synthesis
Total system~874MEnd-to-endSingle GPU inference (RTX 3090)
03 — Latent Diffusion + HiFi-GAN Vocoder 24kHz OUTPUT

Latent Diffusion Refinement: The GPT-2 decoder outputs a coarse latent spectrogram which passes through a U-Net diffusion model for detail enhancement. 50 DDPM denoising steps recover fine spectral structure lost in autoregressive prediction.

— DDPM forward / reverse — Forward: q(x_t|x_{t-1}) = N(√(1-β_t)x_{t-1}, β_t I) Reverse: p_θ(x_{t-1}|x_t) = N(μ_θ(x_t,t), σ²_t I) Score: ε_θ(x_t, t) ← U-Net (3 res levels, skip connections) Steps: T = 1000 training; T = 50 inference (DDIM)

HiFi-GAN Neural Vocoder: The refined mel-spectrogram is converted to a 24kHz waveform by HiFi-GAN. The generator uses residual dilated convolutions; discriminators operate at multiple periods and scales to enforce waveform realism.

— HiFi-GAN discriminator stack — MPD: periods [2,3,5,7,11] — captures periodicity MSD: scales [1, 2, 4] — captures multi-resolution Loss: L_adv + λ_fm × L_feature + λ_mel × L_mel RTF: 0.015 (66× faster than real-time, single GPU)
Voice Cloning

Speaker Embedding — ECAPA-TDNN

6 seconds of reference audio is all it takes to clone any voice — with identity preserved across language boundaries.

ECAPA-TDNN — Emphasized Channel Attention, Propagation and Aggregation 256-DIM

ECAPA-TDNN is the current state-of-the-art speaker verification architecture, combining temporal convolutions with squeeze-and-excitation channel attention and attentive statistics pooling to produce a fixed-dimensional speaker representation independent of utterance length.

— ECAPA-TDNN architecture — Input: Raw waveform → MFCC (80-dim, 25ms window, 10ms hop) Body: 5 × SE-Res2Block (1D dilated convolution + SE channel attention) Dilation schedule: [1, 2, 3, 4] (multi-scale temporal receptive fields) SE ratio: 8 (squeeze 512 → 64 → 512 channels) — Multi-scale temporal aggregation — Concatenate outputs from all SE-Res2Blocks → channel-wise Multi-layer feature aggregation before pooling — Attentive Statistics Pooling — Attention: e_t = v^T tanh(W·h_t + b) (scalar attention weight per frame) Weights: α_t = softmax(e_t) Mean: μ = Σ_t α_t · h_t Std: σ = √(Σ_t α_t · h_t² − μ²) Concat: [μ; σ] → linear → L2-normalize → d-vector ∈ ℝ^256 — Enrollment & verification — Reference: 6 seconds, 16kHz mono (minimum enrollment) Threshold: cosine_similarity(d_v1, d_v2) > 0.85 → same speaker Cross-lingual: d-vector encodes voice timbre, not language — clones across 17 languages
PropertyValueNotes
Embedding dimension256d-vector output (L2-normalized)
Minimum enrollment6 seconds16kHz mono audio
Verification thresholdcosine > 0.85EER < 2.1% on VoxCeleb2
Language coverage17 languagesCross-lingual identity preserved
Training dataVoxCeleb1+27,205 speakers, 1.2M utterances
Speaker EER< 2.1%VoxCeleb1-O benchmark
The Core Innovation

5-Layer Universal Semantic Representation

Where all other TTS systems stop at translation, Voice Forge builds a deep semantic scaffold before synthesis — preserving not just words but meaning, intent, culture, and personality.

01
Lexical Layer SURFACE FORM
The entry point to the USR pipeline. Raw text is tokenized, morphologically analyzed, and tagged for grammatical function before passing upward.
Tokenization: SentencePiece BPE (vocab 32K) — handles agglutinative languages (Turkish, Finnish, Japanese) Lemmatization: Reduce inflected forms to base lemma (e.g., "running" → "run", "geht" → "gehen") Morphology: Analyze prefixes, suffixes, compounding — critical for German, Arabic, Finnish POS tagging: Universal Dependencies tagset: 17 UPOS tags (NOUN, VERB, ADJ, ADV, ADP, DET…) Dependency parse: subject/object/modifier arcs → syntactic scaffold Named entities: PER / LOC / ORG / DATE — preserved verbatim, not culturally adapted
02
Semantic Layer DEEP MEANING
Extracts predicate-argument structure and word meaning beyond surface form — enabling cross-lingual alignment at the semantic level rather than lexical level.
Semantic Role Labeling: PropBank framesets (e.g., "give.01": ARG0=giver, ARG1=thing, ARG2=recipient) FrameNet coverage for 1,300+ semantic frames Predicate-Argument: Who did what to whom? Extracted as structured event representation Word Sense Disambiguation: WordNet synset mapping (117,000 synsets across 155,000 word forms) Context-sensitive: "bank" → financial_institution.n.01 vs river_bank.n.01 Coreference: Resolve pronouns and noun phrases to canonical entities across sentences Neural coref model (SpanBERT fine-tuned, F1 = 79.6 on OntoNotes)
03
Pragmatic Layer COMMUNICATIVE INTENT
Identifies what the speaker is trying to accomplish with the utterance — the speech act — and detects what is implied but not stated (implicature).
Speech Act Classification: Searle's 5 categories: ASSERTIVE — speaker commits to truth of proposition ("The meeting is at 3pm") DIRECTIVE — speaker attempts to get hearer to do something ("Please sit down") COMMISSIVE — speaker commits to future action ("I will call you tomorrow") EXPRESSIVE — speaker expresses psychological state ("Congratulations!") DECLARATIVE — utterance changes institutional reality ("I now pronounce you…") Gricean Maxims (implicature detection): Quantity: "some" → implicates "not all"; "possible" → implicates "not certain" Manner: verbose phrasing → implies importance or hesitation Relevance: contextually unexpected content → flagged for pragmatic reanalysis Presupposition identification: "When did you stop lying?" → presupposes prior lying Presuppositions preserved across translation to prevent false assertions
04
Cultural Layer CULTURAL INTELLIGENCE
The layer that differentiates Voice Forge from every other TTS system. Not literal translation — cultural equivalence. Politeness hierarchies, idiom localization, and humor transposition.
T-V Distinction Mapping: French: tu (informal) / vous (formal/plural) German: du (informal) / Sie (formal) Spanish: tú (informal) / usted (formal) / vosotros (Spain plural) Source text social register → correct target-language register selection Honorific Level Calibration (Japanese keigo system): Teineigo (丁寧語): polite everyday speech — default for public communication Sonkeigo (尊敬語): exalting the listener — used when addressing superiors Kenjōgo (謙譲語): humbling the speaker — used when speaking about self to superior Speech act + social context → automatic keigo level selection Idiomatic Expression Adaptation (not literal translation): EN "it's raining cats and dogs" → FR "il pleut des cordes" (it's raining ropes) EN "bite the bullet" → DE "in den sauren Apfel beißen" (bite the sour apple) Detection: idiom lexicon (13,000+ multi-language entries) + context verification Substitution: cultural equivalent, not word-for-word rendering Humor & Sarcasm: Detection: sentiment-prosody incongruence + pragmatic layer signal Transposition: joke structure preserved; punchline culturally adapted
05
Prosodic Layer VOICE CHARACTER
The final layer. Extracts and transfers the full prosodic profile of the source speaker — F0 contour, duration, energy, breathing — so the synthesized output sounds like the same person, not just the same words.
F0 Contour Extraction (CREPE — Convolutional Representation for Pitch Estimation): Architecture: 6 conv layers, capacity ~21M parameters Resolution: 360Hz output classes → 1.95-cent pitch accuracy Frame rate: 10ms hop — captures rapid pitch modulations (vibrato, tone sandhi) Output: Hz trajectory → converted to semitones relative to speaker range Duration Modeling (Montreal Forced Aligner — MFA): Acoustic model: Kaldi GMM-HMM, trained on 1,000+ hours Alignment: phoneme-level boundaries (millisecond precision) Output: phoneme timing grid → relative duration ratios preserved in target Energy Envelope: Frame-level RMS energy, 10ms frames Emphasis detection: energy peaks co-occurring with F0 peaks = lexical stress Normalized per speaker (z-score) → transferred to target speaker dynamic range Pause Placement (breath group modeling): VAD (Voice Activity Detection): WebRTC VAD + neural override Breath groups: syntactic boundaries (comma, clause, sentence) + respiratory rhythm Silence durations: short pause 80–150ms; breath pause 200–400ms; sentence pause 400–800ms Speaking Rate Adaptation: Source rate: syllables/second (extracted via MFA alignment) Target normalization: language-specific baseline rate (e.g., Spanish 7.82 syl/s; Mandarin 5.18 syl/s; German 5.97 syl/s) Stretch/compress: WSOLA time-scale modification (preserves pitch during rate change)
Signal Processing

Prosody Transfer Pipeline

The full signal path from source audio to culturally-adapted synthesized output — preserving emotion, emphasis, rhythm, breathing, and personality.

SOURCE AUDIO INPUT │ ├── CREPE (neural pitch estimator) │ → F0 contour Hz, 10ms frames, 1.95-cent resolution │ → Voiced/unvoiced decision (U/V flag per frame) │ ├── Montreal Forced Aligner (MFA) │ → Phoneme boundary timestamps (ms precision) │ → Duration ratio per phoneme: d_i / d_mean │ ├── RMS Energy Extractor │ → Frame-level energy envelope (dB, 10ms frames) │ → Lexical stress map (energy + F0 peaks) │ └── Voice Activity Detection (VAD) → Breath group boundaries → Pause durations (short / breath / sentence) │ ▼ DYNAMIC TIME WARPING (DTW) — CROSS-LINGUAL ALIGNMENT │ ├── Align source phoneme sequence → target phoneme sequence │ (phoneme-to-phoneme matching via edit distance + acoustic distance) │ ├── Transfer F0 contour │ → Pitch-shifted to target speaker's mean ± std range │ → Boundary tones (rising/falling) preserved in target intonation grammar │ ├── Transfer duration ratios (not absolute durations) │ → Relative timing preserved: stressed syllables remain longer │ → Absolute timing adapted to target language speaking rate │ └── Transfer energy contour → Emphasis peaks preserved (lexical stress, focus) → Level normalized to target speaker dynamic range │ ▼ XTTS v2 SYNTHESIS Conditioned on: [d_vector_speaker | prosodic_blueprint | phoneme_embeddings] │ ▼ OUTPUT: 24kHz WAVEFORM Preserves: voice identity, emotional register, stress patterns, breathing, rhythm, personality
Competitive Advantage

Cost Deep Dive — 375× Advantage

Self-hosted XTTS v2 annihilates SaaS TTS pricing — while being the only option that delivers cultural preservation.

Provider Cost / 1K chars Voice Cloning Languages Cultural Layer Notes
ElevenLabs $0.300 Yes 29 None Best quality SaaS — but zero cultural intelligence
Google Cloud TTS $0.016 No 220+ None Many languages, robotic prosody, no cloning
Amazon Polly Neural $0.016 No 29 None Limited emotional range, no voice cloning
Azure Neural TTS $0.016 Custom voice 140+ None Custom voice: $24K setup fee + ongoing compute
Voice Forge (XTTS v2 self-hosted) $0.0008 Yes — 6s ref 17 (growing) 5-Layer USR 375× cheaper than ElevenLabs. Culturally superior.
— Cost model derivation (self-hosted) — Compute: RTX 4090 GPU ($800/mo cloud equiv.) Throughput: 66× RTF → process 66 min of audio per 1 min GPU time Chars/hour: ~4.2M characters (average 250 chars/min speech) Cost/char: $800 / (730 hrs × 4.2M chars/hr) = $0.00000026/char Cost/1K: $0.00026 → rounded up to $0.0008/1K (including infra + margin) — Competitive margin at $0.01/1K chars (target retail price) — Gross margin at retail: ($0.01 − $0.0008) / $0.01 = 92% gross margin vs ElevenLabs at $0.30: customer saves $0.29/1K chars → massive switching incentive
Revenue Model

Pricing & Packaging

API-first pricing with a creator tier and enterprise cultural licensing track.

Creator
$99
per month
  • 5M characters / month
  • Voice cloning — up to 10 voices
  • All 17 languages
  • 3-Layer USR (Lexical + Semantic + Prosodic)
  • API access
  • Standard SLA
Enterprise
$2K+
per month (custom)
  • Unlimited characters
  • On-premise / VPC deployment
  • Custom language model fine-tuning
  • Brand voice creation service
  • SLA: 99.9% uptime guarantee
  • Dedicated technical success manager
Financial

Revenue Projections

API-first SaaS with strong enterprise expansion. Cultural layer creates a defensible premium over commodity TTS providers.

SegmentYear 1Year 2Year 3
Creator subscriptions ($99/mo)$72K$288K$720K
Professional subscriptions ($499/mo)$180K$720K$2.4M
Enterprise contracts ($2K+/mo)$96K$480K$1.8M
Usage overages ($/1K chars above plan)$24K$120K$480K
Total ARR$372K$1.61M$5.40M
$18K
Infra / Launch
92%
Gross Margin
< 6mo
Payback (Pro)
$0
API Provider Fees
Market

Market Opportunity

Voice AI is undergoing a platform shift — from single-language TTS to multilingual, culturally-aware voice agents. Voice Forge is positioned at the leading edge.

Voice AI TAM
$56B
Global Voice AI market 2030 (Grand View Research)
TTS SAM
$9.2B
Text-to-speech market 2028E
Localization SOM
$2.1B
AI dubbing + voice localization 2026E
Switching Cost
−99.7%
Cost reduction vs ElevenLabs at equivalent volume
Why Voice Forge Wins
Execution

Build Status — 95% Complete

COMPLETE (95%)
  • XTTS v2 core integration — GPT-2 decoder + HiFi-GAN
  • ECAPA-TDNN speaker encoder — 6s enrollment
  • CREPE F0 extraction pipeline
  • Montreal Forced Aligner integration
  • DTW prosody transfer algorithm
  • Lexical layer (BPE, POS, morphology)
  • Semantic layer (SRL, WSD, coref)
  • Cultural layer (T-V, keigo, idiom DB)
  • Prosodic layer (F0, duration, energy, VAD)
  • FastAPI REST API scaffolding
  • 17-language support core
REMAINING 5%
  • Production API hardening + rate limiting
  • Stripe billing integration
  • Vercel / cloud deployment pipeline
  • Developer dashboard (usage, voice library)
  • SSML extended tag support
  • Batch processing queue (Redis + Celery)
  • Webhook delivery system
  • Full test suite + load testing