01 — The Thesis

The Market Moment

Three signals converging into a single, unmistakable opportunity — and a 375x cost structural advantage competitors cannot match.

Signal

ElevenLabs raised at an $11B valuation on voice AI. YouTube now automatically dubs every creator's video — validating that multilingual reach is a universal creator need. But YouTube's dubbing is generic, robotic, and culturally tone-deaf. The gap between "good enough" and "actually good" is where NexiDub lives. And with self-hosted XTTS v2, we produce that "actually good" output at $0.0008/1K characters — 375x cheaper than ElevenLabs.

01

Proof of Market

ElevenLabs at $11B, Papercup, Deepdub, Dubverse — the market has spoken. Enterprise dubbing exists. Self-serve, voice-cloned, culturally-intelligent dubbing for the 80M+ YouTube creators does not, at scale.

02

Demand Validated by YouTube

YouTube's auto-dub rollout confirms that multilingual distribution is the next creator growth lever. The problem: YouTube's Whisper+mBART pipeline ignores prosody, cultural register, and speaker identity. Creators want their voice — not a robot's.

03

Cultural Intelligence = Moat

Competitors translate words. NexiDub translates meaning via the 5-Layer Universal Semantic Representation (USR) framework — adapting idioms, pragmatic speech acts, honorific register, and prosodic intent for each target culture.

04

Infrastructure Ready

FastAPI backend with 24 routes, Claude Haiku translation, ElevenLabs TTS, WebSocket real-time rooms, auth with tier gating, and 21/21 tests passing. We are weeks from launch, not months.

02 — Technical Deep Dive

The Dubbing Pipeline

Six sequential stages from raw source video to dubbed output with preserved prosody. Each stage uses a peer-validated model with documented performance benchmarks.

STAGE 01 Source Separation MDX-Net — U-Net Band-Split Architecture

MDX-Net (Music Demixing Network) decomposes the source audio track into two stems: the vocal foreground and the background accompaniment (music, ambient, SFX). The architecture uses a U-Net with band-split RNN processing, operating in the time-frequency domain via Short-Time Fourier Transform (STFT) with a hop size of 512 samples at 44.1 kHz.

Vocal isolation quality target: SDR > 12 dB (Signal-to-Distortion Ratio)
SDR > 10 dB: broadcast-ready isolation | SDR < 7 dB: perceptible artifacts

The background stem is preserved through the pipeline and remixed at unity gain with the synthesized vocal track in the final output — preserving the ambient soundscape that defines the original content's identity. The isolated vocal track is passed to Whisper for transcription.

SDR > 12dB target 44.1kHz STFT Background preserved for remix

STAGE 02 Speech Recognition OpenAI Whisper large-v3 — 1.5B Parameters

Whisper large-v3 is a sequence-to-sequence transformer encoder-decoder with 1.5 billion parameters, pre-trained on 680,000 hours of multilingual audio scraped from the web. The encoder processes 80-channel log-Mel spectrograms computed with a 25ms Hann window and 10ms hop size, yielding 100 frames per second of audio at the feature level.

Architecture: 32 encoder layers + 32 decoder layers
Embedding dim: 1280 | Attention heads: 20 | FFN dim: 5120

Word-level timestamps are extracted via Dynamic Time Warping (DTW) alignment between the decoder's cross-attention weights and the encoder's time-frequency representations. This enables phoneme-level timing for the prosody transfer stage — the critical substrate that allows duration modeling to be linguistically grounded rather than estimated.

WER targets — English <5%, Spanish/French/German <8%, CJK <12% (character error rate for logographic scripts)
Word-level alignment — DTW on cross-attention weight matrices, yielding ±25ms timestamp accuracy for most words in conversational speech
Long-form transcription — sliding window with 30s segments, 50% overlap for boundary reconciliation

STAGE 03 Neural Machine Translation Seq2Seq Attention + 5-Layer USR Preservation

Translation is handled by a sequence-to-sequence model with attention, injected with the 5-Layer Universal Semantic Representation (USR) framework as a structured system prompt. This forces the model to go beyond lexical equivalence and preserve the full pragmatic-prosodic intent of each utterance across language boundaries.

BLEU Score Targets (sacrebleu, corpus-level):
EN→ES: 45+ | EN→FR: 42+ | EN→DE: 38+ | EN→JA: 32+ | EN→ZH: 35+

Context window: 512 tokens with 128-token sliding overlap between consecutive utterances, enabling paragraph-level coherence — critical for preserving pronoun coreference chains, discourse markers (however, therefore, moreover), and topic continuity across sentence boundaries in long-form content.

512-token context window 128-token overlap for coherence BLEU 45+ EN→ES

STAGE 04 Neural Text-to-Speech XTTS v2 — GPT-2 + Latent Diffusion + HiFi-GAN

XTTS v2 (Coqui TTS) is a three-component generative stack that achieves state-of-the-art voice cloning with cross-lingual capability — meaning the cloned voice identity transfers across language boundaries, not just the acoustic texture of the source language.

GPT-2 autoregressive text-to-latent decoder — 774M parameters, conditioned on a 256-dimensional d-vector speaker embedding. Generates discrete VQ-VAE latent codes from input phoneme sequences and speaker conditioning.
Latent diffusion model — mel-spectrogram refinement via DDPM (Denoising Diffusion Probabilistic Model) with 50-step inference, guided by the discrete latent codes from stage one. Adds fine spectral texture, breathiness, and speaker-specific timbral characteristics.
HiFi-GAN neural vocoder — 14M parameter generator with multi-scale, multi-period discriminators. Converts mel-spectrograms to 24kHz PCM waveforms. Trained on multi-speaker corpora; achieves MOS >4.2 (5-point mean opinion score scale) for voice-cloned outputs.

Speaker embedding: d-vector = f_encoder(x_ref) ∈ ℝ²⁵⁶ Cross-lingual synthesis: P(y_L2 | text_L2, d-vector_L1)

The 256-dim d-vector is computed by a speaker verification encoder (similar to GE2E loss architecture) applied to a 3-10 second reference audio clip of the target speaker. This embedding conditions all three synthesis stages, ensuring speaker identity coherence through the diffusion and vocoder stages.

774M params (GPT-2 decoder) 256-dim d-vector speaker embedding 24kHz output MOS >4.2 voice cloning

STAGE 05 Prosody Transfer Pipeline CREPE F0 Extraction + Cross-Lingual DTW Alignment

Prosody transfer is the technical differentiator that separates NexiDub from every commercial dubbing API. It ensures the dubbed audio preserves not just the speaker's voice but their emotional cadence — the rises and falls that signal emphasis, surprise, sarcasm, and tenderness — even after language substitution.

F0 extraction via CREPE — Convolutional Representation for Pitch Estimation. CREPE is a fully convolutional model trained on a diverse multi-speaker corpus, achieving pitch estimation at 1.95 cents resolution (approximately 0.11% frequency accuracy). Substantially more accurate than autocorrelation or RAPT on voiced fricatives and glottalized sounds.
Duration modeling — forced alignment on source → phoneme-level timing extraction → cross-lingual DTW to map source phoneme durations to target phoneme sequences. Accounts for the syllabic rate differences between languages (e.g., Spanish ~7.8 syllables/sec vs. German ~5.7 syllables/sec).
Energy contour mapping — frame-level RMS energy extracted at 10ms hops, normalized per utterance, then remapped to target language energy contour via linear interpolation. Language-specific normalization prevents over-amplification of naturally lower-intensity languages (e.g., Japanese) or clipping of high-intensity languages (e.g., Arabic).
Pause placement — source prosodic phrase boundaries detected via F0 reset patterns and energy valleys, preserved as structural constraints in the target TTS conditioning.

Result: dubbed audio preserves the speaker's emotional cadence, emphasis patterns, and breathing rhythm across language boundaries — not just their acoustic identity.

STAGE 06 Lip Synchronization Wav2Lip — LRS2 Dataset — Viseme Generation

Wav2Lip achieves accurate lip synchronization by learning the mapping from mel-spectrogram features to lip region pixel-level predictions, conditioned on the identity of the target face. Pre-trained on the LRS2 (Lip Reading Sentences 2) dataset, which contains 224×224 face crops from BBC broadcast footage with aligned audio transcripts.

Viseme generation — the model produces target viseme sequences (the 14 canonical mouth shapes of English, extended to ~22 for cross-lingual application) driven directly by the mel-spectrogram of the dubbed audio, not from phoneme lookup tables
Face detector — S3FD multi-scale face detector runs at 25fps on the source video; detected bounding box crops are fed to the lip synthesis network at 96×96 pixel resolution
Temporal alignment accuracy — <50ms audio-to-lip offset, which is below the approximately 150ms threshold for perceptual detection of lip-sync errors by untrained viewers
Blending — synthesized lip region is blended back into source frame using Gaussian feathering at the facial boundary mask, preserving skin tone consistency

<50ms lip-sync offset 224×224 LRS2 face crops 22 visemes cross-lingual

Source Video ├── MDX-Net → Vocal Isolation // SDR >12dB, U-Net band-split │ └── Background stem preserved for final remix │ ├── Whisper large-v3 → ASR + Timestamps // 1.5B params, 80-ch log-Mel │ └── DTW alignment // Word-level timestamps ±25ms │ ├── NMT Engine → Translation // Seq2Seq + 512-token context │ └── 5-Layer USR Preservation // Cultural + pragmatic fidelity │ ├── XTTS v2 → Voice Synthesis // GPT-2 + Diffusion + HiFi-GAN │ ├── d-vector cloning (256-dim) // Cross-lingual speaker identity │ ├── CREPE F0 contour transfer // 1.95 cent pitch resolution │ └── Duration + energy alignment // Cross-lingual DTW mapping │ └── Wav2Lip → Lip Sync // LRS2-trained viseme generation └── Temporal offset <50ms // Sub-perceptual threshold → Output: Dubbed video with original voice identity, emotional cadence, and visual sync → Codec: H.264 video + AAC audio @ 24kHz, MP4 container

03 — Semantic Science

5-Layer Universal Semantic Representation

Translation fails not at the lexical level — dictionaries solve that — but at the pragmatic and prosodic levels. USR is the structured framework that makes NexiDub's output culturally intelligent, not merely linguistically accurate.

The Problem with "Accurate" Translation

A direct translation of "Can you pass the salt?" to Korean rendered as a direct command is linguistically accurate but violates the indirect speech act convention in Korean social contexts, sounding rude to a native listener. USR Layer 3 (Pragmatic) catches this: the English utterance is classified as an indirect directive (polite request), and the Korean output is generated to preserve that speech act classification — not the surface grammatical form.

01

Lexical Layer

Tokenization, lemmatization, and Part-of-Speech tagging using the Universal Dependencies annotation schema — a cross-lingual POS framework standardized across 100+ languages. Named entity recognition preserves proper nouns (brand names, people, places) without translation. Multi-word expression detection prevents compositional mis-translation of fixed phrases (kick the bucket → death idiom, not literal act).

02

Semantic Layer

Predicate-argument structure via PropBank annotation — assigns semantic roles to verb arguments: Agent (initiator of action), Patient (affected entity), Theme (thing moving/being described), Instrument, Beneficiary. Semantic Role Labeling (SRL) ensures that the thematic structure of the source utterance is maintained in the target, even when surface syntax differs radically between languages (e.g., SOV Japanese vs. SVO English).

03

Pragmatic Layer

Speech act classification using Austin-Searle taxonomy: Assertive (stating facts), Directive (requests, orders), Commissive (promises, offers), Expressive (thanks, apologies, compliments), Declarative (creating new realities by utterance). Scalar implicature detection flags understatements and overstatements that carry pragmatic meaning beyond literal content. This is the layer that handles sarcasm, irony, and indirect refusals.

04

Cultural Layer

T-V distinction mapping — formal/informal pronoun register (tu vs. vous, du vs. Sie, tú vs. usted) selected based on detected social relationship between interlocutors. Honorific level classification for Korean (~7 speech levels), Japanese (3 primary formality levels: casual, polite, formal), and Thai (4 registers). Idiomatic substitution database: 8,000+ idiom pairs across 12 language directions, mapped to cultural equivalents rather than literal translations. Sports metaphors, food references, and holiday allusions adapted for target cultural context.

05

Prosodic Layer

Intonation blueprint extracted from source utterance: F0 contour in Hz (sampled at 100 fps via CREPE), duration targets in milliseconds per phoneme (from forced alignment), energy envelope in dB (frame-level RMS at 10ms hop), and pause placement (prosodic phrase boundary markers). These five prosodic parameters are passed as conditioning constraints to the XTTS v2 synthesis stage, ensuring the emotional arc of the original delivery survives language substitution.

04 — Current State

What's Built. What's Next.

21/21 tests passing. Production infrastructure in place. Launch is a sprint, not a marathon.

Built & Tested

LIVE landing page with Stripe integration wired
24-route FastAPI backend — all endpoints functional
Claude Haiku translation with USR cultural framing prompts
ElevenLabs TTS voice synthesis pipeline (Phase 1)
WebSocket rooms — real-time session management
Auth system — JWT + session management
Tier gating — Starter / Pro / Business enforcement
Fly.io configuration — deployment config complete
21 / 21 tests passing — full test suite green
Deepgram ASR — integrated, pending activation

Next Sprint

Deploy to Fly.io — run deployment, verify DNS and certs
Stripe payment products — create Pro, Business, Event Pass SKUs
Voice cloning enrollment — XTTS v2 d-vector extraction UX
Deepgram ASR activation — live transcription enable
XTTS v2 self-hosting — GPU inference on Fly.io A100
CREPE F0 transfer — prosody preservation module
Creator onboarding — voice sample collection UI (3-10 sec)
Language expansion — 12 → 30+ languages, post-launch

05 — Competitive Landscape

Where We Win

The market has enterprise players and generic API wrappers. Self-serve, voice-cloned, prosody-preserving, culturally-intelligent dubbing is a genuine white space.

Feature	YouTube Auto-Dub	ElevenLabs	Papercup	Deepdub	NexiDub
Prosody Preservation (F0/Duration/Energy)	—	Partial	Studio	Studio	CREPE + DTW
Voice Cloning (d-vector cross-lingual)	—	Yes	Studio only	Studio only	XTTS v2 256-dim
Cultural Intelligence (USR)	—	—	—	—	5-Layer USR
Lip Synchronization	—	—	Yes	Yes	Wav2Lip <50ms
Self-Serve Pricing	Free (limited)	Complex	Enterprise	Enterprise	$0 — $49.99/mo
Cost per Hour of Dubbed Content	Free (basic)	$2,250/hr	Enterprise	Enterprise	$6–11/hr (Ph1) $2.30–4.30/hr (Ph2)
WebSocket Real-Time Rooms	—	—	—	—	Live Rooms
Language Count	30+ (generic)	32	40+	30+	12 → 30+ (Phase 2)

06 — Cost Analysis

Unit Economics & Cost Model

API-first now, self-hosted to scale. The structural cost advantage is the primary competitive moat for creator-facing pricing.

TTS API Cost Comparison

Provider	Cost per 1K Characters	Cost per Hour of Dubbed Content	Voice Cloning	Cultural Adaptation
ElevenLabs	$0.30	$2,250/hr	Yes	—
Google Cloud TTS (Neural)	$0.016	$120/hr	—	—
Amazon Polly (Neural)	$0.016	$120/hr	—	—
Azure TTS (Neural)	$0.016	$120/hr	Limited	—
NexiDub (XTTS v2 self-hosted) NEXIDUB	$0.0008	$6–11/hr (Phase 1) → $2.30–4.30/hr (Phase 2)	Yes — Cross-lingual	5-Layer USR

375x Cost Advantage

XTTS v2 self-hosted on an A100 GPU processes approximately 1.25M characters per hour at a compute cost of ~$1.00/hr (Fly.io spot pricing). That yields $0.0008/1K characters — versus ElevenLabs at $0.30/1K characters. The 375× advantage enables NexiDub to price competitively for creators at $14.99/mo while maintaining 70–84% gross margin in Phase 1, improving to 85%+ in Phase 2 with full self-hosting.

Phase 1 — API-First

$6–11 / dubbed-hour

Deepgram ASR (transcription)$0.68/hr

Claude Haiku (USR translation)$0.30–1.50/hr

ElevenLabs TTS (synthesis)$3–6/hr

Fly.io compute (FastAPI backend)$0.50–1/hr

Bandwidth / CDN egress$0.50–1.50/hr

Phase 2 — Self-Hosted

$2.30–4.30 / dubbed-hour

Whisper large-v3 (self-hosted GPU)$0.20–0.40/hr

Open-weight LLM translation$0.10–0.30/hr

XTTS v2 + HiFi-GAN (A100 GPU)$1.50–2.50/hr

GPU spot compute (Fly.io A100)$0.30–0.60/hr

Bandwidth / CDN egress$0.20–0.50/hr

07 — Pricing

Tier Structure

From free exploration to enterprise-grade deployment — a clean ladder that scales with creator needs.

Starter

$0

Explore the platform. No credit card required.

5 minutes / month
2 participants max
Text translation only
3 languages
Standard support

Get Started

Pro

$14.99/mo

Full voice cloning and multilingual presence.

120 minutes / month
5 participants
Voice cloning (d-vector)
12 languages
Priority support

Start Pro

Business

$49.99/mo

Team dubbing with full cultural adaptation.

500 minutes / month
10 participants
Voice cloning + CREPE prosody
12+ languages + USR
Analytics dashboard

Go Business

Enterprise

Custom

Dedicated infrastructure, SLAs, custom models.

Unlimited minutes
Unlimited participants
On-premise XTTS v2 option
All 30+ languages
Dedicated CSM + SLA

Contact Sales

Event Pass

$29 one-time

Single live event. All Pro features, no subscription.

8-hour window
Up to 10 participants
All Pro features
12 languages
No subscription

Buy Pass

08 — Revenue Projections

Path to $550K MRR

Conservative assumptions. Creator-led growth with B2B upsells from month 6.

Month 3

$5.5K

~250 subscribers

50 Pro creators (gifted), 200 paid conversions. Validation phase — proving creator love.

Month 6

$25K

~1,200 subscribers

Product Hunt launch + LinkedIn B2B campaign. First Business tier customers onboarding.

Month 12

$125K

~6,000 subscribers

SEO compound growth. Enterprise pipeline active. Phase 2 self-hosting cutting COGS to 15%.

Year 2

$550K

~25,000 subscribers

API product for platforms. Creator studio partnerships. International expansion. Series A ready.

Key Assumption

Blended ARPU of ~$22/mo across tier mix (60% Pro, 30% Business, 10% Event Pass + Enterprise). Month-over-month growth of 35–45% in months 1–6, decelerating to 15–20% in year 2. Churn modeled at 8% monthly for creator segment. Phase 2 XTTS v2 self-hosting improves gross margin from 72% → 87%.

09 — Go-To-Market

Launch Playbook

Creators first. Comparison-driven content. B2B upsell once the creator moat is established.

01

Deploy & Ship

Deploy backend to Fly.io. Create Stripe payment products. Wire Deepgram ASR. Full integration test. Soft launch to beta list.

02

Creator Seeding

Gift 50 YouTube creators with 50K–500K subscribers a free 3-month Pro account. Target multilingual creators already producing in multiple languages.

03

Comparison Content

Side-by-side demos: NexiDub vs YouTube Auto-Dub. Same clip, same language, dramatized prosody quality gap. SEO-targeted: "youtube auto dub alternative."

04

Product Hunt Launch

Coordinated PH drop at Month 3. Pre-warm hunter community. Target Top 5 of the day. Converts to 500–2,000 trial signups in first week.

05

LinkedIn B2B Campaign

Target: Global marketing directors, international growth leads. Message: "Your creator content is leaving 70% of global revenue on the table." Funnel → demo → Enterprise tier.

06

Platform Partnerships

Approach MCNs and creator agencies for bulk licensing. NexiDub as white-label API for platforms. Partnership target: 3 MCN deals by Month 9.

10 — Regulatory

Legal & Compliance

Voice AI sits at the intersection of biometrics, copyright, and AI regulation. We navigate this proactively.

Voice Consent (USA)

Explicit written consent required before d-vector extraction and voice model creation. Consent stored with timestamp, IP, and user ID. Revocation supported at any time — model deletion within 24hrs.

EU AI Act (Aug 2026)

NexiDub's voice cloning may classify as Limited Risk AI under Article 52. Transparency obligations: users must be informed they are interacting with AI-synthesized voice. Disclosure UI in production by Q2 2026.

GDPR — Article 9 Biometrics

Voice data classified as biometric under GDPR Article 9. Requires explicit consent, data minimization, and right to erasure. EU users: voice models stored in EU-region buckets only. DPA agreements with all sub-processors.

ELVIS Act (Tennessee)

Enacted 2024 — protects artists from unauthorized AI voice replication. NexiDub's consent-first model fully compliant. No third-party voice cloning without the voice owner's active enrollment and explicit authorization.

Age Verification

Users under 18 cannot enroll voice clones. Age gate at enrollment with COPPA compliance for US users. International age requirements mapped per jurisdiction (GDPR: 13–16 depending on member state).

Copyright & Translation Rights

Users responsible for owning or licensing content they dub. Platform provides translation tooling only. ToS explicitly prohibits use for unauthorized content reproduction or deepfake creation.

11 — Budget

Capital Requirements

Launch is achievable with minimal capital. The path to $500K ARR requires focused, disciplined spend.

Phase 1 — Launch

$10K–$25K

Fly.io infrastructure (3 months)$1,500–3,000
API costs (Deepgram, Claude, ElevenLabs)$2,000–5,000
Creator seeding (50 Pro accounts)$750
Stripe setup + payment processing$0 + 2.9%
Legal (ToS, Privacy, GDPR, ELVIS compliance)$2,000–5,000
Marketing (comparison content, ads)$2,000–8,000
Contingency (20%)$1,600–3,200

Phase 2 — To $500K ARR

$170K–$350K

Engineering (1–2 hires or contractors)$80,000–150,000
GPU infrastructure (A100 for XTTS v2)$25,000–50,000
API costs at scale (pre-migration runway)$20,000–40,000
Marketing + distribution$30,000–60,000
Legal + compliance (EU AI Act, GDPR)$10,000–20,000
Ops + tools + monitoring$5,000–15,000
Contingency (15%)~$25,000

NEXIDUB