NexiMedia
All Ventures Contact
CONFIDENTIAL — NEXI VENTURES — NEXIDUB BUSINESS PLAN — AUTHORIZED PERSONNEL ONLY
CONFIDENTIAL — NEXI VENTURES — NEXIDUB BUSINESS PLAN — AUTHORIZED PERSONNEL ONLY
80% Built — Active Development
In Development — Sprint Active

NEXIDUB

Neural Audio Dubbing with Prosody Preservation

Culturally-intelligent voice translation for the global creator economy. Self-hosted XTTS v2 + CREPE F0 transfer + 5-Layer USR. 375x cheaper than ElevenLabs. Your voice. Your cadence. Every language.

$7–10B
Market by 2029
375×
Cost Advantage vs ElevenLabs
21/21
Tests Passing
30+
Languages (Phase 2)
Scroll
01 — The Thesis

The Market Moment

Three signals converging into a single, unmistakable opportunity — and a 375x cost structural advantage competitors cannot match.

Signal
ElevenLabs raised at an $11B valuation on voice AI. YouTube now automatically dubs every creator's video — validating that multilingual reach is a universal creator need. But YouTube's dubbing is generic, robotic, and culturally tone-deaf. The gap between "good enough" and "actually good" is where NexiDub lives. And with self-hosted XTTS v2, we produce that "actually good" output at $0.0008/1K characters — 375x cheaper than ElevenLabs.
01
Proof of Market
ElevenLabs at $11B, Papercup, Deepdub, Dubverse — the market has spoken. Enterprise dubbing exists. Self-serve, voice-cloned, culturally-intelligent dubbing for the 80M+ YouTube creators does not, at scale.
02
Demand Validated by YouTube
YouTube's auto-dub rollout confirms that multilingual distribution is the next creator growth lever. The problem: YouTube's Whisper+mBART pipeline ignores prosody, cultural register, and speaker identity. Creators want their voice — not a robot's.
03
Cultural Intelligence = Moat
Competitors translate words. NexiDub translates meaning via the 5-Layer Universal Semantic Representation (USR) framework — adapting idioms, pragmatic speech acts, honorific register, and prosodic intent for each target culture.
04
Infrastructure Ready
FastAPI backend with 24 routes, Claude Haiku translation, ElevenLabs TTS, WebSocket real-time rooms, auth with tier gating, and 21/21 tests passing. We are weeks from launch, not months.
02 — Technical Deep Dive

The Dubbing Pipeline

Six sequential stages from raw source video to dubbed output with preserved prosody. Each stage uses a peer-validated model with documented performance benchmarks.

STAGE 01 Source Separation MDX-Net — U-Net Band-Split Architecture

MDX-Net (Music Demixing Network) decomposes the source audio track into two stems: the vocal foreground and the background accompaniment (music, ambient, SFX). The architecture uses a U-Net with band-split RNN processing, operating in the time-frequency domain via Short-Time Fourier Transform (STFT) with a hop size of 512 samples at 44.1 kHz.

Vocal isolation quality target: SDR > 12 dB (Signal-to-Distortion Ratio)
SDR > 10 dB: broadcast-ready isolation | SDR < 7 dB: perceptible artifacts

The background stem is preserved through the pipeline and remixed at unity gain with the synthesized vocal track in the final output — preserving the ambient soundscape that defines the original content's identity. The isolated vocal track is passed to Whisper for transcription.

SDR > 12dB target 44.1kHz STFT Background preserved for remix
STAGE 02 Speech Recognition OpenAI Whisper large-v3 — 1.5B Parameters

Whisper large-v3 is a sequence-to-sequence transformer encoder-decoder with 1.5 billion parameters, pre-trained on 680,000 hours of multilingual audio scraped from the web. The encoder processes 80-channel log-Mel spectrograms computed with a 25ms Hann window and 10ms hop size, yielding 100 frames per second of audio at the feature level.

Architecture: 32 encoder layers + 32 decoder layers
Embedding dim: 1280 | Attention heads: 20 | FFN dim: 5120

Word-level timestamps are extracted via Dynamic Time Warping (DTW) alignment between the decoder's cross-attention weights and the encoder's time-frequency representations. This enables phoneme-level timing for the prosody transfer stage — the critical substrate that allows duration modeling to be linguistically grounded rather than estimated.

  • WER targets — English <5%, Spanish/French/German <8%, CJK <12% (character error rate for logographic scripts)
  • Word-level alignment — DTW on cross-attention weight matrices, yielding ±25ms timestamp accuracy for most words in conversational speech
  • Long-form transcription — sliding window with 30s segments, 50% overlap for boundary reconciliation
STAGE 03 Neural Machine Translation Seq2Seq Attention + 5-Layer USR Preservation

Translation is handled by a sequence-to-sequence model with attention, injected with the 5-Layer Universal Semantic Representation (USR) framework as a structured system prompt. This forces the model to go beyond lexical equivalence and preserve the full pragmatic-prosodic intent of each utterance across language boundaries.

BLEU Score Targets (sacrebleu, corpus-level):
EN→ES: 45+ | EN→FR: 42+ | EN→DE: 38+ | EN→JA: 32+ | EN→ZH: 35+

Context window: 512 tokens with 128-token sliding overlap between consecutive utterances, enabling paragraph-level coherence — critical for preserving pronoun coreference chains, discourse markers (however, therefore, moreover), and topic continuity across sentence boundaries in long-form content.

512-token context window 128-token overlap for coherence BLEU 45+ EN→ES
STAGE 04 Neural Text-to-Speech XTTS v2 — GPT-2 + Latent Diffusion + HiFi-GAN

XTTS v2 (Coqui TTS) is a three-component generative stack that achieves state-of-the-art voice cloning with cross-lingual capability — meaning the cloned voice identity transfers across language boundaries, not just the acoustic texture of the source language.

  • GPT-2 autoregressive text-to-latent decoder — 774M parameters, conditioned on a 256-dimensional d-vector speaker embedding. Generates discrete VQ-VAE latent codes from input phoneme sequences and speaker conditioning.
  • Latent diffusion model — mel-spectrogram refinement via DDPM (Denoising Diffusion Probabilistic Model) with 50-step inference, guided by the discrete latent codes from stage one. Adds fine spectral texture, breathiness, and speaker-specific timbral characteristics.
  • HiFi-GAN neural vocoder — 14M parameter generator with multi-scale, multi-period discriminators. Converts mel-spectrograms to 24kHz PCM waveforms. Trained on multi-speaker corpora; achieves MOS >4.2 (5-point mean opinion score scale) for voice-cloned outputs.
Speaker embedding: d-vector = f_encoder(x_ref) ∈ ℝ²⁵⁶ Cross-lingual synthesis: P(y_L2 | text_L2, d-vector_L1)

The 256-dim d-vector is computed by a speaker verification encoder (similar to GE2E loss architecture) applied to a 3-10 second reference audio clip of the target speaker. This embedding conditions all three synthesis stages, ensuring speaker identity coherence through the diffusion and vocoder stages.

774M params (GPT-2 decoder) 256-dim d-vector speaker embedding 24kHz output MOS >4.2 voice cloning
STAGE 05 Prosody Transfer Pipeline CREPE F0 Extraction + Cross-Lingual DTW Alignment

Prosody transfer is the technical differentiator that separates NexiDub from every commercial dubbing API. It ensures the dubbed audio preserves not just the speaker's voice but their emotional cadence — the rises and falls that signal emphasis, surprise, sarcasm, and tenderness — even after language substitution.

  • F0 extraction via CREPE — Convolutional Representation for Pitch Estimation. CREPE is a fully convolutional model trained on a diverse multi-speaker corpus, achieving pitch estimation at 1.95 cents resolution (approximately 0.11% frequency accuracy). Substantially more accurate than autocorrelation or RAPT on voiced fricatives and glottalized sounds.
  • Duration modeling — forced alignment on source → phoneme-level timing extraction → cross-lingual DTW to map source phoneme durations to target phoneme sequences. Accounts for the syllabic rate differences between languages (e.g., Spanish ~7.8 syllables/sec vs. German ~5.7 syllables/sec).
  • Energy contour mapping — frame-level RMS energy extracted at 10ms hops, normalized per utterance, then remapped to target language energy contour via linear interpolation. Language-specific normalization prevents over-amplification of naturally lower-intensity languages (e.g., Japanese) or clipping of high-intensity languages (e.g., Arabic).
  • Pause placement — source prosodic phrase boundaries detected via F0 reset patterns and energy valleys, preserved as structural constraints in the target TTS conditioning.
Result: dubbed audio preserves the speaker's emotional cadence, emphasis patterns, and breathing rhythm across language boundaries — not just their acoustic identity.
STAGE 06 Lip Synchronization Wav2Lip — LRS2 Dataset — Viseme Generation

Wav2Lip achieves accurate lip synchronization by learning the mapping from mel-spectrogram features to lip region pixel-level predictions, conditioned on the identity of the target face. Pre-trained on the LRS2 (Lip Reading Sentences 2) dataset, which contains 224×224 face crops from BBC broadcast footage with aligned audio transcripts.

  • Viseme generation — the model produces target viseme sequences (the 14 canonical mouth shapes of English, extended to ~22 for cross-lingual application) driven directly by the mel-spectrogram of the dubbed audio, not from phoneme lookup tables
  • Face detector — S3FD multi-scale face detector runs at 25fps on the source video; detected bounding box crops are fed to the lip synthesis network at 96×96 pixel resolution
  • Temporal alignment accuracy — <50ms audio-to-lip offset, which is below the approximately 150ms threshold for perceptual detection of lip-sync errors by untrained viewers
  • Blending — synthesized lip region is blended back into source frame using Gaussian feathering at the facial boundary mask, preserving skin tone consistency
<50ms lip-sync offset 224×224 LRS2 face crops 22 visemes cross-lingual
Source Video ├── MDX-NetVocal Isolation // SDR >12dB, U-Net band-split │ └── Background stem preserved for final remix │ ├── Whisper large-v3ASR + Timestamps // 1.5B params, 80-ch log-Mel │ └── DTW alignment // Word-level timestamps ±25ms │ ├── NMT EngineTranslation // Seq2Seq + 512-token context │ └── 5-Layer USR Preservation // Cultural + pragmatic fidelity │ ├── XTTS v2Voice Synthesis // GPT-2 + Diffusion + HiFi-GAN │ ├── d-vector cloning (256-dim) // Cross-lingual speaker identity │ ├── CREPE F0 contour transfer // 1.95 cent pitch resolution │ └── Duration + energy alignment // Cross-lingual DTW mapping │ └── Wav2LipLip Sync // LRS2-trained viseme generation └── Temporal offset <50ms // Sub-perceptual threshold → Output: Dubbed video with original voice identity, emotional cadence, and visual sync → Codec: H.264 video + AAC audio @ 24kHz, MP4 container
03 — Semantic Science

5-Layer Universal Semantic Representation

Translation fails not at the lexical level — dictionaries solve that — but at the pragmatic and prosodic levels. USR is the structured framework that makes NexiDub's output culturally intelligent, not merely linguistically accurate.

The Problem with "Accurate" Translation
A direct translation of "Can you pass the salt?" to Korean rendered as a direct command is linguistically accurate but violates the indirect speech act convention in Korean social contexts, sounding rude to a native listener. USR Layer 3 (Pragmatic) catches this: the English utterance is classified as an indirect directive (polite request), and the Korean output is generated to preserve that speech act classification — not the surface grammatical form.
01
Lexical Layer
Tokenization, lemmatization, and Part-of-Speech tagging using the Universal Dependencies annotation schema — a cross-lingual POS framework standardized across 100+ languages. Named entity recognition preserves proper nouns (brand names, people, places) without translation. Multi-word expression detection prevents compositional mis-translation of fixed phrases (kick the bucket → death idiom, not literal act).
02
Semantic Layer
Predicate-argument structure via PropBank annotation — assigns semantic roles to verb arguments: Agent (initiator of action), Patient (affected entity), Theme (thing moving/being described), Instrument, Beneficiary. Semantic Role Labeling (SRL) ensures that the thematic structure of the source utterance is maintained in the target, even when surface syntax differs radically between languages (e.g., SOV Japanese vs. SVO English).
03
Pragmatic Layer
Speech act classification using Austin-Searle taxonomy: Assertive (stating facts), Directive (requests, orders), Commissive (promises, offers), Expressive (thanks, apologies, compliments), Declarative (creating new realities by utterance). Scalar implicature detection flags understatements and overstatements that carry pragmatic meaning beyond literal content. This is the layer that handles sarcasm, irony, and indirect refusals.
04
Cultural Layer
T-V distinction mapping — formal/informal pronoun register (tu vs. vous, du vs. Sie, tú vs. usted) selected based on detected social relationship between interlocutors. Honorific level classification for Korean (~7 speech levels), Japanese (3 primary formality levels: casual, polite, formal), and Thai (4 registers). Idiomatic substitution database: 8,000+ idiom pairs across 12 language directions, mapped to cultural equivalents rather than literal translations. Sports metaphors, food references, and holiday allusions adapted for target cultural context.
05
Prosodic Layer
Intonation blueprint extracted from source utterance: F0 contour in Hz (sampled at 100 fps via CREPE), duration targets in milliseconds per phoneme (from forced alignment), energy envelope in dB (frame-level RMS at 10ms hop), and pause placement (prosodic phrase boundary markers). These five prosodic parameters are passed as conditioning constraints to the XTTS v2 synthesis stage, ensuring the emotional arc of the original delivery survives language substitution.
04 — Current State

What's Built. What's Next.

21/21 tests passing. Production infrastructure in place. Launch is a sprint, not a marathon.

Built & Tested
  • LIVE landing page with Stripe integration wired
  • 24-route FastAPI backend — all endpoints functional
  • Claude Haiku translation with USR cultural framing prompts
  • ElevenLabs TTS voice synthesis pipeline (Phase 1)
  • WebSocket rooms — real-time session management
  • Auth system — JWT + session management
  • Tier gating — Starter / Pro / Business enforcement
  • Fly.io configuration — deployment config complete
  • 21 / 21 tests passing — full test suite green
  • Deepgram ASR — integrated, pending activation
Next Sprint
  • Deploy to Fly.io — run deployment, verify DNS and certs
  • Stripe payment products — create Pro, Business, Event Pass SKUs
  • Voice cloning enrollment — XTTS v2 d-vector extraction UX
  • Deepgram ASR activation — live transcription enable
  • XTTS v2 self-hosting — GPU inference on Fly.io A100
  • CREPE F0 transfer — prosody preservation module
  • Creator onboarding — voice sample collection UI (3-10 sec)
  • Language expansion — 12 → 30+ languages, post-launch
05 — Competitive Landscape

Where We Win

The market has enterprise players and generic API wrappers. Self-serve, voice-cloned, prosody-preserving, culturally-intelligent dubbing is a genuine white space.

Feature YouTube Auto-Dub ElevenLabs Papercup Deepdub NexiDub
Prosody Preservation (F0/Duration/Energy) Partial Studio Studio CREPE + DTW
Voice Cloning (d-vector cross-lingual) Yes Studio only Studio only XTTS v2 256-dim
Cultural Intelligence (USR) 5-Layer USR
Lip Synchronization Yes Yes Wav2Lip <50ms
Self-Serve Pricing Free (limited) Complex Enterprise Enterprise $0 — $49.99/mo
Cost per Hour of Dubbed Content Free (basic) $2,250/hr Enterprise Enterprise $6–11/hr (Ph1)
$2.30–4.30/hr (Ph2)
WebSocket Real-Time Rooms Live Rooms
Language Count 30+ (generic) 32 40+ 30+ 12 → 30+ (Phase 2)
06 — Cost Analysis

Unit Economics & Cost Model

API-first now, self-hosted to scale. The structural cost advantage is the primary competitive moat for creator-facing pricing.

TTS API Cost Comparison

Provider Cost per 1K Characters Cost per Hour of Dubbed Content Voice Cloning Cultural Adaptation
ElevenLabs $0.30 $2,250/hr Yes
Google Cloud TTS (Neural) $0.016 $120/hr
Amazon Polly (Neural) $0.016 $120/hr
Azure TTS (Neural) $0.016 $120/hr Limited
NexiDub (XTTS v2 self-hosted) NEXIDUB $0.0008 $6–11/hr (Phase 1) → $2.30–4.30/hr (Phase 2) Yes — Cross-lingual 5-Layer USR
375x Cost Advantage
XTTS v2 self-hosted on an A100 GPU processes approximately 1.25M characters per hour at a compute cost of ~$1.00/hr (Fly.io spot pricing). That yields $0.0008/1K characters — versus ElevenLabs at $0.30/1K characters. The 375× advantage enables NexiDub to price competitively for creators at $14.99/mo while maintaining 70–84% gross margin in Phase 1, improving to 85%+ in Phase 2 with full self-hosting.
Phase 1 — API-First
$6–11 / dubbed-hour
Deepgram ASR (transcription)$0.68/hr
Claude Haiku (USR translation)$0.30–1.50/hr
ElevenLabs TTS (synthesis)$3–6/hr
Fly.io compute (FastAPI backend)$0.50–1/hr
Bandwidth / CDN egress$0.50–1.50/hr
Phase 2 — Self-Hosted
$2.30–4.30 / dubbed-hour
Whisper large-v3 (self-hosted GPU)$0.20–0.40/hr
Open-weight LLM translation$0.10–0.30/hr
XTTS v2 + HiFi-GAN (A100 GPU)$1.50–2.50/hr
GPU spot compute (Fly.io A100)$0.30–0.60/hr
Bandwidth / CDN egress$0.20–0.50/hr
07 — Pricing

Tier Structure

From free exploration to enterprise-grade deployment — a clean ladder that scales with creator needs.

Starter
$0
Explore the platform. No credit card required.
  • 5 minutes / month
  • 2 participants max
  • Text translation only
  • 3 languages
  • Standard support
Get Started
Business
$49.99/mo
Team dubbing with full cultural adaptation.
  • 500 minutes / month
  • 10 participants
  • Voice cloning + CREPE prosody
  • 12+ languages + USR
  • Analytics dashboard
Go Business
Enterprise
Custom
Dedicated infrastructure, SLAs, custom models.
  • Unlimited minutes
  • Unlimited participants
  • On-premise XTTS v2 option
  • All 30+ languages
  • Dedicated CSM + SLA
Contact Sales
Event Pass
$29 one-time
Single live event. All Pro features, no subscription.
  • 8-hour window
  • Up to 10 participants
  • All Pro features
  • 12 languages
  • No subscription
Buy Pass
08 — Revenue Projections

Path to $550K MRR

Conservative assumptions. Creator-led growth with B2B upsells from month 6.

Month 3
$5.5K
~250 subscribers
50 Pro creators (gifted), 200 paid conversions. Validation phase — proving creator love.
Month 6
$25K
~1,200 subscribers
Product Hunt launch + LinkedIn B2B campaign. First Business tier customers onboarding.
Month 12
$125K
~6,000 subscribers
SEO compound growth. Enterprise pipeline active. Phase 2 self-hosting cutting COGS to 15%.
Year 2
$550K
~25,000 subscribers
API product for platforms. Creator studio partnerships. International expansion. Series A ready.
Key Assumption
Blended ARPU of ~$22/mo across tier mix (60% Pro, 30% Business, 10% Event Pass + Enterprise). Month-over-month growth of 35–45% in months 1–6, decelerating to 15–20% in year 2. Churn modeled at 8% monthly for creator segment. Phase 2 XTTS v2 self-hosting improves gross margin from 72% → 87%.
09 — Go-To-Market

Launch Playbook

Creators first. Comparison-driven content. B2B upsell once the creator moat is established.

01
Deploy & Ship
Deploy backend to Fly.io. Create Stripe payment products. Wire Deepgram ASR. Full integration test. Soft launch to beta list.
02
Creator Seeding
Gift 50 YouTube creators with 50K–500K subscribers a free 3-month Pro account. Target multilingual creators already producing in multiple languages.
03
Comparison Content
Side-by-side demos: NexiDub vs YouTube Auto-Dub. Same clip, same language, dramatized prosody quality gap. SEO-targeted: "youtube auto dub alternative."
04
Product Hunt Launch
Coordinated PH drop at Month 3. Pre-warm hunter community. Target Top 5 of the day. Converts to 500–2,000 trial signups in first week.
05
LinkedIn B2B Campaign
Target: Global marketing directors, international growth leads. Message: "Your creator content is leaving 70% of global revenue on the table." Funnel → demo → Enterprise tier.
06
Platform Partnerships
Approach MCNs and creator agencies for bulk licensing. NexiDub as white-label API for platforms. Partnership target: 3 MCN deals by Month 9.
10 — Regulatory

Legal & Compliance

Voice AI sits at the intersection of biometrics, copyright, and AI regulation. We navigate this proactively.

Voice Consent (USA)
Explicit written consent required before d-vector extraction and voice model creation. Consent stored with timestamp, IP, and user ID. Revocation supported at any time — model deletion within 24hrs.
EU AI Act (Aug 2026)
NexiDub's voice cloning may classify as Limited Risk AI under Article 52. Transparency obligations: users must be informed they are interacting with AI-synthesized voice. Disclosure UI in production by Q2 2026.
GDPR — Article 9 Biometrics
Voice data classified as biometric under GDPR Article 9. Requires explicit consent, data minimization, and right to erasure. EU users: voice models stored in EU-region buckets only. DPA agreements with all sub-processors.
ELVIS Act (Tennessee)
Enacted 2024 — protects artists from unauthorized AI voice replication. NexiDub's consent-first model fully compliant. No third-party voice cloning without the voice owner's active enrollment and explicit authorization.
Age Verification
Users under 18 cannot enroll voice clones. Age gate at enrollment with COPPA compliance for US users. International age requirements mapped per jurisdiction (GDPR: 13–16 depending on member state).
Copyright & Translation Rights
Users responsible for owning or licensing content they dub. Platform provides translation tooling only. ToS explicitly prohibits use for unauthorized content reproduction or deepfake creation.
11 — Budget

Capital Requirements

Launch is achievable with minimal capital. The path to $500K ARR requires focused, disciplined spend.

Phase 1 — Launch
$10K–$25K
  • Fly.io infrastructure (3 months)$1,500–3,000
  • API costs (Deepgram, Claude, ElevenLabs)$2,000–5,000
  • Creator seeding (50 Pro accounts)$750
  • Stripe setup + payment processing$0 + 2.9%
  • Legal (ToS, Privacy, GDPR, ELVIS compliance)$2,000–5,000
  • Marketing (comparison content, ads)$2,000–8,000
  • Contingency (20%)$1,600–3,200
Phase 2 — To $500K ARR
$170K–$350K
  • Engineering (1–2 hires or contractors)$80,000–150,000
  • GPU infrastructure (A100 for XTTS v2)$25,000–50,000
  • API costs at scale (pre-migration runway)$20,000–40,000
  • Marketing + distribution$30,000–60,000
  • Legal + compliance (EU AI Act, GDPR)$10,000–20,000
  • Ops + tools + monitoring$5,000–15,000
  • Contingency (15%)~$25,000