NexiMedia
All Ventures Contact
NEXI — CONFIDENTIAL AIQC INTERACTIVE — INTERNAL VENTURE BRIEF NOT FOR DISTRIBUTION
AI Evaluation Product

AIQC
INTERACTIVE

Adversarial AI Evaluation via Competitive Assessment

BUILT — IN REDESIGN  ·  506 Items  ·  5 Modes
$387B
Global EdTech Market
506
Validated Items
30,360
Unique Learning Paths
5
Game Modes
Core Science

Evaluation Framework

A multi-layer psychometric engine combining adversarial game theory, Bayesian Elo tracking, Item Response Theory, and knowledge graph traversal.

01 — Adversarial Architecture MINIMAX

AIQC operationalizes the red-team / blue-team adversarial paradigm at the item generation layer. A Red-team LLM generates maximally discriminative questions by targeting the boundary of known knowledge, while a Blue-team LLM attempts to answer, revealing capability gaps and knowledge horizons. The human player is inserted in place of the Blue team — receiving questions calibrated in real-time to expose their exact knowledge frontier.

This formulation borrows from the RLHF literature on adversarial preference learning but applies it to psychometric item generation rather than reward modeling. The equilibrium condition ensures the system produces questions with optimal discriminative power at the player's current estimated ability level.

— Adversarial objective (minimax formulation) — Red minimizes: E[S_Blue(q)] over question space Q Blue maximizes: E[S_Blue(q)] over response space R — Nash equilibrium: converges to questions that perfectly discriminate ability — q* = argmax_q [ H(θ) − H(θ | q) ] q* maximizes expected information gain about latent ability θ — Exploration-exploitation via softmax temperature τ — P(q_i) = exp(I(θ̂, q_i) / τ) / Σ exp(I(θ̂, q_j) / τ) τ → 0: greedy optimal selection (pure exploitation) τ → ∞: uniform exploration across item pool
02 — ELO Rating System BAYESIAN EXTENSION

Player skill is tracked using an adapted Elo rating system with Bayesian uncertainty quantification. Unlike chess Elo, AIQC uses per-domain ratings with cross-domain transfer learning priors — a player strong in Machine Learning receives a calibrated prior boost for Statistics questions, weighted by the domain correlation matrix.

— Standard Elo expected score — E_A = 1 / (1 + 10^((R_BR_A) / 400)) — Rating update rule — R'_A = R_A + K × (S_AE_A) S_A: actual outcome (1 = correct, 0.5 = partial, 0 = incorrect) — K-factor schedule (expertise tier) — K = 32 games < 30 (provisional — high plasticity, fast convergence) K = 16 games 30–100 (established — medium plasticity) K = 8 games > 100 (veteran — stable rating, low volatility) — Bayesian uncertainty quantification — Prior: R ~ N(1200, σ₀²) σ₀ = 200 (new player uncertainty) Update: σ²_n = 1 / (1/σ²_{n-1} + 1/σ²_item) 95% CI: R ± 1.96σ system expresses calibrated confidence — Cross-domain transfer learning prior — R_new_domain = Σ_i ρ_{ij} × R_i ρ = domain correlation matrix (12×12)
Rating RangeClassificationBenchmark EquivalentK-Factor
< 1000NovicePre-undergraduate32
1000–1200IntermediateUndergraduate CS / AI32 → 16
1200–1400AdvancedGraduate / industry practitioner16
1400–1600ExpertSenior researcher / principal engineer16 → 8
> 1600EliteDomain specialist / PhD8
03 — Item Response Theory (IRT) 3PL MODEL

Each of the 506 validated items is characterized by three IRT parameters estimated via marginal maximum likelihood (EM algorithm). The 3-Parameter Logistic (3PL) model is the psychometric standard for high-stakes adaptive testing — identical to the model used by the GRE, GMAT, and USMLE.

— 3-Parameter Logistic Model — P(θ) = c + (1 − c) × 1 / (1 + exp(−D·a·(θb))) Parameters: θ latent ability parameter measured on logit scale, N(0,1) population a discrimination parameter how well item separates ability levels [0.5, 2.5] b difficulty parameter ability level where P(correct) = 0.5 + c/2 [−3, +3] c pseudo-guessing parameter lower asymptote (0.20–0.25 for 4-option MCQ) D 1.702 scaling constant (normal ogive approximation) — Item Information Function (peaks where item is most discriminative) — I(θ) = D²a² × [ (P(θ) − c)² / (1−c)² ] × [ Q(θ) / P(θ) ] Peaks at θ = b: maximum discrimination at item difficulty level — Computerized Adaptive Testing (CAT): next item maximizes Fisher information — I_test(θ) = Σ_i I_i(θ) adaptive selection maximizes total information at θ̂ — Ability estimation (Expected a Posteriori, EAP) — θ̂ = ∫ θ · L(u|θ) · g(θ) dθ / ∫ L(u|θ) · g(θ) dθ g(θ) = N(0,1) prior; posterior updated after every response
Discrimination (a) Distribution
  • a < 0.8 — poor discriminators (flagged, removed)
  • 0.8 ≤ a < 1.2 — fair discriminators
  • 1.2 ≤ a < 1.8 — good discriminators (majority)
  • a ≥ 1.8 — excellent discriminators (<15% of bank)
CAT Stopping Rules
  • Minimum items: 5 (reliability floor)
  • Maximum items: 40 (fatigue ceiling)
  • SEM < 0.30: adaptive stop (precision achieved)
  • Content balance: max 4 items per topic cluster
04 — Knowledge Graph Traversal 7,560 NODES

Questions are not stored statically — they are generated by traversal of a structured knowledge graph encoding conceptual relationships, prerequisite dependencies, and difficulty gradients. The graph enables effectively infinite question generation from finite conceptual coverage.

KNOWLEDGE GRAPH HIERARCHY ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Level 0 — Domains 12 domains (ML, DL, CV, NLP, RL, Stats, Math, CS, Ethics, Bio, Physics, Crypto) Level 1 — Topics ~42 topics / domain 504 topic nodes total Level 2 — Concepts ~15 concepts / topic 7,560 concept nodes total Level 3 — Facts ~8 facts / concept ~60,480 fact leaves total ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Edges: prerequisite (directed acyclic), semantic similarity (undirected), difficulty-gradient (real-valued weight) Scale: ~7,560 nodes × 8 facts × n templates = infinite item generation

Traversal uses Thompson Sampling — a Bayesian bandit algorithm that balances exploration (probing uncertain conceptual areas) with exploitation (targeting the difficulty boundary where information gain is highest).

— Thompson Sampling on concept nodes — For each concept node c_i: maintain Beta(α_i, β_i) distribution Sample: s_i ~ Beta(α_i, β_i) proxy for expected information gain Select: c* = argmax_i s_i concept with highest sampled utility Update: correct answer → α_i += 1; incorrect → β_i += 1 (posterior Bayes) — Bloom's Taxonomy weighting applied to traversal depth — Remember 15% factual recall (difficulty b ∈ [−2, −1]) Understand 25% conceptual grasp (difficulty b ∈ [−1, 0]) Apply 25% procedural use (difficulty b ∈ [ 0, 0.5]) Analyze 20% structural decompose (difficulty b ∈ [0.5, 1.5]) Evaluate 10% critical judgment (difficulty b ∈ [ 1, 2]) Create 5% synthesis/generation (difficulty b ∈ [1.5, 3])
15%Remember
25%Understand
25%Apply
20%Analyze
10%Evaluate
5%Create
Product

Five Game Modes

Each mode applies a distinct psychometric lens — from rapid Elo calibration to deep Bloom's-level probing to creative generation.

01
Classic
Fixed 10 questions. 30s per item. Full Elo-rated. CAT item selection. Bayesian 95% CI tracking.
02
Blitz
60-second rapid-fire. Unlimited items. Streak bonuses. τ → 0 (greedy exploitation mode).
03
Deep Dive
5 questions per topic cluster. Traverses all 6 Bloom's levels within a single domain tree.
04
Prompt Golf
Explain a concept in fewest tokens. Creative evaluation beyond MCQ. LLM-as-judge scoring rubric.
05
Party
Synchronous multiplayer. Identical item pool per session. Real-time leaderboard. Social Elo delta.
Psychometric Validation

Item Bank Validation

506 items validated through rigorous psychometric standards — matching methodologies used in professional certification exams (GRE, GMAT, USMLE).

α > 0.85
Cronbach's alpha — internal consistency across all 12 domains
Expert reviewers per item — content validity panel (ICC ≥ 0.75 required)
r > 0.20
Point-biserial correlation threshold — items below removed before IRT calibration
DIF
Differential Item Functioning tested — demographic bias eliminated (MH + logistic regression)
Validation Methodology Details

Content Validity: Expert panel review. 3 domain specialists per item rate relevance, clarity, and cognitive level using the modified Angoff standard-setting method. Intraclass Correlation Coefficient ≥ 0.75 required for item inclusion.

Construct Validity: Confirmatory factor analysis (CFA) confirms single-factor structure within each domain. Goodness-of-fit: RMSEA < 0.06, CFI > 0.95, SRMR < 0.08.

Item Analysis (CTT Pre-screen): Classical Test Theory pre-screening applied before IRT calibration. Difficulty index (p) 0.20–0.80; point-biserial r ≥ 0.20. Items failing either threshold removed.

Differential Item Functioning: Mantel-Haenszel χ² and logistic regression DIF detection across gender and education-level groups. Effect size |Δ| < 0.43 (ETS Class A threshold) required for all retained items.

Knowledge Coverage

12 AI/ML Domains

Full-stack AI knowledge coverage from mathematical foundations to applied ethics and cryptographic security.

01 Machine Learning

Supervised / unsupervised / semi-supervised. PAC learning, VC dimension, bias-variance tradeoff, regularization theory, ensemble methods.

02 Deep Learning

Backpropagation, gradient flow pathologies, attention mechanisms, Transformers, normalization (BN/LN/RMS), architectural families (CNN/RNN/GNN).

03 Computer Vision

Convolutional architectures, object detection (YOLO / DETR / Faster-RCNN), segmentation, optical flow, multi-view geometry, NeRF.

04 NLP

Tokenization, language modeling, seq2seq, RLHF, RAG, embedding geometry, evaluation benchmarks (MMLU, BIG-bench, HellaSwag).

05 Reinforcement Learning

MDPs, Bellman equations, Q-learning, policy gradients (REINFORCE / PPO / SAC), actor-critic, MCTS, multi-agent game theory.

06 Statistics

Bayesian inference, frequentist hypothesis testing, GLMs, survival analysis, causal inference (do-calculus, structural causal models).

07 Mathematics

Linear algebra, multivariate calculus, optimization (convex / non-convex), information theory (KL, MI, entropy), measure theory.

08 Computer Science

Algorithms, complexity theory, distributed systems (CAP theorem, consensus protocols), data structures for ML workloads.

09 AI Ethics

Fairness metrics (demographic parity, equalized odds, calibration), interpretability (SHAP, LIME, attention), alignment research.

10 Biology / BioML

AlphaFold, protein structure prediction, genomics (GWAS, single-cell RNA-seq), drug discovery pipelines, biological sequence models.

11 Physics / Simulation

Physics-informed neural networks (PINNs), quantum ML, scientific computing, Monte Carlo methods, differentiable simulation / autodiff.

12 Cryptography

Homomorphic encryption, zero-knowledge proofs, secure multi-party computation, differential privacy, federated learning protocols.

Revenue Model

Pricing & Monetization

Freemium SaaS with individual and institutional licensing tracks. B2B enterprise is the primary revenue engine.

Explorer — Free
$0
Forever free tier
  • Classic mode — 10 items per day
  • 3 domains unlocked
  • Basic Elo tracking
  • Community leaderboard access
Enterprise
$499
per seat / year
  • Custom item bank upload
  • White-label deployment
  • SSO / SCIM provisioning
  • LMS integration (xAPI / SCORM)
  • Psychometric audit reports
  • Dedicated support SLA
Financial Projections

Revenue Projections

Conservative 3-year model. Institutional licensing and B2B API channels dominate Year 2 onwards.

Revenue SegmentYear 1Year 2Year 3
Pro Subscriptions (consumer)$48K$192K$480K
Enterprise Seat Licensing$120K$600K$2.1M
API / White-label Licensing$36K$180K$540K
Bootcamp / Cohort Licensing$24K$96K$288K
Total ARR$228K$1.07M$3.41M
Unit Economics
  • Pro CAC target: < $18 (organic + SEO primary channel)
  • Pro LTV: $9.99/mo × 14 mo avg retention = $140
  • LTV : CAC ratio: ~7.8× (target > 3× for SaaS health)
  • Gross margin: ~87% (AI generation cost falls with volume)
  • Enterprise CAC: ~$4,200 (outbound + demo cycle)
  • Enterprise LTV: $499/seat × 40 seats × 3yr = $59,880
  • Enterprise LTV : CAC: ~14× at scale
  • Payback period: < 4 months (Pro); < 10 months (Enterprise)
Market

Market Opportunity

AIQC sits at the intersection of three high-growth markets with strong secular tailwinds driven by AI adoption.

EdTech TAM
$387B
Global EdTech by 2028 (HolonIQ)
AI Skills SAM
$62B
AI / ML upskilling market 2026E
Assessment SOM
$8.4B
Online assessment platforms 2025
AI Roles Posted
+340%
YoY growth in AI engineer listings 2024
Competitive Differentiation
Execution

Build Status & Roadmap

COMPLETE
  • 506 validated items across 12 domains
  • Classic + Blitz + Deep Dive modes (live)
  • Elo rating with K-factor schedule
  • 3PL IRT calibration for all items
  • Knowledge graph (7,560 concept nodes)
  • 10-language i18n (EN/ES/FR/DE/JA/ZH/AR/HI/PT/KO)
  • Stripe payment integration
  • Mobile-responsive PWA with service worker
IN REDESIGN
  • Prompt Golf mode — LLM-as-judge evaluation
  • Party / multiplayer mode — WebSocket sync
  • IRT analytics dashboard — ability curves, item maps
  • Bayesian CI visualization — uncertainty bands
  • Employer credential export — signed PDF report
  • API v2 — RESTful, xAPI / SCORM output
  • LMS integration layer — Moodle / Canvas / Blackboard
  • White-label enterprise portal