Core Science

Evaluation Framework

A multi-layer psychometric engine combining adversarial game theory, Bayesian Elo tracking, Item Response Theory, and knowledge graph traversal.

01 — Adversarial Architecture MINIMAX

AIQC operationalizes the red-team / blue-team adversarial paradigm at the item generation layer. A Red-team LLM generates maximally discriminative questions by targeting the boundary of known knowledge, while a Blue-team LLM attempts to answer, revealing capability gaps and knowledge horizons. The human player is inserted in place of the Blue team — receiving questions calibrated in real-time to expose their exact knowledge frontier.

This formulation borrows from the RLHF literature on adversarial preference learning but applies it to psychometric item generation rather than reward modeling. The equilibrium condition ensures the system produces questions with optimal discriminative power at the player's current estimated ability level.

— Adversarial objective (minimax formulation) — Red minimizes: E[S_Blue(q)] over question space Q Blue maximizes: E[S_Blue(q)] over response space R — Nash equilibrium: converges to questions that perfectly discriminate ability — q* = argmax_q [ H(θ) − H(θ | q) ] q* maximizes expected information gain about latent ability θ — Exploration-exploitation via softmax temperature τ — P(q_i) = exp(I(θ̂, q_i) / τ) / Σ exp(I(θ̂, q_j) / τ) τ → 0: greedy optimal selection (pure exploitation) τ → ∞: uniform exploration across item pool

02 — ELO Rating System BAYESIAN EXTENSION

Player skill is tracked using an adapted Elo rating system with Bayesian uncertainty quantification. Unlike chess Elo, AIQC uses per-domain ratings with cross-domain transfer learning priors — a player strong in Machine Learning receives a calibrated prior boost for Statistics questions, weighted by the domain correlation matrix.

— Standard Elo expected score — E_A = 1 / (1 + 10^((R_B − R_A) / 400)) — Rating update rule — R'_A = R_A + K × (S_A − E_A) S_A: actual outcome (1 = correct, 0.5 = partial, 0 = incorrect) — K-factor schedule (expertise tier) — K = 32 games < 30 (provisional — high plasticity, fast convergence) K = 16 games 30–100 (established — medium plasticity) K = 8 games > 100 (veteran — stable rating, low volatility) — Bayesian uncertainty quantification — Prior: R ~ N(1200, σ₀²) σ₀ = 200 (new player uncertainty) Update: σ²_n = 1 / (1/σ²_{n-1} + 1/σ²_item) 95% CI: R ± 1.96σ system expresses calibrated confidence — Cross-domain transfer learning prior — R_new_domain = Σ_i ρ_{ij} × R_i ρ = domain correlation matrix (12×12)

Rating Range	Classification	Benchmark Equivalent	K-Factor
< 1000	Novice	Pre-undergraduate	32
1000–1200	Intermediate	Undergraduate CS / AI	32 → 16
1200–1400	Advanced	Graduate / industry practitioner	16
1400–1600	Expert	Senior researcher / principal engineer	16 → 8
> 1600	Elite	Domain specialist / PhD	8

03 — Item Response Theory (IRT) 3PL MODEL

Each of the 506 validated items is characterized by three IRT parameters estimated via marginal maximum likelihood (EM algorithm). The 3-Parameter Logistic (3PL) model is the psychometric standard for high-stakes adaptive testing — identical to the model used by the GRE, GMAT, and USMLE.

— 3-Parameter Logistic Model — P(θ) = c + (1 − c) × 1 / (1 + exp(−D·a·(θ − b))) Parameters: θ latent ability parameter measured on logit scale, N(0,1) population a discrimination parameter how well item separates ability levels [0.5, 2.5] b difficulty parameter ability level where P(correct) = 0.5 + c/2 [−3, +3] c pseudo-guessing parameter lower asymptote (0.20–0.25 for 4-option MCQ) D 1.702 scaling constant (normal ogive approximation) — Item Information Function (peaks where item is most discriminative) — I(θ) = D²a² × [ (P(θ) − c)² / (1−c)² ] × [ Q(θ) / P(θ) ] Peaks at θ = b: maximum discrimination at item difficulty level — Computerized Adaptive Testing (CAT): next item maximizes Fisher information — I_test(θ) = Σ_i I_i(θ) adaptive selection maximizes total information at θ̂ — Ability estimation (Expected a Posteriori, EAP) — θ̂ = ∫ θ · L(u|θ) · g(θ) dθ / ∫ L(u|θ) · g(θ) dθ g(θ) = N(0,1) prior; posterior updated after every response

Discrimination (a) Distribution

a < 0.8 — poor discriminators (flagged, removed)
0.8 ≤ a < 1.2 — fair discriminators
1.2 ≤ a < 1.8 — good discriminators (majority)
a ≥ 1.8 — excellent discriminators (<15% of bank)

CAT Stopping Rules

Minimum items: 5 (reliability floor)
Maximum items: 40 (fatigue ceiling)
SEM < 0.30: adaptive stop (precision achieved)
Content balance: max 4 items per topic cluster

04 — Knowledge Graph Traversal 7,560 NODES

Questions are not stored statically — they are generated by traversal of a structured knowledge graph encoding conceptual relationships, prerequisite dependencies, and difficulty gradients. The graph enables effectively infinite question generation from finite conceptual coverage.

KNOWLEDGE GRAPH HIERARCHY ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Level 0 — Domains 12 domains (ML, DL, CV, NLP, RL, Stats, Math, CS, Ethics, Bio, Physics, Crypto) Level 1 — Topics ~42 topics / domain 504 topic nodes total Level 2 — Concepts ~15 concepts / topic 7,560 concept nodes total Level 3 — Facts ~8 facts / concept ~60,480 fact leaves total ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Edges: prerequisite (directed acyclic), semantic similarity (undirected), difficulty-gradient (real-valued weight) Scale: ~7,560 nodes × 8 facts × n templates = infinite item generation

Traversal uses Thompson Sampling — a Bayesian bandit algorithm that balances exploration (probing uncertain conceptual areas) with exploitation (targeting the difficulty boundary where information gain is highest).

— Thompson Sampling on concept nodes — For each concept node c_i: maintain Beta(α_i, β_i) distribution Sample: s_i ~ Beta(α_i, β_i) proxy for expected information gain Select: c* = argmax_i s_i concept with highest sampled utility Update: correct answer → α_i += 1; incorrect → β_i += 1 (posterior Bayes) — Bloom's Taxonomy weighting applied to traversal depth — Remember 15% factual recall (difficulty b ∈ [−2, −1]) Understand 25% conceptual grasp (difficulty b ∈ [−1, 0]) Apply 25% procedural use (difficulty b ∈ [ 0, 0.5]) Analyze 20% structural decompose (difficulty b ∈ [0.5, 1.5]) Evaluate 10% critical judgment (difficulty b ∈ [ 1, 2]) Create 5% synthesis/generation (difficulty b ∈ [1.5, 3])

15%Remember

25%Understand

25%Apply

20%Analyze

10%Evaluate

5%Create

Product

Five Game Modes

Each mode applies a distinct psychometric lens — from rapid Elo calibration to deep Bloom's-level probing to creative generation.

01

Classic

Fixed 10 questions. 30s per item. Full Elo-rated. CAT item selection. Bayesian 95% CI tracking.

02

Blitz

60-second rapid-fire. Unlimited items. Streak bonuses. τ → 0 (greedy exploitation mode).

03

Deep Dive

5 questions per topic cluster. Traverses all 6 Bloom's levels within a single domain tree.

04

Prompt Golf

Explain a concept in fewest tokens. Creative evaluation beyond MCQ. LLM-as-judge scoring rubric.

05

Party

Synchronous multiplayer. Identical item pool per session. Real-time leaderboard. Social Elo delta.

Psychometric Validation

Item Bank Validation

506 items validated through rigorous psychometric standards — matching methodologies used in professional certification exams (GRE, GMAT, USMLE).

α > 0.85

Cronbach's alpha — internal consistency across all 12 domains

3×

Expert reviewers per item — content validity panel (ICC ≥ 0.75 required)

r > 0.20

Point-biserial correlation threshold — items below removed before IRT calibration

DIF

Differential Item Functioning tested — demographic bias eliminated (MH + logistic regression)

Validation Methodology Details

Content Validity: Expert panel review. 3 domain specialists per item rate relevance, clarity, and cognitive level using the modified Angoff standard-setting method. Intraclass Correlation Coefficient ≥ 0.75 required for item inclusion.

Construct Validity: Confirmatory factor analysis (CFA) confirms single-factor structure within each domain. Goodness-of-fit: RMSEA < 0.06, CFI > 0.95, SRMR < 0.08.

Item Analysis (CTT Pre-screen): Classical Test Theory pre-screening applied before IRT calibration. Difficulty index (p) 0.20–0.80; point-biserial r ≥ 0.20. Items failing either threshold removed.

Differential Item Functioning: Mantel-Haenszel χ² and logistic regression DIF detection across gender and education-level groups. Effect size |Δ| < 0.43 (ETS Class A threshold) required for all retained items.

Knowledge Coverage

12 AI/ML Domains

Full-stack AI knowledge coverage from mathematical foundations to applied ethics and cryptographic security.

01 Machine Learning

Supervised / unsupervised / semi-supervised. PAC learning, VC dimension, bias-variance tradeoff, regularization theory, ensemble methods.

02 Deep Learning

Backpropagation, gradient flow pathologies, attention mechanisms, Transformers, normalization (BN/LN/RMS), architectural families (CNN/RNN/GNN).

03 Computer Vision

Convolutional architectures, object detection (YOLO / DETR / Faster-RCNN), segmentation, optical flow, multi-view geometry, NeRF.

04 NLP

Tokenization, language modeling, seq2seq, RLHF, RAG, embedding geometry, evaluation benchmarks (MMLU, BIG-bench, HellaSwag).

05 Reinforcement Learning

MDPs, Bellman equations, Q-learning, policy gradients (REINFORCE / PPO / SAC), actor-critic, MCTS, multi-agent game theory.

06 Statistics

Bayesian inference, frequentist hypothesis testing, GLMs, survival analysis, causal inference (do-calculus, structural causal models).

07 Mathematics

Linear algebra, multivariate calculus, optimization (convex / non-convex), information theory (KL, MI, entropy), measure theory.

08 Computer Science

Algorithms, complexity theory, distributed systems (CAP theorem, consensus protocols), data structures for ML workloads.

09 AI Ethics

Fairness metrics (demographic parity, equalized odds, calibration), interpretability (SHAP, LIME, attention), alignment research.

10 Biology / BioML

AlphaFold, protein structure prediction, genomics (GWAS, single-cell RNA-seq), drug discovery pipelines, biological sequence models.

11 Physics / Simulation

Physics-informed neural networks (PINNs), quantum ML, scientific computing, Monte Carlo methods, differentiable simulation / autodiff.

12 Cryptography

Homomorphic encryption, zero-knowledge proofs, secure multi-party computation, differential privacy, federated learning protocols.

Revenue Model

Pricing & Monetization

Freemium SaaS with individual and institutional licensing tracks. B2B enterprise is the primary revenue engine.

Explorer — Free

$0

Forever free tier

Classic mode — 10 items per day
3 domains unlocked
Basic Elo tracking
Community leaderboard access

Pro — Most Popular

$9.99

per month

All 5 game modes
All 12 domains, unlimited items
Full IRT analytics dashboard
Bayesian CI visualization
Personalized study plan
Skill credential export (PDF)
API access — 500 req/mo

Enterprise

$499

per seat / year

Custom item bank upload
White-label deployment
SSO / SCIM provisioning
LMS integration (xAPI / SCORM)
Psychometric audit reports
Dedicated support SLA

Financial Projections

Revenue Projections

Conservative 3-year model. Institutional licensing and B2B API channels dominate Year 2 onwards.

Revenue Segment	Year 1	Year 2	Year 3
Pro Subscriptions (consumer)	$48K	$192K	$480K
Enterprise Seat Licensing	$120K	$600K	$2.1M
API / White-label Licensing	$36K	$180K	$540K
Bootcamp / Cohort Licensing	$24K	$96K	$288K
Total ARR	$228K	$1.07M	$3.41M

Unit Economics

Pro CAC target: < $18 (organic + SEO primary channel)
Pro LTV: $9.99/mo × 14 mo avg retention = $140
LTV : CAC ratio: ~7.8× (target > 3× for SaaS health)
Gross margin: ~87% (AI generation cost falls with volume)

Enterprise CAC: ~$4,200 (outbound + demo cycle)
Enterprise LTV: $499/seat × 40 seats × 3yr = $59,880
Enterprise LTV : CAC: ~14× at scale
Payback period: < 4 months (Pro); < 10 months (Enterprise)

Market

Market Opportunity

AIQC sits at the intersection of three high-growth markets with strong secular tailwinds driven by AI adoption.

EdTech TAM

$387B

Global EdTech by 2028 (HolonIQ)

AI Skills SAM

$62B

AI / ML upskilling market 2026E

Assessment SOM

$8.4B

Online assessment platforms 2025

AI Roles Posted

+340%

YoY growth in AI engineer listings 2024

Competitive Differentiation

No psychometric-rigorous competitor exists. Kaggle, LeetCode, Brilliant, and Coursera quizzes all lack IRT calibration, Elo rating, or Bloom's taxonomy coverage simultaneously. AIQC is the first platform to combine all three.
Adversarial generation moat. Static question banks go stale. Adversarial generation produces novel calibrated items indefinitely — item quality improves with each player via online IRT parameter re-estimation.
Knowledge graph defensibility. 7,560 concept nodes, 60K+ fact leaves, and a 12×12 domain correlation matrix represent 18+ months of construction — expensive to replicate at quality.
Employer-grade signal. A 95% CI Elo report is a verifiable skill credential. Recruiters can use it for technical screening — replacing whiteboard interviews with a psychometrically calibrated score.
No AI knowledge leader yet. No dominant platform owns "AI skill verification." AIQC is positioned to become the TOEFL of AI competency — the standard credential for AI practitioners globally.

Execution

Build Status & Roadmap

COMPLETE

506 validated items across 12 domains
Classic + Blitz + Deep Dive modes (live)
Elo rating with K-factor schedule
3PL IRT calibration for all items
Knowledge graph (7,560 concept nodes)
10-language i18n (EN/ES/FR/DE/JA/ZH/AR/HI/PT/KO)
Stripe payment integration
Mobile-responsive PWA with service worker

IN REDESIGN

Prompt Golf mode — LLM-as-judge evaluation
Party / multiplayer mode — WebSocket sync
IRT analytics dashboard — ability curves, item maps
Bayesian CI visualization — uncertainty bands
Employer credential export — signed PDF report
API v2 — RESTful, xAPI / SCORM output
LMS integration layer — Moodle / Canvas / Blackboard
White-label enterprise portal

AIQCINTERACTIVE