Validation & Transparency

Validated. Transparent. Trustworthy.

Zetyra's calculators collectively pass 896 automated validation tests against industry gold standards. Our GSD module matches gsDesign within 0.04 z-score units. Our Bayesian Toolkit is validated end-to-end—from prior elicitation through sequential monitoring. Calculators are cross-checked against 11 published clinical trials — Salk (1954), HPTN 083, HeartMate II, PACIFIC, MONALEESA-7, DAPA-HF, REBYOTA / PUNCH CD2 & CD3, I-SPY 2, STAMPEDE, REMAP-CAP, NCT03377023 — plus the Leyrat 2024 cluster-RCT worked example. Every table on this page links directly to the open-source script that produced the number.

Version 2.3Last updated April 2026

1Overview

896 automated tests passing
Max deviation: 0.034 z-score
Benchmarked against gsDesign, pwr, scipy
11 published clinical trials used in validation
Open source (MIT license)
42 scripts, continuous integration

Unlike proprietary alternatives, our validation is public.

Our complete validation suite—42 scripts, 896 tests—runs continuously on GitHub Actions. Anyone can verify our accuracy, examine our methodology, and reproduce our results. No black boxes—just glass boxes.

View Live Validation Suite

2Results Summary

Table 1

Validation Results by Calculator

CalculatorTestsReferenceMax DeviationStatus
Two-Sample Sample Size50Cohen (1988), Schoenfeld (1981), closed-form normal approxexact matchPass
Chi-Square Test55scipy.stats.chi2, chi2_contingency, fisher_exact< 1.5e-6Pass
Cluster-Randomized Trial61Donner & Klar (2000), small-cluster t-correction, live MC simsMC-basedPass
Longitudinal / Repeated Measures50Diggle et al. (2002), Frison & Pocock (1992), live MC simsexact matrix formPass
Group Sequential Design30gsDesign R package0.034 z-scorePass
GSD PACIFIC OS17Antonia et al. (2018) NEJM0.022 z-scorePass
GSD MONALEESA-7 OS20Im et al. (2019) NEJM0.022 z-scorePass
GSD Survival/TTE15Schoenfeld (1983), gsDesign< 0.001Pass
GSD Survival gsDesign36gsDesign R (boundaries, alpha spending)0.034 z-scorePass
CUPED12Analytical formulas< 0.001Pass
CUPED Simulation43MC simulation, Deng et al. (2013)< 0.02Pass
Beta-Binomial Conjugate9Lee & Liu (2008) PPoS + Gelman et al. (2013)exact (conjugate formula)Pass
Normal-Normal Conjugate8Spiegelhalter et al. (2004) + Gelman et al. (2013)exact (conjugate formula)Pass
Bayesian Survival21Normal-Normal on log(HR)< 0.01Pass
Bayesian Survival Benchmark25Conjugate oracle, MC PP cross-val< 0.03Pass
Prior Elicitation22ESS formula, scipy.optimize< 0.001Pass
Bayesian Borrowing18Power prior, Cochran's Q< 0.01Pass
Bayesian Sample Size26Binomial CI, MC search (binary + continuous)CI-basedPass
Bayesian Two-Arm24Binomial CI, MC search (binary + continuous)CI-basedPass
Bayesian Sequential20Zhou & Ji (2024)< 0.0001Pass
Bayesian Sequential Table 327Zhou & Ji (2024) Table 3 + R code< 0.005Pass
Bayesian Sequential Survival24Zhou & Ji (2024) + Schoenfeld< 0.0001Pass
Bayesian Seq. Survival Benchmark24Zhou & Ji formula + Type I error< 0.02Pass
SSR Blinded20Conditional power formulas< 0.001Pass
SSR Unblinded21Zone classification, CP< 0.001Pass
SSR Single-Arm (Phase II ORR)13Beta-Binomial conjugate, Lee & Liu (2008), Saville et al. (2014)< 0.001Pass
NCT03377023 Replication (Nivo+Ipi+Nintedanib NSCLC)13Real Bayesian Phase II w/ published interim+final outcomes (Moffitt)SAP boundary: PPoS(r₁=2)=0.31 > 0.20; PPoS(r₁=1)=0.08 ≤ 0.20Pass
SSR gsDesign Benchmark14gsDesign R, reference formulas0 (exact)Pass
RAR (Adaptive Randomization)20Rosenberger optimal, DBCD, Thompson< 0.001Pass
Minimization (Pocock-Simon)17Imbalance reduction benchmarkMC-basedPass
Basket Trial21Independent, BHM, EXNEX< 0.001Pass
Umbrella Trial21Freq/Bayesian × 3 endpointsMC-basedPass
Platform Trial (MAMS)24Boundaries, staggered entry, controlMC-basedPass
I-SPY 2 Replication10Barker et al. (2009), pCR rates< 0.001Pass
STAMPEDE Replication9Sydes et al. (2012), MAMS OS/FFSMC-basedPass
REMAP-CAP Replication8Angus et al. (2020), BayesianMC-basedPass
Salk 1954 Polio Trial Replication8Francis Report (1955): 200,745 vaccine vs 201,229 placebo; 33 vs 115 paralytic casesχ² and Fisher exact: < 1e-10Pass
DAPA-HF Replication11McMurray et al. (2019) EJHF: HR=0.80, α=0.025 (one-sided), 90% power, 844 events targetSchoenfeld events: 845 vs published 844Pass
Leyrat 2024 Primary-Care CRT6Leyrat, Eldridge, Taljaard, Hemming (2024): p0=0.50 → p1=0.65, ICC=0.05, m=46, 24 clustersTotal clusters: exact match; N: 1102 vs 1104Pass
Offline References23Pure math (no API)< 1e-10Pass
Total89642 scriptsAll Pass

3Bayesian ToolkitNEW

248 tests across 13 scripts.

Each of the 6 Bayesian calculators has a dedicated test suite. Tests cover 8 categories of validation:

Analytical Correctness

Conjugate posteriors, boundary formulas, and ESS derivations compared against closed-form references

MC Calibration

Type I error and power checked with Clopper-Pearson binomial CIs that scale with simulation count

Schema Contracts

Response keys, types, and value bounds validated for every API call with strict/non-strict lower bounds

Input Guards

Invalid inputs (negative rates, out-of-range priors) return 400/422 with the offending field named

Boundary Conditions

Extreme priors (ESS=1 to 1000), zero/all events, single-look designs, near-zero and near-one rates

Invariants & Properties

Higher power → larger n, larger effect → smaller n, higher discount → higher ESS, monotone boundaries

Seed Reproducibility

Same seed produces identical MC results across repeated calls for sample size and two-arm designs

Symmetry

Null hypothesis gives same type I error regardless of label swap; identical studies yield I²=0

4Real-World Trial Replications

HPTN 083

Phase 3 HIV Prevention Trial

Design4-look O'Brien-Fleming
Boundaries tested4
Max deviation0.012 z-score

Early looks match reference to ≤ 0.001. The 0.012 max at look 4 reflects accumulated MVN integration precision between scipy and R's mvtnorm, the same source that drives the PACIFIC/MONALEESA-7 gaps.

HeartMate II

LVAD Clinical Trial

Design3-look O'Brien-Fleming
Info fractions[0.27, 0.67, 1.00]
StatusAll properties verified

PACIFIC

SURVIVAL

Durvalumab, Stage III NSCLC (OS)

Design3-look Lan-DeMets OBF
Published boundaryp < 0.00274 (z = 2.78)
Max deviation0.022 z-score

Looks 1–2 match reference exactly (0.000). Look 3 deviation (0.022) is from MVN integration precision.

MONALEESA-7

SURVIVAL

Ribociclib, HR+ Breast Cancer (OS)

Design3-look Lan-DeMets OBF
Published boundariesz = 3.60, 2.32
Max deviation0.022 z-score *

* Looks 1–2 match reference exactly (0.000). The 0.022 gap vs the published paper boundary at look 2 reflects a discrepancy in the published values—both Zetyra and our independent Lan-DeMets reference agree.

I-SPY 2

BASKET

Adaptive Breast Cancer Trial (pCR endpoint)

DesignBayesian basket, 10 signatures
Drugs validatedVeliparib, Pembrolizumab, Neratinib
StatusAll graduation decisions match

Published pCR rates reproduced via Beta-Binomial conjugate posteriors. Veliparib TNBC (51% vs 26%), Pembrolizumab TNBC (60% vs 22%).

STAMPEDE

PLATFORM

Prostate Cancer MAMS Trial (OS/FFS)

Design5-arm MAMS, 4 stages
Key resultDocetaxel OS HR=0.78
StatusBoundaries and power verified

OBF spending boundaries, docetaxel power, celecoxib futility detection (HR=0.98), and total sample size calculations all verified.

REMAP-CAP

PLATFORM

COVID-19 Bayesian Adaptive Platform

DesignBayesian, 99% posterior threshold
Domains validatedIL-6 RA, Antivirals
StatusSuperiority/futility verified

Tocilizumab superiority (mortality 28% vs 36%) and lopinavir futility correctly detected. Multi-domain staggered entry validated.

NCT03377023 (Nivo + Ipi + Nintedanib NSCLC)

BAYESIANSINGLE-ARM SSR

Phase II at Moffitt Cancer Center — Bayesian two-stage design with predictive-probability futility monitoring; both arms' published interim and final outcomes replicated end-to-end (Chen et al. 2019; JTO 2021; JCO 2023)

Arm A: ICI-naïve (p₀=0.30, p₁=0.50)

Sim power vs published0.778 vs 0.85
Final outcome9/22 (40.9% ORR)
Posterior P(p>p₀)0.880 (under-enrolled)

Arm B: ICI-treated (p₀=0.07, p₁=0.20)

SAP rule at r₁=2/20PPoS 0.31 > 0.20 → continued ✓
Final outcome6/28 (21.4% ORR)
Posterior P(p>p₀)0.997 → success ✓

Why the 7pp power gap? Engine-design mismatch, not a bug. Zetyra's Single-Arm SSR calculator is a sample-size re-estimation tool — it treats the interim as a decision point for final N (initial N ≈ 31 from the normal-approx, re-estimated upward toward the cap of 40 only when the interim data warrant). The NCT03377023 SAP uses a fixed N = 40 two-stage Simon-style rule with no re-estimation. Simulated paths that don't trigger an SSR extension end with fewer than 40 observations, giving fewer trials to hit the ≥17-success threshold — hence the lower simulated power. Our validation accepts ±15pp for this reason. The headline claim is the decision-rule replication: at r₁=2/20, PPoS = 0.31 > 0.20 → continue (trial did ✓); at 6/28 final, P(p>p₀) = 0.997 → success ✓. Independent verification confirms the SAP's own two-stage rule produces 0.846 power at p=0.5, matching the published 0.85 — so the SAP is correct and Zetyra is correct on the decision-rule side; only the engine-design choice differs.

Salk (1954)

CHI-SQUAREBINARY N

Francis Field Trial of poliomyelitis vaccine — the canonical randomized 2×2 teaching example

Placebo-controlled arm200,745 vaccine vs 201,229 placebo
Paralytic cases33 vs 115
χ² (Yates)44.15, p = 3.1×10⁻¹¹

Vaccine efficacy 71.2%. Fisher's exact and Yates-corrected χ² both match scipy to ≤ 1e-10. A-priori required N (34,837/arm) far below the 200k actually enrolled — the trial was decisively over-powered.

DAPA-HF (2019)

SURVIVALSCHOENFELD

Dapagliflozin in HFrEF — event-driven Schoenfeld sample-size replication

DesignHR=0.80, α=0.025 (1-sided), power=0.90
Target events (published)844
Zetyra Schoenfeld845 (ceiling off by 1)

The 1-event gap is ceiling vs rounding: the raw Schoenfeld value is 844.09 (exact z from scipy), which Zetyra ceils up to 845 and the paper presents rounded to 844. Both deliver the designed 90% power. HR / power / allocation-ratio sensitivity all monotone in the expected direction. McMurray et al., Eur J Heart Fail 2019. PMID 30895697.

Leyrat et al. (2024) Primary-Care CRT

CLUSTER-RCT

Methodological worked example: behavior-change counseling delivered at GP-practice level, detecting a health-behavior change from 50% to 65%

Design parameters

p₀ → p₁0.50 → 0.65
ICC, cluster size0.05, m = 46
α, power0.05, 0.80

Published vs Zetyra

Design effect3.25 (exact)
Total clusters24 (exact)
Total N1,102 vs 1,104

Leyrat C, Eldridge S, Taljaard M, Hemming K (2024), J Epidemiol Popul Health 72(1):202198. Sensitivity band at ICC ∈ [0.02, 0.10] brackets the point estimate monotonically: 14 < 24 < 42 clusters. The 2-patient gap in total N comes from ceiling order: Zetyra applies 2 × ⌈nind × DE⌉ = 1,102; the paper ceils individual N first (340) then multiplies by DE (340 × 3.25 = 1,105, reported as 1,104). Both deliver the target power; Zetyra's is 2 patients tighter.

REBYOTA (Fecal Microbiota)

BAYESIAN

FDA BLA 125739 — PUNCH CD2 (Phase 2b) & CD3 (Phase 3) for C. difficile infection

PUNCH CD2 (Phase 2b)

Data25/45 responders (55.6%)
Used inPrior, Borrowing, Sample Size
Scenarios11 tests (δ = 0–1)

PUNCH CD3 (Phase 3)

Data126/177 treat, 53/85 placebo
Used inTwo-Arm, Borrowing (MAP)
Cross-phase I²40–90% detected

5GSD Benchmark Details

Boundary accuracy vs gsDesign R package.

Two independent benchmarks validate GSD boundary accuracy against the gsDesign R package: one for standard (non-survival) designs, one for survival/TTE designs with Lan-DeMets spending functions.

Table 2a

Standard GSD Boundaries vs gsDesign (non-survival)

DesignLooksMax DeviationStatus
O'Brien-Fleming20.0000Pass
O'Brien-Fleming30.0015Pass
O'Brien-Fleming40.0117*Pass
O'Brien-Fleming50.0332*Pass
Pocock20.0000Pass
Pocock30.0010Pass
Pocock40.0033Pass

* Deviations at later looks (k=4–5) reflect accumulated multivariate normal integration precision differences between scipy and R's mvtnorm. Looks 2–3 match reference to 3–4 decimals. The 0.033 max at OBF k=5 is the headline “0.034 z-score” figure cited elsewhere on this page.

Table 2b

Survival GSD Boundaries vs gsDesign (Lan-DeMets spending)

DesignLooksMax DeviationStatus
OBF (Lan-DeMets)30.0015Pass
OBF (Lan-DeMets)40.0117*Pass
OBF (Lan-DeMets)50.0332*Pass
Pocock (Lan-DeMets)30.0010Pass
Pocock (Lan-DeMets)40.0033Pass

* Deviations occur at later looks (k=4–5) due to accumulated multivariate normal integration precision differences between scipy and R's mvtnorm. Early looks match exactly (0.000). The max deviation of 0.033 is at OBF k=5 final look. Pocock boundaries show negligible deviation at all look counts.

6Methodology

gsDesign

Group Sequential Design

Validated against the gold-standard gsDesign R package. O'Brien-Fleming and Pocock spending functions computed to match FDA submission standards. Survival/TTE via Schoenfeld.

VRF=1ρ2\text{VRF} = 1 - \rho^2

CUPED

Variance reduction validated against analytical formulas. Sample size reduction proportional to baseline-outcome correlation squared.

Beta(α+x,β+nx)\text{Beta}(\alpha + x,\, \beta + n - x)

Bayesian Toolkit

6 calculators validated end-to-end: conjugate posteriors, Zhou & Ji boundaries, survival log(HR) mapping, Clopper-Pearson MC calibration, power priors, MAP heterogeneity, and ESS-based elicitation.

Var(logHR)=4/d\text{Var}(\log \text{HR}) = 4/d

Survival/TTE

Event-driven designs validated via Schoenfeld variance mapping. GSD, Bayesian Sequential, and Bayesian Predictive Power all support time-to-event endpoints with HR-scale outputs.

CP(zinterim,n)\text{CP}(z_{\text{interim}}, n)

Sample Size Re-estimation

Blinded and unblinded SSR validated against conditional power formulas. Zone classification, inflation caps, and threshold ordering verified for continuous, binary, and survival endpoints.

PPoS+Beta(α+r,β+nr)\text{PPoS} + \text{Beta}(\alpha + r,\, \beta + n - r)

Single-Arm SSR

Bayesian single-arm Phase II (ORR) validated via Beta-Binomial conjugate posterior + predictive-probability futility monitoring. Decoupled γ_efficacy / γ_final thresholds, full operating-characteristics table with sample-size re-estimation. Replicated end-to-end against NCT03377023 interim and final decisions.

n=(zα+zβ)2σ2/δ2n = (z_\alpha + z_\beta)^2 \, \sigma^2 / \delta^2

Two-Sample & χ² Tests

Two-sample sample size (continuous / binary / survival) validated against closed-form Cohen's d benchmarks and Schoenfeld log-rank events. Pearson χ² with Yates correction, McNemar, and Fisher's exact validated against scipy via a Node bridge that exercises the exact TypeScript module shipped to users.

DE=1+(m1)ρ\text{DE} = 1 + (m - 1)\,\rho

Cluster-Randomized Trial

Design-effect inflation validated against Donner & Klar. Small-cluster t-correction (the z-based formula is anti-conservative at k < 15/arm) verified via live random-intercept MC simulation. ICC sensitivity band recomputed independently at each endpoint. Replicated against the Leyrat 2024 worked example to the integer.

Var(β^)=σ2sΣs/(ss)2\text{Var}(\hat\beta) = \sigma^2 \, \mathbf{s}^{\top} \Sigma \mathbf{s} / (\mathbf{s}^{\top} \mathbf{s})^2

Longitudinal / Repeated Measures

Exact matrix slope variance under AR(1) and CS replaces the m→∞ asymptotic previously shipped (off by 2–14× in real-world regimes). ANCOVA, endpoint, and change-from-baseline variance formulas verified against Frison & Pocock (1992) Table 2. Empirical power confirmed via live LMM simulations.

ρ=p/p\rho^{*} = \sqrt{p} \,/\, \textstyle\sum \sqrt{p}

Adaptive Randomization

RAR (DBCD, Thompson, Neyman) validated against Rosenberger optimal allocation theory. Minimization validated against pure random imbalance benchmarks. Binary, continuous, and survival endpoints.

P(θ>θ0data)P(\theta > \theta_0 \mid \text{data})

Master Protocol

Basket (BHM, EXNEX), umbrella, and platform (MAMS) trials validated against conjugate theory and three published trials: I-SPY 2, STAMPEDE, and REMAP-CAP.

7References

  1. 1. GSD: Jennison & Turnbull (2000) Group Sequential Methods with Applications to Clinical Trials
  2. 2. CUPED: Deng A, Xu Y, Kohavi R, Walker T (2013) Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (WSDM)
  3. 3. Bayesian: Gelman et al. (2013) Bayesian Data Analysis
  4. 4. gsDesign: Anderson (2022) gsDesign R package
  5. 5. Bayesian Sequential: Zhou T & Ji Y (2024) On Bayesian Sequential Clinical Trial Designs (New England J Statistics in Data Science 2(1))
  6. 6. Prior Elicitation: Morita, Thall & Müller (2008) Determining the effective sample size of a parametric prior
  7. 7. Survival: Schoenfeld (1983) Sample-size formula for the proportional-hazards regression model
  8. 8. SSR: Cui, Hung & Wang (1999) Modification of sample size in group sequential clinical trials
  9. 9. RAR: Rosenberger et al. (2001) Optimal adaptive designs for binary response trials
  10. 10. Basket: Berry SM, Broglio KR, Groshen S, Berry DA (2013) Bayesian hierarchical modeling of patient subpopulations: Efficient designs of Phase II oncology clinical trials (Clin Trials 10(5):720-734)
  11. 11. Platform: Saville BR & Berry SM (2016) Efficiencies of platform clinical trials: A vision of the future (Clin Trials 13(3):358-366)
  12. 12. I-SPY 2: Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ (2009) I-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy (Clin Pharmacol Ther 86(1):97-100)
  13. 13. STAMPEDE: Sydes MR et al. (2012) Flexible trial design in practice — stopping arms for lack-of-benefit and adding research arms mid-trial in STAMPEDE: a multi-arm multi-stage randomized controlled trial (Trials 13:168)
  14. 14. REMAP-CAP: Angus DC et al. (2020) Effect of Hydrocortisone on Mortality and Organ Support in Patients With Severe COVID-19: The REMAP-CAP COVID-19 Corticosteroid Domain Randomized Clinical Trial (JAMA 324(13):1317-1329)
  15. 15. Sample size textbook: Cohen (1988) Statistical Power Analysis for the Behavioral Sciences, 2nd ed.
  16. 16. 2×2 continuity correction: Yates (1934) Contingency tables involving small numbers and the χ² test
  17. 17. Cluster RCT: Donner & Klar (2000) Design and Analysis of Cluster Randomization Trials
  18. 18. Longitudinal methods: Diggle, Heagerty, Liang & Zeger (2002) Analysis of Longitudinal Data, 2nd ed.
  19. 19. Repeated measures: Frison L & Pocock SJ (1992) Repeated measures in clinical trials: analysis using mean summary statistics and its implications for design (Stat Med 11(13):1685-1704)
  20. 20. Single-Arm SSR (PPoS): Lee & Liu (2008) A predictive probability design for phase II cancer clinical trials (Clin Trials)
  21. 21. CRT worked example: Leyrat C, Eldridge S, Taljaard M, Hemming K (2024) Practical considerations for sample size calculation for cluster randomized trials (J Epidemiol Popul Health 72(1):202198, PMID 38477482)
  22. 22. Salk polio trial: Francis T Jr. (1955) Evaluation of the 1954 Field Trial of Poliomyelitis Vaccine: Final Report (U. Michigan)
  23. 23. DAPA-HF: McMurray JJV et al. (2019) A trial to evaluate the effect of dapagliflozin on morbidity and mortality in HFrEF (Eur J Heart Fail, PMID 30895697)
  24. 24. PACIFIC: Antonia SJ et al. (2018) Overall Survival with Durvalumab after Chemoradiotherapy in Stage III NSCLC (NEJM 379:2342-2350)
  25. 25. MONALEESA-7: Im SA et al. (2019) Overall Survival with Ribociclib plus Endocrine Therapy in Breast Cancer (NEJM 381:307-316)
  26. 26. NCT03377023 Bayesian design: Chen DT, Schell MJ, Fulp WJ et al. (2019) Application of Bayesian predictive probability for interim futility analysis in single-arm phase II trial (Transl Cancer Res 8(Suppl 4):S404-S420, PMID 31456910)
  27. 27. Mehta & Pocock (Promising-Zone SSR): Mehta CR & Pocock SJ (2011) Adaptive increase in sample size when interim results are promising: a practical guide with examples (Stat Med 30:3267–3284)
  28. 28. Bayesian PP: Spiegelhalter DJ, Abrams KR & Myles JP (2004) Bayesian Approaches to Clinical Trials and Health-Care Evaluation (Wiley)
  29. 29. RAR (DBCD): Hu F & Zhang LX (2004) Asymptotic properties of doubly adaptive biased coin designs for multitreatment clinical trials (Ann Stat 32(1):268–301)
  30. 30. EXNEX: Neuenschwander B, Wandel S, Roychoudhury S, Bailey S (2016) Robust exchangeability designs for early phase clinical trials with multiple strata (Pharm Stat 15(2):123–134)
  31. 31. Master Protocol: FDA (2022) Master Protocols: Efficient Clinical Trial Design Strategies to Expedite Development of Oncology Drugs and Biologics (Guidance for Industry)

The only clinical trial design platform with public, continuously validated accuracy.

896 tests. 42 scripts. Every calculator validated.