Docs/Sequential Monitoring

Bayesian Sequential Monitoring

Technical documentation for Bayesian interim monitoring of two-arm clinical trials using posterior probability stopping rules. Supports both binary (Beta-Binomial) and continuous (Normal-Normal) endpoints with efficacy and optional futility boundaries. Based on the methods described in Zhou & Ji (2024) [1].

1. Overview & Motivation

Bayesian sequential monitoring extends fixed-sample Bayesian designs by allowing repeated interim analyses during patient enrollment. At each pre-specified look, the posterior probability that the treatment is superior is compared against stopping thresholds — enabling early termination for either efficacy or futility. Zhou & Ji [1] provide a comprehensive framework comparing three Bayesian sequential approaches, of which this calculator implements the posterior probability (PP) method.

Why Bayesian Sequential?

vs. Fixed-Sample

  • Reduces expected sample size by 20-40%
  • Earlier access to effective treatments
  • Ethical: stops futile trials sooner

vs. Frequentist GSD

  • Natural probability statements about parameters
  • Incorporates prior information formally
  • No alpha-spending function needed

Supported Endpoint Types

Binary (Beta-Binomial)

Responder rates, event proportions

Prior: independent Beta distributions per arm.
Stopping decisions evaluated via Monte Carlo posterior sampling.

Continuous (Normal-Normal)

Mean differences, change from baseline

Prior: Normal on treatment effect.
Analytical z-score boundaries available.

2. Three Bayesian Sequential Approaches

Zhou & Ji [1] describe three distinct Bayesian approaches to sequential clinical trial design. Each defines stopping rules differently, leading to different boundary shapes and operating characteristics.

Posterior Probability (PP) — Implemented Here

Stop at look kk when the posterior probability that the treatment effect exceeds zero crosses a threshold: P(θ>0datak)γP(\theta > 0 \mid \text{data}_k) \geq \gamma.

For Normal-Normal models, this yields analytical z-score boundaries (Equation 5 in [1]):

ck=Φ1(γ)1+σ2nkν2μσ2nkν2c_k = \Phi^{-1}(\gamma)\,\sqrt{1 + \frac{\sigma^2}{n_k \nu^2}} - \frac{\mu\sqrt{\sigma^2}}{\sqrt{n_k}\,\nu^2}

Boundaries decrease with sample size, resembling O'Brien-Fleming spending when ν2\nu^2 is large relative to σ2/n\sigma^2/n.

Posterior Predictive Probability (PPP)

Stop when the predictive probability of eventual success at the final analysis exceeds a threshold. At interim look k<Kk < K, this asks: given current data, what is the probability that the final posterior probability will exceed the success criterion?

This yields a different boundary formula ([1], Section 2.2):

ckPPP=Φ1(η)1/ν2+N/σ2μ/ν2(Nnk)/σ21nkσ/(Nnk)+nk/(σ/ν2+nk/σ)+correctionc_k^{\text{PPP}} = \frac{\Phi^{-1}(\eta)\sqrt{1/\nu^2 + N/\sigma^2} - \mu/\nu^2}{(N - n_k)/\sigma^2} \cdot \frac{1}{\sqrt{n_k}\sigma/(N-n_k) + \sqrt{n_k}/({\sigma/\nu^2 + n_k/\sigma})} + \text{correction}

PPP boundaries tend to be more conservative at early looks and more permissive at later looks compared to PP, resembling stochastically curtailed testing [2].

Decision-Theoretic (DT)

Defines explicit loss functions for wrong decisions and optimizes boundaries by backward induction. The investigator specifies: ξ1\xi_1 (loss of stopping and declaring efficacy when the drug is ineffective) and ξ0\xi_0 (loss of failing to detect a truly effective drug).

Boundaries are found by equating expected losses: Lk(1,zk)=Lk(0,zk)L_k(1, z_k) = L_k(0, z_k), solved numerically via backward recursion from look KK to look 1.

The DT approach is the most flexible but requires careful specification of loss functions. With loss ratio ξ1/ξ035\xi_1/\xi_0 \approx 35, boundaries approximate O'Brien-Fleming ([1], Table 1).

Why Posterior Probability?

This calculator implements the PP method because: (a) it has a direct clinical interpretation ("there is at least a γ\gamma% probability the treatment is better"), (b) it produces analytical boundaries for continuous endpoints, and (c) it naturally extends to binary endpoints via Monte Carlo. The PPP and DT approaches may be added in future releases.

3. Statistical Model

Binary Endpoint (Beta-Binomial)

Each arm has an independent Beta prior on the response rate:

Treatment Arm

θTBeta(αT,βT)\theta_T \sim \text{Beta}(\alpha_T, \beta_T)

Control Arm

θCBeta(αC,βC)\theta_C \sim \text{Beta}(\alpha_C, \beta_C)

At interim look kk, having observed sT,ks_{T,k} successes in nT,kn_{T,k} treatment patients, the posterior is:

θTdatakBeta(αT+sT,k, βT+nT,ksT,k)\theta_T | \text{data}_k \sim \text{Beta}(\alpha_T + s_{T,k},\ \beta_T + n_{T,k} - s_{T,k})

The posterior probability of treatment superiority is estimated via Monte Carlo:

P(θT>θCdatak)1Mm=1M1(θT(m)>θC(m))P(\theta_T > \theta_C | \text{data}_k) \approx \frac{1}{M} \sum_{m=1}^{M} \mathbb{1}(\theta_T^{(m)} > \theta_C^{(m)})

Continuous Endpoint (Normal-Normal)

The treatment effect θ\theta has a Normal prior:

θN(μ,ν2)\theta \sim N(\mu, \nu^2)

With known data variance σ2\sigma^2, after observing the z-statistic zkz_k at look kk with cumulative sample size nkn_k:

P(θ>0zk)=Φ ⁣(zknk/σ2+μ/ν21/ν2+nk/σ21/ν2+nk/σ2)P(\theta > 0 | z_k) = \Phi\!\left(\frac{z_k \sqrt{n_k / \sigma^2} + \mu / \nu^2}{1/\nu^2 + n_k / \sigma^2}\cdot\sqrt{1/\nu^2 + n_k/\sigma^2}\right)

Analytical vs. Simulation-Based

For continuous endpoints, stopping boundaries can be derived analytically (closed-form z-score thresholds). For binary endpoints, the stopping thresholds are fixed (the user-specified γ values) — at each look, the posterior probability P(θ_T > θ_C | data) is estimated via Monte Carlo sampling from the Beta posteriors and compared against these thresholds.

4. Stopping Rules

At each interim analysis k=1,2,,Kk = 1, 2, \ldots, K, the trial may stop based on the posterior probability that treatment is better:

Efficacy Stopping

Stop for efficacy at look k if P(θT>θCdatak)γeff\text{Stop for efficacy at look } k \text{ if } P(\theta_T > \theta_C | \text{data}_k) \geq \gamma_{\text{eff}}

where γeff\gamma_{\text{eff}} is the efficacy threshold (typically 0.95-0.995). Higher values are more conservative, reducing Type I error at the cost of lower power.

Futility Stopping (Optional)

Stop for futility at look k if P(θT>θCdatak)γfut\text{Stop for futility at look } k \text{ if } P(\theta_T > \theta_C | \text{data}_k) \leq \gamma_{\text{fut}}

where γfut\gamma_{\text{fut}} is the futility threshold (typically 0.05-0.20). Futility stopping reduces the expected sample size under the null hypothesis by terminating trials unlikely to succeed.

Non-Binding Futility

Futility stopping is non-binding: investigators may continue the trial past a futility signal. Per FDA guidance, non-binding futility rules do not inflate the Type I error rate and are preferred for regulatory submissions.

Trial Flow

for look k = 1, 2, ..., K:
    Enroll n_per_look patients per arm
    Compute cumulative data (n_k = k × n_per_look)
    Compute P(θ_T > θ_C | data_k)

    if P ≥ γ_efficacy:
        → STOP: Declare efficacy
    elif futility_enabled and P ≤ γ_futility:
        → STOP: Declare futility (non-binding)
    else:
        → CONTINUE to next look

if reached look K without stopping:
    → Make final decision at look K

5. Stopping Boundaries

Continuous Endpoints: Analytical Boundaries

For Normal-Normal models, the stopping boundary at look kk with cumulative per-arm sample size nkn_k is a z-score threshold derived analytically from the posterior probability criterion ([1], Equation 5):

ck=Φ1(γ)1+σ2nkν2μσ2nkν2c_k = \Phi^{-1}(\gamma)\,\sqrt{1 + \frac{\sigma^2}{n_k \nu^2}} - \frac{\mu\sqrt{\sigma^2}}{\sqrt{n_k}\,\nu^2}

Stop for efficacy if the observed z-statistic zk>ckz_k > c_k. For futility, a symmetric lower boundary is derived using γfut\gamma_{\text{fut}}.

Binary Endpoints: Simulation-Based

For Beta-Binomial models, there is no closed-form boundary. Instead, at each look:

  1. Update Beta posteriors for each arm using observed data
  2. Draw MM Monte Carlo samples from each posterior
  3. Compute the posterior probability:
    P(θT>θCdatak)1M1(θT(m)>θC(m))P(\theta_T > \theta_C | \text{data}_k) \approx \frac{1}{M}\sum \mathbb{1}(\theta_T^{(m)} > \theta_C^{(m)})
  4. Compare directly against γeff\gamma_{\text{eff}} and γfut\gamma_{\text{fut}}

Boundary Interpretation

For continuous endpoints, the Boundary Plot shows z-score thresholds (decreasing with more information). For binary endpoints, it shows the constant posterior probability thresholds. Both represent the same concept: the evidence required to stop.

6. Operating Characteristics

Operating characteristics are evaluated via Monte Carlo simulation under both the null and alternative hypotheses.

Type I Error

Probability of declaring efficacy when H₀ is true

Binary: θT=θC\theta_T = \theta_C
Continuous: True effect =0= 0

Power

Probability of declaring efficacy when H₁ is true

Binary: θT=p1, θC=p0\theta_T = p_1,\ \theta_C = p_0
Continuous: True effect =δ= \delta

Simulation Algorithm

for sim in 1...N_simulations:
    # Generate full trial data under H0 and H1
    data_h0 = generate(θ_T = θ_C)
    data_h1 = generate(θ_T = θ_T_design)

    for look k = 1...K:
        # Check stopping at each look
        P_h0 = posterior_prob(data_h0[1:n_k])
        P_h1 = posterior_prob(data_h1[1:n_k])

        if P_h0 >= γ_eff → type1_count++, record stop at k
        if P_h1 >= γ_eff → power_count++, record stop at k
        if P ≤ γ_fut    → record futility stop at k

# Outputs:
type1_error = type1_count / N_simulations
power = power_count / N_simulations
expected_N_h0 = mean(stop_time_h0) × n_per_look
expected_N_h1 = mean(stop_time_h1) × n_per_look

Additional Metrics

MetricDefinition
Expected N (H₀)Average sample size per arm when the null is true
Expected N (H₁)Average sample size per arm when the alternative is true
P(Stop at look k | H)Probability of stopping at each look under each hypothesis
Futility rateOverall probability of stopping for futility under H₀ and H₁

Key Findings from Simulation Studies

Zhou & Ji [1] evaluated OC across 72 scenarios (combinations of true effect distribution ν0\nu_0, prior width ν\nu, and number of looks KK) with N=1000N = 1000, γ=0.95\gamma = 0.95, and 10,000 simulated trials:

  • Type I error control: With a well-calibrated prior (νν0\nu \approx \nu_0), the false positive rate (FPR) stays below 5% even with frequent monitoring (K=1000K = 1000 continuous looks).
  • Prior-data conflict: When the prior is much wider than the true effect distribution (νν0\nu \gg \nu_0), the FPR can inflate above the nominal level. Conversely, overly narrow priors reduce power.
  • Number of looks: More interim analyses generally does not inflate error rates in the Bayesian framework (unlike repeated frequentist testing), provided the posterior probability threshold is held constant.
  • Coverage: Posterior credible interval coverage remains near 95% across scenarios, with slight under-coverage when the prior is strongly misspecified.

Implication for Practice

These results underscore the importance of prior calibration. For regulatory submissions, sponsors should demonstrate that the chosen prior width does not inflate the frequentist Type I error rate by running OC simulations across a range of plausible effect sizes — which is exactly what this calculator provides.

Computational Note

The "Standard" setting uses 5,000 outer simulations. For binary endpoints, inner Monte Carlo draws scale with outer simulations (capped at 10,000) to estimate posterior probabilities. Power curve evaluation uses N/3 simulations per point for efficiency. Use "Precise" (10,000) for regulatory submissions.

7. Power & ASN Curves

Power Curve

The power curve shows the probability of declaring efficacy as a function of the true treatment effect. For binary endpoints, this is plotted against the treatment rate; for continuous, against the true effect size.

  • At θT=θC\theta_T = \theta_C: Should equal the Type I error (typically 2-5%)
  • At design alternative: Should meet target power (typically 80-90%)
  • Steep transition: Indicates good discriminatory ability

Average Sample Number (ASN) Curve

The ASN curve shows the expected sample size per arm as a function of the true effect. This quantifies the efficiency gains from sequential monitoring:

  • Under H₀ (no effect): ASN well below max N indicates effective futility stopping
  • Under H₁ (large effect): ASN well below max N indicates early efficacy detection
  • Near the boundary: ASN approaches max N (ambiguous evidence requires full enrollment)

8. Regulatory Considerations

FDA Guidance on Bayesian Sequential Designs

Per ICH E20 on adaptive designs [3] and the FDA draft guidance on Bayesian methodology in drug/biologics trials [4], sponsors must pre-specify the number and timing of interim analyses, stopping rules, and report full operating characteristics including frequentist Type I error control.

SAP Documentation Checklist

  • Number & Timing of Looks: Pre-specify K and equally-spaced information fractions
  • Stopping Rules: Efficacy threshold γeff\gamma_{\text{eff}}, futility threshold γfut\gamma_{\text{fut}} (if applicable), and whether futility is binding or non-binding
  • Prior Specification: Prior parameters with clinical justification, effective sample size, and sensitivity to prior choice
  • Operating Characteristics: Type I error, power, expected N under both hypotheses, stopping probabilities at each look
  • Simulation Details: Number of Monte Carlo simulations, random seed, software version
  • Boundary Table: Full table of boundaries, information fractions, and stopping probabilities at each look

Calibration Guidance

Efficacy ThresholdTypical Type I ErrorWhen to Use
95.0%~5-8%Exploratory Phase II; single-arm extension
97.5%~2-5%Confirmatory one-sided test; FDA standard
99.0%~0.5-2%Conservative design; pediatric extrapolation
99.5%<1%Very conservative; multiple comparisons

Calibration Strategy

The efficacy threshold γeff\gamma_{\text{eff}} and number of looks KK jointly determine the Type I error rate. Increasing either reduces Type I error but also reduces power. Use the calculator's OC output to calibrate until the desired balance is achieved. There is no need for a separate alpha-spending function — the Bayesian posterior probability criterion inherently controls multiplicity.

9. Technical References

  1. [1]Zhou, T., & Ji, Y. (2024). On Bayesian Sequential Clinical Trial Designs. New England Journal of Statistics in Data Science, 2(1), 136-151.
  2. [2]Lan, K. K. G., Simon, R., & Halperin, M. (1982). Stochastically Curtailed Tests in Long-Term Clinical Trials. Sequential Analysis, 1(3), 207-219.
  3. [3]ICH / U.S. Food and Drug Administration (2025). E20 Adaptive Designs for Clinical Trials: Draft Guidance. Link
  4. [4]U.S. Food and Drug Administration (2026). Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products: Draft Guidance for Industry. Link
  5. [5]Berry, S. M., Carlin, B. P., Lee, J. J., & Muller, P. (2010). Bayesian Adaptive Methods for Clinical Trials. Chapman & Hall/CRC.
  6. [6]Jennison, C., & Turnbull, B. W. (2000). Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC.
  7. [7]Thall, P. F., & Simon, R. (1994). Practical Bayesian Guidelines for Phase IIB Clinical Trials. Biometrics, 50(2), 337-349.
  8. [8]ICH E9 (1998). Statistical Principles for Clinical Trials. PDF

10. API Reference

POST /api/v1/calculators/bayesian-sequential

Key Parameters

ParameterTypeDescription
endpoint_typestring"binary" | "continuous"
kintNumber of interim + final analyses (2-10)
n_per_lookintPatients per arm added at each look
efficacy_thresholdfloatPosterior probability threshold for efficacy (default: 0.975)
futility_thresholdfloat | nullFutility threshold (null = disabled, default: 0.10)
control_ratefloatExpected control arm rate (binary only)
treatment_ratefloatExpected treatment rate under H₁ (binary only)
prior_variancefloatPrior variance ν2\nu^2 (continuous only)
n_simulationsintMonte Carlo simulations (1000-50000)

Note: The current implementation assumes equal allocation (1:1) between treatment and control arms.

Key Response Fields

  • type1_error — Simulated Type I error rate
  • power — Simulated power at design alternative
  • expected_n_h0 — Average N per arm under null
  • expected_n_h1 — Average N per arm under alternative
  • boundary_table — Look-by-look boundaries and stopping probabilities
  • power_curve — Power and ASN across effect sizes
  • decision_rule — Thresholds and interpretation
View full API documentation →

Ready to design?

Configure your Bayesian sequential monitoring design with Zetyra's calculator.

Open Sequential Monitoring Calculator