Docs/Sequential Monitoring

Bayesian Sequential Monitoring

Name: Zetyra
Price: 99 USD
Rating: 4.9 (47 reviews)
Author: Zetyra

Technical documentation for Bayesian interim monitoring of two-arm clinical trials using posterior probability stopping rules. Supports both binary (Beta-Binomial) and continuous (Normal-Normal) endpoints with efficacy and optional futility boundaries. Based on the methods described in Zhou & Ji (2024) [1].

1. Overview & Motivation

Bayesian sequential monitoring extends fixed-sample Bayesian designs by allowing repeated interim analyses during patient enrollment. At each pre-specified look, the posterior probability that the treatment is superior is compared against stopping thresholds — enabling early termination for either efficacy or futility. Zhou & Ji [1] provide a comprehensive framework comparing three Bayesian sequential approaches, of which this calculator implements the posterior probability (PP) method.

Why Bayesian Sequential?

vs. Fixed-Sample

Reduces expected sample size by 20-40%
Earlier access to effective treatments
Ethical: stops futile trials sooner

vs. Frequentist GSD

Natural probability statements about parameters
Incorporates prior information formally
No alpha-spending function needed

Supported Endpoint Types

Binary (Beta-Binomial)

Responder rates, event proportions

Prior: independent Beta distributions per arm.
Stopping decisions evaluated via Monte Carlo posterior sampling.

Continuous (Normal-Normal)

Mean differences, change from baseline

Prior: Normal on treatment effect.
Analytical z-score boundaries available.

2. Three Bayesian Sequential Approaches

Zhou & Ji [1] describe three distinct Bayesian approaches to sequential clinical trial design. Each defines stopping rules differently, leading to different boundary shapes and operating characteristics.

Posterior Probability (PP) — Implemented Here

Stop at look $k$ when the posterior probability that the treatment effect exceeds zero crosses a threshold: $P(\theta > 0 \mid \text{data}_k) \geq \gamma$ .

For Normal-Normal models, this yields analytical z-score boundaries (Equation 5 in [1]):

c_k = \Phi^{-1}(\gamma)\,\sqrt{1 + \frac{\sigma^2}{n_k \nu^2}} - \frac{\mu\sqrt{\sigma^2}}{\sqrt{n_k}\,\nu^2}

Boundaries decrease with sample size, resembling O'Brien-Fleming spending when $\nu^2$ is large relative to $\sigma^2/n$ .

Posterior Predictive Probability (PPP)

Stop when the predictive probability of eventual success at the final analysis exceeds a threshold. At interim look $k < K$ , this asks: given current data, what is the probability that the final posterior probability will exceed the success criterion?

This yields a different boundary formula ([1], Section 2.2):

c_k^{\text{PPP}} = \frac{\Phi^{-1}(\eta)\sqrt{1/\nu^2 + N/\sigma^2} - \mu/\nu^2}{(N - n_k)/\sigma^2} \cdot \frac{1}{\sqrt{n_k}\sigma/(N-n_k) + \sqrt{n_k}/({\sigma/\nu^2 + n_k/\sigma})} + \text{correction}

PPP boundaries tend to be more conservative at early looks and more permissive at later looks compared to PP, resembling stochastically curtailed testing [2].

Decision-Theoretic (DT)

Defines explicit loss functions for wrong decisions and optimizes boundaries by backward induction. The investigator specifies: $\xi_1$ (loss of stopping and declaring efficacy when the drug is ineffective) and $\xi_0$ (loss of failing to detect a truly effective drug).

Boundaries are found by equating expected losses: $L_k(1, z_k) = L_k(0, z_k)$ , solved numerically via backward recursion from look $K$ to look 1.

The DT approach is the most flexible but requires careful specification of loss functions. With loss ratio $\xi_1/\xi_0 \approx 35$ , boundaries approximate O'Brien-Fleming ([1], Table 1).

Why Posterior Probability?

This calculator implements the PP method because: (a) it has a direct clinical interpretation ("there is at least a $\gamma$ % probability the treatment is better"), (b) it produces analytical boundaries for continuous endpoints, and (c) it naturally extends to binary endpoints via Monte Carlo. The PPP and DT approaches may be added in future releases.

3. Statistical Model

Binary Endpoint (Beta-Binomial)

Each arm has an independent Beta prior on the response rate:

Treatment Arm

\theta_T \sim \text{Beta}(\alpha_T, \beta_T)

Control Arm

\theta_C \sim \text{Beta}(\alpha_C, \beta_C)

At interim look $k$ , having observed $s_{T,k}$ successes in $n_{T,k}$ treatment patients, the posterior is:

\theta_T | \text{data}_k \sim \text{Beta}(\alpha_T + s_{T,k},\ \beta_T + n_{T,k} - s_{T,k})

The posterior probability of treatment superiority is estimated via Monte Carlo:

P(\theta_T > \theta_C | \text{data}_k) \approx \frac{1}{M} \sum_{m=1}^{M} \mathbb{1}(\theta_T^{(m)} > \theta_C^{(m)})

Continuous Endpoint (Normal-Normal)

The treatment effect $\theta$ has a Normal prior:

\theta \sim N(\mu, \nu^2)

With known data variance $\sigma^2$ , after observing the z-statistic $z_k$ at look $k$ with cumulative sample size $n_k$ :

P(\theta > 0 | z_k) = \Phi\!\left(\frac{z_k \sqrt{n_k / \sigma^2} + \mu / \nu^2}{1/\nu^2 + n_k / \sigma^2}\cdot\sqrt{1/\nu^2 + n_k/\sigma^2}\right)

Analytical vs. Simulation-Based

For continuous endpoints, stopping boundaries can be derived analytically (closed-form z-score thresholds). For binary endpoints, the stopping thresholds are fixed (the user-specified γ values) — at each look, the posterior probability P(θ_T > θ_C | data) is estimated via Monte Carlo sampling from the Beta posteriors and compared against these thresholds.

4. Stopping Rules

At each interim analysis $k = 1, 2, \ldots, K$ , the trial may stop based on the posterior probability that treatment is better:

Efficacy Stopping

\text{Stop for efficacy at look } k \text{ if } P(\theta_T > \theta_C | \text{data}_k) \geq \gamma_{\text{eff}}

where $\gamma_{\text{eff}}$ is the efficacy threshold (typically 0.95-0.995). Higher values are more conservative, reducing Type I error at the cost of lower power.

Futility Stopping (Optional)

\text{Stop for futility at look } k \text{ if } P(\theta_T > \theta_C | \text{data}_k) \leq \gamma_{\text{fut}}

where $\gamma_{\text{fut}}$ is the futility threshold (typically 0.05-0.20). Futility stopping reduces the expected sample size under the null hypothesis by terminating trials unlikely to succeed.

Non-Binding Futility

Futility stopping is non-binding: investigators may continue the trial past a futility signal. Per FDA guidance, non-binding futility rules do not inflate the Type I error rate and are preferred for regulatory submissions.

Trial Flow

for look k = 1, 2, ..., K:
    Enroll n_per_look patients per arm
    Compute cumulative data (n_k = k × n_per_look)
    Compute P(θ_T > θ_C | data_k)

    if P ≥ γ_efficacy:
        → STOP: Declare efficacy
    elif futility_enabled and P ≤ γ_futility:
        → STOP: Declare futility (non-binding)
    else:
        → CONTINUE to next look

if reached look K without stopping:
    → Make final decision at look K

5. Stopping Boundaries

Continuous Endpoints: Analytical Boundaries

For Normal-Normal models, the stopping boundary at look $k$ with cumulative per-arm sample size $n_k$ is a z-score threshold derived analytically from the posterior probability criterion ([1], Equation 5):

c_k = \Phi^{-1}(\gamma)\,\sqrt{1 + \frac{\sigma^2}{n_k \nu^2}} - \frac{\mu\sqrt{\sigma^2}}{\sqrt{n_k}\,\nu^2}

Stop for efficacy if the observed z-statistic $z_k > c_k$ . For futility, a symmetric lower boundary is derived using $\gamma_{\text{fut}}$ .

Binary Endpoints: Simulation-Based

For Beta-Binomial models, there is no closed-form boundary. Instead, at each look:

Update Beta posteriors for each arm using observed data
Draw $M$ Monte Carlo samples from each posterior
Compute the posterior probability:
$P(\theta_T > \theta_C | \text{data}_k) \approx \frac{1}{M}\sum \mathbb{1}(\theta_T^{(m)} > \theta_C^{(m)})$
Compare directly against $\gamma_{\text{eff}}$ and $\gamma_{\text{fut}}$

Boundary Interpretation

For continuous endpoints, the Boundary Plot shows z-score thresholds (decreasing with more information). For binary endpoints, it shows the constant posterior probability thresholds. Both represent the same concept: the evidence required to stop.

6. Operating Characteristics

Operating characteristics are evaluated via Monte Carlo simulation under both the null and alternative hypotheses.

Type I Error

Probability of declaring efficacy when H₀ is true

Binary: $\theta_T = \theta_C$
Continuous: True effect $= 0$

Power

Probability of declaring efficacy when H₁ is true

Binary: $\theta_T = p_1,\ \theta_C = p_0$
Continuous: True effect $= \delta$

Simulation Algorithm

for sim in 1...N_simulations:
    # Generate full trial data under H0 and H1
    data_h0 = generate(θ_T = θ_C)
    data_h1 = generate(θ_T = θ_T_design)

    for look k = 1...K:
        # Check stopping at each look
        P_h0 = posterior_prob(data_h0[1:n_k])
        P_h1 = posterior_prob(data_h1[1:n_k])

        if P_h0 >= γ_eff → type1_count++, record stop at k
        if P_h1 >= γ_eff → power_count++, record stop at k
        if P ≤ γ_fut    → record futility stop at k

# Outputs:
type1_error = type1_count / N_simulations
power = power_count / N_simulations
expected_N_h0 = mean(stop_time_h0) × n_per_look
expected_N_h1 = mean(stop_time_h1) × n_per_look

Additional Metrics

Metric	Definition
Expected N (H₀)	Average sample size per arm when the null is true
Expected N (H₁)	Average sample size per arm when the alternative is true
P(Stop at look k \| H)	Probability of stopping at each look under each hypothesis
Futility rate	Overall probability of stopping for futility under H₀ and H₁

Key Findings from Simulation Studies

Zhou & Ji [1] evaluated OC across 72 scenarios (combinations of true effect distribution $\nu_0$ , prior width $\nu$ , and number of looks $K$ ) with $N = 1000$ , $\gamma = 0.95$ , and 10,000 simulated trials:

Type I error control: With a well-calibrated prior ( $\nu \approx \nu_0$ ), the false positive rate (FPR) stays below 5% even with frequent monitoring ( $K = 1000$ continuous looks).
Prior-data conflict: When the prior is much wider than the true effect distribution ( $\nu \gg \nu_0$ ), the FPR can inflate above the nominal level. Conversely, overly narrow priors reduce power.
Number of looks: More interim analyses generally does not inflate error rates in the Bayesian framework (unlike repeated frequentist testing), provided the posterior probability threshold is held constant.
Coverage: Posterior credible interval coverage remains near 95% across scenarios, with slight under-coverage when the prior is strongly misspecified.

Implication for Practice

These results underscore the importance of prior calibration. For regulatory submissions, sponsors should demonstrate that the chosen prior width does not inflate the frequentist Type I error rate by running OC simulations across a range of plausible effect sizes — which is exactly what this calculator provides.

Computational Note

The "Standard" setting uses 5,000 outer simulations. For binary endpoints, inner Monte Carlo draws scale with outer simulations (capped at 10,000) to estimate posterior probabilities. Power curve evaluation uses N/3 simulations per point for efficiency. Use "Precise" (10,000) for regulatory submissions.

7. Power & ASN Curves

Power Curve

The power curve shows the probability of declaring efficacy as a function of the true treatment effect. For binary endpoints, this is plotted against the treatment rate; for continuous, against the true effect size.

At $\theta_T = \theta_C$ : Should equal the Type I error (typically 2-5%)
At design alternative: Should meet target power (typically 80-90%)
Steep transition: Indicates good discriminatory ability

Average Sample Number (ASN) Curve

The ASN curve shows the expected sample size per arm as a function of the true effect. This quantifies the efficiency gains from sequential monitoring:

Under H₀ (no effect): ASN well below max N indicates effective futility stopping
Under H₁ (large effect): ASN well below max N indicates early efficacy detection
Near the boundary: ASN approaches max N (ambiguous evidence requires full enrollment)

8. Regulatory Considerations

FDA Guidance on Bayesian Sequential Designs

Per ICH E20 on adaptive designs [3] and the FDA draft guidance on Bayesian methodology in drug/biologics trials [4], sponsors must pre-specify the number and timing of interim analyses, stopping rules, and report full operating characteristics including frequentist Type I error control.

SAP Documentation Checklist

Number & Timing of Looks: Pre-specify K and equally-spaced information fractions
Stopping Rules: Efficacy threshold $\gamma_{\text{eff}}$ , futility threshold $\gamma_{\text{fut}}$ (if applicable), and whether futility is binding or non-binding
Prior Specification: Prior parameters with clinical justification, effective sample size, and sensitivity to prior choice
Operating Characteristics: Type I error, power, expected N under both hypotheses, stopping probabilities at each look
Simulation Details: Number of Monte Carlo simulations, random seed, software version
Boundary Table: Full table of boundaries, information fractions, and stopping probabilities at each look

Calibration Guidance

Efficacy Threshold	Typical Type I Error	When to Use
95.0%	~5-8%	Exploratory Phase II; single-arm extension
97.5%	~2-5%	Confirmatory one-sided test; FDA standard
99.0%	~0.5-2%	Conservative design; pediatric extrapolation
99.5%	<1%	Very conservative; multiple comparisons

Calibration Strategy

The efficacy threshold $\gamma_{\text{eff}}$ and number of looks $K$ jointly determine the Type I error rate. Increasing either reduces Type I error but also reduces power. Use the calculator's OC output to calibrate until the desired balance is achieved. There is no need for a separate alpha-spending function — the Bayesian posterior probability criterion inherently controls multiplicity.

9. Technical References

[1]Zhou, T., & Ji, Y. (2024). On Bayesian Sequential Clinical Trial Designs. New England Journal of Statistics in Data Science, 2(1), 136-151.
[2]Lan, K. K. G., Simon, R., & Halperin, M. (1982). Stochastically Curtailed Tests in Long-Term Clinical Trials. Sequential Analysis, 1(3), 207-219.
[3]ICH / U.S. Food and Drug Administration (2025). E20 Adaptive Designs for Clinical Trials: Draft Guidance. Link
[4]U.S. Food and Drug Administration (2026). Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products: Draft Guidance for Industry. Link
[5]Berry, S. M., Carlin, B. P., Lee, J. J., & Muller, P. (2010). Bayesian Adaptive Methods for Clinical Trials. Chapman & Hall/CRC.
[6]Jennison, C., & Turnbull, B. W. (2000). Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC.
[7]Thall, P. F., & Simon, R. (1994). Practical Bayesian Guidelines for Phase IIB Clinical Trials. Biometrics, 50(2), 337-349.
[8]ICH E9 (1998). Statistical Principles for Clinical Trials. PDF

10. API Reference

POST /api/v1/calculators/bayesian-sequential

Key Parameters

Parameter	Type	Description
endpoint_type	string	"binary" \| "continuous"
k	int	Number of interim + final analyses (2-10)
n_per_look	int	Patients per arm added at each look
efficacy_threshold	float	Posterior probability threshold for efficacy (default: 0.975)
futility_threshold	float \| null	Futility threshold (null = disabled, default: 0.10)
control_rate	float	Expected control arm rate (binary only)
treatment_rate	float	Expected treatment rate under H₁ (binary only)
prior_variance	float	Prior variance $\nu^2$ (continuous only)
n_simulations	int	Monte Carlo simulations (1000-50000)

Note: The current implementation assumes equal allocation (1:1) between treatment and control arms.

Key Response Fields

type1_error — Simulated Type I error rate
power — Simulated power at design alternative
expected_n_h0 — Average N per arm under null
expected_n_h1 — Average N per arm under alternative
boundary_table — Look-by-look boundaries and stopping probabilities
power_curve — Power and ASN across effect sizes
decision_rule — Thresholds and interpretation

View full API documentation →

Ready to design?

Configure your Bayesian sequential monitoring design with Zetyra's calculator.

Open Sequential Monitoring Calculator