Bayesian Sequential Monitoring
Technical documentation for Bayesian interim monitoring of two-arm clinical trials using posterior probability stopping rules. Supports both binary (Beta-Binomial) and continuous (Normal-Normal) endpoints with efficacy and optional futility boundaries. Based on the methods described in Zhou & Ji (2024) [1].
Contents
1. Overview & Motivation
Bayesian sequential monitoring extends fixed-sample Bayesian designs by allowing repeated interim analyses during patient enrollment. At each pre-specified look, the posterior probability that the treatment is superior is compared against stopping thresholds — enabling early termination for either efficacy or futility. Zhou & Ji [1] provide a comprehensive framework comparing three Bayesian sequential approaches, of which this calculator implements the posterior probability (PP) method.
Why Bayesian Sequential?
vs. Fixed-Sample
- Reduces expected sample size by 20-40%
- Earlier access to effective treatments
- Ethical: stops futile trials sooner
vs. Frequentist GSD
- Natural probability statements about parameters
- Incorporates prior information formally
- No alpha-spending function needed
Supported Endpoint Types
Binary (Beta-Binomial)
Responder rates, event proportions
Prior: independent Beta distributions per arm.
Stopping decisions evaluated via Monte Carlo posterior sampling.
Continuous (Normal-Normal)
Mean differences, change from baseline
Prior: Normal on treatment effect.
Analytical z-score boundaries available.
2. Three Bayesian Sequential Approaches
Zhou & Ji [1] describe three distinct Bayesian approaches to sequential clinical trial design. Each defines stopping rules differently, leading to different boundary shapes and operating characteristics.
Posterior Probability (PP) — Implemented Here
Stop at look when the posterior probability that the treatment effect exceeds zero crosses a threshold: .
For Normal-Normal models, this yields analytical z-score boundaries (Equation 5 in [1]):
Boundaries decrease with sample size, resembling O'Brien-Fleming spending when is large relative to .
Posterior Predictive Probability (PPP)
Stop when the predictive probability of eventual success at the final analysis exceeds a threshold. At interim look , this asks: given current data, what is the probability that the final posterior probability will exceed the success criterion?
This yields a different boundary formula ([1], Section 2.2):
PPP boundaries tend to be more conservative at early looks and more permissive at later looks compared to PP, resembling stochastically curtailed testing [2].
Decision-Theoretic (DT)
Defines explicit loss functions for wrong decisions and optimizes boundaries by backward induction. The investigator specifies: (loss of stopping and declaring efficacy when the drug is ineffective) and (loss of failing to detect a truly effective drug).
Boundaries are found by equating expected losses: , solved numerically via backward recursion from look to look 1.
The DT approach is the most flexible but requires careful specification of loss functions. With loss ratio , boundaries approximate O'Brien-Fleming ([1], Table 1).
Why Posterior Probability?
This calculator implements the PP method because: (a) it has a direct clinical interpretation ("there is at least a % probability the treatment is better"), (b) it produces analytical boundaries for continuous endpoints, and (c) it naturally extends to binary endpoints via Monte Carlo. The PPP and DT approaches may be added in future releases.
3. Statistical Model
Binary Endpoint (Beta-Binomial)
Each arm has an independent Beta prior on the response rate:
Treatment Arm
Control Arm
At interim look , having observed successes in treatment patients, the posterior is:
The posterior probability of treatment superiority is estimated via Monte Carlo:
Continuous Endpoint (Normal-Normal)
The treatment effect has a Normal prior:
With known data variance , after observing the z-statistic at look with cumulative sample size :
Analytical vs. Simulation-Based
For continuous endpoints, stopping boundaries can be derived analytically (closed-form z-score thresholds). For binary endpoints, the stopping thresholds are fixed (the user-specified γ values) — at each look, the posterior probability P(θ_T > θ_C | data) is estimated via Monte Carlo sampling from the Beta posteriors and compared against these thresholds.
4. Stopping Rules
At each interim analysis , the trial may stop based on the posterior probability that treatment is better:
Efficacy Stopping
where is the efficacy threshold (typically 0.95-0.995). Higher values are more conservative, reducing Type I error at the cost of lower power.
Futility Stopping (Optional)
where is the futility threshold (typically 0.05-0.20). Futility stopping reduces the expected sample size under the null hypothesis by terminating trials unlikely to succeed.
Non-Binding Futility
Futility stopping is non-binding: investigators may continue the trial past a futility signal. Per FDA guidance, non-binding futility rules do not inflate the Type I error rate and are preferred for regulatory submissions.
Trial Flow
for look k = 1, 2, ..., K:
Enroll n_per_look patients per arm
Compute cumulative data (n_k = k × n_per_look)
Compute P(θ_T > θ_C | data_k)
if P ≥ γ_efficacy:
→ STOP: Declare efficacy
elif futility_enabled and P ≤ γ_futility:
→ STOP: Declare futility (non-binding)
else:
→ CONTINUE to next look
if reached look K without stopping:
→ Make final decision at look K5. Stopping Boundaries
Continuous Endpoints: Analytical Boundaries
For Normal-Normal models, the stopping boundary at look with cumulative per-arm sample size is a z-score threshold derived analytically from the posterior probability criterion ([1], Equation 5):
Stop for efficacy if the observed z-statistic . For futility, a symmetric lower boundary is derived using .
Binary Endpoints: Simulation-Based
For Beta-Binomial models, there is no closed-form boundary. Instead, at each look:
- Update Beta posteriors for each arm using observed data
- Draw Monte Carlo samples from each posterior
- Compute the posterior probability:
- Compare directly against and
Boundary Interpretation
For continuous endpoints, the Boundary Plot shows z-score thresholds (decreasing with more information). For binary endpoints, it shows the constant posterior probability thresholds. Both represent the same concept: the evidence required to stop.
6. Operating Characteristics
Operating characteristics are evaluated via Monte Carlo simulation under both the null and alternative hypotheses.
Type I Error
Probability of declaring efficacy when H₀ is true
Binary:
Continuous: True effect
Power
Probability of declaring efficacy when H₁ is true
Binary:
Continuous: True effect
Simulation Algorithm
for sim in 1...N_simulations:
# Generate full trial data under H0 and H1
data_h0 = generate(θ_T = θ_C)
data_h1 = generate(θ_T = θ_T_design)
for look k = 1...K:
# Check stopping at each look
P_h0 = posterior_prob(data_h0[1:n_k])
P_h1 = posterior_prob(data_h1[1:n_k])
if P_h0 >= γ_eff → type1_count++, record stop at k
if P_h1 >= γ_eff → power_count++, record stop at k
if P ≤ γ_fut → record futility stop at k
# Outputs:
type1_error = type1_count / N_simulations
power = power_count / N_simulations
expected_N_h0 = mean(stop_time_h0) × n_per_look
expected_N_h1 = mean(stop_time_h1) × n_per_lookAdditional Metrics
| Metric | Definition |
|---|---|
| Expected N (H₀) | Average sample size per arm when the null is true |
| Expected N (H₁) | Average sample size per arm when the alternative is true |
| P(Stop at look k | H) | Probability of stopping at each look under each hypothesis |
| Futility rate | Overall probability of stopping for futility under H₀ and H₁ |
Key Findings from Simulation Studies
Zhou & Ji [1] evaluated OC across 72 scenarios (combinations of true effect distribution , prior width , and number of looks ) with , , and 10,000 simulated trials:
- Type I error control: With a well-calibrated prior (), the false positive rate (FPR) stays below 5% even with frequent monitoring ( continuous looks).
- Prior-data conflict: When the prior is much wider than the true effect distribution (), the FPR can inflate above the nominal level. Conversely, overly narrow priors reduce power.
- Number of looks: More interim analyses generally does not inflate error rates in the Bayesian framework (unlike repeated frequentist testing), provided the posterior probability threshold is held constant.
- Coverage: Posterior credible interval coverage remains near 95% across scenarios, with slight under-coverage when the prior is strongly misspecified.
Implication for Practice
These results underscore the importance of prior calibration. For regulatory submissions, sponsors should demonstrate that the chosen prior width does not inflate the frequentist Type I error rate by running OC simulations across a range of plausible effect sizes — which is exactly what this calculator provides.
Computational Note
The "Standard" setting uses 5,000 outer simulations. For binary endpoints, inner Monte Carlo draws scale with outer simulations (capped at 10,000) to estimate posterior probabilities. Power curve evaluation uses N/3 simulations per point for efficiency. Use "Precise" (10,000) for regulatory submissions.
7. Power & ASN Curves
Power Curve
The power curve shows the probability of declaring efficacy as a function of the true treatment effect. For binary endpoints, this is plotted against the treatment rate; for continuous, against the true effect size.
- At : Should equal the Type I error (typically 2-5%)
- At design alternative: Should meet target power (typically 80-90%)
- Steep transition: Indicates good discriminatory ability
Average Sample Number (ASN) Curve
The ASN curve shows the expected sample size per arm as a function of the true effect. This quantifies the efficiency gains from sequential monitoring:
- Under H₀ (no effect): ASN well below max N indicates effective futility stopping
- Under H₁ (large effect): ASN well below max N indicates early efficacy detection
- Near the boundary: ASN approaches max N (ambiguous evidence requires full enrollment)
8. Regulatory Considerations
FDA Guidance on Bayesian Sequential Designs
Per ICH E20 on adaptive designs [3] and the FDA draft guidance on Bayesian methodology in drug/biologics trials [4], sponsors must pre-specify the number and timing of interim analyses, stopping rules, and report full operating characteristics including frequentist Type I error control.
SAP Documentation Checklist
- Number & Timing of Looks: Pre-specify K and equally-spaced information fractions
- Stopping Rules: Efficacy threshold , futility threshold (if applicable), and whether futility is binding or non-binding
- Prior Specification: Prior parameters with clinical justification, effective sample size, and sensitivity to prior choice
- Operating Characteristics: Type I error, power, expected N under both hypotheses, stopping probabilities at each look
- Simulation Details: Number of Monte Carlo simulations, random seed, software version
- Boundary Table: Full table of boundaries, information fractions, and stopping probabilities at each look
Calibration Guidance
| Efficacy Threshold | Typical Type I Error | When to Use |
|---|---|---|
| 95.0% | ~5-8% | Exploratory Phase II; single-arm extension |
| 97.5% | ~2-5% | Confirmatory one-sided test; FDA standard |
| 99.0% | ~0.5-2% | Conservative design; pediatric extrapolation |
| 99.5% | <1% | Very conservative; multiple comparisons |
Calibration Strategy
The efficacy threshold and number of looks jointly determine the Type I error rate. Increasing either reduces Type I error but also reduces power. Use the calculator's OC output to calibrate until the desired balance is achieved. There is no need for a separate alpha-spending function — the Bayesian posterior probability criterion inherently controls multiplicity.
9. Technical References
- [1]Zhou, T., & Ji, Y. (2024). On Bayesian Sequential Clinical Trial Designs. New England Journal of Statistics in Data Science, 2(1), 136-151.
- [2]Lan, K. K. G., Simon, R., & Halperin, M. (1982). Stochastically Curtailed Tests in Long-Term Clinical Trials. Sequential Analysis, 1(3), 207-219.
- [3]ICH / U.S. Food and Drug Administration (2025). E20 Adaptive Designs for Clinical Trials: Draft Guidance. Link
- [4]U.S. Food and Drug Administration (2026). Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products: Draft Guidance for Industry. Link
- [5]Berry, S. M., Carlin, B. P., Lee, J. J., & Muller, P. (2010). Bayesian Adaptive Methods for Clinical Trials. Chapman & Hall/CRC.
- [6]Jennison, C., & Turnbull, B. W. (2000). Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC.
- [7]Thall, P. F., & Simon, R. (1994). Practical Bayesian Guidelines for Phase IIB Clinical Trials. Biometrics, 50(2), 337-349.
- [8]ICH E9 (1998). Statistical Principles for Clinical Trials. PDF
10. API Reference
Key Parameters
| Parameter | Type | Description |
|---|---|---|
| endpoint_type | string | "binary" | "continuous" |
| k | int | Number of interim + final analyses (2-10) |
| n_per_look | int | Patients per arm added at each look |
| efficacy_threshold | float | Posterior probability threshold for efficacy (default: 0.975) |
| futility_threshold | float | null | Futility threshold (null = disabled, default: 0.10) |
| control_rate | float | Expected control arm rate (binary only) |
| treatment_rate | float | Expected treatment rate under H₁ (binary only) |
| prior_variance | float | Prior variance (continuous only) |
| n_simulations | int | Monte Carlo simulations (1000-50000) |
Note: The current implementation assumes equal allocation (1:1) between treatment and control arms.
Key Response Fields
type1_error— Simulated Type I error ratepower— Simulated power at design alternativeexpected_n_h0— Average N per arm under nullexpected_n_h1— Average N per arm under alternativeboundary_table— Look-by-look boundaries and stopping probabilitiespower_curve— Power and ASN across effect sizesdecision_rule— Thresholds and interpretation
Related Documentation
Ready to design?
Configure your Bayesian sequential monitoring design with Zetyra's calculator.