Docs/Guides/Bayesian Sequential Monitoring

A Complete Guide to Bayesian Sequential Monitoring

Bayesian sequential monitoring is a framework for conducting interim analyses in clinical trials using posterior probability stopping rules. Instead of controlling a frequentist alpha budget across looks, investigators define direct probability thresholds—asking at each interim: “Given everything we have observed, what is the probability that the treatment is truly superior?”

Analogy: The Thermostat vs. the Timer

Frequentist GSD is like a kitchen timer: you precommit to checking on the roast at fixed intervals and must budget how many times you open the oven door. Bayesian sequential monitoring is like a thermostat: it continuously reads the actual temperature and acts when the evidence crosses a threshold. You don't need to budget your checks—you just need to calibrate the threshold well.

1. Why Bayesian Sequential Monitoring?

Traditional frequentist interim monitoring (GSD) requires distributing a fixed alpha budget across pre-planned looks. This works well but imposes constraints: the number and timing of analyses must be pre-specified, and the statistical framework answers a somewhat indirect question about p-values rather than direct probabilities of treatment benefit.

Bayesian sequential monitoring offers a fundamentally different approach:

Direct Probability Statements

“There is a 97% probability that the treatment effect is positive” is more intuitive than “the p-value is 0.003.” Bayesian monitoring answers the question clinicians actually ask.

No Alpha Spending Required

There is no alpha budget to distribute. The posterior probability at each look is a self-contained summary of the evidence. Error rates are controlled indirectly by calibrating the stopping thresholds via simulation.

Flexible Timing

The posterior calculation at each look is self-contained, so the stopping threshold itself does not have to be re-spent across a fixed schedule the way alpha is in a frequentist GSD. For regulatory-facing trials, that does not mean looks can be unplanned: sponsors still need a prespecified monitoring plan covering permitted windows or triggers, with the simulated operating characteristics (Type I error, power, expected N) reported across that scenario set. Any deviation from the prespecified schedule still requires re-simulation to confirm the targets hold.

Prior Information

Historical data, expert opinion, or previous trials can be formally incorporated through the prior distribution, rather than informally influencing design choices.

Key insight: Bayesian sequential monitoring does not eliminate the need for frequentist error control. Instead, it achieves it differently—by calibrating thresholds via simulation to ensure that Type I error and power meet regulatory expectations under the null and alternative hypotheses.

2. Posterior Probability Stopping Rules

At each interim analysis $k$ , investigators compute the posterior probability that the treatment effect exceeds zero (or a clinically meaningful threshold), given all data observed so far.

The Core Decision Rules

Efficacy stopping: Stop and declare the treatment effective if:

P(\theta > 0 \mid \text{data}_k) \geq \gamma_E

where $\gamma_E$ is the efficacy threshold (e.g., 0.99).

Futility stopping: Stop for lack of sufficient evidence of benefit if:

P(\theta > 0 \mid \text{data}_k) \leq \gamma_F

where $\gamma_F$ is the futility threshold (e.g., 0.05).

Binary Endpoints: Beta-Binomial Model

For responder rates with Beta priors, the posterior is available in closed form:

p_j \mid x_j \sim \text{Beta}(\alpha_j + x_j,\, \beta_j + n_j - x_j)

The probability $P(p_T - p_C > 0 \mid \text{data})$ is computed via numerical integration or simulation from the two independent Beta posteriors.

Continuous Endpoints: Normal-Normal Model

For continuous outcomes with known variance, the posterior for the treatment difference $\delta$ is normal:

\delta \mid \text{data}_k \sim N\!\left(\mu_k^*,\, (\sigma_k^*)^2\right)

where the posterior mean and variance are precision-weighted combinations of the prior and the observed data. The stopping probability is simply $1 - \Phi(-\mu_k^* / \sigma_k^*)$ .

Survival Endpoints: Log-HR Model

For time-to-event endpoints, the treatment effect is modeled on the log hazard ratio scale:

\log(\text{HR}) \mid d_k \sim N\!\left(\mu_k^*,\, (\sigma_k^*)^2\right)

with data variance $4 / d_k$ where $d_k$ is the number of events at look $k$ . The efficacy criterion becomes $P(\text{HR} < 1 \mid \text{data}_k) \geq \gamma_E$ .

Estimand vs. decision rule

The posterior probability $P(\theta \in \Theta_1 \mid \text{data})$ is a model-based estimand: what the prior and the accumulated data jointly say about the parameter. The decision rule — the threshold $\gamma_E$ , the look schedule, the optional futility threshold, the prior — is a separate object that the regulator reviews.

FDA's January 2026 draft Bayesian guidance treats both as legitimate but distinct: the posterior probability answers the model question; regulatory acceptability hinges on prespecifying every component of the decision rule and reporting the simulated frequentist operating characteristics (Type I error, power, expected sample size) for the full rule. The Bayesian posterior alone does not control multiplicity — the simulation does.

3. Prior Specification

The prior is the most scrutinized element of any Bayesian design submitted to regulators. In sequential monitoring, the prior influences both the stopping boundaries and the operating characteristics. Getting it right is essential.

Vague / Non-informative Priors

The safest regulatory choice. Use wide priors like $\text{Beta}(1, 1)$ for proportions or $N(0, 100^2)$ for continuous effects. The data dominates quickly, and the design's operating characteristics are driven almost entirely by the stopping thresholds and sample size.

When to use: Confirmatory trials, regulatory submissions, no compelling historical data.

Weakly Informative Priors

Encode general domain knowledge (e.g., “treatment effects in this class are typically 5–15%”) without being overly specific. These can improve efficiency without alarming reviewers.

When to use: Phase II trials, well-characterized therapeutic areas, internal decision-making.

Informative / Historical Priors

Derived from previous trials or meta-analyses. Powerful for borrowing historical control data but requires careful justification. Consider using a robust mixture (e.g., 80% informative + 20% vague) to protect against prior-data conflict.

When to use: Single-arm trials with external controls, rare diseases, pediatric extrapolation.

Regulatory tip: Always present a sensitivity analysis showing operating characteristics under multiple prior assumptions. Regulators want to see that your conclusions are robust, not driven by the prior. The FDA's 2010 Bayesian guidance specifically recommends this approach.

4. Calibrating Boundaries

Unlike frequentist GSD where boundaries are derived analytically from an alpha-spending function, Bayesian boundaries are calibrated via simulation. This is both the method's greatest strength and its primary source of complexity.

The Calibration Process

Choose initial thresholds. Start with candidate efficacy threshold $\gamma_E$ (e.g., 0.99) and futility threshold $\gamma_F$ (e.g., 0.05).

Simulate under $H_0$ . Generate thousands of trials with no treatment effect. Record how often the design incorrectly stops for efficacy. This is the simulated Type I error.

Simulate under $H_1$ . Generate trials with the target treatment effect. Record how often the design correctly stops for efficacy. This is the simulated power.

Iterate. Adjust $\gamma_E$ and $\gamma_F$ until the design meets the target Type I error (e.g., 5%) and power (e.g., 80%).

Key Relationships

Threshold Change	Type I Error	Power	Expected N
Raise $\gamma_E$ (stricter)	Decreases	Decreases	Increases
Lower $\gamma_E$ (looser)	Increases	Increases	Decreases
Raise $\gamma_F$ (more aggressive)	Slight decrease	Decreases	Decreases
More interim looks	May increase	Slight change	Decreases

Why simulation? Because Bayesian stopping rules interact with the prior, the sample size, and the number of looks in complex, non-linear ways. There is no closed-form relationship between $\gamma_E$ and Type I error for most practical designs. Simulation is the only reliable calibration method.

5. Operating Characteristics

Regulators evaluate Bayesian sequential designs by their frequentist operating characteristics—how the design performs across repeated use, regardless of Bayesian interpretation. The key metrics:

Type I Error Rate

The probability of stopping for efficacy when the true treatment effect is zero. Computed by simulating trials under $H_0$ . Must typically be controlled at 5% (two-sided) or 2.5% (one-sided) for regulatory acceptance.

Power

The probability of stopping for efficacy when the true treatment effect equals the design alternative. Computed by simulating trials under $H_1$ . Typically targeted at 80–90%.

Average Sample Number (ASN)

The expected number of subjects (or events) enrolled before the trial stops. A key advantage of sequential designs: the ASN is typically 20–40% lower than the fixed-sample equivalent under both $H_0$ and $H_1$ .

Stopping Probabilities by Look

The probability of stopping at each specific interim analysis, broken down by efficacy and futility. This helps investigators plan resources and timelines.

Regulatory framing: Present operating characteristics as a table showing Type I error, power, and ASN for each scenario (null, alternative, and several intermediate effect sizes). This “OC table” is the lingua franca for communicating Bayesian design performance to frequentist-trained reviewers.

6. Worked Example: Phase III Oncology Trial

Scenario

A Phase III trial compares a novel immunotherapy to standard of care for overall survival. The design targets a hazard ratio of 0.75, with 3 equally spaced interim analyses and a final analysis (4 looks total), maximum 400 events.

Step 1: Specify the Model

•

Endpoint: Survival (time-to-event)

•

Effect measure: $\log(\text{HR})$ , with $\text{HR} < 1$ indicating benefit

•

Prior: Vague $N(0, 100^2)$ on $\log(\text{HR})$

•

Looks: At 100, 200, 300, and 400 events

Step 2: Calibrate Thresholds

Through simulation (10,000 trials under each hypothesis):

Threshold Pair	Type I Error	Power (HR=0.75)	ASN (H₀)	ASN (H₁)
$\gamma_E=0.99,\;\gamma_F=0.05$	2.8%	76%	310	270
$\gamma_E=0.985,\;\gamma_F=0.05$	4.2%	82%	295	255
$\gamma_E=0.975,\;\gamma_F=0.05$	6.1%	87%	280	240

The middle row ( $\gamma_E = 0.985$ ) achieves approximately 4% Type I error and 82% power—meeting both the 5% error constraint and 80% power target. This is the selected design.

Step 3: Report Operating Characteristics

Look (Events)	P(Stop Efficacy \| H₁)	P(Stop Futility \| H₀)
Look 1 (100 events)	8%	12%
Look 2 (200 events)	28%	25%
Look 3 (300 events)	25%	20%
Final (400 events)	21%	43%

Interpretation: Under the alternative hypothesis (HR = 0.75), 36% of trials stop for efficacy by the second look (200 events), saving substantial resources. Under the null, 37% of trials stop for futility by the second look, protecting patients from continued enrollment in a futile trial.

7. When NOT to Use Bayesian Sequential Monitoring

Bayesian sequential monitoring is powerful but not always the right choice:

Conservative regulatory environment

Some regulatory agencies or review divisions are less familiar with Bayesian methods. If the regulatory path requires strict frequentist Type I error control with well-established methods, GSD with alpha spending may be more straightforward.

No prior information available

If you have no historical data and plan to use vague priors, the Bayesian approach reduces to approximately a frequentist design with extra simulation complexity. The added machinery may not be justified.

Very few interim looks planned

With only 1–2 interim analyses, the advantage of Bayesian flexibility over alpha spending is minimal. GSD with O'Brien–Fleming boundaries is simpler and nearly as efficient.

Prior sensitivity is high

If operating characteristics change dramatically with small changes to the prior, the design is fragile. This often happens with small sample sizes and informative priors—a dangerous combination.

In these cases, consider frequentist Group Sequential Design or the hybrid approaches described in our GSD vs. Bayesian Sequential comparison.

8. Summary

Bayesian sequential monitoring offers a coherent framework for interim decision-making that speaks the language clinicians and sponsors naturally use—probabilities of treatment benefit. Its strength lies not in avoiding frequentist requirements, but in meeting them through a more intuitive lens.

Define stopping rules as posterior probability thresholds for efficacy and futility.

Calibrate thresholds via simulation to achieve target Type I error and power.

Report operating characteristics in frequentist terms for regulatory audiences.

Conduct sensitivity analyses across prior assumptions to demonstrate robustness.

The best Bayesian sequential designs are those where the prior is defensible, the thresholds are calibrated with care, and the operating characteristics are presented transparently. When these conditions are met, the approach offers genuine advantages in flexibility, interpretability, and efficiency over frequentist alternatives.

IX. References

Zhou T, Ji Y. On Bayesian sequential clinical trial designs. New England Journal of Statistics in Data Science. 2024;2(1).
Berry SM, Carlin BP, Lee JJ, Muller P. Bayesian Adaptive Methods for Clinical Trials. CRC Press; 2010.
Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Wiley; 2004.
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd ed. CRC Press; 2013.
U.S. Food and Drug Administration. Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry. November 2019.
U.S. Food and Drug Administration. Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products: Draft Guidance for Industry. January 12, 2026.

Last updated: May 2026

Ready to design your Bayesian sequential trial?

Use the Bayesian Sequential Calculator to calibrate stopping boundaries, simulate operating characteristics, and generate power/ASN curves for your design.

Open Bayesian Sequential Calculator