Docs/Guides/Bayesian Sequential Monitoring

A Complete Guide to Bayesian Sequential Monitoring

Bayesian sequential monitoring is a framework for conducting interim analyses in clinical trials using posterior probability stopping rules. Instead of controlling a frequentist alpha budget across looks, investigators define direct probability thresholds—asking at each interim: “Given everything we have observed, what is the probability that the treatment is truly superior?”

Analogy: The Thermostat vs. the Timer

Frequentist GSD is like a kitchen timer: you precommit to checking on the roast at fixed intervals and must budget how many times you open the oven door. Bayesian sequential monitoring is like a thermostat: it continuously reads the actual temperature and acts when the evidence crosses a threshold. You don't need to budget your checks—you just need to calibrate the threshold well.

I. Why Bayesian Sequential Monitoring?

Traditional frequentist interim monitoring (GSD) requires distributing a fixed alpha budget across pre-planned looks. This works well but imposes constraints: the number and timing of analyses must be pre-specified, and the statistical framework answers a somewhat indirect question about p-values rather than direct probabilities of treatment benefit.

Bayesian sequential monitoring offers a fundamentally different approach:

Direct Probability Statements

“There is a 97% probability that the treatment effect is positive” is more intuitive than “the p-value is 0.003.” Bayesian monitoring answers the question clinicians actually ask.

No Alpha Spending Required

There is no alpha budget to distribute. The posterior probability at each look is a self-contained summary of the evidence. Error rates are controlled indirectly by calibrating the stopping thresholds via simulation.

Flexible Timing

The posterior calculation at each look is self-contained—unlike alpha-spending approaches, you do not need to pre-specify the exact number or timing of looks. However, changing look timing or frequency changes the simulated operating characteristics, so any schedule change requires re-simulation to confirm that Type I error and power still meet targets.

Prior Information

Historical data, expert opinion, or previous trials can be formally incorporated through the prior distribution, rather than informally influencing design choices.

Key insight: Bayesian sequential monitoring does not eliminate the need for frequentist error control. Instead, it achieves it differently—by calibrating thresholds via simulation to ensure that Type I error and power meet regulatory expectations under the null and alternative hypotheses.

II. Posterior Probability Stopping Rules

At each interim analysis kk, investigators compute the posterior probability that the treatment effect exceeds zero (or a clinically meaningful threshold), given all data observed so far.

The Core Decision Rules

Efficacy stopping: Stop and declare the treatment effective if:

P(θ>0datak)γEP(\theta > 0 \mid \text{data}_k) \geq \gamma_E

where γE\gamma_E is the efficacy threshold (e.g., 0.99).

Futility stopping: Stop for lack of sufficient evidence of benefit if:

P(θ>0datak)γFP(\theta > 0 \mid \text{data}_k) \leq \gamma_F

where γF\gamma_F is the futility threshold (e.g., 0.05).

Binary Endpoints: Beta-Binomial Model

For responder rates with Beta priors, the posterior is available in closed form:

pjxjBeta(αj+xj,βj+njxj)p_j \mid x_j \sim \text{Beta}(\alpha_j + x_j,\, \beta_j + n_j - x_j)

The probability P(pTpC>0data)P(p_T - p_C > 0 \mid \text{data}) is computed via numerical integration or simulation from the two independent Beta posteriors.

Continuous Endpoints: Normal-Normal Model

For continuous outcomes with known variance, the posterior for the treatment difference δ\delta is normal:

δdatakN ⁣(μk,(σk)2)\delta \mid \text{data}_k \sim N\!\left(\mu_k^*,\, (\sigma_k^*)^2\right)

where the posterior mean and variance are precision-weighted combinations of the prior and the observed data. The stopping probability is simply 1Φ(μk/σk)1 - \Phi(-\mu_k^* / \sigma_k^*).

Survival Endpoints: Log-HR Model

For time-to-event endpoints, the treatment effect is modeled on the log hazard ratio scale:

log(HR)dkN ⁣(μk,(σk)2)\log(\text{HR}) \mid d_k \sim N\!\left(\mu_k^*,\, (\sigma_k^*)^2\right)

with data variance 4/dk4 / d_k where dkd_k is the number of events at look kk. The efficacy criterion becomes P(HR<1datak)γEP(\text{HR} < 1 \mid \text{data}_k) \geq \gamma_E.

III. Prior Specification

The prior is the most scrutinized element of any Bayesian design submitted to regulators. In sequential monitoring, the prior influences both the stopping boundaries and the operating characteristics. Getting it right is essential.

Vague / Non-informative Priors

The safest regulatory choice. Use wide priors like Beta(1,1)\text{Beta}(1, 1) for proportions or N(0,1002)N(0, 100^2) for continuous effects. The data dominates quickly, and the design's operating characteristics are driven almost entirely by the stopping thresholds and sample size.

When to use: Confirmatory trials, regulatory submissions, no compelling historical data.

Weakly Informative Priors

Encode general domain knowledge (e.g., “treatment effects in this class are typically 5–15%”) without being overly specific. These can improve efficiency without alarming reviewers.

When to use: Phase II trials, well-characterized therapeutic areas, internal decision-making.

Informative / Historical Priors

Derived from previous trials or meta-analyses. Powerful for borrowing historical control data but requires careful justification. Consider using a robust mixture (e.g., 80% informative + 20% vague) to protect against prior-data conflict.

When to use: Single-arm trials with external controls, rare diseases, pediatric extrapolation.

Regulatory tip: Always present a sensitivity analysis showing operating characteristics under multiple prior assumptions. Regulators want to see that your conclusions are robust, not driven by the prior. The FDA's 2010 Bayesian guidance specifically recommends this approach.

IV. Calibrating Boundaries

Unlike frequentist GSD where boundaries are derived analytically from an alpha-spending function, Bayesian boundaries are calibrated via simulation. This is both the method's greatest strength and its primary source of complexity.

The Calibration Process

1

Choose initial thresholds. Start with candidate efficacy threshold γE\gamma_E (e.g., 0.99) and futility threshold γF\gamma_F (e.g., 0.05).

2

Simulate under H0H_0. Generate thousands of trials with no treatment effect. Record how often the design incorrectly stops for efficacy. This is the simulated Type I error.

3

Simulate under H1H_1. Generate trials with the target treatment effect. Record how often the design correctly stops for efficacy. This is the simulated power.

4

Iterate. Adjust γE\gamma_E and γF\gamma_F until the design meets the target Type I error (e.g., 5%) and power (e.g., 80%).

Key Relationships

Threshold ChangeType I ErrorPowerExpected N
Raise γE\gamma_E (stricter)DecreasesDecreasesIncreases
Lower γE\gamma_E (looser)IncreasesIncreasesDecreases
Raise γF\gamma_F (more aggressive)Slight decreaseDecreasesDecreases
More interim looksMay increaseSlight changeDecreases

Why simulation? Because Bayesian stopping rules interact with the prior, the sample size, and the number of looks in complex, non-linear ways. There is no closed-form relationship between γE\gamma_E and Type I error for most practical designs. Simulation is the only reliable calibration method.

V. Operating Characteristics

Regulators evaluate Bayesian sequential designs by their frequentist operating characteristics—how the design performs across repeated use, regardless of Bayesian interpretation. The key metrics:

Type I Error Rate

The probability of stopping for efficacy when the true treatment effect is zero. Computed by simulating trials under H0H_0. Must typically be controlled at 5% (two-sided) or 2.5% (one-sided) for regulatory acceptance.

Power

The probability of stopping for efficacy when the true treatment effect equals the design alternative. Computed by simulating trials under H1H_1. Typically targeted at 80–90%.

Average Sample Number (ASN)

The expected number of subjects (or events) enrolled before the trial stops. A key advantage of sequential designs: the ASN is typically 20–40% lower than the fixed-sample equivalent under both H0H_0 and H1H_1.

Stopping Probabilities by Look

The probability of stopping at each specific interim analysis, broken down by efficacy and futility. This helps investigators plan resources and timelines.

Regulatory framing: Present operating characteristics as a table showing Type I error, power, and ASN for each scenario (null, alternative, and several intermediate effect sizes). This “OC table” is the lingua franca for communicating Bayesian design performance to frequentist-trained reviewers.

VI. Worked Example: Phase III Oncology Trial

Scenario

A Phase III trial compares a novel immunotherapy to standard of care for overall survival. The design targets a hazard ratio of 0.75, with 3 equally spaced interim analyses and a final analysis (4 looks total), maximum 400 events.

Step 1: Specify the Model

Endpoint: Survival (time-to-event)

Effect measure: log(HR)\log(\text{HR}), with HR<1\text{HR} < 1 indicating benefit

Prior: Vague N(0,1002)N(0, 100^2) on log(HR)\log(\text{HR})

Looks: At 100, 200, 300, and 400 events

Step 2: Calibrate Thresholds

Through simulation (10,000 trials under each hypothesis):

Threshold PairType I ErrorPower (HR=0.75)ASN (H₀)ASN (H₁)
γE=0.99,  γF=0.05\gamma_E=0.99,\;\gamma_F=0.052.8%76%310270
γE=0.985,  γF=0.05\gamma_E=0.985,\;\gamma_F=0.054.2%82%295255
γE=0.975,  γF=0.05\gamma_E=0.975,\;\gamma_F=0.056.1%87%280240

The middle row (γE=0.985\gamma_E = 0.985) achieves approximately 4% Type I error and 82% power—meeting both the 5% error constraint and 80% power target. This is the selected design.

Step 3: Report Operating Characteristics

Look (Events)P(Stop Efficacy | H₁)P(Stop Futility | H₀)
Look 1 (100 events)8%12%
Look 2 (200 events)28%25%
Look 3 (300 events)25%20%
Final (400 events)21%43%

Interpretation: Under the alternative hypothesis (HR = 0.75), 36% of trials stop for efficacy by the second look (200 events), saving substantial resources. Under the null, 37% of trials stop for futility by the second look, protecting patients from continued enrollment in a futile trial.

VII. When NOT to Use Bayesian Sequential Monitoring

Bayesian sequential monitoring is powerful but not always the right choice:

Conservative regulatory environment

Some regulatory agencies or review divisions are less familiar with Bayesian methods. If the regulatory path requires strict frequentist Type I error control with well-established methods, GSD with alpha spending may be more straightforward.

No prior information available

If you have no historical data and plan to use vague priors, the Bayesian approach reduces to approximately a frequentist design with extra simulation complexity. The added machinery may not be justified.

Very few interim looks planned

With only 1–2 interim analyses, the advantage of Bayesian flexibility over alpha spending is minimal. GSD with O'Brien–Fleming boundaries is simpler and nearly as efficient.

Prior sensitivity is high

If operating characteristics change dramatically with small changes to the prior, the design is fragile. This often happens with small sample sizes and informative priors—a dangerous combination.

In these cases, consider frequentist Group Sequential Design or the hybrid approaches described in our GSD vs. Bayesian Sequential comparison.

VIII. Summary

Bayesian sequential monitoring offers a coherent framework for interim decision-making that speaks the language clinicians and sponsors naturally use—probabilities of treatment benefit. Its strength lies not in avoiding frequentist requirements, but in meeting them through a more intuitive lens.

Define stopping rules as posterior probability thresholds for efficacy and futility.

Calibrate thresholds via simulation to achieve target Type I error and power.

Report operating characteristics in frequentist terms for regulatory audiences.

Conduct sensitivity analyses across prior assumptions to demonstrate robustness.

The best Bayesian sequential designs are those where the prior is defensible, the thresholds are calibrated with care, and the operating characteristics are presented transparently. When these conditions are met, the approach offers genuine advantages in flexibility, interpretability, and efficiency over frequentist alternatives.

Ready to design your Bayesian sequential trial?

Use the Bayesian Sequential Calculator to calibrate stopping boundaries, simulate operating characteristics, and generate power/ASN curves for your design.

Open Bayesian Sequential Calculator