Single-Arm Trials: Design and Adaptive Sizing
A methods guide for designing Phase II single-arm trials — when they're the right choice, how to set the inputs your calculator asks for, and how to read its output before committing to a protocol.
Contents
When to use this guide. The reader either is designing a single-arm trial (Phase II oncology being the dominant case), considering one against a randomized alternative, or about to defend one to a regulator.
1. When a single-arm trial is the right design
Single-arm trials measure response in a treated cohort against a fixed external benchmark instead of a concurrent control. They are simpler, faster, and smaller than randomized trials — and they discard the strongest defense against confounding that clinical research has.
The three settings where single-arm is defensible
- •Phase II oncology with high unmet need. Rare cancers, post-chemo refractory populations, biomarker-defined subsets. The historical response rate to standard of care is well-characterized and low enough that a meaningful improvement shows up as a stark contrast rather than a subtle effect.
- •Randomization is infeasible. Ultra-rare diseases (n < 200 eligible patients globally), pediatric indications where exposing healthy controls is unethical, or settings where standard of care is no treatment and patients refuse a placebo arm.
- •Accelerated-approval pathways. In specific oncology contexts — serious or life-threatening disease, unmet need, and an endpoint reasonably likely to predict clinical benefit — FDA can grant accelerated approval based on single-arm data (Subpart H for drugs, Subpart E for biologics), with confirmatory randomized data required post-approval. This is case-dependent and tied to endpoint persuasiveness, effect size, unmet need, and the agency's confirmatory-evidence expectations — not a default pathway.
Rule of thumb: if you can randomize, you should. Single-arm is a concession, not a preference. Every regulator question you will receive starts from this assumption.
When single-arm is not defensible
Even when randomization is operationally hard, single-arm is the wrong design when any of the following are true:
- •Subjective endpoints. Symptom scales, PROs, or investigator-rated response — endpoints vulnerable to expectation bias.
- •Large natural-history variability. Diseases where spontaneous improvement or fluctuation can produce apparent “response” absent treatment effect.
- •Evolving standard of care. Indication-specific benchmarks shift when newer therapies emerge; the historical control rate is a moving target.
- •Endpoint sensitivity to assessment schedule. Different imaging cadence, response criteria versions, or central-vs-investigator review can shift ORR by 5–10 percentage points.
- •No credible historical benchmark. Without a defensible drawn from a synthesis of comparable trials, the test is against an assumption rather than against evidence.
2. The core design decisions
Endpoint choice
Objective response rate (ORR) is the dominant Phase II oncology endpoint and what Zetyra's single-arm tools target. ORR is binary at the patient level (responder vs. non-responder per RECIST 1.1 or equivalent), measurable on a short timeline, and standardized across regulators.
Other binary or time-to-event endpoints (disease control rate, progression-free survival) can support an ORR signal but rarely substitute for it in accelerated-approval submissions. PFS as primary requires a concurrent control to interpret — which defeats the purpose of a single-arm design.
Performance goal selection
A single-arm trial tests a null hypothesis about a single response rate against an alternative:
- • — the historical control rate. The largest response rate that would not justify further development. This is the value the trial is designed to reject.
- • — the target response rate the trial is powered to detect. The smallest response rate that would justify a Phase III investment.
The gap drives sample size. A 10-point improvement (e.g., 30% → 40%) is typical for an enriched oncology population; smaller gaps require substantially larger N.
One-stage vs. two-stage
Simon's two-stage design is the default for Phase II oncology single-arm trials. Patients are enrolled in two stages with a futility check at the end of stage 1: if too few responses are observed, the trial stops early for futility.
- •Simon's optimal minimizes expected sample size under — you stop early often when the drug doesn't work.
- •Simon's minimax minimizes the maximum total sample size — you cap program exposure but stop early less often.
- •One-stage design is appropriate when the response rate is expected to be so high that a futility stop would never trigger, or when accrual is so fast that the interim analysis adds no time savings.
Note: Simon's designs fix the sample size and the decision rules in advance. They cannot adjust if the true response rate falls between and . That's where adaptive sample size re-estimation (Section 4) earns its place.
3. Setting the inputs
Every input the calculator asks for traces back to a defensible external source. The single most common reason a single-arm trial fails its regulatory review is that one of these inputs was set from a hopeful internal estimate rather than published evidence.
Historical control rate
Source from a synthesis of published studies in the same line of therapy and patient population, not a single trial. Regulators look for:
- •Multiple registrational trials reporting ORR in a comparable population, weighted toward the most recent (treatments have improved).
- •A modest upward adjustment if your eligibility criteria are more permissive than the historical trials (patients may have better baseline prognosis).
- •A conservative choice: when in doubt, pick the higher of competing estimates. Designing against a lower than the regulator believes makes the trial easier to pass and harder to defend.
Historical controls are not just an input estimate. Using from prior trials assumes the historical population is exchangeable with the population in your trial. That argument requires explicit attention to eligibility criteria, line of therapy, prior treatments, biomarker definitions, imaging schedule, response criteria version (e.g., RECIST 1.1 vs. iRECIST), central vs. investigator review, and calendar-time shifts in standard of care. Regulators will ask all of these. Document the alignment in the protocol.
Target response rate
is the smallest improvement over that justifies Phase III investment — not the response rate you hope to observe. A common mistake is to set to a biological ceiling (e.g., 60% for an enriched biomarker subset) when the commercial threshold for a Phase III investment is closer to 40%. Powering the trial for the higher number inflates N and risks stopping for futility on a clinically meaningful effect.
Type I and Type II error
The errors are asymmetric in a single-arm setting:
- •α (Type I error) — regulators care more about this. A false-positive Phase II launches a $50–100M Phase III against a drug that doesn't work. One-sided or is the standard.
- •β (Type II error) — sponsors care more about this. A false-negative kills a working drug. Power 1 − β is conventionally 80% or 90%, but Phase II is the cheapest place to recover from a missed signal (Phase IIb confirmatory).
Interim analysis timing (two-stage designs)
For Simon's designs, stage 1 sample size determines how early you can stop for futility. Two common tradeoffs:
| Design | Minimizes | Use when |
|---|---|---|
| Optimal | E[N | H0] | You expect a high chance of futility; want to stop fast when it fails. |
| Minimax | max N | Cap on total program exposure matters more than expected duration. |
Estimand and analysis population
The sample size answers a question that depends on the analysis-population choice and the estimand. For ORR, the common pitfalls are:
- •ITT vs. response-evaluable. The intent-to-treat population includes patients with no post-baseline assessment (typically counted as non-responders). A response-evaluable population excludes them. The two populations can produce ORRs that differ by 5–10 percentage points. Pre-specify which is primary and which is sensitivity.
- •Missing or unevaluable scans. Define in advance how patients without a confirmatory scan are handled. “Missing = failure” is the conservative default and the one regulators expect for the primary analysis.
- •Confirmation rules. RECIST 1.1 requires confirmation of partial or complete response at a follow-up scan (typically ≥ 4 weeks). Reporting unconfirmed responses can materially inflate ORR in some settings; the primary endpoint should use confirmed responses.
- •Assessment timing. ORR is sensitive to scan cadence. A trial assessing at weeks 6, 12, and 24 will report a different ORR than one assessing at weeks 8 and 16. Align with the historical-benchmark trials, or document the deviation.
The ICH E9(R1) estimand framework is the right anchor for documenting these choices in the SAP.
4. Adaptive sample size re-estimation
Fixed-sample Simon designs assume is correct and lock the stage-1 boundary and the maximum sample size to that assumption. When the true response rate sits between and — a real effect, but smaller than planned — the trial often passes the integer futility check but reaches the fixed maximum N with too few responses to declare success. Power collapses against this in-between effect. Adaptive sample size re-estimation lets you extend N when interim data suggest a real but under-powered effect, recovering the lost power.
Two frameworks for re-estimation
- •Bayesian posterior predictive. A Beta-Binomial conjugate model combines a weakly informative prior with the interim data to produce a posterior on the response rate. The re-estimated sample size is whatever brings the posterior probability of success at the final analysis above a pre-specified threshold .
- •Conditional power. Given the observed interim response rate, what additional N would bring the conditional power back to the design target (e.g., 80%)? The promising-zone rule restricts re-estimation to interim results that suggest a real effect but under-powered evidence.
Type I error control under re-estimation
Re-estimation is not free. Naively extending the trial when interim results are mediocre inflates the Type I error rate — sometimes substantially. Recent work (Qian 2026) shows that for single-arm binary trials, the conditional-power promising-zone rule can produce Type I error rates uniformly above the nominal across the entire grid of promising-zone thresholds. The Bayesian posterior-predictive rule with a well-calibrated preserves Type I error. See the Single-Arm SSR technical reference for the calibration details and the in-product calibration helper.
Bayesian decision rule vs. frequentist operating characteristics
These are two different things, and the regulatory expectation is to provide both:
- •Bayesian decision rule. The threshold (e.g., ) on the posterior probability that , used at the final analysis to declare the trial a success. This is what your design uses to make decisions.
- •Frequentist operating-characteristic validation. The simulation-derived Type I error rate, power, and expected sample size under , , and intermediate values. This is what regulators review to assess whether the design controls error and delivers the advertised power.
FDA's January 2026 Bayesian draft guidance is clear that Bayesian methods can support design, adaptation, and inference, but the regulatory review still expects explicit prespecification, justification of the prior, and a panel of operating characteristics covering plausible scenarios. Reporting the Bayesian threshold without the frequentist OC validation is a common source of FDA questions on Bayesian single-arm designs.
Regulatory acceptability
FDA's 2019 Adaptive Designs guidance treats sample size re-estimation as a well-understood adaptation, including for single-arm trials, provided three conditions are met:
- •The re-estimation rule and the maximum sample size are pre-specified in the protocol and SAP.
- •Type I error control is demonstrated by simulation across plausible scenarios.
- •The interim analysis is conducted by an independent statistical analysis center; the sponsor sees only the re-estimated sample size, not the interim response rate.
5. Reading the calculator output
Worked example
Suppose you set , , one-sided , and 80% power. A Simon optimal two-stage design returns illustrative values in the neighborhood of:
| Symbol | Value | Meaning |
|---|---|---|
| 17 | Patients enrolled in stage 1. | |
| 3 | Stop for futility if ≤ 3 responses in stage 1. | |
| 37 | Maximum total N (stage 1 + stage 2). | |
| 10 | Reject at the final analysis if > 10 responses out of 37. | |
| E[N | H0] | ~26 | Expected total N if the drug doesn't work (often stops early at stage 1). |
These are illustrative values, not authoritative boundaries. Exact numbers vary by Simon optimal vs. minimax and by the implementation's rounding; run the Zetyra Single-Arm SSR calculator with your exact parameters to get the boundary table and operating characteristics you should design against.
What the headline number means
The calculator's headline output is the minimum total N to achieve the requested power at . It is a probabilistic guarantee, not a deterministic floor: running exactly this N reaches the target power on average across repeated trials with the same true response rate. A specific trial may overshoot or undershoot.
The interim decision rule is the most consequential output
For two-stage designs, the boundary at the end of stage 1 is what your operations team needs to plan around. “Stop for futility if fewer than responses are observed in the first patients” sets the timeline, the contract milestones, and the patient enrollment ramp. Sponsors who treat as a statistical detail rather than an operational constraint consistently miss their stage-1 readout window.
Operating characteristics
Every Zetyra single-arm output includes a table reporting:
| Quantity | How to read it |
|---|---|
| α (simulated) | Probability of a false-positive when . Must be at or below the nominal level. |
| Power at | Probability of rejecting when . Should match the target (e.g., 80%). |
| E[N | H0] | Expected total N if the drug doesn't work. The most useful operational metric for cost modeling. |
| E[N | H1] | Expected total N if the drug works. Usually close to the maximum N for fixed Simon designs; substantially lower for adaptive designs. |
Note: The reported α from a simulation-based design is approximate — it has its own Monte Carlo error. Trust simulations with at least 50,000 replicates; treat anything smaller as a sketch.
6. Common failure modes
- •
Setting from a single historical trial
Regulators want a synthesis. Pick the wrong reference trial and the entire design rests on a single data point that may not be representative of contemporary standard of care.
- •
Powering for a heroic
Setting at the upper bound of what the drug could plausibly achieve, instead of the lower bound of what would justify Phase III investment. The result is an under-powered trial that stops for futility on effects that would have been commercially meaningful.
- •
Skipping the futility stop
A one-stage design with no interim look saves no patients if the drug fails. Regulators will ask why you didn't protect patients from continued exposure to an ineffective agent.
- •
Treating the calculator output as a protocol
The N, the boundaries, and the operating characteristics are the statistical input to a protocol. The protocol still needs the eligibility criteria, the response assessment schedule, the DMC charter, the analysis plan for responders vs. non-responders, and the regulatory precedent that justifies the design.
- •
Relying on conditional-power promising-zone re-estimation in the binary single-arm setting
Recent work shows this rule inflates Type I error across the entire grid of promising-zone thresholds. Use the Bayesian posterior-predictive rule with calibrated instead, or document the inflation in your SAP and adjust the final significance level.
7. Adjacent topics
- •Randomization Schemes Guide — when a single-arm design isn't justified, what randomized alternatives look like and how to choose between them.
- •Single-Arm SSR Technical Reference — the full methodology, Type I error analysis, and calibration helper for adaptive single-arm trials.
- •Bayesian Borrowing — methods for incorporating historical control data into a single-arm design when concurrent randomization is infeasible.
- •Bayesian Predictive Power Guide — the Bayesian decision-rule framing used by the adaptive single-arm tools.
8. References
- Simon R. Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials. 1989;10(1):1–10.
- A'Hern RPA. Sample size tables for exact single-stage phase II designs. Statistics in Medicine. 2001;20(6):859–866.
- Lee JJ, Liu DD. A predictive probability design for phase II cancer clinical trials. Clinical Trials. 2008;5(2):93–106.
- Thall PF, Simon R. Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994;50(2):337–349.
- Berry SM, Carlin BP, Lee JJ, Müller P. Bayesian Adaptive Methods for Clinical Trials. CRC Press; 2010.
- U.S. Food and Drug Administration. Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry. November 2019.
- U.S. Food and Drug Administration. Use of Bayesian Methodology in Clinical Trials of Drug and Biological Products: Draft Guidance for Industry. January 12, 2026.
- Qian L. Conditional Power Promising Zone Sample Size Re-estimation Inflates Type I Error in Single-Arm Binary Trials: An Exact-Enumeration Study and Comparison with Bayesian Predictive Probability SSR. Zetyra | Evidence in the Wild; April 2026 (under peer review).
Last updated: May 2026
Related Documentation
Single-Arm SSR
Technical reference for the Single-Arm SSR calculator: Bayesian and CP promising-zone rules with operating characteristics.
Randomization Schemes
How to choose between simple, blocked, stratified, minimization, and response-adaptive randomization before powering.
Bayesian Borrowing
Power priors, commensurate priors, and MAP for incorporating historical control data into a single-arm design.
Ready to design your Phase II ORR trial?
Use our Single-Arm SSR Calculator for Bayesian PPoS or CP promising-zone rules with decoupled γefficacy / γfinal thresholds and an in-product calibration helper.
Open Single-Arm SSR Calculator