Sample size: logistic regression
Sample size for logistic regression
Use this tool when you're modelling a binary outcome (yes/no, died/survived) with one or more predictors. We give you both the events-per-variable rule of thumb (Peduzzi 1996) and the Hsieh (1989) formula for a single primary predictor.
Count every coefficient your final model will estimate (including dummy variables for each level of a categorical variable beyond the reference).
How to justify this number
List the variables you genuinely plan to include in the final multivariable model — not every variable in your dataset. Pre-specify this in your protocol; post-hoc inflation defeats the purpose.
Overall event rate in your study population (e.g. 30-day mortality, complication rate).
How to justify this number
Use a recent local cohort, registry, or systematic review. The sample size is most sensitive to this number — a rare outcome (< 10 %) inflates the required sample dramatically.
Convention is 10. Increase to 20 for sparse data or unstable models.
How to justify this number
Cite: Peduzzi P et al. J Clin Epidemiol 1996;49:1373–9. Use 20 for sparse outcomes or when calibration matters more than discrimination (van Smeden et al. 2019).
Hsieh formula (single primary predictor)
If you have one primary predictor of interest and want a more rigorous estimate, fill in the next two fields.
For continuous: OR per 1-SD change. For binary: OR for exposed vs unexposed.
How to justify this number
An OR of 1.3–1.5 is a small-to-modest effect; 2.0+ is large. Take the smallest OR your study still needs to detect, not the largest you might find.
If your primary predictor is correlated with adjustment variables, set this. 0 = independent; 0.3 = mild collinearity; 0.5+ = strong.
You need
What does this calculation actually do?
Events-per-variable rule (Peduzzi 1996): the minimum sample size is whichever gives at least EPV events per coefficient:
n_EPV = (k · EPV) / p
Hsieh formula for a continuous primary predictor (Hsieh 1989, adjusted for collinearity by VIF = 1/(1−R²)):
n = (z₁₋α/₂ + z₁₋β)² / (p · (1 − p) · ln(OR)²) · 1/(1 − R²)
Hsieh formula for a binary primary predictor (Hsieh, Bloch & Larsen 1998):
n = (z₁₋α/₂ √(p̄(1−p̄)/B) + z₁₋β √(p₁(1−p₁) + p₂(1−p₂)·(1−B)/B))² / ((p₁ − p₂)² · (1 − B))
where B is the proportion exposed and p₁, p₂ are derived from the overall event rate and OR. We report the larger of the EPV estimate and the Hsieh estimate.
References: Peduzzi P et al. J Clin Epidemiol 1996;49:1373–9. · Hsieh FY. Statist Med 1989;8:795–802. · Hsieh FY, Bloch DA, Larsen MD. Statist Med 1998;17:1623–34.