Medical Statistics: What Every Doctor and Student Needs to Know
From p-values to NNT, confidence intervals to machine learning — a practical guide to biostatistics for clinicians
Introduction: Why Statistics Matter to Every Clinician
You may not think of yourself as a statistician. But every time you order a diagnostic test, prescribe a medication, or interpret a clinical trial, you are making a statistical decision. Is this result real or random? How confident am I in this finding? What is the magnitude of benefit for my patient?
The problem is that many clinicians received minimal statistical training. Medical school emphasizes clinical reasoning and pattern recognition but often relegates biostatistics to a few lectures. This gap has real consequences: physicians misinterpret p-values, overestimate effect sizes, chase spurious associations, and make decisions that don't reflect the evidence.
This article is not a math textbook. You will not need calculus. Instead, I aim to give you an intuitive understanding of the core statistical concepts that underpin modern medicine. With this foundation, you can read a journal article critically, ask the right questions, and distinguish signal from noise.
Part I: A Brief History of Biostatistics
The key lesson from history: statistics was developed by practical people solving real problems. Nightingale wanted to save lives. Gosset wanted to improve beer quality. Fisher wanted to design better agricultural experiments. The math serves the question, not the other way around.
Part II: Core Concepts—The Foundation
Descriptive Statistics: Summarizing Data
Before we can make inferences, we must describe what we observe. Descriptive statistics reduce a dataset to its essential features.
Measures of Central Tendency
- Mean (average): Sum all values and divide by the count. Sensitive to outliers. If you earn $30k and a billionaire sits next to you, your "average" wealth is $500 million.
- Median (middle value): Arrange values in order; the middle value is the median. Robust to outliers. Half the people earn above it, half below. Often more informative than the mean for skewed distributions (like income).
- Mode (most common value): The value that appears most frequently. Useful for categorical data (e.g., which antidepressant is prescribed most often?).
Measures of Variability
- Range: Maximum minus minimum. Easy to calculate but misleading. Two datasets can have identical ranges but very different distributions.
- Standard Deviation (SD): How spread out data are around the mean. High SD means values are scattered; low SD means they cluster around the mean. In a normal distribution, 68% of values fall within one SD of the mean, 95% within two SDs.
- Variance: SD squared. Used in calculations but harder to interpret directly.
- Interquartile Range (IQR): The range containing the middle 50% of values. Robust to outliers. Preferred for skewed data.
Probability Distributions
Data rarely fall into arbitrary categories. Instead, they follow patterns called distributions. Understanding these patterns is key to inference.
The Normal (Gaussian) Distribution
Many natural phenomena—height, IQ, cholesterol—follow a bell curve. The normal distribution is defined by its mean and SD. It is symmetric, with tails that extend infinitely (though most values cluster near the center). Many statistical tests assume normality, which is why it matters.
Other Important Distributions
- Binomial distribution: For binary outcomes (success/failure, yes/no). Useful for counting events (e.g., how many patients respond to treatment out of 100).
- Poisson distribution: For rare events occurring over time. Useful for counting adverse events in a large population.
- t-distribution: Similar to normal but with heavier tails. Used when sample sizes are small or when we don't know the true SD.
Probability and Risk
Probability is the foundation of inference. It is the long-run frequency of an event. If I flip a fair coin many times, the probability of heads approaches 0.5.
In medicine, we often talk about risk: the probability of an event (disease, death, recovery) in a defined population over a period of time.
Risk = 200 / 10,000 = 0.02 = 2% per year
The inverse of risk is useful too: if 2% develop depression, 98% do not.
Part III: Hypothesis Testing and P-Values
The Framework: Null and Alternative Hypotheses
Scientific research revolves around hypothesis testing. We propose a hypothesis, then ask: does the data support it or contradict it?
The null hypothesis (H₀) is the default assumption: there is no effect, no difference, no relationship. For example: "Sertraline is no better than placebo for depression."
The alternative hypothesis (H₁) is what we are testing: "Sertraline is better than placebo."
We collect data, perform a statistical test, and calculate a p-value. This is where confusion often begins.
Understanding P-Values: What They Are (and Are Not)
The p-value is not the probability that the null hypothesis is true. This is the most common misinterpretation.
The p-value is: the probability of observing data this extreme (or more extreme) if the null hypothesis were true.
Example: You conduct a trial comparing sertraline to placebo. Sertraline patients have a 55% response rate; placebo patients have a 50% response rate. You calculate a p-value of 0.06.
This means: If sertraline were truly no better than placebo, there would be a 6% chance of observing a difference this large (or larger) due to random variation.
It does NOT mean: there is a 6% chance that sertraline is ineffective, or a 94% chance it works.
The Problem with P-Values: P-Hacking and Multiple Testing
If you run enough statistical tests, you will eventually find a "significant" result by chance alone. This is called p-hacking or multiple comparisons.
This inflates the false positive rate and contributes to the replication crisis. Many high-profile studies that failed to replicate likely fell victim to p-hacking.
Solutions include: pre-registering your analysis plan (committing to which tests you will run before seeing the data), using stricter p-value thresholds when multiple tests are involved, and focusing on effect sizes and confidence intervals rather than p-values alone.
Part IV: Confidence Intervals and Effect Sizes
Confidence Intervals: Estimating the True Effect
A p-value tells you whether a result is unlikely under the null. But what is the actual magnitude of the effect? This is where confidence intervals (CIs) shine.
A 95% confidence interval is a range that, if the study were repeated many times, would contain the true effect 95% of the time. It provides a best estimate plus a margin of error.
This means: Our best estimate of the true response rate is 60%. We are 95% confident the true rate falls between 52% and 68%.
Confidence intervals reveal what p-values hide. Two studies can have identical p-values but very different CIs. Study A: Effect of 10% (95% CI: 8%–12%). Study B: Effect of 10% (95% CI: 1%–19%). Both are "significant," but Study A's effect is estimated much more precisely.
If a CI includes zero (or crosses the null value), the result is not statistically significant at the 0.05 level.
Effect Sizes: How Big Is the Difference?
Statistical significance is not the same as clinical significance. A drug can reduce blood pressure by an "insignificant" 1 mm Hg in a study of 100,000 patients. Or it can reduce it by 30 mm Hg in a study of 30 patients (non-significant due to small sample size, but clinically huge).
Effect size measures the magnitude of a difference, independent of sample size. Common metrics include:
- Absolute Risk Reduction (ARR): The difference in event rates between groups. If 60% of treated patients recover and 50% of control patients recover, ARR = 10 percentage points.
- Relative Risk (RR): The ratio of risks. RR = 0.60 / 0.50 = 1.2. The treated group has 1.2 times the risk (or in this framing, 20% higher odds of recovery).
- Cohen's d: Standardized difference in means. d = 0.2 is small, 0.5 is medium, 0.8 is large. Useful for comparing studies with different measurement scales.
- Odds Ratio (OR): Discussed in detail below, but briefly: the odds of an event in one group divided by the odds in another.
Always report effect sizes, not just p-values. A p-value of 0.001 with a tiny effect size is less informative than a p-value of 0.08 with a large, clinically meaningful effect.
Part V: Types of Error, Power, and Sample Size
Type I and Type II Errors
Statistical tests can fail in two ways:
- Type I Error (False Positive): Rejecting the null when it is true. You conclude there is an effect when there isn't one. Probability = alpha (α), conventionally 0.05.
- Type II Error (False Negative): Failing to reject the null when it is false. You conclude there is no effect when one exists. Probability = beta (β), often 0.10–0.20.
Type I errors are limited by the p-value threshold (α). Type II errors are limited by statistical power.
Statistical Power
Power is the probability of detecting an effect if one truly exists. It is 1 minus beta (β). A study with 80% power has a 20% chance of missing a true effect.
Power depends on:
- Sample size: Larger studies have more power. Doubling sample size increases power substantially.
- Effect size: Larger effects are easier to detect. If the true effect is 50% better than the alternative, power is higher than if the effect is 5% better.
- Alpha (significance level): Stricter alpha (e.g., 0.01 vs 0.05) reduces power.
Underpowered studies are common and problematic. If a study has 50% power and the result is non-significant, you have learned almost nothing. There may be a true effect, but you lacked the power to detect it. Conversely, large studies with small effects are often overpowered (detecting clinically trivial differences).
Part VI: Practical Metrics for Clinicians
Number Needed to Treat (NNT) and Number Needed to Harm (NNH)
These are the most clinically useful statistics. They translate group-level data into individual-level meaning.
NNT: The number of patients you must treat to prevent one bad outcome (or achieve one good outcome).
- Response rate on drug: 60%
- Response rate on placebo: 50%
- Absolute Risk Reduction (ARR): 60% – 50% = 10%
- NNT = 1 / ARR = 1 / 0.10 = 10
Interpretation: You must treat 10 patients with this antidepressant to achieve one additional response compared to placebo.
NNH: The number of patients you must treat to harm one person with a side effect.
- Absolute Excess Risk: 5% – 1% = 4%
- NNH = 1 / 0.04 = 25
Interpretation: For every 25 patients treated, one develops metabolic syndrome due to the drug (above background risk).
NNT and NNH allow you to weigh benefits against harms. An NNT of 10 is excellent; an NNT of 100 means the drug must be given to 100 patients to help one, which may not be worth it. An NNH of 25 for a serious side effect is concerning. Context matters: Would you accept an NNT of 100 if the alternative is death? Probably yes. Would you accept it for mild insomnia? Probably not.
Relative Risk and Odds Ratio
Relative Risk (RR): The ratio of the probability of an outcome in one group to the probability in another.
Risk of depression in non-smokers: 10%
RR = 0.20 / 0.10 = 2.0
Interpretation: Smokers have twice the risk of depression as non-smokers.
Odds Ratio (OR): The ratio of odds in one group to odds in another. (Odds are the ratio of the probability an event happens to the probability it doesn't.)
Odds of depression in non-smokers = 0.10 / 0.90 = 0.11
OR = 0.25 / 0.11 = 2.27
When the outcome is rare, OR approximates RR. When the outcome is common, they diverge.
RR is more intuitive for clinicians. An RR of 2 clearly means "twice the risk." ORs are often used in case-control studies and logistic regression but are frequently misinterpreted as RRs (which inflates the apparent effect).
Hazard Ratio
Hazard Ratio (HR): Similar to RR but used in survival analysis (time-to-event data). It represents the relative rate of the event occurring over time.
HR > 1: The treatment group experiences the event sooner or more often. HR < 1: The treatment group is protected. HR = 1: No difference.
Interpretation: HR = 0.8 means the treated group has 80% of the risk (or 20% lower risk) of the event at any given time point.
Sensitivity, Specificity, and Predictive Values
When evaluating a diagnostic test, we care about how well it correctly identifies disease (and non-disease).
- Sensitivity: Proportion of people with disease who test positive. High sensitivity means few false negatives (you don't miss cases). "Ruling out" disease.
- Specificity: Proportion of people without disease who test negative. High specificity means few false positives (you don't over-diagnose). "Ruling in" disease.
- Positive Predictive Value (PPV): Probability that a person with a positive test actually has the disease. Depends on the prevalence of disease in your population.
- Negative Predictive Value (NPV): Probability that a person with a negative test does not have the disease.
ROC Curves and AUC
A Receiver Operating Characteristic (ROC) curve plots sensitivity against (1 – specificity) as you vary the diagnostic threshold. The Area Under the Curve (AUC) summarizes the test's performance. AUC = 0.5 means the test is no better than a coin flip. AUC = 1.0 is perfect discrimination. AUC = 0.7–0.8 is generally considered "fair"; 0.8–0.9 is "good."
ROC curves are useful for comparing tests or choosing a diagnostic threshold that balances sensitivity and specificity for your clinical context.
Part VII: Common Statistical Tests and When to Use Them
| Question | Data Type | Test | Assumptions |
|---|---|---|---|
| Is the mean of one group different from another? | Continuous (normal) | Independent samples t-test | Normal distribution, equal variances |
| Same as above, but small sample? | Continuous (non-normal) | Mann-Whitney U test (non-parametric) | No normality assumption |
| Are means different across 3+ groups? | Continuous (normal) | ANOVA (Analysis of Variance) | Normal distribution, equal variances |
| Are categorical variables associated? | Categorical (counts) | Chi-square test | Expected counts > 5 in each cell |
| Is there a linear relationship between two variables? | Continuous | Pearson correlation or linear regression | Linear relationship, normally distributed residuals |
| Predict a binary outcome (yes/no)? | Binary outcome, multiple predictors | Logistic regression | Large sample size, no perfect separation |
| Predict a continuous outcome? | Continuous outcome, multiple predictors | Linear regression | Linear relationship, normally distributed residuals |
| Compare time-to-event between groups? | Time-to-event (survival) | Kaplan-Meier curves, Cox regression | Independent observations, proportional hazards |
The key is matching your question to the right test. Many errors arise from using an inappropriate test or violating its assumptions. When in doubt, consult a statistician or your study's pre-registered analysis plan.
Part VIII: Common Misinterpretations and Pitfalls
Correlation ≠ Causation
Two variables can be correlated without one causing the other. Ice cream sales and drowning deaths are correlated (both increase in summer), but ice cream doesn't cause drowning. Confounding variables (warm weather) drive both.
Randomized trials break confounding by random assignment. Observational studies cannot, no matter how large. When reading an observational study, always ask: Could a confounding variable explain this association?
Absence of Evidence ≠ Evidence of Absence
A non-significant p-value does not mean there is no effect. It means you failed to detect one (perhaps due to low power). Always examine the confidence interval. If it is wide and crosses the null, the study is inconclusive, not negative.
Statistical Significance ≠ Clinical Significance
A large study can find a statistically significant effect that is trivially small. Always examine absolute effect sizes and ask: Would this change my management of patients?
Publication Bias
Studies with positive results are more likely to be published than negative ones. This skews the literature. If you read 10 published trials of a drug, all positive, be skeptical. There may be 20 unpublished negative trials in a file drawer.
P-Hacking and Multiple Comparisons
If you test enough hypotheses, one will be "significant" by chance. Always ask: Was this the pre-specified primary analysis, or a secondary/exploratory finding? Were multiple tests run, and if so, were they corrected for multiple comparisons?
Part IX: Statistics in the Age of AI and Machine Learning
Why AI Changes (and Doesn't Change) Statistical Thinking
Machine learning and artificial intelligence are revolutionizing medicine. Algorithms predict patient outcomes, analyze medical images, and identify drug targets. But AI does not eliminate the need for statistical literacy—it makes it more critical.
Machine Learning Paradigms
Supervised Learning: The algorithm learns from labeled data (e.g., images labeled as "cancer" or "benign"). It finds patterns that predict the label. Examples include random forests, neural networks, and support vector machines.
Unsupervised Learning: The algorithm finds structure in unlabeled data (e.g., clustering patients into subtypes). No ground truth is provided; the algorithm discovers patterns.
Reinforcement Learning: The algorithm learns through trial and error, optimizing for a reward signal. Used in adaptive clinical trials and treatment optimization.
How Machine Learning Differs from Classical Statistics
- Classical statistics: Start with a hypothesis, collect data, test the hypothesis. Inference is the goal (understanding why).
- Machine learning: Collect data, find patterns, make predictions. Prediction is the goal; interpretability is secondary.
- Assumption burden: Classical statistics assumes data follow known distributions. Machine learning makes fewer assumptions but requires more data.
- Overfitting: Machine learning models can memorize noise in training data and fail on new data. Classical statistics is more robust to small sample sizes but less flexible.
Common Pitfalls in Machine Learning Medicine
Data Leakage: Information from the test set "leaks" into training, inflating performance estimates. A model that looks 95% accurate on test data may perform poorly on new patients.
Selection Bias: If the training data is not representative (e.g., from a single hospital), the model may not generalize. A model trained on a wealthy hospital's data may perform poorly in resource-limited settings.
Black Box Problem: Complex neural networks can predict but cannot explain. This is problematic in medicine, where clinicians need to understand why a model recommends an action.
Class Imbalance: If disease is rare (e.g., 1% prevalence), a naive model that predicts "no disease" for everyone is 99% accurate but useless. Specialized techniques are needed.
Natural Language Processing and Clinical Notes
Large language models (LLMs) can extract insights from unstructured clinical notes, identifying patterns in free text that would take humans weeks to manually code. However, the same issues apply: garbage in, garbage out. Models trained on biased data will perpetuate bias.
Bayesian Methods and Adaptive Trials
Classical hypothesis testing requires a fixed sample size determined a priori. Bayesian methods incorporate prior knowledge and update beliefs as data accumulate. This enables adaptive trials that can be stopped early for efficacy or futility, saving time and resources.
Bayesian approaches ask: "Given my prior belief and the new data, what is the posterior probability that my hypothesis is true?" This is more intuitive than the frequentist p-value but requires specifying priors, which can be contentious.
The Future: Why Statistical Thinking Is More Important Than Ever
In the AI era, statistical literacy is not becoming obsolete—it is becoming essential. Here is why:
- Critical evaluation of algorithms: You must ask: Was this model validated on independent data? Could it have overfit? Is it biased? Answering these requires statistical thinking.
- Integration of AI with human judgment: AI provides predictions; clinical judgment decides what to do with them. Understanding Type I and II errors, sensitivity and specificity, helps you calibrate trust in algorithms.
- Detecting AI hallucinations: Large language models generate plausible-sounding but false information. Statistical reasoning helps you spot inconsistencies and demand evidence.
- Avoiding the replication crisis 2.0: As researchers churn out machine learning models, many will fail to replicate or will work only in narrow domains. Pre-registration, external validation, and transparent reporting—all rooted in statistical thinking—will separate signal from noise.
Key Takeaway for the AI Era
Machine learning is a powerful tool, but it is not a replacement for thinking. The clinician who understands statistics—who questions assumptions, demands evidence, and weighs uncertainty—will navigate the AI revolution successfully. The clinician who blindly trusts algorithms without understanding their limitations will eventually be surprised by their failures.
Part X: Practical Guide—How to Appraise a Study's Statistics
When you read a journal article, here is a checklist for evaluating the statistical methods and results:
Study Design and Sample Size
- Was the sample size calculated a priori? Is it adequate?
- Was power analysis performed? (Expected answer: yes, 80–90% power.)
- Are the inclusion/exclusion criteria clearly stated?
- How many participants were enrolled? How many completed? Why did others drop out?
Primary Outcome
- Is the primary outcome clearly defined?
- Was it registered a priori (on ClinicalTrials.gov)? Or did it change during the study?
- Is it biologically meaningful, not just statistically significant?
Statistical Methods
- Are the statistical tests appropriate for the data?
- Were assumptions checked (e.g., normality, equal variances)?
- Were multiple comparisons corrected?
- Was the analysis per-protocol or intention-to-treat? (Intention-to-treat is preferred.)
Results
- Are effect sizes reported, not just p-values?
- Are confidence intervals provided?
- If p > 0.05, is the CI wide (inconclusive) or tight around the null (true null effect)?
- Are baseline characteristics balanced between groups?
Interpretation
- Do the authors' conclusions match the data?
- Do they acknowledge limitations?
- Is there evidence of outcome switching or selective reporting?
- How does this finding fit the broader literature?
Conclusion: Becoming a Statistically Literate Clinician
Statistics is not a tool for the elite few. It is a language for understanding evidence and making rational decisions. Every clinician should be able to read a confidence interval, compute an NNT, and recognize when p-values are being misused.
You do not need to be a mathematician. But you do need intuition. Understand that variation is natural, that correlation is not causation, that statistical significance differs from clinical significance, and that effect sizes matter more than p-values.
In the age of AI and big data, these skills are more valuable than ever. The clinician who thinks statistically—who questions assumptions, demands evidence, and remains skeptical of neat conclusions—will practice better medicine.
The literature is vast and often contradictory. But with statistical literacy, you can navigate it with confidence. You can ask the right questions, evaluate the strength of evidence, and ultimately make decisions that serve your patients' best interests.
Start small: the next time you read a study, focus on the confidence intervals and effect sizes, not the p-value. Ask yourself: Is this effect clinically meaningful? Could it be due to chance or confounding? Does it fit the broader literature? With practice, statistical thinking becomes second nature—and your medicine improves.
Further Reading
- Altman DG, Bland JM. Absence of evidence is not evidence of absence. BMJ. 1995;311(7003):485.
- Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305–307.
- Badgeley MA, Zech JR, Oakden-Rayner L, et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med. 2019;2:31.
- Blume JD, D'Agostino McGowan L, Dupont WD, Greevy RA. Second-generation p-values: Improved rigor, reproducibility, and transparency in statistical analyses. PLoS One. 2018;13(3):e0213549.
- Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. BMJ. 1995;310(6977):452–454.
- Goodman SN. A comment on replication, p-values and evidence. Stat Med. 1992;11(7):875–879.
- Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924–926.
- Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124.
- Krishan K, Chatterjee A. Medical Statistics: A Practical Guide. Springer; 2023.
- Schüz J, Straif K. Epidemiology of cancer. In: DeVita VT, Lawrence TS, Rosenberg SA, eds. DeVita, Hellman, and Rosenberg's Cancer: Principles & Practice of Oncology. Wolters Kluwer; 2014.
- Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat. 2016;70(2):129–133.
- Windey B, Rademakers R. Recent advances in the understanding of neuronal ceroid lipofuscinosis. Pediatr Neurol. 2016;64:34–44.
- Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC. The ClinicalTrials.gov results database—update and key issues. N Engl J Med. 2011;364(9):852–860.
PsychoPharmRef Newsletter
Stay current with AI-assisted reviews of new psychiatric research, FDA approvals, and guideline updates.
Subscribe — it's free