Medical Education

Medical Statistics: What Every Doctor and Student Needs to Know

From p-values to NNT, confidence intervals to machine learning — a practical guide to biostatistics for clinicians

📅 March 2026 ⏱️ 25 min read 👨‍⚕️ For Clinicians ✍️ Jerad Shoemaker, MD

Key Takeaway: Every clinician must understand statistics—not to become a mathematician, but to read the literature critically, recognize when results are meaningful versus spurious, and make evidence-based decisions. This guide covers the essentials: from basic distributions and hypothesis testing to p-values, confidence intervals, effect sizes, and practical metrics like NNT and odds ratios. In the age of machine learning and AI, statistical literacy has become more important, not less.

Introduction: Why Statistics Matter to Every Clinician

You may not think of yourself as a statistician. But every time you order a diagnostic test, prescribe a medication, or interpret a clinical trial, you are making a statistical decision. Is this result real or random? How confident am I in this finding? What is the magnitude of benefit for my patient?

The problem is that many clinicians received minimal statistical training. Medical school emphasizes clinical reasoning and pattern recognition but often relegates biostatistics to a few lectures. This gap has real consequences: physicians misinterpret p-values, overestimate effect sizes, chase spurious associations, and make decisions that don't reflect the evidence.

This article is not a math textbook. You will not need calculus. Instead, I aim to give you an intuitive understanding of the core statistical concepts that underpin modern medicine. With this foundation, you can read a journal article critically, ask the right questions, and distinguish signal from noise.

Part I: A Brief History of Biostatistics

1854

Florence Nightingale pioneers data visualization to demonstrate that more soldiers died from infectious disease than combat during the Crimean War. She uses "rose diagrams" (circular area charts) to lobby the British government for sanitary reform. Statistics becomes a tool for social change, not just abstract mathematics.

1890s–1920s

Karl Pearson and William Gosset develop foundational statistical methods. Pearson introduces correlation and the chi-square test. Gosset (a Guinness brewery statistician) develops the t-distribution and t-test while working with small samples. Both were practical men solving real problems.

1920s–1935

Ronald Fisher revolutionizes experimental design and hypothesis testing. He develops ANOVA, the concept of p-values, and the logic of null hypothesis significance testing (NHST). His work on agricultural experiments becomes the template for clinical trials. Fisher is brilliant but also controversial—he defended smoking decades after its dangers were known.

1940s–1950s

Clinical trials emerge as the gold standard. The British conduct the first randomized controlled trial of streptomycin for tuberculosis (1946). The FDA mandates efficacy trials for new drugs. Statistics becomes inseparable from clinical research.

1960s–1990s

Bayesian methods and meta-analysis gain traction. Statisticians realize that p-values alone are insufficient. Bayesian approaches incorporate prior knowledge. Meta-analysis allows researchers to synthesize multiple studies. The concept of effect size and confidence intervals becomes standard.

2000s–Present

The replication crisis and big data era. High-profile failures to replicate major findings prompt re-examination of statistical practice. Simultaneously, computational power enables machine learning, which requires different statistical thinking. Pre-registration of studies becomes standard in top journals. The field is in flux, evolving toward more transparency and rigor.

The key lesson from history: statistics was developed by practical people solving real problems. Nightingale wanted to save lives. Gosset wanted to improve beer quality. Fisher wanted to design better agricultural experiments. The math serves the question, not the other way around.

Part II: Core Concepts—The Foundation

Descriptive Statistics: Summarizing Data

Before we can make inferences, we must describe what we observe. Descriptive statistics reduce a dataset to its essential features.

Measures of Central Tendency

Mean (average): Sum all values and divide by the count. Sensitive to outliers. If you earn $30k and a billionaire sits next to you, your "average" wealth is $500 million.
Median (middle value): Arrange values in order; the middle value is the median. Robust to outliers. Half the people earn above it, half below. Often more informative than the mean for skewed distributions (like income).
Mode (most common value): The value that appears most frequently. Useful for categorical data (e.g., which antidepressant is prescribed most often?).

Measures of Variability

Range: Maximum minus minimum. Easy to calculate but misleading. Two datasets can have identical ranges but very different distributions.
Standard Deviation (SD): How spread out data are around the mean. High SD means values are scattered; low SD means they cluster around the mean. In a normal distribution, 68% of values fall within one SD of the mean, 95% within two SDs.
Variance: SD squared. Used in calculations but harder to interpret directly.
Interquartile Range (IQR): The range containing the middle 50% of values. Robust to outliers. Preferred for skewed data.

Clinical Example: Two antipsychotics have the same average weight gain (4 kg). But Drug A has an SD of 0.5 kg (most patients gain 3.5–4.5 kg), while Drug B has an SD of 3 kg (some gain 1 kg, others 7 kg). They have identical means but very different profiles. Understanding variability reveals this distinction.

Probability Distributions

Data rarely fall into arbitrary categories. Instead, they follow patterns called distributions. Understanding these patterns is key to inference.

The Normal (Gaussian) Distribution

Many natural phenomena—height, IQ, cholesterol—follow a bell curve. The normal distribution is defined by its mean and SD. It is symmetric, with tails that extend infinitely (though most values cluster near the center). Many statistical tests assume normality, which is why it matters.

Other Important Distributions

Binomial distribution: For binary outcomes (success/failure, yes/no). Useful for counting events (e.g., how many patients respond to treatment out of 100).
Poisson distribution: For rare events occurring over time. Useful for counting adverse events in a large population.
t-distribution: Similar to normal but with heavier tails. Used when sample sizes are small or when we don't know the true SD.

Probability and Risk

Probability is the foundation of inference. It is the long-run frequency of an event. If I flip a fair coin many times, the probability of heads approaches 0.5.

In medicine, we often talk about risk: the probability of an event (disease, death, recovery) in a defined population over a period of time.

Example: In a population of 10,000 people, 200 develop depression over one year.
Risk = 200 / 10,000 = 0.02 = 2% per year

The inverse of risk is useful too: if 2% develop depression, 98% do not.

Part III: Hypothesis Testing and P-Values

The Framework: Null and Alternative Hypotheses

Scientific research revolves around hypothesis testing. We propose a hypothesis, then ask: does the data support it or contradict it?

The null hypothesis (H₀) is the default assumption: there is no effect, no difference, no relationship. For example: "Sertraline is no better than placebo for depression."

The alternative hypothesis (H₁) is what we are testing: "Sertraline is better than placebo."

We collect data, perform a statistical test, and calculate a p-value. This is where confusion often begins.

Understanding P-Values: What They Are (and Are Not)

The p-value is not the probability that the null hypothesis is true. This is the most common misinterpretation.

The p-value is: the probability of observing data this extreme (or more extreme) if the null hypothesis were true.

Example: You conduct a trial comparing sertraline to placebo. Sertraline patients have a 55% response rate; placebo patients have a 50% response rate. You calculate a p-value of 0.06.

This means: If sertraline were truly no better than placebo, there would be a 6% chance of observing a difference this large (or larger) due to random variation.

It does NOT mean: there is a 6% chance that sertraline is ineffective, or a 94% chance it works.

Why This Matters: Many researchers use p < 0.05 as a hard cutoff for "significance," but the p-value is continuous. A p-value of 0.049 is not clearly different from 0.051. The threshold is arbitrary. What matters is the pattern of evidence across studies, the effect size, and whether the result is biologically plausible.

The Problem with P-Values: P-Hacking and Multiple Testing

If you run enough statistical tests, you will eventually find a "significant" result by chance alone. This is called p-hacking or multiple comparisons.

Example: You measure 20 clinical variables in a dataset. You run 20 statistical tests. Even if none have a true effect, you expect 1 false positive at p < 0.05 by chance (5% of 20 = 1).

This inflates the false positive rate and contributes to the replication crisis. Many high-profile studies that failed to replicate likely fell victim to p-hacking.

Solutions include: pre-registering your analysis plan (committing to which tests you will run before seeing the data), using stricter p-value thresholds when multiple tests are involved, and focusing on effect sizes and confidence intervals rather than p-values alone.

Part IV: Confidence Intervals and Effect Sizes

Confidence Intervals: Estimating the True Effect

A p-value tells you whether a result is unlikely under the null. But what is the actual magnitude of the effect? This is where confidence intervals (CIs) shine.

A 95% confidence interval is a range that, if the study were repeated many times, would contain the true effect 95% of the time. It provides a best estimate plus a margin of error.

Example: A trial of an antidepressant reports: "Response rate 60% (95% CI: 52%–68%)."

This means: Our best estimate of the true response rate is 60%. We are 95% confident the true rate falls between 52% and 68%.

Confidence intervals reveal what p-values hide. Two studies can have identical p-values but very different CIs. Study A: Effect of 10% (95% CI: 8%–12%). Study B: Effect of 10% (95% CI: 1%–19%). Both are "significant," but Study A's effect is estimated much more precisely.

If a CI includes zero (or crosses the null value), the result is not statistically significant at the 0.05 level.

Effect Sizes: How Big Is the Difference?

Statistical significance is not the same as clinical significance. A drug can reduce blood pressure by an "insignificant" 1 mm Hg in a study of 100,000 patients. Or it can reduce it by 30 mm Hg in a study of 30 patients (non-significant due to small sample size, but clinically huge).

Effect size measures the magnitude of a difference, independent of sample size. Common metrics include:

Absolute Risk Reduction (ARR): The difference in event rates between groups. If 60% of treated patients recover and 50% of control patients recover, ARR = 10 percentage points.
Relative Risk (RR): The ratio of risks. RR = 0.60 / 0.50 = 1.2. The treated group has 1.2 times the risk (or in this framing, 20% higher odds of recovery).
Cohen's d: Standardized difference in means. d = 0.2 is small, 0.5 is medium, 0.8 is large. Useful for comparing studies with different measurement scales.
Odds Ratio (OR): Discussed in detail below, but briefly: the odds of an event in one group divided by the odds in another.

Always report effect sizes, not just p-values. A p-value of 0.001 with a tiny effect size is less informative than a p-value of 0.08 with a large, clinically meaningful effect.

Part V: Types of Error, Power, and Sample Size

Type I and Type II Errors

Statistical tests can fail in two ways:

Type I Error (False Positive): Rejecting the null when it is true. You conclude there is an effect when there isn't one. Probability = alpha (α), conventionally 0.05.
Type II Error (False Negative): Failing to reject the null when it is false. You conclude there is no effect when one exists. Probability = beta (β), often 0.10–0.20.

Your Conclusion Truth: Effect Exists Truth: No Effect You conclude: Effect exists Correct (True Positive) Type I Error (False Positive) You conclude: No effect Type II Error (False Negative) Correct (True Negative)

Type I errors are limited by the p-value threshold (α). Type II errors are limited by statistical power.

Statistical Power

Power is the probability of detecting an effect if one truly exists. It is 1 minus beta (β). A study with 80% power has a 20% chance of missing a true effect.

Power depends on:

Sample size: Larger studies have more power. Doubling sample size increases power substantially.
Effect size: Larger effects are easier to detect. If the true effect is 50% better than the alternative, power is higher than if the effect is 5% better.
Alpha (significance level): Stricter alpha (e.g., 0.01 vs 0.05) reduces power.

Underpowered studies are common and problematic. If a study has 50% power and the result is non-significant, you have learned almost nothing. There may be a true effect, but you lacked the power to detect it. Conversely, large studies with small effects are often overpowered (detecting clinically trivial differences).

Part VI: Practical Metrics for Clinicians

Number Needed to Treat (NNT) and Number Needed to Harm (NNH)

These are the most clinically useful statistics. They translate group-level data into individual-level meaning.

NNT: The number of patients you must treat to prevent one bad outcome (or achieve one good outcome).

Example: An antidepressant trial shows:
- Response rate on drug: 60%
- Response rate on placebo: 50%
- Absolute Risk Reduction (ARR): 60% – 50% = 10%
- NNT = 1 / ARR = 1 / 0.10 = 10

Interpretation: You must treat 10 patients with this antidepressant to achieve one additional response compared to placebo.

NNH: The number of patients you must treat to harm one person with a side effect.

Example: An antipsychotic causes metabolic syndrome in 5% of treated patients and 1% of controls.
- Absolute Excess Risk: 5% – 1% = 4%
- NNH = 1 / 0.04 = 25

Interpretation: For every 25 patients treated, one develops metabolic syndrome due to the drug (above background risk).

NNT and NNH allow you to weigh benefits against harms. An NNT of 10 is excellent; an NNT of 100 means the drug must be given to 100 patients to help one, which may not be worth it. An NNH of 25 for a serious side effect is concerning. Context matters: Would you accept an NNT of 100 if the alternative is death? Probably yes. Would you accept it for mild insomnia? Probably not.

Relative Risk and Odds Ratio

Relative Risk (RR): The ratio of the probability of an outcome in one group to the probability in another.

Example: Risk of depression in smokers: 20%
Risk of depression in non-smokers: 10%
RR = 0.20 / 0.10 = 2.0

Interpretation: Smokers have twice the risk of depression as non-smokers.

Odds Ratio (OR): The ratio of odds in one group to odds in another. (Odds are the ratio of the probability an event happens to the probability it doesn't.)

Example: Odds of depression in smokers = 0.20 / 0.80 = 0.25
Odds of depression in non-smokers = 0.10 / 0.90 = 0.11
OR = 0.25 / 0.11 = 2.27

When the outcome is rare, OR approximates RR. When the outcome is common, they diverge.

RR is more intuitive for clinicians. An RR of 2 clearly means "twice the risk." ORs are often used in case-control studies and logistic regression but are frequently misinterpreted as RRs (which inflates the apparent effect).

Hazard Ratio

Hazard Ratio (HR): Similar to RR but used in survival analysis (time-to-event data). It represents the relative rate of the event occurring over time.

HR > 1: The treatment group experiences the event sooner or more often. HR < 1: The treatment group is protected. HR = 1: No difference.

Interpretation: HR = 0.8 means the treated group has 80% of the risk (or 20% lower risk) of the event at any given time point.

Sensitivity, Specificity, and Predictive Values

When evaluating a diagnostic test, we care about how well it correctly identifies disease (and non-disease).

Sensitivity: Proportion of people with disease who test positive. High sensitivity means few false negatives (you don't miss cases). "Ruling out" disease.
Specificity: Proportion of people without disease who test negative. High specificity means few false positives (you don't over-diagnose). "Ruling in" disease.
Positive Predictive Value (PPV): Probability that a person with a positive test actually has the disease. Depends on the prevalence of disease in your population.
Negative Predictive Value (NPV): Probability that a person with a negative test does not have the disease.

Clinical Pearl: PPV and NPV depend on disease prevalence. A test with 95% sensitivity and 95% specificity sounds excellent, but if you use it to screen for a rare disease (prevalence 1%), a positive test still has a low probability of indicating true disease. Always adjust your interpretation to your population.

ROC Curves and AUC

A Receiver Operating Characteristic (ROC) curve plots sensitivity against (1 – specificity) as you vary the diagnostic threshold. The Area Under the Curve (AUC) summarizes the test's performance. AUC = 0.5 means the test is no better than a coin flip. AUC = 1.0 is perfect discrimination. AUC = 0.7–0.8 is generally considered "fair"; 0.8–0.9 is "good."

ROC curves are useful for comparing tests or choosing a diagnostic threshold that balances sensitivity and specificity for your clinical context.

Part VII: Common Statistical Tests and When to Use Them

Question	Data Type	Test	Assumptions
Is the mean of one group different from another?	Continuous (normal)	Independent samples t-test	Normal distribution, equal variances
Same as above, but small sample?	Continuous (non-normal)	Mann-Whitney U test (non-parametric)	No normality assumption
Are means different across 3+ groups?	Continuous (normal)	ANOVA (Analysis of Variance)	Normal distribution, equal variances
Are categorical variables associated?	Categorical (counts)	Chi-square test	Expected counts > 5 in each cell
Is there a linear relationship between two variables?	Continuous	Pearson correlation or linear regression	Linear relationship, normally distributed residuals
Predict a binary outcome (yes/no)?	Binary outcome, multiple predictors	Logistic regression	Large sample size, no perfect separation
Predict a continuous outcome?	Continuous outcome, multiple predictors	Linear regression	Linear relationship, normally distributed residuals
Compare time-to-event between groups?	Time-to-event (survival)	Kaplan-Meier curves, Cox regression	Independent observations, proportional hazards

The key is matching your question to the right test. Many errors arise from using an inappropriate test or violating its assumptions. When in doubt, consult a statistician or your study's pre-registered analysis plan.

Part VIII: Common Misinterpretations and Pitfalls

Correlation ≠ Causation

Two variables can be correlated without one causing the other. Ice cream sales and drowning deaths are correlated (both increase in summer), but ice cream doesn't cause drowning. Confounding variables (warm weather) drive both.

Randomized trials break confounding by random assignment. Observational studies cannot, no matter how large. When reading an observational study, always ask: Could a confounding variable explain this association?

Absence of Evidence ≠ Evidence of Absence

A non-significant p-value does not mean there is no effect. It means you failed to detect one (perhaps due to low power). Always examine the confidence interval. If it is wide and crosses the null, the study is inconclusive, not negative.

Statistical Significance ≠ Clinical Significance

A large study can find a statistically significant effect that is trivially small. Always examine absolute effect sizes and ask: Would this change my management of patients?

Publication Bias

Studies with positive results are more likely to be published than negative ones. This skews the literature. If you read 10 published trials of a drug, all positive, be skeptical. There may be 20 unpublished negative trials in a file drawer.

P-Hacking and Multiple Comparisons

If you test enough hypotheses, one will be "significant" by chance. Always ask: Was this the pre-specified primary analysis, or a secondary/exploratory finding? Were multiple tests run, and if so, were they corrected for multiple comparisons?

Part IX: Statistics in the Age of AI and Machine Learning

Why AI Changes (and Doesn't Change) Statistical Thinking

Machine learning and artificial intelligence are revolutionizing medicine. Algorithms predict patient outcomes, analyze medical images, and identify drug targets. But AI does not eliminate the need for statistical literacy—it makes it more critical.

Machine Learning Paradigms

Supervised Learning: The algorithm learns from labeled data (e.g., images labeled as "cancer" or "benign"). It finds patterns that predict the label. Examples include random forests, neural networks, and support vector machines.

Unsupervised Learning: The algorithm finds structure in unlabeled data (e.g., clustering patients into subtypes). No ground truth is provided; the algorithm discovers patterns.

Reinforcement Learning: The algorithm learns through trial and error, optimizing for a reward signal. Used in adaptive clinical trials and treatment optimization.

How Machine Learning Differs from Classical Statistics

Classical statistics: Start with a hypothesis, collect data, test the hypothesis. Inference is the goal (understanding why).
Machine learning: Collect data, find patterns, make predictions. Prediction is the goal; interpretability is secondary.
Assumption burden: Classical statistics assumes data follow known distributions. Machine learning makes fewer assumptions but requires more data.
Overfitting: Machine learning models can memorize noise in training data and fail on new data. Classical statistics is more robust to small sample sizes but less flexible.

Common Pitfalls in Machine Learning Medicine

Data Leakage: Information from the test set "leaks" into training, inflating performance estimates. A model that looks 95% accurate on test data may perform poorly on new patients.

Selection Bias: If the training data is not representative (e.g., from a single hospital), the model may not generalize. A model trained on a wealthy hospital's data may perform poorly in resource-limited settings.

Black Box Problem: Complex neural networks can predict but cannot explain. This is problematic in medicine, where clinicians need to understand why a model recommends an action.

Class Imbalance: If disease is rare (e.g., 1% prevalence), a naive model that predicts "no disease" for everyone is 99% accurate but useless. Specialized techniques are needed.

Natural Language Processing and Clinical Notes

Large language models (LLMs) can extract insights from unstructured clinical notes, identifying patterns in free text that would take humans weeks to manually code. However, the same issues apply: garbage in, garbage out. Models trained on biased data will perpetuate bias.

Bayesian Methods and Adaptive Trials

Classical hypothesis testing requires a fixed sample size determined a priori. Bayesian methods incorporate prior knowledge and update beliefs as data accumulate. This enables adaptive trials that can be stopped early for efficacy or futility, saving time and resources.

Bayesian approaches ask: "Given my prior belief and the new data, what is the posterior probability that my hypothesis is true?" This is more intuitive than the frequentist p-value but requires specifying priors, which can be contentious.

The Future: Why Statistical Thinking Is More Important Than Ever

In the AI era, statistical literacy is not becoming obsolete—it is becoming essential. Here is why:

Critical evaluation of algorithms: You must ask: Was this model validated on independent data? Could it have overfit? Is it biased? Answering these requires statistical thinking.
Integration of AI with human judgment: AI provides predictions; clinical judgment decides what to do with them. Understanding Type I and II errors, sensitivity and specificity, helps you calibrate trust in algorithms.
Detecting AI hallucinations: Large language models generate plausible-sounding but false information. Statistical reasoning helps you spot inconsistencies and demand evidence.
Avoiding the replication crisis 2.0: As researchers churn out machine learning models, many will fail to replicate or will work only in narrow domains. Pre-registration, external validation, and transparent reporting—all rooted in statistical thinking—will separate signal from noise.

Key Takeaway for the AI Era

Machine learning is a powerful tool, but it is not a replacement for thinking. The clinician who understands statistics—who questions assumptions, demands evidence, and weighs uncertainty—will navigate the AI revolution successfully. The clinician who blindly trusts algorithms without understanding their limitations will eventually be surprised by their failures.

Part X: Practical Guide—How to Appraise a Study's Statistics

When you read a journal article, here is a checklist for evaluating the statistical methods and results:

Study Design and Sample Size

Was the sample size calculated a priori? Is it adequate?
Was power analysis performed? (Expected answer: yes, 80–90% power.)
Are the inclusion/exclusion criteria clearly stated?
How many participants were enrolled? How many completed? Why did others drop out?

Primary Outcome

Is the primary outcome clearly defined?
Was it registered a priori (on ClinicalTrials.gov)? Or did it change during the study?
Is it biologically meaningful, not just statistically significant?

Statistical Methods

Are the statistical tests appropriate for the data?
Were assumptions checked (e.g., normality, equal variances)?
Were multiple comparisons corrected?
Was the analysis per-protocol or intention-to-treat? (Intention-to-treat is preferred.)

Results

Are effect sizes reported, not just p-values?
Are confidence intervals provided?
If p > 0.05, is the CI wide (inconclusive) or tight around the null (true null effect)?
Are baseline characteristics balanced between groups?

Interpretation

Do the authors' conclusions match the data?
Do they acknowledge limitations?
Is there evidence of outcome switching or selective reporting?
How does this finding fit the broader literature?

Conclusion: Becoming a Statistically Literate Clinician

Statistics is not a tool for the elite few. It is a language for understanding evidence and making rational decisions. Every clinician should be able to read a confidence interval, compute an NNT, and recognize when p-values are being misused.

You do not need to be a mathematician. But you do need intuition. Understand that variation is natural, that correlation is not causation, that statistical significance differs from clinical significance, and that effect sizes matter more than p-values.

In the age of AI and big data, these skills are more valuable than ever. The clinician who thinks statistically—who questions assumptions, demands evidence, and remains skeptical of neat conclusions—will practice better medicine.

The literature is vast and often contradictory. But with statistical literacy, you can navigate it with confidence. You can ask the right questions, evaluate the strength of evidence, and ultimately make decisions that serve your patients' best interests.

Start small: the next time you read a study, focus on the confidence intervals and effect sizes, not the p-value. Ask yourself: Is this effect clinically meaningful? Could it be due to chance or confounding? Does it fit the broader literature? With practice, statistical thinking becomes second nature—and your medicine improves.