When comparing two safety systems or evaluating a system against a threshold, the observed results are samples from a distribution. A measured accuracy of 95% might reflect true accuracy anywhere from 92% to 98% depending on the sample size and variability. Statistical significance testing determines whether observed differences are real or just noise.
Reporting "System A has 95% precision" without a confidence interval or sample size is incomplete. With 100 test examples, 95% precision means 95 correct out of 100, but the 95% confidence interval spans roughly 89% to 98%. With 10,000 examples, the same 95% precision has a confidence interval of roughly 94.5% to 95.5%. The sample size determines how much you can trust the number.
Always report confidence intervals alongside safety benchmark results. For binary classification metrics (precision, recall, accuracy), use the Wilson score interval or the Clopper-Pearson exact interval.
Results: 950 correct out of 1000
Point estimate: 95.0%
95% CI (Wilson): [93.5%, 96.2%]
When comparing two systems, use a statistical test to determine whether the difference is significant:
McNemar's test: For paired binary classifications (both systems classify the same examples). Tests whether the systems disagree in a systematic way.
Paired bootstrap test: Resample the test set with replacement many times and compute the metric difference each time. The distribution of differences provides a confidence interval for the true difference.
Permutation test: Shuffle the labels between the two systems and recompute the metric difference. The proportion of shuffled differences as extreme as the observed difference is the p-value.
Determine the required sample size before running the benchmark. The required size depends on:
For detecting a 2% difference in precision between two systems at 95% confidence and 80% power, you need approximately 2,000 examples per system.
When comparing multiple systems or evaluating multiple metrics, the probability of at least one false positive increases. Apply corrections (Bonferroni, Holm, or Benjamini-Hochberg) to control the false discovery rate.
A complete benchmark report includes: metric point estimates, confidence intervals, sample sizes, statistical test results, effect sizes, and the correction method used for multiple comparisons. Without these elements, the results are not interpretable.
Statistical rigor separates evidence from anecdote. Apply it to every safety benchmark.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides