← Back to Learn
red-teamcontent-safetyreference

Statistical Significance in Safety Benchmarks

Authensor

When comparing two safety systems or evaluating a system against a threshold, the observed results are samples from a distribution. A measured accuracy of 95% might reflect true accuracy anywhere from 92% to 98% depending on the sample size and variability. Statistical significance testing determines whether observed differences are real or just noise.

The Problem with Point Estimates

Reporting "System A has 95% precision" without a confidence interval or sample size is incomplete. With 100 test examples, 95% precision means 95 correct out of 100, but the 95% confidence interval spans roughly 89% to 98%. With 10,000 examples, the same 95% precision has a confidence interval of roughly 94.5% to 95.5%. The sample size determines how much you can trust the number.

Confidence Intervals

Always report confidence intervals alongside safety benchmark results. For binary classification metrics (precision, recall, accuracy), use the Wilson score interval or the Clopper-Pearson exact interval.

Results: 950 correct out of 1000
Point estimate: 95.0%
95% CI (Wilson): [93.5%, 96.2%]

Hypothesis Testing for Comparisons

When comparing two systems, use a statistical test to determine whether the difference is significant:

McNemar's test: For paired binary classifications (both systems classify the same examples). Tests whether the systems disagree in a systematic way.

Paired bootstrap test: Resample the test set with replacement many times and compute the metric difference each time. The distribution of differences provides a confidence interval for the true difference.

Permutation test: Shuffle the labels between the two systems and recompute the metric difference. The proportion of shuffled differences as extreme as the observed difference is the p-value.

Sample Size Planning

Determine the required sample size before running the benchmark. The required size depends on:

  • The expected metric value
  • The minimum difference you want to detect
  • The desired confidence level (typically 95%)
  • The desired statistical power (typically 80%)

For detecting a 2% difference in precision between two systems at 95% confidence and 80% power, you need approximately 2,000 examples per system.

Multiple Comparisons

When comparing multiple systems or evaluating multiple metrics, the probability of at least one false positive increases. Apply corrections (Bonferroni, Holm, or Benjamini-Hochberg) to control the false discovery rate.

Reporting Standards

A complete benchmark report includes: metric point estimates, confidence intervals, sample sizes, statistical test results, effect sizes, and the correction method used for multiple comparisons. Without these elements, the results are not interpretable.

Statistical rigor separates evidence from anecdote. Apply it to every safety benchmark.

Keep learning

Explore more guides on AI agent safety, prompt injection, and building secure systems.

View All Guides