A safety evaluation that produces different results when run twice on the same system is not trustworthy. Reproducibility ensures that evaluations are consistent, comparable, and credible. When you claim a safety classifier has 95% accuracy, anyone should be able to verify that number by running the same evaluation.
Model non-determinism: Language models produce different outputs for the same input due to temperature settings, random seeds, and GPU scheduling. Safety evaluations that depend on model outputs inherit this variability.
Evaluation data drift: If the evaluation dataset changes between runs (new examples added, old ones removed), results are not comparable.
Environment differences: Different hardware, software versions, or configurations can produce different results, especially for timing-sensitive metrics like latency benchmarks.
Metric implementation: Subtle differences in how metrics are calculated (rounding, edge case handling, inclusion/exclusion criteria) can change results.
Version the evaluation dataset alongside the evaluation code. Use a specific dataset version for each evaluation run. Never modify a versioned dataset; create a new version instead.
Set random seeds explicitly for all sources of randomness: model sampling, data shuffling, and any stochastic evaluation components. For language models, use temperature 0 (greedy decoding) in safety evaluations to eliminate sampling randomness.
Lock all software dependencies (model versions, library versions, evaluation framework versions) in a reproducibility manifest. Use containers to freeze the software environment.
evaluation_manifest:
dataset: "safety-eval-v3.2"
model: "safety-classifier-v3"
model_hash: "sha256:a1b2c3d4..."
framework: "authensor-eval@1.5.0"
random_seed: 42
temperature: 0
environment: "docker.io/authensor/eval:2026-01"
Write the evaluation procedure as executable code, not prose. Anyone with access to the code, dataset, and manifest should be able to reproduce the results by running a single command.
When comparing two safety systems or two versions of the same system, run both evaluations with the same dataset, the same environment, and the same metrics. Report confidence intervals alongside point estimates to account for any remaining variability.
Run reproducible evaluations automatically in CI. Compare results against baselines. Flag regressions. Store historical results for trend analysis.
Reproducibility is not optional for safety claims. A safety result that cannot be reproduced is an anecdote, not evidence.
Explore more guides on AI agent safety, prompt injection, and building secure systems.
View All Guides