Empirical benchmarks are often used to evaluate new machine learning algorithms, for instance, to establish what is “state of the art.” However, ML systems may perform differently in benchmark studies when, for example, different hyperparameter configurations or training procedures are applied. In this paper, Bouthillier et al. study sources of variation extrinsic to data sampling that may impact benchmark performance measures. They recommend that ML practitioners randomize as many sources of variation as possible and use resampling techniques to enhance the precision of benchmarks. In addition, they assert that evaluation methods should account for variance in addition to average performance.