When building and evaluating language models, developers split data into training, development, and test sets. However, the manner in which data is split can impact how accurately developers estimate performance on real-world datasets. Although some model developers have used random splits to minimize overfitting, Søgaard et al. find that estimates of test time error are worst for random splits. They recommend using multiple, independent test sets where possible (or multiple biased splits where the former is not possible) to get the best performance estimates.