Most state-of-the-art language models are trained on large-scale text corpora that have not been carefully curated or reviewed. As a result, these corpora are likely to contain data quality issues that impact model performance. For example, Lee et al. find that duplicated training examples in common NLP datasets cause models to emit memorized text more frequently. They present a text deduplication framework that enables them to train models that emit memorized text ten times less frequently while also reducing the number of training steps needed to achieve the same or better accuracy and improving evaluation accuracy.