Although DNN achieve SOTA results on several standard benchmarks, their performance often degrades when deployed for real-world use cases due to issues associated with concept drift, training artifacts, and noisy data (among other problems). To enable model developers to evaluate their language models more comprehensively, Stanford and Salesforce researchers have OSS’ed Robustness Gym – an extensible toolkit that unifies 4 standard evaluation paradigms (subpopulations, transformations, evaluations sets, and adversarial attacks). Users can also implement their own evaluation methods with a built-in set of abstractions.