Models that predict well in the training domain may not meet expectations when deployed to production (i.e. there is a gap between independently and identically distributed – iid- evaluation procedure performance and deployment behavior). While this gap may arise when the training setting does not reflect the deployment setting, D’Amour et al. observe that predictive models trained to the same level of iid generalization perform differently in real world environments. They posit that underspecification (wherein there are many possible solutions to achieve equivalent iid performance with the same training data and model specs) may cause this divergence in application-specific generalization. In addition, they demonstrate how underspecification impedes reliable training and can degrade model performance once deployed. They conclude that new training and evaluation techniques are necessary to correct underspecification.