Although Transformer models can achieve SOTA results on NLU tasks, they require large scale pretraining. To understand what Transformers learn from massive datasets (i.e. what they cannot learn from smaller datasets), Zhang et al apply four probing methods and use learning curves to identify what linguistic abilities these models acquire as the pretraining corpora expands from 1M to 1B words. They find that although Transformers can learn to encode most syntactic and semantic features with 10-100M words, larger datasets are needed to develop “commonsense knowledge.” These findings also suggest that learning to encode linguistic features is necessary, but not sufficient, to achieve language understanding.