Prior research has demonstrated that the performance of GPT-3 and other Transformer models can improve by increasing the model size, dataset size, and amount of computational resources. However, this approach is inaccessible to many ML teams who lack infinite budgets. By incorporating techniques for large-scale distributed training into the model design, Wu et al. developĀ a 245B-parameter model, Yuan 1.0, that can outperform others on thousands of GPUs. In addition, they develop a data processing system that can filter massive amounts of Internet data to develop the largest Chinese language data corpora to date. Lastly, by leveraging a calibration and label expansion method, they improve the performance of zero and few-shot learning.