Unlike other machine learning models, Mixture of Experts select different parameters for each new example. While this approach to sparse training can achieve impressive performance, it is less commonly applied due to challenges including communication costs and training instability. To address these limitations, Fedus et al. present Switch Transformer which features a simplified routing algorithm and streamlined design (where the dense feed-forward network is replaced by a sparse Switch FFN layer) to reduce communication and computational costs. Moreover, they demonstrate how Switch Transformers enable researchers to pre-train up to trillion parameter models on the Colossal Clean Crawled Corpus.