Multilingual pretrained models, which learn cross-lingually aligned representations. perform exceptionally well on language and speech understanding tasks, including for applications involving low resource languages (i.e. for which large corpora of training data do not exist). In this context, Bapna et al. develop mSLAM- a model that not only learns multilingual representations but also learns cross-modal representations of text and speech. mSLAM, which has been pretrained on speech from 51 languages and text from 101 languages, also uses Connectionist Temporal Classification on paired speech-text data to minimize interference and capacity dilution.