Most benchmarks for multilingual NLP tasks leverage corpora of pre-training and task-specific data that has been heavily cleaned. However, model developers rarely have access to clean data when building real-world applications. XLM-T, which was recently developed and OSS’ed by researchers from Snap and Cardiff University, is a framework for using and evaluating multilingual language models using actual Twitter data. The framework includes XLM-R, a baseline model pre-trained on nearly 200M tweets in over thirty languages, as well as code to finetune the model on target tasks. In addition, they have released a dataset including 24,262 tweets in 8 languages with labels for sentiment analysis.