Popular pre-trained language models operate on sequences of tokens that represent words or subwords. Although tokenizers can encode text as a sequence of tokens, token-free models may be more robust and simpler to implement (since they obviate the need for text preprocessing). Xue et al. have developed, and OSS’ed, a pre-trained Transformer architecture based on the T5 architecture but modified to process byte sequences. Their model can match the performance of the MT5 model while outperforming it on tasks with noisy text or sensitivity to spelling and pronunciation. The authors have released model checkpoints ranging from 300M – 13B parameters.