In previous issues of Projects to Know, we’ve described how Transformer architectures can achieve state-of-the-art results on several NLP tasks. However, a recent paper submitted to ICLR 2021 suggests that Transformers can also outperform convolutional neural networks (CNN) for computer vision use cases. Previous attempts to combine CNNs with self-attention on pixels have not scaled efficiently. In contrast, the authors of this paper use embeddings of image patches as inputs to the Transformer. Although this model does not beat CNN when trained on smaller datasets, it achieves more impressive results on large datasets (e.g. JFT-300M). This paper has sparked hope for the convergence of NLP and CV architectures among ML researchers.