A few weeks ago, the ML research community responded eagerly to a paper, which suggested that Transformer models could outperform CNNs for computer vision use cases if input images were represented as a sequence of image patches. Now, Google AI has open-sourced the models (pre-trained on ImageNet-21K) described in this paper (as well as code for fine-tuning these models in Jax/Flax). The OSS Vision Transformer models can outperform state-of-the-art CNN with less computational resources when trained on larger datasets.