Since 2020, Transformer architectures have displaced convolutional and recurrent neural networks as the dominant approach to computer and NLP tasks, respectively. However, to surpass the performance of CNNs on computer vision tasks involving higher-resolution inputs, Vision Transformers must be modified to apply attention within local windows (i.e. the “sliding window” strategy). To dramatically improve the performance of ConvNets, Liu et al. study the architectural distinctions between ConvNets and Transformers and then apply their findings to develop a new family of ConvNets, ConvNeXt, which can match the performance of Transformers while maintaining the efficiency of ConvNets.