Masked autoencoders (MAE) work by obfuscating some data and then training a model to predict the masked content. By implementing this technique, researchers have been able to train language models with over 100B parameters that achieve state-of-the-art performance. Although masked autoencoding works well on NLP tasks, computer vision researchers have not been able to achieve similar results. However, He et al. present a new MAE for visual representation learning which enables users to train models like ViT-Huge that can still generalize well. Specifically, their MAE masks a very high ratio of random patches from the input images and reconstructs the missing patches in the pixel space. By leveraging an asymmetric encoder-decoder design that only encodes the visible patches (and not the masked ones), their MAE can optimize accuracy while also reducing training time and memory consumption.