Generative models have been applied to synthesize realistic images, audio, and text; however, they have not generated the same high-quality results when applied to video. To fill this gap, Yan et al. present VideoGPT, a simple architecture for video generation that adapts VQ-VAE (a likelihood-based generative model that learns discrete representations without supervision) and GPT-like models (autoregressive transformers). They demonstrate that this architecture can match the performance of state-of-the-art GANs on the BAIR Robot Pushing benchmark and can create realistic samples from complex natural video datasets like UCF-101 and the Tumblr GIF dataset.