Self-attention architectures enable language models like Transformers to deliver SOTA performance by capturing long-range dependencies among words. Where language models assign meaning to a word by relating it to other words in a sentence, video understanding models may assign meaning to a segment by relating it to the rest of the scene. In this paper, Bertasius et al. replace the convolution operator in video architectures with self-attention, including reducing the restrictiveness of inductive biases and more effectively model dependencies that extend beyond the receptive field. They present the TimeSformer model, which adapts Vision Transformer, by viewing video as a sequence of patches extracted from individual frames. This architecture uses “divided attention” to separately apply temporal and spatial attention within each block of the network.