Although Transformers achieves state-of-the-art performance on numerous tasks, they must be fine-tuned to develop new skills. Without attention over longer sequences, they cannot acquire new knowledge immediately. To address this limitation, Wu et al. propose Memorizing Transformers. Memorizing Transformers store facts as key-value pairs in long-term memory and create queries using an approximate k-nearest neighbor search that attends to these facts. In addition, Memorizing Transformers uses non-differentiable memory so key-value pairs that were previously computed during prior training steps can be reused, thereby improving the scalability and computational efficiency of this technique. With this approach, Memorizing Transformers can memorize facts and achieve significant performance gains.