Transformer models can achieve SOTA performance on several tasks; however, to reduce their computational and memory complexity, ML practitioners must train Transformers on shorter subsequences of longer documents (e.g., books, source code, technical papers). To enable Transformers to work with datasets like code repositories or knowledge bases, researchers recently proposed Memory Transformers, which read and memorize new data at inference time. The authors (still anonymous while the paper is under double-blind review at ICLR) implement Memory Transformers using k-nearest neighbor search on a non-differentiable cache of memorized facts (stored as key-value pairs). They integrate this cache in a single layer near the top of the Transformer stack with lower layers using classical dense attention to parse, summarize, and process information in the input sequence (which is then stored in the external memory).