Although very large deep neural networks yield highly accurate results, it is challenging to deploy DNN with so many layers for use cases with low latency requirements. Although quantization, model pruning, and distillation may improve latency, these approaches tradeoff accuracy for latency and may not leverage temporal locality. Balasubramanian et al. propose using learned caches (of the hidden layer outputs of the DNN) to exploit temporal locality and provide lower latency. They introduce GATI, an end-to-end prediction serving system that applies learned caches for fast inference when given a pre-trained base DNN model and validation dataset.