Although large Transformer models can achieve impressive results, they can be prohibitively computationally expensive. Thus, ML engineers may need to optimize these models, including through approaches like pruning and distillation, to generate ROI. However, many optimization approaches have accuracy tradeoffs. Instead, Mandava et al. propose an optimization approach that hinges on reordering feed-forward blocks (that capture the meaning of content) and the comparatively more expensive self-attention bocks (that capture the meaning of context). Transformer-based architectures typically interleave these blocks, but the authors use differential neural architecture search to explore alternatives. They find that self-attention blocks may only be useful in the first two-thirds of layers in a network, and a 1:5 ratio of self-attention to feed-forward layers is sufficient for the Transformer-XL. The resulting PAR Transformer requires 35% less compute time than Transformer-XL while retaining the same perplexity on the WikiText-103 language modeling dataset.