It has become nearly impossible to author an issue of Projects to Know without mentioning Transformer-based models, which seem to outperform other architectures on NLP tasks consistently. Although the self-attention mechanism of Transformers overcomes the sequential nature of RNNs, this design choice limits the size of input sequences to about 512 tokens in most contexts. To enable the application of Transformers to tasks that require larger context like document classification, Zaheer et al. have released BigBird – which leverages a sparse attention mechanism to reduce the quadratic dependency of Transformers. This repo includes the BigBird linear attention mechanism, the main long sequence encoder stack, and packaged BERT and seq2seq models with BigBird attention.