Research on Transformer models like GPT-3 indicates that language models improve as they grow larger (for example, GPT-3 achieve SOTA results at 175 billion parameters). Consequently, model developers are creating larger (and therefore, more memory-intensive) models. However, training these models can be challenging when the model parameters no longer fit within the memory on the GPU. Sharded training can resolve this bottleneck by splitting the model parameters including optimizer state and gradients across multiple GPUs. PyTorch Lightning recently enabled users to apply sharded training by simply passing in a single trainer flag. Now PyTorch Lightning users can reduce the memory requirements of their models and research with no code changes.