To develop a metaverse wherein participants interact through natural and believable avatar interactions, Roblox must animate 3D character faces in real-time. As such, the Roblox team created a deep learning framework to transform video inputs into a set of Facial Action Codes (i.e., that specify a set of facial animation control for deforming the 3D face mesh). In this post, they discuss the model architecture, which leverages an optimized version of the MTCNN face detection algorithm and a regression system that learns the temporal and spatial features of facial animation. They also review how they leverage synthetic animation sequences and linearly combine different loss terms during the training process. In addition, they outline how they optimize performance through strategies like unpadded convolutions that decrease the feature map size.