Although several tools exist to evaluate classical ML and deep learning models, far fewer tools facilitate the reliable evaluation of reinforcement learning models. Most researchers will compare point estimates of the mean and median scores on a diverse suite of tasks – but they may not consider the statistical uncertainty associated with such point estimates. To help users evaluate the results of RL models (on a small number of training runs), including by reasoning about statistical uncertainty, researchers from Google have OSS’ed RLiable. This easy-to-use Python library enables users to leverage tools like bootstrap confidence intervals and performance profiles, as well as metrics like the interquartile mean and optimality gap to more reliably evaluate RL model performance.