As companies build data platforms on top of distributed systems that include batch workflows and real-time data pipelines, troubleshooting failures becomes increasingly challenging and tedious. To reduce the operational overhead associated with supporting a complex data platform, Netflix developed Pensive – an auto-diagnosis and remediation system. Here, Vikram Srivastava and Marcelo Mayworm describe Pensive’s two components: Batch Pensive and Streaming Pensive. Batch Pensive is integrated with the data platform’s Scheduler service and applies a regular expression based rules engine to classify errors from stack traces (the rules are generated through an ML process and by platform component owners and users). Streaming Pensive is designed to detect issues associated with streaming pipelines by analyzing logs and metrics associated with Kafka streams, Flink jobs, and the sinks to which the latter writes. In the future, Netflix will adapt Pensive to detect and remediate performance issues in addition to failures.