Today, I’m in awe as we celebrate the 100th issue of Projects to Know. I never planned to write a newsletter. A few years ago our Director of Marketing, Malia Powers, suggested that I write a blog post on data quality. After experiencing writer’s block 3/4 through a draft, I proposed that I write a short newsletter instead. Certainly, a short newsletter would be easier to an author than a comprehensive review of data testing, monitoring, and preparation tools and technologies! What’s more, I felt frustrated by existing newsletters on data science and ML, many of which focused on attention-grabbing headlines about the impact of AI on climate change, cancer, and space exploration. I saw the need for a publication that would highlight research and projects with more practical relevance –  content that data practitioners and developers would and could apply. 

100 issues later, I’ve confirmed that writing a newsletter is, in fact, more challenging than writing a blog post. Nonetheless, I have zero regrets. More importantly, I’ve validated that data practitioners and developers want to learn more about the projects that their peers are building in academia and industry; as research papers, internal platforms, and OSS tools; in machine intelligence, distributed systems, and data management. 

While I sometimes opine about summarizing lengthy academic papers, surfing engineering blogs, and sifting through GitHub repos every weekend; I’m so inspired by those who create and consume projects. Projects to Know has proven to me that the impact of a paper is deeper than an academic conference; the impact of an OSS project cannot be measured with GitHub stars, and the impact of internal initiatives often extends far beyond a single company. I’m motivated to continue writing by the readers from unicorn tech companies who share how they’re applying the models described in featured papers; or the creators of OSS database technologies who connect with academic collaborators through PTK.

When I started my career in data in 2009, it was a lonely profession. When I became a manager in 2012, I had so few peers to turn to for mentorship and guidance. But things have changed and now there are so many data practitioners and developers who want to communicate and collaborate; who galvanize each other to try more experimental approaches to managing teams or to reveal the skunkworks project they’ve been working on between calls and meetings. Now, a community exists and there are so many more Projects to Know.

Below, we’ve highlighted a few projects from this expansive compilation – the most popular Papers, Projects, and Content from 4 sets of issues. You’ll see that these projects span a range of topics – from privacy-preserving machine learning to literate programming to serverless prediction serving. They’re created by authors from Tennessee to Singapore and from institutions ranging from F500 companies like Nike to seed stage startups like Ponder. It’s hard to skim through this list without feeling awe – there’s just so much to learn and so many people to learn from. 

I’m so excited to celebrate an ever-growing community and corpus of knowledge today. Thanks for your contributions, readership, and support. 

Not a subscriber? Subscribe here and get 3 academic papers and 3 open source projects that are playing a meaningful role in advancing machine intelligence and data science in your inbox on a weekly basis! You can view past issues here.

Papers

  1. Self-Supervised GANs via Auxiliary Rotation Loss (Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby)
  1. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (Cynthia Rudin)
  1. Recommending Podcasts for Cold-Start Users Based on Music Listening and Taste (Zahra Nazari, Christophe Charbuillet, Johan Pages, Martin Laurent, Denis Charrier, Briana Vecchione, Ben Carterette)
  1. An Overview of Privacy in Machine Learning (Emiliano De Cristofaro)
  1. Optimizing Prediction Serving on Low-Latency Serverless Dataflow (Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez)
  1. Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges (Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong)
  1. Fits and Starts: Enterprise Use of AutoML and the Role of Humans in the Loop (Anamaria Crisan, Brittany Fiore-Gartland)
  1. What are the most important statistical ideas of the past 50 years? (Andrew Gelman, Aki Vehtari)
  1. Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development (Aspen Hopkins, Serena Booth)
  1. DAG Card is the new Model Card (Jacopo Tagliabue, Ville Tuulos, Ciro Greco, Valay Dave)

Projects

  1. Fastscript
  2. Presidio
  3. DeepForge
  4. Edator
  5. DeltaPy⁠⁠
  6. Lux
  7. Streambook
  8. Uncertainty Toolbox
  9. openclean
  10. Malloy

Content

  1. The Uncanny Valley of ML
  2. An Opinionated Guide to ML Research
  3. 21 MORE Hot Data Tools and What They Don’t Do
  4. Shopify’s Data Science & Engineering Foundations
  5. Why Production Machine Learning Fails — And How To Fix It
  6. The Meaning of Production in the Data World
  7. Beyond the Notebook and into the Data Science Framework Revolution
  8. Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
  9. Moving past Airflow: Why Dagster is the next-generation data orchestrator
  10. Three revolutions in data science