Reflecting on 100 Issues of Projects to Know
Today, I’m in awe as we celebrate the 100th issue of Projects to Know. I never planned to write a newsletter. A few years ago our Director of Marketing, Malia Powers, suggested that I write a blog post on data quality. After experiencing writer’s block 3/4 through a draft, I proposed that I write a short newsletter instead. Certainly, a short newsletter would be easier to an author than a comprehensive review of data testing, monitoring, and preparation tools and technologies! What’s more, I felt frustrated by existing newsletters on data science and ML, many of which focused on attention-grabbing headlines about the impact of AI on climate change, cancer, and space exploration. I saw the need for a publication that would highlight research and projects with more practical relevance – content that data practitioners and developers would and could apply.
100 issues later, I’ve confirmed that writing a newsletter is, in fact, more challenging than writing a blog post. Nonetheless, I have zero regrets. More importantly, I’ve validated that data practitioners and developers want to learn more about the projects that their peers are building in academia and industry; as research papers, internal platforms, and OSS tools; in machine intelligence, distributed systems, and data management.
While I sometimes opine about summarizing lengthy academic papers, surfing engineering blogs, and sifting through GitHub repos every weekend; I’m so inspired by those who create and consume projects. Projects to Know has proven to me that the impact of a paper is deeper than an academic conference; the impact of an OSS project cannot be measured with GitHub stars, and the impact of internal initiatives often extends far beyond a single company. I’m motivated to continue writing by the readers from unicorn tech companies who share how they’re applying the models described in featured papers; or the creators of OSS database technologies who connect with academic collaborators through PTK.
When I started my career in data in 2009, it was a lonely profession. When I became a manager in 2012, I had so few peers to turn to for mentorship and guidance. But things have changed and now there are so many data practitioners and developers who want to communicate and collaborate; who galvanize each other to try more experimental approaches to managing teams or to reveal the skunkworks project they’ve been working on between calls and meetings. Now, a community exists and there are so many more Projects to Know.
Below, we’ve highlighted a few projects from this expansive compilation – the most popular Papers, Projects, and Content from 4 sets of issues. You’ll see that these projects span a range of topics – from privacy-preserving machine learning to literate programming to serverless prediction serving. They’re created by authors from Tennessee to Singapore and from institutions ranging from F500 companies like Nike to seed stage startups like Ponder. It’s hard to skim through this list without feeling awe – there’s just so much to learn and so many people to learn from.
I’m so excited to celebrate an ever-growing community and corpus of knowledge today. Thanks for your contributions, readership, and support.
Not a subscriber? Subscribe here and get 3 academic papers and 3 open source projects that are playing a meaningful role in advancing machine intelligence and data science in your inbox on a weekly basis! You can view past issues here.
Papers
- Self-Supervised GANs via Auxiliary Rotation Loss (Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby)
- Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (Cynthia Rudin)
- Recommending Podcasts for Cold-Start Users Based on Music Listening and Taste (Zahra Nazari, Christophe Charbuillet, Johan Pages, Martin Laurent, Denis Charrier, Briana Vecchione, Ben Carterette)
- An Overview of Privacy in Machine Learning (Emiliano De Cristofaro)
- Optimizing Prediction Serving on Low-Latency Serverless Dataflow (Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez)
- Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges (Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong)
- Fits and Starts: Enterprise Use of AutoML and the Role of Humans in the Loop (Anamaria Crisan, Brittany Fiore-Gartland)
- What are the most important statistical ideas of the past 50 years? (Andrew Gelman, Aki Vehtari)
- Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development (Aspen Hopkins, Serena Booth)
- DAG Card is the new Model Card (Jacopo Tagliabue, Ville Tuulos, Ciro Greco, Valay Dave)
Projects
Content
- The Uncanny Valley of ML
- An Opinionated Guide to ML Research
- 21 MORE Hot Data Tools and What They Don’t Do
- Shopify’s Data Science & Engineering Foundations
- Why Production Machine Learning Fails — And How To Fix It
- The Meaning of Production in the Data World
- Beyond the Notebook and into the Data Science Framework Revolution
- Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
- Moving past Airflow: Why Dagster is the next-generation data orchestrator
- Three revolutions in data science
20 MORE Hot Data Tools and What They Don't Do
In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new challenges arise. Our first attempt to help stakeholders navigate this ecosystem highlighted 25 Hot New Data Tools and What They DON’T Do — clarifying specific problems the featured companies and projects did and did NOT solve.
Sarah Catanzaro
June 20, 2020
25 Hot New Data Tools and What They DON’T Do
What does your tool NOT do? To start to attack this problem, we reached out to dozens of entrepreneurs and open source maintainers in the data ecosystem.
Sarah Catanzaro
May 20, 2020