For at least the past five years, we’ve listened to data practitioners tell similar stories about the data challenges they face. Stories about the company whose metrics table ballooned from 100 to 1,000 KPIs. Stories about the company where the number of dashboards skyrocketed after they introduced a self-service analytics platform, thereby taxing their database’s performance. Stories about the company that could not produce consistent financial results pre-IPO because they had 7 different columns that might describe customer ACV. 

Since the rise of cloud data warehouses and distributed data processing frameworks, companies have embraced a new data science and analytics stack designed to democratize access to data. However, tools intended to make data and insights more accessible engender a new set of problems as the number of pipelines, tables, dashboards, and reports proliferates. Users struggle to find or evaluate data. Inconsistent metrics and events impede effective decision-making. Data stewards cannot easily diagnose issues when data pipelines fail. These new tools empower data analysts, scientists and engineers to do seemingly anything, and yet companies still struggle to become “data-driven.” 

Perhaps counter-intuitively, the answer to these challenges lies in more data, or more specifically, in metadata. Metadata, which describes other data, enables companies to address use cases, including data discovery, lineage and change management, governance, and cost attribution. As such, we expect that the data catalog, powered by these metadata services, will become the system of record for the modern data stack. 

Given these dynamics, we’ve been looking for an investment in metadata management for over three years. During this time, we’ve talked to over 100 data producers and consumers to realize the role that metadata can play in their data infrastructure, understand why data cataloging is the most promising initial use case for a metadata engine, and elucidate they key requirements for an effective metadata platform. Based on these conversations, we believe that a winning metadata platform will:

  • Offer an open-source metadata engine and an extensible data model that enables distributed but collaborative authorship.
  • Scale through a push-based approach whereby metadata providers push information to a central repository.
  • Support stream-based ingestion and support online, streaming, and batch data assets.

After looking at dozens of metadata solutions, we found just one that met all these requirements: DataHub. 

DataHub was developed at LinkedIn by Pardhu Gunnam, Mars Lan, and Seyi Adebajo. For more than half a decade, Pardhu, Mars, Seyi and the team that they led, have iterated on the platform as they’ve learned more about users’ metadata needs and how they change as companies grow from thousands of internal users at LinkedIn and a growing OSS community. We believe that DataHub is the only metadata platform that will solve companies’ urgent metadata management problems while also scaling as their use cases and datasets expand. 

Upon seeing strong traction for DataHub among both startups and F500 companies (including financial institutions with the most complex data infrastructure), Pardhu, Mars, and Seyi realized that there was a bigger opportunity to impact the data community and formed Metaphor Data. Today, we are announcing our $5.3 million seed investment in Metaphor Data, which we co-led with our friends at Andreessen Horowitz and an incredible group of thought leaders from the data science and data engineering communities including Bob Muglia, Neha Markhede, Mike Tuchen, DJ Patil, Hilary Mason, Ben Porterfield, Keenan Rice, Scott Breitenother…

As anyone who has ever tried to build and maintain a data catalog can tell you, it’s a lot harder than it seems. It requires incredibly strong technical expertise in data management and distributed systems; high user empathy for a wide range of stakeholders; and commitment to learning as data stacks continue to evolve. The success of DataHub at LinkedIn and in the OSS is evidence that Pardhu, Mars, and Seyi have the experience and engineering skills to build an industry-leading metadata management platform. However, our conversations with this team and their peers also show they have the self-awareness, ambition, and dedication to make this company massive. It is their passion for metadata and commitment to serving their colleagues at LinkedIn and beyond that will enable them to overcome any obstacle. 

We’re so thrilled that we get to work with them as they transform how companies can leverage metadata to make decisions. If you’re a data scientist, data engineer, or AI practitioner, be sure to sign up for updates on Metaphor’s progress and general availability.