This post was originally published on Medium and is a collaboration between Pete Soderling (founder, Data Council & the Data Community Fund), Sarah Catanzaro (partner, Amplify Partners) and Abe Gong (co-founder, Superconductive).
In the past few months, the data ecosystem has continued to burgeon as some parts of the stack consolidate and as new challenges arise. Our first attempt to help stakeholders navigate this ecosystem highlighted 25 Hot New Data Tools and What They DON’T Do — clarifying specific problems the featured companies and projects did and did NOT solve.
This effort was positively received by the data science, engineering and analytics communities, and spurred more engagement than we originally anticipated. Further, we were flattered to see the original post motivate other thought-provoking pieces such as 20 Hot New Data Tools and their Early Go-to-Market Strategies.
Taking it Further
Regardless, we quickly recognized our original post did not go far enough as we received dozens of emails, Twitter messages and Slack DMs about other solutions that were not covered. We had shed light on a small corner of the expanding universe of data tools and platforms, yet there was an opportunity to cover even more.
Although we cannot chronicle every additional data tool in just one follow-up post, here we continue our efforts to cultivate this ecosystem by highlighting a few more. The creators of these tools are not only occupying meaningful parts of the ever-evolving modern data stack, they graciously responded to our requests to help us understand where they fit in.
They sound-off here in their own words.
- Shipyard: Shipyard is a workflow orchestration platform that helps teams quickly launch, monitor, and share data solutions without worrying about infrastructure management. It lets users create reusable blueprints, share data seamlessly between jobs, and run code without any proprietary setup, all while scaling resources dynamically. Shipyard is NOT a no-code tool and does not support data versioning or data visualization.
- Count: Count is a data notebook that replaces dashboards for reporting and self-service, and supports data transformation. Count is uniquely good at team collaboration, enabling technical and non-technical users to work within the same notebook. Count is NOT a data science notebook.
- Castor: Castor is uniquely good at organizing information about data to support data discovery, GDPR compliance, and knowledge management. Through a plug-and-play solution, Castor builds a comprehensive and actionable map of all data assets. Castor is NOT a data visualization or BI tool.
- Census: Census is uniquely good at syncing data models from a warehouse to business tools like Salesforce. It complements existing warehouses, data loaders & transform tools to enable data teams to drive business operations. It is NOT a no-code tool nor does it automagically model your data; it relies on analysts writing models in SQL.
- Iteratively: Iteratively is a schema registry that helps teams collaborate to define, instrument, and validate their analytics. With Iteratively, you can ship high-quality analytics faster and prevent common data quality & privacy issues that undermine trust. Iteratively is NOT a BI tool, data pipeline, or transformation tool.
- StreamSQL: StreamSQL handles deploying, versioning, and sharing model features. Using your definitions, it generates features for both serving and training. Its registry facilitates re-using features across teams and models. Stream does NOT model management and is completely agnostic to what you do with the features once you get them.
- Xplenty: Xplenty is a cloud-based ETL solution providing simple visualized data pipelines for automated data flows across a wide range of sources and destinations. It is uniquely good at ingesting large volumes of data, performing code-free data transformations, and scheduling workflows. Xplenty does NOT do event streaming.
- Vectice: Vectice is uniquely good at tracking, documenting, organizing all AI assets (e.g datasets, features, models, experiments, dashboards, notebooks) and the underlying domain knowledge to successfully manage and scale the enterprise AI initiatives. Vectice does NOT provide any runtime or computational environment.
- Snowplow Analytics: Snowplow is a streaming behavioral data engine that is uniquely good at generating event data from dedicated web/mobile/server SDKs, enhancing that data and delivering it to your data warehouse. Snowplow is NOT a data integration (ELT) tool, nor a general streaming framework, nor a BI tool.
- Datafold: Datafold is uniquely good at comparing datasets in a SQL data warehouse or across data warehouses. It enables running “git diff” on a table of any size. Datafold is NOT a database itself (it works on top of existing infrastructure) and it does NOT work with files.
- Splitgraph: Splitgraph is a tool for building, extending, versioning, and sharing SQL databases that is uniquely good at enhancing existing tools. Splitgraph also features a data catalogue including 40K open datasets that can be queried (and joined) with any SQL client. Splitgraph is NOT a database.
- Datacoral: Datacoral is uniquely good at automatically generating data ingestion and transformation pipelines from SQL-based declarative specifications, and automatically capturing and displaying schema level lineage. Datacoral plays nice with data ingestion tools like Segment, and workflow management tools like Airflow. Datacoral is NOT a data warehouse or a query engine.
- Apache Arrow: Apache Arrow is uniquely good as a language-independent standard for fast in-memory analytical processing and efficient interprocess transport (with minimal overhead) of large tabular datasets. While intended as a computational foundation for data frame projects, it is NOT a replacement for end-user facing tools like pandas.
- Datasaur: Datasaur is built to support NLP labeling via ML-assisted suggestions. It supports workforce management, maintains data privacy, and can be integrated via API to any ML workflow. Datasaur does NOT handle bounding boxes for image/video labeling.
- Datakin: Datakin is a DataOps solution that helps guarantee that data pipelines run without disruption and resulting data can be trusted. It does so by automatically discovering data lineage and providing tools to quickly identify and resolve issues. Datakin is NOT a data catalog nor does it replace any existing data infrastructure components (workflow orchestration, data processing, …).
- ApertureData: ApertureData is a database for visual data like images, videos, feature vectors, and associated metadata like annotations. It natively supports complex searching and preprocessing operations over media objects, and integrates with cloud-based storage and ML frameworks like PyTorch/Tensorflow.. ApertureData does NOT extract metadata or features from images/videos.
- Orchest: Orchest is uniquely good at assisting data scientists in interactively building data science pipelines by providing a visual pipeline editing environment in the browser. Pipeline steps are containerized notebooks or scripts. Orchest does NOT replace Jupyter notebooks, provide a no-code tool, or bring its own computational infrastructure.
- Gazette: Gazette is an open source streaming platform that breaks down the divide between batch and real-time data, enabling users to build real-time applications with exactly-once semantics. It offers real-time message streams, which are natively and durably stored as regular files in cloud storage. Gazette is NOT an ETL tool or an analytics platform.
- Coiled Computing: Coiled excels at scaling data science and machine learning workflows in native Python using Dask, which is familiar, widely adopted, and gives great feedback. Coiled is an opinionated way of bursting to clusters and the cloud while staying in the PyData ecosystem. Coiled/Dask is NOT a database or Kubernetes replacement.
- Upsolver: Upsolver is a cloud-native solution for integrating structured and unstructured data on cloud storage. It utilizes a visual, SQL interface for quick and easy data transformation. Upsolver is NOT a Platform as a Service solution that requires developers to write additional code and learn low-level concepts to process data.
As authors (Sarah, Abe & Pete) we’re collectively brainstorming about how we can extend this effort and create an ever-growing list that helps practitioners find and adopt the right tools, founders align with the best partners, and investors map companies to their investment theses. We look forward to hearing your thoughts on the best medium to continue this exploration with the support of the community.