Data has become the new source code, and we need a way to manage it.

Superb AI + Pachyderm (Images provided by Pachyderm, Superb AI and the author

Data has become the new source code, and we need a way to manage it.

Data is _ so _ important that many leading practitioners in ai are pushing data to be central to ML workflows. For many years, code has been at the center of software development. And we’ve developed amazing tools and processes to create great software, to become more agile and efficient. But today, with the rise of machine learning software, curating the right data for machine learning applications is the most critical factor. Without tools and processes to develop data sets, we cannot create models that have real-world impacts.

Two life cycles of machine learning. (Photo by Pachyderm)

The management of these phases is no small matter. Selecting data sources, generating tags, retraining models, all of these are key components of the data curation life cycle, and we often execute them in an AD hoc manner. So what can we do to keep our efforts from snowballing out of control?

We need a data-centric approach. We need tools to support data development.

In this blog post, we’ll combine two key tools to improve data-centric operations. Superb AI Suite and Pachyderm Hub. Together, these two tools bring data labeling and data versioning to your data operations workflow.

Super AI kit. Label data at scale

Workflow flowchart for the Superb AI suite. (Photo by Superb AI)

Superb AI introduces a revolutionary approach to ML teams to dramatically reduce the time it takes to deliver high-quality training data sets. Instead of relying on human tagging people to do most of the data preparation, teams can now implement a more time and cost effective pipeline through the Superb AI suite.

Superb’s ML-First labeling method should look like the one above.

  • You start by ingesting all the raw data you’ve collected into the Suite platform and tagging just a few images.
  • Then you train the Suite’s CAL function (custom automatic tagging) in an hour, without any custom engineering work.
  • Once done, you can apply the trained models to the rest of your data set, labeling them immediately.
  • Superb AI’s CAL model also tells you which images need to be manually reviewed using patented uncertainty estimation methods along with model predictions.
  • Once you have reviewed and validated a small number of hard labels, you can deliver the training data.
  • The ML team then trains a model and gives you feedback asking for more data.

If your model is not performing well, you need a new set of data to add to your existing real-world data set on the ground. Next, you run them onto your pre-trained model and upload the model’s predictions to our platform. Suite will then help you find and re-label the failure cases. Finally, you can train Suite’s automatic tagging on these edge cases to drive performance.

The cycle repeats over and over again. With each iteration, your model will cover more and more edge cases.

Key competencies.

  • Quickly create a small amount of initial real data to start the labeling process
  • Quickly launch any tagging project with customizable automatic tagging technology that ADAPTS to your specific data set
  • Simplify review and validation workflows by using patent uncertainty estimation AI to quickly identify difficult examples for review

You can try the Superb AI suite for free.

Pachyderm. Versioned data + automation

Schematic diagram of the Pachyderm platform – the data foundation for machine learning. Add MLOps to any toolchain through data versioning and plumbing. (Image courtesy of Pachyderm)

Pachyderm is the data base for machine learning. It’s GitHub for your data-driven application.

Under the hood, Pachyderm forms this foundation by combining two key components.

  1. Data versioning and
  2. Data-driven pipeline.

Similar to Git, Pachyderm versioning allows you to organize and iterate over your data using Repos and commits. But Pachyderm isn’t limited to text files and structured data. Instead, it allows you to version any type of data — images, audio, video, text — anything. The version system is optimized to scale to _ any _ type of large data set, which makes it a perfect match for super AI, giving you cohesive replicability.

Pachyderm’s pipes allow you to connect your code to your data repository. They can automate many components of the machine learning life cycle (such as data preparation, testing, model training) by rerunning the pipeline when new data is submitted. The Pachyderm pipeline, along with versioning, provides an end-to-end clue to your machine learning workflow.

Key competencies.

  • Automate and unify your MLOps toolchain
  • Integrate with first-class tools for data-centric development
  • Iterate quickly to meet both audit and data governance requirements

You can try Pachyderm Hub for free.

Pachyderm serves as a versioned memory for the Superb AI

Superb AI suite +Pachyderm integration diagram. Data is marked in the Superb AI suite. Pachyderm pulls the data set automatically on the Cron tick schedule and submits the data set to the output Sample_project data warehouse. (Photo courtesy of the author)

In this integration, we provide an automated pipeline to label published data from Superb AI. This means we get all the benefits from the Superb AI suite to ingest our data, tag it, and manage our agile tagging workflow **, and ** all the benefits from Pachyderm to versioning and automating the rest of our ML lifecycle.

The pipe itself automatically pulls the data from the Superb AI Suite to the Pachyderm Hub cluster as a committed version. This is just by securely creating a Pachyderm secret for our Superb AI access API key. This key can then be used to create a pipeline to pull our Superb AI data to the Pachyderm data warehouse.

We automate this by using a CRon pipeline that automatically pulls new data according to a schedule (in our case, every 2 minutes). The output data set will be submitted to our SAMple_project data repository.

Pachyderm dashboard view of the Superb AI sample dataset. (Source: Superb AI image)

Once we have our data in Pachyderm, we can build the rest of the MLOps pipeline to test, preprocess, and train our model.

conclusion

Data-centric development is key to generating machine learning models that run in the real world. Superb AI and Pachyderm worked together to unify the data preparation phase into a reliable and agile phase, ensuring that we can continue to feed our models with good data and reduce data errors.