Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Data-Centric AI with Snorkel AI: The Enterprise AI Platform
Artificial Intelligence

Data-Centric AI with Snorkel AI: The Enterprise AI Platform

Last Updated on October 16, 2021 by Editorial Team

The Data-centric AI Platform for Enterprise AI, Snorkel AI: the image showcases the features of Snorkel Flow, label and build, integrate and manage, train and deploy, and analyze and monitor
Source: Snorkel AIΒ 

Data-centric AI enterprise platform showcase: Snorkel Flow by SnorkelΒ AI


Author(s): Towards AI TeamΒ 

Data-centric AI focuses on the data as the arbiter to the success or failure of AI deployment in organizations. In contrast, a model-centric approach focuses on the specifics of the model with less importance to the data. The general idea was that a model-centric approach was necessary to build accurate models in ML pipelines.

However, recent progress in AI has confirmed this belief to be falseβ€Šβ€”β€Šinstead, building a β€œgood enough” starting point for a model has allowed practitioners to make a significant amount of further training, validation, and accuracy improvement while also enabling deeper insights into the essence of the data being used.

AI organizations and enterprises are beginning to transform and focus their efforts toward a more data-centric AI approach, in contrast to a model-centric approach as they start to realize that a data-centric approach (also referred to as Software 2.0) to ML pipelines is crucial for increasing accuracy and rapid AI application development.

The AI landscape is switching from being model-centric toward data-centric AI, and while there are some cool ways to approach it, the folks at Snorkel AI are going all-hands with Snorkel Flow, the first truly data-centric AI platform, with roots in state-of-the-art data programming and weak supervision approaches that aim to tackle the vast challenge of diverting current AI practices toward more robust data-centric approaches and end the time-wasting modelitis.

In this short read, we will talk about Snorkel AI. The inventors behind the data-centric AI approach:

The Data-centric AI Platform for Enterprise AI, Snorkel AI: The image represents the crawled image from Snorkel AI’s website
Source: SnorkelΒ AI

About SnorkelΒ AI

Snorkel AI started as a research project in the Stanford AI Lab in 2015. Initially set out to explore a higher-level interface to machine learning through training data. Snorkel AI has over 50 peer-reviewed publications, published at ICML, Nature, ICLR, IEEE, NeurIPS, and many more, powering the core technology behind Snorkel Flow. In addition, Snorkel’s technology has been developed and deployed at Google, Intel, Apple, two of the three top US banks, the US Department of Justice, and other leading organizations.

The Data-centric AI Platform for Enterprise AI, Snorkel AI: The image showcases the data-centric AI approach of Snorkel Flow
Source: SnorkelΒ AI

Snorkel Flow, the First Data-Centric AIΒ Platform

Snorkel Flow is an AI development platform powered by weak supervision [2], and programmatic data labeling [3] approaches. Using Snorkel Flow, data science teams can collaborate with subject matter experts to rapidly build highly accurate AI applications. In addition, it allows users to create and manage massive amounts of training data, train models, analyze, improve performance by iterating on not just models but also training data and deployβ€Šβ€”β€Šall in one platform.

The Data-centric AI Platform for Enterprise AI, Snorkel AI: the image showcases the data-first AI development of the Snorkel Flow platform
Source: SnorkelΒ AI

Where Does Snorkel Flow ExcelΒ At?

  • Label and build training data programmatically in hours instead of months or even years of hand-labeling.
  • Integrate and manage programmatic training data from all sources, including data cleansing and data slicing.
  • Train and deploy state-of-the-art machine learning models in-platform or via a Python SDK.
  • Analyze and monitor model performance to rapidly identify and correct error modes in the data fast.

Learn more about the Snorkel Flow platform.

The Data-centric AI Platform for Enterprise AI, Snorkel AI: The image showcases the diverse data-types support of Snorkel Flow
Source: SnorkelΒ AI

SuperGLUE CaseΒ Study

Using standard models (i.e., pre-trained BERT) and minimal tuning, the Snorkel AI team was able to leverage critical abstractions for programmatically build and manage training data to achieve a state-of-the-art result on SuperGLUEβ€Šβ€”β€Ša newly curated benchmark; with six tasks for evaluating β€œgeneral-purpose language understanding technologies.

A new SOTA was achieved using programming abstractions on the SuperGLUE Benchmark and four of its components tasks. SuperGLUE is similar to GLUE but contains β€œmore difficult tasks, which are chosen to maximize difficulty and diversity. These tasks are selected to show a substantial headroom gap between a strong BERT-based baseline and human performance.”

The Data-centric AI Platform for Enterprise AI, Snorkel AI: the image showcases Superglue’s functionality, a state-of-the-art research paper by Snorkel AI
Source: SnorkelΒ AI

After reproducing the BERT++ baselines, we minimally tune these models (baseline models, default learning rate, and so on.) and find that with applications of the above programming abstractions, we notice improvements of +4.0 points on the SuperGLUE benchmark (indicating a 21% reduction of the gap to human performance).

The paper [5] also gives updates on Snorkel’s industry use cases with even more applications at scale, for example, Google in Snorkel Drybell to scientific work in MRI classification and automated Genome-wide association study (GWAS) curation, both accepted in Nature Comms.

Industrial CaseΒ Studies

  • Google has used Snorkel to replace 100k+ hand-annotated labels in critical machine learning pipelines.
  • A top US bank uses Snorkel Flow to quickly build AI applications that classify and extract information from their documents.
  • Apple built applications with an internal Snorkel-based system that answered billions of queries in multiple languages and processed trillions of records with up to 2.9x fewer errors.
  • A Fortune 500 Biotech pioneer leveraged Snorkel Flow to extract critical chronic disease data from clinical trials, accurately processing 300K documents in minutes.

References

[1] β€œSnorkel: Rapid Training Data Creation with Weak Supervision.” Alex Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Chris Re, Stanford University, https://arxiv.org/pdf/1711.10160.pdf

[2] β€œWeak Supervision: A New Programming Paradigm For Machine Learning.” Alex Ratner, Paroma Varma, Braden Hancock, Chris RΓ©, et al., SAIL Blog, 2019, https://ai.stanford.edu/blog/weak-supervision/

[3] β€œInteractive Programmatic Labeling for Weak Supervision.” Benjamin Cohen-Wang, Stephen Mussmann, Alex Ratner, Chris RΓ©, KDD, 2019, https://bencw99.github.io/files/kdd2019_dcclworkshop.pdf

[4] β€œSnorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.” Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher RΓ©, Rob Malkin, SIGMOD, 2019, https://arxiv.org/abs/1812.00417

[5] Wang, Alex, et al. β€œSuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.” 2019. SuperGLUE consists of 6 datasets: the Commitment Bank (CB, De Marneffe, et al., 2019), Choice Of Plausible Alternatives (COPA, Roemmele, et al., 2011), the Multi-Sentence Reading Comprehension dataset (MultiRC, Khashabi, et al., 2018), Recognizing Textual Entailment (merged from RTE1, Dagan et al. 2006, RTE2, Bar Haim, et al., 2006, RTE3, Giampiccolo, et al., 2007, and RTE5, Bentivogli, et al., 2009), Word in Context (WiC, Pilehvar, and Camacho-Collados, 2019), and the Winograd Schema Challenge (WSC, Levesque, et al., 2012).


Feedback ↓