Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Weak Supervision in Biomedicine
Artificial Intelligence

Weak Supervision in Biomedicine

Last Updated on June 23, 2021 by Editorial Team

In this episode of Science Talks, Snorkel AI’sΒ Braden HancockΒ chats withΒ Jason Fries – a research scientist at Stanford University’sΒ Biomedical Informatics Research labΒ and Snorkel Research, and one of the first contributors to the Snorkel open-source library. We discuss Jason’s path into machine learning, empowering doctors and scientists with weak supervision, and utilizing organizational resources in biomedical applications of Snorkel.

This episode is part of the #ScienceTalks video series hosted by the Snorkel AI team. You can watch the episode here:

Below are highlights from the conversation, lightly edited for clarity:

How did you get into machine learning?

Jason’s journey into machine learning.

Jason:Β Originally, during my undergraduate days, I intended to go into medicine. However, I enjoyed engineering classes way more than biology classes, so I shifted and majored in Computer Science and English. I also worked with a research group at the University of IowaΒ to track infections in hospitals. I suddenly found myself putting sensors on healthcare workers to track their movements and pulling a bunch of data from their hospital infrastructure – all for monitoring and anticipating how diseases spread to the hospital.

I was overwhelmed and excited by all of that data. That was the first time when Machine Learning (ML) as a powerful paradigm took root in my imagination. After that, I was sold and went to graduate school, starting to work with EHR data and doing the standard ML work to get there.

What was the first application that you applied Snorkel to?

Source:Β Data Programming with DDLite: Putting Humans in a Different Part of the LoopΒ (Ehrenberg et al., 2016)

Jason:Β I was introduced to the Snorkel concept while Alex was writing hisΒ data programming paperΒ for NeurIPS back in 2016. There was a hackathon we ran at a coffee shop calledΒ HanaHausΒ in Palo Alto. A bunch of people got together to test-write some applications, such as tagging disease names and texts. Those turned out to be one of the experiments that went into that paper.

That was my first introduction to the idea of generating training data without using hand-labeled methods. It was a crazy paradigm that made zero sense to me at the time.

After that, hackathons started to be a normal process of the Snorkel development cycle. I would collaborate with the other folks working on Snorkel to think aboutΒ making Snorkel work for real-world problemsΒ across various domains beyond biomedicine. That was the kickoff to weak supervision, which I have been sold on and enamored with since those early days in the HanaHaus.

What is weak supervision applied for different modalities?

Jason:Β It’s different by modality in my experience.

TheΒ textΒ modality benefits a lot from the ML ecosystem, with tools likeΒ HuggingFaceΒ andΒ spaCy.Β Images, especiallyΒ medical imaging, have their own challenges. You need to think about how to wrangle different sources of labels effectively. Both text and image modalities can benefit from traditional methods that work and new methods that show promise.

We have worked with theΒ time seriesΒ domain – analyzing sensor data to detect freezing of gait in Parkinson’s patients. This modality benefits from a controlled experimental setting and requires substantial domain expertise. Down the road, we can certainly think of more novel ways to apply weak supervision there.

How does Snorkel empower domain experts?

Jason:Β There is a narrow band of application settings where you know a single individual or a small team of individuals can make a lot of progress in building a fancy model. However, in the healthcare setting specifically, there are variousΒ logistical challenges to handling the data.

There have been great efforts from groups likeΒ the OHDSI initiativeΒ out of Columbia, which is taking observational EHR data and putting them into a standard format so that people can develop ML models over them more easily. That type of setting, where you could specify a model and deploy it on multiple hospitals with their data in the same format, is tremendously valuable. In addition, you can plug in auto-generating labels, denoising supervision sources, and auto-ML tools in order to accelerate the model development process.

There’s a long road to travel in terms of the general vision to empower frontline clinicians or biomedical scientists.

Let’s be concrete about this. Doctors often need guidance on how to make clinical decisions. They have strong insights into things, butΒ it’s challenging to translate such intuitions into a formal problem which a model can be trained over.

Source:Β Shah Lab

InΒ the Shah labΒ where I’m working,Β Nigam ShahΒ has a big effort called theΒ green buttonto explore marshaling a massive amount of data to provide real-world evidence on demand, more or less. That’s a compelling idea but requires huge efforts across data and scientists to enable this pipeline for answering simple guidance questions.

I think COVID-19 has highlighted how people had immediate questions they’d like guidance on:

  • If I send this patient home on supplemental oxygen, will they come back in 30 days?
  • What’s the likelihood that they will be re-admitted for a health problem?

TheseΒ fog-of-warΒ questions are super important in clinical care, so how can we build ML infrastructure that enables asking them? Unfortunately, that’s where a lot of work still needs to be done.

From a clinical DevOps perspective, one might ask: how do we maintain, monitor, and update ML systems in an institution?

We have organizations like Google and Apple making forays there. But hospitals are still behind in terms of having the right infrastructure or even having the right practices to maintain infrastructure in a clinical setting. Thus,Β there are many issues left to be resolved before we can reach the dream setting – where clinicians are more empowered to answer their own questions fed by AI in the background.

How did Snorkel help to repurpose organizational knowledge for ML?

Jason:Β Concepts likeΒ ontologiesΒ andΒ concept graphsΒ are part and parcel of medicine, which deals strongly in canonical terminologies (likeΒ SNOMED) or medical codes (likeΒ ICD-9-CM). Those are standard ways where information is communicated and exchanged, potentially across hospitals and organizations.

The most straightforward and most readily available concept/terminology classification process is first to extract a bunch of clinical notes, tag a bunch of concepts, then bend them into broader and more fine-grained categories. Such structured knowledge representations are immensely useful.

Source:Β Ontology-driven weak supervision for clinical entity classification in electronic health recordsΒ (Nature Communications, April 2021)

If you talk with people in this domain, the classic problem is figuring out how to deal with β€œcathedral” data artifactslike the Unified Medical Language System. They are very noisy, especially if you want to reason over multiple ones. That’s where Snorkel has been nicely suited.

You can use Snorkel to combine noisy signals, reason about and correct the noise, and get the same benefits as with hand-labeling.

Other classically hard problems such asΒ relation extractionΒ andΒ link predictionΒ across concepts are immensely suitable towards knowledge graph representations. You can pour ontological concepts into Snorkel to build such applications.

Ontologies represent the canonical input of virtually every clinical concept pipeline. Snorkel can simply slide into existing workflows and provide the practical benefits while having minimal changes to the infrastructure in place.

What are the challenges of dealing with rapidly changing data?

Jason:Β COVID-19 has revealed this exact setting. Let’s look at a concrete example.

When the pandemic initially started, there were many questions aboutΒ risk factors, which are crucial to figuring out who needs to be tested for COVID. For example, it was unclear what symptoms are necessarily strongly associated with COVID, how to discriminate COVID from other respiratory illnesses, do you live with someone who has been confirmed as a diagnosis, have you recently traveled, etc. Unfortunately, these things do not show up in structured EHR data. There’s also no setup to answer these questions quickly.

Source:Β Profiling Presenting Symptoms of Patients Screened for SARS-CoV2Β (Medium, April 2020)

This scenario is a great use case for Snorkel. In a day or two, people could look at a small sample of notes and generate enough heuristic rules, such that when combined, they did a solid job of extracting the correct information.

Specifically for the symptom example, we worked withΒ the Gates FoundationΒ and ran Snorkel daily in real-time on emergency department notes generated by Stanford. Our goal wasΒ to extract symptoms and summarize the state of what people were presenting in the emergency room. The data has since then has been used by Carnegie Mellon for various modeling purposes.

As things change on the fly, you need a paradigm like weak supervision to respond appropriately.

There are many scenarios where things change – due to crazy pandemics, shifting practices, changing behavior, etc. These changes need to be baked into your training data. There are many interesting settings where you need the flexibility in controlling what is fed to your model training procedure.

Where To Follow Jason:Β TwitterΒ |Β GithubΒ |Β Google ScholarΒ |Β Academic PageΒ |Β Research Gate

And don’t forget toΒ subscribe to our YouTube channelΒ for future ScienceTalks or follow us onΒ Twitter,Β LinkedIn,Β Facebook, andΒ Instagram.


Weak Supervision in BiomedicineΒ was originally published atΒ Snorkel AIΒ on June 16, 2021.

Feedback ↓