NLP-Powered Data Extraction for SLRs and Meta-Analyses
Last Updated on July 20, 2023 by Editorial Team
Author(s): Gaugarin Oliver
Originally published on Towards AI.
Natural Language Processing
Getting desirable data out of published reports and clinical trials and into systematic literature reviews (SLRs) β a process known as data extraction β is just one of a series of incredibly time-consuming, repetitive, and potentially error-prone steps involved in creating SLRs and meta-analyses.
Itβs also an area that stands to benefit most from automated or semi-automated machine learning (ML) and natural language processing (NLP) techniques. These techniques are increasingly used to speed up and increase the precision of several SLR process steps, including data extraction, selecting primary studies, and formulating research questions. Over the past several years, researchers have increasingly attempted to improve the data extraction process through various ML techniques.
Thatβs great news for researchers who often work on SLRs because the traditional process is mind-numbingly slow: An analysis from 2017 found that SLRs take, on average, 67 weeks to produce. An additional 2018 study found that each SLR takes nearly 1,200 total hours per project. And thatβs not even counting the material costs of SLRs, whose production often comes with a price tag of up to a quarter of a million U.S. dollars apiece.
Itβs for these reasons that practically everyone involved has a vested interest in SLR automation. This ongoing process straddles the intersection between evidence-based medicine, data science, and artificial intelligence (AI). As the capabilities of high-powered computers and ML algorithms have grown, so have opportunities to improve the SLR process. Even minor improvements in speed or precision can significantly impact researchers who primarily rely on manual approaches.
Challenges of manual data extraction for SLRs
While SLRs and meta-analyses have always required an outsized investment in time and effort, the potential workload of every project has only gone up due to the rapidly growing evidence base available to researchers.
While the exponential growth of an extensive evidence base is undoubtedly a good thing overall, it makes the inclusion of all available data through manual processes increasingly elusive. Speed is crucial, and this rapid pace of publication also means every SLR needs constant updating to stay relevant. Fittingly, preprint repositories such as arXiv now offer rapid publication after a relatively short moderation process (as opposed to full peer review).
Manual data extraction by a human is prone to error, especially in such a fast-paced research climate. βResearch has found a high prevalence of errors in the manual data extraction process due to human factors such as time and resource constraints, inconsistency, and tediousness-induced mistakes,β says van Dinter et al. (2021).
Additionally, data extraction can be more difficult to automate than other SLR elements. The selection of primary studies, for example, is easily achievable using study abstracts only, while data extraction requires access to (and the ability to read intelligently) full-text clinical documents.
(Semi) automated data extraction for SLRs through NLP
Researchers can deploy a variety of ML and NLP techniques to help mitigate these challenges. The most common SLR elements to be automated are primary study selection and identification of relevant research using bag of words, frequency-inverse document frequency (TFIDF), and n-gram techniques, among others.
Data extraction has also been automated in several studies using the above techniques, but more commonly via named entity recognition (NER) using models such as CRF, LSTM BioBERT, or SciBERT. Data extraction is essentially an exercise in text classification, where domain-specific words within a document are judged for relevancy based on the word itself, the words surrounding it, and the wordβs position in the document β making NER techniques ideal.
BioBERT and similar BERT-based NER models are trained and fine-tuned using a biomedical corpus (or dataset) such as NCBI Disease, BC5CDR, or Species-800. Data formats for inputting data into NER models typically include Pandas DataFrame or text files in CoNLL format (ie. a text file with one word per line).
The extracted data elements can be automatically cited to the source by annotating the passage of interest so that a human reviewer can easily verify and add it to the review.
Other NLP techniques commonly used to automate parts of the SLR process are text vector (used in research identification and primary study selection), singular value decomposition (primary study selection), and latent semantic analysis models (primary study selection).
New research has also begun looking at deep learning algorithms for automatic systematic reviews, According to van Dinter et al. This includes one paper from 2020 that conducted feature extraction using a denoising autoencoder alongside a deep neural network, and a flattened vector and support vector machines to evaluate study relevance.
NLP for SLR data extraction in action
Several studies have shown the viability of automated extraction through NLP models. This study by Bui et al. used ML and NLP to generate automatic summaries of full-text articles, achieving high rates of recall (91.2% compared to 83.8%) and precision (59% density of relevant sentences to 39%) compared to humans. Another study, albeit from 2010, used an automated extraction system (called ExaCT) to assist researchers in locating and extracting critical pieces of information from full-text randomized clinical trial (RCT) documents. It extracted data at a 94 percent partial accuracy rate and returned fully correct data nearly 70 percent of the time.
And this study used ML models to successfully extract sentences from full-text articles relevant to PICO elements (PICO element identification and extraction for developing clinical research questions is another very time-consuming element of SLR production).
Speed up SLR creation and improve accuracy with CapeStart
NLP techniques hold tremendous promise for unlocking greater value and efficiencies in SLRs and meta-analyses. But medical researchers arenβt machine learning engineers and donβt have the time or bandwidth to build these solutions themselves.
Thatβs why CapeStartβs data scientists and NLP-aided systematic literature review solutions can be a force multiplier for your organization, helping automate or semi-automate several elements of the time-consuming and expensive SLR, PICO, or meta-analyses processes.
Weβve got deep experience working with biomedical datasets such as NCBI Disease, BC5CDR, and Species-800, along with deploying and fine-tuning NER models based on BioBERT and SciBERT. CapeStartβs data preparation and annotation team can also transform any dataset into the required Pandas DataFrame or CoNLL formats for input to train NER models.
Contact us today to explore our wide range of healthcare-focused NLP and data annotation solutions, from custom NLP models for specific healthcare applications to data preparation for training, tweaking, and fine-tuning model accuracy.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI