Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

From UNet to BERT: Extraction of Important Information from Scientific Papers
Latest

From UNet to BERT: Extraction of Important Information from Scientific Papers

Last Updated on September 21, 2022 by Editorial Team

Author(s): Eman Shemsu

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Objective

Despite an increase in the number of papers published each year, little work is done on the application of Machine Learning to scientific papers. Machine learning models are becoming smarter as well. Even though scientific papers are hard to understand even by us humans, they contain unique structures, formatting, and languages which set them apart from other documents.

Photo by Andrea De Santis onΒ Unsplash

In this project, I will demonstrate how we can extract and summarize important information from such documents following a multidisciplinary approach of Natural Language Processing and Computer Vision. Please note that this is a continuation of my previous blog on the Extraction of important information from scientific papers.

Recap

I have covered the computer vision part in my previous blog here. The following is a summary of what wasΒ covered.

The computer vision part contains a UNet-OCR pipeline that does the following:

  • Extraction of important sections learned by UNetΒ and
  • Conversion of learned sections into text using theΒ OCR

The UNet generates a black and white mask that highlights the Title, Author/Authors, and Abstracts sections of a given paper. A post-processing step follows where the masked image is reconstructed into an RGB image. This image is then passed to an Optical character recognition(OCR) engine to be converted into text. I have used Tesseract with default parameters for theΒ OCR.

Figure 1: UNet-OCR pipeline. Image byΒ Author

Text Summarization withΒ BERT

Text summarization is a machine learning technique that aims at generating a concise and precise summary of a text without overall loss of meaning. Text summarization is a popular and much-researched domain of Natural language Processing.

There are two approaches to text summarization whichΒ are:

  • Extractive text summarization
  • Abstractive text summarization

In Extractive text summarization, the summary is generated using excerpts from the texts. No new text is generated; only existing text is used in the summarization process. This can be done by scoring each sentence and generating the k most important sentences from theΒ text.

Abstractive summarization is how a human would do a summary of content i.e explain in one’s own words. The summary would include words, phrases, and sentences not in theΒ text.

I followed an Extractive text summarization approach for this project and used the open source Bert-extractive summarizer[1] repository.

BERT[2] stands for Bidirectional Encoder Representations from Transformers[3] and is a model that has learned sentence representation from training on a very large corpus. Only the encoder part is taken from the Transformers’ encoder-decoder architecture for the BERTΒ model.

Figure 2: Transformer architecture. Image by Vaswani etΒ al

In short, the encoder part accepts the word embedding of an input text. Positional encoding transforms the word embedding to add meaning to each word with respect to its positional context. The attention block (Multi-head Attention) computes the attention vector for each word. These vectors are then fed to the feed-forward Neural Network one vector at a time. This result consists of a set of encoded vectors for each word which is the final output of the BERTΒ model.

The Bert-extractive-summarizer repository is based on the paper Leveraging BERT for Extractive Text Summarization on Lectures [4]. The paper explains using BERT to generate sentence representation and then using the K-mean algorithm to cluster these representations around k concepts. The k sentences closest to their respective centroids are then returned as a representative summary for theirΒ cluster.

Getting started

Start by cloning my repository available at Dagshub[5].

git clone https://dagshub.com/Eman22S/Unet-OCR-2.0.git 

Install DVC.

pip install dvc

Install dagshub.

pip install dagshub

Install Tesseract.

sudo apt install tesseract-ocr-all -y

Install Bert.

pip install bert-extractive-summarizer

Configure your DVCΒ origin.

dvc remote modify dv-origin β€” local auth basic 
dvc remote modify dv-origin β€” local user {your Dagshub username}
dvc remote modify dv-origin β€” local password {your Dagshub password}

If you feel confused about configuring your DVC, refer to this documentation.

Next, pull my tracked dataset into your system using thisΒ command.

dvc pull -r dv-origin

Run tesseract on your image inside a pythonΒ shell.

import subprocess
result= subprocess.run(['tesseract','postprocessed/1409.1556_0.jpg',
'-','-l','eng'], stdout=subprocess.PIPE)
result = result.stdout
result = str(result)

Pass this result to the Summarizer model.

from summarizer import Summarizer 
body = result
model = Summarizer()
result = model(body, ratio=0.2) # Specified with ratio result = model(body, num_sentences=3) # Will return 3 sentences
print(result)

The above code block calls the model with your text (body) and a number of sentences = 3. This means the model will summarize your text in 3 sentences. You can increase or decrease the number of sentences depending on your use case. If you are on Google Colab Notebook refer to this notebook.

What did IΒ use?

  1. Colab notebook freeΒ version
  2. Dagshub as my online repository
  3. DVC to track myΒ dataset
  4. Bert-extractive-summarizer

Project Pipeline

The Figure below shows the output on 1409.1556_0.jpg foundΒ here.

Figure 2: Bert output on Image. Image byΒ Author

Conclusion

You can train the UNet to extract any set of sections from scientific papers but for this experiment, I chose the Abstract, Author/s, and Title sections. BERT gives us the final result which is a summary of sections found in Abstract, Author/s, and Titles. While most papers have more or less similar formatting, structure, and sections, the challenge for this project was to extract a given section where that section might/might not be there for a givenΒ paper.

With that, we have reached the final milestone. Congratulations on making it this far! Feel free to reach out for any questions or feedback. Would be happy to hear fromΒ you!

References

[1] https://github.com/dmmiller612/bert-extractive-summarizer

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[3] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. β€œAttention is all you need.” Advances in neural information processing systems 30Β (2017).

[4] Miller, Derek. β€œLeveraging BERT for extractive text summarization on lectures.” arXiv preprint arXiv:1906.04165 (2019).

[5] https://dagshub.com/docs/

https://www.youtube.com/watch?v=TQQlZhbC5ps


From UNet to BERT: Extraction of Important Information from Scientific Papers was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓