The NLP Cypher | 02.14.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The Vision of St. John on Patmos U+007C Correggio

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.14.21

Heartbreaker

Hey Welcome back! This week, plenty of Colab notebooks were released by various sources and someone actually built a minimized version of the Switch transformer in PyTorch. But first… a word from our sponsors:

If you enjoy the read, help us out by giving it a U+1F44FU+1F44F and share with friends!

The Continuing Story of Neural Magic

Around New Year’s time, I pondered about the upcoming sparsity adoption and its consequences on inference w/r/t ML models. Well, a “new” company is on the scene and they just open-sourced their code.

‘New’ is in quotes because they aren’t really new, their open sourced code is.

The company is Neural Magic.

Neural Magic

Neural Magic has 6 repositories available. Follow their code on GitHub.

github.com

Their core repos consist of

SparseML: a toolkit that includes APIs, CLIs, scripts and libraries that apply optimization algorithms such as pruning and quantization to any neural network.

SparseZoo: a model repo for sparse models.

DeepSparse: a CPU inference engine for sparse models.

Sparsify: a UI interface to optimize deep neural networks for better inference performance.

They currently support: PyTorch, Keras, and TensorFlow V1. U+1F648

Switch Transformer Implementation in PyTorch

There’s already a PyTorch implementation of the Switch Transformer but it’s a minimized version and doesn’t do parallel training. However it does perform the ‘switching’ as described in the switch paper from Google. You can follow their code snippets in the link below and execute their minimized version on Colab. Really cool U+1F525U+1F525.

Switch Transformer

This is a miniature PyTorch implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models with…

nn.labml.ai

Colab

Google Colaboratory

Edit description

colab.research.google.com

Microsoft’s Multi-Lingual Spell Checker

With Bing search, 15% of queries submitted by customers have misspellings. And as a result, MS used inspiration from BART’s architecture to help scale up their search business. Read how Microsoft did it and how multi-lingual spell checkers can be scaled up.

Speller100 expands spelling correction technology to 100+ languages

At Microsoft Bing, our mission is to delight users everywhere with the best search experience. We serve a diverse set…

www.microsoft.com

Cloud NLP for Spacy’s Models

If you are deep in spaCy territory and you need a storage system to serve your spaCy models, the peeps at NLP Cloud can lend you a hand. Their infrastructure is built on top of FastAPI and supports Python, Go and Ruby languages.

Documentation:

API Reference

Get entities using the en_core_web_sm pre-trained model: curl “https://api.nlpcloud.io/v1/en_core_web_sm/entities" \ -H…

docs.nlpcloud.io

Talking about spaCy… the ScispaCy library from AI2 has a new release based on the spaCy’s v3 release. If you want to check out the new v3 features from spaCy, check out their blog post here:

https://explosion.ai/blog/spacy-v3

Large Database of 90M Indian Legal Cases

“Development Data Lab has processed and de-identified legal case records for all lower courts in India filed between 2010–2018, using the government’s online case-management portal — E-courts. The result: charges, filing, hearing and decision dates, trial outcomes, and case type details of 25 million criminal, and 65 million civil cases.”

Big Data for Justice

An open-access dataset of 80 million Indian legal case records

devdatalab.medium.com

OCR Library U+007C Azure

Azure bringing their new version of OCR to their computer vision library it includes:

OCR for 73 languages including Simplified and Traditional Chinese, Japanese, Korean, and several Latin languages.
Natural reading order for the text line output.
Handwriting style classification for text lines.
Text extraction for selected pages for a multi-page document.
Available as a Distroless container for on-premise deployment.

Computer Vision OCR (Read API) previews 73 human languages and new features on cloud and on-premise

Businesses today are applying Optical Character Recognition (OCR) and document AI technologies to rapidly convert their…

techcommunity.microsoft.com

Wav2Vec2

Speech to text capabilities are now part of the Hugging Face library. The Facebook Wav2Vec2 model was uploaded on their model hub this week and includes code snippets for inference.

facebook/wav2vec2-base-960h · Hugging Face

We're on a journey to solve and democratize artificial intelligence through natural language.

huggingface.co

Colab U+1F525

FYI, 1LittleCoder already created a nice Colab for you to get started using a wav file. U+1F60E

Google Colaboratory

Edit description

colab.research.google.com

Topic Modeling

What to know what cross-lingual zero-sot topic modeling is all about? The model is called ZeroShotTM, based on multi-lingual encoder models. Imagine you want your model to learn topics in the English language and then you want it to predict unseen topics in Italian documents it has also never seen. This is essentially what cross-lingual zero-shot is all about. You can read their blog post U+1F447 which features a nice tutorial for you to get familiar with their library:

Contextualized Topic Modeling with Python (EACL2021)

In this blog post, I discuss our latest published paper on topic modeling:

fbvinid.medium.com

Colab of the Week U+1F3C6

Google Colaboratory

Edit description

colab.research.google.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

CLIPBert

CLIPBert takes raw videos/images + text as inputs, and outputs task predictions.

Supports end-to-end pretraining and finetuning for the following tasks:

-Image-text pretraining on COCO and VG captions.
-Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
-Video-QA finetuning on TGIF-QA and MSRVTT-QA.
-Image-QA finetuning on VQA 2.0.

Connected Papers U+1F4C8

jayleicn/ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Official PyTorch code for ClipBERT, an…

github.com

Nystromformer

Nystrom-based algorithm for approximating self-attention. Can escape the 512 token limit of encoder models, and in some cases, with minimal difference in accuracy.

Connected Papers U+1F4C8

mlpen/Nystromformer

Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A key…

github.com

RpBERT

BERT Model for multimodal name-entity recognition (NER)

Connected Papers U+1F4C8

Multimodal-NER/RpBERT

RpBERT: A Text-image Relationship Propagation Based BERT Model for Multimodal NER python==3.7 torch==1.2.0…

github.com

Biomedical Question Answering: A Comprehensive Review

Paper: https://arxiv.org/abs/2102.05281

Connected Papers U+1F4C8

Dataset of the Week: ARC Direct Answer Questions

What is it?

A dataset of 2,985 grade-school level, direct-answer (“open response”, “free form”) science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.

Sample

{
 "question_id": "ARCEZ_Mercury_7221148",
 "tag": "EASY-TRAIN",
 "question": "A baby kit fox grows to become an adult with a mass of over 3.5 kg. What factor will have the greatest influence on this kit fox's survival?",
 "answers": [
 "food availability",
 "larger predators prevalence",
 "the availability of food",
 "the population of predator in the area",
 "food sources",
 "habitat",
 "availability of food",
 "amount of predators around",
 "how smart the fox is"
 ]
}

Where is it?

ARC Direct Answer Questions Dataset – Allen Institute for AI

This 1.1 version fixes some content and formatting issues with answers from the original release. These questions were…

allenai.org

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The NLP Cypher | 02.14.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.14.21

Heartbreaker

The Continuing Story of Neural Magic

Neural Magic

Neural Magic has 6 repositories available. Follow their code on GitHub.

Switch Transformer Implementation in PyTorch

Switch Transformer

This is a miniature PyTorch implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models with…

Google Colaboratory

Edit description

Microsoft’s Multi-Lingual Spell Checker

Speller100 expands spelling correction technology to 100+ languages

At Microsoft Bing, our mission is to delight users everywhere with the best search experience. We serve a diverse set…

Cloud NLP for Spacy’s Models

API Reference

Get entities using the en_core_web_sm pre-trained model: curl “https://api.nlpcloud.io/v1/en_core_web_sm/entities" \ -H…

Large Database of 90M Indian Legal Cases

Big Data for Justice

An open-access dataset of 80 million Indian legal case records

OCR Library U+007C Azure

Computer Vision OCR (Read API) previews 73 human languages and new features on cloud and on-premise

Businesses today are applying Optical Character Recognition (OCR) and document AI technologies to rapidly convert their…

Wav2Vec2

facebook/wav2vec2-base-960h · Hugging Face

We're on a journey to solve and democratize artificial intelligence through natural language.

Colab U+1F525

Google Colaboratory

Edit description

Topic Modeling

Contextualized Topic Modeling with Python (EACL2021)

In this blog post, I discuss our latest published paper on topic modeling:

Colab of the Week U+1F3C6

Google Colaboratory

Edit description

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

CLIPBert

jayleicn/ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Official PyTorch code for ClipBERT, an…

Nystromformer

mlpen/Nystromformer

Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A key…

RpBERT

Multimodal-NER/RpBERT

RpBERT: A Text-image Relationship Propagation Based BERT Model for Multimodal NER python==3.7 torch==1.2.0…

Biomedical Question Answering: A Comprehensive Review

Dataset of the Week: ARC Direct Answer Questions

What is it?

Sample

Where is it?

ARC Direct Answer Questions Dataset – Allen Institute for AI

This 1.1 version fixes some content and formatting issues with answers from the original release. These questions were…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement