Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 02.14.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 02.14.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The Vision of St. John on Patmos U+007C Correggio

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.14.21

Heartbreaker

Hey Welcome back! This week, plenty of Colab notebooks were released by various sources and someone actually built a minimized version of the Switch transformer in PyTorch. But first… a word from our sponsors:

If you enjoy the read, help us out by giving it a U+1F44FU+1F44F and share with friends!

The Continuing Story of Neural Magic

Around New Year’s time, I pondered about the upcoming sparsity adoption and its consequences on inference w/r/t ML models. Well, a “new” company is on the scene and they just open-sourced their code.

‘New’ is in quotes because they aren’t really new, their open sourced code is.

The company is Neural Magic.

Neural Magic

Neural Magic has 6 repositories available. Follow their code on GitHub.

github.com

Their core repos consist of

SparseML: a toolkit that includes APIs, CLIs, scripts and libraries that apply optimization algorithms such as pruning and quantization to any neural network.

SparseZoo: a model repo for sparse models.

DeepSparse: a CPU inference engine for sparse models.

Sparsify: a UI interface to optimize deep neural networks for better inference performance.

They currently support: PyTorch, Keras, and TensorFlow V1. U+1F648

Switch Transformer Implementation in PyTorch

There’s already a PyTorch implementation of the Switch Transformer but it’s a minimized version and doesn’t do parallel training. However it does perform the ‘switching’ as described in the switch paper from Google. You can follow their code snippets in the link below and execute their minimized version on Colab. Really cool U+1F525U+1F525.

Switch Transformer

This is a miniature PyTorch implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models with…

nn.labml.ai

Colab

Google Colaboratory

Edit description

colab.research.google.com

Microsoft’s Multi-Lingual Spell Checker

With Bing search, 15% of queries submitted by customers have misspellings. And as a result, MS used inspiration from BART’s architecture to help scale up their search business. Read how Microsoft did it and how multi-lingual spell checkers can be scaled up.

Speller100 expands spelling correction technology to 100+ languages

At Microsoft Bing, our mission is to delight users everywhere with the best search experience. We serve a diverse set…

www.microsoft.com

Cloud NLP for Spacy’s Models

If you are deep in spaCy territory and you need a storage system to serve your spaCy models, the peeps at NLP Cloud can lend you a hand. Their infrastructure is built on top of FastAPI and supports Python, Go and Ruby languages.

Documentation:

API Reference

Get entities using the en_core_web_sm pre-trained model: curl “https://api.nlpcloud.io/v1/en_core_web_sm/entities" \ -H…

docs.nlpcloud.io

Talking about spaCy… the ScispaCy library from AI2 has a new release based on the spaCy’s v3 release. If you want to check out the new v3 features from spaCy, check out their blog post here:

https://explosion.ai/blog/spacy-v3

Large Database of 90M Indian Legal Cases

“Development Data Lab has processed and de-identified legal case records for all lower courts in India filed between 2010–2018, using the government’s online case-management portal — E-courts. The result: charges, filing, hearing and decision dates, trial outcomes, and case type details of 25 million criminal, and 65 million civil cases.”

Big Data for Justice

An open-access dataset of 80 million Indian legal case records

devdatalab.medium.com

OCR Library U+007C Azure

Azure bringing their new version of OCR to their computer vision library it includes:

  • OCR for 73 languages including Simplified and Traditional Chinese, Japanese, Korean, and several Latin languages.
  • Natural reading order for the text line output.
  • Handwriting style classification for text lines.
  • Text extraction for selected pages for a multi-page document.
  • Available as a Distroless container for on-premise deployment.

Computer Vision OCR (Read API) previews 73 human languages and new features on cloud and on-premise

Businesses today are applying Optical Character Recognition (OCR) and document AI technologies to rapidly convert their…

techcommunity.microsoft.com

Wav2Vec2

Speech to text capabilities are now part of the Hugging Face library. The Facebook Wav2Vec2 model was uploaded on their model hub this week and includes code snippets for inference.

facebook/wav2vec2-base-960h · Hugging Face

We're on a journey to solve and democratize artificial intelligence through natural language.

huggingface.co

Colab U+1F525

FYI, 1LittleCoder already created a nice Colab for you to get started using a wav file. U+1F60E

Google Colaboratory

Edit description

colab.research.google.com

Topic Modeling

What to know what cross-lingual zero-sot topic modeling is all about? The model is called ZeroShotTM, based on multi-lingual encoder models. Imagine you want your model to learn topics in the English language and then you want it to predict unseen topics in Italian documents it has also never seen. This is essentially what cross-lingual zero-shot is all about. You can read their blog post U+1F447 which features a nice tutorial for you to get familiar with their library:

Contextualized Topic Modeling with Python (EACL2021)

In this blog post, I discuss our latest published paper on topic modeling:

fbvinid.medium.com

Colab of the Week U+1F3C6

Google Colaboratory

Edit description

colab.research.google.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

CLIPBert

CLIPBert takes raw videos/images + text as inputs, and outputs task predictions.

Supports end-to-end pretraining and finetuning for the following tasks:

-Image-text pretraining on COCO and VG captions.
-Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
-Video-QA finetuning on TGIF-QA and MSRVTT-QA.
-Image-QA finetuning on VQA 2.0.

Connected Papers U+1F4C8

jayleicn/ClipBERT

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Official PyTorch code for ClipBERT, an…

github.com

Nystromformer

Nystrom-based algorithm for approximating self-attention. Can escape the 512 token limit of encoder models, and in some cases, with minimal difference in accuracy.

Connected Papers U+1F4C8

mlpen/Nystromformer

Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A key…

github.com

RpBERT

BERT Model for multimodal name-entity recognition (NER)

Connected Papers U+1F4C8

Multimodal-NER/RpBERT

RpBERT: A Text-image Relationship Propagation Based BERT Model for Multimodal NER python==3.7 torch==1.2.0…

github.com

Biomedical Question Answering: A Comprehensive Review

Paper: https://arxiv.org/abs/2102.05281

Connected Papers U+1F4C8

Dataset of the Week: ARC Direct Answer Questions

What is it?

A dataset of 2,985 grade-school level, direct-answer (“open response”, “free form”) science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.

Sample

{
"question_id": "ARCEZ_Mercury_7221148",
"tag": "EASY-TRAIN",
"question": "A baby kit fox grows to become an adult with a mass of over 3.5 kg. What factor will have the greatest influence on this kit fox's survival?",
"answers": [
"food availability",
"larger predators prevalence",
"the availability of food",
"the population of predator in the area",
"food sources",
"habitat",
"availability of food",
"amount of predators around",
"how smart the fox is"
]
}

Where is it?

ARC Direct Answer Questions Dataset – Allen Institute for AI

This 1.1 version fixes some content and formatting issues with answers from the original release. These questions were…

allenai.org

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓