The NLP Cypher | 02.14.21
Last Updated on July 27, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 02.14.21
Heartbreaker
Hey Welcome back! This week, plenty of Colab notebooks were released by various sources and someone actually built a minimized version of the Switch transformer in PyTorch. But first⦠a word from our sponsors:
https://www.youtube.com/embed/lGOofzZOyl8
If you enjoy the read, help us out by giving it a U+1F44FU+1F44F and share with friends!
The Continuing Story of Neural Magic
Around New Yearβs time, I pondered about the upcoming sparsity adoption and its consequences on inference w/r/t ML models. Well, a βnewβ company is on the scene and they just open-sourced their code.
βNewβ is in quotes because they arenβt really new, their open sourced code is.
The company is Neural Magic.
Neural Magic
Neural Magic has 6 repositories available. Follow their code on GitHub.
github.com
Their core repos consist of
SparseML: a toolkit that includes APIs, CLIs, scripts and libraries that apply optimization algorithms such as pruning and quantization to any neural network.
SparseZoo: a model repo for sparse models.
DeepSparse: a CPU inference engine for sparse models.
Sparsify: a UI interface to optimize deep neural networks for better inference performance.
They currently support: PyTorch, Keras, and TensorFlow V1. U+1F648
Switch Transformer Implementation in PyTorch
Thereβs already a PyTorch implementation of the Switch Transformer but itβs a minimized version and doesnβt do parallel training. However it does perform the βswitchingβ as described in the switch paper from Google. You can follow their code snippets in the link below and execute their minimized version on Colab. Really cool U+1F525U+1F525.
Switch Transformer
This is a miniature PyTorch implementation of the paper Switch Transformers: Scaling to Trillion Parameter Models withβ¦
nn.labml.ai
Colab
Google Colaboratory
Edit description
colab.research.google.com
Microsoftβs Multi-Lingual Spell Checker
With Bing search, 15% of queries submitted by customers have misspellings. And as a result, MS used inspiration from BARTβs architecture to help scale up their search business. Read how Microsoft did it and how multi-lingual spell checkers can be scaled up.
Speller100 expands spelling correction technology to 100+ languages
At Microsoft Bing, our mission is to delight users everywhere with the best search experience. We serve a diverse setβ¦
www.microsoft.com
Cloud NLP for Spacyβs Models
If you are deep in spaCy territory and you need a storage system to serve your spaCy models, the peeps at NLP Cloud can lend you a hand. Their infrastructure is built on top of FastAPI and supports Python, Go and Ruby languages.
Documentation:
API Reference
Get entities using the en_core_web_sm pre-trained model: curl βhttps://api.nlpcloud.io/v1/en_core_web_sm/entities" \ -Hβ¦
docs.nlpcloud.io
Talking about spaCyβ¦ the ScispaCy library from AI2 has a new release based on the spaCyβs v3 release. If you want to check out the new v3 features from spaCy, check out their blog post here:
https://explosion.ai/blog/spacy-v3
Large Database of 90M Indian Legal Cases
βDevelopment Data Lab has processed and de-identified legal case records for all lower courts in India filed between 2010β2018, using the governmentβs online case-management portal β E-courts. The result: charges, filing, hearing and decision dates, trial outcomes, and case type details of 25 million criminal, and 65 million civil cases.β
Big Data for Justice
An open-access dataset of 80 million Indian legal case records
devdatalab.medium.com
OCR Library U+007C Azure
Azure bringing their new version of OCR to their computer vision library it includes:
- OCR for 73 languages including Simplified and Traditional Chinese, Japanese, Korean, and several Latin languages.
- Natural reading order for the text line output.
- Handwriting style classification for text lines.
- Text extraction for selected pages for a multi-page document.
- Available as a Distroless container for on-premise deployment.
Computer Vision OCR (Read API) previews 73 human languages and new features on cloud and on-premise
Businesses today are applying Optical Character Recognition (OCR) and document AI technologies to rapidly convert theirβ¦
techcommunity.microsoft.com
Wav2Vec2
Speech to text capabilities are now part of the Hugging Face library. The Facebook Wav2Vec2 model was uploaded on their model hub this week and includes code snippets for inference.
facebook/wav2vec2-base-960h Β· Hugging Face
We're on a journey to solve and democratize artificial intelligence through natural language.
huggingface.co
Colab U+1F525
FYI, 1LittleCoder already created a nice Colab for you to get started using a wav file. U+1F60E
Google Colaboratory
Edit description
colab.research.google.com
Topic Modeling
What to know what cross-lingual zero-sot topic modeling is all about? The model is called ZeroShotTM, based on multi-lingual encoder models. Imagine you want your model to learn topics in the English language and then you want it to predict unseen topics in Italian documents it has also never seen. This is essentially what cross-lingual zero-shot is all about. You can read their blog post U+1F447 which features a nice tutorial for you to get familiar with their library:
Contextualized Topic Modeling with Python (EACL2021)
In this blog post, I discuss our latest published paper on topic modeling:
fbvinid.medium.com
Colab of the Week U+1F3C6
Google Colaboratory
Edit description
colab.research.google.com
Repo Cypher U+1F468βU+1F4BB
A collection of recently released repos that caught our U+1F441
CLIPBert
CLIPBert takes raw videos/images + text as inputs, and outputs task predictions.
Supports end-to-end pretraining and finetuning for the following tasks:
-Image-text pretraining on COCO and VG captions.
-Text-to-video retrieval finetuning on MSRVTT, DiDeMo, and ActivityNet Captions.
-Video-QA finetuning on TGIF-QA and MSRVTT-QA.
-Image-QA finetuning on VQA 2.0.
Connected Papers U+1F4C8
jayleicn/ClipBERT
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling Official PyTorch code for ClipBERT, anβ¦
github.com
Nystromformer
Nystrom-based algorithm for approximating self-attention. Can escape the 512 token limit of encoder models, and in some cases, with minimal difference in accuracy.
Connected Papers U+1F4C8
mlpen/Nystromformer
Transformers have emerged as a powerful workhorse for a broad range of natural language processing tasks. A keyβ¦
github.com
RpBERT
BERT Model for multimodal name-entity recognition (NER)
Connected Papers U+1F4C8
Multimodal-NER/RpBERT
RpBERT: A Text-image Relationship Propagation Based BERT Model for Multimodal NER python==3.7 torch==1.2.0β¦
github.com
Biomedical Question Answering: A Comprehensive Review
Paper: https://arxiv.org/abs/2102.05281
Connected Papers U+1F4C8
Dataset of the Week: ARC Direct Answer Questions
What is it?
A dataset of 2,985 grade-school level, direct-answer (βopen responseβ, βfree formβ) science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.
Sample
{
"question_id": "ARCEZ_Mercury_7221148",
"tag": "EASY-TRAIN",
"question": "A baby kit fox grows to become an adult with a mass of over 3.5 kg. What factor will have the greatest influence on this kit fox's survival?",
"answers": [
"food availability",
"larger predators prevalence",
"the availability of food",
"the population of predator in the area",
"food sources",
"habitat",
"availability of food",
"amount of predators around",
"how smart the fox is"
]
}
Where is it?
ARC Direct Answer Questions Dataset – Allen Institute for AI
This 1.1 version fixes some content and formatting issues with answers from the original release. These questions wereβ¦
allenai.org
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI