NLP News Cypher | 01.19.20
Last Updated on July 27, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NLP News Cypher U+007C 01.19.20
Towards AI and Infinity
Andβ¦. weβre back! What a week!
First, we are happy to announce we are now publishing our weekly blog on Towards AIβs platformU+1F4AAU+1F4AA! Happy for this publishing partnership as we intend to bring NLP trends to a more global audience from developers in NYC to business professionals in Hong Kong.
And talking about globalβ¦
In case you were busy this past week: we dropped the βThe Big Bad NLP Databaseβ, a large collection of datasets for ML and NLP developers! The database continues to grow and we have already received excellent recommendations from our users. Updates coming very soon!
Wednesdayβs Announcement Article:
NLP Dataset Library
100s of Datasets for ML Developers (and counting)
medium.com
Database:
The Big Bad NLP Database – Quantum Stat
Datasets for various tasks in Natural Language Processing
quantumstat.com
Thank you for all of your support!
This Week:
The Reformer Transformer
Speaking Thoughts and Minds
Recognizing Facebookβs Real-Time Speech
Deep Hacks
Wolfram Webinars on Data Analytics
CoNLL Meet Spacy
I Recommend Research Papers
Dataset of the Week: ReCord
Meanwhile, Back at the Vegas Ranchβ¦
The Reformer Transformer
βA Transformer model designed to handle context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.β
Google started the new year with a bang and a new transformer. Googleβs new model wants to solve 2 problems weighing on transformers with large input sequences: attention and memory.
Attention is difficult to scale under a large number of words, and as a result, Google introduced a hashing technique allowing the model to efficiently βconnectβ similar vectors together and dividing them into chunks. After applying attention over these segments, it leads to a reduction of computational load.
The memory problem arises in a multi-layered model because of the requirement to save the activation at each layer for the backward pass. This can lead to your GPUβs memory exploding aka OOM errors.
To mitigate this issue, Google turned to reversible layers. (This technique is discussed in the paper above). It avoids storing the activation of each layer in memory and instead computes them on the backwards pass through a clever technique.
Blog:
Reformer: The Efficient Transformer
Understanding sequential data – such as language, music or videos – is a challenging task, especially when there isβ¦
ai.googleblog.com
Colab for Text Generation:
Google Colaboratory
Edit description
colab.research.google.com
Speaking Hearts and Minds
New research from Facebookβs Conversational AI Research group ParlAI, brings together a single model that performs well on several image-grounded conversational tasks (aka multi-modal). Below you can find a few example outputs:
Paper:
Recognizing Facebookβs Real-Time Speech
Facebook open-sourced their wav2letter@anywhere speech recognition framework. The take-away here is that inference for this framework is geared for real-time performance. Also, it achieves SOTA performance on the LibriSpeech dataset!
Blog:
Online speech recognition with wav2letter@anywhere
The process of transcribing speech in real time from an input audio stream is known as online speech recognition. Mostβ¦
ai.facebook.com
GitHub:
facebookresearch/wav2letter
In the paper we are considering: different architectures for acoustic modeling: different criterions: differentβ¦
github.com
Deep Hacks
When engineering, the cookie crumbles on the micro-level. Yes, you can achieve great results from a large fine-tuned transformer but you still need to hone your skills on the classics (import re). Priyansh Trivedi drops a few jewels on this topic in his blog:
Bag of Tricks U+1F45C for NLP Models – (Part 1)
This is the first post in a series of simple, part-obvious, and (largely) independent hacks to increase the performanceβ¦
priyansh.page
Wolfram Webinars on Data Analytics
There are three 90 min. Wolfram webinars coming up that will be highlight custom-built Twitter analytics, data mining of imaginary maps and how to create an automated reporting system (and much more). If the Wolfram language and these subjects interest you, register can here.
Blog:
3 Free Wolfram U Webinars Showcasing Innovative Data Science Applications-Wolfram Blog
January 16, 2020 – Jamie Peterson, Technical Programs Manager, Wolfram U Looking to fulfill your New Year's resolutionβ¦
blog.wolfram.com
CoNLL Meet Spacy
Well now. Special shout-out to Bram. He updated the spacy_conll repo which allows you to parse your text into CoNLL-U format. The plugin can now be used as a custom pipeline in command line or in a python script.
Whatβs CoNLL you may ask?
> python -m spacy_conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1 I -PRON- PRON PRP PronType=prs 2 nsubj _ _
2 like like VERB VBP VerbForm=finU+007CTense=pres 0 ROOT _ _
3 cookies cookie NOUN NNS Number=plur 2 dobj _ _
4 . . PUNCT . PunctType=peri 2 punct _ _
# sent_id = 2
# text = What about you ?
1 What what NOUN WP PronType=intU+007Crel 2 dep _ _
2 about about ADP IN _ 0 ROOT _ _
3 you -PRON- PRON PRP PronType=prs 2 pobj _ _
4 ? ? PUNCT . PunctType=peri 2 punct _ _
BramVanroy/spacy_conll
This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in yourβ¦
github.com
I Recommend Research Papers
Santosh created a recommendation engine for searching research papers using natural language called Natural Language Recommendations! It was trained on abstracts, so the longer the description the better the search results.
Check out his GitHub which also includes a Colab for trialing the engine.
Santosh-Gupta/NaturalLanguageRecommendations
https://colab.research.google.com/github/Santosh-Gupta/NaturalLanguageRecommendations/blob/master/notebooks/inference/Deβ¦
github.com
Dataset of the Week: ReCoRD
What is it:
βA reading comprehension dataset which requires commonsense reasoning.β
Sample:
ReCoRD
The Reading comprehension with Common-sense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiringβ¦
sheng-z.github.io
Where is it?
ReCoRD
The Reading comprehension with Commonsense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiringβ¦
sheng-z.github.io
Meanwhile, Back at the Vegas Ranchβ¦
Last but not least, I canβt believe this is true but apparently indeed.com has a job posting calling for an escort with a Department of Defense βTOP-SECRETβ clearance?!?!? U+1F923U+1F923U+1F923 Are you qualified? Post below:
Defense Contract Jobs, Employment in Las Vegas, NV U+007C Indeed.com
57 Defense Contract jobs available in Las Vegas, NV on Indeed.com. Apply to Employment Lawyer, Technician, Commanderβ¦
www.indeed.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends or social media!
For complete coverage, follow our twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI