NLP News Cypher | 01.19.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

NLP News Cypher U+007C 01.19.20

Towards AI and Infinity

And…. we’re back! What a week!

First, we are happy to announce we are now publishing our weekly blog on Towards AI’s platformU+1F4AAU+1F4AA! Happy for this publishing partnership as we intend to bring NLP trends to a more global audience from developers in NYC to business professionals in Hong Kong.

And talking about global…

In case you were busy this past week: we dropped the “The Big Bad NLP Database”, a large collection of datasets for ML and NLP developers! The database continues to grow and we have already received excellent recommendations from our users. Updates coming very soon!

Wednesday’s Announcement Article:

NLP Dataset Library

100s of Datasets for ML Developers (and counting)

medium.com

Database:

The Big Bad NLP Database – Quantum Stat

Datasets for various tasks in Natural Language Processing

quantumstat.com

Thank you for all of your support!

This Week:

The Reformer Transformer

Speaking Thoughts and Minds

Recognizing Facebook’s Real-Time Speech

Deep Hacks

Wolfram Webinars on Data Analytics

CoNLL Meet Spacy

I Recommend Research Papers

Dataset of the Week: ReCord

Meanwhile, Back at the Vegas Ranch…

The Reformer Transformer

“A Transformer model designed to handle context windows of up to 1 million words, all on a single accelerator and using only 16GB of memory.”

Google started the new year with a bang and a new transformer. Google’s new model wants to solve 2 problems weighing on transformers with large input sequences: attention and memory.

Attention is difficult to scale under a large number of words, and as a result, Google introduced a hashing technique allowing the model to efficiently “connect” similar vectors together and dividing them into chunks. After applying attention over these segments, it leads to a reduction of computational load.

The memory problem arises in a multi-layered model because of the requirement to save the activation at each layer for the backward pass. This can lead to your GPU’s memory exploding aka OOM errors.

LINK

To mitigate this issue, Google turned to reversible layers. (This technique is discussed in the paper above). It avoids storing the activation of each layer in memory and instead computes them on the backwards pass through a clever technique.

Blog:

Reformer: The Efficient Transformer

Understanding sequential data – such as language, music or videos – is a challenging task, especially when there is…

ai.googleblog.com

Colab for Text Generation:

Google Colaboratory

Edit description

colab.research.google.com

Speaking Hearts and Minds

New research from Facebook’s Conversational AI Research group ParlAI, brings together a single model that performs well on several image-grounded conversational tasks (aka multi-modal). Below you can find a few example outputs:

Paper:

LINK

Recognizing Facebook’s Real-Time Speech

Facebook open-sourced their wav2letter@anywhere speech recognition framework. The take-away here is that inference for this framework is geared for real-time performance. Also, it achieves SOTA performance on the LibriSpeech dataset!

Blog:

Online speech recognition with wav2letter@anywhere

The process of transcribing speech in real time from an input audio stream is known as online speech recognition. Most…

ai.facebook.com

GitHub:

facebookresearch/wav2letter

In the paper we are considering: different architectures for acoustic modeling: different criterions: different…

github.com

Deep Hacks

When engineering, the cookie crumbles on the micro-level. Yes, you can achieve great results from a large fine-tuned transformer but you still need to hone your skills on the classics (import re). Priyansh Trivedi drops a few jewels on this topic in his blog:

Bag of Tricks U+1F45C for NLP Models – (Part 1)

This is the first post in a series of simple, part-obvious, and (largely) independent hacks to increase the performance…

priyansh.page

Wolfram Webinars on Data Analytics

There are three 90 min. Wolfram webinars coming up that will be highlight custom-built Twitter analytics, data mining of imaginary maps and how to create an automated reporting system (and much more). If the Wolfram language and these subjects interest you, register can here.

Blog:

3 Free Wolfram U Webinars Showcasing Innovative Data Science Applications-Wolfram Blog

January 16, 2020 – Jamie Peterson, Technical Programs Manager, Wolfram U Looking to fulfill your New Year's resolution…

blog.wolfram.com

CoNLL Meet Spacy

Well now. Special shout-out to Bram. He updated the spacy_conll repo which allows you to parse your text into CoNLL-U format. The plugin can now be used as a custom pipeline in command line or in a python script.

What’s CoNLL you may ask?

> python -m spacy_conll --input_str "I like cookies . What about you ?" --is_tokenized --include_headers
# sent_id = 1
# text = I like cookies .
1 I -PRON- PRON PRP PronType=prs 2 nsubj _ _
2 like like VERB VBP VerbForm=finU+007CTense=pres 0 ROOT _ _
3 cookies cookie NOUN NNS Number=plur 2 dobj _ _
4 . . PUNCT . PunctType=peri 2 punct _ _

# sent_id = 2
# text = What about you ?
1 What what NOUN WP PronType=intU+007Crel 2 dep _ _
2 about about ADP IN _ 0 ROOT _ _
3 you -PRON- PRON PRP PronType=prs 2 pobj _ _
4 ? ? PUNCT . PunctType=peri 2 punct _ _

BramVanroy/spacy_conll

This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your…

github.com

I Recommend Research Papers

Santosh created a recommendation engine for searching research papers using natural language called Natural Language Recommendations! It was trained on abstracts, so the longer the description the better the search results.

Check out his GitHub which also includes a Colab for trialing the engine.

Santosh-Gupta/NaturalLanguageRecommendations

https://colab.research.google.com/github/Santosh-Gupta/NaturalLanguageRecommendations/blob/master/notebooks/inference/De…

github.com

Dataset of the Week: ReCoRD

What is it:

“A reading comprehension dataset which requires commonsense reasoning.”

Sample:

ReCoRD

The Reading comprehension with Common-sense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiring…

sheng-z.github.io

Where is it?

ReCoRD

The Reading comprehension with Commonsense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiring…

sheng-z.github.io

Meanwhile, Back at the Vegas Ranch…

Last but not least, I can’t believe this is true but apparently indeed.com has a job posting calling for an escort with a Department of Defense “TOP-SECRET” clearance?!?!? U+1F923U+1F923U+1F923 Are you qualified? Post below:

Defense Contract Jobs, Employment in Las Vegas, NV U+007C Indeed.com

57 Defense Contract jobs available in Las Vegas, NV on Indeed.com. Apply to Employment Lawyer, Technician, Commander…

www.indeed.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

If you enjoyed this article, help us out and share with friends or social media!

For complete coverage, follow our twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

NLP News Cypher | 01.19.20

Author(s): Ricky Costa

NLP News Cypher U+007C 01.19.20

Towards AI and Infinity

NLP Dataset Library

100s of Datasets for ML Developers (and counting)

The Big Bad NLP Database – Quantum Stat

Datasets for various tasks in Natural Language Processing

This Week:

The Reformer Transformer

Reformer: The Efficient Transformer

Understanding sequential data – such as language, music or videos – is a challenging task, especially when there is…

Google Colaboratory

Edit description

Speaking Hearts and Minds

Recognizing Facebook’s Real-Time Speech

Online speech recognition with wav2letter@anywhere

The process of transcribing speech in real time from an input audio stream is known as online speech recognition. Most…

facebookresearch/wav2letter

In the paper we are considering: different architectures for acoustic modeling: different criterions: different…

Deep Hacks

Bag of Tricks U+1F45C for NLP Models – (Part 1)

This is the first post in a series of simple, part-obvious, and (largely) independent hacks to increase the performance…

Wolfram Webinars on Data Analytics

3 Free Wolfram U Webinars Showcasing Innovative Data Science Applications-Wolfram Blog

January 16, 2020 – Jamie Peterson, Technical Programs Manager, Wolfram U Looking to fulfill your New Year's resolution…

CoNLL Meet Spacy

BramVanroy/spacy_conll

This module allows you to parse a text to CoNLL-U format. You can use it as a command line tool, or embed it in your…

I Recommend Research Papers

Santosh-Gupta/NaturalLanguageRecommendations

https://colab.research.google.com/github/Santosh-Gupta/NaturalLanguageRecommendations/blob/master/notebooks/inference/De…

Dataset of the Week: ReCoRD

ReCoRD

The Reading comprehension with Common-sense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiring…

ReCoRD

The Reading comprehension with Commonsense Reasoning Dataset (ReCoRD) is a new reading comprehension dataset requiring…

Meanwhile, Back at the Vegas Ranch…

Defense Contract Jobs, Employment in Las Vegas, NV U+007C Indeed.com

57 Defense Contract jobs available in Las Vegas, NV on Indeed.com. Apply to Employment Lawyer, Technician, Commander…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement