NLP News Cypher | 09.13.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
NLP News Cypher U+007C 09.13.20
Aere Perrenius
Welcome back. Hope you enjoyed your week! We have another update that came in this Friday. Added another 11 datasets and 5 new notebooks. Want to thank all who contributed going back all the way to last weekβs update! Won Ik Cho, Gayatri Venugopal, Moin Nadeem Himanshu Chaudhary, Vincent Wu, Prafull Sharma, Yeshwanth Reddy & Manu Romero!
In case you are feeling adventurous: The truth is out there for Lex Fridman, who out of left field, interviewed Cmdr. Fravor, an F/A-18 fighter pilot who engaged a UFO back in 2004 off the coast of Southern California, known colloquially as the βNimitz incidentβ. U+1F47D
https://www.youtube.com/embed/aB8zcAttP1E
Blast from the past: Check out this old (2017) blog post from Google introducing transformer models. Wild to see how much progress thatβs been made in the field of NLP in the last couple of years. U+1F60A
Google AI Blog
Machine learning (ML) is a key strategic focus at Google, with highly active groups pursuing research in virtually allβ¦
ai.googleblog.com
Anon letter: Richard Socher, ex-Salesforce CSO, recently left the company to start his own venture, which at the time of this writing, remains in stealth. While little is known about this startup, it seems heβs looking to make the internet a safer place by fixing misinformation. His company is recruiting for select positions, apply if interested:
About Us
The internet is broken. Yes, we have access to more information than ever before, but too often, hate andβ¦
su-sea.github.io
Infosec news: Stay mobile and portable with a USB. Learn more about Tails OS here:
Tails – Home
Tails uses the Tor network to protect your privacy online and help you avoid censorship. Enjoy the Internet like itβ¦
tails.boum.org
A backdoor: https://littlealchemy2.com/
This Week
KILT
Deleting Zeros
Korean ASR Library
Data Readiness for Applied NLPers
PyTorch and KGEs
Honorable Mentions U+1F649
Dataset of the Week: StereoSet
KILT
As you may already have experienced it, your next NLP project may require you to work with knowledge-intensive tasks such as open-domain question answering or fact-checking. Benchmarking these knowledge intensive tasks can be difficult because these tasks require a huge knowledge source to feed off of (and things can get even harder when you have various knowledge sources to work with). As a result, a new benchmark from Facebook AI gives researchers a centralized baseline to start their research and benchmark model performance for these tough tasks, and itβs called KILT. It leverages an interface across tasks that are grounded on a single knowledge source: the 2019/08/01 Wikipedia snapshot containing 5.9M articles. Here are the tasks youβll work with in KILT: fact checking, open-domain question answering, slot filling, entity linking, and dialogue.
Hereβs what each Wiki record looks like:
{
'wikipedia_title': 'Email marketing',
'wikipedia_id': 1101759,
'text': ['p1', 'p2',...., 'pn'], # list of paragraph text
'anchors': [{"text":,"href":,"paragraph_id":,"start":,"end":} ] ,
'categories': 'comma separated list of categories'
'history': # some info from wikipedia, including original url
'wikidata_info': # wikidata info
}
GitHub:
facebookresearch/KILT
The KILT benchmark is described in the following paper: https://arxiv.org/abs/2009.02252 conda create -n kilt37 -yβ¦
github.com
Paper: https://arxiv.org/pdf/2009.02252.pdf
Deleting Zeros
Our NN models are dense beasts, that love linear algebra (aka matrix multiplication). But that density isnβt very efficient. We can actually delete space in the matrix by getting rid of these zeros without losing performance. Without getting into the weeds of sparsity to spare you my naivetΓ© on the subject, hereβs FranΓ§ois Lagunasβ intuitive take from his previous blog post:
βThe sparsity of the matrix is the fraction of zeros against the size of the matrix
The pros? If you have a lot of zeros, you donβt have to compute some multiplications, and you donβt have to store them. So you may gain on size and speed, for training and inference.
The cons? Of course, having all these zeros will probably have an impact on network accuracy/performance. But to what extent? You may be surprised.β
FYI, for a more in-depth discussion/history on sparsity, you can check out HFβs new blog here. Whatβs cool is that you can begin your sparsity adventure with a new Hugging Face notebook that helps replace a linear block with a sparse one! Itβs fairly straightforward to execute. For more intuition, checkout their notebooks below.
huggingface/pytorch_block_sparse
This PyTorch extension provides a drop-in replacement for torch.nn.Linear using block sparse matrices instead of denseβ¦
github.com
Notebook (6 hidden layer RoBERTa):
Keep in mind this configures a model PRIOR to training, (notice they are not calling up any checkpoints)
Oh, and you need a GPU U+1F92D
huggingface/pytorch_block_sparse
Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manageβ¦
github.com
Notebook:
An implementation for initializing a dataset, tokenizer and training a sparse language model.
huggingface/pytorch_block_sparse
Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manageβ¦
github.com
Korean ASR Library
A new speech recognition library for the Korean language is out, itβs called KoSpeech, based on PyTorch. They also include preprocessing methods for the KsponSpeech corpus and a baseline model. U+1F60E
GitHub (profile pic of the year candidate):
sooftware/KoSpeech
1Kakao Brain 2 The Kwangwoon University of Electronic Information Technology 3 The Kwangwoon University of Informationβ¦
github.com
Data Readiness for Applied NLPers
Taking data seriously in NLP? Oftentimes, we can overlook pain-points that can arise when dealing with data prior to the start of project. As a result, folks at RISE in Sweden wrote an interesting white paper on data readiness for those applying NLP across businesses/institutions. Hereβs a pithy example of what stakeholders should keep their eye on with respect to accessibility:
Does the data exist? Is the data required to address the task even recorded?
Data conversion and encoding. One of the major challenges faced within NLP is the conversion of documents from a source format, e.g., PDF, Word or Excel, to a format suitable for addressing the task at hand. In order to move beyond Band C, data conversion and encoding have to be in place.
Legal aspects of accessibility. Not only should the data be available to the intended team, but the team and the result of their efforts to produce a solution to the task at hand should also be cleared with respect to the legal aspects of accessing, and handling of the data. This include, e.g., the handling of personal identifiable information, and copyright issues.
PyTorch and KGEs
The millionth PyTorch library to come out this year U+1F62DU+1F62D. TorchKGE, if you are into link prediction β you can check out their library here:
GitHub
torchkge-team/torchkge
TorchKGE: Knowledge Graph embedding in Python and Pytorch. TorchKGE is a Python module for knowledge graph (KG)β¦
github.com
Honorable Mentions U+1F649
Multi-Modal Machine Translation:
DeepLearnXMU/MM-DCCN
The code for "Dynamic Context-guided Capsule Network for Multimodal Machine Translation" in Pytorch. This project isβ¦
github.com
Paper: https://arxiv.org/pdf/2009.02016.pdf
FIll-in the Blank LM
How to Fill in the Blanks with Language Models
When editing or revising we often write in a non-linear manner. Writing an email An existing system might suggestβ¦
ai.stanford.edu
Tutorial for Troubleshooting Batch Sizes if You Have Memory Problems
Tutorial: training on larger batches with less memory in AllenNLP
This is part of a series of mini-tutorials to help you with various aspects of the AllenNLP library.
medium.com
Dataset of the Week: StereoSet
What is it?
StereoSet is a dataset that measures stereotype bias in language models. It consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.
Sample:
StereoSet
Explore how models interact with StereoSet.
stereoset.mit.edu
Where is it?
StereoSet
StereoSet is a dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences thatβ¦
stereoset.mit.edu
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends!
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI