Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

NLP News Cypher | 09.13.20
Natural Language Processing   Newsletter

NLP News Cypher | 09.13.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Quantum Stat

The Ninth Wave (1850) Ivan Aivazovsky

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

Aere Perrenius

Welcome back. Hope you enjoyed your week! We have another update that came in this Friday. Added another 11 datasets and 5 new notebooks. Want to thank all who contributed going back all the way to last week’s update! Won Ik Cho, Gayatri Venugopal, Moin Nadeem Himanshu Chaudhary, Vincent Wu, Prafull Sharma, Yeshwanth Reddy & ManuΒ Romero!

In case you are feeling adventurous: The truth is out there for Lex Fridman, who out of left field, interviewed Cmdr. Fravor, an F/A-18 fighter pilot who engaged a UFO back in 2004 off the coast of Southern California, known colloquially as the β€œNimitz incident”. ?

Blast from the past: Check out this old (2017) blog post from Google introducing transformer models. Wild to see how much progress that’s been made in the field of NLP in the last couple of years.Β ?

Google AI Blog

Anon letter: Richard Socher, ex-Salesforce CSO, recently left the company to start his own venture, which at the time of this writing, remains in stealth. While little is known about this startup, it seems he’s looking to make the internet a safer place by fixing misinformation. His company is recruiting for select positions, apply if interested:

About Us

Infosec news: Stay mobile and portable with a USB. Learn more about Tails OSΒ here:

Tails – Home

A backdoor: https://littlealchemy2.com/

This Week

KILT

Deleting Zeros

Korean ASRΒ Library

Data Readiness for AppliedΒ NLPers

PyTorch andΒ KGEs

Honorable MentionsΒ ?

Dataset of the Week: StereoSet

KILT

As you may already have experienced it, your next NLP project may require you to work with knowledge-intensive tasks such as open-domain question answering or fact-checking. Benchmarking these knowledge intensive tasks can be difficult because these tasks require a huge knowledge source to feed off of (and things can get even harder when you have various knowledge sources to work with). As a result, a new benchmark from Facebook AI gives researchers a centralized baseline to start their research and benchmark model performance for these tough tasks, and it’s called KILT. It leverages an interface across tasks that are grounded on a single knowledge source: the 2019/08/01 Wikipedia snapshot containing 5.9M articles. Here are the tasks you’ll work with in KILT: fact checking, open-domain question answering, slot filling, entity linking, and dialogue.

Here’s what each Wiki record looksΒ like:

{
'wikipedia_title': 'Email marketing',
'wikipedia_id': 1101759,
'text': ['p1', 'p2',...., 'pn'], # list of paragraph text
'anchors': [{"text":,"href":,"paragraph_id":,"start":,"end":} ] ,
'categories': 'comma separated list of categories'
'history': # some info from wikipedia, including original url
'wikidata_info': # wikidata info
}

GitHub:

facebookresearch/KILT

Paper: https://arxiv.org/pdf/2009.02252.pdf

Deleting Zeros

Our NN models are dense beasts, that love linear algebra (aka matrix multiplication). But that density isn’t very efficient. We can actually delete space in the matrix by getting rid of these zeros without losing performance. Without getting into the weeds of sparsity to spare you my naivetΓ© on the subject, here’s FranΓ§ois Lagunas’ intuitive take from his previous blogΒ post:

β€œThe sparsity of the matrix is the fraction of zeros against the size of theΒ matrix

The pros? If you have a lot of zeros, you don’t have to compute some multiplications, and you don’t have to store them. So you may gain on size and speed, for training and inference.

The cons? Of course, having all these zeros will probably have an impact on network accuracy/performance. But to what extent? You may be surprised.”

FYI, for a more in-depth discussion/history on sparsity, you can check out HF’s new blog here. What’s cool is that you can begin your sparsity adventure with a new Hugging Face notebook that helps replace a linear block with a sparse one! It’s fairly straightforward to execute. For more intuition, checkout their notebooks below.

huggingface/pytorch_block_sparse

Notebook (6 hidden layer RoBERTa):

Keep in mind this configures a model PRIOR to training, (notice they are not calling up any checkpoints)

Oh, and you need a GPUΒ ?

huggingface/pytorch_block_sparse

Notebook:

An implementation for initializing a dataset, tokenizer and training a sparse languageΒ model.

huggingface/pytorch_block_sparse

Korean ASRΒ Library

A new speech recognition library for the Korean language is out, it’s called KoSpeech, based on PyTorch. They also include preprocessing methods for the KsponSpeech corpus and a baseline model.Β ?

GitHub (profile pic of the year candidate):

sooftware/KoSpeech

Data Readiness for AppliedΒ NLPers

Taking data seriously in NLP? Oftentimes, we can overlook pain-points that can arise when dealing with data prior to the start of project. As a result, folks at RISE in Sweden wrote an interesting white paper on data readiness for those applying NLP across businesses/institutions. Here’s a pithy example of what stakeholders should keep their eye on with respect to accessibility:

Does the data exist? Is the data required to address the task even recorded?

Data conversion and encoding. One of the major challenges faced within NLP is the conversion of documents from a source format, e.g., PDF, Word or Excel, to a format suitable for addressing the task at hand. In order to move beyond Band C, data conversion and encoding have to be inΒ place.

Legal aspects of accessibility. Not only should the data be available to the intended team, but the team and the result of their efforts to produce a solution to the task at hand should also be cleared with respect to the legal aspects of accessing, and handling of the data. This include, e.g., the handling of personal identifiable information, and copyright issues.

PyTorch andΒ KGEs

The millionth PyTorch library to come out this year ??. TorchKGE, if you are into link predictionβ€Šβ€”β€Šyou can check out their libraryΒ here:

GitHub

torchkge-team/torchkge

Honorable MentionsΒ ?

Multi-Modal Machine Translation:

DeepLearnXMU/MM-DCCN

Paper: https://arxiv.org/pdf/2009.02016.pdf

FIll-in the BlankΒ LM

How to Fill in the Blanks with Language Models

Tutorial for Troubleshooting Batch Sizes if You Have MemoryΒ Problems

Tutorial: training on larger batches with less memory in AllenNLP

Dataset of the Week: StereoSet

What isΒ it?

StereoSet is a dataset that measures stereotype bias in language models. It consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.

Sample:

StereoSet

Where isΒ it?

StereoSet

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around theΒ world.

If you enjoyed this article, help us out and share withΒ friends!

For complete coverage, follow our Twitter: @Quantum_Stat

www.quantumstat.com


NLP News Cypher | 09.13.20 was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓