Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


NLP News Cypher | 10.18.20
Latest   Machine Learning   Newsletter

NLP News Cypher | 10.18.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Photo by Fatih Yürür on Unsplash


NLP News Cypher U+007C 10.18.20

Crash Override

Hey welcome back! It’s a brand new week. And if you are unaware, we tapped into the gold mine of arXiv’s new code link feature (mentioned in last week’s newsletter). Recently, arXiv collaborated with PapersWithCode to conveniently link any associated repos to its accompanying paper on the abstract page (which is much better than stalking the PDF). U+1F447


Well… we wanted to know if we could extract links to say… all of the NLP-related papers published in a trailing 5-day week?!?! Please note that papers on arXiv are published Mon-Fri and can vary in totals ranging between 300–500 papers on a weekly basis for NLP-related material.

This past week, there were 330 papers published in the Computation and Language directory. Of these, 108 had GitHub links U+1F440. That’s roughly a .300 batting average (which is slightly above 17–20% rate of late). The rest didn’t have code but had the paper linked to the PapersWithCode website. The last 11 or so didn’t include either, these were excluded, giving us a total of 319. (Keep in mind that code can be added later on, so it is possible that some of the abstracts may have been populated with code in the last 72 hours and thus my stats can be slightly off)

The data dump has 3 fields:

URL of the abstract,

Title of the Abstract

Code to GitHub pages (if available) or PwC.


arxiv_10_17 id,url,title,code 1, https://arxiv.org/abs/2010.07375 ,[2010.07375] Decoding Methods for Neural Narrative…


Cool fact: Older papers have been retroactively appended with code links even though this feature is only a week old.(e.g. https://arxiv.org/abs/2005.11787 For example, this paper was submitted in May but has a code url).

In conclusion, this has been an awesome time experimenting and the amount data obtained is kind of nuts. Tons of new libraries and associated notebooks were discovered pretty fast.

BTW, if you want to get an awesome selection of NLP papers and other research news you can always sign up to our newsletter. (FYI, by ‘NLP papers’ I’m not referring to an arXiv data dump like the one linked above, this was just a one-off U+1F625)

Sign-up here: https://quantumstat.com

…if you can replicate this adventure, you are a true Jedi Master.U+1F469‍U+1F4BB

IBM research papers at INTERSPEECH 2020

talkin’ about papers…

IBM Research at INTERSPEECH 2020 U+007C IBM Research Blog

The 21st INTERSPEECH Conference will take place as a fully virtual conference from October 25 to October 29…


New Stanford Seminar

For those interested in building modern ML stacks and applying them in the real world, this new Stanford seminar series just got off the ground:

Last but not least, special thanks to Sebastian Ruder for giving the Super Duper NLP Repo a nice shout-out in his newsletter! U+1F60E

ML and NLP starter toolkit, Low-resource NLP toolkit, "Can a LM understand natural language?", The…

Hi all,It has been a while… I hope you're doing well in these crazy and strange times.COVID-19 has affected each of…


Spotify Open-Sources Klio

Spotify is on GitHub bruh! Klio is their file processing library that allows you to process large audio files (or any binary file). It’s built over Apache Beam.



Klio is an ecosystem that allows you to process audio files – or any binary files – easily and at scale. Klio jobs are…


Post-Deployment AI Management

In this white paper from O’Reilly, the subject of MLOps is discussed focusing on areas in production. The article discusses: keeping tabs on signs of domain shift in data, service level indicators so one knows the expected performance of a model in production, model monitoring and much more.

White Paper:

AI Product Management After Deployment

The field of AI product management continues to gain momentum. As the AI product management role advances in maturity…


Multilingual Factual Knowledge Retrieval

Want to know how many facts your language model knows about the data it was trained on, (but in a multilingual style)? X-FACTR is a benchmark that does just this. (it also does multi-token). The benchmark follows the factual knowledge expressed in the form of subject-relation-object triples from the T-REx dataset.

“X-FACTR is a multilingual benchmark for probing factual knowledge in language models. Prompts in 23 languages are created by native speakers to probe factual knowledge in LMs by having them fill in the blanks of prompts such as: Punta Cana is located in _.”

Language Support:

  • en (English), fr (French), nl (Dutch), ru (Russian), es (Spanish), jp (Japanese), vi (Vietnamese), zh (Chinese), hu (Hungarian), ko (Korean), tr (Turkish), he (Hebrew), el (Greek), war (Waray), mr (Marathi), mg (Malagasy), bn (Bengali), tl (Tagalog), sw (Swahili), pa (Punjabi), ceb (Cebuano), yo (Yoruba), ilo (Ilokano)


Multilingual Factual Knowledge Retrieval from Pretrained Language Models View My GitHub Profile X-FACTR is a…




X-FACTR is a multilingual benchmark for probing factual knowledge in language models. Prompts in 23 languages are…


Annotation-Tools Repo

Mariana Neves has an awesome repo that indexes annotation tools across various domains for your NLP needs. They also have a web app that allows you to search their index. If you are involved in data annotation, bookmark this:


We have evaluated all tools with regard to many criteria. Further, we developed a Web application for searching for the…



News-Please is a news crawler that extracts structured information from… you guessed it, news articles. This is isn’t just a scraper, this library also allows you to crawl and extract news articles from the commoncrawl.org site!

Here’s an example JSON output:


Permalink GitHub is home to over 50 million developers working together to host and review code, manage projects, and…




news-please is an open source, easy-to-use news crawler that extracts structured information from almost any news…


SpaCy 3.0

SpaCy update U+1F525. Transformers pipeline is dope. And they now support multi-task learning. Their update supports newly trained models for 16 languages and 51 pipelines. Read more here:

Introducing spaCy v3.0 nightly · Explosion

spaCy v3.0 is going to be a huge release! It features new transformer-based pipelines that get spaCy's accuracy right…


Honorable Papers

Commonsense and Adapters: https://arxiv.org/pdf/2005.11787.pdf

Abductive Reasoning: https://arxiv.org/pdf/2010.05906.pdf

BioMegatron: https://arxiv.org/pdf/2010.06060.pdf

Dataset of the Week: DialoGLUE

What is it?

DialoGLUE is a conversational AI benchmark containing 7 datasets pertaining to task-oriented dialogue.


Where is it?


DialoGLUE is a conversational AI benchmark designed to encourage dialogue research in representation-based transfer…


Paper: https://arxiv.org/pdf/2009.13570.pdf

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat


Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓