Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


The NLP Cypher | 11.15.20
Latest   Machine Learning   Newsletter

The NLP Cypher | 11.15.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.



The NLP Cypher U+007C 11.15.20

Dead Languages

Welcome back! Plenty of things to talk about. We have a few conferences coming up over the next few weeks: EMNLP 2020 starts tomorrow and then we have everyone’s favorite NeurIPS beginning on the 6th of December.

Also, we updated the El Grande y Mal Base de Datos de NLP this week and added 25 new datasets. Highlights include the Russian SuperGlue and Chinese Clue Benchmark! U+1F60E

Having difficulty staying ahead of the NLP research curve? Maybe this is why…


Speaking about EMNLP and NeurIPS. Paper Digest aggregated a few papers from these conferences and included their code links, enjoy. U+1F440

NeurIPS 2020 Papers with Code/Data

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest…


EMNLP 2020 Papers with Code/Data

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest…


Salesforce at EMNLP

Salesforce Research at EMNLP 2020

This year marks the 24th annual Empirical Methods in Natural Language Processing (EMNLP) conference reimagined for the…


Stanford at EMNLP

Stanford AI Lab Papers and Talks at EMNLP 2020

The Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 is being hosted virtually from November…


Also PyTorch Developer Day conference happened: You can see their presentation on their Facebook page:

Log into Facebook U+007C Facebook

Log into Facebook to start sharing and connecting with your friends, family, and people you know.


Dev Vendetta

In our newsletter in the week before last, we highlighted the battle royale brewing between developers and the YouTube-dl repo on Github…. long story short, many developers weren’t too happy with a DMCA take down notice from the music industry. Well, it now seems someone leaked GitHub’s source code (enterprise version) as a commit on GitHub’s DMCA repo lol. (it seems the source code was leaked accidently by GitHub devs prior to commit, so wasn’t a hack.)


“In a suspicious commit to the official GitHub DMCA repository, an unknown individual uploaded the confidential source code, impersonating Nat Friedman using a bug in GitHub’s application.”

GitHub Source Code Leak

What do Microsoft really think about open-source? The entire source code for the code hosting service used by…


U+1F441 Tales from the Dark Web U+1F441 Somewhere in Tor Land, lies a hidden GitHub in pure form. With a total of ~8 repos (thus far). And coincidently, the youtube-dl source code is there: U+1F62D (don’t recommend visiting Tor Land).


Oh, in other news, there’s an ICLR 2021 data dump with paper review ratings:


iclr2021_review_rebuttal title,url,avg_rating,ratings How Neural Networks Extrapolate: From Feedforward to Graph Neural…


Bootleg: Named Entity Disambiguation

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

Named entity disambiguation (NED) is the process of mapping "strings" to "things" in a knowledge base. You have likely…


“What is the average gas mileage of a Lincoln?” In order for your smart AI model to understand this question it’s going to need to be proficient in NED. Why? So it doesn’t confuse a car with a US presidentU+1F601 . The main topic of discussion in the blog above, revolves around the importance of how much your model knows the tail distribution of entities since the majority of entities in text are found on the tail (rare) and often unseen.

“In Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.”

Authors present a new model Bootleg that gets much of its awesome performance from a combination of a unique entity embedding, a type embedding, and a relation embedding. Its SOTA performance is impressive, and the blog contains a benchmark table for you to checkout its performance. Also, Bootleg can transfer entity knowledge to non-NED tasks with improved performance.



Bootleg is a self-supervised named entity disambiguation (NED) system built to improve disambiguation of entities that…


Paper: https://arxiv.org/pdf/2010.10363.pdf

De’Cypher Languages

With time, comes the eventual loss of information. This includes languages that were lost too soon before we could decipher their grammar. MIT CSAIL and co. are now using machine learning to resurrect these languages back to life.

Translating lost languages using machine learning

Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead…



BERT and SpanBERT for Coreference Resolution

Oldie but goodie.


This repository contains code and models for the paper, BERT for Coreference Resolution: Baselines and Analysis…


Software Updates


Release v1.2.1 · allenai/allennlp

Added an optional seed parameter to ModelTestCase.set_up_model which sets the random seed forrandom, numpy, and torch…


Deep Pavlov

DeepPavlov Library 0.13.0 release

Recently we shared with you our goals for 2020 — Q4 and 2021. In post you can find our directions and plans for…


Hugging Face

You can now Git clone HF pretrained models straight into your Python code U+1F60E

[Announcement] Model Versioning: Upcoming changes to the model hub

Update: migration is now completed. TL;DR early next week, we will migrate the models stored on the huggingface.co…


And now… a brief moment of clarity from John Carmack…


Repo Cypher U+1F468‍U+1F4BB

A collection of recent released repos that caught our U+1F441


Repo maintains a collection of bug report datasets for duplicate bug identification, bug localization, bug triaging, bug-fixing time estimation, and bug information mining.


BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main…



Research showing that transforming the sentence embedding distribution to a smooth gaussian improves the performance of BERT on semantic textual similarity tasks.


This is a TensorFlow implementation of the following paper: On the Sentence Embeddings from Pre-trained Language Models…


Long Range Arena: A benchmark for Efficient Transformers

The project aims at establishing benchmark tasks/datasets for evaluating transformer-based models in a systematic way.


Long-range arena is an effort toward systematic evaluation of efficient transformer models. The project aims at…


CommonCrawl Domain Names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. “commoncrawl” to “common crawl”).


Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to…


Why Do You Need a Billion Words?


This is the repository for the paper When Do You Need Billions of Words of Pretraining Data? We use jiant1 for our edge…


Dataset of the Week: Complex Sequential Question Answering (CSQA)

What is it?

Complex Sequential QA combines 2 tasks: answering factual questions through complex inferencing over a realistic-sized KG of millions of entities, and learning to converse through a series of coherently linked QA pairs. Answers require logical, quantitative, and comparative reasoning as well as their combinations.

Where is it?

Complex Sequential Question Answering: Towards Learning to Converse Over Linked Question Answer…

While conversing with chatbots, humans typically tend to ask many questions, a significant portion of which can be…


Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓