The NLP Cypher | 11.15.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 11.15.20
Dead Languages
Welcome back! Plenty of things to talk about. We have a few conferences coming up over the next few weeks: EMNLP 2020 starts tomorrow and then we have everyone’s favorite NeurIPS beginning on the 6th of December.
Also, we updated the El Grande y Mal Base de Datos de NLP this week and added 25 new datasets. Highlights include the Russian SuperGlue and Chinese Clue Benchmark! U+1F60E
Having difficulty staying ahead of the NLP research curve? Maybe this is why…
Speaking about EMNLP and NeurIPS. Paper Digest aggregated a few papers from these conferences and included their code links, enjoy. U+1F440
NeurIPS 2020 Papers with Code/Data
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest…
www.paperdigest.org
EMNLP 2020 Papers with Code/Data
www.paperdigest.org
Salesforce at EMNLP
Salesforce Research at EMNLP 2020
This year marks the 24th annual Empirical Methods in Natural Language Processing (EMNLP) conference reimagined for the…
blog.einstein.ai
Stanford at EMNLP
Stanford AI Lab Papers and Talks at EMNLP 2020
The Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 is being hosted virtually from November…
ai.stanford.edu
Also PyTorch Developer Day conference happened: You can see their presentation on their Facebook page:
Log into Facebook U+007C Facebook
Log into Facebook to start sharing and connecting with your friends, family, and people you know.
www.facebook.com
Dev Vendetta
In our newsletter in the week before last, we highlighted the battle royale brewing between developers and the YouTube-dl repo on Github…. long story short, many developers weren’t too happy with a DMCA take down notice from the music industry. Well, it now seems someone leaked GitHub’s source code (enterprise version) as a commit on GitHub’s DMCA repo lol. (it seems the source code was leaked accidently by GitHub devs prior to commit, so wasn’t a hack.)
U+1F976U+1F976U+1F976
“In a suspicious commit to the official GitHub DMCA repository, an unknown individual uploaded the confidential source code, impersonating Nat Friedman using a bug in GitHub’s application.”
GitHub Source Code Leak
What do Microsoft really think about open-source? The entire source code for the code hosting service used by…
resynth1943.net
U+1F441 Tales from the Dark Web U+1F441 Somewhere in Tor Land, lies a hidden GitHub in pure form. With a total of ~8 repos (thus far). And coincidently, the youtube-dl source code is there: U+1F62D (don’t recommend visiting Tor Land).
Oh, in other news, there’s an ICLR 2021 data dump with paper review ratings:
iclr2021_review_scores
iclr2021_review_rebuttal title,url,avg_rating,ratings How Neural Networks Extrapolate: From Feedforward to Graph Neural…
docs.google.com
Bootleg: Named Entity Disambiguation
Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation
Named entity disambiguation (NED) is the process of mapping "strings" to "things" in a knowledge base. You have likely…
ai.stanford.edu
“What is the average gas mileage of a Lincoln?” In order for your smart AI model to understand this question it’s going to need to be proficient in NED. Why? So it doesn’t confuse a car with a US presidentU+1F601 . The main topic of discussion in the blog above, revolves around the importance of how much your model knows the tail distribution of entities since the majority of entities in text are found on the tail (rare) and often unseen.
“In Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.”
Authors present a new model Bootleg that gets much of its awesome performance from a combination of a unique entity embedding, a type embedding, and a relation embedding. Its SOTA performance is impressive, and the blog contains a benchmark table for you to checkout its performance. Also, Bootleg can transfer entity knowledge to non-NED tasks with improved performance.
GitHub
HazyResearch/bootleg
Bootleg is a self-supervised named entity disambiguation (NED) system built to improve disambiguation of entities that…
github.com
Paper: https://arxiv.org/pdf/2010.10363.pdf
De’Cypher Languages
With time, comes the eventual loss of information. This includes languages that were lost too soon before we could decipher their grammar. MIT CSAIL and co. are now using machine learning to resurrect these languages back to life.
Translating lost languages using machine learning
Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead…
news.mit.edu
BERT and SpanBERT for Coreference Resolution
Oldie but goodie.
mandarjoshi90/coref
This repository contains code and models for the paper, BERT for Coreference Resolution: Baselines and Analysis…
github.com
Software Updates
AllenNLP
Release v1.2.1 · allenai/allennlp
Added an optional seed parameter to ModelTestCase.set_up_model which sets the random seed forrandom, numpy, and torch…
github.com
Deep Pavlov
DeepPavlov Library 0.13.0 release
Recently we shared with you our goals for 2020 — Q4 and 2021. In post you can find our directions and plans for…
medium.com
Hugging Face
You can now Git clone HF pretrained models straight into your Python code U+1F60E
[Announcement] Model Versioning: Upcoming changes to the model hub
Update: migration is now completed. TL;DR early next week, we will migrate the models stored on the huggingface.co…
discuss.huggingface.co
And now… a brief moment of clarity from John Carmack…
Repo Cypher U+1F468U+1F4BB
A collection of recent released repos that caught our U+1F441
BugRepo
Repo maintains a collection of bug report datasets for duplicate bug identification, bug localization, bug triaging, bug-fixing time estimation, and bug information mining.
logpai/bugrepo
BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main…
github.com
BERT-Flow
Research showing that transforming the sentence embedding distribution to a smooth gaussian improves the performance of BERT on semantic textual similarity tasks.
bohanli/BERT-flow
This is a TensorFlow implementation of the following paper: On the Sentence Embeddings from Pre-trained Language Models…
github.com
Long Range Arena: A benchmark for Efficient Transformers
The project aims at establishing benchmark tasks/datasets for evaluating transformer-based models in a systematic way.
google-research/long-range-arena
Long-range arena is an effort toward systematic evaluation of efficient transformer models. The project aims at…
github.com
CommonCrawl Domain Names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. “commoncrawl” to “common crawl”).
google-research-datasets/common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to…
github.com
Why Do You Need a Billion Words?
nyu-mll/pretraining-learning-curves
This is the repository for the paper When Do You Need Billions of Words of Pretraining Data? We use jiant1 for our edge…
github.com
Dataset of the Week: Complex Sequential Question Answering (CSQA)
What is it?
Complex Sequential QA combines 2 tasks: answering factual questions through complex inferencing over a realistic-sized KG of millions of entities, and learning to converse through a series of coherently linked QA pairs. Answers require logical, quantitative, and comparative reasoning as well as their combinations.
Where is it?
Complex Sequential Question Answering: Towards Learning to Converse Over Linked Question Answer…
While conversing with chatbots, humans typically tend to ask many questions, a significant portion of which can be…
amritasaha1812.github.io
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI