The NLP Cypher | 11.15.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 11.15.20
Dead Languages
Welcome back! Plenty of things to talk about. We have a few conferences coming up over the next few weeks: EMNLP 2020 starts tomorrow and then we have everyoneβs favorite NeurIPS beginning on the 6th of December.
Also, we updated the El Grande y Mal Base de Datos de NLP this week and added 25 new datasets. Highlights include the Russian SuperGlue and Chinese Clue Benchmark! U+1F60E
Having difficulty staying ahead of the NLP research curve? Maybe this is whyβ¦
Speaking about EMNLP and NeurIPS. Paper Digest aggregated a few papers from these conferences and included their code links, enjoy. U+1F440
NeurIPS 2020 Papers with Code/Data
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digestβ¦
www.paperdigest.org
EMNLP 2020 Papers with Code/Data
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digestβ¦
www.paperdigest.org
Salesforce at EMNLP
Salesforce Research at EMNLP 2020
This year marks the 24th annual Empirical Methods in Natural Language Processing (EMNLP) conference reimagined for theβ¦
blog.einstein.ai
Stanford at EMNLP
Stanford AI Lab Papers and Talks at EMNLP 2020
The Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 is being hosted virtually from Novemberβ¦
ai.stanford.edu
Also PyTorch Developer Day conference happened: You can see their presentation on their Facebook page:
Log into Facebook U+007C Facebook
Log into Facebook to start sharing and connecting with your friends, family, and people you know.
www.facebook.com
Dev Vendetta
In our newsletter in the week before last, we highlighted the battle royale brewing between developers and the YouTube-dl repo on Githubβ¦. long story short, many developers werenβt too happy with a DMCA take down notice from the music industry. Well, it now seems someone leaked GitHubβs source code (enterprise version) as a commit on GitHubβs DMCA repo lol. (it seems the source code was leaked accidently by GitHub devs prior to commit, so wasnβt a hack.)
U+1F976U+1F976U+1F976
βIn a suspicious commit to the official GitHub DMCA repository, an unknown individual uploaded the confidential source code, impersonating Nat Friedman using a bug in GitHubβs application.β
GitHub Source Code Leak
What do Microsoft really think about open-source? The entire source code for the code hosting service used byβ¦
resynth1943.net
U+1F441 Tales from the Dark Web U+1F441 Somewhere in Tor Land, lies a hidden GitHub in pure form. With a total of ~8 repos (thus far). And coincidently, the youtube-dl source code is there: U+1F62D (donβt recommend visiting Tor Land).
Oh, in other news, thereβs an ICLR 2021 data dump with paper review ratings:
iclr2021_review_scores
iclr2021_review_rebuttal title,url,avg_rating,ratings How Neural Networks Extrapolate: From Feedforward to Graph Neuralβ¦
docs.google.com
Bootleg: Named Entity Disambiguation
Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation
Named entity disambiguation (NED) is the process of mapping "strings" to "things" in a knowledge base. You have likelyβ¦
ai.stanford.edu
βWhat is the average gas mileage of a Lincoln?β In order for your smart AI model to understand this question itβs going to need to be proficient in NED. Why? So it doesnβt confuse a car with a US presidentU+1F601 . The main topic of discussion in the blog above, revolves around the importance of how much your model knows the tail distribution of entities since the majority of entities in text are found on the tail (rare) and often unseen.
βIn Wikidata, only 13% of entities even have Wikipedia pages as a source of textual information.β
Authors present a new model Bootleg that gets much of its awesome performance from a combination of a unique entity embedding, a type embedding, and a relation embedding. Its SOTA performance is impressive, and the blog contains a benchmark table for you to checkout its performance. Also, Bootleg can transfer entity knowledge to non-NED tasks with improved performance.
GitHub
HazyResearch/bootleg
Bootleg is a self-supervised named entity disambiguation (NED) system built to improve disambiguation of entities thatβ¦
github.com
Paper: https://arxiv.org/pdf/2010.10363.pdf
DeβCypher Languages
With time, comes the eventual loss of information. This includes languages that were lost too soon before we could decipher their grammar. MIT CSAIL and co. are now using machine learning to resurrect these languages back to life.
Translating lost languages using machine learning
Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these deadβ¦
news.mit.edu
BERT and SpanBERT for Coreference Resolution
Oldie but goodie.
mandarjoshi90/coref
This repository contains code and models for the paper, BERT for Coreference Resolution: Baselines and Analysisβ¦
github.com
Software Updates
AllenNLP
Release v1.2.1 Β· allenai/allennlp
Added an optional seed parameter to ModelTestCase.set_up_model which sets the random seed forrandom, numpy, and torchβ¦
github.com
Deep Pavlov
DeepPavlov Library 0.13.0 release
Recently we shared with you our goals for 2020 β Q4 and 2021. In post you can find our directions and plans forβ¦
medium.com
Hugging Face
You can now Git clone HF pretrained models straight into your Python code U+1F60E
[Announcement] Model Versioning: Upcoming changes to the model hub
Update: migration is now completed. TL;DR early next week, we will migrate the models stored on the huggingface.coβ¦
discuss.huggingface.co
And nowβ¦ a brief moment of clarity from John Carmackβ¦
Repo Cypher U+1F468βU+1F4BB
A collection of recent released repos that caught our U+1F441
BugRepo
Repo maintains a collection of bug report datasets for duplicate bug identification, bug localization, bug triaging, bug-fixing time estimation, and bug information mining.
logpai/bugrepo
BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a mainβ¦
github.com
BERT-Flow
Research showing that transforming the sentence embedding distribution to a smooth gaussian improves the performance of BERT on semantic textual similarity tasks.
bohanli/BERT-flow
This is a TensorFlow implementation of the following paper: On the Sentence Embeddings from Pre-trained Language Modelsβ¦
github.com
Long Range Arena: A benchmark for Efficient Transformers
The project aims at establishing benchmark tasks/datasets for evaluating transformer-based models in a systematic way.
google-research/long-range-arena
Long-range arena is an effort toward systematic evaluation of efficient transformer models. The project aims atβ¦
github.com
CommonCrawl Domain Names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. βcommoncrawlβ to βcommon crawlβ).
google-research-datasets/common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" toβ¦
github.com
Why Do You Need a Billion Words?
nyu-mll/pretraining-learning-curves
This is the repository for the paper When Do You Need Billions of Words of Pretraining Data? We use jiant1 for our edgeβ¦
github.com
Dataset of the Week: Complex Sequential Question Answering (CSQA)
What is it?
Complex Sequential QA combines 2 tasks: answering factual questions through complex inferencing over a realistic-sized KG of millions of entities, and learning to converse through a series of coherently linked QA pairs. Answers require logical, quantitative, and comparative reasoning as well as their combinations.
Where is it?
Complex Sequential Question Answering: Towards Learning to Converse Over Linked Question Answerβ¦
While conversing with chatbots, humans typically tend to ask many questions, a significant portion of which can beβ¦
amritasaha1812.github.io
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI