Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


The NLP Cypher | 11.22.20
Latest   Machine Learning   Newsletter

The NLP Cypher | 11.22.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.



The NLP Cypher U+007C 11.22.20

Ultima Ratio Regum

Hey welcome back! EMNLP happened this week U+1F440. Tons of research came out and this newsletter won’t do justice to all of the great research conducted by institutions worldwide. But first…

We will be releasing an update to the Big Bad NLP Database this week and also a large update to the Super Duper NLP Repo after Thanksgiving. These updates will be delivered via our email NL, if interested, you can sign-up on our homepage.

As always, if you enjoy this read, please give it a U+1F44FU+1F44F and share with your enemies. U+1F601

Ok, knowledge graphs time: Once again, Michael Galkin released his incredibly detailed round-up newsletter U+1F525U+1F525. After a strong start in 2019 for knowledge augmented language models, it seems they continue to be the hot ticket for this year. Below is the TOC and link to full blog post (*warning* its extensive and awesome):


  1. KG-Augmented Language Models: Empower your Transformer
    1.1 Autoencoders
    1.2 Autoregressive
  2. Natural Language Generation: New Folks in Datasetlandia
  3. Entity Linking: Massive and Multilingual
  4. Relation Extraction: OpenIE 6 and Neural Extractors
  5. KG Representation Learning: Temporal KGC and Successor to FB15K-237
  6. ConvAI + KGs: On the Shoulders of OpenDialKG
  7. Wrapping Up

Knowledge Graphs in NLP @ EMNLP 2020

Your guide to the KG-related research in NLP, November edition.


High Performance NLP at EMNLP (SLIDES)

Slides from Google and Uni. of Washington that explores the current state of scaling NLP Models in order to deal with large volumes text, cost and software and hardware considerations. This tutorial discusses the current and possible future directions for attacking these key areas for improving NLP efficiency.


An awesome library that recently came out and built on top of PyTorch Geometric. It allows for an easy configuration of data loading and can be easily initialized for various GNN configurations in parallel. This is a good library to start with if you feel Geometric on its own is too intimidating. U+1F60E


GraphGym is a platform for designing and evaluating Graph Neural Networks (GNN). 1. Highly modularized pipeline for GNN…


Paper: https://arxiv.org/pdf/2011.08843.pdf


This fella allows you to extract keywords and keyphrases from text by using BERT embeddings. It’s pretty straightforward and to conduct inference, you only need 3 lines of code. It’s fairly good, I tested it on abstract summaries from arXiv and I may use it to index the papers I read. U+270C


KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and…



Linformer is the “first theoretically proven linear-time transformer” out of FacebookAI (came out this Summer). In a nut shell, the amount of compute grows linearly with the amount of input length, unlike your typical transformer. U+1F447

This is great news for practitioners as this would allow one to really scale models in production, especially if you have to do millions of computations in a short amount of time. Apparently FB already has it running in production.

How Facebook uses super-efficient AI models to detect hate speech

Building AI that can analyze complicated text isn't enough to protect people from harmful content. We need systems that…


Legal Search Engine

judyrecords is the largest search engine of United States court cases on the Internet.

Although the search engine boasts a huge catalogue, not all court documents can be made available online as some documents can only be requested in person at specific court houses. Still an awesome feat. When is the API coming out? U+1F601


Edit description


Podcast Search Engine

Interested in keeping up with podcasts: Here’s an API that allows you to search meta data of podcasts and episodes by people, places, or topics. The API is free as long as you stay within 2,500 requests per month.

Podcast API: Podcast Search & Directory API

We have a transparent and simple pricing model for Listen API. You can start with FREE plan without entering your…


Tabular Transformers

A great medium article highlighting how to wrap the multimodal transformers library on top of the transformers library for tabular data! Currently the library supports 3 models: BERT, DistilBERT and RoBERTa. For training, you can use the trainer class from the Transformers lib.

Documentation: https://multimodal-toolkit.readthedocs.io/en/latest/modules/model.html#module-multimodal_transformers.model.tabular_transformers


How to Incorporate Tabular Data with HuggingFace Transformers

[Colab] [Github]


youtube-dl returns

Devs put up a good fight. It’s back?

GitHub explanation for reinstating youtube-dl repo:

Standing up for developers: youtube-dl is back – The GitHub Blog

Today we reinstated additional information youtube-dl, a popular project on GitHub, after we received about the project…


Systematic Comparison of Open Information Extraction Techniques

In this paper from EMNLP, authors evaluated current deep learning systems for conducting open information extraction (OIE). That is, to automatically extract triplets from text so you can obtain subject predicate object from sentences. They explored different training scenarios for OIE, and compared existing OIE models. Good introductory paper if you are new to this space.

Paper: https://www.aclweb.org/anthology/2020.emnlp-main.690.pdf

Repo Cypher U+1F468‍U+1F4BB

A collection of recent released repos that caught our U+1F441


This library contains a set of modules that can be used to analyze the activations of neural networks, with a focus on NLP architectures such as LSTMs and Transformers


Paper: https://arxiv.org/abs/2011.06819 Demo: Documentation: https://diagnnose.readthedocs.io This library contains a…



NLPGym is a toolkit to bridge the gap between applications of RL and NLP. This aims at facilitating research and benchmarking of DRL application on natural language processing tasks.


NLPGym is a toolkit to bridge the gap between applications of RL and NLP. This aims at facilitating research and…


Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Datasets and models for information extraction tasks in the Russian


Contribute to iis-research-team/ner-rc-russian development by creating an account on GitHub.



WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries.


This repository contains the dataset from the paper " WikiAsp: A Dataset for Multi-domain Aspect-based Summarization"…


Dataset of the Week: GrailQA

What is it?

Dataset used for knowledge base question answering (KBQA) containing 64,331 crowdsourced questions involving up to 4 relations and functions like counting, comparatives, and superlatives. The dataset covers all of the 86 domains in Freebase Commons.


Where is it?

Strongly Generalizable Question Answering Dataset

Strongly Generalizable Question Answering Dataset (GrailQA) is a new large-scale, high-quality dataset for question…


Paper: https://arxiv.org/pdf/2011.07743.pdf

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓