Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


NLP News Cypher | 09.20.20
Latest   Machine Learning   Newsletter

NLP News Cypher | 09.20.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Ibrahim Jabbar-Beik


NLP News Cypher U+007C 09.20.20

EMNLP and Graphs U+1F635

U+261D Persian art is pretty. Welcome back for another week of the Cypher. Yesterday, we made another weekly update to the Big Bad NLP Database and the Super Duper NLP Repo. We added 10 datasets and 6 new notebooks. This update was a good one since we added PyTorch Geometric notebooks for graph neural networks in case you all are feeling a bit adventurous.U+1F648

BTW, if you enjoy this newsletter please share it or give it a U+1F44FU+1F44F!

Detour: I’ve been experimenting with onnx runtime inference on BERT question answering. The latency is significantly improved with ONNX which is currently running on “okish” cloud CPUs, the latency range is between 170–240ms. Here’s the demo:

ONNX Runtime Inference U+007C Quantum Stat

BERT Question Answering

FYI, several EMNLP accepted papers were circulating this week for the November conference. Before we go there, here’s a quick appetizer from the paper “Message Passing for Hyper-Relational Knowledge Graphs” which compares the traditional knowledge triple vs. a hyper-relational graph.


Use the Force LUKE (preprint not out yet U+1F625)


GNN Resources

Found this thread from Petar Veličković (DeepMind) highlighting top graph neural network resources, enjoy:

A thread written by @PetarV_93

As requested , here are a few non-exhaustive resources I'd recommend for getting started with Graph Neural Nets (GNNs)…

NeurIPS Fun n’ Games:


Wordplay: When Language Meets Games @ NeurIPS 2020. Date and time: Full day workshop on Fri Dec 11 th or Sat the 12 th…

This Week

Dialog Ranking Pretrained Transformers

TensorFlow Lite and NLP

Indonesian NLU Benchmark


RECOApy for Speech Preprocessing

Survey on the ‘X-Formers’

Dataset of the Week: ASSET

Dialog Ranking Pretrained Transformers

Another one accepted at EMNLP from Microsoft Research: using transformers (GPT-2) to figure out whether a reply to a comment is more likely to get engagement or not. Pretty interesting huh! Their dialog ranking models were trained on 133M pairs of of human feedback data from Reddit.

So what does it really do? Here’s an example from their demo: For the statement“I love NLP!”, if you were to respond with “Here’s a free textbook (URL) in case anyone needs it.” this is more likely to be up-voted than the response “Me too!”. (meaning the former will have a higher ranking score)

Additionally, their colab allows you to run several models at once to distinguish:

updown… which gets more upvotes?

width… which gets more direct replies?

depth… which gets longer follow-up thread?

Colab of the Week

Thank you to author Xiang Gao for forwarding, you can also find it on the Super Duper Repo U+270C…

Google Colaboratory

Edit description



How likely a dialog response is upvoted U+1F44D and/or gets replied U+1F4AC? This is what DialogRPT is learned to predict. It is…


TensorFlow Lite and NLP

From their blog post this past week: there are now new features in TF Lite with regards to NLP models: They have new pre-trained NLP models, and better support for converting TensorFlow NLP Models to TensorFlow Lite format.

TensorFlow Lite Model Maker

The TensorFlow Lite Model Maker library simplifies the process of training a TensorFlow Lite model using custom…

FYI, their TF Lite Task library has 3 APIs for:

  • NLClassifier: classifies the input text to a set of known categories.
  • BertNLClassifier: classifies text optimized for BERT-family models.
  • BertQuestionAnswerer: answers questions based on the content of a given passage with BERT-family models.

Keep in mind these are models that run natively on the phone (aka do not need internet connection to the cloud server).

What's new in TensorFlow Lite for NLP

September 16, 2020 – Posted by Tian Lin, Yicheng Fan, Jaesung Chung and Chen Cen TensorFlow Lite has been widely…

Indonesian NLU Benchmark

Check out the new Indonesian NLU benchmark. They include a BERT-based model, IndoBERT, and its ALBERT alternative, IndoBERT-lite. In addition, the benchmark also includes datasets for 12 downstream tasks regarding single-sentence classification, single-sentence sequence-tagging, sentence-pair classification, and sentence-pair sequence labeling.

And finally, a large corpus for language modeling containing 4 billion words (250M sentences)U+1F525U+1F525.

IndoNLU Benchmark

The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural language…




More from EMNLP U+1F60E:

“CoDEx offers three rich knowledge graph datasets that contain positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.”

In addition, they also provide pretrained models to be used on the LibKGE library for link prediction and triple classification tasks.

The total data dump has about 1,156,222 triples.



CoDEx is a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia. As introduced and…

RECOApy for Speech Preprocessing

RECOApy is a new library that offers devs a UI that helps to record and phonetically transcribe data for speech apps in addition to grapheme-to-phoneme conversion. Currently, the library supports transcription in 8 languages: Czech, English, French, German, Italian, Polish, Romanian and Spanish.



RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications…

Survey on the ‘X-Formers’

The new model architecture dubbed by the Google authors as ‘X-Formers’ (e.g. Longformer and Reformer) are the new and very memory efficient transformers that have come on the scene in 2020. In this paper, the authors describe a holistic view of this architecture, techniques, and current trends.


Dataset of the Week: ASSET

What is it?

A dataset for tuning and evaluation of automatic sentence simplification models. ASSET consists of 23,590 human simplifications associated with the 2,359 original sentences from TurkCorpus.


Where is it?


ASSET is a dataset for evaluating Sentence Simplification systems with multiple rewriting transformations, as described…

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

If you enjoyed this article, help us out and share with friends!

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓