NLP News Cypher | 09.20.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
NLP News Cypher U+007C 09.20.20
EMNLP and Graphs U+1F635
U+261D Persian art is pretty. Welcome back for another week of the Cypher. Yesterday, we made another weekly update to the Big Bad NLP Database and the Super Duper NLP Repo. We added 10 datasets and 6 new notebooks. This update was a good one since we added PyTorch Geometric notebooks for graph neural networks in case you all are feeling a bit adventurous.U+1F648
BTW, if you enjoy this newsletter please share it or give it a U+1F44FU+1F44F!
Detour: Iβve been experimenting with onnx runtime inference on BERT question answering. The latency is significantly improved with ONNX which is currently running on βokishβ cloud CPUs, the latency range is between 170β240ms. Hereβs the demo:
ONNX Runtime Inference U+007C Quantum Stat
BERT Question Answering
onnx.quantumstat.com
FYI, several EMNLP accepted papers were circulating this week for the November conference. Before we go there, hereβs a quick appetizer from the paper βMessage Passing for Hyper-Relational Knowledge Graphsβ which compares the traditional knowledge triple vs. a hyper-relational graph.
Use the Force LUKE (preprint not out yet U+1F625)
GNN Resources
Found this thread from Petar VeliΔkoviΔ (DeepMind) highlighting top graph neural network resources, enjoy:
A thread written by @PetarV_93
As requested , here are a few non-exhaustive resources I'd recommend for getting started with Graph Neural Nets (GNNs)β¦
threader.app
NeurIPS Fun nβ Games:
/overview
Wordplay: When Language Meets Games @ NeurIPS 2020. Date and time: Full day workshop on Fri Dec 11 th or Sat the 12 thβ¦
wordplay-workshop.github.io
This Week
Dialog Ranking Pretrained Transformers
TensorFlow Lite and NLP
Indonesian NLU Benchmark
CoDEx
RECOApy for Speech Preprocessing
Survey on the βX-Formersβ
Dataset of the Week: ASSET
Dialog Ranking Pretrained Transformers
Another one accepted at EMNLP from Microsoft Research: using transformers (GPT-2) to figure out whether a reply to a comment is more likely to get engagement or not. Pretty interesting huh! Their dialog ranking models were trained on 133M pairs of of human feedback data from Reddit.
So what does it really do? Hereβs an example from their demo: For the statementβI love NLP!β, if you were to respond with βHereβs a free textbook (URL) in case anyone needs it.β this is more likely to be up-voted than the response βMe too!β. (meaning the former will have a higher ranking score)
Additionally, their colab allows you to run several models at once to distinguish:
updown
… which gets more upvotes?
width
… which gets more direct replies?
depth
… which gets longer follow-up thread?
Colab of the Week
Thank you to author Xiang Gao for forwarding, you can also find it on the Super Duper Repo U+270Cβ¦
Google Colaboratory
Edit description
colab.research.google.com
GitHub:
golsun/DialogRPT
How likely a dialog response is upvoted U+1F44D and/or gets replied U+1F4AC? This is what DialogRPT is learned to predict. It isβ¦
github.com
Paper: https://arxiv.org/pdf/2009.06978.pdf
TensorFlow Lite and NLP
From their blog post this past week: there are now new features in TF Lite with regards to NLP models: They have new pre-trained NLP models, and better support for converting TensorFlow NLP Models to TensorFlow Lite format.
TensorFlow Lite Model Maker
The TensorFlow Lite Model Maker library simplifies the process of training a TensorFlow Lite model using customβ¦
www.tensorflow.org
FYI, their TF Lite Task library has 3 APIs for:
- NLClassifier: classifies the input text to a set of known categories.
- BertNLClassifier: classifies text optimized for BERT-family models.
- BertQuestionAnswerer: answers questions based on the content of a given passage with BERT-family models.
Keep in mind these are models that run natively on the phone (aka do not need internet connection to the cloud server).
What's new in TensorFlow Lite for NLP
September 16, 2020 – Posted by Tian Lin, Yicheng Fan, Jaesung Chung and Chen Cen TensorFlow Lite has been widelyβ¦
blog.tensorflow.org
Indonesian NLU Benchmark
Check out the new Indonesian NLU benchmark. They include a BERT-based model, IndoBERT, and its ALBERT alternative, IndoBERT-lite. In addition, the benchmark also includes datasets for 12 downstream tasks regarding single-sentence classification, single-sentence sequence-tagging, sentence-pair classification, and sentence-pair sequence labeling.
And finally, a large corpus for language modeling containing 4 billion words (250M sentences)U+1F525U+1F525.
IndoNLU Benchmark
The IndoNLU benchmark is a collection of resources for training, evaluating, and analyzing natural languageβ¦
www.indobenchmark.com
Paper:
CoDEx
More from EMNLP U+1F60E:
βCoDEx offers three rich knowledge graph datasets that contain positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.β
In addition, they also provide pretrained models to be used on the LibKGE library for link prediction and triple classification tasks.
The total data dump has about 1,156,222 triples.
GitHub:
tsafavi/codex
CoDEx is a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia. As introduced andβ¦
github.com
RECOApy for Speech Preprocessing
RECOApy is a new library that offers devs a UI that helps to record and phonetically transcribe data for speech apps in addition to grapheme-to-phoneme conversion. Currently, the library supports transcription in 8 languages: Czech, English, French, German, Italian, Polish, Romanian and Spanish.
GitHub:
adrianastan/recoapy
RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applicationsβ¦
github.com
Survey on the βX-Formersβ
The new model architecture dubbed by the Google authors as βX-Formersβ (e.g. Longformer and Reformer) are the new and very memory efficient transformers that have come on the scene in 2020. In this paper, the authors describe a holistic view of this architecture, techniques, and current trends.
Paper: https://arxiv.org/pdf/2009.06732.pdf
Dataset of the Week: ASSET
What is it?
A dataset for tuning and evaluation of automatic sentence simplification models. ASSET consists of 23,590 human simplifications associated with the 2,359 original sentences from TurkCorpus.
Sample:
Where is it?
facebookresearch/asset
ASSET is a dataset for evaluating Sentence Simplification systems with multiple rewriting transformations, as describedβ¦
github.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends!
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI