Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


The NLP Cypher | 12.20.20
Latest   Machine Learning   Newsletter

The NLP Cypher | 12.20.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Natural Language Processing

The NLP Cypher U+007C 12.20.20

Knowledge Graphs, UFOs and New Speech Models… oh and change your password!

Welcome back for another week of the NLP Cypher. Time to review some oddities and perhaps, even rein in on a few jewels. Plenty to talk about in the security space after the massive hack of multiple US agencies over the past week left everyone updating their McAfee firewall. And if you’re a rookie production engineer who’s now paranoid over having proper opsec, why not start by reading a simple markdown file from GitHub.

(FYI, when in doubt, install QubesOS U+1F648)


The goal of this document is to outline strict processes those that have access to PRODUCTION systems MUST follow. It…


And as always, if you enjoy this newsletter, give it a U+1F44FU+1F44F and share with your enemies. U+1F60E

NeurIPS & Knowledge Graphs

Michael Galkin returns with his round-up of graph news this time out of NeurIPS U+1F525. Around five percent of papers from the conference were on graphs so lots to discuss.

His TOC:

  1. Query Embedding: Beyond Query2Box
  2. KG Embeddings: NAS, U+1F4E6 vs U+1F52E, Meta-Learning
  3. SPARQL and Compositional Generalization
  4. Benchmarking: OGB, GraphGYM, KeOps
  5. Wrapping out


Machine Learning on Knowledge Graphs @ NeurIPS 2020

Your guide to the KG-related research in NLP, December edition


For an enterprise take w/r/t the power of knowledge graphs:

How Knowledge Graphs Will Transform Data Management And Business – The Innovator

In late November the U.S. Federal Drug Administration approved Benevolent AI's recommended arthritis drug Baricitnib as…


Training Data Extraction Attack U+1F440

A new paper (with authors from every major big tech), was recently published showing how one can attack language models like GPT-2 and extract information verbatim like personal identifiable information from just by querying the model. U+1F976! The information extracted derived from the models’ training data that was based on scraped internet info. This is a big problem especially when you train a language model on a private custom dataset. The paper discusses causes and possible work arounds.


Book Me Some Data

Looks like Booking.com wants a new recommendation engine and they are offering up their dataset of over 1 million anonymized hotel reservations to get you in the game. Pretty cool if you want a chance in working with real-world data.

Here’s the training dataset schema:

user_id — User ID
check-in — Reservation check-in date
checkout — Reservation check-out date
affiliate_id — An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
device_class — desktop/mobile
booker_country — Country from which the reservation was made (anonymized)
hotel_country — Country of the hotel (anonymized)
city_id — city_id of the hotel’s city (anonymized)
utrip_id — Unique identification of user’s trip (a group of multi-destinations bookings within the same trip)

“The eval dataset is similar to the train set except that the city_id of the final reservation of each trip is concealed and requires a prediction.”

Home U+007C Booking.com WSDM challenge

The ACM WSDM WebTour 2021 Challenge focuses on a multi destinations trip planning problem. The goal of this challenge…


UFO Files Dumped on Archive.org

There’s a nice dump of UFO files spanning several decades and countries if you want to get your alien research on. Apparently there was some copyright beef between content owners and media publishers which led ultimately to a third party obtaining a copy in the wild and uploading the files to archive.org U+1F62D. Anyway, it’s a good data source to try out your latest OCR algo or if you are interested in searching for anti-gravity propulsion tech.


UFO Files

This is quite possibly the largest collection of publicly accessible UFO documents in the world, drawing from as many…


The Air Force Ported µZero

Apparently the US Air Force decided to port DeepMind’s µZero to the navigation/sensor system of a U-2 “dragon lady” spy plane. And they have called it ARTUµ, inspired by R2-D2 from Star Wars U+1F62D. Recently, they ran the first ever simulated flight to show off the AI’s capabilities. The mission was for ARTUµ to conduct reconnaissance of enemy missile launchers on the ground while the pilot looked for aerial threats.

DARPA be like:


Here Goes the Air Force's 'Big News' – The Debrief

For nearly two weeks, Assistant Secretary of the Air Force (Acquisition, Technology and Logistics) Will Roper, had been…


GitHub Search Index Update

GitHub will get rid of your repo from its code search index if it’s been inactive for more than a year. So how do you stay ‘active’?

“Recent activity for a repository means that it has had a commit or has shown up in a search result.”

Changes to Code Search Indexing – GitHub Changelog

Starting today, GitHub Code Search will only index repositories that have had recent activity within the last year…


Getting Rid of Intents

Alan Nichol opines on the latest state of conversational AI and his RASA platform with regards to the goal of getting rid of intents as being paramount for conversational AIs to achieve Kurzweil levels of robustness. They are currently experimenting with end-2-end learning as an alternative to intents.

In RASA 2.2 and beyond, intents will be optional.


We're a step closer to getting rid of intents

One year ago I wrote that it's about time we get rid of intents, and that seems to have struck a nerve with many people…


Speech Transformers & Datasets & WMT20 Model Checkpoints from FAIR

XLSR-53: Multilingual Self-Supervised Speech Transformer

Multilingual pre-trained wav2vec 2.0 models


wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for…


Multi-Lingual LibriSpeech Dataset


Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is…


WMT Models Out

Facebook FAIR’s WMT’20 news translation task submission models


Facebook AI Research Sequence-to-Sequence Toolkit written in Python. – pytorch/fairseq


Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441


ELECTRIC is an ELECTRA version of an energy-based model. In addition it is able to re-rank speech recognition n-best lists better than language models and much faster than masked language models.

The new ELECTRIC model is found in the ELECTRA repo:


ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer…



ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP tasks — — Reading Comprehension, Multiple-Choice Question-Answering, Textual Entailment, Sentiment Analysis, Query Paraphrasing and Machine Translation.


ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP…


Diff Pruning

Models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model’s parameters per task.


While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size…



PlanSum, a summarization model that leverages content planning


AAAI2021] Unsupervised Opinion Summarization with Content Planning This PyTorch code was used in the experiments of the…


Biomedical Entity Linking

Keras implementation of a lightweight neural method for biomedical entity linking, which needs just a fraction of the parameters of a BERT model and much less computing resources.


This is a Keras implementation of the paper A Lightweight Neural Model for Biomedical Entity Linking. Clone the…



LIREx, incorporates both a rationale-enabled explanation generator and an instance selector to select only relevant, plausible natural language explanations (NLEs) to augment NLI models.


This repo is the code release of the paper LIREx: Augmenting Language Inference with Relevant Explanations, which is…



RankAE performs summarization on chat dialogue without employing manually labeled data.


AAAI-2021 paper: Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders…


Dataset of the Week: ASAYAR

What is it?

Dataset comprises of more than 1,800 annotated images collected from the Moroccan Highway. ASAYAR data can be used to develop and evaluate traffic signs detection and French or Arabic text detection in different languages.


Where is it?

ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels

Overview Welcome to ASAYAR, the first public dataset dedicated for Latin (French) and Arabic Scene Text Detection in…


Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓