The NLP Cypher | 12.20.20

Last Updated on July 24, 2023 by Editorial Team

The NLP Cypher U+007C 12.20.20

Knowledge Graphs, UFOs and New Speech Models… oh and change your password!

Welcome back for another week of the NLP Cypher. Time to review some oddities and perhaps, even rein in on a few jewels. Plenty to talk about in the security space after the massive hack of multiple US agencies over the past week left everyone updating their McAfee firewall. And if you’re a rookie production engineer who’s now paranoid over having proper opsec, why not start by reading a simple markdown file from GitHub.

(FYI, when in doubt, install QubesOS U+1F648)

hashbang/book

The goal of this document is to outline strict processes those that have access to PRODUCTION systems MUST follow. It…

github.com

And as always, if you enjoy this newsletter, give it a U+1F44FU+1F44F and share with your enemies. U+1F60E

NeurIPS & Knowledge Graphs

Michael Galkin returns with his round-up of graph news this time out of NeurIPS U+1F525. Around five percent of papers from the conference were on graphs so lots to discuss.

His TOC:

Blog:

Machine Learning on Knowledge Graphs @ NeurIPS 2020

Your guide to the KG-related research in NLP, December edition

medium.com

For an enterprise take w/r/t the power of knowledge graphs:

How Knowledge Graphs Will Transform Data Management And Business – The Innovator

In late November the U.S. Federal Drug Administration approved Benevolent AI's recommended arthritis drug Baricitnib as…

theinnovator.news

Training Data Extraction Attack U+1F440

A new paper (with authors from every major big tech), was recently published showing how one can attack language models like GPT-2 and extract information verbatim like personal identifiable information from just by querying the model. U+1F976! The information extracted derived from the models’ training data that was based on scraped internet info. This is a big problem especially when you train a language model on a private custom dataset. The paper discusses causes and possible work arounds.

Paper

Book Me Some Data

Looks like Booking.com wants a new recommendation engine and they are offering up their dataset of over 1 million anonymized hotel reservations to get you in the game. Pretty cool if you want a chance in working with real-world data.

Here’s the training dataset schema:

user_id — User ID
check-in — Reservation check-in date
checkout — Reservation check-out date
affiliate_id — An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
device_class — desktop/mobile
booker_country — Country from which the reservation was made (anonymized)
hotel_country — Country of the hotel (anonymized)
city_id — city_id of the hotel’s city (anonymized)
utrip_id — Unique identification of user’s trip (a group of multi-destinations bookings within the same trip)

“The eval dataset is similar to the train set except that the city_id of the final reservation of each trip is concealed and requires a prediction.”

Home U+007C Booking.com WSDM challenge

The ACM WSDM WebTour 2021 Challenge focuses on a multi destinations trip planning problem. The goal of this challenge…

www.bookingchallenge.com

UFO Files Dumped on Archive.org

There’s a nice dump of UFO files spanning several decades and countries if you want to get your alien research on. Apparently there was some copyright beef between content owners and media publishers which led ultimately to a third party obtaining a copy in the wild and uploading the files to archive.org U+1F62D. Anyway, it’s a good data source to try out your latest OCR algo or if you are interested in searching for anti-gravity propulsion tech.

U+1F47D:

UFO Files

This is quite possibly the largest collection of publicly accessible UFO documents in the world, drawing from as many…

that1archive.neocities.org

The Air Force Ported µZero

Apparently the US Air Force decided to port DeepMind’s µZero to the navigation/sensor system of a U-2 “dragon lady” spy plane. And they have called it ARTUµ, inspired by R2-D2 from Star Wars U+1F62D. Recently, they ran the first ever simulated flight to show off the AI’s capabilities. The mission was for ARTUµ to conduct reconnaissance of enemy missile launchers on the ground while the pilot looked for aerial threats.

DARPA be like:

Article:

Here Goes the Air Force's 'Big News' – The Debrief

For nearly two weeks, Assistant Secretary of the Air Force (Acquisition, Technology and Logistics) Will Roper, had been…

thedebrief.org

GitHub Search Index Update

GitHub will get rid of your repo from its code search index if it’s been inactive for more than a year. So how do you stay ‘active’?

“Recent activity for a repository means that it has had a commit or has shown up in a search result.”

Changes to Code Search Indexing – GitHub Changelog

Starting today, GitHub Code Search will only index repositories that have had recent activity within the last year…

github.blog

Getting Rid of Intents

Alan Nichol opines on the latest state of conversational AI and his RASA platform with regards to the goal of getting rid of intents as being paramount for conversational AIs to achieve Kurzweil levels of robustness. They are currently experimenting with end-2-end learning as an alternative to intents.

In RASA 2.2 and beyond, intents will be optional.

Blog:

We're a step closer to getting rid of intents

One year ago I wrote that it's about time we get rid of intents, and that seems to have struck a nerve with many people…

blog.rasa.com

Speech Transformers & Datasets & WMT20 Model Checkpoints from FAIR

XLSR-53: Multilingual Self-Supervised Speech Transformer

Multilingual pre-trained wav2vec 2.0 models

pytorch/fairseq

wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for…

github.com

Multi-Lingual LibriSpeech Dataset

facebookresearch/wav2letter

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is…

github.com

WMT Models Out

Facebook FAIR’s WMT’20 news translation task submission models

pytorch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. – pytorch/fairseq

github.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

ELECTRIC

ELECTRIC is an ELECTRA version of an energy-based model. In addition it is able to re-rank speech recognition n-best lists better than language models and much faster than masked language models.

The new ELECTRIC model is found in the ELECTRA repo:

google-research/electra

ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer…

github.com

ParsiNLU

ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP tasks — — Reading Comprehension, Multiple-Choice Question-Answering, Textual Entailment, Sentiment Analysis, Query Paraphrasing and Machine Translation.

persiannlp/parsinlu

ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP…

github.com

Diff Pruning

Models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model’s parameters per task.

dguo98/DiffPruning

While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size…

github.com

PlanSum

PlanSum, a summarization model that leverages content planning

rktamplayo/PlanSum

AAAI2021] Unsupervised Opinion Summarization with Content Planning This PyTorch code was used in the experiments of the…

github.com

Biomedical Entity Linking

Keras implementation of a lightweight neural method for biomedical entity linking, which needs just a fraction of the parameters of a BERT model and much less computing resources.

tigerchen52/Biomedical-Entity-Linking

This is a Keras implementation of the paper A Lightweight Neural Model for Biomedical Entity Linking. Clone the…

github.com

LIREx

LIREx, incorporates both a rationale-enabled explanation generator and an instance selector to select only relevant, plausible natural language explanations (NLEs) to augment NLI models.

zhaoxy92/LIREx

This repo is the code release of the paper LIREx: Augmenting Language Inference with Relevant Explanations, which is…

github.com

RankAE

RankAE performs summarization on chat dialogue without employing manually labeled data.

RowitZou/RankAE

AAAI-2021 paper: Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders…

github.com

Dataset of the Week: ASAYAR

What is it?

Dataset comprises of more than 1,800 annotated images collected from the Moroccan Highway. ASAYAR data can be used to develop and evaluate traffic signs detection and French or Arabic text detection in different languages.

Sample

Where is it?

ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels

Overview Welcome to ASAYAR, the first public dataset dedicated for Latin (French) and Arabic Scene Text Detection in…

vcar.github.io

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The NLP Cypher | 12.20.20

Author(s): Ricky Costa

The NLP Cypher U+007C 12.20.20

Knowledge Graphs, UFOs and New Speech Models… oh and change your password!

hashbang/book

The goal of this document is to outline strict processes those that have access to PRODUCTION systems MUST follow. It…

NeurIPS & Knowledge Graphs

Machine Learning on Knowledge Graphs @ NeurIPS 2020

Your guide to the KG-related research in NLP, December edition

For an enterprise take w/r/t the power of knowledge graphs:

How Knowledge Graphs Will Transform Data Management And Business – The Innovator

In late November the U.S. Federal Drug Administration approved Benevolent AI's recommended arthritis drug Baricitnib as…

Training Data Extraction Attack U+1F440

Book Me Some Data

Home U+007C Booking.com WSDM challenge

The ACM WSDM WebTour 2021 Challenge focuses on a multi destinations trip planning problem. The goal of this challenge…

UFO Files Dumped on Archive.org

UFO Files

This is quite possibly the largest collection of publicly accessible UFO documents in the world, drawing from as many…

The Air Force Ported µZero

Here Goes the Air Force's 'Big News' – The Debrief

For nearly two weeks, Assistant Secretary of the Air Force (Acquisition, Technology and Logistics) Will Roper, had been…

GitHub Search Index Update

Changes to Code Search Indexing – GitHub Changelog

Starting today, GitHub Code Search will only index repositories that have had recent activity within the last year…

Getting Rid of Intents

We're a step closer to getting rid of intents

One year ago I wrote that it's about time we get rid of intents, and that seems to have struck a nerve with many people…

Speech Transformers & Datasets & WMT20 Model Checkpoints from FAIR

XLSR-53: Multilingual Self-Supervised Speech Transformer

pytorch/fairseq

wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for…

Multi-Lingual LibriSpeech Dataset

facebookresearch/wav2letter

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is…

WMT Models Out

pytorch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. – pytorch/fairseq

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

ELECTRIC

google-research/electra

ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer…

ParsiNLU

persiannlp/parsinlu

ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP…

Diff Pruning

dguo98/DiffPruning

While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size…

PlanSum

rktamplayo/PlanSum

AAAI2021] Unsupervised Opinion Summarization with Content Planning This PyTorch code was used in the experiments of the…

Biomedical Entity Linking

tigerchen52/Biomedical-Entity-Linking

This is a Keras implementation of the paper A Lightweight Neural Model for Biomedical Entity Linking. Clone the…

LIREx

zhaoxy92/LIREx

This repo is the code release of the paper LIREx: Augmenting Language Inference with Relevant Explanations, which is…

RankAE

RowitZou/RankAE

AAAI-2021 paper: Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders…

Dataset of the Week: ASAYAR

What is it?

Sample

Where is it?

ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels

Overview Welcome to ASAYAR, the first public dataset dedicated for Latin (French) and Arabic Scene Text Detection in…

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement