The NLP Cypher | 12.20.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
Natural Language Processing
The NLP Cypher U+007C 12.20.20
Knowledge Graphs, UFOs and New Speech Models⦠oh and change your password!
Welcome back for another week of the NLP Cypher. Time to review some oddities and perhaps, even rein in on a few jewels. Plenty to talk about in the security space after the massive hack of multiple US agencies over the past week left everyone updating their McAfee firewall. And if youβre a rookie production engineer whoβs now paranoid over having proper opsec, why not start by reading a simple markdown file from GitHub.
(FYI, when in doubt, install QubesOS U+1F648)
hashbang/book
The goal of this document is to outline strict processes those that have access to PRODUCTION systems MUST follow. Itβ¦
github.com
And as always, if you enjoy this newsletter, give it a U+1F44FU+1F44F and share with your enemies. U+1F60E
NeurIPS & Knowledge Graphs
Michael Galkin returns with his round-up of graph news this time out of NeurIPS U+1F525. Around five percent of papers from the conference were on graphs so lots to discuss.
His TOC:
- Query Embedding: Beyond Query2Box
- KG Embeddings: NAS, U+1F4E6 vs U+1F52E, Meta-Learning
- SPARQL and Compositional Generalization
- Benchmarking: OGB, GraphGYM, KeOps
- Wrapping out
Blog:
Machine Learning on Knowledge Graphs @ NeurIPS 2020
Your guide to the KG-related research in NLP, December edition
medium.com
For an enterprise take w/r/t the power of knowledge graphs:
How Knowledge Graphs Will Transform Data Management And Business – The Innovator
In late November the U.S. Federal Drug Administration approved Benevolent AI's recommended arthritis drug Baricitnib asβ¦
theinnovator.news
Training Data Extraction Attack U+1F440
A new paper (with authors from every major big tech), was recently published showing how one can attack language models like GPT-2 and extract information verbatim like personal identifiable information from just by querying the model. U+1F976! The information extracted derived from the modelsβ training data that was based on scraped internet info. This is a big problem especially when you train a language model on a private custom dataset. The paper discusses causes and possible work arounds.
Book Me Some Data
Looks like Booking.com wants a new recommendation engine and they are offering up their dataset of over 1 million anonymized hotel reservations to get you in the game. Pretty cool if you want a chance in working with real-world data.
Hereβs the training dataset schema:
user_id β User ID
check-in β Reservation check-in date
checkout β Reservation check-out date
affiliate_id β An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third party referrals, paid search engine, etc.)
device_class β desktop/mobile
booker_country β Country from which the reservation was made (anonymized)
hotel_country β Country of the hotel (anonymized)
city_id β city_id of the hotelβs city (anonymized)
utrip_id β Unique identification of userβs trip (a group of multi-destinations bookings within the same trip)
βThe eval dataset is similar to the train set except that the city_id of the final reservation of each trip is concealed and requires a prediction.β
Home U+007C Booking.com WSDM challenge
The ACM WSDM WebTour 2021 Challenge focuses on a multi destinations trip planning problem. The goal of this challengeβ¦
www.bookingchallenge.com
UFO Files Dumped on Archive.org
Thereβs a nice dump of UFO files spanning several decades and countries if you want to get your alien research on. Apparently there was some copyright beef between content owners and media publishers which led ultimately to a third party obtaining a copy in the wild and uploading the files to archive.org U+1F62D. Anyway, itβs a good data source to try out your latest OCR algo or if you are interested in searching for anti-gravity propulsion tech.
U+1F47D:
UFO Files
This is quite possibly the largest collection of publicly accessible UFO documents in the world, drawing from as manyβ¦
that1archive.neocities.org
The Air Force Ported Β΅Zero
Apparently the US Air Force decided to port DeepMindβs Β΅Zero to the navigation/sensor system of a U-2 βdragon ladyβ spy plane. And they have called it ARTUΒ΅, inspired by R2-D2 from Star Wars U+1F62D. Recently, they ran the first ever simulated flight to show off the AIβs capabilities. The mission was for ARTUΒ΅ to conduct reconnaissance of enemy missile launchers on the ground while the pilot looked for aerial threats.
DARPA be like:
Article:
Here Goes the Air Force's 'Big News' – The Debrief
For nearly two weeks, Assistant Secretary of the Air Force (Acquisition, Technology and Logistics) Will Roper, had beenβ¦
thedebrief.org
GitHub Search Index Update
GitHub will get rid of your repo from its code search index if itβs been inactive for more than a year. So how do you stay βactiveβ?
βRecent activity for a repository means that it has had a commit or has shown up in a search result.β
Changes to Code Search Indexing – GitHub Changelog
Starting today, GitHub Code Search will only index repositories that have had recent activity within the last yearβ¦
github.blog
Getting Rid of Intents
Alan Nichol opines on the latest state of conversational AI and his RASA platform with regards to the goal of getting rid of intents as being paramount for conversational AIs to achieve Kurzweil levels of robustness. They are currently experimenting with end-2-end learning as an alternative to intents.
In RASA 2.2 and beyond, intents will be optional.
Blog:
We're a step closer to getting rid of intents
One year ago I wrote that it's about time we get rid of intents, and that seems to have struck a nerve with many peopleβ¦
blog.rasa.com
Speech Transformers & Datasets & WMT20 Model Checkpoints from FAIR
XLSR-53: Multilingual Self-Supervised Speech Transformer
Multilingual pre-trained wav2vec 2.0 models
pytorch/fairseq
wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework forβ¦
github.com
Multi-Lingual LibriSpeech Dataset
facebookresearch/wav2letter
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset isβ¦
github.com
WMT Models Out
Facebook FAIRβs WMTβ20 news translation task submission models
pytorch/fairseq
Facebook AI Research Sequence-to-Sequence Toolkit written in Python. – pytorch/fairseq
github.com
Repo Cypher U+1F468βU+1F4BB
A collection of recently released repos that caught our U+1F441
ELECTRIC
ELECTRIC is an ELECTRA version of an energy-based model. In addition it is able to re-rank speech recognition n-best lists better than language models and much faster than masked language models.
The new ELECTRIC model is found in the ELECTRA repo:
google-research/electra
ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformerβ¦
github.com
ParsiNLU
ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLP tasks β β Reading Comprehension, Multiple-Choice Question-Answering, Textual Entailment, Sentiment Analysis, Query Paraphrasing and Machine Translation.
persiannlp/parsinlu
ParsiNLU is a comprehensive suit of high-level NLP tasks for Persian language. This suit contains 6 different key NLPβ¦
github.com
Diff Pruning
Models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained modelβs parameters per task.
dguo98/DiffPruning
While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large sizeβ¦
github.com
PlanSum
PlanSum, a summarization model that leverages content planning
rktamplayo/PlanSum
AAAI2021] Unsupervised Opinion Summarization with Content Planning This PyTorch code was used in the experiments of theβ¦
github.com
Biomedical Entity Linking
Keras implementation of a lightweight neural method for biomedical entity linking, which needs just a fraction of the parameters of a BERT model and much less computing resources.
tigerchen52/Biomedical-Entity-Linking
This is a Keras implementation of the paper A Lightweight Neural Model for Biomedical Entity Linking. Clone theβ¦
github.com
LIREx
LIREx, incorporates both a rationale-enabled explanation generator and an instance selector to select only relevant, plausible natural language explanations (NLEs) to augment NLI models.
zhaoxy92/LIREx
This repo is the code release of the paper LIREx: Augmenting Language Inference with Relevant Explanations, which isβ¦
github.com
RankAE
RankAE performs summarization on chat dialogue without employing manually labeled data.
RowitZou/RankAE
AAAI-2021 paper: Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encodersβ¦
github.com
Dataset of the Week: ASAYAR
What is it?
Dataset comprises of more than 1,800 annotated images collected from the Moroccan Highway. ASAYAR data can be used to develop and evaluate traffic signs detection and French or Arabic text detection in different languages.
Sample
Where is it?
ASAYAR: A Dataset for Arabic-Latin Scene Text Localization in Highway Traffic Panels
Overview Welcome to ASAYAR, the first public dataset dedicated for Latin (French) and Arabic Scene Text Detection inβ¦
vcar.github.io
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI