The NLP Cypher | 10.25.20
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 10.25.20
On Her Majestyβs Service
Was knocking about the Big Bad NLP Database when I arrived on a unique dataset. This lot, and itβs accompanying GitHub repo, struck me as bizarre, first, because of its content and second, because of its author/sponsor. So I went down the rabbit hole.
The dataset, called re3d, was created by a couple of consultancies in the UK on behalf of the Defence Science and Technology Laboratory (DSTL), which is part of the ultra-secret Porton Down government facility. The tech lab is a daughter agency of the UK Ministry of Defense and you can think of it as the UKβs version of DARPA/Skunkworks. And they have an βinterestingβ history. But why does one of the UKβs most secretive labs have an interest in NLP, and more specifically, an entity/relation extraction dataset? Wellβ¦ according to their repo:
The project aimed to create a βgold standardβ dataset that could be used to train and validate machine learning approaches to natural language processing (NLP); specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst.
say wha?! U+1F9D0
Their datasets consist of JSON files with entities and relations extracted encompassing several βinterestingβ sectors: Australian Department of Foreign Affairs, BBC Online, CENTCOM, Delegation of the European Union to Syria, UK Government, US State Department, & (everyoneβs favorite) Wikipedia. U+1F468βU+1F4BB
So what does it look like? Hereβs an example from their relations.json file in the CENTCOM (US Central Command) folder: This shows metadata of the entity βJoseph Votelβ, his relation of being βIsSynomymOfβ with the entity of βcommander of U.S. Central Commandβ and also includes text span metadata from the source document. U+1F441
{'_id': '001C9C3F3DFE16B4921B1E906F66E161-3-14-47-0-12-IsSynonymOf', 'begin': 395,
'confidence': 1,
'documentId': '001C9C3F3DFE16B4921B1E906F66E161',
'end': 397,
'source': 'commander of U.S. Central Command',
'sourceBegin': 397,
'sourceEnd': 430,
'target': 'Joseph Votel',
'targetBegin': 383,
'targetEnd': 395,
'type': 'IsSynonymOf',
'value': ','}
Hereβs the source URL from the example above:
βsourceUrlβ : βhttp://www.centcom.mil/MEDIA/PRESS-RELEASES/Press-Release-View/Article/904608/centcom-reinforces-support-for-syrian-arab-coalition/β
Essentially, this dataset allows one to create relation/entity graphs to military personnel, locations, weapons and other treats from publicly available articles. And if you use your imagination, this dataset can be used for crazy stuff.U+1F62C
Their GitHub repo offers a complete view of their schema for relations and entities, and it seems that the DSTL has other βinterestingβ repositories you may want to check out. Give the dataset a try and let me know what other things you may find! Just donβt tell MI6. info @ quantumstat . com
This message will self destruct in 30 seconds.
dstl/re3d
This dataset was the output of a project carried out by Aleph Insights and Committed Software on behalf of the Defenceβ¦
github.com
RL for NLP
New Repos
(U+1F468βU+1F4BB)
MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
A new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities β text, audio and video.
amankhullar/mast
Code for EMNLP NLPBT 2020 paper. MAST is trained on the 300h version of the How2 datasetβ¦
github.com
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
CharacterBERT is a variant of BERT that produces word-level contextual representations by attending to the characters of each input token.
helboukkouri/character-bert
This is the repository of the paper " CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabularyβ¦
github.com
Multi-hop Question Generation with Graph Convolutional Network
Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.
HLTCHKUST/MulQG
This is the implementation of the paper: Multi-hop Question Generation with Graph Convolutional Network. Dan Su, Yanβ¦
github.com
TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis
TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.
mohiuddin02/TweetBERT
TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis GitHub is home to over 50 millionβ¦
github.com
GSum: A General Framework for Guided Neural Abstractive Summarization
GSUm is a abstractive summarization framework that can effectively take different kinds of external guidance as input which generates qualitatively different summaries.
neulab/guided_summarization
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab orβ¦
github.com
NeuSpell: A Neural Spelling Correction Toolkit
This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources.
neuspell/neuspell
NeuSpell: A Neural Spelling Correction Toolkit NeuSpell is an open-source toolkit for context sensitive spellingβ¦
github.com
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
A cross-encoder is used to label a larger set of input pairs to augment the training data for the bi-encoder for pairwise sentence scoring. (think semantic textual similarity)
UKPLab/sentence-transformers
This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also knownβ¦
github.com
Open Question Answering over Tables and Text
A new large-scale dataset called Open Table-Text Question Answering (OTT-QA) for Open QA over both tabular and textual data.
wenhuchen/OTT-QA
This respository contains the OTT-QA dataset used in Open Question Answering over Tables and Text and the baseline codeβ¦
github.com
M2M-100
So this model made headlines this week, with Facebook AI revealing M2M-100, a multi-lingual translation model that is independent of English data, which can handle any pair of 100 languages. This is convenient because gives a considerable performance boost by 10 points on BLUE vs traditional methods (that would require the English language as a middle man given its abundance of datasets (e.g. Chinese to English and then English to French for a Chinese to French translation task.))
The training data derives from the massive CCMatrix and CCAligned datasets. In addition, they released a 12B parameter model checkpoint for you to play with. (if you have a gazillion number of GPUs)
pytorch/fairseq
In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pairβ¦
github.com
Data Poisoning
Data security is important. Especially when one uses this data to train their models. According to Eric Wallaceβs new blog post, there are ways to impact a modelβs predictions by using a specific trigger phrase in the input. The example he uses is where an adversary places a trigger phase for βApple iPhonesβ comments (biasing comments to be positive). And when you encounter new data in the wild for inference, the model would predict sample comments regarding βApple iPhonesβ as positive even though comments may have been negative. Read along to learn how they did it:
Data Poisoning
Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for trainingβ¦
www.ericswallace.com
ReGex Is Alive and Well
Amit Chaudharyβs awesome blog discusses a classic: regular expression. The well crafted post is both an introduction and a cheat sheet that can help the regex noob get acquainted to low-level Python (aka where eagles dare).
A Visual Guide to Regular Expression
It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches aβ¦
amitness.com
Forex Data Dump
Want a decadeβs worth of FOREX tick data? Someone used the Dukascopy API and had themselves a party. Itβs available via seedbox/torrent.
Total Files 463
Total Line Count 8,495,770,706
Total Data Points 33,983,082,824
Total Decompressed Size 501 GB
Total Compressed Size 61 GB
Search in Rust (donβt worry thereβs a python wrapper )
MeiliSearch, have you tried it? If you like search and indexing data you should give it a drive. I used it when I was exploring the James Bond dataset discussed in the introduction above. It was also written in Rust so itβs pretty fast!
Hereβs a list of its awesome features out of the box:
- Search as-you-type experience (answers < 50 milliseconds) U+1F525
- Full-text search
- Typo tolerant (understands typos and miss-spelling)
- Faceted search and filters
- Supports Kanji characters
- Supports Synonym
- Easy to install, deploy, and maintain
- Whole documents are returned
- Highly customizable
- RESTful API
meilisearch/MeiliSearch
U+26A1 Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine U+1F50D MeiliSearch is a powerful, fast, open-source, easyβ¦
github.com
Docs
Introduction
Open source Instant Search Engine
docs.meilisearch.com
Sorry, no βdataset of the weekβ. This time, thereβs this
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI