Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


The NLP Cypher | 10.25.20
Latest   Machine Learning   Newsletter

The NLP Cypher | 10.25.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Photo by Youhana Nassif on Unsplash


The NLP Cypher U+007C 10.25.20

On Her Majesty’s Service

Was knocking about the Big Bad NLP Database when I arrived on a unique dataset. This lot, and it’s accompanying GitHub repo, struck me as bizarre, first, because of its content and second, because of its author/sponsor. So I went down the rabbit hole.

The dataset, called re3d, was created by a couple of consultancies in the UK on behalf of the Defence Science and Technology Laboratory (DSTL), which is part of the ultra-secret Porton Down government facility. The tech lab is a daughter agency of the UK Ministry of Defense and you can think of it as the UK’s version of DARPA/Skunkworks. And they have an “interesting” history. But why does one of the UK’s most secretive labs have an interest in NLP, and more specifically, an entity/relation extraction dataset? Well… according to their repo:

The project aimed to create a ‘gold standard’ dataset that could be used to train and validate machine learning approaches to natural language processing (NLP); specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst.

say wha?! U+1F9D0

Their datasets consist of JSON files with entities and relations extracted encompassing several “interesting” sectors: Australian Department of Foreign Affairs, BBC Online, CENTCOM, Delegation of the European Union to Syria, UK Government, US State Department, & (everyone’s favorite) Wikipedia. U+1F468‍U+1F4BB

So what does it look like? Here’s an example from their relations.json file in the CENTCOM (US Central Command) folder: This shows metadata of the entity “Joseph Votel”, his relation of being “IsSynomymOf” with the entity of “commander of U.S. Central Command” and also includes text span metadata from the source document. U+1F441

{'_id': '001C9C3F3DFE16B4921B1E906F66E161-3-14-47-0-12-IsSynonymOf', 'begin': 395, 
'confidence': 1,
'documentId': '001C9C3F3DFE16B4921B1E906F66E161',
'end': 397,
'source': 'commander of U.S. Central Command',
'sourceBegin': 397,
'sourceEnd': 430,
'target': 'Joseph Votel',
'targetBegin': 383,
'targetEnd': 395,
'type': 'IsSynonymOf',
'value': ','}

Here’s the source URL from the example above:

“sourceUrl” : “”

Essentially, this dataset allows one to create relation/entity graphs to military personnel, locations, weapons and other treats from publicly available articles. And if you use your imagination, this dataset can be used for crazy stuff.U+1F62C

top secret

Their GitHub repo offers a complete view of their schema for relations and entities, and it seems that the DSTL has other “interesting” repositories you may want to check out. Give the dataset a try and let me know what other things you may find! Just don’t tell MI6. info @ quantumstat . com

This message will self destruct in 30 seconds.


This dataset was the output of a project carried out by Aleph Insights and Committed Software on behalf of the Defence…

RL for NLP

New Repos


MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

A new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities — text, audio and video.


Code for EMNLP NLPBT 2020 paper. MAST is trained on the 300h version of the How2 dataset…

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

CharacterBERT is a variant of BERT that produces word-level contextual representations by attending to the characters of each input token.


This is the repository of the paper " CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabulary…

Multi-hop Question Generation with Graph Convolutional Network

Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.


This is the implementation of the paper: Multi-hop Question Generation with Graph Convolutional Network. Dan Su, Yan…

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis

TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.


TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis GitHub is home to over 50 million…

GSum: A General Framework for Guided Neural Abstractive Summarization

GSUm is a abstractive summarization framework that can effectively take different kinds of external guidance as input which generates qualitatively different summaries.


You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

NeuSpell: A Neural Spelling Correction Toolkit

This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources.


NeuSpell: A Neural Spelling Correction Toolkit NeuSpell is an open-source toolkit for context sensitive spelling…

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

A cross-encoder is used to label a larger set of input pairs to augment the training data for the bi-encoder for pairwise sentence scoring. (think semantic textual similarity)


This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known…

Open Question Answering over Tables and Text

A new large-scale dataset called Open Table-Text Question Answering (OTT-QA) for Open QA over both tabular and textual data.


This respository contains the OTT-QA dataset used in Open Question Answering over Tables and Text and the baseline code…


So this model made headlines this week, with Facebook AI revealing M2M-100, a multi-lingual translation model that is independent of English data, which can handle any pair of 100 languages. This is convenient because gives a considerable performance boost by 10 points on BLUE vs traditional methods (that would require the English language as a middle man given its abundance of datasets (e.g. Chinese to English and then English to French for a Chinese to French translation task.))

The training data derives from the massive CCMatrix and CCAligned datasets. In addition, they released a 12B parameter model checkpoint for you to play with. (if you have a gazillion number of GPUs)


In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair…

Data Poisoning

Data security is important. Especially when one uses this data to train their models. According to Eric Wallace’s new blog post, there are ways to impact a model’s predictions by using a specific trigger phrase in the input. The example he uses is where an adversary places a trigger phase for “Apple iPhones” comments (biasing comments to be positive). And when you encounter new data in the wild for inference, the model would predict sample comments regarding “Apple iPhones” as positive even though comments may have been negative. Read along to learn how they did it:

Data Poisoning

Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for training…

ReGex Is Alive and Well

Amit Chaudhary’s awesome blog discusses a classic: regular expression. The well crafted post is both an introduction and a cheat sheet that can help the regex noob get acquainted to low-level Python (aka where eagles dare).

A Visual Guide to Regular Expression

It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a…

Forex Data Dump

Want a decade’s worth of FOREX tick data? Someone used the Dukascopy API and had themselves a party. It’s available via seedbox/torrent.

Total Files 463

Total Line Count 8,495,770,706

Total Data Points 33,983,082,824

Total Decompressed Size 501 GB

Total Compressed Size 61 GB


Search in Rust (don’t worry there’s a python wrapper )

MeiliSearch, have you tried it? If you like search and indexing data you should give it a drive. I used it when I was exploring the James Bond dataset discussed in the introduction above. It was also written in Rust so it’s pretty fast!

Here’s a list of its awesome features out of the box:

  • Search as-you-type experience (answers < 50 milliseconds) U+1F525
  • Full-text search
  • Typo tolerant (understands typos and miss-spelling)
  • Faceted search and filters
  • Supports Kanji characters
  • Supports Synonym
  • Easy to install, deploy, and maintain
  • Whole documents are returned
  • Highly customizable
  • RESTful API


U+26A1 Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine U+1F50D MeiliSearch is a powerful, fast, open-source, easy…



Open source Instant Search Engine

Sorry, no “dataset of the week”. This time, there’s this

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓