The NLP Cypher | 10.25.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 10.25.20

On Her Majesty’s Service

Was knocking about the Big Bad NLP Database when I arrived on a unique dataset. This lot, and it’s accompanying GitHub repo, struck me as bizarre, first, because of its content and second, because of its author/sponsor. So I went down the rabbit hole.

The dataset, called re3d, was created by a couple of consultancies in the UK on behalf of the Defence Science and Technology Laboratory (DSTL), which is part of the ultra-secret Porton Down government facility. The tech lab is a daughter agency of the UK Ministry of Defense and you can think of it as the UK’s version of DARPA/Skunkworks. And they have an “interesting” history. But why does one of the UK’s most secretive labs have an interest in NLP, and more specifically, an entity/relation extraction dataset? Well… according to their repo:

The project aimed to create a ‘gold standard’ dataset that could be used to train and validate machine learning approaches to natural language processing (NLP); specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst.

say wha?! U+1F9D0

Their datasets consist of JSON files with entities and relations extracted encompassing several “interesting” sectors: Australian Department of Foreign Affairs, BBC Online, CENTCOM, Delegation of the European Union to Syria, UK Government, US State Department, & (everyone’s favorite) Wikipedia. U+1F468‍U+1F4BB

So what does it look like? Here’s an example from their relations.json file in the CENTCOM (US Central Command) folder: This shows metadata of the entity “Joseph Votel”, his relation of being “IsSynomymOf” with the entity of “commander of U.S. Central Command” and also includes text span metadata from the source document. U+1F441

{'_id': '001C9C3F3DFE16B4921B1E906F66E161-3-14-47-0-12-IsSynonymOf', 'begin': 395, 
'confidence': 1, 
'documentId': '001C9C3F3DFE16B4921B1E906F66E161', 
'end': 397, 
'source': 'commander of U.S. Central Command', 
'sourceBegin': 397, 
'sourceEnd': 430, 
'target': 'Joseph Votel', 
'targetBegin': 383, 
'targetEnd': 395, 
'type': 'IsSynonymOf', 
'value': ','}

Here’s the source URL from the example above:

“sourceUrl” : “http://www.centcom.mil/MEDIA/PRESS-RELEASES/Press-Release-View/Article/904608/centcom-reinforces-support-for-syrian-arab-coalition/”

Essentially, this dataset allows one to create relation/entity graphs to military personnel, locations, weapons and other treats from publicly available articles. And if you use your imagination, this dataset can be used for crazy stuff.U+1F62C

top secret

Their GitHub repo offers a complete view of their schema for relations and entities, and it seems that the DSTL has other “interesting” repositories you may want to check out. Give the dataset a try and let me know what other things you may find! Just don’t tell MI6. info @ quantumstat . com

This message will self destruct in 30 seconds.

dstl/re3d

This dataset was the output of a project carried out by Aleph Insights and Committed Software on behalf of the Defence…

github.com

RL for NLP

New Repos

(U+1F468‍U+1F4BB)

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

A new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities — text, audio and video.

amankhullar/mast

Code for EMNLP NLPBT 2020 paper. MAST is trained on the 300h version of the How2 dataset…

github.com

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

CharacterBERT is a variant of BERT that produces word-level contextual representations by attending to the characters of each input token.

helboukkouri/character-bert

This is the repository of the paper " CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabulary…

github.com

Multi-hop Question Generation with Graph Convolutional Network

Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.

HLTCHKUST/MulQG

This is the implementation of the paper: Multi-hop Question Generation with Graph Convolutional Network. Dan Su, Yan…

github.com

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis

TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.

mohiuddin02/TweetBERT

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis GitHub is home to over 50 million…

github.com

GSum: A General Framework for Guided Neural Abstractive Summarization

GSUm is a abstractive summarization framework that can effectively take different kinds of external guidance as input which generates qualitatively different summaries.

neulab/guided_summarization

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

NeuSpell: A Neural Spelling Correction Toolkit

This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources.

neuspell/neuspell

NeuSpell: A Neural Spelling Correction Toolkit NeuSpell is an open-source toolkit for context sensitive spelling…

github.com

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

A cross-encoder is used to label a larger set of input pairs to augment the training data for the bi-encoder for pairwise sentence scoring. (think semantic textual similarity)

UKPLab/sentence-transformers

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known…

github.com

Open Question Answering over Tables and Text

A new large-scale dataset called Open Table-Text Question Answering (OTT-QA) for Open QA over both tabular and textual data.

wenhuchen/OTT-QA

This respository contains the OTT-QA dataset used in Open Question Answering over Tables and Text and the baseline code…

github.com

M2M-100

So this model made headlines this week, with Facebook AI revealing M2M-100, a multi-lingual translation model that is independent of English data, which can handle any pair of 100 languages. This is convenient because gives a considerable performance boost by 10 points on BLUE vs traditional methods (that would require the English language as a middle man given its abundance of datasets (e.g. Chinese to English and then English to French for a Chinese to French translation task.))

The training data derives from the massive CCMatrix and CCAligned datasets. In addition, they released a 12B parameter model checkpoint for you to play with. (if you have a gazillion number of GPUs)

pytorch/fairseq

In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair…

github.com

Data Poisoning

Data security is important. Especially when one uses this data to train their models. According to Eric Wallace’s new blog post, there are ways to impact a model’s predictions by using a specific trigger phrase in the input. The example he uses is where an adversary places a trigger phase for “Apple iPhones” comments (biasing comments to be positive). And when you encounter new data in the wild for inference, the model would predict sample comments regarding “Apple iPhones” as positive even though comments may have been negative. Read along to learn how they did it:

Data Poisoning

Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for training…

www.ericswallace.com

ReGex Is Alive and Well

Amit Chaudhary’s awesome blog discusses a classic: regular expression. The well crafted post is both an introduction and a cheat sheet that can help the regex noob get acquainted to low-level Python (aka where eagles dare).

A Visual Guide to Regular Expression

It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a…

amitness.com

Forex Data Dump

Want a decade’s worth of FOREX tick data? Someone used the Dukascopy API and had themselves a party. It’s available via seedbox/torrent.

Total Files 463

Total Line Count 8,495,770,706

Total Data Points 33,983,082,824

Total Decompressed Size 501 GB

Total Compressed Size 61 GB

declassified

Search in Rust (don’t worry there’s a python wrapper )

MeiliSearch, have you tried it? If you like search and indexing data you should give it a drive. I used it when I was exploring the James Bond dataset discussed in the introduction above. It was also written in Rust so it’s pretty fast!

Here’s a list of its awesome features out of the box:

Search as-you-type experience (answers < 50 milliseconds) U+1F525
Full-text search
Typo tolerant (understands typos and miss-spelling)
Faceted search and filters
Supports Kanji characters
Supports Synonym
Easy to install, deploy, and maintain
Whole documents are returned
Highly customizable
RESTful API

meilisearch/MeiliSearch

U+26A1 Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine U+1F50D MeiliSearch is a powerful, fast, open-source, easy…

github.com

Docs

Introduction

Open source Instant Search Engine

docs.meilisearch.com

Sorry, no “dataset of the week”. This time, there’s this

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The NLP Cypher | 10.25.20

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 10.25.20

On Her Majesty’s Service

dstl/re3d

This dataset was the output of a project carried out by Aleph Insights and Committed Software on behalf of the Defence…

RL for NLP

New Repos

(U+1F468‍U+1F4BB)

amankhullar/mast

Code for EMNLP NLPBT 2020 paper. MAST is trained on the 300h version of the How2 dataset…

helboukkouri/character-bert

This is the repository of the paper " CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabulary…

HLTCHKUST/MulQG

This is the implementation of the paper: Multi-hop Question Generation with Graph Convolutional Network. Dan Su, Yan…

mohiuddin02/TweetBERT

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis GitHub is home to over 50 million…

neulab/guided_summarization

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

neuspell/neuspell

NeuSpell: A Neural Spelling Correction Toolkit NeuSpell is an open-source toolkit for context sensitive spelling…

UKPLab/sentence-transformers

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known…

wenhuchen/OTT-QA

This respository contains the OTT-QA dataset used in Open Question Answering over Tables and Text and the baseline code…

M2M-100

pytorch/fairseq

In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair…

Data Poisoning

Data Poisoning

Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for training…

ReGex Is Alive and Well

A Visual Guide to Regular Expression

It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a…

Forex Data Dump

Search in Rust (don’t worry there’s a python wrapper )

meilisearch/MeiliSearch

U+26A1 Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine U+1F50D MeiliSearch is a powerful, fast, open-source, easy…

Introduction

Open source Instant Search Engine

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement