Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 10.25.20
Latest   Machine Learning   Newsletter

The NLP Cypher | 10.25.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Photo by Youhana Nassif on Unsplash

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 10.25.20

On Her Majesty’s Service

Was knocking about the Big Bad NLP Database when I arrived on a unique dataset. This lot, and it’s accompanying GitHub repo, struck me as bizarre, first, because of its content and second, because of its author/sponsor. So I went down the rabbit hole.

The dataset, called re3d, was created by a couple of consultancies in the UK on behalf of the Defence Science and Technology Laboratory (DSTL), which is part of the ultra-secret Porton Down government facility. The tech lab is a daughter agency of the UK Ministry of Defense and you can think of it as the UK’s version of DARPA/Skunkworks. And they have an “interesting” history. But why does one of the UK’s most secretive labs have an interest in NLP, and more specifically, an entity/relation extraction dataset? Well… according to their repo:

The project aimed to create a ‘gold standard’ dataset that could be used to train and validate machine learning approaches to natural language processing (NLP); specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst.

say wha?! U+1F9D0

Their datasets consist of JSON files with entities and relations extracted encompassing several “interesting” sectors: Australian Department of Foreign Affairs, BBC Online, CENTCOM, Delegation of the European Union to Syria, UK Government, US State Department, & (everyone’s favorite) Wikipedia. U+1F468‍U+1F4BB

So what does it look like? Here’s an example from their relations.json file in the CENTCOM (US Central Command) folder: This shows metadata of the entity “Joseph Votel”, his relation of being “IsSynomymOf” with the entity of “commander of U.S. Central Command” and also includes text span metadata from the source document. U+1F441

{'_id': '001C9C3F3DFE16B4921B1E906F66E161-3-14-47-0-12-IsSynonymOf', 'begin': 395, 
'confidence': 1,
'documentId': '001C9C3F3DFE16B4921B1E906F66E161',
'end': 397,
'source': 'commander of U.S. Central Command',
'sourceBegin': 397,
'sourceEnd': 430,
'target': 'Joseph Votel',
'targetBegin': 383,
'targetEnd': 395,
'type': 'IsSynonymOf',
'value': ','}

Here’s the source URL from the example above:

“sourceUrl” : “http://www.centcom.mil/MEDIA/PRESS-RELEASES/Press-Release-View/Article/904608/centcom-reinforces-support-for-syrian-arab-coalition/”

Essentially, this dataset allows one to create relation/entity graphs to military personnel, locations, weapons and other treats from publicly available articles. And if you use your imagination, this dataset can be used for crazy stuff.U+1F62C

top secret

Their GitHub repo offers a complete view of their schema for relations and entities, and it seems that the DSTL has other “interesting” repositories you may want to check out. Give the dataset a try and let me know what other things you may find! Just don’t tell MI6. info @ quantumstat . com

This message will self destruct in 30 seconds.

dstl/re3d

This dataset was the output of a project carried out by Aleph Insights and Committed Software on behalf of the Defence…

github.com

RL for NLP

New Repos

(U+1F468‍U+1F4BB)

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

A new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities — text, audio and video.

amankhullar/mast

Code for EMNLP NLPBT 2020 paper. MAST is trained on the 300h version of the How2 dataset…

github.com

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

CharacterBERT is a variant of BERT that produces word-level contextual representations by attending to the characters of each input token.

helboukkouri/character-bert

This is the repository of the paper " CharacterBERT: Reconciling ELMo and BERT for Word-LevelOpen-Vocabulary…

github.com

Multi-hop Question Generation with Graph Convolutional Network

Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.

HLTCHKUST/MulQG

This is the implementation of the paper: Multi-hop Question Generation with Graph Convolutional Network. Dan Su, Yan…

github.com

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis

TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.

mohiuddin02/TweetBERT

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis GitHub is home to over 50 million…

github.com

GSum: A General Framework for Guided Neural Abstractive Summarization

GSUm is a abstractive summarization framework that can effectively take different kinds of external guidance as input which generates qualitatively different summaries.

neulab/guided_summarization

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

NeuSpell: A Neural Spelling Correction Toolkit

This toolkit comprises of 10 spell checkers, with evaluations on naturally occurring mis-spellings from multiple (publicly available) sources.

neuspell/neuspell

NeuSpell: A Neural Spelling Correction Toolkit NeuSpell is an open-source toolkit for context sensitive spelling…

github.com

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

A cross-encoder is used to label a larger set of input pairs to augment the training data for the bi-encoder for pairwise sentence scoring. (think semantic textual similarity)

UKPLab/sentence-transformers

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known…

github.com

Open Question Answering over Tables and Text

A new large-scale dataset called Open Table-Text Question Answering (OTT-QA) for Open QA over both tabular and textual data.

wenhuchen/OTT-QA

This respository contains the OTT-QA dataset used in Open Question Answering over Tables and Text and the baseline code…

github.com

M2M-100

So this model made headlines this week, with Facebook AI revealing M2M-100, a multi-lingual translation model that is independent of English data, which can handle any pair of 100 languages. This is convenient because gives a considerable performance boost by 10 points on BLUE vs traditional methods (that would require the English language as a middle man given its abundance of datasets (e.g. Chinese to English and then English to French for a Chinese to French translation task.))

The training data derives from the massive CCMatrix and CCAligned datasets. In addition, they released a 12B parameter model checkpoint for you to play with. (if you have a gazillion number of GPUs)

pytorch/fairseq

In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair…

github.com

Data Poisoning

Data security is important. Especially when one uses this data to train their models. According to Eric Wallace’s new blog post, there are ways to impact a model’s predictions by using a specific trigger phrase in the input. The example he uses is where an adversary places a trigger phase for “Apple iPhones” comments (biasing comments to be positive). And when you encounter new data in the wild for inference, the model would predict sample comments regarding “Apple iPhones” as positive even though comments may have been negative. Read along to learn how they did it:

Data Poisoning

Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for training…

www.ericswallace.com

ReGex Is Alive and Well

Amit Chaudhary’s awesome blog discusses a classic: regular expression. The well crafted post is both an introduction and a cheat sheet that can help the regex noob get acquainted to low-level Python (aka where eagles dare).

A Visual Guide to Regular Expression

It's a common task in NLP to either check a text against a pattern or extract parts from the text that matches a…

amitness.com

Forex Data Dump

Want a decade’s worth of FOREX tick data? Someone used the Dukascopy API and had themselves a party. It’s available via seedbox/torrent.

Total Files 463

Total Line Count 8,495,770,706

Total Data Points 33,983,082,824

Total Decompressed Size 501 GB

Total Compressed Size 61 GB

declassified

Search in Rust (don’t worry there’s a python wrapper )

MeiliSearch, have you tried it? If you like search and indexing data you should give it a drive. I used it when I was exploring the James Bond dataset discussed in the introduction above. It was also written in Rust so it’s pretty fast!

Here’s a list of its awesome features out of the box:

  • Search as-you-type experience (answers < 50 milliseconds) U+1F525
  • Full-text search
  • Typo tolerant (understands typos and miss-spelling)
  • Faceted search and filters
  • Supports Kanji characters
  • Supports Synonym
  • Easy to install, deploy, and maintain
  • Whole documents are returned
  • Highly customizable
  • RESTful API

meilisearch/MeiliSearch

U+26A1 Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine U+1F50D MeiliSearch is a powerful, fast, open-source, easy…

github.com

Docs

Introduction

Open source Instant Search Engine

docs.meilisearch.com

Sorry, no “dataset of the week”. This time, there’s this

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

www.quantumstat.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓