The NLP Cypher | 01.03.21
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 01.03.21
A New Era
Hey Welcome back, you made it! Now, let us begin 2021 on the right path with an impromptu moment of customer service by Elon Musk:
FYI
If you havenβt read our Mini Year Review, we released it last week while everyone was on holiday U+1F62C. Per usual, if you enjoy the read please give our article a U+1F44FU+1F44F and share it with your friends and enemies!
Now, letβs play a game. Letβs say we have all 7,129 NLP paper abstracts for the entire year of 2020. And now we run BERTopic U+1F447 on top of those abstracts for some topic modeling to find the most frequent topics discussed.
MaartenGr/BERTopic
BERTopic is a topic modeling technique that leverages U+1F917 transformers and c-TF-IDF to create dense clusters allowingβ¦
github.com
What do we get?
- speech-related
- bert-related
- dialogue-related
- embeddings-related
- graphs-related
For a more detailed readout of the topics U+1F447
A Pile of 825GBs
The Pile dataset, an 800GB monster of English text for language modeling. U+1F440
The Pile is composed of 22 large and diverse datasets:
The diversity of the dataset is what makes it unique and powerful for holding cross-domain knowledge.
As a result, to score well on the Pile BTB (Bits per Byte) benchmark a model should
β¦βbe able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.β
The dataset is formatted in jsonlines in zstandard compression. You can also view more datasets on The Eye U+1F441 here:
The Pile
The Pile
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-qualityβ¦
pile.eleuther.ai
The U+1F441
Index of /public/AI/pile_preliminary_components/
The Eye is a website dedicated towards archiving and serving publicly available information. #opendirectory #archiveβ¦
the-eye.eu
Domain Shifting Sentiment on Corporate Filings
Corporations are adapting to NLP models that listen in on filings and other financial-related disclosures. According to a new study, corporations are choosing their words wisely in order to fool machines so they are able to reduce the negative sentiment in their statements.
Paper:
How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI
Founded in 1920, the NBER is a private, non-profit, non-partisan organization dedicated to conducting economic researchβ¦
www.nber.org
ML Book Drops U+1F4DA
This week, a couple of ML book prints dropped from well known authors in machine learning. The first is from Jurafsky and Martinβs Speech and Language Processingβs book with new chapters/updates:
Highlights:
-new version of Chapter 8 (bringing together POS and NER in one chapter),
-new version of Chapter 9 (with Transformers)
-Chapter 11 (MT)
neural span parsing and CCG parsing moved into Chapter 13 (Constituency Parsing) and Statistical Constituency Parsing moved to Appendix C
new version of Chapter 23 (QA modernized)
Chapter 26 (ASR + TTS)
Speech and Language Processing
new version of Chapter 8 (bringing together POS and NER in one chapter), new version of Chapter 9 (with Transformers)β¦
web.stanford.edu
Also Murphyβs Probabilistic Machine Learning draft made the rounds this week. And thereβs code along with it! Enjoy.
https://probml.github.io/pml-book/book1.html
code:
probml/pyprobml
Python 3 code for my new book series Probabilistic Machine Learning. This is work in progress, so expect rough edgesβ¦
github.com
Open Library Explorer
Thereβs a new way to explore the Internet Archive for awesome content.
The Open Library Explorer! A new way to browse the Internet Archive
Are you looking for a change of pace this holiday season? How about some reading? Now I'm sure you are all trying toβ¦
datahorde.org
Quantum Ad-List
Someone built U+1F447 as a way to block ads U+1F923.
βMade an AI to track and analyze every websites, a bit like a web crawler, to find and identify ads. It is a list containing over 1,300,000 domains used by ads, trackers, miners, malwares.β
The Quantum Alpha . / The Quantum Ad-List
With over 800000 blocked domains used by ads that my magnificent AI put up together. The AI is like a loyal dogβ¦
gitlab.com
Repo Cypher U+1F468βU+1F4BB
A collection of recent released repos that caught our U+1F441
LayoutLM V2
Microsoft released the 2nd version of their document understanding language model LayoutLM. If you are interested in SOTA w/r/t document AI tasks. Follow this repo!
microsoft/unilm
December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide varierty of document AI tasks, including DocVQAβ¦
github.com
WikiTableT
A large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata.
mingdachen/WikiTableT
Code, data, and pretrained models for the paper "Generating Wikipedia Article Sections from Diverse Data Sources" Codeβ¦
github.com
ShortFormer
Shortformer model shows that by *shortening* inputs, performance improves while speed and memory efficiency go up. It uses two new techniques: staged training and position-infused attention/caching.
ofirpress/shortformer
This repository contains the code for the Shortformer model. This file explains how to run our experiments on theβ¦
github.com
ExtendedSumm
An extractive summarization technique that observes the hierarchical structure of long documents by using a multi-task learning approach.
Georgetown-IR-Lab/ExtendedSumm
This repository contains the implementation details and datasets used in On Generating Extended Summaries of Longβ¦
github.com
NeurST
NeurST aims at building and training end-to-end speech translation.
From the TikTok folks at Bytedance:
bytedance/neurst
NeurST aims at easily building and training end-to-end speech translation, which has the careful design forβ¦
github.com
TabularSemanticParsing
Model used in cross-domain tabular semantic parsing (X-TSP). This is the task of predicting the executable structured query language given a natural language question issued to some database.
salesforce/TabularSemanticParsing
This is the official code release of the following paper: Xi Victoria Lin, Richard Socher and Caiming Xiong. Bridgingβ¦
github.com
AraBERTv2 / AraGPT2 / AraELECTRA
AraBERT now comes in 4 new variants to replace the old v1 versions.
aub-mind/arabert
This repository now contains code and implementation for: AraBERT v0.1/v1: Original AraBERT v0.2/v2: Base and largeβ¦
github.com
Reasoning over Chains of Facts with Transformers
Model retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer.
rubencart/LIIR-TextGraphs-14
This repository contains the implementation for our submission to the TextGraphs-14 shared task on Multi-Hop Inferenceβ¦
github.com
Dataset of the Week: DECODE Dataset
What is it?
A conversational dataset containing contradictory dialogues to study how well NLU models can capture consistency in dialogues. It contains 27,184 instances from 4 subsets from Facebookβs ParlAI framework.
Sample
Where is it?
Contradiction
A study on contradiction detection and non-contradiction generation in dialogue modeling. The paper can be found hereβ¦
parl.ai
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI