The NLP Cypher | 01.03.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Quantum Stat

The NLP Cypher | 01.03.21 — Dream of St. Ursula | Carpaccio

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

A New Era

Hey Welcome back, you made it! Now, let us begin 2021 on the right path with an impromptu moment of customer service by Elon Musk:

FYI

If you haven’t read our Mini Year Review, we released it last week while everyone was on holiday ?. Per usual, if you enjoy the read please give our article a ?? and share it with your friends and enemies!

Now, let’s play a game. Let’s say we have all 7,129 NLP paper abstracts for the entire year of 2020. And now we run BERTopic ? on top of those abstracts for some topic modeling to find the most frequent topics discussed.

MaartenGr/BERTopic

What do we get?

speech-related
bert-related
dialogue-related
embeddings-related
graphs-related

For a more detailed readout of the topics ?

A Pile of 825GBs

The Pile dataset, an 800GB monster of English text for language modeling. ?

The Pile is composed of 22 large and diverse datasets:

paper

The diversity of the dataset is what makes it unique and powerful for holding cross-domain knowledge.

As a result, to score well on the Pile BTB (Bits per Byte) benchmark a model should

…“be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers.”

The dataset is formatted in jsonlines in zstandard compression. You can also view more datasets on The Eye ? here:

The Pile

The ?

Index of /public/AI/pile_preliminary_components/

Domain Shifting Sentiment on Corporate Filings

Corporations are adapting to NLP models that listen in on filings and other financial-related disclosures. According to a new study, corporations are choosing their words wisely in order to fool machines so they are able to reduce the negative sentiment in their statements.

Paper:

How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI

ML Book Drops ?

This week, a couple of ML book prints dropped from well known authors in machine learning. The first is from Jurafsky and Martin’s Speech and Language Processing’s book with new chapters/updates:

Highlights:

-new version of Chapter 8 (bringing together POS and NER in one chapter),

-new version of Chapter 9 (with Transformers)

-Chapter 11 (MT)

neural span parsing and CCG parsing moved into Chapter 13 (Constituency Parsing) and Statistical Constituency Parsing moved to Appendix C

new version of Chapter 23 (QA modernized)

Chapter 26 (ASR + TTS)

Speech and Language Processing

Also Murphy’s Probabilistic Machine Learning draft made the rounds this week. And there’s code along with it! Enjoy.

https://probml.github.io/pml-book/book1.html

code:

probml/pyprobml

Open Library Explorer

There’s a new way to explore the Internet Archive for awesome content.

The Open Library Explorer! A new way to browse the Internet Archive

Quantum Ad-List

Someone built ? as a way to block ads ?.

“Made an AI to track and analyze every websites, a bit like a web crawler, to find and identify ads. It is a list containing over 1,300,000 domains used by ads, trackers, miners, malwares.”

The Quantum Alpha . / The Quantum Ad-List

Repo Cypher ?‍?

A collection of recent released repos that caught our ?

LayoutLM V2

Microsoft released the 2nd version of their document understanding language model LayoutLM. If you are interested in SOTA w/r/t document AI tasks. Follow this repo!

microsoft/unilm

WikiTableT

A large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata.

mingdachen/WikiTableT

ShortFormer

Shortformer model shows that by *shortening* inputs, performance improves while speed and memory efficiency go up. It uses two new techniques: staged training and position-infused attention/caching.

ofirpress/shortformer

ExtendedSumm

An extractive summarization technique that observes the hierarchical structure of long documents by using a multi-task learning approach.

Georgetown-IR-Lab/ExtendedSumm

NeurST

NeurST aims at building and training end-to-end speech translation.

From the TikTok folks at Bytedance:

bytedance/neurst

TabularSemanticParsing

Model used in cross-domain tabular semantic parsing (X-TSP). This is the task of predicting the executable structured query language given a natural language question issued to some database.

salesforce/TabularSemanticParsing

AraBERTv2 / AraGPT2 / AraELECTRA

AraBERT now comes in 4 new variants to replace the old v1 versions.

aub-mind/arabert

Reasoning over Chains of Facts with Transformers

Model retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer.

rubencart/LIIR-TextGraphs-14

Dataset of the Week: DECODE Dataset

What is it?

A conversational dataset containing contradictory dialogues to study how well NLU models can capture consistency in dialogues. It contains 27,184 instances from 4 subsets from Facebook’s ParlAI framework.

Sample

Where is it?

Contradiction

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

The NLP Cypher | 01.03.21 was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Towards AI Team

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.