NLP News Cypher | 04.19.20

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

NLP News Cypher U+007C 04.19.20

Parallels

The universe runs on a simple rule, and it could be, that the framework of this rule may be interpreted by computing hyper-graphs. This past week, while a pandemic engulfed the planet, Stephen Wolfram unveiled his vision for what governs our universe, the possible source code that initializes all fundamental laws of physics.

An eerie parallel can be drawn between this week’s event, and what happened in 1665 when another physicist retreated to his childhood home for private study to avoid the plague. The aftermath was the law of gravity and calculus.

Back when I first read Wolfram’s insights into cellular automata (and its consequences on computation) it was fascinating. So when I heard he released his “theory of everything” this week, I was really excited for Stephen and all of physics. I hope it is as fruitful as the theories that stemmed from the mind of that other dude from Cambridge U.

FYI, I have Rule 30 on my business card U+1F601.

(declassified) cellular automaton Rule 30

If you want to catch Wolfram’s theory, travel here:

The Wolfram Physics Project: Upcoming Livestreams

Upcoming livestreams of Stephen Wolfram's project to discover the fundamental theory of physics. Recordings of previous…

www.wolframphysics.org

U+1F407

Last week I opined on our new demo – RABBIT. If you aren’t caught up, it’s a real-time finance tweet classifier running on two distilled transformers. By real-time I mean it classifies tweets as they stream in, it’s not batch (except for when you land on the page). The best time to experience the demo is during the weekly stock market trading hours. Stream rate spikes around 8:00 AM.

For a peek, you can travel here:

RABBIT U+1F407

RABBIT is a state-of-the-art AI web app that uses transformer models to classify finance-related tweets in real-time…

rabbit.quantumstat.com

BTW, we were seeing weird inaccuracies on select topics, as a result, we added an additional 1,000 tweets, retrained the models, and relaunched. On a P100 GPU, this took a total of 45 minutes for data wrangling/fine-tuning 2 models. This is one of the luxuries of modern NLP stacks, fine-tuning SOTA models doesn’t take long.

I have a new surprise release lined up for this week, stay tuned U+1F440!

How was your week? U+1F60E

This Week:

XTREME

Trivial BERT

The Poisoned Pawn

Synthetic Data

Scaling Your Back-End U+1F923

ToD-BERT

Dataset of the Week: The SimpleQuestions Dataset

XTREME

One of the most important reasons for the creation of the Big Bad NLP Database was to bring more attention to low-resource languages. So as you may expect, I was really excited this week when a new multi-lingual benchmark was released. XTREME is meant to XTREMELY evaluate your multi-lingual model by looking at 4 NLP objectives: sentence classification, structured prediction, sentence retrieval, and question answering. Not bad right? Except that it expects your model to generalize to a subset of 40 languages per task (and there are 9 of them!). U+1F601

Which ones?

af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh

GitHub:

google-research/xtreme

This repository contains information about XTREME, code for downloading data, and implementations of baseline systems…

github.com

Trivial BERT

The McCormick chronicles continue its hunt to discover the inner workings of BERT. This time they looked at the inherent factoids BERT learns from its pre-training by asking fill-in-the-blank questions. You can follow them down the rabbit hole here:

Trivial BERsuiT – How much trivia does BERT know?

As I've been doing all of this research into BERT, I've been really curious-just how much trivia does BERT know? We use…

mccormickml.com

The Poisoned Pawn

Peeps at CMU can hack SOTA AI models U+1F648. Essentially, they highlight the dangers in community sharing of pre-trained weights (which has become a recent trend). What they discovered is after hacking pre-trained weights, they can penetrate your machine after fine-tuning by enabling…

“the attacker to manipulate the model prediction simply by injecting an arbitrary keyword.”

System engineers be like:

GitHub (BERT looks like a sith lord in the picture below) :

neulab/RIPPLe

This repository contains the code to implement experiments from the paper " Weight Poisoning Attacks on Pre-trained…

github.com

Paper:

LINK

Synthetic Data

Have you heard of synthetic data? Well if you’ve dealt before with class-imbalance on your training set – you’ll relate to this article. Synthetic data is the byproduct of several techniques used for generating data — which includes goodies like over-sampling techniques (found in the library imbalanced-learn), all the way to GANs!

Synthetic Data

The future standard for Data Science development

medium.com

Scaling Your Back-End U+1F923

Want to know how Kaggle scaled their back-end from a single Kubernetes cluster to a multi-cluster architecture? Well, this detailed article explains one of the most difficult areas in AI deployment, load-balancing your servers with several Kubernetes clusters (in this example, they are using Google’s (GKE) and gRPC as the message protocol). We rarely get a chance at production-level architecture, this is a must-read if you are serious about deploying your models like the pros.

A multi-cluster gRPC architecture on GKE

This post explains how to load-balance a gRPC application across many GKE clusters in different regions to increase…

medium.com

ToD-BERT

This paper caught my eye because when I think of transformers and dialogue, my knee jerk reaction immediately goes to chit-chat dialogue. But in this paper, Salesforce Research introduces ToD-BERT, which is BERT pre-trained on 9 task-dialogue datasets. This new model was compared with regular BERT that was only fine-tuned for downstream task-dialogue tasks, and ToD outperformed it. The downstream tasks in question were: intention detection, dialogue state tracking, dialogue act prediction, and response selection. According to the authors, code will be released soon.

Paper:

LINK

Dataset of the Week: The SimpleQuestions Dataset

What is it?

“The SimpleQuestions dataset consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.”

Sample:

* What American cartoonist is the creator of Andy Lippincott?
 Fact: (andy_lippincott, character_created_by, garry_trudeau) 
* Which forest is Fires Creek in?
 Fact: (fires_creek, containedby, nantahala_national_forest)
* What does Jimmy Neutron do?
 Fact: (jimmy_neutron, fictional_character_occupation, inventor)
* What dietary restriction is incompatible with kimchi?
 Fact: (kimchi, incompatible_with_dietary_restrictions, veganism)

Where is it?

bAbI

This page gather resources related to the bAbI project of Facebook AI Research which is organized towards the goal of…

research.fb.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

If you enjoyed this article, help us out and share with friends!

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

NLP News Cypher | 04.19.20

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

NLP News Cypher U+007C 04.19.20

Parallels

The Wolfram Physics Project: Upcoming Livestreams

Upcoming livestreams of Stephen Wolfram's project to discover the fundamental theory of physics. Recordings of previous…

U+1F407

RABBIT U+1F407

RABBIT is a state-of-the-art AI web app that uses transformer models to classify finance-related tweets in real-time…

This Week:

XTREME

google-research/xtreme

This repository contains information about XTREME, code for downloading data, and implementations of baseline systems…

Trivial BERT

Trivial BERsuiT – How much trivia does BERT know?

As I've been doing all of this research into BERT, I've been really curious-just how much trivia does BERT know? We use…

The Poisoned Pawn

neulab/RIPPLe

This repository contains the code to implement experiments from the paper " Weight Poisoning Attacks on Pre-trained…

Synthetic Data

Synthetic Data

The future standard for Data Science development

Scaling Your Back-End U+1F923

A multi-cluster gRPC architecture on GKE

This post explains how to load-balance a gRPC application across many GKE clusters in different regions to increase…

ToD-BERT

Dataset of the Week: The SimpleQuestions Dataset

bAbI

This page gather resources related to the bAbI project of Facebook AI Research which is organized towards the goal of…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement