NLP News Cypher | 04.19.20
Last Updated on July 27, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
NLP News Cypher U+007C 04.19.20
Parallels
The universe runs on a simple rule, and it could be, that the framework of this rule may be interpreted by computing hyper-graphs. This past week, while a pandemic engulfed the planet, Stephen Wolfram unveiled his vision for what governs our universe, the possible source code that initializes all fundamental laws of physics.
An eerie parallel can be drawn between this weekβs event, and what happened in 1665 when another physicist retreated to his childhood home for private study to avoid the plague. The aftermath was the law of gravity and calculus.
Back when I first read Wolframβs insights into cellular automata (and its consequences on computation) it was fascinating. So when I heard he released his βtheory of everythingβ this week, I was really excited for Stephen and all of physics. I hope it is as fruitful as the theories that stemmed from the mind of that other dude from Cambridge U.
FYI, I have Rule 30 on my business card U+1F601.
If you want to catch Wolframβs theory, travel here:
The Wolfram Physics Project: Upcoming Livestreams
Upcoming livestreams of Stephen Wolfram's project to discover the fundamental theory of physics. Recordings of previousβ¦
www.wolframphysics.org
U+1F407
Last week I opined on our new demo – RABBIT. If you arenβt caught up, itβs a real-time finance tweet classifier running on two distilled transformers. By real-time I mean it classifies tweets as they stream in, itβs not batch (except for when you land on the page). The best time to experience the demo is during the weekly stock market trading hours. Stream rate spikes around 8:00 AM.
For a peek, you can travel here:
RABBIT U+1F407
RABBIT is a state-of-the-art AI web app that uses transformer models to classify finance-related tweets in real-timeβ¦
rabbit.quantumstat.com
BTW, we were seeing weird inaccuracies on select topics, as a result, we added an additional 1,000 tweets, retrained the models, and relaunched. On a P100 GPU, this took a total of 45 minutes for data wrangling/fine-tuning 2 models. This is one of the luxuries of modern NLP stacks, fine-tuning SOTA models doesnβt take long.
I have a new surprise release lined up for this week, stay tuned U+1F440!
How was your week? U+1F60E
This Week:
XTREME
Trivial BERT
The Poisoned Pawn
Synthetic Data
Scaling Your Back-End U+1F923
ToD-BERT
XTREME
One of the most important reasons for the creation of the Big Bad NLP Database was to bring more attention to low-resource languages. So as you may expect, I was really excited this week when a new multi-lingual benchmark was released. XTREME is meant to XTREMELY evaluate your multi-lingual model by looking at 4 NLP objectives: sentence classification, structured prediction, sentence retrieval, and question answering. Not bad right? Except that it expects your model to generalize to a subset of 40 languages per task (and there are 9 of them!). U+1F601
Which ones?
af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh
GitHub:
google-research/xtreme
This repository contains information about XTREME, code for downloading data, and implementations of baseline systemsβ¦
github.com
Trivial BERT
The McCormick chronicles continue its hunt to discover the inner workings of BERT. This time they looked at the inherent factoids BERT learns from its pre-training by asking fill-in-the-blank questions. You can follow them down the rabbit hole here:
Trivial BERsuiT – How much trivia does BERT know?
As I've been doing all of this research into BERT, I've been really curious-just how much trivia does BERT know? We useβ¦
mccormickml.com
The Poisoned Pawn
Peeps at CMU can hack SOTA AI models U+1F648. Essentially, they highlight the dangers in community sharing of pre-trained weights (which has become a recent trend). What they discovered is after hacking pre-trained weights, they can penetrate your machine after fine-tuning by enablingβ¦
βthe attacker to manipulate the model prediction simply by injecting an arbitrary keyword.β
System engineers be like:
GitHub (BERT looks like a sith lord in the picture below) :
neulab/RIPPLe
This repository contains the code to implement experiments from the paper " Weight Poisoning Attacks on Pre-trainedβ¦
github.com
Paper:
Synthetic Data
Have you heard of synthetic data? Well if youβve dealt before with class-imbalance on your training set – youβll relate to this article. Synthetic data is the byproduct of several techniques used for generating data β which includes goodies like over-sampling techniques (found in the library imbalanced-learn), all the way to GANs!
Synthetic Data
The future standard for Data Science development
medium.com
Scaling Your Back-End U+1F923
Want to know how Kaggle scaled their back-end from a single Kubernetes cluster to a multi-cluster architecture? Well, this detailed article explains one of the most difficult areas in AI deployment, load-balancing your servers with several Kubernetes clusters (in this example, they are using Googleβs (GKE) and gRPC as the message protocol). We rarely get a chance at production-level architecture, this is a must-read if you are serious about deploying your models like the pros.
A multi-cluster gRPC architecture on GKE
This post explains how to load-balance a gRPC application across many GKE clusters in different regions to increaseβ¦
medium.com
ToD-BERT
This paper caught my eye because when I think of transformers and dialogue, my knee jerk reaction immediately goes to chit-chat dialogue. But in this paper, Salesforce Research introduces ToD-BERT, which is BERT pre-trained on 9 task-dialogue datasets. This new model was compared with regular BERT that was only fine-tuned for downstream task-dialogue tasks, and ToD outperformed it. The downstream tasks in question were: intention detection, dialogue state tracking, dialogue act prediction, and response selection. According to the authors, code will be released soon.
Paper:
Dataset of the Week: The SimpleQuestions Dataset
What is it?
βThe SimpleQuestions dataset consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation.β
Sample:
* What American cartoonist is the creator of Andy Lippincott?
Fact: (andy_lippincott, character_created_by, garry_trudeau)
* Which forest is Fires Creek in?
Fact: (fires_creek, containedby, nantahala_national_forest)
* What does Jimmy Neutron do?
Fact: (jimmy_neutron, fictional_character_occupation, inventor)
* What dietary restriction is incompatible with kimchi?
Fact: (kimchi, incompatible_with_dietary_restrictions, veganism)
Where is it?
bAbI
This page gather resources related to the bAbI project of Facebook AI Research which is organized towards the goal ofβ¦
research.fb.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends!
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI