Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

NLP News Cypher | 04.12.20
Latest   Machine Learning   Newsletter

NLP News Cypher | 04.12.20

Last Updated on July 27, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

NLP News Cypher | 04.12.20
Photo by Sz. Marton on Unsplash

NLP News Cypher U+007C 04.12.20

Down the Rabbit Hole

I called it RABBIT. My demo is finito. We built an app for those who are interested in streaming APIs, online inference, and transformers in production.

**update: 04.13.20: rabbit.quantumstat.com

The web app, (of which I’ve shown a glimpse in the past) attempts to do a very difficult balancing act. One of the hardest bottlenecks in deep learning today is leveraging state of the art models in NLP (transformers, RAM expensive) and being able to deploy them in production without making your server or bank account explode. I think I may have figured it out, at least for this app U+1F60E.

What is it?

RABBIT streams tweets from dozens of financial news sources (the usual suspects: Bloomberg, CNBC, WSJ and more) and runs 2 classifiers over them in real-time!

What am I classifying? 1st model classifies 21 topics in finance:

declassified

The 2nd model classifies whether the tweet is either bullish, bearish or neutral in stance. What does this mean? It means that if you are an investor/trader holding gold, and the tweet mentions that the price of gold is up, this would be labeled bullish, the inverse bearish, and if you don’t care either way, it’s neutral. In fact, this app is supposed to be personalized to an individual user. Because what you will see is a demo for a general audience, I tried to generalize as much as possible with this classifying schema.

As a result, this assumes first-order logic in classification. Meaning, my logic is not assuming n-order effects. For example, if you hold oil, and oil goes up in price, this is considered bullish, (even though it is possible that the reason oil went up is because of some geo-political conflict which could have negative impact on the market (bearish), this is a hypothetical n-order effect).

What does it run on?

I’ve architected the back-end with the option of expanding both compute and connections if required. The transformers are the distilled version of RoBERTa that were fed over 10K tweets from a custom dataset. Currently, I’m leveraging message queues and an asynchronous framework to help me push tweets out to the user. Shout-out to Adam King for sparking the idea during one of our digital fireside chats. (FYI, you can check out his infamous GPT-2 model here: talktotransformer.com)

RABBIT uses a web-socket connection for the streaming capabilities and is run only on 4 CPU cores. While this compute may seem small, when married with this architecture, it’s actually lightning fast (even while doing online inference with 2 transformers!). Since the web-sockets are connected to the browser and data serving is uni-directional, scaling to the client-side is fairly robust.

Errata

Very recently there’s been some domain shift due to the coronavirus altering the news cycle (which has decreased the accuracy of the models). I will continually add more data to mitigate this, even though for now, it performs reasonably well.

Fin

Will officially release it tomorrow, April 13th. Check my Twitter for the update. FYI, the app is best experienced during weekly trading hours when the stock market is open so you can see it stream really fast (even though technically you can check it out anytime you want).

Proud of this work. It’s cheap, it’s powerful and it’s fast.

Possible future approaches will be to create a language-model from scratch, and then fine-tune it on the custom dataset I mentioned above. Additionally, would be nice to add more data in a dashboard with a live stock market stream.

How was your week? U+1F60E

This Week:

Bare Metal

Colab of the Week, on Self-Attention

Hugging Electra

Colbert AI

A Very Large News Dataset

A Token of Appreciation

Dataset of the Week: X-Stance

Bare Metal

AI chip makers are betting that NLP models keep getting bigger and bigger although their chips are becoming smarter. The metal peeps say they want to isolate NN inputs to individual cores as opposed to batching them. The consequence is only neurons in your network that “need” to fire will do so since they are isolated:

“Companies are fixated on the concept of “sparsity,” the notion that many neural networks can be processed more efficiently if redundant information is stripped away. Lie observed that there is “a large, untapped potential for sparsity” and that “neural networks are naturally sparse.”

Startup Tenstorrent shows AI is changing computing and vice versa U+007C ZDNet

Tenstorrent is one of the rush of AI chip makers founded in 2016 and finally showing product. The new wave of chips…

www.zdnet.com

With this knowledge, the new AI chips don’t need to train as long and can drop out of training earlier on. U+1F9D0

Colab of the Week, on Self-Attention

I’ll let you explore this one:

Google Colaboratory

Edit description

colab.research.google.com

Hugging Electra

The new method for training language models with relatively low compute is now on the Hugging Face library. You may remember ELECTRA’s provoking performance U+1F447

huggingface/transformers

ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer…

github.com

It didn’t take long for developers to leverage ELECTRA, the Simple Transformers library, which is built on top of U+1F917's Transformers, already has it:

Understanding ELECTRA and Training an ELECTRA Language Model

How does a Transformer Model learn a language? What’s new in ELECTRA? How do you train your own language model on a…

towardsdatascience.com

Colbert AI

GPT-2 strikes back with a bit of humor. Developers Abbas Mohammed and Shubham Rao created this model by extracting monologues from the Late Show’s video captions on YouTube. They provided a nice Colab notebook with excellent documentation for anyone who wants to do something similar. (you may need to get your own dataset U+1F622)

Colab:

Google Colaboratory

Edit description

colab.research.google.com

A Very Large News Dataset

Found this gem on Reddit. With the average developer getting closer and closer to training their own language models from scratch, super big datasets will grow more popular among NLP developers in the long term. This dataset holds 2.7 million news articles from the past 4 years:

All the News 2.0 : 2.7 million news articles – Components

An update to the popular All the News dataset published in 2017. This dataset contains 2.7 million articles from 26…

components.one

A Token of Appreciation

A relatively new tokenizer came to my attention this week boasting it’s speed versus other well known tokenizers (it was written in C++). If you want to compare how fast it does versus the others (Hugging Face, SentencePiece and fastBPE), check out their benchmark results:

Main Repo:

VKCOM/YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte…

github.com

Benchmarks:

VKCOM/YouTokenToMe

YouTokenToMe will be compared with Hugging Face, SentencePiece and fastBPE. These three algorithms are considered to be…

github.com

Dataset of the Week: X-Stance

What is it?

“The x-stance dataset contains more than 150 political questions, and 67k comments written by candidates on those questions.” The comments are in English, German, French and Italian.

Sample:

Where is it?

ZurichNLP/xstance

Documentation and evaluation script accompanying the paper "X-Stance: A Multilingual Multi-Target Dataset for Stance…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

If you enjoyed this article, help us out and share with friends!

For complete coverage, follow our Twitter: @Quantum_Stat

www.quantumstat.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓