NLP News Cypher | 04.12.20
Last Updated on July 27, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NLP News Cypher U+007C 04.12.20
Down the Rabbit Hole
I called it RABBIT. My demo is finito. We built an app for those who are interested in streaming APIs, online inference, and transformers in production.
**update: 04.13.20: rabbit.quantumstat.com
The web app, (of which Iβve shown a glimpse in the past) attempts to do a very difficult balancing act. One of the hardest bottlenecks in deep learning today is leveraging state of the art models in NLP (transformers, RAM expensive) and being able to deploy them in production without making your server or bank account explode. I think I may have figured it out, at least for this app U+1F60E.
What is it?
RABBIT streams tweets from dozens of financial news sources (the usual suspects: Bloomberg, CNBC, WSJ and more) and runs 2 classifiers over them in real-time!
What am I classifying? 1st model classifies 21 topics in finance:
The 2nd model classifies whether the tweet is either bullish, bearish or neutral in stance. What does this mean? It means that if you are an investor/trader holding gold, and the tweet mentions that the price of gold is up, this would be labeled bullish, the inverse bearish, and if you donβt care either way, itβs neutral. In fact, this app is supposed to be personalized to an individual user. Because what you will see is a demo for a general audience, I tried to generalize as much as possible with this classifying schema.
As a result, this assumes first-order logic in classification. Meaning, my logic is not assuming n-order effects. For example, if you hold oil, and oil goes up in price, this is considered bullish, (even though it is possible that the reason oil went up is because of some geo-political conflict which could have negative impact on the market (bearish), this is a hypothetical n-order effect).
What does it run on?
Iβve architected the back-end with the option of expanding both compute and connections if required. The transformers are the distilled version of RoBERTa that were fed over 10K tweets from a custom dataset. Currently, Iβm leveraging message queues and an asynchronous framework to help me push tweets out to the user. Shout-out to Adam King for sparking the idea during one of our digital fireside chats. (FYI, you can check out his infamous GPT-2 model here: talktotransformer.com)
RABBIT uses a web-socket connection for the streaming capabilities and is run only on 4 CPU cores. While this compute may seem small, when married with this architecture, itβs actually lightning fast (even while doing online inference with 2 transformers!). Since the web-sockets are connected to the browser and data serving is uni-directional, scaling to the client-side is fairly robust.
Errata
Very recently thereβs been some domain shift due to the coronavirus altering the news cycle (which has decreased the accuracy of the models). I will continually add more data to mitigate this, even though for now, it performs reasonably well.
Fin
Will officially release it tomorrow, April 13th. Check my Twitter for the update. FYI, the app is best experienced during weekly trading hours when the stock market is open so you can see it stream really fast (even though technically you can check it out anytime you want).
Proud of this work. Itβs cheap, itβs powerful and itβs fast.
Possible future approaches will be to create a language-model from scratch, and then fine-tune it on the custom dataset I mentioned above. Additionally, would be nice to add more data in a dashboard with a live stock market stream.
How was your week? U+1F60E
This Week:
Bare Metal
Colab of the Week, on Self-Attention
Hugging Electra
Colbert AI
A Very Large News Dataset
A Token of Appreciation
Dataset of the Week: X-Stance
Bare Metal
AI chip makers are betting that NLP models keep getting bigger and bigger although their chips are becoming smarter. The metal peeps say they want to isolate NN inputs to individual cores as opposed to batching them. The consequence is only neurons in your network that βneedβ to fire will do so since they are isolated:
βCompanies are fixated on the concept of βsparsity,β the notion that many neural networks can be processed more efficiently if redundant information is stripped away. Lie observed that there is βa large, untapped potential for sparsityβ and that βneural networks are naturally sparse.β
Startup Tenstorrent shows AI is changing computing and vice versa U+007C ZDNet
Tenstorrent is one of the rush of AI chip makers founded in 2016 and finally showing product. The new wave of chipsβ¦
www.zdnet.com
With this knowledge, the new AI chips donβt need to train as long and can drop out of training earlier on. U+1F9D0
Colab of the Week, on Self-Attention
Iβll let you explore this one:
Google Colaboratory
Edit description
colab.research.google.com
Hugging Electra
The new method for training language models with relatively low compute is now on the Hugging Face library. You may remember ELECTRAβs provoking performance U+1F447
huggingface/transformers
ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformerβ¦
github.com
It didnβt take long for developers to leverage ELECTRA, the Simple Transformers library, which is built on top of U+1F917's Transformers, already has it:
Understanding ELECTRA and Training an ELECTRA Language Model
How does a Transformer Model learn a language? Whatβs new in ELECTRA? How do you train your own language model on aβ¦
towardsdatascience.com
Colbert AI
GPT-2 strikes back with a bit of humor. Developers Abbas Mohammed and Shubham Rao created this model by extracting monologues from the Late Showβs video captions on YouTube. They provided a nice Colab notebook with excellent documentation for anyone who wants to do something similar. (you may need to get your own dataset U+1F622)
Colab:
Google Colaboratory
Edit description
colab.research.google.com
A Very Large News Dataset
Found this gem on Reddit. With the average developer getting closer and closer to training their own language models from scratch, super big datasets will grow more popular among NLP developers in the long term. This dataset holds 2.7 million news articles from the past 4 years:
All the News 2.0 : 2.7 million news articles – Components
An update to the popular All the News dataset published in 2017. This dataset contains 2.7 million articles from 26β¦
components.one
A Token of Appreciation
A relatively new tokenizer came to my attention this week boasting itβs speed versus other well known tokenizers (it was written in C++). If you want to compare how fast it does versus the others (Hugging Face, SentencePiece and fastBPE), check out their benchmark results:
Main Repo:
VKCOM/YouTokenToMe
YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byteβ¦
github.com
Benchmarks:
VKCOM/YouTokenToMe
YouTokenToMe will be compared with Hugging Face, SentencePiece and fastBPE. These three algorithms are considered to beβ¦
github.com
Dataset of the Week: X-Stance
What is it?
βThe x-stance dataset contains more than 150 political questions, and 67k comments written by candidates on those questions.β The comments are in English, German, French and Italian.
Sample:
Where is it?
ZurichNLP/xstance
Documentation and evaluation script accompanying the paper "X-Stance: A Multilingual Multi-Target Dataset for Stanceβ¦
github.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
If you enjoyed this article, help us out and share with friends!
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI