The NLP Cypher | 02.07.21
The NLP Cypher | 02.07.21

The Short Squeeze

The plebeian beats the market…

Wall Street — with its historical knack for favoring the frothingly rich, summer-home traveling bourgeoisie — was outsmarted by a network of Reddit day-trading Millennial computer nerds. The outcome: billions of hedge fund capital thrown to the wind.

When the COVID saga began, those looking to make some extra scratch online leaned on online commission-free trading platforms like Robinhood to make ends meet. And with this meteoric rise of a new retail investor gambling on Wall Street’s hallowed grounds, no one, not even the institutional investor foresaw the events of the last weeks.

So what happened?

Elon pwning short sellers

The Reddit traders began purchasing shares of stock that was heavily shorted by industry investors. (‘shorted’ means betting the stock will go down). When stock price began to climb due to their share purchasing, hedge funds began losing loads of money on their short and in result, hedge funds began to purchase the same stocks that they shorted in order to cover their losses. The end result is a vicious melt up of stock price aka the short squeeze.

FYI, you can use the PRAW library to view live streams on Reddit, so you get the pleasure in watching WallStreetBets phenomenon in real-time. U+1F601

PRAW: The Python Reddit API Wrapper – PRAW 7.1.5.dev0 documentation

ArXiv Revisited U+007C Graphs U+007C Video

ArXiv released a new feature that allows one to use “Connected Papers” to generate a graph of related research papers from the open-sourced platform. FYI, I hacked it this week, so I’ll add “connected papers” to the repo cypher every week. U+1F601

Speaking of arXiv, there’s a new feature called “papers-with-video” created by Amit Chaudhary. It’s a web browser extension which provides a link to a video relating to the arXiv paper in view. It currently covers 3.7K ML papers. U+1F525U+1F525



From the makers of sentence transformers, they introduce a new machine translation library. (comes with language detection too)


  • Easy installation and usage: Use state-of-the-art machine translation with 3 lines of code
  • Automatic download of pre-trained machine translation models
  • Translation between 150+ languages
  • Automatic language detection for 170+ languages
  • Sentence and document translation
  • Multi-GPU and multi-process translation


This package provides easy to use, state-of-the-art machine translation for more than 100+ languages. The highlights of…


A great library if you want to deploy your model on Google Cloud and get a nice API endpoint running on top of FastAPI and GCP’s preemptible instances. Since preemptible machines can be taken down anytime, they have a mechanism in place to auto-start them to avoid down time. U+270C


Give us a GitHub star to show your love! BudgetML is perfect for practitioners who would like to quickly deploy their…

The GPT-3 List of Projects/Startups

The web that GPT-3 currently weaves. Here’s a nice table of current projects and startups riding the GPT-3 gravy train. OpenAI’s inference API has spun up an entire industry U+1F635.


Stylometry library correlating writing styles. Uses Burrows’ Delta algo.

“The Burrows’ delta is a statistic which expresses the distance between two authors’ writing styles. A high number like 3 implies that the two authors are very dissimilar, whereas a low number like 0.2 would imply that two books are very likely to be by the same author.”

Author mentions that most stylometry libraries include mostly graphs but for his library, he wanted to include probabilities as well. In addition, faststylometry includes “unknown” books for testing purposes. Pretty cool.

Feds use this type of tech to catch perps on the dark web by correlating writing styles to get warrants. (random fact) They also use time correlations but that’s another story…


Fast Stylometry Tutorial – Freelance Data Scientist U+007C Thomas Wood

I'm introducing a Python library I've written, called faststylometry, which allows you to compare authors of texts by…



By Thomas Wood, Fast Data Science Source code at Tutorial at…

Using CLIP for Unsplash Search

Someone threw OpenAI’s CLIP model on top of Unsplash for searching pictures via natural language. Includes a Colab. U+1F60E


Search photos on Unsplash using natural language descriptions. The search is powered by OpenAI's CLIP model and the…

VS Code Chat

“Chat with your Slack and Discord teams within VS Code”

One less open tab on your browser. Winning!


0.34.0: With this release, the integration with VS Live Share has now moved into the core VS Live Share extension…

GitHub Live Tracker

ghtop provides a number of views of all current public activity from all users across the entire GitHub platform”

One more terminal window open. Winning!

(Headshot of the Week U+1F3C6)


See what's happening on GitHub in real time (also helpful if you need to use up your API quota as quickly as possible)…

Colab of the Week

Using Transformers with Weights and Biases:

Google Colaboratory

A100 vs V100 GPU Benchmarks

Want to know the PyTorch training speed difference between the A100 vs. V100 GPUs for language models U+1F447? FYI, Lambda now carries the big boy, the A100. More in the blog:


A100 vs V100 Deep Learning Benchmarks U+007C Lambda

Lambda is now shipping A100 servers. In this post, we benchmark the PyTorch training speed of the Tesla A100 and V100…

Star Trek Dialogue Scripts in JSON

If you need your GPT-3 to speak in Klingon U+1F923.

Example JSON:

“line”: “On Stardate 43997, Captain Jean-Luc Picard of the Federation Starship Enterprise was kidnapped for six days by an invading force known as the Borg. Surgically altered, he was forced to lead an assault on Starfleet at Wolf 359.”


A collection of Star Trek scripts dumped to JSON. A bit of a messy repo from my work but better the data be out there…

RackSpace AI/ML Survey

Total respondents =1,870 U+007C IT Professionals Worldwide


“$1.06M: What the average company spends annually on AI and machine learning initiatives..”

Leading Use of ‘AI’ is as a “Component of data analytics…”

Regarding current plans: 46% say they “want to improve the speed and efficiency of existing processes…”

Leading Challenge with 27% of respondents “Shortage of skilled AI/ML talent”

Get your copy here U+1F447

AI and machine learning research report U+007C Rackspace Technology

To learn more about how IT leaders are adopting and using AI and machine learning, we surveyed 1500+ IT leaders in…

Repo Cypher U+1F468‍U+1F4BB

U+1F4C8 U+1F4C8Added the new ConnectedPapers feature U+1F4C8 U+1F4C8

PAWLS U+007C PDF Annotations

Software that allows one to collect annotations associated with a PDF document.

Video Tutorial


PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated…

Connected Papers U+1F4C8

Multi-Document Driven Dialogue (MD3)

A new dialogue task where an agent can guess the target document that
the user is interested in by leading a dialogue.


This is the code for AAAI2021 paper Converse, Focus and Guess – Towards Multi-Document Driven Dialogue. We build a…

Connected Papers U+1F4C8


A named entity recognition system that extracts soft skills from text.


A Named Entity Recognition system that extracts soft skills from text Permalink Failed to load latest commit…

Connected Papers U+1F4C8


A speech recognition toolkit for Automatic Speech Recognition (ASR).


We share neural Net together. The main motivation of WeNet is to close the gap between research and production…

Connected Papers U+1F4C8

Tabular K-BERT U+007C Tabular Scenario Based Question Answering

Repo for tabular scenario question answering where a model is tasked to answer multiple-choice questions based on a passage and associated tables.


Sorce code for "TSQA: Tabular Scenario Based Question Answering", implement is based on K-BERT. We thank the authors of…

Connected Papers U+1F4C8

Open Information Extraction Dataset

A large dataset for open information extraction in addition to training scripts for your own model using AI2’s library, run on PyTorch.


In this repository, you will find the data published in the paper Scaling Up Supervised Information Extraction, along…

Connected Papers U+1F4C8

Dataset of the Week: Urban Dictionary (UD) Dataset

Dataset contains 2.5 million phrases from Urban Dictionary, including their definitions and votes.


CSV Rows: 2,580,925

Column 1: word_id — for usage in urban dictionary api

Column 2: word — the text being defined

Column 3: up_votes — thumbs up count as of may 2016

Column 4: down_votes — thumbs down count as of may 2016

Column 5: author — hash of username of submitter

Column 6: definition — text with possible utf8 chars, double semi-colon denotes a newline

Where is it?

Here’s the dataset in a bonus repo for generating slang U+1F525.


This is the github repository for the TACL paper "A Computation Framework for Slang Generation". The dataset is a…

Quantum Stat

