Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 02.07.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 02.07.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The Great Day of His Wrath U+007C Martin

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.07.21

The Short Squeeze

The plebeian beats the market…

Wall Street — with its historical knack for favoring the frothingly rich, summer-home traveling bourgeoisie — was outsmarted by a network of Reddit day-trading Millennial computer nerds. The outcome: billions of hedge fund capital thrown to the wind.

When the COVID saga began, those looking to make some extra scratch online leaned on online commission-free trading platforms like Robinhood to make ends meet. And with this meteoric rise of a new retail investor gambling on Wall Street’s hallowed grounds, no one, not even the institutional investor foresaw the events of the last weeks.

So what happened?

Elon pwning short sellers

The Reddit traders began purchasing shares of stock that was heavily shorted by industry investors. (‘shorted’ means betting the stock will go down). When stock price began to climb due to their share purchasing, hedge funds began losing loads of money on their short and in result, hedge funds began to purchase the same stocks that they shorted in order to cover their losses. The end result is a vicious melt up of stock price aka the short squeeze.

FYI, you can use the PRAW library to view live streams on Reddit, so you get the pleasure in watching WallStreetBets phenomenon in real-time. U+1F601

PRAW: The Python Reddit API Wrapper – PRAW 7.1.5.dev0 documentation

Edit description

praw.readthedocs.io

ArXiv Revisited U+007C Graphs U+007C Video

ArXiv released a new feature that allows one to use “Connected Papers” to generate a graph of related research papers from the open-sourced platform. FYI, I hacked it this week, so I’ll add “connected papers” to the repo cypher every week. U+1F601

Speaking of arXiv, there’s a new feature called “papers-with-video” created by Amit Chaudhary. It’s a web browser extension which provides a link to a video relating to the arXiv paper in view. It currently covers 3.7K ML papers. U+1F525U+1F525

declassified

NMT

From the makers of sentence transformers, they introduce a new machine translation library. (comes with language detection too)

Deets:

  • Easy installation and usage: Use state-of-the-art machine translation with 3 lines of code
  • Automatic download of pre-trained machine translation models
  • Translation between 150+ languages
  • Automatic language detection for 170+ languages
  • Sentence and document translation
  • Multi-GPU and multi-process translation

UKPLab/EasyNMT

This package provides easy to use, state-of-the-art machine translation for more than 100+ languages. The highlights of…

github.com

BudgetML

A great library if you want to deploy your model on Google Cloud and get a nice API endpoint running on top of FastAPI and GCP’s preemptible instances. Since preemptible machines can be taken down anytime, they have a mechanism in place to auto-start them to avoid down time. U+270C

ebhy/budgetml

Give us a GitHub star to show your love! BudgetML is perfect for practitioners who would like to quickly deploy their…

github.com

The GPT-3 List of Projects/Startups

The web that GPT-3 currently weaves. Here’s a nice table of current projects and startups riding the GPT-3 gravy train. OpenAI’s inference API has spun up an entire industry U+1F635.

FastStylometry

Stylometry library correlating writing styles. Uses Burrows’ Delta algo.

“The Burrows’ delta is a statistic which expresses the distance between two authors’ writing styles. A high number like 3 implies that the two authors are very dissimilar, whereas a low number like 0.2 would imply that two books are very likely to be by the same author.”

Author mentions that most stylometry libraries include mostly graphs but for his library, he wanted to include probabilities as well. In addition, faststylometry includes “unknown” books for testing purposes. Pretty cool.

Feds use this type of tech to catch perps on the dark web by correlating writing styles to get warrants. (random fact) They also use time correlations but that’s another story…

Blog:

Fast Stylometry Tutorial – Freelance Data Scientist U+007C Thomas Wood

I'm introducing a Python library I've written, called faststylometry, which allows you to compare authors of texts by…

freelancedatascientist.net

GitHub:

fastdatascience/faststylometry

By Thomas Wood, Fast Data Science Source code at https://github.com/woodthom2/faststylometry Tutorial at…

github.com

Using CLIP for Unsplash Search

Someone threw OpenAI’s CLIP model on top of Unsplash for searching pictures via natural language. Includes a Colab. U+1F60E

haltakov/natural-language-image-search

Search photos on Unsplash using natural language descriptions. The search is powered by OpenAI's CLIP model and the…

github.com

VS Code Chat

“Chat with your Slack and Discord teams within VS Code”

One less open tab on your browser. Winning!

vsls-contrib/chat

0.34.0: With this release, the integration with VS Live Share has now moved into the core VS Live Share extension…

github.com

GitHub Live Tracker

ghtop provides a number of views of all current public activity from all users across the entire GitHub platform”

One more terminal window open. Winning!

(Headshot of the Week U+1F3C6)

nat/ghtop

See what's happening on GitHub in real time (also helpful if you need to use up your API quota as quickly as possible)…

github.com

Colab of the Week

Using Transformers with Weights and Biases:

Google Colaboratory

Edit description

colab.research.google.com

A100 vs V100 GPU Benchmarks

Want to know the PyTorch training speed difference between the A100 vs. V100 GPUs for language models U+1F447? FYI, Lambda now carries the big boy, the A100. More in the blog:

declassified

A100 vs V100 Deep Learning Benchmarks U+007C Lambda

Lambda is now shipping A100 servers. In this post, we benchmark the PyTorch training speed of the Tesla A100 and V100…

lambdalabs.com

Star Trek Dialogue Scripts in JSON

If you need your GPT-3 to speak in Klingon U+1F923.

Example JSON:

“line”: “On Stardate 43997, Captain Jean-Luc Picard of the Federation Starship Enterprise was kidnapped for six days by an invading force known as the Borg. Surgically altered, he was forced to lead an assault on Starfleet at Wolf 359.”

jkingsman/Star-Trek-Script-Programmatics

A collection of Star Trek scripts dumped to JSON. A bit of a messy repo from my work but better the data be out there…

github.com

RackSpace AI/ML Survey

Total respondents =1,870 U+007C IT Professionals Worldwide

TL;DR

“$1.06M: What the average company spends annually on AI and machine learning initiatives..”

Leading Use of ‘AI’ is as a “Component of data analytics…”

Regarding current plans: 46% say they “want to improve the speed and efficiency of existing processes…”

Leading Challenge with 27% of respondents “Shortage of skilled AI/ML talent”

Get your copy here U+1F447

AI and machine learning research report U+007C Rackspace Technology

To learn more about how IT leaders are adopting and using AI and machine learning, we surveyed 1500+ IT leaders in…

www.rackspace.com

Repo Cypher U+1F468‍U+1F4BB

U+1F4C8 U+1F4C8Added the new ConnectedPapers feature U+1F4C8 U+1F4C8

PAWLS U+007C PDF Annotations

Software that allows one to collect annotations associated with a PDF document.

Video Tutorial

allenai/pawls

PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated…

github.com

Connected Papers U+1F4C8

Multi-Document Driven Dialogue (MD3)

A new dialogue task where an agent can guess the target document that
the user is interested in by leading a dialogue.

laddie132/MD3

This is the code for AAAI2021 paper Converse, Focus and Guess – Towards Multi-Document Driven Dialogue. We build a…

github.com

Connected Papers U+1F4C8

SkillNER

A named entity recognition system that extracts soft skills from text.

nicolamelluso/SkillNER

A Named Entity Recognition system that extracts soft skills from text Permalink Failed to load latest commit…

github.com

Connected Papers U+1F4C8

WeNet

A speech recognition toolkit for Automatic Speech Recognition (ASR).

mobvoi/wenet

We share neural Net together. The main motivation of WeNet is to close the gap between research and production…

github.com

Connected Papers U+1F4C8

Tabular K-BERT U+007C Tabular Scenario Based Question Answering

Repo for tabular scenario question answering where a model is tasked to answer multiple-choice questions based on a passage and associated tables.

nju-websoft/TSQA

Sorce code for "TSQA: Tabular Scenario Based Question Answering", implement is based on K-BERT. We thank the authors of…

github.com

Connected Papers U+1F4C8

Open Information Extraction Dataset

A large dataset for open information extraction in addition to training scripts for your own model using AI2’s library, run on PyTorch.

Jacobsolawetz/large-scale-oie

In this repository, you will find the data published in the paper Scaling Up Supervised Information Extraction, along…

github.com

Connected Papers U+1F4C8

Dataset of the Week: Urban Dictionary (UD) Dataset

Dataset contains 2.5 million phrases from Urban Dictionary, including their definitions and votes.

Content

CSV Rows: 2,580,925

Column 1: word_id — for usage in urban dictionary api

Column 2: word — the text being defined

Column 3: up_votes — thumbs up count as of may 2016

Column 4: down_votes — thumbs down count as of may 2016

Column 5: author — hash of username of submitter

Column 6: definition — text with possible utf8 chars, double semi-colon denotes a newline

Where is it?

Here’s the dataset in a bonus repo for generating slang U+1F525.

zhewei-sun/slanggen

This is the github repository for the TACL paper "A Computation Framework for Slang Generation". The dataset is a…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓