The NLP Cypher | 02.07.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The Great Day of His Wrath U+007C Martin

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.07.21

The Short Squeeze

The plebeian beats the market…

Wall Street — with its historical knack for favoring the frothingly rich, summer-home traveling bourgeoisie — was outsmarted by a network of Reddit day-trading Millennial computer nerds. The outcome: billions of hedge fund capital thrown to the wind.

When the COVID saga began, those looking to make some extra scratch online leaned on online commission-free trading platforms like Robinhood to make ends meet. And with this meteoric rise of a new retail investor gambling on Wall Street’s hallowed grounds, no one, not even the institutional investor foresaw the events of the last weeks.

So what happened?

Elon pwning short sellers

The Reddit traders began purchasing shares of stock that was heavily shorted by industry investors. (‘shorted’ means betting the stock will go down). When stock price began to climb due to their share purchasing, hedge funds began losing loads of money on their short and in result, hedge funds began to purchase the same stocks that they shorted in order to cover their losses. The end result is a vicious melt up of stock price aka the short squeeze.

FYI, you can use the PRAW library to view live streams on Reddit, so you get the pleasure in watching WallStreetBets phenomenon in real-time. U+1F601

PRAW: The Python Reddit API Wrapper – PRAW 7.1.5.dev0 documentation

Edit description

praw.readthedocs.io

ArXiv Revisited U+007C Graphs U+007C Video

ArXiv released a new feature that allows one to use “Connected Papers” to generate a graph of related research papers from the open-sourced platform. FYI, I hacked it this week, so I’ll add “connected papers” to the repo cypher every week. U+1F601

Speaking of arXiv, there’s a new feature called “papers-with-video” created by Amit Chaudhary. It’s a web browser extension which provides a link to a video relating to the arXiv paper in view. It currently covers 3.7K ML papers. U+1F525U+1F525

declassified

NMT

From the makers of sentence transformers, they introduce a new machine translation library. (comes with language detection too)

Deets:

Easy installation and usage: Use state-of-the-art machine translation with 3 lines of code
Automatic download of pre-trained machine translation models
Translation between 150+ languages
Automatic language detection for 170+ languages
Sentence and document translation
Multi-GPU and multi-process translation

UKPLab/EasyNMT

This package provides easy to use, state-of-the-art machine translation for more than 100+ languages. The highlights of…

github.com

BudgetML

A great library if you want to deploy your model on Google Cloud and get a nice API endpoint running on top of FastAPI and GCP’s preemptible instances. Since preemptible machines can be taken down anytime, they have a mechanism in place to auto-start them to avoid down time. U+270C

ebhy/budgetml

Give us a GitHub star to show your love! BudgetML is perfect for practitioners who would like to quickly deploy their…

github.com

The GPT-3 List of Projects/Startups

The web that GPT-3 currently weaves. Here’s a nice table of current projects and startups riding the GPT-3 gravy train. OpenAI’s inference API has spun up an entire industry U+1F635.

FastStylometry

Stylometry library correlating writing styles. Uses Burrows’ Delta algo.

“The Burrows’ delta is a statistic which expresses the distance between two authors’ writing styles. A high number like 3 implies that the two authors are very dissimilar, whereas a low number like 0.2 would imply that two books are very likely to be by the same author.”

Author mentions that most stylometry libraries include mostly graphs but for his library, he wanted to include probabilities as well. In addition, faststylometry includes “unknown” books for testing purposes. Pretty cool.

Feds use this type of tech to catch perps on the dark web by correlating writing styles to get warrants. (random fact) They also use time correlations but that’s another story…

Blog:

Fast Stylometry Tutorial – Freelance Data Scientist U+007C Thomas Wood

I'm introducing a Python library I've written, called faststylometry, which allows you to compare authors of texts by…

freelancedatascientist.net

GitHub:

fastdatascience/faststylometry

By Thomas Wood, Fast Data Science Source code at https://github.com/woodthom2/faststylometry Tutorial at…

github.com

Using CLIP for Unsplash Search

Someone threw OpenAI’s CLIP model on top of Unsplash for searching pictures via natural language. Includes a Colab. U+1F60E

haltakov/natural-language-image-search

Search photos on Unsplash using natural language descriptions. The search is powered by OpenAI's CLIP model and the…

github.com

VS Code Chat

“Chat with your Slack and Discord teams within VS Code”

One less open tab on your browser. Winning!

vsls-contrib/chat

0.34.0: With this release, the integration with VS Live Share has now moved into the core VS Live Share extension…

github.com

GitHub Live Tracker

“ghtop provides a number of views of all current public activity from all users across the entire GitHub platform”

One more terminal window open. Winning!

(Headshot of the Week U+1F3C6)

nat/ghtop

See what's happening on GitHub in real time (also helpful if you need to use up your API quota as quickly as possible)…

github.com

Colab of the Week

Using Transformers with Weights and Biases:

Google Colaboratory

Edit description

colab.research.google.com

A100 vs V100 GPU Benchmarks

Want to know the PyTorch training speed difference between the A100 vs. V100 GPUs for language models U+1F447? FYI, Lambda now carries the big boy, the A100. More in the blog:

A100 vs V100 Deep Learning Benchmarks U+007C Lambda

Lambda is now shipping A100 servers. In this post, we benchmark the PyTorch training speed of the Tesla A100 and V100…

lambdalabs.com

Star Trek Dialogue Scripts in JSON

If you need your GPT-3 to speak in Klingon U+1F923.

Example JSON:

“line”: “On Stardate 43997, Captain Jean-Luc Picard of the Federation Starship Enterprise was kidnapped for six days by an invading force known as the Borg. Surgically altered, he was forced to lead an assault on Starfleet at Wolf 359.”

jkingsman/Star-Trek-Script-Programmatics

A collection of Star Trek scripts dumped to JSON. A bit of a messy repo from my work but better the data be out there…

github.com

RackSpace AI/ML Survey

Total respondents =1,870 U+007C IT Professionals Worldwide

TL;DR

“$1.06M: What the average company spends annually on AI and machine learning initiatives..”

Leading Use of ‘AI’ is as a “Component of data analytics…”

Regarding current plans: 46% say they “want to improve the speed and efficiency of existing processes…”

Leading Challenge with 27% of respondents “Shortage of skilled AI/ML talent”

Get your copy here U+1F447

AI and machine learning research report U+007C Rackspace Technology

To learn more about how IT leaders are adopting and using AI and machine learning, we surveyed 1500+ IT leaders in…

www.rackspace.com

Repo Cypher U+1F468‍U+1F4BB

U+1F4C8 U+1F4C8Added the new ConnectedPapers feature U+1F4C8 U+1F4C8

PAWLS U+007C PDF Annotations

Software that allows one to collect annotations associated with a PDF document.

Video Tutorial

allenai/pawls

PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated…

github.com

Connected Papers U+1F4C8

Multi-Document Driven Dialogue (MD3)

A new dialogue task where an agent can guess the target document that
the user is interested in by leading a dialogue.

laddie132/MD3

This is the code for AAAI2021 paper Converse, Focus and Guess – Towards Multi-Document Driven Dialogue. We build a…

github.com

Connected Papers U+1F4C8

SkillNER

A named entity recognition system that extracts soft skills from text.

nicolamelluso/SkillNER

A Named Entity Recognition system that extracts soft skills from text Permalink Failed to load latest commit…

github.com

Connected Papers U+1F4C8

WeNet

A speech recognition toolkit for Automatic Speech Recognition (ASR).

mobvoi/wenet

We share neural Net together. The main motivation of WeNet is to close the gap between research and production…

github.com

Connected Papers U+1F4C8

Tabular K-BERT U+007C Tabular Scenario Based Question Answering

Repo for tabular scenario question answering where a model is tasked to answer multiple-choice questions based on a passage and associated tables.

nju-websoft/TSQA

Sorce code for "TSQA: Tabular Scenario Based Question Answering", implement is based on K-BERT. We thank the authors of…

github.com

Connected Papers U+1F4C8

Open Information Extraction Dataset

A large dataset for open information extraction in addition to training scripts for your own model using AI2’s library, run on PyTorch.

Jacobsolawetz/large-scale-oie

In this repository, you will find the data published in the paper Scaling Up Supervised Information Extraction, along…

github.com

Connected Papers U+1F4C8

Dataset of the Week: Urban Dictionary (UD) Dataset

Dataset contains 2.5 million phrases from Urban Dictionary, including their definitions and votes.

Content

CSV Rows: 2,580,925

Column 1: word_id — for usage in urban dictionary api

Column 2: word — the text being defined

Column 3: up_votes — thumbs up count as of may 2016

Column 4: down_votes — thumbs down count as of may 2016

Column 5: author — hash of username of submitter

Column 6: definition — text with possible utf8 chars, double semi-colon denotes a newline

Where is it?

Here’s the dataset in a bonus repo for generating slang U+1F525.

zhewei-sun/slanggen

This is the github repository for the TACL paper "A Computation Framework for Slang Generation". The dataset is a…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

TAI #148: New API Models from OpenAI (4.1) & xAI (grok-3); Exploring Deep Research’s Scaling Laws

Traditional RAG vs Graph RAG

I Was About to Order Taco Bell Again. Instead, I Built an AI That Talks Me Down

MCP is on Fire.

Efficient Fine-Tuning of LLMs: LoRA and QLoRA in Enterprise AI LangGraph Workflows

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The NLP Cypher | 02.07.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 02.07.21

The Short Squeeze

PRAW: The Python Reddit API Wrapper – PRAW 7.1.5.dev0 documentation

Edit description

ArXiv Revisited U+007C Graphs U+007C Video

NMT

UKPLab/EasyNMT

This package provides easy to use, state-of-the-art machine translation for more than 100+ languages. The highlights of…

BudgetML

ebhy/budgetml

Give us a GitHub star to show your love! BudgetML is perfect for practitioners who would like to quickly deploy their…

The GPT-3 List of Projects/Startups

FastStylometry

Fast Stylometry Tutorial – Freelance Data Scientist U+007C Thomas Wood

I'm introducing a Python library I've written, called faststylometry, which allows you to compare authors of texts by…

fastdatascience/faststylometry

By Thomas Wood, Fast Data Science Source code at https://github.com/woodthom2/faststylometry Tutorial at…

Using CLIP for Unsplash Search

haltakov/natural-language-image-search

Search photos on Unsplash using natural language descriptions. The search is powered by OpenAI's CLIP model and the…

VS Code Chat

vsls-contrib/chat

0.34.0: With this release, the integration with VS Live Share has now moved into the core VS Live Share extension…

GitHub Live Tracker

nat/ghtop

See what's happening on GitHub in real time (also helpful if you need to use up your API quota as quickly as possible)…

Colab of the Week

Google Colaboratory

Edit description

A100 vs V100 GPU Benchmarks

A100 vs V100 Deep Learning Benchmarks U+007C Lambda

Lambda is now shipping A100 servers. In this post, we benchmark the PyTorch training speed of the Tesla A100 and V100…

Star Trek Dialogue Scripts in JSON

jkingsman/Star-Trek-Script-Programmatics

A collection of Star Trek scripts dumped to JSON. A bit of a messy repo from my work but better the data be out there…

RackSpace AI/ML Survey

AI and machine learning research report U+007C Rackspace Technology

To learn more about how IT leaders are adopting and using AI and machine learning, we surveyed 1500+ IT leaders in…

Repo Cypher U+1F468‍U+1F4BB

PAWLS U+007C PDF Annotations

allenai/pawls

PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated…

Multi-Document Driven Dialogue (MD3)

laddie132/MD3

This is the code for AAAI2021 paper Converse, Focus and Guess – Towards Multi-Document Driven Dialogue. We build a…

SkillNER

nicolamelluso/SkillNER

A Named Entity Recognition system that extracts soft skills from text Permalink Failed to load latest commit…

WeNet

mobvoi/wenet

We share neural Net together. The main motivation of WeNet is to close the gap between research and production…

Tabular K-BERT U+007C Tabular Scenario Based Question Answering

nju-websoft/TSQA

Sorce code for "TSQA: Tabular Scenario Based Question Answering", implement is based on K-BERT. We thank the authors of…

Open Information Extraction Dataset

Jacobsolawetz/large-scale-oie

In this repository, you will find the data published in the paper Scaling Up Supervised Information Extraction, along…

Dataset of the Week: Urban Dictionary (UD) Dataset

Content

Where is it?

zhewei-sun/slanggen

This is the github repository for the TACL paper "A Computation Framework for Slang Generation". The dataset is a…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥