Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.


The NLP Cypher | 01.24.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 01.24.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The Harvest U+007C Martin


The NLP Cypher U+007C 01.24.21


Hey Welcome back! Another week goes by and the NLP domain continues to fly beyond escape velocity… But don’t worry, there’s an awesome intuition pump on how Transformers work:


If you continue to enjoy this read, please share with your friends and don’t forget to give it a U+1F44FU+1F44F …. U+1F60E

Epic Twitter Dataset

Cornell Tech came out with a huge Twitter dataset based on 7.6M tweet/25.6M retweets from 2.6M users that discussed voter fraud between October 23rd and December 16th. The analysis goes in deep on who promoted or denied “voter fraud”, visualizations of the networks, and who Twitter banned (individual tweet content were not directly shared for privacy). The results were fascinating and the dataset is available.

Networks of “promoters” and “detractors” of voter fraud. Orange color highlights suspended Twitter accounts.



VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter…

Daily Double U+007C Jeopardy Data

Hey want to teach your encoder decoder models how to generate questions from answers??? Take a look at the Jeopardy archive created by the fans. Has clues and answers plus other metadata. Great data resource if it only can be harvested somewhere ….

J! Archive

The fan-created archive of Jeopardy! games and players-409,579 clues and counting! [All] [1] [2] [3] [4] [5] [6] [7]…

Here it is! U+270CU+270C


Jeopardy clues from Clues are collected with Scrapy, saved to sqlite, and updated daily via GitHub…

2020 NLP/ML Recap

Sebastian Ruder’s 2020 recap is a blog post you can’t miss. He discussed top 10 trends (including links to papers) in NLP/Machine learning that caught his eye over the past year:

  1. Scaling up — and down
  2. Retrieval augmentation
  3. Few-shot learning
  4. Contrastive learning
  5. Evaluation beyond accuracy
  6. Practical concerns of large LMs
  7. Multilinguality
  8. Image Transformers
  9. ML for science
  10. Reinforcement learning

Full Blog Post

ML and NLP Research Highlights of 2020

The selection of areas and methods is heavily influenced by my own interests; the selected topics are biased towards…

GNN Applications

A refreshing recap discussing where graph neural networks applications are headed in 2021. Discusses recommender systems, combinatorial optimization, computer vision and physics/life sciences applications.

Top Applications of Graph Neural Networks 2021

GNNs have come a long way in academia. But do we have good applications of them in industry?

From ZeRO to Hero U+007C A Memory Optimizer

Remember Zero Redundancy Optimizer (ZeRO)? Microsoft’s optimizer for very large parameter models returns with an engaging Hugging Face blog post. FYI, (Hugging Face’s Trainer class gives support for DeepSpeed's and FairScale's ZeRO features as of the 4.2 version.) With the DeepSpeed library, they were able to get a single 24GB RTX-3090 card to train a 3 billion param T5 with a batch size of 20. U+1F440U+1F440


Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

A guest blog post by Hugging Face fellow Stas Bekman As recent Machine Learning models have been growing much faster…

Computer Science Videos

If you like videos and computer science educational videos U+1F447


Introduction Please check NOTES for general information about this list. Please refer for contribution…

FOIA YouTube

The Black Vault really enjoys its FOIA (Freedom of Information Act) requests so much that it decided to request all of the YouTube videos that are listed as private or unlisted among several federal agencies!! U+1F601

Private/Unlisted YouTube Videos of U.S. Government Agencies – The Black Vault

Background Many U.S. government agencies and military branches have public YouTube pages. That is no secret. However…

2021 Enterprise and Machine Learning Survey

“The time required to deploy a model is 31% lower for organizations that buy a third-party solution.”

“Organizations with more models spend more of their data scientists’ time on deployment, not less”

“The time required to deploy a model is increasing year-on-year”

Download a free copy here:

The 2021 enterprise trends in machine learning

Building on last year's report, Algorithmia presents the 2021 enterprise trends in machine learning report. See what's…

Repo Cypher U+1F468‍U+1F4BB

A collection of recent released repos that caught our U+1F441


A trainable pipeline for fundamental NLP tasks with more than 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

(Authors say Trankit outperforms Stanford’s Stanza on select tasks like sentence segmentation and dependency parsing (English)) U+1F976U+1F976


Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It…


Spectrum is a model that uses deep learning to generate rap song lyrics. Includes demo and Colab!


Spectrum is an AI that uses deep learning to generate rap song lyrics. View Demo Report Bug Request Feature Open In…

Neural Punctuator (w/ BERT)

Automatic punctuation restoration with BERT models for English and Hungarian.


Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conference…

Text-to-Text Transformers for Semantic Parsing

Finetune a T5 model on the task of semantic parsing for generating Python code out of natural language descriptions.


This repository is used to finetune a T5 model on the task of semantic parsing, a.k.a. generating (Python) code out of…

BERT Text Classification Jupyter Notebooks

Notebooks for fine-tuning BERT, SciBERT and BioBERT; Visualizing self-attention in the last layer of the BERT models, and get lists of most attended words above average in the last layer of the BERT models.


The annotation and classification of scientific literature is a crucial task to make scientific knowledge easily…


Few shot dialog state tracking using meta-learning. Full codebase to be eventually released. This space is one to watch if building conversational models with the ability to transfer to new domains interests you.


Source code for our "D-REPTILE" paper at EACL 2021: Saket Dingliwal, Bill Gao, Sanchit Agarwal, Tagyoung Chung, and…

Dataset of the Week: OpenViDial

What is it?

Dialogue turns and visual contexts were extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. It contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

Where is it?


This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain…

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓