The NLP Cypher | 01.24.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The NLP Cypher | 01.24.21 — The Harvest U+007C Martin

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 01.24.21

Geronimo

Hey Welcome back! Another week goes by and the NLP domain continues to fly beyond escape velocity… But don’t worry, there’s an awesome intuition pump on how Transformers work:

If you continue to enjoy this read, please share with your friends and don’t forget to give it a U+1F44FU+1F44F …. U+1F60E

Epic Twitter Dataset

Cornell Tech came out with a huge Twitter dataset based on 7.6M tweet/25.6M retweets from 2.6M users that discussed voter fraud between October 23rd and December 16th. The analysis goes in deep on who promoted or denied “voter fraud”, visualizations of the networks, and who Twitter banned (individual tweet content were not directly shared for privacy). The results were fascinating and the dataset is available.

Networks of “promoters” and “detractors” of voter fraud. Orange color highlights suspended Twitter accounts.

GitHub:

sTechLab/VoterFraud2020

VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter…

github.com

Daily Double U+007C Jeopardy Data

Hey want to teach your encoder decoder models how to generate questions from answers??? Take a look at the Jeopardy archive created by the fans. Has clues and answers plus other metadata. Great data resource if it only can be harvested somewhere ….

J! Archive

The fan-created archive of Jeopardy! games and players-409,579 clues and counting! [All] [1] [2] [3] [4] [5] [6] [7]…

j-archive.com

Here it is! U+270CU+270C

jvani/jarchive-clues

Jeopardy clues from j-archive.com. Clues are collected with Scrapy, saved to sqlite, and updated daily via GitHub…

github.com

2020 NLP/ML Recap

Sebastian Ruder’s 2020 recap is a blog post you can’t miss. He discussed top 10 trends (including links to papers) in NLP/Machine learning that caught his eye over the past year:

Full Blog Post

ML and NLP Research Highlights of 2020

The selection of areas and methods is heavily influenced by my own interests; the selected topics are biased towards…

ruder.io

GNN Applications

A refreshing recap discussing where graph neural networks applications are headed in 2021. Discusses recommender systems, combinatorial optimization, computer vision and physics/life sciences applications.

Top Applications of Graph Neural Networks 2021

GNNs have come a long way in academia. But do we have good applications of them in industry?

medium.com

From ZeRO to Hero U+007C A Memory Optimizer

Remember Zero Redundancy Optimizer (ZeRO)? Microsoft’s optimizer for very large parameter models returns with an engaging Hugging Face blog post. FYI, (Hugging Face’s Trainer class gives support for DeepSpeed's and FairScale's ZeRO features as of the 4.2 version.) With the DeepSpeed library, they were able to get a single 24GB RTX-3090 card to train a 3 billion param T5 with a batch size of 20. U+1F440U+1F440

Blog:

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

A guest blog post by Hugging Face fellow Stas Bekman As recent Machine Learning models have been growing much faster…

huggingface.co

Computer Science Videos

If you like videos and computer science educational videos U+1F447

Developer-Y/cs-video-courses

Introduction Please check NOTES for general information about this list. Please refer CONTRIBUTING.md for contribution…

github.com

FOIA YouTube

The Black Vault really enjoys its FOIA (Freedom of Information Act) requests so much that it decided to request all of the YouTube videos that are listed as private or unlisted among several federal agencies!! U+1F601

Private/Unlisted YouTube Videos of U.S. Government Agencies – The Black Vault

Background Many U.S. government agencies and military branches have public YouTube pages. That is no secret. However…

www.theblackvault.com

2021 Enterprise and Machine Learning Survey

“The time required to deploy a model is 31% lower for organizations that buy a third-party solution.”

“Organizations with more models spend more of their data scientists’ time on deployment, not less”

“The time required to deploy a model is increasing year-on-year”

Download a free copy here:

The 2021 enterprise trends in machine learning

Building on last year's report, Algorithmia presents the 2021 enterprise trends in machine learning report. See what's…

info.algorithmia.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recent released repos that caught our U+1F441

Trankit

A trainable pipeline for fundamental NLP tasks with more than 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

(Authors say Trankit outperforms Stanford’s Stanza on select tasks like sentence segmentation and dependency parsing (English)) U+1F976U+1F976

nlp-uoregon/trankit

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It…

github.com

Spectrum

Spectrum is a model that uses deep learning to generate rap song lyrics. Includes demo and Colab!

YigitGunduc/Spectrum

Spectrum is an AI that uses deep learning to generate rap song lyrics. View Demo Report Bug Request Feature Open In…

github.com

Neural Punctuator (w/ BERT)

Automatic punctuation restoration with BERT models for English and Hungarian.

attilanagy234/neural-punctuator

Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conference…

github.com

Text-to-Text Transformers for Semantic Parsing

Finetune a T5 model on the task of semantic parsing for generating Python code out of natural language descriptions.

ypapanik/t5-for-code-generation

This repository is used to finetune a T5 model on the task of semantic parsing, a.k.a. generating (Python) code out of…

github.com

BERT Text Classification Jupyter Notebooks

Notebooks for fine-tuning BERT, SciBERT and BioBERT; Visualizing self-attention in the last layer of the BERT models, and get lists of most attended words above average in the last layer of the BERT models.

expertailab/Is-BERT-self-attention-a-feature-selection-method

The annotation and classification of scientific literature is a crucial task to make scientific knowledge easily…

github.com

D-REPTILE

Few shot dialog state tracking using meta-learning. Full codebase to be eventually released. This space is one to watch if building conversational models with the ability to transfer to new domains interests you.

saketdingliwal/Few-Shot-DST

Source code for our "D-REPTILE" paper at EACL 2021: Saket Dingliwal, Bill Gao, Sanchit Agarwal, Tagyoung Chung, and…

github.com

Dataset of the Week: OpenViDial

What is it?

Dialogue turns and visual contexts were extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. It contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

Where is it?

ShannonAI/OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The NLP Cypher | 01.24.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 01.24.21

Geronimo

Epic Twitter Dataset

sTechLab/VoterFraud2020

VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voter…

Daily Double U+007C Jeopardy Data

J! Archive

The fan-created archive of Jeopardy! games and players-409,579 clues and counting! [All] [1] [2] [3] [4] [5] [6] [7]…

jvani/jarchive-clues

Jeopardy clues from j-archive.com. Clues are collected with Scrapy, saved to sqlite, and updated daily via GitHub…

2020 NLP/ML Recap

ML and NLP Research Highlights of 2020

The selection of areas and methods is heavily influenced by my own interests; the selected topics are biased towards…

GNN Applications

Top Applications of Graph Neural Networks 2021

GNNs have come a long way in academia. But do we have good applications of them in industry?

From ZeRO to Hero U+007C A Memory Optimizer

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

A guest blog post by Hugging Face fellow Stas Bekman As recent Machine Learning models have been growing much faster…

Computer Science Videos

Developer-Y/cs-video-courses

Introduction Please check NOTES for general information about this list. Please refer CONTRIBUTING.md for contribution…

FOIA YouTube

Private/Unlisted YouTube Videos of U.S. Government Agencies – The Black Vault

Background Many U.S. government agencies and military branches have public YouTube pages. That is no secret. However…

2021 Enterprise and Machine Learning Survey

The 2021 enterprise trends in machine learning

Building on last year's report, Algorithmia presents the 2021 enterprise trends in machine learning report. See what's…

Repo Cypher U+1F468‍U+1F4BB

A collection of recent released repos that caught our U+1F441

Trankit

nlp-uoregon/trankit

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It…

Spectrum

YigitGunduc/Spectrum

Spectrum is an AI that uses deep learning to generate rap song lyrics. View Demo Report Bug Request Feature Open In…

Neural Punctuator (w/ BERT)

attilanagy234/neural-punctuator

Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conference…

Text-to-Text Transformers for Semantic Parsing

ypapanik/t5-for-code-generation

This repository is used to finetune a T5 model on the task of semantic parsing, a.k.a. generating (Python) code out of…

BERT Text Classification Jupyter Notebooks

expertailab/Is-BERT-self-attention-a-feature-selection-method

The annotation and classification of scientific literature is a crucial task to make scientific knowledge easily…

D-REPTILE

saketdingliwal/Few-Shot-DST

Source code for our "D-REPTILE" paper at EACL 2021: Saket Dingliwal, Bill Gao, Sanchit Agarwal, Tagyoung Chung, and…

Dataset of the Week: OpenViDial

What is it?

Where is it?

ShannonAI/OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain…

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement