The NLP Cypher | 01.24.21
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 01.24.21
Geronimo
Hey Welcome back! Another week goes by and the NLP domain continues to fly beyond escape velocityβ¦ But donβt worry, thereβs an awesome intuition pump on how Transformers work:
If you continue to enjoy this read, please share with your friends and donβt forget to give it a U+1F44FU+1F44F β¦. U+1F60E
Epic Twitter Dataset
Cornell Tech came out with a huge Twitter dataset based on 7.6M tweet/25.6M retweets from 2.6M users that discussed voter fraud between October 23rd and December 16th. The analysis goes in deep on who promoted or denied βvoter fraudβ, visualizations of the networks, and who Twitter banned (individual tweet content were not directly shared for privacy). The results were fascinating and the dataset is available.
GitHub:
sTechLab/VoterFraud2020
VoterFraud2020 is a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users related to voterβ¦
github.com
Daily Double U+007C Jeopardy Data
Hey want to teach your encoder decoder models how to generate questions from answers??? Take a look at the Jeopardy archive created by the fans. Has clues and answers plus other metadata. Great data resource if it only can be harvested somewhere β¦.
J! Archive
The fan-created archive of Jeopardy! games and players-409,579 clues and counting! [All] [1] [2] [3] [4] [5] [6] [7]β¦
j-archive.com
Here it is! U+270CU+270C
jvani/jarchive-clues
Jeopardy clues from j-archive.com. Clues are collected with Scrapy, saved to sqlite, and updated daily via GitHubβ¦
github.com
2020 NLP/ML Recap
Sebastian Ruderβs 2020 recap is a blog post you canβt miss. He discussed top 10 trends (including links to papers) in NLP/Machine learning that caught his eye over the past year:
- Scaling up β and down
- Retrieval augmentation
- Few-shot learning
- Contrastive learning
- Evaluation beyond accuracy
- Practical concerns of large LMs
- Multilinguality
- Image Transformers
- ML for science
- Reinforcement learning
Full Blog Post
ML and NLP Research Highlights of 2020
The selection of areas and methods is heavily influenced by my own interests; the selected topics are biased towardsβ¦
ruder.io
GNN Applications
A refreshing recap discussing where graph neural networks applications are headed in 2021. Discusses recommender systems, combinatorial optimization, computer vision and physics/life sciences applications.
Top Applications of Graph Neural Networks 2021
GNNs have come a long way in academia. But do we have good applications of them in industry?
medium.com
From ZeRO to Hero U+007C A Memory Optimizer
Remember Zero Redundancy Optimizer (ZeRO)? Microsoftβs optimizer for very large parameter models returns with an engaging Hugging Face blog post. FYI, (Hugging Faceβs Trainer class gives support for DeepSpeed's and FairScale's ZeRO features as of the 4.2 version.) With the DeepSpeed library, they were able to get a single 24GB RTX-3090 card to train a 3 billion param T5 with a batch size of 20. U+1F440U+1F440
Blog:
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale
A guest blog post by Hugging Face fellow Stas Bekman As recent Machine Learning models have been growing much fasterβ¦
huggingface.co
Computer Science Videos
If you like videos and computer science educational videos U+1F447
Developer-Y/cs-video-courses
Introduction Please check NOTES for general information about this list. Please refer CONTRIBUTING.md for contributionβ¦
github.com
FOIA YouTube
The Black Vault really enjoys its FOIA (Freedom of Information Act) requests so much that it decided to request all of the YouTube videos that are listed as private or unlisted among several federal agencies!! U+1F601
Private/Unlisted YouTube Videos of U.S. Government Agencies – The Black Vault
Background Many U.S. government agencies and military branches have public YouTube pages. That is no secret. Howeverβ¦
www.theblackvault.com
2021 Enterprise and Machine Learning Survey
βThe time required to deploy a model is 31% lower for organizations that buy a third-party solution.β
βOrganizations with more models spend more of their data scientistsβ time on deployment, not lessβ
βThe time required to deploy a model is increasing year-on-yearβ
Download a free copy here:
The 2021 enterprise trends in machine learning
Building on last year's report, Algorithmia presents the 2021 enterprise trends in machine learning report. See what'sβ¦
info.algorithmia.com
Repo Cypher U+1F468βU+1F4BB
A collection of recent released repos that caught our U+1F441
Trankit
A trainable pipeline for fundamental NLP tasks with more than 100 languages, and 90 downloadable pretrained pipelines for 56 languages.
(Authors say Trankit outperforms Stanfordβs Stanza on select tasks like sentence segmentation and dependency parsing (English)) U+1F976U+1F976
nlp-uoregon/trankit
Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). Itβ¦
github.com
Spectrum
Spectrum is a model that uses deep learning to generate rap song lyrics. Includes demo and Colab!
YigitGunduc/Spectrum
Spectrum is an AI that uses deep learning to generate rap song lyrics. View Demo Report Bug Request Feature Open Inβ¦
github.com
Neural Punctuator (w/ BERT)
Automatic punctuation restoration with BERT models for English and Hungarian.
attilanagy234/neural-punctuator
Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conferenceβ¦
github.com
Text-to-Text Transformers for Semantic Parsing
Finetune a T5 model on the task of semantic parsing for generating Python code out of natural language descriptions.
ypapanik/t5-for-code-generation
This repository is used to finetune a T5 model on the task of semantic parsing, a.k.a. generating (Python) code out ofβ¦
github.com
BERT Text Classification Jupyter Notebooks
Notebooks for fine-tuning BERT, SciBERT and BioBERT; Visualizing self-attention in the last layer of the BERT models, and get lists of most attended words above average in the last layer of the BERT models.
expertailab/Is-BERT-self-attention-a-feature-selection-method
The annotation and classification of scientific literature is a crucial task to make scientific knowledge easilyβ¦
github.com
D-REPTILE
Few shot dialog state tracking using meta-learning. Full codebase to be eventually released. This space is one to watch if building conversational models with the ability to transfer to new domains interests you.
saketdingliwal/Few-Shot-DST
Source code for our "D-REPTILE" paper at EACL 2021: Saket Dingliwal, Bill Gao, Sanchit Agarwal, Tagyoung Chung, andβ¦
github.com
Dataset of the Week: OpenViDial
What is it?
Dialogue turns and visual contexts were extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. It contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.
Where is it?
ShannonAI/OpenViDial
This repo contains downloading instructions for the OpenViDial dataset in γOpenViDial: A Large-Scale, Open-Domainβ¦
github.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI