The NLP Cypher | 04.04.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 04.04.21

The Facebook Leak Chronicles

Hey Welcome Back. A lot of things happened in the past 24 hours! Well first, Alon Gal, a cybersecurity dude, who we first mentioned many months ago on the Cypher regarding his tweet mentioning a humongous bitcoin wallet that hackers were attempting to crack on the dark web, well he’s back. And this time, his tweet went viral. A vulnerability on Facebook (patched in 2019) exposed 533 million Facebook users worldwide(32,126,812 based in the US). Metadata such as phone numbers, emails, names, and other things were exposed. Initially the data was being traded on the dark web, but as of a few days ago, it was leaked to the public. U+1F62C

I actually gained access to the US portion of the leak. The good news (at least for US users) is that very few emails and DOBs were included in this subset.

GPT-3 Neo Callback

As of this week, The GPT-3 NEO models released by EleutherAI are live on the Hugging Face Model hub, you can download both of them for inference:

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)[{'generated_text': 'EleutherAI has made a commitment to create new software packages for each of its major clients and has'}]

For fine-tuning the 2.7B param model, you can use this fellow’s repo. U+1F447

Xirider/finetune-gpt2xl

Finetuning large language models like GPT2-xl is often difficult, as these models are too big to fit on a single GPU…

github.com

According to the author, he tested it on a V100 GPU (16 GB VRAM) with 78 GB RAM computer and got it to work. U+270CU+270C

Here’s is the code snippet for inference after fine-tuning:

from transformers import GPTNeoForCausalLM, AutoTokenizermodel = GPTNeoForCausalLM.from_pretrained("finetuned").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")text = "From off a hill whose concave"
ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")max_length = 400 + ids.shape[1] # add the length of the prompt tokens to match with the mesh-tf generationgen_tokens = model.generate(
 ids,
 do_sample=True,
 min_length=max_length,
 max_length=max_length,
 temperature=0.9,
 use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)

In case you were wondering where EleutherAI is heading next in terms of scaling GPT NEO U+1F447

declassified

ADAM Optimizer Goes BYE BYE?

Welcome MADGRAD, a new state-of-the-art optimizer. According to FB Research, it can exceed the speed of ADAM. Authors say you may need to adjust your learning rate and lower weight decay (hyperparameters) to accommodate MADGRAD.

BackProp: ML Resources page

A new resource page currently focusing on the deep learning topics, GANs, and Transformers.

Home

Backprop serves as a resource to accelerate learning by finding and aggregating the best resources on various machine…

www.backprop.org

Hawking: A Natural Language Date Parser (Java)

Really cool, it takes in natural language as input (some input mentioning a date) and it can return that date in an official date format.

String inputText = "Good morning, Have a nice day. Shall we meet on December 20 ?";
#outputText : on December 20
Start : 2021-12-20T00:00:00.000+05:30
End : 2021-12-20T23:59:59.000+05:30

zoho/hawking

Given any date expression in a sentence, Hawking will apply standard language recognition and parser techniques to…

github.com

PyTorch Profiler

Out now in 1.8 PyTorch version.

“This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic detection of bottlenecks in the model, and generates recommendations on how to resolve these bottlenecks.”

PyTorch

by Maxim Lukiyanov – Principal PM at Microsoft, Guoliang Hua – Principal Engineering Manager at Microsoft, Geeta…

pytorch.org

Knowledge Graph Embeddings Tutorial

An awesome tutorial featuring slides, Colab notebooks and other materials to learn about all things knowledge graphs.

Knowledge Graph Embeddings Tutorial: From Theory to Practice

Slides [PDF] Jupyter notebook/Colab Hands-On Session Check out additional KGE tutorial material Knowledge graph…

kge-tutorial-ecai2020.github.io

Talking about Knowledge Graphs…

“ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories (Temporal, Contingency, Comparison, Expansion, and Co-Occurrence), 194-million unique eventualities, and 64-million unique edges among them.”

HKUST-KnowComp/ASER

ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more…

github.com

Lab Errors: Datasets

A site was created exposing errors in very popular datasets used in vision and NLP (20 newsgroups and IMDB are on it). U+1F976U+1F976

Label Errors in Benchmark ML Datasets

We identify label errors in 10 benchmark ML test sets and study the potential for these label errors to affect…

labelerrors.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Using CLIP models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort.

This is a lot of fun. U+1F92A

orpatashnik/StyleCLIP

Demo video: Optimization notebook: Global directions notebook: StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery…

github.com

Connected Papers U+1F4C8

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE)

ASOTE extracts aspect term, sentiment and opinion term triplets from sentences.

l294265421/ASOTE

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE) extracts aspect term, sentiment and opinion term triplets from…

github.com

Connected Papers U+1F4C8

Changing the Mind of Transformers for Topically-Controllable Language Generation

A framework used for controlling the topic of text generation with language models. It displays multiple candidate upcoming topics, of which a user can select a subset to guide the generation.

iesl/interactive_LM

We will first introduce the how to run the IPython notebook demo by downloading our pretrained models. Then, we will…

github.com

Connected Papers U+1F4C8

Dodrio U+007C Transformers Visualization Toolkit

An interactive visualization system designed to help NLP researchers and practitioners analyze and compare attention weights in transformer-based models with linguistic knowledge.

poloclub/dodrio

An interactive visualization system designed to help NLP researchers and practitioners analyze and compare attention…

github.com

Connected Papers U+1F4C8

Self-Supervised Euphemism Detection and Identification for Content Moderation

Model (BERT) and data for the task of euphemism detection and identification.

WanzhengZhu/Euphemism

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content…

github.com

Connected Papers U+1F4C8

Dataset of the Week: CaSiNo

What is it?

A negotiation dataset consisting of 1,030 negotiation dialogues. Two participates take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements.

Sample

Participant Info

Demographics (Age, Gender, Ethnicity, Education)
Personality attributes (SVO and Big-5)
Preference order
Arguments for needing or not needing a specific item

Negotiation Dialogue

Alternating conversation between two participants
11.6 utterances on average
Includes the use of four emoticons: Joy, Sadness, Anger, Surprise

Negotiation Outcomes

Points scored
Satisfaction (How satisfied are you with the negotiation outcome?)
Opponent Likeness (How much do you like your opponent?)

Where is it?

kushalchawla/CaSiNo

This repository contains the dataset and the PyTorch code for 'CaSiNo: A Corpus of Campsite Negotiation Dialogues for…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The NLP Cypher | 04.04.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 04.04.21

The Facebook Leak Chronicles

GPT-3 Neo Callback

Xirider/finetune-gpt2xl

Finetuning large language models like GPT2-xl is often difficult, as these models are too big to fit on a single GPU…

ADAM Optimizer Goes BYE BYE?

BackProp: ML Resources page

Home

Backprop serves as a resource to accelerate learning by finding and aggregating the best resources on various machine…

Hawking: A Natural Language Date Parser (Java)

zoho/hawking

Given any date expression in a sentence, Hawking will apply standard language recognition and parser techniques to…

PyTorch Profiler

PyTorch

by Maxim Lukiyanov – Principal PM at Microsoft, Guoliang Hua – Principal Engineering Manager at Microsoft, Geeta…

Knowledge Graph Embeddings Tutorial

Knowledge Graph Embeddings Tutorial: From Theory to Practice

Slides [PDF] Jupyter notebook/Colab Hands-On Session Check out additional KGE tutorial material Knowledge graph…

Talking about Knowledge Graphs…

HKUST-KnowComp/ASER

ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more…

Lab Errors: Datasets

Label Errors in Benchmark ML Datasets

We identify label errors in 10 benchmark ML test sets and study the potential for these label errors to affect…

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

orpatashnik/StyleCLIP

Demo video: Optimization notebook: Global directions notebook: StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery…

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE)

l294265421/ASOTE

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE) extracts aspect term, sentiment and opinion term triplets from…

Changing the Mind of Transformers for Topically-Controllable Language Generation

iesl/interactive_LM

We will first introduce the how to run the IPython notebook demo by downloading our pretrained models. Then, we will…

Dodrio U+007C Transformers Visualization Toolkit

poloclub/dodrio

An interactive visualization system designed to help NLP researchers and practitioners analyze and compare attention…

Self-Supervised Euphemism Detection and Identification for Content Moderation

WanzhengZhu/Euphemism

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content…

Dataset of the Week: CaSiNo

What is it?

Sample

Where is it?

kushalchawla/CaSiNo

This repository contains the dataset and the PyTorch code for 'CaSiNo: A Corpus of Campsite Negotiation Dialogues for…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement