Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 04.04.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 04.04.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Early Moonrise, Florida U+007C Inness

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 04.04.21

The Facebook Leak Chronicles

Hey Welcome Back. A lot of things happened in the past 24 hours! Well first, Alon Gal, a cybersecurity dude, who we first mentioned many months ago on the Cypher regarding his tweet mentioning a humongous bitcoin wallet that hackers were attempting to crack on the dark web, well he’s back. And this time, his tweet went viral. A vulnerability on Facebook (patched in 2019) exposed 533 million Facebook users worldwide(32,126,812 based in the US). Metadata such as phone numbers, emails, names, and other things were exposed. Initially the data was being traded on the dark web, but as of a few days ago, it was leaked to the public. U+1F62C

I actually gained access to the US portion of the leak. The good news (at least for US users) is that very few emails and DOBs were included in this subset.

GPT-3 Neo Callback

As of this week, The GPT-3 NEO models released by EleutherAI are live on the Hugging Face Model hub, you can download both of them for inference:

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
generator("EleutherAI has", do_sample=True, min_length=50)
[{'generated_text': 'EleutherAI has made a commitment to create new software packages for each of its major clients and has'}]

For fine-tuning the 2.7B param model, you can use this fellow’s repo. U+1F447

Xirider/finetune-gpt2xl

Finetuning large language models like GPT2-xl is often difficult, as these models are too big to fit on a single GPU…

github.com

According to the author, he tested it on a V100 GPU (16 GB VRAM) with 78 GB RAM computer and got it to work. U+270CU+270C

Here’s is the code snippet for inference after fine-tuning:

from transformers import GPTNeoForCausalLM, AutoTokenizermodel = GPTNeoForCausalLM.from_pretrained("finetuned").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("finetuned")
text = "From off a hill whose concave"
ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
max_length = 400 + ids.shape[1] # add the length of the prompt tokens to match with the mesh-tf generationgen_tokens = model.generate(
ids,
do_sample=True,
min_length=max_length,
max_length=max_length,
temperature=0.9,
use_cache=True
)
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)

In case you were wondering where EleutherAI is heading next in terms of scaling GPT NEO U+1F447

declassified

ADAM Optimizer Goes BYE BYE?

Welcome MADGRAD, a new state-of-the-art optimizer. According to FB Research, it can exceed the speed of ADAM. Authors say you may need to adjust your learning rate and lower weight decay (hyperparameters) to accommodate MADGRAD.

BackProp: ML Resources page

A new resource page currently focusing on the deep learning topics, GANs, and Transformers.

Home

Backprop serves as a resource to accelerate learning by finding and aggregating the best resources on various machine…

www.backprop.org

Hawking: A Natural Language Date Parser (Java)

Really cool, it takes in natural language as input (some input mentioning a date) and it can return that date in an official date format.

String inputText = "Good morning, Have a nice day. Shall we meet on December 20 ?";
#output
Text : on December 20
Start : 2021-12-20T00:00:00.000+05:30
End : 2021-12-20T23:59:59.000+05:30

zoho/hawking

Given any date expression in a sentence, Hawking will apply standard language recognition and parser techniques to…

github.com

PyTorch Profiler

Out now in 1.8 PyTorch version.

“This new profiler collects both GPU hardware and PyTorch related information, correlates them, performs automatic detection of bottlenecks in the model, and generates recommendations on how to resolve these bottlenecks.”

PyTorch

by Maxim Lukiyanov – Principal PM at Microsoft, Guoliang Hua – Principal Engineering Manager at Microsoft, Geeta…

pytorch.org

Knowledge Graph Embeddings Tutorial

An awesome tutorial featuring slides, Colab notebooks and other materials to learn about all things knowledge graphs.

Knowledge Graph Embeddings Tutorial: From Theory to Practice

Slides [PDF] Jupyter notebook/Colab Hands-On Session Check out additional KGE tutorial material Knowledge graph…

kge-tutorial-ecai2020.github.io

Talking about Knowledge Graphs…

“ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more than 11-billion-token unstructured textual data. ASER contains 15 relation types belonging to five categories (Temporal, Contingency, Comparison, Expansion, and Co-Occurrence), 194-million unique eventualities, and 64-million unique edges among them.”

HKUST-KnowComp/ASER

ASER (activities, states, events, and their relations), a large-scale eventuality knowledge graph extracted from more…

github.com

Lab Errors: Datasets

A site was created exposing errors in very popular datasets used in vision and NLP (20 newsgroups and IMDB are on it). U+1F976U+1F976

Label Errors in Benchmark ML Datasets

We identify label errors in 10 benchmark ML test sets and study the potential for these label errors to affect…

labelerrors.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Using CLIP models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort.

This is a lot of fun. U+1F92A

orpatashnik/StyleCLIP

Demo video: Optimization notebook: Global directions notebook: StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery…

github.com

Connected Papers U+1F4C8

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE)

ASOTE extracts aspect term, sentiment and opinion term triplets from sentences.

l294265421/ASOTE

Aspect-Sentiment-Opinion Triplet Extraction (ASOTE) extracts aspect term, sentiment and opinion term triplets from…

github.com

Connected Papers U+1F4C8

Changing the Mind of Transformers for Topically-Controllable Language Generation

A framework used for controlling the topic of text generation with language models. It displays multiple candidate upcoming topics, of which a user can select a subset to guide the generation.

iesl/interactive_LM

We will first introduce the how to run the IPython notebook demo by downloading our pretrained models. Then, we will…

github.com

Connected Papers U+1F4C8

Dodrio U+007C Transformers Visualization Toolkit

An interactive visualization system designed to help NLP researchers and practitioners analyze and compare attention weights in transformer-based models with linguistic knowledge.

poloclub/dodrio

An interactive visualization system designed to help NLP researchers and practitioners analyze and compare attention…

github.com

Connected Papers U+1F4C8

Self-Supervised Euphemism Detection and Identification for Content Moderation

Model (BERT) and data for the task of euphemism detection and identification.

WanzhengZhu/Euphemism

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content…

github.com

Connected Papers U+1F4C8

Dataset of the Week: CaSiNo

What is it?

A negotiation dataset consisting of 1,030 negotiation dialogues. Two participates take the role of campsite neighbors and negotiate for Food, Water, and Firewood packages, based on their individual preferences and requirements.

Sample

Participant Info

  • Demographics (Age, Gender, Ethnicity, Education)
  • Personality attributes (SVO and Big-5)
  • Preference order
  • Arguments for needing or not needing a specific item

Negotiation Dialogue

  • Alternating conversation between two participants
  • 11.6 utterances on average
  • Includes the use of four emoticons: Joy, Sadness, Anger, Surprise

Negotiation Outcomes

  • Points scored
  • Satisfaction (How satisfied are you with the negotiation outcome?)
  • Opponent Likeness (How much do you like your opponent?)

Where is it?

kushalchawla/CaSiNo

This repository contains the dataset and the PyTorch code for 'CaSiNo: A Corpus of Campsite Negotiation Dialogues for…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓