The NLP Cypher | 05.02.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

Beware the Beautiful Witch U+007C O’Malley

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 05.02.21

The NLP Index

As an applied machine learning engineer (aka hacker U+1F468‍U+1F4BB aka flying ninja U+1F431‍U+1F464), I’m consistently looking for better and faster ways to stay on top of the deep learning and software development circuit. After comparing various sources for research, code, and apps. I’ve discovered that a significant amount of awesome NLP code is not on arXiv and not all NLP research is on GitHub. To obtain a wider scope of current NLP research and code, I’ve created the NLP Index! A search-as-you-type search engine containing over 3,000 NLP repositories (updated weekly) U+1F525. The index contains the research paper, a ConnectedPapers link for a graph of related papers, and its GitHub repo.

The NLP Index

Top NLP Code Repositories – Quantum Stat

index.quantumstat.com

The intent of this platform is for researchers and hackers to obtain information quickly and comprehensively about all things NLP. And not just from research papers, but from awesome apps that are created on top of this research.

We’ve included the option of open search (as opposed to exclusively only serving pre-defined categories) because of inter-dependencies among subject areas. Meaning, sometimes a paper/repo can be both about “knowledge graphs” and “datasets” simultaneously and it’s difficult to discretize topics. We prefer giving the user the option of openly searching the database across all domains/sectors simultaneously. We also included pre-defined queries with dozens of topics in NLP via the sidebar for convenience.

The index has several attributes such as: search as you type, typo tolerance, and synonym detection.

Synonym Detection

For example, if you search for “dataset” the database will also search for “corpus” and “corpora” text simultaneously to make sure every asset is searched. U+1F91F

Typo Tolerance

If you search “gpt2" it will also include “gpt-2"

Search as you type

It will output results on every character as you type in real-time taking only a couple milliseconds. (thank you memory mapping U+1F648)

Also want to mention that the Big Bad NLP Database has already been merged with the NLP Index! For the most up-to-date compendium of NLP datasets, you can go to the “data” section of the sidebar and click dataset or openly search for a specific dataset/task. Eventually, I will sunset the BBND URL and eventually redirect it to the Index.

Want to thank all of the support I’ve received over the past week after taking the NLP Index live. Thank you to Philip Vollet for sharing his dataset with hundred of NLP repos. You can find his posts in the “Uncharted” section.

More features coming soon. Stay tuned. U+1F649

BERT, Explain Yourself!

Discover why BERT makes an inference using SHAP (SHapley Additive exPlanations); a game theoretic approach to explain the output of any machine learning model. It leverages the Transformers pipeline.

ml6team/quick-tips

It has been over two years since transformer models took the NLP throne U+1F3C5, but up until recently they couldn't tell…

github.com

Colab of the Week

Google Colaboratory

Edit description

colab.research.google.com

Explainable AI Cheat Sheet

Includes graphic, YouTube vid, and several links with papers/ books discussing the topic of explainable AI.

Explainable AI Guide

A brief overview of the Explainable AI cheat sheet with examples.

ex.pegg.io

StyleCLIP is Too Much Fun!

Awesome introduction from Max Woolf on using StyleCLIP (via Colab notebooks) to manipulate headshot pics via text prompts. You can even add your own pictures, the quality is pretty good. For example, take a look at the generation after the text prompt: “Face after using the NLP index” U+1F447 U+1F62DU+1F62D

Easily Transform Portraits of People into AI Aberrations Using StyleCLIP U+007C Max Woolf's Blog

GANs, generative adversarial networks, are all the rage nowadays for creating AI-based imagery. You've probably seen…

minimaxir.com

Software Updates

AdapterHub

New version includes BART and GPT-2 models U+1F6A8

Adapters for Generative and Seq2Seq Models in NLP

Adapters are becoming more and more important in machine learning for NLP. For instance, they enable us to efficiently…

adapterhub.ml

BERTopic

(semi-)supervised topic modeling by leveraging supervised options in UMAP

model.fit(docs, y=target_classes)

Backends:

Added Spacy, Gensim, USE (TFHub)
Use a different backend for document embeddings and word embeddings
Create your own backends with bertopic.backend.BaseEmbedder
Click here for an overview of all new backends

Calculate and visualize topics per class

Calculate: topics_per_class = topic_model.topics_per_class(docs, topics, classes)

Visualize: topic_model.visualize_topics_per_class(topics_per_class)

Release Major Release v0.7 · MaartenGr/BERTopic

The two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and…

github.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

Gradient-based Adversarial Attacks against Text Transformers

A general-purpose framework, GBDA (Gradient-based Distributional Attack), for gradient-based adversarial attacks, and apply it against transformer models on text data.

facebookresearch/text-adversarial-attack

Install HuggingFace dependences conda install -c huggingface transformers pip install datasets (Optional) For attacks…

github.com

Connected Papers U+1F4C8

Easy and Efficient Transformer

Pytorch inference plugin for transformers with large model sizes and long sequences. Currently supports GPT-2 and BERT models.

NetEase-FuXi/EET

EET(Easy and Efficient Transformer) is an efficient Pytorch inference plugin focus on Transformer-based models with…

github.com

Connected Papers U+1F4C8

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

ashkamath/mdetr

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data…

github.com

Connected Papers U+1F4C8

XLM-T — A Multilingual Language Model Toolkit for Twitter

Continues pre-training on a large corpus of Twitter in multiple languages on the XLM-Roberta-Base model. Includes 4 colab notebooks.

cardiffnlp/xlm-t

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter. As…

github.com

Connected Papers U+1F4C8

FRANK: Factuality Evaluation Benchmark

A typology of factual errors for fine-grained analysis of factuality in summarization systems.

artidoro/frank

This repository contains the data for the FRANK Benchmark for factuality evaluation metrics (see our NAACL 2021 paper…

github.com

Connected Papers U+1F4C8

Legal Document Similarity

A collection of state-of-the-art document representation methods for the task of retrieving semantically related US case law. Text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincaré), and
hybrid methods were explored.

malteos/legal-document-similarity

Implementation, trained models and result data for the paper Evaluating Document Representations for Content-based…

github.com

Connected Papers U+1F4C8

Dataset of the Week: Shellcode_IA32 U+1F469‍U+1F4BB

What is it?

Shellcode_IA32 is a dataset containing 20 years of shellcodes from a variety of sources is the largest collection of shellcodes in assembly available to date. This dataset consists of 3,200 examples of instructions in assembly language for IA-32 (the 32-bit version of the x86 Intel Architecture) from publicly available security exploits. Dataset is used for automatically generating shell code (code generation task). Assembly programs used to generate shellcode from exploit-db and from shell-storm were collected.

paper

Where is it?

dessertlab/Shellcode_IA32

Shellcode_IA32 is a dataset consisting of challenging but common assembly instructions, collected from real shellcodes…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The NLP Cypher | 05.02.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 05.02.21

The NLP Index

The NLP Index

Top NLP Code Repositories – Quantum Stat

BERT, Explain Yourself!

ml6team/quick-tips

It has been over two years since transformer models took the NLP throne U+1F3C5, but up until recently they couldn't tell…

Colab of the Week

Google Colaboratory

Edit description

Explainable AI Cheat Sheet

Explainable AI Guide

A brief overview of the Explainable AI cheat sheet with examples.

StyleCLIP is Too Much Fun!

Easily Transform Portraits of People into AI Aberrations Using StyleCLIP U+007C Max Woolf's Blog

GANs, generative adversarial networks, are all the rage nowadays for creating AI-based imagery. You've probably seen…

Software Updates

AdapterHub

Adapters for Generative and Seq2Seq Models in NLP

Adapters are becoming more and more important in machine learning for NLP. For instance, they enable us to efficiently…

BERTopic

Release Major Release v0.7 · MaartenGr/BERTopic

The two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and…

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

Gradient-based Adversarial Attacks against Text Transformers

facebookresearch/text-adversarial-attack

Install HuggingFace dependences conda install -c huggingface transformers pip install datasets (Optional) For attacks…

Easy and Efficient Transformer

NetEase-FuXi/EET

EET(Easy and Efficient Transformer) is an efficient Pytorch inference plugin focus on Transformer-based models with…

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

ashkamath/mdetr

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data…

XLM-T — A Multilingual Language Model Toolkit for Twitter

cardiffnlp/xlm-t

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter. As…

FRANK: Factuality Evaluation Benchmark

artidoro/frank

This repository contains the data for the FRANK Benchmark for factuality evaluation metrics (see our NAACL 2021 paper…

Legal Document Similarity

malteos/legal-document-similarity

Implementation, trained models and result data for the paper Evaluating Document Representations for Content-based…

Dataset of the Week: Shellcode_IA32 U+1F469‍U+1F4BB

What is it?

Where is it?

dessertlab/Shellcode_IA32

Shellcode_IA32 is a dataset consisting of challenging but common assembly instructions, collected from real shellcodes…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement