Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 04.11.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 04.11.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

A Naturalist’s Study U+007C Roy

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 04.11.21

It’s Dark, and NLP is Hot

One small step for man…

One giant leap for monkeys playing pong with their mind…

declassified

Welcome back. This week we have a good one for you. But first… Neuralink, Elon Musk’s brain chip company, implanted a chip in a monkey’s skull so it could play pong wirelessly. Winning! And if you’re the type to create the next level Star Trek tech, they have open positions available. U+1F649

The first fully-implanted 1000+ channel brain-machine interface

In a 2019 white paper, we outlined the design of our novel electrodes and our unique surgical approach, along with…

neuralink.com

Graphbrain U+007C Semantic Hypergraphs

Excited to announce that the Graphbrain library had a major update this past week. It now includes a more extensive documentation with tutorials and notebooks for quick experimentation.

Recap…

Graphbrain is a library used to construct semantic hypergraphs from text. A hypergraph is just a normal graph except that an edge is not limited to only 2 vertices. It can have 3 or > U+1F60E. This feature gives it the flexibility to extract knowledge entities in a hierarchical nature. It is built on top of spaCy and Hugging Face’s NeuralCoref library to help with the coreference resolution task.

If you are new to the library, it may be a bit intimidating at first because of its notation. FYI, this is what the notation looks like… U+1F447

I know… to the untrained eye it’s a bit funky but it’s definitely worth it to explore deeper because it can help give you a fresh look at NLP tasks from a new architecture other than a pure deep learning approach. To get familiar with all the tasks check out their paper below.

FYI, this is their manual to familiarize yourself with the notation the model spits out:

Semantic Hypergraph notation – Graphbrain 0.4.0 documentation

SH notation is based on two simple principles: Every hyperedge belongs to one of eight basic types. The first element…

graphbrain.net

Documentation:

Graphbrain – Language, Knowledge, Cognition – Graphbrain 0.4.0 documentation

Graphbrain is an Artificial Intelligence open-source software library and scientific research tool. Its aim is to…

graphbrain.net

Code:

graphbrain/graphbrain

Graphbrain is an Artificial Intelligence open-source software library and scientific research tool. Its aim is to…

github.com

Semantic Hypergraphs Paper

It’s the Wild Wild West on Reddit U+1F62C

Chirpy: Stanford’s Open Source Chatbot

Stanford open-sourced Chirpy, their chatbot who won 2nd place in the Alexa Prize. This is a chit-chat bot that has a broad range of response generators. They can be fully rule-based to fully neural.

Types of generators:

Music Response Generator

Personal Chat Response Generator

Wiki Response Generator

Inside Chirpy Cardinal: Stanford's Open-Source Social Chatbot that Won 2nd place in the Alexa Prize

Last year, Stanford won 2nd place in the Alexa Prize Socialbot Grand Challenge 3 for social chatbots. In this post, we…

ai.stanford.edu

Running PyTorch on Apple’s M1 Chip? U+1F447

GPU acceleration for Apple's M1 chip? · Issue #47702 · pytorch/pytorch

U+1F680 Feature Hi, I was wondering if we could evaluate PyTorch's performance on Apple's new M1 chip. I'm also wondering…

github.com

Kgextension: From Knowledge Graphs to Pandas

“The kgextension package allows one to access and use Linked Open Data to augment existing datasets. It enables one to incorporate knowledge graph information in pandas.DataFrames

Types of Linked Open Data: DBpedia, WikiData or the EU Open Data Portal

om-hb/kgextension

The kgextension package allows to access and use Linked Open Data to augment existing datasets. It enables to…

github.com

Colab of the Week

Google Colaboratory

Edit description

colab.research.google.com

Visualize BERT

Attention is all you need… to see a transformer bust a move.

“BertViz is a tool for visualizing attention in the Transformer model, supporting all models from the transformers library (BERT, GPT-2, XLNet, RoBERTa, XLM, CTRL, etc.)”

jessevig/bertviz

BertViz is a tool for visualizing attention in the Transformer model, supporting all models from the transformers…

github.com

GECToR — Grammatical Error Correction

Besides a bunch YouTube ads, Grammarly also has a bunch of transformer models U+1F62C. Their grammatical error correction models were pre-trained on synthetic data and then fine-tuned in two stages:

first on error-filled corpora, and second, on a combination of error-filled and error-free parallel corpora.

grammarly/gector

This repository provides code for training and testing state-of-the-art models for grammatical error correction with…

github.com

PyTorch Geometric Temporal

A temporal graph neural network extension library for PyTorch Geometric. If you are into Epidemiological Forecasting or Web Traffic Prediction, have a look:

benedekrozemberczki/pytorch_geometric_temporal

PyTorch Geometric Temporal is a temporal (dynamic) extension library for PyTorch Geometric. The library consists of…

github.com

NLP Use-Cases

In these slides, Andrei Lopatenko, ML engineer, describes some of the top NLP use cases in the business world that he’s experienced over the past 15 years.

Locust U+007C Load Testing

Need to load test your website or your API endpoint on open-sourced software? Check out locust…U+1F41C

Locust – A modern load testing framework

Edit description

locust.io

Python Packages Anyone?

How to make an awesome Python package in 2021…

How to make an awesome Python package in 2021

If you are like me, every once in a while you write a useful python utility and want to share it with your colleagues…

antonz.org

The Annoy Library

Very fast nearest neighbor search. Spotify uses it for their music recommendations.

Features (found it on their repo):

  • Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
  • Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2–2*cos(u, v))
  • Works better if you don’t have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
  • Small memory usage
  • Lets you share memory between multiple processes
  • Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
  • Native Python support, tested with 2.7, 3.6, and 3.7.
  • Build index on disk to enable indexing big datasets that won’t fit into memory (contributed by Rene Hollander)

spotify/annoy

Annoy ( Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that…

github.com

Multi-Document Summarization

Aylien created the Wikipedia Current Events Portal Dataset for summarization. In their blog they discuss why their approach differs from the more recent SOTA models like PEGASUS and BART with their inability to access multi-documents. U+1F976U+1F976

“It is based on the Wikipedia Current Events Portal (WCEP) where Wikipedia editors write concise summaries of important current events, usually in 1 or 2 sentences, and provide links to news articles as sources for each summary.”

Blog:

Adventures in Multi-Document Summarisation: The Wikipedia Current Events Portal Dataset

01 Apr, 2021 Demian Gholipour 13 Min Read In this post we give a brief overview of multi-document summarization (MDS)…

aylien.com

Google Colaboratory

Edit description

colab.research.google.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

CodeTrans

State of the art pre-trained models for source code. CodeTrans was trained on several Nvidia RTX 8000 GPUs and couple of Google TPUs using various state-of -the-art transformer models.

agemagician/CodeTrans

CodeTrans is providing state of the art pre-trained models for source code. CodeTrans was trained on several Nvidia RTX…

github.com

Connected Papers U+1F4C8

Layout Parser

OCR U+1F631U+1F631U+1F631U+1F631U+1F631 and document image analysis

Layout-Parser/layout-parser

Layout Parser is a deep learning based tool for document image layout analysis tasks. Use pip or conda to install the…

github.com

Connected Papers U+1F4C8

Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks

A transformer architecture extended with Graph Attention Networks for multi-task neural semantic parsing.

endrikacupaj/LASAGNE

This paper addresses the task of (complex) conversational question answering over a knowledge graph. For this task, we…

github.com

Connected Papers U+1F4C8

EXPATS: A Toolkit for Explainable Automated Text Scoring

A framework for automated text scoring (ATS) tasks, such as automated essay scoring and readability assessment. The toolkit also provides seamless integration with the Language Interpretability Tool (LIT) so that one can interpret and visualize models and their predictions.

octanove/expats

EXPATS is an open-source framework for automated text scoring (ATS) tasks, such as automated essay scoring and…

github.com

Connected Papers U+1F4C8

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Given an input text, identifies grammatical features useful for language education.

octanove/grammartagger

GrammarTagger – A Neural Multilingual Grammar Profiler for Language Learning – octanove/grammartagger

github.com

Connected Papers U+1F4C8

MMBERT: Multimodal BERT Pretraining for Improved Medical VQA

A multi-modal transformer for visual question answering task in the medical domain. It achieves new state-of-the-art performance on two VQA datasets for radiology images — VQA-Med 2019 and VQARAD.

VirajBagal/MMBERT

Yash Khare*, Viraj Bagal*, Minesh Mathew, Adithi Devi, U Deva Priyakumar, CV Jawahar Abstract: Images in the medical…

github.com

Connected Papers U+1F4C8

Dataset of the Week: HumAID

What is it?

A dataset consisting of ∼77K human-labeled tweets, sampled from a pool of ∼24 million tweets across 19 disaster events that happened between 2016 and 2019. Disaster events consist of earthquakes/cyclones, floods, hurricanes and wildfires.

Where is it?

paper

CrisisNLP

Description of the dataset The HumAID Twitter dataset consists of several thousands of manually annotated tweets that…

crisisnlp.qcri.org

RIP to one of the realest to ever do it…

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓