Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

The NLP Cypher | 04.25.21
Latest   Machine Learning   Newsletter

The NLP Cypher | 04.25.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

St. Michael Overwhelming the Demon U+007C Raphael

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 04.25.21

No Quarter

In previous releases of the NLP Cypher, I’ve signaled my desire for cryptic clues and puzzles as reasoning challenges for NLP models to chew on. The rationale being that if NLP models can come close to human performance on a task that requires n-order logic to solve, a new avenue in performance could be opened other than merely pattern recognition of syntax (which is the current hack of SOTA language models).

A new paper this week delivered a new dataset full of cryptic crossword (not to be confused with a traditional crossword puzzle) clues and benchmark results using the T5 model. While this research won’t have a direct impact in applied deep learning in the enterprise any time soon, any outcome that brings NLP models closer to human-level performance on these type of tasks will be an important step for artificial intelligence.

paper

jsrozner/decrypt

Repository for paper Decrypting Cryptic Crosswords – jsrozner/decrypt

github.com

Prompt Tuning U+007C The Next Frontier in NLP Training?

Important paper to read if you train models. Shows how fine-tuning’s days may be counted as prompt tuning may be a more efficient method for tuning large language models. Prompt tuning allows one to use a frozen model and tunes only the text prompts (as opposed to model tuning ((a.k.a fine-tuning)) that alters the entire model).

This means we don’t have to train a new copy of a model for every new NLP task! This paper shows how previous hurdles involved with prompt turning were overcome to make it on par or competitive with traditional fine-tuning. In addition, prompt tuning’s simplicity can be helpful in combating domain shift. U+1F9D0

This paper reminded me of what is going on with multi-task learning, where adapters have a ‘similar’ approach in using a frozen model and only the adapter is fine-tuned for each task (this isn’t prompt tuning but similar in the sense of using a frozen model for efficiency/flexibility). Afterwards, you can stack adapters for various tasks and use the frozen model for its embeddings. I interviewed the lead author Jonas Pfeiffer of Adapter Hub in a previous NLP Cypher.

LINK

Lightning-Transformers

PyTorch Lightning merged the Transformers library and the Hydra framework. U+1F976 It’s fairly simple to use.

Here’s the elevator pitch:

  • Train using HuggingFace Transformers models and datasets with Lightning custom Callbacks, Loggers, Accelerators and high performance scaling.
  • Seamless Memory and Speed Optimizations such as DeepSpeed ZeRO or FairScale Sharded Training with no code changes.
  • Powerful config composition backed by Hydra — Easily swap out models, optimizers, schedulers and many more configurations without touching the code.
  • Transformer Task Abstraction for Rapid Research & Experimentation — Built from the ground up to be task agnostic, the library supports creating transformer tasks across all modalities with little friction.

Code

PyTorchLightning/lightning-transformers

Option 1: from PyPI pip install lightning-transformers # instead of: `python train.py …`, run with…

github.com

Docs

Lightning Transformers – Lightning Transformers documentation

Lightning Transformers offers a flexible interface for training and fine-tuning SOTA Transformer models using the…

lightning-transformers.readthedocs.io

Blog

Training Transformers at Scale With PyTorch Lightning

Introducing Lightning Transformers, a new library that seamlessly integrates PyTorch Lightning, HuggingFace…

pytorch-lightning.medium.com

NLP CookBook

Educational paper on the state of the art of transformers, comparing the performance and use-cases of different variants.

https://arxiv.org/ftp/arxiv/papers/2104/2104.10640.pdf

NLP in Finance

This Refinitiv Labs blog highlights their thinking behind a custom pretrained version of BERT on the Reuters News archive.

NLP: Unlock value in financial services terminology U+007C Refinitiv Perspectives

Refinitiv Labs have pre-trained a natural language processing (NLP) model with financial and business news so that it…

perspectives.refinitiv.com

How Often Do People Copy and Paste Stack Overflow?

They promised us flying cars, instead we got “One out of every four users who visits a Stack Overflow question copies something within five minutes of hitting the page.”

Check out this blog if you want to find out how much we all suck at coding.

U+1F62DU+1F62DU+1F62DU+1F62D

How often do people actually copy and paste from Stack Overflow? Now we know. – Stack Overflow Blog

They say there's a kernel of truth behind every joke. In the case of our recent April Fools gag, it might be more like…

stackoverflow.blog

PyTerrier U+007C Information Extraction Library

An information retrieval library using the Java-based Terrier IR platform internally to support indexing and retrieval operations. These are some of the features for neural ranking/dense retrieval:

BERT (through OpenNIR), T5, ColBERT, ANCE, DeepCT and doc2query.

-OpenNIR: [Github] [Documentation]

-PyTerrier_ANCE: [Github] — dense retrieval

-PyTerrier_ColBERT: [Github] — dense retrieval and/or neural reranking

-PyTerrier_T5: [Github] — neural reranking

-PyTerrier_doc2query: [Github] — neural augmented indexing

-PyTerrier_DeepCT: [Github] — neural augmented indexing

terrier-org/pyterrier

A Python API for Terrier The easiest way to get started with PyTerrier is to use one of our Colab notebooks – look for…

github.com

Deep Sparse Inference

Mark Kurtz from Neural Magic goes over NM’s Deep Sparse inference engine for sparse models. The engine expects models to be in ONNX format for ingestion and then runs them incredibly fast on CPUs via its engine. Hopefully NLP will be coming soon… U+1F91E. (What I just described is towards the end of the video, the beginning is a cool intro via Nir Shavit on sparsity and why it’s awesome).

AI Adoption in the Enterprise 2021

O’Reilly’s Survey is out, here’s the goods:

We had almost three times as many responses as last year, with similar efforts at promotion. More people are working with AI.

In the past, company culture has been the most significant barrier to AI adoption. While it’s still an issue, culture has dropped to fourth place.

This year, the most significant barrier to AI adoption is the lack of skilled people and the difficulty of hiring. That shortage has been predicted for several years; we’re finally seeing it.

The second-most significant barrier was the availability of quality data. That realization is a sign that the field is growing up.

The percentage of respondents reporting “mature” practices has been roughly the same for the last few years. That isn’t surprising, given the increase in the number of respondents: we suspect many organizations are just beginning their AI projects.

The retail industry sector has the highest percentage of mature practices; education has the lowest. But education also had the highest percentage of respondents who were “considering” AI.

Relatively few respondents are using version control for data and models. Tools for versioning data and models are still immature, but they’re critical for making AI results reproducible and reliable.

AI Adoption in the Enterprise 2021

During the first weeks of February, we asked recipients of our Data and AI Newsletters to participate in a survey on AI…

www.oreilly.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

BEIR U+007C Zero-Shot Benchmark for Information Retrieval Tasks

Benchmark includes17 datasets, 9 tasks with diverse domains. 9 SOTA retrieval models evaluated in a zero-shot setup.

UKPLab/beir

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for…

github.com

Connected Papers U+1F4C8

ELECTRAMed

A pre-trained domain-specific language model suited for the biomedical field. It sets state-of-the-art result on the BC5CDR corpus for named entity recognition, and provides the best outcome in 2 over the 5 runs of the 7th BioASQ-factoid Challenge for the question answering task.

gmpoli/electramed

A new pre-trained language representation model for biomedical NLP Motivation The overwhelming amount of biomedical…

github.com

Connected Papers U+1F4C8

Retrieve Write Slot Filling

Using RAG in order to improve on the slot-filling task (i.e. given an entity query in form of [ENTITY, SLOT, ?], the tasks asks to generate the missing slot). This allows for automating the creation of knowledge graphs.

IBM/retrieve-write-slot-filling

This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPR…

github.com

Connected Papers U+1F4C8

Condenser

A general pre-training architecture based on Transformer LMs, to improve dense optimization readiness. Currently supported architectures include all models with BERT or RoBERTa architecture.

luyug/Condenser

Code for converting a pre-trained Transformer encoder LM into Condenser, a Transformer architecture specialized in…

github.com

Connected Papers U+1F4C8

Text2App

A framework that allows users to create functional Android applications from natural language specifications. U+1F648

Masum06/Text2App

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Connected Papers U+1F4C8

ProphetNet-X

ProphetNet pretrained models for Chinese, English Dialog, Chinese Dialog, Multi-lingual, and Code Generation.

microsoft/ProphetNet

This repo provides the code for reproducing the experiments in ProphetNet . In the paper, we propose a new pre-trained…

github.com

Connected Papers U+1F4C8

LAMPRET: Layout-Aware Multimodal PreTraining for Document
Understanding

A general-purposed pretraining methodology which exploits both the structure and the content of documents, and considers multimedia contents, such as images, to learn a comprehensive multimodal document representation.

Paper: https://arxiv.org/pdf/2104.08405.pdf

Connected Papers U+1F4C8

Dataset of the Week: GooAQ U+1F951: Google Answers to Google Questions!

What is it?

A question answering dataset containing over 5 million questions and 3 million answers collected from Google. GOOAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature.

Sample

{
"id": 5009708,
"question": "carbon dioxide comprises approximately what percentage of tropospheric gases?",
"short_answer": "04%",
"answer": "Carbon dioxide comprise approximately . 04% of tropospheric gases.",
"answer_type": "feat_snip"
}

Where is it?

allenai/gooaq

This repository contains the code/data accompanying our recent work on long-form question answering. NOTE This dataset…

github.com

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓