The NLP Cypher | 04.25.21
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 04.25.21
No Quarter
In previous releases of the NLP Cypher, Iβve signaled my desire for cryptic clues and puzzles as reasoning challenges for NLP models to chew on. The rationale being that if NLP models can come close to human performance on a task that requires n-order logic to solve, a new avenue in performance could be opened other than merely pattern recognition of syntax (which is the current hack of SOTA language models).
A new paper this week delivered a new dataset full of cryptic crossword (not to be confused with a traditional crossword puzzle) clues and benchmark results using the T5 model. While this research wonβt have a direct impact in applied deep learning in the enterprise any time soon, any outcome that brings NLP models closer to human-level performance on these type of tasks will be an important step for artificial intelligence.
jsrozner/decrypt
Repository for paper Decrypting Cryptic Crosswords – jsrozner/decrypt
github.com
Prompt Tuning U+007C The Next Frontier in NLP Training?
Important paper to read if you train models. Shows how fine-tuningβs days may be counted as prompt tuning may be a more efficient method for tuning large language models. Prompt tuning allows one to use a frozen model and tunes only the text prompts (as opposed to model tuning ((a.k.a fine-tuning)) that alters the entire model).
This means we donβt have to train a new copy of a model for every new NLP task! This paper shows how previous hurdles involved with prompt turning were overcome to make it on par or competitive with traditional fine-tuning. In addition, prompt tuningβs simplicity can be helpful in combating domain shift. U+1F9D0
This paper reminded me of what is going on with multi-task learning, where adapters have a βsimilarβ approach in using a frozen model and only the adapter is fine-tuned for each task (this isnβt prompt tuning but similar in the sense of using a frozen model for efficiency/flexibility). Afterwards, you can stack adapters for various tasks and use the frozen model for its embeddings. I interviewed the lead author Jonas Pfeiffer of Adapter Hub in a previous NLP Cypher.
Lightning-Transformers
PyTorch Lightning merged the Transformers library and the Hydra framework. U+1F976 Itβs fairly simple to use.
Hereβs the elevator pitch:
- Train using HuggingFace Transformers models and datasets with Lightning custom Callbacks, Loggers, Accelerators and high performance scaling.
- Seamless Memory and Speed Optimizations such as DeepSpeed ZeRO or FairScale Sharded Training with no code changes.
- Powerful config composition backed by Hydra β Easily swap out models, optimizers, schedulers and many more configurations without touching the code.
- Transformer Task Abstraction for Rapid Research & Experimentation β Built from the ground up to be task agnostic, the library supports creating transformer tasks across all modalities with little friction.
Code
PyTorchLightning/lightning-transformers
Option 1: from PyPI pip install lightning-transformers # instead of: `python train.py …`, run withβ¦
github.com
Docs
Lightning Transformers – Lightning Transformers documentation
Lightning Transformers offers a flexible interface for training and fine-tuning SOTA Transformer models using theβ¦
lightning-transformers.readthedocs.io
Blog
Training Transformers at Scale With PyTorch Lightning
Introducing Lightning Transformers, a new library that seamlessly integrates PyTorch Lightning, HuggingFaceβ¦
pytorch-lightning.medium.com
NLP CookBook
Educational paper on the state of the art of transformers, comparing the performance and use-cases of different variants.
https://arxiv.org/ftp/arxiv/papers/2104/2104.10640.pdf
NLP in Finance
This Refinitiv Labs blog highlights their thinking behind a custom pretrained version of BERT on the Reuters News archive.
NLP: Unlock value in financial services terminology U+007C Refinitiv Perspectives
Refinitiv Labs have pre-trained a natural language processing (NLP) model with financial and business news so that itβ¦
perspectives.refinitiv.com
How Often Do People Copy and Paste Stack Overflow?
They promised us flying cars, instead we got βOne out of every four users who visits a Stack Overflow question copies something within five minutes of hitting the page.β
Check out this blog if you want to find out how much we all suck at coding.
U+1F62DU+1F62DU+1F62DU+1F62D
How often do people actually copy and paste from Stack Overflow? Now we know. – Stack Overflow Blog
They say there's a kernel of truth behind every joke. In the case of our recent April Fools gag, it might be more likeβ¦
stackoverflow.blog
PyTerrier U+007C Information Extraction Library
An information retrieval library using the Java-based Terrier IR platform internally to support indexing and retrieval operations. These are some of the features for neural ranking/dense retrieval:
BERT (through OpenNIR), T5, ColBERT, ANCE, DeepCT and doc2query.
-OpenNIR: [Github] [Documentation]
-PyTerrier_ANCE: [Github] β dense retrieval
-PyTerrier_ColBERT: [Github] β dense retrieval and/or neural reranking
-PyTerrier_T5: [Github] β neural reranking
-PyTerrier_doc2query: [Github] β neural augmented indexing
-PyTerrier_DeepCT: [Github] β neural augmented indexing
terrier-org/pyterrier
A Python API for Terrier The easiest way to get started with PyTerrier is to use one of our Colab notebooks – look forβ¦
github.com
Deep Sparse Inference
Mark Kurtz from Neural Magic goes over NMβs Deep Sparse inference engine for sparse models. The engine expects models to be in ONNX format for ingestion and then runs them incredibly fast on CPUs via its engine. Hopefully NLP will be coming soonβ¦ U+1F91E. (What I just described is towards the end of the video, the beginning is a cool intro via Nir Shavit on sparsity and why itβs awesome).
AI Adoption in the Enterprise 2021
OβReillyβs Survey is out, hereβs the goods:
We had almost three times as many responses as last year, with similar efforts at promotion. More people are working with AI.
In the past, company culture has been the most significant barrier to AI adoption. While itβs still an issue, culture has dropped to fourth place.
This year, the most significant barrier to AI adoption is the lack of skilled people and the difficulty of hiring. That shortage has been predicted for several years; weβre finally seeing it.
The second-most significant barrier was the availability of quality data. That realization is a sign that the field is growing up.
The percentage of respondents reporting βmatureβ practices has been roughly the same for the last few years. That isnβt surprising, given the increase in the number of respondents: we suspect many organizations are just beginning their AI projects.
The retail industry sector has the highest percentage of mature practices; education has the lowest. But education also had the highest percentage of respondents who were βconsideringβ AI.
Relatively few respondents are using version control for data and models. Tools for versioning data and models are still immature, but theyβre critical for making AI results reproducible and reliable.
AI Adoption in the Enterprise 2021
During the first weeks of February, we asked recipients of our Data and AI Newsletters to participate in a survey on AIβ¦
www.oreilly.com
Repo Cypher U+1F468βU+1F4BB
A collection of recently released repos that caught our U+1F441
BEIR U+007C Zero-Shot Benchmark for Information Retrieval Tasks
Benchmark includes17 datasets, 9 tasks with diverse domains. 9 SOTA retrieval models evaluated in a zero-shot setup.
UKPLab/beir
BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework forβ¦
github.com
Connected Papers U+1F4C8
ELECTRAMed
A pre-trained domain-specific language model suited for the biomedical field. It sets state-of-the-art result on the BC5CDR corpus for named entity recognition, and provides the best outcome in 2 over the 5 runs of the 7th BioASQ-factoid Challenge for the question answering task.
gmpoli/electramed
A new pre-trained language representation model for biomedical NLP Motivation The overwhelming amount of biomedicalβ¦
github.com
Connected Papers U+1F4C8
Retrieve Write Slot Filling
Using RAG in order to improve on the slot-filling task (i.e. given an entity query in form of [ENTITY, SLOT, ?], the tasks asks to generate the missing slot). This allows for automating the creation of knowledge graphs.
IBM/retrieve-write-slot-filling
This is the code for our KILT leaderboard submission to the T-REx and zsRE tasks. It includes code for training a DPRβ¦
github.com
Connected Papers U+1F4C8
Condenser
A general pre-training architecture based on Transformer LMs, to improve dense optimization readiness. Currently supported architectures include all models with BERT or RoBERTa architecture.
luyug/Condenser
Code for converting a pre-trained Transformer encoder LM into Condenser, a Transformer architecture specialized inβ¦
github.com
Connected Papers U+1F4C8
Text2App
A framework that allows users to create functional Android applications from natural language specifications. U+1F648
Masum06/Text2App
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab orβ¦
github.com
Connected Papers U+1F4C8
ProphetNet-X
ProphetNet pretrained models for Chinese, English Dialog, Chinese Dialog, Multi-lingual, and Code Generation.
microsoft/ProphetNet
This repo provides the code for reproducing the experiments in ProphetNet . In the paper, we propose a new pre-trainedβ¦
github.com
Connected Papers U+1F4C8
LAMPRET: Layout-Aware Multimodal PreTraining for Document
Understanding
A general-purposed pretraining methodology which exploits both the structure and the content of documents, and considers multimedia contents, such as images, to learn a comprehensive multimodal document representation.
Paper: https://arxiv.org/pdf/2104.08405.pdf
Connected Papers U+1F4C8
Dataset of the Week: GooAQ U+1F951: Google Answers to Google Questions!
What is it?
A question answering dataset containing over 5 million questions and 3 million answers collected from Google. GOOAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature.
Sample
{
"id": 5009708,
"question": "carbon dioxide comprises approximately what percentage of tropospheric gases?",
"short_answer": "04%",
"answer": "Carbon dioxide comprise approximately . 04% of tropospheric gases.",
"answer_type": "feat_snip"
}
Where is it?
allenai/gooaq
This repository contains the code/data accompanying our recent work on long-form question answering. NOTE This datasetβ¦
github.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI