The NLP Cypher | 05.23.21
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 05.23.21
Overtime
Hey Welcome back, another week goes by and so much code/research has been released into the wild.
Oh and btw, The NLP Index is on U+1F525U+1F525U+1F525 , and I want to thank all contributors!
Hereβs a quick glimpse at the awesome contributions: A collection of Spanish Medical NLP datasets brought to you by Salvador Lima in Barcelona. U+1F64CU+1F64C Will update the NLP Index with these and other assets by tomorrow.
Cantemist (oncology clinical cases for cancer text mining): https://zenodo.org/record/3978041
PharmaCoNER (Pharmacological Substances, Compounds and proteins in Spanish clinical case reports): https://zenodo.org/record/4270158
CodiEsp (Abstracts from Lilacs and Ibecs with ICD10 codes): https://zenodo.org/record/3606662
MEDDOCAN (Medical Document Anonymization): https://zenodo.org/record/4279323
MESINESP2 (Medical Semantic Indexing): https://zenodo.org/record/4722925
Wav2vec-U: Unsupervised Speech Recognition U+1F60D
This new FAIR model doesnβt need transcriptions to learn speech. It just needs unsupervised speech recordings and text. They used a GAN to help discriminate phonemes (sounds of language). While Wav2vec-U doesnβt achieve SOTA on the Librispeech benchmark, it still gets a pretty good score given the fact it didnβt require 960 hours of transcribed speech data. U+1F440
Blog:
wav2vec Unsupervised: Speech recognition without supervision
To enable speech recognition technology for many more languages spoken around the globe, Facebook AI is releasingβ¦
ai.facebook.com
Code:
pytorch/fairseq
Wav2vec Unsupervised (wav2vec-U) is a framework for building speech recognition systems without any labeled trainingβ¦
github.com
Polars Dataframes U+1F601
If you use dataframes often, you should check out Polars. Itβs an awesome dataframe library written in Rust (includes Python bindings). Comes with Arrow support and all of its glory including parquet file and AWS S3 IO support.
pola-rs/polars
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model. Lazy U+007C eagerβ¦
github.com
Docs:
Polars – User Guide
This book is an introduction to the Polars DataFrame library. Its goal is to explain the inner workings of Polars byβ¦
pola-rs.github.io
Universiteit van Amsterdam U+007C Notebooks and Tutorials
The University of Amsterdam has a sweet collection of colab notebooks mixing various domains including GNNs, Transformers and computer vision.
Hereβs their TOC:
Tutorial 2: Introduction to PyTorch
Tutorial 3: Activation functions
Tutorial 4: Optimization and Initialization
Tutorial 5: Inception, ResNet and DenseNet
Tutorial 6: Transformers and Multi-Head Attention
Tutorial 7: Graph Neural Networks
Tutorial 8: Deep Energy Models
Tutorial 9: Autoencoders
Tutorial 10: Adversarial Attacks
Tutorial 11: Normalizing Flows
Tutorial 12: Autoregressive Image Modeling
Welcome to the UvA Deep Learning Tutorials! β UvA DL Notebooks v1.0 documentation
For this yearβs course edition, we created a series of Jupyter notebooks that are designed to help you understandingβ¦
uvadlc-notebooks.readthedocs.io
KELM U+007C Converting WikiData to Natural Language
Google introduces the KELM dataset in a huge win for the factoid nerds. The dataset is a Wikidata knowledge graph converted into natural language with the idea of using the corpus for improving the factual knowledge in pretrained models! A T5 was used for this conversion. The corpus consists of ~18M sentences spanning ~45M triples and ~1500 relations.
KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora
Large pre-trained natural language processing (NLP) models, such as BERT, RoBERTa, GPT-3, T5 and REALM, leverageβ¦
ai.googleblog.com
Talkinβ about knowledge graphsβ¦
An Introduction to Knowledge Graphs
Knowledge Graphs (KGs) have emerged as a compelling abstraction for organizing the world's structured knowledge, and asβ¦
ai.stanford.edu
No Trash Search!
No Trash Search
Edit description
notrashsearch.github.io
LabML.AI Annotated PyTorch Papers
Learn from academic papers annotated with their corresponding code. Pretty cool if you want to decipher research.
labml.ai Annotated PyTorch Paper Implementations
This is a collection of simple PyTorch implementations of neural networks and related algorithms. These implementationsβ¦
nn.labml.ai
Completely Normal (aka not suspect) Task
applicaai/kleister-charity
The goal of this task is to retrieve charity address (but not other addresses), charity number, charity name and itsβ¦
github.com
Repo Cypher U+1F468βU+1F4BB
A collection of recently released repos that caught our U+1F441
Measuring Coding Challenge Competence With APPS
A benchmark for code generation.
Check out the GPT-Neo results when compared to GPT-2/3, very interesting.
hendrycks/apps
This is the repository for Measuring Coding Challenge Competence With APPS by Dan Hendrycks*, Steven Basart*, Sauravβ¦
github.com
Connected Papers U+1F4C8
wikipiifed β Automated Dataset Creation and Federated Learning
A repo for automating dataset creation from wikipedia biography pages and utilizing the dataset for federated learning of BERT based named entity recognizer.
ratmcu/wikipiifed
This repo represent the automated dataset creation from wikipedia biography pages and utilizing the dataset forβ¦
github.com
Connected Papers U+1F4C8
OpenMEVA Benchmark
OpenMEVA is a benchmark for evaluating open-ended story generation.
thu-coai/OpenMEVA
Contributed by Jian Guan, Zhexin Zhang. Thank Jiaxin Wen for DeBugging. OpenMEVA is a benchmark for evaluatingβ¦
github.com
Connected Papers U+1F4C8
KLUE: Korean Language Understanding Evaluation
KLUE benchmark is composed of 8 tasks:
- Topic Classification (TC)
- Sentence Textual Similarity (STS)
- Natural Language Inference (NLI)
- Named Entity Recognition (NER)
- Relation Extraction (RE)
- (Part-Of-Speech) + Dependency Parsing (DP)
- Machine Reading Comprehension (MRC)
- Dialogue State Tracking (DST)
KLUE-benchmark/KLUE
The KLUE is introduced to make advances in Korean NLP. Korean pre-trained language models(PLMs) have appeared to solveβ¦
github.com
Connected Papers U+1F4C8
Contextual Machine Translation
Context-aware models for document-level machine translation. Also includes SCAT, an English-French dataset comprising supporting context words for 14K translations that professional translators found useful for pronoun disambiguation.
Most MT models are on the sentence level, so this is an interesting repo for those looking to go onto the document level.
neulab/contextual-mt
Implementations of context-aware models for document-level translation tasks, used in Measuring and Incresing Contextβ¦
github.com
Connected Papers U+1F4C8
Dataset of the Week: Few-NERD
What is it?
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built, one is supervised: Few-NERD (SUP) and the other two are few-shot: Few-NERD (INTRA) and Few-NERD (INTER).
Sample (in typical NER format)
Between O
1789 O
and O
1793 O
he O
sat O
on O
a O
committee O
reviewing O
the O
administrative MISC-law
constitution MISC-law
of MISC-law
Galicia MISC-law
to O
little O
effect O
. O
Where is it?
thunlp/Few-NERD
This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset . Check outβ¦
github.com
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI