The NLP Cypher | 05.09.21

Last Updated on July 24, 2023 by Editorial Team

Author(s): Ricky Costa

Originally published on Towards AI.

The NLP Cypher | 05.09.21 — Saturn as seen from Mimas U+007C Chesley Bonestell

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 05.09.21

Lost Tales

I mostly know dark.fail as an onion site with a great collection of urls for parasailing tor-land (aka darknet). To be honest, I didn’t even know dark.fail had a clearnet site. And very recently, its clearnet mirror was phished for a total of 4–5 days. U+1F440

Apparently a threat actor presented a fake court order to dark.fail’s domain registrar. And in return, they obtained access to the dark.fail’s hosting and rerouted traffic to the bad actor’s mirrored web page. It phished the pages URLs with the intention on fooling people into thinking they were buying products on the dark markets when instead the bad actor(s) were pocketing their bitcoin. This has caused a big uproar in the hacking community given dark.fail’s popularity.U+1F976

The anonymous owner of dark.fail appeared on a hacker podcast this past weekend to discuss the hijacking and spoke via a text-to-speech software as to protect their voice identity. You can watch/listen here:

And in other news…

ICLR Residuals…

Google at ICLR 2021

The 9th International Conference on Learning Representations ( ICLR 2021), a virtual conference focused on deep…

ai.googleblog.com

Stanford AI Lab Papers and Talks at ICLR 2021

The International Conference on Learning Representations (ICLR) 2021 is being hosted virtually from May 3rd – May 7th…

ai.stanford.edu

Galkin’s Knowledge Graph Review from ICLR

Couldn’t have a conference without getting a Galkin knowledge graph review!

TOC:

Knowledge Graphs @ ICLR 2021

Your guide to the KG-related research in ML, May edition

mgalkin.medium.com

THE NLP Index Update

Since last week, we’ve added ~750 new repos to the index and I’ve included GitHub stars and programming language for each repo.

In addition, we also added nearly 1,000 introductory videos for select assets. Thank you to Amit Chaudhary for the data! U+1F431‍U+1F464

Check it out here:

The NLP Index

Top NLP Code Repositories – Quantum Stat

index.quantumstat.com

A Commonsense Knowledge Base Construction

Checkout how the Max Planck Institute for Informatics is building commonsense knowledge bases.

This paper introduces 3 systems:

Quasimodo: “an open-source commonsense knowledge base designed to get relevant properties about entities.” site

Dice: “a reasoning framework for deriving refined and expressive commonsense knowledge from existing CSK collections.” site

Ascent: “a pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from the web.” site

A Large Netflix Dataset

“This dataset combines data sources from Netflix, Rotten Tomatoes, IMBD, posters, box office information, trailers on YouTube, and more using a variety of APIs.” Netflix doesn’t have it’s own API so the devs just went nuclear on triangulating Netflix’s data via other sources. U+1F649

Last updated April 2021 according to authors.

Latest Netflix data with 26+ joined attributes

Latest, complete Netflix movie dataset created from 4 APIs

www.kaggle.com

Awesome Self-Supervised Learning

Index for all things Self-Supervised Learning across different domains such as vision, NLP, graphs and more.

jason718/awesome-self-supervised-learning

A curated list of awesome self-supervised methods. Contribute to jason718/awesome-self-supervised-learning development…

github.com

For an intuitive intro into self-supervised learning, check out Sergey Ivanov’s blog:

GML In-Depth: three forms of self-supervised learning

Hello and welcome to the graph ML newsletter! This in-depth post is about self-supervised learning (SSL) and its…

graphml.substack.com

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

SUPERB Benchmark for Speech

A collection of benchmarking resources to evaluate the capability of a universal shared representation for speech processing. SUPERB consists of the following:

A benchmark of ten speech processing tasks built on established public datasets,

A BENCHMARK TOOLKIT designed to evaluate and analyze pretrained model performance on various downstream tasks following the conventional evaluation protocols from speech communities,

A public LEADERBOARD for SUBMISSIONS and performance tracking on the benchmark.

SUPERB: Speech processing Universal PERformance Benchmark

A comprehensive and reproducible benchmark for Self-supervised Speech Representation Learning

superbbenchmark.org

Associated repo:

s3prl/s3prl

April 2021: Support SUPERB: Speech processing Universal PERformance Benchmark, submitted to Interspeech 2021 Jan 2021…

github.com

Connected Papers U+1F4C8

Explainable Text VQA

A dataset containing ground truth visual and multi-reference textual explanations that can be leveraged during both training and evaluation.

Dataset not officially out yet, but keep track of this repo for updates.

amzn/explainable-text-vqa

We will shortly release the TextVQA-X dataset accompanying the A First Look: Towards Explainable TextVQA Models via…

github.com

Connected Papers U+1F4C8

Rare Disease Identification

Using ontologies and weak supervision to identify rare diseases from clinical notes.

acadTags/Rare-disease-identification

This repository presents an approach using ontologies and weak supervision to identify rare diseases from clinical…

github.com

Connected Papers U+1F4C8

The Carleton Benchmark Suite (CBench)

A benchmarking framework for evaluating question answering systems over knowledge graphs.

aorogat/CBench

CBench is an extensible and more informative benchmarking framework for evaluating question answering systems over…

github.com

Connected Papers U+1F4C8

AMR Parser with Action-Pointer Transformer

Abstract Meaning Representation (AMR) parsing is a sentence-to-graph prediction task where target nodes are not explicitly aligned to sentence tokens.

Authors used a transformer that handles the generation of arbitrary graph constructs.

IBM/transition-amr-parser

Transition-based parser for Abstract Meaning Representation (AMR) in Pytorch. The code includes two fundamental…

github.com

Connected Papers U+1F4C8

ADAM

ADAM is a demonstration of “grounded language acquisition,” which is to say learning (some amount of) language from observing how language is used in concrete situations, like infants (presumably) do. U+1F440

This work is under DARPA’s Grounded Artificial Intelligence Language Acquisition (GAILA) program. U+1F6F8U+1F47D

isi-vista/adam

ADAM is ISI's effort under DARPA's Grounded Artificial Intelligence Language Acquisition (GAILA) program. Background…

github.com

Connected Papers U+1F4C8

Knover U+007C Knowledge Grounded Dialogue Generation

Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out efficient training/inference of large-scale dialogue generation models.

PaddlePaddle/Knover

Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and…

github.com

Connected Papers U+1F4C8

Dataset of the Week: Ascent

What is it?

A pipeline for automatically collecting, extracting and consolidating commonsense knowledge (CSK) from the web.

Where is it?

AscentKB

Ascent ( Advanced Semantics for Commons ense K nowledge Ex t raction) is a pipeline for automatically collecting…

ascent.mpi-inf.mpg.de

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The NLP Cypher | 05.09.21

Author(s): Ricky Costa

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher U+007C 05.09.21

Lost Tales

ICLR Residuals…

Google at ICLR 2021

The 9th International Conference on Learning Representations ( ICLR 2021), a virtual conference focused on deep…

Stanford AI Lab Papers and Talks at ICLR 2021

The International Conference on Learning Representations (ICLR) 2021 is being hosted virtually from May 3rd – May 7th…

Galkin’s Knowledge Graph Review from ICLR

Knowledge Graphs @ ICLR 2021

Your guide to the KG-related research in ML, May edition

THE NLP Index Update

The NLP Index

Top NLP Code Repositories – Quantum Stat

A Commonsense Knowledge Base Construction

A Large Netflix Dataset

Latest Netflix data with 26+ joined attributes

Latest, complete Netflix movie dataset created from 4 APIs

Awesome Self-Supervised Learning

jason718/awesome-self-supervised-learning

A curated list of awesome self-supervised methods. Contribute to jason718/awesome-self-supervised-learning development…

GML In-Depth: three forms of self-supervised learning

Hello and welcome to the graph ML newsletter! This in-depth post is about self-supervised learning (SSL) and its…

Repo Cypher U+1F468‍U+1F4BB

A collection of recently released repos that caught our U+1F441

SUPERB Benchmark for Speech

SUPERB: Speech processing Universal PERformance Benchmark

A comprehensive and reproducible benchmark for Self-supervised Speech Representation Learning

s3prl/s3prl

April 2021: Support SUPERB: Speech processing Universal PERformance Benchmark, submitted to Interspeech 2021 Jan 2021…

Explainable Text VQA

amzn/explainable-text-vqa

We will shortly release the TextVQA-X dataset accompanying the A First Look: Towards Explainable TextVQA Models via…

Rare Disease Identification

acadTags/Rare-disease-identification

This repository presents an approach using ontologies and weak supervision to identify rare diseases from clinical…

The Carleton Benchmark Suite (CBench)

aorogat/CBench

CBench is an extensible and more informative benchmarking framework for evaluating question answering systems over…

AMR Parser with Action-Pointer Transformer

IBM/transition-amr-parser

Transition-based parser for Abstract Meaning Representation (AMR) in Pytorch. The code includes two fundamental…

ADAM

isi-vista/adam

ADAM is ISI's effort under DARPA's Grounded Artificial Intelligence Language Acquisition (GAILA) program. Background…

Knover U+007C Knowledge Grounded Dialogue Generation

PaddlePaddle/Knover

Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and…

Dataset of the Week: Ascent

What is it?

Where is it?

AscentKB

Ascent ( Advanced Semantics for Commons ense K nowledge Ex t raction) is a pipeline for automatically collecting…

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement