The NLP Cypher | 01.10.21
Last Updated on July 24, 2023 by Editorial Team
Author(s): Ricky Costa
Originally published on Towards AI.
NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher U+007C 01.10.21
Melting Clocks
Once in a while you discover a goodie in the dregs of research. A cipher cracking paper emerged recently on the topic of using seq2seq models to crack 1:1 substitution ciphers. U+1F649
(1:1 substitution is when ciphertext represents a fixed character in the target plaintext. Read more here if you prefer to live dangerously.
Several deciphering methods used today make a big assumption. That we know the target language of the cipher we need to crack. But when diving into encrypted historical texts where the target language is unknown, well, you tend to get a big headache when the language origin is ambiguous.
When one begins to attack encrypted text. The state of the cipher can be in various conditions: alphanumeric (numbers/letters) or it can even be symbolic or it can be a mix of both (like the Zodiac Killerβs ciphers U+1F447).
However, IF we know ahead of time that the cipherβs plaintext language is… say English (and not Latin, or any other language), well, we are off to a good start and with a healthy advantage. Why? Because we can leverage the unique features of the English language that donβt occur in other languages. I.e. the letter βeβ is the most frequent letter in English, so itβs possible the most frequent letter in the ciphertext could be the letter βeβ , and by using these heuristics, letter by letter you slowly turn into Tom Hanks from the Da Vinci Code.
Letter Frequencies in the English Language
The third column represents proportions, taking the least common letter (q) as equal to 1. The letter E is over 56β¦
www3.nd.edu
Whatβs really interesting about this paper is that the authors wanted to test if a multi-lingual seq2seq transformer would be able to crack ciphers WITHOUT knowing the origin of the language of the plaintext. They formulated the decipherment as a sequence-to-sequence translation problem. The model was trained on the character level.
Whatβs cool is that they tested the model on historical ciphers (that have been previously cracked) such as the Borg cipher and it was able to crack the first 256 characters with very low error. According to the authors, this is the first application of sequence-to-sequence neural models for decipherment!
NSA be likeβ¦
If you enjoy this read, please give it a U+1F44FU+1F44F and share with your friends! It really helps us out!
Donβt Worry Thereβs a Stack Exchange for Crypto Nerds
Cryptography Stack Exchange
Cryptography Stack Exchange is a question and answer site for software developers, mathematicians and others interestedβ¦
crypto.stackexchange.com
OpenAI Dropping Jewels
You probably have already heard of OpenAIβs model drops from this week so Iβll save you the recap. Added their two blogs in case you want to catch up. This week I added the Colab notebook for CLIP on LinkedIn and it got a good reception, will also append it here if you are interested:
Colab of the Week U+007C CLIP
Google Colaboratory
Edit description
colab.research.google.com
DALL-E Blog
DALLΒ·E: Creating Images from Text
DALLΒ·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset ofβ¦
openai.com
CLIP Blog
CLIP: Connecting Text and Images
Weβre introducing a neural network called CLIP which efficiently learns visual concepts from natural languageβ¦
openai.com
DALL-E Replication Already on GitHub
Surprise! Someone already replicated DALL-E on PyTorch U+1F601. U+1F525U+1F525
pip install dalle-pytorch
lucidrains/DALLE-pytorch
Implementation / replication of DALL-E, OpenAIβs Text to Image Transformer, in Pytorch β lucidrains/DALLE-pytorch
github.com
Object Storage Search Engine
Thank your local hacker
Hey you know how when you setup your S3 bucket or another object storage and you have the option to choose between public or private setting. Well have you ever wondered what it would look like if someone could harvest all the public bucket URLs for you to openly search them: U+1F447
Inside the Rabbit Hole
The Ecco library allows one to visualize why language models bust moves the way they do. The library is mostly focused on autoregressive models (e.g. GPT-2/3 models). They currently have 2 notebooks to visualize neuron activation and input saliency.
It is built on top of PyTorch and Transformers.
Look Inside Language Models
Ecco is a python library that creates interactive visualizations allowing you to explore what your NLP Language Modelβ¦
www.eccox.io
Interfaces for Explaining Transformer Language Models
Interfaces for exploring transformer language models by looking at input saliency and neuron activation. Explorable #1β¦
jalammar.github.io
Text-to-Speech with Swag
15.ai came on the scene in 2019 with its awesome text-to-speech demo and itβs been refining its modelsβ capabilities ever since. You can type in text and get deep learning generated speech conditioned on various characters ranging from HAL 9000 from 2001: Space Odyssey to Doctor Who.
15.ai: Natural TTS with minimal data
15.ai: Natural high-quality faster-than-real-time text-to-speech synthesis with minimal data
15.ai
ML Metadata
Google came out with Machine Learning Metadata (MLMD). A library to keep track of your entire ML workflow. Allows you to version your models and datasets so you know why things go wrong when they do.
ML Metadata: Version Control for ML
January 08, 2021 – Posted by Ben Mathes and Neoklis Polyzotis, on behalf of the TFX Team When you write code, you needβ¦
blog.tensorflow.org
El GitHub:
google/ml-metadata
ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developer and data scientistβ¦
github.com
MLDM API Class:
mlmd.metadata_store.MetadataStore U+007C TFX U+007C TensorFlow
A store for the artifact metadata. mlmd.metadata_store.MetadataStore( configβ¦
www.tensorflow.org
NNs for iOS with Wolfram Language
Wolfram out of left field, and he brought a smartphone. In a recent Wolfram blog post, they show how to train an image classifier, throwing it on ONNX, and then converting it to Core ML so it can be used on iOS devices. Includes code!
Deploy a Neural Network to Your iOS Device Using the Wolfram Language-Wolfram Blog
January 7, 2021 – Jofre Espigule-Pons, Machine Learning Today's handheld devices are powerful enough to run neuralβ¦
blog.wolfram.com
Machine Learning Index w/ Code
A huge index with several hundred projects per index on all things machine learning, includes computer vision and NLP. You can find the Super Duper NLP Repo on it U+1F60E.
ashishpatel26/500-AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects-with-code
500 AI Machine learning Deep learning Computer vision NLP Projects with code β¦
github.com
Repo Cypher U+1F468βU+1F4BB
A collection of recent released repos that caught our U+1F441
Ask2Transformers
Ask2Transformers automatically annotates text data.. aka zero-shot. U+1F525
osainz59/Ask2Transformers
This repository contains the code for the work Ask2Transformers – Zero Shot Domain Labelling with Pretrainedβ¦
github.com
Subformer
A parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique.
machelreid/subformer
This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us toβ¦
github.com
SF-QA
Open-domain QA evaluation library, it includes efficient reader comparison, reproducible research, and knowledge source for applications.
soco-ai/SF-QA
A Simple and Fair Evaluation Library for Open-domain Question Answering Oepn-domain QA Evaluation usually means days ofβ¦
github.com
ARBERT & MARBERT
Arabic BERT returns for a 2nd week in a row on the Cypher. This time its ARBERT and MARBERT. It also includes ArBench a benchmark for Arabic NLU based on 41 datasets across 5 different tasks.
UBC-NLP/marbert
This is the repository accompanying our paper ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In theβ¦
github.com
CRSLab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS). Includes models and datasets.
RUCAIBox/CRSLab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS). It is developed based on Pythonβ¦
github.com
Dataset of the Week: StrategyQA
What is it?
βStrategyQA is a question-answering benchmark focusing on open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy. StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.β
Sample
Example 1
βIs growing seedless cucumber good for a gardener with entomophobia?β
Answer: Yes
Explanation: Seedless cucumber fruit does not require pollination. Cucumber plants need insects to pollinate them. Entomophobia is a fear of insects.
Example 2
βAre chinchillas cold-blooded?β
Answer: No
Explanation: Chinchillas are rodents, which are mammals. All mammals are warm-blooded.
Example 3
βWould Janet Jackson avoid a dish with ham?β
Answer: Yes
Explanation: Janet Jackson follows an Islamic practice. Islamic culture avoids eating pork. Ham is made from pork.
Where is it?
StrategyQA Dataset – Allen Institute for AI
The StrategyQA dataset was created through a crowdsourcing pipeline for eliciting creative and diverse yes/no questionsβ¦
allenai.org
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI