Site icon Towards AI

Getting Meaning from Text: Self-attention Step-by-step Video

Getting Meaning from Text: Self-attention Step-by-step Video

Author(s): Romain Futrzynski

Natural Language Processing

In October 2019, Google announced that it would process search queries with the BERT model that its researchers have developed. This model can grasp difficult nuances of language: in the search 2019 brazil traveler to the USA need a visa, it is understood that the traveler is Brazilian, and the destination is the USA. From now on, Google says, this search returns the page of the U.S. embassy in Brazil, and no longer shows a page about U.S. citizens traveling to Brazil.

Remarkably, most of the attention mechanism at the core of many transformer models like BERT relies on just a few basic vector operations.

Let’s see how it works.

What’s the matter with context?

If you see the word bank, you might think about a financial institution, the office where your advisor works, the portable battery that charges your phone on the move, or even the edge of a lake or river.
If you’re given more context, as in It’s a pleasant walk by the river bank, you can realize that bank goes well with the river, so it must mean the land next to some water. You could also realize that you can walk by this bank, so it must look like a footpath along the river. The whole sentence adds up to create a mental picture of the bank.

Self-attention seeks to do the very same thing.

Word embedding

A word like bank is called a token when it represents a fundamental piece of text, is commonly encoded as a vector of real, continuous values: the embedding vector.

Determining the values inside the embedding vector of a token is a large part of the heavy lifting in text processing. Thankfully, with hundreds of dimensions available to organize the vocabulary of known tokens, embeddings can be pre-trained to relate numerically in ways that reflect how their tokens relate in natural language.

How to contextualize the embeddings?

The key to the state of the art performance in Natural Language Processing (NLP) is to transform the embeddings to create the right numerical picture from the tokens in any given sentence.

This is what the scaled dot-product self-attention mechanism does elegantly with (mostly) a few operations of linear algebra.

Self-attention mechanism. Image by the author.

Multi-head attention and BERT

A single sequence of input embeddings can be projected using many different sets of key, query, and value projections in what is called multi-head attention. Each projection set can focus on calculating different types of relationships between the tokens and create specific contextualized embeddings.

The contextualized embeddings coming from different attention heads are simply concatenated together.

Multi-head attention. Image by the author.

Deep learning models for Natural Language Processing typically apply many layers of multi-head attention and mix in extra operations to get robust results.

BERT processes a sentence to output contextualized embeddings that are more representative of the true meaning. Image by the author.

For instance, the BERT Encoder uses the WordPiece embeddings of tokens but always begins by adding them to positional embeddings. This step gives information about the order of the tokens in the input sentence, which self-attention would not consider otherwise.

Additional linear projections, normalization, and feed-forward layers give the whole model more flexibility and stability.

The result is a model that can remove ambiguity from natural text and reduce it to precise values that you can use to automatically search, classify, or even annotate text content.

Read more

Literature

Online resources


Getting Meaning from Text: Self-attention Step-by-step Video was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Exit mobile version