Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.
Latest

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.

Last Updated on January 1, 2023 by Editorial Team

Author(s): Fraol Batole

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

This article shows a step-by-step implementation of a Multi-lingual Neural Machine Translation (MNMT) model. In this implementation, we build an encoder-decoder architecture-based MNMT.

An overview of MNMT:

Before diving into the implementation, let’s take a step back and understand what Multi-lingual Neural Machine Translation (MNMT) models mean. Surprisingly, there are around 7,100 languages that exist in the whole world [https://www.ethnologue.com/guides/how-many-languages]. For several reasons, there is a need to translate one language into another language. Although people used to translate languages manually, the recent advancement of deep learning has enabled the automatic translation of texts. The translation of one language into another is called Neural Machine Translation (NMT). For example, given an English text, the model can translate it into Spanish.

Now, what if there is a need to translate more languages? Then, we must build several NMT models that support different language translations. However, recently, research has shown that training a single model that supports multiple language translation is beneficial for several reasons. It is referred to as Multi-lingual Neural Machine Translation (MNMT). For example, some contexts from the available languages are applied to another language. Especially this use case is beneficial for low-resource languages. To illustrate MNMT, the picture below.

MNTM Model (Taken from Google)

Data collection and preprocessing:

Step 1: Collecting and clean the dataset

To train an MNTM model, a parallel corpus (text) should exist within the languages. That is, in the above picture, a sentence in Japanese must have a translated sentence into English and Korean. A suitable dataset for this purpose is Tatoeba. This dataset contains sentences in over 400 languages. The sentence pair supported by this dataset can be downloaded here.

Step 2: Extracting Parallel Sentence

As explained above, we need parallel text between each language. Thus, we must filter out the sentences that need to be translated. For this blog, we only use English-to-{Japanese, and Korean}, as illustrated in the picture above. To find similar sentences, we can filter out the intersection between each language using the Pandas library.

The code to implement this is available here.

The Tatoeba dataset contains an id for each sentence. Since we downloaded “English to Japanese” and “English to Korean” paired sentences, we can find the intersection of English id’s between both pairs. Thus, we can find the same English sentence and translation into {Japanese and Korean}. Now, the extracted sentences can be used to train our model.

Step 3: Preprocessing the dataset

Now, we can prepare the training and testing set for the model. We load the downloaded and cleaned dataset in the code below and put a starting and ending tag.

text_file = "spa-eng/spa.txt"
with open(text_file) as f:
lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
english, japanese, korean = line.split("\t")
spanish = "[start] " + spanish + " [end]"
text_pairs.append((english, spanish))
import random
print(random.choice(text_pairs))
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Moreover, each text should be vectorized, i.e., converting the text into a numerical vector. This step, among other advantages, reduces the computational speed of our model. We use an existing code from Deep Learning with Python (second-edition book) to vectorize and prepare the dataset.

Step 4: Vectorizations and preparing the dataset

The code can be accessed here.

Building the model:

Step 5: LSTM-based Encoder-Decoder Model

In this article, we only used a different way to preprocess the dataset for the model. However, there are several approaches to building MNTM. For example, by adding additional output to the model, we could implement an MNTM. Each output layer will be responsible for predicting a single language.

The model is built based on [1] But with minor changes. We chose to use LSTM instead of GRU from the original code for this example. Moreover, since we have multiple languages, we pass the input as the source language and the others as the target language.

source = keras.Input(shape=(None,), dtype="int64", name="source")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
layers.LSTM(latent_dim), merge_mode="sum")(x)
past_target = keras.Input(shape=(None,), dtype="int64", name="target")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.LSTM(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step)

Step 6: Training the model

We are using a simple LSTM model as an example so that the performance might be lower than other models. However, it is easy to build a different model and feed the same datasets as the technique mostly relies on preprocessing rather than building the model.

seq2seq_rnn.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Step 7: Checking the model’s translation

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
tokenized_input_sentence = source_vectorization([input_sentence])
decoded_sentence = "[start]"
for i in range(max_decoded_sentence_length):
tokenized_target_sentence = target_vectorization([decoded_sentence])
next_token_predictions = seq2seq_rnn.predict(
[tokenized_input_sentence, tokenized_target_sentence])
sampled_token_index = np.argmax(next_token_predictions[0, i, :])
sampled_token = spa_index_lookup[sampled_token_index]
decoded_sentence += " " + sampled_token
if sampled_token == "[end]":
break
return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
input_sentence = random.choice(test_eng_texts)
print("-")
print(input_sentence)
print(decode_sequence(input_sentence))

References:

  1. https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part04_sequence-to-sequence-learning.ipynb
  2. Dabre, Raj, Chenhui Chu, and Anoop Kunchukuttan. “A survey of multilingual neural machine translation.” ACM Computing Surveys (CSUR) 53.5 (2020): 1–38.
  3. https://tatoeba.org/en/downloads
  4. https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html


A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓