A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.

Last Updated on January 1, 2023 by Editorial Team

Author(s): Fraol Batole

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

This article shows a step-by-step implementation of a Multi-lingual Neural Machine Translation (MNMT) model. In this implementation, we build an encoder-decoder architecture-based MNMT.

An overview of MNMT:

Before diving into the implementation, let’s take a step back and understand what Multi-lingual Neural Machine Translation (MNMT) models mean. Surprisingly, there are around 7,100 languages that exist in the whole world [https://www.ethnologue.com/guides/how-many-languages]. For several reasons, there is a need to translate one language into another language. Although people used to translate languages manually, the recent advancement of deep learning has enabled the automatic translation of texts. The translation of one language into another is called Neural Machine Translation (NMT). For example, given an English text, the model can translate it into Spanish.

Now, what if there is a need to translate more languages? Then, we must build several NMT models that support different language translations. However, recently, research has shown that training a single model that supports multiple language translation is beneficial for several reasons. It is referred to as Multi-lingual Neural Machine Translation (MNMT). For example, some contexts from the available languages are applied to another language. Especially this use case is beneficial for low-resource languages. To illustrate MNMT, the picture below.

Data collection and preprocessing:

Step 1: Collecting and clean the dataset

To train an MNTM model, a parallel corpus (text) should exist within the languages. That is, in the above picture, a sentence in Japanese must have a translated sentence into English and Korean. A suitable dataset for this purpose is Tatoeba. This dataset contains sentences in over 400 languages. The sentence pair supported by this dataset can be downloaded here.

Step 2: Extracting Parallel Sentence

As explained above, we need parallel text between each language. Thus, we must filter out the sentences that need to be translated. For this blog, we only use English-to-{Japanese, and Korean}, as illustrated in the picture above. To find similar sentences, we can filter out the intersection between each language using the Pandas library.

The code to implement this is available here.

The Tatoeba dataset contains an id for each sentence. Since we downloaded “English to Japanese” and “English to Korean” paired sentences, we can find the intersection of English id’s between both pairs. Thus, we can find the same English sentence and translation into {Japanese and Korean}. Now, the extracted sentences can be used to train our model.

Step 3: Preprocessing the dataset

Now, we can prepare the training and testing set for the model. We load the downloaded and cleaned dataset in the code below and put a starting and ending tag.

text_file = "spa-eng/spa.txt"

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, japanese, korean = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))

import random
print(random.choice(text_pairs))

import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Moreover, each text should be vectorized, i.e., converting the text into a numerical vector. This step, among other advantages, reduces the computational speed of our model. We use an existing code from Deep Learning with Python (second-edition book) to vectorize and prepare the dataset.

Step 4: Vectorizations and preparing the dataset

The code can be accessed here.

Building the model:

Step 5: LSTM-based Encoder-Decoder Model

In this article, we only used a different way to preprocess the dataset for the model. However, there are several approaches to building MNTM. For example, by adding additional output to the model, we could implement an MNTM. Each output layer will be responsible for predicting a single language.

The model is built based on [1] But with minor changes. We chose to use LSTM instead of GRU from the original code for this example. Moreover, since we have multiple languages, we pass the input as the source language and the others as the target language.

source = keras.Input(shape=(None,), dtype="int64", name="source")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
    layers.LSTM(latent_dim), merge_mode="sum")(x)

past_target = keras.Input(shape=(None,), dtype="int64", name="target")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.LSTM(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step)

Step 6: Training the model

We are using a simple LSTM model as an example so that the performance might be lower than other models. However, it is easy to build a different model and feed the same datasets as the technique mostly relies on preprocessing rather than building the model.

seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])

seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Step 7: Checking the model’s translation

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

References:

https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part04_sequence-to-sequence-learning.ipynb
Dabre, Raj, Chenhui Chu, and Anoop Kunchukuttan. “A survey of multilingual neural machine translation.” ACM Computing Surveys (CSUR) 53.5 (2020): 1–38.
https://tatoeba.org/en/downloads
https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.

Author(s): Fraol Batole

An overview of MNMT:

Data collection and preprocessing:

Building the model:

References:

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Small Language Models Make Business Sense

Best Laptop For Data Science

Mastering Generative AI Architectural Patterns: A Comprehensive Guide

How Far Is AI Capable of Delivering on Its Promises and Changing Our Civilization?

Advanced Hallucination Mitigation Techniques in LLMs – RAG, knowledge editing, contrastive decoding, self-refinement, uncertainty-aware beam search

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.

Author(s): Fraol Batole

An overview of MNMT:

Data collection and preprocessing:

Building the model:

References:

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement