Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.
Latest

A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras.

Last Updated on January 1, 2023 by Editorial Team

Author(s): Fraol Batole

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

This article shows a step-by-step implementation of a Multi-lingual Neural Machine Translation (MNMT) model. In this implementation, we build an encoder-decoder architecture-based MNMT.

An overview ofΒ MNMT:

Before diving into the implementation, let’s take a step back and understand what Multi-lingual Neural Machine Translation (MNMT) models mean. Surprisingly, there are around 7,100 languages that exist in the whole world [https://www.ethnologue.com/guides/how-many-languages]. For several reasons, there is a need to translate one language into another language. Although people used to translate languages manually, the recent advancement of deep learning has enabled the automatic translation of texts. The translation of one language into another is called Neural Machine Translation (NMT). For example, given an English text, the model can translate it intoΒ Spanish.

Now, what if there is a need to translate more languages? Then, we must build several NMT models that support different language translations. However, recently, research has shown that training a single model that supports multiple language translation is beneficial for several reasons. It is referred to as Multi-lingual Neural Machine Translation (MNMT). For example, some contexts from the available languages are applied to another language. Especially this use case is beneficial for low-resource languages. To illustrate MNMT, the pictureΒ below.

MNTM Model (Taken fromΒ Google)

Data collection and preprocessing:

Step 1: Collecting and clean theΒ dataset

To train an MNTM model, a parallel corpus (text) should exist within the languages. That is, in the above picture, a sentence in Japanese must have a translated sentence into English and Korean. A suitable dataset for this purpose is Tatoeba. This dataset contains sentences in over 400 languages. The sentence pair supported by this dataset can be downloaded here.

Step 2: Extracting ParallelΒ Sentence

As explained above, we need parallel text between each language. Thus, we must filter out the sentences that need to be translated. For this blog, we only use English-to-{Japanese, and Korean}, as illustrated in the picture above. To find similar sentences, we can filter out the intersection between each language using the PandasΒ library.

The code to implement this is available here.

The Tatoeba dataset contains an id for each sentence. Since we downloaded β€œEnglish to Japanese” and β€œEnglish to Korean” paired sentences, we can find the intersection of English id’s between both pairs. Thus, we can find the same English sentence and translation into {Japanese and Korean}. Now, the extracted sentences can be used to train ourΒ model.

Step 3: Preprocessing theΒ dataset

Now, we can prepare the training and testing set for the model. We load the downloaded and cleaned dataset in the code below and put a starting and endingΒ tag.

text_file = "spa-eng/spa.txt"
with open(text_file) as f:
lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
english, japanese, korean = line.split("\t")
spanish = "[start] " + spanish + " [end]"
text_pairs.append((english, spanish))
import random
print(random.choice(text_pairs))
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Moreover, each text should be vectorized, i.e., converting the text into a numerical vector. This step, among other advantages, reduces the computational speed of our model. We use an existing code from Deep Learning with Python (second-edition book) to vectorize and prepare theΒ dataset.

Step 4: Vectorizations and preparing theΒ dataset

The code can be accessedΒ here.

Building theΒ model:

Step 5: LSTM-based Encoder-Decoder Model

In this article, we only used a different way to preprocess the dataset for the model. However, there are several approaches to building MNTM. For example, by adding additional output to the model, we could implement an MNTM. Each output layer will be responsible for predicting a single language.

The model is built based on [1] But with minor changes. We chose to use LSTM instead of GRU from the original code for this example. Moreover, since we have multiple languages, we pass the input as the source language and the others as the target language.

source = keras.Input(shape=(None,), dtype="int64", name="source")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
layers.LSTM(latent_dim), merge_mode="sum")(x)
past_target = keras.Input(shape=(None,), dtype="int64", name="target")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.LSTM(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
seq2seq_rnn = keras.Model([source, past_target], target_next_step)

Step 6: Training theΒ model

We are using a simple LSTM model as an example so that the performance might be lower than other models. However, it is easy to build a different model and feed the same datasets as the technique mostly relies on preprocessing rather than building theΒ model.

seq2seq_rnn.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Step 7: Checking the model’s translation

import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
tokenized_input_sentence = source_vectorization([input_sentence])
decoded_sentence = "[start]"
for i in range(max_decoded_sentence_length):
tokenized_target_sentence = target_vectorization([decoded_sentence])
next_token_predictions = seq2seq_rnn.predict(
[tokenized_input_sentence, tokenized_target_sentence])
sampled_token_index = np.argmax(next_token_predictions[0, i, :])
sampled_token = spa_index_lookup[sampled_token_index]
decoded_sentence += " " + sampled_token
if sampled_token == "[end]":
break
return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
input_sentence = random.choice(test_eng_texts)
print("-")
print(input_sentence)
print(decode_sequence(input_sentence))

References:

  1. https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part04_sequence-to-sequence-learning.ipynb
  2. Dabre, Raj, Chenhui Chu, and Anoop Kunchukuttan. β€œA survey of multilingual neural machine translation.” ACM Computing Surveys (CSUR) 53.5 (2020):Β 1–38.
  3. https://tatoeba.org/en/downloads
  4. https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html


A Comprehensive Guide to Building Multi-lingual Neural Machine Translation using Keras. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓