Natural Language Processing in Tensorflow

Last Updated on December 17, 2020 by Editorial Team

Tokenization and Sequencing

Natural Language Processing in Tensorflow — Photo by Emma Matthews Digital Content Production on Unsplash

In this blog post, we shall seek to learn how to implement tokenization and sequencing, important text pre-processing steps, in Tensorflow.

Outline

Introduction to Tokenizer
Understanding Sequencing

Introduction to Tokenizer

Tokenization is the process of splitting the text into smaller units such as sentences, words or subwords. In this section, we shall see how we can pre-process the text corpus by tokenizing text into words in Tensorflow. We shall use the Keras API with Tensorflow backend; The code snippet below shows the necessary imports.📑

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer

And voila🎉 we have all modules imported! Let’s initialize a list of sentences that we shall tokenize.

sentences = [

'Life is so beautiful',

'Hope keeps us going',

'Let us celebrate life!'

The next step is to instantiate the Tokenizer and call the fit_to_texts method.

tokenizer = Tokenizer()

tokenizer.fit_on_texts(sentences)

Well, when the text corpus is very large, we can specify an additional num_words argument to get the most frequent words. For example, if we’d like to get the 100 most frequent words in the corpus, then tokenizer = Tokenizer(num_words=100) does just that! 😊

To know how these tokens have been created and the indices assigned to words, we can use the word_index attribute.

word_index = tokenizer.word_index

print(word_index)

💡 Here’s the output:

{‘life’: 1, ‘us’: 2, ‘is’: 3, ‘so’: 4, ‘beautiful’: 5, ‘hope’: 6, ‘keeps’: 7, ‘going’: 8, ‘let’: 9, ‘celebrate’: 10}

Well, so far so good! But what happens when the test data contains words that we’ve not accounted for in the vocabulary?🤔

test_data = [

'Our life is to celebrate',

'Hoping for the best!',

'Let peace prevail everywhere'

We have introduced sentences in test_data which contain words that are not in our earlier vocabulary.

How do we account for such words which are not in vocabulary? 🤔We can define an argument oov_token to account for such Out Of Vocabulary (OOV) tokens.😀

tokenizer = Tokenizer(oov_token=”<OOV>”)

The word_index now returns the following output:

{‘<OOV>’: 1, ‘life’: 2, ‘us’: 3, ‘is’: 4, ‘so’: 5, ‘beautiful’: 6, ‘hope’: 7, ‘keeps’: 8, ‘going’: 9, ‘let’: 10, ‘celebrate’: 11}

Understanding Sequencing

In this section, we shall build on the tokenized text, using these generated tokens to convert the text into a sequence. 📕📗📘📒

We can get a sequence by calling the texts_to_sequences method.

sequences = tokenizer.texts_to_sequences(sentences)

Here’s the output:[[2, 4, 5, 6], [7, 8, 3, 9], [10, 3, 11, 2]]

Let’s now take a step back. What happens when the sentences are of different lengths? 🙄Then, we will have to convert all of them to the same length.🤷

We shall import pad_sequences function to pad our sequences and look at the padded sequences.

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(sequences)

print("\nPadded Sequences:")

print(padded)

# Output

Padded Sequences:
 [[ 2  4  5  6]  
  [ 7  8  3  9] 
  [10  3 11  2]]

By default, the length of the padded sequence = length of the longest sentence. However, we can limit the maximum length by explicitly setting the maxlen argument.

padded = pad_sequences(sequences,maxlen=5)

print("\nPadded Sequences:")

print(padded)

# Output

Padded Sequences: 
[[ 0  2  4  5  6] 
 [ 0  7  8  3  9] 
 [ 0 10  3 11  2]]

Now, let’s pad our test sequences after converting them to sequences.

test_seq = tokenizer.texts_to_sequences(test_data)

print("\nTest Sequence = ", test_seq)

padded = pad_sequences(test_seq, maxlen=10)

print("\nPadded Test Sequence: ")

print(padded)

And here’s our output.

# Output

Test Sequence =  [[1, 2, 4, 1, 11], [1, 1, 1, 1], [10, 1, 1, 1]]

Padded Test Sequence:  
[[ 0  0  0  0  0  1  2  4  1 11] 
 [ 0  0  0  0  0  0  1  1  1  1] 
 [ 0  0  0  0  0  0 10  1  1  1]]

We see that all the padded sequences are of length maxlen and are padded with 0s at the beginning. What if we would like to add trailing zeros instead of at the beginning? We only need to specify padding=’post’

padded = pad_sequences(test_seq, maxlen=10, padding='post')

print("\nPadded Test Sequence: ")

print(padded)

# Output

Padded Test Sequence: 
 [[ 1  2  4  1 11  0  0  0  0  0] 
 [ 1  1  1  1  0  0  0  0  0  0]  
 [10  1  1  1  0  0  0  0  0  0]]

So far, none of the sentences have length exceeding maxlen, but in practice, we may have sentences that are much longer than maxlen. In that case, we have to truncate the sentences and can set the argument truncating='post' or 'pre' to drop the first few or the last few words that exceed the specified maxlen. Here’s the link to the Colab notebook for the above example.

Happy learning and coding!🎈✨🎉👩🏽‍💻

Reference

Natural Language Processing in TensorFlow

Natural Language Processing in Tensorflow was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Natural Language Processing in Tensorflow

Author(s): Bala Priya C

Natural Language Processing

Tokenization and Sequencing

Outline

Introduction to Tokenizer

Understanding Sequencing

Reference

Towards AI Team

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Natural Language Processing in Tensorflow

Author(s): Bala Priya C

Tokenization and Sequencing

Outline

Introduction to Tokenizer

Understanding Sequencing

Reference

Towards AI Team

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement