Where the randomness comes

Last Updated on July 25, 2023 by Editorial Team

Author(s): Jun Wang

Originally published on Towards AI.

Word Embedding and Language Modeling U+007C Towards AI

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the code implements it from papers in the following posts.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials 🙂 ). I want you to understand word vectors from the coding level together with me so that we can know how to design and implement our word embedding and language model.

If you got any chance to train word vectors yourself, you may find that the model and the vector representation are different across every training even you feed into the same training data. This is because of the randomness introduced in the training time. The code can talk itself, let’s take a look at where the randomness comes and how to eliminate it thoroughly. I will use DL4j’s implementation of paragraph vectors to show the code. If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation.

The initialization of model weights and vector representation

We know that before training, the weights of a model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set the seed as 0, we will get the exact same initialization every time. Here is the place where the seed takes effect. Here, the syn0 is the model weights, and it is initialized by Nd4j.rand

// Nd4j takes seed configuration here
Nd4j.getRandom().setSeed(configuration.getSeed());// Nd4j initializes a random matrix for syn0
syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);

PV-DBOW algorithm

If we use the PV-DBOW algorithm (I will explain the details of it in the following posts) to train Paragraph Vectors, during the iterations of training, it randomly subsamples words from text window to calculate and update weights. But this random is not really random. Let’s take a look at the code.

// next random is an AtomicLong initialized by thread id
this.nextRandom = new AtomicLong(this.threadId);

And nextRandom is used in

trainSequence(sequence, nextRandom, alpha);

Where inside trainSequence, it will do

nextRandom.set(nextRandom.get() * 25214903917L + 11);

If we go deeper on the training steps, we will find it generates nextRandom by the same way, i.e., doing the same mathematical operation (Go to this and this to know why), so the number relies only on the thread id, where the thread id is 0, 1, 2, 3, …. Hence, it’s no longer random.

Parallel tokenization

It’s used for tokenizing parallelly since the process of complicated text can be time costing, tokenizing parallelly can help the performance, while the consistency among training is not guaranteed. The sequences processed by tokenizer can have random order to feed into threads to train. As you can see from the code, the runnable which is doing the tokenization will wait until it finishes if we set allowParallelBuilder to false, where the order of feeding data can maintain.

if (!allowParallelBuilder) {
 try {
 runnable.awaitDone();
 } catch (InterruptedException e) {
 Thread.currentThread().interrupt();
 throw new RuntimeException(e);
 }
}

Queue that provides sequences to every thread to train

This LinkedBlockingQueue gets sequences from the iterator of training text and provides these sequences to each thread. Since every thread can come randomly, in every time of training, each thread can get different sequences to train. Let’s look at the implementation of this data provider.

// initialize a sequencer to provide data to threads
val sequencer = new AsyncSequencer(this.iterator, this.stopWords);// each threads are pointing to the same sequencer 
// worker is the number of threads we want to use
for (int x = 0; x < workers; x++) {
 threads.add(x, new VectorCalculationsThread(x, ..., sequencer); 
 threads.get(x).start(); 
}// sequencer will initialize a LinkedBlockingQueue buffer
// and maintain the size between [limitLower, limitUpper]
private final LinkedBlockingQueue<Sequence<T>> buffer;
limitLower = workers * batchSize;
limitUpper = workers * batchSize * 2;// threads get data from the queue through
buffer.poll(3L, TimeUnit.SECONDS);

Hence, if we set the number of a worker as 1, it will run in a single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set the number of workers (threads) as 1.
Then we will have the exact same results of word vector and paragraph vector if we feed into the same data.

Finally, our code to train is like:

ParagraphVectors vec = new ParagraphVectors.Builder()
 .minWordFrequency(1)
 .labels(labelsArray)
 .layerSize(100)
 .stopWords(new ArrayList<String>())
 .windowSize(5)
 .iterate(iter)
 .allowParallelTokenization(false)
 .workers(1)
 .seed(0)
 .tokenizerFactory(t)
 .build();

vec.fit();

If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

Reference

[1] Deeplearning4j, ND4J, DataVec and more — deep learning & linear algebra for Java/Scala with GPUs + Spark — From Skymind http://deeplearning4j.org https://github.com/deeplearning4j/deeplearning4j

[2] Java™ Platform, Standard Edition 8 API Specification https://docs.oracle.com/javase/8/docs/api/

[3] https://giphy.com/

[4] https://images.google.com/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Where the randomness comes

Author(s): Jun Wang

Word Embedding and Language Modeling U+007C Towards AI

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

The initialization of model weights and vector representation

PV-DBOW algorithm

Parallel tokenization

Queue that provides sequences to every thread to train

Summarize

Reference

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Where the randomness comes

Author(s): Jun Wang

Word Embedding and Language Modeling U+007C Towards AI

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

The initialization of model weights and vector representation

PV-DBOW algorithm

Parallel tokenization

Queue that provides sequences to every thread to train

Summarize

Reference

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement