Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Where the randomness comes
Latest   Machine Learning

Where the randomness comes

Last Updated on July 25, 2023 by Editorial Team

Author(s): Jun Wang

Originally published on Towards AI.

Word Embedding and Language Modeling U+007C Towards AI

How to Get Deterministic word2vec/doc2vec/paragraph Vectors

OK, welcome to our Word Embedding Series. This post is the first story of the series. You may find this story is suitable for the intermediate or above, who has trained or at least tried once on word2vec, or doc2vec/paragraph vectors. But no worries, I will introduce background, prerequisites and knowledge and how the code implements it from papers in the following posts.

I will try my best to do not redirect you to some other links that ask you to read tedious tutorials and end with giving up (trust me, I am the victim of the tremendous online tutorials 🙂 ). I want you to understand word vectors from the coding level together with me so that we can know how to design and implement our word embedding and language model.

If you got any chance to train word vectors yourself, you may find that the model and the vector representation are different across every training even you feed into the same training data. This is because of the randomness introduced in the training time. The code can talk itself, let’s take a look at where the randomness comes and how to eliminate it thoroughly. I will use DL4j’s implementation of paragraph vectors to show the code. If you want to take look on the other package, go to gensim’s doc2vec, which has the same method of implementation.

The initialization of model weights and vector representation

We know that before training, the weights of a model and vector representation will be initialized randomly, and the randomness is controlled by seed. Hence, if we set the seed as 0, we will get the exact same initialization every time. Here is the place where the seed takes effect. Here, the syn0 is the model weights, and it is initialized by Nd4j.rand

// Nd4j takes seed configuration here
Nd4j.getRandom().setSeed(configuration.getSeed());
// Nd4j initializes a random matrix for syn0
syn0 = Nd4j.rand(new int[] {vocab.numWords(), vectorLength}, rng).subi(0.5).divi(vectorLength);

PV-DBOW algorithm

If we use the PV-DBOW algorithm (I will explain the details of it in the following posts) to train Paragraph Vectors, during the iterations of training, it randomly subsamples words from text window to calculate and update weights. But this random is not really random. Let’s take a look at the code.

// next random is an AtomicLong initialized by thread id
this.nextRandom = new AtomicLong(this.threadId);

And nextRandom is used in

trainSequence(sequence, nextRandom, alpha);

Where inside trainSequence, it will do

nextRandom.set(nextRandom.get() * 25214903917L + 11);

If we go deeper on the training steps, we will find it generates nextRandom by the same way, i.e., doing the same mathematical operation (Go to this and this to know why), so the number relies only on the thread id, where the thread id is 0, 1, 2, 3, …. Hence, it’s no longer random.

Parallel tokenization

It’s used for tokenizing parallelly since the process of complicated text can be time costing, tokenizing parallelly can help the performance, while the consistency among training is not guaranteed. The sequences processed by tokenizer can have random order to feed into threads to train. As you can see from the code, the runnable which is doing the tokenization will wait until it finishes if we set allowParallelBuilder to false, where the order of feeding data can maintain.

if (!allowParallelBuilder) {
try {
runnable.awaitDone();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException(e);
}
}

Queue that provides sequences to every thread to train

This LinkedBlockingQueue gets sequences from the iterator of training text and provides these sequences to each thread. Since every thread can come randomly, in every time of training, each thread can get different sequences to train. Let’s look at the implementation of this data provider.

// initialize a sequencer to provide data to threads
val sequencer = new AsyncSequencer(this.iterator, this.stopWords);
// each threads are pointing to the same sequencer
// worker is the number of threads we want to use
for (int x = 0; x < workers; x++) {
threads.add(x, new VectorCalculationsThread(x, ..., sequencer);
threads.get(x).start();
}
// sequencer will initialize a LinkedBlockingQueue buffer
// and maintain the size between [limitLower, limitUpper]
private final LinkedBlockingQueue<Sequence<T>> buffer;
limitLower = workers * batchSize;
limitUpper = workers * batchSize * 2;
// threads get data from the queue through
buffer.poll(3L, TimeUnit.SECONDS);

Hence, if we set the number of a worker as 1, it will run in a single thread and have the exact same order of feeding data in each time of training. But notice that single thread will tremendously slow down the training.

Summarize

To summarize, the following is what we need to do to exclude randomness thoroughly:
1. Set seed as 0;
2. Set allowParallelTokenization as false;
3. Set the number of workers (threads) as 1.

Then we will have the exact same results of word vector and paragraph vector if we feed into the same data.

Finally, our code to train is like:

ParagraphVectors vec = new ParagraphVectors.Builder()
.minWordFrequency(1)
.labels(labelsArray)
.layerSize(100)
.stopWords(new ArrayList<String>())
.windowSize(5)
.iterate(iter)
.allowParallelTokenization(false)
.workers(1)
.seed(0)
.tokenizerFactory(t)
.build();

vec.fit();

If you are feeling like

please follow the next stories about word embedding and language model, I have prepared the feast for you.

Reference

[1] Deeplearning4j, ND4J, DataVec and more — deep learning & linear algebra for Java/Scala with GPUs + Spark — From Skymind http://deeplearning4j.org https://github.com/deeplearning4j/deeplearning4j

[2] Java™ Platform, Standard Edition 8 API Specification https://docs.oracle.com/javase/8/docs/api/

[3] https://giphy.com/

[4] https://images.google.com/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓