Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

LogBERT Explained In Depth: Part II
Latest

LogBERT Explained In Depth: Part II

Last Updated on October 9, 2022 by Editorial Team

Author(s): David Schiff

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

In the previous article, I covered the basics of the Attention mechanism, and in general, I covered the transformer block. In this part of the series, I would like to cover how LogBERT trains and how we can use it to detect anomalies in log sequences.

Let’s get into the nitty gritty little details ofΒ LogBERT.

In the paper (link: https://arxiv.org/pdf/2103.04475.pdf) a log sequence is definedΒ as:

Where S is a sequence of keys (words) in the log sequence. Notice how our log keys are marked with a superscript j to indicate the sequence they belong to and a subscript over k to indicate the index in the series ofΒ words.

As usual, when training a transformer, we want to add a token to mark the beginning of the sequence, and the token will be β€œDIST” as mentioned in the paper. Another special token we will be adding is the β€œMASK” token. The mask token will be used to cover up words in the sentence.

If you take a look at the last part, you will remember our runningΒ example:

Preprocessing our log sequences

As you can see, our log sequence is preprocessed to contain the different specialΒ tokens.

DISTβ€Šβ€”β€ŠStart ofΒ sentence

MASKβ€Šβ€”β€ŠCovers up a key in theΒ sequence

EOSβ€Šβ€”β€ŠEnd of a sentence (Although not mentioned in the paper, it is used in the code onΒ GitHub.)

Before I clear everything up about the special tokens, I want to review the training phase ofΒ LogBERT.

Recall the different parts of the LogBERT architecture:

Attention using the Q, K, and VΒ matrices
MultiHead Attention concatenating the different Attention heads

The functions above are just the mathematical descriptions of MultiHead Attention:

And finally, it all comes down to the transformer block’s mathematical description:

Which is just this whole transformer block:

Beautiful, now we have it all tied together. Usually, transformers have multiple transformer layers, which means we can assign a general transformer function which would be just a sequence of transformer blocks. This is described in the paperΒ as:

Where h is the output of the Transformer function, this output should be essentially just a vector that encodes all the information about the log sequence defined asΒ X.

I think the image above sums it all up. Just the general LogBERT architecture. Notice how the output layer is the size of the vocabulary. As usual in classification cases, specifically in our case, we need to classify the word covered under the MASK token. This leads me to the loss function of the final LogBERTΒ model.

As mentioned earlier, all log sequences are preprocessed to contain a masked word. The model will try to predict the word being masked. This is a self-supervised task that the model will have to complete. The loss function for this taskΒ is:

Where y is the real word under the mask and y hat is the probability assigned for the real word under the mask. We can see that the loss is essentially a categorical cross-entropy loss function where the categories are actually words. Notice how the summation is over N and M. N being the number of log sequences and M being the total number of masks chosen in the sentence.

For each log sequence, we choose a random word to mask with a probability of 0.15. We actually choose multiple words to mask, and for each log, we calculate the total loss over the mask prediction in the sentence. The final loss is summed over each sequence and divided by the number of log sequences.

The other task (loss function) that LogBERT needs to minimize is the Volume Of The HyperSphere loss function.

This loss function uses hβ€Šβ€”β€Šas mentioned above. h is the representation of the log sequence (specifically h-dist), and the goal of this loss function is to minimize the distance between the representations of each log and their center. The center is updated with each epoch and is calculated as the average overall log representation.

Finally, both loss functions are used to update the model weights. The final loss function is comprised of a weighted sum of both loss functions.

Now, how do we use LogBERT to find anomalous logs? As proposed in the paper, we go over a log sequence and calculate the predictions for each masked word. We define a hyperparameter g which is the top g most probable words that lie beyond the mask. If the actual word is not in the top g words, we count that word as an anomaly. Now, we define r, another hyperparameter, as our threshold for counting a log as anomalous or not. If there are more than r word anomalies, the log sequence is defined as an anomalous log.

I would like to propose another method to locate anomalous logs. Simply use the final loss function and define a threshold z for which if a log scores higher than z it is an anomalous log.

That’s it, I hope you enjoyed your read. I highly recommend reading up on the original paper and hopping over to GitHub to look at the codeΒ itself.


LogBERT Explained In Depth: Part II was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓