Last Updated on January 6, 2023 by Editorial Team
Last Updated on June 28, 2022 by Editorial Team
Author(s): Maria Zorkaltseva
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Conventional antiviruses and Intrusion Detection Systems (IDS) employ heuristic-based and signature-based methods to detect malicious JS code. But this analysis can be inefficient in case of zero-day attacks. Machine learning (ML) applications, which are currently being actively developed in various industries, have also found their place in cybersecurity. ML has shown its effectiveness against zero-based attacks. When it comes to detecting malicious JS code, there are different approaches from the field of Natural Language Processing (NLP), standard ML that uses tabular data, and deep learning models.
The input data for ML models will vary due to the fact that there are two methods to analyze the behavior of the program: static and dynamic code analysis. The static method analyzes the data without running the source code and is based on source code only. For instance, this can be archived by traversing the code Abstract Syntax Tree. In opposite, dynamic code analysis requires source code to be executed. In this post, we will consider only cases of static analysis.
In this article, we will look at some related work to get an idea of what researchers offer for obfuscated JS code detection. And also will consider the task of classifying benign /malicious JS code snippets using a combination of NLP features and the standard ML approaches.
To sum up here can be identified several common approaches to feature JS code in case of static analysis:
- Approach 1 (natural language): consider JS code as natural language text. Features can be represented as a collection of characters statistics, file entropies, special functions count, number of special symbols, etc.
- Approach 2 (lexical features): regex expressions to extract plain-JS text elements (like [a-z]+ and removing special characters such as ∗,=, !, etc.) combined with NLP featurization method applications like Bag-of-Words (BOW), TF-IDF, Doc2Vec, LDA, embeddings, and etc.
- Approach 3 (syntactic features): AST features plus NLP featurization;
- Approach 4 (semantic features): get AST features -> construct Control Flow Graph (CFG) -> build Program Dependency Graph (PDG) -> get semantic slices -> transformation to numerical vectors.
Coding section: Classification of benign /malicious JS code
For this purpose, let's use a dataset from Machine learning for the cyber security cookbook. For simplicity, I will use approach 1, mentioned above.
0-label — normal code, 1-label — obfuscated code
For other features ideas I used the following list of JS functions that are frequently used in malicious JS codes:
Train/test data split
Random Forest Model
👉🏻 Full code is also accessible through my GitHub.
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI