How Machine Learning Detects Phishing Attacks
Last Updated on April 22, 2024 by Editorial Team
Author(s): Eera Bhatt
Originally published on Towards AI.
Especially since the COVID-19 epidemic, we have relied on the Internet for so many of our daily services like banking, entertainment, and social networking.
Sadly, cyber attackers take advantage of this by trying out more phishing attacks on unsuspecting users. (Nope, not that kind of fishing.)
But what is phishing? Phishing is when a cyber attacker tries to gain personal information β like credit card numbers and passwords β out of a user by getting them to click on a deceptive link. In most cases, the attacker sends the user this link either through email or through a similar messaging platform. Oftentimes, this malicious link can be built to resemble a legitimate website as a way to trick the user into clicking it. So it can be very difficult for a user to tell the difference between a secure website and a fake website.
Once personal information is gained like this, itβs easy for the attacker to spread malware β harmful software β such as ransomware that locks a userβs device or files until they make a payment to the attacker. So how do we combat this?
Blacklists. The most common automatic defense to phishing is a blacklist of suspicious URLs. For instance, search engines like Bing and Google might have blacklists that contain harmful phishing URLs to protect their Internet users from clicking on these links. Blacklists tend to be updated frequently either by cybersecurity experts or by users who encounter attacks.
Just to clarify, a website is blacklisted when some type of malware is definitely present. When a URL is identified as harmful for sure, it is removed by the search engineβs authorities.
But thereβs a problem with this: blacklists might not be able to identify phishing websites that are newly developed. In fact, almost 1.5 million new phishing websites are created every month! What we need is a new way to identify phishing URLs that havenβt been revealed yet.
In a study, the authors create a machine learning approach to solve this by extracting features (characteristics) of suspicious webpages to detect new phishing offenses. They propose eight different features that involve the relationship between a webpageβs URL and its content.
The authors divide their machine learning model into three steps:
- Extract features and download webpage content.
- Apply feature vectorization (donβt worry, weβll get to that).
- Determine whether the webpage is for phishing.
Letβs go through each of these phases a bit more closely.
Feature generation. In this study, the authors generate features based on the webpageβs URL as well as its HTML source code. Source code is like a set of instructions written by computer programmers about the webpageβs program. This text is meant for almost any human to read.
Document Object Model Tree. A DOM tree is used by the authors so that they can access all the HTML aspects of the webpage, including its content and hyperlinks.
Vectorization. Before we move on to the second step, letβs define vectorization in the context of machine learning. In this study, for example, the authors train their model with textual data from the web pages. And while humans like us might understand the words or images on a webpage, a computer can only work with this data when it comes in numbers.
This is why in machine learning, we use vectorization as a way to convert input data from its natural format into numerical data for computers to work with. So letβs break down some of these featuresβ¦
Total hyperlinks feature. In general, phishing websites contain very few pages compared to what real websites have. Actually, there are times when phishers only create a login page with no hyperlinks at all. So the authors incorporate this feature with the number of hyperlinks present on a webpage, information taken from the HTML source code.
Error in hyperlinks. At times, phishers add broken or dead links into their fake websites that donβt actually work. Using this feature, the authors verify that a hyperlink on the website is really valid.
Login form features. On both genuine and phishing websites, login pages are extremely common. But of course, on phishing websites, this page is normally just meant to steal the userβs personal information. On a real websiteβs login page, though, there tends to be a hyperlink with a base domain that is similar to that of the browser address bar. But in phishing websites, the login page URL likely has a different base domain that is an empty or invalid link. To identify phishing websites with these features, the authors find the ratio of a web pageβs suspicious forms to its total forms.
Classification. Finally, the authors try out multiple machine learning classifiers to train their model, but they realize that eXtreme Gradient Boosting (XGBoost) outperforms all of its competitors. But why?
XGBoost. This choice is a type of ensemble classifier which means that it can take predictions from various classifiers and combine them to classify the websites with greater accuracy. Another benefit of XGBoost is that it can handle extremely large datasets that donβt fit into memory. Lastly, this classifier can use many different cores on the computerβs CPU to process multiple tasks at once; this makes the computing so much faster overall.
Sadly, itβs true that new phishing websites can be made at an increasing rate. But with the power of machine learning, nothing can stop users from keeping their private information, you know, private.
References:
[1] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach for phishing websites using URL and HTML features (2022), Scientific Reports
[2] A. Bensaoud, J. Kalita, M. Bensaoud, A survey of malware detection using deep learning (2024), Machine Learning with Applications on ScienceDirect
[3] A. Jha, Vectorization Techniques in NLP [Guide] (2023), Neptune.ai
[4] Source Code and Object Code, Office of Research at University of Washington
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI