Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


How Machine Learning Detects Phishing Attacks
Artificial Intelligence   Data Science   Latest   Machine Learning

How Machine Learning Detects Phishing Attacks

Last Updated on April 22, 2024 by Editorial Team

Author(s): Eera Bhatt

Originally published on Towards AI.

Especially since the COVID-19 epidemic, we have relied on the Internet for so many of our daily services like banking, entertainment, and social networking.

Sadly, cyber attackers take advantage of this by trying out more phishing attacks on unsuspecting users. (Nope, not that kind of fishing.)

Photo by Adi Goldstein on Unsplash

But what is phishing? Phishing is when a cyber attacker tries to gain personal information — like credit card numbers and passwords — out of a user by getting them to click on a deceptive link. In most cases, the attacker sends the user this link either through email or through a similar messaging platform. Oftentimes, this malicious link can be built to resemble a legitimate website as a way to trick the user into clicking it. So it can be very difficult for a user to tell the difference between a secure website and a fake website.

Once personal information is gained like this, it’s easy for the attacker to spread malware — harmful software — such as ransomware that locks a user’s device or files until they make a payment to the attacker. So how do we combat this?

Blacklists. The most common automatic defense to phishing is a blacklist of suspicious URLs. For instance, search engines like Bing and Google might have blacklists that contain harmful phishing URLs to protect their Internet users from clicking on these links. Blacklists tend to be updated frequently either by cybersecurity experts or by users who encounter attacks.

Just to clarify, a website is blacklisted when some type of malware is definitely present. When a URL is identified as harmful for sure, it is removed by the search engine’s authorities.

But there’s a problem with this: blacklists might not be able to identify phishing websites that are newly developed. In fact, almost 1.5 million new phishing websites are created every month! What we need is a new way to identify phishing URLs that haven’t been revealed yet.

In a study, the authors create a machine learning approach to solve this by extracting features (characteristics) of suspicious webpages to detect new phishing offenses. They propose eight different features that involve the relationship between a webpage’s URL and its content.

The authors divide their machine learning model into three steps:

  • Extract features and download webpage content.
  • Apply feature vectorization (don’t worry, we’ll get to that).
  • Determine whether the webpage is for phishing.

Let’s go through each of these phases a bit more closely.

Feature generation. In this study, the authors generate features based on the webpage’s URL as well as its HTML source code. Source code is like a set of instructions written by computer programmers about the webpage’s program. This text is meant for almost any human to read.

Document Object Model Tree. A DOM tree is used by the authors so that they can access all the HTML aspects of the webpage, including its content and hyperlinks.

Vectorization. Before we move on to the second step, let’s define vectorization in the context of machine learning. In this study, for example, the authors train their model with textual data from the web pages. And while humans like us might understand the words or images on a webpage, a computer can only work with this data when it comes in numbers.

This is why in machine learning, we use vectorization as a way to convert input data from its natural format into numerical data for computers to work with. So let’s break down some of these features…

Total hyperlinks feature. In general, phishing websites contain very few pages compared to what real websites have. Actually, there are times when phishers only create a login page with no hyperlinks at all. So the authors incorporate this feature with the number of hyperlinks present on a webpage, information taken from the HTML source code.

Error in hyperlinks. At times, phishers add broken or dead links into their fake websites that don’t actually work. Using this feature, the authors verify that a hyperlink on the website is really valid.

Login form features. On both genuine and phishing websites, login pages are extremely common. But of course, on phishing websites, this page is normally just meant to steal the user’s personal information. On a real website’s login page, though, there tends to be a hyperlink with a base domain that is similar to that of the browser address bar. But in phishing websites, the login page URL likely has a different base domain that is an empty or invalid link. To identify phishing websites with these features, the authors find the ratio of a web page’s suspicious forms to its total forms.

Classification. Finally, the authors try out multiple machine learning classifiers to train their model, but they realize that eXtreme Gradient Boosting (XGBoost) outperforms all of its competitors. But why?

XGBoost. This choice is a type of ensemble classifier which means that it can take predictions from various classifiers and combine them to classify the websites with greater accuracy. Another benefit of XGBoost is that it can handle extremely large datasets that don’t fit into memory. Lastly, this classifier can use many different cores on the computer’s CPU to process multiple tasks at once; this makes the computing so much faster overall.

Sadly, it’s true that new phishing websites can be made at an increasing rate. But with the power of machine learning, nothing can stop users from keeping their private information, you know, private.


[1] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach for phishing websites using URL and HTML features (2022), Scientific Reports

[2] A. Bensaoud, J. Kalita, M. Bensaoud, A survey of malware detection using deep learning (2024), Machine Learning with Applications on ScienceDirect

[3] A. Jha, Vectorization Techniques in NLP [Guide] (2023),

[4] Source Code and Object Code, Office of Research at University of Washington

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓