How Machine Learning Detects Phishing Attacks

Last Updated on April 22, 2024 by Editorial Team

Author(s): Eera Bhatt

Originally published on Towards AI.

Especially since the COVID-19 epidemic, we have relied on the Internet for so many of our daily services like banking, entertainment, and social networking.

Sadly, cyber attackers take advantage of this by trying out more phishing attacks on unsuspecting users. (Nope, not that kind of fishing.)

But what is phishing? Phishing is when a cyber attacker tries to gain personal information — like credit card numbers and passwords — out of a user by getting them to click on a deceptive link. In most cases, the attacker sends the user this link either through email or through a similar messaging platform. Oftentimes, this malicious link can be built to resemble a legitimate website as a way to trick the user into clicking it. So it can be very difficult for a user to tell the difference between a secure website and a fake website.

Once personal information is gained like this, it’s easy for the attacker to spread malware — harmful software — such as ransomware that locks a user’s device or files until they make a payment to the attacker. So how do we combat this?

Blacklists. The most common automatic defense to phishing is a blacklist of suspicious URLs. For instance, search engines like Bing and Google might have blacklists that contain harmful phishing URLs to protect their Internet users from clicking on these links. Blacklists tend to be updated frequently either by cybersecurity experts or by users who encounter attacks.

Just to clarify, a website is blacklisted when some type of malware is definitely present. When a URL is identified as harmful for sure, it is removed by the search engine’s authorities.

But there’s a problem with this: blacklists might not be able to identify phishing websites that are newly developed. In fact, almost 1.5 million new phishing websites are created every month! What we need is a new way to identify phishing URLs that haven’t been revealed yet.

In a study, the authors create a machine learning approach to solve this by extracting features (characteristics) of suspicious webpages to detect new phishing offenses. They propose eight different features that involve the relationship between a webpage’s URL and its content.

The authors divide their machine learning model into three steps:

Extract features and download webpage content.
Apply feature vectorization (don’t worry, we’ll get to that).
Determine whether the webpage is for phishing.

Let’s go through each of these phases a bit more closely.

Feature generation. In this study, the authors generate features based on the webpage’s URL as well as its HTML source code. Source code is like a set of instructions written by computer programmers about the webpage’s program. This text is meant for almost any human to read.

Document Object Model Tree. A DOM tree is used by the authors so that they can access all the HTML aspects of the webpage, including its content and hyperlinks.

Vectorization. Before we move on to the second step, let’s define vectorization in the context of machine learning. In this study, for example, the authors train their model with textual data from the web pages. And while humans like us might understand the words or images on a webpage, a computer can only work with this data when it comes in numbers.

This is why in machine learning, we use vectorization as a way to convert input data from its natural format into numerical data for computers to work with. So let’s break down some of these features…

Total hyperlinks feature. In general, phishing websites contain very few pages compared to what real websites have. Actually, there are times when phishers only create a login page with no hyperlinks at all. So the authors incorporate this feature with the number of hyperlinks present on a webpage, information taken from the HTML source code.

Error in hyperlinks. At times, phishers add broken or dead links into their fake websites that don’t actually work. Using this feature, the authors verify that a hyperlink on the website is really valid.

Login form features. On both genuine and phishing websites, login pages are extremely common. But of course, on phishing websites, this page is normally just meant to steal the user’s personal information. On a real website’s login page, though, there tends to be a hyperlink with a base domain that is similar to that of the browser address bar. But in phishing websites, the login page URL likely has a different base domain that is an empty or invalid link. To identify phishing websites with these features, the authors find the ratio of a web page’s suspicious forms to its total forms.

Classification. Finally, the authors try out multiple machine learning classifiers to train their model, but they realize that eXtreme Gradient Boosting (XGBoost) outperforms all of its competitors. But why?

XGBoost. This choice is a type of ensemble classifier which means that it can take predictions from various classifiers and combine them to classify the websites with greater accuracy. Another benefit of XGBoost is that it can handle extremely large datasets that don’t fit into memory. Lastly, this classifier can use many different cores on the computer’s CPU to process multiple tasks at once; this makes the computing so much faster overall.

Sadly, it’s true that new phishing websites can be made at an increasing rate. But with the power of machine learning, nothing can stop users from keeping their private information, you know, private.

References:

[1] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, Y. Wang, An effective detection approach for phishing websites using URL and HTML features (2022), Scientific Reports

[2] A. Bensaoud, J. Kalita, M. Bensaoud, A survey of malware detection using deep learning (2024), Machine Learning with Applications on ScienceDirect

[3] A. Jha, Vectorization Techniques in NLP [Guide] (2023), Neptune.ai

[4] Source Code and Object Code, Office of Research at University of Washington

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How Machine Learning Detects Phishing Attacks

Author(s): Eera Bhatt

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

NN#12 — Neural Networks Decoded: Concepts Over Code

Future-Proof Your Marketing: Applied AI and Prompt Engineering for Homo Sapiens

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How Machine Learning Detects Phishing Attacks

Author(s): Eera Bhatt

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement