Senior Synergy

Introduction

Phishing URLs pose a grave threat to individuals in Thailand, and the risk is expected to escalate considerably in the future. While existing phishing link detection tools primarily focus on technical aspects, there remains a significant gap in addressing the informational needs of users beyond just numbers and data.

In response to the increasing threat of phishing attacks and the need for user-centric cybersecurity solutions, this project aims to develop a phishing detection tool powered by interpretable machine learning models.

URL components

A Uniform Resource Locator (URL) is designed to locate web pages. The diagram below outlines the structure of a common URL and its key components.

A phisher has complete control over the subdomain sections and can assign any value to them. The URL might also include a path and file elements that can be manipulated by the phisher as desired. The path is entirely under the phisher's control. In the rest of this article, we refer to these portions of the URL as FreeURL.

Machine Learning Techniques

Algorithm : Random Forest

The Random Forest algorithm can be utilized to enhance accuracy and mitigate overfitting in URL classification. The algorithm combines the outputs of multiple decision trees to categorize URLs. By using random subsets of features, each decision tree focuses on different combinations of URL features. The diverse set of decision trees creates a robust ensemble model that effectively analyzes and classifies URLs. A final decision is made through majority voting based on the predictions from each tree, resulting in more reliable and precise classifications compared to a single decision tree model.

Features selection method : Particle Swarm Optimization

The purpose of feature selection is to select a subset of relevant features from a large number of available features to achieve similar or even better classification performance than using all features. By eliminating/reducing irrelevant and redundant features, feature selection could reduce the number of features, shorten the training time, simplify the learned classifiers, and/or improve the classification performance.

Particle Swarm Optimization is a population based technique to address feature selection problems in this project due to better representation, capability of searching large spaces, being less expensive computationally, being easier to implement, and fewer parameters being required. PSO simulates social behavior such as birds flocking and fish schooling. In PSO, a population, also called a swarm, of candidate solutions are encoded as particles in the search space. PSO starts with the random initialisation of a population of particles. Particles move in the search space to search for the optimal solution by updating the position of each particle based on the experience of its own.

After performing feature selection using PSO, 19 features were selected, resulting in the highest accuracy of 96.12 percent. The following are the features used for the model.

Model Results

In this study, we developed a phishing URL detection model and evaluated its performance across key metrics, achieving highly promising results. The model's overall accuracy was 95.07%

Precision, Recall, and F1-Score

To further assess the model's reliability, we considered additional performance metrics:

Precision: The model achieved a precision of 96.00%, meaning that of all the URLs flagged as phishing, 96% were actual phishing URLs. A high precision score is essential in this context, as it reduces false positives, ensuring that legitimate URLs are not misclassified as phishing.
Recall: The recall was measured at 94.05%, indicating the model's ability to detect most phishing URLs correctly. High recall ensures that phishing attacks are not missed, which is critical for minimizing the risk of undetected security threats.
F1-Score: The F1-score, which balances precision and recall, was 95.07%, demonstrating that the model maintains a strong balance between minimizing false positives and false negatives. This metric highlights the overall reliability of the model in a real-world application where both accuracy and coverage are important.

Accuracy
96.13%
Precision
96.28%
Recall
95.97%
F1-Score
96.13%

Information About Each Features

Name	Type	Explanation
domainlength	Address Bar based	Count the characters in the hostname string.
www	Address Bar based	If the URL has 'www' as the subdomain, then return 0; otherwise, return 1.
subdomain	Address Bar based	If the URL has more than 1 subdomain then return 1, else 0.
https	Address Bar based	If the URL contains 'https', then return 0; otherwise, return 1.
short_url	Address Bar based	If the URL is a short URL, return 1; otherwise, return 0.
@	Address Bar based	Count the ‘@' characters in the URL.
-	Address Bar based	Count the '-' characters in the URL.
=	Address Bar based	Count the '=' characters in the URL.
.	Address Bar based	Count the '.' characters in the URL's hostname.
_	Address Bar based	Count the '_' characters in the URL.
/	Address Bar based	Count the '/' characters in the URL.
digit	Address Bar based	Count the digit (0-9) characters in the URL.
log	Address Bar based	If the URL contains a 'log' word in the URL then return 0, else 1.
pay	Address Bar based	If the URL contains a 'pay' word in the URL then return 0, else 1.
web	Address Bar based	If the URL contains a 'web' word in the URL then return 0, else 1.
account	Address Bar based	If the URL contains an 'account' word in the URL then return 0, else 1.
pcemptylinks	HTML/DOM Structure based	Percentage of empty links. An empty link does not lead to a different page.
pcextlinks	HTML/DOM Structure based	Percentage of external links that direct you to another site with a different domain.
pcrequrl	HTML/DOM Structure based	Percentage of external resource URLs hosted on a different domain.
zerolink	HTML/DOM Structure based	If the URL page has no links in the HTML body, return 1; otherwise, return 0.
extfavicon	HTML/DOM Structure based	If the favicon URL is from a different domain than the submitted URL, return 1; otherwise, return 0.
submit2Email	HTML/DOM Structure based	If the HTML page contains "\b(mail()\|mailto:?)\b" then return 1, else 0.
sfh	HTML/DOM Structure based	SFHs that contain an empty string or lead to different domain sites from the submitted URL should return 1; otherwise, return 0.
redirection	Abnormal Based	If clicking the submitted URL results in a redirection to another URL, return 1; otherwise, return 0.
domainage	Domain Based	If the domain age is less than 6 months, return 1; otherwise, return 0.
domainend	Domain Based	If the difference in days between the current date and expiration date is less than or equal to one year, return 1; otherwise, return 0.

Feature Importance

The success of this model is largely due to its careful selection and weighting of relevant features. In phishing URL detection, the following features proved to be most influential: