Introduction
Phishing URLs pose a grave threat to individuals in Thailand, and the risk is expected to escalate considerably in the future. While existing phishing link detection tools primarily focus on technical aspects, there remains a significant gap in addressing the informational needs of users beyond just numbers and data.
In response to the increasing threat of phishing attacks and the need for user-centric cybersecurity solutions, this project aims to develop a phishing detection tool powered by interpretable machine learning models.
URL components
A Uniform Resource Locator (URL) is designed to locate web pages. The diagram below outlines the structure of a common URL and its key components.
A phisher has complete control over the subdomain sections and can assign any value to them. The URL might also include a path and file elements that can be manipulated by the phisher as desired. The path is entirely under the phisher's control. In the rest of this article, we refer to these portions of the URL as FreeURL.
Machine Learning Techniques
Algorithm : Random Forest
The Random Forest algorithm can be utilized to enhance accuracy and mitigate overfitting in URL classification. The algorithm combines the outputs of multiple decision trees to categorize URLs. By using random subsets of features, each decision tree focuses on different combinations of URL features. The diverse set of decision trees creates a robust ensemble model that effectively analyzes and classifies URLs. A final decision is made through majority voting based on the predictions from each tree, resulting in more reliable and precise classifications compared to a single decision tree model.
Features selection method : Particle Swarm Optimization
The purpose of feature selection is to select a subset of relevant features from a large number of available features to achieve similar or even better classification performance than using all features. By eliminating/reducing irrelevant and redundant features, feature selection could reduce the number of features, shorten the training time, simplify the learned classifiers, and/or improve the classification performance.
Particle Swarm Optimization is a population based technique to address feature selection problems in this project due to better representation, capability of searching large spaces, being less expensive computationally, being easier to implement, and fewer parameters being required. PSO simulates social behavior such as birds flocking and fish schooling. In PSO, a population, also called a swarm, of candidate solutions are encoded as particles in the search space. PSO starts with the random initialisation of a population of particles. Particles move in the search space to search for the optimal solution by updating the position of each particle based on the experience of its own.
After performing feature selection using PSO, 19 features were selected, resulting in the highest accuracy of 96.12 percent. The following are the features used for the model.
Model Results
In this study, we developed a phishing URL detection model and evaluated its performance across key metrics, achieving highly promising results. The model's overall accuracy was 95.07%
Precision, Recall, and F1-Score
To further assess the model's reliability, we considered additional performance metrics:
- Precision: The model achieved a precision of 96.00%, meaning that of all the URLs flagged as phishing, 96% were actual phishing URLs. A high precision score is essential in this context, as it reduces false positives, ensuring that legitimate URLs are not misclassified as phishing.
- Recall: The recall was measured at 94.05%, indicating the model's ability to detect most phishing URLs correctly. High recall ensures that phishing attacks are not missed, which is critical for minimizing the risk of undetected security threats.
- F1-Score: The F1-score, which balances precision and recall, was 95.07%, demonstrating that the model maintains a strong balance between minimizing false positives and false negatives. This metric highlights the overall reliability of the model in a real-world application where both accuracy and coverage are important.
- Accuracy96.13%
- Precision96.28%
- Recall95.97%
- F1-Score96.13%
Information About Each Features
Name | Type | Explanation |
---|---|---|
domainlength | Address Bar based | Count the characters in the hostname string. |
www | Address Bar based | If the URL has 'www' as the subdomain, then return 0; otherwise, return 1. |
subdomain | Address Bar based | If the URL has more than 1 subdomain then return 1, else 0. |
https | Address Bar based | If the URL contains 'https', then return 0; otherwise, return 1. |
short_url | Address Bar based | If the URL is a short URL, return 1; otherwise, return 0. |
@ | Address Bar based | Count the ‘@' characters in the URL. |
- | Address Bar based | Count the '-' characters in the URL. |
= | Address Bar based | Count the '=' characters in the URL. |
. | Address Bar based | Count the '.' characters in the URL's hostname. |
_ | Address Bar based | Count the '_' characters in the URL. |
/ | Address Bar based | Count the '/' characters in the URL. |
digit | Address Bar based | Count the digit (0-9) characters in the URL. |
log | Address Bar based | If the URL contains a 'log' word in the URL then return 0, else 1. |
pay | Address Bar based | If the URL contains a 'pay' word in the URL then return 0, else 1. |
web | Address Bar based | If the URL contains a 'web' word in the URL then return 0, else 1. |
account | Address Bar based | If the URL contains an 'account' word in the URL then return 0, else 1. |
pcemptylinks | HTML/DOM Structure based | Percentage of empty links. An empty link does not lead to a different page. |
pcextlinks | HTML/DOM Structure based | Percentage of external links that direct you to another site with a different domain. |
pcrequrl | HTML/DOM Structure based | Percentage of external resource URLs hosted on a different domain. |
zerolink | HTML/DOM Structure based | If the URL page has no links in the HTML body, return 1; otherwise, return 0. |
extfavicon | HTML/DOM Structure based | If the favicon URL is from a different domain than the submitted URL, return 1; otherwise, return 0. |
submit2Email | HTML/DOM Structure based | If the HTML page contains "\b(mail()|mailto:?)\b" then return 1, else 0. |
sfh | HTML/DOM Structure based | SFHs that contain an empty string or lead to different domain sites from the submitted URL should return 1; otherwise, return 0. |
redirection | Abnormal Based | If clicking the submitted URL results in a redirection to another URL, return 1; otherwise, return 0. |
domainage | Domain Based | If the domain age is less than 6 months, return 1; otherwise, return 0. |
domainend | Domain Based | If the difference in days between the current date and expiration date is less than or equal to one year, return 1; otherwise, return 0. |
Feature Importance
The success of this model is largely due to its careful selection and weighting of relevant features. In phishing URL detection, the following features proved to be most influential: