HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis

Opara, Chidimma, Wei, Bo and Chen, Yingke (2020) HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis. In: 2020 International Joint Conference on Neural Networks (IJCNN 2020). Neural Networks (IJCNN) . IEEE, Piscataway, pp. 6906-6913. ISBN 9781728169279, 9781728169262

[img]
Preview
Text
HTMLPhish_Enabling_Accurate_Phishing_Web_Page_Detection_by_Applying_Deep_Learning_Techniques_on_HTML_Analysis_WCCI.pdf - Accepted Version

Download (1MB) | Preview
Official URL: https://doi.org/10.1109/IJCNN48605.2020.9207707

Abstract

Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based datadriven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive experiments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93% Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language.

Item Type: Book Section
Additional Information: IJCNN 2020: International Joint Conference on Neural Networks ; Conference date: 19-07-2020
Uncontrolled Keywords: html, Phishing detection, Web pages, Classification model, Convolutional Neural Networks
Subjects: G400 Computer Science
G500 Information Systems
G600 Software Engineering
G700 Artificial Intelligence
Department: Faculties > Engineering and Environment > Computer and Information Sciences
Depositing User: Rachel Branson
Date Deposited: 18 May 2020 12:31
Last Modified: 31 Jul 2021 10:18
URI: http://nrl.northumbria.ac.uk/id/eprint/43160

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics