Hybrid machine learning architecture for phishing email classification
In parallel to the increase of internet usage worldwide, phishing attacks incidents have increased analogically. One of the ways with which phishing attacks are initiated is through sending phishing emails. In this thesis an attempt at creating a novel technique for detecting phishing emails is presented. For that purpose, a solution using machine learning algorithms is explored. In order to design a machine learning model that properly identifies emails as phishing or benign, the following directions were studied: (a) The most widely used machine learning algorithms, with their representative parameters, advantages and disadvantages that their usage provides, (b) Developing an algorithm to test different combinations of machine learning feature inputs, (c) Different performance evaluation metrics of the created machine learning models and developing an algorithm for them and (d) The structure of the emails, their different characteristics and anything that can be used as a feature input for the machine learning models to be able to efficiently detect the pattern of phishing emails. To that end, two different fundamental feature categories were created: (i) Properties-based features, which are retrieved from the different characteristics of the emails, such as the number of URLs or attachments in the emails and (ii) Text-based features, which are retrieved from the text part of the email which is to be read by the receiver. Today, with the increase of computer processing power, efficient machine learning solutions are developed for an increasing number of problems. Although there is related work towards phishing email classification using machine learning techniques, this work proposes a novel hybrid technique using the two aforementioned types of features. To that end, two architectures were proposed: (a) A hybrid “assembled” architecture, which assembles the properties-based and the text-based features as one consistent feature vector which is then used as input for the classification process and (b) A hybrid “stacked” architecture, which has two classifiers: the first one classifies the email using text-based features and the second one (which outputs the final classification) uses the properties-based features and one additional feature which is the classification output of the first classifier for the to-be-classified email. The most widely used tools for developing machine learning based programs were explored and Apache Spark with its MLlib library were used. All the algorithms were written using the Python programming language. After developing an algorithm for testing every combination of classifier, their different hyperparameter values and the different architectures, it was found that, compared to the classification using only one type of features, the hybrid “assembled” architecture has an improved performance but the hybrid “stacked” architecture has a slightly reduced performance.