Αυτόματη εποπτευόμενη μηχανική μάθηση με τεχνικές δειγματοληψίας

Κουρέας, Σταύρος

Automated supervised machine learning with sampling techniques

Master Thesis

Author

Κουρέας, Σταύρος

Date

2022-02

Abstract

Data analysis is a sector of modern science that deals with the management and interpretation of usable information, which is now growing rapidly. In this thesis, machine learning applications will be researched in data sets of various sizes, technologies mentioned at the beginning and on the one hand the machine learning improvement technique the devices seem to meet specific specifications. Nowadays "armed" with a multitude of algorithms and hyper-parameters we can achieve amazing results but choosing the right combinations is a difficult process. By processing large data sets daily, the demands on processing power and time increase. Most disciplines require highly accurate predictions, which requires a great deal of research in each data set. This thesis aims to propose a new technique with sampling procedures, which can bring satisfactory results in less time with less processing power. At the same time, it "builds" a methodology for analyzing big data and dealing with general problems such as missing values, alphanumeric values and others such as unbalanced data sets. This technique works with the method of sampling in rows and columns, is evaluated through an experimental process where several results are collected from different data sets and compared without using it. More specifically, 15 data sets were used for binary classification, 15 for multi classification and 5 for regression. All data sets are known datasets in the field of machine learning. The results of the experimental procedure indicated that 10% is sufficient for sampling in rows and 80% is sufficient for sampling in columns based on correlation. The result seems to be satisfactory since the same selection of algorithms with the use of the sample against complete at 80%, while in the case that the selection algorithm is not the same, there is a probability that exceeds 70% on selecting an algorithm that is the next better one. This practically means that if in a smaller data set the decision to use an algorithm was made, then this algorithm is quite likely to work better in the whole data set. Specifically, this technique is developed in python language in the form of a library, which consists of specific organized sub-procedures. Each sub-process handles specific decisions during the pre-processing stage, such as sampling management in rows, sampling management in columns, dealing with missing values, normalization but also in the modeling stage such as algorithm selection and hyperparameter optimization. However, this library has been published in the PiPy repository under the name "AutoMLWrapper" (since it is a set of subsystems of special methods) and is accompanied by a relevant notebook sample. https://pypi.org/project/automlwrapper/ So distribution and execution can be done easily and quickly in a simple python environment by installing the library using pip install, so its use is direct to the end user.

Postgraduate Studies Programme

Πληροφοριακά Συστήματα και Υπηρεσίες

Department

Σχολή Τεχνολογιών Πληροφορικής και Επικοινωνιών. Τμήμα Ψηφιακών Συστημάτων

Number of pages

Language

Greek

URI

https://dione.lib.unipi.gr/xmlui/handle/unipi/14201
http://dx.doi.org/10.26267/unipi_dione/1624

Collections

Τμήμα Ψηφιακών Συστημάτων

Show full item record

Except where otherwise noted, this item's license is described as
Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα