Αυτόματη εποπτευόμενη μηχανική μάθηση με τεχνικές δειγματοληψίας
Automated supervised machine learning with sampling techniques

View/ Open
Keywords
Αυτόματη μηχανική μάθηση ; Αυτόματοποιημένη μηχανική μάθηση ; Μηχανική μάθηση ; Δειγματοληπτικές τεχνικές ; Δειγματοληψία ; Τεχνικές δειγματοληψίαςAbstract
Data analysis is a sector of modern science that deals with the management and interpretation
of usable information, which is now growing rapidly. In this thesis, machine learning
applications will be researched in data sets of various sizes, technologies mentioned at the
beginning and on the one hand the machine learning improvement technique the devices seem
to meet specific specifications.
Nowadays "armed" with a multitude of algorithms and hyper-parameters we can achieve
amazing results but choosing the right combinations is a difficult process. By processing large
data sets daily, the demands on processing power and time increase. Most disciplines require
highly accurate predictions, which requires a great deal of research in each data set.
This thesis aims to propose a new technique with sampling procedures, which can bring
satisfactory results in less time with less processing power. At the same time, it "builds" a
methodology for analyzing big data and dealing with general problems such as missing values,
alphanumeric values and others such as unbalanced data sets.
This technique works with the method of sampling in rows and columns, is evaluated through
an experimental process where several results are collected from different data sets and
compared without using it. More specifically, 15 data sets were used for binary classification,
15 for multi classification and 5 for regression. All data sets are known datasets in the field of
machine learning.
The results of the experimental procedure indicated that 10% is sufficient for sampling in rows
and 80% is sufficient for sampling in columns based on correlation. The result seems to be
satisfactory since the same selection of algorithms with the use of the sample against complete
at 80%, while in the case that the selection algorithm is not the same, there is a probability that
exceeds 70% on selecting an algorithm that is the next better one. This practically means that
if in a smaller data set the decision to use an algorithm was made, then this algorithm is quite
likely to work better in the whole data set.
Specifically, this technique is developed in python language in the form of a library, which
consists of specific organized sub-procedures. Each sub-process handles specific decisions
during the pre-processing stage, such as sampling management in rows, sampling management
in columns, dealing with missing values, normalization but also in the modeling stage such as
algorithm selection and hyperparameter optimization.
However, this library has been published in the PiPy repository under the name
"AutoMLWrapper" (since it is a set of subsystems of special methods) and is accompanied
by a relevant notebook sample. https://pypi.org/project/automlwrapper/ So distribution and
execution can be done easily and quickly in a simple python environment by installing the
library using pip install, so its use is direct to the end user.