Analytics on clinical and gene expression data to provide risk stratification predictions for cancer subtypes
View/ Open
Keywords
Clinical data ; Genes expression data ; Cancer subtypes discovery ; Cancer subtypes identification ; Survival analysis ; Kaplan Meier curves ; Survival clusteringAbstract
Survival analysis is not only applicable to biomedical problems for time until death estimation
and recent studies have shown that it is a powerful tool for risk stratification that can be used
in various sectors. Survival analysis is a well-established statistical technique and last years
there have been many studies that combined survival analysis with machine learning
algorithms in order to capture non linear relationships among features and to obtain better
performances. The usage of different data modalities has been proven to be effective for the
enhancement of the performance of machine learning models. In this study even though our
main purpose was not a classification task, we utilize two different data modalities (clinical
and genes expression data) for risk stratification of a group of patients with cancer according
to their latent cancer subtypes. Firstly, we utilize clinical data to extract features that are
associated with survival time and with the presence or absence of the event of death. Next use
these features selected to identify latent groups between patients and when this is done, we
used the labels as ground truth to identify a subset of survival-associated genes. We tested three
different features selection and dimensionality reduction techniques for genes expression data
to examine if this will cause any differences in our results. We finally applied classifiers to the
genes subset identified, and we tried to predict in which sub cancer group category would a
future patient belong to. Implementing this approach, we expect that the identified subgroups
are biologically meaningful or in other words they differ in terms of survival. The two
contributions of the proposed approach were a) the discovery of a meaningful subset of genes
that are associated with the survival of the patient and b) with the usage of this approach future
patients will be able to be accurately categorized in a survival risk group even if the available
data are not labeled from the very beginning.