Abstract
Application of machine learning methods to two important problems, namely, detection of COVID-
19 using chest radiographs (X-rays and CT scans), and molecular subtyping of breast cancer using
multi-omics data is carried out. The recent pandemic made clear the need for fast and reliable tech-
niques in distinguishing pneumonia caused by the novel virus SARS-CoV-2 from pneumonia caused
by other viral/bacterial/fungal infections. In this work, a basic CNN model was built from scratch and
spatial attention-based mechanism (Attn-CNN) incorporated to detect the manifestations of COVID-19
in CXR and CT scan images with improved generalizability and explainability has been developed. The
proposed spatial attention-based solution overcomes the need for lung segmentation and region-based
annotations for training the CNN models while keeping the model complexity minimized, thus making
it deployable in clinical settings. To verify the generalizability of the models, testing has also been
carried out on external datasets and explainability has been provided using Grad-CAM visualization of
the pixels, selected by the model for classification. Performance evaluation of the proposed approach
against five state-of-the-art deep learning models showed 95% accuracy for CXRs and 96% for CT
images and outperformed all other models and comparatively generalized well on external datasets.
Advancements in the high-throughput techniques have generated large volumes of data, enabling
genome-wide profiling of various omics data, such as protein-coding and non-coding (e.g., miRNA,
lncRNA, etc.) genes, DNA methylation, and analysis of genetic variations (SNVs, CNVs, etc.). How-
ever, identification of diagnostic and prognostic biomarkers is challenging due to heterogeneity at mul-
tiple levels and the huge number of features associated with each. This heterogeneity is seen to affect
the generalizability and explainability of ML models. To address the high dimensionality and explain-
ability issues, a knowledge-based feature selection framework along with a filtering approach using pre-
dominant correlations is proposed for multi-omics-based biomarker identification. Breast cancer being
hormone-dependent cancer, we considered the molecular subtype classification based on the three hor-
mone receptors, viz., estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth
factor receptor 2 (HER2): Luminal (ER+, PR±, HER2±), HER2-enriched (ER–, PR–, HER2+), and
Triple Negative (ER–, PR–, HER2–). DNA methylation data from protein-coding genes and long non-
coding RNAs (lncRNAs) were integrated with gene expression data of the associated genes and copy
variant genes for feature selection and classification. Using 172 features obtained from the proposed
framework, stratified 5-fold cross-validation was carried out using five ML models. The best perfor-
mance is obtained for Random Forest model with an accuracy value of 98.19% and AUC values ≥ 0.98
for all the three classes showing the effectiveness of the proposed approach.