This study is performed mainly to introduce the modified algorithm CCMQM and compare its results with those of some commonly known imputation methods. The modified system is validated on a rectal cancer microarray dataset, and other known methods are applied to the same data. Different parameters are considered in a trial to reach the highest prediction accuracy and the performance of this system is verified through statistical evaluation tests.
Gene expression data analysis
The spot intensities of TIFF and RAW images are calculated using image analysis programs and then exported from the used program into several text files and Excel spreadsheets, representing raw data. The microarray measurements in these files indirectly represent the target quantity (the gene abundance) by measuring the fluorescence strength of the spots for each fluorescent dye (Cy5 for red and Cy3 for). First, the GenePix Array List (GAL) describes the position and content of each spot on an array from plain text files. Substance name and ID lists are pasted directly into the array setting file to form a GAL file. Second, GenePix results (GPR) are demonstrated in a spreadsheet. One version of GenePix calculates up to 108 different measurements for each spot, including the number of feature and background pixels for each feature at each scan wavelength and the mean, median, and totality of pixel intensities at each scan wavelength for a feature and background pixels. The data matrix of an expressed gene is then constructed from these spreadsheets. The foreground and background intensities of the red and green channels are determined. Then, four quantities, namely, RF, GF, RB, and GB, are used to calculate the log ratio of the net array intensities of the red and green channels. The net intensities are determined with the GenePix output from the changes in the mean of the foreground and the median of the background of both channels. The matrix dimension becomes n × m in which n genes are represented in rows and m samples are represented in columns. Each sample represents data from one array, and log ratios are calculated from the data introduced in one spreadsheet.
Data preprocessing
Data are preprocessed to control the quality of the image data produced by scanning a microarray and converting the image data to a table of expression level values or expression level ratios for the genes on the microarray. Different preprocessing and analysis techniques, such as the limma package in R, are applied using linear models of microarray data for analysis [9]. These procedures are crucial for a spotted array with a two-color method. In this study, preprocessing is an essential part for the input data to suitably analyze and make it valid to enter the model.
It solves many problems in raw data such and figured out them as a checklist before implementing:
-
Removing the categorizing row and column.
-
Eliminating genes with no name.
-
Minimize the high values by apply log transformation.
-
Performing quantile normalization to achieve the same sample distribution at each state.
The capturing and storing of microarray information are not the last steps of the process itself. The amount of information from a single microarray experiment is quite large, thus software apparatuses should be utilized to understand its meaning. GE information analysis is typically applied to (i) discriminate between different known cell natures or conditions, e.g., between normal and tumor tissues or between tumors of different types, or monitor tumors under different treatment schemes; and (ii) identify different and previously unknown cell types or conditions, e.g., new subclasses of an existing class of tumors. The same problems are encountered in genes that are being classified, such as unknown cDNA sequences must be assigned to gene classes, or a kind of genes should be divided into new functional classes based on their expression patterns under many experimental conditions [10]. These double tasks are defined as class prediction and class discovery [11]. In ML literature, they are known as supervised and unsupervised learning; the learning in question in microarrays data analysis being of GE values. If classes are identified with labels, discriminate analysis or supervised learning methods rather than clustering methods are commonly used [12]. Data having a ratio of missingness equal to 1% in represented data are neglected, whereas those with missingness of 1–5% are manageable. However, data with missingness of 5–15% should be subjected to appropriate approaches to achieve good imputation results. When datasets have > 15% MD, choosing IM may strongly influence the results. As such, MV from the original data should be selected from about 5% of all genes randomly and assigned. Then, ignorable (MCAR and MAR) and non-ignorable (NMAR) missingness types are measured at 3 missingness rates (10%, 20%, and 30%).
Ignorable MV is produced by randomly choosing the samples at three missingness rates. Then, they are removed. Furthermore, the upper or lower tails (10%, 20%, and 30%) of data are selected to produce non-ignorable MV. Their values are subsequently removed to ensure that the missing rate depends on the actual GE through data processing for imaginarily produced MV.
Complete workflow
In Fig. 7, having some steps in sequence to achieve the desired output of this system starting with data preprocessing that worked on the raw dataset of 16,156 sample genes of normal and rectal cancer from the GEO database [13] has the genome-based microarray data accession number GSE15781 [14] for rectal cancer, with dataset containing 42 samples from 22 patients with rectal cancer and 20 healthy individuals patients to prepare it for entering the generation stage for applying the 3 mechanisms of missing data to enter the imputation algorithms (9 criteria techniques), and the three categories are local group for KNN and LLS techniques while for the global group are BPCA, SVD, Gene-mean, gene-median, column-mean, and column-median, and the modified global technique is Clustering column-mean quantile median (CCMQM) following up to the evaluation stage that based upon some statistical tests as NRMSE, time consuming, Euclidean distance, signal-to-noise ratio (SNR), Gini coefficient, and Fisher discriminant.
Gene statistical tests
Normalized root mean square error
The used imputation methods are compared in terms of NRMSE. The RMSE is a regularly used measure of the variance between values predicted by a model.
NRMSE is calculated using the following formula in Eq. (1):
$$\mathrm{NRMSE}=\sqrt{\frac{\mathrm{mean}\;\left(\mathrm{Yorginial}-\mathrm{Yinputed}\right)\;\hat{}2}{\mathrm{variance}\;\left(\mathrm{Yorginial}\right)}}$$
(1)
Where Y-original and Y-imputed represent the original and imputed datasets, respectively; the NRMSE values range from 0 to 1, and the smaller the values are, the better the evaluation performance will be [15].
Time consumption of imputation methods
Another metric of the performance comparison is execution time. In our study, the execution times for statistic-based IM are comparable with changes and proportional increases in MV. Overall, one model can be preferred to others by considering metric time. A considerable trade-off exists between prediction accuracy and time taken for imputation.
Gini coefficient
Gini coefficient (GC) measures the inequality among values of a variable. The higher the index value is, the more distributed the data will be. Alternatively, GC can be considered to be half of the relative mean absolute difference so that Gini coefficient gives us a measure of competitiveness, and therefore a measure of uncertainty.
Coefficient output can take any values from 0 to 1 (0 to 100%).
A GC of 0 indicates perfect equality of distribution of income within a population, whereas a GC of 1 represents a perfect inequality when one person in a population receives all the income, while other people earn nothing. In some cases, if the income of a population is negative, GC exceeds 100%.
GC can be calculated as follows:
$$\textrm{Gini Coefficient}=\frac{A}{A+B},$$
(2)
Where A and B are the areas above and under the Lorenz curve, respectively
Euclidean distance
First performed in the nineteenth century by Augustin-Louis Cauchy, Euclidean distance (ED) is a nonparametric test through which the distance between each gene p and the ideal gene q is calculated agreeing to the ED and expressed by the Eq. (3). Each GE value in one gene and its corresponding value in the ideal gene are treated as two points in space. ED is calculated as the square root of the sum of the squared differences of the two real value vectors. The distance between two object points is usually defined as the smallest distance among pairs of points from the two objects, the International System of Units (SI) should be used. Equivalents in other units may be given in parentheses.
$$\textrm{ED}=\sqrt{\Sigma {\left(p-q\right)}^2}$$
(3)
Where p and q are the two points of the ED difference.
Signal-to-noise ratio
Signal-to-noise ratio (SNR), which was first proposed by Golub et al. in 1999 [11], measures the relative usefulness of the feature by ranking the features. This test is performed by comparing gene correlations with the expected gene correlations. The measure of the relative contribution by a sample to the signal and noise, not the ratio, is examined. It gives each gene a value because of maximal differences in the mean expression between two bunches and a negligible variety of expression inside each accumulate [16]. In this method, genes are first ranked according to their expression levels by using the SNR test statistic. In this test, the signal strength indicates the class conditional means, and noise is categorized as the conditional standard deviation. In microarray data, the features selected for classification can be ranked. SNR is represented as follows:
$$\textrm{SNR}\ \left(\textrm{i}\right)={}^{{\upmu}_{\textrm{i}1}-{\upmu}_{\textrm{i}2}}\!\left/ \!{}_{{\upsigma}_{\textrm{i}1}+{\upsigma}_{\textrm{i}2}}\right.,$$
(4)
Where μi1 and μi2 are the mean differences of sample classes 1 and 2, respectively; σi1 and σi2 are the standard deviations of the samples in each class; i = 1 to ng.
Fisher discriminant criterion
Fisher’s linear discriminant analysis, which was first introduced by Duda et al. [17] in 1973, is a mixture of observed or measured variables that best describe the separations between known groups of observations. It mainly aims to classify or predict issues at which the dependent variable appears quantitatively. According to its criteria, higher values are assigned to features that vary significantly among classes (original gene and predicted gene in our example) relative to their variances. Genes are arranged in a descending order by which the first genes are considered the most informative from the Fisher discriminant criterion test end result [18].
The Fisher discriminant ratio is used to evaluate the classes separately by each feature. The Fisher values of the range and variance clarify the degree of overlap between classes in a data set. The higher the value of a given feature is, the greater the number of classes separable by that feature and the lower the overlap and complexity of data will be. Therefore, features with minimum overlaps in data sets and low complexity should be selected. The time complexity of (1) is 0 (m, n, c), where m is the number of samples, n is the number of features, and c is the number of data set classes [19].
The Fisher discriminant ratio is determined as
$$\textrm{FC}\ \left(\textrm{i}\right)=\frac{{\left({\upmu}_{\textrm{i}1}-{\upmu}_{\textrm{i}2}\right)}^2\ }{\left({\upsigma_{\textrm{i}1}}^2+{\upsigma_{\textrm{i}2}}^2\right)}$$
(5)
Two classes can be approximated by N (μi1, σi1) and N (μi2,σi2), respectively, where N (μi,σi) indicates a normal distribution with mean μi and variance σi.
Modified system overflow
The following block diagram and the pseudo-code simplify the phases of the data route in the modified system (CCMQM) that takes place on a missing data set to impute and evaluate the imputation given for each missing value. Preprocessed data are exposed to decreasing ratios (10%, 20%, and 30%). The data dimensions of rows and columns are determined to take a random sample in the presence of missing spots and another sample without the missing spots with the same number of columns in both samples. The samples are subjected to some statistical calculations to obtain a substitutable persuasive value. This value is then substituted into all the missing spots, and the reformed data are compared with the existing data. Comparable files are attached to Excel files and visualized in bar charts for convenient observation to verify if the system yields the desired output (Fig. 8).
A pesudo-code representing the detialed steps of the system:
Include all needed packages using the following libraries
Crane Library
BioConductor Library
// Generation //
Declare matrixes to store the imputed data from CCMQM method "CCMQM_Matrix"
Declare matrixes to store your ouput results from testing methods showing column and
row names
Read "Data File" matrix to proceed preprocessing procedure
Include "Data File" in a data frame
Declare "Data" to store "Data File" after data framing
Declare number of normal and cancer patients shown in the matrix
Include a separator column between normal and cancer patients for easily tracking
Declare long int to store any number of rows to make your code generic
Declare "Datamiss 10", "Datamiss 20"& "Datamiss 30" to store the amount of missing values refering to the used technique
// Imputation //
Use MCAR function to start first technique
for (i in 1:number of columns in "Data")
{
Use system.time to start counting the time
Store the value in "Start_Time_MCAR_Datamiss10"
Select first column
Count number of "Nan" in the first column
Take a random sample of numbers in the first column = number of
missing "Nan" in the first column
Store this samples in "sample_datamiss10_1st_col"
Declare data frame for "sample_datamiss10_1st_col"
Calculate Column Mean for "sample_datamiss10_1st_col"
Substitute with this value with every missing "Nan"
Save the new Imputed Column in "CCMQM_Matrix" 1st column
Use NMRSE function to calulate Column performance
Store the calculated value in "NMRSE Matrix"
Use Gini Coeffitient function
Store the calculated value in "Gini Matrix"
Calculate End.time from difference bet. system.time
Store the calculated value in "Time Matrix"
}
Use same as previous for Datamiss20 & Datamiss30
Use MAR function to start Second technique
for (i in 1:number of columns in "Data")
{
Use system.time to start counting the time
Store the value in "Start_Time_MAR_Datamiss10"
Select first column
Count number of "Nan" in the first column
Take a random sample of numbers in the first column = number of
missing "Nan" in the first column
Store this samples in "sample_datamiss10_1st_col"
Declare data frame for "sample_datamiss10_1st_col"
Calculate Column Mean for "sample_datamiss10_1st_col"
Substitute with this value with every missing "Nan"
Save the new Imputed Column in "CCMQM_Matrix" 1st column
Use NMRSE function to calulate Column performance
Store the calculated value in "NMRSE Matrix"
Use Gini Coeffitient function
Store the calculated value in "Gini Matrix"
Calculate End.time from difference bet. system.time
Store the calculated value in "Time Matrix"
}
Use same as previous for Datamiss20 & Datamiss30
Use NMAR function to start Third technique
for (i in 1:number of columns in "Data")
{
Use system.time to start counting the time
Store the value in "Start_Time_NMAR_Datamiss10"
Select first column
Count number of "Nan" in the first column
Take a random sample of numbers in the first column = number of
missing "Nan" in the first column
Store this samples in "sample_datamiss10_1st_col"
Declare data frame for "sample_datamiss10_1st_col"
Calculate Column Mean for "sample_datamiss10_1st_col"
Substitute with this value with every missing "Nan"
Save the new Imputed Column in "CCMQM_Matrix" 1st column
Use NMRSE function to calulate Column performance
Store the calculated value in "NMRSE Matrix"
Use Gini Coeffitient function
Store the calculated value in "Gini Matrix"
Calculate End.time from difference bet. system.time
Store the calculated value in "Time Matrix"
}
Use same as previous for Datamiss20 & Datamiss30
// Evaluation //
Declare all output matixes in dataframe
Save Output matrixes as .xlsx files in defined address in your system
use excel to generate evaluated bar charts for resonable output
Use output data to attach them on the main imputation methods code for study
Comparisons.