Skip to main content

Ensemble of deep learning and machine learning approach for classification of handwritten Hindi numerals

Abstract

Given the vast range of factors, including shape, size, skew, and orientation of handwritten numerals, their machine-based recognition is a difficult challenge for researchers in the pattern recognition field. Due to the abundance of curves and resembling shapes of the symbols, the recognition of Devnagari numerals can leverage the difficulty level of the recognition. The suggested low-classification-cost method for obtaining fine features from given numeral images used benchmark deep learning models, VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3, to address these issues. Principal component analysis, a powerful dimensionality reduction method, was used to efficiently reduce the number of dimensions in the information that pre-trained deep convolutional neural network models provided. The method for improving recognition accuracy by fusing features was provided in the scheme. A machine learning algorithm: support vector machine was employed for the recognition task due to its capacity to distinguish between patterns belonging to distinct classes. The system was able to obtain a recognition accuracy of 99.72% and was effective in demonstrating the importance of ensemble machine learning and deep learning approaches.

Introduction

Machine-based recognition of handwritten alphabets is one of the requirements of language-based automation. Intrinsic, unconditional diversity in writing styles, shapes, scales, skews, orientations, and deformations of handwritten alphabets are the main associated challenges. As a result of their massive populations not having embraced English as their first language, nations like India, China, Egypt, Saudi Arabia, and the United Arab Emirates are building automation systems in their own national tongues to benefit most of their populations. Many advancements have been reported for language-based automation systems related to English script due to its worldwide acceptance. Systems based on globally emerging languages like Hindi (Devnagari), Mandarin, Arabic, Javanese, Urdu, and Persian require extra care. Efforts have been made in the present work for the Devanagari script. A set of Hindi numerals is shown in Fig. 1.

Fig. 1
figure 1

Set of handwritten Hindi numeral

Several methods have been implemented so far for solving the proposed problem. Some benchmarking models are described as follows: Das et al. [1] received quad-tree longest-run and modular principal component analysis (PCA)-based features from numeral images and concatenated them. The classification was done with a one-versus-all support vector machine (SVM) classifier. Iamsa et al. [2] crafted a histogram of gradient (HOG) features from handwritten Hindi digits. The feedforward backpropagation neural network (FBNN) and extreme learning machine (ELM) were implemented as classification algorithms; the former was the top performer.

Khanduja et al. [3] created a hybrid of structural and statistical features that included intersection points, end points, loops, and pixel distributions. The feedforward neural network was employed for the recognition of numerals. Singh et al. [4] examined the performance of five distinct classifiers: multilayer layer perceptron (MLP), Naïve Bayes (NB), logistic classifiers, random forest (RF), and SVM over local weighted run-length features received from numeral images. Acharya et al.’s [5] introduction of the deep convolutional neural networks (DCNN) model with a dropout function marked a turning point for Devanagari alphabet recognition algorithms. With the noble purpose of advancing relevant research, the authors have generated a benchmarking dataset of isolated handwritten Devanagari characters and made it freely accessible to the public. The effect of adding more layers to convolutional neural networks (CNN) on the recognition of Devanagari alphabets was studied by Chakraborty et al. [6]. A hybrid CNN and bidirectional long-short-term memory (BLSTM) model was also tried; however, it fell short of the performance of the standard CNN model. AlexNet, a pre-trained DCNN model, was used by Sonawane et al. [7] to present the transfer-learning method for identifying Devanagari characters. Aneja et al. [8] provided a thorough comparison analysis based on pre-trained DCNN models, including AlexNet, DenseNet-121, DenseNet-201, VGG-11, VGG-16, VGG-19, and Inception-V3, for the identification of Devanagari alphabets. Trivedi et al. [9] implemented a genetic algorithm and the L-BFGS optimization method to train CNN for addressing the concerns of getting stuck in local optima and the large number of iterations. Their evolutionary technique achieved a higher recognition rate for handwritten Devanagari numerals. Kumar et al. [10] introduced a convolution autoencoder based on unsupervised learning to extract reduced-sized features from the augmented numeral images of Devanagari, English, and Bangla scripts. A deep convolutional network was employed for the final classification using these features. Chaurasia et al. [11] employed CNN as a feature extractor to receive salient features from handwritten numeral images of various Indian scripts. The authors employed an SVM classifier to avail the benefit of structural risk minimization. Sarkhel et al. [12] developed a state-of-the-art multicolumn, multi-scale CNN architecture for capturing important features from the images of handwritten characters related to several Indian scripts. A SVM classifier was employed for the classification task.

Some recent studies presented benchmark approaches to solving similar problems. Rakshit et al. [13] produced a comparative study of 11 different CNN models, namely, DenseNet-201, MobileNetV2, VGG-19, EfficientNetB0, NASNetMobile, Xception, Inception ResnetV2, ResNet50, EkushNet, InceptionV3, and ResNet152V2, in recognition of handwritten Bangla characters. ResNet152V2 was the top performer. Garg et al. [14] examined k-NN and SVM classifiers with linear, polynomial, and radial basis function (RBF) kernels in machine-based recognition of Gurumukhi characters. Peak extent and modified division point-based features were crafted for the purpose. In their later study [15], the authors presented a multifeature, multi-classifier approach for solving the problem of recognizing Gurumukhi script from degraded images. The authors employed zoning, diagonal, shadow, and peak extent-based features on k-NN, decision tree, and RF classifiers. Kathigi et al. [16] developed a skewed line segmentation technique to separate the individual Kannad characters. Steerable pyramid and discrete wavelet transforms were implemented to extract salient features. The classification was performed with LSTM using combined features. Narang et al. [17] employed CNN for feature extraction as well as for classification in the recognition of ancient characters in Devanagari script. Authors experimented with CNN architecture by varying counts of layers and filters, the size of stride and kernel, and activation functions in search of the best combination. To avoid manual feature engineering in the recognition of handwritten Urdu characters. Mushtaq et al. [18] developed a CNN model that outperformed the model based on handcrafted features. Robert Raj et al. [19] developed a recognition model for handling the problems of discontinuity, overlooping, and unnecessary portions presented in the structure of Tamil characters. The authors introduced a junction point elimination algorithm that outperformed conventional feature selection and pre-extraction algorithms. Deore et al. [20] finely tuned the popular deep convolutional neural network model VGG16 with advanced adaptive gradients to recognize handwritten Devanagari characters. Moudgil et al. [21] developed a convolution-based capsule network that captures spatial relationships among local features and reduces the vector length for effective classification of Devanagari characters. Guo et al. [22] proposed a solution for the recognition of similar-shaped Tai Le characters. The authors estimated the second- and third-level wavelet transforms for given character images and converted them into wavelet deep convolution features. Linear discriminant and principal component analysis were applied to limit the feature dimensionality. The classification model included six deep, variationally sparse Gaussian processes for efficient recognition. It has been observed that deep learning techniques are replacing conventional feature extraction and classification techniques in this field in order to attain improved recognition accuracy [23].

It could be observed that the deep learning-based models achieved a significant recognition rate without the need for handcrafted features. The only concern is their large feature vectors, which may leverage the classification cost. Optimizing the size of the feature vector can lead to a low-classification-cost solution [24] in the following terms.

  • Training time: A smaller feature vector typically implies fewer features that need to be processed and used to train a classifier. The computational complexity of training algorithms may scale with the number of features, leading to shorter training times for reduced feature sizes.

  • Memory usage: A smaller feature vector requires less memory to store the feature values during training and classification processes. This can lead to reduced memory usage, which can provide cost-effectiveness if there are limitations on the available memory resources.

  • Computational complexity: The computational complexity of the classification algorithms (SVM in the present case) can be influenced by the size of the feature vector. The computational complexity of SVM training and classification depends on the number of support vectors, which are the data points nearest to the decision boundary. The dimensionality of the problem decreases by reducing the number of features, and it becomes computationally less expensive to find the support vectors. Also, the number of kernel evaluations required during training and classification decreases, leading to faster execution.

Motivation

The state-of-the-art models could be categorized into two classes: (1) the models adopted a machine-learning approach, and (2) the models employed deep convolutional neural networks.

Machine learning typically involves the use of statistical models that are trained on labeled data to make predictions or decisions. Machine learning models are often simpler and more interpretable. These models can often be trained on smaller datasets with fewer parameters. The model’s success significantly depends on the handcrafted features that are extracted from the data. Important concerns about handcrafted features are time consumption [25], the requirement of domain expertise and careful feature engineering [26], bias due to the designer’s prior assumptions that may not capture all relevant information in the data, and limited scalability, generalization, and reproducibility due to problem-specific design. This can limit the effectiveness of the model and lead to suboptimal performance.

The deep convolutional neural networks address these concerns through their potential to auto-generate features from raw images. These networks are well known for producing human-like performance in the field of pattern recognition. The main issues related to the implementation of these networks are the requirements of large datasets, millions of trainable parameters, and high computational complexity, which can restrict their deployment on low-end hardware platforms such as embedded systems, Raspberry Pi, field programmable gate arrays (FPGA), and cell phones.

The pros and cons of the abovementioned approaches induced the motivation for developing a recognition model to bridge the gap between them and receive optimum advantages.

Contribution

In the proposed work, the network architecture of benchmark DCNN models VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 was modified as a feature extractor to exploit their auto-generative feature capabilities. The classical PCA method was adopted for optimizing the size of feature vectors received from individual models. The optimized feature vectors were fused together in a strategic manner to obtain the maximum recognition rate from the benchmark SVM classifier. The suggested model provided a low-classification-cost solution to the proposed problem in terms of feature vector size.

Preliminary

An overview of the techniques used in the presented work is given in the following subsections.

VGG-16Net

This is a convolutional neural network architecture developed by the Visual Geometry Group (VGG) at Oxford University. It is named after the fact that it consists of 16 layers, which include convolutional layers, pooling layers, and fully connected layers [27]. The VGG-16Net architecture was designed for image recognition and classification tasks and achieved state-of-the-art performance on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2014. The network consists of a series of 3 × 3 convolutional layers, each followed by a rectified linear unit (RELU) activation function and a 2 × 2 max pooling layer. The final layers of the network consist of fully connected layers that perform the classification task. VGG-16Net is a deep neural network that has 138 million parameters and requires significant computational resources to train. However, pre-trained versions of the network are available and can be used for transfer learning, which allows for faster training on new image recognition tasks.

VGG-19Net

VGG-19Net is a convolutional neural network architecture with 19 layers [27]. It was developed by the Visual Geometry Group (VGG) at the University of Oxford and achieved state-of-the-art performance in ILSVRC-2014. The architecture of VGG-19Net is like that of VGG-16Net but with the addition of three extra 3 × 3 convolutional layers. VGG-19Net has 143 million parameters and requires significant computational resources to train. However, pre-trained versions of the network are available and can be used for transfer learning, which allows for faster training on new image recognition tasks.

ResNet-50

ResNet-50 is a deep convolutional neural network architecture that was introduced by Microsoft Research in 2015. The name “ResNet” comes from “residual network,” which refers to the use of residual connections, or skip connections, which allow information to bypass certain layers in the network. This helps mitigate the vanishing gradient problem, which can occur when training very deep neural networks [28]. ResNet-50 consists of 50 layers and is used primarily for image recognition tasks such as object detection and classification. The architecture of ResNet-50 is based on the building blocks known as residual blocks, which consist of two convolutional layers and a skip connection. The skip connection allows the input to be added directly to the output of the residual block, which helps preserve information and gradients through the network.

Inception-v3

Inception-v3 is a convolutional neural network architecture that was introduced by researchers at Google in 2015 [29]. The architecture of Inception-v3 is based on the use of “inception modules,” which consist of several parallel convolutional layers with different filter sizes. This allows the network to capture features at multiple scales and helps reduce the computational cost of the network. Inception-v3 also uses a technique called “factorization,” which decomposes large convolutions into smaller convolutions. This helps reduce the number of parameters in the network and improve its computational efficiency. Inception-v3 also includes other features such as batch normalization and dropout regularization, which enhance the generalization performance of the network.

Principal component analysis

It is a statistical method that can be used to reduce the dimensionality of a dataset by projecting the original data onto a lower-dimensional subspace defined by the principal components. This projection preserves as much of the original variability as possible while reducing the number of dimensions needed to represent the data [30]. PCA has several applications, including data compression, feature extraction, and the visualization of high-dimensional data. It is also commonly used as a preprocessing step for other machine learning algorithms to reduce the number of features and improve the accuracy of the model. The steps involved in the estimation of the principal components are described as follows:

Let X be a data matrix of dimension N × F, where N is the number of samples and F is the number of features.

  1. 1.

    Standardization of X:

    $$Z= \left(X-\mu \right)/\sigma$$
    (1)

where Z is the standardized data matrix, μ is the mean vector of X, and σ is the standard deviation vector of X. This transforms each feature of X to have zero mean and unit variance, which ensures that all features are on the same scale and have equal importance in the analysis.

  1. 2.

    Calculation of the covariance matrix related to standardized data:

    $$S= \left(1/N\right) \times {Z}^{T}\times Z$$
    (2)

    where, S is covariance matrix and \({\mathrm{Z}}^{\mathrm{T}}\) is the transpose of Z.

  1. 3.

    Determining the eigenvectors and eigenvalues of the covariance matrix by the following:where V and λ represent eigenvectors and eigenvalues respectively and can be denoted as follows:

    $$S \times V=\uplambda \times V$$
    (3)
$$V=\left[{\mathrm{V}}_{1}{,\mathrm{ V}}_{2 ,}{\mathrm{V}}_{3,}\dots {\mathrm{V}}_{\mathrm{F}}\right]$$
$$\lambda =\left[{\uplambda }_{1}{,\uplambda }_{2 ,}{\uplambda }_{3,}\dots {\uplambda }_{F}\right]$$

The eigenvectors represent the principal components, and the eigenvalues represent the variance explained by each principal component.

  1. 4.

    Calculation of principal components:

    $$PC=Z \times V$$
    (4)

Where PC represents principal components.

Support vector machine

It is a popular and powerful machine learning algorithm used for classification and regression analysis. The basic idea behind an SVM is to find the hyperplane that best separates the data points of different classes. The hyperplane is chosen so that it maximizes the margin, which is the distance between the hyperplane and the closest data points in each class. The data points closest to the hyperplane are called support vectors. SVMs can handle both linearly separable and nonlinearly separable data by using different types of kernels. A kernel function transforms the original data into a higher-dimensional feature space, where it may become linearly separable. Some commonly used kernel functions include the linear, the polynomial, and the RBF kernels. In addition to binary classification, SVMs can be extended to handle multiclass classification problems by using techniques such as one-vs-all and one-vs-one [31]. SVMs have several advantages over other classification algorithms, including their ability to handle high-dimensional data, their robustness to overfitting, and their effectiveness even with small datasets. In the proposed work, an SVM classifier was employed with the one-versus-all technique and an RBF kernel. The classification cost of a one-versus-all SVM classifier can be calculated as follows:

Let “m” be the number of classes and “n” be the number of training samples. Let “d” be the dimensionality of the feature vector. During training, the one-versus-all SVM classifier trains m separate binary SVM classifiers, one for each class. Each binary SVM classifier is trained on a subset of the training data that consists of the samples from one class and the samples from all other classes. Let C be the regularization parameter of the SVM, and let “k” be the kernel function used by the SVM. The training complexity of the one-versus-all SVM classifier can be expressed as follows:

$$O\,\left(m\,\times\,{n}^{2}\,\times\,d\right)\,\times\,[\mathrm{complexity\,of\,the\,kernel\,function\,k}]$$
(5)

During classification, the one-versus-all SVM classifier applies each of the m binary SVM classifiers to the test sample and selects the class with the highest score. Let “t” be the number of test samples. The classification complexity of the one-versus-all SVM classifier can be expressed as follows:

$$O\,\left(m\,\times\,t\,\times\,d\right)\,\times\,\left[\mathrm{complexity\,of\,the\,kernel\,function\,k}\right]$$
(6)

The complexity of the RBF kernel (k) used in an SVM classifier depends on the number of training samples and the dimensionality of the feature vector. The RBF kernel function is defined as follows:

$$k \left(x, {x}^{\mathrm{^{\prime}}}\right) =\mathrm{exp} \left(-\Upsilon \times ||x- {x}^{\mathrm{^{\prime}}}{||}^{2}\right)$$
(7)

where x and x′ are two feature vectors, ||.|| is the Euclidean distance between them, and ϒ is a parameter that determines the width of the kernel. The complexity of the RBF kernel function can be calculated as follows:

For a single evaluation of the kernel function, the time complexity is O(d), since we need to compute the Euclidean distance between the two feature vectors. To evaluate the kernel function for all pairs of training samples, the complexity is as follows:

$$O \left({n}^{2} \times d\right)$$
(8)

Since there are \({\mathrm{n}}^{2}\) pairs of training samples and we need to compute the kernel function for each pair. From Eqs. (5), (6), and (8), it is obvious that the various complexities of the SVM classifier directly depend on the dimensionality (d) of the feature vector. This suggested that optimizing feature vectors in terms of dimensionality (size) would improve the classification cost. The same is true for other classifiers.

Methods

The complete overview of the proposed scheme is depicted in Fig. 2.

Fig. 2
figure 2

Design of proposed model

Input dataset

The input dataset is compiled from a public repository [5]. The dataset has accurate labelling for each handwritten numeral in Hindi script. The dataset exhibits a wide range of variations in writing styles, size, slant, stroke thickness, etc. that are commonly encountered in real-world scenarios. It has a balanced distribution of numerals across different classes, which can ensure bias-free training. The dataset has satisfactory sample counts of 20,000. All these reasons make it a suitable choice for the proposed work.

Dataset preprocessing

The pretrained DCNN models have specific input size requirements. In the presented work, images were resized in the input dataset to match the input size expected by the individual models. Details about the required input image size for proposed DCNN models are provided in Table 1.

Table 1 Input size requirements of DCNN models

The resized images for VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 were represented by S1, S2, S3, and S4, respectively, in Fig. 2.

Feature extraction

The architecture of individual models was modified for the purpose of feature extraction. The classification block of the individual DCNN models typically consists of fully connected layers with a large number of parameters (of the order of millions). These layers are responsible for mapping the extracted features to 1000 class labels, as the classification blocks of individual models were originally designed to solve the classification problem of the ImageNet dataset with 1000 object classes. Since the objective of the proposed strategy is to exploit the auto-generative feature capability of pretrained DCNN models, their classification blocks are of no use. In the modified architecture, the classification blocks were removed completely to eliminate the computational burden and memory requirements associated with the fully connected layers. The remaining convolutional layers in the modified architecture were locked out of further training in order to take advantage of transfer learning. Arrangements have been made to collect the features after the final convolutional layer of each model. The individual models were set as feature extractors. The simplified architectures of modified networks are shown in Fig. 3. The resized images (S1, S2, S3, S4) were applied to the respective modified DCNN architectures, VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3. The sizes of corresponding feature vectors derived for a given digit image were 4096, 4096, 2048, and 2048, respectively, and were represented by F1, F2, F3, and F4 in the design (refer to Fig. 2). The process of feature extraction is depicted in Algorithm 1.

Fig. 3
figure 3

Modified architecture of DCNN models as feature extractor. a VGG-16Net. b VGG-19Net. c ResNet-50. d Inception-v3

figure c

Algorithm 1. Algorithm for feature extraction

Feature optimization

This stage included the feature reduction and feature fusion steps of the proposed methodology. The background details of the numeral images were almost identical and did not carry any pattern-related information (refer to Fig. 1). This has suggested the possibility of having redundant information in the individual feature types (F1 to F4). The principal component analysis (refer to the “Inception-v3” section) has been applied to individual feature types F1 to F4 to eliminate feature collinearity. The trial-and-error technique was used to identify the optimum number of principal components. First, 10 PCA components were estimated using separate feature vectors (i.e., F1, F2, F3, and F4). These elements were combined to form one vector. A sample dataset of 500 such fused feature vectors (50 samples from each numeral class) was made for the specified purpose. The sample dataset was used to train and test the proposed classifier. For the sample dataset, the procedure was repeated while stepping up the principal component counts from 10 to 40 in increments of 2. The recognition accuracy was seen to greatly increase between components 10 and 20, but no further significant increases were seen. This suggests that 20 component counts are the optimal number. The various feature vectors (F1, F2, F3, and F4) were reduced in dimension by the suggested approach to 20, and the resulting reduced feature vectors were shown as R1, R2, R3, and R4 accordingly (refer to Fig. 2). The reduced features R1 to R4 were concatenated into a single feature vector Z. The frame format of the fused feature vector Z is shown in Fig. 4. The size of the proposed optimized features becomes 80. The vector Z was estimated for all the numeral images in the input dataset. The reduced feature vectors R1 to R4 and the fused feature vector Z were used to create five new datasets. The process of feature optimization is depicted in Algorithm 2.

Fig. 4
figure 4

Frame format of fused feature vector Z

figure d

Algorithm 2. Algorithm for feature optimization

Numeral recognition

Besides the input dataset, five new datasets have been created up to this stage. The details are given in Table 2. Datasets D1 to D4 were created from the features received from VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3, respectively, after feature optimization. Dataset D5 was created by concatenating the features related to datasets D1 to D4. The individual datasets were split into train and test sets in a ratio of 75:25. The SVM classifier was trained and tested with individual datasets. Details of the hyperparameters used during the classifier learning are given in Table 3. The results were recorded in terms of precision, recall, F1 score, and recognition accuracy. The formulations used for the calculation of the metrics are given in Table 4. Here, TP, TN, FP, and FN represent true-positive, true-negative, false-positive, and false-negative events during the testing phase of the proposed classifier. A comprehensive result analysis is provided in the next section.

Table 2 Summary of newly created datasets
Table 3 Major parameters of SVM classifier
Table 4 Details of performance metrics used in the proposed study

Results

Results obtained from various datasets are compiled in Tables 5, 6, 7, 8, 9 and 10. Arrangements have been made to estimate the confusion matrix using classification reports given in Tables 5, 6, 7, 8, 9 and 10. This would generate more readability about the model’s performance. The consolidated results are compiled in Table 11. The model achieved highest recognition accuracy of 99.72% with the proposed fusion-based feature scheme (Dataset D5).

Table 5 Result obtained from input dataset (images)
Table 6 Result obtained from dataset D1 (VGG-16Net)
Table 7 Result obtained from dataset D2 (VGG-19Net)
Table 8 Result obtained from dataset D3 (ResNet-50)
Table 9 Result obtained from dataset D4 (Inception-v3)
Table 10 Result obtained from dataset D5 (proposed fusion-based features)
Table 11 Consolidated results obtained from various datasets used in the proposed study

Feature separation

Arrangements were made to visualize the separation between the features related to various numeral classes in the input dataset and the proposed feature scheme (dataset D5) by using t-SNE (t-distributed stochastic neighbor embedding) algorithm. It is a popular dimensionality reduction algorithm used for visualizing high-dimensional data in a low-dimensional space while preserving the structure of the original data as much as possible. With the help of Gaussian kernel, t-SNE computes a similarity score for each data point with every other data point based on Euclidean distance. The similarity scores were used to compute probability distributions for both the high-dimensional and low-dimensional spaces. The goal of t-SNE is to minimize the divergence between the probability distributions in the high-dimensional and low-dimensional spaces. The algorithm does this by adjusting the positions of the data points in the low-dimensional space so that the probability distributions match as closely as possible.

A significant separation between the features related to different numeral classes could be observed in Fig. 6b, which derived from the proposed feature scheme, in comparison to Fig. 6a, which derived from the raw images of the input dataset. The more separation between the features, the easer their classification.

The results of various benchmark models, along with a proposed one, are compiled in Table 12. It should be noted that there is no standard dataset of handwritten Hindi numerals in the public domain, and the results of benchmark models as mentioned in Table 12 were based on different datasets.

Table 12 Results of benchmark models along with proposed one

The proposed model produced a comparable recognition rate to the benchmark models, that too with a smaller feature vector and a higher number of test samples. Small is the size of the feature vector, and low will be the training and classification complexities (refer to “Principal component analysis” section).

Discussion

Figure 5 demonstrates the efficiency of the proposed scheme. The confusion matrix in Fig. 5a was derived when the classifier was tested with the input dataset (i.e., numeral images directly). A higher degree of confusion could be observed between numeral classes 2–3, 4–5, and 6–7; also, a significant count of false-negative (FN) predictions was recorded for numeral classes 5 and 7. All these regions of the confusion matrix were encircled in red. The confusion matrix in Fig. 5b to e was derived when the classifier was tested with datasets D1 to D4. Clear improvements could be observed in the encircled regions of the respective matrices with respect to Fig. 5a. This was also reflected in the recognition accuracy achieved with these datasets (refer to Table 11). The confusion matrix in Fig. 5f shows tremendous improvements over Fig. 5a and the rest. This matrix was derived by testing the classifier with the proposed fusion-based feature scheme (dataset D5). The matrix has minimal confusion. This suggested the potential of the proposed scheme in the selection of prominent features related to different numeral classes that could be helpful in their precise recognition by the given machine learning algorithm.

Fig. 5
figure 5

Confusion matrix related to a input dataset, b dataset D1, c dataset D2, d dataset D3, e dataset D4, and f dataset D5 (proposed dataset)

Figure 6 demonstrates the effectiveness of the proposed scheme in selecting distinct features related to various numeral classes. Proposed feature optimization resulted in a good separation between the features related to various numeral classes in the feature space, which contributed to achieve the comparable recognition rate to the benchmark models.

Fig. 6
figure 6

Separation between the features related to various numeral classes in a input dataset and b proposed feature scheme (dataset D5)

Referring to Table 12, the proposed model achieved comparable recognition accuracy to benchmark models by considering fewer numbers of features, which suggest its potential of solving the given problem with low-classification cost.

Conclusions

Most of the benchmark models relied on either a machine learning or deep learning approach. The former is simpler and more interpretable; it can be trained with small datasets and fewer parameters, but the need for manual feature engineering limits its performance. On the other hand, deep learning methods can autogenerate the salient features. These methods need large datasets and millions of trainable parameters to produce excellent results. The proposed study presented an effective ensemble of these state-of-the-art approaches. The benchmark DCNN models VGG-16Net, VGG-19Net, ResNet-50, and Inception-v3 were employed as feature extractors that produced large feature vectors. The size of feature vectors was optimized by careful implementation of the classical PCA method, which led to a low-classification-cost solution to the proposed problem. The optimized features were fused together in a systematic manner and used to train the benchmark SVM classifier. The proposed model successfully achieved comparable results to the benchmark models with a smaller feature vector. Small is the size of the feature vector, and low will be the training and classification complexities. Although medical imaging and related pattern recognition problems are not within the scope of the current study, we are hopeful that the proposed fusion-based feature scheme would also be helpful in solving these kinds of problems effectively.

Availability of data and materials

The dataset used in the study is available on https://www.kaggle.com/datasets/ashokpant/devanagari-character-dataset-large

Abbreviations

PCA:

Principal component analysis

SVM:

Support vector machine

HOG:

Histogram of gradient

FBNN:

Feedforward backpropagation neural network

ELM:

Extreme learning machine

MLP:

Multilayer perceptron

NB:

Naïve Bayes

RF:

Random forest

DCNN:

Deep convolutional neural networks

CNN:

Convolutional neural network

BLSTM:

Bidirectional long short-term memory

L-BFGS:

Limited memory-Broyden–Fletcher–Goldfarb–Shanno

k-NN:

K-nearest neighbor

RBF:

Radial basis function

LSTM:

Long short-term memory

FPGA:

Field-programmable gate array

VGG:

Visual Geometry Group

ILSVRC:

ImageNet Large-Scale Visual Recognition Challenge

RELU:

Rectified linear unit

TP:

True positive

TN:

True negative

FP:

False positive

FN:

False negative

t-SNE:

T-distributed Stochastic Neighbor Embedding

References

  1. Das N et al (2012) A statistical-topological feature combination for recognition of handwritten numerals. Applied Soft Computing Journal 12(8):2486–2495

    Article  Google Scholar 

  2. Iamsa-At S, Horata P (2013) Handwritten character recognition using histograms of oriented gradient features in deep learning of artificial neural network. International Conference on IT Convergence and Security, ICITCS-2013 1:1–5

    Google Scholar 

  3. Khanduja D, Nain N, Panwar S (2015) A hybrid feature extraction algorithm for Devanagari script. ACM Transactions on Asian and Low-Resource Language Information Processing 15(1):1–11

    Article  Google Scholar 

  4. Singh PK, Das S, Sarkar R, Nasipuri M (2017) “Recognition of offline handwriten Devanagari numerals using regional weighted run length features,” International Conference on Computer, Electrical and Communication Engineering, ICCECE-2016 1:1–6

  5. Acharya S, Pant AK, Gyawali PK (2015) “Deep learning based large scale handwritten Devanagari character recognition,” 9th International Conference on Software, Knowledge, Information Management and Applications, ICSKIMA-2015 9:1–6

  6. Chakraborty B, Shaw B, Aich J, Bhattacharya U, Parui SK (2018) “Does deeper network lead to better accuracy: a case study on handwritten Devanagari characters,” Proceedings - 13th International Workshop on Document Analysis Systems, DAS-2018 13:411–416

  7. Sonawane PK, Shelke S (2018) “Handwritten Devanagari character classification using deep learning.,” International Conference on Information, Communication, Engineering and Technology, ICICET-2018 1:1–4

  8. Aneja N, Aneja S (2019) “Transfer learning using CNN for Handwritten Devanagari character recognition,” 1st IEEE International Conference on Advances in Information Technology, ICAIT-2019 1:293–296

  9. Trivedi A, Srivastava S, Mishra A, Shukla A, Tiwari R (2018) Hybrid evolutionary approach for Devanagari handwritten numeral recognition using convolutional neural network. Procedia Computer Science 125:525–532

    Article  Google Scholar 

  10. S. Kumar and R. K. Aggarwal, “Augmented handwritten Devanagari digit recognition using convolutional autoencoder,” International Conference on Inventive Research in Computing Applications, ICIRCA-2018. 2018:574–580.

  11. S. Chaurasia and S. Agarwal, “Recognition of handwritten numerals of various Indian regional languages using deep learning,” 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering, UPCON-2018. 2018:1–6.

  12. Sarkhel R, Das N, Das A, Kundu M, Nasipuri M (2017) A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular Indic scripts. Pattern Recogn 71:78–93

    Article  Google Scholar 

  13. Rakshit P, Chatterjee S, Haldar C, Sen S, Obaidullah SM, Roy K (2022) Comparative study on the performance of the state-of-the-art CNN models for handwritten Bangla character recognition. Multimedia Tools and applications 82(7):1–22

    Google Scholar 

  14. Garg A, Jindal MK, Singh A (2019) Offline handwritten Gurmukhi character recognition: k-NN vs. SVM classifier. Int J Inf Technol 13:2389–2396

    Google Scholar 

  15. Garg A, Jindal MK, Singh A (2019) Degraded offline handwritten Gurmukhi character recognition: study of various features and classifiers. Int J Inf Technol 14:145–153

    Google Scholar 

  16. Kathigi A, HonnamachanahalliKariputtaiah K (2022) Handwritten character recognition using skewed line segmentation method and long short term memory network. Int J Syst Assur Eng Manage 13(4):1733–1745

    Article  Google Scholar 

  17. S. R. Narang, M. K. Jindal, S. Ahuja, and M. Kumar, “On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features,” Soft Computing, no. published online, pp. 1–11, 2020.

  18. Mushtaq F, Misgar MM, Kumar M, Khurana SS (2021) UrduDeepNet: offline handwritten Urdu character recognition using deep neural network. Neural Comput Appl 33:15229–15252

    Article  Google Scholar 

  19. Raj MAR, Abirami S (2020) Junction point elimination based Tamil handwritten character recognition: an experimental analysis. J Syst Sci Syst Eng 29(1):100–123

    Article  Google Scholar 

  20. Deore SP, Pravin A (2020) Devanagari handwritten character recognition using fine-tuned deep convolutional neural network on trivial dataset. Sadhana - Acad Proc Eng Sci 45(1):1–13

    Google Scholar 

  21. Moudgil A, Singh S, Gautam V, Rani S, Shah SH (2023) Handwritten Devanagari manuscript characters recognition using CapsNet. Int J Cogn Comput Eng 4:47–54

    Google Scholar 

  22. H. Guo, Y. Liu, J. Zhao, and Y. Song, “Offline handwritten Tai Le character recognition using wavelet deep convolution features and ensemble deep variationally sparse Gaussian processes,” Soft Computing, 2023.

  23. Singh S, Garg N, Kumar M (2022) Feature extraction and classification techniques for handwritten Devanagari text recognition: a survey. Multimed Tools Appl 82:747–775

    Article  Google Scholar 

  24. Jia W, Sun M, Lian J, Hou S (2022) Feature dimensionality reduction: a review. Complex Intell Syst 8(3):2663–2693

    Article  Google Scholar 

  25. Janiesch C, Heinrich K. “Machine learning and deep learning”. 2021:685–695.

  26. Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  27. Simonyan K, Zisserman A. “Very deep convolutional networks for large-scale image recognition”. in 3rd International Conference on Learning Representations, ICLR-2015. 2015:1–14.

  28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Dec:770–778

    Google Scholar 

  29. Szegedy C et al (2015) “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR-2015 24:1–9

  30. Markos A, Tuzhilina E. “Principal component analysis,” nature reviews methods primers. 2022;2.

  31. Awad M, Khanna R (2015) Support Vector Machines for Classification. In: Efficient Learning Machines, vol 1. Apress, Berkeley, p 39–66

Download references

Acknowledgements

We present our deep gratitude to Google Co-laboratory services to provide a hassle-free Python platform with the power of a graphical processing unit and vast python library support, without which it would be not easy to complete the proposed work. We are grateful to Acharya, Pant, and Gyawali for their efforts in developing the dataset of handwritten Devanagari characters and providing it in the public domain for progressive research in the related field.

Funding

The proposed research does not involve any type of funding.

Author information

Authors and Affiliations

Authors

Contributions

All authors have equal contribution in the proposed research. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Danveer Rajpal.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rajpal, D., Garg, A.R. Ensemble of deep learning and machine learning approach for classification of handwritten Hindi numerals. J. Eng. Appl. Sci. 70, 81 (2023). https://doi.org/10.1186/s44147-023-00252-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s44147-023-00252-2

Keywords