 Reviews
 Open access
 Published:
Exploring topological data analysis for information extraction: application to recognition of Arabic machineprinted numerals
Journal of Engineering and Applied Science volume 71, Article number: 16 (2024)
Abstract
This manuscript explores the capability of topological data analysis (TDA) based on homology theory (HT: a subfield of algebraic topology) to extract relevant information for recognition of confusing Arabic machineprinted numerals. In fact, topological properties may significantly reduce the confusion between some numerals such as “1” and “4” in the context of small data sets. These two latter digits differ in the sense that digit 1 has no hole and digit 4 has one hole. Our contribution consists of evaluating the contribution of TDA with its invariant descriptors such as Betti numbers in machineprinted Arabic numerals recognition. Our investigation is driven by the following set of actions: (i) we extract Betti numbers invariant features of each numeral image and partition the ten numerals into three different clusters with respect to these features. (ii) We then perform a classification by assigning a test image to its corresponding cluster, and map this image to a numeral using dynamictime warping as a metric defined in the Freemans’ chaincode space. We compared our proposed approach with major stateoftheart methods depicting various ways of using TDA in character recognition. The advantages and limitations of TDA (including its pros and cons) are discussed further based on numeral recognition results.
Introduction
Character recognition is divided into two categories: online and offline. Online character recognition attempts to recognize characters while they are being written by an individual. However, offline character recognition pinpoints to characters that have been optically scanned by a machine whether they are handwritten or machineprinted (refer to Fig. 1).
Research in recognition of characters is grounded in the document analysis literature [1]. Intensive research studies on character recognition have been widely reported. Most of these studies focus mainly on the recognition of handwritten alphanumeric characters. However, not many studies have been devoted to recognition of machineprinted numerals [2]. Khedidja et al. proposed the Hu moment, the number of holes, and surface for feature extraction. However, they implemented seven classifiers for classification of multifont Arabic machineprinted [3]. Moreover, Alqudah et al. implemented a shift and scale invariant procedure for offline machineprinted decimal digit recognition. Their approach is based on a correlation factor [4].
Dhandra et al. introduced a thinning algorithm for multifont numeral recognition. Their feature extraction uses a directional density vector, and their classification relied on a decision tree minimum distance nearest neighbor [5]. Radha et al. proposed a new algorithm for pincode script identification. They developed a new method for chaincode normalization. They conducted recognition of numerals based on neural network and naïve Bayes classifiers [6]. Salameh et al. proposed a method that estimates the number of ends and conjunction nodes of a numeral shape. They conducted a classification using fuzzy logic [7]. Trainable font embeddings [8], horizontal direction features [9], and profile vector [10] are some conventional major features extraction techniques explored in this field. These techniques are very powerful to extract discriminative clues that are relevant for a classification task. Almost all these ideas are based solely on the geometry of the strokes composing the input numeral. In fact, high recognition performance using these traditional techniques has been achieved using large size databases.
More recent approaches that invoke deep graphical networks (deep learning) using private large databases have been investigated to enhance the performance [10,11,12]. However, large quantities of real data sets are not often publicly available. This data sparseness problem represents one of the challenging problems in statistical learning. To address this issue, researchers rely on three techniques—data augmentation (DA) (including generative adversarial networks)—transfer learning (TL) and collective learning (CL). However, DA based on cropping, padding, and horizontal flipping is not appropriate in several applications (such as license plate recognition, containers code recognition) since augmented data do not occur in real scenarios. Moreover, TL and CL perform optimally only if the original and the destination domains of both models are similar enough. If this is not the case, the trained models might perform worse than what one would expect. Therefore, in this scenario, a template matching technique based on a small data set is far more adequate. Other probabilistic approaches such as hidden Markov models have also been explored; however, they require a large amount of data in order to achieve an accurate recognition [13]. In general, when they are trained on draining large data sets, machine learning (ML) models can exhibit a deep understanding of numeral pattern structures, leading to an extremely high performance. All these research materials have significantly contributed to the success of many applications. The technology of machineprinted numeral recognition is needed in many applications such as license plates recognition (LPR) [12]. Indeed, commercial LPR invokes convolutional neural networks trained with a large quantity of private (not public) datasets of more than thousands of vehicles [12]. In LPR, the background and the number of digits depend on the legislation imposed by the country considered. This modification in legislations explains the creation of many small size public license plate databases around the world [12]. Therefore, in order to rely on a deep learning scheme, one has to gather very large private (and not public) databases; however, these repositories are not made freely available to the scientific community: their purchase incurs a high cost. Furthermore, machineprinted recognition has been also successful in other fields such as automated processing of bank statements [14], barcodes identification [15], documents restoration [16], and zipcode recognition [17]. A recognition system does not tolerate any mistake. For example, an error in a machineprinted US zipcode is significantly expensive since this mailpiece can be incorrectly routed by the US postal service (USPS) to the wrong state (or city) post office. Likewise, an error in a series of digits in bank statement is very detrimental since it induces financial losses. However, there are still some shortcomings that are common to many traditional approaches: (i) geometrical features alone are not sufficient to optimally perceive a machineprinted numeral. It is well known that humans are capable to associate (or perceive) different types of elliptical shapes to the same class even if their roundnesses are different. In other words, topological features of machineprinted numeral may represent valuable clues that can be considered for discriminating between confusing numerals.
To validate this conjecture, we propose to explore persistent homology theory (PHT) that optimally estimate Betti numbers for the purpose of numerals recognition. Our ultimate goal consists of evaluating the significance of topological data analysis (TDA) in the field of Arabic machineprinted numerals recognition in a small data set setting. The advantages and limitations of TDA are underscored. At the end, one should be able to decide whether it is worth to apply TDA or not in this application. It is important to underscore that some authors investigated topological data analysis for the purpose of recognizing handwritten numerals [18, 19]. Our contribution is not directed towards the recognition performance assessment but rather discuss the significance of TDA by highlighting when TDA can or cannot be exploited. Our exploration of TDA relies on topological invariants (known as Betti numbers) during a first stage and subsequently invokes geometrical information during a second stage of classification. The first stage aims at representing the binary images of the ten numerals using their topological signatures (Betti numbers). This representation allows partitioning these images based on their topological signatures into three clusters: C_{1} = {1,2,3,5,7} contains numerals with no holes, and C_{2} = {8} includes numerals with two holes and C_{3} = {0,4,6,9} with one hole. The second stage performs a chaincode representation and a pattern matching distance for classification. In fact, once the numeral input test image is assigned to its corresponding cluster, Freeman’s chaincode with eight directions of the binary image is subsequently computed and compared to the set of chaincode templates within the cluster. Finally, dynamictime warping (DTW) metric is invoked for the overall numeral recognition procedure. It is noteworthy that this similarity measure between temporal sequences does not require any largescale training and is therefore very useful in the case where the database size is small [20]. Furthermore, we have defined a diversity ratio assigned to the set of fonts used in the numeral databases to assess the robustness of TDA.
Main text
The “Problem description” subsection presents the addressed problem, while the “Overview of topological data analysis” subsection provides some background on TDA and HT. The subsection titled “Exploring TDA for recognition of arabic machineprinted numerals” subsection is devoted to the exploration of TDA for the recognition of machineprinted Arabic numerals. It introduces the topological representation and shows how the partitioning of partition of numerals is conducted through the persistent homology (Betti numbers) principles associated with machineprinted numerals. The subsection titled “experiments and results” showcases the data collection strategy, the importance of font diversity and the numerals clustering scheme based on TDA. The “Experiments and results” subsection. The “Performance comparison and assessment of the proposed approach” subsection depicts a comparative task between our approach and some major state of the art techniques. This very subsection also brings forth the advantages and limitations of TDA for the recognition of machineprinted Arabic numerals. Finally, the conclusion and perspectives are laid out in the “Conclusions” subsection.
Problem description
Our investigation consists of exploring topological data analysis (TDA) with its persistent homology concept and analyzing its advantages and limitations when applied to recognition of Arabic machineprinted numerals for small size data sets. This problem consists of determining the correct Betti numbers (denoted βi) that should improve the performance of Arabic machineprinted numeral. It is noteworthy that Betti numbers represent the core of homology theory.
Overview of topological data analysis
We now provide topological definitions deemed necessary to comprehend TDA. For a deeper understanding of this section, please refer to [21,22,23].
Simplicial complexes
Definition 1: If E = {u_{0}, u_{1},..., u_{j}} represents a set of (j + 1) points taken from an affine space, then this set is affinely independent if the j vectors, u_{1} − u_{0}, u_{2} − u_{0},..., u_{k} − u_{0}, are linearly independent.
Definition 2: The convex hull of (j + 1) affinely independent points is called a jsimplex s. The points of E are the vertices of the jsimplex s and j is its dimension. It is noteworthy that simplices of dimensions 0, 1, 2, and 3 are vertices, edges, triangles, and tetrahedrons, respectively. Vertices, edges, and triangles can be formed in the twodimensional space as well as in a threedimensional space. However, tetrahedrons are not observed in a twodimensional space. Other simplices such as 5cell (or ncell in general) can also be formed in higher dimensional spaces [24].
Definition 3: A simplex spanned by a subset of the vertices of s is called a face of s (refer to the upper triangle of Fig. 2).
Definition 4: A finite set of simplices, S, is a simplicial complex if it satisfies the following two conditions:

∀s ∈ S, every face of s belongs to S

for every pair of simplices (s, ϕ) ∈ (SxS), the intersection, s ∩ ϕ, is either an empty set or a face common to both simplices.

The highest dimension of any simplex contained in S represents its dimension.
Abstract simplicial complex
Definition 5: An abstract simplicial complex is a subcollection, B (generic set) of S, such that s’ ∈ B and ϕ’ ⊆ s’⇒ ϕ’ ∈ B. The sets s’ are defined as abstract simplices.
The abstract simplex dimension is defined as equal to the simplex cardinality which is the number of sets. However, the dimension of B is the maximum dimension exhibited by any of its abstract simplices.
VietorisRips simplicial complex
Several geometrical constructors such as Alpha complexes, Čech complexes, and witness complexes are available in the literature [25]. To recover the persistent homology of a space from a finite sample of points, we have invoked the VietorisRips (VR) constructor [23]. We now provide a definition of this constructor.
Definition 6: If E is a finite set of points in \({\mathbb{R}}^{2}\), and d is a positive real number, then the VietorisRips complex of E and d, denoted VR (E; d), consists of all abstract simplices in 2E (power set) whose vertices’ distances do not exceed d from one another. We namely connect any two vertices whose distance from each other is no more than d by an edge, and we add a triangle or a higherdimensional simplex to the complex if all its edges are contained in the complex (refer to Fig. 3).
Homology and persistent homology
We now define the notions of filtration, pth homology group, persistent homology, death and birth, and barcodes:
Construction of series of simplicial complexes
Definition 7: A filtration on a simplicial complex S is a collection of subcomplexes {S(d) d ∈ \({\mathbb{R}}\)} of S such that S(d_{1}) ⊂ S(d_{2}) whenever d_{1} ≤ d_{2}. However, the filtration value of a simplex s ∈ S is the smallest d such that s ∈ S(d). Larger simplicial complexes are built from basic ones using different values of d to form a filtration.
Computation of Betti numbers
A simplicial complex generates a group structure through the addition of psimplices.
Definition 8: The union of all faces is the boundary of the simplex, and its complement is called the interior, or the open simplex.
Definition 9: The free group obtained through this addition of psimplices is called the chain group, G_{p}.
Definition 10: The pth homology of a simplicial complex S is the quotient vector space:
H_{p}(S) = Ker(l_{p})/Im(l_{p} + 1) = Z_{p}/B_{p}, in which Z_{p} is the group structure of all pchains with empty boundary (which is the sum of boundaries of its simplices); and Bp, which is a subgroup of G_{p}, is thus a subgroup of Z_{p}, known as the boundary group. The notion of Betti numbers derives from this quotient group. The function \(lp\) is a linear map defined as follows:
C_{p}(S) is the \({\mathbb{F}}_{2}\) vector space (2elements field) whose basis is expressed by the psimplices of S.
Definition 11: The rank of H_{p} (the smallest cardinality of a generating set for H_{p}) is known as the pth Betti number given by β_{p}. In fact, the pth Betti number records the number of pdimensional holes in S. When p = 0, β_{0} computes the number of pathconnected components of S. If E = \({\mathbb{R}}^{3}\), β_{1} accounts for the number of independent tunnels (holes), and β_{2} tallies the number of cavities.
For example, the shape in Fig. 2a depicts a circle whose Betti numbers are (1,1). However, Fig. 2b depicts an empty cylinder whose Betti numbers (β_{0}, β_{1}, β_{2}), respectively are (1,1,0).
Persistent homology (PH)
PH aims to reveal the different scales (values of d) from which a set of points can be observed within a single formalism. Therefore, several simplicial complexes are exhibited in order to approximate the true shape of the cloud of points (dataset). PH seeks to determine the homology that best represents the true shape disclosed by the set points. In fact, this optimal homology attempt to generate the qualitative noisefree features.
Definition 12: Let S_{1} ⊂ S_{2} ⊂ … ⊂ S_{n} = S, be a filtered simplicial complex; therefore, we define the pth persistent homology of S as the pair:
where ∀(i,j) ∈ {1,…,n}, i ≤ j, the linear maps: H_{p}(S_{i})\(\stackrel{fij}{\to }\) H_{p}(S_{j}) represent the maps induced by the inclusion maps from S_{i} to S_{j}.
Birth and death
The concept of birth and death is essential in the computation of Betti numbers that represent the qualitative features.
Definition 13: We assert that x ∈ H_{p}(S_{i}) (x ≠ 0) is born in H_{p}(S_{i}), if it is not present in the image of f_{ij,i}. Likewise, x is said to die in Hp(S_{j}) if j > i is the smallest index such that f_{ij}(x) = 0.
Barcodes
Through the variation of d, Betti numbers are computed for each simplicial complex. From one step to the next one, matching up the births and deaths (as described in the previous section), we obtain a set of bars, known as the barcode of the filtration. A bar corresponds to a class in one of the homology groups. Figure 3 shows barcodes obtained via a filtration on a simplicial complex based on 13 points randomly spaced. For example, for d = 1.5, the persistent homology (barcodes) reveals, three connected components (β_{0} = 3), and one hole (β_{1} = 1). However, for d = 2.3, it discloses one connected component (β_{0} = 1), and one hole (β_{1} = 1). Finally, for d = 4, (β_{0} = 1) and (β_{1} = 0).
Statistical interpretation of topological information
Once barcodes have been produced through PH, one needs to provide an interpretation of the computed results. The question that is addressed is how to assert that topological information extracted from a certain sample of points is compatible with the topological information derived from a null model. However, it is important to distinguish between two scenarios: (i) the set of points depicts an object with known topological properties, and (ii) the set of points describes an object with unknown topological properties.
In the first scenario, a barcode is a theoretical parameter and needs to be compared to the observed one (computed via PH). A goodness of fit statistical test such as the chisquare indicates how well does the observed barcode match the theoretical one. For example, if the object is the numeral “0” (one class among 10), its barcode exhibits one connected component (β_{0} = 1) and one hole (β_{1} = 1) and represents the null model (expected value in a null hypothesis of a statistical test of significance). However, the observed barcode computed through PH using different random fonts of the same numeral “0” can be viewed as the empirical mean.
The chisquare can therefore be applied to assert the significance of the observed barcode. In the second scenario, topological properties of an object represented via a set of points are unknown. In this case, the null hypothesis of a statistical test can be provided by a generative model (such as many realizations from a probability distribution of barcodes). For example, a large number of barcodes (representing a population) can be generated using many fonts with different resolution levels. All these barcodes are assigned to the numeral class “0”. Topological properties (such as the Betti numbers) of the observed barcode can subsequently be compared to those present in the population using a statistical test of significance. Finally, one can underscore that the computation of PH through data can only be performed through the sequence of tasks (pipeline) depicted by Fig. 4.
Exploring TDA for recognition of Arabic machine printed numerals
We show in this section how TDA can be exploited to improve recognition of confusing Arabic machineprinted numerals.
Numeral image characterization via simplicial complexes
We apply homology theory on a cloud of points E forming the skeleton of a printed numeral image within a set of various fonts. This set of points and its neighborhood system satisfying a set of axioms represents a topological space. The skeleton formed is obtained during a preprocessing phase including image complement, bounding box computation, scaling (or resizing) for normalization, and then skeletonization of the binary numeral image I [26]. The cloud of points is defined in a twodimensional metric space (E; d) whereby a neighborhood system is formed using VR constructor (refer to Fig. 5).
Noise generated from the thinning operation
Once the thinning operation is applied on the bitmap image, this latter object becomes noisy. This noise prevents any classification from being accurate. One can easily notice the branches on the numeral contour of digit 2 depicted by Fig. 5. A simpler approach for extracting topological features would be computing a graph out of the skeleton. Once this is done, the count of cycles in this graph would have been an option for the estimation of topological clues. However, such an approach would require a table that depicts the connectivity between vertices (pixels). Unfortunately, this information is not available since only a noisy bitmap image obtained from the skeletonization procedure is in our possession (refer to Fig. 5). Therefore, TDA that does not require the vertices connectivity is the best alternative. Another variant to compare is using an alternative method without computing the skeleton and computing PH on the VRfiltration associated to the raw image after a binarization process. However, although this option is possible, its complexity computation is intractable.
Extraction of Betti numbers as topological features
The goal of this phase is to compute the optimal value d* from which topological features of numerals can be extracted. These numerals are represented within various set of fonts. Once simplicial complexes assigned to a numeral are computed (as part of a filtration), topological features are extracted. However, it is noteworthy that the feature extraction task is supervised. In other words, the class assigned to a numeral image is known a priori. The cavities do not appear since only twodimensional simplicial complexes are considered. Figure 3 shows a set of points whose distance between adjacent points is random. In this case, the extraction of Betti numbers assigned to the digit “6” is conditioned on an unknown value d. One can notice from this figure that the value d = 2.3 depicts the numeral “6” with its Betti numbers β_{0}, and β_{2}, equal both to 1. However, when the value of d is above 3.6, the hole contained in the numeral “6” disappears. Fortunately, in the case of machineprinted numeral recognition, where the input is a bitmap image, the distance between all adjacent pixels composing this image is either d = 1 (vertical or horizontal positions) or \(d=\sqrt{2}\) (diagonal positions). Therefore, these two values represent the two possible scales from which any numeral is perceived. Ideally, the PH graph assigned to numerals relies only on two d values (refer to Fig. 6a). However, due to some artifacts emanated from the numeral skeletonization procedure, noisy features could be generated for some particular fonts when d exceeds the value of \(\sqrt{2}\). For instance, the horizontal bar in the β_{1} graph of the numeral “2” represents a noisy feature and therefore should be disregarded. When the threshold distance d between adjacent pixels is chosen between \(\sqrt{2}\) and 2, a fake hole is formed (red quadrilateral in Fig. 6b). To remove this quirk, one has to select a threshold value greater than 2. Thus, nonadjacent pixels will be connected by creating additional triangles and edges which suppress this fake hole. Since our task is supervised, we could determine the optimal value d* that precisely represents the topological features assigned to numerals. The optimal value sought should not generate fake holes nor eliminate real holes that topologically characterize numerals. This optimal value d can be selected from the interval [\(\sqrt{2}, \sqrt{{{\text{M}}}^{2}+{{\text{N}}}^{2}}\)], where M and N are the number of rows and columns, respectively, of the resized image I.
Partitioning of numeral images based on their Betti numbers
This task is achieved by clustering the ten numerals with respect to their topological signatures (i.e., after the computation of their Betti numbers). Three clusters have been identified using this procedure: cluster C_{1} = {1,2,3,5,7} contains numerals with no holes, cluster C_{2} = {8} includes numerals with two holes, and cluster C_{3} = {0,4,6,9} includes those numerals with one hole. In this application, all numerals have one connected component.
Recognition of numerals
We cover in this section the generation of templates using Freeman image representation and the classification task of numerals based dynamic time warping.
Freeman chaincode computation
Definition 14: Freeman chaincode is a compact method for representing the contours of an object. This representation was first proposed by Herbert Freeman [27].
This method represents a boundary of an object by a connected sequence of straight line segments of specified length and direction. More precisely, this depiction is based on 4 (or) 8 connectivity of the segments. The direction of each segment is coded through a numbering procedure. A boundary code which is a sequence of these directional numbers is called a Freeman chaincode. The chaincode of a boundary depends on the initial point considered. Code numbers exhibits one possibility to characterize the shape of the boundary. A chaincode is extracted by delimiting the contour in a counter clockwise manner by recording the directions as we move from one contour pixel to the next (refer to Fig. 7).
Dynamic time warping distance
Definition 15: DTW is an algorithm that computes the similarity between two temporal sequences, which may have different speeds.
This methodology has been proposed independently in the literature by Vintsyuk [28] and Sakoe [29] for speech applications. DTW is a way to compare two temporal sequences that do not coincide perfectly. It is a method that computes the optimal matching between two sequences. It aims at determining the temporal alignment that minimizes Euclidean distance between two aligned series, (refer to Fig. 8). DTW is useful in many areas such as speech recognition, data mining, financial markets, and others.

DTW distance and warping path
Let’s consider two temporal sequences: A = {a_{1}, a_{2}, …, a_{m}} and B = {b_{1}, b_{2}, …., b_{n}}, the variable DTW(i, j) denotes the DTW distance between A_{1…i} et B_{1…j}. This distance is expressed through the following recursive equation:
where \(dist\left({a}_{i},{b}_{j}\right)\) represents the distance between the two elements a_{i} and b_{j}.
For example, if the sequences A = {5, 5, 6, 6, 6, 7, 7, 7, 7} and B = {1, 1, 1, 2, 2, 2, 3, 3}, then the computation of the dynamic time warping distance is depicted by Table 1.
The warping path is the one that connects the bold numbers in each column of Table 1. The total DTW distance corresponds to the last bold element, equal in this example to 1125. The term “Infty” corresponds to the mathematical sign “∞”.

Restrictions on the warping function
The warping path is determined based on a dynamic programming approach that aligns two sequences. Computing all possible paths is “combinatorically intractable”. Therefore, there is a need to limit the number of possible warping paths. The following constraints are required to reduce the search space.

❖Boundary condition: This constraint ensures that the warping path begins with the starting points of both signals and ends with their endpoints.

❖Monotonicity condition: This constraint preserves the timeorder of points (there is no return in time)

❖Continuity (step size) condition: This constraint restricts the path moves to adjacent points in time (not jumps in time).

❖Warping window condition: Permissible points can be limited to fall within a given warping window of width r (SakoeChiba band). An acceptable warping path uses chess king moves that are as follows:

❖Horizontal moves

❖Vertical moves

❖Diagonal moves
Template generation
Templates are generated by computing the Freeman’s chaincode representation of each numeral within this set of three clusters to respectively form Codebook C_{1}, Codebook C_{2}, and Codebook C_{3}. The Codebook generation is achieved through the sequence of tasks explained through the diagram depicted by Fig. 9 (proc function). It is worth noticing that only a few fonts are considered during template generation.
Classification of numerals
This classification phase is conducted as follows: (i) initially, we assign the numeral input test to its corresponding cluster (the winner cluster). (ii) We then apply all tasks described by Fig. 9 to compute its Freeman’s chaincode. We finally invoke dynamictime warping distance to classify all numerals. Since we are conducting a multifont numeral recognition task, the optimal class associated to the input test is the most frequent class of numerals across all fonts considered in the Codebooks. This optimal class is determined via the statistical mode measure. Therefore, the decision criterion is expressed mathematically as follows: Determine the class \({\upomega }^{*}\) such that:
where S is the input Freeman’s chaincode and \({{\text{S}}}_{{\upomega }_{{\text{i}}}}^{{{\text{f}}}_{{\text{j}}}}\) is the chaincode assigned to class \({\upomega }_{{\text{i}}}\) for the font \({{\text{f}}}_{{\text{j}}}.\) One can notice from Fig. 10 that cluster 2 has only one numeral and therefore does not require any further processing. It is crucial to outline that patternmatching technique does not need training and is very efficient within the context of smallsize databases. Since TDA was applied a priori, only a few numerals are in competition and pattern matching as a subsequent task removes the confusions of the remaining numerals (refer to Fig. 10).
Experiments and results
This section is devoted to the collection of data, the computation of the font diversity ratio, and the numerals clustering based on TDA.
Data collection
We have collected a database of numeral images using 33 fonts. Therefore, the total number of numeral images is 330 (33 × 10). This set is heterogeneous in the sense that it contains similar and dissimilar font shape types (refer to Fig. 11). It is worth underscoring that since digit 4 of both Blackadder ITC and Bauhaus 93 fonts do not include hole; therefore, they were respectively replaced by digit 4 of Calibri font (sign + in Fig. 11a) and digit 4 of Bauhaus 93 font was replaced by digit 4 of Constantia font (sign * in Fig. 11a).
Font diversity ratio
The diversity ratio of fonts represents the level of similarity of fonts contained in the entire database that is used for numeral recognition. This criterion requires a classification scheme of all types of fonts. Table 2 illustrates this taxonomy extracted from reference (https://www.freecodecamp.org/news/typographytypefamiliesclassificationsandcombiningtypefaces). The proposed methodology is based on the fonts list depicted by Fig. 11. Table 3 shows the mapping of each font of Fig. 11 to its type numbered from 1 to 8 in Table 2.
From this table, one can compute a diversity ratio δ (representing a dissimilarity level across fonts) which is expressed as follows:
where DT denotes the number of different types of fonts used, and AT is the total number of types. One can easily compute the diversity ratio assigned to Table 2. This ratio is in this case equal to 7/8 = 87.5%. This value corresponds to a low similarity of fonts.
Twostage numeral recognition
To validate our methodology during this experimental phase, we have performed an automatic clustering of the 330 numeral images into 3 clusters based on topological features. Furthermore, we have considered only 13 fonts out of the 33. These 13 fonts are selected arbitrarily. Hence, 130 (13 × 10) images are collected. We extracted topological features on these 130 images and performed a clustering based on these features with a VR simplicial complex threshold set to d = 3 \(\sqrt{2}\). This optimal value was determined through a finetuning procedure (minimum recognition error rate achieved) based on the rescaling parameters M and N (M = N = 128). We relied on the JAVAPLEX platform [23] to perform this operation. We finally conducted a classification based on DTW procedure. The results of the firststage recognition are illustrated by Table 4. The accuracy reported is 100%. This high performance shows that TDA is very accurate during the initial clustering of numeral images. We subsequently partitioned the database in order to create two cases: (i) the fonts of the test samples are contained in the codebook (set of reference templates), and (ii) the fonts of the test samples are not part of the codebook. We now provide some details about these two settings.
Case where testing fonts are contained in the reference codebook
In case (1), we have designed two different scenarios: In the first scenario, only 13 fonts out of 33 have been considered (fonts delimited by the red region in Fig. 11a and the shaded region in Fig. 11b). Therefore, the total amount of images is equal to 130 = 13 × 10 numeral images with a fixed numeral font size set to 350. This set of images was designed as reference templates to form the codebook. To form a test set with 200 images; we have considered only the first ten fonts of Fig. 9a counted line by line from left to right. Only two font sizes were considered, the first one is 250 and the second one is 450. Therefore, the tally is 10 × 10 from the 250font size and 10 × 10 from the 450font size, which is in total equal to 200 images. Furthermore, one can underscore that in this scenario, the fonts adopted in the codebook are similar in shape to the fonts used in the test set. The only difference between the reference set and the testing is the size. Tables 5 and 6 depict the confusion matrices obtained in this scenario without the incorporation of TDA and with its incorporation, respectively.
In the second scenario, to build the test set, we considered all 13 fonts from which we generated 130 images with font size set to 450 and 130 images with font size equal 250, which produces a total of 260 images. The codebook reference templates set is left unchanged. The confusion matrices obtained in this scenario without the incorporation of TDA and with its incorporation are depicted by both Tables 7 and 8, respectively.
Case where testing fonts are not part of the reference codebook
In case (2), like in the previous scenario, we have chosen the same 13 fonts out of the 33 as a template set. The test set contains 10 numerals from the 20 remaining fonts. Consequently, the size of the test set adds up to 200 images. In order to assess the contribution of TDA alone, we have classified the numerals captured by their chaincodes using only DTW pattern matching technique (no prior clustering was performed). The class decision adopted in this case is illustrated via Eq. (3), with the size of Codebook equals to 130 images. Table 9 depicts the confusion matrix as well as the accuracy obtained which is equal to 85%. Finally, Table 10 depicts the confusion matrix using both TDA and DTW. The accuracy in this latter experiment has reached 92%. This improvement in accuracy is explained mainly by the contribution of TDA.
In fact, TDA has removed the confusion between several pairs of numerals such as 1 and 4 that are nonequivalent topologically. One can notice for example that the true positive rate of class 1 was 40% without TDA and has been elevated to 70% with the contribution of TDA. Likewise, the true positive rate of class 8 was 85% without TDA, and has reached 100% with the contribution of TDA. Furthermore, the same improvement happened between the following pairs of numerals (7,4) and (7,9). TDA has removed these confusions. Overall, the relative accuracy improvement using TDA is 8.23%. Since the topological properties (Betti numbers) of all numerals are known beforehand, and the accuracy from Table 4 is 100%, one can state that the topological information from a sample of points is compatible with the one conveyed by the null model.
Performance comparison and assessment of the proposed approach
We cover a comparative task between our approach and some major state of the arts techniques.
Comparison of the proposed approach with different scenarios
The task of improving recognition of confusing Arabic machineprinted numerals depends on the contribution of TDA. Table 4 has shown that TDA is capable to categorize perfectly (with 100% accuracy) numerals into three clusters. This contribution allows to subsequently classifying numerals from these three clusters only. However, one has to distinguish between the two cases: (i) testing fonts are part of the reference Codebook and (ii) testing fonts are not part of the reference Codebook. In the first case, the use of TDA has improved the accuracy from 93.5 to 95% (refer to the confusion matrices of Tables 5 and 6). Likewise, the accuracy went up from 91.92 to 93.85% when the 13 fonts are used (refer to the confusion matrices of Tables 7 and 8). However, in the second case (testing fonts are not part of the reference Codebook), the accuracy increased drastically from 85 to 92% when TDA is applied (refer to the confusion matrices of Tables 9 and 10). This accuracy rate achieved represents an achievement since (i) the classifier was tested on numerals with unknown fonts, and (ii) the database used is relatively small and the classifier does not involve any largescale training.
Comparison with some major states of the art approaches
We compare our methodology with some papers from the literature that use TDA and others that are more general:
TDA approaches
In this section, we highlight some applications of TDA and show how they differ from our approach (refer to Table 11).
General approaches
We have compared our approach with major methodologies proposed in the literature. Table 12 shows the size of training and testing sets, the number of fonts involved, the visual similarity between fonts, and the average accuracy for each technique deployed for the recognition task. This table also indicates whether all testing fonts have been included in the training set. Table 13 depicts the feature extraction techniques and the classifiers used by the same contributors. The comparison task is undertaken by taking into account three criteria, which are the following:

a. The size of the training set (or Codebook for template matching) and testing sets used.

b. The size of training samples (or samples from the Codebook) contained in the testing sets.

c. The diversity ratio assigned to the set of fonts used to build the database.

The authors in reference [2] indicated that all the training was used as testing set under the “resubstitution” operation. Their fonts diversity ratio is low (i.e., they have used similar fonts), and the size of their database is very large, which is far from the small database size setting conveyed by our approach. Furthermore, it is important to outline that our approach does not require training.

The authors in reference [3] did not provide much information about the number of fonts used and whether training samples are contained in the testing set. However, they claimed that the font diversity ratio is low, which may explain the high accuracy obtained.

The authors in reference [4] reported that their training set was reused during testing, and their font diversity ratio is low.

The authors in [5] did not disclose any detailed information about the context of their experiment. However, the size of their database is close to our database size, which makes these two approaches comparable.

The authors in reference [6] relied on a small database size, and their font diversity ratio is medium but their performance remains low.

Finally, we obtained a series of performance, which are 95%, 93.85%, and 92%, depending on the database size, the font diversity ratios, and whether training is part of testing.

Pros and cons of TDA
The following table provides an insight into whether TDA is worth to be applied to the task of recognition of machineprinted Arabic numerals. Table 14 depicts the advantages and limitations of TDA in this specific application:
Conclusions
We have explored TDA in order to investigate whether this mathematical paradigm is capable to extract relevant information that improves recognition of Arabic machineprinted numerals. This exploration is performed through clustering of numeral images based on their topological invariants. Hence, we have reduced the number of classes through clustering and consequently lowered the confusion between numerals. We have demonstrated that TDA performs precisely in clustering Arabic machineprinted numerals even when numeral fonts are dissimilar in shape. This dissimilarity is expressed via a font diversity ratio that we have developed as a criterion for performance comparison purposes. We have observed that the recognition accuracy using TDA and DTW successively is significantly higher than the one using only DTW. We have assessed the limitations and advantages of TDA in this numerals recognition task. Our future work will be focused on:

Improving the secondstage classification by modifying DTW distance through the incorporation of a matrix that contains cost values proportional to the difference of two angular directions of symbols depicting a chaincode.

Extending this work exploration to fontindependent alphanumeric characters recognition.
Availability of data and materials
The corresponding author can provide the datasets used during the current study upon reasonable request.
Abbreviations
 CL:

Collective learning
 DA:

Data augmentation
 DTW:

Dynamictime warping
 HT:

Homology theory
 kNN:

Knearest neighbors
 LPR:

License plates recognition
 ML:

Machine learning
 PHT:

Persistent homology theory
 TDA:

Topological data analysis
 TL:

Transfer learning
 USPS:

US postal service
 VR:

VietorisRips
References
Kumar M, Jindal MK, Sharma RK, Jindal SR (2019) Character and numeral recognition for nonIndic and Indic scripts: a survey. Artif Intell Rev 52(4):2235–2261
Hassanpour H, Samadiani N, Akbarzadeh F (2017) A modified selforganizing map neural network to recognize multifont printed persian numerals. Int J Eng IJE 30(11):1700–1706
Khedidja D, Hayet M (2019) Multiple classifiers and invariant features extraction for digit recognition. IJECE 11(1):41–52
Alqudah AT, AlZoubi HR, AlKhassaweneh M (2012) Shift and scale invariant recognition of printed numerals. Abhath AlYarmouk Basic Sci Eng 21(1):41–49
Dhandra BV, Malemath VS, Mallikarjun H, Hegadi R (2006) Multifont Numeral recognition without Thinning based on Directional Density of pixels. In: 2006 1st International Conference on Digital Information Management. IEEE, Bangalore, p 157–160
Radha R, Aparna RR (2014) Automatic extraction, segmentation and recognition of multifont Indian Pincode. IJCVR 4(3):247–258
Salameh M, Salem AA (2016) hyper recognition techniques for English digits using statistical analysis of nodes and Fuzzy Logic for pattern recognition. Int J Multi Sci Eng 7(8):1–7
Wang Y, Lian Z (2020) Exploring fontindependent features for scene text recognition. Proceedings of the 28th ACM International Conference on Multimedia. pp 1900–1920
Kundaikar T, Pawar JD (2020) Multifont Devanagari Text Recognition Using LSTM Neural Networks. First International Conference on Sustainable Technologies for Computational Intelligence. Springer, Singapore, pp 495–506
Sharma R, Kaushik B, Gondhi N (2020) Character recognition using machine learning and deep learninga survey. In: 2020 International Conference on Emerging Smart Computing and Informatics (ESCI). IEEE, Pune, p 341–345
Stricker D (2019) Multifont Printed Amharic Character Image Recognition: Deep Learning Techniques. Advances of Science and Technology: 6th EAI International Conference, ICAST 2018, Bahir Dar, Ethiopia, October 5–7, 2018, Proceedings, vol 274. Springer, Bahir Dar, p 322
Silva SM, Jung CR (2020) Realtime license plate detection and recognition using deep convolutional neural networks. J Vis Commun Image Represent 71:102773
Bouchaffra D, Tan J (2006) Structural hidden Markov models: An application to handwritten numeral recognition. Intelligent Data Analysis 10(1):67–79
Jha M, Kabra M, Jobanputra S, Sawant R (2019) Automation of cheque transaction using deep learning and optical character recognition. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE, Tirunelveli, p 309–312
Chowdhury AI, Rahman MS, Sakib N (2019) A study of multiple barcode detection from an image in business system. Int J Comput Appl 181(37):30–37
Savino P, Tonazzini A (2016) Digital restoration of ancient color manuscripts from geometrically misaligned rectoverso pairs. J Cult Herit 19:511–521
Bouchaffra D, Govindaraju V, Srihari SN (1999) Postprocessing of recognized strings using nonstationary Markovian models. IEEE Trans Pattern Anal Mach Intell 21(10):990–999
Adcock A, Carlsson E, Carlsson G (2016) The ring of algebraic functions on persistence bar codes. Homol Homotopy Appl 16(1):381–402
Kališnik S (2019) Tropical coordinates on the space of persistence barcodes. Found Comput Math 19(1):101–129
Choi HR, Kim T (2018) modified dynamic time warping based on direction similarity for fast gesture recognition. Math Probl Eng 2018:1–9
Edelsbrunner H, Harer JL (2010) Computational topology: An introduction. American Mathematical Society, Providence, Rhode Island
Otter N, Porter MA, Tillmann U, Grindrod P, Harrington HA (2017) A roadmap for the computation of persistent homology. EPJ Data Science 6(1):17
Adams H, Tausz A, VejdemoJohansson M (2014) JavaPlex: a research software package for persistent (co) homology. In: International Congress on Mathematical Software. Springer, Seoul, p 129–136
Pola FPB, Pola IRV (2019) Optimizing computational highorder schemes in finite volume simulations using unstructured mesh and topological data structures. Appl Math Comput 342:1–17
De Silva V, Gunnar EC Topological estimation using witness complexes. In: Symposium on Point Based Graphics. IEEE, Goslar, Germany, p 157–166
Lee TC, Kashyap RL, Chu CN (1994) Building skeleton models via 3D medial surface/axis thinning algorithms. Comp Vision Graph Image Proc 56(6):462–478
Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Elec Comput EC10(2):260–268
Vintsyuk TK (1968) “Speech discrimination by dynamic programming”, Cybernetics
Sakoe H, Chiba S (1978) “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing
Tauzin G, Lupo U, Pérez TL, Caorsi JB, MedinaMardones M, Hess K (2021) giottotda: A topological data analysis toolkit for machine learning and data exploration. J Mach Learn Res 22(39):1–6
A Garin, G Tauzin (2019) A topological “reading” lesson: Classification of MNIST using TDA. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). pp 1551–1556
Turkeš N, Nys R, Verdonck J, Latré S (2021) Noise robustness of persistent homology on greyscale images across filtrations and signatures. PloS One 16(9):e0257215
Acknowledgements
We are grateful to the General Direction of Scientific Research (DGRSDT) for their continuous financial support in this research.
Funding
This research work is supported by a grant from the General Direction of Scientific Research & Development (DGRSDT), under the number (DGRSDT13), Algeria.
Author information
Authors and Affiliations
Contributions
D. Bouchaffra designed the study and wrote the manuscript. F. Ykhlef performed the experiments, collected the data, provided critical feedback, and contributed to the writing of the manuscript. Both authors confirmed the results, and read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was conducted while Professor Djamel Bouchaffra was a faculty at Oakland University, Michigan, USA.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Bouchaffra, D., Ykhlef, F. Exploring topological data analysis for information extraction: application to recognition of Arabic machineprinted numerals. J. Eng. Appl. Sci. 71, 16 (2024). https://doi.org/10.1186/s4414702300346x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4414702300346x