Exploring topological data analysis for information extraction: application to recognition of Arabic machine-printed numerals

Bouchaffra, Djamel; Ykhlef, Faycal

doi:10.1186/s44147-023-00346-x

Reviews
Open access
Published: 13 January 2024

Exploring topological data analysis for information extraction: application to recognition of Arabic machine-printed numerals

Djamel Bouchaffra¹ &
Faycal Ykhlef¹

Journal of Engineering and Applied Science volume 71, Article number: 16 (2024) Cite this article

475 Accesses
Metrics details

Abstract

This manuscript explores the capability of topological data analysis (TDA) based on homology theory (HT: a subfield of algebraic topology) to extract relevant information for recognition of confusing Arabic machine-printed numerals. In fact, topological properties may significantly reduce the confusion between some numerals such as “1” and “4” in the context of small data sets. These two latter digits differ in the sense that digit 1 has no hole and digit 4 has one hole. Our contribution consists of evaluating the contribution of TDA with its invariant descriptors such as Betti numbers in machine-printed Arabic numerals recognition. Our investigation is driven by the following set of actions: (i) we extract Betti numbers invariant features of each numeral image and partition the ten numerals into three different clusters with respect to these features. (ii) We then perform a classification by assigning a test image to its corresponding cluster, and map this image to a numeral using dynamic-time warping as a metric defined in the Freemans’ chaincode space. We compared our proposed approach with major state-of-the-art methods depicting various ways of using TDA in character recognition. The advantages and limitations of TDA (including its pros and cons) are discussed further based on numeral recognition results.

Introduction

Character recognition is divided into two categories: online and offline. Online character recognition attempts to recognize characters while they are being written by an individual. However, offline character recognition pinpoints to characters that have been optically scanned by a machine whether they are handwritten or machine-printed (refer to Fig. 1).

Research in recognition of characters is grounded in the document analysis literature [1]. Intensive research studies on character recognition have been widely reported. Most of these studies focus mainly on the recognition of handwritten alphanumeric characters. However, not many studies have been devoted to recognition of machine-printed numerals [2]. Khedidja et al. proposed the Hu moment, the number of holes, and surface for feature extraction. However, they implemented seven classifiers for classification of multifont Arabic machine-printed [3]. Moreover, Alqudah et al. implemented a shift and scale invariant procedure for offline machine-printed decimal digit recognition. Their approach is based on a correlation factor [4].

Dhandra et al. introduced a thinning algorithm for multifont numeral recognition. Their feature extraction uses a directional density vector, and their classification relied on a decision tree minimum distance nearest neighbor [5]. Radha et al. proposed a new algorithm for pincode script identification. They developed a new method for chaincode normalization. They conducted recognition of numerals based on neural network and naïve Bayes classifiers [6]. Salameh et al. proposed a method that estimates the number of ends and conjunction nodes of a numeral shape. They conducted a classification using fuzzy logic [7]. Trainable font embeddings [8], horizontal direction features [9], and profile vector [10] are some conventional major features extraction techniques explored in this field. These techniques are very powerful to extract discriminative clues that are relevant for a classification task. Almost all these ideas are based solely on the geometry of the strokes composing the input numeral. In fact, high recognition performance using these traditional techniques has been achieved using large size databases.

More recent approaches that invoke deep graphical networks (deep learning) using private large databases have been investigated to enhance the performance [10,11,12]. However, large quantities of real data sets are not often publicly available. This data sparseness problem represents one of the challenging problems in statistical learning. To address this issue, researchers rely on three techniques—data augmentation (DA) (including generative adversarial networks)—transfer learning (TL) and collective learning (CL). However, DA based on cropping, padding, and horizontal flipping is not appropriate in several applications (such as license plate recognition, containers code recognition) since augmented data do not occur in real scenarios. Moreover, TL and CL perform optimally only if the original and the destination domains of both models are similar enough. If this is not the case, the trained models might perform worse than what one would expect. Therefore, in this scenario, a template matching technique based on a small data set is far more adequate. Other probabilistic approaches such as hidden Markov models have also been explored; however, they require a large amount of data in order to achieve an accurate recognition [13]. In general, when they are trained on draining large data sets, machine learning (ML) models can exhibit a deep understanding of numeral pattern structures, leading to an extremely high performance. All these research materials have significantly contributed to the success of many applications. The technology of machine-printed numeral recognition is needed in many applications such as license plates recognition (LPR) [12]. Indeed, commercial LPR invokes convolutional neural networks trained with a large quantity of private (not public) datasets of more than thousands of vehicles [12]. In LPR, the background and the number of digits depend on the legislation imposed by the country considered. This modification in legislations explains the creation of many small size public license plate databases around the world [12]. Therefore, in order to rely on a deep learning scheme, one has to gather very large private (and not public) databases; however, these repositories are not made freely available to the scientific community: their purchase incurs a high cost. Furthermore, machine-printed recognition has been also successful in other fields such as automated processing of bank statements [14], barcodes identification [15], documents restoration [16], and zip-code recognition [17]. A recognition system does not tolerate any mistake. For example, an error in a machine-printed US zip-code is significantly expensive since this mail-piece can be incorrectly routed by the US postal service (USPS) to the wrong state (or city) post office. Likewise, an error in a series of digits in bank statement is very detrimental since it induces financial losses. However, there are still some shortcomings that are common to many traditional approaches: (i) geometrical features alone are not sufficient to optimally perceive a machine-printed numeral. It is well known that humans are capable to associate (or perceive) different types of elliptical shapes to the same class even if their roundnesses are different. In other words, topological features of machine-printed numeral may represent valuable clues that can be considered for discriminating between confusing numerals.

To validate this conjecture, we propose to explore persistent homology theory (PHT) that optimally estimate Betti numbers for the purpose of numerals recognition. Our ultimate goal consists of evaluating the significance of topological data analysis (TDA) in the field of Arabic machine-printed numerals recognition in a small data set setting. The advantages and limitations of TDA are underscored. At the end, one should be able to decide whether it is worth to apply TDA or not in this application. It is important to underscore that some authors investigated topological data analysis for the purpose of recognizing handwritten numerals [18, 19]. Our contribution is not directed towards the recognition performance assessment but rather discuss the significance of TDA by highlighting when TDA can or cannot be exploited. Our exploration of TDA relies on topological invariants (known as Betti numbers) during a first stage and subsequently invokes geometrical information during a second stage of classification. The first stage aims at representing the binary images of the ten numerals using their topological signatures (Betti numbers). This representation allows partitioning these images based on their topological signatures into three clusters: C₁ = {1,2,3,5,7} contains numerals with no holes, and C₂ = {8} includes numerals with two holes and C₃ = {0,4,6,9} with one hole. The second stage performs a chaincode representation and a pattern matching distance for classification. In fact, once the numeral input test image is assigned to its corresponding cluster, Free-man’s chaincode with eight directions of the binary image is subsequently computed and compared to the set of chaincode templates within the cluster. Finally, dynamic-time warping (DTW) metric is invoked for the overall numeral recognition procedure. It is noteworthy that this similarity measure between temporal sequences does not require any large-scale training and is therefore very useful in the case where the database size is small [20]. Furthermore, we have defined a diversity ratio assigned to the set of fonts used in the numeral databases to assess the robustness of TDA.

Main text

The “Problem description” subsection presents the addressed problem, while the “Overview of topological data analysis” subsection provides some background on TDA and HT. The subsection titled “Exploring TDA for recognition of arabic machine-printed numerals” subsection is devoted to the exploration of TDA for the recognition of machine-printed Arabic numerals. It introduces the topological representation and shows how the partitioning of partition of numerals is conducted through the persistent homology (Betti numbers) principles associated with machine-printed numerals. The subsection titled “experiments and results” showcases the data collection strategy, the importance of font diversity and the numerals clustering scheme based on TDA. The “Experiments and results” subsection. The “Performance comparison and assessment of the proposed approach” subsection depicts a comparative task between our approach and some major state of the art techniques. This very subsection also brings forth the advantages and limitations of TDA for the recognition of machine-printed Arabic numerals. Finally, the conclusion and perspectives are laid out in the “Conclusions” subsection.

Problem description

Our investigation consists of exploring topological data analysis (TDA) with its persistent homology concept and analyzing its advantages and limitations when applied to recognition of Arabic machine-printed numerals for small size data sets. This problem consists of determining the correct Betti numbers (denoted βi) that should improve the performance of Arabic machine-printed numeral. It is noteworthy that Betti numbers represent the core of homology theory.

Overview of topological data analysis

We now provide topological definitions deemed necessary to comprehend TDA. For a deeper understanding of this section, please refer to [21,22,23].

Simplicial complexes

Definition 1: If E = {u₀, u₁,..., u_j} represents a set of (j + 1) points taken from an affine space, then this set is affinely independent if the j vectors, u₁ − u₀, u₂ − u₀,..., u_k − u₀, are linearly independent.

Definition 2: The convex hull of (j + 1) affinely independent points is called a j-simplex s. The points of E are the vertices of the j-simplex s and j is its dimension. It is noteworthy that simplices of dimensions 0, 1, 2, and 3 are vertices, edges, triangles, and tetrahedrons, respectively. Vertices, edges, and triangles can be formed in the two-dimensional space as well as in a three-dimensional space. However, tetrahedrons are not observed in a two-dimensional space. Other simplices such as 5-cell (or n-cell in general) can also be formed in higher dimensional spaces [24].

Definition 3: A simplex spanned by a subset of the vertices of s is called a face of s (refer to the upper triangle of Fig. 2).

Definition 4: A finite set of simplices, S, is a simplicial complex if it satisfies the following two conditions:

∀s ∈ S, every face of s belongs to S
for every pair of simplices (s, ϕ) ∈ (SxS), the intersection, s ∩ ϕ, is either an empty set or a face common to both simplices.
The highest dimension of any simplex contained in S represents its dimension.

Abstract simplicial complex

Definition 5: An abstract simplicial complex is a sub-collection, B (generic set) of S, such that s’ ∈ B and ϕ’ ⊆ s’⇒ ϕ’ ∈ B. The sets s’ are defined as abstract simplices.

The abstract simplex dimension is defined as equal to the simplex cardinality which is the number of sets. However, the dimension of B is the maximum dimension exhibited by any of its abstract simplices.

Vietoris-Rips simplicial complex

Several geometrical constructors such as Alpha complexes, Čech complexes, and witness complexes are available in the literature [25]. To recover the persistent homology of a space from a finite sample of points, we have invoked the Vietoris-Rips (VR) constructor [23]. We now provide a definition of this constructor.

Definition 6: If E is a finite set of points in ${\mathbb{R}}^{2}$, and d is a positive real number, then the Vietoris-Rips complex of E and d, denoted VR (E; d), consists of all abstract simplices in 2E (power set) whose vertices’ distances do not exceed d from one another. We namely connect any two vertices whose distance from each other is no more than d by an edge, and we add a triangle or a higher-dimensional simplex to the complex if all its edges are contained in the complex (refer to Fig. 3).

Homology and persistent homology

We now define the notions of filtration, pth homology group, persistent homology, death and birth, and barcodes:

Construction of series of simplicial complexes

Definition 7: A filtration on a simplicial complex S is a collection of subcomplexes {S(d) |d ∈ ${\mathbb{R}}$} of S such that S(d₁) ⊂ S(d₂) whenever d₁ ≤ d₂. However, the filtration value of a simplex s ∈ S is the smallest d such that s ∈ S(d). Larger simplicial complexes are built from basic ones using different values of d to form a filtration.

Computation of Betti numbers

A simplicial complex generates a group structure through the addition of p-simplices.

Definition 8: The union of all faces is the boundary of the simplex, and its complement is called the interior, or the open simplex.

Definition 9: The free group obtained through this addition of p-simplices is called the chain group, G_p.

Definition 10: The pth homology of a simplicial complex S is the quotient vector space:

H_p(S) = Ker(l_p)/Im(l_p + 1) = Z_p/B_p, in which Z_p is the group structure of all p-chains with empty boundary (which is the sum of boundaries of its simplices); and Bp, which is a subgroup of G_p, is thus a subgroup of Z_p, known as the boundary group. The notion of Betti numbers derives from this quotient group. The function $lp$ is a linear map defined as follows:

$$\begin{array}{c}{C}_{p}{\left(S\right)}^{lp}\to {C}_{p-1}\left(S\right)\\ \sigma \to {\sum }_{s\subset \sigma , s\in {S}_{p-1}}s.\end{array}$$

(1)

C_p(S) is the ${\mathbb{F}}_{2}$ vector space (2-elements field) whose basis is expressed by the p-simplices of S.

Definition 11: The rank of H_p (the smallest cardinality of a generating set for H_p) is known as the pth Betti number given by β_p. In fact, the pth Betti number records the number of p-dimensional holes in S. When p = 0, β₀ computes the number of path-connected components of S. If E = ${\mathbb{R}}^{3}$, β₁ accounts for the number of independent tunnels (holes), and β₂ tallies the number of cavities.

For example, the shape in Fig. 2a depicts a circle whose Betti numbers are (1,1). However, Fig. 2b depicts an empty cylinder whose Betti numbers (β₀, β₁, β₂), respectively are (1,1,0).

Persistent homology (PH)

PH aims to reveal the different scales (values of d) from which a set of points can be observed within a single formalism. Therefore, several simplicial complexes are exhibited in order to approximate the true shape of the cloud of points (dataset). PH seeks to determine the homology that best represents the true shape disclosed by the set points. In fact, this optimal homology attempt to generate the qualitative noise-free features.

Definition 12: Let S₁ ⊂ S₂ ⊂ … ⊂ S_n = S, be a filtered simplicial complex; therefore, we define the pth persistent homology of S as the pair:

$$\left(\left\{{H}_{p}({S}_{i})\right\}, 1\le i\le n, \left\{{f}_{ij}\right\}, 1\le i,j\le n\right)$$

(2)

where ∀(i,j) ∈ {1,…,n}, i ≤ j, the linear maps: H_p(S_i)$\stackrel{fij}{\to }$ H_p(S_j) represent the maps induced by the inclusion maps from S_i to S_j.

Birth and death

The concept of birth and death is essential in the computation of Betti numbers that represent the qualitative features.

Definition 13: We assert that x ∈ H_p(S_i) (x ≠ 0) is born in H_p(S_i), if it is not present in the image of f_i-j,i. Likewise, x is said to die in Hp(S_j) if j > i is the smallest index such that f_ij(x) = 0.

Barcodes

Through the variation of d, Betti numbers are computed for each simplicial complex. From one step to the next one, matching up the births and deaths (as described in the previous section), we obtain a set of bars, known as the barcode of the filtration. A bar corresponds to a class in one of the homology groups. Figure 3 shows barcodes obtained via a filtration on a simplicial complex based on 13 points randomly spaced. For example, for d = 1.5, the persistent homology (barcodes) reveals, three connected components (β₀ = 3), and one hole (β₁ = 1). However, for d = 2.3, it discloses one connected component (β₀ = 1), and one hole (β₁ = 1). Finally, for d = 4, (β₀ = 1) and (β₁ = 0).

Statistical interpretation of topological information

Once barcodes have been produced through PH, one needs to provide an interpretation of the computed results. The question that is addressed is how to assert that topological information extracted from a certain sample of points is compatible with the topological information derived from a null model. However, it is important to distinguish between two scenarios: (i) the set of points depicts an object with known topological properties, and (ii) the set of points describes an object with unknown topological properties.

In the first scenario, a barcode is a theoretical parameter and needs to be compared to the observed one (computed via PH). A goodness of fit statistical test such as the chi-square indicates how well does the observed barcode match the theoretical one. For example, if the object is the numeral “0” (one class among 10), its barcode exhibits one connected component (β₀ = 1) and one hole (β₁ = 1) and represents the null model (expected value in a null hypothesis of a statistical test of significance). However, the observed barcode computed through PH using different random fonts of the same numeral “0” can be viewed as the empirical mean.

The chi-square can therefore be applied to assert the significance of the observed barcode. In the second scenario, topological properties of an object represented via a set of points are unknown. In this case, the null hypothesis of a statistical test can be provided by a generative model (such as many realizations from a probability distribution of barcodes). For example, a large number of barcodes (representing a population) can be generated using many fonts with different resolution levels. All these barcodes are assigned to the numeral class “0”. Topological properties (such as the Betti numbers) of the observed barcode can subsequently be compared to those present in the population using a statistical test of significance. Finally, one can underscore that the computation of PH through data can only be performed through the sequence of tasks (pipeline) depicted by Fig. 4.

Exploring TDA for recognition of Arabic machine printed numerals

We show in this section how TDA can be exploited to improve recognition of confusing Arabic machine-printed numerals.

Numeral image characterization via simplicial complexes

We apply homology theory on a cloud of points E forming the skeleton of a printed numeral image within a set of various fonts. This set of points and its neighborhood system satisfying a set of axioms represents a topological space. The skeleton formed is obtained during a preprocessing phase including image complement, bounding box computation, scaling (or resizing) for normalization, and then skeletonization of the binary numeral image I [26]. The cloud of points is defined in a two-dimensional metric space (E; d) whereby a neighborhood system is formed using VR constructor (refer to Fig. 5).

Noise generated from the thinning operation

Once the thinning operation is applied on the bitmap image, this latter object becomes noisy. This noise prevents any classification from being accurate. One can easily notice the branches on the numeral contour of digit 2 depicted by Fig. 5. A simpler approach for extracting topological features would be computing a graph out of the skeleton. Once this is done, the count of cycles in this graph would have been an option for the estimation of topological clues. However, such an approach would require a table that depicts the connectivity between vertices (pixels). Unfortunately, this information is not available since only a noisy bitmap image obtained from the skeletonization procedure is in our possession (refer to Fig. 5). Therefore, TDA that does not require the vertices connectivity is the best alternative. Another variant to compare is using an alternative method without computing the skeleton and computing PH on the VR-filtration associated to the raw image after a binarization process. However, although this option is possible, its complexity computation is intractable.

Extraction of Betti numbers as topological features

The goal of this phase is to compute the optimal value d* from which topological features of numerals can be extracted. These numerals are represented within various set of fonts. Once simplicial complexes assigned to a numeral are computed (as part of a filtration), topological features are extracted. However, it is noteworthy that the feature extraction task is supervised. In other words, the class assigned to a numeral image is known a priori. The cavities do not appear since only two-dimensional simplicial complexes are considered. Figure 3 shows a set of points whose distance between adjacent points is random. In this case, the extraction of Betti numbers assigned to the digit “6” is conditioned on an unknown value d. One can notice from this figure that the value d = 2.3 depicts the numeral “6” with its Betti numbers β₀, and β₂, equal both to 1. However, when the value of d is above 3.6, the hole contained in the numeral “6” disappears. Fortunately, in the case of machine-printed numeral recognition, where the input is a bitmap image, the distance between all adjacent pixels composing this image is either d = 1 (vertical or horizontal positions) or $d=\sqrt{2}$ (diagonal positions). Therefore, these two values represent the two possible scales from which any numeral is perceived. Ideally, the PH graph assigned to numerals relies only on two d values (refer to Fig. 6a). However, due to some artifacts emanated from the numeral skeletonization procedure, noisy features could be generated for some particular fonts when d exceeds the value of $\sqrt{2}$. For instance, the horizontal bar in the β₁ graph of the numeral “2” represents a noisy feature and therefore should be disregarded. When the threshold distance d between adjacent pixels is chosen between $\sqrt{2}$ and 2, a fake hole is formed (red quadrilateral in Fig. 6b). To remove this quirk, one has to select a threshold value greater than 2. Thus, non-adjacent pixels will be connected by creating additional triangles and edges which suppress this fake hole. Since our task is supervised, we could determine the optimal value d* that precisely represents the topological features assigned to numerals. The optimal value sought should not generate fake holes nor eliminate real holes that topologically characterize numerals. This optimal value d can be selected from the interval [$\sqrt{2}, \sqrt{{{\text{M}}}^{2}+{{\text{N}}}^{2}}$], where M and N are the number of rows and columns, respectively, of the resized image I.

Partitioning of numeral images based on their Betti numbers

This task is achieved by clustering the ten numerals with respect to their topological signatures (i.e., after the computation of their Betti numbers). Three clusters have been identified using this procedure: cluster C₁ = {1,2,3,5,7} contains numerals with no holes, cluster C₂ = {8} includes numerals with two holes, and cluster C₃ = {0,4,6,9} includes those numerals with one hole. In this application, all numerals have one connected component.

Recognition of numerals

We cover in this section the generation of templates using Freeman image representation and the classification task of numerals based dynamic time warping.

Freeman chaincode computation

Definition 14: Freeman chaincode is a compact method for representing the contours of an object. This representation was first proposed by Herbert Freeman [27].

This method represents a boundary of an object by a connected sequence of straight line segments of specified length and direction. More precisely, this depiction is based on 4 (or) 8 connectivity of the segments. The direction of each segment is coded through a numbering procedure. A boundary code which is a sequence of these directional numbers is called a Freeman chaincode. The chaincode of a boundary depends on the initial point considered. Code numbers exhibits one possibility to characterize the shape of the boundary. A chaincode is extracted by delimiting the contour in a counter clockwise manner by recording the directions as we move from one contour pixel to the next (refer to Fig. 7).

Dynamic time warping distance

Definition 15: DTW is an algorithm that computes the similarity between two temporal sequences, which may have different speeds.

This methodology has been proposed independently in the literature by Vintsyuk [28] and Sakoe [29] for speech applications. DTW is a way to compare two temporal sequences that do not coincide perfectly. It is a method that computes the optimal matching between two sequences. It aims at determining the temporal alignment that minimizes Euclidean distance between two aligned series, (refer to Fig. 8). DTW is useful in many areas such as speech recognition, data mining, financial markets, and others.

DTW distance and warping path

Let’s consider two temporal sequences: A = {a₁, a₂, …, a_m} and B = {b₁, b₂, …., b_n}, the variable DTW(i, j) denotes the DTW distance between A_1…i et B_1…j. This distance is expressed through the following recursive equation:

$$DTW\left(i,j\right)=\left\{\begin{array}{c}0 if i=0 and j=0\\ Infty if i=0 or b=0, and i\ne j\\ dist\left({a}_{i},{b}_{j}\right)+min\left\{\begin{array}{c}DTW\left(i-1,j\right)\\ DTW\left(i,j-1\right)\\ DTW\left(i-1,j-1\right)\end{array}if 1\le i\le m and1\le j\le n,\right.\end{array}\right.$$

where $dist\left({a}_{i},{b}_{j}\right)$ represents the distance between the two elements a_i and b_j.

For example, if the sequences A = {5, 5, 6, 6, 6, 7, 7, 7, 7} and B = {1, 1, 1, 2, 2, 2, 3, 3}, then the computation of the dynamic time warping distance is depicted by Table 1.

Table 1 Computation of the optimal warping path whose cells are in bold

Full size table

The warping path is the one that connects the bold numbers in each column of Table 1. The total DTW distance corresponds to the last bold element, equal in this example to 1125. The term “Infty” corresponds to the mathematical sign “∞”.

Restrictions on the warping function

The warping path is determined based on a dynamic programming approach that aligns two sequences. Computing all possible paths is “combinatorically intractable”. Therefore, there is a need to limit the number of possible warping paths. The following constraints are required to reduce the search space.

❖Boundary condition: This constraint ensures that the warping path begins with the starting points of both signals and ends with their endpoints.
❖Monotonicity condition: This constraint preserves the time-order of points (there is no return in time)
❖Continuity (step size) condition: This constraint restricts the path moves to adjacent points in time (not jumps in time).
❖Warping window condition: Permissible points can be limited to fall within a given warping window of width r (Sakoe-Chiba band). An acceptable warping path uses chess king moves that are as follows:
❖Horizontal moves
❖Vertical moves
❖Diagonal moves

Template generation

Templates are generated by computing the Freeman’s chaincode representation of each numeral within this set of three clusters to respectively form Codebook C₁, Codebook C₂, and Codebook C₃. The Codebook generation is achieved through the sequence of tasks explained through the diagram depicted by Fig. 9 (proc function). It is worth noticing that only a few fonts are considered during template generation.

Classification of numerals

This classification phase is conducted as follows: (i) initially, we assign the numeral input test to its corresponding cluster (the winner cluster). (ii) We then apply all tasks described by Fig. 9 to compute its Freeman’s chaincode. We finally invoke dynamic-time warping distance to classify all numerals. Since we are conducting a multi-font numeral recognition task, the optimal class associated to the input test is the most frequent class of numerals across all fonts considered in the Codebooks. This optimal class is determined via the statistical mode measure. Therefore, the decision criterion is expressed mathematically as follows: Determine the class ${\upomega }^{*}$ such that:

$${{\varvec{\upomega}}}^{\boldsymbol{*}}=\mathbf{M}\mathbf{o}\mathbf{d}\mathbf{e}\left\{{\mathbf{a}\mathbf{r}\mathbf{g}\mathbf{m}\mathbf{i}\mathbf{n}}_{{{\varvec{\upomega}}}_{\mathbf{i}}}\left(\mathbf{D}\mathbf{T}\mathbf{W}\left({\mathbf{S},{\text{S}}}_{{{\varvec{\upomega}}}_{\mathbf{i}}}^{{\mathbf{f}}_{\mathbf{j}}}\right)\right),\forall \mathbf{j}\right\}$$

(3)

where S is the input Freeman’s chaincode and ${{\text{S}}}_{{\upomega }_{{\text{i}}}}^{{{\text{f}}}_{{\text{j}}}}$ is the chaincode assigned to class ${\upomega }_{{\text{i}}}$ for the font ${{\text{f}}}_{{\text{j}}}.$ One can notice from Fig. 10 that cluster 2 has only one numeral and therefore does not require any further processing. It is crucial to outline that pattern-matching technique does not need training and is very efficient within the context of small-size databases. Since TDA was applied a priori, only a few numerals are in competition and pattern matching as a subsequent task removes the confusions of the remaining numerals (refer to Fig. 10).

Experiments and results

This section is devoted to the collection of data, the computation of the font diversity ratio, and the numerals clustering based on TDA.

Data collection

We have collected a database of numeral images using 33 fonts. Therefore, the total number of numeral images is 330 (33 × 10). This set is heterogeneous in the sense that it contains similar and dissimilar font shape types (refer to Fig. 11). It is worth underscoring that since digit 4 of both Blackadder ITC and Bauhaus 93 fonts do not include hole; therefore, they were respectively replaced by digit 4 of Calibri font (sign + in Fig. 11a) and digit 4 of Bauhaus 93 font was replaced by digit 4 of Constantia font (sign * in Fig. 11a).

Font diversity ratio

The diversity ratio of fonts represents the level of similarity of fonts contained in the entire database that is used for numeral recognition. This criterion requires a classification scheme of all types of fonts. Table 2 illustrates this taxonomy extracted from reference (https://www.freecodecamp.org/news/typography-type-families-classifications-and-combining-typefaces). The proposed methodology is based on the fonts list depicted by Fig. 11. Table 3 shows the mapping of each font of Fig. 11 to its type numbered from 1 to 8 in Table 2.

Table 2 Font type classification

Full size table

Table 3 Mapping of fonts used in the proposed approach to their types

Full size table

From this table, one can compute a diversity ratio δ (representing a dissimilarity level across fonts) which is expressed as follows:

$$\delta =\frac{DT}{AT}\times 100\mathrm{ \%}$$

(4)

where DT denotes the number of different types of fonts used, and AT is the total number of types. One can easily compute the diversity ratio assigned to Table 2. This ratio is in this case equal to 7/8 = 87.5%. This value corresponds to a low similarity of fonts.

Two-stage numeral recognition

To validate our methodology during this experimental phase, we have performed an automatic clustering of the 330 numeral images into 3 clusters based on topological features. Furthermore, we have considered only 13 fonts out of the 33. These 13 fonts are selected arbitrarily. Hence, 130 (13 × 10) images are collected. We extracted topological features on these 130 images and performed a clustering based on these features with a VR simplicial complex threshold set to d = 3 $\sqrt{2}$. This optimal value was determined through a finetuning procedure (minimum recognition error rate achieved) based on the rescaling parameters M and N (M = N = 128). We relied on the JAVAPLEX platform [23] to perform this operation. We finally conducted a classification based on DTW procedure. The results of the first-stage recognition are illustrated by Table 4. The accuracy reported is 100%. This high performance shows that TDA is very accurate during the initial clustering of numeral images. We subsequently partitioned the database in order to create two cases: (i) the fonts of the test samples are contained in the codebook (set of reference templates), and (ii) the fonts of the test samples are not part of the codebook. We now provide some details about these two settings.

Table 4 Numeral clustering using only TDA, accuracy = 100%

Full size table

Case where testing fonts are contained in the reference codebook

In case (1), we have designed two different scenarios: In the first scenario, only 13 fonts out of 33 have been considered (fonts delimited by the red region in Fig. 11a and the shaded region in Fig. 11b). Therefore, the total amount of images is equal to 130 = 13 × 10 numeral images with a fixed numeral font size set to 350. This set of images was designed as reference templates to form the codebook. To form a test set with 200 images; we have considered only the first ten fonts of Fig. 9a counted line by line from left to right. Only two font sizes were considered, the first one is 250 and the second one is 450. Therefore, the tally is 10 × 10 from the 250-font size and 10 × 10 from the 450-font size, which is in total equal to 200 images. Furthermore, one can underscore that in this scenario, the fonts adopted in the codebook are similar in shape to the fonts used in the test set. The only difference between the reference set and the testing is the size. Tables 5 and 6 depict the confusion matrices obtained in this scenario without the incorporation of TDA and with its incorporation, respectively.

Table 5 Confusion matrix using DTW without TDA clustering, accuracy = 93.5% (case 1, scenario 1)

Full size table

Table 6 Confusion matrix using DTW with TDA clustering, accuracy = 95% (case 1, scenario 1)

Full size table

In the second scenario, to build the test set, we considered all 13 fonts from which we generated 130 images with font size set to 450 and 130 images with font size equal 250, which produces a total of 260 images. The codebook reference templates set is left unchanged. The confusion matrices obtained in this scenario without the incorporation of TDA and with its incorporation are depicted by both Tables 7 and 8, respectively.

Table 7 Confusion matrix using only DTW, accuracy = 91.92% (case 1, scenario 1)

Full size table

Table 8 Confusion matrix using DTW with TDA clustering, accuracy = 93.85% (case 1, scenario 2)

Full size table

Case where testing fonts are not part of the reference codebook

In case (2), like in the previous scenario, we have chosen the same 13 fonts out of the 33 as a template set. The test set contains 10 numerals from the 20 remaining fonts. Consequently, the size of the test set adds up to 200 images. In order to assess the contribution of TDA alone, we have classified the numerals captured by their chaincodes using only DTW pattern matching technique (no prior clustering was performed). The class decision adopted in this case is illustrated via Eq. (3), with the size of Codebook equals to 130 images. Table 9 depicts the confusion matrix as well as the accuracy obtained which is equal to 85%. Finally, Table 10 depicts the confusion matrix using both TDA and DTW. The accuracy in this latter experiment has reached 92%. This improvement in accuracy is explained mainly by the contribution of TDA.

Table 9 Confusion matrix using DTW without TDA clustering, accuracy = 85% (case 2)

Full size table

Table 10 Confusion matrix using DTW with TDA clustering, accuracy = 92% (case 2)

Full size table

In fact, TDA has removed the confusion between several pairs of numerals such as 1 and 4 that are non-equivalent topologically. One can notice for example that the true positive rate of class 1 was 40% without TDA and has been elevated to 70% with the contribution of TDA. Likewise, the true positive rate of class 8 was 85% without TDA, and has reached 100% with the contribution of TDA. Furthermore, the same improvement happened between the following pairs of numerals (7,4) and (7,9). TDA has removed these confusions. Overall, the relative accuracy improvement using TDA is 8.23%. Since the topological properties (Betti numbers) of all numerals are known beforehand, and the accuracy from Table 4 is 100%, one can state that the topological information from a sample of points is compatible with the one conveyed by the null model.

Performance comparison and assessment of the proposed approach

We cover a comparative task between our approach and some major state of the arts techniques.

Comparison of the proposed approach with different scenarios

The task of improving recognition of confusing Arabic machine-printed numerals depends on the contribution of TDA. Table 4 has shown that TDA is capable to categorize perfectly (with 100% accuracy) numerals into three clusters. This contribution allows to subsequently classifying numerals from these three clusters only. However, one has to distinguish between the two cases: (i) testing fonts are part of the reference Codebook and (ii) testing fonts are not part of the reference Codebook. In the first case, the use of TDA has improved the accuracy from 93.5 to 95% (refer to the confusion matrices of Tables 5 and 6). Likewise, the accuracy went up from 91.92 to 93.85% when the 13 fonts are used (refer to the confusion matrices of Tables 7 and 8). However, in the second case (testing fonts are not part of the reference Codebook), the accuracy increased drastically from 85 to 92% when TDA is applied (refer to the confusion matrices of Tables 9 and 10). This accuracy rate achieved represents an achievement since (i) the classifier was tested on numerals with unknown fonts, and (ii) the database used is relatively small and the classifier does not involve any large-scale training.

Comparison with some major states of the art approaches

We compare our methodology with some papers from the literature that use TDA and others that are more general:

TDA approaches

In this section, we highlight some applications of TDA and show how they differ from our approach (refer to Table 11).

Table 11 Comparison with some major TDA approaches

Full size table

General approaches

We have compared our approach with major methodologies proposed in the literature. Table 12 shows the size of training and testing sets, the number of fonts involved, the visual similarity between fonts, and the average accuracy for each technique deployed for the recognition task. This table also indicates whether all testing fonts have been included in the training set. Table 13 depicts the feature extraction techniques and the classifiers used by the same contributors. The comparison task is undertaken by taking into account three criteria, which are the following:

a. The size of the training set (or Codebook for template matching) and testing sets used.
b. The size of training samples (or samples from the Codebook) contained in the testing sets.
c. The diversity ratio assigned to the set of fonts used to build the database.
- The authors in reference [2] indicated that all the training was used as testing set under the “resubstitution” operation. Their fonts diversity ratio is low (i.e., they have used similar fonts), and the size of their database is very large, which is far from the small database size setting conveyed by our approach. Furthermore, it is important to outline that our approach does not require training.
- The authors in reference [3] did not provide much information about the number of fonts used and whether training samples are contained in the testing set. However, they claimed that the font diversity ratio is low, which may explain the high accuracy obtained.
- The authors in reference [4] reported that their training set was reused during testing, and their font diversity ratio is low.
- The authors in [5] did not disclose any detailed information about the context of their experiment. However, the size of their database is close to our database size, which makes these two approaches comparable.
- The authors in reference [6] relied on a small database size, and their font diversity ratio is medium but their performance remains low.
- Finally, we obtained a series of performance, which are 95%, 93.85%, and 92%, depending on the database size, the font diversity ratios, and whether training is part of testing.

Table 12 Average accuracies of some state of the art methodologies

Full size table

Table 13 Feature extraction and classification technique depicted by Table 10

Full size table

Pros and cons of TDA

The following table provides an insight into whether TDA is worth to be applied to the task of recognition of machine-printed Arabic numerals. Table 14 depicts the advantages and limitations of TDA in this specific application:

Table 14 Advantages and limitations of TDA

Full size table

Conclusions

We have explored TDA in order to investigate whether this mathematical paradigm is capable to extract relevant information that improves recognition of Arabic machine-printed numerals. This exploration is performed through clustering of numeral images based on their topological invariants. Hence, we have reduced the number of classes through clustering and consequently lowered the confusion between numerals. We have demonstrated that TDA performs precisely in clustering Arabic machine-printed numerals even when numeral fonts are dissimilar in shape. This dissimilarity is expressed via a font diversity ratio that we have developed as a criterion for performance comparison purposes. We have observed that the recognition accuracy using TDA and DTW successively is significantly higher than the one using only DTW. We have assessed the limitations and advantages of TDA in this numerals recognition task. Our future work will be focused on:

Improving the second-stage classification by modifying DTW distance through the incorporation of a matrix that contains cost values proportional to the difference of two angular directions of symbols depicting a chaincode.
Extending this work exploration to font-independent alphanumeric characters recognition.

Availability of data and materials

The corresponding author can provide the datasets used during the current study upon reasonable request.

Abbreviations

CL:: Collective learning
DA:: Data augmentation
DTW:: Dynamic-time warping
HT:: Homology theory
kNN:: K-nearest neighbors
LPR:: License plates recognition
ML:: Machine learning
PHT:: Persistent homology theory
TDA:: Topological data analysis
TL:: Transfer learning
USPS:: US postal service
VR:: Vietoris-Rips

References

Kumar M, Jindal MK, Sharma RK, Jindal SR (2019) Character and numeral recognition for non-Indic and Indic scripts: a survey. Artif Intell Rev 52(4):2235–2261
Article Google Scholar
Hassanpour H, Samadiani N, Akbarzadeh F (2017) A modified self-organizing map neural network to recognize multi-font printed persian numerals. Int J Eng IJE 30(11):1700–1706
Google Scholar
Khedidja D, Hayet M (2019) Multiple classifiers and invariant features extraction for digit recognition. IJECE 11(1):41–52
Google Scholar
Alqudah AT, Al-Zoubi HR, Al-Khassaweneh M (2012) Shift and scale invariant recognition of printed numerals. Abhath Al-Yarmouk Basic Sci Eng 21(1):41–49
Google Scholar
Dhandra BV, Malemath VS, Mallikarjun H, Hegadi R (2006) Multi-font Numeral recognition without Thinning based on Directional Density of pixels. In: 2006 1st International Conference on Digital Information Management. IEEE, Bangalore, p 157–160
Radha R, Aparna RR (2014) Automatic extraction, segmentation and recognition of multi-font Indian Pincode. IJCVR 4(3):247–258
Article Google Scholar
Salameh M, Salem AA (2016) hyper recognition techniques for English digits using statistical analysis of nodes and Fuzzy Logic for pattern recognition. Int J Multi Sci Eng 7(8):1–7
Google Scholar
Wang Y, Lian Z (2020) Exploring font-independent features for scene text recognition. Proceedings of the 28th ACM International Conference on Multimedia. pp 1900–1920
Chapter Google Scholar
Kundaikar T, Pawar JD (2020) Multi-font Devanagari Text Recognition Using LSTM Neural Networks. First International Conference on Sustainable Technologies for Computational Intelligence. Springer, Singapore, pp 495–506
Chapter Google Scholar
Sharma R, Kaushik B, Gondhi N (2020) Character recognition using machine learning and deep learning-a survey. In: 2020 International Conference on Emerging Smart Computing and Informatics (ESCI). IEEE, Pune, p 341–345
Stricker D (2019) Multi-font Printed Amharic Character Image Recognition: Deep Learning Techniques. Advances of Science and Technology: 6th EAI International Conference, ICAST 2018, Bahir Dar, Ethiopia, October 5–7, 2018, Proceedings, vol 274. Springer, Bahir Dar, p 322
Silva SM, Jung CR (2020) Real-time license plate detection and recognition using deep convolutional neural networks. J Vis Commun Image Represent 71:102773
Article Google Scholar
Bouchaffra D, Tan J (2006) Structural hidden Markov models: An application to handwritten numeral recognition. Intelligent Data Analysis 10(1):67–79
Article Google Scholar
Jha M, Kabra M, Jobanputra S, Sawant R (2019) Automation of cheque transaction using deep learning and optical character recognition. In: 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE, Tirunelveli, p 309–312
Chowdhury AI, Rahman MS, Sakib N (2019) A study of multiple barcode detection from an image in business system. Int J Comput Appl 181(37):30–37
Google Scholar
Savino P, Tonazzini A (2016) Digital restoration of ancient color manuscripts from geometrically misaligned recto-verso pairs. J Cult Herit 19:511–521
Article Google Scholar
Bouchaffra D, Govindaraju V, Srihari SN (1999) Postprocessing of recognized strings using nonstationary Markovian models. IEEE Trans Pattern Anal Mach Intell 21(10):990–999
Article Google Scholar
Adcock A, Carlsson E, Carlsson G (2016) The ring of algebraic functions on persistence bar codes. Homol Homotopy Appl 16(1):381–402
Article MathSciNet Google Scholar
Kališnik S (2019) Tropical coordinates on the space of persistence barcodes. Found Comput Math 19(1):101–129
Article MathSciNet Google Scholar
Choi HR, Kim T (2018) modified dynamic time warping based on direction similarity for fast gesture recognition. Math Probl Eng 2018:1–9
Google Scholar
Edelsbrunner H, Harer JL (2010) Computational topology: An introduction. American Mathematical Society, Providence, Rhode Island
Google Scholar
Otter N, Porter MA, Tillmann U, Grindrod P, Harrington HA (2017) A roadmap for the computation of persistent homology. EPJ Data Science 6(1):17
Article Google Scholar
Adams H, Tausz A, Vejdemo-Johansson M (2014) JavaPlex: a research software package for persistent (co) homology. In: International Congress on Mathematical Software. Springer, Seoul, p 129–136
Pola FPB, Pola IRV (2019) Optimizing computational high-order schemes in finite volume simulations using unstructured mesh and topological data structures. Appl Math Comput 342:1–17
MathSciNet Google Scholar
De Silva V, Gunnar EC Topological estimation using witness complexes. In: Symposium on Point Based Graphics. IEEE, Goslar, Germany, p 157–166
Lee T-C, Kashyap RL, Chu C-N (1994) Building skeleton models via 3-D medial surface/axis thinning algorithms. Comp Vision Graph Image Proc 56(6):462–478
Google Scholar
Freeman H (1961) On the encoding of arbitrary geometric configurations. IRE Trans Elec Comput EC-10(2):260–268
Vintsyuk TK (1968) “Speech discrimination by dynamic programming”, Cybernetics
Google Scholar
Sakoe H, Chiba S (1978) “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing
Google Scholar
Tauzin G, Lupo U, Pérez TL, Caorsi JB, Medina-Mardones M, Hess K (2021) giotto-tda: A topological data analysis toolkit for machine learning and data exploration. J Mach Learn Res 22(39):1–6
MathSciNet Google Scholar
A Garin, G Tauzin (2019) A topological “reading” lesson: Classification of MNIST using TDA. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). pp 1551–1556
Google Scholar
Turkeš N, Nys R, Verdonck J, Latré S (2021) Noise robustness of persistent homology on greyscale images across filtrations and signatures. PloS One 16(9):e0257215
Article Google Scholar

Download references

Acknowledgements

We are grateful to the General Direction of Scientific Research (DGRSDT) for their continuous financial support in this research.

Funding

This research work is supported by a grant from the General Direction of Scientific Research & Development (DGRSDT), under the number (DGRSDT-13), Algeria.

Author information

Authors and Affiliations

Division Architecture des Systèmes et Multimédia, Centre de Développement des Technologies Avancées - CDTA, Cité 20 Août 1956, Baba Hassen, PO. Box 17, Algiers, 16081, Algeria
Djamel Bouchaffra & Faycal Ykhlef

Authors

Djamel Bouchaffra
View author publications
You can also search for this author in PubMed Google Scholar
Faycal Ykhlef
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D. Bouchaffra designed the study and wrote the manuscript. F. Ykhlef performed the experiments, collected the data, provided critical feedback, and contributed to the writing of the manuscript. Both authors confirmed the results, and read and approved the final version of the manuscript.

Corresponding author

Correspondence to Faycal Ykhlef.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was conducted while Professor Djamel Bouchaffra was a faculty at Oakland University, Michigan, USA.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Bouchaffra, D., Ykhlef, F. Exploring topological data analysis for information extraction: application to recognition of Arabic machine-printed numerals. J. Eng. Appl. Sci. 71, 16 (2024). https://doi.org/10.1186/s44147-023-00346-x

Download citation

Received: 17 May 2023
Accepted: 18 December 2023
Published: 13 January 2024
DOI: https://doi.org/10.1186/s44147-023-00346-x

Exploring topological data analysis for information extraction: application to recognition of Arabic machine-printed numerals

Abstract

Introduction

Main text

Problem description

Overview of topological data analysis

Simplicial complexes

Abstract simplicial complex

Vietoris-Rips simplicial complex

Homology and persistent homology

Construction of series of simplicial complexes

Computation of Betti numbers

Persistent homology (PH)

Birth and death

Barcodes

Statistical interpretation of topological information

Exploring TDA for recognition of Arabic machine printed numerals

Numeral image characterization via simplicial complexes

Noise generated from the thinning operation

Extraction of Betti numbers as topological features

Partitioning of numeral images based on their Betti numbers

Recognition of numerals

Freeman chaincode computation

Dynamic time warping distance

Template generation

Classification of numerals

Experiments and results

Data collection

Font diversity ratio

Two-stage numeral recognition

Case where testing fonts are contained in the reference codebook

Case where testing fonts are not part of the reference codebook

Performance comparison and assessment of the proposed approach

Comparison of the proposed approach with different scenarios

Comparison with some major states of the art approaches

TDA approaches

General approaches

Pros and cons of TDA

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords