Predicting the compressive strength of ultra‑high‑performance concrete using a decision tree machine learning model enhanced by the integration of two optimization meta‑heuristic algorithms

remarkable


Introduction
Concrete is the foremost cement-based composite widely employed in construction projects [1].However, the progressively intricate application contexts now demand heightened performance standards [2].In response, ultra-high-performance concrete (UHPC) , an innovative cement-based composite, has witnessed rapid advancement in recent years both in theoretical exploration and practical implementation [3].UHPC demonstrates remarkable advantages in fulfilling the intricate requisites of modern construction, encompassing lightweight structures, expansive spans, and national defence projects, owing to its exceptional mechanical attributes and enduring nature [4].Diverging from conventional concrete types, UHPC 's core objective to achieve outstanding performance revolves around cultivating a dense particle packing arrangement.Consequently, the incorporation of supplementary cementitious materials (SCMs) such as silica fume, fly ash, limestone powder, and metakaolin becomes necessary to fill voids among larger particles.This incorporation of SCMs , however, leads to a more intricate and variable UHPC mix, which in turn introduces instability in UHPC 's performance, including mechanical characteristics, workability, and rheological properties.Thus, a fitting methodology for UHPC mix design becomes imperative [5][6][7].
However, traditional mix design approaches often rely on empirical knowledge and are sometimes offered without substantiation, lacking the guidance of particle packing theories.Currently, theoretical design approaches for UHPC predominantly stem from particle-dense packing models, categorizable into discrete and continuous models [8].Discrete models assume a specific particle size set, whereas continuous models consider a continuous distribution of particle sizes seamlessly integrated into size distribution systems.In 2013 , the American Society of Civil Engineers assigned a D+ rating to the deteriorating US Infrastructure.Principal factors contributing to this decay are the cor- rosion of steel reinforcement and concrete degradation due to the infiltration of corrosive ions [9].
In comparison to standard concrete, UHPC stands out with substantial enhance- ments in mechanical and durability properties.UHPC holds the potential to address the prevailing state of dilapidated infrastructure effectively.A series of conferences held in Kassel, Germany [10][11][12]; Marseille, France [13]; and Des Moines, USA [14], have effectively showcased the material's performance and applicative prospects.Despite its impressive capabilities, the widespread adoption of UHPC faces obstacles arising from elevated material costs and sustainability concerns.The increased expenses stem from various factors intrinsic to UHPC , including the need for superior-quality materials, costly fibre reinforcements, and corresponding quality assurance [15].Efforts have been undertaken to mitigate costs through the utilization of more affordable, locally available constituents.
Machine learning (ML) algorithms, like artificial neural networks (ANNs) , have gained broad acceptance in various fields due to their ability to predict outcomes accurately, aligned with experimental results [16][17][18].Nevertheless, experiments can involve intricate test matrices with many parameters, some of which contribute only minimally to the outcomes.In response, computer scientists have developed selection algorithms based on data-driven models [19][20][21].These algorithms effectively identify the most relevant independent variables, swiftly reducing the dimensionality of the input matrix.The demand for soft computing tools in predictive modelling in engineering, covering components, systems, and materials, continues to rise steadily [22][23][24].Among these tools, ANN has emerged as a leading soft computing approach, finding successful appli- cation across different engineering domains.ANN's usefulness extends to tasks such as prediction, approximation, character and pattern recognition, image processing, forecasting, classification, optimization, and control-related challenges.This versatility has motivated researchers to propose and utilize ANN models for a wide array of issues in civil engineering.Notably, ANN behavioural modelling has been extensively employed to study concrete structural elements.Recent efforts have extended this research to employ various ANN models for predictive tasks related to building materials like steel, concrete, and composites [25].Concrete, in particular, has garnered significant interest.ANN modelling, leveraging accumulated experimental data, has effectively addressed its fresh and hardened properties [26].
Additionally, predicting concrete's compressive strength has become a prolific area of investigation, where ANN models play a crucial role.The utilization of ANN for predict- ing the compressive strength of diverse concrete types, including normal weight, lightweight, and recycled, has intrigued researchers [27].Simultaneously, exploring different ML techniques has enabled the comprehension of high-performance concrete's com- pressive strength.As the field has progressed, the introduction of UHPC has spurred further refinements in ANN modelling, broadening its application to predictive analyses of this cutting-edge material's behaviour [28].
The precision of UHPC prediction is being improved with the help of a novel ML tech- nique introduced in this study.This primary focus is on obtaining extremely precise predictions of UHPC results, a crucial component in civil engineering.The study uses the decision tree (DT) model because collecting empirical data has inherent difficulties.However, careful parameter fine-tuning is essential to the DT model's success.The study uses a dual-algorithm approach that combines Sequential Halving Optimization (SHO) and Crystal Structure Analysis (CryStAl) to get the best performance possible from the DT model.This fusion turns out to be incredibly powerful, greatly improving the DT model's accuracy and efficiency.The practical benefits of this innovative strat- egy are especially notable in the infrastructure sector, where they simplify the design and construction of UHPC structures.With the aid of a sizable UHPC dataset, thorough comparative analyses are carried out to support the validity of this proposed framework.These results demonstrate a promising route for achieving precise UHPC forecasts in the context of civil engineering projects by incorporating the DT algorithm into this ML methodology.

Data gathering
A meticulous approach assesses ultra-high-performance concrete (UHPC) within a soil context, considering numerous variables.The effort involves precise data management, dividing the dataset into training (70%), validation (15%), and testing (15%) subsets.The foundation is a dataset of 110 experimental samples from prior research, validating the empirical distribution method and fortifying predictive models.UHPC behaviour assessment and prediction utilize a decision tree (DT) model, leveraging inherent predictive capabilities within variables outlined in Table 1.The concrete mix design includes eight inputs: cement content (C) , sand-cement ratio (S/C) , sil- ica fume-cement ratio (SF /C) , fly ash-cement ratio (FA/C) , steel fibre-cement ratio (STF /C) , quartz powder-cement ratio (QP/C) , water-cement ratio (W/C), and admix- ture-cement ratio (Ad/C) .Except for cement (C) , these inputs are as percentages relative to C .C is in (kg/m 3 ), while other inputs are percentages relative to C .The output, CS , quantified in megapascals (MPa) , supports a robust comprehension of UHPC behaviour and predictive modelling insights.A 2D kernel plot, Fig. 1, visually illustrates input-output interplay.It represents associations between inputs and CS and depicts joint distribution or correlation.The plot shows pairs of data points, one axis showing input variables (e.g.cement content, S/C ratio) and the other CS values.Each point signifies an experimental sample with connected input and output.The plot aids in discerning trends, patterns, and interdependencies, identifying impactful input combinations on UHPC strengths.This representation helps researchers com- prehend variable relationships and input-output impacts.Within the UHPC evalua- tion context, the 2D kernel plot enhances understanding of predictive model efficacy Fig. 1 The 2D kernel plot between input and outputs by visually illustrating links between concrete mix design and compressive strength, enriching performance insight [29].

Decision tree (DT)
The (DT) is a widely used supervised learning technique for resolving classification and regression issues.When a specific categorical grouping or classification is absent, the regression analysis technique can still predict the likely outcome based on independent variables thanks to the hierarchy or divided structure of the tree [30,31].The model shown in Fig. 2 shows a straightforward decision tree with a single binary target variable, Y (with values of 0 or 1 ), two continuous variables, x 1 and x 2 , and all of their values fall between 0 and 1 .Additionally, as shown in Fig. 3, the arrangement can be thought of as a segmented geographic area.The analytical framework that is frequently used includes dividing the sample space into distinct, well-defined, and comprehensive segments.Each of these segments directly relates to a particular leaf node, which denotes the result of a series of subsequent decision-making steps.Every record in a decision tree is given a single segment, called a leaf node, which serves as its home.Determining the most efficient model that can precisely segment all available data into distinct segments is the main goal of using decision trees for analysis [32].
Nodes and branches are the basic building blocks of a decision tree model, and splitting, stopping, and pruning procedures are important steps in its construction [33].

Nodes
Nodes fall into three distinct categories.2.An intermediary node, or chance node, is the second kind of node.It represents a constrained range of potential decisions that can be made at a specific location in the hierarchical structure.3. Terminal nodes, also called end nodes, comprise the third category of nodes and represent the outcome of a string of assessments or events.

Branches
A hierarchical structure of branching elements represents chance events when building a decision tree model.A discrimination protocol can be expressed as rules using an if-then structure for each path from the root node through intermediary and terminal nodes.For instance, the realization of outcome j may depend on a series of conditions numbered from 1 to k, where the satisfaction of each condition causes out- come j to occur.

Splitting
It is necessary to identify key input variables and segment records based on them to build a model.The purity of the child nodes, determined by the percentage of the target condition, serves as a guide for choosing the input variables.The partitioning procedure is guided by metrics like entropy and the Gini index and continues until uniformity or stopping criteria are met.Most of the time, not all possible input variables are used, and a particular input parameter may be used more than once at different levels of the decision-making hierarchy.

Stopping
In statistical modelling, complexity and robustness must be balanced because they interact mutually antagonistically.The accuracy of future projections is inversely correlated with the model's complexity.Even though it is crucial to build a decision tree that matches current observations and has a small distribution of data points in each leaf, it is insufficient for forecasting future cases.Stopping rules must be incorporated during development to prevent excessive complexity.The number of observations needed in a leaf, the number of observations in a node before partitioning, and the depth measure are common parameters for stopping rules.Analytical goals and dataset characteristics must be thoroughly examined to choose the appropriate stopping parameters.Berry and Linoff recommend defining a specific percentage of records contained in a leaf node, ranging from 0.25 to 100%, regarding the entire training dataset to reduce overfitting and underfitting.A thorough approach is required to ensure the best accuracy and relevance in modelling.

Pruning
An alternative method for implementing stopping criteria in decision tree modelling entails growing a big tree and trimming it down to the perfect size by removing nodes that do not add much to the collection of new data.Utilizing the percentage of datasets linked to error prediction to choose the best subtree from a pool of candidates is a common technique.The ideal answer can be aided by validating the model on a different dataset.Pre-pruning and post-pruning are two acknowledged pruning techniques in machine learning.Pre-pruning involves using statistical tests such as chi-square [34] tests and multiple comparison adjustment techniques to limit the development of nonsignificant branches.Post-pruning, on the other hand, removes branches in an ideal way after building a thorough decision tree to increase classification accuracy when using the validation dataset.The specific context and features of the dataset will determine which pruning technique is used.

Sea-horse Optimizer (SHO)
In 2022 , Zhao proposed a novel meta-heuristic approach called the SHO algorithm.The SHO algorithm is a population-based meta-heuristic technique that mimics the social behaviour of sea horses and consists of three primary components: movement, hunting, and reproduction.The algorithm incorporates both local and global search abilities to achieve a balance between exploration and exploitation capabilities.The movement behaviour is designed for local search, while the hunting behaviour is intended for global search, and the reproductive behaviour complements both [35].The SHO algorithm commences by generating a population of potential solutions.
where dim represents the number of dimensions in the search space and pop indicates the population size used in the SHO algorithm, each member of the sea horse popula- tion represents a potential solution within the search space problem.In an optimization (1) problem that involves minimizing, the elite individual is determined as the one with the lowest fitness value and is denoted by X elite .X elite can be obtained using Eq. ( 2): The function f (.) represents the cost function's value for a given problem, which assesses the fitness of potential solutions in the search space.The motion behaviour of sea horses involves two states: Brownian motion and Levy flight.Brownian motion facilitates enhanced exploration in the search space, while Levy flight simulates the step size of the movement of sea horses, allowing them to migrate and explore different locations to avoid excessive local exploitation.To determine the updated position of a sea horse in iteration t , we can express these two scenarios as follows: where Levy is defined by the Lévy flight distribution function with a randomly gener- ated parameter from the interval [0, 2] , the coordinates represent the spiral movement component of SHOtes x, y, and z .The constant coefficient l is used to control the step size of the Lévy flight; β t is Brownian motion's random walk coefficient.The normal ran- dom number r 1 is used to introduce stochasticity in the Brownian motion component [36].The hunting behaviour of sea horses can lead to either success or failure.Success is achieved when a sea horse captures its prey by moving faster, while failure results in further exploration of the search space.This hunting behaviour can be represented mathematically as: where the new location of the sea horse after hunting at iteration t is denoted as X 1 new (t) , r 2 is the randomly generated number within [0, 1], and b is a directly decreasing param- eter that adjusts seahorse-based step length during the hunting process.The reproductive behaviour of sea horses divides the population into male and female groups based on their fitness values, and male sea horses are responsible for reproduction.
where fathers and mothers refer to the male and female populations, respectively, while X 2 sort denotes all X 2 sort arranged in ascending order of their corresponding fitness values.The algorithm selects half of the best-fit individuals from the population to create a new generation of candidate solutions.The expression of the i − th offspring is as follows: where r 3 is the random number between [0, 1], X father i and X mother i individuals are chosen at random from the male and female populations.The SHO algorithm is specifically developed for solving optimization problems that entail continuous search spaces and (2) has exhibited encouraging outcomes in several applications.The proposed SHO algo- rithm's flowchart is presented in Fig. 4.
The SHO algorithm offers a fresh perspective on resolving optimization problems, and its effectiveness and efficiency render it a promising technique for diverse applications.

Crystal Structure Algorithm ( CryStAl)
Crystals are defined as minerals with a three-dimensional organized or regularly repeating crystalline structure.The sizes and forms of crystalline solids can vary, and they might have isotropic or anisotropic characteristics [37].Crystals are made of tiny particles having a distinct form.Numerous chemical and physical compositions have been investigated and put forth via testing.Furthermore, human inventions like mechanics, buildings, and artwork have been impacted by the complex symmetries and qualities of crystals.The crystal structure is explained in this article using the Bravais model.This model takes the infinite lattice geometry into account, and it specifies the periodic structure that the lattice geometry describes together with the vector of the lattice locations in the following manner: The lattice geometry and the vector of the lattice locations, where c i is the minimum vector of the primary crystal directions and s i is the angular number of the crystal, which explains the periodic structure in the Bravais model.This fundamental concept of crystals is described with suitable modifications for the mathematical modelling of CryStAl .In this paradigm, every possible optimization technique solution is thought of as a single crystal lattice.For the cycle's startup, any number of crystal lattices is chosen.
where q is the problem's size, and s is the potential solution.The starting locations of these crystals in the search space are chosen at random by: The j th choice variable of the i-th candidate arrangement is within the indicated ρ , where x j i (0) represents the beginning gem position and the least and maximum allow- able values are characterized as x j i,max and x j i,min , respectively.As to the crystallographic theory of the "base", all corner crystals make up the fundamental crystals.wz main is randomly selected while taking into account the first crystal created.Furthermore, each tread has a random extraction technique defined, and the current value (z l ) is dis- regarded.wz r indicates crystals having the ideal arrangement and D v is the average of crystals that are chosen at random.Using basic network concepts, four kinds of update processes are created to track a candidate solution's location in the search space: In the above formula, the old position is given by wz old , the new position is denoted by wz new , and the random numbers are denoted by z, z 1 , z 2 , andz 3 .Metaheuristics consists of two main components: mining and exploration.It is noteworthy that Eqs.(10) to (13) have been tested to perform global and local searches simultaneously.To deal with variable solutions x j i that violate the variable limit requirements, a mathematical flag is created that requires adjustment of the variable limits, causing problems with . . . . . . . . .

Performance evaluation methods
In this study, various evaluation criteria for hybrid models are presented, emphasizing their correlation and error rates.The evaluation metrics discussed in this discussion include mean absolute error (MAE) , coefficient of correlation (R 2 ), relative absolute error (RAE) , root mean square error (RMSE) , and Scatter Index (SI) .The mathematical equa- tions for each of these metrics are listed below.An algorithm with an R 2 value close to 1 performs exceptionally well in the training, validation, and testing phases.On the other hand, lower values of metrics like RMSE , RAE , and MAE are preferred because they sig- nify a lower degree of model error.
The variables N , which stand for the number of samples, h i , h , and z , which stand for the mean predicted and measured values, respectively, and z i , which alternatively stands for the measured value, are used in Eqs.(14)(15)(16)(17)(18).

Results and discussion
This study's primary objective was to predict UHPC using three different models: DT, DTSH, and DTCS.During the training, validation, and testing phases, these models' performance was compared to actual measurements.Five statistical measures were used to ensure a thorough evaluation as indicated in Table 2: R 2 , RMSE , SI , RAE , and MAE .These metrics provided a solid basis for evaluating and contrasting the efficiency of the employed algorithms.The R 2 values, which measure how much of the variability in the dependent variable can be explained by the independent variable, received particular attention.A standout was the DTSH model, which achieved the highest R 2 values of 0.997 across all phases and displayed remarkable predictive accuracy.The DT model, ( 14) Similarly, the training phase of the DT model resulted in a SI value of 0.025.The DTSH model, which produced values of 1.233 and 12.824 for MAE and RAE during the training phase, emerged as the better choice compared to the DT model, which produced values of 2.887 and 26.357.Overall, the results convincingly demonstrate that the DTSH model is superior to the DT and DTCS models in all three stages.When selecting a model for real-world applications, it is crucial to consider additional aspects like model complexity, computational effectiveness, and ease of implementation.The study's findings essentially show that SHO optimization successfully enhances DT's UHPC prediction capabilities.Therefore, using the DTSH model for actual UHPC prediction applications offers a useful and trustworthy option.
A scatter plot is used in Fig. 5 to compare a hybrid model's performance over the crucial training, validation, and testing phases.R 2 is used in the evaluation to determine how closely predicted and observed values are related, and RMSE is used to determine how much of a difference between the two there is.The DTSH model's central line and closely spaced data points show exceptional accuracy in all phases.Projected and actual value alignment reveals a remarkable agreement with few scattering traces.In contrast, despite having data points that are distributed more evenly around the central axis, the DT and DTCS models exhibit comparable performance levels.When compared to the DTSH model, this wider distribution suggests increased inaccuracy and relatively lower precision.
In Fig. 6, a comprehensive comparison is presented, demonstrating the correlation between predicted and measured UHPC through a bar chart plot.The evaluation of predictive precision is centred around how well the predicted and observed behaviours match.For the DTSH model, there is a subtle deviation across all three phases, with a notable concentration of predicted data points placed above their measured counterparts.Shifting to the DT and DTCS models, a slight difference becomes apparent between the projected and actual data points; however, their predictive accuracy falls slightly below the standard set by the DTSH model.On the contrary, the DTSH model's performance shows an even more modest alignment with the measured data points than the other two models.This discrepancy is particularly evident, marked by a noticeable difference between the projected and observed values.Figure 7 illustrates the error rate percentages of the hybrid frameworks using a normal distribution plot.These models underwent a comprehensive evaluation across three phases: training, validation, and testing, each with separate sample sets.The normal distribution plot vividly highlights notable differences in error distribution among the models.It is worth noting that the samples tend to cluster within a relatively narrow error range of − 2 to 2%, showcasing the consistent and tightly grouped distribution exhibited by the DTSH model.The DTCS model displays an error rate of − 3 to 3%, while the DT model shows a broader span of − 5 to 5%, indicating its position as the model with the highest error rate.This observation emphasizes the consistent performance of the DTSH model across all evaluation phases.Among the trio of models, the DT model stood out due to its wider range of error percentages, indicating increased variability and reduced predictive precision compared to the other two models.Moving on, Fig. 8 presents a half-violin diagram depicting error percentages for the models in this study.During the training phase, DTSH exhibited an impressive mean error rate of 0%, characterized by a well-formed normal distribution with minimal dispersion.The error distribution consistently remained below the 6% threshold, indicating favourable results.In contrast, the DT model displayed dispersion across both phases, featuring a symmetric and uniformly distributed normal curve.Despite this dispersion, the model managed to maintain its error percentage below 10%.DTCS showed the most pronounced and diverse discrepancies among the three models.Interestingly, a single outlier data point emerged during the assessment stage, comprising over 8% of the dataset, an unusual occurrence in statistical analysis.When considering dispersion, the DT model stood out, showing a greater spread than the other two models, with fewer instances of incidence near zero.Overall, all three models demonstrated satisfactory performance.However, DTSH showcased superior outcomes in terms of consistency and accuracy.

Conclusions
The number of experimental studies examining the characteristics of ultra-high-performance concrete (UHPC) has increased recently.However, using conventional statistical techniques to establish a precise relationship between the composition variables and the engineering features of UHPC has proven challenging and nonlinear.A robust and sophisticated approach is needed to make sense of the vast amount of experimental data available.This strategy ought to produce precise estimation methods and illuminate the complexities of nonlinear materials science.Enter ML, a potent technique that excels at spotting hidden patterns within complex datasets.In light of these considerations, the present study is dedicated to harnessing cutting-edge ML techniques, specifically DT, to predict the CS of UHPC.The foundation of this endeavour lies in a meticulously curated dataset consisting of 110 test experiments and 8 input parameters extracted from a comprehensive compilation of published literature.To elevate the predictive capabilities of the DT model, two meta-heuristic algorithms, SHO and CryStAl, have been seamlessly integrated.This amalgamation yields three distinct models: the original DT, an enhanced version DTSH empowered by SHO, and DTCS enriched by CryStAl.Evaluating these models is an exhaustive process, encompassing stages such as Training, Validation, and Testing.The dataset utilized for these evaluations comprises laboratory samples sourced from reputable published references.The efficacy and predictive prowess of the models in estimating UHPC compressive strength are quantified through an array of performance evaluation metrics, expounded upon in the dedicated section.
The culmination of these rigorous evaluations yields the following outcomes: phases, highlighting its remarkable dependability and accuracy in forecasting UHPC compressive strength.The RMSE of DTSH was noticeably 80% lower than that of the DT model, which is a resounding demonstration of the improved prediction accuracy of this method.

1 .
Primary nodes, called decision nodes, are the first category and denote a choice to partition or subset all data.

Fig. 2
Fig. 2 Sample decision tree based on binary target variable Y

Fig. 3
Fig. 3 DT using sample space view

Fig. 4
Fig. 4 The flowchart of the proposed SHO algorithm on the other hand, produced slightly lower R 2 values, 0.985, during the corresponding phases.Beyond R 2 , the study also examined RMSE and other error indicators.The DT model showed more errors during the validation phase, with RMSE values ranging from 1.746 to 7.403, while the DTSH model showed the least errors during the training phase.The DTSH model obtained the lowest SI value of 0.011 during the training phase as part of the evaluation, indicating that it is the most suitable for modelling.

Fig. 8
Fig.8The box of errors among the developed models

Table 1
The properties of data set components engaged in the modelling process

Table 2
Performance indices of proposed models