Accurate compressive strength prediction using machine learning algorithms and optimization techniques

Numerous components’ complex interrelationships and interconnectedness present a formidable obstacle in developing mix designs for high-performance concrete (HPC) formulation. The effectiveness of machine learning (ML) algorithms in resolving this paradox has been illustrated. However, they are classified as opaque black-box models due to the lack of a discernible correlation between blend ratios and compressive durability. The present study proposes a semi-empirical methodology that integrates various techniques, including non-dimensionalization and optimization, to overcome this constraint. The methodology exhibits a noteworthy level of accuracy when forecasting compressive strength (CS) across a spectrum of divergent datasets, thus evincing its extensive and all-encompassing efficacy. Moreover, the precise relationship that semi-empirical equations convey is of great significance to practitioners and researchers in this field, especially with respect to their predictive abilities. The determination of CS in concrete is a critical facet of the design of HPC. An exhaustive comprehension of the intricate interplay between manifold factors is requisite to attain an ideal blend proportion. The study’s findings indicate that RF can accurately predict CS . Moreover, the combination of optimization algorithms significantly enhances the model’s effectiveness. Among the three Optimization Algorithms under consideration, the COA optimizer has exhibited superior performance in augmenting the accuracy and precision of the RF prediction model for CS. As a result, RFCO obtained the more suitable value of R 2 and RMSE obtained 0.998 and 0.88, alternatively.


Introduction
Modern engineering constructions make extensive use of concrete as their main building material.The erection of concrete constructions within complex settings requires the utilization of high-performance concrete (HPC), which exhibits superior attributes to normal concrete, such as high strength, durability, etc. HPC is a heterogeneous material comprising superior-quality cement, coarse and fine aggregates, water, and admixtures.HPC, in which performance is not limited to strength but include construction, and this concrete displays superior characteristics in term of its *Correspondence: lwb198454@163.com 1 Yinchuan Institute of Science and Technology, University of Finance and Economics, Yinchuan 750021, Ningxia, China Within the discipline of Artificial Intelligence, machine learning is the study of developing algorithms that can learn from datasets and become more proficient over time.Machine learning (ML) is a notable benefit in effectively handling extensive and sophisticated datasets.This capability allows for identifying underlying patterns and producing precise predictions with remarkable accuracy.The advent of numerous ML methods, encompassing supervised and reinforcement learning, unsupervised learning, deep learning, and various others, has been documented [17][18][19].ML has attained significant prominence and has been extensively implemented across varied sectors, encompassing healthcare, finance, manufacturing, and transportation.ML possesses the potential to be employed in the healthcare sector to analyze medical images as well as detect potential anomalies.In the finance domain, the implementation of ML has the potential to yield significant benefits in detecting fraudulent activities and reducing the risks associated with financial undertakings, as has been suggested in relevant literature [20][21][22].
Barkhordari et al. [23] compared different ensemble learner algorithms for predicting the compressive strength of fly ash concrete (FAC).Separate stacking with the random forest meta-learner achieved the most accurate predictions, with a coefficient of determination of 97.6% and the lowest mean square error and variance.The SSE-Random Forest algorithm performed well in prediction accuracy, with the largest R 2 (0.976) and smallest MSE (0.0041) for the test set.The SSE-Gradient Boosting model also performed well, with an MSE of 0.005 and R 2 of 0.997 for the training phase.Naseri et al. [24] tackled the challenge of pre-fabrication estimation of concrete compressive strength, advocating for efficient alternatives to labor-intensive experimental methods.Investigating the influence of materials and sample age on fly ash concrete strength, a novel predictive method was introduced, utilizing the water cycle and genetic algorithms.Comparative analysis revealed the water cycle algorithm as the most accurate model, surpassing classical regression models.Concrete mixtures with less than 35% fly ash by weight of the binder displayed maximum CS, with a notable decline beyond this threshold.These findings shed light on optimizing concrete mixture proportions for enhanced strength, bolstering sustainability and efficiency in production.
The random forest (RF) algorithm is a widely applied ML field technique commonly used for regression and classification tasks.The proposed approach constitutes a form of collaborative learning mechanics whereby numerous decision trees are integrated to enhance the accuracy of prognostications [25].In an RF model, the construction of each decision tree involves the utilization of a random subset of both the data entries and the features.The RF model operates by combining the outcomes of numerous decision trees, thereby diminishing the risk of overfitting and enhancing the overall precision of the model.The process of randomizing the data and features employed in individual decision trees contributes to enhancing the resilience and adaptability of the model.RF models have found application in diverse fields, ranging from finance and healthcare to environmental science [26].These models have been leveraged to perform tasks such as forecasting stock prices, diagnosing illnesses, and pinpointing environmental factors that may trigger disease outbreaks.The RF algorithm is a potent and versatile ML methodology that has demonstrated considerable efficacy across numerous domains [27].
The present study employs the random forest (RF) algorithm for compressive strength (CS) of HPC prediction owing to its proficient handling of intricate systems and multifarious parameters through ML methods.Enhancement procedures remained addi- tionally deployed to enhance the accuracy of the HPC organizations.In addition, optimiza- tion algorithms are mathematical techniques utilized to locate the most optimal outcome for a particular problem.These algorithms have been extensively applied to optimize diverse parameters linked to the configuration of HPC systems.The subsequent section delineates three optimization algorithms, namely the rider optimization algorithm (ROA), black widow optimization algorithm (BWOA), and COOT optimization algorithm (COA).The present study introduces an innovative methodology for predicting CS by integrating RF with three optimization algorithms.1Thepresent methodology holds substantial poten- tial as a valuable instrument for geotechnical engineers to enhance the design of retaining structures made of CS .The paper introduces a novel hybrid approach, integrating non- dimensionalization, optimization, and ML algorithms to enhance predictive models for HPC.It addresses the complexity of HPC mix designs by accurately forecasting compressive strength and optimizing blend proportions.Notably, it emphasizes the interpretability of ML models, which is crucial for practical engineering applications.The study advocates for a comprehensive life-cycle assessment of HPC, considering long-term durability and sustainability.Collaborative interdisciplinary efforts involving material science, civil engineering, and computer science are highlighted for advancing sustainable and efficient HPC formulations and practices.

Data assembly
Supervised machine learning (ML) algorithms require numerous input variables to predict the compressive strength (CS) of HPC.The data in the present study were procured from antecedently published literature and the test data mentioned in Appendix 1 in Table 6 [28].The employed models utilized a total of eight input variables, namely water (W), binder (B), fly ash (FA), micro silica (MS), coarse aggregate (RCA), superplasticizers (SP), total aggregated (TA), and age.The dependent variable employed in the models under analysis was CS.The model's results exhibit a considerable dependence on both the number of data points utilized and the number of input parameters.The present investigation employed 168 data points (i.e., mixes) to forecast the characteristics of HPC.The RF model was executed utilizing Python programming language in the Anaconda environment, while the Python software was employed to facilitate its implementation.An examination was conducted on the relative distribution of each parameter implemented in the1mixes, and a report containing the comprehensive descriptive statistical analysis of these parameters can be found in Tables 1 and 2 for training and testing, respectively.
Table 3 presents the correlation matrix showing the relationships between the input parameters (B, FA/B, MS/B, CA/B, CA/TA, W/B, SP/B, Age) and the output (CS).Correlation values range from − 1 to 1, where − 1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

Principle of RF
A collection of tree-structured classifiers expressed as a random forest classifier b(x, ℵ l ), q = 1, . . ., where the preferred category for a provided input, denoted as x is determined by each tree casting a unit vote.Here, the {ℵ l } represent separate random vec- tors with identical distributions.Several tree-structured classifiers created with the use of a random variable and a training sample set make up a random forest, {ℵ l } , for the q -th tree in Breiman's model [29].A classifier is produced as a result of the stochastic factors being sovereign and uniformly spread among a pair of trees.b(x,ℵ l ) , where the input vector is represented by x .By iterating the procedure l instances produce a sequence of classifier.setsb 1 (x), b 2 (x), ..., b q (x) is produced.It may be applied to generate several models of categorization.The decision function is computed in accordance with the typical majority vote that determines the system's ultimate output.
The amalgamation of several distinct decisions tree replicas is represented by way of B(x) , with every tree having the ability to cast a ballot aimed at the superior choice (1) Table 3 Correlation between the inputs and output categorization outcome for particular contribution parameters The indicator function is represented by the symbol F (.) , and the output variable is V [30].The procedure of depicting the optimal categorization result is demonstrated in the depicted Fig. 1.

RF's characters
The purpose of boundary [31], which is used in RF to determine how much the mean quantity of votes in favor of the right session at.X,V exceeds the amount for the wrong class, is as follows: A larger value in the margin function indicates a higher degree of accuracy and confidence in the classification forecast.According to the constraints given by the variables V and l, the function mc(X, V ) defined in Eq. ( 2) includes averaging certain values received from the function F applied to b l (X) and comparing these values using a maximum operation.A deeper comprehension of the context and the particular functions involved would be necessary to determine the precise interpretation and meaning.This classifier's generalization error is defined as follows: ( LeoBreiman proved the unpredictability of b q (X) = b x, ℵ q , obeys the strong rule of large numbers if there are enough number of decision trees.For nearly all sequences, OS * converges to a certain value as the quantity of choice arboreal structures rises of ℵ 1 Breiman furthermore, it was exemplified that RF does not exhibit vulnerability to over- fitting and could provide the generalization error's limiting value.
Leo Breiman also deduced that there exists an upper bound for the simplification mistake: The generalization error of Random Forest (RF) is impacted by a pair of variables: the potency of individual trees within the forest, as indicated by (z), and the average correlation value β , which shows the relationship between the trees.A reduced level of correlation indicates diminished mutual reliance among the trees, leading to enhanced performance of the Random Forest (RF) [32].

Rider optimization algorithm (ROA)
The ROA algorithm is typically formulated based on a group of riders collaborating to achieve a specific position [33].The place where the z th rider at that moment V is repre- sented by V ti (z, s) .Furthermore, the composition of the rider team is determined by add- ing the number of bypassers (Bi), followers (Fi), overtakers (Oi), and attackers (A i ) [34].
For the z th rider, the representation of the angles related to the location, steering, and vehicle coordinates by θ z , (V i ) ti+1 (z,s) and ϕ .Furthermore, significant vehicle characteristics for the z th accelerator is included with the rider ( ai z ), brake (br z ) , and gear ( Ei z ).While the gear value runs from [0] to [4] , the brake and accelerator ranges from [0] to [1].
Let us say the bypass rider takes a conventional route instead of the leader's.In such a scenario, using Eq. ( 7), the location update for this group is chosen at random, where represents a random value between [0] and [1] , ϕ denotes a random number between [1] and D, ρ denotes a value between [1] and D , and η shows an arbitrary value between [1] and Thus, in order to reach the objective, the positions of the bypass riders are updated, and the follower, using the coordinate selection given in Eq. ( 8), modifies their placement in accordance with the location of the leading rider.In this equation, the coordinate selector is denoted by X Ri represents the location of the leader, Ri represents the leader's index, Di ti+1 z,s denotes the steering angle of the z th rider in the q th coordinate, and (3) g ti z a represents the distance that the z th a rider needs to cover, computed by multiplying the off-time rate by the rider's velocity.
Based on the three factors listed in Eq. ( 9) the motorcyclists who are overtaking change their position: the direction indication, relative success rate, and coordinate selector.In this equation, X ti (z, q) represents the location of the z th rider in the q th coordinate, while Ci It ti (z) denotes the direction indicator of the rider's movement.
The generalized distance vector is calculated to determine the coordinate selection it entails deducting the place of the z th a rider from that of the leader.Similarly, the attacker rider uses the same updating mechanism as the follower in an attempt to take the lead [35].But unlike the follower, the attacker changes all coordinates instead of only a subset of them, as shown by Eq. ( 10): According to Eq. 11: the activity counter uses a value of [1] when the "on" rider's suc- cess rate exceeds the predefined rate and [0] for trailing.
The steering angle is updated by the activity counter, as shown in Eq. (12).
As stated in Eq. ( 13), updating the gear entails selecting the greater value depending on the activity counter.

Black widow optimization algorithm (BWOA)
The BWOA is a meta-heuristic algorithm that integrates evolutionary algorithms with distinct criteria based on the reproductive behavior exhibited by black widow spiders [36].The BWOA algorithm emulates the procreation behavior of Latrodectus mactans, commonly known as black widow spiders, which entails a multifaceted mechanism of assortment and propagation aimed at generating novel progeny.The BWOA algorithm presents a distinctive and efficacious methodology for addressing intricate optimization problems, rendering it capable of circumventing local optima and converging promptly towards optimal solutions, thanks to its aptitude for (8) upholding equilibrium between its exploration and exploitation phases.Such a combination of attributes contributes to its remarkable effectiveness [37,38].Furthermore, Fig. 2 displays the BWO flowchart.The primary phases of BWO may be summed up as follows in brief:

1: Initialization
Each widow can be represented in the population in this stage, which is made up of the number of widows with size M as an array of 1 × M var representing the solution to the problem.This array can be defined as widow = (x 1 , x 2 , . . ., x M var ) , where M var is the dimension of the optimization problem.Also, M var can be defined as the quan- tity of threshold values that the program must obtain, while x i is the i − th candidate solution.
The fitness of a widow is obtained by evaluation of the fitness function of f of each widow of the set (x 1 , x 2 , . . ., x M var ) .Then fitness = f (widow), which can be represented by: fitness = (x 1 , x 2 , . . ., x M var ) .Subsequently, the procreation process entails ran- domly selecting pairs of parents who engage in the mating process, during which the female black widow consumes the male, either during or after copulation.

2: Procreate
In the procreation step, an alpha should be created as long as a widow array containing random numbers.Then, offspring are produced by using α and Eq. ( 14) in which x 1 and x 2 are parents, y 1 and y 2 are offspring.The crossover result is evaluated and stored.(14)

COOT optimization algorithm(COA)
The COOT optimization algorithm is predicated on the distinct movement patterns exhibited by coot populations on water surfaces.Coots are diminutive avian species that exhibit collective behaviors on aquatic surfaces, primarily aimed at approaching food sources or predetermined locations [39].The algorithm's procedural set of instructions is stipulated as follows: The population shall be initialized through a randomized process following Eq.( 15): CP(i) represents the position of the i − th coot, while d refers to the number of var- iables or dimensions in the optimization problem.The search space is defined by the upper bound vc and lower bound kc , which determine the maximum and minimum val- ues for each variable in the problem space.Specifically, vc and kc define the range of the search space for the optimization problem.
Once the population is initialized, the position of each coot undergoes updates based on four distinct movement behaviors.

Random movement
Equation ( 17) is used to randomly initialize a position Q for the first step of this movement.
To prevent being stuck in a local optimum, the position is modified using Eq. ( 18): Eq. ( 18) is used to determine the value of E, which is then utilized in Eq. ( 19) along with a random number S 2 in the range of [0, 1].
The variable Iter represents the upper limit of iterations, while Z denotes the current number of iterations.

Chain movement
To execute the chain movement, the average position of two coots can be determined by utilizing Eq. ( 20): (15)  where CP(i − 1) is the location of the second coot bird.

Adjusting position according to the leader
During the leadership movement, a coot bird updates its position based on the position of the leader within its group.Specifically, a coot bird follower moves towards the leader in its group.The leader is chosen using Eq. ( 21): Eq. ( 21) utilizes P to denote the leader's number, i for the follower's number, and MZ for the total number of leaders [40].
During the switch movement, the position of a coot bird is by utilizing Eq. ( 22): Eq. ( 22) employs CP(i) to represent the current position of the coot bird, LP(P) for the position of the chosen leader, S 1 for a random number in the range of [0, 1], and R for a random number in the interval of [− 1, 1].

Leander movement
The leader must transition from the current local to the global optimal position to locate the optimal position [41].This is accomplished by updating the leader's position using Eq. ( 23): Eq. ( 23) utilizes qBest to denote the best possible position, S 3 and S 4 as random num- bers in the range of [0, 1], and S as a random number in the interval of [-1, 1].Eq. ( 24) is utilized to determine the value of B.

Performance evaluation methods
As previously stated, this study employs a number of measures, including the coefficient of persistence (CP), mean square error (MSE), mean absolute relative error (MARE) , cor- relation coefficient (R 2 ), and root mean square error (RMSE) , to assess the models.To compute these metrics, apply Eqs. ( 25), 26, ( 27), ( 28) and ( 29): ( 21) Here, e i and z i show the experimental and predicted parameters, correspondingly.The cruel standards of the experimental and predicted data points are symbolized through e and z .On the other hand, U indicates how many samples are being taken into account.

Discussion and results
This unit deals with the assessment of the recently launched cross cars.Training and testing include two categories of performance metrics; 70% of the instances in the dataset are used intended for the purpose of instruction, with the remainder 30.% used for challenging.A greater number is desirable in the case of the R 2 measure; as for the various measurements, the goal is to minimize the fault and get the best possible result.Slightly increase or decrease in the presentation measures during the trying stage is indicative of how well or poorly the model was trained during the training stage.Table 4 presents an evaluation of the models' performance.RFCO train = 0.9981 had the greatest R^2 value, while RFRO test = 0.9778 had the lowest value.The RFCO test yielded the most appropriate values in RMSE and CP , which were 0.8766 and 0.327 , respectively.RFCO test obtained the greatest value in MARE , 0.0096, while RFRO train got the lowest value, 0.0342 , similar to the other two error assessors.RFRO test had the most acceptable result with a score of 6.829 with regard to MSE , which is the the greatest worth of the relevant presentation criteria; RFCO test obtained the lowest score of 0.7685.Table 5 provides a comparative analysis between the current study and previously published articles concerning compressive strength prediction.The table presents the models used in each study and the corresponding performance metrics, including R 2 and RMSE.The models used in the present study (RFCO) achieved a high R 2 of 0.9981 and a low RMSE of 0.880 compared those used in the referenced published papers.This demonstrates the effectiveness and accuracy of the RFCO model in predicting CS.
A scatter plot comparing the expected and actual results for three different hybrid models RFCO, RFBW, and RFRO is shown in Fig. 3. To depict the separate training and testing stages, the current methodology uses two linear fits, a scatter plot, and a centerline.The scatter plot that is displayed shows a pronounced affirmative correlation between the actual and anticipated standards for each of the three models, indicating that the models are highly accurate in predicting the values Nonetheless, the scatter plot reveals that RFCO exhibits the highest degree of data point clustering around the linear fit lines, implying superior accuracy among the three models.The correlation between RFBW and RFRO is strong, although the data points exhibit greater dispersion.Both models' linear regression lines show a similar slope and intercept, suggesting that they have similar predictive abilities.
Figure 4 depicts a column plot that presents a comparative analysis between three hybrid models' predicted and measured samples.The plot exhibits the degree to which the anticipated values conform with the observed values, effectively spotlighting the efficacy of the models.The results demonstrate that RFCO achieves a notable degree of precision, as evidenced by the close correspondence between predicted and measured values across the entirety of the dataset.The findings suggest a robust association between the projected and observed outcomes in both RFBW and RFRO, albeit with a marginally higher degree of discrepancies from the empirical data.This observation indicates that although RFBW and RFRO exhibit efficacy, they may lack the accuracy offered by RFCO.
The box plot in Fig. 5 illustrates the percentage of errors for the models presented.During the training phase, RFCO exhibited a mean error rate of 0%, accompanied by a distinct normal distribution and demonstrated minuscule dispersion.The distribution of errors exhibited favorable characteristics, as the values remained below the 10% threshold.In contrast, RFBW exhibited dispersion in both phases, and a more symmetrical and uniform normal distribution was observed.However, the attained model exhibited an error percentage that did not exceed 10% maximum.The RFRO exhibited the most notable and varied discrepancies; however, an aberrant datum was solely obtained during the assessment stage and constituted more than 10% of the dataset, a rarity in statistical analysis.The Gaussian distribution concerning the RFBW exhibited a greater degree of dispersion compared1to the1other two1models and a reduced frequency of incidence in the vicinity of zero.As a broad observation, each of the three models exhibited satisfactory performance; however, the model denoted as RFCO demonstrated the preeminent outcomes among them.Figure 6 shows the analysis using the Taylor diagram.The Taylor diagram comprehensively compares multiple models based on correlation, standard deviation, and RMSE.RFCO demonstrated the highest performance among the models assessed, followed by RFBW and RFRO.The superior performance of RFCO, as indicated by its placement in the Taylor diagram, suggests that it achieved a remarkable balance between correlation, standard deviation, and RMSE in compressive strength prediction.The RFBW model also showcased commendable performance, securing a close second in overall performance.RFRO, although slightly below RFBW, displayed a notable level of accuracy and reliability in predicting compressive strength.The insights gained from the hybrid models, integrating ML algorithms with techniques like non-dimensionalization and optimization, can be applied in several practical engineering applications within the field of HPC formulation: 1. Optimal mix design: the hybrid models can guide engineers in selecting optimal mix designs for HPC, considering various components and their interrelationships.This can lead to formulations with improved CS and other desired properties.2. Resource optimization: by accurately predicting CS, engineers can optimize the use of raw materials, minimizing waste and reducing costs while maintaining the desired performance of the concrete.3. Structural design and durability assessment: CS predictions are crucial in structural design.Hybrid models can aid in assessing the durability and performance of HPC in specific structural applications, allowing for better design choices and enhancing the lifespan of structures.4. Quality control and assurance: predictive models can be utilized for quality control during the production of HPC, ensuring that the concrete meets the desired strength requirements before it is used in construction projects.5. Real-time monitoring and decision-making: ML algorithms can be adapted to continuously monitor and predict concrete strength during curing or after construction.This real-time feedback can help adjust construction schedules or make necessary modifications to ensure structural integrity.In addition, potential limitations and areas for further research: 1. Data availability and quality: the availability of comprehensive and high-quality data is critical for the accuracy and effectiveness of predictive models.Further research should focus on improving data collection and standardization within the concrete industry.2. Model interpretability: addressing ML models' 'black-box' nature is essential for broader adoption.Research should aim to enhance the interpretability of these models, making the predictions more understandable to engineers and stakeholders.3. Incorporating additional parameters: Extending the models to consider more parameters, such as environmental conditions, curing processes, and construction practices, can enhance the accuracy and applicability of the predictions.4. Generalization and transferability: research should focus on enhancing the generalization of models across diverse geographical and climatic regions, considering different raw materials and mixed design practices.5. Robustness to variability: investigate the robustness of models to variations in raw material properties and other external factors, ensuring that predictions remain accurate and reliable under different conditions.

Conclusions
High-performance concrete, or HPC , is well known for its remarkable strength, durabil- ity, and workability.In construction engineering, concrete's compressive strength (CS) is widely acknowledged as a crucial mechanical attribute.One practical approach to dealing with this specific problem is to apply machine learning (ML) .The aim of this work was to forecast the fatigue life of coiled tubing in HPC applications using the random forest (RF) ML technique.In order to increase the accuracy of the findings, the current study used an amalgamation strategy fusing the

Fig. 2
Fig. 2 Flowchart of the Black Widow Optimization algorithm

Fig. 6
Fig. 6 Taylor diagram for the presented models

Table 1
The training phase's input and output variables' statistical characteristics

Table 2
The statistical properties of inputs and output variables in the testing phase

Table 4
The consequences obtained excluding the amalgamated designs

Table 5
The comparison between the present work and published articles . RF model by optimization methods, for example, COA, ROA, and BWOA.The presentation of the model was assessed by means of the R 2 , RMSE, CP, MSE, and MARE indices.The findings show that, in comparison to the RFRO and RFBW models, the RFCO models perform better, showing fewer error signs.The best RMSE values were shown by the RF with RFCO models in both the training and testing stages.The restricted distribution range displayed by these models suggests a precise and dependable capacity to forecast HPC.All models, however, showed a consistent percentage of mistakes, indicating that more improvements are required.Research indicates that RF hybrid models., more particularly the RFCO models., are highly effective at forecasting HPC., which provides accurate and consistent results for a range of engineering uses.Future research can enhance predictive models for HPC, incorporating diverse factors like environmental conditions, curing techniques, and sustainable materials.Integration of real-time sensor data and advanced imaging can enrich model insights.Addressing ML model interpretability in HPC is crucial.A comprehensive life-cycle assessment of HPC, considering durability and sustainability beyond early strength, is essential.Collaborative interdisciplinary efforts involving material science, civil engineering, and computer science are key to advancing sustainable, resilient, and efficient construction practices.