Innovative compressive strength prediction for recycled aggregate/concrete using K‑nearest neighbors and meta‑heuristic optimization approaches

Compressive


Introduction
Compressive strength ( F c ) stands as a pivotal parameter within structural engineering and construction materials.It functions as a fundamental gauge of a material's ability to endure axial loads, those forces that compress or shorten it.More precisely, F c quantifies the maximum axial stress a material, typically concrete, can withstand without succumbing to failure or collapse [1][2][3][4].This characteristic carries immense significance in the Page 2 of 16 Duan Journal of Engineering and Applied Science (2024) 71 :15 planning and construction of vital structures like buildings, bridges, dams, and various infrastructure projects.A comprehensive grasp of F c proves indispensable to engineers and architects, as it holds direct sway over structural integrity and safety.Factors such as the composition of concrete mixes, curing conditions, and environmental factors wield substantial influence over F c .Consequently, researchers and professionals are pursuing refining their comprehension and predictive methodologies in this realm.Recent years have borne witness to the deployment of advanced techniques such as machine learning, finite element analysis, and non-destructive testing, all aimed at augmenting the precision of F c predictions.Moreover, the evolution of concrete technology, encompassing the incorporation of supplementary cementitious materials and alternative aggregates, has ushered in an era of more sustainable construction practices.Notably, these innovations have not come at the expense of compressive strength; in many instances, they have improved it [5][6][7].In essence, the study of F c serves as the linchpin in guarantee- ing the durability and reliability of civil engineering structures.The ceaseless march of research and innovation in this field redefines the future of construction materials and practices, ensuring that they align with the demands of an ever-expanding construction industry while considering environmental sustainability [8][9][10].
The relentless expansion of the construction industry necessitates vast quantities of aggregates, primarily employed as one of the primary constituents in concrete production.In stark contrast, the demolition of aging structures begets an abundance of discarded concrete, often occupying precious landfill space, engendering severe environmental concerns such as land depletion.This predicament has spurred the exploration of recycling and repurposing demolished concrete as an eco-friendly alternative to non-renewable virgin aggregates [11][12][13].The utilization of recycled concrete aggregate (RCA), derived from the crushing of demolished concrete, has emerged as a promising solution, capable of ameliorating the sustainability of natural resources while mitigating the adverse environmental repercussions associated with the mere disposal of demolished concrete.Nevertheless, it is essential to acknowledge that RCA differs in properties from natural aggregate (NA).
Consequently, the physical and mechanical attributes of RAC crafted from RCA exhibit disparities compared to their natural aggregate concrete (NAC) counterparts.These distinctions chiefly arise due to the higher porosity and water absorption characteristics exhibited by RCA in contrast to NA [14][15][16].One pivotal mechanical property in the concrete industry, the elastic modulus, gauges a material's deformation is particularly noteworthy.RAC generally demonstrates a lower elastic modulus value when compared to NAC formed with an equivalent water-to-cement ratio (w/c).Various researchers have proposed equations aimed at correlating the elastic modulus of concrete with other properties, such as F c .However, it is essential to acknowledge that these equations are primarily rooted in experimental data gathered from NAC, casting doubt upon their applicability to RAC.
A more nuanced approach is indispensable in light of the complex and multifaceted nature of experimental trials, particularly those teeming with a myriad of parameters, some of which exert only marginal influence on outcomes [17].Computer scientists have responded to this challenge by crafting selection algorithms founded on data-driven models [18].These algorithms exhibit a remarkable capacity to discern the most pivotal independent variables, promptly trimming the dimensionality of the input matrix and, in turn, enhancing efficiency.The domain of engineering components, systems, and materials is experiencing an escalating demand for soft computing tools in predictive modeling.This upward trajectory underscores the continued prominence of machine learning (ML) models, particularly artificial neural networks (ANNs), lauded for their adeptness in generating precise outcome predictions closely mirroring empirical observations [19][20][21].In an era marked by the relentless march of technology, these data-driven tools are revolutionizing our capacity to predict the F c of RAC, providing invaluable insights into the behavior of this environmentally conscious construction material.
This study is dedicated to refining the accuracy of predictions concerning the F c development in RAC by improving the K-nearest neighbors (KNN) model.However, realizing the full predictive potential of KNN necessitates the meticulous optimization of its parameters.To tackle this challenge, the study integrates two competent optimization algorithms: the Fire Hawk optimizer (FHO) and Runge-Kutta optimization (RUK).This amalgamation aims to amplify the efficiency of processes associated with both the design and construction of F c in RAC, ultimately conferring benefits upon the infrastructure sector and the constructed environment.To validate the robustness of the proposed framework, an extensive dataset about F c is employed.A comprehensive comparative analysis is meticulously conducted to establish its superiority over conventional optimization methods.Esteemed statistical metrics, including R 2 , RMSE, and MSE, are harnessed with precision to assess the performance of the ML models incorporated in this research.

Data gathering
The study thoroughly investigated the compressive strength ( F c ) of recycled aggre- gate concrete (RAC) while considering multiple variables.To enhance the efficiency of analysis, the dataset was meticulously partitioned into three distinct subsets: a training set (70%), a validation set (15%), and a testing set (15%).The study made use of Table 1's thorough analysis of input variables crucial to concrete production to predict F c behavior using a KNN model.Understanding and controlling the final concrete product's quality relies heavily on these factors.Four hundred forty-one observations make up the dataset used in this study, ensuring robust statistical properties.The study provides a thorough explanation of each variable below: 1. Water-to-cement ratio (w/c) This variable represents the proportion of water to cement in the concrete mix, ranging from 0.30 to 1.03.It has an average of 0.55 and a standard deviation of 0.15.A lower value indicates reduced water content, typically leading to stronger concrete.

Coarse aggregate to cement ratio (CA/C)
Denoting the ratio of coarse aggregates to cement spans from 1.00 to 7.40, with an average of 3.32 and a standard deviation of 1.21.CA/C significantly influences the structural properties of concrete.

Cement fineness (r)
This variable measures the fineness of cement particles, with values ranging from 0.00 to 1.00.The average is 0.52, with a standard deviation of 0.39.Finer cement particles can enhance both workability and strength.

Fine aggregate to total aggregate ratio (FA/TA)
The ratio of fine aggregates to total aggregates varies from 0.00 to 0.58, with an average of 0.40 and a standard deviation of 0.07.This ratio significantly impacts concrete workability and long-term durability.

Specific gravity of saturated surface-dry aggregates (SG)
The specific gravity of saturated surface-dry aggregates ranges from 0.00 to 6.23.The average is 2.28, with a standard deviation of 0.76, reflecting aggregate density.

Water absorption of aggregates ( Wa)
This variable represents the water absorption capacity of aggregates, with values spanning from 0.00 to 28.00.The average is 3.71, with a standard deviation of 3.06.Lower water absorption is desirable for concrete quality.
Based on a dataset of 441 observations, this in-depth analysis of these variables offers vital insights for optimizing concrete mix designs to achieve desired strength and performance characteristics.The statistical properties provide invaluable information for quality assurance and determining variability in concrete production [22].Marginal histograms, which are visual representations of the distributions of specific variables along the edges of a scatter plot or two-dimensional graph, are shown in Fig. 1.They give a quick overview of the distribution of the data, making it easier to spot trends, outliers, and patterns within each variable while also visualizing how those variables relate to one another.

K-nearest neighbor's (KNN)-based
The KNN algorithm makes predictions based on the most frequently occurring feed- back from K data points nearest the test point.Before applying the algorithm, it is essential to address the normalization of these parameters using Eq. ( 1).
Afterward, utilize Eq. ( 2) to compute the Euclidean distance between the test and data points.
Equation (2) calculates the distance H between the original data points (x i ) and the test point (x j ) using Euclidean distance, where m is the number of argument points [23].However, since different parameters have varying impacts on thermal comfort even when the exact value is changed, such as a 1 • C change in air temperature has a more significant impact than a 1% change in air humidity to remove the inconsistent (1) Fig. 1 The marginal histograms plot for input and output effects of indoor thermal parameters on thermal comfort, it is necessary to modify the Euclidean distance for all parameters using Eq. ( 3).
The weight (w h ) assigned to each indoor thermal parameter that impacts thermal comfort [24].Distances are calculated to determine the K data points closest to the test point [25].The feedback from the subjects at the current test point is then taken to be the feedback that occurs the most frequently among these K data points.Cross- validation can be used to determine the value of K , which establishes the quantity of necessary data points.It is crucial to pick a K value that is in the middle between the two extremes.The model may be overly sensitive to sample points close to the test point if K is too small, leading to an excessive amount of interference from noise points.On the other hand, if K is too high, the model's accuracy might suffer.The flowchart of KNN has been shown in Fig. 2. (3) Fig. 2 The flowchart of the KNN mode

Fire Hawk optimizer (FHO)
The FHO steps are introduced in this section.The starting population X of FHO is given a value, and it has N solutions with D values [26].This procedure is shown as in Eq. ( 4).
U j and L j are utilized in Eq. ( 4) to represent the search domain's boundaries at dimen- sion j.A random value is indicated by rand < spanclass = ′ reftype ′ > [0, 1] < /span > .Each solution X i then calculates its fitness value and the best one ( X b ) as having the highest fitness value.The best n solutions are then used to construct the fire Hawks (FH l,l = 1, 2, ..., n), while the rest refer to the prey (PR k,k = 1, 2, ..., m).The distance between FH and PR is then calculated as follows: The following equation will then be used to modify the value of FK .
where there is one Fire Hawk, FH n (t) .r 1 and r 2 are random values found in the range < spanclass = ′ reftype ′ > [0, 1] < /span > .The safe prey area is then allocated, and this is shown using the formula below to find the safe position ( SP l ) inside the Fire Hawk region [27].
The next step involves simulating animal behavior via PK movement within the FH zone.This simulation updates the prey's position as follows: After that, the following formula updates the safe location outside the l th FH.
The prey then changes its location based on the calculation below.
The stop criteria are then checked to see if they have been satisfied.If they have, the best solution is the output of FHO ; otherwise, the updating process is repeated [28].

Runge-Kutta optimization (RUN)
The RUN optimization algorithm is based on the Runge-Kutta method (RKM), which was employed to compute solutions associated with differential equations of the first order.The RUN algorithm's mathematical formulation comprises a series of stages, which are elaborated upon below: (4) • The initialization stage involves creating the initial solutions for N agents based on the search space's boundaries [LB, UB] .This is accomplished by employing the subse- quent Eq. ( 11): The formula takes into account the dimension of the problem, denoted by P , LB j , and UB j signify the lower and upper limits of the jth variable in the solution set Z ij , where i ranges from 1toN , representing the overall quantity of search agents [29].
• During the solution refinement stage, the RUN algorithm employs a search mecha- nism (SM) that utilizes the RKM to modify the current solution's position at every iteration [30,31].This mechanism is expressed as follows: In Eq. ( 11), ) .The integer value r , which lies between − 1 and 1, is utilized to alter the direction of the search process.On the other hand, the symbols and µ are random numbers ranging from 0to2 and 0to1 , respec- tively.The adaptive factor SF is specified as follows: The total number of iterations is represented by tmax .The values of Z c and Z m used in Eq. ( 14) are defined as follows: Equation ( 15) includes a randomly generated number represented by the ϕ, which lies between 0 and 1.Here, Z b and Z pb denote the best agent at each iteration and the best − so − far agent, respectively.The SM parameter mentioned in Eq. ( 11) is updated using the following formula: (11) The symbols rand 1 and rand 2 represent random numbers.The Z value is calcu- lated as follows: The values of Z w and Z b are updated according to the following equations: if Else • During the enhanced solution quality stage, various operators are employed to improve the convergence rate and avoid local optima.The objective is to enhance the quality of solutions, which is achieved through the following process: The formula in Eq. ( 19) involves a random number, which lies between 0 and 1, and an integer number r that can take on the values of 1, 0, or −1 .According to [30], if the fitness value of Z new2 is not superior to the fitness value of Z i , then there is another ( 16) opportunity to update the value of Z i .This can be achieved by utilizing the subse- quent Eq. ( 20): This equation involves a random value r 1 , r 2 , and r 3 .The value of v is computed as twice the difference of r 3 and 0.5 , where r 3 is a random number in the range < spanclass = ′ reftype ′ > [0, 1] < /span >.

Performance evaluation methods
This study introduces several criteria for evaluating hybrid models according to their correlations and error rates.The evaluation metrics looked at include root mean square error (RMSE), mean absolute relative error (MARE), coefficient correlation (R 2 ), mean square error (MSE), and U95.The relevant formulas for each of these metrics are given below.An algorithm that achieves a high R 2 value near 1 performs excellently in the three training, validation, and testing phases.In contrast, metrics with lower values, like RMSE and MSE, are preferred because they show that the model has less error.
Equations (21)(22)(23)(24)(25) use the variables M to indicate the number of samples, p i to rep- resent the predicted value,p and l to denote the mean predicted and measured values, respectively, and l i to indicate the measured value alternatively.

Findings and detailed explanation for Table 2
The study employed three distinct models, namely KNN, KNFH, and KNRK, to forecast compressive strength ( F c ) of recycled aggregate concrete.These models (20) underwent comprehensive evaluation across three phases: training, validation, and testing, with careful data partitioning to ensure fairness.The evaluation process incorporated five vital statistical metrics, including R 2 , RMSE, MARE, U95, and MSE, to facilitate a detailed comparison of model performance.Table 2 shows the results of the developed models, and the comparison between the models is as follows: • The primary focus of the evaluation centered on R 2 values, which indicate the extent to which the independent variable explains variations in the dependent variable.Notably, the KNFH model demonstrated exceptional predictive accuracy, achieving a superior R 2 value of 0.994 during training and consistently outperforming the alternative models.In contrast, the KNN model yielded slightly lower R 2 values of 0.977 during training.• Furthermore, an in-depth analysis of other error indicators, particularly RMSE, revealed a range spanning from 1.122 to 2.529.Impressively, the KNFH model exhibited the lowest error, while the KNN model exhibited relatively higher errors.• During the training phase, the KNFH model displayed the lowest MARE value of 0.028, suggesting its superiority.In contrast, the KNN and KNRK models exhibited higher MARE values of 0.052 and 0.044, respectively.• In terms of MSE and U95 during training, the KNFH model also produced the lowest values, with an MSE of 1.259 and a U95 of 3.110.Interestingly, in the training phase, the MSE and U95 values for the KNN model were the highest.
The study's findings undeniably demonstrated that the KNFH model outperformed the KNN and KNRK models in specific phases.However, when selecting a model for real-world applications, it is vital to consider additional factors such as model complexity, computational efficiency, and ease of implementation.In conclusion, the results provide compelling evidence that FHO optimization successfully enhanced the KNN model's predictive capabilities in predicting F c .

Enhanced presentation of figures in the results section
Figure 3 displays a scatter plot that evaluates the performance of hybrid models during three stages: training, validation, and testing.The evaluation is based on two crucial criteria, R 2 and RMSE.R 2 measures the similarity between predicted and observed values, while RMSE quantifies the prediction error dispersion.The KNFH model's data points were closely grouped around the central line, indicating its outstanding accuracy across all three phases.The tight clustering between predicted and actual values suggests minimal dispersion and a high level of agreement.On the other hand, the KNRK and KNN models had data points that were more evenly spread around the central line, indicating similar performance levels.However, compared to the KNFH model, this broader dispersion suggests a higher error and somewhat lower accuracy in the KNRK and KNN models.
In Fig. 4, there is a line plot that compares projected and observed values of F c of RAC.This visual representation is divided into three main sections: training, validation, and testing.The accuracy of this representation depends on how closely the projected behavior matches the observed behavior.The KNFH model predicts values slightly higher than actual measurements, causing slight differences in performance between the three phases.The KNN and KNRK models show minimal deviation between projected Fig. 3 Plotting the dispersion of evolved hybrid models and measured points but are less precise than the KNFH model, with a significant gap between projected and measured points.
Figure 5 presents a drop-line plot depicting the error percentages of the models developed in this study.The majority of data points cluster around the 14.96% mark, underscoring KNFH as the model with the lowest error rate.In contrast, both KNN and KNRK exhibit a broader range of error percentages, with a substantial number of values surpassing 37.94% and 19.13%.Notably, the right-skewed distributions of KNN and KNRK highlight data points with significantly higher error percentages.This observation underscores KNFH's superior accuracy and serves as a visual representation of the error percentage distributions for the developed models.
Figure 6 presents a scatter interval plot that effectively illustrates the error percentages associated with the models examined in this study.Notably, KNFH emerges as the top performer, boasting an outstanding mean error rate of 0%.Its error distribution consistently remains below the 10% threshold, and the data displays minimal dispersion, closely resembling a normal distribution curve.In contrast, KNN's performance is characterized by dispersion across all phases.This model exhibits a more symmetrical and uniform normal distribution, with error percentages not exceeding 25%.The behavior of KNRK stands out due to its unique characteristics.This model showcases the most Fig. 4 The comparison of predicted and measured values pronounced and diverse discrepancies among the three.Interestingly, a single outlier datum contributes to over 15% of the dataset, an unusual occurrence in statistical analysis.This further emphasizes the distinct nature of KNFH's performance.

Conclusions
Experimental studies aimed at comprehending the distinct properties of compressive strength ( F c ) of recycled aggregate concrete (RAC) has significantly increased in recent years.Due to its complex and nonlinear nature, it has been challenging to establish a precise correlation between the composition variables and F c using con- ventional statistical methods.The solution to this problem requires a robust and sophisticated methodology that can glean valuable information from the vast amount of experimental data.Such a strategy ought to offer precise estimation methods and perceptions of the complex issues involved in nonlinear materials science.Machine learning (ML), a potent tool capable of revealing hidden patterns within complex datasets, plays a crucial role.With these considerations in mind, this study is dedicated to harnessing the cutting-edge capabilities of ML, particularly the K-nearest neighbors (KNN) model, to predict F c of RAC.The foundation of this endeavor rests upon a meticulously curated dataset comprising 441 test experiments and 6 input parameters extracted from an extensive compilation of published literature.To enhance the predictive potential of the KNN model, two meta-heuristic algorithms, namely the Fire Hawk optimizer (FHO) and the Runge-Kutta optimization (RUK), have been seamlessly integrated.The effectiveness and predictive prowess of these models in estimating F c of RAC properties are quantified through a range of perfor- mance evaluation metrics, which are elaborated upon in a dedicated section.The following vital outcomes emerge from this comprehensive evaluation: • Among the proposed models, the KNFH variants demonstrate remarkable outcomes, yielding the highest R 2 values.Although the KNN model had a slightly lower R 2 score, the difference was negligible.Regarding error rates, KNFH outperforms KNN and KNRK, exhibiting a significant 1.7% reduction.The elevated R 2 values and reduced error rates underscore the impressive predictive capabilities of KNFH.• Notably, the KNFH model consistently displays the lowest RMSE values across all phases, highlighting its remarkable dependability and accuracy in forecasting F c .KNFH's RMSE is noticeably 77% lower than that of the KNN model, clearly demonstrating the model's improved prediction accuracy.
The findings unequivocally establish KNFH as the superior performer, outshining KNN and earning the top model accolade in this study due to its exceptional performance.

Fig. 5
Fig.5 The error rate percentage for the models is based on the vertical drop line plot

Fig. 6
Fig. 6 The scatter interval plot of errors comparison of proposed models

Table 1
The statistic properties of the input variable of F C

Table 2
The result of developed models for KNN