 Research
 Open access
 Published:
Supervised machine learningbased salp swarm algorithm for fault diagnosis of photovoltaic systems
Journal of Engineering and Applied Science volumeÂ 71, ArticleÂ number:Â 12 (2024)
Abstract
The diagnosis of faults in gridconnected photovoltaic (GCPV) systems is a challenging task due to their complex nature and the high similarity between faults. To address this issue, we propose a wrapper approach called the salp swarm algorithm (SSA) for feature selection. The main objective of SSA is to extract only the most important features from the raw data and eliminate unnecessary ones to improve the classification accuracy of supervised machine learning (SML) classifiers. Subsequently, the selected features are used to train supervised machine learning (SML) techniques in distinguishing between various operating modes. To evaluate the efficiency of the technique, we used healthy and faulty data from GCPV systems that have been injected with frequent faults, 20 different types of faults were introduced, including linetoline, linetoground, connectivity faults, and those affecting the operation of baypass diodes. These faults present diverse conditions, such as simple and multiple faults in the PV arrays and mixed faults in both arrays. The performances of the developed SSASML are compared with those using principal component analysis (PCA) and kernel PCA (KPCA) based SML techniques through different criteria (i.e., accuracy, recall, precision, F1 score, and computation time). The experimental findings demonstrated that the proposed diagnosis paradigm outperformed the other techniques and achieved a high diagnostic accuracy (an average accuracy greater than 99%) while significantly reducing computation time.
Introduction
In huge datasets, the process of assessing data becomes more difficult since not all of the data is appropriate. Feature selection is the process of selecting the most important features and removing the repetitious ones in order to solve classification issues. The selected subset of features will improve classification accuracy while decreasing classification time, providing the same or even better classification accuracy than using all of the features [1]. The goal is to identify a set of significant s features from a set of S features (sâ€‰<â€‰S) in a given dataset [2]. S is composed of all the features of a particular data collection; it may include noisy, repetitive, and misleading features. As a result, a complete search cannot be used in practice since it scans the whole solution space, which takes a long time [3]. We intended to save only a subset of the relevant features. Unnecessary features are not only useless for classification, but they may significantly decrease classification accuracy. By removing unnecessary features, computational efficiency, and classification accuracy may be improved. The search criteria contain two types of FS methods: filterbased and wrapperbased. The filterbased techniques choose the feature subset independently of the predictors. Filteringbased FS methods include the gain ratio [4] and information gain (IG) [5]. Wrapperbased techniques, as opposed to filterbased approaches, apply predictors to evaluate the quality of the chosen features [6, 7]. These techniques like sequential backward selection (SBS) [8], sequential forward selection (SFS) [9], and neural networkbased methods [10]. Several search approaches, in particular the random search and the greedy search, have been employed to find the most suitable subset of features [11]. Greedy search approaches create and assess all possible combinations of characters, making this strategy timedemanding. Meanwhile, random search approaches scan the search space at random for the best subset of features. However, these approaches have several disadvantages, such as being easily stuck at local optimal points and having a high search space and time complexity. Metaheuristic approaches were employed to address the limitations of the previously discussed FS methods. Metaheuristic techniques are approaches to global optimization that mimic the biological, physical, and animal social behaviors in nature [12]. When applied to FS issues, they can explore the search space both globally and locally. Particle swarm optimization (PSO) [13], genetic algorithms (GAs) [14], differential evolution (DE) [14], Ant lion optimization (ALO) [15], grey wolf optimizer (GWO) [16], and artificial bee colony optimization [17] are all wellknown instances of metaheuristics. In the preceding two decades, metaheuristics have proved their efficiency and productivity in solving difficult and largescale challenges in engineering design and machine learning data mining applications [18]. Several studies have been conducted to evaluate the effectiveness of various metaheuristic algorithms for feature selection. In [19], the authors introduced a binary version of the ant lion optimizer (ALO) to find the optimal set of features and demonstrated that their proposed algorithm outperformed other algorithms in terms of accuracy. In [20], the authors modified the parameter used to balance exploration and exploitation in ALO and introduced a chaotic ALO (CALO), which was shown to outperform standard ALO, particle swarm optimization (PSO), and genetic algorithm (GA). Meanwhile, in [21], the authors proposed a feature selection technique based on a modified Cuckoo Search algorithm with rough sets and showed that their proposed method was superior to other optimizers. In [22], the authors improved the binary iteration of the whale optimization algorithm (WOA) for feature selection, resulting in an improved algorithm (IWOA) that outperformed other algorithms in terms of classification accuracy and feature reduction. In [23], the authors introduced a chaotic version of the mothflame optimization (MVO) algorithm, called CMVO, which was found to be superior to other optimizers. Finally, in [24], the authors proposed a binary version of the hybrid grey wolf optimization and particle swarm optimization algorithm (BGWOPSO), which outperformed other binary optimization algorithms for accuracy, feature selection, and computational time. Another approach to feature selection is using machine learning algorithms such as artificial neural networks (ANN). In [25], the authors proposed a feature selection approach based on an extension of particle swarm optimization (PSO) for wind energy conversion (WEC) systems, which demonstrated improved classification performance with reduced computation time. Similarly, in [26], the authors proposed using genetic algorithm (GA) for feature selection in combination with ANN for fault diagnosis in gridconnected photovoltaic (GCPV) systems, which proved to be feasible and effective with low computation time.
In the current study, we present a novel fault diagnosis paradigm for photovoltaic (PV) systems utilizing a feature selection method called SSASML. The proposed approach aims to address the complex nature of GCPV systems and the high similarity between different faults, which makes it challenging to diagnose faults accurately and ensure highperformance functioning. The main contributions of our work include:

The first step in our approach is to select the most important and sensitive features from the data, which can be challenging in nonlinear systems. While PCA is a commonly used method, it is not always effective for fault classification. Therefore, an alternative method called KPCA was developed. However, KPCA can be computationally challenging for large datasets.

To overcome these challenges, we propose an SSAbased SML technique for detecting faults and distinguishing between operating modes in PV systems. SSA offers several advantages, such as being a new algorithm, easier to implement, having fewer parameters, and having a low computational cost [27].

The salp swarm algorithm (SSA) is utilized for feature selection by eliminating unnecessary features, while supervised machine learning is used for fault diagnosis. This approach tackles the issues of statistical, multivariate, and nonlinear feature selection and fault diagnosis in GCPV systems while improving classification accuracy, limiting the number of chosen features, and significantly reducing computation time.
The rest of the paper is organized as follows: Sect.Â 2 gives a brief theoretical overview of PCA, KPCA, and SSA, which are employed in feature extraction and selection. SectionÂ 3 is devoted to the discussion of supervised machinelearning techniques. SectionÂ 4 presents the proposed methodology for fault diagnosis and classification utilizing an SSAbased SML algorithm. SectionÂ 5 presents the simulation results that evaluate the performance of the proposed SSAbased SML. SectionÂ 6 concludes the paper.
Methods
Feature extraction and selection
Principal component analysis
Principal component analysis (PCA) is a descriptive method for analyzing existing relationships between system variables without taking the systemâ€™s model into account [28]. Originally developed by Karl Pearson to describe and summarize the information contained in a dataset, Harold Hotelling later improved it as a technique for analyzing existing relationships between variables [29].
Consider the data matrix \(X(N,\,m)\) of a system, where N represents the number of measurements or observations and m represents the number of sensors or variables. Before running the analysis, it is necessary to perform preprocessing, which includes centering and reducing the data. The goal of this preprocessing is to keep certain variables from dominating the analysis simply due to their high amplitude in comparison to other variables. The following relation then centers each column \({X}_{1}\) of the matrix \(X\left(N\cdot m\right)\)
Where \({X}_{i}\) is the ith column of the matrix X, \({M}_{i}\) is the mean of the ith column and \({{X\sigma }_{i}}_{1}^{2}\) is the variance of the ith column, respectively. The new centered and reduced data matrix is as follows:
After obtaining the new data matrix, the covariance matrix Î¦ is computed as follows:
The principal component analysis thus consists of breaking down the matrix as follows:
where the principal components of X are represented by the columns of the matrix T. The eigenvectors of the covariance matrix Î¦ are represented by the columns of the matrix P. In terms of linear systems, PCA is quite effective. Due to the nonlinear nature of the majority of current systems, PCA is ineffective in these systems. In order to get around PCAâ€™s difficulties, a number of nonlinearbased PCA techniques have been developed, including kernel principal component analysis (KPCA).
Kernel principal component analysis
Kernel PCA (KPCA) depends on translating data into a higherdimensional space where the data becomes linear. Consider a data matrix with m variables and N observations that have been normalized.
The data are projected onto the characteristic space \(H\) using the function \(\varnothing :{x}_{i}\in {\mathcal{R}}^{m}\to {\varnothing }_{i}=\varnothing \left({x}_{i}\right)\in {\mathcal{R}}^{h}\) of dimension \(h>>m\) The dot product of two vectors \(\varnothing \left({x}_{i}\right)\) and \(\varnothing \left({x}_{j}\right)\) is an important characteristic in the feature space, and it is as follows:
where \(k\) denotes the kernel function and \(i, j=1, \dots , N\)
In this study, we utilized the radial basis function defined as follows:
where c is the width of the Radial basis kernel function.
The KPCA model, like the linear PCA model, is derived by looking at the eigenvalues and eigenvectors of the covariance matrix in the new space. In the case of a collection of centered and reduced data, \(\varphi=\left[\varnothing\left(x_1\right)L\varnothing\left(x_1\right)\cdots\varnothing\left(x_{N1}\right)\right]^T\) The covariance matrix C is defined in the space of the characteristics by:
The following equation will be solved to determine the eigenvalues Î» and eigenvectors v of the covariance matrix C.
EquationÂ (2) may be expressed from the Gramâ€™s matrix \(K=\varphi {\varphi }^{T}\) as follows:
where \(\lambda\) and \(\alpha\) are the eigenvalues and eigenvectors of K. It being important to identify the first â„“ kernel principal components. The cumulative percent variance (CPV) criteria [30] are utilized to calculate the number of significant â„“ KPCs. As determined by the first â„“ KPCs, the CPV is a measure of the percent variance:
Subsequently, the kernel principal components are calculated using
where the â„“ principal eigenvectors \(P=\left[\alpha_1,{K},\dots,\alpha_1\right]\) of K are those that correspond to its largest eigenvalues \(\Lambda=diag\left\{\lambda_1,{K},\dots,\lambda_l\right\}\).
To select the effective features, Hotellingâ€™s T^{2} and SPE are also used in addition to the â„“ first KPCs. These are the statistical characteristics defined:
Where \(\Lambda=\left(\Lambda_1,{K},\dots,\Lambda_l\right)\) and \(C={PA}^{1}{P}^{T}\). Where k(x) is the kernel vector of the measured variable \(x\) and is denoted by
FigureÂ 1 illustrates the main stages of the KPCA technique for feature extraction and selection.
Salp swarm algorithm (SSA)
SSA is one of the algorithms with a random population that Mirjalili et al. [27]. proposed in 2017. SSA mimics the swarming behavior of salps during ocean foraging. Salps typically form a swarm known as a salp chain in heavy oceans. The salp at the front of the chain is the leader in the SSA algorithm, while the remaining salps are referred to as followers. The position of salps is defined in a ddimensional search space, where d is the number of variables in a particular problem, similar to previous swarmbased methods. Therefore, a twodimensional matrix called x is utilized to store the positions of all salps. Additionally, it is believed that the swarm will use S as its aim to find a food source in the search space. The following is the provided mathematical model for SSA. Using the next equation, the leader salp can change its position:
where \(x_{j}^{1}\) represents the position of the first salp (leader) in the jth dimension, \({S}_{j}\) represents the position of the food source in the jth dimension, \({ub}_{j}\) and \({lb}_{j}\) represent the upper and lower bounds of the jth dimension, respectively, and \({c}_{1}\), \({c}_{2}\), and \({c}_{3}\) are random numbers. EquationÂ 16 demonstrates that the leader only changes its position in relation to the food source. Because it balances exploration and exploitation, the coefficient \(c_{1}\) is the most crucial parameter in the SSA.
where L is the maximum number of iterations and l represents the current iteration. Random variables in the interval [0,1] are generated uniformly for the parameters c_{2} and c_{3}. The following equations (Newtonâ€™s law of motion) are used to update the position of the followers:
where \(i\ge 2\), \({x}_{j}^{1}\) depicts the position of the ith follower salp in the jth dimension, \(t\) denotes time, \({\delta }_{0}\) denotes the beginning speed, and \(\lambda =\frac{{\delta }_{final}}{{\delta }_{0}}\) where \(\delta =\frac{x{x}_{0}}{t}\)
The discrepancy between iterations is equal to 1 because the time in optimization is iterated, and since \({\delta }_{0}=0\), this equation can be written as follows:
where \(i\ge 2\), \({x}_{j}^{i}\) represent the position of the ith following salp in the jth dimension, respectively. It is possible to mimic the salp chains using Eqs. 16 and 19.
SSAbased feature selection
The following is a list of the requirements to develop the SSAbased feature selection paradigm:
Encoding scheme
We encoded the individuals using a vector of real numbers. The vector is applied for features that are randomly mapped in the interval [0,1]. As a result, if the component value is equal to or greater than 0.5, it is replaced with 1 and the feature is selected. However, the value is estimated to be 0 and the feature is not picked.
Objective function
The classification accuracy rate is calculated from Eq.Â 20, which is our objective function based on computing accuracy for each selection.
where TP (true positive) refers to correctly classified positive observations, TN (true negative) refers to correctly classified negative samples, FP (false positive) refers to incorrectly classified positive observations, and FN (false negative) refers to incorrectly classified negative observations.
Architecture system
In this part, we discussed our suggested system, the SSAbased feature selection architecture. Previous research employed the termâ€™System Architectureâ€™ [31, 32]. The following are the primary components of SSAbased feature selection:
Data normalization is a typical preprocess in feature selection. We normalized the features to exist in the interval [0,1] in order to eliminate the negative effects of existing bias values in particular features; this normalization was accomplished by identifying the selected feature by N in Eq.Â 21:
Salps individuals decoding: our vector has been occupied by the selected features in this stage.
Identifying training and testing sets: we partitioned the dataset into training sets (X_{train}, Y_{train}) and testing sets (X_{test}, Y_{test}). The main features are represented by \(X=\left[{X}_{1}, {X}_{2}, \dots , {X}_{n} \right]\) and the main class is Y. To build the model, SML classifiers are used to manage X_{train} and Y_{train}. Finally, we evaluate the modelâ€™s accuracy by using X_{test} as an input to the model.
Select a feature subset: we picked features with a value of 1 from the training set.
Fitness evaluation: we used training set vectors to train our classifier and then used Eq.Â 20 to estimate classification accuracy.
Termination condition: we stopped the entire operation by limiting the number of iterations. FigureÂ 2 depicts the entire system workflow for feature selectionbased SSA.
Faults classification using supervised machine learning techniques
Supervised machine learning classifiers are then applied to these features for the goal of fault classification once the most informative features of the data have been extracted and chosen using PCA, KPCA, and SSA approaches. These classifiers include Knearest neighbors (KNN), discriminant analysis (DA), decision trees (DT), and support vector machines (SVM).
Knearest neighbors
The Knearest neighbors (KNN) technique is a widely used machine learning algorithm for classification and regression tasks. It is a simple yet effective nonparametric method for classifying new observations based on their similarity to previously observed data [33].
Discriminant analysis
Discriminant analysis (DA) is a wellknown machinelearning technique for classification tasks. It is a statistical method for determining a linear combination of features that best divides into two or more classes of objects. The purpose of the DA is to find a function that can accurately forecast the grouping or classification of new observations based on their predictor variable values [34, 35].
Decision trees
The decision tree (DT) is a common machinelearning technique that represents a decisionmaking process using a treelike structure. Each node in the tree represents a decision based on a certain feature or attribute, and the branches indicate various outcomes or decisions based on that feature [36].
Support vector machines
Support vector machine (SVM) is a supervised machine learning model. It is based on the concept of a hyperplane classifier, also known as linear separability. The purpose of SVM is to identify a linear optimal hyperplane that maximizes the margin of separation between the two classes [37, 38].
Fault diagnosis and classification using SSAbased SML technique
The proposed methodology for fault diagnosis in GCPV systems consists of two primary steps: feature selection and fault classification. The approach utilizes filter and wrapper methods for feature selection, and the supervised machine learning (SML) classifier for fault diagnosis. The aim is to simplify the classification process due to the complex nature of GCPV systems and the high similarity between different faults. The first step involves collecting GCPV data, which is then subjected to PCA, KPCA, and SSA to extract and select the most efficient and pertinent features. The selected feature subset is then used as input to the SML classifier to differentiate between operating modes and classify faults. The proposed technique is summarized in the block diagram shown in Fig.Â 3. This study presents an effective fault diagnosis technique based on the SSA model and SML classifiers. Although PCA is highly efficient for linear systems, it is inappropriate for most nonlinear systems, which are prevalent in GCPV systems. Moreover, KPCA may be inadequate for realworld applications with large datasets. To address these challenges, an optimized SSAbased SML classifier technique is proposed, which utilizes SSA for feature selection and SML for fault classification.
The proposed SSAbased SML technique is a promising solution for detecting and identifying faults in GCPV systems. It leverages the strengths of SSA for feature selection and SML for fault classification to address the challenges posed by nonlinear systems and large datasets.
Experimental results and discussions
System description
FigureÂ 4 shows a photovoltaic system setup with a DC bus voltage of 500Â V. The PV side is made up of 3 PV networks with a maximum power of 4Â kW each. A single set of PV arrays is composed of 2 parallel chains where each chain has 24 modules connected in series. Every module has 20 cells [26].
In this study, the two parallel PV fields, PV_{1} and PV_{2}, underwent different scenarios representing five types of faults, as outlined in TableÂ 1. The simple fault in PV_{1} involved four fault scenarios:

Bypass diode fault: The bypass is emulated by changing the resistance.

Connectivity fault: the connectivity fault is considered in the string of the PV system, between two modules. This fault was modeled by a serial variable resistance.

Linetoline fault: LL is described by the variation in resistance that is situated between any two points in the PV array.

Line to ground fault: LG is described by the variation in resistance that is situated between one point and the ground.
This study deals with various fault scenarios, and each scenario includes several cases, as shown in TableÂ 2.

The first scenario refers to simple faults that only affect the PV1 array.

The second scenario represents simple faults that solely affect the PV2 array.

The third scenario deals with multiple faults on the same array. In this case, we address multiple faults on both PV1 and PV2 separately.

The fourth scenario examines mixed faults that might occur on both arrays at the same time.

The fifth scenario integrates all of the preceding scenarios to monitor the system in all of its states.
TableÂ 3 shows the various simulated 8 variable measurements that were collected in order to carry out the various experiments for fault diagnosis purposes. These variables represent one healthy (attributed to class C_{0}) and 20 different faulty operating modes of GCPV (assigned to C_{i}, iâ€‰=â€‰1, â€¦, 20), respectively, as shown in TableÂ 2. The collected dataset was divided into two categories, namely, training and testing datasets, and the same observations were used for both. To validate the testing dataset, we added noise of significant magnitude.
The following criteria have been approved for use in evaluating and comparing performance: accuracy, precision, recall, F1 score, and computation time (CT) [39].
Simulation results
In this section, the proposed methods PCA, KPCA, and SSAbased SML are applied for monitoring the GCPV system, a tenfold crossvalidation approach was used. In order to perform the proposed FD paradigm, four conditions are considered including the first condition (attributed to Cd_{1}), which represents a healthy mode, Simple fault in PV1 (F_{3}), and simple fault in PV2 (F_{7}) modes. The second condition (Cd_{2}), which represents a healthy mode, Simple fault in PV1 (F_{2}) and simple fault in PV2 (F_{6}) modes. The third condition (Cd_{3}), which represents a healthy mode and Mixed fault mode (F_{15}). Finally, the last condition (Cd_{4}), which represents a healthy mode and all faults modes (F_{1} to F_{20}).
The PCA and KPCA algorithms are used as a feature selection technique in a filter mode. In this study and in regard to the PCA model, 3 groups of features are used, containing group 1: (T_{â„“}), group 2: (T_{â„“}, SPE), and group 3: (T_{â„“}, T ^{2}, SPE). Group 2 (the first â„“â€‰=â€‰6 PCs and SPE statistics) provides the best results in terms of classification accuracy. Where 6 Principal components have been retained to be used in a supervised machine learning classifier in all faults. Thus, due to its underlying linearity assumption, PCA performs quite poorly for fault classification in some nonlinear systems. KPCA was developed to deal with nonlinear relationships between process variables. Where, the 95% cumulative variance criteria are used to identify the retained KPCs, with 53 KPCs remaining.
On the other hand, the SSA algorithm is used as a feature selection technique in a wrapper mode by applying the KNN, DA, DT, and SVM classifiers as a fitness function (where Kâ€‰=â€‰5, nSplitâ€‰=â€‰50, Discâ€‰=â€‰â€™lâ€™ and Kernelâ€‰=â€‰r). In this work, these classifiers are used as a classification algorithm to evaluate the quality of the chosen subset of features. The SSA parameters are set as follows, the population size (number of salps) is 10 and the maximum number of iterations is 50. The results presented in TableÂ 4 show that the SSASML selects a minimal number of features in all faults.
Discussions
Various classifiers are used in this study, and the best classifier is chosen based on classification performance. TableÂ 5 depicts the overall performance accuracy.
Firstly, PCASML achieved low accuracies in some cases. In Cd_{1}, in this case, all the developed approaches had high diagnosis performance, with accuracy rates of 87.96%, 84.89%, 85.47%, and 97.18% for KNN, DA, DT, and SVM classifiers, respectively. However, the results decreased in Cd_{2} compared to the previous condition. Additionally, The fault diagnosis techniques showed poor performances in Cd_{3}, with an accuracy rate of 59.86% for the SVM classifier. When dealing with all fault conditions (Cd_{4}), PCAbased DA and DT had low accuracy rates of 48.85% and 47.64%, respectively, and were inefficient in distinguishing between different operating modes.
Secondly, KPCASML achieved accuracies between 5.50 and 99.89%. In Cd_{1}, the KNN, DT, and SVM classifiers showed good results in terms of performance classification, except for the DA classifier with an accuracy rate of 33.35%. However, KNN and SVM had very high computation times during the testing stage. Moreover, the outcomes decreased in Cd_{2} compared to the initial condition. Consequently, in Cd_{3}, the Fault diagnosis techniques showed good results, except for DA, with an accuracy rate of 46.11%. When dealing with all fault conditions (Cd_{4}), KNN and SVM classifiers achieved high accuracy rates of 98.83% and 90.91%, respectively. However, DA and DT showed very poor classification with accuracy rates of 5.50% and 15.75%, respectively.
Finally, SSA achieved the highest accuracies (57.62â€‰âˆ’â€‰99.98%) using the all conditions. SSAML had the best overall performance with accuracies of 99.98% and 99.91% for SVM and DT classifiers, respectively, in Cd_{1}. In Cd_{2}, all the developed approaches had high diagnosis performance. Then in Cd_{3}, SSA improved the performance classification of KPCAbased DA classifier with an accuracy rate increasing from 46, 11% to 83.77%. In the last condition, SSA enhanced the results of KPCADA from 5.50 to 57.62%, from 15.75 to 69.72%, and from 90.91 to 99.46% for the DA, DT, and SVM classifiers, respectively. Besides, the proposed method led to a significant reduction in computation time compared to the other methods. Furthermore, SSA outperformed other techniques in terms of classification accuracy, recall, precision, F1 score, and computation time, due to its ability to explore the feature space intelligently. These results confirmed the effectiveness of the SSA in analyzing the feature space and selecting the best subset that resulted in higher classification performance.
Conclusions
In this study, we focused on diagnosing various incipient faults of gridconnected photovoltaic (GCPV) systems during different operation modes. We identified 20 different types of faults, including linetoline and linetoground faults, connectivity faults, and faults affecting the operation of bypass diodes. These faults presented diverse conditions, such as simple and multiple faults in the PV arrays and mixed faults in both arrays. To address the complexity and similarity between faults, we developed a feature selection tool to enhance the accuracy of the supervised machine learning (SML) models. Firstly, we applied the salp swarm algorithm (SSA) for feature selection to select the most effective features from the raw data. Then, we fed these significant and sensitive features into the SML model for classification purposes. The results confirmed that the developed paradigm significantly improved the diagnosis performance when applied to GCPV systems. The diagnosis accuracies of the proposed SSASML were compared to those using PCA and kernel PCAbased SML methods through different metrics (i.e., accuracy, recall, precision, F_{1} score, and computation time). The obtained results confirmed that the development paradigm outperformed the other methods and achieved a high diagnostic accuracy (an average accuracy greater than 99%) and low computation time using GCPV data.
Availability of data and materials
Data will be made available on request.
Abbreviations
 GCPV:

Gridconnected photovoltaic system
 FD:

Fault diagnosis
 FS:

Feature selection
 SML:

Supervised machine learning
 PCA:

Principal component analysis
 KPCA:

Kernel principal component analysis
 SSA:

Salp swarm algorithm
 KNN:

Knearest neighbors
 SVM:

Support vector machine
 DT:

Decision tree
 DA:

Discriminant analysis
 CT:

Computation time
References
Harvey DY, Todd MD (2014) Automated feature design for numeric sequence classification by genetic programming. IEEE Trans Evol Comput 19(4):474â€“489
Oh IS, Lee JS, Moon BR (2004) Hybrid genetic algorithms for feature selection. IEEE Trans Pattern Anal Mach Intell 26(11):1424â€“1437
Cover TM, Van Campenhout JM (1977) On the possible orderings in the measurement selection problem. IEEE Trans Syst Man Cybern 7(9):657â€“661
Witten IH, Frank E (2002) Data mining: practical machine learning tools and techniques with java implementations. ACM SIGMOD Rec 31(1):76â€“77
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81â€“106
Ma L, Li M, Gao Y, Chen T, Ma X, Qu L (2017) A novel wrapper approach for feature selection in objectbased image classification using polygonbased crossvalidation. IEEE Geosci Remote Sens Lett 14(3):409â€“413
Zhu Z, Ong YS, Dash M (2007) Wrapperâ€“filter feature selection algorithm using a memetic framework, IEEE Transactions on Systems, Man, and Cybernetics. Part B (Cybernetics) 37(1):70â€“76
Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recogn 43(1):5â€“13
Bermejo P, J. A. GÂ´amez, J. M. Puerta, Incremental wrapperbased subset selection with replacement: An advantageous alternative to sequential forward selection, in, (2009) IEEE Symposium on Computational Intelligence and Data Mining. IEEE 2009:367â€“374
Setiono R, Liu H (1997) Neuralnetwork feature selector. IEEE Trans Neural Networks 8(3):654â€“662
Langley P, et al.Â (1994) Selection of relevant features in machine learning, in: Proceedings of the AAAI Fall symposium on relevance, Vol. 184, Citeseer, pp. 245â€“271.
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51â€“67
Banks A, Vincent J, Anyakoha C (2008) A review of particle swarm optimization. part ii: hybridisation, combinatorial, multicriteria and constrained optimization, and indicative applications. Natural Computing 7(1):109â€“124.
Han KH, Kim JH (2002) Quantuminspired evolutionary algorithm for a class of combinatorial optimization. IEEE Trans Evol Comput 6(6):580â€“593
Mafarja MM, Mirjalili S (2019) Hybrid binary ant lion optimizer with rough set and approximate entropy reducts for feature selection. Soft Comput 23(15):6249â€“6265
Ibrahim RA, Elaziz M. Abd, Lu S (2018) Chaotic oppositionbased greywolf optimization algorithm based on differential evolution and disruption operator for global optimization. Expert Syst Appl 108:1â€“27.
Karaboga D, Akay B (2009) A comparative study of artificial bee colony algorithm. Appl Math Comput 214(1):108â€“132
Javidi M, Emami N (2016) A hybrid search method of wrapper feature selection by chaos particle swarm optimization and local search. Turk J Electr Eng Comput Sci 24(5):3852â€“3861
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54â€“65
Zawbaa HM, Emary E, Grosan C (2016) Feature Selection via Chaotic Antlion Optimization. PLoS ONE 11(3):e0150652. https://doi.org/10.1371/journal.pone.0150652
Aziz MAE, Hassanien AE (2018) Modified cuckoo search algorithm with rough sets for feature selection. Neural Comput Appl 29(4):925â€“934
Hegazy AE, Makhlouf M, ElTawel GS (2018) Dimensionality reduction using an improved whale optimization algorithm for data classification. Int J Modern Educ Comput Sci 11(7):37
Ewees AA, El Aziz MA, HassanienÂ AE (2019) Chaotic multiverse optimizerbased feature selection. Neural Comput Appl 31(4):991â€“1006.
AlTashi Q, Kadir SJA, Rais HM, Mirjalili S, Alhussian H (2019) Binary optimization using hybrid grey wolf optimization for feature selection, Ieee. Access 7:39496â€“39508
Mansouri M, Dhibi K, Nounou H, Nounou M (2022) An effective fault diagnosis technique for wind energy conversion systems based on an improved particle swarm optimization. Sustainability 14(18):11195
Hichri A, Hajji M, Mansouri M, Abodayeh K, Bouzrara K, Nounou H, Nounou M (2022) Geneticalgorithmbased neural network for fault detection and diagnosis: Application to gridconnected photovoltaic systems. Sustainability 14(17):10518
Mirjalili S, Gandomi AH, Mirjalili SZ, Saremi S, Faris H, Mirjalili SM (2017) Salp swarm algorithm: A bioinspired optimizer for engineering design problems. Adv Eng Softw 114:163â€“191
Harkat MF (2003) DÂ´etection et localisation de dÂ´efauts par analyse en composantes principales, Ph.D. thesis, Institut National Polytechnique de LorraineINPL.
Chouaib C (2016) Diagnostic et surveillance des procÂ´edÂ´es industriels et de leur environnement sur la base de lâ€™analyse de donnÂ´ees, Ph.D. thesis, Badji MokhtarAnnaba University.
Maulud A, Wang D, Romagnoli J (2006) A multiscale orthogonal nonlinear strategy for multivariate statistical process monitoring. J Process Control 16(7):671â€“683
Braga PL, Oliveira AL, Meira SR (2008) A gabased feature selection and parameters optimization for support vector regression applied to software effort estimation, in: Proceedings of the 2008 ACM symposium on Applied computing, pp. 1788â€“1792. https://doi.org/10.1145/1363686.1364116
Faris H, Hassonah MA, AlZoubi A, Mirjalili S, Aljarah I (2018) A multiverse optimizer approach for feature selection and optimizing svm parameters based on a robust system architecture. Neural Comput Appl 30(8):2355â€“2369
Wang Y, Pan Z, Pan Y (2019) A training data set cleaning method by classification ability ranking for the knearest neighbor classifier. IEEE transactions on neural networks and learning systems 31(5):1544â€“1556
Klecka WR, Iversen GR, Klecka WR (1980) Discriminant analysis, Vol. 19, Sage. https://doi.org/10.4135/9781412983938
Balakrishnama S, Ganapathiraju A (1998) Linear discriminant analysisa brief tutorial. Institute for Signal and information Processing 18(1998):1â€“8
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017) Classification and regression trees, Routledge. Nat Methods 14:757â€“8. https://doi.org/10.1038/nmeth.4370
Dietrich R, Opper M, Sompolinsky H (1999) Statistical mechanics of support vector networks. Phys Rev Lett 82(14):2975
Kaper M, Meinicke P, Grossekathoefer U, Lingner T, Ritter H (2004) Bci competition 2003data set iib: support vector machines for the p300 speller paradigm. IEEE Trans Biomed Eng 51(6):1073â€“1076
Mandal JK, Bhattacharya D (2020) Emerging technology in modelling and graphics, Advances in Intelligent Systems and Computing (AISC, volume 937), Springer
Acknowledgements
The publication is the result of the Qatar National Research Fund (QNRF) research grant.
Funding
Funding is provided by the Qatar National Library.
Author information
Authors and Affiliations
Contributions
Amal Hichri: writingâ€”original draft, software. Mansour Hajji: writingâ€”original draft, software. Majdi Mansouri: supervision, methodology, reviewing, and editing. Hazem Nounou: visualization. Kais Bouzrara: visualization. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hichri, A., Hajji, M., Mansouri, M. et al. Supervised machine learningbased salp swarm algorithm for fault diagnosis of photovoltaic systems. J. Eng. Appl. Sci. 71, 12 (2024). https://doi.org/10.1186/s4414702300344z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4414702300344z