Supervised machine learning-based salp swarm algorithm for fault diagnosis of photovoltaic systems

The diagnosis of faults in grid-connected photovoltaic (GCPV) systems is a challenging task due to their complex nature and the high similarity between faults. To address this issue, we propose a wrapper approach called the salp swarm algorithm (SSA) for feature selection. The main objective of SSA is to extract only the most important features from the raw data and eliminate unnecessary ones to improve the classification accuracy of supervised machine learning (SML) classifiers. Subsequently, the selected features are used to train supervised machine learning (SML) techniques in distinguishing between various operating modes. To evaluate the efficiency of the technique, we used healthy and faulty data from GCPV systems that have been injected with frequent faults, 20 different types of faults were introduced, including line-to-line, line-to-ground, connectivity faults, and those affecting the operation of bay-pass diodes. These faults present diverse conditions, such as simple and multiple faults in the PV arrays and mixed faults in both arrays. The performances of the developed SSA-SML are compared with those using principal component analysis (PCA) and kernel PCA (KPCA) based SML techniques through different criteria (i.e., accuracy, recall, precision, F1 score, and computation time). The experimental findings demonstrated that the proposed diagnosis paradigm outperformed the other techniques and achieved a high diagnostic accuracy (an average accuracy greater than 99%) while significantly reducing computation time.


Introduction
In huge datasets, the process of assessing data becomes more difficult since not all of the data is appropriate.Feature selection is the process of selecting the most important features and removing the repetitious ones in order to solve classification issues.The selected subset of features will improve classification accuracy while decreasing classification time, providing the same or even better classification accuracy than using all of the features [1].The goal is to identify a set of significant s features from a set of S features (s < S) in a given dataset [2].S is composed of all the features of a particular data collection; it may include noisy, repetitive, and misleading features.As a result, a complete search cannot be used in practice since it scans the whole solution space, which takes a long time [3].We intended to save only a subset of the relevant features.Unnecessary features are not only useless for classification, but they may significantly decrease classification accuracy.By removing unnecessary features, computational efficiency, and classification accuracy may be improved.The search criteria contain two types of FS methods: filter-based and wrapper-based.The filter-based techniques choose the feature subset independently of the predictors.Filtering-based FS methods include the gain ratio [4] and information gain (IG) [5].Wrapper-based techniques, as opposed to filter-based approaches, apply predictors to evaluate the quality of the chosen features [6,7].These techniques like sequential backward selection (SBS) [8], sequential forward selection (SFS) [9], and neural network-based methods [10].Several search approaches, in particular the random search and the greedy search, have been employed to find the most suitable subset of features [11].Greedy search approaches create and assess all possible combinations of characters, making this strategy time-demanding.Meanwhile, random search approaches scan the search space at random for the best subset of features.However, these approaches have several disadvantages, such as being easily stuck at local optimal points and having a high search space and time complexity.Metaheuristic approaches were employed to address the limitations of the previously discussed FS methods.Metaheuristic techniques are approaches to global optimization that mimic the biological, physical, and animal social behaviors in nature [12].When applied to FS issues, they can explore the search space both globally and locally.Particle swarm optimization (PSO) [13], genetic algorithms (GAs) [14], differential evolution (DE) [14], Ant lion optimization (ALO) [15], grey wolf optimizer (GWO) [16], and artificial bee colony optimization [17] are all well-known instances of metaheuristics.In the preceding two decades, metaheuristics have proved their efficiency and productivity in solving difficult and large-scale challenges in engineering design and machine learning data mining applications [18].Several studies have been conducted to evaluate the effectiveness of various metaheuristic algorithms for feature selection.In [19], the authors introduced a binary version of the ant lion optimizer (ALO) to find the optimal set of features and demonstrated that their proposed algorithm outperformed other algorithms in terms of accuracy.In [20], the authors modified the parameter used to balance exploration and exploitation in ALO and introduced a chaotic ALO (CALO), which was shown to outperform standard ALO, particle swarm optimization (PSO), and genetic algorithm (GA).Meanwhile, in [21], the authors proposed a feature selection technique based on a modified Cuckoo Search algorithm with rough sets and showed that their proposed method was superior to other optimizers.In [22], the authors improved the binary iteration of the whale optimization algorithm (WOA) for feature selection, resulting in an improved algorithm (IWOA) that outperformed other algorithms in terms of classification accuracy and feature reduction.In [23], the authors introduced a chaotic version of the mothflame optimization (MVO) algorithm, called CMVO, which was found to be superior to other optimizers.Finally, in [24], the authors proposed a binary version of the hybrid grey wolf optimization and particle swarm optimization algorithm (BGWOPSO), which outperformed other binary optimization algorithms for accuracy, feature selection, and computational time.Another approach to feature selection is using machine learning algorithms such as artificial neural networks (ANN).In [25], the authors proposed a feature selection approach based on an extension of particle swarm optimization (PSO) for wind energy conversion (WEC) systems, which demonstrated improved classification performance with reduced computation time.Similarly, in [26], the authors proposed using genetic algorithm (GA) for feature selection in combination with ANN for fault diagnosis in grid-connected photovoltaic (GCPV) systems, which proved to be feasible and effective with low computation time.
In the current study, we present a novel fault diagnosis paradigm for photovoltaic (PV) systems utilizing a feature selection method called SSA-SML.The proposed approach aims to address the complex nature of GCPV systems and the high similarity between different faults, which makes it challenging to diagnose faults accurately and ensure high-performance functioning.The main contributions of our work include: • The first step in our approach is to select the most important and sensitive features from the data, which can be challenging in nonlinear systems.While PCA is a commonly used method, it is not always effective for fault classification.Therefore, an alternative method called KPCA was developed.However, KPCA can be computationally challenging for large datasets.• To overcome these challenges, we propose an SSA-based SML technique for detecting faults and distinguishing between operating modes in PV systems.SSA offers several advantages, such as being a new algorithm, easier to implement, having fewer parameters, and having a low computational cost [27].• The salp swarm algorithm (SSA) is utilized for feature selection by eliminating unnecessary features, while supervised machine learning is used for fault diagnosis.This approach tackles the issues of statistical, multivariate, and nonlinear feature selection and fault diagnosis in GCPV systems while improving classification accuracy, limiting the number of chosen features, and significantly reducing computation time.
The rest of the paper is organized as follows: Sect. 2 gives a brief theoretical overview of PCA, KPCA, and SSA, which are employed in feature extraction and selection.Section 3 is devoted to the discussion of supervised machine-learning techniques.Section 4 presents the proposed methodology for fault diagnosis and classification utilizing an SSA-based SML algorithm.Section 5 presents the simulation results that evaluate the performance of the proposed SSA-based SML.Section 6 concludes the paper.

Principal component analysis
Principal component analysis (PCA) is a descriptive method for analyzing existing relationships between system variables without taking the system's model into account [28].Originally developed by Karl Pearson to describe and summarize the information contained in a dataset, Harold Hotelling later improved it as a technique for analyzing existing relationships between variables [29].
Consider the data matrix X(N , m) of a system, where N represents the number of measurements or observations and m represents the number of sensors or variables.Before running the analysis, it is necessary to perform preprocessing, which includes centering and reducing the data.The goal of this preprocessing is to keep certain variables from dominating the analysis simply due to their high amplitude in comparison to other variables.The following relation then centers each column X 1 of the matrix Where X i is the ith column of the matrix X, M i is the mean of the ith column and Xσ i 2 1 is the variance of the ith column, respectively.The new centered and reduced data matrix is as follows: After obtaining the new data matrix, the covariance matrix Φ is computed as follows: The principal component analysis thus consists of breaking down the matrix as follows: where the principal components of X are represented by the columns of the matrix T. The eigenvectors of the covariance matrix Φ are represented by the columns of the matrix P. In terms of linear systems, PCA is quite effective.Due to the nonlinear nature of the majority of current systems, PCA is ineffective in these systems.In order to get around PCA's difficulties, a number of nonlinear-based PCA techniques have been developed, including kernel principal component analysis (KPCA).

Kernel principal component analysis
Kernel PCA (KPCA) depends on translating data into a higher-dimensional space where the data becomes linear.Consider a data matrix with m variables and N observations that have been normalized.
The data are projected onto the characteristic space H using the function The dot product of two vec- tors ∅(x i ) and ∅ x j is an important characteristic in the feature space, and it is as follows: (1) where k denotes the kernel function and i, j = 1, . . ., N In this study, we utilized the radial basis function defined as follows: where c is the width of the Radial basis kernel function.
The KPCA model, like the linear PCA model, is derived by looking at the eigenvalues and eigenvectors of the covariance matrix in the new space.In the case of a collection of centered and reduced data, The covariance matrix C is defined in the space of the characteristics by: The following equation will be solved to determine the eigenvalues λ and eigenvectors v of the covariance matrix C. Equation (2) may be expressed from the Gram's matrix K = ϕϕ T as follows: where and α are the eigenvalues and eigenvectors of K.It being important to identify the first ℓ kernel principal components.The cumulative percent variance (CPV) criteria [30] are utilized to calculate the number of significant ℓ KPCs.As determined by the first ℓ KPCs, the CPV is a measure of the percent variance: Subsequently, the kernel principal components are calculated using where the ℓ principal eigenvectors P = [α 1 , K , . . ., α 1 ] of K are those that correspond to its largest eigenvalues � = diag{ 1 , K , . . ., l }.
To select the effective features, Hotelling's T 2 and SPE are also used in addition to the ℓ first KPCs.These are the statistical characteristics defined: Where � = (� 1 , K , . . ., � l ) and C = PA −1 P T .Where k(x) is the kernel vector of the measured variable x and is denoted by

Salp swarm algorithm (SSA)
SSA is one of the algorithms with a random population that Mirjalili et al. [27].proposed in 2017.SSA mimics the swarming behavior of salps during ocean foraging.Salps typically form a swarm known as a salp chain in heavy oceans.The salp at the front of the chain is the leader in the SSA algorithm, while the remaining salps are referred to as followers.The position of salps is defined in a d-dimensional search space, where d is the number of variables in a particular problem, similar to previous swarm-based methods.Therefore, a two-dimensional matrix called x is utilized to store the positions of all salps.Additionally, it is believed that the swarm will use S as its aim to find a food source in the search space.The following is the provided mathematical model for SSA.Using the next equation, the leader salp can change its position: (15)  where L is the maximum number of iterations and l represents the current iteration.Random variables in the interval [0,1] are generated uniformly for the parameters c 2 and c 3 .The following equations (Newton's law of motion) are used to update the position of the followers: where i ≥ 2 , x 1 j depicts the position of the ith follower salp in the jth dimension, t denotes time, δ 0 denotes the beginning speed, and = The discrepancy between iterations is equal to 1 because the time in optimization is iterated, and since δ 0 = 0 , this equation can be written as follows: where i ≥ 2 , x i j represent the position of the ith following salp in the jth dimension, respectively.It is possible to mimic the salp chains using Eqs.16 and 19.

SSA-based feature selection
The following is a list of the requirements to develop the SSA-based feature selection paradigm: Encoding scheme We encoded the individuals using a vector of real numbers.The vector is applied for features that are randomly mapped in the interval [0,1].As a result, if the component value is equal to or greater than 0.5, it is replaced with 1 and the feature is selected.However, the value is estimated to be 0 and the feature is not picked.

Objective function
The classification accuracy rate is calculated from Eq. 20, which is our objective function based on computing accuracy for each selection.
where TP (true positive) refers to correctly classified positive observations, TN (true negative) refers to correctly classified negative samples, FP (false positive) refers to (16) x 1 j = S j + c 1 ub j − lb j c 2 + lb j c 3 ≥ 0 S j + c 1 ub j − lb j c 2 + lb j c 3 < 0 Architecture system In this part, we discussed our suggested system, the SSA-based feature selection architecture.Previous research employed the term'System Architecture' [31,32].The following are the primary components of SSA-based feature selection: Data normalization is a typical preprocess in feature selection.We normalized the features to exist in the interval [0,1] in order to eliminate the negative effects of existing bias values in particular features; this normalization was accomplished by identifying the selected feature by N in Eq. 21: Salps individuals decoding: our vector has been occupied by the selected features in this stage.
Identifying training and testing sets: we partitioned the dataset into training sets (X train , Y train ) and testing sets (X test , Y test ).The main features are represented by X = [X 1 , X 2 , . . ., X n ] and the main class is Y.To build the model, SML classifiers are used to manage X train and Y train .Finally, we evaluate the model's accuracy by using X test as an input to the model.
Select a feature subset: we picked features with a value of 1 from the training set.
Fitness evaluation: we used training set vectors to train our classifier and then used Eq.20 to estimate classification accuracy.
Termination condition: we stopped the entire operation by limiting the number of iterations.Figure 2 depicts the entire system workflow for feature selection-based SSA.
Faults classification using supervised machine learning techniques Supervised machine learning classifiers are then applied to these features for the goal of fault classification once the most informative features of the data have been extracted and chosen using PCA, KPCA, and SSA approaches.These classifiers include K-nearest neighbors (KNN), discriminant analysis (DA), decision trees (DT), and support vector machines (SVM).

K-nearest neighbors
The K-nearest neighbors (KNN) technique is a widely used machine learning algorithm for classification and regression tasks.It is a simple yet effective non-parametric method for classifying new observations based on their similarity to previously observed data [33].

Discriminant analysis
Discriminant analysis (DA) is a well-known machine-learning technique for classification tasks.It is a statistical method for determining a linear combination of features that best divides into two or more classes of objects.The purpose of the DA is to find a function that can accurately forecast the grouping or classification of new observations based on their predictor variable values [34,35].

Decision trees
The decision tree (DT) is a common machine-learning technique that represents a decision-making process using a tree-like structure.Each node in the tree represents a ( 21) decision based on a certain feature or attribute, and the branches indicate various outcomes or decisions based on that feature [36].

Support vector machines
Support vector machine (SVM) is a supervised machine learning model.It is based on the concept of a hyperplane classifier, also known as linear separability.The purpose of SVM is to identify a linear optimal hyperplane that maximizes the margin of separation between the two classes [37,38].

Fault diagnosis and classification using SSA-based SML technique
The proposed methodology for fault diagnosis in GCPV systems consists of two primary steps: feature selection and fault classification.The approach utilizes filter and wrapper methods for feature selection, and the supervised machine learning (SML) classifier for fault diagnosis.The aim is to simplify the classification process due to the complex nature of GCPV systems and the high similarity between different faults.The first step involves collecting GCPV data, which is then subjected to PCA, KPCA, and SSA to extract and select the Although PCA is highly efficient for linear systems, it is inappropriate for most nonlinear systems, which are prevalent in GCPV systems.Moreover, KPCA may be inadequate for real-world applications with large datasets.To address these challenges, an optimized SSA-based SML classifier technique is proposed, which utilizes SSA for feature selection and SML for fault classification.
The proposed SSA-based SML technique is a promising solution for detecting and identifying faults in GCPV systems.It leverages the strengths of SSA for feature selection and SML for fault classification to address the challenges posed by nonlinear systems and large datasets.

System description
Figure 4 shows a photovoltaic system setup with a DC bus voltage of 500 V.The PV side is made up of 3 PV networks with a maximum power of 4 kW each.A single set of PV arrays is composed of 2 parallel chains where each chain has 24 modules connected in series.Every module has 20 cells [26].
In this study, the two parallel PV fields, PV 1 and PV 2 , underwent different scenarios representing five types of faults, as outlined in Table 1.The simple fault in PV 1 involved four fault scenarios: • Bypass diode fault: The bypass is emulated by changing the resistance.
• Connectivity fault: the connectivity fault is considered in the string of the PV system, between two modules.This fault was modeled by a serial variable resistance.• Line-to-line fault: LL is described by the variation in resistance that is situated between any two points in the PV array.• Line to ground fault: LG is described by the variation in resistance that is situated between one point and the ground.
This study deals with various fault scenarios, and each scenario includes several cases, as shown in Table 2.
• The first scenario refers to simple faults that only affect the PV1 array.
• The second scenario represents simple faults that solely affect the PV2 array.
• The third scenario deals with multiple faults on the same array.In this case, we address multiple faults on both PV1 and PV2 separately.• The fourth scenario examines mixed faults that might occur on both arrays at the same time.• The fifth scenario integrates all of the preceding scenarios to monitor the system in all of its states.
Table 3 shows the various simulated 8 variable measurements that were collected in order to carry out the various experiments for fault diagnosis purposes.These variables represent one healthy (attributed to class C 0 ) and 20 different faulty operating modes of GCPV (assigned to C i , i = 1, …, 20), respectively, as shown in Table 2.The collected dataset was divided into two categories, namely, training and testing datasets, and the same observations were used for both.To validate the testing dataset, we added noise of significant magnitude.
The following criteria have been approved for use in evaluating and comparing performance: accuracy, precision, recall, F1 score, and computation time (CT) [39].

Simulation results
In this section, the proposed methods PCA, KPCA, and SSA-based SML are applied for monitoring the GCPV system, a tenfold cross-validation approach was used.In order to perform the proposed FD paradigm, four conditions are considered including the first condition (attributed to Cd 1 ), which represents a healthy mode, Simple fault in PV1 (F 3 ), and simple fault in PV2 (F 7 ) modes.The second condition (Cd 2 ), which represents a healthy mode, Simple fault in PV1 (F 2 ) and simple fault in PV2 (F 6 ) modes.The third condition (Cd 3 ), which represents a healthy mode and Mixed fault mode (F 15 ).Finally, the last condition (Cd 4 ), which represents a healthy mode and all faults modes (F 1 to F 20 ).
The PCA and KPCA algorithms are used as a feature selection technique in a filter mode.In this study and in regard to the PCA model, 3 groups of features are used, containing group 1: (T ℓ ), group 2: (T ℓ , SPE), and group 3: (T ℓ , T 2 , SPE).Group 2 (the first ℓ = 6 PCs and SPE statistics) provides the best results in terms of classification accuracy.Where 6 Principal components have been retained to be used in a supervised machine learning classifier in all faults.Thus, due to its underlying linearity assumption, PCA performs quite poorly for fault classification in some nonlinear systems.KPCA was developed to deal with nonlinear relationships between process variables.Where, the 95% cumulative variance criteria are used to identify the retained KPCs, with 53 KPCs remaining.
On the other hand, the SSA algorithm is used as a feature selection technique in a wrapper mode by applying the KNN, DA, DT, and SVM classifiers as a fitness function (where K = 5, nSplit = 50, Disc = 'l' and Kernel = r).In this work, these classifiers are used as a classification algorithm to evaluate the quality of the chosen subset of features.The SSA parameters are set as follows, the population size (number of salps) is 10 and the maximum number of iterations is 50.The results presented in Table 4 show that the SSA-SML selects a minimal number of features in all faults.

Variables Descriptions
x 1 I pv1 : Output current of the PV 1 panel (A) x 2 V pv1: Output current of the PV 1 panel (V) x 3 I pv2 : Output current of the PV 1 panel (A) x 4 V pv2 : Output voltage of the PV 2 panel (V) x 5 V dc : Grid voltage phase dc (V) x 6 i a : Grid current phase a (A) x 7 i b : Grid current phase b (A) x 8 i c : Grid current phase c (A)

Conclusions
In this study, we focused on diagnosing various incipient faults of grid-connected photovoltaic (GCPV) systems during different operation modes.We identified 20 different types of faults, including line-to-line and line-to-ground faults, connectivity faults,  and faults affecting the operation of bypass diodes.These faults presented diverse conditions, such as simple and multiple faults in the PV arrays and mixed faults in both arrays.To address the complexity and similarity between faults, we developed a feature selection tool to enhance the accuracy of the supervised machine learning (SML) models.Firstly, we applied the salp swarm algorithm (SSA) for feature selection to select the most effective features from the raw data.Then, we fed these significant and sensitive features into the SML model for classification purposes.The results confirmed that the developed paradigm significantly improved the diagnosis performance when applied to GCPV systems.The diagnosis accuracies of the proposed SSA-SML were compared to those using PCA and kernel PCA-based SML methods through different metrics (i.e., accuracy, recall, precision, F 1 score, and computation time).The obtained results confirmed that the development paradigm outperformed the other methods and achieved a high diagnostic accuracy (an average accuracy greater than 99%) and low computation time using GCPV data.

Figure 1
Figure 1 illustrates the main stages of the KPCA technique for feature extraction and selection.

Fig. 3 12 Fig. 4
Fig. 3 Illustration of SML-based features selection procedures for PV fault diagnosis

Table 1
Description and characteristics of the different labeled injected faults

Table 2
Construction of database for GCPV fault diagnosis system

Table 4
SSA-based feature selection in all faults

Table 5
Summary performances of different classifiers