Detection research of insulating gloves wearing status based on improved YOLOv8s algorithm

The safety hazards may be caused by power grid operators not wearing insulating gloves according to regulations for live electrical working. Additionally, existing methods for detecting the wearing status of insulating gloves suffer from low recognition accuracy, slow detection speed, and large memory occupation by weight files. To address these issues, a Mixup-CA-Small-YOLOv8s (MCS-YOLOv8s) algorithm is proposed for detecting the wearing status of insulating gloves. Firstly, the mixup data augmentation technology using image mixing is introduced, increasing the data’s diversity and improving the model’s generalization ability. Secondly, the coordinate attention (CA) module is added to the original backbone network to strengthen the channel and positional information, suppressing the secondary feature information. Finally, a small target detection structure is designed by removing the last bottom feature detection layer in the original neck network and adding a shallow feature. The ability of small targets’ feature extraction is enhanced without increasing too much computation. The experimental results indicate that the mean average precision of the MCS-YOLOv8s algorithm on the test set is 0.912, the detection speed is 87 FPS, and the model’s weight memory occupies 15.7 MB. It is verified that the model has the advantages of high detection accuracy, fast speed, and small weight memory, which has great significance in ensuring the safe and stable operation of the power grid.


Introduction
The safety hazards associated with the power grid involve various factors, such as personnel behavior and equipment failure [1].Following the regulations, power grid operators are required to wear insulating gloves when conducting charged operations.Failure to comply with this requirement is considered unsafe behavior.Currently, within the power grid infrastructure, staff rely on monitoring videos to manually assess whether grid personnel are wearing insulating gloves.Depending on manual supervision alone can leave the grid with significant deficiencies in terms of operating costs, detection objectivity, and reliability [2,3].The intelligent detection of power grid personnel *Correspondence: leetg@mail.lzjtu.cninsulating gloves whether wearing condition, based on deep learning algorithms, holds significant importance in enhancing the efficiency of monitoring and ensuring the safe and stable operation of the power grid.With the advancement of computer vision technology, accomplishing this task has become more feasible.
In prior studies on detecting the condition of power grid personnel wearing insulating gloves, literature [4] employs traditional computer vision methods to fuse global color features and SIFT features.They utilize the center of mass distance and intersection over union between insulating gloves and body parts, to identify the status of insulating glove wearing.However, their approach does not account for the situation where the wrong gloves are worn.Additionally, their dataset images for the target are clear and high resolution, and the model exhibits poor generalization ability and could not meet the requirements of practical engineering application.Reference [5] utilizes the RetinaNet network to recognize the state of wearing insulating gloves.While achieving high detection accuracy for correctly wearing insulating gloves, the recognition accuracy for incorrectly wearing insulating gloves is considerably low.This is mainly attributed to the small size of the hand target and the weak feature of the hand, causing the prediction frame to deviate from the positioning of the small target, leading to misdetection or even missed detection.Reference [6] introduces an algorithm for detecting the status of wearing insulating gloves using an enhanced version of YOLOv3.The average accuracy improves by 31% compared to the YOLOv3 base model.Nevertheless, the overall detection accuracy still requires improvement, and the model's weight memory is 346.9MB.According to the literature [7], the improved YOLOv3 model exhibits higher detection accuracy and speed compared to Faster R-CNN.However, the overall mean average precision is only 79.25%, and the detection speed is merely 19.4 FPS.Reference [8] proposes YOLOv4based detection for insulating gloves wearing status, achieving high average accuracy.Nonetheless, targets not wearing insulating gloves correctly are not recognized.Furthermore, the detection speed and model's weight file memory are not mentioned.Therefore, it is difficult to ensure that engineering deployment is achieved.Both literature [9] and literature [10] utilize YOLOv5 for detecting insulating gloves.Despite their methods achieving high mean average precision, neither can detect the condition of incorrectly wearing insulating gloves.
The current model for detecting the wearing condition of insulating gloves, based on traditional computer vision and deep learning methods, suffers from low recognition accuracy, poor real-time performance, and a large memory occupation of weight files.The enhancement and attention to the recognition of small hand targets that are not properly wearing insulating gloves is urgently needed.The analysis indicates that the deep learning method surpasses the traditional computer vision method in all aspects of performance for recognizing insulating gloves.Furthermore, owing to the ongoing development of the deep learning method, the model's recognition performance for the insulating gloves wearing condition continues to improve consistently.
Deep learning methods have become increasingly prevalent in tasks related to target detection [11].In the 2014 IEEE International Conference on Computer Vision and Pattern Recognition, the R-CNN for target detection was proposed by Girshick et al. [12].In the following year, Fast R-CNN was proposed by Girshick et al. [13].At the 2015 Conference and Workshop on Neural Information Processing Systems, Kaiming He et al.
introduced Faster R-CNN, which significantly improves average accuracy [14].However, its detection speed is only 5 FPS.In 2016, Redmon et al. introduced the YOLO one-stage target detection network, achieving a detection speed of 45 FPS [15].However, its recognition accuracy is lower compared to the two-stage target detection network.In 2017 and 2018, Redmon et al. proposed the YOLOv2 [16] and YOLOv3 [17] algorithms, respectively.These algorithms significantly improve recognition accuracy and detection speed.However, the detection of small targets remains unsatisfactory.In 2020, the YOLOv4 algorithm was proposed by Bochkovskiy et al. [18], while YOLOv5 was proposed by Glenn et al. [19] in the same year.Both algorithms employ a deeper network structure and the path aggregation network (PANet) to enhance the accuracy of small target detection.Following this, YOLOv6 [20], YOLOv7 [21], and YOLOv8 [22] were introduced.YOLOv8 demonstrates higher mean average precision and faster detection rates.After considering all performance metrics of the above target detection algorithms, it is concluded that YOLOv8 is the most suitable base model for real-time wearing status detection of the insulating gloves of power grid personnel.
To achieve higher recognition accuracy, faster detection speed, and a smaller weight file memory for detecting the wearing condition of insulating gloves, this paper proposes the MCS-YOLOv8s algorithm, an enhancement of the YOLOv8s base model.Firstly, the Mixup data augmentation method is employed to enhance the model's generalization.Secondly, the backbone network incorporates the CA mechanism to enhance the extraction of crucial features.Finally, a small target detection structure for detecting small targets is designed to enhance the identification of incorrectly wearing insulating gloves in complex backgrounds.This is crucial for detecting the wearing status of insulating gloves for power grid personnel.

YOLOv8s algorithm
The YOLOv8 algorithm comprises five different network structures with varying widths and depths, namely n, s, m, l, and x.When evaluating the performance of different YOLOv8 algorithm structures for the same task, such as detection accuracy and speed, it is found that the YOLOv8s algorithm is better suited for detecting the condition of insulating gloves wearing.Figure 1 displays the structure of the YOLOv8s network, which comprises three parts: backbone network (backbone), neck network (neck), and output (output).The backbone network includes the convolution module (Conv), bottleneck layer (C2f ), and spatial pyramid pooling-fast layer (SPPF).The C2f module enhances the model's ability to extract gradient flow information while maintaining a lightweight design.It enhances the feature extraction of input images in the backbone network.The neck network utilizes a PANet structure that fuses strong semantic and localization feature information through top-down and bottom-up path aggregation.Finally, at the output, multiple bounding boxes undergo non-maximum suppression (NMS) for filtering.The prediction category with the highest output confidence value is then selected, and the coordinates of the bounding boxes at the target location are returned.

MCS-YOLOv8s algorithm improvement methods
Figure 2 illustrates the network structure of the MCS-YOLOv8s algorithm, which is enhanced on the YOLOv8s model.The following outlines the overall improvement ideas: 1) Firstly, the data enhancement strategy has been enhanced by incorporating the Mixup data augmentation technique.This approach elevates the training images' background complexity and enriches the training set's diversity.Improve the model's detection performance and robustness while retaining the current network structure and computational efficiency.2) Next, the CA module is introduced in the backbone of YOLOv8s between the Conv module and C2f module and following the SPPF module.The intermediate feature map can efficiently integrate spatial positional information and accurately localize the position of small targets through the CA module, which strengthens the attention to channel and positional information by taking into account their positional relationship.This ensures a more objective and precise evaluation of the target's position.3) Finally, the PANet structure in the neck network is enhanced by removing the bottom feature detection layer and introducing a shallow feature, thereby designing a new small target detection structure.This approach can preserve more effective information about small targets.The utilization of shallow feature maps with small receptive fields can significantly enhance the detection of small targets.The small target detection structure can more accurately detect hard-to-recognize small targets without significantly increasing computational effort.

Coordinate attention (CA) mechanism module
The CA module enables the network to gain information about a broader area without incurring significant computational overhead by embedding positional information into channel attention.The CA module considers both channel and direction-dependent positional information.Additionally, it is flexible and lightweight enough to be easily integrated into the network architecture [23].Figure 3 shows the CA module, which consists of the residual network module, X Avg Pool, and Y Avg Pool.In Fig. 3, C, H, and W respectively represent the channel, height, and width of the input feature map.X Avg Pool and Y Avg Pool respectively refer to one-dimensional average pooling along the horizontal and vertical directions of the input feature map.Re-weight refers to the weighted aggregation of spatial features from input feature maps.The CA module comprises two parts: positional information generation and position attention aggregation.Initially, the positional information of the feature map is generated.Pooling kernels of dimensions (H, 1) and (1, W) are employed to respectively encode features along the horizontal and vertical directions of a given input feature map U.The outputs from encoding the c-th channel on the input feature map U, with a height of h and a width of w, can be expressed as follows.
where u c (h, i) represents the eigenvalue of the c-th channel on the input feature map U with height h and width i. u c (j, w) represents the eigenvalue of the c-th channel on the input feature map U with height j and width w.
Position attention aggregation is initiated using the horizontal and vertical positional information generated by the above encoding.The feature maps, which have encoded both spatial directions, undergo a channel splicing operation.The spliced feature maps are then transformed using the 1 × 1 convolutional transform function F 1 .This process can be expressed as follows.
where δ represents the nonlinear activation function.[ •, •] denotes the channel splicing operation along the spatial dimensions.The intermediate feature mapping of the positional information is denoted by f ∈ R C/r×(H +W ) , to minimize computation and model complexity, with r being the reduction rate set to 16.
(1) For the intermediate feature mapping f, it is decomposed into tensors f h ∈ R C/r×h and f w ∈ R C/r×w in horizontal and vertical directions, which are along the spatial dimension.The 1 × 1 convolutional transform functions F 2 and F 3 are then used to transform f h and f w into tensors with the same number of channels C. The computa- tional process can be expressed as follows.
where g h and g w are transformed tensors output.σ is a sigmoid activation function.
Finally, the feature information of each position on the input feature map U is weighted and aggregated using g h and g w as the attention weights in the horizontal and vertical directions, respectively.The final feature map V after feature aggregation is then output.The procedure for weighted aggregation in computation can be expressed as follows.
where v c (j, i) represents the eigenvalue of the c-th channel on the output feature map V, with position coordinates (j, i).u c (j, i) represents the eigenvalue of the c-th channel on the input feature map U, with position coordinates (j, i).g h c j and g w c (i) are the horizon- tal attentional weights for the height j and the vertical attentional weights for the width i of the c-th channel on the input feature map U, respectively.
The backbone network feature extraction process in YOLOv8s involves the convolutional layer, which calculates the feature information of neighboring positions for each feature map.It is effective in extracting local features but struggles to capture global features.The convolutional layer neglects the inter-mapping between the information of each channel because each channel in the feature map contains distinct feature information [23].Power grid personnel frequently wear insulating gloves when working with electricity outdoors.However, this practice can give rise to issues such as difficulty in detecting the target due to the distance between the hand target and detection equipment, as well as low pixel value.Therefore, the addition of the CA module enhances the learning of feature relationships between channels and captures long-range dependencies within a channel, enabling the retention of precise positional information for small targets.Literature [24] has demonstrated that the CA module can optimize the learning of various types of target feature information in feed-forward neural networks and efficiently integrate spatial coordinate information.This enhancement improves the network's ability to accurately locate the target position and effectively enhances target detection performance.Therefore, this paper proposes adding the CA module between each Conv module and C2f module in the backbone network, as well as at the last layer of the backbone.The introduction of the CA module is illustrated in Fig. 4. The enhanced backbone network can extract more precise information about small targets by enhancing attention to channel and (4) positional information through the CA mechanism module while reducing attention to secondary information.The network model has been improved to enhance the detection of power grid personnel wearing insulating gloves status, making it more robust.

Small target detection structure
The YOLOv8s network requires the input image to undergo feature extraction in the backbone network, followed by processing in the neck network, before finally outputting to the detection layer.Addressing challenges such as the weak target features and the small size of insulating gloves wearing detection for power grid personnel, this paper enhances the PANet structure and proposes a small target detection structure specifically designed for small target detection.In the backbone network, the feature information of small targets is consistently lost as the network deepens due to multiple downsampling operations.Therefore, a sizing mechanism is introduced at the level of larger feature maps, serving as shallow features for small target detection.Shallow feature maps with smaller spatial receptive fields, which contain more edge information and have a stronger ability to represent geometric details, can significantly enhance the detection of small targets [25].To reduce network complexity and computation while obtaining more effective information about small targets, this structure initiates channel fusion at the top feature extraction layer and removes the last bottom feature detection layer.The structure of the MCS-YOLOv8's backbone network for feature extraction and small target detection is illustrated in Fig. 6.
The MCS-YOLOv8s network modifies the original YOLOv8s network by removing the P 5 feature detection layer.It initiates feature layer channel fusion starting from the C 2 feature extraction layer, and the specific fusion operation can be expressed as follows.
(7) where + denotes the superposition of feature map channels with the same length and width dimensions.↑ 2× represents the double upsampling operation, and ↓ 2× represents the double downsampling operation.P 2 , P 3 , and P 4 are the output feature maps obtained from the enhanced PANet network.The algorithm presented in this paper conducts wearing state recognition of the insulating gloves based on the P 2 , P 3 , and P 4 feature maps.
Differing from the PANet structure used in the neck network of the YOLOv8s algorithm, the MCS-YOLOv8s algorithm removes the last bottom small-scale feature map detection layer (P 5 ), introduces a shallow large-scale feature map detection layer (P 2 ), and conducts target detection on three scales of feature maps (P 2 , P 3 , and P 4 ).P 2 , P 3 , and P 4 feature maps are obtained by aggregating spatial features from the C 2 , C 3 , C 4 , and C 5 feature maps.Combined multichannel detection information is utilized to recognize the wearing condition of insulating gloves.Figure 7 illustrates the improvements made to the neck network.

Data augmentation strategy
The image samples of grid personnel wearing insulating gloves are limited.To address this limitation and enhance the diversity of input images, data augmentation techniques can be employed.This helps avoid the overfitting phenomenon in the convolutional network, ensuring that the trained model exhibits improved robustness and generalization ability [26,27].The original YOLOv8s algorithm employs seven data augmentation methods: random hue, saturation, value, translation, rotation, scale, and mosaic data enhancement.Mosaic data augmentation is a technique that entails randomly selecting four images from a training batch and then applying random cropping, scaling, and rotating operations to them.The resulting images are subsequently stitched together into a single image, thereby expanding the dataset and increasing the number of small targets in the sample.The six remaining data augmentation methods are traditional data augmentation techniques that apply (8) Fig. 6 The backbone and neck structure of MCS-YOLOv8s geometric and color transformations to the images.These methods exhibit limited effectiveness in enhancing the performance of detecting the wearing status of insulating gloves for power grid personnel.In this paper, Mixup is employed, a data augmentation technique based on image mixing, to enhance the generalization ability of the network.This is achieved by augmenting the background complexity of the image.Mixup data enhancement is a regularization technique that randomly mixes two training image pixels to create a single image with two input labels.
The Mixup data enhancement method is randomly selecting two images from a training batch and obtaining their corresponding time series data, denoted as x i , y i and x j , y j , where i ≠ j.Here, x i and x j represent the two images, while y i and y j repre- sent their respective label information.Finally, the mixed image is obtained through a calculation process.The resulting image and label are represented as X and Y, respectively, and the calculation process is described in (10) and (11).
where λ falls within the range of [0, 1] and follows the beta (α, α) distribution.α is the hyper-parameter utilized to control the interpolation strength between feature targets.As α → 0, the Mixup data enhancement effect is close to failing.
The MCS-YOLOv8s algorithm employs eight data enhancement methods to enhance the mixing of contextual information, increase dataset diversity and complexity, and improve overall performance.(

Experimental dataset
To verify the effectiveness of the proposed network model for insulating gloves detection, we conduct experiments using images taken at a power grid maintenance site.The dataset comprises 4947 images, randomly divided into training, validation, and test sets with a ratio of 8:1:2.Table 1 provides the number of images in each set.
The dataset comprises real-time images from a power grid maintenance site, captured under complex shooting conditions.The targets are small in scale and frequently occluded.This dataset is representative and useful for training models and enhancing detection performance.Labelimg is employed for labeling the images, with categories divided into "glove" for correctly wearing insulating gloves and "wrongglove" for wearing wrong gloves or not wearing gloves.Figure 8 displays typical images for each category.Figure 8a and b depicts grid personnel correctly wearing insulating gloves.In contrast, Fig. 8c shows grid personnel wearing noninsulating gloves, and Fig. 8d shows grid personnel without gloves.

Experimental environment
To ensure efficient training and testing of the improved YOLOv8s model, we utilize the hardware environment configuration presented in Table 2.A deep learning environment is established using PyCharm 2022, PyTorch 1.7, and Python 3.8 on a Windows 10 operating system.The batch size is set to 16, and the number of training epochs is 200, with both the initial and termination learning rates set to 0.01.

Evaluation metrics
To compare the detection performance of the MCS-YOLOv8s algorithm with other models, this paper employs four evaluation metrics: F1 score, mean average precision (mAP), detection speed (V), and weight size (Ws).V represents the number of images processed per second, measured in frames per second (FPS).The F1 score is calculated as the weighted harmonic mean of the precision (P) and the recall (R).
The value of mAP is typically calculated at an intersection over union (IoU) of 0.5 between the prediction frame and the true frame.

Ablation experiments results and discussion
The ablation experiments involve analyzing the performance of each model by comparing the original and improved models using the same dataset.Three improvements are made to the YOLOv8s model: Mixup data enhancement, the addition of a CA module to the backbone network, and the implementation of a designed small target detection structure for the neck network.YOLOv8s is the original model, and YOLOv8s + Small, YOLOv8s + CA, and YOLOv8s + Mixup are the three improved models.Performance comparison experiments are conducted among these models.
Tables 3 and 4 respectively display the F1 score and mAP performance of the model for each detection category.The table includes the following information: "glove" indicates the proper wearing of insulating gloves by grid workers, and "wrongglove" (12)   indicates improper wearing of insulating gloves or lack of use of gloves by grid workers.mAP represents the mean average precision of all categories, and the F1 score is computed based on the P and R values of all categories.Upon comparing the data in the two tables, it is evident that the improved models YOLOv8s + Small, YOLOv8s + CA, and YOLOv8s + Mixup exhibit higher F1 score and mAP than the original model YOLOv8s.
Notably, the most significant performance improvement is in detecting the incorrect wearing of insulating gloves.The effectiveness of the three improved methods proposed for recognizing the wearing status of insulating gloves is verified.
In ablation experiments, the following three models YOLOv8s + Small + CA, YOLOv8s + Small + Mixup, and YOLOv8s + CA + Mixup are obtained by fusing the above three improved methods two by two.Tables 5 and 6 demonstrate that the same experiments are conducted using the same dataset to test the model's performance, with the YOLOv8s + Small + CA model exhibiting the best performance.The model's detection performance has significantly improved with the utilization of the small and CA modules.In addition, the other two models outperform the original model.
The MCS-YOLOv8s algorithm is derived from the YOLOv8s base model by integrating three improvement points: Mixup data enhancement, CA module, and small target detection structure.To evaluate the performance of the MCS-YOLOv8s algorithm, it is compared with the original YOLOv8s model.Table 7 shows the performance of the two models in terms of F1 score, mAP, Ws, and detection speed V.
The results display that the MCS-YOLOv8s model achieved a maximum mAP of 0.912, which is 2.8% higher than the original YOLOv8s model's mAP of 0.884.The F1 score also improved by 1.6%, from the original 0.874 to 0.89.At this point, the model   recognition performance is at its highest level, and the weight file size is reduced to 15.7 MB.Although the detection speed of the MCS-YOLOv8s model is reduced, achieving a speed of 87 FPS is still a high value for fast detection.It can be seen that the MCS-YOLOv8s model is better than the original YOLOv8s model in all other performances under the guarantee of real-time detection, which is more conducive to the realization of the deployment of the mobile terminal, and accurately detects the status of insulating gloves worn by power grid workers in the actual engineering to ensure the safe and stable operation of the power grid.

Comparison with other algorithms
To further verify the detection performance of the MCS-YOLOv8s model, we compare the MCS-YOLOv8s model with six mainstream target detection models using the same dataset, hardware, software, and experimental parameters.The results of the comparison experiments are shown in Table 8.Table 8 shows that the Mobilenet-SSD algorithm has the lowest detection speed and mAP value, with the model weights file memory being 91.1 MB.The YOLOv3-tiny and YOLOv4-tiny models, lightweight models in the traditional YOLO series, respectively exhibit mAP values of 0.742 and 0.753.The latest lightweight model in the YOLO series is YOLOv7-tiny, boasting an impressive image processing speed of 112 FPS and a weight file memory of only 12.3 MB.However, its recognition accuracy remains relatively low.The s model of YOLOv5 has demonstrated a significant improvement in mAP value, reaching 0.873, compared to YOLO's tiny series of models.The YOLOX-S model achieves a mAP value of 0.875, slightly lower than the 0.884 achieved by the YOLOv8s algorithm in the YOLO series.However, unlike the tiny series and YOLOv5s, the weight file of the YOLOX-S model has a larger memory of 44.3 MB.The algorithm presented in this paper, MCS-YOLOv8s, improves recognition accuracy by 3.9% compared to YOLOv5s and 3.7% compared to YOLOX-S.Additionally, it has a weight memory of only 15.7 MB and processes images at a speed of 87 FPS, surpassing YOLOX-S.

Visualization of model detection results and discussion
The MCS-YOLOv8s model is employed to detect images in the test set, and the detection results are displayed in Fig. 9. Figure 9a displays an image with severe target occlusion, where incomplete target feature information in these images causes significant interference with recognition.Many images suffer from this issue, and resolving it is crucial to enhancing the effectiveness of model detection.The model accurately detects heavily occluded targets, as demonstrated by the results.Figure 9b displays an image of fuzzy targets with inadequate texture features, which may lead to missed detection.However, as shown in the actual detection results graph, this type of target can still be effectively detected.Figure 9c displays the image for detecting small targets.As the number of feature layers increases, the feature information for small targets is lost.To enhance the accuracy of small target detection, the neck network is improved, and the small target detection structure is utilized.Figure 9d shows an image taken on a cloudy day with poor outdoor lighting.In low-light environments, the quality of captured images decreases, making detection more difficult.However, the improved model with the added CA module can still accurately recognize such targets due to its powerful feature extraction network.Figure 9e and f displays class activation maps of power plant personnel wearing insulating gloves, while Fig. 9g and h shows class activation maps without wearing insulating gloves.From the four class activation maps, it can be seen Fig. 9 of detection effect that the detection targets of the MCS-YOLOv8s model are more focused on the hand region, and the highly responsive regions are concentrated in the parts that are most helpful in making category judgments for correct classification.The experimental results show that the MCS-YOLOv8s model can effectively detect the insulating gloves wearing condition of power grid workers, which is of great significance in ensuring the safety of power grid workers and the safe and stable operation of the power grid.

Conclusion
The proposed MCS-YOLOv8s model is implemented to recognize the wearing status of insulating gloves for power grid personnel working with electricity in this paper.The Mixup data enhancement method enhances the dataset diversity without increasing computational effort, improving the model's detection performance and generalization.Introducing the CA module in backbone enhances the attention and extraction of effective feature information.The design of a new structure for detecting small targets in the neck improves the acquisition of small target feature information, resulting in better classification and localization of targets.The experimental results indicate that the MCS-YOLOv8s model achieves a mAP value of 91.2% on the test set, showcasing a 2.8% improvement in detection performance compared to the YOLOv8s base model.
The MCS-YOLOv8s model has a final weight file memory occupation of only 15.7MB and can process images at a speed of up to 87 FPS.Despite this decrease in speed, it still maintains a high value, making it suitable for real-time detection requirements.Additionally, the model has low hardware requirements, such as CPU, which renders it feasible for deployment on embedded devices.
The MCS-YOLOv8s algorithm model is horizontally compared with mainstream target detection algorithms such as YOLOv8s, YOLOX-S, YOLOv7-tiny, and YOLOv5s.The experimental results indicate that the MCS-YOLOv8s model exhibits faster detection speed, higher detection accuracy, and a smaller memory occupation of weight files.These findings confirm the advancement of the MCS-YOLOv8s algorithm.
In future work, the model will be trained using a more diverse and extensive dataset.Furthermore, we will explore additional optimizations for the current model, enhancing its lightweight and detection performance.

Fig. 3
Fig. 3 Schematic diagram of the CA module

Fig. 4
Fig. 4 Introduction of the CA module

Figure 5
shows the backbone feature extraction and neck structure of YOLOv8s.The input feature map is denoted by C i (i = 0, 1, 2, 3, 4, 5), where i is the feature extraction level.The output feature map is denoted by P n (n = 3, 4, 5), where n is the feature output level.The neck network of YOLOv8s adopts the PANet structure, a feature fusion of strong semantic information from the top layer and robust localization information from the bottom layer, better retaining features and information related to small targets.

Fig. 5
Fig.5 The backbone and neck structure of YOLOv8s

Fig. 7
Fig. 7 Comparison of neck network improvement

Table 1
The number of experimental datasets Fig. 8 images of various categories

Table 2
Hardware environment

Table 3
F1 score of different models

Table 4
mAP of different models

Table 5
F1 score of different models

Table 6
mAP of different models

Table 7
Comprehensive performance of different models

Table 8
Performance comparison with other algorithms