Train rolling stock video segmentation and classification for bogie part inspection automation: a deep learning approach

Train rolling stock examination (TRSE) is a physical procedure for inspecting the bogie parts during transit at a little over 30 kmph. Currently, this process is manually performed across many railway networks across the world. This work proposes to automate the process of TRSE using artificial intelligence techniques. The previous works have proposed active contour-based models for the segmentation of bogie parts. Though accurate, the models require manual intervention and are found to be iterative making them unsuitable for real-time operations. In this work, we propose a segmentation model followed by a deep learning classifier that can accurately increase the deployability of such systems in real time. We apply the UNet model for the segmentation of bogie parts which are further classified using an attention-based convolutional neural network (CNN) classifier. In this work, we propose a shape deformable attention model to identify shape variations occurring in the video sequence due to viewpoint changes during the train movement. The TRSNet is trained and tested on the high-speed train bogie videos captured across four different trains. The results of the experimentation have been shown to improve the recognition accuracy of the proposed system by 6% over the state-of-the-art classifiers previously developed for TRSE.

TRSE can be considered as an operational maintenance service to examine the bogie parts on which the train moves. This examination is performed when the train is moving at just over 30 kmph. The Indian Railways operational manual (https:// rdso. india nrail ways. gov. in/ works/ uploa ds/ File/ Draft% 20Han dbook% 20on% 20Int egrat ed% 20Rol ling% 20Sto ck% 20Dep ot. pdf ) on TRSE provides details of the various checklists on the performance of the bogie parts that are visually observable during the process. A few instances from the manual are as follows: axlE box leaks, suspension movements, hanging parts, brake shoe functionality, breakages in parts, missing screws, and flat tires. The entire operation is performed by a three-man crew, one on each side of the train and the other one recording the defects. The defects are then classified into immediate action-required maintenance jobs and pit-stop maintenance jobs.
The pit stop maintenance jobs can be executed during the overall maintenance of the train at the designated destination in a railway maintenance yard. Contrastingly, the immediate jobs are handled when the train halts in the following station. The entire TRSE is fully proven that it has been in practice for 100 years. However, the failure of this process has also caused accidents and loss of life bringing a great financial burden on the railway operations. The biggest reason for failure has been analyzed as manual monitoring and no mechanized support system. We are the first in the subcontinent to provide a visual support system for TRSE to assist the monitoring engineers [1][2][3][4][5][6][7].
Our previous models focused on bogie part shape extraction from the videos of train undercarriage. The works focused on developing active contour models with multiple shapes prior to knowledge-based inferences. These models did a great job of preserving the shape of the segmented bogie parts. Despite their success in bogie part segmentation, active contours have an underlying computational complexity when it comes to high-resolution video data [8,9]. The bottleneck in applying knowledge-based active contours is attributed to its iterative model. These are energy-based models that propagate a differential contour by optimizing the contour regions with respect to the divergence of the image. Hence, real-time implementation of these methods has been next to impossible.
Deep learning approaches have shown to have deployable capabilities and were being used for many video object recognition [10][11][12][13] and analytics [14][15][16] applications. Our previous model used a modified Yolo V2 architecture for the bogie part identification process on video data [17]. However, the challenge was to annotate the maximum possible part variations across different training video datasets. Moreover, the architecture Yolo V2 was modified to make the attention mechanism more stringent on the bogie parts to detect their presence across the entire video sequence. This process made the training loads heavy and the computation process cumbersome during training even though the testing was simpler. Despite its good performance, it lacked two major objectives of TRSE: (1) the bogie part shape extraction or descriptor for determining the health of the parts during transit and (2) the deformation in the bogie parts in the consecutive video frames as the train moves horizontally with respect to the camera angle.
The above two objectives along with the core objective of segmenting bogie parts and recognizing them with high accuracy will be successfully resolved using the proposed deep learning framework. Our proposed deep network has two frameworks: (1) the segmentation module and (2) the attention-based classifier. The traditional object detection deep learning models apply annotations on all the bogie video frames to supervise the recognition process with good accuracies. Moreover, these methods do not detect the structure or shape similar to the bogie parts for maintaining structural durability during transit. Consequently, the segmentation results at multiple stages were applied as attention to the classifier for achieving higher resolution rates. The attention mechanism followed is inception from transformers for speech recognition applications which has been incepted into our proposed model.
Specifically, this part of the section presents the past research on technological developments on the road to automate TRSE. The entire section has three main ingredients. The first one is related to methods developed in general towards the solution of rolling stock examination. Secondly, the computer vision-based models were applied to video data for the bogie shape segmentation process. Thirdly, the advancement of deep learning approaches and their applications in object detection.

TRSE technology-based approaches
Indian Railways (IR) is the largest rail network on the planet which carries 10 million passengers every day. The primary challenge is to keep the trains away from accidents. Consequently, this is the job of railway maintenance engineers and researchers to develop new technologies to assist human resources. Currently, advanced technologies are being applied for efficient locomotive production [17], high reliability coach design [18], electronic signaling system [19], a Global Positioning System (GPS)-based train tracking [20], and ultrasound reflectors for track anomaly detection [21]. Apart from the above technologies for uninterrupted operations, the important aspect of train and passengers safety is rolling stock examination [22]. TRSE is performed manually across all rail companies using 3 humans and specifically using a rolling pit. The success rate of Manual TRSE was found to be around 99% for an entire decade of operations in Indian Railways (https:// rdso. india nrail ways. gov. in/ works/ uploa ds/ File/ Draft% 20Han dbook% 20on% 20Int egrat ed% 20Rol ling% 20Sto ck% 20Dep ot. pdf ). However, that 1% failure rate had damaged millions of dollars in the economy and thousands of lives. That is why automating this TRSE becomes an increasingly important problem to find a solution. The objective of this work is to transform the manual rolling stock examination into an automated or semi-automated system to assist railway engineers. Indian Railways are testing a prototype model named KRATES (Konkan Railways Automated Train Examination System) (https:// konka nrail way. com/ uploa ds/ editor_ images/ 15510 89341_ ATES% 20web% 20220 22019. pdf ). The system has many sensors to measure temperature, pressure, acceleration, brake shoe functionality, and a camera. The purpose of the camera in the KRATES module is about remote visual monitoring rather than real-time predictions on the bogie parts.
According to the Indian Railways rolling stock manual, the following primary checks are needed for a train to consider fit to reach the next destination without any accidents. The above parameters are all visually observable and are biologically compared with the training models for evaluation. This results in documentary evidence that provides an insight into the behavior of the bogie parts during transit. The objective of this work is to transform the above visually identifiable problems into computer vision-based models for automated TRSE. Currently, most rail network companies operate manually due to unavailability of technology or research resources for finding a commercially viable solution. However, video-based bogie part retrieval models have been developed in the past with considerable research impact in the field of computer vision.

Computer vision algorithms for TRSE
Initially, the work in [23] has been a starter to unfold the deeper connection between transport automation technologies and their ability to prevent accidents. Inspired by this, some of our previous works have been built on the basis of computer vision. Most of the research works on train safety are different from ours as they are focused on rails and ballast monitoring using computer vision. The work in [23] shows the first use of a camera to record bogie parts and extract them using image mosaicking. 3D imaging models were designed and developed for inspecting train tires and the surrounding ballast using simple 3D correlations [24]. However, the dual cameras were not grounded but are mounted on the train. The movement of the train captures the rails and the ballast with predefined displacements. The parallel projection model along with the 3D digital image correlations will identify anomalies in tracks and ballast. However, the biggest drawback is the train movement at high speeds will make the video data blurry making faulty measurements. Another dual camera model with multi modal recordings in RGB and infrared frequencies has been applied for the bogie part identification process [25]. The method developed uses a panoramic viewing model to compare RGB and infrared images to locate bogie parts. The infrared camera has been in place to identify heating bogie parts such as axlL box, brake shoes, joints, and high friction contacts. The two cameras were used to detect hot and cold parts simultaneously. However, motion blurring in RGB video data and high ambient temperatures make the detection process ambiguous. The next work shifts to a pit hole camera system placed inside a trench dug under the tracks to capture brake shoes [26]. The keyframes with break panels are extracted, and curve fitting models were applied to segment the break portions and identify defects. Even though the results were prominent in the brake shoes functionality identification process, the actual implementation of the project poses a bottleneck both commercially and structurally. Currently, TGV of France and bullet trains of Japan use train-mounted cameras to monitor tracks and ballast. The monitoring is executed manually, and no processing algorithms were reported in the published patent [27]. This is due to the fact that the video sequences captured from cameras mounted on high-speed trains are subjected to unpredictable vibrations which generate noisy video data for processing. Interestingly, the camera sensor on the ground has shown to achieve maximally effective train bogie video data for monitoring than the system mounted on the train. One such system was developed with lights and antiglare techniques for capturing high contrast video frames of rolling stock [28]. Actually, this work has been the basis for automating TRSE. However, this work does not highlight anything about the algorithms for bogie part identification. Another work that has drawn parallels with the above has demonstrated the use of focus lights on the undercarriage to video capture the bogies [29]. Additionally, this work applies basic image processing models to extract the edges of bogie parts in order to identify them. However, the techniques described were not able to represent the overlapping boundaries of bogie parts in the video frames. Moreover, the blurring induced by the moving train has made the edge detection process difficult for part identification. Recently, 3D modelling of contact bogie parts and wheel surfaces has been shown to achieve good results for the detection of defects [30,31]. The biggest problem with 3D modelled image data is their powerful graphics processing requirements. The powerful graphics make these techniques incompatible with realtime processing.
The two biggest drawbacks of the above models were their inability to segment bogie parts effectively, and the video data was noisy due to recording of train movement at 30 frames per second shutter speed. These two bottlenecks were efficiently handled by our previous models for TRSE [1]. To fight blurring, the recording is done by using a high-speed wide-angle sports action visual sensor at 240 fps, the effectively exceptionally high-quality bogie frames. Secondly, the segmentation problems were addressed using active contour (AC) models with shape prior knowledge of the bogie parts [2,3]. These shape-based active contours with local information [5] have presented a 99% accuracy in preserving the extracted bogie part shapes from the output of the models. Moreover, the work in [4] shows an upgraded touching Boundary segmentation algorithm for collectively extracting bogie parts from the video frames. This model has generated interest due to the fact that the bogie parts are indeed overlapping as they support each other to tightly hold the entire structure as a single unit. The above AC-based models have performed well in segmenting the bogie parts effectively. Despite their success in bogie part segmentation, the AC models are iterative and are not suitable for real-time implementation of TRSE. Apart from the above, the TRSE automation algorithms lack adaptability, scalability, and reliability to transform the results into real-time production models. Consequently, these gaps in current research methodologies have motivated us to perceive the real-time implementable models for automating TRSE. Hence, the deep learning approaches were leveraged to build and deploy automated TRSE systems for generating actionable intelligence for assisting rail companies.

Deep learning approaches for automating TRSE
The implementable learning (DL) approaches have been in operation since 2012 with the creation of AlexNet in the ILSVRC ImageNet challenge [32]. The then AlexNet has been trained on 3 GB GPUs from Nvidia using parallel processing. After that, many highly accurate and reliable models have outperformed the AlexNet. They are inception V1 [33], VGG-16 [34], ResNet [35], SeNet [36], and PNASNET [37]. These models have been shown to achieve a very high rate of accuracy in image classification tasks over the years. These base models were updated to detect objects in images and video sequences, and one such model that has performed consistently on multiple test sets is the Yolo (You Only Look Once) architecture [38]. In our previous work [5], the second version of Yolo is modified for the extraction of bogie parts on the video sequences. The model was able to detect most of the bogie objects except for the places where the part deformation is more than 50% of the actual trained part. The biggest challenge in implementation is attributed to the annotations of bogie parts from video sequences along with the bounding box information on which the Yolo model is trained. Though the model has recorded an 84.98% accuracy in correctly identifying the bogie part on a moving train video sequence, it failed to identify bogie parts with high confidence scores for the slightest deformation in the objects occurring due to viewpoint variations. Moreover, to compensate for the object deformations, the model has been trained on a large set of frames in the video sequence. Hence, it becomes extremely important to learn the object deformations for the segmentation process. In deep learning, the segmentation process has been applied through an architecture broadly called as hourglass model [39]. Then, the upgrades with some minor modifications have reported betterment in segmentation results, though their basic structure matches the hourglass model. The most popular and powerful variants of the hourglass are UNet [40], VNet [41], SegNet [42], and Auto Encoders [43]. The backbone network architectures in these segmentation modules can be any of the state-of-the-art network architectures such as VGG-16, Resnet-34, and Inception Net. Once the segmentation processes are learned by the network using a very small dataset of bogie parts, the next step is classification. Generally, instances have shown that the segmented output is inputted along with the original video frame into the classifier for recognition. The RGB input is multiplied with the segmented bogie parts and passed to the classification module designed using the standard networks similar to that of the backbone segmentation network [44,45]. Unfortunately, doing the multiplicative attention will instigate the user to segment all the bogie parts in all the video frames for maximally correct classification. Instead of performing the traditional multiplicative fusion between the RGB video frames and the segmented objects, this work offers a solution incepted from the model of natural language processing called multihead attention [46]. Similar methods were proposed in the automation of construction durability testing such as identifying payment cracks using capsule net segmentation [47] and PCGANs [48].
Finally, the proposed model brings a novel methodology for real-time implementable automated TRSE powered by computer vision, artificial intelligence, and video analytics. The next section focuses on developing a detailed elaboration of the methods applied for automating TRSE with deep learning.

Methods
The highly accurate and accepted attention model in speech processing is the multi-head attention model. The multi head attention model is capable of providing attention to a particular set of words during training. Similarly, the moving train induces motion artifacts such as bogie part shape deformation due to viewpoint variations on the fixed camera positioning. Moreover, the camera outputs 240 fps video sequences with a lensed angle of 54°. The method proposed in this work has a segmentation network followed by multi head attention-based bogie parts classifier. The entire model is called deep bogie part inspector (DBPI) which has a segmentation module in the back end and an attention-based classifier model in the backend. The primary network in the DBPI model is built on the UNET architecture backbone. Subsequently, the secondary network is a classifier model built multi-head-head attention network. In our proposed multi head attention network, we have three streams that are fed with three consecutive segmented bogie parts in keyframes. The proposed method has been shown to improve the recognition accuracies by 6% over the existing object detection deep networks.
The proposed work has three core objectives: (1) to segment the bogie parts from key video frames of train undercarriage by learning from deformations in shapes and spatial locations, (2) to apply the segmented bogie parts to find an attention matrix that will contribute to the faster identification of bogie parts in continuous video sequences, and (3) to design a multi head attention-based architecture for the classification of bogie parts irrespective of their shape and spatial location in the entire video sequence.
Finally, the proposed model will also give a bogie part assert score (BPAS) that can help the human TRSE inspector to make decisions for timely maintenance and thereby increasing passenger safety. Figure 1 shows the video frames of the bogies on Indian Railway coaches.
Our proposed deep bogie part inspector (DBPI) for TRSE is different from the existing models in three different ways: (1) multi head attention network in the classifier will learn from a minimalistic dataset making the training process faster, (2) it offers higher bogie part classification accuracies across the entire range of video sequences, and (3) the model generates actionable intelligence for the maintenance engineers to predict the durability of the bogie parts during the train running cycle.
The deep bogie part inspector is an ensemble of two learning networks as stated in the introduction. The architecture of the proposed DBPI is shown in Fig. 2. The first network is based on UNet architecture to segment the bogie parts from the video sequences. Followed by the UNet is the classification network that identifies a bogie part and checks its durability for the onward train journey towards the destination. Specifically, the classifier is built on multi head attention mechanism where the segmented output of each part is applied to determine the attention of the part in the video sequences. Furthermore, the  matching networks were designed to establish a semantic correspondence between the parts and the original video frames. This process allows the network to match the correct position of the bogie part from multiple frames resulting in the correct match from a few sampled segments. This has enabled the network to learn from a few frames of segmented bogie parts rather than the bogie parts in all the frames. Finally, the extracted features are concatenated locally first and then globally before being learned by the fully connected neural nets. The last layer is a Softmax that predicts the correct bogie part from the input video frames.

The bogie segmentation module -the B-UNet
The bogie-U-shaped convolutional neural network (B-UNet) is a segmentation module for separating the bogie parts individually from the video frames is shown in Fig. 2. Given a complete set of bogie video frames V(x, y, where t is the frame number, the objective of B-UNet is to segment the bogie parts S k b x, y, k , where k is the pointer to the key frames. The key frames are important as the video is captured with 240 fps shutter speed; the number of frames within a second is equal to 240. The change across 240 frames is less than noticeable by the artificial visual sensor, and hence, key frames are extracted. Since all the bogie video frames have similar pixel densities the feature-based key frames extraction models using histogram of oriented gradients (HOG) features with K-means clustering had little impact on the outcome. However, the entropy-based method [49] has shown a good deal of variation in pixels across the video frames. The frame entropy is computed as follows: The entropy Ε of frames f is a 2D space between Ε and f. The 2D entropy space could offer local maximum and minimum points from which local extreme points are extracted. These extracted points are the frames representing the key frames. The resulting key frames of bogies are given by V(x, y, 3, k) where k = 1, 2, …, K and (x, y, 3) is the pixel locations in 3 dimensions. Now, we redefine the problem of segmentation as, given a set of key bogie frames V(x, y, 3, k) ∀ k = 1 to K, design a UNet model to learn the bogie parts binary models for segmentation S k b x, y, k . The parameter b gives the number of bogie parts or bogie part index. The architecture of the UNet has been incepted from [47]. The model takes an input frame size equal to 240 × 424 × 3. The network in Fig. 3 has 8 convolution layers with 16,32,64,128 filters per two consecutive layers in both the compressive and expansive paths as shown in the UNET part of Fig. 2. All layers have 3 × 3 unpadded convolutions with ReLu activation functions followed by a maximum pooling of 2 × 2, which halves the frame resolution to the forward layers. Subsequent down-sampling steps will see a doubling of the number of filters in both arms of the UNet. There are no fully connected layers in the end of the downsampler block. Subsequently, upsampler blocks add pixels in 2 × 2 up-convolutions that cut down the number of filter channels to half of the corresponding counterpart in the downsampling level. This results in the loss of feature channels during upsampling making it difficult to generate a segmentation mask with these small feature maps. This loss in feature maps is compensated with the help of skip connections that apprehend the cropped feature maps from the corresponding compressor blocks after the up convolutions in the expander blocks at the same level. The cropping of the encoder or compressor block features is necessary to ensure that uniform dimensionality for concatenation with the expander feature maps. These concatenated feature maps are further learned using two 3 × 3 convolutional layers followed by a nonlinear ReLu. Finally, a 1 × 1 convolutional layer maps each of the 16-component feature vectors into the required classes. Our B-UNet has only 16 components against the 64 in the original UNet architecture. This is because the segmentation stroke of the bogie part in the entire video frame is small compared to the entire spatial resolution of the frame itself. After multiple experiments, the 16-channel filter is perfect for bogie part segmentation and is computationally faster than the traditional UNet model. The bogie segmentation has a large background region when compared to the spatial occupancy of the part in the video frame. This has resulted in a dominating loss with respect to the background during the training process falling into the local minimum frequently. Hence, we propose to apply the solution in [62], which addresses the foreground-background pixel imbalance in the rolling stock video frames. We applied the two traditional loss functions during the training process. They are a type of binary cross-entropy (BEC) called focal loss (FL) and dice loss (DL). The BEC is given by where G T is the ground truth in the pixel range {0, 1}and p ∈ [0, 1] is the probabilities of foreground and background predicted by the model. {np 0 , np 1 } are classes that represent background class with 0th values and foreground with 1 value. The values of α ∈ (0, 1] and γ ∈ [0, 5] are adjustable hyperparameters. For B-UNet, we selected α = 0.5 and γ = 1 across all datasets. The second loss used was dice loss (DL) which is a regular in segmentation problems using deep learning models. Dice loss solves the problem of imbalance between foreground and background pixels using the segmentation evaluation index between the predicted segmentation mask and ground truth annotated masks. The DL is formulated as follows: The parameter δ ∈ [0, 1] is a preventive measure to avoid a divide by 0 instances during training. The two losses were used simultaneously for backpropagating through the network for weight modifications. However, the combination of the proposed loss is considered as an average over the entire pixel range defined as follows: The B-UNet segmentation network is trained on K key frames to extract b p bogie parts from B ∈ [1, b] bogies and P ∈ [1, p] parts. Testing is initiated on the sequences of bogie parts that were not previously seen by the B-UNet of different trains. The obtained parts are now applied as inputs to the classifier to identify the bogie part correctly and provide the necessary analysis.

B-UNet implementation
The original frame size from the high-speed camera sensor was 1280 × 1918 at 240 fps. The sensor records 240 frames per second, and in a 1-min video, we have around 240 × 60 = 14,400 frames per minute. Our dataset consists of passenger trains from the Indian subcontinent which are having an average of 20 coaches per train. The camera sensor's average recording of a train happened for around 1.05 to 1.42 min. All the above values are computed based on the video contents in our dataset. The average number of frames in each training class was found to be around 15,456 frames per train. Using the entropybased formulation, the key frame extractor will assemble only frames with maximally occupied bogie parts. The number of frames per bogie is around 0.2% of the total frames, which is 30 frames/bogie. There will be two bogies per coach per side, and for a 20-coach train, there will be 40 bogies. Finally, the training set for bogie part segmentation consists of 30 × 40 = 1200 video frames. From these 1200 training bogie video frames, we train only for 8 bogies with 18 parts. This is because the bogie parts are fairly constant over the entire train; it is unnecessary to use all bogie frames for training. A number 8 also guarantees good data augmentation for training apart from others such as rotation, scaling, zooming, and flipping horizontally and vertically in our model. Finally, the training set has 320 frames in 100 different augmentations per frame. The total dataset for B-UNet will have 32 K video frames and 32 K ground truth labels with around 1778 parts per label. The filter kernels are initialized using the zero mean Gaussian centered around unit variance. A batch normalization layer is added after each convolution layer to induce stability of the process. The hyperparameters in the loss function are selected as discussed in the previous section for all the bogie videos through experimentation. The optimizer is Adam with a learning rate of 0.00001 and a momentum factor of 0.02. There is no decay in the learning rate as the error reaches a minimum value. All these methods are unchanged across all datasets and on other models used for comparisons. All the models were implemented on NVIDIA GTX1070i attached to 16GB memory. The epochs hyperparameter is set to 100 for all models. The testing is performed on the full bogie video sequence without key frame extraction for segmenting the bogie parts. The segmented bogie parts are now arranged in chronological triplet order of current f c , previous frame f c − 1 , and next frame f c + 1 for each bogie part. These three groups of segmented video frames form the input to the classifier which is built on the multi head attention model.

The bogie classification module: B-MHAC
The B-MHAC (bogie-multi head attention classifier) is a combination of an attention grabber network and the dense network classifier with Softmax activation. The outputs of the 1 × 1 convolutions in the B-UNet are segmented bogie parts that are separated into multiple classes manually. Given the bogie classes C b with their segmented bogie parts from b = 1 to B at 30 time steps and the raw bogie RGB video frames V(x, y, 3, t) ∈ R 2 , the objective of B-MHAC is to learn the distinct bogie part features using the multi head attention framework as an object placeholder in the video frames. The right side of Fig. 2 has 4 streams of convolutional layers with three of them forming the basis for attention on the raw video frames in the 4th stream. The convolutional layers in each stream will accept an input of size 240 × 424 × 3, which is interpolated to match the segmented outputs to 240 × 424, which will be operated upon by 32 filters of size 3 × 3 with stride 1. These linear layers are nonlinearized with ReLu activations and passed through batch normalization to train the layers more independently.
Given the features from the four streams with sizes W × H × F | i l in the l th layer of i th the feature matrix, the first goal is to multiply the segments at different time scales with the incoming bogie RGB frame independently in the upper 3 streams of the multi head network. The output of the multiplication is M l s (i) ∈ R 2 where s = 1, 2, 3 is the stream number. In order to obtain the relationship between the original masked object M l s (i) and the segmented bogie part features f l s (i) , we apply a feature matching block as shown in Fig. 4. Here l gives the layers, and i represented feature positions.
Let f M ⊂ M l s (i) and f B ⊂ f l s (i) be the features of query masked object and the support bogie parts of size W × H × C, respectively. Primarily, these two features are mapped to a spaces Θ and Φ to obtain Θ(f M ) and Φ(f B ), respectively. Subsequently, the 3D matrices are reshaped to WH × Cwhich transforms into a spatial attention maps using the formulation below: The w j (f B ) are the features of the bogie parts at j th position in the network. This ensures that the features that are relevant to the query image are retained and that which are irrelevant are discarded. Finally, the output of matching network g(f M , f B ) is reshaped to that of the original query features and is concatenated with them by applying a δ weighing rule. The formulation is computed as follows: The δ value is a hyperparameter which will be decided based on the experimentation and the pixel density of each of the bogie objects. Finally, the integrated features F I from each of the bogie classes are applied to a two-stage dense network with Softmax activation for classification. Though the process is computationally expensive, it has shown to recognize deforming shapes of bogie objects during the movement of the train. Accordingly, we test the performance of the proposed method through experimentation and validation on the train rolling stock dataset.

TRSE datasets
The datasets used in this work are shown in Fig. 5. A more detailed view of the capturing mechanism and sensor used is given in our earlier works [33]. The train rolling stock examination (TRSE) bogie videos are captured at different time stamps during the day as shown in Fig. 5. Each of these videos were captured at 240 frames per second when the train was moving at a little over 30 kmph at 1080P resolution. Each of the video datasets has more than 21,000 frames.
Since it was difficult to find defective bogies within a short period in real time, we simulated the defects found regularly on bogie parts using photoshop and reinduced those frames back into the original video sequence. Figure 6 shows two such defects on spring suspension and binding rods.
The objective of the experimentation is to identify the following bogie parts in the video sequence as shown in Fig. 7. Altogether, there are 16 bogie parts that should be monitored during TRSE as per the Indian railway rolling stock examination manual. The numbering will be part of the class names as there are multiple parts with the same  and mean non-identification (mNI). The IoU is generally used for understanding the performance of the UNet segmentation module and its role in the judgment of the classifier. The range of IoU is between 0 and 1, with the latter being the desired value for a good segmentation algorithm. Similarly, the mAP gives the precision with which the classifier identifies the given bogie object. The mFI is a parameter that indicates the false identification of a bogie object, and mNI gives the inability of the classifier to identify the bogie object.

Training and testing the B-MHAC
The training dataset is limited to only one sequence of 200 video frames per sample. Elaborately, only 200 video frames per train bogie are applied for training the model along with the defect-induced sequences. Our train video dataset consists of 7 video sequences, out of which six are normal and one is with defective bogie parts respectively. The total training video frames applied are 7 × 200 = 1.4K. Consequently, the remaining video frames are used for testing the trained model. The two networks in B-MHAC are trained and tested separately due to different hyperparameter initializations. The weight and bias initializations for B-MHAC has been through zero mean unit variance Gaussian distribution function. The learning rate in UNet was fixed across all datasets as 0.000001. This high learning rate enables the UNet to learn slowly over the entire object range. The bogie object masks were created using the annotation tool, ImageJ. The UNet is trained on stochastic gradient descent (sgd) optimizer and dice loss function. Specifically, the dice loss defines the overlap between the predicted and ground truth samples.
To standardize the training process, the UNet across all datasets were trained for 100 epochs. Consequently, the multi head classifier uses a dynamic learning rate initialized at 0.0001 which reduces by 10% when the error becomes constant for 10 continuous epochs. The momentum factor is 0.8. Here, we used the Adam optimizer for weight adaptations and cross entropy loss function for error calculations. The output of the classifier is a probability distribution function with maximum probability pointing towards the predicted class label. Additionally, inferencing on the test video sequences is accomplished by mapping the bounding box locations from the annotating data. The biggest advantage of the B-MHAC lies in the tiny training set that is sufficient for achieving robust performance over the entire test samples.

Results and discussion
The proposed segmentation followed by multi head recognition of train bogie parts from high-speed video frames is being experimented with multiple datasets and variational hyperparameter combinations of the network during training. Subsequently, the results of the experiments were validated against the previous models on different test inputs.
The following subsections provide a detailed analysis of the results obtained on multiple datasets.

Quantitative validation of B-MHAC
First, we show the output of the UNet segmentation module on bogie train video sequences. Second, we present the three outcomes of the multi head classifier to show the confidence of the trained model in identifying a bogie object during inferencing. Figure 8 shows the results of the UNet on the axle bogie part. Simultaneously, the segmented axle bogie part is juxtaposed with ground truth sequences in Fig. 9.
The figures show only 15 frames of the video sequence in data B-1 when the train is moving from right to left of the screen. Subsequently, the results obtained for all other  17 bogie parts were found to be similar to Fig. 8. The first few frames in the output of segmentation have shown to have weak edges as the size of the object is small and its deformation is rapid between frames. However, with the increased object pixel density and reduced intra frame object deformations, the segmentation process is relatively stable and provides exceptional quality of bogie parts for classification. The trained B-MHAC is tested on the segmentation outputs of UNet, and the results are projected onto the actual video frames through bounding boxes. First, we show the results obtained on the databases in Fig. 6. The inferencing results are shown in Fig. 10 on six different train videos captured under various circumstances. The overall bogie part retrieval is found to be around 90% in video sequences where the camera lens was perpendicular to the train movement. The relative position of the bogie parts in the frames does not affect the recognition accuracies due to the presence of the segmentation module and the multi head attention network. The multi head attention network takes input from three sets of bogie parts at different time steps and generalizes on the location of the objects in the continuous video sequence. This has guaranteed greater accurate mapping of bounding box information onto the video sequence. Consequently, the effectiveness of B-MHAC bogie part identification model is to be ascertained by comparing the results against popular image object detection models such as SSD, R-CNN, Fast R-CNN, Faster R-CNN, and our previous method with different Yolo versions. The visual results are presented in Fig. 11 on B-4 dataset. The proposed method outperformed other models due to the presence of multi head attention network that was learned in time steps on bogie object deformations.
This type of learning involves instances of both spatial and temporal information for classification making it robust to object deformations in the video sequences of moving trains. Finally, the B-MHAC is tested for defective parts identification on modified video sequences. The video frames with defective parts are fabricated with two defects on the spring suspension and binding screw.
These defective part frames are induced into the video sequence, and the model was trained from scratch to identify defects by using the existing hyperparameters from the previous training. The results are projected onto the video sequence with a red bounding box for defective parts as shown in Fig. 12. The ability to identify defective parts by the proposed B-MHAC is found to be impressive. This is due to fact that the bogie part is segmented, and it passes through an attention span of multiple time steps which gives the network to learn distinct features across classes. Subsequently, the next subsection highlights the qualitative results on all the datasets with the calculated parameters as indicated above.

Qualitative evaluation of B-MHAC model
This subsection evaluates the proposed B-MHAC deep learning model on the six TRSE datasets. The IoU is calculated only for segmented bogie parts with UNet, and the remaining represents the classifier performance. The results are tabulated in Table 1. The values are averaged over the entire test sample. The average IoU across all datasets and bogie parts is 0.9162. This shows that the difference in predicted bogie segments and the GT has been narrowed extensively. We found a lower IoU for parts that are positioned at the end of the frames than that are in the middle. Additionally, the camera angle and the light intensities during recording also influenced the lower IoU scores on the datasets B-3 and B-6, respectively. Consequently, the average recognition mAP is 0.90115 across all datasets. Critical analysis showed that the bogie parts such as wheels and spring suspensions have recorded the lowest mAP values across all datasets. However, their scores were better than the previous models as shown in Table 2.
The models in Table 2 are trained from scratch on all datasets by keeping the hyperparameters constant. The other two parameters mFI and mNI indicate the B-MHAC failure to identify a part correctly and does not identify at all in the video frame. These two parameters are important in understanding the reason for the failure of the B-MHAC model. These parametric comparisons are presented in Tables 3 and 4. Analysis of these tables showcases that the bogie parts in the neighborhood of the camera focal length have better identification potential than those that are away from it. In practice, it becomes extremely rigid to adjust the camera sensor position with respect to the moving train. Despite the above constraints, the B-MHAC has shown robust performance in instances where the camera sensor is randomly positioned. Overall, the B-MHAC has shown capabilities to sense bogie parts with exceptionally high accuracy when compared  to other models. This is due to its multiple networks used for segmentation and recognition simultaneously.

Defect detection through B-MHAC and comparison
The primary objective of TRSE is to identify defective parts during transit. Therefore, any automated TRSE algorithm should have capabilities to detect defective parts in the vicinity of the normal bogie parts. This experiment is performed to test the ability   of the algorithms to determine defective parts. Accordingly, the one set of training samples was selected as defective parts. In this work, only two defects were induced manually on the spring suspension and binding screw. A total of 200 frames were created with the two defects and were inducted into the video sequence of B-1. These are called broken part defects where the width and location of the cut are varied every 20 frames. The testing is performed with a 4000-frame video where 40 continuous frames were inducted into the original B-1 dataset at 5 randomly selected locations. The results of the experiment are shown in Table 5. Markedly, the proposed method shows robust defect identification capabilities over other methods by taking advantage of the multi head attention network. However, the model also suffers from inconsistency in defect dimensions which have gone undetected in the video sequence.

B-MHAC vs similar works
Previous works largely focused on the segmentation process of the bogie parts from the TRSE video sequences. These models aim to segment the bogie parts with precision rather than generating discriminative features for classification. Different from these approaches, we added an extra deep learning classifier at the end of segmentation processes for the recognition of bogie parts. This experiment will provide an insight into the behavior of the B-MHAC model over the existing models. Instead of including different CNN architectures along with the segmentation module, we applied our multi head attention classifier to the segmented outputs of these methods. The training the testing processes were in line with the original B-MHAC model. The results obtained are presented in Table 6. Only mAP was computed for a single