The highly accurate and accepted attention model in speech processing is the multi-head attention model. The multi head attention model is capable of providing attention to a particular set of words during training. Similarly, the moving train induces motion artifacts such as bogie part shape deformation due to viewpoint variations on the fixed camera positioning. Moreover, the camera outputs 240 fps video sequences with a lensed angle of 54°. The method proposed in this work has a segmentation network followed by multi head attention-based bogie parts classifier. The entire model is called deep bogie part inspector (DBPI) which has a segmentation module in the back end and an attention-based classifier model in the backend. The primary network in the DBPI model is built on the UNET architecture backbone. Subsequently, the secondary network is a classifier model built multi-head-head attention network. In our proposed multi head attention network, we have three streams that are fed with three consecutive segmented bogie parts in keyframes. The proposed method has been shown to improve the recognition accuracies by 6% over the existing object detection deep networks.
The proposed work has three core objectives: (1) to segment the bogie parts from key video frames of train undercarriage by learning from deformations in shapes and spatial locations, (2) to apply the segmented bogie parts to find an attention matrix that will contribute to the faster identification of bogie parts in continuous video sequences, and (3) to design a multi head attention-based architecture for the classification of bogie parts irrespective of their shape and spatial location in the entire video sequence.
Finally, the proposed model will also give a bogie part assert score (BPAS) that can help the human TRSE inspector to make decisions for timely maintenance and thereby increasing passenger safety. Figure 1 shows the video frames of the bogies on Indian Railway coaches.
Our proposed deep bogie part inspector (DBPI) for TRSE is different from the existing models in three different ways: (1) multi head attention network in the classifier will learn from a minimalistic dataset making the training process faster, (2) it offers higher bogie part classification accuracies across the entire range of video sequences, and (3) the model generates actionable intelligence for the maintenance engineers to predict the durability of the bogie parts during the train running cycle.
The deep bogie part inspector is an ensemble of two learning networks as stated in the introduction. The architecture of the proposed DBPI is shown in Fig. 2. The first network is based on UNet architecture to segment the bogie parts from the video sequences. Followed by the UNet is the classification network that identifies a bogie part and checks its durability for the onward train journey towards the destination. Specifically, the classifier is built on multi head attention mechanism where the segmented output of each part is applied to determine the attention of the part in the video sequences. Furthermore, the matching networks were designed to establish a semantic correspondence between the parts and the original video frames. This process allows the network to match the correct position of the bogie part from multiple frames resulting in the correct match from a few sampled segments. This has enabled the network to learn from a few frames of segmented bogie parts rather than the bogie parts in all the frames. Finally, the extracted features are concatenated locally first and then globally before being learned by the fully connected neural nets. The last layer is a Softmax that predicts the correct bogie part from the input video frames.
The bogie segmentation module — the B-UNet
The bogie-U-shaped convolutional neural network (B-UNet) is a segmentation module for separating the bogie parts individually from the video frames is shown in Fig. 2. Given a complete set of bogie video frames V(x, y, 3, t) ∀ (x, y, 3) ∈ R2, where t is the frame number, the objective of B-UNet is to segment the bogie parts \({S}_b^k\left(x,y,k\right)\), where k is the pointer to the key frames. The key frames are important as the video is captured with 240 fps shutter speed; the number of frames within a second is equal to 240. The change across 240 frames is less than noticeable by the artificial visual sensor, and hence, key frames are extracted. Since all the bogie video frames have similar pixel densities the feature-based key frames extraction models using histogram of oriented gradients (HOG) features with K-means clustering had little impact on the outcome. However, the entropy-based method [49] has shown a good deal of variation in pixels across the video frames. The frame entropy is computed as follows:
$$\mathrm{E}(f)=-\sum \limits_j{p}_f(j)\times \log \left({p}_f(j)\right)$$
(1)
The entropy Ε of frames f is a 2D space between Ε and f. The 2D entropy space could offer local maximum and minimum points from which local extreme points are extracted. These extracted points are the frames representing the key frames. The resulting key frames of bogies are given by V(x, y, 3, k) where k = 1, 2, …, K and (x, y, 3) is the pixel locations in 3 dimensions.
Now, we redefine the problem of segmentation as, given a set of key bogie frames V(x, y, 3, k) ∀ k = 1 to K, design a UNet model to learn the bogie parts binary models for segmentation \({S}_b^k\left(x,y,k\right)\). The parameter b gives the number of bogie parts or bogie part index. The architecture of the UNet has been incepted from [47]. The model takes an input frame size equal to 240 × 424 × 3. The network in Fig. 3 has 8 convolution layers with 16,32,64,128 filters per two consecutive layers in both the compressive and expansive paths as shown in the UNET part of Fig. 2. All layers have 3 × 3 unpadded convolutions with ReLu activation functions followed by a maximum pooling of 2 × 2, which halves the frame resolution to the forward layers. Subsequent down-sampling steps will see a doubling of the number of filters in both arms of the UNet. There are no fully connected layers in the end of the downsampler block. Subsequently, upsampler blocks add pixels in 2 × 2 up-convolutions that cut down the number of filter channels to half of the corresponding counterpart in the downsampling level. This results in the loss of feature channels during upsampling making it difficult to generate a segmentation mask with these small feature maps. This loss in feature maps is compensated with the help of skip connections that apprehend the cropped feature maps from the corresponding compressor blocks after the up convolutions in the expander blocks at the same level. The cropping of the encoder or compressor block features is necessary to ensure that uniform dimensionality for concatenation with the expander feature maps. These concatenated feature maps are further learned using two 3 × 3 convolutional layers followed by a nonlinear ReLu. Finally, a 1 × 1 convolutional layer maps each of the 16-component feature vectors into the required classes. Our B-UNet has only 16 components against the 64 in the original UNet architecture. This is because the segmentation stroke of the bogie part in the entire video frame is small compared to the entire spatial resolution of the frame itself. After multiple experiments, the 16-channel filter is perfect for bogie part segmentation and is computationally faster than the traditional UNet model.
The bogie segmentation has a large background region when compared to the spatial occupancy of the part in the video frame. This has resulted in a dominating loss with respect to the background during the training process falling into the local minimum frequently. Hence, we propose to apply the solution in [62], which addresses the foreground-background pixel imbalance in the rolling stock video frames. We applied the two traditional loss functions during the training process. They are a type of binary cross-entropy (BEC) called focal loss (FL) and dice loss (DL). The BEC is given by
$$FL\left({G}_T,p\right)=\left\{\begin{array}{c}-\sum \limits_{i=1}^{np_0}\alpha {\left(1-\alpha \right)}^{\gamma}\log (p),\kern2.5em if\ {G}_T=1\\ {}-\sum \limits_{i=1}^{np_1}\alpha {\left(1-\alpha \right)}^{\gamma }{p}^{\gamma}\log \left(1-p\right),\kern0.5em otherwise\end{array}\right.$$
(2)
where GT is the ground truth in the pixel range {0, 1}and p ∈ [0, 1] is the probabilities of foreground and background predicted by the model. {np0, np1} are classes that represent background class with 0th values and foreground with 1 value. The values of α ∈ (0, 1] and γ ∈ [0, 5] are adjustable hyperparameters. For B-UNet, we selected α = 0.5 and γ = 1 across all datasets.
The second loss used was dice loss (DL) which is a regular in segmentation problems using deep learning models. Dice loss solves the problem of imbalance between foreground and background pixels using the segmentation evaluation index between the predicted segmentation mask and ground truth annotated masks. The DL is formulated as follows:
$$DL\left(p,{G}_T\right)=1-\frac{2\sum \limits_{i=1}^{np}{p}_i{G}_{T_i}+\delta }{\sum \limits_{i=1}^{np}{p}_i^2+\sum \limits_{i=1}^{np}{G}_{T_i}^2+\delta }$$
(3)
The parameter δ ∈ [0, 1] is a preventive measure to avoid a divide by 0 instances during training. The two losses were used simultaneously for backpropagating through the network for weight modifications. However, the combination of the proposed loss is considered as an average over the entire pixel range defined as follows:
$${S}_L=\frac{FL\left(p,{G}_T\right)}{np}+ DL\left(p,{G}_T\right)$$
(4)
The B-UNet segmentation network is trained on K key frames to extract bp bogie parts from B ∈ [1, b] bogies and P ∈ [1, p] parts. Testing is initiated on the sequences of bogie parts that were not previously seen by the B-UNet of different trains. The obtained parts are now applied as inputs to the classifier to identify the bogie part correctly and provide the necessary analysis.
B-UNet implementation
The original frame size from the high-speed camera sensor was 1280 × 1918 at 240 fps. The sensor records 240 frames per second, and in a 1-min video, we have around 240 × 60 = 14,400 frames per minute. Our dataset consists of passenger trains from the Indian subcontinent which are having an average of 20 coaches per train. The camera sensor’s average recording of a train happened for around 1.05 to 1.42 min. All the above values are computed based on the video contents in our dataset. The average number of frames in each training class was found to be around 15,456 frames per train. Using the entropy-based formulation, the key frame extractor will assemble only frames with maximally occupied bogie parts. The number of frames per bogie is around 0.2% of the total frames, which is 30 frames/bogie. There will be two bogies per coach per side, and for a 20-coach train, there will be 40 bogies. Finally, the training set for bogie part segmentation consists of 30 × 40 = 1200 video frames. From these 1200 training bogie video frames, we train only for 8 bogies with 18 parts. This is because the bogie parts are fairly constant over the entire train; it is unnecessary to use all bogie frames for training. A number 8 also guarantees good data augmentation for training apart from others such as rotation, scaling, zooming, and flipping horizontally and vertically in our model. Finally, the training set has 320 frames in 100 different augmentations per frame. The total dataset for B–UNet will have 32 K video frames and 32 K ground truth labels with around 1778 parts per label. The filter kernels are initialized using the zero mean Gaussian centered around unit variance. A batch normalization layer is added after each convolution layer to induce stability of the process. The hyperparameters in the loss function are selected as discussed in the previous section for all the bogie videos through experimentation. The optimizer is Adam with a learning rate of 0.00001 and a momentum factor of 0.02. There is no decay in the learning rate as the error reaches a minimum value. All these methods are unchanged across all datasets and on other models used for comparisons. All the models were implemented on NVIDIA GTX1070i attached to 16GB memory. The epochs hyperparameter is set to 100 for all models.
The testing is performed on the full bogie video sequence without key frame extraction for segmenting the bogie parts. The segmented bogie parts are now arranged in chronological triplet order of current fc, previous frame fc − 1, and next frame fc + 1 for each bogie part. These three groups of segmented video frames form the input to the classifier which is built on the multi head attention model.
The bogie classification module: B-MHAC
The B-MHAC (bogie–multi head attention classifier) is a combination of an attention grabber network and the dense network classifier with Softmax activation. The outputs of the 1 × 1 convolutions in the B-UNet are segmented bogie parts that are separated into multiple classes manually. Given the bogie classes Cbwith their segmented bogie parts from b = 1 to B at 30 time steps and the raw bogie RGB video frames V(x, y, 3, t) ∈ R2, the objective of B-MHAC is to learn the distinct bogie part features using the multi head attention framework as an object placeholder in the video frames. The right side of Fig. 2 has 4 streams of convolutional layers with three of them forming the basis for attention on the raw video frames in the 4th stream. The convolutional layers in each stream will accept an input of size 240 × 424 × 3, which is interpolated to match the segmented outputs to 240 × 424, which will be operated upon by 32 filters of size 3 × 3 with stride 1. These linear layers are nonlinearized with ReLu activations and passed through batch normalization to train the layers more independently.
Given the features from the four streams with sizes \({\left.W\times H\times F\right|}_l^i\) in the lth layer of ith the feature matrix, the first goal is to multiply the segments at different time scales with the incoming bogie RGB frame independently in the upper 3 streams of the multi head network. The output of the multiplication is \({M}_s^l(i)\in {R}^2\) where s = 1, 2, 3 is the stream number. In order to obtain the relationship between the original masked object \({M}_s^l(i)\) and the segmented bogie part features \({f}_s^l(i)\), we apply a feature matching block as shown in Fig. 4. Here l gives the layers, and i represented feature positions.
Let \({f}_M\subset {M}_s^l(i)\) and \({f}_B\subset {f}_s^l(i)\) be the features of query masked object and the support bogie parts of size W × H × C, respectively. Primarily, these two features are mapped to a spaces Θ and Φ to obtain Θ(fM) and Φ(fB), respectively. Subsequently, the 3D matrices are reshaped to WH × Cwhich transforms into a spatial attention maps using the formulation below:
$$h\left({f}_M,{f}_B\right)=\underset{rows}{soft\max}\left(\Theta \left({f}_M\right)\Phi {\left({f}_B\right)}^T\right)$$
(5)
Meanwhile, the output h(fM, fB) is multiplied with the features in support spaces Φ(fB) into an intermediate space g(fM, fB)formulated as follows:
$$g\left({f}_M,{f}_B\right)=h\left({f}_M,{f}_B\right)\times w\left({f}_B\right)$$
(6)
The wj(fB) are the features of the bogie parts at jth position in the network. This ensures that the features that are relevant to the query image are retained and that which are irrelevant are discarded.
Finally, the output of matching network g(fM, fB) is reshaped to that of the original query features and is concatenated with them by applying a δ weighing rule. The formulation is computed as follows:
$${F}_I=\delta \times g\left({f}_M,{f}_B\right)+\left(1-\delta \right)M(i)$$
(7)
The δ value is a hyperparameter which will be decided based on the experimentation and the pixel density of each of the bogie objects. Finally, the integrated features FI from each of the bogie classes are applied to a two-stage dense network with Softmax activation for classification. Though the process is computationally expensive, it has shown to recognize deforming shapes of bogie objects during the movement of the train. Accordingly, we test the performance of the proposed method through experimentation and validation on the train rolling stock dataset.
Experimentation
TRSE datasets
The datasets used in this work are shown in Fig. 5. A more detailed view of the capturing mechanism and sensor used is given in our earlier works [33]. The train rolling stock examination (TRSE) bogie videos are captured at different time stamps during the day as shown in Fig. 5. Each of these videos were captured at 240 frames per second when the train was moving at a little over 30 kmph at 1080P resolution. Each of the video datasets has more than 21,000 frames.
Since it was difficult to find defective bogies within a short period in real time, we simulated the defects found regularly on bogie parts using photoshop and reinduced those frames back into the original video sequence. Figure 6 shows two such defects on spring suspension and binding rods.
The objective of the experimentation is to identify the following bogie parts in the video sequence as shown in Fig. 7. Altogether, there are 16 bogie parts that should be monitored during TRSE as per the Indian railway rolling stock examination manual. The numbering will be part of the class names as there are multiple parts with the same name. A total of five parameters were used to judge the performance of the algorithms qualitatively along with the visual validation on the test view frames. They are intersection-over-union (IoU), mean average precision (mAP), mean False Identification (mFI), and mean non-identification (mNI). The IoU is generally used for understanding the performance of the UNet segmentation module and its role in the judgment of the classifier. The range of IoU is between 0 and 1, with the latter being the desired value for a good segmentation algorithm. Similarly, the mAP gives the precision with which the classifier identifies the given bogie object. The mFI is a parameter that indicates the false identification of a bogie object, and mNI gives the inability of the classifier to identify the bogie object.
Training and testing the B-MHAC
The training dataset is limited to only one sequence of 200 video frames per sample. Elaborately, only 200 video frames per train bogie are applied for training the model along with the defect-induced sequences. Our train video dataset consists of 7 video sequences, out of which six are normal and one is with defective bogie parts respectively. The total training video frames applied are 7 × 200 = 1.4K. Consequently, the remaining video frames are used for testing the trained model. The two networks in B-MHAC are trained and tested separately due to different hyperparameter initializations. The weight and bias initializations for B-MHAC has been through zero mean unit variance Gaussian distribution function. The learning rate in UNet was fixed across all datasets as 0.000001. This high learning rate enables the UNet to learn slowly over the entire object range. The bogie object masks were created using the annotation tool, ImageJ. The UNet is trained on stochastic gradient descent (sgd) optimizer and dice loss function. Specifically, the dice loss defines the overlap between the predicted and ground truth samples. To standardize the training process, the UNet across all datasets were trained for 100 epochs.
Consequently, the multi head classifier uses a dynamic learning rate initialized at 0.0001 which reduces by 10% when the error becomes constant for 10 continuous epochs. The momentum factor is 0.8. Here, we used the Adam optimizer for weight adaptations and cross entropy loss function for error calculations. The output of the classifier is a probability distribution function with maximum probability pointing towards the predicted class label. Additionally, inferencing on the test video sequences is accomplished by mapping the bounding box locations from the annotating data. The biggest advantage of the B-MHAC lies in the tiny training set that is sufficient for achieving robust performance over the entire test samples.