 Research
 Open access
 Published:
SKZC: selfdistillation and knearest neighborbased zeroshot classification
Journal of Engineering and Applied Science volume 71, Article number: 97 (2024)
Abstract
Zeroshot learning represents a formidable paradigm in machine learning, wherein the crux lies in distilling and generalizing knowledge from observed classes to novel ones. The objective is to identify unfamiliar objects that were not included in the model’s training, leveraging learned patterns and knowledge from previously encountered categories. As a crucial subtask of openworld object detection, zeroshot classification can also provide insights and solutions for this field. Despite its potential, current zeroshot classification models often suffer from a performance gap due to limited transfer ability and discriminative capability of learned representations. In pursuit of advancing the subpar state of zeroshot object classification, this paper introduces a novel model for image classification which can be applied to object detection, namely, selfdistillation and knearest neighborbased zeroshot classification method. First, we employ a diffusion detector to identify potential objects in images. Then, selfdistillation and distancebased classifiers are used for distinguishing unseen objects from seen classes. The knearest neighborbased cluster heads are designed to cluster the unseen objects. Extensive experiments and visualizations were conducted on publicly available datasets on the efficacy of the proposed approach. Precisely, our model demonstrates performance improvement of over 20% compared to contrastive clustering. Moreover, it achieves a precision of 0.910 and a recall of 0.842 on CIFAR10 datasets, a precision of 0.737, and a recall of 0.688 on CIFAR100 datasets for the macro average. Compared to a more recent model (SGFR), our model realized improvements of 10.9%, 13.3%, and 7.8% in Sacc, Uacc, and H metrics, respectively. This study aims to introduce fresh ideas into the domain of zeroshot image classification, and it can be applied to openworld object detection tasks. Our code is available at https://www.github.com/CmosWolf1/Code_implementation_for_paper_SKZC.
Introduction
As a crucial task in computer vision, image classification [1] tasks involve assigning predefined labels or categories to input data based on their characteristic or features. It is also an important subtask within the field of object detection. There is no doubt that enhancements in the performance of classification models can also lead to improvements in the classification abilities of performance of object detection models. Tasks of classification depend on the availability of a large volume of tagged data [2]. Due to advances in deep learning techniques [3,4,5], most image classification methods used in the domain of computer vision are supervised learning methods, depending on large extensive volumes of tagged data for training. However, existing datasets are unable to encompass all possible classes, and human society’s evolution continually gives rise to fresh classifications [6]. It leads these supervised classification methods to perform unsatisfying when some categories have scarce or even no tagged data [7].
Zeroshot classification also seen as zeroshot learning (ZSL) [8, 9] or zeroshot recognition is suggested to address the problem of lacking data enabling the recognition of objects belonging to unseen categories. It is a subfield of machine learning that aims to classify objects or instances into unseen classes during training by leveraging the knowledge transfer from related classes for which labeled data is available.
Traditional zeroshot classification can be divided into three main approaches. The first approach utilizes pretrained word embedding vectors to represent and understand the relationship among different categories. For instance, DeViSE [10] utilizes a pretrained convolutional neural network (CNN) to project image features and word embedding of labels into a shared space. ConSE [11], on the other hand, merges the k highestprobability image embeddings. The second approach directly incorporates the relationships between classes using either a graph convolutional network (GCN) or a predefined class hierarchy like WordNet [3]. GCNZ [12] and DGPZ [13] employ GCNs to propagate knowledge between seen and unseen classes, while incorporating CNN and word embedding. An alternative method, HZSL [14], projects both image and text embedding into a hyperbolic space that organizes child and parent classes within the hierarchical structure of WordNet [3]. Lastly, some approaches, such as [15,16,17], depend on humantagged attributes to model class semantics. These methods consider attribute annotations as informative cues for understanding the characteristics and distinguishing features of various classes. Different from CNNbased methods, vision transformers (ViT) [18] have surfaced as a substitute for convolutional neural networks in the field of visual recognition [18,19,20]. The emergence of selfdistillation [21] has provided new solutions for zeroshot. Selfknowledge distillation [21] seeks to educate a student model by emulating the learning patterns of an alreadytrained teacher model, which is a pretrained ViT model. Many zeroshot learning methods, such as [22, 23], utilize selfdistillation models to acquire features for unseen categories.
However, these prior approaches suffer from several limitations. First, their focus lies primarily on improving the correspondence between image features extracted from pretrained CNNs and pretrained word embedding models like Glove [24]. Moreover, they employ predefined class hierarchies, such as WordNet [3], which confines category modeling to a tree structure, thereby failing to capture the complex interclass relationships observed in realworld objects. Moreover, relying solely on class hierarchies restricts the classification scope to only those categories included in the hierarchy. Lastly, attributebased methods lack the ability to generalize to categories lacking seen attributes, thereby limiting their applicability.
Based on the aforementioned observation, we introduce a novel selfdistillation and knearest neighborbased model for zeroshot classification problems namely, selfdistillation and knearest neighborbased zeroshot classification. When unseen categories are underrepresented or completely absent in datasets, and lack clear semantic relationships with other seen classes, conventional zeroshot image classification algorithms often struggle to achieve satisfactory classification performance. In contrast, our model effectively addresses this issue. Firstly, we use a diffusion detector [25] to detect potential objects in the image. Secondly, we design a selfdistillation and distancebased classifier (SDDC) to classify seen and unseen objects. Lastly, we propose a knearest neighborbased cluster head (KCH) to cluster those different kinds of unseen objects. As shown in Fig. 1, the clustering process is performed using KCH on several unseen objects in a given embedding space. Extensive experiments have demonstrated the efficacy of our model. We conducted tests on four datasets: CIFAR100, CIFAR10, ImageNet10, and STL10 [26,27,28]. In cluster performance, we achieved varying degrees of improvement compared to the contrastive clustering [29] method. Moreover, we achieve a precision of 0.910 and a recall of 0.842 on CIFAR10 datasets, and a precision of 0.737 and a recall of 0.688 on CIFAR100 datasets for the macroaverage. Compared to a more recent model (SGFR), our model realized improvements of 10.9%, 13.3%, and 7.8% in Sacc, Uacc, and H metrics, respectively.
Our main contributions are as follows:

(1)
For the first time, we have applied diffusion model to the detection of seen and unseen objects. This implies that the methods in our model can be applied not only to classification tasks but also provide solutions and insights for detection tasks, particularly openworld object detection [6, 30] (OWOD) tasks.

(2)
We propose selfdistillation and distancebased classifier (SDDC) and the knearest neighborbased cluster head (KCH) to classify seen and unseen objects.

(3)
Our model is capable of lifelong learning, meaning it can without the need for human intervention once it is initialized.
Related work
Generativebased ZSL methods
In the domain of zeroshot image classification, leveraging generative adversarial networks (GANs) that are capable of synthesizing highly authentic imagery has emerged as a novel and promising approach [31, 32]. These advanced GAN variants enable the generation of visual feature representations for unseen categories by utilizing the known visual data from seen classes coupled with semantic attributes of the target unseen classes. Xian et al. [33] devised an enhanced model incorporating Wasserstein GAN (WGAN) [34], integrating the WGAN’s loss function with a classification loss to not only ensure the discriminative nature of the synthetically produced features but also to bolster the stability of the training regimen. Subsequently, numerous researchers have refined the WGAN framework, aiming to address challenges associated with generated samples’ quality, diversity, and semantic relevance [30, 35, 36]. Vyas et al. [37] introduced the leveraging of the semantic relationship GAN (LsrGAN), which utilizes a semanticregularized loss component to facilitate knowledge transfer between classes. To counteract issues related to training instability, certain studies have adopted variational autoencoder (VAE) known for their robust training characteristics in zeroshot learning tasks [38,39,40]. Other research efforts have focused on developing a joint embedding space through VAE for multimodal data integration [41, 42], effectively narrowing the divide between the visual and semantic spectra.
Embeddingbased ZSL methods
Embeddingbased approaches are designed to create a shared embedding space for images and their corresponding semantic attributes. These approaches can be categorized into three distinct types. The first category concentrates on mastering a conversion from the visual space to semantic space [43,44,45] which encounters issues such as projection domain shift and the hubness phenomenon. To mitigate these issues, the second type of approach inverts this direction by mapping the semantic information onto the visual domain [46, 47]. The third category aims to reconcile the disparities between visual and semantic domains by jointly mapping both visual and semantic features into an intermediary shared space [48, 49]. This common space is calibrated using bidirectionally aligned knowledge from both visual and semantic representations, addressing the limitations associated with direct mappings and transfer of model parameters. Despite these improvements, embeddingbased techniques continue to grapple with challenges such as semantic information loss and a deficiency in representing unseen class features, leading to a prediction bias towards classes that have been observed during training [50].
Methods
Problem definition
Let’s assume that the set of categories to which all objects in an openworld belong comprises the set S^{t} = {1,2,3,...,C}⊂N^{+}, where N^{+} denotes the set of positive integers, C is the number of all the classes in the open world. Seen and unseen categories can be respectively defined as K^{t} and U^{t}. Let’s define embedding vector sets set as F^{t}.
It is evident that K^{t}, U^{t} ⊆ S^{t}, and both K^{t} and U^{t} are empty at the onset of the task. Moreover, the seen and unseen objects come from the detector. Then, those seen and unseen objects will be added into set K^{t} or set U^{t} according to the result of a classifier. Subsequently, we need to cluster these unseen categories. It is worth noting that vector clusters in the embedding set will continue to increase as the task progresses. Therefore, due to the limitations of computational power and cost, we need to put these unseen categories into several embedding sets before clustering (further particulars will be elaborated in the subsequent subsections). These embedding sets are combined to form set F^{t}.
Overall architecture
Figure 2 presents the comprehensive structure of our proposed model for zeroshot image classification. We have incorporated a detector into our model for classification tasks and continuously update it to enhance its performance in realworld classification tasks. Additionally, cropping the images detected by the detector allows our model to iterate by itself at a fast pace.
Firstly, we use diffusion model detector [25] as the based detector. Then, we crop the image detected by the detector according to the box predictor. These cropped images are sent into the selfdistillation and distancebased classifier (SDDC) to differentiate between categories that have been previously encountered and those that have not. After that, unseen categories will be sent into a knearest neighborbased clustering head (KCH) for clustering. Seen classes will be added to the existing seen cluster. Lastly, we update the boxes predictor module so that the detector can recognize the newly added classes. Additionally, we will integrate the already clustered unseen clusters into the embedding vector set to accomplish the transformation from unseen classes to seen classes. As time progresses, the number of seen clusters will increase, allowing the model to recognize an evergrowing of classes.
Selfdistillation and distancebased classifier
Due to the limited capability of backbone network models such as ResNet [51] and SwinBase [52] in effectively extracting foreground features from images, we employ a selfdistillation learning model to extract foreground features.
The architecture of the selfdistillation learning model is shown in Fig. 3. This model is demonstrated using a single pair of views (x_{1},x_{2}) for simplicity and clarity. It applies two distinct random transformations to an input image and provides them as inputs to both the student and teacher networks. Although these networks have identical structures, their parameters are different. The teacher network generates Kdimensional feature vectors that are normalized using a temperature softmax function. These feature vectors are then compared using a crossentropy loss to measure their similarity [53]. The teacher network’s output is normalized by calculating the mean over the batch. The student network [53] ɡ_{θs} is a neural network model that learns to perform a task by trying to mimic or replicate the behavior of the teacher network [53] ɡ_{θt}. During the training phase, the student network is updated using standard backpropagation techniques, where gradients are calculated based on the difference between the student’s predictions and the teacher’s outputs. The goal is for the student network to learn representations that are good enough to match those produced by the teacher. For an input image x, the student and teacher network each produce a set of probabilities across M categories, indicated as P_{s} for the student and P_{t} for the teacher. Their probabilities P_{s}(x) are the result of applying a softmax function to normalize the outputs from the network ɡ_{θs}(x). More precisely:
with τ_{s} > 0, a temperature parameter is utilized to regulate the sharpness of the output distribution, with a corresponding expression governing P_{t} when modulated by the temperature τ_{t}.
In our classifier, we use the student network to extract feature vectors of objects. The student network has been trained on the ImageNet200 datasets [54]. We calculate the Euclidean distance d_{E} between these feature vectors f_{1n} and the center vector of each cluster f_{2n} within every embedding vector sets as follows:
where f_{1n} = (f_{11}, f_{12}, f_{13},..., f_{1N}) and f_{2n} = (f_{21}, f_{21}, f_{22},..., f_{2N}), N⊂N^{+} are both Ndimensional feature vectors. These cluster radii R_{i} in an embedding vector set E are formulated as follows:
where S is the vectors’ number of a seen cluster, V_{ij} is a feature vector in a seen cluster, and \(\alpha\) is a parameter which determines the size of a cluster’s radius. Regarding parameter \(\alpha\), we will delve into the specifics in Section "Patch size".
After that, for an input feature vector, we compute its distance d_{E} with every cluster centroid in each embedding vector sets set. Then, we use whether d_{E} is less than the cluster radius R_{i} as a criterion to determine if the object belongs to a seen category i or an unseen class.
Knearest neighborbased cluster head
Enabling the model to cluster unseen classes provides it with the ability to differentiate among diverse unseen classes. We present a knearest neighborbased cluster head to cluster these unseen classes. Algorithm 1 provides an overview on how the knearest neighborbased cluster head clusters these unseen classes.
The search space parameter is defined as n_neighbors, which means that we search for the optimal value of the n_neighbors within a range from 1 to 20 (excluding 20). The purpose of this is to experiment with different values of k (i.e., the number of nearest neighbors) and find the best value to construct the KNN model. Then, the cluster labels are assigned based on the indices of the nearest neighbors. After the prediction is completed, each unseen vector will have a label ID pointing to a specific cluster. Next, these unseen clusters will be divided to ensure that there are only ten clusters in each embedding vector set (we will explain in detail why only ten clusters are retained in an embedding vector set in Section "Patch size"). Then, we will integrate the new embedding vector set with unseen clusters into the collection of the embedding vector sets set to complete the update of seen categories. Simultaneously, we will update the boxes predictor so that the detector can detect the newly added seen categories.
Training
Diffusion detector
The L_{2} loss function [55] using by diffusion model can be formulated as follows:
which t ϵ {0,1,...T} and the neural network \({f}_{\theta }({z}_{t},t)\) are trained to predict z_{0} from z_{t} by minimizing the training objective with L_{2} loss.
To establish a robust foundation for our object detection framework, we incorporated a pretrained diffusion model [25] that has been extensively trained on MSCOCO [56] datasets. We specifically employed the weights of a model based on the ResNet50 [51] architecture, which has demonstrated remarkable performance in object detection tasks due to its deep residual learning capabilities. It is noteworthy that the original implementation of the diffusion model involved a lengthy process with 500 sampling steps, which contributed to precise but computationally intensive inference. Considering the realtime requirements of our zeroshot classification task, we optimized the inference pipeline by reducing the number of sampling steps from 500 to 300. This strategic adjustment enabled us to substantially accelerate the inference speed of our diffusionbased detector while maintaining an acceptable tradeoff between accuracy and realtime performance metrics.
Selfdistillation model
In order to align the output distributions, the crossentropy loss concerning the parameters of student network \(\theta\) _{s} is minimized by the following:
where \(H(a,b)=alogb.\)
In the following, a description is provided on how the problem in Eq. (5) is adapted for selfsupervised learning. The initial step involves generating various distorted views or crops of an image using a multicrop strategy [57]. Specifically, a set V of different views is created from a given image. To capture both global and local information, our model incorporates two global views (x_{1g} and x_{2g}) and multiple local views of smaller resolution. While all crops are processed by the student model, only the global views are utilized by the teacher model. This process promotes “localtoglobal” correspondences [53]. The loss function is then minimized:
We use vision transformer (ViT) [18] as the backbone of selfdistillation and distancebased classifier. We employed four distinct model configurations with varying sizes and resolutions (ViTS/16, ViTS/8, ViTB/16, and ViTB/8) [53] to thoroughly investigate their feature extraction efficacy.
Experiment
Preparation
Datasets
We evaluate our model on the set of tasks T = {T_{1}, T_{2}}. Among them, task 1 is the clustering performance testing task. As shown in Table 1, for task 1, we use 10 classes from STL10 [27], ImageNet10 [28], and CIFAR10 [26], 100 classes from CIFAR100 [56] datasets. For task 2, we use CIFAR10, CIFAR100 and CUB [58]. Furthermore, we use pretrained selfdistillation models with two different resolutions and two different model sizes, resulting in four types of models. Therefore, in task 1, we plan to evaluate the performance and practicality of each method and model through thorough evaluation.
Evaluation metrics
In task 1, to assess our approach, we employ three commonly recognized metrics for clustering evaluation: The normalized mutual information (NMI), accuracy (ACC), and adjusted rand index (ARI).
The NMI is a metric that remains consistent regardless of the dataset’s size. It effectively measures the extent of information overlap between the true labels and the labels assigned through clustering, indicating the quality of the clustering. This can be formulated as follows:
where U and V are two sets of clusters, the shared information content of U and V is quantified by I(U; V) which is the mutual information, while H(U) and H(V) represent the individual uncertainties of U and V.
Accuracy (ACC) measures the proportion of correctly clustered instances by comparing the cluster assignments with the ground truth labels, reflecting the clustering correctness. This can be formulated as follows:
where n is the number of samples, c_{i} is the cluster assignment for sample i, l_{i} is the true label for sample i, m is the mapping function from clusters to true label s, and l is the indicator function.
Adjusted rand index (ARI) which can adjust the similarity between the true clustering and the predicted clustering with a value that can be compared across different datasets. This can be demonstrated as follows:
where RI is the rand index, which is calculated as follows:
in this context, TP is the count of true positive pairs, TN is the number of true negative pairs, FP is the count of falsely identified positive pairs, and FN is the count of falsely identified negative pairs. The expected RI depends on the marginal totals of a contingency table (or confusion matrix) of the cluster assignment.
In task 2, we use three evaluation metrics: precision, recall, and F1 scores to assess model performance.
Among them, precision is a measure of the accuracy of a classification model, which indicates the proportion of the true positive predictions in the total predicted positives. The precision metric is computed by dividing the number of true positives by the total number of instances classified as positive, which includes both true positives and false positives. High precision indicates that an algorithm generated a significant number of relevant results compared to irrelevant ones. Precision can be formulated as follows:
Recall measures the ability of a model to find all the relevant cases within datasets. It represents the fraction of actual positives correctly identified by the classifier out of all actual positives. Mathematically, it is the ratio of correctly detected positive cases to the total actual positive cases. High recall indicates that the class is correctly recognized to a large extent. Recall (sensitivity) can be presented as follows:
The F1score is calculated as 2 times the product of precision and recall divided by the sum of precision and recall, thereby balancing the tradeoff between false positives and false negatives. It is a measure that combines precision and recall, considering both false positives and false negatives, to provide a single score for model accuracy, providing a single score that weighs both the concerns of finding all relevant instances (recall) and returning only relevant instances (precision). F1score reaches its best value at 1 (perfect precision and recall) and worst at 0. F1score can be demonstrated as follows:
Implementation details
The detector of our model is based on diffusion detector [25] with ResNet50 [51], and SwinBase [52] backbone. We use the detector to detect both seen and unseen objects. More precisely, we employ a diffusion model with the ResNet50 [51] architecture as the backbone network to extract objects from images. Additionally, this diffusion model has been pretrained on the MSCOCO [56] datasets.
In task 1 and task 2, we tested 4 selfdistillation models, whose parameter counts and resolutions are shown in Table 2.
It is worth noting that larger model parameters and smaller resolution values indicate better performance of the model. Furthermore, the selfdistillation model we use has been pretrained on the ImageNet datasets [54].
For the hyperparameters, we set the value of \(\alpha\) to 0.75 and set the number of clusters in each embedding set at 7.
Results and discussion
Clustering performance
The quality of clustering directly influences the outcome of the entire classification task; therefore, the model’s ability to effectively cluster data is of crucial significance. The clustering performance of our model is shown in Table 3, and we tested the ViTB/8 selfdistillation model on the STL10, ImageNet10, CIFAR10, and CIFAR100 datasets. Apart from the contrastive clustering algorithms, all the algorithms tested in our study employed feature vectors extracted by a selfdistillation model for clustering operations. It is evident that compared to contrastive clustering algorithm, the traditional clustering algorithm also achieved promising performance. This indicates the effectiveness of selfdistillation models.
In Table 4, we conducted tests using the CIFAR10 and CIFAR100 datasets and concluded that the ViTB/8 model has the best performance. It can be clearly seen that the model possesses a greater quantity of parameters and enhanced resolution typically demonstrates improved performance outcomes. Therefore, due to the substantial number of model parameters and higher resolution afforded by ViTB/8, it exhibits the most superior performance. Besides, considering the requirement for realtime classification, we are willing to sacrifice some model performance to enhance the inference speed of the model.
The clustering visualization results of the ViTB/8 model on STL10, ImageNet10, CIFAR10, and CIFAR100 are shown in Fig. 4.
The clustering visualization results for the ViTB/8, ViTB/16, ViTS/8, and ViTS/16 models on CIFAR10 and CIFAR100 datasets are shown in Fig. 5.
Classification performance
Based on the results shown in Table 5, we tested 4 selfdistillation models on the CIFAR10 datasets. It is easy to see from the table that in the CIFAR10 dataset’s ten categories, the basesized model exhibits the best performance, and the model with a resolution of 8 achieves the highest precision and recall scores in 70% of the categories, as well as the highest f1scores in 80% of the categories. Therefore, we can consider the ViTB/8 model, with larger model parameters and higher resolution, has the best model performance.
Similar to the result in Table 5, the test results on the CIFAR100 datasets show that the ViTB/8 model achieved the highest scores across all metrics. It indicates that the ViTB/8 model has the best model performance. The result is shown in Table 6.
Model size
The base model (B) has a greater number of parameters; thus, it is more likely to capture complex image characteristics, which generally leads to better generalization ability and inference accuracy when there is an abundance of training data. Although the small model (S) has fewer parameters and a lower computational cost, making it potentially more suitable for resourseconstrained environments or scenarios sensitive to latency, it might prevent overfitting due to its simplicity especially in cases where training data is not extensively available.
Patch size
Models with a smaller patch size (e.g., ViTB/8) generate longer sequences and therefore have the capacity to capture finergrained image information. This can aid in learning more complex image patterns, potentially leading to improved accuracy of the model. However, longer sequences also mean higher computational costs and increased memory demands. In contrast, a larger patch size (e.g., ViTBase/16) reduces sequence length, lowering computational complexity but potentially at the loss of some detailed information.
Subsequently, we designed an experiment to compare the performance gap between our model and other more recent zeroshot image classification models. As shown in Table 7, our model (based on ViTB/16) achieved the best overall performance. Compared to SGFR [59], our model demonstrated improvements of 10.9%, 13.3%, and 7.8% on Sacc, Uacc, and H metrics, respectively. We attribute the enhancements in our model to the methodological design and the choice of the number of clusters in each embedding set and the value of \(\alpha\).
Ablation study
We designed ablation experiments to study the contributions of SDDC and KCH in the model (see Table 8). Missing SDDC module means we replace the selfdistillation network with a standard ResNet backbone, and missing KCH refers to using the KMeans clustering algorithm in place of the KCH module.
When SDDC and KCH are missing (row 4), the model performs the worst. Adding only SDDC (row 1) will improve the model’s ability to cluster, and with highquality clustering, the model is likely to demonstrate enhanced performance in classification. Adding only KCH (row 2) will directly improve the model’s ability to classify. When adding both SDDC and KCH, the model performs the best. Therefore, the presence or absence of both SDDC and KCH will affect the performance of the model and the optimal performance is obtained when both components are present and work together.
The calculation of cluster radius
In our experiments, we found that a category’s cluster may have several points that are far from the cluster center. If we simply use the Euclidean distance from the furthest point to the cluster center as cluster radius, it could lead to a large number of misjudged unseen objects. Moreover, the choice of the cluster radius can affect the accuracy and false positive rate of unseen object identification. Therefore, in order to calculate the optimal cluster radius, we designed an experiment.
In the experiment, we defined the distance value from all points in the cluster to the cluster center point as d_{i}. Then, we place this distance d_{i} into a set D. Afterward, we select the smallest \(\alpha\) percentile values from set D and discard any remaining points that fail to meet the specified conditions. Finally, we set the maximum value in set D as the cluster radius, allowing \(\alpha\) to increase from 0 to 1 in increments of 0.01 (It is evident that \(\alpha\) is positively correlated with the radius of the cluster). We then plotted the curve showing the change in the accuracy rate of unseen identification as \(\alpha\) varied, as shown in Fig. 7. Moreover, we also plotted the curve of the harmonic mean of our model on the CUB dataset as a function of \(\alpha\), as shown in Fig. 6. The maximum value is reached within the interval from 0.6 to 0.8, strictly speaking at 0.66. However, considering that the optimal value of \(\alpha\) might differ across datasets, and with a view to generality, we set \(\alpha\) to 0.75. The misjudgment rate of unseen is negatively correlated with the accuracy rate.
Number of clusters in each embedding set
As the task progresses, more and more feature vectors will inevitably appear in the embedding set. It is impractical to perform clustering operations only in one embedding set, as it would consume a lot of time and computational power. It is obvious that these feature vectors follow the same distribution. Based on the above fact, we decided to place clusters in different embedding sets and perform clustering operations there, rather than just in one embedding set. However, it is worth noting that if the number of clusters in the embedding set is too small, it may lead to excessively small differences in the distances between the input feature vector and the different cluster center vectors, resulting in incorrect judgments of input feature vectors. Therefore, we designed an experiment to explore the optimal number of clusters in each embedding set. The performance metrics NMI, ARI, and ACC indices for clustering as a function of the number of clusters in the embedding set are shown in Fig. 8.
As shown in Fig. 8, once the number of in the same embedding set exceeds 7, the three indicators NMI, ARI, and ACC plummet sharply, indicating a rapid deterioration in clustering performance within that embedding set. A decline in clustering performance can lead to a model propensity for misclassifying unseen classes as seen ones (disorganized vector distribution within each cluster, resulting in an excessively large radius). Moreover, considering the fact that too few clusters may result in seen categories being misidentified as unseen categories, an excessively high number of clusters can lead to feature vectors becoming overly concentrated within the embedding set, thereby causing unseen classes to be erroneously identified as seen classes. Therefore, we decide to fix the number of clusters in each embedding set at 7. Furthermore, we have plotted the visualization of the growth process of the number of clusters in the embedding set, as shown in Fig. 9.
Conclusions
In this work, we propose a novel zeroshot classification model named selfdistillation and knearest neighborbased zeroshot classification model. We propose a new method including knearest neighborbased cluster head (KCH) and selfdistillation and distancebased classifier (SDDC). Abundant experiments demonstrate the effectiveness of our model on zeroshot classification problems. In cluster performance, on datasets CIFAR10, CIFAR100, ImageNet10, and STL10, our model outperforms the contrastive clustering model across the board. In the classification task, we achieved a precision of 0.910 and a recall of 0.842 on CIFAR10 datasets, a precision of 0.737, and a recall of 0.688 on CIFAR100 datasets for the macroaverage.
While our model has shown promising results on certain datasets (dataset CIFAR et al.), it still has limitations (dataset CUB). Realworld objects are incredibly diverse and complex, often exceeding what can be experimentally simulated. Take birds as an example: there are over 9000 known species of birds, each with distinct appearances. Even humans find it challenging to differentiate closely related bird species due to their similar features. Consequently, for several closely related categories, our model may perform poorly in classification tasks because the feature vectors of these categories are too close together within the embedding space. To address this issue, future research will delve deeper into the selection of the number of clusters in each embedding set and optimizing \(\alpha\) parameter. By dynamically adjusting these based on the characteristics of feature vectors within the embedding set, we aim to achieve a more reasonable distribution of feature vectors, ultimately enhancing the model’s ability to classify categories with similar features.
In future work, we hope to continue improving our model structures and apply it to openworld object detection problems. By exploring these uncharted territories, we aim to bridge the gap between academic experimentation and realworld applicability, making our model more robust and versatile in handling diverse and dynamic environments.
Availability of data and materials
Pretrained models of DiffusionDet are from https://github.com/ShoufaChen/DiffusionDet.
Pretrained models of selfdistillation model are from: https://github.com/facebookresearch/dino.
Abbreviations
 KCH:

Knearest neighborbased cluster head
 SDDC:

Selfdistillation and distancebased classifier
 ViTB/8:

Vision transform model with base size and 8 resolution
 ViTS/8:

Vision transform model with small size and 8 resolution
 ViTB/16:

Vision transform model with base size and 16 resolution
 ViTS/16:

Vision transform model with small size and 16 resolution
References
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In *Proceedings of the Neural Information Processing Systems (NIPS).
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, et al., (2020) The devil is in the channels: mutualchannel loss for finegrained image classification, TIP
Feinerer I, Hornik K (2020) wordnet: WordNet Interface. R package version 0.115. [Online]. Available: https://CRAN.Rproject.org/package=wordnet
Yang X, Deng C, Wei K, Yan J, Liu W (2020) Adversarial learning for robust deep clustering. In *Proceedings of the Neural Information Processing Systems (Neur)*, 2020.
Ju Y, Lam KM, Chen Y, Qi L, Dong J (2020) Pay attention to devils: a photometric stereo network for better details. In *Proceedings of the International Joint Conference on Artificial Intelligence (IJI)*.
Li H, Wang F, Liu J, Huang J, Zhang T, & Yang S. (2022). Microknowledge embedding for zeroshot classification. Computers and Electrical Engineering, 101. https://doi.org/10.1016/j.compeleceng.2022.108068
Wang W, Zheng VW, Yu H, Miao C (2019) "A survey of zeroshot learning: settings, methods and applications," in *ACM Trans. Intell. Syst. Technol.*, vol. 10, no. 2, pp. 119.
Lampert CH, Nickisch H and Harmeling S. (2009) Learning to detect unseen object classes by betweenclass attribute transfer, CVPR
Palatucci Mark, A. Pomerleau Dean, E. Hinton Geoffrey and Tom Michael Mitchell, (2009) Zeroshot learning with semantic output codes, NIPS
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M’A and Mikolov T. (2013) DeViSe: a deep visualsemantic embedding model, NIPS
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, et al., (2014) Zeroshot learning by convex combination of semantic embeddings, ICLR
Wang X, Ye Y, Gupta A (2018) Zeroshot recognition via semantic embeddings and knowledge graphs, CVPR
Kampffmeyer M, Chen Y, Liang X, Wang H, Zhang Y, Xing EP (2019) Rethinking knowledge graph propagation for zeroshot learning, CVPR
Liu S, Chen J, Pan L, Ngo CW, Chua TS, Jiang YG (2020) Hyperbolic visual embedding learning for zeroshot recognition, CVPR
RomeraParedes B and Torr PHS. (2015) An embarrassingly simple approach to zeroshot learning, ICML
Akata Z, Perronnin F, Harchaoui Z, Schmid C (2013)Labelembedding for attributebased classification, CVPR
Akata Z, Reed S, Walter D, Lee H, Schiele B (2015)Evaluation of output embeddings for finegrained image classification, CVPR
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. preprint arXiv:2010.11929
Zhao H, Jia J, Koltun V. (2020) Exploring selfattention for image recognition. In CVPR,
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2020) Training dataefficient image transformers & distillation through attention. preprint arXiv:2012.12877
Hinton G, Vinyals O, Dean J. (2015) Distilling the knowledge in a neural network. preprint arXiv:1503.02531
Cheng R, Wu B, Zhang P, Vajda P, Gonzalez JE (2021) Dataefficient languagesupervised zeroshot learning with selfdistillation, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, pp. 31133118, https://doi.org/10.1109/CVPRW53098.2021.00348
X. Kong , Kong X et al. (2022) Encompactness: selfdistillation embedding & contrastive generation for generalized zeroshot learning, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 92969305, https://doi.org/10.1109/CVPR52688.2022.00909
Pennington J, Socher R, Manning CD (2014) , EMNLP
Chen S, Sun P, Song Y, Luo P (2022). DiffusionDet: diffusion model for object detection. arXiv. https://doi.org/10.48550/arXiv.2211.09788
Krizhevsky A, Hinton G. (2009) Learning multiple layers of features from tiny images. Master's thesis, Dept. Comp. Sci., Univ; Toronto
Chang J. Wang L, Meng G, Xiang S, and Pan C. (2017) Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, 5879–5887
Coates A, Ng, Lee H. (2011) An analysis of singlelayer networks in unsupervised feature learning, in Proc. 14th Int. Conf. Artif. Intell. Statist. (AISTATS), pp. 215–223
Li Y, Hu P, Liu Z, Peng D, Zhou JT, Peng X (2021). Contrastive clustering. 35h AAAI Conference on Artificial Intelligence, AAAI 2021, 10A, 8547–8555
Li J, Jin M, Lu K et al (2019) Leveraging the invariant side of generative zeroshot learning[C]. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, 7402–7411
Caixia Y, Chang X, Li Z et al (2021) Zeronas: differentiable generative adversarial networks search for zeroshot learning[J]. IEEE Trans Pattern Anal Mach Intell 2021:1–9
Shermin T, Teng SW, Sohel F et al (2021) Bidirectional mapping coupled GAN for generalized zeroshot learning[J]. IEEE Trans Image Process 31:721–733
Xian Y, Lorenz T, Schiele B, et al (2018) Feature generating networks for zeroshot learning. In CVPR, 5542–5551
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein Gan[C]. ICML
Felix R, Kumar VBG, Reid I et al (2018) Multimodal cycleconsistent generalized zeroshot learning[C]. In: Proceedings of European Conference on Computer Vision, Munich, 21–37
Han Z, Fu Z, Li G et al (2020) Inference guided feature generation for generalized zeroshot learning. Neurocomputing 430:150–158
Vyas MR, Venkateswara H, Panchanathan S (2020) Leveraging seen and unseen semantic relationships for generative zeroshot learning[C]. European Conference on Computer Vision. Springer, Cham, 70–86
Chen Z, Huang Z, Li J et al (2021) Entropybased uncertainty calibration for generalized zeroshot learning[C]. Australasian Database Conference. Springer, Cham, 139–151
Schonfeld E, Ebrahimi S, Sinha S, et al (2019) Generalized zeroand fewshot learning via aligned variational autoencoders[C]. Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 8247 – 8255
Verma VK, Arora G, Mishra A et al (2018) Generalized zeroshot learning via synthesized examples[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4281–4289
Chen SM, Xie GS, Liu Y et al (2021) HSVA: hierarchical semanticvisual adaptation for zeroshot learning[C]. 35th Conference on Neural Information Processing Systems
Ma P, Hu X (2020) A variational autoencoder with deep embedding model for generalized zeroshot learning[C]. Proceedings of the AAAI Conference on Artificial Intelligence, 11733–11740
Chen L, Zhang H, Xiao J, Liu W, Chang SF (2018)Zeroshot visual recognition using semanticspreserving adversarial embedding networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1043–1052
Guan J, Lu Z, Xiang T et al (2020) Zero and few shot learning with semantic feature synthesis and competitive learning[J]. IEEE Trans Pattern Anal Mach Intell 43(7):2510–2523
Pandey A, Mishra A, Verma VK et al (2020) Stacked adversarial network for zeroshot sketch based image retrieval[C]. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2529–2538
Das D, George Lee CS (2019) Zeroshot image recognition using relational matching, adaptation and calibration[C]. 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8
Xie GS, Zhang XY, Yao Y et al (2021) Vman: a virtual mainstay alignment network for transductive zeroshot learning. IEEE Trans Image Process 30:4316–4329
Liu Y, Tuytelaars T (2020) A deep multimodal explanation model for zeroshot learning. IEEE Trans Image Process 29:4788–4803
Yang Hu, Wen G, Chapman A et al (2021) Graphbased visualsemantic entanglement network for zeroshot image recognition[J]. IEEE Trans Multimed 24:2473–2487
Luo Y, Wang X, Pourpanah F (2021) Dual VAEGAN: a generative model for generalized zeroshot learning. Appl Soft Comput 107:107352
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceed ings of the IEEE conference on computer vision and pattern recognition, pages 770–778
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022
Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, & Joulin A. (2021). Emerging properties in selfsupervised vision transformers. arXiv
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC and FeiFei L (2015) Imagenet large scale visual recognition challenge. IJCV
Ho Jonathan, Jain Ajay, Abbeel Pieter (2020) Denoising diffusion probabilistic models. Adv Neural Information Process Syst 33:6840–6851
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In ECCV, pages 740–755. Springer
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS
Wah C, Branson S, Welinder P, Perona P, Belongie S. "The CaltechUCSD Birds2002011 Dataset," California Institute of Technology, CNSTR2010001, 2010.
Li X, Fang M, Li H and Chen B (2024) Selectivegenerative feature representations for generalized zeroshot openset classification by learning a tightly clustered space. Expert Syst Appl 245, 123062, 2024. https://doi.org/10.1016/j.eswa.2023.123062. Available: https://www.sciencedirect.com/science/article/pii/S0957417423035649
Acknowledgements
The authors thank Jiajie Li from the University of Electronic Science and Technology of China for organizing the image data of this paper and to Yong Sun for providing the graphics card support for the paper experiment.
Funding
This research was funded by the Science and Technology Program of Sichuan (grant number 2022ZDZX0005, 2023ZHCG0013), the central government guides local special fund projects of the Mianyang Municipality Science and Technology Bureau (grant number 2022ZYDF009).
Author information
Authors and Affiliations
Contributions
The authors Muyang Sun and Haitao Jia contributed equally to this study.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sun, M., Jia, H. SKZC: selfdistillation and knearest neighborbased zeroshot classification. J. Eng. Appl. Sci. 71, 97 (2024). https://doi.org/10.1186/s44147024004293
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s44147024004293