SKZC: self-distillation and k-nearest neighbor-based zero-shot classification

Zero-shot learning represents a formidable paradigm in machine learning, wherein the crux lies in distilling and generalizing knowledge from observed classes to novel ones. The objective is to identify unfamiliar objects that were not included in the model’s training, leveraging learned patterns and knowledge from previously encountered categories. As a crucial subtask of open-world object detection, zero-shot classification can also provide insights and solutions for this field. Despite its potential, current zero-shot classification models often suffer from a performance gap due to limited transfer ability and discriminative capability of learned representations. In pursuit of advancing the subpar state of zero-shot object classification, this paper introduces a novel model for image classification which can be applied to object detection, namely, self-distilla-tion and k-nearest neighbor-based zero-shot classification method. First, we employ a diffusion detector to identify potential objects in images. Then, self-distillation and distance-based classifiers are used for distinguishing unseen objects from seen classes. The k-nearest neighbor-based cluster heads are designed to cluster the unseen objects. Extensive experiments and visualizations were conducted on publicly available datasets on the efficacy of the proposed approach. Precisely, our model demonstrates performance improvement of over 20% compared to contrastive clustering. Moreover, it achieves a precision of 0.910 and a recall of 0.842 on CIFAR-10 datasets, a precision of 0.737, and a recall of 0.688 on CIFAR-100 datasets for the macro average. Compared to a more recent model (SGFR), our model realized improvements of 10.9%, 13.3%, and 7.8% in Sacc, Uacc, and H metrics, respectively. This study aims to introduce fresh ideas into the domain of zero-shot image classification, and it can be applied to open-world object detection tasks. Our code is available at https:// www. github. com/ CmosW olf1/ Code_ imple menta tion_ for_ paper_ SKZC.


Introduction
As a crucial task in computer vision, image classification [1] tasks involve assigning predefined labels or categories to input data based on their characteristic or features.It is also an important subtask within the field of object detection.There is no doubt that enhancements in the performance of classification models can also lead to improvements in the classification abilities of performance of object detection models.Tasks of classification depend on the availability of a large volume of tagged data [2].Due to advances in deep learning techniques [3][4][5], most image classification methods used in the domain of computer vision are supervised learning methods, depending on large extensive volumes of tagged data for training.However, existing datasets are unable to encompass all possible classes, and human society's evolution continually gives rise to fresh classifications [6].It leads these supervised classification methods to perform unsatisfying when some categories have scarce or even no tagged data [7].
Zero-shot classification also seen as zero-shot learning (ZSL) [8,9] or zero-shot recognition is suggested to address the problem of lacking data enabling the recognition of objects belonging to unseen categories.It is a sub-field of machine learning that aims to classify objects or instances into unseen classes during training by leveraging the knowledge transfer from related classes for which labeled data is available.
Traditional zero-shot classification can be divided into three main approaches.The first approach utilizes pre-trained word embedding vectors to represent and understand the relationship among different categories.For instance, DeViSE [10] utilizes a pre-trained convolutional neural network (CNN) to project image features and word embedding of labels into a shared space.ConSE [11], on the other hand, merges the k highest-probability image embeddings.The second approach directly incorporates the relationships between classes using either a graph convolutional network (GCN) or a predefined class hierarchy like WordNet [3].GCNZ [12] and DGPZ [13] employ GCNs to propagate knowledge between seen and unseen classes, while incorporating CNN and word embedding.An alternative method, HZSL [14], projects both image and text embedding into a hyperbolic space that organizes child and parent classes within the hierarchical structure of WordNet [3].Lastly, some approaches, such as [15][16][17], depend on human-tagged attributes to model class semantics.These methods consider attribute annotations as informative cues for understanding the characteristics and distinguishing features of various classes.Different from CNN-based methods, vision transformers (ViT) [18] have surfaced as a substitute for convolutional neural networks in the field of visual recognition [18][19][20].The emergence of self-distillation [21] has provided new solutions for zero-shot.Self-knowledge distillation [21] seeks to educate a student model by emulating the learning patterns of an already-trained teacher model, which is a pretrained ViT model.Many zero-shot learning methods, such as [22,23], utilize self-distillation models to acquire features for unseen categories.
However, these prior approaches suffer from several limitations.First, their focus lies primarily on improving the correspondence between image features extracted from pretrained CNNs and pre-trained word embedding models like Glove [24].Moreover, they employ predefined class hierarchies, such as WordNet [3], which confines category modeling to a tree structure, thereby failing to capture the complex inter-class relationships observed in real-world objects.Moreover, relying solely on class hierarchies restricts the classification scope to only those categories included in the hierarchy.Lastly, attributebased methods lack the ability to generalize to categories lacking seen attributes, thereby limiting their applicability.
Based on the aforementioned observation, we introduce a novel self-distillation and k-nearest neighbor-based model for zero-shot classification problems namely, selfdistillation and k-nearest neighbor-based zero-shot classification.When unseen categories are underrepresented or completely absent in datasets, and lack clear semantic relationships with other seen classes, conventional zero-shot image classification algorithms often struggle to achieve satisfactory classification performance.In contrast, our model effectively addresses this issue.Firstly, we use a diffusion detector [25] to detect potential objects in the image.Secondly, we design a self-distillation and distance-based classifier (SDDC) to classify seen and unseen objects.Lastly, we propose a k-nearest neighbor-based cluster head (KCH) to cluster those different kinds of unseen objects.As shown in Fig. 1, the clustering process is performed using KCH on several unseen objects in a given embedding space.Extensive experiments have demonstrated the efficacy of our model.We conducted tests on four datasets: CIFAR-100, CIFAR-10, Ima-geNet-10, and STL-10 [26][27][28].In cluster performance, we achieved varying degrees of improvement compared to the contrastive clustering [29] method.Moreover, we achieve a precision of 0.910 and a recall of 0.842 on CIFAR-10 datasets, and a precision of 0.737 and a recall of 0.688 on CIFAR-100 datasets for the macro-average.Compared to a more recent model (SGFR), our model realized improvements of 10.9%, 13.3%, and 7.8% in Sacc, Uacc, and H metrics, respectively.
Our main contributions are as follows: (1) For the first time, we have applied diffusion model to the detection of seen and unseen objects.This implies that the methods in our model can be applied not only to classification tasks but also provide solutions and insights for detection tasks, particularly open-world object detection [6,30] (OWOD) tasks.(2) We propose self-distillation and distance-based classifier (SDDC) and the k-nearest neighbor-based cluster head (KCH) to classify seen and unseen objects.(3) Our model is capable of lifelong learning, meaning it can without the need for human intervention once it is initialized.

Generative-based ZSL methods
In the domain of zero-shot image classification, leveraging generative adversarial networks (GANs) that are capable of synthesizing highly authentic imagery has emerged as a novel and promising approach [31,32].These advanced GAN variants enable the generation of visual feature representations for unseen categories by utilizing the known Fig. 1 Clustering process of unseen classes using the KCH visual data from seen classes coupled with semantic attributes of the target unseen classes.Xian et al. [33] devised an enhanced model incorporating Wasserstein GAN (WGAN) [34], integrating the WGAN's loss function with a classification loss to not only ensure the discriminative nature of the synthetically produced features but also to bolster the stability of the training regimen.Subsequently, numerous researchers have refined the WGAN framework, aiming to address challenges associated with generated samples' quality, diversity, and semantic relevance [30,35,36].Vyas et al. [37] introduced the leveraging of the semantic relationship GAN (LsrGAN), which utilizes a semanticregularized loss component to facilitate knowledge transfer between classes.To counteract issues related to training instability, certain studies have adopted variational auto-encoder (VAE) known for their robust training characteristics in zero-shot learning tasks [38][39][40].Other research efforts have focused on developing a joint embedding space through VAE for multi-modal data integration [41,42], effectively narrowing the divide between the visual and semantic spectra.

Embedding-based ZSL methods
Embedding-based approaches are designed to create a shared embedding space for images and their corresponding semantic attributes.These approaches can be categorized into three distinct types.The first category concentrates on mastering a conversion from the visual space to semantic space [43][44][45] which encounters issues such as projection domain shift and the hubness phenomenon.To mitigate these issues, the second type of approach inverts this direction by mapping the semantic information onto the visual domain [46,47].The third category aims to reconcile the disparities between visual and semantic domains by jointly mapping both visual and semantic features into an intermediary shared space [48,49].This common space is calibrated using bi-directionally aligned knowledge from both visual and semantic representations, addressing the limitations associated with direct mappings and transfer of model parameters.Despite these improvements, embedding-based techniques continue to grapple with challenges such as semantic information loss and a deficiency in representing unseen class features, leading to a prediction bias towards classes that have been observed during training [50].

Problem definition
Let's assume that the set of categories to which all objects in an open-world belong comprises the set S t = {1,2,3,...,C}⊂N + , where N + denotes the set of positive integers, C is the number of all the classes in the open world.Seen and unseen categories can be respectively defined as K t and U t .Let's define embedding vector sets set as F t .It is evident that K t , U t ⊆ S t , and both K t and U t are empty at the onset of the task.Moreover, the seen and unseen objects come from the detector.Then, those seen and unseen objects will be added into set K t or set U t according to the result of a classifier.Subsequently, we need to cluster these unseen categories.It is worth noting that vector clusters in the embedding set will continue to increase as the task progresses.Therefore, due to the limitations of computational power and cost, we need to put these unseen categories into several embedding sets before clustering (further particulars will Sun and Jia Journal of Engineering and Applied Science (2024) 71:97 be elaborated in the subsequent subsections).These embedding sets are combined to form set F t .

Overall architecture
Figure 2 presents the comprehensive structure of our proposed model for zero-shot image classification.We have incorporated a detector into our model for classification tasks and continuously update it to enhance its performance in real-world classification tasks.Additionally, cropping the images detected by the detector allows our model to iterate by itself at a fast pace.Firstly, we use diffusion model detector [25] as the based detector.Then, we crop the image detected by the detector according to the box predictor.These cropped images are sent into the self-distillation and distance-based classifier (SDDC) to differentiate between categories that have been previously encountered and those that have not.After that, unseen categories will be sent into a k-nearest neighbor-based clustering head (KCH) for clustering.Seen classes will be added to the existing seen cluster.Lastly, we update the boxes predictor module so that the detector can recognize the newly added classes.Additionally, we will integrate the already clustered unseen clusters into the embedding vector set to accomplish the transformation from unseen classes to seen classes.As time progresses, the number of seen clusters will increase, allowing the model to recognize an ever-growing of classes.

Self-distillation and distance-based classifier
Due to the limited capability of backbone network models such as ResNet [51] and Swin-Base [52] in effectively extracting foreground features from images, we employ a self-distillation learning model to extract foreground features.
The architecture of the self-distillation learning model is shown in Fig. 3.This model is demonstrated using a single pair of views (x 1 ,x 2 ) for simplicity and clarity.It applies two distinct random transformations to an input image and provides them as inputs to both the student and teacher networks.Although these networks have identical Fig. 2 The comprehensive structure of our model structures, their parameters are different.The teacher network generates K-dimensional feature vectors that are normalized using a temperature softmax function.These feature vectors are then compared using a cross-entropy loss to measure their similarity Fig. 3 Architecture of self-distillation learning model [53].The teacher network's output is normalized by calculating the mean over the batch.The student network [53] ɡ θs is a neural network model that learns to perform a task by trying to mimic or replicate the behavior of the teacher network [53] ɡ θt .During the training phase, the student network is updated using standard backpropagation techniques, where gradients are calculated based on the difference between the student's predictions and the teacher's outputs.The goal is for the student network to learn representations that are good enough to match those produced by the teacher.For an input image x, the student and teacher network each produce a set of probabilities across M categories, indicated as P s for the student and P t for the teacher.Their probabilities P s (x) are the result of applying a softmax function to normalize the outputs from the network ɡ θs (x).More precisely: with τ s > 0, a temperature parameter is utilized to regulate the sharpness of the output distribution, with a corresponding expression governing P t when modulated by the temperature τ t .
In our classifier, we use the student network to extract feature vectors of objects.The student network has been trained on the ImageNet-200 datasets [54].We calculate the Euclidean distance d E between these feature vectors f 1n and the center vector of each cluster f 2n within every embedding vector sets as follows: where f 1n = (f 11 , f 12 , f 13 ,..., f 1N ) and f 2n = (f 21 , f 21 , f 22 ,..., f 2N ), N⊂N + are both N-dimensional feature vectors.These cluster radii R i in an embedding vector set E are formulated as follows: where S is the vectors' number of a seen cluster, V ij is a feature vector in a seen cluster, and α is a parameter which determines the size of a cluster's radius.Regarding param- eter α , we will delve into the specifics in Section "Patch size".
After that, for an input feature vector, we compute its distance d E with every cluster centroid in each embedding vector sets set.Then, we use whether d E is less than the cluster radius R i as a criterion to determine if the object belongs to a seen category i or an unseen class.

K-nearest neighbor-based cluster head
Enabling the model to cluster unseen classes provides it with the ability to differentiate among diverse unseen classes.We present a k-nearest neighbor-based cluster head to cluster these unseen classes.Algorithm 1 provides an overview on how the k-nearest neighbor-based cluster head clusters these unseen classes. (1)

Algorithm 1 Algorithm of clustering unseen classes
The search space parameter is defined as n_neighbors, which means that we search for the optimal value of the n_neighbors within a range from 1 to 20 (excluding 20).The purpose of this is to experiment with different values of k (i.e., the number of nearest neighbors) and find the best value to construct the KNN model.Then, the cluster labels are assigned based on the indices of the nearest neighbors.After the prediction is completed, each unseen vector will have a label ID pointing to a specific cluster.Next, these unseen clusters will be divided to ensure that there are only ten clusters in each embedding vector set (we will explain in detail why only ten clusters are retained in an embedding vector set in Section "Patch size").Then, we will integrate the new embedding vector set with unseen clusters into the collection of the embedding vector sets set to complete the update of seen categories.Simultaneously, we will update the boxes predictor so that the detector can detect the newly added seen categories.

Diffusion detector
The L 2 loss function [55] using by diffusion model can be formulated as follows: which t ϵ {0,1,...T} and the neural network f θ (z t , t) are trained to predict z 0 from z t by minimizing the training objective with L 2 loss.
To establish a robust foundation for our object detection framework, we incorporated a pre-trained diffusion model [25] that has been extensively trained on MSCOCO [56] datasets.We specifically employed the weights of a model based on the ResNet50 [51] architecture, which has demonstrated remarkable performance in object detection tasks due to its deep residual learning capabilities.It is noteworthy that the original implementation (4) of the diffusion model involved a lengthy process with 500 sampling steps, which contributed to precise but computationally intensive inference.Considering the real-time requirements of our zero-shot classification task, we optimized the inference pipeline by reducing the number of sampling steps from 500 to 300.This strategic adjustment enabled us to substantially accelerate the inference speed of our diffusion-based detector while maintaining an acceptable trade-off between accuracy and real-time performance metrics.

Self-distillation model
In order to align the output distributions, the cross-entropy loss concerning the parameters of student network θ s is minimized by the following: where H (a, b) = −alogb.
In the following, a description is provided on how the problem in Eq. ( 5) is adapted for self-supervised learning.The initial step involves generating various distorted views or crops of an image using a multi-crop strategy [57].Specifically, a set V of different views is created from a given image.To capture both global and local information, our model incorporates two global views (x 1g and x 2g ) and multiple local views of smaller resolution.While all crops are processed by the student model, only the global views are utilized by the teacher model.This process promotes "local-to-global" correspondences [53].The loss function is then minimized: We use vision transformer (ViT) [18] as the backbone of self-distillation and distancebased classifier.We employed four distinct model configurations with varying sizes and resolutions (ViT-S/16, ViT-S/8, ViT-B/16, and ViT-B/8) [53] to thoroughly investigate their feature extraction efficacy.

Datasets
We evaluate our model on the set of tasks T = {T 1 , T 2 }.Among them, task 1 is the clustering performance testing task.As shown in Table 1, for task 1, we use 10 classes from (5)  datasets.For task 2, we use CIFAR-10, CIFAR-100 and CUB [58].Furthermore, we use pre-trained self-distillation models with two different resolutions and two different model sizes, resulting in four types of models.Therefore, in task 1, we plan to evaluate the performance and practicality of each method and model through thorough evaluation.

Evaluation metrics
In task 1, to assess our approach, we employ three commonly recognized metrics for clustering evaluation: The normalized mutual information (NMI), accuracy (ACC), and adjusted rand index (ARI).
The NMI is a metric that remains consistent regardless of the dataset's size.It effectively measures the extent of information overlap between the true labels and the labels assigned through clustering, indicating the quality of the clustering.This can be formulated as follows: where U and V are two sets of clusters, the shared information content of U and V is quantified by I(U; V) which is the mutual information, while H(U) and H(V) represent the individual uncertainties of U and V.
Accuracy (ACC) measures the proportion of correctly clustered instances by comparing the cluster assignments with the ground truth labels, reflecting the clustering correctness.This can be formulated as follows: where n is the number of samples, c i is the cluster assignment for sample i, l i is the true label for sample i, m is the mapping function from clusters to true label s, and l is the indicator function.
Adjusted rand index (ARI) which can adjust the similarity between the true clustering and the predicted clustering with a value that can be compared across different datasets.This can be demonstrated as follows: where RI is the rand index, which is calculated as follows: in this context, TP is the count of true positive pairs, TN is the number of true negative pairs, FP is the count of falsely identified positive pairs, and FN is the count of falsely identified negative pairs.The expected RI depends on the marginal totals of a contingency table (or confusion matrix) of the cluster assignment.
In task 2, we use three evaluation metrics: precision, recall, and F1 scores to assess model performance.( 7) Among them, precision is a measure of the accuracy of a classification model, which indicates the proportion of the true positive predictions in the total predicted positives.The precision metric is computed by dividing the number of true positives by the total number of instances classified as positive, which includes both true positives and false positives.High precision indicates that an algorithm generated a significant number of relevant results compared to irrelevant ones.Precision can be formulated as follows: Recall measures the ability of a model to find all the relevant cases within datasets.It represents the fraction of actual positives correctly identified by the classifier out of all actual positives.Mathematically, it is the ratio of correctly detected positive cases to the total actual positive cases.High recall indicates that the class is correctly recognized to a large extent.Recall (sensitivity) can be presented as follows: The F1-score is calculated as 2 times the product of precision and recall divided by the sum of precision and recall, thereby balancing the trade-off between false positives and false negatives.It is a measure that combines precision and recall, considering both false positives and false negatives, to provide a single score for model accuracy, providing a single score that weighs both the concerns of finding all relevant instances (recall) and returning only relevant instances (precision).F1-score reaches its best value at 1 (perfect precision and recall) and worst at 0. F1-score can be demonstrated as follows:

Implementation details
The detector of our model is based on diffusion detector [25] with ResNet-50 [51], and Swin-Base [52] backbone.We use the detector to detect both seen and unseen objects.More precisely, we employ a diffusion model with the ResNet-50 [51] architecture as the backbone network to extract objects from images.Additionally, this diffusion model has been pre-trained on the MSCOCO [56] datasets.
In task 1 and task 2, we tested 4 self-distillation models, whose parameter counts and resolutions are shown in Table 2.
It is worth noting that larger model parameters and smaller resolution values indicate better performance of the model.Furthermore, the self-distillation model we use has been pre-trained on the ImageNet datasets [54].For the hyperparameters, we set the value of α to 0.75 and set the number of clus- ters in each embedding set at 7.

Clustering performance
The quality of clustering directly influences the outcome of the entire classification task; therefore, the model's ability to effectively cluster data is of crucial significance.The clustering performance of our model is shown in Table 3, and we tested the ViT-B/8 self-distillation model on the STL-10, ImageNet-10, CIFAR-10, and CIFAR-100 datasets.Apart from the contrastive clustering algorithms, all the algorithms tested in our study employed feature vectors extracted by a self-distillation model for clustering operations.It is evident that compared to contrastive clustering algorithm, the traditional clustering algorithm also achieved promising performance.This indicates the effectiveness of self-distillation models.
In Table 4, we conducted tests using the CIFAR-10 and CIFAR-100 datasets and concluded that the ViT-B/8 model has the best performance.It can be clearly seen that the model possesses a greater quantity of parameters and enhanced resolution typically demonstrates improved performance outcomes.Therefore, due to the substantial number of model parameters and higher resolution afforded by ViT-B/8, it exhibits the most superior performance.Besides, considering the requirement for real-time classification, we are willing to sacrifice some model performance to enhance the inference speed of the model.
The clustering visualization results for the ViT-B/8, ViT-B/16, ViT-S/8, and ViT-S/16 models on CIFAR-10 and CIFAR-100 datasets are shown in Fig. 5.When SDDC and KCH are missing (row 4), the model performs the worst.Adding only SDDC (row 1) will improve the model's ability to cluster, and with high-quality clustering, the model is likely to demonstrate enhanced performance in classification.Adding only KCH (row 2) will directly improve the model's ability to classify.When adding both SDDC and KCH, the model performs the best.Therefore, the presence or absence of both SDDC and KCH will affect the performance of the model and the optimal performance is obtained when both components are present and work together.

The calculation of cluster radius
In our experiments, we found that a category's cluster may have several points that are far from the cluster center.If we simply use the Euclidean distance from the furthest point to the cluster center as cluster radius, it could lead to a large number of misjudged unseen objects.Moreover, the choice of the cluster radius can affect the accuracy and false positive rate of unseen object identification.Therefore, in order to calculate the optimal cluster radius, we designed an experiment.
In the experiment, we defined the distance value from all points in the cluster to the cluster center point as d i .Then, we place this distance d i into a set D. Afterward, we select the smallest α percentile values from set D and discard any remaining points that fail to meet the specified conditions.Finally, we set the maximum value in set D as the cluster radius, allowing α to increase from 0 to 1 in increments of 0.01 (It is evident that α is positively correlated with the radius of the cluster).We then plotted the curve show- ing the change in the accuracy rate of unseen identification as α varied, as shown in Fig. 7.Moreover, we also plotted the curve of the harmonic mean of our model on the  CUB dataset as a function of α , as shown in Fig. 6.The maximum value is reached within the interval from 0.6 to 0.8, strictly speaking at 0.66.However, considering that the optimal value of α might differ across datasets, and with a view to generality, we set α to 0.75.The misjudgment rate of unseen is negatively correlated with the accuracy rate.

Number of clusters in each embedding set
As the task progresses, more and more feature vectors will inevitably appear in the embedding set.It is impractical to perform clustering operations only in one embedding Fig. 6 The curve of harmonic mean as a function of alpha Fig. 7 Change in the accuracy rate of unseen identification with the variation of α value curve set, as it would consume a lot of time and computational power.It is obvious that these feature vectors follow the same distribution.Based on the above fact, we decided to place clusters in different embedding sets and perform clustering operations there, rather than just in one embedding set.However, it is worth noting that if the number of clusters in the embedding set is too small, it may lead to excessively small differences in the distances between the input feature vector and the different cluster center vectors, resulting in incorrect judgments of input feature vectors.Therefore, we designed an experiment to explore the optimal number of clusters in each embedding set.The performance metrics NMI, ARI, and ACC indices for clustering as a function of the number of clusters in the embedding set are shown in Fig. 8.As shown in Fig. 8, once the number of in the same embedding set exceeds 7, the three indicators NMI, ARI, and ACC plummet sharply, indicating a rapid deterioration in clustering performance within that embedding set.A decline in clustering performance can lead to a model propensity for misclassifying unseen classes as seen ones (disorganized vector distribution within each cluster, resulting in an excessively large radius).Moreover, considering the fact that too few clusters may result in seen categories being misidentified as unseen categories, an excessively high number of clusters can lead to feature vectors becoming overly concentrated within the embedding set, thereby causing unseen classes to be erroneously identified as seen classes.Therefore, we decide to fix the number of clusters in each embedding set at 7. Furthermore, we have plotted the visualization of the growth process of the number of clusters in the embedding set, as shown in Fig. 9.While our model has shown promising results on certain datasets (dataset CIFAR et al.), it still has limitations (dataset CUB).Real-world objects are incredibly diverse and complex, often exceeding what can be experimentally simulated.Take birds as an example: there are over 9000 known species of birds, each with distinct appearances.Even humans find it challenging to differentiate closely related bird species due to their similar features.Consequently, for several closely related categories, our model may perform poorly in classification tasks because the feature vectors of these categories are too close together within the embedding space.To address this issue, future research will delve deeper into the selection of the number of clusters in each embedding set and optimizing α parameter.By dynamically adjusting these based on the characteristics of feature vectors within the embedding set, we aim to achieve a more reasonable distribution of feature vectors, ultimately enhancing the model's ability to classify categories with similar features.

Conclusions
In future work, we hope to continue improving our model structures and apply it to open-world object detection problems.By exploring these uncharted territories, we aim to bridge the gap between academic experimentation and real-world applicability, making our model more robust and versatile in handling diverse and dynamic environments.

2 *
TP + FP + FN In this work, we propose a novel zero-shot classification model named self-distillation and k-nearest neighbor-based zero-shot classification model.We propose a new method including k-nearest neighbor-based cluster head (KCH) and self-distillation and distance-based classifier (SDDC).Abundant experiments demonstrate the effectiveness

Fig. 8
Fig. 8 Clustering performance in the same embedding set as a function of a number of cluster curve

Table 1
Datasets for each task

Table 2
Parameters and resolution of each model

Table 3
Cluster comparison for task 1

Table 4
Model comparison for task 1

Table 7
Comparisons in task 2 on the CUB datasetThe bold values represent the maximum values in the same row

Table 8
Ablation experimental results of our modelThe bold values represent the maximum values in the same row