Classification feature sets for source code plagiarism detection in Java

Hosam, Eman; Hadhoud, Mayada; Atiya, Amir; Fayek, Magda

doi:10.1186/s44147-022-00155-8

Journal of Engineering and Applied Science

Table 3 The target and description of the proposed feature sets

From: Classification feature sets for source code plagiarism detection in Java

Feature set	Target	Count	Description
Structural histogram features	Summarizing the function prototype similarity matrices using histograms (instead of the min., max., and avg.).	8	There are 2 similarity matrices (one for types and the other for identifiers). Each matrix has 4 features representing the normalized count of the 4 histogram partitions of similarity values.
Lexical per-class features	Comparing each class pair in the two programs lexically by the cosine similarity of their character 3-grams. Note: The candidates and extreme ranges of the similarity matrix are used.	3	There is 1 feature for the average of the candidate list extracted from the class similarity matrix, and 2 features for the two histogram extreme ranges of the candidate list
Structural counting features	Comparing the two programs based on some counting features representing the program structure such as number of classes, functions, and loops.	12	The 12 counts extracted from each program: classes, interfaces, subclasses, functions, loops, conditionals, function calls, class fields, variable declarations, assignments, comments, string literals. For each 2 counts, the similarity feature is the minimum count over the maximum count.
Modified original features	Proposing a modified version of Ganguly et al. [8] structural features using candidates and histogram extreme ranges of similarity matrices.	8	The 6 structural features are: 2 for the histogram extreme ranges and 1 for the candidates’ average for the 2 similarity matrices. The same 1 lexical feature and 1 stylistic feature of Ganguly et al. [8] are used.

Back to article page