Skip to main content

Table 3 The target and description of the proposed feature sets

From: Classification feature sets for source code plagiarism detection in Java

Feature set

Target

Count

Description

Structural histogram features

Summarizing the function prototype similarity matrices using histograms (instead of the min., max., and avg.).

8

There are 2 similarity matrices (one for types and the other for identifiers). Each matrix has 4 features representing the normalized count of the 4 histogram partitions of similarity values.

Lexical per-class features

Comparing each class pair in the two programs lexically by the cosine similarity of their character 3-grams. Note: The candidates and extreme ranges of the similarity matrix are used.

3

There is 1 feature for the average of the candidate list extracted from the class similarity matrix, and 2 features for the two histogram extreme ranges of the candidate list

Structural counting features

Comparing the two programs based on some counting features representing the program structure such as number of classes, functions, and loops.

12

The 12 counts extracted from each program: classes, interfaces, subclasses, functions, loops, conditionals, function calls, class fields, variable declarations, assignments, comments, string literals. For each 2 counts, the similarity feature is the minimum count over the maximum count.

Modified original features

Proposing a modified version of Ganguly et al. [8] structural features using candidates and histogram extreme ranges of similarity matrices.

8

The 6 structural features are: 2 for the histogram extreme ranges and 1 for the candidates’ average for the 2 similarity matrices. The same 1 lexical feature and 1 stylistic feature of Ganguly et al. [8] are used.