From: Classification feature sets for source code plagiarism detection in Java
Feature set | Target | Count | Description |
---|---|---|---|
Structural histogram features | Summarizing the function prototype similarity matrices using histograms (instead of the min., max., and avg.). | 8 | There are 2 similarity matrices (one for types and the other for identifiers). Each matrix has 4 features representing the normalized count of the 4 histogram partitions of similarity values. |
Lexical per-class features | Comparing each class pair in the two programs lexically by the cosine similarity of their character 3-grams. Note: The candidates and extreme ranges of the similarity matrix are used. | 3 | There is 1 feature for the average of the candidate list extracted from the class similarity matrix, and 2 features for the two histogram extreme ranges of the candidate list |
Structural counting features | Comparing the two programs based on some counting features representing the program structure such as number of classes, functions, and loops. | 12 | The 12 counts extracted from each program: classes, interfaces, subclasses, functions, loops, conditionals, function calls, class fields, variable declarations, assignments, comments, string literals. For each 2 counts, the similarity feature is the minimum count over the maximum count. |
Modified original features | Proposing a modified version of Ganguly et al. [8] structural features using candidates and histogram extreme ranges of similarity matrices. | 8 | The 6 structural features are: 2 for the histogram extreme ranges and 1 for the candidates’ average for the 2 similarity matrices. The same 1 lexical feature and 1 stylistic feature of Ganguly et al. [8] are used. |