From: Classification feature sets for source code plagiarism detection in Java
Parameter | Description | Value |
---|---|---|
field | True if the per-field representation is used | True |
toptermquery | True if top “num_q_terms” terms (not all terms) are considered for each field after sorting terms by TFIDF score | True |
num_q_terms | The number of top terms for each field if “toptermquery” is true | 20 |
lambda | The weight (from 0 to 1) of TF with respect to IDF | 0.4 |
minShingleSize | The minimum size of the word ngrams of terms (Note: the unigrams are included by default) | 2 |
maxShingleSize | The maximum size of the word ngrams of terms | 3 |
num_wanted | For each query document, the top “num_wanted” hit documents (that are sorted by relevance score) are included in the output file. | 20 |