API docs¶
compare50¶
-
class
compare50.
Comparator
¶ Abstract base class for
compare50
comparators which specify how submissions should be scored and compared.-
abstract
compare
(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
-
abstract
score
(submissions, archive_submissions, ignored_files)¶ Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of
compare50.Score
s for each submission pair.
-
abstract
-
class
compare50.
Comparison
(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)¶ - Variables
sub_a – the first submission
sub_b – the second submission
span_matches – a list of pairs of matching
compare50.Span
s, wherein the first element of each pair is fromsub_a
and the second is fromsub_b
.ignored_spans – a list of
compare50.Span
s which were ignored (e.g. because they matched distro files)
Represents an in-depth comparison of two submissions.
-
__init__
(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING) → None¶ Method generated by attrs for class Comparison.
-
exception
compare50.
Error
¶ Base class for compare50 errors.
-
class
compare50.
File
(name, submission)¶ - Variables
name – file name (path relative to the submission path)
submission – submission containing this file
id – integer that uniquely identifies this file (files with the same path will always have the same id)
Represents a single file from a submission.
-
__init__
(name, submission) → None¶ Method generated by attrs for class File.
-
classmethod
get
(id)¶ Find File with given id
-
lexer
()¶ Determine which Pygments lexer should be used.
-
property
path
¶ The full path of the file
-
read
(size=- 1)¶ Open file, read
size
bytes from it, then close it.
-
tokens
()¶ Returns the preprpocessed tokens of the file.
-
unprocessed_tokens
()¶ Get the raw tokens of the file.
-
class
compare50.
Pass
¶ Abstract base class for
compare50
passes, which are essentially ways forcompare50
to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every filecompare50
recieves) as well as a comparator (used to score and compare the preprocessed submissions).
-
class
compare50.
Score
(sub_a, sub_b, score=0)¶ - Variables
sub_a – the first submission
sub_b – the second submission
score – a number indicating the similarity between
sub_a
andsub_b
(higher meaning more similar)
A score representing the similarity of two submissions.
-
__init__
(sub_a, sub_b, score=0) → None¶ Method generated by attrs for class Score.
-
class
compare50.
Span
(file, start, end)¶ - Variables
file – the ID of the File containing the span
start – the character index of the first character in the span
end – the character index one past the end of the span
Represents a range of characters in a particular file.
-
__init__
(file, start, end) → None¶ Method generated by attrs for class Span.
-
class
compare50.
Submission
(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)¶ - Variables
path – the file path of the submission
files – list of
compare50.File
objects contained in the submissionpreprocessor – A function from tokens to tokens that will be run on each file in the submission
id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).
Represents a single submission. Submissions may either be single files or directories containing many files.
-
__init__
(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False) → None¶ Method generated by attrs for class Submission.
-
classmethod
get
(id)¶ Retrieve submission corresponding to specified id
-
class
compare50.
Token
(start, end, type, val)¶ - Variables
start – the character index of the beginning of the token
end – the character index one past the end of the token
type – the Pygments token type
val – the string contents of the token
A result of the lexical analysis of a file. Preprocessors operate on Token streams.
-
__init__
(start, end, type, val) → None¶ Method generated by attrs for class Token.
-
compare50.
compare
(scores, ignored_files, pass_)¶ - Parameters
scores ([
compare50.Score
]) – Scored submission pairs to be compared more granularlyignored_files ({
compare50.File
}) – files containing distro codepass (
compare50.Pass
) – pass whose comparator should be use to compare the submissions
- Returns
Compare50Result
s corresponding to each of the given scores- Return type
[
compare50.Compare50Result
]
Performs an in-depth comparison of each submission pair and returns a corresponding list of
compare50.compare50Result
s.
-
compare50.
expand
(span_matches, tokens_a, tokens_b)¶ - Parameters
span_matches ([(
compare50.Span
,compare50.Span
)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same filetokens_a ([
compare50.Token
]) – the tokens of the file corresponding to the first element of eachspan_match
tokens_b ([
compare50.Token
]) –- param tokens_a
the tokens of the file corresponding to the first element of each
span_match
- Returns
A new list of maximially expanded span pairs
- Return type
Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.
-
compare50.
missing_spans
(file, original_tokens=None, processed_tokens=None)¶ - Parameters
file (
compare50.File
) – file to be examinedoriginal_tokens – the unprocessed tokens of
file
. May be optionally specified iffile
has been tokenized elsewhere to avoid tokenizing it again.processed_tokens – the result of preprocessing the tokens of
file
. May optionally be specified iffile
has been preprocessed elsewhere to avoid doing so again.
- Returns
The spans of
file
that were stripped by the preprocessor.- Return type
Determine which parts of
file
were stripped out by the preprocessor.
-
compare50.
rank
(submissions, archive_submissions, ignored_files, pass_, n=50)¶ - Parameters
submissions ([
compare50.Submission
]) – submissions to be rankedarchive_submissions ([
compare50.Submission
]) – archive submissions to be rankedignored_files ({
compare50.File
}) – files containing distro codepass (
compare50.Pass
) – pass whose comparator should be use to rank the submissionsn (int) – number of submission pairs to return
- Returns
the top
n
submission pairs- Return type
Rank submissions, return the top
n
most similar pairs
compare50.passes¶
-
class
compare50.passes.
exact
¶ Removes nothing, not even whitespace, then uses the winnowing algorithm to compare submissions.
-
class
compare50.passes.
misspellings
¶ Compares comments for identically misspelled English words.
-
class
compare50.passes.
nocomments
¶ Removes comments, but keeps whitespace, then uses the winnowing algorithm to compare submissions.
-
class
compare50.passes.
structure
¶ Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.
-
class
compare50.passes.
text
¶ Removes whitespace, then uses the winnowing algorithm to compare submissions.
compare50.preprocessors¶
-
compare50.preprocessors.
by_character
(tokens)¶ Make a token for each character.
-
compare50.preprocessors.
comments
(tokens)¶ Remove all tokens that aren’t comments.
-
compare50.preprocessors.
extract_identifiers
(tokens)¶ Remove all tokens that don’t represent identifiers.
-
compare50.preprocessors.
normalize_builtin_types
(tokens)¶ Normalize builtin type names
-
compare50.preprocessors.
normalize_case
(tokens)¶ Make all tokens lower case.
-
compare50.preprocessors.
normalize_identifiers
(tokens)¶ Replace all identifiers with
v
-
compare50.preprocessors.
normalize_numeric_literals
(tokens)¶ Replace numeric literals with their types.
-
compare50.preprocessors.
normalize_string_literals
(tokens)¶ Replace string literals with empty strings.
-
compare50.preprocessors.
split_on_whitespace
(tokens)¶ Split values of tokens on whitespace into new tokens
-
compare50.preprocessors.
strip_comments
(tokens)¶ Remove all comments from tokens.
-
compare50.preprocessors.
strip_whitespace
(tokens)¶ Remove all whitespace from tokens.
-
compare50.preprocessors.
text_printer
(tokens)¶ Print token values. Useful for debugging.
-
compare50.preprocessors.
token_printer
(tokens)¶ Print each token. Useful for debugging.
-
compare50.preprocessors.
words
(tokens)¶ Split tokens into tokens containing just one word.
compare50.comparators¶
-
class
compare50.comparators.
Misspellings
(dictionary)¶ -
__init__
(dictionary)¶ Initialize self. See help(type(self)) for accurate signature.
-
compare
(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
-
score
(submissions, archive_submissions, ignored_files)¶ Number of identically misspelled words.
-
-
class
compare50.comparators.
Winnowing
(k, t)¶ Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
- Parameters
t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched
- Parma k
the noise threshold; any matching sequence of tokens shorter than this will be ignored
-
__init__
(k, t)¶ Initialize self. See help(type(self)) for accurate signature.
-
compare
(scores, ignored_files)¶ Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
-
score
(submissions, archive_submissions, ignored_files)¶ Number of matching k-grams.