API docs
compare50
- class compare50.Comparator
Abstract base class for
compare50
comparators which specify how submissions should be scored and compared.- abstract compare(scores, ignored_files)
Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
- abstract score(submissions, archive_submissions, ignored_files)
Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of
compare50.Score
s for each submission pair.
- class compare50.Comparison(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)
- Variables
sub_a – the first submission
sub_b – the second submission
span_matches – a list of pairs of matching
compare50.Span
s, wherein the first element of each pair is fromsub_a
and the second is fromsub_b
.ignored_spans – a list of
compare50.Span
s which were ignored (e.g. because they matched distro files)
Represents an in-depth comparison of two submissions.
- __init__(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING) None
Method generated by attrs for class Comparison.
- exception compare50.Error
Base class for compare50 errors.
- class compare50.File(name, submission)
- Variables
name – file name (path relative to the submission path)
submission – submission containing this file
id – integer that uniquely identifies this file (files with the same path will always have the same id)
Represents a single file from a submission.
- __init__(name, submission) None
Method generated by attrs for class File.
- classmethod get(id)
Find File with given id
- lexer()
Determine which Pygments lexer should be used.
- property path
The full path of the file
- read(size=-1)
Open file, read
size
bytes from it, then close it.
- tokens()
Returns the preprpocessed tokens of the file.
- unprocessed_tokens()
Get the raw tokens of the file.
- class compare50.Pass
Abstract base class for
compare50
passes, which are essentially ways forcompare50
to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every filecompare50
recieves) as well as a comparator (used to score and compare the preprocessed submissions).
- class compare50.Score(sub_a, sub_b, score=0)
- Variables
sub_a – the first submission
sub_b – the second submission
score – a number indicating the similarity between
sub_a
andsub_b
(higher meaning more similar)
A score representing the similarity of two submissions.
- __init__(sub_a, sub_b, score=0) None
Method generated by attrs for class Score.
- class compare50.Span(file, start, end)
- Variables
file – the ID of the File containing the span
start – the character index of the first character in the span
end – the character index one past the end of the span
Represents a range of characters in a particular file.
- __init__(file, start, end) None
Method generated by attrs for class Span.
- class compare50.Submission(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)
- Variables
path – the file path of the submission
files – list of
compare50.File
objects contained in the submissionpreprocessor – A function from tokens to tokens that will be run on each file in the submission
id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).
Represents a single submission. Submissions may either be single files or directories containing many files.
- __init__(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False) None
Method generated by attrs for class Submission.
- classmethod get(id)
Retrieve submission corresponding to specified id
- class compare50.Token(start, end, type, val)
- Variables
start – the character index of the beginning of the token
end – the character index one past the end of the token
type – the Pygments token type
val – the string contents of the token
A result of the lexical analysis of a file. Preprocessors operate on Token streams.
- __init__(start, end, type, val) None
Method generated by attrs for class Token.
- compare50.compare(scores, ignored_files, pass_)
- Parameters
scores ([
compare50.Score
]) – Scored submission pairs to be compared more granularlyignored_files ({
compare50.File
}) – files containing distro codepass (
compare50.Pass
) – pass whose comparator should be use to compare the submissions
- Returns
Compare50Result
s corresponding to each of the given scores- Return type
[
compare50.Compare50Result
]
Performs an in-depth comparison of each submission pair and returns a corresponding list of
compare50.compare50Result
s.
- compare50.expand(span_matches, tokens_a, tokens_b)
- Parameters
span_matches ([(
compare50.Span
,compare50.Span
)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same filetokens_a ([
compare50.Token
]) – the tokens of the file corresponding to the first element of eachspan_match
tokens_b ([
compare50.Token
]) –- param tokens_a
the tokens of the file corresponding to the first element of each
span_match
- Returns
A new list of maximially expanded span pairs
- Return type
Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.
- compare50.missing_spans(file, original_tokens=None, processed_tokens=None)
- Parameters
file (
compare50.File
:param original_tokens: the unprocessed tokens offile
. May be optionally specified iffile
has been tokenized elsewhere to avoid tokenizing it again.) – file to be examinedprocessed_tokens – the result of preprocessing the tokens of
file
. May optionally be specified iffile
has been preprocessed elsewhere to avoid doing so again.
- Returns
The spans of
file
that were stripped by the preprocessor.- Return type
Determine which parts of
file
were stripped out by the preprocessor.
- compare50.rank(submissions, archive_submissions, ignored_files, pass_, n=50)
- Parameters
submissions ([
compare50.Submission
]) – submissions to be rankedarchive_submissions ([
compare50.Submission
]) – archive submissions to be rankedignored_files ({
compare50.File
}) – files containing distro codepass (
compare50.Pass
) – pass whose comparator should be use to rank the submissionsn (int) – number of submission pairs to return
- Returns
the top
n
submission pairs- Return type
Rank submissions, return the top
n
most similar pairs
compare50.passes
- class compare50.passes.exact
Removes nothing, not even whitespace, then uses the winnowing algorithm to compare submissions.
- class compare50.passes.misspellings
Compares comments for identically misspelled English words.
- class compare50.passes.nocomments
Removes comments, but keeps whitespace, then uses the winnowing algorithm to compare submissions.
- class compare50.passes.structure
Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.
- class compare50.passes.text
Removes whitespace, then uses the winnowing algorithm to compare submissions.
compare50.preprocessors
- compare50.preprocessors.by_character(tokens)
Make a token for each character.
- compare50.preprocessors.comments(tokens)
Remove all tokens that aren’t comments.
- compare50.preprocessors.extract_identifiers(tokens)
Remove all tokens that don’t represent identifiers.
- compare50.preprocessors.normalize_builtin_types(tokens)
Normalize builtin type names
- compare50.preprocessors.normalize_case(tokens)
Make all tokens lower case.
- compare50.preprocessors.normalize_identifiers(tokens)
Replace all identifiers with
v
- compare50.preprocessors.normalize_numeric_literals(tokens)
Replace numeric literals with their types.
- compare50.preprocessors.normalize_string_literals(tokens)
Replace string literals with empty strings.
- compare50.preprocessors.split_on_whitespace(tokens)
Split values of tokens on whitespace into new tokens
- compare50.preprocessors.strip_comments(tokens)
Remove all comments from tokens.
- compare50.preprocessors.strip_whitespace(tokens)
Remove all whitespace from tokens.
- compare50.preprocessors.text_printer(tokens)
Print token values. Useful for debugging.
- compare50.preprocessors.token_printer(tokens)
Print each token. Useful for debugging.
- compare50.preprocessors.words(tokens)
Split tokens into tokens containing just one word.
compare50.comparators
- class compare50.comparators.Misspellings(dictionary)
- __init__(dictionary)
- compare(scores, ignored_files)
Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
- score(submissions, archive_submissions, ignored_files)
Number of identically misspelled words.
- class compare50.comparators.Winnowing(k, t)
Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
- Parameters
t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched
- Parma k
the noise threshold; any matching sequence of tokens shorter than this will be ignored
- __init__(k, t)
- compare(scores, ignored_files)
Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of
compare50.Comparison
s
- score(submissions, archive_submissions, ignored_files)
Number of matching k-grams.