API docs

compare50

class compare50.Comparator

Abstract base class for compare50 comparators which specify how submissions should be scored and compared.

abstract compare(scores, ignored_files): Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

abstract score(submissions, archive_submissions, ignored_files): Given a list of submissions, a list of archive submissions, and a set of distro files, return a list of compare50.Scores for each submission pair.

class compare50.Comparison(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING)

Variables

sub_a – the first submission
sub_b – the second submission
span_matches – a list of pairs of matching compare50.Spans, wherein the first element of each pair is from sub_a and the second is from sub_b.
ignored_spans – a list of compare50.Spans which were ignored (e.g. because they matched distro files)

Represents an in-depth comparison of two submissions.

__init__(sub_a, sub_b, span_matches=NOTHING, ignored_spans=NOTHING) → None: Method generated by attrs for class Comparison.

exception compare50.Error: Base class for compare50 errors.

class compare50.File(name, submission)

Variables

name – file name (path relative to the submission path)
submission – submission containing this file
id – integer that uniquely identifies this file (files with the same path will always have the same id)

Represents a single file from a submission.

__init__(name, submission) → None: Method generated by attrs for class File.

classmethod get(id): Find File with given id

lexer(): Determine which Pygments lexer should be used.

property path: The full path of the file

read(size=-1): Open file, read size bytes from it, then close it.

tokens(): Returns the preprpocessed tokens of the file.

unprocessed_tokens(): Get the raw tokens of the file.

class compare50.Pass: Abstract base class for compare50 passes, which are essentially ways for compare50 to compare submissions. Subclasses must define a list of preprocessors (functions from tokens to tokens which will be run on every file compare50 recieves) as well as a comparator (used to score and compare the preprocessed submissions).

class compare50.Score(sub_a, sub_b, score=0)

Variables

sub_a – the first submission
sub_b – the second submission
score – a number indicating the similarity between sub_a and sub_b (higher meaning more similar)

A score representing the similarity of two submissions.

__init__(sub_a, sub_b, score=0) → None: Method generated by attrs for class Score.

class compare50.Span(file, start, end)

Variables

file – the ID of the File containing the span
start – the character index of the first character in the span
end – the character index one past the end of the span

Represents a range of characters in a particular file.

__init__(file, start, end) → None: Method generated by attrs for class Span.

class compare50.Submission(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False)

Variables

path – the file path of the submission
files – list of compare50.File objects contained in the submission
preprocessor – A function from tokens to tokens that will be run on each file in the submission
id – integer that uniquely identifies this submission (submissions with the same path will always have the same id).

Represents a single submission. Submissions may either be single files or directories containing many files.

__init__(path, files, large_files=NOTHING, undecodable_files=NOTHING, preprocessor=<function Submission.<lambda>>, is_archive=False) → None: Method generated by attrs for class Submission.

classmethod get(id): Retrieve submission corresponding to specified id

class compare50.Token(start, end, type, val)

Variables

start – the character index of the beginning of the token
end – the character index one past the end of the token
type – the Pygments token type
val – the string contents of the token

A result of the lexical analysis of a file. Preprocessors operate on Token streams.

__init__(start, end, type, val) → None: Method generated by attrs for class Token.

compare50.compare(scores, ignored_files, pass_)

Parameters

scores ([compare50.Score]) – Scored submission pairs to be compared more granularly
ignored_files ({compare50.File}) – files containing distro code
pass (compare50.Pass) – pass whose comparator should be use to compare the submissions

Returns

Compare50Results corresponding to each of the given scores

Return type

[compare50.Compare50Result]

Performs an in-depth comparison of each submission pair and returns a corresponding list of compare50.compare50Results.

compare50.expand(span_matches, tokens_a, tokens_b)

Parameters

span_matches ([(compare50.Span, compare50.Span)]) – span pairs to be expanded wherein the first element of every pair is from the same file and the second element of every pair is from the same file
tokens_a ([compare50.Token]) – the tokens of the file corresponding to the first element of each span_match
tokens_b ([compare50.Token]) –

param tokens_a

the tokens of the file corresponding to the first element of each span_match

Returns

A new list of maximially expanded span pairs

Return type

[(compare50.Span, compare50.Span)]

Expand all span matches. This is useful when e.g. two spans in two different files are identical, but there are tokens before/after these spans that are also identical between the files. This function expands each of these spans to include these additional tokens.

compare50.missing_spans(file, original_tokens=None, processed_tokens=None)

Parameters

file (compare50.File :param original_tokens: the unprocessed tokens of file. May be optionally specified if file has been tokenized elsewhere to avoid tokenizing it again.) – file to be examined
processed_tokens – the result of preprocessing the tokens of file. May optionally be specified if file has been preprocessed elsewhere to avoid doing so again.

Returns

The spans of file that were stripped by the preprocessor.

Return type

[compare50.Span]

Determine which parts of file were stripped out by the preprocessor.

compare50.rank(submissions, archive_submissions, ignored_files, pass_, n=50)

Parameters

submissions ([compare50.Submission]) – submissions to be ranked
archive_submissions ([compare50.Submission]) – archive submissions to be ranked
ignored_files ({compare50.File}) – files containing distro code
pass (compare50.Pass) – pass whose comparator should be use to rank the submissions
n (int) – number of submission pairs to return

Returns

the top n submission pairs

Return type

[compare50.Score]

Rank submissions, return the top n most similar pairs

compare50.passes

class compare50.passes.exact: Removes nothing, not even whitespace, then uses the winnowing algorithm to compare submissions.

class compare50.passes.misspellings: Compares comments for identically misspelled English words.

class compare50.passes.nocomments: Removes comments, but keeps whitespace, then uses the winnowing algorithm to compare submissions.

class compare50.passes.structure: Compares code structure by removing whitespace and comments; normalizing variable names, string literals, and numeric literals; and then running the winnowing algorithm.

class compare50.passes.text: Removes whitespace, then uses the winnowing algorithm to compare submissions.

compare50.preprocessors

compare50.preprocessors.by_character(tokens): Make a token for each character.

compare50.preprocessors.comments(tokens): Remove all tokens that aren’t comments.

compare50.preprocessors.extract_identifiers(tokens): Remove all tokens that don’t represent identifiers.

compare50.preprocessors.normalize_builtin_types(tokens): Normalize builtin type names

compare50.preprocessors.normalize_case(tokens): Make all tokens lower case.

compare50.preprocessors.normalize_identifiers(tokens): Replace all identifiers with v

compare50.preprocessors.normalize_numeric_literals(tokens): Replace numeric literals with their types.

compare50.preprocessors.normalize_string_literals(tokens): Replace string literals with empty strings.

compare50.preprocessors.split_on_whitespace(tokens): Split values of tokens on whitespace into new tokens

compare50.preprocessors.strip_comments(tokens): Remove all comments from tokens.

compare50.preprocessors.strip_whitespace(tokens): Remove all whitespace from tokens.

compare50.preprocessors.text_printer(tokens): Print token values. Useful for debugging.

compare50.preprocessors.token_printer(tokens): Print each token. Useful for debugging.

compare50.preprocessors.words(tokens): Split tokens into tokens containing just one word.

compare50.comparators

class compare50.comparators.Misspellings(dictionary)

__init__(dictionary)

compare(scores, ignored_files): Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files): Number of identically misspelled words.

class compare50.comparators.Winnowing(k, t)

Comparator utilizing the (robust) Winnowing algorithm as described https://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

Parameters: t (int) – the guarantee threshold; any matching sequence of tokens of length at least t is guaranteed to be matched
Parma k: the noise threshold; any matching sequence of tokens shorter than this will be ignored

__init__(k, t)

compare(scores, ignored_files): Given a list of scores and a list of distro files, perform an in-depth comparison of each submission pair and return a corresponding list of compare50.Comparisons

score(submissions, archive_submissions, ignored_files): Number of matching k-grams.