Reference

`Crossandra`

class Crossandra(
    token_source: type[Enum] = Empty,
    *,
    convert_crlf: bool = True,
    ignore_whitespace: bool = False,
    ignored_characters: str = "",
    rules: list[Rule[Any] | RuleGroup] | None = None,
    suppress_unknown: bool = False,
)

The core class representing a Crossandra tokenizer. Takes the following arguments:

token_source: an enum containing all possible tokens (defaults to an empty enum)
convert_crlf: whether \r\n should be converted to \n before tokenization
ignored_characters: a string of characters to ignore (defaults to "")
ignore_whitespace: whether spaces, tabs, newlines etc. should be ignored (defaults to False)
suppress_unknown: whether unknown-token errors should be suppressed (defaults to False)
rules: a list of additional rules to use

The enum takes priority over the rule list.
The rules are prioritized in the order they appear in the list (descending).

Token enums can allow a tuple of values as aliases:

class MarkdownStyle(Enum):
    BOLD = "**"
    ITALIC = ("_", "*")
    UNDERLINE = "__"
    STRIKETHROUGH = "~~"
    CODE = ("`", "``")


print(
    *Crossandra(MarkdownStyle, ignore_whitespace=True).tokenize("* ** _ __"),
    sep="\n"
)
# <MarkdownStyle.ITALIC: ('*', '_')>
# <MarkdownStyle.BOLD: '**'>
# <MarkdownStyle.ITALIC: ('*', '_')>
# <MarkdownStyle.UNDERLINE: '__'>

`Crossandra.tokenize`

def tokenize(self, code: str) -> list[Enum | Any]

Tokenizes the input string. Returns a list of tokens.

`Crossandra.tokenize_lines`

def tokenize_lines(self, code: str) -> list[list[Enum | Any]]

Tokenizes the input string line by line. Returns a nested list of tokens, where each inner list corresponds to a consecutive line of the input string. Equivalent to [foo.tokenize(line) for line in source.splitlines()].

Fast Mode

When all tokens are of length 1 and there are no additional rules, Crossandra will use a simpler tokenization method (the so called Fast Mode).

Example

Tokenizing noisy Brainfuck code (BrainfuckToken taken from examples)

(tested on MacBook Air M1 (256/16) with pure Python wheels)

# Setup
from random import choices
from string import punctuation

program = "".join(choices(punctuation, k=...))
tokenizer = Crossandra(Brainfuck, suppress_unknown=True)

log10(k)	Default	Fast Mode	Speedup
1	40µs	20µs	100%
2	160µs	30µs	433%
3	1.5ms	130µs	1,054%
4	14ms	900µs	1,456%
5	290ms	9ms	3,122%

Rules and rule groups

`Rule`

class Rule[T](
    pattern: Pattern[str] | str,
    converter: Callable[[str], T] | None = None,
    *,
    flags: RegexFlag | int = 0,
    ignore: bool = False,
)

Used for defining custom rules. pattern is a regex pattern to match (flags can be supplied). A converter can be supplied and will be called with the matched substring as the argument (defaults to None, returning the matched string directly). When ignore is True, the matched substring will be excluded from the output.

Rule objects are hashable and comparable and can be ORed (|) for grouping with other Rules and RuleGroups.

`Rule.apply`

def apply(self, target: str) -> tuple[T | str | Ignored, int] | NotApplied

Checks if target matches the Rule's pattern. If it does, returns a tuple with

if ignore=True: the Ignored sentinel
if converter=None: the matched substring
otherwise: the result of calling the Rule's converter on the matched substring

and the length of the matched substring. If it doesn't, returns the NotApplied sentinel.

`RuleGroup`

class RuleGroup(rules: tuple[Rule[Any], ...])

Used for storing multiple Rules in one object. RuleGroups can be constructed by passing in a tuple of rules or by ORing (|) two or more Rules, and they can be ORed with other RuleGroups or Rules themselves. RuleGroups are hashable and iterable.

`RuleGroup.apply`

def apply(self, target: str) -> tuple[Any | str | Ignored, int] | NotApplied

Applies the rules in the group to the target string. Returns the result of the first rule that matches, or NotApplied if none do.

Common patterns

The common submodule is a collection of commonly used patterns.

Rules

CHAR (e.g. 'h')
LETTER (e.g. m)
WORD (e.g. ball)
SINGLE_QUOTED_STRING (e.g. 'nice fish')
DOUBLE_QUOTED_STRING (e.g. "hello there")
C_NAME (e.g. crossandra_rocks)
NEWLINE (\r\n or \n)
DIGIT (e.g. 7)
HEXDIGIT (e.g. c)
DECIMAL (e.g. 3.14)
INT (e.g. 2137)
SIGNED_INT (e.g. -1)
FLOAT (e.g. 1e3)
SIGNED_FLOAT (e.g. +4.3)

Rule groups

STRING (SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING)
NUMBER (INT | FLOAT)
SIGNED_NUMBER (SIGNED_INT | SIGNED_FLOAT)
ANY_INT (INT | SIGNED_INT)
ANY_FLOAT (FLOAT | SIGNED_FLOAT)
ANY_NUMBER (NUMBER | SIGNED_NUMBER)