Reference
Crossandra
class Crossandra(
token_source: type[Enum] = Empty,
*,
convert_crlf: bool = True,
ignore_whitespace: bool = False,
ignored_characters: str = "",
rules: list[Rule[Any] | RuleGroup] | None = None,
suppress_unknown: bool = False,
)
Crossandra
tokenizer. Takes the following
arguments:
token_source
: an enum containing all possible tokens (defaults to an empty enum)convert_crlf
: whether\r\n
should be converted to\n
before tokenizationignored_characters
: a string of characters to ignore (defaults to""
)ignore_whitespace
: whether spaces, tabs, newlines etc. should be ignored (defaults toFalse
)suppress_unknown
: whether unknown-token errors should be suppressed (defaults toFalse
)rules
: a list of additional rules to use
The enum takes priority over the rule list.
The rules are prioritized in the order they appear in the list (descending).
Token enums can allow a tuple of values as aliases:
class MarkdownStyle(Enum):
BOLD = "**"
ITALIC = ("_", "*")
UNDERLINE = "__"
STRIKETHROUGH = "~~"
CODE = ("`", "``")
print(
*Crossandra(MarkdownStyle, ignore_whitespace=True).tokenize("* ** _ __"),
sep="\n"
)
# <MarkdownStyle.ITALIC: ('*', '_')>
# <MarkdownStyle.BOLD: '**'>
# <MarkdownStyle.ITALIC: ('*', '_')>
# <MarkdownStyle.UNDERLINE: '__'>
Crossandra.tokenize
def tokenize(self, code: str) -> list[Enum | Any]
Crossandra.tokenize_lines
def tokenize_lines(self, code: str) -> list[list[Enum | Any]]
[foo.tokenize(line) for line in source.splitlines()]
.
Fast Mode
When all tokens are of length 1 and there are no additional rules, Crossandra will use a simpler tokenization method (the so called Fast Mode).
Example
Tokenizing noisy Brainfuck code (BrainfuckToken
taken from
examples)
(tested on MacBook Air M1 (256/16) with pure Python wheels)
# Setup
from random import choices
from string import punctuation
program = "".join(choices(punctuation, k=...))
tokenizer = Crossandra(Brainfuck, suppress_unknown=True)
log10(k) | Default | Fast Mode | Speedup |
---|---|---|---|
1 | 40µs | 20µs | 100% |
2 | 160µs | 30µs | 433% |
3 | 1.5ms | 130µs | 1,054% |
4 | 14ms | 900µs | 1,456% |
5 | 290ms | 9ms | 3,122% |
Rules and rule groups
Rule
class Rule[T](
pattern: Pattern[str] | str,
converter: Callable[[str], T] | None = None,
*,
flags: RegexFlag | int = 0,
ignore: bool = False,
)
pattern
is a regex pattern to match (flags
can be supplied). A converter
can be supplied and will be called with the
matched substring as the argument (defaults to None
, returning the matched
string directly). When ignore
is True
, the matched substring will be
excluded from the output.
Rule
objects are hashable and comparable and can be ORed (|
) for grouping
with other Rule
s and RuleGroup
s.
Rule.apply
def apply(self, target: str) -> tuple[T | str | Ignored, int] | NotApplied
target
matches the Rule's pattern. If it does, returns a tuple with
- if
ignore=True
: theIgnored
sentinel - if
converter=None
: the matched substring - otherwise: the result of calling the Rule's converter on the matched substring
and the length of the matched substring. If it doesn't, returns the NotApplied
sentinel.
RuleGroup
class RuleGroup(rules: tuple[Rule[Any], ...])
RuleGroup
s can be constructed
by passing in a tuple of rules or by ORing (|
) two or more Rule
s, and they
can be ORed with other RuleGroup
s or Rule
s themselves. RuleGroup
s are
hashable and iterable.
RuleGroup.apply
def apply(self, target: str) -> tuple[Any | str | Ignored, int] | NotApplied
NotApplied
if none do.
Common patterns
The common
submodule is a collection of commonly used patterns.
Rules
- CHAR (e.g.
'h'
) - LETTER (e.g.
m
) - WORD (e.g.
ball
) - SINGLE_QUOTED_STRING (e.g.
'nice fish'
) - DOUBLE_QUOTED_STRING (e.g.
"hello there"
) - C_NAME (e.g.
crossandra_rocks
) - NEWLINE (
\r\n
or\n
) - DIGIT (e.g.
7
) - HEXDIGIT (e.g.
c
) - DECIMAL (e.g.
3.14
) - INT (e.g.
2137
) - SIGNED_INT (e.g.
-1
) - FLOAT (e.g.
1e3
) - SIGNED_FLOAT (e.g.
+4.3
)
Rule groups
- STRING (
SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING
) - NUMBER (
INT | FLOAT
) - SIGNED_NUMBER (
SIGNED_INT | SIGNED_FLOAT
) - ANY_INT (
INT | SIGNED_INT
) - ANY_FLOAT (
FLOAT | SIGNED_FLOAT
) - ANY_NUMBER (
NUMBER | SIGNED_NUMBER
)