Text parser

About

A text parser written in the Python language.

The project has one goal, speed! See the benchmark below more details.

Project homepage: https://github.com/eerimoq/textparser

Documentation: http://textparser.readthedocs.org/en/latest

Credits

Thanks PyParsing for a user friendly interface. Many of textparser’s class names are taken from this project.

Installation

pip install textparser

Example usage

The Hello World example parses the string Hello, World! and outputs its parse tree ['Hello', ',', 'World', '!'].

The script:

import textparser
from textparser import Sequence


class Parser(textparser.Parser):

    def token_specs(self):
        return [
            ('SKIP',          r'[ \r\n\t]+'),
            ('WORD',          r'\w+'),
            ('EMARK',    '!', r'!'),
            ('COMMA',    ',', r','),
            ('MISMATCH',      r'.')
        ]

    def grammar(self):
        return Sequence('WORD', ',', 'WORD', '!')


tree = Parser().parse('Hello, World!')

print('Tree:', tree)

Script execution:

$ env PYTHONPATH=. python3 examples/hello_world.py
Tree: ['Hello', ',', 'World', '!']

Benchmark

A benchmark comparing the speed of 10 JSON parsers, parsing a 276 kb file.

$ env PYTHONPATH=. python3 examples/benchmarks/json/speed.py

Parsed 'examples/benchmarks/json/data.json' 1 time(s) in:

PACKAGE         SECONDS   RATIO  VERSION
textparser         0.10    100%  0.21.1
parsimonious       0.17    169%  unknown
lark (LALR)        0.27    267%  0.7.0
funcparserlib      0.34    340%  unknown
textx              0.54    546%  1.8.0
pyparsing          0.68    684%  2.4.0
pyleri             0.88    886%  1.2.2
parsy              0.92    925%  1.2.0
parsita            2.28   2286%  unknown
lark (Earley)      2.34   2348%  0.7.0

NOTE 1: The parsers are not necessarily optimized for speed. Optimizing them will likely affect the measurements.

NOTE 2: The structure of the resulting parse trees varies and additional processing may be required to make them fit the user application.

NOTE 3: Only JSON parsers are compared. Parsing other languages may give vastly different results.

Contributing

Fork the repository.
Implement the new feature or bug fix.
Implement test case(s) to ensure that future changes do not break legacy.
Run the tests.
```
python3 -m unittest
```
Create a pull request.

The parser class

class textparser.Parser[source]

The abstract base class of all text parsers.

>>> from textparser import Parser, Sequence
>>> class MyParser(Parser):
...    def token_specs(self):
...        return [
...            ('SKIP',          r'[ \r\n\t]+'),
...            ('WORD',          r'\w+'),
...            ('EMARK',    '!', r'!'),
...            ('COMMA',    ',', r','),
...            ('MISMATCH',      r'.')
...        ]
...    def grammar(self):
...        return Sequence('WORD', ',', 'WORD', '!')

keywords()[source]

A set of keywords in the text.

def keywords(self):
    return set(['if', 'else'])

Return type:: set[str]

token_specs()[source]

The token specifications with token name, regular expression, and optionally a user friendly name.

Two token specification forms are available; (kind, re) or (kind, name, re). If the second form is used, the grammar should use name instead of kind.

See Parser for an example usage.

Return type:: list[tuple[str, str] | tuple[str, str, str]]

tokenize(text)[source]

Tokenize given string text, and return a list of tokens. Raises TokenizeError on failure.

This method should only be called by parse(), but may very well be overridden if the default implementation does not match the parser needs.

Parameters:: text (str)
Return type:: list[Token]

grammar()[source]

The text grammar is used to create a parse tree out of a list of tokens.

See Parser for an example usage.

Return type:: Grammar

parse(text, token_tree=False, match_sof=False)[source]

Parse given string text and return the parse tree. Raises ParseError on failure.

Returns a parse tree of tokens if token_tree is True.

>>> MyParser().parse('Hello, World!')
['Hello', ',', 'World', '!']
>>> tree = MyParser().parse('Hello, World!', token_tree=True)
>>> from pprint import pprint
>>> pprint(tree)
[Token(kind='WORD', value='Hello', offset=0),
 Token(kind=',', value=',', offset=5),
 Token(kind='WORD', value='World', offset=7),
 Token(kind='!', value='!', offset=12)]

Parameters:

text (str)
token_tree (bool)
match_sof (bool)

Return type:

Building the grammar

The grammar built by combining the classes below and strings.

Here is a fictitious example grammar:

grammar = Sequence(
    'BEGIN',
    Optional(choice('IF', Sequence(ZeroOrMore('NUMBER')))),
    OneOrMore(Sequence('WORD', Not('NUMBER'))),
    Any(),
    DelimitedList('WORD', delim=':'),
    'END')

class textparser.Sequence(*patterns)[source]

Matches a sequence of patterns. Becomes a list in the parse tree.

Parameters:: patterns (Pattern | str)

class textparser.Choice(*patterns)[source]

Matches any of given ordered patterns patterns. The first pattern in the list has highest priority, and the last lowest.

Parameters:: patterns (Pattern | str)

class textparser.ChoiceDict(*patterns)[source]

Matches any of given patterns. The first token kind of all patterns must be unique, otherwise and Error exception is raised.

This class is faster than Choice, and should be used if the grammar allows it.

Parameters:: patterns (Pattern | str)

textparser.choice(*patterns)[source]

Returns an instance of the fastest choice class for given patterns patterns. It is recommended to use this function instead of instantiate Choice or ChoiceDict directly.

Parameters:: patterns (Pattern | str)
Return type:: Choice | ChoiceDict

class textparser.ZeroOrMore(pattern)[source]

Matches pattern zero or more times.

See Repeated for more details.

Parameters:: pattern (Pattern | str)

class textparser.ZeroOrMoreDict(pattern, key=None)[source]

Matches pattern zero or more times.

See RepeatedDict for more details.

Parameters:

pattern (Pattern | str)
key (Callable[[list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str], str] | None)

class textparser.OneOrMore(pattern)[source]

Matches pattern one or more times.

See Repeated for more details.

Parameters:: pattern (Pattern | str)

class textparser.OneOrMoreDict(pattern, key=None)[source]

Matches pattern one or more times.

See RepeatedDict for more details.

Parameters:

pattern (Pattern | str)
key (Callable[[list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str], str] | None)

class textparser.DelimitedList(pattern, delim=',')[source]

Matches a delimented list of pattern separated by delim. pattern must be matched at least once. Any match becomes a list in the parse tree, excluding the delimiters.

Parameters:

pattern (Pattern | str)
delim (str)

class textparser.Optional(pattern)[source]

Matches pattern zero or one times. Becomes a list in the parse tree, empty on mismatch.

Parameters:: pattern (Pattern | str)

class textparser.Any[source]: Matches any token.

class textparser.AnyUntil(pattern)[source]

Matches any token until given pattern is found. Becomes a list in the parse tree, not including the given pattern match.

Parameters:: pattern (Pattern | str)

class textparser.And(pattern)[source]

Matches pattern, without consuming any tokens. Any match becomes an empty list in the parse tree.

Parameters:: pattern (Pattern | str)

class textparser.Not(pattern)[source]

Matches if pattern does not match. Any match becomes an empty list in the parse tree.

Just like And, no tokens are consumed.

Parameters:: pattern (Pattern | str)

class textparser.NoMatch[source]: Never matches anything.

class textparser.Tag(name, pattern)[source]

Tags any matched pattern with name name. Becomes a two-tuple of name and match in the parse tree.

Parameters:

name (str)
pattern (Pattern | str)

class textparser.Forward[source]

Forward declaration of a pattern.

>>> foo = Forward()
>>> foo <<= Sequence('NUMBER')

class textparser.Repeated(pattern, minimum=0)[source]

Matches pattern at least minimum times. Any match becomes a list in the parse tree.

Parameters:

pattern (Pattern | str)
minimum (int)

class textparser.RepeatedDict(pattern, minimum=0, key=None)[source]

Same as Repeated, but becomes a dictionary instead of a list in the parse tree.

key is a function taking the match as input and returning the dictionary key. By default the first element in the match is used as key.

Parameters:

pattern (Pattern | str)
minimum (int)
key (Callable[[list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str], str] | None)

class textparser.Pattern[source]

Base class of all patterns.

match(tokens)[source]

Returns MISMATCH on mismatch, and anything else on match.

Parameters:: tokens (_Tokens)
Return type:: list[list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str] | dict[str, list[list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str]] | tuple[str, list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str] | Token | str | Literal[_Mismatch.MISMATCH]

textparser.MatchObject: alias of list[MatchObject] | dict[str, list[MatchObject]] | tuple[str, MatchObject] | Token | str

class textparser.Grammar(grammar)[source]

Creates a tree of given tokens using the grammar grammar.

Parameters:: grammar (Pattern | str)

class textparser.Token(kind: str, value: str | None, offset: int)[source]

Parameters:

kind (str)
value (str | None)
offset (int)

textparser.MISMATCH = _Mismatch.MISMATCH: Returned by match() on mismatch.

Exceptions

class textparser.Error[source]: General textparser exception.

class textparser.ParseError(text, offset)[source]

This exception is raised when the parser fails to parse the text.

Parameters:

text (str)
offset (int)

property text: str: The input text to the parser.

property offset: int: Offset into the text where the parser failed.

property line: int: Line where the parser failed.

property column: int: Column where the parser failed.

class textparser.TokenizeError(text, offset)[source]

This exception is raised when the text cannot be converted into tokens.

Parameters:

text (str)
offset (int)

property text: str: The input text to the tokenizer.

property offset: int: Offset into the text where the tokenizer failed.

class textparser.GrammarError(offset)[source]

This exception is raised when the tokens cannot be converted into a parse tree.

Parameters:: offset (int)

property offset: int: Offset into the text where the parser failed.

Utility functions

textparser.markup_line(text, offset, marker='>>!<<')[source]

Insert marker at offset into text, and return the marked line.

>>> markup_line('0\n1234\n56', 3)
1>>!<<234

Parameters:

text (str)
offset (int)
marker (str)

Return type:

str

textparser.tokenize_init(spec)[source]

Initialize a tokenizer. Should only be called by the tokenize() method in the parser.

Parameters:: spec (Sequence[tuple[str, str] | tuple[str, str, int]])
Return type:: tuple[list[Token], str]