Text parser¶
About¶
A text parser written in the Python language.
The project has one goal, speed! See the benchmark below more details.
Project homepage: https://github.com/eerimoq/textparser
Documentation: http://textparser.readthedocs.org/en/latest
Credits¶
- Thanks PyParsing for a user friendly interface. Many of
textparser
’s class names are taken from this project.
Installation¶
pip install textparser
Example usage¶
The Hello World example parses the string Hello, World!
and
outputs its parse tree ['Hello', ',', 'World', '!']
.
The script:
import textparser
from textparser import Sequence
class Parser(textparser.Parser):
def token_specs(self):
return [
('SKIP', r'[ \r\n\t]+'),
('WORD', r'\w+'),
('EMARK', '!', r'!'),
('COMMA', ',', r','),
('MISMATCH', r'.')
]
def grammar(self):
return Sequence('WORD', ',', 'WORD', '!')
tree = Parser().parse('Hello, World!')
print('Tree:', tree)
Script execution:
$ env PYTHONPATH=. python3 examples/hello_world.py
Tree: ['Hello', ',', 'World', '!']
Benchmark¶
A benchmark comparing the speed of 10 JSON parsers, parsing a 276 kb file.
$ env PYTHONPATH=. python3 examples/benchmarks/json/speed.py
Parsed 'examples/benchmarks/json/data.json' 1 time(s) in:
PACKAGE SECONDS RATIO VERSION
textparser 0.09 100% 0.19.0
parsimonious 0.17 183% unknown
lark (LALR) 0.29 306% 0.6.6
funcparserlib 0.33 346% unknown
textx 0.53 557% 1.8.0
pyparsing 0.67 710% 2.3.1
pyleri 0.78 825% 1.2.2
parsy 0.91 969% 1.2.0
lark (Earley) 2.11 2240% 0.6.6
parsita 2.26 2393% unknown
NOTE 1: The parsers are not necessarily optimized for speed. Optimizing them will likely affect the measurements.
NOTE 2: The structure of the resulting parse trees varies and additional processing may be required to make them fit the user application.
NOTE 3: Only JSON parsers are compared. Parsing other languages may give vastly different results.
Contributing¶
Fork the repository.
Install prerequisites.
pip install -r requirements.txt
Implement the new feature or bug fix.
Implement test case(s) to ensure that future changes do not break legacy.
Run the tests.
make test
Create a pull request.
The parser class¶
-
class
textparser.
Parser
[source]¶ The abstract base class of all text parsers.
>>> from textparser import Parser, Sequence >>> class MyParser(Parser): ... def token_specs(self): ... return [ ... ('SKIP', r'[ \r\n\t]+'), ... ('WORD', r'\w+'), ... ('EMARK', '!', r'!'), ... ('COMMA', ',', r','), ... ('MISMATCH', r'.') ... ] ... def grammar(self): ... return Sequence('WORD', ',', 'WORD', '!')
-
token_specs
()[source]¶ The token specifications with token name, regular expression, and optionally a user friendly name.
Two token specification forms are available;
(kind, re)
or(kind, name, re)
. If the second form is used, the grammar should use name instead of kind.See
Parser
for an example usage.
-
tokenize
(text)[source]¶ Tokenize given string text, and return a list of tokens. Raises
TokenizeError
on failure.This method should only be called by
parse()
, but may very well be overridden if the default implementation does not match the parser needs.
-
grammar
()[source]¶ The text grammar is used to create a parse tree out of a list of tokens.
See
Parser
for an example usage.
-
parse
(text, token_tree=False, match_sof=False)[source]¶ Parse given string text and return the parse tree. Raises
ParseError
on failure.Returns a parse tree of tokens if token_tree is
True
.>>> MyParser().parse('Hello, World!') ['Hello', ',', 'World', '!'] >>> tree = MyParser().parse('Hello, World!', token_tree=True) >>> from pprint import pprint >>> pprint(tree) [Token(kind='WORD', value='Hello', offset=0), Token(kind=',', value=',', offset=5), Token(kind='WORD', value='World', offset=7), Token(kind='!', value='!', offset=12)]
-
Building the grammar¶
The grammar built by combining the classes below and strings.
Here is a fictitious example grammar:
grammar = Sequence(
'BEGIN',
Optional(choice('IF', Sequence(ZeroOrMore('NUMBER')))),
OneOrMore(Sequence('WORD', Not('NUMBER'))),
Any(),
DelimitedList('WORD', delim=':'),
'END')
-
class
textparser.
Sequence
(*patterns)[source]¶ Matches a sequence of patterns. Becomes a list in the parse tree.
-
class
textparser.
Choice
(*patterns)[source]¶ Matches any of given ordered patterns patterns. The first pattern in the list has highest priority, and the last lowest.
-
class
textparser.
ChoiceDict
(*patterns)[source]¶ Matches any of given patterns. The first token kind of all patterns must be unique, otherwise and
Error
exception is raised.This class is faster than
Choice
, and should be used if the grammar allows it.
-
textparser.
choice
(*patterns)[source]¶ Returns an instance of the fastest choice class for given patterns patterns. It is recommended to use this function instead of instantiate
Choice
orChoiceDict
directly.
-
class
textparser.
ZeroOrMore
(pattern)[source]¶ Matches pattern zero or more times.
See
Repeated
for more details.
-
class
textparser.
ZeroOrMoreDict
(pattern, key=None)[source]¶ Matches pattern zero or more times.
See
RepeatedDict
for more details.
-
class
textparser.
OneOrMore
(pattern)[source]¶ Matches pattern one or more times.
See
Repeated
for more details.
-
class
textparser.
OneOrMoreDict
(pattern, key=None)[source]¶ Matches pattern one or more times.
See
RepeatedDict
for more details.
-
class
textparser.
DelimitedList
(pattern, delim=', ')[source]¶ Matches a delimented list of pattern separated by delim. pattern must be matched at least once. Any match becomes a list in the parse tree, excluding the delimitors.
-
class
textparser.
Optional
(pattern)[source]¶ Matches pattern zero or one times. Becomes a list in the parse tree, empty on mismatch.
-
class
textparser.
AnyUntil
(pattern)[source]¶ Matches any token until given pattern is found. Becomes a list in the parse tree, not including the given pattern match.
-
class
textparser.
And
(pattern)[source]¶ Matches pattern, without consuming any tokens. Any match becomes an empty list in the parse tree.
-
class
textparser.
Not
(pattern)[source]¶ Matches if pattern does not match. Any match becomes an empty list in the parse tree.
Just like
And
, no tokens are consumed.
-
class
textparser.
Tag
(name, pattern)[source]¶ Tags any matched pattern with name name. Becomes a two-tuple of name and match in the parse tree.
-
class
textparser.
Forward
[source]¶ Forward declaration of a pattern.
>>> foo = Forward() >>> foo <<= Sequence('NUMBER')
-
class
textparser.
Repeated
(pattern, minimum=0)[source]¶ Matches pattern at least minimum times. Any match becomes a list in the parse tree.
-
class
textparser.
RepeatedDict
(pattern, minimum=0, key=None)[source]¶ Same as
Repeated
, but becomes a dictionary instead of a list in the parse tree.key is a function taking the match as input and returning the dictionary key. By default the first element in the match is used as key.
Exceptions¶
-
class
textparser.
ParseError
(text, offset)[source]¶ This exception is raised when the parser fails to parse the text.
-
text
¶ The input text to the parser.
-
offset
¶ Offset into the text where the parser failed.
-
line
¶ Line where the parser failed.
-
column
¶ Column where the parser failed.
-
Utility functions¶
-
textparser.
markup_line
(text, offset, marker='>>!<<')[source]¶ Insert marker at offset into text, and return the marked line.
>>> markup_line('0\n1234\n56', 3) 1>>!<<234
-
textparser.
tokenize_init
(spec)[source]¶ Initialize a tokenizer. Should only be called by the
tokenize()
method in the parser.
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line