MacroPEG Parser Combinators¶
from macropy.peg import macros, peg
from macropy.quick_lambda import macros, f
"""
PEG grammar from Wikipedia
Op <- "+" / "-" / "*" / "/"
Value <- [0-9]+ / '(' Expr ')'
Expr <- Value (Op Value)*
Simplified to remove operator precedence
"""
def reduce_chain(chain):
chain = list(reversed(chain))
o_dict = {
"+": f[_+_],
"-": f[_-_],
"*": f[_*_],
"/": f[_/_],
}
while len(chain) > 1:
a, [o, b] = chain.pop(), chain.pop()
chain.append(o_dict`o <a, b>`_)
return chain[0]
with peg:
op = '+' | '-' | '*' | '/'
value = '[0-9]+'.r // int | ('(', expr, ')') // f[_[1]]
expr = (value, (op, value).rep is rest) >> reduce_chain([value] + rest)
print(expr.parse("123")) # 123
print(expr.parse("((123))")) # 123
print(expr.parse("(123+456+789)")) # 1368
print(expr.parse("(6/2)")) # 3
print(expr.parse("(1+2+3)+2")) # 8
print(expr.parse("(((((((11)))))+22+33)*(4+5+((6))))/12*(17+5)")) # 1804
MacroPEG is an implementation of Parser Combinators, an approach to building recursive descent parsers, when the task is too large for regexes but yet too small for the heavy-duty parser generators. MacroPEG is inspired by Scala’s parser combinator library, utilizing python macros to make the syntax as clean as possible .
The above example describes a simple parser for arithmetic expressions, which roughly follows the PEG syntax. Note how that in the example, the bulk of the code goes into the loop that reduces sequences of numbers and operators to a single number, rather than the recursive-descent parser itself!
Any assignment (xxx = ...
) within a with peg:
block is
transformed into a Parser
. A Parser
comes with a
.parse(input)
method, which returns the parsed result if parsing
succeeds and raises a ParseError
in the case of failure. The
ParseError
contains a nice human-readable string detailing exactly
what went wrong.
json_exp.parse('{"omg": "123", "wtf": , "bbq": "789"}')
# ParseError: index: 22, line: 1, col: 23
# json_exp / obj / pair / json_exp
# {"omg": "123", "wtf": , "bbq": "789"}
# ^
# expected: (obj | array | string | true | false | null | number)
In addition to .parse(input)
, a Parser also contains:
parse_string(input)
, a more program-friendly version ofparse
that returns successes and failures as boxed values (with metadata);- a
parse_partial(input)
method, which is identical toparse_string
, but does not require the entireinput
to be consumed, as long as some prefix of theinput
string matches. Theremaining
attribute of the`Success`
indicates how far into theinput
string parsing proceeded.
Basic Combinators¶
Parsers are generally built up from a few common building blocks. The fundamental atoms include:
- string literals like
'+'
match the input to their literal value (e.g. ‘+’) and return it as the parse result, or fails if it does not match; - regexes like
'[0-9]+'.r
match the regex to the input if possible, and return it; - tuples like
('(', expr, ')')
match each of the elements within sequentially, and return a list containing the result of each element. It fails if any of its elements fails; - parsers separated by
|
, for example'+' | '-' | '*' | '/'
, attempt to match each of the alternatives from left to right, and return the result of the first success; - parsers separated by
&
, for example'[1234]'.r & '[3456]'.r
, require both parsers succeed, and return the result of the left side; parser.rep
attempts to match theparser
0 or more times, returning a list of the results from each successful match;-parser
negates theparser
: ifparser
succeeded (with any result),-parser
fails. Ifparser
failed,-parser
succeeds with the result""
, the empty string.
Apart from the fundamental atoms, MacroPeg also provides combinators which are not strictly necessary, but are nevertheless generally useful in almost all parsing scenarios:
parser.rep1
attempts to match theparser
1 or more times, returning a list of the results from each successful match. Ifparser
does not succeed at least once,parser.rep1
fails. Equivalent toparser.rep & parser
;parser.rep_with(other)
andparser.rep1_with(other)
repeat theparser
0 or more or 1 or more times respectively, except now theother
parser is invoked in between invocations ofparser
. The output ofother
is discarded, and these methods return a list of values similar torep
andrep1
;parser * n
attempts to match theparser
exactlyn
times, returning a list of lengthn
containing the result of then
successes. Fails otherwise;parser.opt
matches theparser
0 or 1 times, returning either[]
or[result]
whereresult
is the result ofparser
. Equivalent toparser | Succeed([])
;parser.join
takes a parser that returns a list of strings (e.g. tuples,rep
,rep1
, etc.) and returns a parser which returns the strings concatenated together. Equivalent toparser // "".join
.
Transforming values using //
¶
So far, these building blocks all return the raw parse tree: all the things like whitespace, curly-braces, etc. will still be there. Often, you want to take a parser e.g.
from macropy.peg import macros, peg
with peg:
num = '[0-9]+'.r
print(repr(num.parse("123"))) # '123'
which returns a string of digits, and convert it into a parser which
returns an int
with the value of that string. This can be done
with the //
operator:
from macropy.peg import macros, peg
with peg:
num = '[0-9]+'.r // int
print(repr(num.parse("123"))) # 123
The //
operator takes a function which will be used to transform
the result of the parser: in this case, it is the function int
,
which transforms the returned string into an integer.
Another example is:
with peg:
laugh = 'lol'
laughs1 = 'lol'.rep1
laughs2 = laughs1 // "".join
print(laughs1.parse("lollollol")) # ['lol', 'lol', 'lol]
print(laughs2.parse("lollollol")) # lollollol
Where the function "".join"
is used to join together the list of
results from laughs1
into a single string. As mentioned earlier,
laughs2
can also be written as laughs2 = laughs1.join
.
Binding Values using >>
¶
Although //
is sufficient for everyone’s needs, it is not always
convenient. In the example above, a value
is defined to be:
value = ... | ('(', expr, ')') // (lambda x: x[1])
As you can see, we need to strip off the unwanted parentheses from the
parse tree, and we do it with a lambda
that only selects the
middle element, which is the result of the expr
parser. An
alternate way of representing this is:
value = ... | ('(', expr is result, ')') >> result
In this case, the is
keyword is used to bind the result of
expr
to the name result
. The >>
(“bind”) operator can be
used to transform the parser by only operating on the bound results
within the parser. >>
also binds the results of other parsers to
their name. Hence the above is equivalent to:
value = ... | ('(', expr, ')') >> expr
The expr
on the left refers to the parser named expr
in the
with peg:
block, while the expr
on the right refers to the
results of the parser named expr
in case of a successful
parse. The parser on the left has to be outside any is
expressions for it to be captured as above, and so in this line in the
above parser:
expr = (value, (op, value).rep is rest) >> reduce_chain([value] + rest)
The result of the first value
on the left of >>
is bound to
value
on the right, while the second value
is not because it
is within an is
expression bound to the name rest
. If you have
multiple parsers of the same name on the left of >>
, you can
always refer to each individual explicitly using the is
syntax
shown above.
Althought this seems like a lot of shuffling variables around and meddling with the local scope and semantics, it goes a long way to keep things neat. For example, a JSON parser may define an array to be:
with peg:
...
# parses an array and extracts the relevant bits into a Python list
array = ('[', (json_exp, (',', json_exp).rep), space.opt, ']') // (lambda x: [x[1][0]] + [y[1] for y in x[1][1]])
...
Where the huge lambda
is necessary to pull out the necessary parts
of the parse tree into a Python list. Although it works, it’s
difficult to write correctly and equally difficult to read. Using the
is
operator, this can be rewritten as:
array = ('[', json_exp is first, (',', json_exp is rest).rep, space.opt, ']') >> [first] + rest
Now, it is clear that we are only interested in the result of the two
json_exp
parsers. The >>
operator allows us to use those,
while the rest of the parse tree ([
, ,
, etc.) are conveniently
discarded. Of course, one could go a step further and us the
rep_with
method which is intended for exactly this purpose:
array = ('[', json_exp.rep_with(',') >> arr, space.opt, ']') >> arr
Which arguably looks the cleanest of all!
Cut¶
from macropy.peg import macros, peg, cut
with peg:
expr1 = ("1", "2", "3") | ("1", "b", "c")
expr2 = ("1", cut, "2", "3") | ("1", "b", "c")
print(expr1.parse("1bc") # ['1', 'b', 'c'])
print(expr2.parse("1bc"))
# ParseError: index: 1, line: 1, col: 2
# expr2
# 1bc
# ^
# expected: '2'
cut
is a special token used in a sequence of parsers, which
commits the parsing to the current sequence. As you can see above,
without cut
, the left alternative fails and the parsing then
attempts the right alternative, which succeeds. In contrast, with
expr2
, the parser is committed to the left alternative once it
reaches the cut
(after successfully parsing “1”) and thus when the
left alternative fails, the right alternative is not tried and the
entire parse
fails.
The purpose of cut
is two-fold:
Increasing performance by removing unnecessary backtracking¶
Using JSON as an example: if your parser sees a {
, begins parsing a
JSON object, but some time later it fails, it does not need to both
backtracking and attempting to parse an Array ([...
), or a String
("...
), or a Number. None of those could possibly succeed, so
cutting the backtracking and failing fast prevents this unnecessary
computation.
Better error reporting.¶
For example, if you try to parse the JSON String;
{ : "failed lol"}
if your JSON parser looks like:
with peg:
...
json_exp = obj | array | string | num | true | false | null
obj = '{', pair.rep_with(",") , space, '}'
...
Without cut
, the only information you could gain from attempting to
parse that is something like:
index: 0, line: 1, col: 1
json_exp
{ : 1, "wtf": 12.4123}
^
expected: (obj | array | string | true | false | null | number)
On the other hand, using a cut
inside the object
parser
immediately after parsing the first {
, we could provide a much
more specific error:
index: 5, line: 1, col: 6
json_exp / obj
{ : 1, "wtf": 12.4123}
^
expected: '}'
In the first case, after failing to parse obj
, the json_exp
parser goes on to try all the other alternatives. After all to them
fail to parse, it only knows that trying to parse json_exp
starting from character 0 doesn’t work; it has no way of knowing that
the alternative that was “supposed” to work was obj
.
In the second case, cut
is inserted inside the object
parser,
something like:
obj = '{', cut, pair.rep_with(",") , space, '}'
Once the first {
is parsed, the parser is committed to that
alternative. Thus, when it fails to parse string
, it knows it
cannot backtrack and can immediately end the parsing. It can now give
a much more specific source location (character 10) as well as better
information on what it was trying to parse (json / object /
string
).
Full Example¶
MacroPEG is not limited to toy problems, like the arithmetic expression parser above. Below is the full source of a JSON parser, provided in the unit tests:
from macropy.peg import macros, peg, cut
from macropy.quick_lambda import macros, f
def decode(x):
x = x.decode('unicode-escape')
try:
return str(x)
except:
return x
escape_map = {
'"': '"',
'/': '/',
'\\': '\\',
'b': '\b',
'f': '\f',
'n': '\n',
'r': '\r',
't': '\t'
}
"""
Sample JSON PEG grammar for reference, shameless stolen from
https://github.com/azatoth/PanPG/blob/master/grammars/JSON.peg
JSON <- S? ( Object / Array / String / True / False / Null / Number ) S?
Object <- "{"
( String ":" JSON ( "," String ":" JSON )*
/ S? )
"}"
Array <- "["
( JSON ( "," JSON )*
/ S? )
"]"
String <- S? ["] ( [^ " \ U+0000-U+001F ] / Escape )* ["] S?
Escape <- [\] ( [ " / \ b f n r t ] / UnicodeEscape )
UnicodeEscape <- "u" [0-9A-Fa-f]{4}
True <- "true"
False <- "false"
Null <- "null"
Number <- Minus? IntegralPart fractPart? expPart?
Minus <- "-"
IntegralPart <- "0" / [1-9] [0-9]*
fractPart <- "." [0-9]+
expPart <- ( "e" / "E" ) ( "+" / "-" )? [0-9]+
S <- [ U+0009 U+000A U+000D U+0020 ]+
"""
with peg:
json_doc = (space, (obj | array), space) // f[_[1]]
json_exp = (space, (obj | array | string | true | false | null | number), space) // f[_[1]]
pair = (string is k, space, ':', cut, json_exp is v) >> (k, v)
obj = ('{', cut, pair.rep_with(",") // dict, space, '}') // f[_[1]]
array = ('[', cut, json_exp.rep_with(","), space, ']') // f[_[1]]
string = (space, '"', (r'[^"\\\t\n]'.r | escape | unicode_escape).rep.join is body, '"') >> "".join(body)
escape = ('\\', ('"' | '/' | '\\' | 'b' | 'f' | 'n' | 'r' | 't') // escape_map.get) // f[_[1]]
unicode_escape = ('\\', 'u', ('[0-9A-Fa-f]'.r * 4).join).join // decode
true = 'true' >> True
false = 'false' >> False
null = 'null' >> None
number = decimal | integer
integer = ('-'.opt, integral).join // int
decimal = ('-'.opt, integral, ((fract, exp).join) | fract | exp).join // float
integral = '0' | '[1-9][0-9]*'.r
fract = ('.', '[0-9]+'.r).join
exp = (('e' | 'E'), ('+' | '-').opt, "[0-9]+".r).join
space = '\s*'.r
Testing it out with some input, we can see it works as we would expect:
test_string = """
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": 10021
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
]
}
"""
import json
print(json_exp.parse(test_string) == json.loads(test_string))
# True
import pprint
pp = pprint.PrettyPrinter(4)
pp.pprint(json_exp.parse(test_string))
#{ 'address': { 'city': 'New York',
# 'postalCode': 10021.0,
# 'state': 'NY',
# 'streetAddress': '21 2nd Street'},
# 'age': 25.0,
# 'firstName': 'John',
# 'lastName': 'Smith',
# 'phoneNumbers': [ { 'number': '212 555-1234', 'type': 'home'},
# { 'number': '646 555-4567', 'type': 'fax'}]}
You can see that json_exp
parses that non-trivial blob of JSON into
an identical structure as Python’s in-built json
package. In
addition, the source of the parser looks almost identical to the PEG
grammar it is parsing, shown above. This parser makes good use of the
//
and >>
operators to transform the output of its individual
components, as well as using rep_with
method to easily parse the
comma-separated JSON objects and arrays. This parser is almost fully
compliant with the test cases
found on the json.org website (it doesn’t fail, as
it should, for deeply-nested JSON arrays), which isn’t bad for 50
lines of code.
As mentioned earlier, MacroPEG parsers also provide exceptions with
nice error messages when the parse
method fails, and the JSON parser
is no exception. Even when parsing larger documents, the error
reporting rises to the challenge:
json_exp.parse("""
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": 10021
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": 646 555-4567"
}
]
}
""")
# ParseError: index: 456, line: 19, col: 31
# json_exp / obj / pair / json_exp / array / json_exp / obj
# "number": 646 555-4567"
# ^
# expected: '}'
Pretty neat! This full example of a JSON parser demonstrates what MacroPEG provides to a programmer trying to write a parser:
- Excellent error reporting
- Simple AST processing, on the fly
- An extremely clear PEG-like syntax
- Extremely concise parser definitions
Not bad for an implementation that spans 350 lines of code!