mirror of
https://git.savannah.gnu.org/git/guile.git
synced 2025-04-29 19:30:36 +02:00
* NEWS: * doc/ref/api-control.texi: * doc/ref/api-data.texi: * doc/ref/api-debug.texi: * doc/ref/api-deprecated.texi: * doc/ref/api-evaluation.texi: * doc/ref/api-foreign.texi: * doc/ref/api-i18n.texi: * doc/ref/api-io.texi: * doc/ref/api-languages.texi: * doc/ref/api-macros.texi: * doc/ref/api-memory.texi: * doc/ref/api-modules.texi: * doc/ref/api-options.texi: * doc/ref/api-peg.texi: * doc/ref/api-procedures.texi: * doc/ref/api-scheduling.texi: * doc/ref/api-undocumented.texi: * doc/ref/api-utility.texi: * doc/ref/expect.texi: * doc/ref/goops.texi: * doc/ref/misc-modules.texi: * doc/ref/posix.texi: * doc/ref/repl-modules.texi: * doc/ref/scheme-ideas.texi: * doc/ref/scheme-scripts.texi: * doc/ref/srfi-modules.texi: * gc-benchmarks/larceny/dynamic.sch: * gc-benchmarks/larceny/twobit-input-long.sch: * gc-benchmarks/larceny/twobit.sch: * libguile/gc.h: * libguile/ioext.c: * libguile/list.c: * libguile/options.c: * libguile/posix.c: * libguile/threads.c: * module/ice-9/boot-9.scm: * module/ice-9/optargs.scm: * module/ice-9/ports.scm: * module/ice-9/pretty-print.scm: * module/ice-9/psyntax.scm: * module/language/elisp/parser.scm: * module/language/tree-il/compile-bytecode.scm: * module/srfi/srfi-37.scm: * module/srfi/srfi-43.scm: * module/statprof.scm: * module/texinfo/reflection.scm: * test-suite/tests/eval.test: * test-suite/tests/fluids.test: Fix typos. Signed-off-by: Ludovic Courtès <ludo@gnu.org>
1037 lines
33 KiB
Text
1037 lines
33 KiB
Text
@c -*-texinfo-*-
|
|
@c This is part of the GNU Guile Reference Manual.
|
|
@c Copyright (C) 2006, 2010, 2011
|
|
@c Free Software Foundation, Inc.
|
|
@c See the file guile.texi for copying conditions.
|
|
|
|
@node PEG Parsing
|
|
@section PEG Parsing
|
|
|
|
Parsing Expression Grammars (PEGs) are a way of specifying formal
|
|
languages for text processing. They can be used either for matching
|
|
(like regular expressions) or for building recursive descent parsers
|
|
(like lex/yacc). Guile uses a superset of PEG syntax that allows more
|
|
control over what information is preserved during parsing.
|
|
|
|
Wikipedia has a clear and concise introduction to PEGs if you want to
|
|
familiarize yourself with the syntax:
|
|
@url{http://en.wikipedia.org/wiki/Parsing_expression_grammar}.
|
|
|
|
The @code{(ice-9 peg)} module works by compiling PEGs down to lambda
|
|
expressions. These can either be stored in variables at compile-time by
|
|
the define macros (@code{define-peg-pattern} and
|
|
@code{define-peg-string-patterns}) or calculated explicitly at runtime
|
|
with the compile functions (@code{compile-peg-pattern} and
|
|
@code{peg-string-compile}).
|
|
|
|
They can then be used for either parsing (@code{match-pattern}) or searching
|
|
(@code{search-for-pattern}). For convenience, @code{search-for-pattern}
|
|
also takes pattern literals in case you want to inline a simple search
|
|
(people often use regular expressions this way).
|
|
|
|
The rest of this documentation consists of a syntax reference, an API
|
|
reference, and a tutorial.
|
|
|
|
@menu
|
|
* PEG Syntax Reference::
|
|
* PEG API Reference::
|
|
* PEG Tutorial::
|
|
* PEG Internals::
|
|
@end menu
|
|
|
|
@node PEG Syntax Reference
|
|
@subsection PEG Syntax Reference
|
|
|
|
@subsubheading Normal PEG Syntax:
|
|
|
|
@deftp {PEG Pattern} sequence a b
|
|
Parses @var{a}. If this succeeds, continues to parse @var{b} from the
|
|
end of the text parsed as @var{a}. Succeeds if both @var{a} and
|
|
@var{b} succeed.
|
|
|
|
@code{"a b"}
|
|
|
|
@code{(and a b)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {ordered choice} a b
|
|
Parses @var{a}. If this fails, backtracks and parses @var{b}.
|
|
Succeeds if either @var{a} or @var{b} succeeds.
|
|
|
|
@code{"a/b"}
|
|
|
|
@code{(or a b)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {zero or more} a
|
|
Parses @var{a} as many times in a row as it can, starting each @var{a}
|
|
at the end of the text parsed by the previous @var{a}. Always
|
|
succeeds.
|
|
|
|
@code{"a*"}
|
|
|
|
@code{(* a)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {one or more} a
|
|
Parses @var{a} as many times in a row as it can, starting each @var{a}
|
|
at the end of the text parsed by the previous @var{a}. Succeeds if at
|
|
least one @var{a} was parsed.
|
|
|
|
@code{"a+"}
|
|
|
|
@code{(+ a)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} optional a
|
|
Tries to parse @var{a}. Succeeds if @var{a} succeeds.
|
|
|
|
@code{"a?"}
|
|
|
|
@code{(? a)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {followed by} a
|
|
Makes sure it is possible to parse @var{a}, but does not actually parse
|
|
it. Succeeds if @var{a} would succeed.
|
|
|
|
@code{"&a"}
|
|
|
|
@code{(followed-by a)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {not followed by} a
|
|
Makes sure it is impossible to parse @var{a}, but does not actually
|
|
parse it. Succeeds if @var{a} would fail.
|
|
|
|
@code{"!a"}
|
|
|
|
@code{(not-followed-by a)}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {string literal} ``abc''
|
|
Parses the string @var{"abc"}. Succeeds if that parsing succeeds.
|
|
|
|
@code{"'abc'"}
|
|
|
|
@code{"abc"}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {any character}
|
|
Parses any single character. Succeeds unless there is no more text to
|
|
be parsed.
|
|
|
|
@code{"."}
|
|
|
|
@code{peg-any}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {character class} a b
|
|
Alternative syntax for ``Ordered Choice @var{a} @var{b}'' if @var{a} and
|
|
@var{b} are characters.
|
|
|
|
@code{"[ab]"}
|
|
|
|
@code{(or "a" "b")}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} {range of characters} a z
|
|
Parses any character falling between @var{a} and @var{z}.
|
|
|
|
@code{"[a-z]"}
|
|
|
|
@code{(range #\a #\z)}
|
|
@end deftp
|
|
|
|
Example:
|
|
|
|
@example
|
|
"(a !b / c &d*) 'e'+"
|
|
@end example
|
|
|
|
Would be:
|
|
|
|
@lisp
|
|
(and
|
|
(or
|
|
(and a (not-followed-by b))
|
|
(and c (followed-by (* d))))
|
|
(+ "e"))
|
|
@end lisp
|
|
|
|
@subsubheading Extended Syntax
|
|
|
|
There is some extra syntax for S-expressions.
|
|
|
|
@deftp {PEG Pattern} ignore a
|
|
Ignore the text matching @var{a}
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} capture a
|
|
Capture the text matching @var{a}.
|
|
@end deftp
|
|
|
|
@deftp {PEG Pattern} peg a
|
|
Embed the PEG pattern @var{a} using string syntax.
|
|
@end deftp
|
|
|
|
Example:
|
|
|
|
@example
|
|
"!a / 'b'"
|
|
@end example
|
|
|
|
Is equivalent to
|
|
|
|
@lisp
|
|
(or (peg "!a") "b")
|
|
@end lisp
|
|
|
|
and
|
|
|
|
@lisp
|
|
(or (not-followed-by a) "b")
|
|
@end lisp
|
|
|
|
@node PEG API Reference
|
|
@subsection PEG API Reference
|
|
|
|
@subsubheading Define Macros
|
|
|
|
The most straightforward way to define a PEG is by using one of the
|
|
define macros (both of these macroexpand into @code{define}
|
|
expressions). These macros bind parsing functions to variables. These
|
|
parsing functions may be invoked by @code{match-pattern} or
|
|
@code{search-for-pattern}, which return a PEG match record. Raw data can be
|
|
retrieved from this record with the PEG match deconstructor functions.
|
|
More complicated (and perhaps enlightening) examples can be found in the
|
|
tutorial.
|
|
|
|
@deffn {Scheme Macro} define-peg-string-patterns peg-string
|
|
Defines all the nonterminals in the PEG @var{peg-string}. More
|
|
precisely, @code{define-peg-string-patterns} takes a superset of PEGs. A normal PEG
|
|
has a @code{<-} between the nonterminal and the pattern.
|
|
@code{define-peg-string-patterns} uses this symbol to determine what information it
|
|
should propagate up the parse tree. The normal @code{<-} propagates the
|
|
matched text up the parse tree, @code{<--} propagates the matched text
|
|
up the parse tree tagged with the name of the nonterminal, and @code{<}
|
|
discards that matched text and propagates nothing up the parse tree.
|
|
Also, nonterminals may consist of any alphanumeric character or a ``-''
|
|
character (in normal PEGs nonterminals can only be alphabetic).
|
|
|
|
For example, if we:
|
|
@lisp
|
|
(define-peg-string-patterns
|
|
"as <- 'a'+
|
|
bs <- 'b'+
|
|
as-or-bs <- as/bs")
|
|
(define-peg-string-patterns
|
|
"as-tag <-- 'a'+
|
|
bs-tag <-- 'b'+
|
|
as-or-bs-tag <-- as-tag/bs-tag")
|
|
@end lisp
|
|
Then:
|
|
@lisp
|
|
(match-pattern as-or-bs "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
(match-pattern as-or-bs-tag "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
|
|
@end lisp
|
|
|
|
Note that in doing this, we have bound 6 variables at the toplevel
|
|
(@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
|
|
@var{as-or-bs-tag}).
|
|
@end deffn
|
|
|
|
@deffn {Scheme Macro} define-peg-pattern name capture-type peg-sexp
|
|
Defines a single nonterminal @var{name}. @var{capture-type} determines
|
|
how much information is passed up the parse tree. @var{peg-sexp} is a
|
|
PEG in S-expression form.
|
|
|
|
Possible values for capture-type:
|
|
|
|
@table @code
|
|
@item all
|
|
passes the matched text up the parse tree tagged with the name of the
|
|
nonterminal.
|
|
@item body
|
|
passes the matched text up the parse tree.
|
|
@item none
|
|
passes nothing up the parse tree.
|
|
@end table
|
|
|
|
For Example, if we:
|
|
@lisp
|
|
(define-peg-pattern as body (+ "a"))
|
|
(define-peg-pattern bs body (+ "b"))
|
|
(define-peg-pattern as-or-bs body (or as bs))
|
|
(define-peg-pattern as-tag all (+ "a"))
|
|
(define-peg-pattern bs-tag all (+ "b"))
|
|
(define-peg-pattern as-or-bs-tag all (or as-tag bs-tag))
|
|
@end lisp
|
|
Then:
|
|
@lisp
|
|
(match-pattern as-or-bs "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
(match-pattern as-or-bs-tag "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: (as-or-bs-tag (as-tag aa))>
|
|
@end lisp
|
|
|
|
Note that in doing this, we have bound 6 variables at the toplevel
|
|
(@var{as}, @var{bs}, @var{as-or-bs}, @var{as-tag}, @var{bs-tag}, and
|
|
@var{as-or-bs-tag}).
|
|
@end deffn
|
|
|
|
@subsubheading Compile Functions
|
|
It is sometimes useful to be able to compile anonymous PEG patterns at
|
|
runtime. These functions let you do that using either syntax.
|
|
|
|
@deffn {Scheme Procedure} peg-string-compile peg-string capture-type
|
|
Compiles the PEG pattern in @var{peg-string} propagating according to
|
|
@var{capture-type} (capture-type can be any of the values from
|
|
@code{define-peg-pattern}).
|
|
@end deffn
|
|
|
|
|
|
@deffn {Scheme Procedure} compile-peg-pattern peg-sexp capture-type
|
|
Compiles the PEG pattern in @var{peg-sexp} propagating according to
|
|
@var{capture-type} (capture-type can be any of the values from
|
|
@code{define-peg-pattern}).
|
|
@end deffn
|
|
|
|
The functions return syntax objects, which can be useful if you want to
|
|
use them in macros. If all you want is to define a new nonterminal, you
|
|
can do the following:
|
|
|
|
@lisp
|
|
(define exp '(+ "a"))
|
|
(define as (compile (compile-peg-pattern exp 'body)))
|
|
@end lisp
|
|
|
|
You can use this nonterminal with all of the regular PEG functions:
|
|
|
|
@lisp
|
|
(match-pattern as "aaaaa") @result{}
|
|
#<peg start: 0 end: 5 string: bbbbb tree: bbbbb>
|
|
@end lisp
|
|
|
|
@subsubheading Parsing & Matching Functions
|
|
|
|
For our purposes, ``parsing'' means parsing a string into a tree
|
|
starting from the first character, while ``matching'' means searching
|
|
through the string for a substring. In practice, the only difference
|
|
between the two functions is that @code{match-pattern} gives up if it can't
|
|
find a valid substring starting at index 0 and @code{search-for-pattern} keeps
|
|
looking. They are both equally capable of ``parsing'' and ``matching''
|
|
given those constraints.
|
|
|
|
@deffn {Scheme Procedure} match-pattern nonterm string
|
|
Parses @var{string} using the PEG stored in @var{nonterm}. If no match
|
|
was found, @code{match-pattern} returns false. If a match was found, a PEG
|
|
match record is returned.
|
|
|
|
The @code{capture-type} argument to @code{define-peg-pattern} allows you to
|
|
choose what information to hold on to while parsing. The options are:
|
|
|
|
@table @code
|
|
@item all
|
|
tag the matched text with the nonterminal
|
|
@item body
|
|
just the matched text
|
|
@item none
|
|
nothing
|
|
@end table
|
|
|
|
@lisp
|
|
(define-peg-pattern as all (+ "a"))
|
|
(match-pattern as "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
|
|
|
|
(define-peg-pattern as body (+ "a"))
|
|
(match-pattern as "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
|
|
(define-peg-pattern as none (+ "a"))
|
|
(match-pattern as "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: ()>
|
|
|
|
(define-peg-pattern bs body (+ "b"))
|
|
(match-pattern bs "aabbcc") @result{}
|
|
#f
|
|
@end lisp
|
|
@end deffn
|
|
|
|
@deffn {Scheme Macro} search-for-pattern nonterm-or-peg string
|
|
Searches through @var{string} looking for a matching subexpression.
|
|
@var{nonterm-or-peg} can either be a nonterminal or a literal PEG
|
|
pattern. When a literal PEG pattern is provided, @code{search-for-pattern} works
|
|
very similarly to the regular expression searches many hackers are used
|
|
to. If no match was found, @code{search-for-pattern} returns false. If a match
|
|
was found, a PEG match record is returned.
|
|
|
|
@lisp
|
|
(define-peg-pattern as body (+ "a"))
|
|
(search-for-pattern as "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
(search-for-pattern (+ "a") "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
(search-for-pattern "'a'+" "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: aa>
|
|
|
|
(define-peg-pattern as all (+ "a"))
|
|
(search-for-pattern as "aabbcc") @result{}
|
|
#<peg start: 0 end: 2 string: aabbcc tree: (as aa)>
|
|
|
|
(define-peg-pattern bs body (+ "b"))
|
|
(search-for-pattern bs "aabbcc") @result{}
|
|
#<peg start: 2 end: 4 string: aabbcc tree: bb>
|
|
(search-for-pattern (+ "b") "aabbcc") @result{}
|
|
#<peg start: 2 end: 4 string: aabbcc tree: bb>
|
|
(search-for-pattern "'b'+" "aabbcc") @result{}
|
|
#<peg start: 2 end: 4 string: aabbcc tree: bb>
|
|
|
|
(define-peg-pattern zs body (+ "z"))
|
|
(search-for-pattern zs "aabbcc") @result{}
|
|
#f
|
|
(search-for-pattern (+ "z") "aabbcc") @result{}
|
|
#f
|
|
(search-for-pattern "'z'+" "aabbcc") @result{}
|
|
#f
|
|
@end lisp
|
|
@end deffn
|
|
|
|
@subsubheading PEG Match Records
|
|
The @code{match-pattern} and @code{search-for-pattern} functions both return PEG
|
|
match records. Actual information can be extracted from these with the
|
|
following functions.
|
|
|
|
@deffn {Scheme Procedure} peg:string match-record
|
|
Returns the original string that was parsed in the creation of
|
|
@code{match-record}.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} peg:start match-record
|
|
Returns the index of the first parsed character in the original string
|
|
(from @code{peg:string}). If this is the same as @code{peg:end},
|
|
nothing was parsed.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} peg:end match-record
|
|
Returns one more than the index of the last parsed character in the
|
|
original string (from @code{peg:string}). If this is the same as
|
|
@code{peg:start}, nothing was parsed.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} peg:substring match-record
|
|
Returns the substring parsed by @code{match-record}. This is equivalent to
|
|
@code{(substring (peg:string match-record) (peg:start match-record) (peg:end
|
|
match-record))}.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} peg:tree match-record
|
|
Returns the tree parsed by @code{match-record}.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} peg-record? match-record
|
|
Returns true if @code{match-record} is a PEG match record, or false
|
|
otherwise.
|
|
@end deffn
|
|
|
|
Example:
|
|
@lisp
|
|
(define-peg-pattern bs all (peg "'b'+"))
|
|
|
|
(search-for-pattern bs "aabbcc") @result{}
|
|
#<peg start: 2 end: 4 string: aabbcc tree: (bs bb)>
|
|
|
|
(let ((pm (search-for-pattern bs "aabbcc")))
|
|
`((string ,(peg:string pm))
|
|
(start ,(peg:start pm))
|
|
(end ,(peg:end pm))
|
|
(substring ,(peg:substring pm))
|
|
(tree ,(peg:tree pm))
|
|
(record? ,(peg-record? pm)))) @result{}
|
|
((string "aabbcc")
|
|
(start 2)
|
|
(end 4)
|
|
(substring "bb")
|
|
(tree (bs "bb"))
|
|
(record? #t))
|
|
@end lisp
|
|
|
|
@subsubheading Miscellaneous
|
|
|
|
@deffn {Scheme Procedure} context-flatten tst lst
|
|
Takes a predicate @var{tst} and a list @var{lst}. Flattens @var{lst}
|
|
until all elements are either atoms or satisfy @var{tst}. If @var{lst}
|
|
itself satisfies @var{tst}, @code{(list lst)} is returned (this is a
|
|
flat list whose only element satisfies @var{tst}).
|
|
|
|
@lisp
|
|
(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(2 2 (1 1 (2 2)) (2 2 (1 1)))) @result{}
|
|
(2 2 (1 1 (2 2)) 2 2 (1 1))
|
|
(context-flatten (lambda (x) (and (number? (car x)) (= (car x) 1))) '(1 1 (1 1 (2 2)) (2 2 (1 1)))) @result{}
|
|
((1 1 (1 1 (2 2)) (2 2 (1 1))))
|
|
@end lisp
|
|
|
|
If you're wondering why this is here, take a look at the tutorial.
|
|
@end deffn
|
|
|
|
@deffn {Scheme Procedure} keyword-flatten terms lst
|
|
A less general form of @code{context-flatten}. Takes a list of terminal
|
|
atoms @code{terms} and flattens @var{lst} until all elements are either
|
|
atoms, or lists which have an atom from @code{terms} as their first
|
|
element.
|
|
@lisp
|
|
(keyword-flatten '(a b) '(c a b (a c) (b c) (c (b a) (c a)))) @result{}
|
|
(c a b (a c) (b c) c (b a) c a)
|
|
@end lisp
|
|
|
|
If you're wondering why this is here, take a look at the tutorial.
|
|
@end deffn
|
|
|
|
@node PEG Tutorial
|
|
@subsection PEG Tutorial
|
|
|
|
@subsubheading Parsing /etc/passwd
|
|
This example will show how to parse /etc/passwd using PEGs.
|
|
|
|
First we define an example /etc/passwd file:
|
|
|
|
@lisp
|
|
(define *etc-passwd*
|
|
"root:x:0:0:root:/root:/bin/bash
|
|
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
|
|
bin:x:2:2:bin:/bin:/bin/sh
|
|
sys:x:3:3:sys:/dev:/bin/sh
|
|
nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
|
|
messagebus:x:103:107::/var/run/dbus:/bin/false
|
|
")
|
|
@end lisp
|
|
|
|
As a first pass at this, we might want to have all the entries in
|
|
/etc/passwd in a list.
|
|
|
|
Doing this with string-based PEG syntax would look like this:
|
|
@lisp
|
|
(define-peg-string-patterns
|
|
"passwd <- entry* !.
|
|
entry <-- (! NL .)* NL*
|
|
NL < '\n'")
|
|
@end lisp
|
|
|
|
A @code{passwd} file is 0 or more entries (@code{entry*}) until the end
|
|
of the file (@code{!.} (@code{.} is any character, so @code{!.} means
|
|
``not anything'')). We want to capture the data in the nonterminal
|
|
@code{passwd}, but not tag it with the name, so we use @code{<-}.
|
|
|
|
An entry is a series of 0 or more characters that aren't newlines
|
|
(@code{(! NL .)*}) followed by 0 or more newlines (@code{NL*}). We want
|
|
to tag all the entries with @code{entry}, so we use @code{<--}.
|
|
|
|
A newline is just a literal newline (@code{'\n'}). We don't want a
|
|
bunch of newlines cluttering up the output, so we use @code{<} to throw
|
|
away the captured data.
|
|
|
|
Here is the same PEG defined using S-expressions:
|
|
@lisp
|
|
(define-peg-pattern passwd body (and (* entry) (not-followed-by peg-any)))
|
|
(define-peg-pattern entry all (and (* (and (not-followed-by NL) peg-any))
|
|
(* NL)))
|
|
(define-peg-pattern NL none "\n")
|
|
@end lisp
|
|
|
|
Obviously this is much more verbose. On the other hand, it's more
|
|
explicit, and thus easier to build automatically. However, there are
|
|
some tricks that make S-expressions easier to use in some cases. One is
|
|
the @code{ignore} keyword; the string syntax has no way to say ``throw
|
|
away this text'' except breaking it out into a separate nonterminal.
|
|
For instance, to throw away the newlines we had to define @code{NL}. In
|
|
the S-expression syntax, we could have simply written @code{(ignore
|
|
"\n")}. Also, for the cases where string syntax is really much cleaner,
|
|
the @code{peg} keyword can be used to embed string syntax in
|
|
S-expression syntax. For instance, we could have written:
|
|
|
|
@lisp
|
|
(define-peg-pattern passwd body (peg "entry* !."))
|
|
@end lisp
|
|
|
|
However we define it, parsing @code{*etc-passwd*} with the @code{passwd}
|
|
nonterminal yields the same results:
|
|
|
|
@lisp
|
|
(peg:tree (match-pattern passwd *etc-passwd*)) @result{}
|
|
((entry "root:x:0:0:root:/root:/bin/bash")
|
|
(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
|
|
(entry "bin:x:2:2:bin:/bin:/bin/sh")
|
|
(entry "sys:x:3:3:sys:/dev:/bin/sh")
|
|
(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
|
|
(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
|
|
@end lisp
|
|
|
|
However, here is something to be wary of:
|
|
|
|
@lisp
|
|
(peg:tree (match-pattern passwd "one entry")) @result{}
|
|
(entry "one entry")
|
|
@end lisp
|
|
|
|
By default, the parse trees generated by PEGs are compressed as much as
|
|
possible without losing information. It may not look like this is what
|
|
you want at first, but uncompressed parse trees are an enormous headache
|
|
(there's no easy way to predict how deep particular lists will nest,
|
|
there are empty lists littered everywhere, etc. etc.). One side-effect
|
|
of this, however, is that sometimes the compressor is too aggressive.
|
|
No information is discarded when @code{((entry "one entry"))} is
|
|
compressed to @code{(entry "one entry")}, but in this particular case it
|
|
probably isn't what we want.
|
|
|
|
There are two functions for easily dealing with this:
|
|
@code{keyword-flatten} and @code{context-flatten}. The
|
|
@code{keyword-flatten} function takes a list of keywords and a list to
|
|
flatten, then tries to coerce the list such that the first element of
|
|
all sublists is one of the keywords. The @code{context-flatten}
|
|
function is similar, but instead of a list of keywords it takes a
|
|
predicate that should indicate whether a given sublist is good enough
|
|
(refer to the API reference for more details).
|
|
|
|
What we want here is @code{keyword-flatten}.
|
|
@lisp
|
|
(keyword-flatten '(entry) (peg:tree (match-pattern passwd *etc-passwd*))) @result{}
|
|
((entry "root:x:0:0:root:/root:/bin/bash")
|
|
(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
|
|
(entry "bin:x:2:2:bin:/bin:/bin/sh")
|
|
(entry "sys:x:3:3:sys:/dev:/bin/sh")
|
|
(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
|
|
(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
|
|
(keyword-flatten '(entry) (peg:tree (match-pattern passwd "one entry"))) @result{}
|
|
((entry "one entry"))
|
|
@end lisp
|
|
|
|
Of course, this is a somewhat contrived example. In practice we would
|
|
probably just tag the @code{passwd} nonterminal to remove the ambiguity
|
|
(using either the @code{all} keyword for S-expressions or the @code{<--}
|
|
symbol for strings)..
|
|
|
|
@lisp
|
|
(define-peg-pattern tag-passwd all (peg "entry* !."))
|
|
(peg:tree (match-pattern tag-passwd *etc-passwd*)) @result{}
|
|
(tag-passwd
|
|
(entry "root:x:0:0:root:/root:/bin/bash")
|
|
(entry "daemon:x:1:1:daemon:/usr/sbin:/bin/sh")
|
|
(entry "bin:x:2:2:bin:/bin:/bin/sh")
|
|
(entry "sys:x:3:3:sys:/dev:/bin/sh")
|
|
(entry "nobody:x:65534:65534:nobody:/nonexistent:/bin/sh")
|
|
(entry "messagebus:x:103:107::/var/run/dbus:/bin/false"))
|
|
(peg:tree (match-pattern tag-passwd "one entry"))
|
|
(tag-passwd
|
|
(entry "one entry"))
|
|
@end lisp
|
|
|
|
If you're ever uncertain about the potential results of parsing
|
|
something, remember the two absolute rules:
|
|
@enumerate
|
|
@item
|
|
No parsing information will ever be discarded.
|
|
@item
|
|
There will never be any lists with fewer than 2 elements.
|
|
@end enumerate
|
|
|
|
For the purposes of (1), "parsing information" means things tagged with
|
|
the @code{any} keyword or the @code{<--} symbol. Plain strings will be
|
|
concatenated.
|
|
|
|
Let's extend this example a bit more and actually pull some useful
|
|
information out of the passwd file:
|
|
|
|
@lisp
|
|
(define-peg-string-patterns
|
|
"passwd <-- entry* !.
|
|
entry <-- login C pass C uid C gid C nameORcomment C homedir C shell NL*
|
|
login <-- text
|
|
pass <-- text
|
|
uid <-- [0-9]*
|
|
gid <-- [0-9]*
|
|
nameORcomment <-- text
|
|
homedir <-- path
|
|
shell <-- path
|
|
path <-- (SLASH pathELEMENT)*
|
|
pathELEMENT <-- (!NL !C !'/' .)*
|
|
text <- (!NL !C .)*
|
|
C < ':'
|
|
NL < '\n'
|
|
SLASH < '/'")
|
|
@end lisp
|
|
|
|
This produces rather pretty parse trees:
|
|
@lisp
|
|
(passwd
|
|
(entry (login "root")
|
|
(pass "x")
|
|
(uid "0")
|
|
(gid "0")
|
|
(nameORcomment "root")
|
|
(homedir (path (pathELEMENT "root")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "bash"))))
|
|
(entry (login "daemon")
|
|
(pass "x")
|
|
(uid "1")
|
|
(gid "1")
|
|
(nameORcomment "daemon")
|
|
(homedir
|
|
(path (pathELEMENT "usr") (pathELEMENT "sbin")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
|
|
(entry (login "bin")
|
|
(pass "x")
|
|
(uid "2")
|
|
(gid "2")
|
|
(nameORcomment "bin")
|
|
(homedir (path (pathELEMENT "bin")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
|
|
(entry (login "sys")
|
|
(pass "x")
|
|
(uid "3")
|
|
(gid "3")
|
|
(nameORcomment "sys")
|
|
(homedir (path (pathELEMENT "dev")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
|
|
(entry (login "nobody")
|
|
(pass "x")
|
|
(uid "65534")
|
|
(gid "65534")
|
|
(nameORcomment "nobody")
|
|
(homedir (path (pathELEMENT "nonexistent")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "sh"))))
|
|
(entry (login "messagebus")
|
|
(pass "x")
|
|
(uid "103")
|
|
(gid "107")
|
|
nameORcomment
|
|
(homedir
|
|
(path (pathELEMENT "var")
|
|
(pathELEMENT "run")
|
|
(pathELEMENT "dbus")))
|
|
(shell (path (pathELEMENT "bin") (pathELEMENT "false")))))
|
|
@end lisp
|
|
|
|
Notice that when there's no entry in a field (e.g. @code{nameORcomment}
|
|
for messagebus) the symbol is inserted. This is the ``don't throw away
|
|
any information'' rule---we successfully matched a @code{nameORcomment}
|
|
of 0 characters (since we used @code{*} when defining it). This is
|
|
usually what you want, because it allows you to e.g. use @code{list-ref}
|
|
to pull out elements (since they all have known offsets).
|
|
|
|
If you'd prefer not to have symbols for empty matches, you can replace
|
|
the @code{*} with a @code{+} and add a @code{?} after the
|
|
@code{nameORcomment} in @code{entry}. Then it will try to parse 1 or
|
|
more characters, fail (inserting nothing into the parse tree), but
|
|
continue because it didn't have to match the nameORcomment to continue.
|
|
|
|
|
|
@subsubheading Embedding Arithmetic Expressions
|
|
|
|
We can parse simple mathematical expressions with the following PEG:
|
|
|
|
@lisp
|
|
(define-peg-string-patterns
|
|
"expr <- sum
|
|
sum <-- (product ('+' / '-') sum) / product
|
|
product <-- (value ('*' / '/') product) / value
|
|
value <-- number / '(' expr ')'
|
|
number <-- [0-9]+")
|
|
@end lisp
|
|
|
|
Then:
|
|
@lisp
|
|
(peg:tree (match-pattern expr "1+1/2*3+(1+1)/2")) @result{}
|
|
(sum (product (value (number "1")))
|
|
"+"
|
|
(sum (product
|
|
(value (number "1"))
|
|
"/"
|
|
(product
|
|
(value (number "2"))
|
|
"*"
|
|
(product (value (number "3")))))
|
|
"+"
|
|
(sum (product
|
|
(value "("
|
|
(sum (product (value (number "1")))
|
|
"+"
|
|
(sum (product (value (number "1")))))
|
|
")")
|
|
"/"
|
|
(product (value (number "2")))))))
|
|
@end lisp
|
|
|
|
There is very little wasted effort in this PEG. The @code{number}
|
|
nonterminal has to be tagged because otherwise the numbers might run
|
|
together with the arithmetic expressions during the string concatenation
|
|
stage of parse-tree compression (the parser will see ``1'' followed by
|
|
``/'' and decide to call it ``1/''). When in doubt, tag.
|
|
|
|
It is very easy to turn these parse trees into lisp expressions:
|
|
|
|
@lisp
|
|
(define (parse-sum sum left . rest)
|
|
(if (null? rest)
|
|
(apply parse-product left)
|
|
(list (string->symbol (car rest))
|
|
(apply parse-product left)
|
|
(apply parse-sum (cadr rest)))))
|
|
|
|
(define (parse-product product left . rest)
|
|
(if (null? rest)
|
|
(apply parse-value left)
|
|
(list (string->symbol (car rest))
|
|
(apply parse-value left)
|
|
(apply parse-product (cadr rest)))))
|
|
|
|
(define (parse-value value first . rest)
|
|
(if (null? rest)
|
|
(string->number (cadr first))
|
|
(apply parse-sum (car rest))))
|
|
|
|
(define parse-expr parse-sum)
|
|
@end lisp
|
|
|
|
(Notice all these functions look very similar; for a more complicated
|
|
PEG, it would be worth abstracting.)
|
|
|
|
Then:
|
|
@lisp
|
|
(apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
|
|
(+ 1 (+ (/ 1 (* 2 3)) (/ (+ 1 1) 2)))
|
|
@end lisp
|
|
|
|
But wait! The associativity is wrong! Where it says @code{(/ 1 (* 2
|
|
3))}, it should say @code{(* (/ 1 2) 3)}.
|
|
|
|
It's tempting to try replacing e.g. @code{"sum <-- (product ('+' / '-')
|
|
sum) / product"} with @code{"sum <-- (sum ('+' / '-') product) /
|
|
product"}, but this is a Bad Idea. PEGs don't support left recursion.
|
|
To see why, imagine what the parser will do here. When it tries to
|
|
parse @code{sum}, it first has to try and parse @code{sum}. But to do
|
|
that, it first has to try and parse @code{sum}. This will continue
|
|
until the stack gets blown off.
|
|
|
|
So how does one parse left-associative binary operators with PEGs?
|
|
Honestly, this is one of their major shortcomings. There's no
|
|
general-purpose way of doing this, but here the repetition operators are
|
|
a good choice:
|
|
|
|
@lisp
|
|
(use-modules (srfi srfi-1))
|
|
|
|
(define-peg-string-patterns
|
|
"expr <- sum
|
|
sum <-- (product ('+' / '-'))* product
|
|
product <-- (value ('*' / '/'))* value
|
|
value <-- number / '(' expr ')'
|
|
number <-- [0-9]+")
|
|
|
|
;; take a deep breath...
|
|
(define (make-left-parser next-func)
|
|
(lambda (sum first . rest) ;; general form, comments below assume
|
|
;; that we're dealing with a sum expression
|
|
(if (null? rest) ;; form (sum (product ...))
|
|
(apply next-func first)
|
|
(if (string? (cadr first));; form (sum ((product ...) "+") (product ...))
|
|
(list (string->symbol (cadr first))
|
|
(apply next-func (car first))
|
|
(apply next-func (car rest)))
|
|
;; form (sum (((product ...) "+") ((product ...) "+")) (product ...))
|
|
(car
|
|
(reduce ;; walk through the list and build a left-associative tree
|
|
(lambda (l r)
|
|
(list (list (cadr r) (car r) (apply next-func (car l)))
|
|
(string->symbol (cadr l))))
|
|
'ignore
|
|
(append ;; make a list of all the products
|
|
;; the first one should be pre-parsed
|
|
(list (list (apply next-func (caar first))
|
|
(string->symbol (cadar first))))
|
|
(cdr first)
|
|
;; the last one has to be added in
|
|
(list (append rest '("done"))))))))))
|
|
|
|
(define (parse-value value first . rest)
|
|
(if (null? rest)
|
|
(string->number (cadr first))
|
|
(apply parse-sum (car rest))))
|
|
(define parse-product (make-left-parser parse-value))
|
|
(define parse-sum (make-left-parser parse-product))
|
|
(define parse-expr parse-sum)
|
|
@end lisp
|
|
|
|
Then:
|
|
@lisp
|
|
(apply parse-expr (peg:tree (match-pattern expr "1+1/2*3+(1+1)/2"))) @result{}
|
|
(+ (+ 1 (* (/ 1 2) 3)) (/ (+ 1 1) 2))
|
|
@end lisp
|
|
|
|
As you can see, this is much uglier (it could be made prettier by using
|
|
@code{context-flatten}, but the way it's written above makes it clear
|
|
how we deal with the three ways the zero-or-more @code{*} expression can
|
|
parse). Fortunately, most of the time we can get away with only using
|
|
right-associativity.
|
|
|
|
@subsubheading Simplified Functions
|
|
|
|
For a more tantalizing example, consider the following grammar that
|
|
parses (highly) simplified C functions:
|
|
|
|
@lisp
|
|
(define-peg-string-patterns
|
|
"cfunc <-- cSP ctype cSP cname cSP cargs cLB cSP cbody cRB
|
|
ctype <-- cidentifier
|
|
cname <-- cidentifier
|
|
cargs <-- cLP (! (cSP cRP) carg cSP (cCOMMA / cRP) cSP)* cSP
|
|
carg <-- cSP ctype cSP cname
|
|
cbody <-- cstatement *
|
|
cidentifier <- [a-zA-z][a-zA-Z0-9_]*
|
|
cstatement <-- (!';'.)*cSC cSP
|
|
cSC < ';'
|
|
cCOMMA < ','
|
|
cLP < '('
|
|
cRP < ')'
|
|
cLB < '@{'
|
|
cRB < '@}'
|
|
cSP < [ \t\n]*")
|
|
@end lisp
|
|
|
|
Then:
|
|
@lisp
|
|
(match-pattern cfunc "int square(int a) @{ return a*a;@}") @result{}
|
|
(32
|
|
(cfunc (ctype "int")
|
|
(cname "square")
|
|
(cargs (carg (ctype "int") (cname "a")))
|
|
(cbody (cstatement "return a*a"))))
|
|
@end lisp
|
|
|
|
And:
|
|
@lisp
|
|
(match-pattern cfunc "int mod(int a, int b) @{ int c = a/b;return a-b*c; @}") @result{}
|
|
(52
|
|
(cfunc (ctype "int")
|
|
(cname "mod")
|
|
(cargs (carg (ctype "int") (cname "a"))
|
|
(carg (ctype "int") (cname "b")))
|
|
(cbody (cstatement "int c = a/b")
|
|
(cstatement "return a- b*c"))))
|
|
@end lisp
|
|
|
|
By wrapping all the @code{carg} nonterminals in a @code{cargs}
|
|
nonterminal, we were able to remove any ambiguity in the parsing
|
|
structure and avoid having to call @code{context-flatten} on the output
|
|
of @code{match-pattern}. We used the same trick with the @code{cstatement}
|
|
nonterminals, wrapping them in a @code{cbody} nonterminal.
|
|
|
|
The whitespace nonterminal @code{cSP} used here is a (very) useful
|
|
instantiation of a common pattern for matching syntactically irrelevant
|
|
information. Since it's tagged with @code{<} and ends with @code{*} it
|
|
won't clutter up the parse trees (all the empty lists will be discarded
|
|
during the compression step) and it will never cause parsing to fail.
|
|
|
|
@node PEG Internals
|
|
@subsection PEG Internals
|
|
|
|
A PEG parser takes a string as input and attempts to parse it as a given
|
|
nonterminal. The key idea of the PEG implementation is that every
|
|
nonterminal is just a function that takes a string as an argument and
|
|
attempts to parse that string as its nonterminal. The functions always
|
|
start from the beginning, but a parse is considered successful if there
|
|
is material left over at the end.
|
|
|
|
This makes it easy to model different PEG parsing operations. For
|
|
instance, consider the PEG grammar @code{"ab"}, which could also be
|
|
written @code{(and "a" "b")}. It matches the string ``ab''. Here's how
|
|
that might be implemented in the PEG style:
|
|
|
|
@lisp
|
|
(define (match-and-a-b str)
|
|
(match-a str)
|
|
(match-b str))
|
|
@end lisp
|
|
|
|
As you can see, the use of functions provides an easy way to model
|
|
sequencing. In a similar way, one could model @code{(or a b)} with
|
|
something like the following:
|
|
|
|
@lisp
|
|
(define (match-or-a-b str)
|
|
(or (match-a str) (match-b str)))
|
|
@end lisp
|
|
|
|
Here the semantics of a PEG @code{or} expression map naturally onto
|
|
Scheme's @code{or} operator. This function will attempt to run
|
|
@code{(match-a str)}, and return its result if it succeeds. Otherwise it
|
|
will run @code{(match-b str)}.
|
|
|
|
Of course, the code above wouldn't quite work. We need some way for the
|
|
parsing functions to communicate. The actual interface used is below.
|
|
|
|
@subsubheading Parsing Function Interface
|
|
|
|
A parsing function takes three arguments - a string, the length of that
|
|
string, and the position in that string it should start parsing at. In
|
|
effect, the parsing functions pass around substrings in pieces - the
|
|
first argument is a buffer of characters, and the second two give a
|
|
range within that buffer that the parsing function should look at.
|
|
|
|
Parsing functions return either #f, if they failed to match their
|
|
nonterminal, or a list whose first element must be an integer
|
|
representing the final position in the string they matched and whose cdr
|
|
can be any other data the function wishes to return, or '() if it
|
|
doesn't have any more data.
|
|
|
|
The one caveat is that if the extra data it returns is a list, any
|
|
adjacent strings in that list will be appended by @code{match-pattern}. For
|
|
instance, if a parsing function returns @code{(13 ("a" "b" "c"))},
|
|
@code{match-pattern} will take @code{(13 ("abc"))} as its value.
|
|
|
|
For example, here is a function to match ``ab'' using the actual
|
|
interface.
|
|
|
|
@lisp
|
|
(define (match-a-b str len pos)
|
|
(and (<= (+ pos 2) len)
|
|
(string= str "ab" pos (+ pos 2))
|
|
(list (+ pos 2) '()))) ; we return no extra information
|
|
@end lisp
|
|
|
|
The above function can be used to match a string by running
|
|
@code{(match-pattern match-a-b "ab")}.
|
|
|
|
@subsubheading Code Generators and Extensible Syntax
|
|
|
|
PEG expressions, such as those in a @code{define-peg-pattern} form, are
|
|
interpreted internally in two steps.
|
|
|
|
First, any string PEG is expanded into an s-expression PEG by the code
|
|
in the @code{(ice-9 peg string-peg)} module.
|
|
|
|
Then, the s-expression PEG that results is compiled into a parsing
|
|
function by the @code{(ice-9 peg codegen)} module. In particular, the
|
|
function @code{compile-peg-pattern} is called on the s-expression. It then
|
|
decides what to do based on the form it is passed.
|
|
|
|
The PEG syntax can be expanded by providing @code{compile-peg-pattern} more
|
|
options for what to do with its forms. The extended syntax will be
|
|
associated with a symbol, for instance @code{my-parsing-form}, and will
|
|
be called on all PEG expressions of the form
|
|
@lisp
|
|
(my-parsing-form ...)
|
|
@end lisp
|
|
|
|
The parsing function should take two arguments. The first will be a
|
|
syntax object containing a list with all of the arguments to the form
|
|
(but not the form's name), and the second will be the
|
|
@code{capture-type} argument that is passed to @code{define-peg-pattern}.
|
|
|
|
New functions can be registered by calling @code{(add-peg-compiler!
|
|
symbol function)}, where @code{symbol} is the symbol that will indicate
|
|
a form of this type and @code{function} is the code generating function
|
|
described above. The function @code{add-peg-compiler!} is exported from
|
|
the @code{(ice-9 peg codegen)} module.
|