Deep expression chain performance

Discussion:

(too old to reply)

Andy

2023-04-25 13:51:55 UTC

I am writing C grammar.
Grammar speed may be down caused by deep expression chain.
For example, simple "n=0" has 20 levels:
assignmentExpression
conditionalExpression
logicalOrExpression
logicalAndExpression
inclusiveOrExpression
exclusiveOrExpression
andExpression
eqaulityExpression
relationalExpression
shiftExpression
additiveExpression
multiplicativeExpression
castExpression
unaryExpression
postfixExpression
postfixExpressionLeft
atom
literal
IntegerConstant
DeciamlConstant

In other hand, Rust grammar for Antlr4
https://github.com/antlr/grammars-v4/blob/master/rust/RustParser.g4
has precedence definition:
expression
: outerAttribute+ expression # AttributedExpression // technical, remove left recursive
| literalExpression # LiteralExpression_
| pathExpression # PathExpression_
| expression '.' pathExprSegment '(' callParams? ')' # MethodCallExpression // 8.2.10
| expression '.' identifier # FieldExpression // 8.2.11
| expression '.' tupleIndex # TupleIndexingExpression // 8.2.7
| expression '.' 'await' # AwaitExpression // 8.2.18
..............................
If I do similar in C grammar:
expr
: atom # atom_
| '__extension__'? '(' compoundStatement ')' # groupedExpression
| expr '(' argumentExpressionList? ')' # postfixExpression
| expr '.' Identifier # postfixExpression
| expr '->' Identifier # postfixExpression
| expr '[' expr ']' # postfixExpression

parsing time increased by orders of magnitude: short .c files was processed longer a long file I didn't wait for it to finish (probably increasing nonlinear)

My ask:
form expr: expr operator expr
is always ambiguous and waste time or I something bad have written?
How is possible to reduce depth of chain?
Parsing time with Antlr is much longer than parsing + semantic actions with GCC. Probably big part of this time is long depth of expressions.
[If your parser is taking a lot of time, something is wrong. In the
compilers I know, the pareer is always a tiny part of the work, which
is dominated by the lexer which has to look at each character in the
input, and optimizers that have to make multiple passes over the whole
intermediate representation. -John]

Anton Ertl

2023-04-26 16:33:54 UTC

Permalink

Andy <***@gmail.com> writes:
[ANTLR]

Post by Andy
expr
: atom # atom_
| '__extension__'? '(' compoundStatement ')' # groupedExpression
| expr '(' argumentExpressionList? ')' # postfixExpression
| expr '.' Identifier # postfixExpression
| expr '->' Identifier # postfixExpression
| expr '[' expr ']' # postfixExpression
parsing time increased by orders of magnitude: short .c files was processed longer a long file I didn't wait for it to finish (probably increasing nonlinear)

I am not an expert on ANTLR, but it is based on LL-parsing, and
normally LL-based parsers endlessly recurse on left recursion.
However, given that your Rust grammar has left recursion, too, ANTLR
seems to be able to cope with that problem at least in some cases.

There are other features in ANTLR that cause superlinear parsing
times: syntactic predicates and semantic predicates. IIRC there can
also be slowdowns if a long lookahead is necessary.

Post by Andy
form expr: expr operator expr
is always ambiguous

Yes.

Post by Andy
and waste time

Depends on how the ambiguity is dealt with. E.g., yacc resolves the
ambiguity in some way, and suppresses the alternative parse, and
therefore has linear performance. However, the suppressed alternative
may be the one you were interested in, so better define the grammar
unambiguously. I don't see such a rule in the example above, though.

By contrast, a GLR parser or Early parser keeps the alternatives in
some way, resulting in superexponential performance (n^3 for Early,
not sure for GLR).

- anton
--
M. Anton Ertl
***@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

gah4

2023-04-26 20:06:51 UTC

Permalink

Post by Andy
I am writing C grammar.
Grammar speed may be down caused by deep expression chain.
assignmentExpression
conditionalExpression
logicalOrExpression
logicalAndExpression
inclusiveOrExpression
exclusiveOrExpression
andExpression
eqaulityExpression
relationalExpression
shiftExpression
additiveExpression
multiplicativeExpression
castExpression
unaryExpression
postfixExpression
postfixExpressionLeft
atom
literal
IntegerConstant
DeciamlConstant

A not unusual compiler design in the olden (small memory) days, is recursive
descent
for statements, and operator precedence for expressions. If you do
expressions with
recursive descent, you get, as you note, 20 levels on the stack. Operator
precedence
avoids all those levels, and works well for expressions for most languages.

Others have explained how parser generators do it.

Otherwise, the design of small-memory compilers is a lost art.

Kaz Kylheku

2023-04-27 00:28:55 UTC

Permalink

ISO C specifies its grammar in an obtusely verbose way, similar to the
above. The reason is that it makes the grammar unambiguous, meaning
that it does not require any annotation or external remarks about
ambiguities having to be resolved by a certain associativity or
precedence.

If you have some grammar processor which literally follows all those
productions, doing all of their reductions, then it will likely
not perform well.

I suspect even a LALR(1) compiled grammar will have an issue
there; I'm not aware that Yacc-like parser generators do any
collapsing of cascaded reductions which do nothing but { $$ = $1; }.

Here is a concrete example. Attached below is over 200 lines of
y.output from a Yacc implementation (Bison). In that y.output
we can trace the actions of the state machine and see what happens
when a the simple token LIT is reduced to an expression.

In State 0, when the parser sees a LIT, it will shift the token
and go to State 1. There, a reduction takes place via Rule 9,
and so we have a parser_expression on the top of the stack.

Then, back to State 0, where now have a primary_expression,
and so we go to state 8.

State 8, similarly to State 1, performs a reduction:
primary expression to mult_expression.

Back to State 0, where we have to dispatch a similar
thing again for multi_expression, then additive_expression,
then expression and finally top.

Thanks for raising this issue; I suspect it's a real performance
problem. Not only is a factored-out grammar hard to read but the
performance is bad due to useless back and forth transitions between
superfluous states to propagate reductions.

Above-referenced material follows:

Grammar

0 $accept: top $end

1 top: expression ';'
2 | expression ';' expression

3 expression: additive_expression

4 additive_expression: mult_expression
5 | mult_expression '+' additive_expression

6 mult_expression: primary_expression
7 | mult_expression '*' primary_expression

8 primary_expression: '(' expression ')'
9 | LIT
10 | IDENT

Terminals, with rules where they appear

$end (0) 0
'(' (40) 8
')' (41) 8
'*' (42) 7
'+' (43) 5
';' (59) 1 2
error (256)
LIT (258) 9
IDENT (259) 10

Nonterminals, with rules where they appear

$accept (10)
on left: 0
top (11)
on left: 1 2, on right: 0
expression (12)
on left: 3, on right: 1 2 8
additive_expression (13)
on left: 4 5, on right: 3 5
mult_expression (14)
on left: 6 7, on right: 4 5 7
primary_expression (15)
on left: 8 9 10, on right: 6 7

state 0

0 $accept: . top $end

LIT shift, and go to state 1
IDENT shift, and go to state 2
'(' shift, and go to state 3

top go to state 4
expression go to state 5
additive_expression go to state 6
mult_expression go to state 7
primary_expression go to state 8

state 1

9 primary_expression: LIT .

$default reduce using rule 9 (primary_expression)

state 2

10 primary_expression: IDENT .

$default reduce using rule 10 (primary_expression)

state 3

8 primary_expression: '(' . expression ')'

LIT shift, and go to state 1
IDENT shift, and go to state 2
'(' shift, and go to state 3

expression go to state 9
additive_expression go to state 6
mult_expression go to state 7
primary_expression go to state 8

state 4

0 $accept: top . $end

$end shift, and go to state 10

state 5

1 top: expression . ';'
2 | expression . ';' expression

';' shift, and go to state 11

state 6

3 expression: additive_expression .

$default reduce using rule 3 (expression)

state 7

4 additive_expression: mult_expression .
5 | mult_expression . '+' additive_expression
7 mult_expression: mult_expression . '*' primary_expression

'+' shift, and go to state 12
'*' shift, and go to state 13

$default reduce using rule 4 (additive_expression)

state 8

6 mult_expression: primary_expression .

$default reduce using rule 6 (mult_expression)

state 9

8 primary_expression: '(' expression . ')'

')' shift, and go to state 14

state 10

0 $accept: top $end .

$default accept

state 11

1 top: expression ';' .
2 | expression ';' . expression

LIT shift, and go to state 1
IDENT shift, and go to state 2
'(' shift, and go to state 3

$default reduce using rule 1 (top)

expression go to state 15
additive_expression go to state 6
mult_expression go to state 7
primary_expression go to state 8

state 12

5 additive_expression: mult_expression '+' . additive_expression

LIT shift, and go to state 1
IDENT shift, and go to state 2
'(' shift, and go to state 3

additive_expression go to state 16
mult_expression go to state 7
primary_expression go to state 8

state 13

7 mult_expression: mult_expression '*' . primary_expression

LIT shift, and go to state 1
IDENT shift, and go to state 2
'(' shift, and go to state 3

primary_expression go to state 17

state 14

8 primary_expression: '(' expression ')' .

$default reduce using rule 8 (primary_expression)

state 15

2 top: expression ';' expression .

$default reduce using rule 2 (top)

state 16

5 additive_expression: mult_expression '+' additive_expression .

$default reduce using rule 5 (additive_expression)

state 17

7 mult_expression: mult_expression '*' primary_expression .

$default reduce using rule 7 (mult_expression)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Anton Ertl

2023-04-27 06:17:51 UTC

Permalink

Post by Kaz Kylheku
ISO C specifies its grammar in an obtusely verbose way, similar to the
above. The reason is that it makes the grammar unambiguous, meaning
that it does not require any annotation or external remarks about
ambiguities having to be resolved by a certain associativity or
precedence.

They were not that keen on eliminating ambiguity. The classical
ambiguous grammar rule is there:

|selection-statement:
| if ( expression ) statement
| if ( expression ) statement else statement
| switch ( expression ) statement

They resolve that ambiguity in prose:

|An else is associated with the lexically nearest preceding if that is
|allowed by the syntax.

C is not alone in this approach.

As for why they chose to deal with expressions through BNF, my guess
is that it allowed them to split the expression syntax into different
sections, each with their own piece of the grammar, rather than having
a section on the syntax that presents an ambuguous BNF and resolves
the ambiguity in prose (and by necessity has to cover the whole
expression syntax), and then having sections on the semantics of the
various expression sub-syntaxes.

- anton
--
M. Anton Ertl
***@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/