Discussion:
Interesting paper on regex NFA matching
(too old to reply)
John R Levine
2024-01-26 03:10:56 UTC
Permalink
The easiest way to implement regular expressions is to turn them into
NFAs, but that has well known problems that if you feed hostile input
to the NFAs, they can take exponential time and it's a way to DDoS the
server. If you translate the NFA to a DFA it runs in linear time, but
if the regex is ambiguous, as many are, the DFA may match something
different from the NFA.

This paper on the Arxiv preprint server proposes some rather complex
tweaks to NFA regex matching to fix many performance issues while giving
the same answers as before.

https://arxiv.org/abs/2401.12639

Regards,
John Levine, ***@taugh.com, Taughannock Networks, Trumansburg NY
Please consider the environment before reading this e-mail. https://jl.ly
c***@spitfire.i.gajendra.net
2024-01-26 04:15:30 UTC
Permalink
Post by John R Levine
The easiest way to implement regular expressions is to turn them into
NFAs, but that has well known problems that if you feed hostile input
to the NFAs, they can take exponential time and it's a way to DDoS the
server. If you translate the NFA to a DFA it runs in linear time, but
if the regex is ambiguous, as many are, the DFA may match something
different from the NFA.
NFAs won't take exponential time if implemented carefully,
though they may be superlinear. Still, we're talking quadratic,
not exponential.

Backtracking implementations can take exponential time, but
those aren't all NFAs.

Converting an NFA to a DFA could be exponential, however.

https://swtch.com/~rsc/regexp/regexp1.html
Post by John R Levine
This paper on the Arxiv preprint server proposes some rather complex
tweaks to NFA regex matching to fix many performance issues while giving
the same answers as before.
https://arxiv.org/abs/2401.12639
If I'm reading the abstract right, they modify a backtracking
implementation so that they can efficiently match some extended
regex's. Note however, that such so-called "extended" regexs
are not really regular expressions in the formal sense; that is,
they do not describe members of the regular languages, but
rather languages beyond those expressible as a DFA. Regardless
this work is interesting, since AFAIK the speedup for this
class of regex to linear time is novel. Thanks for passing it
on.

- Dan C.
Kaz Kylheku
2024-01-26 04:16:38 UTC
Permalink
Post by John R Levine
The easiest way to implement regular expressions is to turn them into
NFAs, but that has well known problems that if you feed hostile input
to the NFAs, they can take exponential time and it's a way to DDoS the
server.
Do you have a reference for this? I seem to recall that NFA simulation
is quadratic or cubic at worst.

The exponential problem that immediately comes to mind in connection
with regexes is that the NFA to DFA subset construction can experience
an exponential state explosion.

The NFA simulator won't experience that expoential explosion because
it only generates the subsets that occur from the specific input
that it's given; it's not reasoning about all possible inputs.
Post by John R Levine
if the regex is ambiguous, as many are, the DFA may match something
different from the NFA.
Is that also documented somewhere? DFA matching something different
from the NFA has to be a bug. The algorithms we have in this area
are correct.

What does it mean for a regex to be ambiguous?
Post by John R Levine
This paper on the Arxiv preprint server proposes some rather complex
tweaks to NFA regex matching to fix many performance issues while giving
the same answers as before.
https://arxiv.org/abs/2401.12639
I'm going to look at this in more detail, but I can tell you from the
abstract that they are taking on backtracking regex implementations that
suffer from "catastrophic backtracking", which informs me of the general
direction.

Backtracking regexes are used to implement the hacky features found
in PCRE (perl compatible regex).

They have almost nothing to do with building NFA graphs and simulating
them, or going to DFA from that.

Backtracking follows and interprets the abstract syntax tree of a regex.

One known problem with backtracking regex implementations is
that they don't treat branches; R1|R2|R3|..|R4 correctly.

Backtrackers iterate over the branches left to right and try them.
They stop when they encounter a match for the first one, and declare
that the entire disjunction matched. That would be fine if we are just
asking whether there is a match, but it's incorrect if we are matching a
regex prefix (or looking for a substring) and need the longest match.
For instance "a|aa" matches "aardvark", but only the first "a"!

If we simulate the a|aa NFA graph, what we do is feed every character of
aardvark until we hit an impasse (the machine informs us that no more
input can be accepted). At that point, we consider the most recent
character position where we had been in an accepted state. That position
will be the second "a" of "aardvark". After we feed the first "a"
to the simulator, it will report that it's in an acceptance state. But
we keep going; we feed the second "a", and for that it also reports
acceptance. When we feed the "r" the machine can indicate that it has
hit a dead end: that character cannot be matched. It is at this point
that we can take a shortcut and stop feeding more characters. The last
time we had an acceptance state was just after we fed the second "a",
and so we know that the regex matched the "aa" part of the word.
The NFA doesn't care whether the original syntax is a|aa or aa|a: the
operator is commutative.

A DFA implementation will not exhibit the behavior of searching through
the branches of a|aa left to right for a match, because it follows from
the NFA. So because of this kind of situation, the backtracking regex
faces difficulties in translation to DFA. It is caused by the incorrect,
loose, hacky interpretation of the backtracking implementation which
pull stunts like not maing A|B commutative.

NFA simulation isn't backtracking; it doesn't hit those kinds of
explosions inherent in backtracking.

Backtracking is not required for implementing classic regexes, like
POSIX EREs and BREs and whatnot, only Perl-like cruft.

Backtracking gets into trouble due to searches occuring at multiple
levels of recursion. It's like any recursive search with an exploding
space. It's clear memoization can help under the right conditions,
whereby it can recognize redundancy in the space, to avoid processing
subspaces that are identical to ones already processed ("overlapping
subproblems").

(The NFA to DFA construction involves a kind of memoization. How so?
The algorithm generates subsets of NFA states for imaginary inputs,
and turns those sets of NFA states into single DFA states. As it
probes into that space, it has to recognize subsets it has already
generated before! Each such a seen-before subset represents an
existing DFA state that has already been created. The transitions
should go to that state rather than making a new one which will
turn out identical.)

(Also, caching approaches are possible in simulating an NFA:
instead of computing a DFA ahead of time, we lazily generate some of
it as we go through the input, and cache some of it. Say up to 32
states, with some LRU replacement or whtaever. It's like "JIT" for
NFAs. We get some of the speed benefit of DFA, but the state
explosion of the DFA is thwarted.)

(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed
under. I will skim through it, though.)

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
[You're right, it's polynomial, but that's bad enough.

Ambiguous is if there are multiple subpatterns that can match the
same string, e.g. (foo|[a-z]+) matching "foo". Often your code cares
which one it matches even though that's arguably a bug. -John]
Christopher F Clark
2024-01-27 10:47:33 UTC
Permalink
Post by Kaz Kylheku
(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed
under. I will skim through it, though.)
You are mistaken that they are not applicable to programming languages.

Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.

The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.

In any case, parser generators like ANTLR use backtracking to great
effect (and yes, we "borrowed" that idea in Yacc++) and so did the
entire PEG (parsing expression grammar) community, using memoizartion
to give linear time performance in many cases.

Thus, to me, it is no surprise to see this reintegrated back into the
PCRE world.

Moreover, even if it is of little interest [mistakenly in my opinion] to the
"programming language" community, it is of definite interest in the
"real world:" PCRE regular expressions are both used (and abused)
in many applications, including SNORT where linear performance and
preventing DDOS attacks would be of prime importance. (While at Intel,
I had the honor of working with the Hyperscan (Sensory Networks) team
and improving PCRE performance for SNORT was a major "selling point"

And "greedy" (longest match), "non-greedy" (shortest match) and other
variations like "lexically first in the grammar" matching all have their
applications. Saying that only one of them is correct is ignoring the
needs of actual users.

And, yes, we had discussions about "automata theory correct" interpretations
of what certain PCRE expressions meant and how even "canonical"
implementations had various "bugs" that made their meaning ambiguous.

By the way, to give credit where due, we "borrowed" the idea of
"tunnel automatons" from Josef Grosch to implement controlled
backtracking in Yacc++. It lets one do subset construction (i.e.
the LR(0) algorithm) for items, breaking out for "syntactic predicates"
exactly as needed.

--
******************************************************************************
Chris Clark email: ***@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
Kaz Kylheku
2024-01-27 20:20:29 UTC
Permalink
Post by Christopher F Clark
Post by Kaz Kylheku
(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed
under. I will skim through it, though.)
You are mistaken that they are not applicable to programming languages.
Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.
The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.
I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.

{digits}/. {
// ... set up semantic value ..
return INTEGER;
}

{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}

Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
Christopher F Clark
2024-01-29 08:39:58 UTC
Permalink
Post by Kaz Kylheku
[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
I still want to write an extended reply to this, where
Post by Kaz Kylheku
I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.
{digits}/. {
// ... set up semantic value ..
return INTEGER;
}
{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}
Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.
You may not recognize it as backtracking, but it is reading two characters
past the end of the integer token to disambiguate it. And while one character
of lookahead is generally "free" two characters requires more extensive
buffering and mechanisms to reset the state.

Thus, what LEX and flex are implementing is not strictly a pure
regular expression matcher, but one with a slight extension to what
can be recognized. While it is an extremely useful and appropriate
convention for the application area, it isn't strictly within the
bounds. The engine which runs such machines needs to have the hooks to
implement that extension.

One can prove this by using the hierarchy of language families where
regular expressions are a subset of LL(1) grammars which are a subset
of LR(1) grammars. If you try to write an LL(1) grammar for that, you
will find yourself unable to construct a proper FIRST set which disambiguates
the two rules Similarly, if one is attempting to make an LR grammar for it
uaing an algorithm that builds first an LR(0) machine, you will see that it
has a shift/reduce conflict on the "." character and one which cannot be
disambiguated by one character of lookahead.

Thus, what is implemented in LEX is essentially "longest, lexically first"
matching. Where the lexer saves the last state where a token was accepted
and continues attempting to match, and on failure backs up to the last
recognized match (and then applies the rule which is lexically first that
caused that match).

The only difference between this and what PEG (and PCRE) mechanisms
do is that they implement "lexical first, longest" match. That reorders the
set of matches that are considered. As the paper you dismissed notes,
that can be compared to depth-first, versus breadth-first search of the
tree of potential matches. Both strategies have their advantages and
areas where they are the proper choice and places where they are the
wrong choice.

If one finds oneself writing "trailing context" or "unput" in one's lexical
specification, one is using lookahead. And sometimes in lexers one
uses lookahead without even noticing it, as LEX doesn't emit warnings
when it does so. (We had to disable such warnings in our implementation
to get it to approximate LEX behavior.)

-------

One final note is that language classes are determined by writing a
grammar which accepts (not lexes or parses) the given language.
That is one can write a grammar that doesn't disambiguate the cases
(i.e. combines them into one case), but then one hasn't correctly
tokenized it. I don't know of an automata theory (or compiler) book
that mentions this distinction, much less highlights its implications,
so I will forgive not knowing it unless one makes an argument that
depends upon it, which you did.

--
******************************************************************************
Chris Clark email: ***@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
Kaz Kylheku
2024-01-29 19:28:02 UTC
Permalink
Post by Christopher F Clark
Post by Kaz Kylheku
[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
I still want to write an extended reply to this, where
Post by Kaz Kylheku
I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.
{digits}/. {
// ... set up semantic value ..
return INTEGER;
}
{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}
Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.
You may not recognize it as backtracking, but it is reading two characters
past the end of the integer token to disambiguate it.
But some backtracking is required for recognizing the longest matching
prefixes of the input, even if we us nothing but basic regex features!

If we consider the trivial pattern (a*b)+:

Given the input "aaabaaabaaac" the prefix which constitutes a token
is the "aaabaaab".

In order to extract that token, the machine will have to consume
symbols all the way through the "aaac" which is excluded from the
lexeme. Only at the "c" character does it hit the situation that no
transition is available, and so the matching terminates. Then it has to
return to the most recent position in the input where the automaton had
been in an acceptance state.

It is not the same as the backtracking of a graph/tree search algorithm,
which returns to prior points to try alternatives, due to doing a
depth-first search.

At the point where we back up in the input, we are not trying
different branches of a depth-first-search. We have completed
a depth-first-search!

One input retrace at the conclusion of a succesful search is not
qualitatively the same situation as multiple, cascaded retraces
integrated into the search algorithm.
Post by Christopher F Clark
Thus, what is implemented in LEX is essentially "longest, lexically first"
matching. Where the lexer saves the last state where a token was accepted
and continues attempting to match, and on failure backs up to the last
recognized match (and then applies the rule which is lexically first that
caused that match).
Yes; that is required in any regex matching function that searches
for a prefix, or infix.

The fundamental automaton's job is to recognize strings: does a given
input string belong to the language (set of strings) described by that
automaton/regex.

In lexical analysis, we solve the problem of, what is the longest prefix
of the input which belongs to that set of strings, the obvious way to do
which is to match as far as we can and then return to the most recent
good position.
Post by Christopher F Clark
The only difference between this and what PEG (and PCRE) mechanisms
do is that they implement "lexical first, longest" match.
I think when you say "lexical first", you might be referring to the
rules themselves: rules that are lexically earlier in the grammar
"win" over later rules.

(In Lex, that still matters. If two rules match tokens of exactly
the same length, the lexically earlier rule gets the token.)

This "lexical first, longest" is why the order of R1|R2|..|RN matters
in some so-called regex implementations.

Unfortunately, as far as regexes go, that is broken.

It may still be a language describing a regular set, or at least
in some cases, but the regex language for doing that requires the
disjunction operator to be commutative.

This is just historically so, but with justification.

The people who invented regular expressions decades ago, as far back as
the 1960's and earlier, and wrote papers about them decided on a
commutative disjunction, for good reasons.

A commutative disjunction is necessary in order to have a direct
relationship between sets and regular expressions.

If R1 describes a set of strings S1, and R2 describes a set of strings
S2, then R1|R2 describes S1 ∪ S2.

Because ∪ commutes, we want | to commute, since it means the same thing.

If | doesn't commute, then it doesn't correspond to the ∪ operator. The
relationship to sets has been severed or muddled.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Christopher F Clark
2024-02-01 02:09:48 UTC
Permalink
Post by Kaz Kylheku
But some backtracking is required for recognizing the longest matching
prefixes of the input, even if we use nothing but basic regex features!
The key word in your sentence is "prefixes". The recognition problem
is not about prefixes, but about complete strings. This is why my answer
mentioned that the distinction between lexing (and parsing) and recognition
is not covered in any automata books I have read.

The definition of regular expressions is that when the finite automata
halts (at the end of the string or when there are no transitions out
of the state that it is in) it is either in an accepting state or not. That's
the language recognition problem. There is no mention of "backing
up to the last state that accepted." And for language recognition
that is important, otherwise, any random trailing garbage after a
legal string would still be considered a match because a prefix
matched.

Thus, given your regular expression (a*b)+.,

b, ab, aab, aaab, ... , bb, abb, aabb, ..., bab, baab, ...., aaaabaaab
are all in the language, but aaabaaabaaac does not match that
regular expression. A prefix of it does, but not the entire string.
And, thus the string aaabaaabaaac does not satisfy the language
recognition problem.

Now, when you ask does a string which is a prefix of aaabaaabaaac
match the regular expression, you get two strings which qualify aaab
and aaabaab.But that's not the *language recognition* problem.
That's a different problem, one relevant to lexing.

However, either string satisfies the question of is this string in the language.
Thus, both strings aaab and aaabaaab satisfy the language recognition
problem. Only one satisfies your definition of the "correct" lexing problem
because you are fixated on a specific extension of how regular expressions
need to work. It is not a bad extension, but it is an extension of the
meaning of regular expressions in the lexing context which is not the
context of language recognition which is where regular expressions
were defined.

Notably, if one focuses on the language recognition problem, the one
where prefixes don't count, only whole strings do, then whether the |
operator is commutative or not, does not matter. It only matters when
one can terminate (and accept) without considering the entire string.

Still, there are cases where one might prefer to get the shortest answer,
rather than the longest answer. Or cases, where one might prefer a
shorter answer when two possible answers match. Those situations
require a more complex language. A language which regular expressions
are only a subset of. This class of languages is what the PCRE (and PEG)
extensions allow one to access. It is possible to express every regular
expression in those languages, trivially so with PCRE and with only a
small amount of care with PEGs (whose / operator is non-commutative
and chosen that way not to be confused with the | operator). And,
moreover, you can get the same results for prefix matching if that is
what you desire.

However, you can also get results where other matches are the ones
that are "preferred" and returned. The user simply has more choices.
Now, if one hasn't chosen wisely and one has carelessly written an
"extended" regular expression which returns a different preferred choice
than the one which one expected, that is the fault of the writer not the
formalism. People are just as capable of writing buggy regular expressions
as they are of writing buggy programs, and that's true whether using
the extensions or not. Maybe the user really wanted "(a+b)+" or the
more complex "(a*b)+a*" or some other variant.

And while there are buggy implementations of PCRE regular expressions,
that doesn't mean the formalism itself is bad.

--
******************************************************************************
Chris Clark email: ***@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
Ev Drikos
2024-02-04 09:14:00 UTC
Permalink
Post by Christopher F Clark
You are mistaken that they are not applicable to programming languages.
Hello,

I agree; another example is the keyword FORMAT in Fortran. No matter how
the scanner is implemented, generated or hand coded, the FORMAT
statement requires forward scanning if the parser is deterministic,
in which case error messages can be IMHO descriptive, as shown ie here:
https://gist.github.com/drikosev/56202b8d928c22da33aebc7cc0e16193

Related examples with the lookahead operator '+:' can be found here:
https://github.com/drikosev/Fortran/blob/master/OMP_Fortran_Scanner.txt

Hope, I'm not out of topic.

Ev. Drikos

Loading...