hedgehog


It’s always funny to hear yourself speak 🙂

http://www.youtube.com/watch?v=PbmVSGTra30

Another guest post by Roy Fox, Sentrigo’s Head of Security Research.

Here is a list of things worth considering when using regular expressions. Some of the tips are Hedgehog related.

Use predefined character sets

You should usually prefer using predefined character sets, such as \d, to explicit ones, such as [0-9]. Some character sets provide locale and Unicode support, for example \w is not equivalent to [a-zA-Z0-9_], since it also matches non-Latin letters and numbers.

In addition, using predefined character sets may improve the performance of your regular expressions.

Avoid unnecessary group capturing

To improve performance, avoid grouping, i.e. using parenthesis, as much as possible. Nevertheless, sometimes you may have to group an expression for some reason, but not capture the group for backreferencing, for example in the expression:

(ab)+

In this case, a significant performance gain can be achieved by using non-capturing grouping:

(?:ab)+

Avoid multiple and nested repetitions

The matching algorithm uses backtracking: on failure, it goes back to try other matching possibilities for parts of the expression it already matched. Multiple or nested repetitions may create a multitude of equivalent matching possibilities, so that trying all of them is redundantly slow.

For example, the pattern

^.*password

is essentially equivalent to

^.*.*password

However, in the former, a match for password is tried once in any starting position, while in the latter, if password fails, it’s tried again and again. This is because the wildcards match any splitting of the prefix into 2 parts. The situation is even worse with

^(.+)*password

where every partitioning of the prefix is tried.

Use atomic matching

Often, backtracking is unnecessary. For example, when the expression

create\s*table

is matched against the string

create         user

it’s futile to try to match \s* against any but the longest sequence of whitespaces. You can avoid this backtracking by using the equivalent

create(?>\s*)table

This is atomic non-capturing grouping. When a match has been found for the group (\s*, in this case), but subsequently not for the remainder of the expression (table, in this case), this signals the regular expression engine not to backtrack, that is, not to try another match for \s*.

It should be noted that the repetition quantifiers *, +, and ? have a short notation for their atomic versions: *+, ++ and ?+, respectively, so that (?>\s*) is equivalent to \s*+.

While this may greatly improve performance, note that atomic matching may alter the meaning of the expression, and care should be taken not to harm its validity. For example, .*+ should never be used, because it matches the remainder of the string and nothing else.

Case insensitivity

Hedgehog compiles regular expressions with the flag (?i), which mean that upper/lower case is ignored. For example, the expression

SeLeCt

will match the string

seLEct

If case sensitive matching is required, the expression (or sub-expression) can be preceded with (?-i).

Matching newlines

Hedgehog also compiles regular expressions with the flag (?s), which mean that a wildcard (.) can match a newline character. This is intended primarily to allow the match for .* to span multiple lines. When this is not the required behavior, (?-s) can be used to make a wildcard not match newline characters.

Matching newline characters explicitly can be done using \r and/or \n. Note, however, that different databases have different standard newline symbols, and most accept non-standard ones. It is best to avoid this issue by simply matching any sequence of whitespaces.

Plan for matching failure

For every regular expression, there are the set of strings it matches and of those which it fails to match. In the Hedgehog scenario, as in many others, only a tiny fraction of all strings will match, and most will fail. This makes performance much more important for failing strings than for matched ones.

What this means is that you should try to compose regular expressions which fails as soon as possible for as many of these failing strings as possible. Suppose, for example, you have 2 expressions, expr1 and expr2. expr1 does exactly what you want, but is very complex and slow. expr2 is much faster, but matches, in addition to all the strings it should, half of the strings it shouldn’t. It may be best to use the expression

(?=expr2)expr1

or something equivalent, despite the additional cost of the lookahead. This is because it saves attempting to match the expensive expr1 on half of the failing strings.