It’s always funny to hear yourself speak 🙂
Entries tagged with “hedgehog”.
Wed 4 Apr 2012
Thu 8 Oct 2009
Another guest post by Roy Fox, Sentrigo’s Head of Security Research.
Here is a list of things worth considering when using regular expressions. Some of the tips are Hedgehog related.
Use predefined character sets
You should usually prefer using predefined character sets, such as \d, to explicit ones, such as [0-9]. Some character sets provide locale and Unicode support, for example \w is not equivalent to [a-zA-Z0-9_], since it also matches non-Latin letters and numbers.
In addition, using predefined character sets may improve the performance of your regular expressions.
Avoid unnecessary group capturing
To improve performance, avoid grouping, i.e. using parenthesis, as much as possible. Nevertheless, sometimes you may have to group an expression for some reason, but not capture the group for backreferencing, for example in the expression:
In this case, a significant performance gain can be achieved by using non-capturing grouping:
Avoid multiple and nested repetitions
The matching algorithm uses backtracking: on failure, it goes back to try other matching possibilities for parts of the expression it already matched. Multiple or nested repetitions may create a multitude of equivalent matching possibilities, so that trying all of them is redundantly slow.
For example, the pattern
is essentially equivalent to
However, in the former, a match for password is tried once in any starting position, while in the latter, if password fails, it’s tried again and again. This is because the wildcards match any splitting of the prefix into 2 parts. The situation is even worse with
where every partitioning of the prefix is tried.
Use atomic matching
Often, backtracking is unnecessary. For example, when the expression
is matched against the string
it’s futile to try to match \s* against any but the longest sequence of whitespaces. You can avoid this backtracking by using the equivalent
This is atomic non-capturing grouping. When a match has been found for the group (\s*, in this case), but subsequently not for the remainder of the expression (table, in this case), this signals the regular expression engine not to backtrack, that is, not to try another match for \s*.
It should be noted that the repetition quantifiers *, +, and ? have a short notation for their atomic versions: *+, ++ and ?+, respectively, so that (?>\s*) is equivalent to \s*+.
While this may greatly improve performance, note that atomic matching may alter the meaning of the expression, and care should be taken not to harm its validity. For example, .*+ should never be used, because it matches the remainder of the string and nothing else.
Hedgehog compiles regular expressions with the flag (?i), which mean that upper/lower case is ignored. For example, the expression
will match the string
If case sensitive matching is required, the expression (or sub-expression) can be preceded with (?-i).
Hedgehog also compiles regular expressions with the flag (?s), which mean that a wildcard (.) can match a newline character. This is intended primarily to allow the match for .* to span multiple lines. When this is not the required behavior, (?-s) can be used to make a wildcard not match newline characters.
Matching newline characters explicitly can be done using \r and/or \n. Note, however, that different databases have different standard newline symbols, and most accept non-standard ones. It is best to avoid this issue by simply matching any sequence of whitespaces.
Plan for matching failure
For every regular expression, there are the set of strings it matches and of those which it fails to match. In the Hedgehog scenario, as in many others, only a tiny fraction of all strings will match, and most will fail. This makes performance much more important for failing strings than for matched ones.
What this means is that you should try to compose regular expressions which fails as soon as possible for as many of these failing strings as possible. Suppose, for example, you have 2 expressions, expr1 and expr2. expr1 does exactly what you want, but is very complex and slow. expr2 is much faster, but matches, in addition to all the strings it should, half of the strings it shouldn’t. It may be best to use the expression
or something equivalent, despite the additional cost of the lookahead. This is because it saves attempting to match the expensive expr1 on half of the failing strings.
Sun 17 Jun 2007
This is a personal as well as a commercial posting for me… Tomorrow is a special day in the short history of my company – after long months of R&D, we are finally releasing our product, named Hedgehog (there’s already some coverage in Dark Reading). These are very exciting times both for me personally and for the entire team at Sentrigo, who’ve made this possible through a lot of hard work and well applied knowledge – I feel very lucky to have such a great team working with me.
Hedgehog is database security monitoring software that monitors DB transactions in real-time, and generates alerts based on a highly flexible set of policy rules. A light-weight sensor is installed on the database machine and monitors the shared memory. It doesn’t use redo logs or DBMS APIs – those would be too slow… The trick is to do it so that it doesn’t use up CPU power.
Hedgehog can be downloaded from Sentrigo’s website, and while it supports only Oracle for the moment, in the coming months we will release versions for MS SQL, DB2 and other major DBMSs. There are basically two version – Hedgehog Standard, which is totally free to use, and Hedgehog Enterprise, which is not free but available for free evaluation. The differences are explained in some detail on the website, but basically it boils down to prevention capabilities and enterprise scalability and integration.
My sense is that we’re bringing something new to this space, and I’m anxious to see how this will be received. Feedback is of course welcome. Give it a try!