Friday, September 10, 2004

More Regular Expression Complaints

Bob responded to my previous comment on regular expression with a post of his own. His didn't seem to care for the idea of a new syntax for writing regular expressions and instead focused on how you can use technique and tools to make the current syntax more usable.

I agree that there are better ways to write a complex regular expression, like the mail address parsing example, using common language features; simply breaking a complex expression like this into named blocks would go a long way to improving its understandability. However, my issues with regular expression go deeper than what can be done with simple syntax substitution.

Take for example the current regular expression's use of the simple parentheses. In regular expression parentheses serve a dual purpose: they act both as scope boundries for operators like the quantifiers "?", "+", "*" and "{x}".

Quantified Expression: "^([a-f][0-9]-){3}[a-f][0-9]$"
Matches: "a8-b2-c3-f6"
Doesn't match: "a9-b3-c8-x8"

And as group delimeters to capture subsets of the matched string for back references or for extraction by the caller.

Sub Expression: "^([a-f][0-9]-([a-f][0-9])-([a-f][0-9])-([a-f][0-9])$"
Matches: "a8-b2-c3-f6"
Group 1: "a8"
Group 2: " b2"
Group 3: " c3"
Group 4: " f6"

Back Reference Expression: "^(.{4})-\1$"
Matches: "aaaa-aaaa"
Group 1: "aaaa"

If you have a complex expression with a lot of parenthesis you either have to count them very carefully to determine the correct group number or else you have to do some experimentation to determine what number matches which group. If the expression ever changes or you need to add or subtract parenthesis the back references and the code that uses the expression will break.

A first step to fixing this would be named groups. Rather than having to use group numbers you would be able to address the group by name. Here's a short example of what I mean that recreates the examples above using the basic syntax I created in my previous post.

# Equivalent to "^([a-f][0-9]-([a-f][0-9])-([a-f][0-9])-([a-f][0-9])$"





# Equivalent to "^(.{4})-\1$"




Named groups as I described them above wouldn't fix one class of problem however. But it's a problem standard regular expressions have too. You can't use quantifier blocks and groups simultaneously. Examine this expression:

Quantified Expression: "^(([a-f][0-9])-){3}([a-f][0-9])$"

The block "(([a-f][0-9])-)" must repeat three times. This block however contain the parenthetic group "([a-f][0-9])". It seems like you should you be able to access each group of the repeating block. But you can't. Given the input, the actual results are:

Grouped quantified expression: "^(([a-f][0-9])-){3}([a-f][0-9])$"
Matches: "a8-b2-c3-f6"
Group 1: "c3"
Group 2: "c3-"
Group 3: "f6"

Post a Comment
The Out Campaign: Scarlet Letter of Atheism