Wednesday, September 08, 2004

Regular Expressions

Regular expressions are without doubt the most cryptic set of commands a modern programmer is bound to face. We've built generations of languages that have hidden the complexities of machine code but we still use these horrible gumbles of characters for text processing. Take for example this expression for parsing email addresses I found at RegExLib.com.



^(([^<>;()[\]\\.,;:@"]+(\.[^<>()[\]\\.,;:@"]+)*)|(".+"))@(((
[a-z]([-a-z0-9]*[a-z0-9])?)|(#[0-9]+)|(\[((([01]?[0-9]{0,2})
|(2(([0-4][0-9])|(5[0-5]))))\.){3}(([01]?[0-9]{0,2})|(2(([0-
4][0-9])|(5[0-5]))))\]))\.)*(([a-z]([-a-z0-9]*[a-z0-9])?)|(#
[0-9]+)|(\[((([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\.
){3}(([01]?[0-9]{0,2})|(2(([0-4][0-9])|(5[0-5]))))\]))$


Please, there's got to be a better way. I know Python supports verbose regexs, but even that's not much of an improvment. Why isn't there a higher level regular expression language? If such a thing exists I couldn't find it doing some basic Google searches.


What I would like to see is a language that's more verbose, less cryptic and supports block reuse. I've invented a little syntax below that I think is easier to read and I've attempted to translate the above expression. I think it's a lot easier to read.



DEF SEQUENCE INNER_IP (
(
ZERO-OR-ONE(SET(01)),
ZERO-TO-TWO(RANGE(0-9))
)
OR
(
'2',(
(RANGE(0-4),RANGE(0-9))
OR
('5',RANGE(0-5))
)
)
)

DEF SEQUENCE DOMAIN_NAME (
ZERO-OR-ONE (
RANGE(a-z), (ZERO-OR-MORE RANGE(a-z,0-9),RANGE(a-z,0-9) )
)
)

DEF SEQUENCE ADDRESS (
SEQUENCE DOMAIN-NAME
OR (
'#',ONE-OR-MORE RANGE(0-9)
)
OR (
'[', THREE-OF(SEQUENCE INNER-IP, '.'), SEQUENCE INNER-IP,']'
)
)



DEF SEQUENCE MAIL_ADDRESS (
ZERO-OR-MORE (
(
ONE-OR-MORE NOT-IN SET(<>(\)[]\\.,;:@"),
ZERO-OR-MORE('.', ONE-OR-MORE NOT-IN SET(<>(\)[]\\,;:@"))
)
OR
('"', ONE-OR-MORE ANY,'"')
)
)


ANCHOR BEGIN
(
SEQUENCE MAIL_ADDRESS
),
'@'
(
SEQUENCE ADDRESS
),
'.',
(
SEQUENCE ADDRESS
)
ANCHOR END


Post a Comment
 
The Out Campaign: Scarlet Letter of Atheism