Thursday, September 09, 2004

Java Tokenizers

I'm always needing to parse strings and streams and always bumping into weirdness with Java's default tokenizers. Here's a short list of my pet peeves with these classes. There are others but these are the biggest.

1. If you’re going to have two classes called Tokenizers (StringTokenizer, StreamTokenizer) it would be nice if they had the same semantics. Besides the fact they both break up a sequence of characters into tokens these classes have nothing in common.

2. In Java 1.1 when they introduced all the Reader IO classes why didn't they make a ReaderTokenizer instead of adding a Reader constructor to StreamReader and deprecating the InputStream constructor. They added BufferedReader, FilterReader, etc to replace BufferedInputStream and FilterInputStream, why stop there.

3. The operation of StreamTokenizer is not well documented. In order to use the class you really need to understand it's implementation model.

I've never seen the code but from playing with the API it appears the class keeps an array in the background that mirrors the character set. Each slot in that array has attributes that describe its corresponding character. The attributes determine whether the character is whitespace, a word character, string delimeter, etc... The class then provides you with a bunch of methods that let you change the attributes in the slots.

However, if you don't understand the underlying model I just described the methods like the following are hard to understand:

public void wordChar(int low, int hi)
Specifies that all characters c in the range low <= c <= high
are word constituents. A word token consists of a word constituent
followed by zero or more word constituents or number constituents.

There's nothing in the doc or the method signature that lets you know you can call this repeatedly to set different ranges of characters as word characters.

4. The Javadoc for StringTokenizer calls it a 'legacy class' and recommends people use String.split(String regex) instead. This advice is fine for certain uses of StringTokenizer but for others a more appropriate statements would be to use: StreamTokenizer() with a StringReader().

Post a Comment
The Out Campaign: Scarlet Letter of Atheism