Let’s do it with Batteries

January 27, 2009 § 6 Comments

Or, OCaml is a scripting language, too.

Note: These extracts use the latest version of Batteries, currently available from the git. Barring any accident, this version should be made public within the next few days.

A few days ago, when writing some code for OCaml Batteries Included, I realized that, to properly embed Camomile’s Unicode transcoding module, I would need to manually write 500+ boring lines, all of them looking like:

| `ascii -> Encoding.of_name "ASCII"

The idea behind that pattern matching was to define a type-safe phantom type for text encodings. Upon installation, Camomile generates a directory containing about 540 files, one per text encoding, and it seemed like a good idea to rely upon something less fragile than a string name.

Of course, writing this pattern-matching manually was out of the question: it was boring, error-prone, and while Batteries deserves sacrifices, it doesn’t quite deserve that level of mind-numbing activities. The alternative was to generate both the list of constructors and the pattern-matching code from the contents of the directory. I could have done it with some scripting language but that sounded like a good chance to test-drive the numerous new functions of the String module of Batteries (73 for 28 in the Base Library).

The main program

The structure of the program is easy: read the contents of a directory. For each file, do some treatment on the file name and print the result:

open Shell
foreach (files_of argv.(1)) do_something

Here, foreach is the same function as iter but with its arguments reversed. It’s sometimes much more readable. Instead of reading the contents of a directory with Shell.files_of, we could just as well have traversed the command-line arguments with args, or read the lines of standard input using IO.lines_of stdin.

Actually, we could just as well generalize to a (possibly empty) set of directories. For this purpose, we just need to map our function files_of to the enumeration of command-line arguments. This yields an enumeration of enumerations, which we turn into a flat enumeration with flatten. In my mind, that’s somewhat nicer and more readable than nested loops.

Our main program now looks like:

open Shell, Enum
foreach (flatten (map files_of (args ()))) do_something

Or, for those of us who prefer operators to parenthesis:

open Shell, Enum
(foreach **> flatten **> map files_of **> args ()) do_something

String manipulation

It’s now time to take a file name and turn it into

  1. a nice constructor name
  2. a file name without extension,

That second point is the easiest, so let’s start with it. We have a function Filename.chop_extension just for this purpose. So, if we were interested only in printing the list of files without their extension, we could define

let do_something x = print_endline (Filename.chop_extension x)

The first point is slightly trickier, as we need to

  1. remove the extension from the file name (done)
  2. prepend character ` (trivial)
  3. replace any illicit character with _ (slightly more annoying, I know that the list of illicit characters which may actually appear in my list of files contains :, -, (, ) and whitespaces but I’d rather not go and check manually  which other characters may turn out problematic)
  4. prepend something before names which start with a digit, as digits cannot appear as the first character of an OCaml constructor (a tad annoying, too)
  5. make everything lowercase, just because it’s nicer (trivial).

Let’s deal with the third item, it’s bound to be central. Let’s see, replacing characters could be done with regular expressions, something I dislike, or with function String.map. It’s nicer, type-safer, and it has a counterpart Rope.map for Unicode, if we ever need one. Now, functions Char.is_letter and Char.is_digit will help us determine which names are safe. Using them together, we obtain the following function:

open Char
let replace s = String.map (fun c -> if is_letter c || is_digit c then c else '_') s

Let’s solve the fourth item on our list. We need to check the first character of a string and to determine whether it’s a digit. Well, we already know how to do this. Let’s call our prefix p:

let clean_digit p s = if is_digit s.[0] then p^s else s

If we chain up everything, we obtain

let constructor p s = "`" ^ (if is_digit r.[0] then p^r else r)
    where         r = lowercase (String.map (fun c -> if is_letter c || is_digit c then c else '_') s)

I like this where syntax.

Format

Now that we have both our strings, we just need to be able to combine and print them. For this purpose, Printf is probably the most concise tool. Here, we can just write

let print s1 s2 = Printf.printf " | %s -> %S\n" s1 s2

We could parameterize upon the format used by printf and we’re bound to do this sooner or later, but let’s keep it simple for now.

The complete program

open Shell, Enum

foreach (flatten **> map files_of **> args ()) do_something
  where do_something s =
   let name = Filename.chop_extension s in Printf.printf " | %s -> %S\n" c name
     where c = "`" ^ (if Char.is_digit r.[0] then "codemap_"^r else r)
     where r = lowercase (String.map (fun c -> if Char.is_letter c || Char.is_digit c then c else '_') name)

I don’t know about you but I find this pretty nice, for a type-safe language. I’m sure it would have been possible to make something shorter in Perl or awk, and suggestions are welcome regarding how to improve this but I’m rather happy. And, once again, we’re not trying to beat Python, Perl or awk in concision, just to do something comparably good, because we already beat them by far in speed and safety.

So, what do you think?

Tagged: , , , , , , , , , , , , , , , , , , ,

§ 6 Responses to Let’s do it with Batteries

Leave a reply to MikkelFJ Cancel reply

What’s this?

You are currently reading Let’s do it with Batteries at Il y a du thé renversé au bord de la table.

meta