Let’s do it with Batteries
January 27, 2009 § 6 Comments
Or, OCaml is a scripting language, too.
Note: These extracts use the latest version of Batteries, currently available from the git. Barring any accident, this version should be made public within the next few days.
A few days ago, when writing some code for OCaml Batteries Included, I realized that, to properly embed Camomile’s Unicode transcoding module, I would need to manually write 500+ boring lines, all of them looking like:
| `ascii -> Encoding.of_name "ASCII"
The idea behind that pattern matching was to define a type-safe phantom type for text encodings. Upon installation, Camomile generates a directory containing about 540 files, one per text encoding, and it seemed like a good idea to rely upon something less fragile than a string name.
Of course, writing this pattern-matching manually was out of the question: it was boring, error-prone, and while Batteries deserves sacrifices, it doesn’t quite deserve that level of mind-numbing activities. The alternative was to generate both the list of constructors and the pattern-matching code from the contents of the directory. I could have done it with some scripting language but that sounded like a good chance to test-drive the numerous new functions of the String module of Batteries (73 for 28 in the Base Library).
The main program
The structure of the program is easy: read the contents of a directory. For each file, do some treatment on the file name and print the result:
open Shell foreach (files_of argv.(1)) do_something
foreach is the same function as
iter but with its arguments reversed. It’s sometimes much more readable. Instead of reading the contents of a directory with
Shell.files_of, we could just as well have traversed the command-line arguments with
args, or read the lines of standard input using
Actually, we could just as well generalize to a (possibly empty) set of directories. For this purpose, we just need to
map our function
files_of to the enumeration of command-line arguments. This yields an enumeration of enumerations, which we turn into a flat enumeration with
flatten. In my mind, that’s somewhat nicer and more readable than nested loops.
Our main program now looks like:
open Shell, Enum foreach (flatten (map files_of (args ()))) do_something
Or, for those of us who prefer operators to parenthesis:
open Shell, Enum (foreach **> flatten **> map files_of **> args ()) do_something
It’s now time to take a file name and turn it into
- a nice constructor name
- a file name without extension,
That second point is the easiest, so let’s start with it. We have a function
Filename.chop_extension just for this purpose. So, if we were interested only in printing the list of files without their extension, we could define
let do_something x = print_endline (Filename.chop_extension x)
The first point is slightly trickier, as we need to
- remove the extension from the file name (done)
- prepend character
- replace any illicit character with
_(slightly more annoying, I know that the list of illicit characters which may actually appear in my list of files contains
)and whitespaces but I’d rather not go and check manually which other characters may turn out problematic)
- prepend something before names which start with a digit, as digits cannot appear as the first character of an OCaml constructor (a tad annoying, too)
- make everything lowercase, just because it’s nicer (trivial).
Let’s deal with the third item, it’s bound to be central. Let’s see, replacing characters could be done with regular expressions, something I dislike, or with function
String.map. It’s nicer, type-safer, and it has a counterpart
Rope.map for Unicode, if we ever need one. Now, functions
Char.is_digit will help us determine which names are safe. Using them together, we obtain the following function:
open Char let replace s = String.map (fun c -> if is_letter c || is_digit c then c else '_') s
Let’s solve the fourth item on our list. We need to check the first character of a string and to determine whether it’s a digit. Well, we already know how to do this. Let’s call our prefix
let clean_digit p s = if is_digit s. then p^s else s
If we chain up everything, we obtain
let constructor p s = "`" ^ (if is_digit r. then p^r else r) where r = lowercase (String.map (fun c -> if is_letter c || is_digit c then c else '_') s)
I like this
Now that we have both our strings, we just need to be able to combine and print them. For this purpose, Printf is probably the most concise tool. Here, we can just write
let print s1 s2 = Printf.printf " | %s -> %S\n" s1 s2
We could parameterize upon the format used by printf and we’re bound to do this sooner or later, but let’s keep it simple for now.
The complete program
open Shell, Enum foreach (flatten **> map files_of **> args ()) do_something where do_something s = let name = Filename.chop_extension s in Printf.printf " | %s -> %S\n" c name where c = "`" ^ (if Char.is_digit r. then "codemap_"^r else r) where r = lowercase (String.map (fun c -> if Char.is_letter c || Char.is_digit c then c else '_') name)
I don’t know about you but I find this pretty nice, for a type-safe language. I’m sure it would have been possible to make something shorter in Perl or awk, and suggestions are welcome regarding how to improve this but I’m rather happy. And, once again, we’re not trying to beat Python, Perl or awk in concision, just to do something comparably good, because we already beat them by far in speed and safety.
So, what do you think?