Let’s do it with Batteries
January 27, 2009 § 6 Comments
Or, OCaml is a scripting language, too.
Note: These extracts use the latest version of Batteries, currently available from the git. Barring any accident, this version should be made public within the next few days.
A few days ago, when writing some code for OCaml Batteries Included, I realized that, to properly embed Camomile’s Unicode transcoding module, I would need to manually write 500+ boring lines, all of them looking like:
| `ascii -> Encoding.of_name "ASCII"
The idea behind that pattern matching was to define a type-safe phantom type for text encodings. Upon installation, Camomile generates a directory containing about 540 files, one per text encoding, and it seemed like a good idea to rely upon something less fragile than a string name.
Of course, writing this pattern-matching manually was out of the question: it was boring, error-prone, and while Batteries deserves sacrifices, it doesn’t quite deserve that level of mind-numbing activities. The alternative was to generate both the list of constructors and the pattern-matching code from the contents of the directory. I could have done it with some scripting language but that sounded like a good chance to test-drive the numerous new functions of the String module of Batteries (73 for 28 in the Base Library).
The main program
The structure of the program is easy: read the contents of a directory. For each file, do some treatment on the file name and print the result:
open Shell foreach (files_of argv.(1)) do_something
Here, foreach
is the same function as iter
but with its arguments reversed. It’s sometimes much more readable. Instead of reading the contents of a directory with Shell.files_of
, we could just as well have traversed the command-line arguments with args
, or read the lines of standard input using IO.lines_of stdin
.
Actually, we could just as well generalize to a (possibly empty) set of directories. For this purpose, we just need to map
our function files_of
to the enumeration of command-line arguments. This yields an enumeration of enumerations, which we turn into a flat enumeration with flatten
. In my mind, that’s somewhat nicer and more readable than nested loops.
Our main program now looks like:
open Shell, Enum foreach (flatten (map files_of (args ()))) do_something
Or, for those of us who prefer operators to parenthesis:
open Shell, Enum (foreach **> flatten **> map files_of **> args ()) do_something
String manipulation
It’s now time to take a file name and turn it into
- a nice constructor name
- a file name without extension,
That second point is the easiest, so let’s start with it. We have a function Filename.chop_extension
just for this purpose. So, if we were interested only in printing the list of files without their extension, we could define
let do_something x = print_endline (Filename.chop_extension x)
The first point is slightly trickier, as we need to
- remove the extension from the file name (done)
- prepend character
`
(trivial) - replace any illicit character with
_
(slightly more annoying, I know that the list of illicit characters which may actually appear in my list of files contains:
,-
,(
,)
and whitespaces but I’d rather not go and check manually which other characters may turn out problematic) - prepend something before names which start with a digit, as digits cannot appear as the first character of an OCaml constructor (a tad annoying, too)
- make everything lowercase, just because it’s nicer (trivial).
Let’s deal with the third item, it’s bound to be central. Let’s see, replacing characters could be done with regular expressions, something I dislike, or with function String.map
. It’s nicer, type-safer, and it has a counterpart Rope.map
for Unicode, if we ever need one. Now, functions Char.is_letter
and Char.is_digit
will help us determine which names are safe. Using them together, we obtain the following function:
open Char let replace s = String.map (fun c -> if is_letter c || is_digit c then c else '_') s
Let’s solve the fourth item on our list. We need to check the first character of a string and to determine whether it’s a digit. Well, we already know how to do this. Let’s call our prefix p
:
let clean_digit p s = if is_digit s.[0] then p^s else s
If we chain up everything, we obtain
let constructor p s = "`" ^ (if is_digit r.[0] then p^r else r) where r = lowercase (String.map (fun c -> if is_letter c || is_digit c then c else '_') s)
I like this where
syntax.
Format
Now that we have both our strings, we just need to be able to combine and print them. For this purpose, Printf is probably the most concise tool. Here, we can just write
let print s1 s2 = Printf.printf " | %s -> %S\n" s1 s2
We could parameterize upon the format used by printf and we’re bound to do this sooner or later, but let’s keep it simple for now.
The complete program
open Shell, Enum foreach (flatten **> map files_of **> args ()) do_something where do_something s = let name = Filename.chop_extension s in Printf.printf " | %s -> %S\n" c name where c = "`" ^ (if Char.is_digit r.[0] then "codemap_"^r else r) where r = lowercase (String.map (fun c -> if Char.is_letter c || Char.is_digit c then c else '_') name)
I don’t know about you but I find this pretty nice, for a type-safe language. I’m sure it would have been possible to make something shorter in Perl or awk, and suggestions are welcome regarding how to improve this but I’m rather happy. And, once again, we’re not trying to beat Python, Perl or awk in concision, just to do something comparably good, because we already beat them by far in speed and safety.
So, what do you think?
Fantastic. I think the Enum module will be one of the major sources of awesomeness in Batteries, although it might take some time for OCaml programmers who haven’t used ExtLib to get used to it.
Is (**>) the same as (<|) ?
Thanks. Indeed, the more I use Enum, the more I like it.
Indeed. The only difference is that the associativity of ( **> ) is the right one. In a future version of Batteries, we will have ( <| ) with the right associativity but that’s not something we’ve tackled yet.
Delimited Overloading can set the associativity (and priority) of :
https://forge.ocamlcore.org/plugins/scmsvn/viewcvs.php/trunk/examples/pa_compos.ml?rev=276&root=pa-do&view=auto
[Oh, gosh, why is there no preview!]
I meant that Delimited Overloading can set the associativity (and priority) of <| just right. The link points to the example of |> (and make it a no-op so it cost nothing). I’ll update it for <| when I have a few moments.
That’s exactly what we plan to use.
You may also want to take a look at shell-utils
http://git.dvide.com/pub/ocaml-shell-utils/tree/README.txt
I hacked this up to do shell scripting in ocaml. There is glob function to take in lots of files. Most shell commands exist in two versions: one that take string, and one that takes a list. This makes it possible to apply lots of ocaml processing prior to calling shell commands as you suggest.