Extrapol, part 1: from C to Effects

June 3, 2008 § Leave a comment

Here comes the long-promised description of Extrapol, my main ongoing research project. In a few words, our objective with Extrapol is to fill a hole in the current suite of tools built to ensure the security of systems. While there’s an ample amount of stuff designed to analyse the behaviour of processes either during their execution (dynamic analysis) or after their completion (trace analysis), there is little work on applying static analysis to actual system security.

What does this program do?

Whether we’re talking about SELinux, AppArmor, the Unix memory protection, ACLs, every form of dynamic analysis is trying to find the answer to two questions:

  1. what is this program doing?
  2. is that legal?

With Extrapol, the questions are similar, although we’re asking them earlier — and restricting ourselves to C programs:

  1. what is this program going to do whenever it gets executed?
  2. is that legal?
  3. if information is missing, where will that information come from?

Let’s look at an example and see what gives:

void* read_some_stuff()
   FILE* file = fopen("/home/foo/bar.log", "r");
   void* buf  = malloc(1024*sizeof(int));
   fread(buf , sizeof(int), 1024, file);
   return buf;

Now, what does this function do? Now, regardless of the quality of the code, from a cursory read, it’s quite clear that it

  • opens a file named “/home/foo/bar.log” for reading
  • allocates some memory
  • read some of the content of that file
  • return a pointer to the contents of that file.

In terms of security, the program should only be executed by a user who has the authorization to read file “/home/foo/bar.log” and only if that user is willing to let the program read the contents of that file. Let’s see what Extrapol tells us about the extract:

read_some_stuff: Function
        effect                 : "open" (Constant "/home/foo/bar.log" , Constant "r" )
        effect                 : "read" (Constant "/home/foo/bar.log" )
        return                 : "data read from file" (Constant "/home/foo/bar.log" )

In clear(er) English, Extrapol has just given us the following informations:

  • read_some_stuff is a function
  • read_some_stuff does not take any argument
  • if executed, this function will have the secondary effect of opening a file named “/home/foo/bar.log” for reading
  • if executed, this function will also have the secondary effect of reading stuff from a file named "/home/foo/bar.log"
  • the result of calling this function is some data read from the file named "/home/foo/bar.log"

While the result is somewhat larger than the original program, of course, we can apply Extrapol to more complex programs. Say, from

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int do_open(char* file_name)
  int fd = (int)fopen(file_name, O_RDONLY);
  return fd;

int do_read(int fd)
  char* buf;
  buf    = (char*)malloc(1024*sizeof(char));
  int rd = 0;
  for(int i = 0; i < 1024; ++i)
    rd = rd + fread(buf, i, 1, fd);

  return buf;

Well, in that case, Extrapol will analyze both functions and produce a summary of each.
	<li> The analysis of <code>do_open</code> will result in: 
do_open: Function
        input arg file_name    : Bottom
        effect                 : "open" (Identifier "file_name", Constant "0" )
        return                 : "FILE" (Identifier "file_name" )

that is,

  • do_open is a function
  • do_open accepts one argument file_name
  • the content of this argument won’t be modified
  • this argument doesn’t seem to require to contain anything specific
  • the function has one system effect: opening for reading the file whose name was given as file_name
  • the function returns some abstract data, in that case a file whose name name was given as file_name.
  • The analysis of do_read will result in:
    do_read: Function
            input arg fd           : "FILE" (Identifier "extrapol_generated_699 (formerly <anonymous> ) " )
            effect                 : "read" (Identifier "extrapol_generated_699 (formerly <anonymous> ) " )
            return                 : "data read from file" (Identifier "extrapol_generated_699 (formerly <anonymous> ) " )

    that is,

    • do_read is a function
    • do_read accepts one argument fd
    • the content of this argument won’t be modified
    • this argument must be abstract data, in that case a file
    • for the rest of this function we will call the filename extrapol_generated_699 (formerly <anonymous>)"
    • the function has one system effect: reading from the file whose name was given as extrapol_generated_699 (formerly <anonymous> )
    • the function returns some abstract data, in that case data read from the file whose name was given as extrapol_generated_699 (formerly <anonymous> )
  • Now, chances are that we may want to use both functions. As it turns out, whenever a function has been analysed, Extrapol places the information it has deduced in its knowledge base (the environment) and may use it for further deductions.

    Now, let’s feed the following additional source to Extrapol:

    int main(int argc, char **argv) {
      assert(argc >= 1);
      int   fd  = do_open(argv[0]);
      char* buf = do_read(fd);
      return 0;

    In return, Extrapol will answer:

    main: Function
            input arg argc         : "command-line argument" ()
            input arg argv         : "command-line argument" ()
            input vararg           : "command-line argument" ()
            effect                 : "read" (Identifier "argv" )
            effect                 : "open" (Identifier "argv", Constant "0" )
            return                 : Constant "0"

    In other words, the effect of the full program will be to open and read from a file whose name was provided as a command-linen argument.

    How does this work ?

    For today’s entry, I’ll take that question with the meaning of “How do I use it ?” Well, it’s simple. Extrapol takes as input one or more C source files (they don’t have to be pre-processed) and a knowledge base of primitive functions. This knowledge base is an ASCII file with the same format as Extrapol’s answers.

    For instance, to obtain the above demos, we provided Extrapol with the following piece of information for function malloc:

    malloc: Function
      input arg size: Bottom
      return: Bottom

    In other words, malloc is a function accepting one argument named size. Nothing special is expected from size and the returned value is compatible with any value.

    Now, fread is a more complex function. Here’s the prototype of the function, as presented in its man page:

    size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

    For Extrapol, that’s a function with 4 arguments and a return value. Let’s start from the end.

    • The return value could be anything, we won’t make any guarantee (that’s called Top — we could refine that).
    • The last argument is a file. We’ve seen earlier how to deal with files, they’re abstract structures, which we’ve decided by convention to call "FILE"(foobar), where foobar is either the name of the file or a pointer to a variable/argument which contains that name. In that context, "FILE" is called a constructor, by the way. It’s just a character string which we use to differentiate abstract values. Correspondingly, foobar is the argument of "FILE".
    • The second to last argument is a number of blocks to read. We’re just going to ignore it. In other words, that argument is compatible with any value. That’s Bottom.
    • The second argument is a size. We’re also going to ignore it, so Bottom again.
    • Finally, the first argument is an output argument — functionally, it’s a special form of return, so we’re going to mark that argument as output. We could just specify that it may contain anything and refuse to make any guarantee, but here, we’re going to refine a bit and specify that it’s abstract data, read from the file. Let’s call the constructor "data read from file" and give it as argument the name of the file. This gives "data read from file"(foobar).
    • Oh, yeah, and before we forget, this function has a side-effect we’re interested in: it reads some content from that file foobar. Let’s call that effect "read"(foobar)

    Or, in full Extrapol syntax:

    fread: Function
       output arg ptr: "data read from file" ( Identifier "path" )
       input arg size : Bottom
       input arg nmemb: Bottom
       input arg stream: "FILE" ( Identifier "path" )
       return: Top
       effect: "read" ( Identifier "path" )

    Now, from a set of C source files and this knowledge base, Extrapol will pre-process C, parse it, remove unused symbols then examine the source code and proceed by success deductions until either

    • something goes wrong (typically, a function is used without being defined)
    • every function has been examined.

    Extrapol then outputs the result of his deductions, in a format fit for reinjection inside the knowledge base.

    If you are curious, the theories behind this notion of deduction are

    • lambda-calculus
    • dependent types
    • types with effects
    • Hindley-Milner-style type inference.

    This will be detailed further in another blog entry — and in a journal paper, whenever I find the time to complete it.

    Does it work ?

    The answer is yes, no and maybe.

    • Maybe: testing on real applications is something yet to do. Currently, Extrapol works on our sample set, a set containing no entry larger than 100 lines. We’ll know more when we find the courage to test, say, tar or df.

    • Yes: everything written above, and a dozen other samples, work.
    • No: some features are not implemented yet. Function pointers won’t work, nor will recursivity. I haven’t put together the theory behind these constructions yet, so I don’t know yet how hard all of this will be. In addition, for the current prototype, we can’t deduce anything interesting on global variables yet, and functions need to be declared in the order in which they are used. These two aspects will be easy to fix, they just don’t have the highest priority.

    How is it written ?

    There are currently two different versions of Extrapol:

    • Extrapol/Java is a Java-based implementation of the specification, by Steve-William Kissi, Bastien Jansen and David Teller. It is about 18,000 lines of code and, at this very moment, lags slightly behind the experimental version, Extrapol/ML.
    • Extrapol/ML is the set of specifications, written in OCaml by David Teller — as well as the experimental version. It is about 3,000 lines of code, purely functional except for logging.

    Both versions are meant as open-source, although the licence hasn’t been 100% decided. Extrapol/Java may end up MIT-style with Extrapol/ML ending up LGPL + linking exception.

    Can I play with it ?

    As soon as we’ve decided the licence, we’ll release the current prototypes. Ideally, this should happen before mid-June.


    Tagged: , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s

    What’s this?

    You are currently reading Extrapol, part 1: from C to Effects at Il y a du thé renversé au bord de la table.


    %d bloggers like this: