Project ideas

Christos Kloukinas

c.kloukinas -at- city...
  1. Extending the XCD translator tool <-------- !!!!

    When you write a Java program, you compile to see if there are any grammar/type errors, and, if not, you run it with different input values to test that it's correct.
    When you produce a UML diagram, how do you know that it's correct?
    You don't... :-(

    We have created a new language to describe architectures of software systems and their behaviour, called Xcd (pronounced "exceed"). Here's the Xcd home page.
    This language allows to specify architectural designs in a textual form, as if you were programming. It also allows you to check these designs for errors. There's a tool that translates the designs into another language that is easier to check automatically (through an external tool called Spin - a model checker, see What is Spin).
    An article describing the Xcd language has been published at the CBSE'14 conference:

  2. It's just a matter of semantics... Introduction

    Languages are defined by their grammar and their semantics. The grammar says that "a $ 3 ;" is incorrect Java. The semantics describe what exactly a grammatically correct piece of code actually means (what does it mean to write something like "a = 4;"? It's context dependent - it's an error if you haven't defined "a" to be a number...).
    Imagine that I give you the grammar of the Greek language. Using it you can identify which Greek sentences are correct and which are wrong but without knowing the semantics you cannot say what the correct sentences actually mean - it's all Greek to you! Well, the same happens with computers.

    Now it's getting trickier - how do you know that your C++ compiler correctly implements the C++ standard (i.e., the C++ semantics as expressed in English by the standard committee)? The C++ compiler developers have tried to interpret what the committee had written in English and codify that into their compiler but it's not an easy task - bugs have been identified on many occassions.

    There's a way to make this task easier by using a recently developed framework called the K Framework. There you use a specific formal (not just English) language to describe the semantics of a language and there are tools to produce the compilers for you automatically (like writing a UML statechart and then getting running code from it directly).

    This framework has used to produce executable formal semantics for the C language (!!!!), published at POPL'12 (a very interesting but maybe a bit difficult to understand paper).

    A much easier but slightly outdated primer into the K Framework is available, showing how the framework works for a simple calculator language - should be much easier to follow (and I'm more than glad to help with it). Figure 4, on p. 19 shows the whole input for defining the grammar of a simple calculator and its semantics - have a quick look at it, to see how simple it is.

    There's an up-to-date (as far as I know) set of tutorial videos.

    The tools of the K Framework are available from the K Framework's Tools Github page.

    1. You could express the semantics of FSP with the K Framework, to get our own version of LTSA!

      A paper describing the translation of FSP into another formal language (Lotos) may be helpful in understanding how FSP is supposed to work when unsure (mainly a note to myself).
      You need to be able to think abstractly and be very, very methodical - giving super attention to detail.

    2. You can try to use the K Framework to express the Xcd language semantics with it (much harder than doing so for FSP, since XCD is a more complicated language).

      We have an informal definition of the Xcd language semantics, described in English and pseudo-code in Mert Ozkaya's PhD thesis. Mert has also written the tool that translates Xcd designs into the Spin tool's language so that they can be verified. This tool is supposed to have followed the English language Xcd semantics but it's quite possible that some errors have been done (or indeed that the original semantics were not always correct - we're only humans...).

List of OLD project ideas

  1. Spin (suggested)

    There's a tool called Spin, that takes a model and checks it automatically.
    Developed at Bell Labs, received the very prestigious ACM System Software award in 2002, its developer now heads a research lab at NASA.

    Spin's written in C. Plain, good, old C. No classes/generic code - simple... (he, he, he).

    We'd like to modify Spin itself so that it better supports our Xcd language.

    More info (+ binaries and source code) on Spin can be found here:

    1. Spin now supports channels and message receptions like this:

      /* receive a message consisting of three data values from channel
         chan1 and assign these into d1, d2, d3. */
      chan1 ? d1, d2, d3;

      You can now also do:

      /* only receive a three valued message and assign the first value to
         d1 if the second value in the message is equal to 3 and the third
         value is equal to the current value of d4 */
      chan1 ? d1, 3, eval(d4);

      We need:

      /* only receive a three valued message into d1, d2, d3, if its values
         satisfy the predicate after the : character, where $i is the i-th
         message component */
      chan1 ? d1, d2, d3 : ($1 * $2 == $3 + d4);
      Note: Spin has different types of message reception.
      c ?    m; /* receive the first message in the buffer if it matches m */
      c ??   m; /* receive any matching m non-deterministically */
      c ?  < m >; /* receive the first message in the buffer if it matches m but don't remove it from the channel buffer */
      c ?? < m >; /* receive any matching m non-deterministically but don't remove it from the channel buffer */
      c ?  [ m ]; /* poll channel to see if you could receive the first message in the buffer if it matches m */
      c ?? [ m ]; /* poll channel to see if you could receive any matching m non-deterministically */ 
    2. Also Spin has a non-deterministic select(i, min, max), which assigns i a value between min and max when minmax or min otherwise (it never blocks!).
      It is equivalent to the following Promela code:
      /* select(i, min, max); */
      i = min;
      do /* Non-deterministic loop, can choose any value between min and max */
      :: i < max -> ++i
      :: break

      We'd like a select(i, min, max, predicate) - assign i a value between min/max that satisfies predicate; block if no such value exists. The latter part is important (and difficult) - the statement shouldn't be executable at all if there is no value within the range for which predicate is true.
      And of course, if the predicate is true for some value then it shouldn't block, so the following "implementation" is wrong:

      /* select(i, min, max, pred); */
      i = min;
      do /* Non-deterministic loop, can choose any value between min and max */
      :: i < max -> ++i
      :: pred(i) -> break /* Bug: if pred(i) is true, one may choose to
      advance i nevertheless and get stuck in values of i for which pred(i)
      isn't true. We could store the last value of i that satisfies pred(i) */
      Another attempt - but is it correct?
      /* select(i, min, max, pred); */
      i = min; last_i=min-1;
      do /* Non-deterministic loop, can choose any value between min and max */
      :: pred(i) -> last_i = i ; /* Store the last value of i that satisfies pred(i) */
                    :: i < max -> ++i
                    :: break
      :: else -> if 
                 :: i < max -> ++i
      	   :: (i == max) && (last_i != min) -> i = last_i ; break

    We can then modify the Xcd translator to take advantage of these features (especially the generalized message reception, which should help decrease the state space considerably).
    This would also help demonstrate the usefulness of these features by comparing how the two versions of the translator behave memory/time-wise.

  2. Spin - two-in-one (challenging) Spin can simulate a model and also produce a C program that can verify the model. If there are errors during verification, one can replay them using Spin.

    Issue: The code that Spin has internally to be able to simulate models is quite similar to the C code it produces to verify models. The verification code can also execute fragments of C code that can be included in Spin's models (as they're getting compiled along with everything else), while the simulator cannot execute them.

    We'd like a version of Spin that enables the produced C program to also simulate the model - one small step towards merging the two facets of Spin.

    (or use hobbes to execute the checker code directly inside Spin...)

  3. Spin unspun (hard) Here we go all-guns-blazing in refactoring...

    Spin, is not just a tool - it has so many configuration options that it'd be more right to call it a family of tools.
    When compiling its verifier, pan (Protocol ANalyser), it can use a number of options:, e.g., -DBITSTATE, -DBFS, -DNOREDUCE, etc.
    The code is then compiled selectively according to these directives - whether it'll be exhaustive or bitstate exploration, depth-first or breadth-first, with or no partial-order reduction, etc.

    In a perfect world, I'd like to refactor its code entirely, so that instead of selecting code based on these directives I can use C++ Template Metaprogramming to do the selection.

    As a first step, it'd be nice if we could:

    1. Pass some directives to the compiler, e.g.: g++ -x c++ -DBITSTATE -DNOREDUCE pan.c -o pan
    2. Use the directives to define some global enum variables:
      enum class search_type { exchaustive, bitstate };
      #ifndef BITSTATE
       search_type search_type_val = search_type::exhaustive;
       search_type search_type_val = search_type::bitstate;
      (same for the other options).
    3. Then use template metaprogramming to select code that is optimised for specific values of these global enum variables:
      /* Your job! Set of classes that when instantiated at the end will
         print the list of directives used at compilation */

    What's needed:

    1. Identify all these directives (a smart grep will do it)
    2. Group directives into groups, e.g., SEARCH_MODE = { Depth_First, Breadth_First }; and respective global variables that are set to the appropriate mode.
    3. Replace all #if/endif selective compilation directives with if/switch checks that check the global variables.

    This will allow the compiler to compile the *whole* code, not just the one that has been selected by the directives.

    This will allow us to look for bugs in Spin, some of which could be masked now. I'm already developing some crazy C pre-processor macros to facilitate this task (one must have fun from time to time...;-).

    The whole code? Well, not exactly... You see, Spin contains code inside strings (in files called pangenX.[ch]), that it uses to produce another executable called pan (Promela ANalyzer) given some model - it's the pan executable that actually checks the model. Obviously, this code is never seen by the compiler when Spin is compiled - only seen when compiling pan. In itself that's not too bad, as one can refactor it as we've just described for the Spin code and check for bugs. But some of this code is quite likely similar to the code that Spin itself uses - Spin can simulate a model (i.e., execute one of its execution paths), while pan model-checks it (i.e., executes all its execution paths).

    So it'd be really, really nice if we could find a way to refactor this code so that common parts are only written once and so forth. This may require using C++, especially lambda functions and/or template meta-programming. It might also require some crafty C pre-processor macros and shell scripts to turn the same code into a string version (for pan) and a normal version (for Spin).

    What's in it for you? You can demonstrate that you can understand and modify a complex piece of software and do so in a systematic manner. Employers with brains would appreciate that, cause that's the majority of what professional programmers do - few write programs from scratch.


  Note: The following project ideas may depend on the semantics project above as the current Xcd tool may not be able to treat the cases explored - so I consider them risky and would rather have someone complete the Xcd semantics project first.


  1. patterns (actually may depend on the semantics project above as the current Xcd tool may not be able to treat these examples...)

    Here is a paper discussing how to represent the patterns in the Gang of Four book using aspect-orientation.

    The code from that was here: It's still accessible through the Internet archive

    Along with a PhD student here at City we're developing a language to describe architectures. All related publications, the tool, and some case studies are here:

    The goal is to take the patterns described in POSA (vol 1) and describe them in Xcd.

    POSA books:

  2. algorithms (actually may depend on the semantics project above as the current Xcd tool may not be able to treat these examples...)

    Similar to the patterns project - the goal here is to describe some algorithms, like (from C++): swap, sort, find, etc. Also map-reduce (don't think that the Xcd language can support it now but it's good to try).

  3. (DIFFICULT++) properties (actually may depend on the semantics project above as the current Xcd tool may not be able to treat these examples...)

    There are two ways to describe a protocol: a decentralized one, where you give the behaviour of each participant, and a centralized one where you describe at each state which participant can move and what the next state would be.

    The Xcd language supports the former.

    We'd like to extend it so as to be able to describe a property for such a protocol, which would be expressed as a centralized behaviour.

    Concerns language design, and compiler changes.

  4. (DIFFICULT) time (actually may depend on the semantics project above as the current Xcd tool may not be able to treat these examples...)

    The current version of the Xcd language does not have any notion of time (like FSP).

    But in many systems time needs to be modelled.

    So we'd need to extend the syntax to include clocks and expressions on them, and then extend the compiler that translates Xcd models into the language of a model-checker called Spin.

    More info (+ binaries and source code) on Spin can be found here:

    Note - Spin itself does not support time.

    There are two groups that have tried to extend it so that it does:

    We'd preferrably like to use the first extension, as it can support more than time.

    Looking at the syntax extensions that these two have done for Spin, would be a good way to start on deciding how to extend the syntax of Xcd.

    You don't need to understand the underlying theory about model-checking time (well, not all of it).

    Like previous projects, it involves working on the syntax of the language and on its compiler.

  5. Google/Microsoft File System (actually may depend on the semantics project above as the current Xcd tool may not be able to treat these examples...)

    There's a paper describing how the authors specified the distributed file system of Google (GFS), Microsoft (Niobe), and another one (Chain):

    Roxana Geambasu, Andrew Birrell, and John MacCormick. "Experiences with Formal Specification of Fault-Tolerant Storage Systems." In Proceedings of the 38th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-DCCS), Anchorage, Alaska, June 2008.

    The authors were kind enough to give me the specifications.

    It would be very interesting to translate these into Xcd, to see if we can make them shorter, etc. by reusing protocols.