Friday, August 18, 2006

A Eureka Moment!

Back on the 13th August I posted a long email to the ICFP contest discussion list. What follows below is basically a slightly reworked version of that.

I wanted to share a "discovery" that I had made. I knew it could (and still might) turn out to be something already well known in certain circles (or be obvious to others) but I couldn't find anything on the net that seemed to be describing the same kind of thing. It certainly wasn't obvious to me, so I felt rather excited that perhaps this might be something "new" - even if it ultimately also turned out to have little or no real value!

In essence, I first found a way (for some simple cases at least) to precisely describe the number of instructions executed by a particular self-interpreter (when running another program) using a matrix, plus a less significant vector, both derived directly from a static analysis of the internals of that self-interpreter. Later on I stumbled onto the discovery that the dominant eigenvalue of that same matrix was the limit, as N goes to infinity, for the ratio of the total instructions executed to run a stack of N+1 levels of the interpreter to the total instructions executed by a stack of just N levels instead (and with both "stacks" running the same top level program). This also meant that as long as the time required to execute each opcode is some constant positive amount, then the same ratio describes the difference in total running time also.

In the same way that the set of eigenvalues and associated eigenvectors characterise a matrix, this ratio appears to characterise a particular self-interpreter. Therefore I decided to name this ratio the eigenratio of that self-interpreter.

Here's a blow by blow description of how I got to this point:

In an earlier post I described how I had done some experiments, and it seemed that as long as you had a high enough stack of um.um self-interpreters running some other UM binary, then adding yet another instance of um.um to the stack made the whole thing about 24.5 times slower. (Just under that value actually.)

That is, you can construct a stack of N um.um interpreters with another program to top it all off and then run the whole thing with some other valid interpreter (your own C implementation for example). If we then look at the ratio between the time taken to run the stack with N+1 interpreters to the time taken to run the stack with just N interpreters, it appears that the ratio converges to a value of around 24.5 as N increases. That is, as N increases, N+1 levels of the interpreter will take about 24.5 times as long as N levels to run the same top level program. Furthermore it seemed that the limit value of this ratio didn't change when a different top level program was used (on the very top of the stack).

Later I added a few lines of code to my own UM interpreter (the C implementation) so it would report exactly how many times each instruction in the UM instruction set was being executed at the lowest level of the "stack of self-interpreters" (at the level where things were "really" happening!) I collected those statistics with different numbers of um.um in the stack, and then put them into a spreadsheet where I could experiment with assigning more or less arbitrary, but still fixed positive "costs" to each instruction. Once again, it seemed that no matter what cost (e.g., execution time) I assigned to individual instructions, in the end the ratio of the total "cost" for N+1 to just N levels of um.um running some other top level program still converged to the same value of around 24.5! This was intriguing...

Next I decided to implement my own simple virtual machine/instruction set, and then to write a self-interpreter for it (in the same style as um.um) but with the specific goal of achieving a lower ratio than the 24.5 (in the limiting case) that um.um produced. I did this with a very simple machine with (initially) only 6 instructions and finally got my own self-interpreter running. It showed the same kind of behaviour as um.um when run as a stack of multiple copies of itself, but with a ratio of about 15.2. After examining the code for this intepreter I added another specialised instruction to replace a commonly used sequence of three others, and the updated VM and self-interpreter achieved a ratio of about 11.5. (Still a long way to go to get anywhere near the "Golden Ratio" (~1.618...) that I had plucked out of thin air for my earlier (and not entirely serious) "conjecture" but at least some progress was being made in the right direction!) Another subsequent smaller tweak to the instruction set reduced the ratio to about 11.2.

At this stage I decided to try a different approach, and after spending some time with pen and paper found a tidy way to describe and directly calculate the number of times each instruction would be executed in order to run a stack of N self-interpreters plus some other program on top. The solution was to use matrices.

The structure of my own various self-interpreters was simple. A very short section of code executed once at start-up, and then a loop was entered that on each iteration decoded the opcode for the next instruction from the program being interpreted (next level up) before branching to some code to emulate that instruction (adjusting memory addresses etc., along the way).

Therefore for any instruction being interpreted, I could follow the execution path through the interpreter and determine exactly which instructions it executed and also how many times. From this I built a matrix M with m rows and m columns (each row and column corresponding to one of the m instructions in the instruction set). Each column of the matrix corresponds to the emulation of one instruction from the instruction set, and the values in that column (reading the rows from top to bottom) are exactly the number of times each instruction is executed in the current level of the interpreter in order to handle that task. In other words, the value in row i and column j is the number of times instruction i has to be executed during the main loop in order to handle one instance of instruction j in the code being interpreted.

Next, I constructed an m by 1 vector (call it P) where each value is a count of times each instruction is executed during a full run of the top level program. (The order of the values in this vector must match the order used for the instructions in the rows and columns in M also. In my case the op codes were integers from 0 to m-1 and the same ordering was used in rows/columns of the matrices.)

Thirdly, construct one more vector (S) with the same dimensions as P where the values are again the numbers of times the various instructions are executed, but this time covering any code in the start-up section of the interpreter (before it enters its main loop to handle instructions from the program being interpreted).

Now (using juxtaposition to represent matrix multiplication), we can say:

P is the number of times instruction is executed to run the top level program.

MP + S is the number of times each instruction is executed by a self-interpreter running the top level program.

M(MP + S) + S = MMP + MS + S is the number of times each instruction is executed by a stack of two self-interpreters running the top level program.

And so on for higher levels...

So far, so good. Now I was able to use the Numpy package for Python (which provides support for various matrix operations) and after fixing a couple of errors in my matrices was able to reproduce exactly the same figures for counts of instructions executed as I had previously achieved but only by having my own C interpreter track them.

I was pretty happy at that point, having found a nice way to use matrices to predict exactly the number of times each instruction would be executed for a given stack of N copies of some self-interpreter plus some other program on top of that. But now I found myself wondering what characteristic of the matrix M (ignoring S because it is very small in comparison and therefore has very little effect on the big picture) produced the apparently fixed ratios observed as described earlier.

Now it's been a long time since I've had any need to use matrix determinants and so on, but I did vaguely remember the names of a few things of that nature, if not the actual uses of them! So more or less on a whim I googled "eigenvalues", found an online calculator, copied and pasted the contents of my own 'M' matrix, and clicked "calculate". Imagine my surprise when the first listed eigenvalue matched the (11.20...) ratio I had observed - to some number of decimal digits!! Definitely one of those "euraka moments"!

To cut a long story slightly shorter, I have now refreshed my memory on eigenvalues and eigenvectors and have also subsequently found the description of the "Power iteration" algorithm which seems to provide at least a basis for understanding why repeated multiplication by the matrix M eventually generates results that are essentially identical to multiplying by the magnitude of the dominant eigenvalue of M instead. Building a stack of interpreters is essentially equivalent to executing the Power Iteration algorithm using the matrix derived from the interpreter being used in that stack!

In summary, as we let N increase towards infinity, the ratio of total instructions executed to run N+1 interpreters (plus a some other fixed program as the very top level), to the total instructions executed when running just N interpreters (plus the same top-level program), is very close to the magnitude of the dominant eigenvector for the matrix M that describes the internals of the interpreter (as described earlier).

The eigenvalue equation tells us that:

M.(eigenvector k) = (eigenvalue k).(eigenvector k)

So (disregarding the effect of S) we can see that as we pile on more and more interpreters onto the stack, the vector holding the current number of times each instruction has been executed converges towards the "dominant eigenvector" for the matrix M (which describes the heart of the interpreter) and the ratio converges to the magnitude of the corresponding "dominant eigenvalue".

So far I've just been talking about counts of instructions executed. We can however easily include different costs for each different instruction in another 1 by m vector. The dot product of this and the final counts vector gives the total "cost". But when you calculate the ratio using a formula that includes the costs it turns out to be easy to see that the terms cancel and the final ratio is still the same as what got earlier using just the number of instructions executed.

Where to from here? At one point I was hoping that when I understood the relationship between a particular matrix M and the corresponding ratio, that it might help me find the smallest possible ratio - and therefore (in some sense) describe the "most efficient" self-interpreter possible. However, clearly M could (at least in a purely hypothetical sense) be the identity matrix and in that case the ratio is 1. Adding new levels of the interpreter wouldn't make anything run more slowly because if M equals the identity matrix (with ones down the leading diagonal and zeroes elsewhere) then we are also saying each instruction from level N+1 of our stack of programs can be emulated by a single instruction at level N. That doesn't make any sense to me without appealing to some kind of magic or perhaps some as yet undiscovered computing device that can do something for nothing. So the question of what the lower limit is on the "eigenratio" for any self-interpreter (and any "computing machine") is still an open question as far as I can see.

In the beginning there was the ICFP Contest...

A few weeks ago I (along with a friend) participated in the 2006 ICFP Programming Contest. We called ourselves the "Black Knights" this year. This was the 3rd consecutive year that we have taken part and once again had a huge amount of fun - even though we were nowhere near the leading teams on the scoreboard!

The first part of the contest involved writing an emulator for a virtual machine (dubbed the "Universal Machine", or UM for short) according to the provided specifications. You used your emulator to run a UM binary provided by the organisers, and then when you got that running correctly (and fast enough!) it generated output that was used to create yet another UM binary. And that one was an fun filled simulation of a UNIX type operating system called UMIX! The remainder of the contest involved hacking into various accounts on UMIX, and then finding and solving various problems in order to gain additional points. At least I think it did... I say that because our team has yet to even find some of those problems! We spent far too much time trying to get a Python version (and then a Python/Pyrex version) running correctly and fast enough. In the end we reverted to C. But that's another story...

After the contest the organisers released what they called a "reference implementation" of an interpreter for the UM, called - wait for it - "um.um"! This was written in the UM machine language directly and can run any other UM binary, including itself. Thus it is a "self-interpreter" (or alternatively, a "meta-circular interpreter").

Of course, being written in the UM language itself, you still need another working interpreter (such as your own) to run it. (The UM doesn't yet exist as hardware!) But once you have that then it is quite easy to do something like have your own interpreter run a copy of um.um, which runs another copy of itself, which runs another copy, (insert as many levels as you feel like!) and then finally runs some other UM program. Running that final ("top level") program directly in your simulator is easier of course (and faster) but perhaps not quite as mind bending!

I did a few experiments with um.um after the contest ended. In particular I was interested in looking at how much slower the whole stack of multiple copies of um.um ran some program than (say) if there was one less copy in the stack. What I found was that after you had added "enough" copies, then adding another seemed to make the whole thing about 24.5 times slower. This ratio didn't seem to change even with different programs on the very top of the stack (although it might take more levels of um.um before "converging"), and it also didn't seem to matter which (other language) interpreter I used at the very bottom level. I had several to choose from for that very bottom level, three based on the implementations I worked on during the contest (and tidied up later), plus some other Python variants created after the contest in an attempt to find a faster way to do it.

It seemed that the um.um interpreter (once you had enough instances in the "stack") always made things run about 24.5 times slower for each additional instance. I wondered how small that ratio could be made, and also what the lower bound for such a ratio could be for any self-interpreter, in any language, and on any possible computer architecture!

Next post... what I discovered when I dug a little deeper.