|
|
While working in Operating System and Hardware Support for Tandem Europe, we received multiple reports of very specific processor failures from a company near Düsseldorf. In each case, CPU 4 failed at 16:04 every day.
We finally got some processor dumps and found that the head of the MAPPOOL table (really a doubly-linked list) had been corrupted. MAPPOOL was always in memory map 6, and the (“absolute”) address of the map was, conveniently, at the beginning of the map: the two pointers pointed to the first pointer. Tandem extended (32 bit) addresses were either “relative” (relative to the user data space) or “absolute” (relative to system memory). They were usually written as two 16 bit octal values, and the sign bit of the first word was set to 1 for absolute addresses and 0 for relative addresses. So the address of MAPPOOL was %100014.0. There was also an alternative form as a single 32 bit value; in this form the address was %20003000000D, where D stood for double word. Don't you love the use of octal for 16 bit machines? We certainly did.
In the dumps, though, the forward pointer (at address %100014.0) had been modified to %100015.0. We had never seen anything like that before, and our first course of action was to replace the processor—to no avail.
Finally we decided to try to catch the bug in action. That's relatively easy: under normal operating conditions, this pointer is never changed, so a memory access breakpoint on write would catch it. I had been playing with a feature called “checkroutine”, which allowed us to put arbitrary code into system code space and attach it to the kernel debugger. I wrote a custom checkroutine and sent Garry Easop off to Düsseldorf to put it in CPU 4 shortly before 16:04. Bingo! It worked first time.
The dump was amazing: the crash was caused by a user program! A number of factors came together:
The customer had a clever programmer who knew that the only proper way to terminate a program was to jump to address 0. No call stop for him!
This even worked, if you ignored the crashes. The beginning of the code space in a Tandem program was reserved for the Procedure Entry Point Table, and the first two words were pointers in this table, guaranteed to be less than 512 (the maximum size of the table). Executing these words would, sooner or later, lead to the process crashing.
Somebody had introduced a bug into the system a release or two back. The background was an untidy area where the file system needed to move data from user data space to the Process File Segment (PFS) in “system” (kernel) data space. This was done in process context, and clearly system data is off limits to user processes. But the solution was relatively simple: the absolute addresses of the segments we wanted to address were higher than the current code and data segments. So on entry to the file system, the limits were bent so that the process could access the PFS from relative address %10.0, which allowed access to the four user segments), to a higher value, which I forget, but which allowed access to the PFS as well. When it returned, it reset the upper limits to %10.0, which it wrote in the alternative real 32 bit form as %2000000D.
Not a very elegant situation, but it worked. But somehow one day the reset code got modified: instead of setting the upper limit to %2000000D, it set it to 2000000D. That missing % meant that the word was interpreted as decimal, and so it set the real limit to %36.102200, way beyond the system map area.
Finally, the matter was what happened when the process stopped. The value at address 0 was, by pure fluke, one of the two or three instructions that always wrote (or tried to write) to system data: ORG, “logical OR with System Global”. This took two arguments on the register stack: an address and a value to OR with the current contents. They were, of course, respectively %100014.0 and 1. Normally this would create an exception in a user process, but the file system bug allowed it the access. So this instruction performed the damage, helped by the bug in the file system code.
That didn't mean that the system crashed immediately, of course. It didn't do that until the next time that something accessed MAPPOOL. As our previous dumps showed, this was some time later, by which time no trace of the culprit remained.
Greg's home page | Greg's diary | Greg's photos | Copyright |