Greg's death of a controller
Greg's diary
Photo index
Greg's home page
Network link stats
Greg's other links
Copyright information
Groogle

Some years ago, while still working for Tandem, I wrote this text in reply to some discussion on USENET alt.folklore.computers. The file from which I derived the page is dated 1 August 1991, but I think the events to which I refer happened some time in 1986.

> I had a problem that went away when I added the line:
>    static int frobozz;  /* frobozz not being used anywhere in the program */
> Of course, this was a compile-time problem (internal compiler error).
> If it were a run-time problem, it could most likely be blamed on pointer
> abuse or similar programmer error.

There are other reasons for this sort of behaviour. One of the more interesting ones in my experience was some years ago in a new release of Tandem's Guardian operating system. This is written in TAL (Tandem Application Language, i.e. a systems programming language derived from HP's SPL and vaguely reminiscent of a Pascal or Algol dialect). To understand the problem, you need to know something about the system architecture:

Each (of up to 16) cpu(s) has 32 logical I/O channels, each of which can interrupt at one of two levels (standard or high priority). These levels are not priorities: there is a separate priority system. The only difference is the fact that the system can (and does) choose to mask them differently. On interrupt, an interrupt mask is loaded by microcode to determine which interrupts can interrupt the interrupt routine. For example, uncorrectable memory error and power fail are almost always left enabled. The dispatcher interrupt is disabled by all interrupt routines. IOINTERRUPT is masked more often than HIIOINTERRUPT, and this (and a few other distinctions) result in faster treatment of high-priority interrupts by the system.

Nevertheless, HIIO is not very different, and the requirement of strapping the controllers mean that it is seldom used. One day, years ago, somebody in software development noticed that the interrupt routines IOINTERRUPT and HIIOINTERRUPT (yes, TAL does map identifiers to ALL CAPS) were identical except for a very small piece of code. Since they are in an expensive part of address space, the mod was made to make HIIOINTERRUPT an entry to IOINTERRUPT. The following example is illustrative in nature and bears only a superficial relationship to genuine Tandem code:

   proc iointerrupt interrupt;  -- declaration

   begin
   entry hiiointerrupt;         -- forward declaration of entry point
   int regs [0:7];              -- register save area (done by ucode)
   int intcount [0:31];         -- interrupt counts for each controller
   int hiio := 0;               -- normally not HIIO

   goto common^code:            -- oh well

   hiiointerrupt:               -- this is our entry point
   hiio := 1;                   -- set flag for hiio

   common^code:

   if hiio then                 -- code for HIIOINTERRUPT
     code (HIIO)                -- this is T/16 assembler stuff
   else
     code (IIO);                -- get the interrupt status

   -- and then continue and use the status for whatever, including getting
   -- the channel status.  Increment the interrupt count for this channel,
   -- kill the controller if it's continually interrupting.

   if (intcount [channel] := intcount [channel] + 1) > maxintcount then
     call kill^channel (channel);

This channel kill condition would, of course, happen inevitably if the interrupt counts were not regularly reset. This happens in TIMERINTERRUPT. We get the address of the local data area for IOINTERRUPT and HIIOINTERRUPT (by way of the SIV, if you happen to be interested in Guardian internals) into a pointer which I shall call icount (because I've forgotten its real name). Then we do:

   icount [0] ':=' 0 & icount [0] for 31;

This stores a 0 in the first word and then uses a block move to propagate it into the other counts. This is done once for each interrupt routine (they have the same code, but different local data). This works because the interrupt routines have static local data (again, pointed to by the SIV).

With this buildup, it's obvious what happened. Well, almost. Tests worked perfectly, like QA tests always seem to. Then we delivered the release, and found:

  1. Customers with controllers on channel 0 (normally not used) were getting controller kills almost immediately (flavour: keeps going down, won't come up, customer screaming for blood).
  2. Customers with controllers running HIIO were experiencing the same symptoms, no matter what the controller number.

Well, the information above will point you in the right direction, but the real answer was a feature (really, not a bug) of the TAL compiler. Write a procedure with an entry point and with initialised local data, and it will initialise it with an internal subroutine, thus avoiding unnecessary duplication of initialisation code. The entry to IOINTERRUPT would thus look like:

   (init subroutine)
   POP  100                     -- pop return address from stack
   ADDS 040                     -- create space for interupt counts
   LDI  0                       -- load immediate 0
   PUSH    0700                 -- store on stack
   SETP                         -- set P register to return address

   (iointerrupt)
   BSUB init                    -- initialise
   BUN  common^code             -- and on to common code

   (hiiointerrupt)

   BSUB         init
   LDI  1                       -- load immediate 1
   STOR hiio                    -- and store in hiio

   common^code:

The BSUB instruction (Branch to SUBroutine) stores its address on the stack, in the local data area. This doesn't usually make the slightest bit of difference to the program, since it happens before the local area is initialised. Except, of course, in an interrupt routine. On entry, the microcode has already stored the registers, so the S (stack) register is pointing to intcount [0]. There the return address gets stored. Explains phenomenon 1 completely.

Phenomenon 2 is even more interesting. Here, TIMERINTERRUPT is doing its regular (every 0.3 second) clear of the interrupt counts of HIIOINTERRUPT. Code looks something like:

   LDI  0               -- 0
   STOR INTCOUNT        -- in INTCOUNT
   LADR INTCOUNT        -- set up registers: source
   LADR INTCOUNT+1      -- and dest address
   LDI  037             -- 31 words
   MOVW                 -- move words

Somewhere between the STOR INTCOUNT and the MOVW, HIIOINTERRUPT gets invoked, planting a bomb in intcount [0]. TIMERINTERRUPT then goes and propagates it through all interrupt counts.

I still find this a truly fascinating example. We had it with a high speed (at the time - 64 kb/s) X.25 controller. Assuming a packet size of 128 octets, 2048 bits, we would expect an interrupt rate of about 30 times a second. TIMERINTERRUPT does its thing every 0.3 second. Nevertheless, it was not possible to keep the controller up for more than 30 seconds.

So, how to fix it? Right!

 static int frobozz;  /* frobozz not being used anywhere in the program */

Put this in front of the interrupt counts, and the problem is solved. The BSUB stores its return address there, and nobody sees it again.

Well, in fact, I think we had a more descriptive name for it, and of course, it was in TAL, but that's really how we did it. The point I'm trying to make here is that this is definitely not a compiler bug, but many languages have side effects which are by no means intuitively obvious.


Greg's home page Greg's diary Greg's photos Copyright

Valid XHTML 1.0!

$Id: controller-death.php,v 1.3 2012/03/11 21:53:18 grog Exp $