The Best Debugging Story I’ve Ever Heard
Back in the early 80’s, my dad worked at Storage Technology, a now-defunct corporate entity that made tape drives and pneumatic systems to drive these tapes at high speeds – for that period of time.
(Used under license from Laughing Squid. The original is available here.)
They had hacked engineered the tape drives such that you could have one central drive – the ‘A’ drive – connected to seven other 'B’ drives, and a small operating system on some RAM attached to the A drive would delegate the reading and writing of data across all of the B drives.
Every time you started up the A drive, you had to insert a floppy disc into a peripheral drive connected to the A drive so that the operating system could be loaded onto the A drive’s RAM. The operating system was appallingly primitive - it derived its processing power from an 8-bit micro controller.
The target audience for this sort of thing were corporations with very large data sets - banks, magazines, et cetera - that needed to print huge amounts of address labels or bank statements.
One customer had a problem. In the middle of a print run, one particular A drive would stop working, causing the entire print run to stop. To restore the drive the attendants had to reboot the entire drive - and if this happened in the middle of a six-hour print job, there’d be a ton of expensive computer time lost and the whole operation would fall behind schedule.
So Storage Technologies sent out technicians. The technicians, despite their best efforts, could not reproduce the bug in test settings: this bug seemed only to happen in the middle of large print jobs. So, on the off chance that this was a hardware issue, they replaced everything they could - the RAM, the microcontroller, the disk drive, every conceivable part of the tape drive - but the problem kept happening.
So the technicians phoned up headquarters and called in The Expert.
The Expert got a chair and a cup of coffee and sat in the computer room – these were the days when they had rooms specifically dedicated to computers, after all – and watched it as the attendants queued up a large print job. He waited until it crashed - which it did. Everybody looked to The Expert – and he didn’t have a clue what was causing it. So he ordered that the job be queued up again, and all the attendants and technicians went back to work.
The Expert sat down in his chair again, waiting for it to crash. It took something like six hours of waiting, but it crashed again. He still had no idea what was causing it, other than the fact that it happened when the room was crowded. He ordered that the job be restarted, and he sat down again and waited.
By the third crash, he had noticed something. The crash occurred when the attendants were changing the tapes on an unrelated drive. And furthermore, he realized that the crash occurred as soon as one of the attendants walked across a certain tile on the floor.
This type of floor was made of aluminum tiles propped up by posts about 6 to 8 inches tall. The massive amount of wires that these computers needed were threaded under the floor tiles so that an unwary attendant wouldn’t trip over a crucial cable. The tiles were put together very tightly so that no debris would fall into the space where the wiring went.
The Expert figured out that one of the aluminum tiles was warped. When an attendant stood on the corner of the warped tile, the edges of the tiles rubbed together. As the plastic connecting the tiles rubbed together, they produced microsparks, which in turn caused RF interference.
Nowadays, RAM is much more thoroughly shielded from RF interference. But back then, this was not the case. The Expert figured out that the RF interference was corrupting the RAM and, in turn, the operating system.
The Expert called the maintenance office, got a new tile, installed it himself, and the problem went away.
from Hacker News https://ift.tt/R9LO3FE
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.