2009-01-09

Tale of the Dueling Bugs

One of the things I love about software is the completely bizarre cause and effect chains. I recently found one of these chains when I inherited a data acquisition system. The system consisted of firmware sitting on a custom printed circuit board (PCB), and PC software which reads out the data from the motherboard. The firmware had a known defect in it, so the first thing I did when I inherited the firmware was to fix the defect. Most people call defects "bugs," but I do not like the word "bug." Bugs are cute. Defects are not cute.

Anyways, I fixed the firmware problem and all went back to being okay in the world. I didn't see the communication timeouts and corrupt data from the defect. However, I couldn't get the data from the previous experiments out of my mind. The issues with the data didn't quite match the conditions of the defect, which occurred randomly and independently. The data had holes at regular intervals. These multi-day experiment have a lot of data, and when something looks wrong to me in a lot of data I have a hard time explaining what looks wrong. I have to play around with data to figure out what looks wrong. So I graphed it, and then I made some histograms, and then I did crazy things like graph every other sample.

Indeed, it turns out there was *another* defect, but this one was in the C++ software. This defect cropped every half hour, and it caused continuous communications issues, different from the firmware defect. I immediately tried running with the new firmware. The system didn't work at all after half an hour, right when the C++ defect cropped up.

Suddenly what had happened came into focus. The firmware defect occurs randomly, but it would always be an isolated incident. However, after half an hour of standard operation, the C++ defect occurred and continued to occur. The communication would be completely messed up until the firmware defect randomly occurred. This would reset the C++ code, and the system would go on as normal until the cycle repeated in half an hour.

I asked if anyone had performed a control experiment or a long-running test, and no one had. While I look on this incident as additional evidence for the importance of testing, I think it also displays the importance of running control experiments. If someone had run a control experiment, they would have discovered the issues before it ruined a portion of the data.

The person who wrote the software is no longer here, but he was a very hard worker. He worked a lot of really late nights, usually troubleshooting issues with the code. I think this is an example of how a Puritan attitude is a bad thing in a programmer. Lazy people constantly ask themselves if there is a better way - spending a day thinking about a month's worth of work is reasonable to me. Or maybe he made the right decision, which was to get enough data to graduate and leave the problems to the next guy.

No comments: