I listen to the Amp Hour podcast. Last week, they were talking about ways engineers fail and about checklists as a way to be more disciplined in avoiding preventable errors.
I try to learn from my mistakes so I do have a checklist of sorts when I reach the “it’s all broken and never, ever going to work again” stage of debugging. Of course, when I get to that point, the symptoms are usually different. Still, some parts of the path are relatively common.
1. Having you tried turning it off and back on again?
It is a joke, a very well seasoned joke. But it is only funny because it is so horribly true. This isn’t a good way to debug a horrible problem but it does provide the circumstances the problem happens under. I often start from scratch to reproduce issues because there is often a clue in the process. So, turning it off and back on again is a way to make sure that I start from a well-understood starting point.
The next questions:
Really? Are you sure?
But these are going to follow most of these checklist questions. As embedded system get more complicated, it is pretty easy to turn a part of the system off but not all of it. I don’t usually go from no-power-to-my-desk, I almost always start with my computer on. But sometimes, turning off all power is necessary (including restarting the debugger).
2. Does it have power?
This is different than the previous in that the “it” refers to all the things that may be broken: the processor, the debugger, the sensor or actuator, the level shifters, everything. This step usually requires a voltmeter (and is related to #5).
3. Is it running the code you think it is?
I sometimes have two or three code bases. If my development environment is pointed to the wrong one, I could easily be compiling over here but loading the image from over there. Or possibly, if the load process is difficult, I may think I’ve loaded the code and somehow missed a step.
I’m working on a big system now, a monitor of another system. When I compile the code, it goes through four machines before it gets to where I can try it out. The possibility that I typed cp instead of scp (or forgot the ending colon!), well, let’s just say it happens more often than I’d like.
This is the reason to have build numbers. But if you are really, really sure it is running the code you think, change something about the output and reload. Make sure you see the change.
4. Did you read the boot and debug output?
I write error logging features and always want a serial debug output. I love power on self tests. However, once the system is working, I stop looking at them. But sometimes, if a cable has loosened or hardware has failed, the system will tell me. But only if I am listening.
5. If it used to work, what changed?
Nothing changed, of course. It just stopped working. My minor code change could not possibly have caused something so catastrophic.
I’ve heard that. I’ve said it. That doesn’t make it true.
If nothing changed, then run the old code. If it fails, well, that’s interesting now, isn’t it? If it succeeds, well then, stop saying nothing changed, something obviously did.
This does require frequent commit to version control, to get back to a last-known-good image. But you were doing that anyway, right? And then you can binary search to determine where the error crept in.
Note that if the code change doesn’t explain it, compare the map files (even the binaries). Realizing that something crossed a boundary may give you inside. Oh, also, the makefile (or project file) may have changed: optimizations can have big ramifications.
6. Is there anything interesting in the map file?
Like many embedded engineers, I find the map file to be a bit of an illegible mess. Happily, I’m not afraid of them anymore. There is an amazing amount of information in the map file. And it provides a different perspective on the code, sometimes it will jog my memory.
7. Can you prove it is hardware?
Of course, at this point, it probably is. But you know hardware engineers, they can be feeble. They need proof. So what kind of proof can you offer? How can you break it down to show it cannot possibly be the software?
Seriously, I have had the privilege of working with some phenomenal hardware engineers. It is seldom hardware (but not never). The process of proving it is hardware is a good part of debugging. Plus, if you can make the difficult error repeatable for the hardware engineer, they’ll probably take you to lunch for making their jobs easier.
8. Can you explain the problem to another engineer?
When I tutored intro to CS, I asked people to explain their problem to a teddy bear outside my office before explaining them to me. I usually listened in. However, at least 50% of the people thanked the bear and left without talking to me. Ok, probably only 20% thanked the bear but most walked away because they never actually needed my help. They needed to get their thoughts in order, to explain it to themselves.
I do that now, talk to myself. Sometimes I try to explain it to a trusted colleague (or a junior engineer) in email, trying to figure out what questions they would ask me so the explanations is really good. Occasionally, I even send the email after all that, if I still can’t figure out the problem.
9. Did you use the single line if again?
When I saw Apple’s goto Fail bug, I completely understood. I avoid unbraced if statements because one month, I tallied up my most common coding mistakes and found that unbraced if statements caused a disproportionate number of my bugs. I vowed never to use them again. Since this is a known failure point on my part, it makes my checklist.