Last week, I talked about self-testing possibilities for embedded software, where the goal was to detect and mitigate the effects of hardware failure. I commented that further work could be done to address the issue of software failure and this would need to wait until a future occasion.
That time has come …
All non-trivial software has bugs. Obviously, well designed software is likely to have less and the application of modern embedded software development tools can keep them to a minimum. Of course, specific bugs cannot be predicted [otherwise they could be eradicated], but certain types of software problem can be identified and it may be possible to spot a problem before it becomes a disaster.
I would divide such software problems into two broad categories:
- data corruption
- code looping
As a significant amount of embedded code is written in C, that means that developers are likely to be making use of pointers. Used carefully, pointers are a powerful feature of the language, but they are also one of the most common sources of programmer error. Problems with pointer usage are hard to identify statically and the bugs introduced might manifest themselves in subtle ways when the code is executed. Some things, like dereferencing a null pointer are easily detected, as they normally cause a trap. Others are harder, as a pointer could end up pointing just about anywhere – more often than not it will be to a valid address, but, unfortunately, it may not be the correct one. There is little that self-testing code can do about this. There are, however, two special cases of pointer usage where there is a chance: stack overflow and array bound violations.
Stack overflow should not occur, as the stack allocation should be carefully determined and its usage verified during the debug phase. However, it is quite possible to overlook a special situation or make use of a less testable construct [like a recursive function]. A simple solution is to include an extra word at either end of the stack space – “guard words”. These are pre-loaded with a specific value, which is monitored by a self-test task [which may run in the background]. If the value changes, the stack limits have been violated. The value should be chosen carefully. An odd number is best, as that would not represent a valid address for most processors. Perhaps 0x55555555. So long as the value is “unlikely” – so not 0x00000001 or 0xffffffff for example – there is a 4 billion to 1 chance of a false alarm.
In some languages, there is built-in detection for addressing outside the bounds of an array, but this introduces a runtime overhead, which may be unwelcome. So this is not implemented in C. Also, it is possible to access array elements using pointers, instead of the [ ] operator, so any checking might be circumvented. The best approach is to just check for buffer overrun type of errors by locating a guard word at the end of an array and monitoring in the same way as the stack overflow check.
Code should never get stuck in an infinite loop, but a logic error or the non-occurrence of an expected external event might result in code hanging. In any kind of multi-threaded environment – either an RTOS or mainline code with ISRs – it is possible to implement a “watchdog” mechanism. Each task that runs continuously [which might be just the mainline code] needs to “check in” with the watchdog task [which may be a timer ISR] every so often. If a timeout occurs, action needs to be taken. I discussed this matter, from a different perspective, in a blog about user displays a little while ago.
So, what is to be done when a stack overflow, array bound violation or hanging task is detected? This depends on the application. It may be necessary to stop the system, sound an alarm of some kind, or simply reset the system. The choice depends on many factors, but broadly the goal is for something better than a crashed system.