ISO 26262…the tale of Transient and Permanent Faults

Introduction

Are you designing to the ISO 26262 standard and trying to decide if your design is safe from random hardware faults? If so, are you trying to figure out those annoying safety metrics (PMHF, SPFM and LFM)? If so, you are also undoubtedly weighing both transient and permanent faults. Both need to be analyzed to achieve compliance but when diving deeper, we discover that there is quite a bit of difference between them.  This post hits on some of the basics, but if you are looking to dive right in, detailed information can be found in the white paper: “Similar but different… the tale of Transient and Permanent Faults.”

What are they and Where do they come from?

First, what are the sources of these faults? In the integrated circuit, sources of faults can range from EMI (Electro Magnetic Interference), radiation, electro migration, shocks/vibrations, etc. In some cases, it is important to know these specific sources so targeted measures can be taken. But in most cases, it is reasonable to abstract them to bit flips and stuck-at faults. The huge size of integrated circuits requires this level of abstraction, aligned to what ISO 26262 requires in normative and informative guidance.

How do we protect against them?

Automotive IC’s often contain a blend of various safety mechanisms, which are derived from the safety concept, safety architecture, and the safety requirements, including the ISO 26262 defined ASIL target. Each safety mechanism is unique in it’s ability to detect both permanent and transient failures. When a safety architect determines which safety mechanisms to deploy, these characteristics must be considered to ensure sufficient protection from random failures while also taking into account the power, performance, and area impact.

Tradeoffs Example 1: A lockstep implementation provides good coverage on both permanent and transient failures, but comes at over a 2X silicon area and power cost.
Tradeoffs Example 2: A software test library has minimal impact on area and power but may have an impact on functional performance depending on the end application and required execution frequency. Of equal importance is that an STL provides no coverage for transient failures.

When it comes to transient failures, there are additional considerations that must be accounted for such as frequency of update, active vs. in-active time, and logic function. Typically, when focusing on digital logic, an analysis of transient failure rates for memories, flip flops, latches, and normal logic gates would
show that memories are always relevant, while normal logic gates (e.g., ANDs, ORs, etc.) almost never rise to relevancy. This is not to say that logic gates are not affected by the sources of transient faults, but they are not statistically relevant. Of course, this analysis must still be performed.

Transient Considerations Example 1: A register bank which buffers data frames which are overwritten on every clock will often not rise to a level where the safety of the overall function is compromised. This is because a transient failure causing a bit flip on a flop within that bank will be overwritten on the next clock cycle.
Transient Considerations Example 2: A watchdog timer (WDT) will have various configuration registers for use by its protecting CPU, including an enable/disable feature, refresh setting, length of time to expect a refresh, etc. Technically the WDT is a safety mechanism and testing the WDT at power-on self-test with LBIST is expected and reasonable for detecting permanent faults. However, it may be prudent to protect the control registers and timers. For instance, perhaps the WDT defaults to being disabled to prevent WDT timeouts impacting the boot process. Once the boot process is complete, the WDT is enabled, and that register is never accessed again, thus becoming static and susceptible to transient faults. A transient would potentially impact the ability for the WDT to operate and thus impact protection.

The examples and information above scratch the surface on some of the questions which must be asked and analysis to be performed when determining what safety mechanisms to install in the design.

Want to learn more?

As mentioned at the beginning, we dug into this topic in detail and the write up can be found in the white paper titled: “Similar but different… the tale of Transient and Permanent Faults.”

Siemens EDA is the market leader in automation tools and provides tailored automation and services to help guide project teams through the complex task of Safety Analysis. If you’d like to learn more about Siemens solutions, visit:

Conclusion

ISO 26262 requires analysis be performed for both permanent and transient failures. At the surface, there are a lot of similarities but when looking more closely, there are importance differences that must be understood. A safety analysis methodology should carefully consider the different technologies and fault types to identify the most efficient safety architecture.

Want to stay up to date on news from Siemens Digital Industries Software? Click here to choose content that's right for you

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at https://blogs.sw.siemens.com/verificationhorizons/2022/10/27/similar-but-different-the-tale-of-transient-and-permanent-faults-in-iso-26262/