Embedded systems and slipstreaming

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

A story that crops up on several websites taling about embedded systems, is that two apparently identical devices from the same vendor were tested, and one (not both) of the failed. Is this possible?
Well, yes it is. Not all copies of the same device are manufactured in the same production run. In the JIT world, production runs tend to be fairly small since there are no longer any warehouses in the pipeline. If problems were encountered with these devices from earlier runs, these problems are corrected in later runs. Mostly these are minor problems that don't justify giving the next run a new model designation.
If the device has firmware in it (most do), bug fixes often aren't considered release revisions. New features in the software will get a new revision number, but not bug fixes. Sometimes this bigfix process will be tracked, and sometimes not. Since there are eddies in the supply pipeline, sometimes the best the vendor can do is to say "If you bought it after date X, it's probably OK."
This process is called slipstreaming - the process of making minor changes to an existing product with each production run. I don't know, but I'd be amazed if the latest repairs to Windows 98 weren't included in each run without any visible or published way to tell which changes were in which run.
The addition of date logic where none was in there before will almost always be considered a revision change. The correction of date logic to handle the century change might not. In theory, *all* devices that use dates should be individually tested, even if you bought them all from the same place at the same time. In practice, this is just too much and we need to take our chances (which involves very focused conversations with the vendor and even the firmware author if necessary and possible).
This issue should arise vanishingly rarely in real life, especially with so much attention being focused in it by everyone. But some will slip through, and as they say in the data sheets, "the consequences are undefined."

-- Flint (flintc@mindspring.com), March 14, 1999

Answers

"In theory, *all* devices that use dates should be individually tested, even if you bought them all from the same place at the same time. In practice, this is just too much and we need to take our chances"
This is what bothers me about the "everything looks fine" reports we get from various industries that say they found few problems with their embedded chips or chip systems. Could they really have checked a majority of the critical systems for possible date errors? Do all of those companies have one of them "super geeks" to thoroughly check the devices? Have any of them said what percentage of embedded chips/systems they checked and the percentage that had problems? How many failures would it take to shut down the whole production system for a lengthy period--one percent or 25%? I think the reports are often sincere but I still doubt that their investigations were as thorough as they probably should be. I don't see the justification yet for the problems to be considered "vanishingly rare". I would feel much more comfortable about it if I knew how they went about checking their systems. I assume they "sampled" some percentage of them because they can't test all of them and from what's been said before, even if they did test them won't that process itself potentially create a date-related error where one otherwise wouldn't have occurred? What kind of real evidence or data should we expect to see before we can be reasonably satisfied that the embedded chip problem will be likely be insignificant in an industry (not just a particular company)? And if one firm in an industry (oil, chemicals, water treatment) reports few embedded chip problems does that me we can assume that essentially all other companies in the industry will have the same low probability of failures?

-- bdb (cb_rex99@hotmail.com), March 14, 1999.

What I meant was, finding two apparently identical devices with differing compliance was vanishingly rare. Finding devices that use dates incorrectly isn't rare at all. I should mention that errors in date *usage* are software errors, even if that software is burned into a PROM chip. Nothing is wrong with the clock chips themselves.
As to what percentage of devices might need to be noncompliant before it's a problem, this question really isn't meaningful. There are two directions to attack the embedded testing issue from: does the device keep real time synchronized with the calendar; and what exactly does this device do within the system?
Using these directions, you start your search by identifying components whose failure would pose serious functional and/or safety problems. As a rather silly example, you really don't care much about a total permanent failure of the office coffee machine. Problems with elevators and security systems are minor and easily worked around in the short term. Problems with the buildings are more serious. HVAC failures can make buildings unliveable for quite a while, for example.
Moving up the food chain, functional production issues are more important again. If the assembly lines won't run or other problems are found at that level, you're basically out of business until you get them fixed.
Safety is probably at the top, although safety can overlap mission functionality. A big explosion is unsafe and also bad for production. Maintenance systems are nearly as important, especially if proper daily maintenance is an ongoing battle. Someone at euy2k.com spoke of noncompliant vibration detectors for flywheels at utilities. These can be important if you have flywheel bearing problems, because these wheels are monsters, and if one breaks loose you basically rebuild the entire facility (and bury the dead).
All in all, it's not quite so bad as finding needles in haystacks. There's a definite heirarchy of priorities, and the potential impacts of noncompliances are usually pretty clear. I'm pretty sure that Chevron is saying that they've decided not to even bother checking devices whose complete failure wouldn't be material to operations.
On the other hand, I'm also pretty sure that some important systems will be forgotten, or misdiagnosed. I expect the number of things that go BOOM to be well above the normal failure rates.

-- Flint (flintc@mindspring.com), March 14, 1999.

Flint;
Thanks for the additional clarification.
Regards,

-- Watchful (seethesea@msn.com), March 14, 1999.

Moderation questions? read the FAQ