Embedded Chip Problems

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

I don't know if these problems with embedded chips have been posted before, but I found it intresting. (Many more listings at the site.)

http://www.iee.org.uk/2000risk/Casebook/eg_index.htm

The Millennium Problem in Embedded Systems

Embedded Systems Fault Casebook (May 1999)

EXAMPLE NO EG-42 Equipment Type: SCADA Industry Sector: Manufacturing PC or Computer based: No System Age: 2

Application: Monitoring of high frequency welding equipment

Description of the Problem: All data logging after Jan 1 2000 would be erased as "old" data. How was it Identified: Information from Website.

What was the Solution: A software patch is available and will be installed by original supplier of equipment. This original supplier had been unaware of the problem and consequently will need to fix several hundred similar systems worldwide

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: Loss of historical trending data and traceability for QA.

Other: This emphasises the importance of not taking information provided by vendors at face value, but to continually revisit manufacturers for updates

EXAMPLE NO EG-44 Equipment Type: SCADA Industry Sector: Manufacturing PC or Computer based: No

Application: SCADA system which provided an overview of the operation of some 250 systems in a manufacturing plant.

Description of the Problem: The system failed during the power down rollover test such that it would not start up again when power was re-applied. The system backup was restored and the system successfully re-initialised.

How was it Identified: The problem was identified during diligence testing, with the vendor present, to prove compliance of a system that the vendor had claimed was compliant. The test it failed on was the power down rollover test (i.e. letting the system roll over from 31st December 1999 to 1st January 2000 with the main power removed). The system would not start up again when power was re-applied.

What was the Solution: The problem was rectified immediately by the vendors. Software modifications were made.

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: If this failure happens on restarting the system after the millennium shutdown, and the system backup is not readily available to be restored, then this will result in significant downtime.

Other: Failure under test can be as serious an issue as year 2000 failure itself, and so adequate test planning which includes proven recovery procedures is essential.

EXAMPLE NO EG-80 Equipment Type: SCADA Industry Sector: Manufacturing PC or Computer based: Yes System Age: 4

Application: Operation & control of gas flare stack

Description of the Problem:

How was it Identified: Audit by supplier identified that version of UNIX was not compliant

What was the Solution: This system is to be relocated /modified in Year2000 Decision made to roll date back 8 years (non compliance date associated )and not spend money to fix .8 Years was to keep leap year in line if system was not immediately replaced

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: Failure to open the flare stack could cause over pressurisation of gas distribution system including gas holder if gas manufacturing units could not be reduced especially if consumers were to fail due to Y2K problems.

EXAMPLE NO EG-82 Equipment Type: SCADA Industry Sector: Manufacturing PC or Computer based: Yes System Age: 5 years

Application: Windows based SCADA system for chemical handling plant

Description of the Problem: The first test was a BIOS rollover test with (unfortunately) the SCADA package running. A failure occurred, causing a `lockup' with a complete lack of response. A manual trip was initiated, however not before an acid spill occurred resulting in a minor environmental problem and a major safety incident. Unfortunately, the company's personnel failed to realise the implications of conducting such a test on an operational plant. Another test, was performed some weeks later on a later version this time with properly prepared test plans, and plant personnel awareness. One function of the tests was for 09-09-99, 09-09-1999 and 99-99-9999. These `dates' resulted in the software failing to execute. Note that 99 is often used as end of file indicator.

How was it Identified: See above

What was the Solution: Replacement

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: Dangerous chemical spill Other: Testing live systems requires careful forethought and preparation. Always back up data first.

EXAMPLE NO EG-28 Equipment Type: OTHER Industry Sector: Rail Transport PC or Computer based: No

Application: System used for voice and data communications between train drivers and signallers.

Description of the Problem: Before updating the time, the management processor sets all of its internal registers to zero, and monitors the status of them afterwards. If the status of one or more registers is still zero, this is interpreted as message not received. The processor will await the arrival of a valid signal before updating the time and date. So, effectively, it will cease to function for one year, then resume normal operation on 01/01/2001.

How was it Identified: Discussions with the users and then structured interview with the equipment manufacturer. The manufacturer was unable to answer all questions satisfactorily and during follow-up work discovered the error.

What was the Solution: The equipment manufacturer must provide a software upgrade.

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: If information gets out of sequence, chaos will ensue. Train delays will occur, and there will be increased risk of rail accidents. The cost of this could be considerable. There will also be regulatory problems as, in the event of an emergency, logs and sequencing information is needed for post-incidence enquiries.

EXAMPLE NO EG-72 Equipment Type: OTHER Industry Sector: Rail Transport PC or Computer based: Yes

Application: Train Describer. Provides information about locations of train services to signallers and other rail staff. The output is used to update a train position model which places each train's unique description (headcode) on a schematic representation of the rail network.

Description of the Problem: The train describer computer is non-compliant. The system software clock stores date with a 2-digit year, however it prints dates with 4-digit years by using the prefix '19' in output to the printer. On start up the system will accept dates in the range 10/10/84 to 31/12/99. Dates outside this range will be rejected as invalid. The system will rollover to '00' if left powered up. However, in the event of a failure it will not be possible to restart the system and enter the correct date after 31/12/99. The day of the week will be incorrectly calculated after the 2000 leap day, so the describer may refer to the wrong timetable.

How was it Identified: Investigation steps included: Site visits; Discussions with suppliers, including reference to documentation; Inspection of source code for the system software real time clock i.e. RTIME and DATE.

What was the Solution: Modifications are possible to make the system compliant. However, it is reasonably old and difficult to maintain, so replacement is preferable.

Consequences for the SYSTEM: System Stops

Consequences of failure to the BUSINESS: Loss of reputation. Customers claiming refunds on tickets where trains running at incorrect times, late or not at all.

EXAMPLE NO EG-27 Equipment Type: OTHER Industry Sector: Rail Transport PC or Computer based: No

Application: Vibration monitoring on rail network. If train wheels have a flat spot, or axles are damaged, the rail will vibrate due to the uneven load distribution. These vibrations are detected by monitors on the rails, allowing faults to be identified.

Description of the Problem: There are two models of this system: Mark 1 will fail to operate completely after 09/09/99 due to the fact that 999 was used as an end of file marker; Mark 2 will operate until the end of 1999, but its internal clock will fail to rollover into the next century.

How was it Identified: Supplier information was referred to initially, rollover testing was then carried out.

What was the Solution: Both systems were rolled back to determine which, if any, of the previous leap years it would be possible to use. The Mark 2 systems cannot be rolled back to a date prior to system installation, for example 1996. Mark 1 systems can be rolled back to any date, but will fail again once their internal clock reached 9/99.

Consequences for the SYSTEM: Erroneous Result

Consequences of failure to the BUSINESS: Catastrophic if a problem, which subsequently leads to an accident, cannot be identified.

-- (TooShy@ToSay.com), June 21, 1999

Answers

Oh dear, just when the pollys are SO adament that such things do not really happen. OK, lets try the usual polly approach:

1) This is old information. [Oops, "MAY 1999", scratch that.]

2) This is just hype from people out to get money from Y2K work. ["IEE" - Institute of Electrical Engineers. Uhh, never mind, scratch that one.]

3) Any of these problems could have been fixed immediately with the fix-on-failure approach. ["Catastrophic"! "Cannot be identified"!! "Upgrade must be applied from software vendor"!!! Holy cow. Scratch this one, too....]

4) Not a single serial number has been presented on any of these items. Therefore, as according to the Paul Davis / Stephen Poole methodology for Y2K embedded chips, THIS ENTIRE LITANY OF PROBLEMS IS IN FACT NOT A PROBLEM AT ALL!!!! [Whew! Isn't polly weaseling wonderful?]

[And all you silly doomers out there: If you feel like arguing about this, do yourselves a favor and DO NOT BRING UP THE SEWAGE SPILL INCIDENT. No serial numbers there either, ha-ha-ha.]

-- King of Spain (madrid@aol.com), June 21, 1999.

Dan the Power Man isn't worried, why should I be? What is the puney estimated failure rate, anyway? So what if just ONE system will be a showstopper? Who cares if it's located at the bottom of the ocean, Flint has flippers. How many of these things are there in the world....you know, the real world? The Phone Companies have found all of them, right? The Power Companies aren't concerned, isn't that so? The oil industries aren't worried, or are they? Are there *honestly* any embedded systems where YOU work? Just don't panic, we're working on it really really hard and very, very fast. Now, don't you feel better? Good. Let's put that cash of yours back in the bank, OK?

-- Will continue (farming@home.com), June 21, 1999.

Where's dIETER? dIETER? Are you there? Comments? Are you asleep?

:)

-- FM (vidprof@aol.com), June 21, 1999.


The good news: (1) These errors have been found. (2) In most cases, there is a known fix, that can be applied...

The bad news: (1) Obviously, not all cases have been found. (2) In some cases there are not known fixes. (3) Fixes, even if known, will not be applied in all cases...

-- Mad Monk (madmonk@hawaiian.net), June 22, 1999.


Kool, something to add to our Y2K List of Failures - Part 4.

Where are you, Rob? <:)=

-- Sysman (y2kboard@yahoo.com), June 22, 1999.



I have never said there will be no failures. Flint has never said there will be no failures. Y2Kpro has never said there will be no failures. Stephen Poole has never said there will be no failures. Neither SuperPolly nor CPR have ever said there will be NO failures. I am getting really tired of this constant whine - 'look, a failure, that PROVES the "pollys" are idiots'. The whole bloody LOT of us have said "THERE IS ENOUGH TESTING AND REMEDIATION BEING DONE TO MAKE PREPARATIONS FOR A 3 TO 6 MONTHS LONG PERIOD WITHOUT POWER OR FOOD DELIVERIES UNNECESSARY - 2 OR 3 WEEKS IS MORE THAN ENOUGH - IT WILL NOT BE TEOTW".

And incidentally, doesn't the post above show that these places ARE taking Y2K seriously? Isn't the 'NO ONE IS FIXING IT BECAUSE THE BOSS DOESN'T UNDERSTAND IT AND IT CAN'T BE FIXED ANYWAY' the true mainstay of doomer thought? Do you expect the outfits above to keep right on trucking through the rollover or not? Seems more like one for the 'polly' side to me.

-- Paul Davis (davisp1953@yahoo.com), June 22, 1999.


Paul Davis ... never mind. I won't comment. Your post is perfect just as it is. ANYONE, as we approach July 1999, who is willing to accept that baloney, deserves just what they will get.

(Gee, Paul, who is "CPR"? Never seen a post from that entity here. Friend of yours?)

-- King of Spain (madrid@al.com), June 22, 1999.

whoooooie. Alot of huffin' and puffin' in that post of yours, Paul. Here.....sit down a minute and breathe deeeeeeply, deeeeply, Paul. There you go now, it's gonna be alllllright. (has anyone found the NUK yet???)

-- Will continue (farming@home.com), June 22, 1999.

Spain forgot one: These problems were easily identified and fixed. Wait a minute ..., trying to think of clever thing to say ..., Oops! Pay no attention to that man behind the curtain!

-- cd (artful@dodger.com), June 22, 1999.

Not that it will matter to you doomers, but since the subject is similar to a dicussion in the euy2k forum and for the few who are independent, open minded, and critical thinkers (i.e., truly want the facts), I will repost the following post I made in that forum regarding the IEE reports:

I was about to clarify IEEE vs. IEE, but you beat me to it Rick. As far as Y2K, IEEE wasn't even in the ballgame, a disappointment to me as a member. IEE has done a better, but not very much better, job. IEE repeats many of the common fallacies of the run-of-the-mill Y2K sites and embedded systems white papers. The embedded system failure reports are worth a read though, just to get a flavor of y2k bugs in embedded systems. I urge others to take a critical and objective look at the failures reported in the IEE index I want to caution however, that from reading over these reports, a number of the "potential" catastrophic failure reports are similarly written, and may perhaps the work of an overzealous company hyping it's Y2K work, or perhaps confusion by this company as to how to interpret "Consequences of failure to the Business" part of the survey (i.e., does this refer to consequences of the Y2K failure, or of a system failure from any cause?). Examples: EG-54 appears to be a typical minor date problem, yet the "Consequences of failure to the Business" are written as though the equipment might fail. EG-52 says in one place will continue to work properly through millennium unless powered down and restarted in which case the date will be wrong. Consequences are "system stops"??? "Potential failure of air conditioning/ heating system, security systems etc."?? This report isn't even consistent with itself. My favorite here is EG-67 - the ultimate in hype!

In comparison, EG-49 is a well written and brief report that clearly addresses the consequences of the minor Y2K bug, not of a "system" failure due to "any cause". EG-47 may also be credible since actual testing was performed, assuming that the testing was proper (it often is not, and testing methodology can induce artificial failure mechanisms). A number of other reports here also appear credible since details are supplied in some cases. But without a listed source, equipment model information and the like, I would hesitate to use this information for anything other than getting familiar with the types of y2k bugs in embedded systems. You may even want to exercise the same healthy skepticism you use when reading information provided by utility industry insiders ;)

Regards,

-- FactFinder (FactFinder@bzn.com), June 20, 1999.

---------------------------------------------------------------------- ----------

-- FactFinder (FactFinder@bzn.com), June 22, 1999.



There seems to be a lot of strawman-bashing going on here. If you shoot yourself in the foot, it doesn't matter how loud you shout "I got you!" So let's find a bit of perspective here.

1) There are errors in embedded systems. Yes, there really are. Testing has shown that these are less common than feared, yet too common for comfort. Which doesn't change the fact that a great deal of testing and fixing is really being done. Nobody can say how much of the potential danger has been averted in time, but it's very obviously a great deal less than zero.

2) These are test reports. Clearly, testing is being done and problems are being identified and addressed. And a careful reading of this material shows that some of these corrections are widely applicable. If a system is common to 250 locations, the testing and fix need to be done only once, while the implementation of the fix must be done 250 times. Believe me, it's a lot faster, easier and cheaper when a vendor *notifies* you of a problem and sends you a fix, than for you to have to find needles in the haystack all by yourself.

3) It should also be clear that we simply aren't going to find all these problems beforehand. Many will strike. Most of these 'business impacts' are not quantifiable, since they're too contingent on indirect factors (exactly what's going on a time of failure, use to which the business puts the system, implementation details, availability of workarounds, collateral damages, on and on). I think we can be pretty confident that there will be thousands of failures of varying effective magnitudes in embedded systems.

4) Most of these are monitoring systems. This makes sense, since such systems keep logs and records, and hence use dates. By and large, the systems *being* monitored continue to work just fine. Loss of monitoring systems gives more time to repair, and less immediate impact, than loss of the systems being monitored.

5) Attempts to mischaracterize such reports aren't helpful. We knew there were bugs, we tested, we found some (perhaps even most), we fixed them, we documented them. Normal procedure. Using this growing record of success as 'proof' of coming failure is Alice-in-Wonderland logic. Even George Orwell couldn't go so far as to say success=failure.

Conclusion: There will be scattered problems of varying importance everywhere. A very hectic couple of weeks, and probably a few really newsworthy disasters. But from all indications, economic impact will be difficult to extract from the general noise level. Nothing here lends any credence to the notion that these failures, taken all together, will lead to any wide scale infrastructure breakdown.

-- Flint (flintc@mindspring.com), June 22, 1999.


And yet, nothing that would suggest that such widespread infrastructure breakdowns are not possible, either. (You know, no electricity, no telecommunications, no clean water, but LOTS of sewage bubbling!)

Farfetched? Well, here it is with just a tad over six months until the big event, and no utility is stating that they are ready for Y2K -- they are all just still working-on-it-real-hard. If one must err, shouldn't be on the side of caution? You know, as in "hope for the best, prepare for the worst"?

-- King of Spain (madrid@aol.com), June 22, 1999.

Cips do not have Y2K failures, but....

The title "Embedded Chip Problems" is not shown in the preceeding examples.

The problems are not in the chips, but in the software running through them.

Software and software upgrades are what are needed to fix those very real problems. The chips themselfs are not the problem. These devices apparently are not computer based or PC based. Many devices exist like these. I believe this is what people mean when they say "embedded system". A system with hardware and software with some of the capabilities of computers. These are usually designed to perform a certain job. They need to be checked! No doublt some will have Y2K problems. The majority do not have any Y2K problems, for they do not use dates. But and this is a big but, you have to know which ones do use a date and check to see if it has problems. These devices have software running through them just like computers. I noticed in the examples above that the fixes and patches were "software". The hardware (the chips themselfs) do not need to be fixed or replaced. What this boiles down to is that "embedded systems" (devices that compute) must be checked for Y2K problems and fixed if there is one. The chips themselfs are not the problem. But to go further, there are "programmable chips" The ones you hear about that can have programming burned into them. So far as I have found (and I have asked and looked everywhere) none of these have been programmed with a Y2K fault in the software. This does not mean none have, just that in the "embedded chip" industry, the date/time functions were external to the chip itself- usually on the same board the chip was on or even getting it from an external source from the device itself. This could be a PC. As in building controlls, the PC determines the date, and the ...say...main building lighting has its own device (embedded system) that turns on the main lighting on Monday through Friday. The device gets it time information from the PC. (24 hours a day/7days a week). The device had no need to know what month or year it is. I found that building Management systems are programmed for a year at a time. The reason for this is that due to the fact tht different hollidays occure on different days of the week in different years, the entire building and it's periferals (elevators, lighting, heating etc)will get the information from the PC as to wheather it is a work day, weekend or a closed day. The lighting device does it's 7 day a week job (five on 2 off) unless it gets information from the building management system computer to do otherwise. (4th of July no on will be in today) so keep lighting down to the closed day status. One of the advantages of all of the systems being controlled by this main computer is that when there is an unforseen non work day, the entire building can be controlled by it. It is important that this main computer be checked and fixec if it has Y2K problems, and a good many of them will have to be fixed as they run off of a number of different kind and ages of PC's. *************************8 http://www.egroups.com/group/year-two-thousand/183.html?

http://www.boma.org/ Building Owners and Managers Association (BOMA) International is a premier network of over 16,500 commercial real estate professionals BOMA International represents 84 United States, ten Canadian, and seven overseas associations in Australia, Indonesia, Japan, Korea, the Philippines and South Africa.

http://www.boma.org/year2000/ BOMA is committed to helping commercial real estate professionals prepare their buildings for the Year 2000 through timely information. BOMA is pleased to post relevant articles pertaining to the Millenium Bug on its Web site. You will also notice that we have established a Year 2000 Special Interest Group (SIG) where users can download articles, read and post messages and even chat in real time.

http://www.boma.org/year2000/letters.htm Letters of compliance from;

Andover Controls Corporation Dover Elevators Kastle Systems, Inc. Montgomery KONE Otis Elevator Company Timberline Software Corporation The Trane Company U.S. Department of Housing and Urban Development York International

The letter from Andover Controls Corperation was especially reasuring; "As the technology leader in advanced micro-processor based building controls systems, Andover's engineers anticipated, planned, and tested for date transition from 1999 to 2000 when out building control products were originally developed". ********************* And now this from Dave Hall who testified before congress to the fact that of 60 billion chips in existance a percentage (4-2- less than one) would cause Y2K failures.

this is what he says today;

The Year 2000 problem resides in the programming, be it a software application or the firmware on a microprocessor. The "chip", or the hardware itself, does not have any date use, so it does not have any potential Year 2000 problem. To make the "chip" accomplish some task, a designer develops a software program and either puts it in memory or burns it into a PROM of some type. The logic in the program then commands the hardware to accomplish some task upon receipt of some input, or at some specific points in time. All computers, mainframe, midrange, PCs, microprocessors, super computers, etc., are run by programming. The programming, if it uses years for any reason and the programmer used only two-digit year date fields, has the potential to incorporate Year 2000 problems.

To get a more complete idea of what is involved in assessing equipment, check out the downloadable white papers at www.year2000.unt.edu, topic page 11 under the web conference.

Dave Hall My opinions only, of course Embedded Systems and Infrastructure Risk Management.



-- Cherri (sams@brigadoon.com), June 23, 1999.


Part of a recent reminder from IBM... <:)=

"logic that track, represent, or make decisions based on date information and may not correctly process dates later than 1999. For equipment and programs that do not correctly process the post-1999 dates, their results could be unpredictable and their impact could ripple widely from application to data, application to application, system to system, and organization to organization."

-- Sysman (y2kboard@yahoo.com), June 23, 1999.


Moderation questions? read the FAQ