Embedded Systems Impacts ~ A Valuable Advisory

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

UNMODIFIED FORWARDED MATERIAL

The following forwarded material contains some very useful and valuable information. I am interested in learning what others think about this advisory, particularly Number 10 in the list.

All of the following is forwarded material, including the "Editor's Notes". No edits have been made.

Year 2000 Infrastructure and Embedded Systems Advisory Number 1099-044

Embedded Systems Impacts

October 6, 1999

I've gotten many calls requesting types and examples of embedded systems and equipment failures. There are numerous notes at vendor's web sites on which of their equipment and software is non-Year 2000 compliant, but very little on what such non-compliance could do to your specific application. It is impossible to state with certainty exactly what your problems will be without evaluating your specific system/equipment, but the following list does provide some examples of what impacts have been found and what the consequences of those impacts are. Some of the more interesting ones and my notes are in italics. If anyone would like to contribute to my ever-growing list of impacts, please do so.

1. Application Type: Weighing of finished product. Description: Apparent failure to recognize leap year. Jumps from 28/2/00-1/3/00 Solution: PROM replacement Consequences: Possible breach of regulatory requirements. Some versions count year transitions to identify leap years.

2. Application Type: Date coding (ink jet) of finished product Description: Unit fails to roll coding date forward correctly once code date is in Year 2000 (failure as soon as forward date hits Year 2000) Solution: Replacement PROM Consequences: Serious problem if not rectified. Manual date entry possible, but given the number of units this would present major difficulties.

3. Application Type: HVAC - Control of mechanical services and air conditioning equipment. Description: Controller fails on first power-up after roll over (12/31/99-1/1/00) whether roll over was with power on or off. Solution: Replacement PROM. Consequences: Nuisance failure of services to manufacturing plants resulting in significant downtime. Actual date unimportant so a workaround is possible.

4. Application Type: Instruments - Weighing of finished products (multiple weighers on network). Description: System fails to roll over correctly. Solution: Software update required Consequences: Failure to meet regulatory requirements for average weight. This problem could be a significant nuisance as manual operation of these instruments is difficult.

5. Application Type: PLC-based control system Description: Unit passes all Y2K tests but at transition from 12/31/03 - 1/1/04 reverts to 1/1/00. (Editor's note: January 1, 2000 is NOT the only problem date embedded systems have. It will be necessary to check numerous dates depending upon type and functionality of embedded code. ) Solution: Replacement PROM. Consequences: Failure of unit.

6. Application Type: Card access control system for site and internal departments Description: System fails at roll over and access is barred. Solution: Replace. Consequences: Security system inoperable. Additional manning required.

7. Application Type: SCADA Operator interface and display system Description: Custom firmware fails roll over test. PC platforms also fail. Solution: Replace Consequences: Key manufacturing plant constrained in operation. Most major functions operable albeit at reduced efficiency, but data logging, etc. lost. (Editor's note: How long can you operate at a reduced pace and remain in business?)

8. Application Type: SCADA Bought-in graphics/display package. Description: Core functionality OK, but some optional modules fail roll over test. Solution: Upgrade or replace. Consequences: Largely cosmetic (loss of data logging and trending information), but could be more serious in the event of major plant problems and lead to additional downtime. (Editor's note: If efficiency is important to you, loss of trend data can lead to cascading losses and eventual failures.)

9. Application Type: DCS Control System controlling smelter plant Description: Rollover to year 2000 System on reboot reverted to incorrect date Solution: Replace battery backup Consequences: Loss/corruption of trend data. (Editor's note: If efficiency is important to you, loss of trend data can lead to cascading losses and eventual failures.)

10. Application Type: DCS control system control for petrochemical plant Description: Online roll over to Year 2000 dates caused failure Solution: No known workaround. Plant had to be operated from one station until problem could be rectified. Replacement is necessary. Consequences: Near catastrophic. Limited reliability and operability of plant. Reduced production. (Editor's note: You do not need more than one failure to significantly affect a plant. Percentage numbers are meaningless at this level of "embedded systems" use. Individual items of equipment can bring you down while multiple failures of other items can have very little effect.)

11. Application Type: Car Park Management System Description: Dates after 12/31/99 not handled correctly Solution: Replacement of hardware and software Consequences: Loss of revenue, lack of car parking causing traffic congestion, safety considerations if car park egress not possible or restricted.

12. Application Type: Fire Station Alarm Monitoring System. System provides the fire Department with information regarding the state of alarms of critical systems. Control of lighting and motorized doors is also possible through this system. Description: Problems were experienced with the Visual Basic platform that the monitoring software runs on. (Editor's note: You have to test all layers of potential problems. These "embedded systems" are simply PCs in different clothes.) The system will have incorrect knowledge of the day of the week post 31/12/99, and will be unable to recognize February 29, 2000. Solution: 1) Replacement of PC with Y2K compliant version; 2) replacement of Visual Basic platform with Y2K compliant version; 3) installation of operating system necessary for latest version of Visual Basic; 4) amendment of custom code as necessary to run on the new platform. (Editor's note: Once you change the platform, language version and/or compiler version, you must check the functionality of the code. Compiler versions especially will mess up functionality.) Consequences: Incorrect knowledge of the days of the week will lead to incorrect identification of silent and normal hours. Failure to recognize leap day will lead to system crash. Both of the above constitute safety risks. If the operator is not aware of computer errors, incorrect action could be taken by fire department. 13. Application Type: Logging / Monitoring - Measures and records personal external dose information for those working in radioactive areas. Entry/exit records are date/time stamped and are stored in a database. Description: The real time clock within the PC will not roll over correctly. As a result of this dose records over a given range will total incorrectly and reader and software records will be incorrectly date stamped. The DBU software will roll over to 1900. This will result in the loss of some dose records, making over-exposure of some operators possible. (Editor's note: Creation and storage of data is just as important as use.) Solution: Replace the PC used for the IDR software with a fully Year 2000 compliant version. Consequences: Incorrect dose records would cause the regulator to take action, possibly closing the facility until it could be proved that corrective actions had been successfully implemented. Possible legal costs and personnel injury damages.

14. Application Type: Train Describer. This system provides information about locations of train services to signalers and other rail staff. The output is used to update a train position model that places each train's unique description (head code) on a schematic representation of the rail network. Description: The Train Describer computer is non-compliant. The system software clock stores date with a 2-digit year, however, it prints dates with 4-digit years by using the prefix '19' in output to the printer. (Editor's note: Just because a printout uses four digit year fields does not mean that your system does.) On start up the system will accept dates in the range 10/10/84 to 31/12/99. Dates outside this range will be rejected as invalid. The system will rollover to '00' if left powered up. However, in the event of a failure it will not be possible to restart the system and enter the correct date after 31/12/99. The day of the week will be incorrectly calculated after the 2000 leap day, so the Describer may refer to the wrong timetable. Solution: Modifications are possible to make the system compliant. However, it is reasonably old and difficult to maintain, so replacement is preferable. Consequences: Loss of knowledge of train location, trains running at incorrect times, late or not at all.

15. Application Type: Site-wide building access control and security system consisting of six connected controllers at various site locations controlling a network of card readers and keypads (supported by modems etc), printers and visual display terminals. Description: The control panel will roll over correctly from 02/28/00 to 02/29/00. However, if 02/29/00 is entered manually, it will default to 02/01/00. (Editor's note: Have you checked all possible user inputs to see if they could affect your system?) Solution: The access control software will be upgraded and the control panels replaced. Consequences: Nuisance, access to restricted areas may be controlled using a manual, paper-based system. However, this would be expensive and time consuming.

16. Application Type: A robot used to change air filters in a restricted area has a PLC controller. The robot may be used in automatic mode controlled by its PLC. It can also be used in manual mode, but the operator relies on the PLC to receive information from sensors on the robot arm. Completely manual operation is not possible. Description: PLCs running certain versions of the operating system will fail to roll over into the next century correctly. This will disable the robot. Problems will not be experienced immediately as the robot is not in constant use. (Editor's note: Are you sure you have checked and tested ALL of your equipment?) However, failure to correct the problem would seriously impair production. The operator terminal used to program the PLC is non-compliant, as is the programming software. It may be difficult, if not impossible, to roll the PLC system clock back and, if necessary kit changes cannot be made, production will be stopped. Solution: Complete replacement of the PLC. Consequences: Production must stop. Unless a solution is found regulatory non-compliance would follow.

17. Application Type: System used for voice and data communications between train drivers and signalers. Description: Before updating the time, the management processor sets all of its internal registers to zero, and monitors the status of them afterwards. If the status of one or more registers is still zero, this is interpreted as message not received. The processor will await the arrival of a valid signal before updating the time and date. So, effectively, it will cease to function for one year, then resume normal operation on 01/01/01. Solution: The equipment manufacturer must provide a software upgrade. Consequences: If information gets out of sequence, chaos will ensue. Train delays will occur, and there will be increased risk of rail accidents. The cost of this could be considerable. There will also be regulatory problems as, in the event of an emergency, logs and sequencing information is needed for post-incidence inquiries.

18. Application Type: Tracking system used on 6 meter and 8 meter satellite dishes. This tracking system is used to position satellite dishes that provide uplinks to communication satellites in geostationary orbit. Description: The tracking system rolls over into the next century and the data '00' is interpreted as an invalid date. Knowledge of the date is essential to finding the position of the satellites. Solution: There are three possibilities: 1) Upgrade the tracking system; 2) use alternative transmission means; 3) transmit using smaller satellite dishes on higher power. Consequences: It will not be possible to broadcast signals. (Editor's note: Make sure embedded systems impacts won't impact your expected contingency plan actions.)

19. Application Type: SCADA system that provides an overview of the operation of approximately 250 systems in a manufacturing plant. Description: The system failed during the power down roll over test. It would not start up again when power was re-applied. The system was restored from backup and the successfully re-initialized. Solution: The problem was rectified immediately by the vendors. Software modifications were made. Consequences: If this failure happens on restarting the system after the millennium shutdown, and the system backup is not readily available to restore the system, then this type of problem could result in significant downtime.

20. Application Type: A smart density analyzer uses a radioactive source as part of its measuring process. Description: The algorithm that compensates for the decay of the radioactive source gives erroneous results on rollover to January 1,2000. Solution: It was initially thought that the solution would be to recalibrate the instrument on December 31, 1999 (enter a date of January 1, 2000), and then to recalibrate again on January 1, 2000 (enter the date of January 1, 2000 again). Testing discovered that doing two sequential recalibrations also caused major problems. (Editor's note: Be sure that your "solution" does not cause problems. Test your solution before final implementation.) The vendor is now offering users of the system an EPROM upgrade. Consequences: In an operating process, this would raise alarms and possibly result in a costly process shutdown.

21. Application Type: Multi-site organization has a packet switching mechanism to allow medium speed data communications. Description: Each communication node in the network has a real time chip in the node firmware. The firmware only 'sees' two digit dates. The system will not function correctly if allowed to roll into the next century. The packet switching management system is a supervisory level system with non-compliant operating system in conjunction with non-compliant application software. (Editor's note: This system has all three problems, firmware, operating system and application software. All should be checked for proper functionality.) Solution: The packet switching device will have its internal clock wound back by 28 years to synchronize days of the week and leap years. The packet switching management system will be completely decommissioned. No fix has been identified for the application system although the operating system could be upgraded. Consequences: The management system is the key to determining fault location, performance metrics, and event reporting. Without the management system, it will be difficult to manage with faults, and alterations will take longer to deal with, thus impacting network resilience.

22. Application Type: A multi-site utility company has 1.2 million meters (30% electronic and 70% mechanical). A problem arose with the calibration equipment for the electronic meters. Description: On testing electronic meters and rolling through the post 2000 dates, the calibration equipment 'stuck' at 2010. It was impossible for the user to reset the calibration equipment. The vendor had to be called in. (Editor's note: January 1, 2000 is not the only possible problem date. With the lack of a universal format standard, we run risks every year.) Solution: The vendor reset the calibration equipment and inserted an upgrade patch. Consequences: This caused a major logistical problem as a backlog of calibration checks built up.

23. Application Type: Fuel Pump Description: Year does not roll over. Leap years are not recognized. Solution: Client "working around" fault. Owner has to manually correct date on each January 1st. Consequences: Inability to monitor fuel dispensation. 24. Application Type: HVAC - Air Conditioning/Heating Controls Description: Loss of control of HVAC system. Critical date 01/01/2000. Solution: Upgrade software. Manufacturer supplying free upgrade. Consequences: Potentially catastrophic.

25. Application Type: Fire alarm control panel - sounds alarm. Description: There would be a fire alarm malfunction on rollover - alarm raised. Solution: Software upgrade. Consequences: Would lead to building being evacuated.

26. Application Type: Water leak detection. Description: Non-reporting of leaks/fire alarms. This type of problem could be either no alarm, false alarms, or both. The critical date for this specific system was 01/01/2000. Solution: Upgrade microprocessor. Consequences: Non- reporting of leaks could cause major damage with long down times. False alarms would cause systems (e.g.. air conditioning) to be closed down. 27. Application Type: Building Energy Management System Description: The system will operate correctly through the millennium rollover if the system remains powered. If the system is powered down, however, the date will revert to XX/XX/1900. (Editor's note: If you can absolutely believe that your system will never be powered down, then you don't have to fix this type of problem.) Solution: Upgrade/ replace equipment. Consequences: Potential failure of air conditioning/ heating system, security systems etc.

28. Application Type: Fire Alarm Panel. Description: System crashes on rollover, but can be reset in year 2000. However, it doesn't recognize leap years. (Editor's note: You should do a leap day test for 2000, 2001 and 2004.) The critical date for this specific system is 01/01/2000. Solution: Replace equipment. Consequences: Building is left unprotected if system is not reset immediately after rollover.

29. Application Type: SCADA - Supervisory control & archive data for production process. Description: Loss of communications to discrete control functions and failure of archiving process data due to 2 digit date field use. Solution: Fix installed by manufacturer Consequences: Loss of heating models for process. Manufacturing an unusable product. Loss of process data for quality control and QA. 30. Application Type: SCADA - Monitoring of high frequency welding equipment Description: All data logging after January 1, 2000, would be erased as 'old' data. (Editor's note: Have you checked to see if you could properly write data or files during your Y2K tests?). Solution: A software patch is available and will be installed by original supplier of equipment. This original supplier had been unaware of the problem and consequently will need to fix several hundred similar systems worldwide Consequences: Loss of historical trending data and traceability for QA.

31. Application Type: Level and flow monitoring of waste acid treatment plant Instrument Description: Problem experienced with some versions of firmware. If the unit rolls over any year (it's not a Y2K specific problem) with the power supply off, then on power up, the display is blank and the keyboard locked so that the device will not operate. Solution: A known compliant version of the firmware has been installed. Long term, the unit will be replaced. Consequences: Inability to treat acid, resulting in shutdown of plant.

32. Application Type: DCS - Wire Loom Testers. This is a stand-alone system, which is not connected to any computer network. It performs electrical continuity tests on aircraft wiring looms. Description: On rollover, the PC attached rolls to 00, but the certificates printed out for the customer show the date as being in the year 100. Solution: The PC and its software are to be replaced with a compliant version. Consequences: The system is unable to produce valid certificates for the customer. The customer will reject invalid certificates as they form part of the contract for the aircraft. Consequently, aircraft delivery will be stopped. (Editor's note: What are your contractual requirements for documentation and have you included them in your testing?)

33. Application Type: Logging / Monitoring - This system is found in the automotive industry and is concerned with the "just-in-time" manufacture of airbags. The assembly line is made up of a number of stations. The action carried out at each station is controlled by a dedicated PLC that operates independently of all other PLCs. The whole line is controlled via a main line computer which carries detailed information about the product being assembled and the route map through the manufacturing line, and serves as the link between the assembly line and a network based database. Description: The reference date is used for comparison against the manufacturing date of components that are included in the assembly. Tests revealed that the PLC performing this comparison performed correctly. Further tests revealed that another part of the assembly line suffered a different date-related problem that involved the current production date. The problem was found to be the result of converting the year data (100) into two digits (YY) resulting in the printed label containing :0 as representation of the year 2000. It was found that products carrying labels with year :0 are rejected as a result of invalid year code (Editor's note: All systems should have end-to-end tests carried out. Definition of "a system" should include ALL aspects of manufacturing, packaging, shipping, distribution, etc.) Solution: The date handling routine in the label printing software was modified to represent the year 2000 as 00. Tests were carried out to verify this and found that a fault was again registered. This was traced to the PLC code that compared the year code on the label (00) to the year code in the MDT (100). Therefore a further modification was carried out on the data received from the MDT to represent the year 2000 as 00. Consequences: Loss of production on three assembly lines.

34. Application Type: CNC Milling Machine. The system is used to manufacture aircraft parts and is controlled by PLCs. Description: At the 31/12/1999-1/1/2000 transition, the PLC's BIOS resets from 31/12/99 to 4/1/1980. Numerical Control (NC) program data with the current date (1/1/2000) is then downloaded from the DNC network. There will now be a date conflict between the downloaded NC data and the internal date (Editor's note: This is why you should accomplish end-to-end tests on all systems - possible internal date (really format) conflicts.). Solution: Upgrade of operating system in 3 stages. Consequences: Confusion over NC files that are downloaded over the site network due to date discrepancies. There are three of these machines dedicated to the same task, all are identical and therefore consequence of failure is increased threefold. As far as known, these are the only machines available to manufacture the aircraft parts to the proven method at this site.

35. Application Type: HVAC - The system comprises: (1) a centralized PC (with the appropriate software) that monitors and controls the operating parameters of both a boiler management system and microprocessor-based out stations; (2) local area network that connects the outstations and boiler systems to the PC via networked hubs; [3] portable hand-held computers that are used in the programming of the outstations with, for example, local operating characteristics; (4) air-conditioning units (ACU). Description: While conducting the tests it was found that when power was removed from the outstations and subsequently re-applied (Editor's note: Do your tests include a power-on and power-off rollover?), the outstations failed to recognize leap years. As a result of these omissions the history logs held in the central PC became corrupted. For example, if the PC was expecting data for the 29th February 2000 it received data (from the outstations) for what the outstations believed to be the 1st March 2000 (since the 29th February had been "lost"). Solution: There are two possible solutions (excluding the "do nothing" option): 1) Upgrade the firmware versions of the out-stations 2) Replace the system software. Consequences: The system would activate (or deactivate) at various times during the year.

36. Application Type: Instrument - An "electrical continuity tester" (ECT). It is a standalone instrument and is made of: 1) a master switching console (MSC) that connects the wiring loom under test to the ECT by means of a 100-way cable; 2) a PC is connected to the ECT by means of an RS232 link. This computer contains all the programs required to automate the operation the ECT and record the results of the tests; 3) the ECT is connected to an electricity supply and contains banks of manually operated make or break switches. Description: The problem that occurred is as follows: On December 30, 1999, it would not have been possible to set the system's operation for January 1, 2000. The PC would have interpreted the year as 1900. That means that the license would become invalid, which in turn means that the system would refuse to operate (Editor's note: Have you investigated all possible license ramifications?). Given that the system would fail to operate, it is not possible to identify any further effects of non-compliance. Solution: Replacement Consequences: Given that the license would prevent the system from operating, the product being manufactured could not be tested and therefore could not be sold. In the short term the credibility of the business would suffer. In the medium term customers may impose (financial) penalties because the product had not been delivered on time. In the longer term the business may cease operating.

Advisory Disclaimer

Risk Management Advisories are provided to selected organizations to enable them to better understand the nature of the problem addressed. The information in each advisory may be distilled from numerous sources. Since it is impossible to ascertain the accuracy and completeness of such information, the above information is provided as is with no representations or warranties of any kind, whether expressed or implied, and we assume no liability for damages arising out of or relating in any way to the use of the information.

Advisory Editor:

David C. Hall, Senior Consultant, Risk Management Services

Advisory Question/Comment Contact Information

Phone: 630-734-9674 Fax: 630-734-9675 E-mail: dhall@enteract.com US Mail: Hall Associates 268 Weather Hill Drive Willowbrook, IL 60514

__________________________________________________________

END OF FORWARDED MATERIAL __________________________________________________________

This material seems quite useful. Thought it might be helpful to pass it along.



-- Share (Share@shareand share alike.com), October 16, 1999

Answers

Thanks!

-- R (riversoma@aol.com), October 16, 1999.

SNICKER

35 cases, the last one is because it is a PC that needs to have the BIOS updated.

35

35!!!!

UH hmmm cough cough....

You may be wondering at my semi hysterical outburst here.

Well we have 35 cases. And most are not the "embedded chip" itself.

This from the man who just one year ago brought you the 40, 50 billion embedded chips with 10% Yes a big 10% sure to fail!

I don't want to hear the crap about how he lowered it later when he actually researched the facts, as apposed to the original guess he now admits was a guess.

He testified before congress with those guesses. Now he can only bring you 35 embedded systems (not chips now people pay attention!) that will be impacted (yep not catastrophic failure but IMPACTS.

He knew nothing when he started spouting his "opinion", he has been learning "on the job".

I told him 18 months ago he did not know what he was talking about.

-- Cherri (sams@brigadoon.com), October 16, 1999.


Thanks Share. I doubt any polly would actually answer this thread.

Mike

======================================================================

-- Michael Taylor (mtdesign3@aol.com), October 16, 1999.


Oh to be sure there will e pollies answear this thread. They won't make sence, but they'll be hemming and hawwing as per usual

~~~~~~~~~~~~~~~~~~~~~~~~Shakey~~~~~~~~~~~~~~~

-- Shakey (in_a_bunker@forty.feet), October 16, 1999.


Thanks for posting this, Share.

-- Ashton & Leska in Cascadia (allaha@earthlink.net), October 16, 1999.


Cherri couldn't help herself...you know, stirring up a fresh batch of humble pie. Problem is...we still don't know who's going to eat it.

What is it they say about humble pie?

Is it don't get to cocky when you're baking humble pie?

-- no talking please (breadlines@soupkitchen.gov), October 16, 1999.


Cherri,

Your stage name, or nom de plume if you prefer, isn't: "Share" by chance, is it?

-- no talking please (breadlines@soupkitchen.gov), October 16, 1999.


smells like a setup to me...

dem dirty rats

-- share what (share@share.share), October 16, 1999.


Nobody has ever claimed that NO embedded systems have problems. Of course there are problems. But the *number* of problems is far smaller than feared (this isn't a very long list) and the *impacts* of those problems tend to be less than feared (a lot of these impacts are "possible regulatory violation"). And these "editor's notes" describe the worst possible results of these failure modes.

There are much longer lists, to be sure. But where do these lists come from? How did Dave Hall find out this stuff? Not only are the devices and failure modes posted on the manufacturers' web sites, but in many (most?) cases, customers of these devices have been individually notified of the problems by the manufacturers. It's really in nobody's best interests to keep such issues secret.

Notice that the solution in many cases is to replace the PROM. I doubt a manufacturer would recommend this solution if there were no replacement available.

Finally, and unfortunately, we have no good data on how many users of these devices have performed the remediation. While I find it hard to imagine that a company, notified that a key element of their process won't work, will choose to ignore the warning, probably some of these problems will remain to bite us. However, since there are fixes available in each case, it seems unlikely that any company would deliberately suffer loss of efficiency for very long. Repairing the problem would surely be more cost effective.

My experience has been that serious problems (like #10) make the vendor very proactive -- onsite assistance is common.

-- Flint (flintc@mindspring.com), October 16, 1999.


Flint,

You might want to back up aways to a new thread on Oil Industry Embeddeds and Baker Hughes Inc. There you'll find new data to show a sliver of just how bad the embedded situation may be in the Oil Industry. These company is letting all the dirty clothes hang out on the line, or at least much of it in their product lines which are the products that they've sold in the past to the Oil Industry. This company seems to me to be in serious trouble. And they are NOT the only ones.

Also, someone else has posted a thread regarding ATT problems also. The news is really mixed... Happy talk versus the actual 10-Qs where the lawyers have now begun to alter the tone and the mood as these folks realize their problems are NOT getting solved. Who knows, we may see the lawyers force these folks to come clean with expectations but probably not until Christmas, and by then...TOOO LATE.

Also IF... the telecoms go down, it will seriously impede if not stop much of the oil production because a lot of the systems remain functioning due to remote relays within process systems. Not to mention communicating back and forth between human overseers of equip in the field.

-- R.C. (racambab@mailcity.com), October 16, 1999.



R.C.:

Yes, that's all useful information. There will surely be unexpected shutdowns (or worse) come rollover.

The ATT thread discussed the physical limitations of a network testing methodology. Yes, these limits exist. I pointed out that because of these limitations, there can be no absolute guarantee everything will work right. Then I had to struggle against the normal TB2K conclusion that lack of a guarantee things will work right constitutes a guarantee things will work wrong. I continue to consider it a good sign that they have tested extensively, even if the coverage cannot be 100%

As for the 10Q, I'm not sure what to think right now. I know that the SEC legally required corporations to describe their worst case scenarios. And I have the uneasy feeling that those scenarios have been plucked from the the big picture and treated (mood and tone, as you write) as *status reports*. Even some of the brighter lights on this forum have confused contingency plans with an admission that their contingencies are certainties.

So I agree the picture is mixed. Let's just take care not to try to 'clarify' the picture by picking a conclusion and ignoring everything that doesn't agree.

-- Flint (flintc@mindspring.com), October 16, 1999.


Flint,

After perusing numerous 10-Qs for the past two weeks and comparing with previous statements and 10-Qs there is a marked attitude change regarding the mood and tone of the latest disclosures. These boys are shape-shifting their rhetoric towards a focus on 3rd Party disclaimers. In other words, they're setting up a basis for trying to put the blame for failures on anyone but themselves. Obviously no one anymore wants to use the Harry Truman philosophy of blame: "The buck stops here."

We're seeing this set-up for blame-shifting going on in almost every new 10-Q that I've seen. This is a new trend that is developing. It will be interesting to see where it goes. It is just one more indicator though, that things are not going nearly as well as these companies would have you to believe, and at the same time lends credence to insider reports at some companies of internal "near panic" within certain elements of the corporate world.

It's these tell-tale signs and many others that has this reporter's nose sniffing something very "fishy" and or "smelly" going on and its certainly pretty rank. Not a good optimistic indicator.

As I said in an earlier post elsewhere, ATT isn't the only folks in the telco community in trouble. From what I understand these folks are expecting serious downtime. A week without banking and financial markets could easily "meltdown" the investment/banking world. The phones may well set that up...even if its only sporadic outages. There's nothing worse than getting halfway through a fax or sending an email only to get cut-off. Having such done repeatedly or having continuous interruptions between brokerage houses and the market exchanges will create serious disruptions, I'm sure...and here I'm being optimistic. Combine this with all the other interrelated problems (and we've not even discussed electricity problems) and things could QUITE EASILY GET CONVOLUTED to create a meltdown scenario that ends up with a "head for the hills" or "camping" solution. It's just not that far-fetched like it once was. I was running a 4-7 scale scenario... I'm seriously considering raising it up a notch to 5-8, but it will never be TEOTWAWKI... unless the planet explodes or something which I don't foresee for at least another millenium. Therefore your scathing retort to Diane was "uncalled for" IMHO.

-- R.C. (racambab@mailcity.com), October 16, 1999.


Flint --

Not a bad analysis. However, I do have a problem with one point. The fact is that a PROM is a 'Programmable Read Only Memory'. Now there are a couple of problems with the stated expedient of 'replace the PROM'.

First - There must be a pin compatible replacement part. This will depend a lot on how old the equipment is.

Second - The new PROM cannot be a copy of the old one. This is simple replacing the broken program with the broken program, an exercise in futility. The PROM holds a program, which must be remediated. (Assuming that company which originally produced the equipment is aware of the problem, retains the original source code that produced the defective part, so that remediation of the system can proceed. And don't laugh, this is almost totally dependent on the state of the Software Configuration Management, Change Request Management systems, which in all too many companies doing real-time embedded systems range all the way from 'non-existent' to 'a joke' to 'just almost barely adequate' to 'mediocre'. I have worked in a few embedded shops and never saw one that wasn't working for the DOD that had a rating better than those.)

Third -- This article doesn't address what I consider to be the REAL disaster potential of embedded systems, which are those systems which are run by microcontrollers, which have their programs in ROM which is actually on-board the chip. These have their programs burned in on the actual chip at the factory. (Purchaser provides a binary pattern to burn in.) These must be replaced as microcontrollers. If the particular chip is either no longer available with no direct replacement (and I mean direct, as these things typically have reduced instruction sets tailored to the controlling of equipment, which is very hardware dependent.), has only an upgrade chip, or the manufacturer is out of business, these will be extremely difficult to replace. They are also notoriously difficult to test, as rolling the clock forward is very difficult. (They run on control loops, are have to be removed from the device for test and evaluation, with specialized equipment.)

Fourth - And all of this begs the question of whether the test equipment, compilers, linkers, loaders, assemblers, etc. which were used in the original effort are still available, and are compliant themselves.

-- just another (another@engineer.com), October 16, 1999.


R.C.:

First, I didn't get the impression that ATT was "in trouble" simply because their testing methodology was necessarily limited. My impression was that they were admitting that a given level of uncertainty was inevitable. I agree, it is.

Second, I'm not sure how to view the shifting emphasis toward problems with the "other guy" in these 10Q statements. My suspicion is that this is partially because the required worst-case scenarios tend to focus on potential problems over which the organization has no direct control, and partially because most companies seem to have postponed validation of the "other guy" until late in the remediation process -- really part of contingency planning. Also, in many cases companies are naturally reluctant to open their kimono to one another.

So whether we're seeing a real trend toward admitting shortcomings, or whether we're seeing an artifact of the procedure and the reporting requirements, is a question I can't answer. Probably a combination. While I didn't expect ANY sizeable remediation project to be fully completed, I am disappointed in the reported statuses (as weighed on my BS scales).

just:

I agree with all you say (this is my business too). I probably didn't express myself very well. I doubted a manufacturer would *advise* that a PROM be replaced, if there were no suitable, compliant replacement. Certainly I'd be pissed if I called them up to get the replacement and was told "they aren't made anymore. You have to replace the whole expensive schmeer. Nyah Nyah"

Early last year, I got a call from a desperate remediator in Australia. They had a multi-million dollar assembly line robot assembly down for lack of a ST225 hard drive, no longer available. Seems the software was written specifically for this drive. The manufacturer said "Sorry, no longer supported. Replace assembly line!" Brutal. I found a source of these drives in California, $10 each plus shipping.

-- Flint (flintc@mindspring.com), October 16, 1999.


This is a pretty good list, unfortunately it suffers from the same problems as the UK IEE Y2K site failure lists - way to much hype regarding the consequences for a number of the cases presented, and no manufacturer/model numbers to allow verification. Indeed, a number of the cases aren't even functional failures, they are date stamp errors ALL painted as having serious consequences - that's a bit hard to believe, based on my experiences and industry data I have seen.

I don't want to throw the baby out with the bath water though, because there are a number of cases presented here that are indicative of my own findings in assessments and testing and the findings of others I have seen. The technical descriptions of the problems are credible for many of the cases even though many of the "consequenses" are not. I have yet to see evidence of a standard PLC that fails outright, however, and I would have to see the manufacturer/model information and verify before accepting these cases as valid (I am talking PLCs only, most of which operate as stand-alone. The fewer number that interface with computer based workstations are more likely to have problems due to the workstation software).

All, in all, this is a list worth discussing, and like the UK list, it presents a broad spectrum of types of failures. For those who have a technical interest in embedded systems, I am going to do an item by item evaluation of this list, and a summary. I will do the best I can with this, although this is much more difficult without the manufacturer and model numbers. I am also nearing completion of my embedded system failure list which does have manufacturer and model numbers and links to the source data.

Regards,

-- FactFinder (FactFinder@bzn.com), October 16, 1999.



Flint --

The big problem is that many embedded systems were built just like that, and the manufacturer either doesn't support the software or isn't in business anymore.

When you do an embedded system, there typically isn't any way to do it otherwise. The problem is that frequently, the first time the folks who originally built the system find out is when a customer calls up and says 'help'. The system manufacturer, if they had the firmware folks in house, now scramble around to see if they still have the source. If they do, they try to figure out if the problem is solvable through the code. If so, they try to find out if the chip is still available. If it is, then they will attempt to repair. Now this process can take from 3 or 4 weeks, to 2 or 3 months. And often they will find that the tools to build the stuff aren't available or don't work, or that they would need to upgrade the operating system or hardware to get a compatible version of the build tools, and so on and so on.

This is why I fear this aspect more than anything else. The number of things that nobody appears to realize is driven by this sort of stuff is literally staggering, and an awful lot has not been checked out. And it is too late for it to do a lot of good now anyway. The cycle time is such that there just isn't TIME to do much in the way of repair or, worse, replace, even in those cases where it can be done, and in my experience, the much more likely solution is to call for a replacement of the system, as the original can't be made compliant.

For perspective, consider that I actually asked about an embedded system I did back in the mid to late 80's and Y2K. The answer was, heck the chips only have a 5 year life. The system only has a 4 year life, that's why the chip was chosen. Every one of these will have to be replaced long before then. Well, those 5 year life chips are in all too many cases still out there, along with the original systems, because we did too good a job, they don't fail, they are basically 'fire and forget', and that is exactly what people have done with them... forgotten them.

-- just another (another@engineer.com), October 17, 1999.


Moderation questions? read the FAQ