Embedded Systems: The Wildcard?

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

[snipped from Douglass Carmichael's weekly update]

The following From Wired Magazine is important. A great piece of reporting, moving from "percent done" to anecdote with understandable detail.

http://www.wired.com/wired/archive/7.04/texaco_pr.html

"Abshier is one of the very few corporate Y2K managers who wanted to speak on the record. He had an agenda, of course - "I want to show that Texaco is doing a good job," he said. But his larger goal was to deliver what surely is the most difficult message to convey about Y2K: that Year 2000 problems are real and may indeed be locally severe; and also that hard work, engineering good sense, and intercompany cooperation can minimize the damage.

"After we chatted for a while, the four of us went into the center's machine room, a windowless box filled with hardware, cables, and the hot air blown out the back of electronic equipment. I asked them if they had any idea of all the embedded code running in all that gear. Martin laughed, ruefully. "Oh yeah. We know. We wrote most of it ourselves." As there was no storm on that particular morning, the machine room was mostly given over to the test Abshier had invited me to see, a re-creation of one of the first tests Texaco had run on an embedded system.

"The precise embedded system to be tested was a remote terminal unit, or RTU. An RTU is something like a small, single-purpose computer, the Stormac team explained. In a paperback-sized box mounted on the wall were several integrated circuit boards, each containing chips with embedded logic. Unlike programmable logic controllers, or PLCs, which can contain complex programs to control industrial processes, an RTU is fairly primitive, usually confined to doing one task. This one measures the flow of liquids and gases through a pipeline. Simple as its work sounds - it measures the instantaneous flow rate, stamps the measurement with a date and time, and stores it temporarily in its internal memory - it's a crucial piece of gear for Texaco. This little box is how it knows how much fuel it's delivering through its pipelines - and how much to bill the customers who are getting that fuel.

"This RTU is just one small data-collection point in a wider universe of intelligent devices that communicate with a centralized computer system. Via microwave, hardwire, and radio, hundreds of devices like this one are constantly sending data to the Supervisory Control and Data Acquisition system.

"The Scada host computer sat on the other side of the machine room - nothing exotic-looking, just an Intel-based PC with specialized OS and software. But the Scada system is the heart of Texaco's embedded-system network. If it can't collect data from the field devices, the company has no idea what's going on in its operations, can't analyze its production, can't bill customers - can't function as a company. By law, if Texaco loses contact with its field devices, it shuts down in four hours. Right at that moment, the Scada system was polling hundreds of embedded-system devices, collecting and storing about 30,000 points of data.

"Cook attached a laptop to the RTU, which gave him a direct interface to the logic in the device. He was, of course, about to do the one thing everyone wanted to do: set the date on the device to December 31, 1999, wait for the year to change, and then see what would happen.

"Using a handheld interface terminal, he entered the date and time: 12/31/99 23:59:45.

"Then we all watched the display on the face of the RTU as the seconds counted up to midnight. 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59 - then the date rolled over.

01/01/:0.

"Colon zero," said Cook. "It's like, what is that?"

"Then he tried entering the date 12/31/00. Again the seconds counted up to midnight, this time to 01/01/:1.

"But nothing terrible seemed to happen. No flashing lights, no buzzers, no equipment shutdowns. Was it just a weird date-format problem? A lot of hype over a display? Cook then took me over to the terminal for the Scada system and tried to collect information from the RTU. He entered the command to retrieve the device's idea of the date and time, and the Scada console displayed:

01/01/101

"Then he tried to retrieve the crucial information from the device, the date-stamped flow measurements stored in the unit. And the Scada system answered:

METER DATA NOT AVAILABLE - CONTRACT HOUR NOT CURRENT

"It can't get the data," Cook explained.

"Gas and oil continued to flow, unmonitored and unmeasured. If you can't read the data, said Cook, "you don't know what you've sold, and you can't get paid for it." How long could Texaco continue to function without being able to bill for the oil and gas delivered through its pipelines? Abshier and Martin looked at each other and let the question go.

"Texaco has hundreds of RTUs like this one out in the field. Fixing the devices involves going out to each unit, changing the chips inside it, and installing new software - about an hour's work per unit. The first round of replacement chips the RTU vendor sent them didn't work; they had to wait for another. Then the Scada system needed upgrading. And that was just for this one device. There are all those other devices in the field, with their chips and their embedded logic - setting valve positions, measuring pressure - hundreds of them.

"Even so, Martin and Abshier were reassuring. In the face of serious system failure, Martin said, "We could be back online, with the proper personnel, probably within a week." Abshier made a point of saying that Texaco was finding Y2K problems in only 5 percent of its embedded systems - enough to take Y2K seriously but not so many as to cause panic. And they found no problems in life-critical systems, those related to safety, health, and the environment. Texaco got a relatively early start on its Y2K work; Abshier has a large budget (Texaco estimates it will cost about $75 million to fix its systems); and he said many times that the problems they're finding are "not showstoppers." And he still retains the plainspoken confidence from his days of writing code. "Engineers know all these systems are not going to fail. Engineers aren't stupid."

"And yet, as the day wore on, I became aware of an edginess in Abshier. Maybe it was the uncertain atmosphere in the Stormac control room itself, dim and quiet like a radio station, with seven consoles showing data readings from the offshore control centers. Suspended above the consoles was a muted television permanently tuned to the Weather Channel. Despite the hot hazy sunshine outside, a tropical depression was developing in the Gulf, and Martin, who would have to supervise the platform personnel brought here in case of evacuations, kept sliding his eyes over to the TV. "We're waiting to see if it's named," he said, meaning they were waiting to see if the depression became a tropical storm."
[/snip]

Embedded systems are clever and bear watching. Closely.

~C~

-- Critt Jarvis (middleground@critt.com), April 28, 1999

Answers

Seems like you ought to get Bruce Beach to comment on this one.

-- Marsha Sykes (MSykes@court.co.macon.il.us), April 28, 1999.

Why in the world should Bruce Beach comment on this? What does he know about SCADA systems?
By the way, the most significant line is the one that begins the article: "Back in July, 1998 ..." This is an old story. Abshier is telling people nowdays how Texaco has _beat_ the Y2K odds.

-- Stephen M. Poole, CET (smpoole7@bellsouth.net), April 28, 1999.

Yes, Stephen, "Back in July, 1998 ..." is a candidate for most significant line.
However, I wonder about companies whose business continuity depends on the accurate functioning of "embedded meter readers" (mission- critical?). Like Texaco, I wonder if they diligently took proactive measures early enough to prepare for possible disruptions that might be classified as something outside their historical experience of day-to-day operations?
And,
I also wonder how many Emergency Management folks are considering possible disruptions that might be classified as something outside their historical experience of "normal accidents" or calamities that are generally isolated to a particular time and space?
Just wondering....
~C~

-- Critt Jarvis (middleground@critt.com), April 28, 1999.

Critt,
However, I wonder about companies whose business continuity depends on the accurate functioning of "embedded meter readers" (mission- critical?).
Why? You see, while everyone else has been debating this stuff ad nauseum, these companies have quietly but steadily been fixing the problem since July '98 (and earlier, actually). I get emails daily from folks who say something like, "we've fixed our stuff, but the corporate legal beagles won't let us say so yet."
Like Texaco, I wonder if they diligently took proactive measures early enough to prepare for possible disruptions that might be classified as something outside their historical experience of day-to-day operations?
Abshier says yes. A little knowledge of how these things works helps, too. See Dr. Kinsler's discussion on the email page at my Web site (he's does name the system, and he's speaking of electrical utilities, but you'll see the point).
Besides, the big disconnect in your thinking is that these corporations somehow WANT to die; that they have a death wish, that they'll willingly go into Y2K with problems that could kill them. That defies logic.
And, I also wonder how many Emergency Management folks are considering possible disruptions that might be classified as something outside their historical experience of "normal accidents" or calamities that are generally isolated to a particular time and space?
Considering that most of them have contingencies for everything up to and including a nuclear attack, I'm not too worried about that. (My father was a former Civil Defense coordinator -- back before it was expanded and called "emergency management." [g])

-- Stephen M. Poole, CET (smpoole7@bellsouth.net), April 28, 1999.

Stephen, First, I apologize for the lack of clarity in my writing. Let me take another stab at it...

Texaco is an example of a business entity committing resources to ensure business continuity through roll-over into the year 2000. Some to many business entities do not, will not have the same quality or quanity ($$$) of commitment to address Y2K in the manner that Texaco and companies with similar resources have been able to make available.

Texaco can probably think outside the box. Texaco's organization/reorganization will probably be recognizable post-y2k. I'm not aware of anyone's deathwish for their company, but I am aware of companies on a business death bed. By that, I simply mean they've accepted fix-on-failure and have no work arounds or contingency planning in place.

Although some regions have Y2K disruptions on the agenda (Miami/Dade as one example), I am sure there are those that do not. Isolated disruptions are one thing. Multiple disruptions are quite another.

Critt

-- Critt Jarvis (middleground@critt.com), April 28, 1999.

I just love the way the doom and gloom people can read a posting from a company saying all good things and respond by saying 'yeah but I bet no one else is doing this'!! Christmas must be a blast at their house. Graham

-- graham haslam (grahamhaslam@hotmail.com), April 28, 1999.

Critt,
Granted. I had a nice typo in my line about Mark Kinsler, too, so we all fail to communicate clearly sometimes. :)
What I would love to get a Doomster to discuss with me, in a calm, rational manner (without namecalling and ad hominem attacks) is to explain WHY there is so much distrust? Texaco's not the only company saying that they're working on Y2K. Most major corporations say they are, and are confident that they'll be able to handle it.
Shell Oil, for example, said a few weeks ago: "one thing that we found was that embedded systems just weren't a problem" (which echoes what Gartner is saying now). They then stated that they were confident that they'd be ready for 2000.
The "disconnect" that I talked about is important: do you really think that the management of Texaco, or BP, or Southern Power will just sit by and watch the corporation die? I don't think so. THAT'S what defies logic.
What we have here, in a nutshell, is companies saying, "we're addressing the problem." Their IT/IS departments are the ones leaking the stories -- often based on a flawed understanding of how the REST of the company works, but you can't get them to see that (and to even SUGGEST that here brings a FIRESTORM of ad hominem attacks on the one saying it[g]) -- about how they're "lying" and "covering up."
So, this is a fascinating study in human nature for me. Why are Doomsters so willing to believe the IT types, but not the rest of the people in a given corporation? I just don't understand it.
I think the big guys will be ready. Even Gartner says that the only remaining concern is the small to medium size enterprise levels (SMEs). BUT ... and this is a big BUT ... these smaller businesses are the most likely to be able to work around Y2K problems (speaking as a small businessman of many years experience).

-- Stephen M. Poole, CET (smpoole7@bellsouth.net), April 28, 1999.

Dear Stephen,
I would like to tell you some of the reasons I mistrust the corporations when they say they are fixing the problem. I used to be a technical consultant and used to work on testing satellites before the were launched. I was part of the system that I now distrust. Her are some reasons.
1. Testing the entire system in a plant is the final "acid test". Many corporations are doing subsystem testing and waiting for the year 2000 to show them how close they came. This is both nieve and irresponible. 2. Most of commerce is from companies that do not have sophisticated engineers like Texeco does. Many small companies don't have any engineers to do any sophisticated testing. The Texeco's with their multi-million dollar budgets will do well. To extrapolate their case to all or most other companies is not logical. 3. Many engineers do not even agree on how embedded systems should be tested. My experience as an engineer tells me that at least 20 per cent of the engineers doing the testing are not doing the testing thoroughly or properly. They and their companies will be caught clueless in Jan 2000. 4. My experience in engineering design, development, and testing showed me that it takes many times through a design and test to get it right. Y2k does not offer that luxury and their will be many mistakes found in Jan 2000 that are a result of the engineers doing this for the first time. 5. Type testing doesn't cut it but type testing is the order of the day. 6. My estimate is that if 20% of the suppliers to large corporations have slow downs starting in Jan 2000 then the large corporations are in trouble. 7. Many critical parts for many products come from foriegn countries which are way beyoun schedule. Just in time inventories are very sensitive to simulaneous delays in many critical components.
Stephen, this is just a small number of reasons. I could go on a extend the list to about 25 items.
I am preparing based on my estimation of what will happen. That is my right and I am exercising it and paying the bill.
What I personally dislike is a doomer criticizing a DGI or someone like yourself criticizing the doomers. I don't see what you are doing as constructive. It is easy to find fault. The truth is that y2k is too complicated for even intelligent people to make a good guess of it. Each of us has to collect his own data and make his own opinion. Please collect your data, make your decision and then allow others to do the same.

-- Tomcat (tomcat@cat.com), April 28, 1999.

Stephen,

When I read a report with enough anectdotal detail as the one provided on Texaco, I am encouraged. Encouraged, not by the test results, but by knowing that due diligence makes us aware of snafus we might not otherwise have reason to consider.

I expect most of the Texacos, BPs, and entities at that scale to be reasonably business-resilient at roll-over. But, that's not where my unease casts its shadow of gloom. I have high suspicions, insider information that I can not easily share, that there are smaller businesses knowingly going into roll-over with a fix-if-it fails strategy. Calling fix-on-failure a strategy is a bit of a stretch if the failure disables a business-critical system.

Generally, the companies that I know to be in this predicament do not have directors or high level managers who understand the quality of coupling in the interconnectedness, interoperability of their business enterprise. Without an understanding of loose or tight coupling, it's difficult to appreciate cascading failure. And, if you don't believe in cascading failure, then why bother to make contingency plans?

I'm curious, are you considering any sort of preparations in case your area experiences disruptions outside normal accidents?

Critt

-- Critt Jarvis (middleground@critt.com), April 28, 1999.

We had this story a few days ago. Please see this thread for a few other comments... <:)=

-- Sysman (y2kboard@yahoo.com), April 28, 1999.

Sysman, from that other thread:
The first round of replacement chips the RTU vendor sent them didn't work; they had to wait for another.
Things like that are annoying, but terribly common (speaking from 25+ years experience -- had a case of that just tonight, in fact[g]). The important thing is that they stayed on the problem and got the correct chips, just like we all do.
Then the Scada system needed upgrading. And that was just for this one device.
SCADA (Supervisory Control and Data Acquisition) is not a "device," it's a system -- sortof like a DA network. It's a means of collecting data from many devices in the field and communicating status to the personnel scattered over a service area.
This is a little off topic (though not as much as you might think at first glance): here's why the nationwide utility test of THEIR SCADA systems on April 9th was so important.
Please; I'm not trying to rekindle old arguments -- "bad results were kept secret and not reported" -- which may or may not have merit. I'm simply addressing the protestations of D&G'ers in advance that it was just a "PR stunt."
It was in fact a very significant test. Here's my own little "insider's report" [g], which I'll be posting to my Web site in few days:

The April 9th drill was designed to train utility people how to manually obtain critical system values [he's talking about bypassing SCADA and checking the RTU-level data manually] and communicate them to operations personnel through non-standard (other that good ol' Ma Bell) modes of communication. The drill verified that if all communications failed, utility personnel in the stations could manually control the system!!
Our drill verified all of our backup communications can enable at least my utility to control itself minus the SCADA system. All looks good so far, we'll see about the 9/9 event too. I'll be suprised if that doesn't pass uneventful too.

THAT was the point of the test. This is significant because it addresses (generally) the contention from IT types that their computers are indispensible to company operations. In some cases, they are. But that doesn't apply across the board.
I'm getting the same type of feedback from oil company people. As soon as one will let me quote him/her, I'll post that at my Web site, too. :)
This is important because it addresses the argument: "not possible to do it manually, not enough personnel!" Sure; it's not possible in all cases. But the doomsday scenarios involving oil fields and electrical grids are a bit overstated (and that's being generous).

-- Stephen M. Poole, CET (smpoole7@bellsouth.net), April 28, 1999.

Moderation questions? read the FAQ