Debate post # 2: Resiliency

greenspun.com : LUSENET : Y2K Debating Forum : One Thread

The resiliency of organizations in the face of system failures

One of the main arguments of the Y2K optimists is that organizations are resilient with respect to system failures. They point out that systems fail frequently under normal circumstances, but are fixed in a reasonable amount of time and without causing major disasters.

Therefore, if it were true that Y2K system failures were similar in scope, frequency, and consequences to "ordinary" system failures, this argument proceeds, we should not expect any particularly unusual level of problems disasters next year.

Let's examine one recent, well-publicized systems failure and see how well these arguments hold up. Of course, I'm referring to the

MCI frame relay debacle

Here are some quotes from an August 9, 1999, Inter@ctiveWeek story, written during the outage:

  1. Many MCI WorldCom customers across the country said they still do not have steady frame service a full 100 hours after the outage began. [emphasis added]

  2. No estimated time of repair has been indicated, and MCI WorldCom officials could not be reached for comment. [emphasis added]

Here's an August 11, 1999, Inter@ctiveWeek story giving further details about the cause of the outage:

  1. MCI WorldCom said that software in Lucent Technologies' networking equipment was the cause of the poor network performance and service interruption on the carrier's frame relay network that started last Thursday, Aug. 5. The carrier is now in the process of restoring failed connections in its network.

  2. According to MCI WorldCom spokeswoman Linda Laughlin, the carrier was upgrading a switch somewhere on the frame relay network when the device began to "experience congestion" and the initial outage began. Laughlin was unable to say where the faulty upgrade occurred or what the upgrade was for. [emphasis added]

  3. The network failure called into question the operating procedures of major carriers. MCI WorldCom did not warn its customers in advance that the upgrade was being conducted. [emphasis added]

Here are some quotes from an August 13, 1999, Inter@ctiveWeek article explaining the "domino effect" of this outage:

  1. MCI WorldCom's frame relay network was still down Aug. 13, and the 8-day-old outage was threatening to bring down a lot of the carrier's customers with it. [emphasis added]

  2. But the meltdown - which some observers speculated affected almost a third of the network and left 30 percent of its customer base without the high-speed data service - isn't just going to hurt MCI WorldCom customers. It will likely inflict a large financial burden on the company in terms of service givebacks and cash payouts.

  3. "They're gonna give us a hell of a lot of money by the time we're done with them," said William English, co-owner and chief financial officer at Onramp. The Internet service provider said it lost just about its entire subscriber base as customers abandoned the ISP for competing access companies. [emphasis added]

  4. The outage reminded many of AT&T's 22-hour frame relay outage in April. AT&T gave customers an average of five days' free service. [emphasis added]

  5. Last week, customers blasted the company's lack of communication and customer support. "I can't believe the way they're handling this; it's insane," said Robert Loughlin, chief technical officer of RMI.Net, a national backbone provider. He said about 20 percent of his ISP customers were affected by the outage. "Unfortunately, it takes a long time to change carriers." [emphasis added]

Here are some quotes from an August 16, 1999, PC Week article explaining that the problem was caused by bad software being loaded onto one of MCI's four frame relay networks.

  1. So far, the company has determined that the outage began as the result of flawed Lucent software being loaded onto hardware on one of MCI WorldCom's four distinct frame relay networks. The one specific network affected is one that MCI WorldCom, of Jackson, Miss., uses to service customers with requirements for international data circuits.
  2. That specific network serves 3,000 customers, including America Online Inc. and the Chicago Board of Trade, and consists of 300 different switches and switching nodes. Before the outage, the network had been certified by engineers at Lucent, of Murray Hill, N.J., as meeting all necessary operating parameters for a network of its scope, said MCI WorldCom officials. [emphasis added]

Finally, here are some quotes from an August 17, 1999, Reuters story explaining the situation after the outage was over:

  1. "The company did not know if all the customers on that particular network, one of four it operates, experienced problems, he said. " [emphasis added]

  2. On Aug. 13, MCI WorldCom shut down the network, removed the upgraded software and reinstalled the old software, also made by Lucent. The process, begun Saturday, was completed for domestic customers Sunday afternoon, Beaumont said. [emphasis added]

  3. Meanwhile, engineers from Lucent and its research and development arm, Bell Labs, have not yet identified the problem's source, Lucent spokesman Bill Price said. [emphasis added]

Summarizing the important points of the above debacle

  1. A single point of failure can bring down a large section of a complex network.
  2. It is not always easy to figure out exactly what is wrong, or even where the error is. Note that they still did not know exactly what caused the error more than a week after it occurred.
  3. Even software that has been "certified" by a large, well respected company like Lucent can fail disastrously when it is put into production.
  4. One of the main ways that errors are "worked around" is to revert to an earlier version of the software.
  5. As indicated by the AT&T problem, this is not an "isolated example" with no wider significance.
  6. The failure of one critical supplier can cause disaster to hundreds or thousands of businesses.
  7. You can't count on suppliers to tell you what they are planning to do to their systems, even if a failed "upgrade" could have disastrous results for you.
  8. You can't count on suppliers to tell you the status of their repair efforts, or to compensate you fairly for their failures without quibbling.
  9. It's not necessarily easy or fast to "work around" critical failures by switching to another supplier.
  10. Failures of suppliers to your suppliers (e.g., Lucent) can be just as devastating as failures by your direct suppliers.

The ramifications of these points with respect to Y2K problems

  1. What would have happened if this were the year 2000 and MCI didn't have a Y2K compliant version of the software to fall back on? Since they still don't know what the problem was in the software that failed, the answer is obvious: they would still be out of operation, and so would their customers.
  2. Testing is not equivalent to production. The software that MCI installed had been certified by Lucent, presumably after extensive testing. Yet it did not work in production. The significance of this is that every "Y2K ready" application in the world is going to go into production almost simultaneously when the clock rolls over. That is, no matter how much testing anyone has done, until their applications actually are receiving live data with dates in the year 2000, no one knows whether they will work properly in that circumstance, just as Lucent did not know that their software would not work when it was installed on MCI's switch.

Conclusions

A single problem of this sort can possibly be worked around by the customer, assuming alternate sources of supply. It can be worked around by the supplier if they have an older working version to fall back to. Multiple concurrent failures of critical services with no fall back available, as is likely to be the case next year, will be fatal to many organizations.

-- Anonymous, August 18, 1999

Answers

Nice formatting, by the way.

The MCI problem is a very good illustration of the types of problems that can be encountered during implementations of software into production. These are exactly the types of problems I discussed in the first post.

The problem is extrapolating these types of problems to Year 2000 problems on rollover. In my post, I was comparing relative error rates, and did not delve into the types of errors encountered. In fact, my original post heavily discounted implementation errors in favor of Year 2000 problems, in order to provide a large margin for error.

In actuality, the very opposite is true. Let's look at the MCI problem.

The software installed caused failures. The specific problem with the software is still to be uncovered. Virtually any part of the new software is vulnerable to failure; it all has to be examined. One quote in particular is very telling:

MCI Worldcom Blames Lucent For Outage

The intermittent outage, which began Friday, Aug. 5, was finally resolved yesterday, but only after MCI WorldCom shut down the affected portion of its frame relay network Saturday for nearly 24 hours of troubleshooting.

The fact that the errors were intermittent is a large clue. Intermittent errors are by far the most difficult to diagnose. These types of errors escape testing because they usually occur only under a very specific, unusual and unknown set of circumstances, or because of capacity constraints that were not duplicated in testing.

Your basic problem is the analogy that all systems are going into production on Jan 1, 2000. These systems have already been implemented, or are going into production, prior to Jan 1, 2000. The implementation problems are being worked through currently.

On Jan 1, 2000, a very specific and known circumstance will occur. And a very specific subset of the code is liable to failure. Because it is a very specific and known circumstance, testing can and does result in a precise view of the result of those circumstances, a point you acknowledged on the previous thread.

And even for errors that do occur, the cause will be far from unknown. Again, limiting the diagnosis to a very small subset of the actual code.

Taking this further, it's my opinion that the above is also the reason that virtually every instance of "Y2k-related" failures to date involve implementation errors, and not actual date failures. It is also one of the main reasons that early trigger dates, such as Jan 1, April 1, July 1, etc, failed to produce any noticable failures.

It's not that date related failures didn't occur. Surveys indicate virtually every organization has already experienced Y2k errors. But because the errors are caused by a very specific set of easily duplicated circumstances, they can be diagnosed and fixed.

To summarize:

1) The MCI failure was intermittent in nature, a circumstance that often eludes testing. Intermittent errors are difficult to diagnose, because they tend to occur under an unknown set of circumstances. By contrast, the rollover to Jan 1, 2000, is a very specific and known circumstance, which can and is being duplicated on thousands of test machines.

2) The MCI failure was caused by the implementation of new software. Besides the above, the root cause can be virtually anywhere in the code. In contrast, errors that occur on Jan 1, 2000 will be caused by a very small and well-known subset of the code, a subset Howard Rubin as found to be around 3%.

3) Applications are not going "into production" on rollover. A very small subset of the code will be put into a very specific and well-known circumstance.

4) In general, Year 2000 date failures cannot be compared to implementation errors for level of severity. Date-related failures are much easier to diagnose and fix than the general class of implementation errors. This is borne out by the fact that virtually every case of "Y2k related" failure to date was caused by implementation errors, and not by date errors, even though virtually every organization has already experienced date-related errors.

-- Anonymous, August 18, 1999


The software installed caused failures. The specific problem with the software is still to be uncovered. Virtually any part of the new software is vulnerable to failure; it all has to be examined.

Are you claiming that this software is completely new, and totally unrelated to the previous version of the same software from Lucent? If so, please provide some evidence for this statement. If not, I would be extremely surprised if Lucent did not have rigid change control procedures in place that would tell exactly what parts of the software had been changed and therefore were vulnerable to bugs. Of course, this undermines your arguments that this situation is very different from Y2K changes, where everything that is in danger of failing is clearly identified.

One quote in particular is very telling:

MCI Worldcom Blames Lucent For Outage

The intermittent outage, which began Friday, Aug. 5, was finally resolved yesterday, but only after MCI WorldCom shut down the affected portion of its frame relay network Saturday for nearly 24 hours of troubleshooting.

The fact that the errors were intermittent is a large clue. Intermittent errors are by far the most difficult to diagnose. These types of errors escape testing because they usually occur only under a very specific, unusual and unknown set of circumstances, or because of capacity constraints that were not duplicated in testing.

Your basic problem is the analogy that all systems are going into production on Jan 1, 2000. These systems have already been implemented, or are going into production, prior to Jan 1, 2000. The implementation problems are being worked through currently.

When the systems that are currently being tested finally encounter actual Y2K data, which will not be until next year, they'll be subjected to a large number of "very specific, unusual and unknown set of circumstances" as well as "capacity constraints that were not duplicated in testing". At that point, they'll be subject to failure just as every other application is when it is put back into production after being modified.

On Jan 1, 2000, a very specific and known circumstance will occur. And a very specific subset of the code is liable to failure. Because it is a very specific and known circumstance, testing can and does result in a precise view of the result of those circumstances, a point you acknowledged on the previous thread.

What I said was that the testing would give us some indication of the number and severity of failures, not that it would unearth all failures. Is that what you're claiming now?

And even for errors that do occur, the cause will be far from unknown. Again, limiting the diagnosis to a very small subset of the actual code.

Yes, the cause will be known: using dates after 2000. But that does not mean that the problems will be easy to fix. Remember, there are generally many, many date references in big programs that are date sensitive. Which one is causing the problem may not be obvious. In addition, you have acknowledged that there will be bad fixes and missed fixes, any one of which could cause a disaster like the MCI one.

Taking this further, it's my opinion that the above is also the reason that virtually every instance of "Y2k-related" failures to date involve implementation errors, and not actual date failures. It is also one of the main reasons that early trigger dates, such as Jan 1, April 1, July 1, etc, failed to produce any noticable failures. It's not that date related failures didn't occur. Surveys indicate virtually every organization has already experienced Y2k errors. But because the errors are caused by a very specific set of easily duplicated circumstances, they can be diagnosed and fixed.

So far, the failures have affected a very small number of systems in each company, almost all of which are date look-ahead systems that are not needed for day-to-day operations. Once their daily production systems start failing, which will not be until next year, the situation will look very different.

To summarize: 1) The MCI failure was intermittent in nature, a circumstance that often eludes testing. Intermittent errors are difficult to diagnose, because they tend to occur under an unknown set of circumstances. By contrast, the rollover to Jan 1, 2000, is a very specific and known circumstance, which can and is being duplicated on thousands of test machines.

Except for capacity and actual live data, which cannot be provided until rollover. Either of these could cause failures of the sort seen in the MCI debacle.

2) The MCI failure was caused by the implementation of new software. Besides the above, the root cause can be virtually anywhere in the code. In contrast, errors that occur on Jan 1, 2000 will be caused by a very small and well-known subset of the code, a subset Howard Rubin as found to be around 3%.

This is true only if this new software is actually new. However, that would be very unusual in such large systems that have been around for a long time. Do you have evidence that it is an actual rewrite, rather than a modification of existing software?

3) Applications are not going "into production" on rollover. A very small subset of the code will be put into a very specific and well-known circumstance.

All of the code will be subjected to the effects of live data with dates after the year 2000 for the first time in January. Whether due to errors in the code in the same application, or errors in data being fed in from outside, it is very likely that unanticipated failures will occur, as neither of these error sources can be tested exhaustively.

4) In general, Year 2000 date failures cannot be compared to implementation errors for level of severity. Date-related failures are much easier to diagnose and fix than the general class of implementation errors. This is borne out by the fact that virtually every case of "Y2k related" failure to date was caused by implementation errors, and not by date errors, even though virtually every organization has already experienced date-related errors.

No, the reason that virtually every "Y2K related" failure to date was caused by implementation errors is much simpler: it isn't 2000 yet. Therefore, no live data with 2000 dates has been handled by production code as yet. Next year, we'll see how well (or poorly) the remediation has really been done. I don't look forward to that.

-- Anonymous, August 18, 1999


Are you claiming that this software is completely new, and totally unrelated to the previous version of the same software from Lucent? If so, please provide some evidence for this statement. If not, I would be extremely surprised if Lucent did not have rigid change control procedures in place that would tell exactly what parts of the software had been changed and therefore were vulnerable to bugs. Of course, this undermines your arguments that this situation is very different from Y2K changes, where everything that is in danger of failing is clearly identified.

No, Steve, what I'm saying is that apparently MCI WorldCom was installing Lucent switches to replace a hodge-podge of other manufacturers. From http://ww w.techweb.com:80/wire/story/TWB19990812S0013:

Malone described the subnetworks that comprise MCI WorldCom's frame relay backbone as "somewhat of a patchwork," given all the equipment vendors involved.

Domestically, MCI uses BNX switches from Nortel. WorldCom, prior to its merger with MCI, built its frame-relay network with equipment from companies that ultimately were acquired by Cisco and Lucent. The carrier is upgrading the whole network with Lucent switches.

So it is really immaterial what types of change controls were in place for Lucent. It was being installed in a new environment, replacing other, non-Lucent systems.

When the systems that are currently being tested finally encounter actual Y2K data, which will not be until next year, they'll be subjected to a large number of "very specific, unusual and unknown set of circumstances" as well as "capacity constraints that were not duplicated in testing". At that point, they'll be subject to failure just as every other application is when it is put back into production after being modified.

How so, Steve? If an application is tested through Year 2000 dates, what will those unknown circumstances be? How will they be unusual, when the testing has already run through the Year 2000?

As for capacity constraints, yes, they cannot usually be tested fully. But they are tested when the program is reimplemented into production, which is happening currently. Or are you somehow saying that date-related code is somehow more likely to fail the more it's executed? If so, do you have any examples?

What I said was that the testing would give us some indication of the number and severity of failures, not that it would unearth all failures. Is that what you're claiming now?

No. I've yet to see a fully complete set of test scenarios, anywhere.

What I am saying is that a system that has gone through Y2k remediation and testing has had the major scenarios tested. That the system will not fallover based solely on the processing of current dates in 2000. And that other scenarios can be fixed in relatively short time.

Yes, the cause will be known: using dates after 2000. But that does not mean that the problems will be easy to fix. Remember, there are generally many, many date references in big programs that are date sensitive. Which one is causing the problem may not be obvious. In addition, you have acknowledged that there will be bad fixes and missed fixes, any one of which could cause a disaster like the MCI one.

Rubin estimates the number of date-related references to 3% of the code. But come on, Steve. I assume you've debugged programs in the past. It is not a matter of looking at all date references in the program. Specific failures follow specific logic paths.

This process becomes much easier once the error is found. These are systems that have been through remediation and testing, so a failure indicates a set of circumstances not tested. Invariably, fixing a known error is much easier than finding the potential error in the first place.

As for bad fixes, yes, they are present. But these are just as likely to occur on any day from implementation, as they are on Jan 1, 2000.

And again, the MCI "disaster" was apparently not caused by a bad fix, but by the introduction of new systems. An implementation.

So far, the failures have affected a very small number of systems in each company, almost all of which are date look-ahead systems that are not needed for day-to-day operations. Once their daily production systems start failing, which will not be until next year, the situation will look very different.

I don't know, Steve. GartnerGroup doesn't apparently agree with you on the severity of "look-ahead" problems. From their recent report:

Year 2000 World Status, 2Q99: The Final Countdown

Failures in 1999 will be due to:

Failures in 1999 are likely to have a higher impact on the enterprise due to more customer-facing transactions and services being affected.

Failures in 2000 will be due to:

The business impact in 2000 may be less than in 1999, since more reporting-type systems and fewer customer-facing systems are likely to be affected. Failure support experience should be greater, and this should aid in reducing failure recovery time.

-------------

Except for capacity and actual live data, which cannot be provided until rollover. Either of these could cause failures of the sort seen in the MCI debacle.

Again, if you can provide an example of date-related code that is more prone to fail the more it is executed, I'll be happy to consider "capacity".

Actual "live" data may not be available. But certainly, "aged" data is available and used in testing.

All of the code will be subjected to the effects of live data with dates after the year 2000 for the first time in January. Whether due to errors in the code in the same application, or errors in data being fed in from outside, it is very likely that unanticipated failures will occur, as neither of these error sources can be tested exhaustively.

Again, data can and is aged to support testing.

As for invalid data from other sources, applications have always had to handle this problem with interfaces.

No, the reason that virtually every "Y2K related" failure to date was caused by implementation errors is much simpler: it isn't 2000 yet. Therefore, no live data with 2000 dates has been handled by production code as yet. Next year, we'll see how well (or poorly) the remediation has really been done. I don't look forward to that.

"Live" data with Year 2000 dates is being processed, and in some industries has been processed for quite some time. What will change on Jan 1, 2000 is the "current-date" moving to the year 2000. A situation being reproduced on thousands of test systems, even as we speak.

-- Anonymous, August 18, 1999


No, Steve, what I'm saying is that apparently MCI WorldCom was installing Lucent switches to replace a hodge-podge of other manufacturers. From http://ww w.techweb.com:80/wire/story/TWB19990812S0013: Malone described the subnetworks that comprise MCI WorldCom's frame relay backbone as "somewhat of a patchwork," given all the equipment vendors involved. Domestically, MCI uses BNX switches from Nortel. WorldCom, prior to its merger with MCI, built its frame-relay network with equipment from companies that ultimately were acquired by Cisco and Lucent. The carrier is upgrading the whole network with Lucent switches.

So it is really immaterial what types of change controls were in place for Lucent. It was being installed in a new environment, replacing other, non-Lucent systems.

All the press reports I've seen say that they got back online by falling back to older software also made by Lucent, as stated in the following passage:
On Aug. 13, MCI WorldCom shut down the network, removed the upgraded software and reinstalled the old software, also made by Lucent.

If that isn't what they did, how did they get the system running again?

How so, Steve? If an application is tested through Year 2000 dates, what will those unknown circumstances be? How will they be unusual, when the testing has already run through the Year 2000?

You want me to give you a list of what the unknown circumstances will be? If I could do that, I wouldn't be spending my time posting on this forum; I'd be making billions in the stock market.

How will they be unusual? Because testing is not the same as production.

As for capacity constraints, yes, they cannot usually be tested fully. But they are tested when the program is reimplemented into production, which is happening currently. Or are you somehow saying that date-related code is somehow more likely to fail the more it's executed? If so, do you have any examples?

No, I'm not saying that date related code is somehow more likely to fail the more it is executed. I'm saying that code is more likely to fail the higher the load it is subjected to.

No. I've yet to see a fully complete set of test scenarios, anywhere. What I am saying is that a system that has gone through Y2k remediation and testing has had the major scenarios tested. That the system will not fallover based solely on the processing of current dates in 2000. And that other scenarios can be fixed in relatively short time.

I'm sure the people that wrote the Lucent code thought that its major scenarios were also tested, that it would not fall over when used in an actual system, and that it could be fixed in a relatively short period of time. They were wrong. Please explain why they were wrong.

Rubin estimates the number of date-related references to 3% of the code. But come on, Steve. I assume you've debugged programs in the past. It is not a matter of looking at all date references in the program. Specific failures follow specific logic paths. This process becomes much easier once the error is found. These are systems that have been through remediation and testing, so a failure indicates a set of circumstances not tested. Invariably, fixing a known error is much easier than finding the potential error in the first place.

Again, your analysis applies equally well to the MCI situation. Please explain why Lucent didn't fix their code quickly, and in fact why they still have not fixed it.

As for bad fixes, yes, they are present. But these are just as likely to occur on any day from implementation, as they are on Jan 1, 2000.

Not if the errors are exposed by processing current and past dates in the year 2000, which has not yet happened in production.

And again, the MCI "disaster" was apparently not caused by a bad fix, but by the introduction of new systems. An implementation.

Oh, it was a disaster all right, if you are one of the ones affected. At least, if my company were put out of business or lost most of its customers because of someone else's error, I would consider it a disaster. Wouldn't you? As to whether it was a new system or bad fix, we don't know that; all the press accounts other than the one you've quoted call it a new version, which would mean a bad fix by my definition.

I don't know, Steve. GartnerGroup doesn't apparently agree with you on the severity of "look-ahead" problems. From their recent report:

Year 2000 World Status, 2Q99: The Final Countdown

Failures in 1999 will be due to:

Some fixed and tested solutions, since 5 percent to 9 percent of remediated lines of code still have defects after testing

Larger volumes of transactions using date-forward calculations

Non-fixed-symbol dates in source code used as processing actions

Enterprises that have not yet completed remediation but are continuing to execute more date-forward transactions later in 1999

A large number of companies entering their fiscal 2000 and processing more "00" dates

So far, so good.

Failures in 2000 will be due to:

Typical business transactions being run using defective code, while some will be run for the first time at various periods later in 2000

Frozen applications put into production for the first time during 2000

Some commercial packaged software proving to be noncompliant in later released versions

Are they saying that commercial software vendors don't know whether their code is compliant? Or are they saying that the vendors are lying about it? Either way, it doesn't look too good for the victims, I mean customers.

Some commercial packaged software not yet year-2000-compliant

Improper windowing-remediation logic or time window already expiring

I guess some people haven't done their testing on the time machine very thoroughly.

Running noncompliant archived data on compliant systems

Oops, that's another serious problem. Has anyone upgraded all their data to be compliant?

Running code that had little or no testing done on it

Wait a minute. How could that happen? I thought testing with a time machine had smoked out most of the potential bugs, so such failures wouldn't happen any more than they do now. I guess they must not agree with you on that.

Failures in 1999 are likely to have a higher impact on the enterprise due to more customer-facing transactions and services being affected.

The business impact in 2000 may be less than in 1999, since more reporting-type systems and fewer customer-facing systems are likely to be affected.

This makes absolutely no sense. Have they somehow exchanged 1999 in 2000 in these summary paragraphs? Of course, it is true that many companies have already been handling forward dates after 1999; the problem is handling present and past dates after 1999. Otherwise, banks, mortgage companies, and insurance companies would have been finished with their year 2000 problem in 1970 or earlier, when they first started processing future dates after 1999. Clearly, this has not been the case, because they did not remediate their entire applications to handle present and past dates after 1999, just the parts of the application that dealt with future dates.

Again, if you can provide an example of date-related code that is more prone to fail the more it is executed, I'll be happy to consider "capacity".

As I've already pointed out, it is not date related code, but all code that is more likely to fail under higher load. That's why we have capacity testing.

Actual "live" data may not be available. But certainly, "aged" data is available and used in testing.

Again, testing is not the same as production. If they were the same, then the new Lucent software would have worked for MCI.

As for invalid data from other sources, applications have always had to handle this problem with interfaces.

Yes, but the magnitude of invalid data from untested or poorly tested programs is likely to be far greater next year than it is this year or has been at any time in the past. See your quote from the Gartner group above about problems with poorly tested programs next year.

"Live" data with Year 2000 dates is being processed, and in some industries has been processed for quite some time. What will change on Jan 1, 2000 is the "current-date" moving to the year 2000. A situation being reproduced on thousands of test systems, even as we speak.

I've already covered the reason why we can expect a higher likelihood of errors once the current date or a past date is in the year 2000, above. As for the test systems, testing is not the same as production.

-- Anonymous, August 18, 1999


Steve, even the artcile you cite states:

About four weeks ago, the company installed new software, made by Lucent Technologies Inc. (NYSE:LU - news) to allow the network to support additional customers and services.

This doesn't sound like a simple upgrade, or a bad fix; at the least, it sounds like a major release upgrade, where large portions of the code is modified.

Or maybe it has nothing to do with the code at all. Another article states:

A Lucent spokeswoman said the software in question has been deployed successfully on other networks and suggested that the problem was caused by the installation process rather than the software itself.

Yet another variable of system implementations that have no relationship whatsoever with Year 2000 date failures.

You asked numerous times for me to explain the MCI situation. I have found sources that suggest that this is not just a simple upgrade; in fact, it may have nothing to do with the code at all. All again illustrating the complexity and sheer number of variables that are involved in a software implementation, complexity and variables that are not present when a small subset of the code begins processing year 2000 dates. But then, this is your example. It would seem to me that it is up to you to explain how the cause of this error is similiar to an error caused when statements, currently functioning in production, begin processing dates in the Year 2000.

Your statements that all code is more likely to fail due to capacity problems are somewhat interesting. Queue processing I can understand; queues can overflow. But for the life of me, I just can't understand how statements processing dates, currently working under production capacity, will somehow fail when the year becomes 2000, because of the number of times they're executed. Are they going to be executed more in the year 2000? Again, do you have any examples?

No, testing is not the same as production. No doubt, things will be missed, even in tested applications. As they always are, and as they always will be. But the fact is, year 2000 errors are trivial in comparison to other types of potential errors. No one that I know of has ever denied this. The problem has always been finding all the potential errors; a situation that is remedied when the error actually occurs.

-- Anonymous, August 18, 1999



Steve, even the article you cite states:

About four weeks ago, the company installed new software, made by Lucent Technologies Inc. (NYSE:LU - news) to allow the network to support additional customers and services.

This doesn't sound like a simple upgrade, or a bad fix; at the least, it sounds like a major release upgrade, where large portions of the code is modified.

Even if large portions of the code are modified, Lucent should know which ones those are, and which ones could possibly be at fault. Shouldn't they?

Or maybe it has nothing to do with the code at all. Another article states: A Lucent spokeswoman said the software in question has been deployed successfully on other networks and suggested that the problem was caused by the installation process rather than the software itself. Yet another variable of system implementations that have no relationship whatsoever with Year 2000 date failures.

The very fact that Lucent and MCI can't even agree what the problem might be is, to me, a telling point. I can't be optimistic about systems problems in the face of this finger-pointing exercise. However, it doesn't seem to bother you, for some reason. Could you explain this?

You asked numerous times for me to explain the MCI situation. I have found sources that suggest that this is not just a simple upgrade; in fact, it may have nothing to do with the code at all. All again illustrating the complexity and sheer number of variables that are involved in a software implementation, complexity and variables that are not present when a small subset of the code begins processing year 2000 dates.

I'm not aware of any programs in which only a small subset of the code processes year 2000 dates. The whole system has to processes dates correctly. The fact that only a few percent of the code has been changed does not invalidate this fact.

But then, this is your example. It would seem to me that it is up to you to explain how the cause of this error is similiar to an error caused when statements, currently functioning in production, begin processing dates in the Year 2000.

It is similar in that a system has been put into production that was thought to work in the laboratory, i.e., in testing. Since none of the current and past date handling in production code involves dates after the end of 1999, all such date handling up till now has been done only in testing, not in production. That code will go into production with respect to year 2000 issues in January, and not before. Therefore, I expect a significant number of failures as a result of errors in date handling for current and past dates in January.

Your statements that all code is more likely to fail due to capacity problems are somewhat interesting. Queue processing I can understand; queues can overflow. But for the life of me, I just can't understand how statements processing dates, currently working under production capacity, will somehow fail when the year becomes 2000, because of the number of times they're executed. Are they going to be executed more in the year 2000? Again, do you have any examples?

Yes, I can give you some examples. First, almost all year 2000 remediation involves either windowing or date expansion. Windowing requires extra CPU cycles compared to the original code which just used the two digit date directly. Date expansion, on the other hand, increases the amount of file I/O that has to be performed. Both of these obviously increase the load on the processor.

I also expect a great increase in the number of rejected transactions due to flaws in applications from which data is imported. This will exercise paths that as a rule take much more CPU time and I/O than normal processing does.

I'm sure there are a number of other possible ways that capacity could be strained by year 2000 issues, but I think those are enough to indicate that this is a legitimate concern.

No, testing is not the same as production. No doubt, things will be missed, even in tested applications. As they always are, and as they always will be. But the fact is, year 2000 errors are trivial in comparison to other types of potential errors. No one that I know of has ever denied this.

I deny it. Now you know someone who has denied it.

The problem has always been finding all the potential errors; a situation that is remedied when the error actually occurs.

This may be true if the number of errors is small enough that you can isolate which one is causing what problem. In the presence of a large number of errors, analysis is very difficult.

And what about the MCI problem? The error occurred. They haven't fixed it. They had to back out the new software instead. What will they do next year, if the old software isn't Y2K compliant?

-- Anonymous, August 19, 1999


Even if large portions of the code are modified, Lucent should know which ones those are, and which ones could possibly be at fault. Shouldn't they?

The very fact that Lucent and MCI can't even agree what the problem might be is, to me, a telling point. I can't be optimistic about systems problems in the face of this finger-pointing exercise. However, it doesn't seem to bother you, for some reason. Could you explain this?

Again, my problem with your whole example is that you take an error with a system implementation and use it as an example of year 2000 problems.

You lump everything together, and simplistically label them as systems problems. Year 2000 errors are a trivial, if pervasive, subset of system problems.

Nothing in the MCI example remotely resembles the type of problems encountered with Year 2000 dates. The potential causes of the MCI problem run the gamut from an invalid installation, to an inability to cope with the MCI congfiguration, to an inability to cope with the MCI capacity, to actual coding errors. The sheer number of variables and complexity makes it difficult to diagnose and resolve the error. So no, it doesn't surprise me that the problem hasn't been resolved. If it was, in fact, a faulty installation, I would not expect to ever see a resolution.

Yes, I can give you some examples. First, almost all year 2000 remediation involves either windowing or date expansion. Windowing requires extra CPU cycles compared to the original code which just used the two digit date directly. Date expansion, on the other hand, increases the amount of file I/O that has to be performed. Both of these obviously increase the load on the processor.

I also expect a great increase in the number of rejected transactions due to flaws in applications from which data is imported. This will exercise paths that as a rule take much more CPU time and I/O than normal processing does.

I'm sure there are a number of other possible ways that capacity could be strained by year 2000 issues, but I think those are enough to indicate that this is a legitimate concern.

We are talking of programs tested and already in production, right? Windowing is already occuring, and requires the same CPU cycles whether the year begins with a '19' or '20'. Date expansion has already occurred, and requires no more I/O or CPU whether or not the year begins with '19' or '20'.

While the number of rejected transactions may increase, these typically stop the processing. The exact opposite of your statement is true; CPU and I/O used in processing a transaction to completion greatly exceed the CPU and I/O required when the transaction is stopped due to errors.

I deny it. Now you know someone who has denied it.

I'll leave this one alone, for the sake of keeping the level of discussion at a somewhat higher level.

This may be true if the number of errors is small enough that you can isolate which one is causing what problem. In the presence of a large number of errors, analysis is very difficult.

And what about the MCI problem? The error occurred. They haven't fixed it. They had to back out the new software instead. What will they do next year, if the old software isn't Y2K compliant?

And again, installing new software in no way compares to Year 2000 errors. For errors that do occur, they will take whatever time is required to resolve the error. In the MCI case, my guess is they would have tried a complete re-install of the software. The fact is, there may be absolutely nothing wrong with the actual code.

Comparing the problem resolution of a system implementation to errors that occur in programs remediated, tested, and running in production, when the year changes to 2000, is invalid.

Again, this is borne out by the fact that virtually every "Y2k related" system failure has been due to implementations rather than errors in date processing. Even though virtually every organization has encountered date processing errors. Errors in "forward looking" processing. Errors that GartnerGroup expects to have even more of an impact than errors when the year actually rolls to 2000.

-- Anonymous, August 19, 1999


We are talking of programs tested and already in production, right? Windowing is already occuring, and requires the same CPU cycles whether the year begins with a '19' or '20'. Date expansion has already occurred, and requires no more I/O or CPU whether or not the year begins with '19' or '20'.

Good point. I guess I shouldn't have tried to come up with examples off the top of my head without spending a little more time on them.

While the number of rejected transactions may increase, these typically stop the processing. The exact opposite of your statement is true; CPU and I/O used in processing a transaction to completion greatly exceed the CPU and I/O required when the transaction is stopped due to errors.

That's not the way any production program I've ever seen has worked. Exceptions are always logged, which typically takes a lot more I/O than normal processing. But maybe my experience is atypical and most people just throw errors on the floor. But doesn't that make it harder to trace the problem?

I deny it. Now you know someone who has denied it.

I'll leave this one alone, for the sake of keeping the level of discussion at a somewhat higher level.

I guess I should clarify that statement. Of course any particular Y2K problem is trivial. In aggregate, however, I don't consider them trivial.

And again, installing new software in no way compares to Year 2000 errors. For errors that do occur, they will take whatever time is required to resolve the error. In the MCI case, my guess is they would have tried a complete re-install of the software. The fact is, there may be absolutely nothing wrong with the actual code.

Comparing the problem resolution of a system implementation to errors that occur in programs remediated, tested, and running in production, when the year changes to 2000, is invalid.

Again, this is borne out by the fact that virtually every "Y2k related" system failure has been due to implementations rather than errors in date processing. Even though virtually every organization has encountered date processing errors. Errors in "forward looking" processing. Errors that GartnerGroup expects to have even more of an impact than errors when the year actually rolls to 2000.

I don't think we're going to get any farther with this. If it's okay with you, I think we should wrap up this "round". As I went first in summarizing in the first round, it should be your turn to do so here.

-- Anonymous, August 19, 1999


The MCI problems provide a perfect example of the types of problems encountered during the implementation of new software. Exactly the types of errors I discussed in comparing the relative error rates we are experiencing currently, to expected error rates for the Year 2000 rollover.

It allowed an examination of the relative severity of the errors, a topic I purposely discounted in my first post.

Unfortunately, it is a very poor example of the types of errors that can be expected when Y2k remediated and tested applications encounter the Year 2000.

Consider. The implementation of new software can encounter errors from a multitude of sources:

1) Invalid Installation. Software can encounter physical errors during copying. The installation can overlay files in use by other software. Parameters may be set improperly. And this is just a partial list.
2) Differing configurations of systems. For example, Lucent claims the software is up and running at other installations. The software may not correctly handle the configuration of systems specific to the MCI installation.
3) Capacity problems. As Steve rightly points out, testing is not the same as production. One large area is capacity. Testing, even "stress testing", can only partially duplicate the actual stress of production.
4) On top of all that, you have the actual code. This does not appear to be even the simplified case where a system is running in production, and a set of modifications are applied.

All of the above, and more, could be the source of the problems MCI encountered. And none have any relationship to what can be expected when applications already installed in production, encounter the Year 2000:

1) These applications have already been installed and are running in production. Any invalid installations have been worked through.
2) The software is again installed and functioning. It can handle the configuration.
3) Again, the software is already handling production capacity.

What will happen is that a subset of the code will be subject to failure. Unlike MCI, the root cause of the failure will be far from unknown.

In summary, the problems encountered during software implementations far exceed those that can be expected from remediated and tested applications on rollover. Again, my first point compared relative error rates, while heavily discounting the severity of implementation errors.

In actuality, the opposite is true. Implementation errors far exceed expected Year 2000 errors in severity and diagnosis, due to the complexity and sheer number of variables involved.

These observations are borne out by our experience to date. Virtually every instance of "Y2k Related" system failures have been caused by implementation problems. Very few, if any, have actually involved failures with processing "forward looking" dates.

It's not that actual "date" problems haven't occurred. They have. Surveys indicate virtually every organization to date have experienced these types of errors.

It's not that "forward looking" date processing is any less important. Indeed, GartnerGroup states the exact opposite, that they expect forward looking errors to have an even larger impact than "current" and "backward" looking processing.

But because implementations involve such a high number variables and complexity, it is just far more likely that these will go unresolved longer, and as such reach the point where they can truly be labelled a "failure".

Comparing implementation errors to Year 2000 errors, though, is completely invalid.

-- Anonymous, August 20, 1999


Was that your summary, or a response to my previous post? In other words, do you want to continue or are you finished with your presentation on this issue?

-- Anonymous, August 21, 1999


I'm done.

-- Anonymous, August 21, 1999

Here's my summary of what this MCI problem tells us that can be extrapolated to give us some insight into Y2K.
  1. When a serious software problem occurs involving more than one company, a great deal of finger-pointing will follow.
  2. According to at least one press report, Lucent had "certified" the software that failed. If this is true, then we can see that even large companies can claim something is Y2K ready when in fact it is not.
  3. Press reports will often contradict one another, leaving us without the information we would need to determine what actually happened.
  4. Software problems are often not fixed in "oh, two or three hours", as Cory Hamasaki likes to put it. They can often go on for days or weeks, even when everything else is working.
  5. In the case of serious difficulties in updating or installing a piece of software, the "solution" is often to fall back to a previous version of the software. Of course, this will not work after January 1st if the previous version was not Y2K compliant.
  6. If you have critical suppliers, the failure of one of them can bring your business to a standstill.
  7. It is not always feasible to change to a new supplier even if the failure of your current supplier is endangering your business.
  8. You can't count on your suppliers to tell you what they're going to do, even if you could be severely affected by problems they encounter.
  9. A failure by one of your supplier's suppliers can affect you just as much as a failure by one of your direct suppliers.

Anyone who can understand these points and not be concerned enough to prepare for potential serious disruptions next year is beyond my ability to convince of anything.

The same is true, only much more so, for the implications of the recent Navy infrastructure report, as discussed in the Washington Post. For these reasons, and because of my other obligations, this will be my last participation in the "Great Y2K Debates". Thank you for your attention.

-- Anonymous, August 21, 1999


Moderation questions? read the FAQ