Mobile Order system shuts down for a Week after Y2K Upgrade

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

This Just in;

http://www2.idg.com.au/CWT1997.nsf/09e1552169f2a5dcca2564610027fd24/85ba098f1533fe794a256798001b2cc5?OpenDocument

-- Slammer (slammer@Nowhere.com), June 23, 1999

Answers

Yes indeed a.. It appears the failure rate has started to rise. If these last two weeks is any indication the rides starting to get a little rougher. In addition to the Major Sewer spill, and Mobile's failure (just reported now even though it happened last month by the way) there are two more failures reported on the Y2K Today site. The first deals with the failure of a 911 system in Pueblo. The second however is somewhat ironic.. The article is as follows;

County computer listing big payoffs by mistake

By Brian Werth, Herald-Times Business Editor

Fred Rice was shocked, happy and bemused  in that order  when he received a check for $49 million and one cent last week from the Monroe County prosecutor's check deception office.

Wrecker service owner Gary Koontz got a check for $50 million. He thought it was hilarious.

No, the county government hasn't started a local version of the Hoosier Lotto. Instead, a new computer in the check deception office went haywire and began spewing out deceptive eight-figure checks.

Other businesses also may have received  or may yet get  checks for millions of dollars from Monroe County.

The checks have the correct amount in the numerical spot on the upper right hand of the Monroe County bank checks  in this case $50 and $49.01. But they have the words "fifty million" and "forty-nine million" on the center line.

The mistakes are the result of a computer error generated by new software that was installed to fix the so-called Y2K problem, said Monroe County Prosecutor Carl Salzmann.

"We went over to a new computer last week," he said. "We checked the numbers before they went out, but we obviously didn't check the words. The computer guy is coming to fix it. It is ironic that these checks are coming from the check deception department."

Salzmann said that office typically sends out thousands of dollars a month to various businesses that received bad checks. The prosecutor's office tracks down the bad check and gets the money back for the businesses.

Anyone who receives a check for millions should return it to the prosecutor's office and a new check with the correct amount will be issued.

"I knew it was a big mistake, but I told them I wanted to have some fun with it first," Rice said. "It's almost enough to retire on, isn't it?"

Koontz said he was thinking of having the check framed and put up in his office.

"I dearly love this," Koontz said. "It's from a bad check from last October. So it was particularly surprising to get a check for $50 million out of the blue. I told my wife we were millionaires for a day."

Business Editor Brian Werth can be reached at 331-4375 or by e-mail at

-- Slammer (Slammer@Nowhere.Com), June 23, 1999.


The million dollar errors they will find; the ten thousand dollar errors and the one thousand dollar errors they will probably never know about......

It's starting - these are typical examples of errors - bumps in the roda people, just bumps in the road. Times many million.

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), June 23, 1999.


Slammer,

Good story. Notice, the key here was to pull the new system out and put the old one back in. *Not* take a day or two and correct the new system. But after 1/1/2000 this will not be an option. Stuff like this is what will be causing all the trouble, all over the place.

-- Gordon (gpconnolly@aol.com), June 23, 1999.


Gordon, yes, as you suggest it will not be possible when the event happens to perform the remediation that they did in Melbourne. I expect that as more evidence presents itself that we will see a pattern emerging as we get nearer the epicenter of the event (we are already inside the event horizon). The nature of the approach to Year 2000 rollover will continue to take the form of implementation, backout and remediate then reimplement. This will kind of soften the effect on the public and allow a trial and error remediation on these forward looking systems to occur. This is expected and the spin machine will handle much of the fallout. But hopefully enough of this stuff will happen to continue to move everyone to make preparations for the inevitable. As you said, the difference between these failures and the Big enchalada is that there continues to be a backout path. This makes the failures at year 2000 much more devastating even though there may be less of them, so the fallout will be much greater. but I only state the obvious.

-- Slammer (Slammer@Nowhere.com), June 23, 1999.

Well, while you all extrapolate from this to the Y2k rollover, bear in mind that this was a "system implementation", as opposed to the rollover, where specific "date" errors will need to be addressed. There is really no comparison between the myriad of errors that can and do occur when implementing a new system, and dealing with a specific subset of date-related errors in an otherwise functioning system.

-- Hoffmeister (hoff_meister@my-deja.com), June 23, 1999.


Robert... how about blocks in the road? Just a few million little building blocks for that wall we're gonna slam into at full speed!!!

Mike ==================================================================

-- Michael Taylor (mtdesign3@aol.com), June 23, 1999.


Yes Slammer...folks think these errors are amusing right now. But when they become the norm, confidence in "the system of systems" will collapse. Then the panic begins.

Gartner says y2k "starts" in July...

-- a (a@a.a), June 23, 1999.


a, you mean, like, Y2K starts next Thursday?

(by the way, Maria thinks ..."She and a have become very close friends, maybe a little too close." So, until the rolloever, you & I mustn't be seen together again.)

-- lisa (lisa@work.now), June 23, 1999.


wow... a real y2k romance? what a great idea for a song...

a and lisa sitting in a tree

picking out provisions

doomers they be

first comes love

then comes marriage

better get a y2k compliant baby carraige...

roflmao.

Mike ===================================================================

-- Michael Taylor (mtdesign3@aol.com), June 23, 1999.


Whoa, whoa, Mike - a's married.

I just think it's funny that Maria believes a will doubly infect me with this godawful meme I've already contracted....

-- lisa (lisa@work.now), June 23, 1999.



"...and a year for testing."

-- jor-el (jor-el@krypton.uni), June 23, 1999.

Hoffmeister,

What was that you were mumbling about? I didn't quite catch it.

-- Gordon (gpconnolly@aol.com), June 24, 1999.


Hoffy,

Under normal circumstances you are right by any stretch of the imagination. However, something seem to stand out clearly here which does relate this to Y2K and allows it to be counted in that category. In the case of mobile, Y2K was specifically mentioned as the source of the failure by the company. IMHO This would indicate that the system failed on a date related issue relating to the rollover and was unable to process information. i.e. a variation of the JoAnne effect. In any event, Y2K error's in supposedly fixed systems are still Y2K errors, and should not be viewed in the same sense as the normal system impelentation failures because the context of this was Y2K, which was the reason for the upgrade and for the failure.

-- Slammer (Slammer@Work.com), June 24, 1999.


Slammer

The quote was "Y2k driven". I don't find anything that suggests the problems were due to date-related processing.

Don't get me wrong. If anyone is keeping track of "Y2k errors", this would count. Systems implemented due to Y2k considerations are subject to all of the same errors associated with other implementations.

My guess is that these are part of the pre-rollover errors that Gartner Group is expecting. And, as far as Y2k goes, these types of reports are Good News. A compliant system was installed; errors were uncovered and fixed. Every one is one less system to worry about at rollover.

But it is invalid to extrapolate from system implementation problems to errors that will occur on rollover.

Gordon

Here, let me translate to language you seem to understand:

"LOOK! ERRORS IN A SYSTEM ROLLOVER! PROOF POSITIVE Y2K WILL BE A 10!!!"

-- Hoffmeister (hoff_meister@my-deja.com), June 24, 1999.


Under normal circumstances you are right by any stretch of the imagination. However, something seem to stand out clearly here which does relate this to Y2K and allows it to be counted in that category. In the case of mobile, Y2K was specifically mentioned as the source of the failure by the company. IMHO This would indicate that the system failed on a date related issue relating to the rollover and was unable to process information. i.e. a variation of the JoAnne effect. In any event, Y2K error's in supposedly fixed systems are still Y2K errors, and should not be viewed in the same sense as the normal system impelentation failures because the context of this was Y2K, which was the reason for the upgrade and for the failure.

I think Hoff's right. There's nothing to indicate that the problems in the new system were actual Y2K problems. They might have been, but they might have been problems related to the implementation, or applications problems unrelated to Y2K. Especially in a situation like that: "The project was a 'complete systems rebuild', according to Potts, which saw the company's IT operations move across from a VAX platform to Windows NT 4.0." Whew!

It is good news that Y2K compliant systems are being implemented now.

But this case shows, I think, how much fun we're going to be having when Y2K compliant systems are being implemented for the first time in the last week of December. Do I think there will be a lot of that? You bet.

-- Lane Core Jr. (elcore@sgi.net), June 24, 1999.



The whole thing (like any "DeathMarch" project) becomes linked together by the diifficult schedule requirement - yes, this particular failure was not itself Y2K (date-related) failure - it was "re-programming Y2K PROCESS failure" - and such process failures are the real threat.

Hoff - I don't really care what causes the process to fail - date-related failure, bad data, bad data exchange, operating system failure, losses of accuracy or efficiency or procedures so a component goes bad or gets made wrong or the food or medicine spoils. It doesn't matter - the process failed, and so the company (its employees and customers) is threatened with failure, delays, or hardship.

THAT IS THE THREAT! Don't quibble about whether the immediate cause was date-related or embedded chip or operating system or "we don't have any water pressure and have to shut down...."

-- Robert A. Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), June 24, 1999.


No doubt, Robert. And yes, I expect to see more than the normal amount of such errors in the coming months. They will seem even more prevalent, as people are somewhat more sensitized to "system" errors.

But that was not the point. Attempts are made to extrapolate these types of problems to those that will happen at rollover. And that just doesn't cut it.

-- Hoffmeister (hoff_meister@my-deja.com), June 24, 1999.


Hoffmeister,

The point I'm attempting to make here is that, because the exact circumstances of the failure aren't clear, it may have been an implementation detail, or transaction may have posted incorrectly due to the JoAnne effect causing orders to fail, or the Oracle client revision level on the server may not have matched that called for by a client application, or a dll may have been missing, or an entry into the PCT and PPT or VSAM file or generation data group may not have been setup, etc. etc. is that a truth exists here. It is true that in this case that the remediation system failed is good in that a problem won't happen (having worn many hats including Q&A PA and SysAnalyst I understand the value of failure and the testing process). My point is somewhat more hueristic than algorithmic in that collectively we seem to be seeing a rise in the failure rate (which is as I've said expected do to new system implementation) which taken individually do not indicate anything much, but collectively indicate a pattern of support for movement to a larger Y2K event. And also that the nature of alternatives to remediate failures, changes after the actual rollover in that, like Gordon alluded to, no exit strategy exists. This would (in a non-quantifiable way) imply that a weighting should be applied in remediation man hour estimates (again in a non-quantifiable way as there is insufficient information here) to post 2000 failures to err on the heavy side (so that, for example 1 Post Dev 31, 1999 failure would count for several of these kind of failures). What makes this unquantifiable is the timeliness and the availablility of time to fix a problem (i.e. the backout scenario allows levity and so the hours per day spent on the program is lower. This in term, when looked at collectively from possible man hours available to expend on the problem keeps the hours spent remediating below the threshold of available hours, or the OK guys put the backups back in and lets go get a pizza scenario). It is only when the timeliness changes because of the lack of a exit to backup strategy that they will begin to demand more hours in a day than the availability of programmers that the circumstances of Y2K change. At that point the failure rate and demand to fix those failures outstrips the supply of programmers to work on them that Y2K becomes a public event. We should be able to draw some correlations as more data becomes available. I assume statisticians are already working on this, as evidenced by the new report I just saw the other day that, based on LOC and metrics from sampling (I hope statistical sampling) that of the 280 million bugs, 98 million will be left unresolved. Problem is the Programmer hour requirements to avoid mission critical failures in those bugs are a lot higher than stuff like the Mobile failure, or any glitches caused by the Pre-Y2K Upgrade/test failures. Thats why 12/31/1999 11:59 (time zone not withstanding) is the big enchalada. A good precursor to this event that will give the statiticians a good means of extrapolation of impact to GDP will be the GPS failure later this summer/fall. Slammer..

-- Slammer (Slammer@Work.Com), June 24, 1999.


Slammer

1) Once again, nothing implies the failure of date-driven transactions. Actually, one can imply it was not the cause; since the system implementation was "Y2k-driven", the implication is the old system is not Y2k compliant, and would not be able to handle look ahead dates.

2) Maybe I misread you, but you seem to imply that these types of failures are somehow "building" to a major Y2k event. But this example shows a failure that has been fixed, and thus does nothing to add to any "Y2k Buildup".

3) You're right, backing out changes is not an option at rollover. But my point is that on rollover, we'll be dealing with a small subset of potential errors. Just as you mentioned the host of other things that could go wrong during the implementation that are completely unrelated to a date-driven error in a program at rollover. In fact, your list makes exactly the point I've tried to make.

4) I started a thread awhile ago, attempting to use available metrics to estimate the number of errors on rollover. If you're interested, it is Y2k Metrics and Error Rates. I just haven't seen any evidence that failures at rollover, or before, or after, will "overwhelm" the ability to keep the systems from failing completely.

-- Hoffmeister (hoff_meister@my-deja.com), June 24, 1999.


Hoffmeister,

An intresting post. Still, the metrics must somehow be weighted for severity. I also have had to go thru an Inline Verification and validation Process (Like the one Flint indicates ala IV&V). My opinion is that I believe that the Howlers will be weeded out. It's the subtle things that can kill you. Here's a good one to play with. A maintenence routine for an Airline checks on expiration dates for hydraulic fluids used in a DC-10 owned by company X. The program that prints up the replacement lists for the maintenence guys see's the expiration of the fluid in DC-10 #1 which was last replaced in 1996 (stored internally as, say 96-01-10) and has a 5 year life as 3001 instead of 2001 because of an internal windowing algorithm that followed its rules correctly (all maintenence dates < 97 CC=20 else CC=19 so 96->2096+5(Duration of product)=3001) was tested using data harvested from production in 1998 and incorrectly aged backwards to 1996 for all columns of a date related nature except thier aging routine missed the column on this Guy. It should have produced the bad result (3001), except that since the column still showed 1998 it appeared on the test result as 2003. Which appears to the clueless independent test team as a good date. since they didn't employ a differencing routine to age the results forward to compare with the actual production run they didn't notice a difference. As a result the incorrect test produced acceptable results because the team hired to do the independent verification missed it. Whats the result? The program is implemented as a "remediated" program and goes into production, the hydraulic fluid is not changed and breaks down or the lines its running in corrode and the plane fails to function correctly and falls from the sky. When does it happen? Maybe 2002. What will it look like? NEWS FLASH: another plane crashed today in the mountains south of San Francisco thats the third one this year. The crash is under investigation. Meantime, the plane is purged from the maintenence system after a month because it has been marked terminated from service. So the dirty data (and coincedentally the evidence) is cleared out. Will it be linked to Y2k? No.. it will be deemed a "mechanical failure in the planes hydraulic system was found responsible for the crash". Passengers killed and public will never know Y2K did them in.. the system did its job and it didn't fail here.. dirty data and timeliness did them in. Thats some of my point. I agree with Flint that testing is everything. And the windowing algorithm did its job and was certainly not to blame but the integration of historical data with newly implemented systems can produce unquantifiable failures (Not howlers but SBD's). Just like that chemical plant that just blew up, under investigation, Did a y2k test leave dirty data that infected a production system? Maybe maybe not. This is the insideous nature of the Y2K problem. We will be seeing increases in failures but they will appear to be natural causes. Unrelated. I think it would be intresting though to setup a two dimensional cartesion plane and plot the failure rate of company announced Y2K failures or just plain failures, period. Maybe by industry. That at least will give us an idea of the tip of the iceberg. Nobody is going to figure out the total picture even when its happening, but with charting and application of hours to failure (subcategorized perhaps) and using available statistics from the labor department of programmers per industry (if available) we could rough out a guesstimate at which point we cross the threshold where the hours requirements for FOF supercedes the available man hours to FOF. Thats the point where TSHTF maybe. Anyway my best sense tells me we are in a non-quantifiable situation and have to use common sense hueristics from here on out. Just like I can look down a street and see a car comming, and even though I don't know the exact distance or speed the car is travelling (and so can't really know exactly when it would cross my path) I can just look and say, Oh,, I can cross now because its far enough away. I do that every day. Even though we have done a signifigant amount of testing, I know we didn't catch everything. I am a moderate in my opinions but I believe there will be signifigant disruptions but a lot of them will seem natural, and will reduce with time as the dirty data washes out replaced with good stuff. P.S. I won't be on a plane after 9/1/1999 till after 2001 sometime. Jane Garveys pretty safe with that one shot deal she's doing on Dec 31 at midnight coast to coast, tell her to get on a plane to different points of call from different hubs once a week thru 2001 if she really wants to prove her bravery.

-- slammer (Slammer@work.com), June 24, 1999.


Ok, Here's what happened to my Father this week. He works for a food/gas mart. (speedways)( He's bored with retirement) Anyway, they have been having problems with their machines since they had their y2k fix. I posted this a few weeks ago. He made him a GI. This week their whole system shut down but the outside pumps kept pumping. The crashed computers didn't shut them off. They lost a bunch of gas. They had to shut the whole station down for the day. Just found out abt this today when I called on my way through from Ohio.

-- Moore Dinty moore (not@thistime.com), June 24, 1999.

Moderation questions? read the FAQ