Had a blast Monday. There was a transformer that went out and as the electric company rerouted power a second transformer blew up. I mean really blew up, sparks fire the whole deal.
Now, anytime there is a power outage IT is effected. Good news is here we have a room size UPS and a Generator. Bad news, when that transformer blew the surge was big enough that the 125A breaker on the UPS was tripped. Generator was running, but the power couldn’t get there. Now this really sucks as all of my servers and network gear is down. Good news, it took less than 5 minutes from the trip to getting power restored to the room. All total our customers lost around 10 minutes of network access while the 7606 booted up.
Now for the bad news, we lost two servers. Oddly enough it was two of the newest servers we have, HP DL380 G6 servers. I was amazed that some of our older gear didn’t die. Now HP is 99% of the time great to get repairs from, but Monday afternoon they sucked. First, we have 24/7/4 hour on all of our servers. The ticket for both servers was entered just before 4 PM, system boards showed up at 8:38 PM. Now, 30 minutes over is not bad, but there was no tech to install them. I called in and was talking to someone live by 8:45 PM, and the fist thing I am told is it will be 4 hours to get a tech on site to install the boards. I just about blew my top.
Now, I’ve been doing this for a long time and I know what you do and don’t do when this happens. First, you tell them this is a network down emergency, production outage, or some other combination of key words that help the script reader know that this is a real issue. Next you ask to be escalated to a supervisor. Don’t take no for an answer, but don’t be rude or cuss because that will give them an out at some companies to disconnect the call.
Now, back to the issue at hand, no tech. Well, I was at their mercy so I sat and waited. Guy showed up at 11:30 PM. Nice guy, but he didn’t seem to have very much experience with the HP servers. Once the boards were replaced one of the servers still wouldn’t boot. Tech called in to the tech support line, they wanted me to initialize the drives, I asked for the phone and told the script reader on the phone that we needed to skip ahead and look at the error message at boot. The Smart Array controller showed a config error, and that it was locked. Turned out we needed more parts needed after all. He went directly to the warehouse for me and brought the part back as he really didn’t want to sit and wait for it.
All told it was just after 3 AM when both servers were back up and running. At this point I thought it was time to go home and get some sleep, boy was I wrong. The gate was locked and I couldn’t find the remote to open it, I was trapped. Now, I could have pouted and whined, but instead I took the time to apply patches to several servers and perform some other house keeping that needed to be done. All told I managed to stay busy till 8 AM, then I went home happy that I didn’t need to worry about returning to work as I had already put in 8 hours for Tuesday.
Of course the calls and emails to the Blackberry managed to get me out of bed by noon. Oh well, that is what they pay me for and at least I have a job.
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment