Last week was very rough. I was primary on-call. That means I can be, and am, called and notified over any software or hardware error across a couple of major product lines anytime of the day and night. Sometimes the systems are quiet, but not this time. The guy that had it last weekend had mentioned that one particular error was firing about every hour overnight.
And he wasn't lying. It would take care of itself, but it still paged me every hour overnight. We had looked at several fixes but it was still notifying. Considering how much trouble I have getting back to sleep after having been woken up, this was becoming a special kind of hell for me.
On the third night, at midnight, another alert came in. It wasn't the same issue as before. I laid there, hoping the issue would go away. Nope. It was still in an error condition. Great. I'll be shot because now I have to engage the brain and I was working off about 6 hours of sleep in the past four days.
OK. get into the environment, take a look at the graphs and see what the problem was. OK. we have a process to fix this. So I'll run that.
<20minutes> Nope. That didn't do it. So let's amp it up and restart the software one level up. I couldn't see anything in a particular instance of the software, so global reset.
<20 minutes> Still having issues. Let's look at the flowing logs of the lower software and see if there is something there. Ok, one of the 5 seems to be an issue. Let's restart that.
<20 minutes> No, it is now erroring over here. It looks like the upstream one is bad. Thing is, I don't know which of the four it is. Let's try restarting them selectively.
<20 minutes> Where is this freaking error coming from?! Google mentions this error is associated with network issues. Dang it's cold. Ok. let's look at the network traffic and see if I can find something there.
<10 minutes> Interesting. This server's network spikes and then went way low. I'm going to poke around with it.
<10 minutes> OK. the server is in an unstable state. So I have to announce this to my group and ask if anyone is awake. Let them know that I have to restart the server and that can be a shock to the entire system.
<10 minutes> Announce this to all the developers and admins. Restart.
<10 minutes> Verify that things have repaired. Check the processes. Announce the repair.
About 4:30, everything was done. Crawled back into bed. Normally, this is the time I take a shower and get ready for work. Too tired. Can't operate.
<7:00am> "Barry, I need you to wake up. There's no hot water. There's something wrong with the water heater." <swear, cuss, shake my fist at the sky>
My wife led me downstairs and told me all that she had checked. The breakers were all set. The furnace is still working so the gas is still on. I wandered aimlessly while Tracy was on her hands and knees looking at the bottom of the tank. At this time, I was having a few different thoughts. One, why did she ask me down here if she's going to do the fix. Two, why am I standing here? My wife is more than capable to fix this without me.
<5 minutes> Me: "The instructions are on the side of the tank." Tracy: "Oh!"
<2 minutes, in a cold basement> "OK, did that. Got a blinky. Pressing the button....There's the pilot. Great! Looks like it is working again." She got up from the floor and told me to get back to bed.
As I walked back upstairs, awake with no hope of getting sleep, I mused that all I did was tell my wife to RTFM. "Read The F****** Manual." I won't tell her that until later. It seems to be all I do at work telling users what to do.
At 9:00 am, another issue came up and in the on-line meeting, I was asked what I was doing there, as I had been up all night. They'd take the rest of the on-call and that I should get back to bed. I work with great people. And my wife is awesome too. The day, however, sucked.
And he wasn't lying. It would take care of itself, but it still paged me every hour overnight. We had looked at several fixes but it was still notifying. Considering how much trouble I have getting back to sleep after having been woken up, this was becoming a special kind of hell for me.
On the third night, at midnight, another alert came in. It wasn't the same issue as before. I laid there, hoping the issue would go away. Nope. It was still in an error condition. Great. I'll be shot because now I have to engage the brain and I was working off about 6 hours of sleep in the past four days.
OK. get into the environment, take a look at the graphs and see what the problem was. OK. we have a process to fix this. So I'll run that.
<20minutes> Nope. That didn't do it. So let's amp it up and restart the software one level up. I couldn't see anything in a particular instance of the software, so global reset.
<20 minutes> Still having issues. Let's look at the flowing logs of the lower software and see if there is something there. Ok, one of the 5 seems to be an issue. Let's restart that.
<20 minutes> No, it is now erroring over here. It looks like the upstream one is bad. Thing is, I don't know which of the four it is. Let's try restarting them selectively.
<20 minutes> Where is this freaking error coming from?! Google mentions this error is associated with network issues. Dang it's cold. Ok. let's look at the network traffic and see if I can find something there.
<10 minutes> Interesting. This server's network spikes and then went way low. I'm going to poke around with it.
<10 minutes> OK. the server is in an unstable state. So I have to announce this to my group and ask if anyone is awake. Let them know that I have to restart the server and that can be a shock to the entire system.
<10 minutes> Announce this to all the developers and admins. Restart.
<10 minutes> Verify that things have repaired. Check the processes. Announce the repair.
About 4:30, everything was done. Crawled back into bed. Normally, this is the time I take a shower and get ready for work. Too tired. Can't operate.
<7:00am> "Barry, I need you to wake up. There's no hot water. There's something wrong with the water heater." <swear, cuss, shake my fist at the sky>
My wife led me downstairs and told me all that she had checked. The breakers were all set. The furnace is still working so the gas is still on. I wandered aimlessly while Tracy was on her hands and knees looking at the bottom of the tank. At this time, I was having a few different thoughts. One, why did she ask me down here if she's going to do the fix. Two, why am I standing here? My wife is more than capable to fix this without me.
<5 minutes> Me: "The instructions are on the side of the tank." Tracy: "Oh!"
<2 minutes, in a cold basement> "OK, did that. Got a blinky. Pressing the button....There's the pilot. Great! Looks like it is working again." She got up from the floor and told me to get back to bed.
As I walked back upstairs, awake with no hope of getting sleep, I mused that all I did was tell my wife to RTFM. "Read The F****** Manual." I won't tell her that until later. It seems to be all I do at work telling users what to do.
At 9:00 am, another issue came up and in the on-line meeting, I was asked what I was doing there, as I had been up all night. They'd take the rest of the on-call and that I should get back to bed. I work with great people. And my wife is awesome too. The day, however, sucked.
No comments:
Post a Comment