Steven Bristol is a Fucking Idiot

Written by on Aug 6 2010

I used to be a smart man. I used to be able to sleep at night. No longer. Yesterday, the latter caught up with the former. Here is my story: I was installing trivial piece of software on the main LessAccounting production server. I installed in in /var, and after configuring a few other services to work with it I discovered that I needed to upgrade one service to work with the latest version of the thing I was installing. Yum would only upgrade to a minor version behind the one I needed and I didn’t care enough to compile, so I just installed the previous version of this trivial software, which does work with the older version of the service. All of this is trial, minor shit, so it was fine. I checked and everything was working. So I decided to clean up the newer, unused software and ran ‘sudo rm -fr /var/trivalxxx.×.×.x’, enter. Now this is a small directory and should have finished immediately, so after two seconds I looked at the command and realized that due to using tab to complete, my tiredness and my new found lack of intelligence the command I actually ran was ‘sudo rm -fr /var trivialxxxx.×.×.x’. FUUUUUUUCK ME!!!! ctrl-c halted the devastation and sure enough LessAccounting.com was no longer serving pages. I looked into /var and most everything was there. I looked down into /var to see how much of LessAccounting’s pieces were missing. Everything kept in /var was missing.

A bit of panic ensued, but not much. There was no data there so my first thought was redeploy the missing pieces and then get the rest from backup and everything would be fine. Except it wasn’t. Webistrano/capistrano would not connect to the server, ssh problem. Now the panic really started. None of my guys were around for support. Fuck me. If I can’t ssh in what else is wrong? How can I fix it. If I loose my one terminal session, how do I get back in. I completely panicked. I video’d with Allan to let him know. Our video session was disconnected and I couldn’t connect, via the browser, to some ancillary pages on the server and I shit my pants: the server needs to be rebuilt. It turned out to be a network issue on my side and I video’d him back. He said “There’s nothing I can do to help.” I pleaded “Just hold my hand and listen to be yell for a few minutes until I calm down.” He stayed on for a few minutes, while I went to get the missing stuff from backup. Since I only have one terminal, I had to just sit and wait the ten minutes to retrieve the 5GB of files I needed, and then the 13 minutes to untar/zip the file. During that wait time I’m searching for a hint to the sshd issue. I finally call Rich Cavanaugh for help. He calmly starts walking me through the diagnosis. I copy the missing files back to /var and LessAccounting comes to life. Thankfully, nothing else needed to be done. Total down time about 32 minutes. Back to ssh. Rich suggests tailing /var/log/messages (yes, /var/log was unaffected) and it’s obvious that sshd needs /var/empty/ssh/etc/ so it can symlink to the currently timezone file. Creating these directories fixes sshd and I can connect from another terminal. It’s over. Everything is over. I’m still shaking.

Understand, this wasn’t just carelessness. I am very aware of this type of mistake. I always remind my guys to be careful of this sort of thing when they’re in production and I’m always cognizant and very careful myself. The take away for me is that I used to be a smart man who slept at night, but now I am a fucking idiot. A very tired fucking idiot.

Afterword

Remember that this all started because I had two directories: /var/trivialxxxxx.3.2.1 and /var/trivalxxxxx.2.5.1 and I wanted to delete the higher versioned dir. After everything was fixed, I noticed the my terrible rm command had removed the 2.5.1 version, but not the 3.2.1 version. So I had to install 2.5.1 AND still run ‘sudo rm -fr /var/trivialxxxxx.3.2.1. After typing the command, but before running it, I cut and paste the command into campfire so someone could double check it. After approval, I ran the command successfully. Once again: Fuck me.

Meet
Steven

Hi I'm Steven,

I wrote the article you're reading... I lead the developers, write music, used to race motorcycles, and help clients find the right features to build on their product.

Get Blog Updates