I think we’re back. I hope we’re back.
For anyone who noticed over the last two days that things have been a little weird, you have my utmost apologies. Things have been a little hectic around here, for reasons you might have already heard, but the tale deserves a retelling anyway at this point, if only to help me make sure that later down the road, I can remember it in more or less the sequence that it actually happened.
Over a year ago, in the process of upgrading and moving systems around, I managed to rip the power coupler point off of a drive. Jessie executed a daring solder-repair, and we resuscitated the drive, but we both knew that was a temporary fix, and I needed to build a new system drive. However, as one can imagine, I always had something that was far more important than upgrading the server, motherbrain. Eventually, the hardware I had purchased specifically for that reason got repurposed, and by that point the momentum was lost. Of course, this didn’t stop me from volunteering the server to host a rash of other important systems—a MUCK for some friends, my wife’s writing project, my writing blog, an IRC server for some other people, et cetera. You know, to ensure that I’d have a nice big userbase looking at the server at all times.
So, knowing that I had let things get out of control, I had actually started taking the first steps towards fixing this admittedly-embarrassing situation. I had gotten the hardware I needed, and with some help—and a bit of prompting—from Jessie and Cube, I had, in fact, commented to th’ otr that I was making some decent headway. I’d finally decided to get away from Slackware and enter the Twentieth Century with everyone else and start using Ubuntu, I’d gotten the hardware built out, and I’d gotten the OS installed. Late Friday afternoon, I told Cube that I’d gotten a list of packages together on the new system that I thought were the bulk of what I needed to install, and the rest I could iron out in time. I then went off to play Race for the Galaxy with Nicky and generally relax from the annoying work week, with an eye towards getting ahead on my writing backlog.
This is when the Cosmic 2×4 struck.
About 23h30 on Jugya, Indi calls me and says he’s noticing some permissions weirdness with the MUCK database, and could I help him diagnose it. We work on it for a bit, figure out that I ran something as root I shouldn’t have, and it had a file locked. Easy fix. I excuse myself from the game, sit down to work on it, and Indi figures that it’s probably a smart move to back up the database “just in case.” I agree, which is unusual for me because I’m usually prone to just blasting through things and trusting that fools and little bunis have a bit of play with the Dice.
Of course, they do, but as it turns out, that’s not always the good kind of play. Right as the backup finished, the server started beeping at me. This was not just the polite “I believe you have sat on the keyboard” beep. This wasn’t even the “hey, you might want to take a look at this” beep. This was the full-volume fast-repeating beep of something seriously wrong. The console greeted me with things like “Kernel panic” and “Killing Interrupt Handler” and “AIEEE!”
This is not good.
I power-cycled the server, intent on fixing whatever was wrong… only it was over 180 days since its last fsck on all drives and had to force a journal check. Okay, fine… only it gets to 28% on the first drive and does it again. Reboot again, second time clean, gets to the next disk… gets to 10% and crashes.
So, with many apologies to Nicky—and to others whose weekends I may have negatively impacted—I did the only thing I could do: I took the drives out of the old server, put them into the new server, did a full dump of their contents, and proceeded to launch the new server in place of the old. Like the man says, no problem, only solutions. How hard could it be, really? I wasn’t changing that much, really. I mean, check it:
|CPU||1 32bit||2 64bit|
|Named||BIND 8||BIND 9|
|Postfix configuration||Sendmail emulation using m4||Postfix native|
|PHP||5.2.10 tarball||5.2.6 package|
|IRC||Unreal 3.1||Unreal 3.2|
|IRC Services||version 4||version 5|
Suffice to say, I’ve spent the last thirty-six hours either getting the system rebuilt. The only thing that hasn’t changed between the old environment and the new is the majority of the user data, and even those aren’t perfectly identical. I think at this point that most things are restored to a state of pretty-much-like-they-were, but this is going to be a few days in burn-in in the new environment, if not longer. At least we’re on current hardware, with current software.
The greatest irony of this, of course, is that the soldered drive still seems to be working fine. As indicated in the link above, the problem appears to have been something related to the motherboard clock ticks being dropped, which would confuse things during long I/O sessions. The server has followed us through two moves, and it belonged to a friend of ours prior to Jessie and I inheriting it, so this has apparently been a time bomb whose time had come.