[Linux-users] Saga of the mail server
David Bandel
david.bandel
Mon Aug 27 15:01:17 PDT 2007
Folks,
Just so you know what's going on, I offer the following. If I hadn't
just gone through it, I know I would think it's pulp fiction staple.
A month ago (28 July to be exact), I started my vacation (first time
in two years). I left in charge two fellows who can only be compared
to Monk and Mombo. If you don't know who they are, you need to pirate
a copy of "Happily Never After" from your favorite torrent site
(torrentfive.com comes to mind).
Anyway, after I left on Sat morning, I received a call that the backup
router was not responding to ssh, but was routing, however, they
decided to reboot it and it didn't come back up (don't you just love
folks from the Windoze world that only know 'reboot, reboot,
reinstall'?). I told them to put the spare drive in, get a basic
install on it, put ssh and an IP on it and I would take care of it (it
had some iproute2 magic I hadn't yet taught them).
After I finished, I called and told them to back it up to the backup
server (this was now late Sat afternoon).
On Sunday I get another call: the backup server isn't responding
(this time they knew not to reboot). The monitor showed what I
believe was a kernel panic (based on what they read to me). No
problem: now reboot.
This time lots of messages -- some from BIOS, some from the kernel,
and a kernel panic after SATA-0 lost interrupts, etc.
OK, remove SATA-0, replace with SATA-1 (RAID 1 mirror). SATA-1 will
not boot, and replacing SATA-0 just results in lost interrupts and a
kernel panic. Bloody hell. And being several hundred miles away with
the family waiting for me is not conducive to working out bizarre
problems.
I don't think Monk and Mombo were up to the next task: Install a new
drive, RAID1 it, and try to see if the SATA-1 drive still had data on
it to copy over. Probably lost my backup during this step.
Once up, I told them to ensure all systems had new backups.
Monday morning and another call: All systems _except_ those on a
recently purchased Dell system were backed up. The drive light on the
Dell was on solid and it would not respond to SSH. Monitor showed
(they were learning anyway) that the hard disk was having access
problems. This is the box that had the linux-sxs list on it as well
as my mail server.
Did this box get a good backup before it locked up? No. Reboot and
keep fingers crossed. Disk was doing "click click" and would not even
be recognized by the BIOS.
Was there anything else that could go wrong? Well apparently, Murphy
was not done with me yet.
I contacted a data recovery site (I really wanted some of my files,
which was why they were backed up to the backup server). Two weeks
and $1100 later, the files are on an FTP site. Why any reputable
company would use a Windoze FTP server is beyond me. I get as many
connections rejected as I do files downloaded.
Anyway, all will be restored shortly (I had hoped).
The linux-sxs list has been running for this past 3 weeks on a
different system, waiting to be restored once all the files were back.
Would you believe today I awoke to the linux-sxs server (now RAID-1)
with the first disk (manufactured 3 May 2007, a WD Caviar) toast?
Today it's running on one drive until I buy yet another drive. And
BTW, my backup router was also down again this a.m.
The big question is, what could possibly be causing all these disk
problems? Nothing else seems affected. Can't be just coincidence.
One drive a year is about average for me. But now 6 in less than a
month? I'm complete flummoxed.
Still working on the problems.
David A. Bandel
--
Focus on the dream, not the competition.
- Nemesis Air Racing Team motto
More information about the Linux-users
mailing list