[Linux-users] Saga of the mail server
James McDonald
james
Mon Aug 27 18:59:17 PDT 2007
David Bandel wrote:
> Folks,
>
> Just so you know what's going on, I offer the following. If I hadn't
> just gone through it, I know I would think it's pulp fiction staple.
>
> A month ago (28 July to be exact), I started my vacation (first time
> in two years). I left in charge two fellows who can only be compared
> to Monk and Mombo. If you don't know who they are, you need to pirate
> a copy of "Happily Never After" from your favorite torrent site
> (torrentfive.com comes to mind).
>
> Anyway, after I left on Sat morning, I received a call that the backup
> router was not responding to ssh, but was routing, however, they
> decided to reboot it and it didn't come back up (don't you just love
> folks from the Windoze world that only know 'reboot, reboot,
> reinstall'?). I told them to put the spare drive in, get a basic
> install on it, put ssh and an IP on it and I would take care of it (it
> had some iproute2 magic I hadn't yet taught them).
>
> After I finished, I called and told them to back it up to the backup
> server (this was now late Sat afternoon).
>
> On Sunday I get another call: the backup server isn't responding
> (this time they knew not to reboot). The monitor showed what I
> believe was a kernel panic (based on what they read to me). No
> problem: now reboot.
>
> This time lots of messages -- some from BIOS, some from the kernel,
> and a kernel panic after SATA-0 lost interrupts, etc.
>
> OK, remove SATA-0, replace with SATA-1 (RAID 1 mirror). SATA-1 will
> not boot, and replacing SATA-0 just results in lost interrupts and a
> kernel panic. Bloody hell. And being several hundred miles away with
> the family waiting for me is not conducive to working out bizarre
> problems.
>
> I don't think Monk and Mombo were up to the next task: Install a new
> drive, RAID1 it, and try to see if the SATA-1 drive still had data on
> it to copy over. Probably lost my backup during this step.
>
> Once up, I told them to ensure all systems had new backups.
>
> Monday morning and another call: All systems _except_ those on a
> recently purchased Dell system were backed up. The drive light on the
> Dell was on solid and it would not respond to SSH. Monitor showed
> (they were learning anyway) that the hard disk was having access
> problems. This is the box that had the linux-sxs list on it as well
> as my mail server.
>
> Did this box get a good backup before it locked up? No. Reboot and
> keep fingers crossed. Disk was doing "click click" and would not even
> be recognized by the BIOS.
>
> Was there anything else that could go wrong? Well apparently, Murphy
> was not done with me yet.
>
> I contacted a data recovery site (I really wanted some of my files,
> which was why they were backed up to the backup server). Two weeks
> and $1100 later, the files are on an FTP site. Why any reputable
> company would use a Windoze FTP server is beyond me. I get as many
> connections rejected as I do files downloaded.
>
> Anyway, all will be restored shortly (I had hoped).
>
> The linux-sxs list has been running for this past 3 weeks on a
> different system, waiting to be restored once all the files were back.
> Would you believe today I awoke to the linux-sxs server (now RAID-1)
> with the first disk (manufactured 3 May 2007, a WD Caviar) toast?
> Today it's running on one drive until I buy yet another drive. And
> BTW, my backup router was also down again this a.m.
>
> The big question is, what could possibly be causing all these disk
> problems? Nothing else seems affected. Can't be just coincidence.
> One drive a year is about average for me. But now 6 in less than a
> month? I'm complete flummoxed.
>
> Still working on the problems.
>
> David A. Bandel
>
I'm sure you would have Filtered &UPS'd power but is there any other
environmental issues that could case these failures? (e.g. Extreme
RF/Magnetic interference)
More information about the Linux-users
mailing list