[Linux-users] Saga of the mail server

James McDonald james
Mon Aug 27 18:59:17 PDT 2007


David Bandel wrote:
> Folks,
>
> Just so you know what's going on, I offer the following.  If I hadn't
> just gone through it, I know I would think it's pulp fiction staple.
>
> A month ago (28 July to be exact), I started my vacation (first time
> in two years).  I left in charge two fellows who can only be compared
> to Monk and Mombo.  If you don't know who they are, you need to pirate
> a copy of "Happily Never After" from your favorite torrent site
> (torrentfive.com comes to mind).
>
> Anyway, after I left on Sat morning, I received a call that the backup
> router was not responding to ssh, but was routing, however, they
> decided to reboot it and it didn't come back up (don't you just love
> folks from the Windoze world that only know 'reboot, reboot,
> reinstall'?).  I told them to put the spare drive in, get a basic
> install on it, put ssh and an IP on it and I would take care of it (it
> had some iproute2 magic I hadn't yet taught them).
>
> After I finished, I called and told them to back it up to the backup
> server (this was now late Sat afternoon).
>
> On Sunday I get another call:  the backup server isn't responding
> (this time they knew not to reboot).  The monitor showed what I
> believe was a kernel panic (based on what they read to me).  No
> problem:  now reboot.
>
> This time lots of messages -- some from BIOS, some from the kernel,
> and a kernel panic after SATA-0 lost interrupts, etc.
>
> OK, remove SATA-0, replace with SATA-1 (RAID 1 mirror). SATA-1 will
> not boot, and replacing SATA-0 just results in lost interrupts and a
> kernel panic.  Bloody hell.  And being several hundred miles away with
> the family waiting for me is not conducive to working out bizarre
> problems.
>
> I don't think Monk and Mombo were up to the next task:  Install a new
> drive, RAID1 it, and try to see if the SATA-1 drive still had data on
> it to copy over.  Probably lost my backup during this step.
>
> Once up, I told them to ensure all systems had new backups.
>
> Monday morning and another call:  All systems _except_ those on a
> recently purchased Dell system were backed up.  The drive light on the
> Dell was on solid and it would not respond to SSH.   Monitor showed
> (they were learning anyway) that the hard disk was having access
> problems.  This is the box that had the linux-sxs list on it as well
> as my mail server.
>
> Did this box get a good backup before it locked up?  No.  Reboot and
> keep fingers crossed.  Disk was doing "click click" and would not even
> be recognized by the BIOS.
>
> Was there anything else that could go wrong?  Well apparently, Murphy
> was not done with me yet.
>
> I contacted a data recovery site (I really wanted some of my files,
> which was why they were backed up to the backup server).  Two weeks
> and $1100 later, the files are on an FTP site.  Why any reputable
> company would use a Windoze FTP server is beyond me.  I get as many
> connections rejected as I do files downloaded.
>
> Anyway, all will be restored shortly (I had hoped).
>
> The linux-sxs list has been running for this past 3 weeks on a
> different system, waiting to be restored once all the files were back.
>  Would you believe today I awoke to the linux-sxs server (now RAID-1)
> with the first disk (manufactured 3 May 2007, a WD Caviar) toast?
> Today it's running on one drive until I buy yet another drive.  And
> BTW, my backup router was also down again this a.m.
>
> The big question is, what could possibly be causing all these disk
> problems?  Nothing else seems affected.  Can't be just coincidence.
> One drive a year is about average for me.  But now 6 in less than a
> month?  I'm complete flummoxed.
>
> Still working on the problems.
>
> David A. Bandel
>   
I'm sure you would have Filtered &UPS'd power but is there any other 
environmental issues that could case these failures? (e.g. Extreme 
RF/Magnetic interference)





More information about the Linux-users mailing list