Unforgiving Software Raid (2.4.20)

Tue Nov 1 18:42:27 PST 2005

Screeeeeeeeeeeeeeaaaacch.  Ok, back the truck up a second here.

It has been a real kicker of a day... but I'm not sure I can blame the 
LVM/RAID combination for any of it.  That which I believed to be caused by 
LVM/RAID issues I'm now considering a lack of power.  I seem to be able to 
get the system running great right up to the point where I connect all the 
drives.  At that point, I saw odd filesystem error messages, particularly 
from my LVs.  I considered this simply my own stupidity.  However, the backup 
raid array (RAID1 used for storing backup tar.gz's) started experiencing 
issues.  This of course caused a massive infarction, since I whacked the 
production array to restore the entire filesystem from scratch....  I'd get 
part-way through the full backups and they'd error out with checksum errors.

However, I decided to disconnect the secondary backup drive to try to salvage 
some backout plan...  Once I rebooted with only two of the four drives 
installed, the system seemed to run and restore like a champ.

Tonight I will be completing the restore-from-scratch process, and I think I'm 
going to replace the two LVs with straight RAID arrays.  Just better safe 
than sorry and I simply don't have the time or energy to play with it at this 
point.

However, I will be leaving off the secondary backup drive until I can get a 
beefier powersupply.

BOTTOM-LINE:  I need to take back my rants against LVM and RAID as they are 
unsupported at this point.  That doesn't mean they aren't true, just that I'm 
no longer convinced.

However, I will offer this rant:
LVM and RAID between different versions are anything but friendly.  eg.  This 
is a SuSE 8.2 box, running 2.4.20.  When I booted off the SuSE9.0 and 9.1 and 
Ubuntu cd's I was able to create everything just how I wanted it.  But I had 
a difficult time getting any of the others to read the drive.  I can 
understand not going from 2.6 to 2.4, but between the different 2.6 kernels 
you might think things work.  I suppose this is all based on little 
experience, but it's been a painful learning process.  Temper that with the 
fact that in the past 36 hours I have had 2 hours sleep, 2 hours of 
trick-or-treating, and 30 hours of work.  I'll be the first to tell you that 
I'm not exactly balanced at the  moment.  Please forgive the ranting.  I'm 
beat.

Hang tight all,
Matt

On Tuesday 01 November 2005 09:03, Matthew Carpenter wrote:
> Thanks David.  That's what I did.  I am, however, more stupid than I had
> already believed.
>
> Lesson:  Don't run LVM over RAID.
>
> Reason:  I just caused myself immense headache with a split mirror.  One
> drive was still in the array, the other apparently registered the LVs... 
> When I went to hot-add the new partitions into the existing RAID array it
> apparently didn't like the fact that I was writing raw to the drive...
> Corrupted the filesystems...  Not pretty picture.  Had to restore from
> backup.
>
> That was about the time I recognized that several of the larger backup
> files came up as corrupt...  I'm quite concerned for this machine, and
> particularly the data stored on it.
>
> I'd like to recommend to the kernel developers that when VG's compete, pick
> the MD-based ones over others....  :\
>
> On Tuesday 01 November 2005 07:18, David Bandel wrote:
> > On 11/1/05, Matthew Carpenter <matt at eisgr.com> wrote:
> > > Does this strike anyone else as rather nasty?
> > >
> > >         md: kicking non-fresh hda3 from array!
> > >
> > >
> > > Here's the context:
> > >         md: created md3
> > >         md: bind<hda3,1>
> > >         md: bind<hdc3,2>
> > >         md: running: <hdc3><hda3>
> > >         md: hdc3's event counter: 0000002c
> > >         md: hda3's event counter: 0000002a
> > >         md: superblock update time inconsistency -- using the most
> > > recent one md: freshest: hdc3
> > >         md: kicking non-fresh hda3 from array!
> > >         md: unbind<hda3,1>
> > >         md: export_rdev(hda3)
> > >
> > > In my old Novell days, this behavior would send NetWare into rebuild
> > > mode. Instead of just booting the offender, it recovered.  Perhaps I
> > > can see a reason for having this behavior, but is there any way to tell
> > > the Linux kernel to bite the bullet and resync the disks?
> >
> > I suggest you:
> >
> > fail the disk and remove it from the raid
> >
> > add the disk back to the raid (whence it should rebuild)
> >
> > Not seen this myself.  Very stange.
> >
> > Ciao,
> >
> > David A. Bandel
> > --
> > Focus on the dream, not the competition.
> >             - Nemesis Air Racing Team motto
> >
> > _______________________________________________
> > Linux-users mailing list ( Linux-users at linux-sxs.org )
> > Unsub/Password/Etc:
> > http://mail.linux-sxs.org/cgi-bin/mailman/listinfo/linux-users
> >
> > Need to chat further on this subject? Check out #linux-users on
> > irc.linux-sxs.org !

-- 
Matthew Carpenter 
matt at eisgr.com                          http://www.eisgr.com/

Enterprise Information Systems
* Network Server Appliances
* Security Consulting, Incident Handling & Forensics
* Network Consulting, Integration & Support
* Web Integration and E-Business