Unforgiving Software Raid (2.4.20)
Matthew Carpenter
matt
Tue Nov 1 18:42:27 PST 2005
Screeeeeeeeeeeeeeaaaacch. Ok, back the truck up a second here.
It has been a real kicker of a day... but I'm not sure I can blame the
LVM/RAID combination for any of it. That which I believed to be caused by
LVM/RAID issues I'm now considering a lack of power. I seem to be able to
get the system running great right up to the point where I connect all the
drives. At that point, I saw odd filesystem error messages, particularly
from my LVs. I considered this simply my own stupidity. However, the backup
raid array (RAID1 used for storing backup tar.gz's) started experiencing
issues. This of course caused a massive infarction, since I whacked the
production array to restore the entire filesystem from scratch.... I'd get
part-way through the full backups and they'd error out with checksum errors.
However, I decided to disconnect the secondary backup drive to try to salvage
some backout plan... Once I rebooted with only two of the four drives
installed, the system seemed to run and restore like a champ.
Tonight I will be completing the restore-from-scratch process, and I think I'm
going to replace the two LVs with straight RAID arrays. Just better safe
than sorry and I simply don't have the time or energy to play with it at this
point.
However, I will be leaving off the secondary backup drive until I can get a
beefier powersupply.
BOTTOM-LINE: I need to take back my rants against LVM and RAID as they are
unsupported at this point. That doesn't mean they aren't true, just that I'm
no longer convinced.
However, I will offer this rant:
LVM and RAID between different versions are anything but friendly. eg. This
is a SuSE 8.2 box, running 2.4.20. When I booted off the SuSE9.0 and 9.1 and
Ubuntu cd's I was able to create everything just how I wanted it. But I had
a difficult time getting any of the others to read the drive. I can
understand not going from 2.6 to 2.4, but between the different 2.6 kernels
you might think things work. I suppose this is all based on little
experience, but it's been a painful learning process. Temper that with the
fact that in the past 36 hours I have had 2 hours sleep, 2 hours of
trick-or-treating, and 30 hours of work. I'll be the first to tell you that
I'm not exactly balanced at the moment. Please forgive the ranting. I'm
beat.
Hang tight all,
Matt
On Tuesday 01 November 2005 09:03, Matthew Carpenter wrote:
> Thanks David. That's what I did. I am, however, more stupid than I had
> already believed.
>
> Lesson: Don't run LVM over RAID.
>
> Reason: I just caused myself immense headache with a split mirror. One
> drive was still in the array, the other apparently registered the LVs...
> When I went to hot-add the new partitions into the existing RAID array it
> apparently didn't like the fact that I was writing raw to the drive...
> Corrupted the filesystems... Not pretty picture. Had to restore from
> backup.
>
> That was about the time I recognized that several of the larger backup
> files came up as corrupt... I'm quite concerned for this machine, and
> particularly the data stored on it.
>
> I'd like to recommend to the kernel developers that when VG's compete, pick
> the MD-based ones over others.... :\
>
> On Tuesday 01 November 2005 07:18, David Bandel wrote:
> > On 11/1/05, Matthew Carpenter <matt at eisgr.com> wrote:
> > > Does this strike anyone else as rather nasty?
> > >
> > > md: kicking non-fresh hda3 from array!
> > >
> > >
> > > Here's the context:
> > > md: created md3
> > > md: bind<hda3,1>
> > > md: bind<hdc3,2>
> > > md: running: <hdc3><hda3>
> > > md: hdc3's event counter: 0000002c
> > > md: hda3's event counter: 0000002a
> > > md: superblock update time inconsistency -- using the most
> > > recent one md: freshest: hdc3
> > > md: kicking non-fresh hda3 from array!
> > > md: unbind<hda3,1>
> > > md: export_rdev(hda3)
> > >
> > > In my old Novell days, this behavior would send NetWare into rebuild
> > > mode. Instead of just booting the offender, it recovered. Perhaps I
> > > can see a reason for having this behavior, but is there any way to tell
> > > the Linux kernel to bite the bullet and resync the disks?
> >
> > I suggest you:
> >
> > fail the disk and remove it from the raid
> >
> > add the disk back to the raid (whence it should rebuild)
> >
> > Not seen this myself. Very stange.
> >
> > Ciao,
> >
> > David A. Bandel
> > --
> > Focus on the dream, not the competition.
> > - Nemesis Air Racing Team motto
> >
> > _______________________________________________
> > Linux-users mailing list ( Linux-users at linux-sxs.org )
> > Unsub/Password/Etc:
> > http://mail.linux-sxs.org/cgi-bin/mailman/listinfo/linux-users
> >
> > Need to chat further on this subject? Check out #linux-users on
> > irc.linux-sxs.org !
--
Matthew Carpenter
matt at eisgr.com http://www.eisgr.com/
Enterprise Information Systems
* Network Server Appliances
* Security Consulting, Incident Handling & Forensics
* Network Consulting, Integration & Support
* Web Integration and E-Business
More information about the Linux-users
mailing list