XFS problems post RAID crash

Wed Jul 27 14:32:53 PDT 2011

THe first thing I'd suggest doing is trying with a much newer kernel.
THat 2.6.25.10 is ancient, and could potentially have a number of
fixed XFS bugs.  If that doesn't help, then you likely need to get on
the XFS mailing list and ask the experts for guidance. However, if you
had HW failure on two disks, its quite likely that your filesystem is
beyond repair.

I do have to ask, how did you end up with two disks in the same array
with bad sectors at the same time?  What kind of disks are these?
Were you running smartd?

On Wed, Jul 27, 2011 at 2:29 PM, sysadmin <sysadmin at insinc.com> wrote:
>
> Hi Lonni thanks for the reply.
>
> I'm using kernel 2.6.25.10 (CentOS 5).
>
> I have made a copy of the data on another server via NFS mount - the
> transfer took 6 days (~3TB of data). As I stated in my original post I can
> reformat this file system and transfer the data back but if I can make the
> current one work it would save me a lot of time. Not to mention we have some
> urgent uses for the space currently being taken up by the backup.
>
> Nick
>
>
> netllama wrote:
>>
>> You never stated what kernel version you were using.  Anyway, if
>> you're able to mount the filesystem read only, then I'm unclear why
>> you can't make a copy of that data and use that?
>>
>> On Wed, Jul 27, 2011 at 2:20 PM, sysadmin <sysadmin at insinc.com> wrote:
>>>
>>> Hey all,
>>>
>>> I had an Linux MD RAID 5 array that had 2 drives go offline due to bad
>>> sectors (no data scrubbing was being performed).
>>>
>>> I've managed to rebuild the array and can mount the XFS file system RO.
>>> Some
>>> of the files are missing/corrupt but I have managed to transfer many of
>>> them
>>> off to another system.
>>>
>>> Only trouble now is when I mount the file system RW I get problems such
>>> as
>>> "cannot allocate memory" doing an ls on a directory and other such
>>> strange
>>> problems.
>>>
>>> Running xfs_repair gives:
>>>
>>>  10:09 root at servername:~# xfs_repair /dev/md5
>>>  Phase 1 - find and verify superblock...
>>>  Phase 2 - using internal log
>>>         - zero log...
>>>         - scan filesystem freespace and inode maps...
>>>  bad magic # 0x20000000 in btbno block 5/1098
>>>  bad magic # 0 in btcnt block 6/5518
>>>  expected level 1 got 0 in btcnt block 6/5518
>>>  bad magic # 0 in btcnt block 7/5129
>>>  expected level 1 got 0 in btcnt block 7/5129
>>>  bad magic # 0x6e73745f in btbno block 8/218842
>>>  expected level 0 got 2354 in btbno block 8/218842
>>>  bad magic # 0x33340933 in btcnt block 8/218847
>>>  expected level 0 got 13104 in btcnt block 8/218847
>>>  bad magic # 0x28717476 in btbno block 10/13130259
>>>  expected level 1 got 25970 in btbno block 10/13130259
>>>  bad magic # 0x2f000000 in btbno block 13/31016602
>>>  bad magic # 0 in btcnt block 14/13717213
>>>  expected level 1 got 0 in btcnt block 14/13717213
>>>  bad magic # 0xf0980300 in btcnt block 15/1358720
>>>  bad magic # 0x2e323032 in btbno block 17/28998874
>>>  expected level 0 got 11825 in btbno block 17/28998874
>>>  bad magic # 0x36332e31 in btcnt block 17/28998875
>>>  expected level 0 got 14137 in btcnt block 17/28998875
>>>  bad magic # 0 in btbno block 19/91721
>>>  block (22,5084) multiply claimed by bno space tree, state - 2
>>>  block (22,5085) multiply claimed by bno space tree, state - 2
>>>  bcnt freespace btree block claimed (state 1), agno 23, bno 21801,
>>> suspect 0
>>>  bad magic # 0x2d313509 in btbno block 24/17051
>>>  expected level 1 got 12848 in btbno block 24/17051
>>>  block (25,22903) multiply claimed by bno space tree, state - 7
>>>  block (25,23312) multiply claimed by bno space tree, state - 2
>>>  block (25,23313) multiply claimed by bno space tree, state - 2
>>>  block (25,23314) multiply claimed by bno space tree, state - 2
>>>  block (25,23315) multiply claimed by bno space tree, state - 2
>>>  block (25,23316) multiply claimed by bno space tree, state - 2
>>>  block (25,23317) multiply claimed by bno space tree, state - 2
>>>  bno freespace btree block claimed (state 1), agno 25, bno 22902, suspect
>>> 0
>>>  bcnt freespace btree block claimed (state 1), agno 25, bno 23400,
>>> suspect 0
>>>  bad magic # 0x41425442 in btcnt block 27/1293
>>>  expected level 1 got 0 in btcnt block 27/1293
>>>  bad magic # 0x2000000 in btcnt block 27/16922334
>>>
>>> No matter the xfs_repair flags I use (even -d with the server booted in
>>> single user mode) the repair hangs at this point i.e. with this exact bad
>>> magic line:
>>>
>>>  bad magic # 0x2000000 in btcnt block 27/16922334
>>>
>>> Is there anything I can do to recover this file system ? Obviously a full
>>> recovery would be ideal but given the RAID crash likely impossible. At
>>> this
>>> stage I'd be happy if I could just get the file system to run on whatever
>>> data it's managed to retain. If I have to reformat and transfer data back
>>> onto the system the transfer will take days.
>>>
>>> I've hunted around on the web and found people having similar issues but
>>> not
>>> quite the same e.g. :
>>>
>>>  http://old.nabble.com/bad-magic-and-dubious-inode-td8785248.html#a8785248
>>>
>>> Any help appreciated.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama at gmail.com
LlamaLand                       https://netllama.linux-sxs.org