2003-06-21 19:18:41

by Joern Nettingsmeier

[permalink] [raw]
Subject: severe FS corruption w/ reiserfs and 2.5.72-bk3

a word of warning:

i just completely and utterly trashed my filesystems with 2.5.72-bk2 and
reiserfs. there are metric shitloads of errors on journal replay and i
end up in repair mode. did a couple of --rebuild-tree's, but new errors
cropped up after every reboot.
happens both on scsi and ide drives and ate almost all of my machine...

my reiserfstools are recent (can't recall the version, but it's better
than or equal to the one listed in Documentation/Changes).

otoh, it seems i had two versions installed, the one that comes with
suse 8.1 in /sbin/ and mine in /usr/local/sbin. after realizing the
problem, i moved the current version over to /sbin so that it is invoked
on startup... might have made the problem worse.

unfortunately i did a number of things at once: upgrade the kernel from
.72 (which has worked for me quite well), add an ide drive (i didn't
have ide in my kernel before, and geez! is that module code broken :))
and shuffle partitions around. which makes the problem hard to pinpoint.

if anyone wants me to do some forensics on the machine, speak up.
otherwise i'll swipe it clean and start over from scratch.

best,

j?rn


(i'd appreciate a cc: of your replies. thanks.)


--
All Members shall refrain in their international relations from
the threat or use of force against the territorial integrity or
political independence of any state, or in any other manner
inconsistent with the Purposes of the United Nations.
-- Charter of the United Nations, Article 2.4


J?rn Nettingsmeier
Kurf?rstenstr 49, 45138 Essen, Germany
http://spunk.dnsalias.org (my server)
http://www.linuxdj.com/audio/lad/ (Linux Audio Developers)



2003-06-21 20:09:13

by Oleg Drokin

[permalink] [raw]
Subject: Re: severe FS corruption w/ reiserfs and 2.5.72-bk3

Joern Nettingsmeier <[email protected]> wrote:

JN> i just completely and utterly trashed my filesystems with 2.5.72-bk2 and
JN> reiserfs. there are metric shitloads of errors on journal replay and i
JN> end up in repair mode. did a couple of --rebuild-tree's, but new errors
JN> cropped up after every reboot.
JN> happens both on scsi and ide drives and ate almost all of my machine...

Hm. Can I ask for your kernel config, and kernel logs (if possible),
reiserfsck /dev/device -l /somewhere/device.log , and send those logs to
me too.

JN> unfortunately i did a number of things at once: upgrade the kernel from
JN> .72 (which has worked for me quite well), add an ide drive (i didn't
JN> have ide in my kernel before, and geez! is that module code broken :))
JN> and shuffle partitions around. which makes the problem hard to pinpoint.

Not sure what bk3 is for, I have yesterday's bk snapshot and it works for me,
I seem to be unable to reach bkbits.net for today.

JN> if anyone wants me to do some forensics on the machine, speak up.
JN> otherwise i'll swipe it clean and start over from scratch.

I wonder if you can create clean fs, copy some stuff there with 2.5.72-mm2
and see what happens?

Bye,
Oleg

2003-06-21 21:23:06

by Joern Nettingsmeier

[permalink] [raw]
Subject: Re: severe FS corruption w/ reiserfs and 2.5.72-bk3

hello oleg !

thanks for your quick reply!

Oleg Drokin wrote:
> Joern Nettingsmeier <[email protected]> wrote:
>
> JN> i just completely and utterly trashed my filesystems with 2.5.72-bk2 and
> JN> reiserfs. there are metric shitloads of errors on journal replay and i
> JN> end up in repair mode. did a couple of --rebuild-tree's, but new errors
> JN> cropped up after every reboot.
> JN> happens both on scsi and ide drives and ate almost all of my machine...
>
> Hm. Can I ask for your kernel config, and kernel logs (if possible),
> reiserfsck /dev/device -l /somewhere/device.log , and send those logs to
> me too.

sorry, i can't really get anything in and out of that box.
i've been able to extract some parts of the fsck log, they look like this:

vpf-10680: The file [2406 26400] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 49744] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 30103] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 47443] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 55026] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 28681] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 21611] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [2406 21610] has the wrong block count in the
StatData (8), should be (1)
vpf-10680: The file [10 33567] has the wrong block count in the StatData
(8), should be (1)
vpf-10680: The file [10 33568] has the wrong block count in the StatData
(8), should be (1)

it tells me it can fix it with the --fixable option, done that a couple
of times, new errors after that.

i can't send you my kernel config, because the kernel tree was one of
the first things that got eaten.... the entire tree ended up in
lost+found all in pieces. unfortunately i deleted it. :(

it's an smp box with 2 p3s, intel bx chipset, aic7xxx scsi, ide and scsi
compiled into the kernel, reiserfs too. tagged queuing enabled, ignore
word validation bits enabled, use dma by default enabled. it has 3 scsi
discs, 2 of which are striped, and one ide which has one huge partition
of 175G. the scsi discs are 4, 9 and 9 gig.
no atapi devices, only a scsi cdrom and a burner.

the box is so f&%$ed up, when i try to cat a logfile i get:
vpf-10640: the on-disk and the correct bitmaps differs.
nothing else.
:(


> JN> unfortunately i did a number of things at once: upgrade the kernel from
> JN> .72 (which has worked for me quite well), add an ide drive (i didn't
> JN> have ide in my kernel before, and geez! is that module code broken :))
> JN> and shuffle partitions around. which makes the problem hard to pinpoint.
>
> Not sure what bk3 is for, I have yesterday's bk snapshot and it works for me,
> I seem to be unable to reach bkbits.net for today.
>
> JN> if anyone wants me to do some forensics on the machine, speak up.
> JN> otherwise i'll swipe it clean and start over from scratch.
>
> I wonder if you can create clean fs, copy some stuff there with 2.5.72-mm2
> and see what happens?

/var is totally FUBAR, the system won't boot into my fallback kernel. i
don't have a second machine around to compile another kernel on... sorry.

i might be able to put the ide disk into another box tomorrow and try
another reiserfsck with proper logging, but i have no place to check the
scsi disks. i'll keep you posted when i get it done.

best,

j?rn





--
All Members shall refrain in their international relations from
the threat or use of force against the territorial integrity or
political independence of any state, or in any other manner
inconsistent with the Purposes of the United Nations.
-- Charter of the United Nations, Article 2.4


J?rn Nettingsmeier
Kurf?rstenstr 49, 45138 Essen, Germany
http://spunk.dnsalias.org (my server)
http://www.linuxdj.com/audio/lad/ (Linux Audio Developers)




2003-06-22 07:50:49

by Oleg Drokin

[permalink] [raw]
Subject: Re: severe FS corruption w/ reiserfs and 2.5.72-bk3

Hello!

On Sun, Jun 22, 2003 at 12:36:33AM +0200, Joern Nettingsmeier wrote:

> >JN> i just completely and utterly trashed my filesystems with 2.5.72-bk2
> >and JN> reiserfs. there are metric shitloads of errors on journal replay
> >and i JN> end up in repair mode. did a couple of --rebuild-tree's, but new
> >errors JN> cropped up after every reboot.
> >JN> happens both on scsi and ide drives and ate almost all of my machine...
> >Hm. Can I ask for your kernel config, and kernel logs (if possible),
> >reiserfsck /dev/device -l /somewhere/device.log , and send those logs to
> >me too.
> sorry, i can't really get anything in and out of that box.
> i've been able to extract some parts of the fsck log, they look like this:
> vpf-10680: The file [2406 26400] has the wrong block count in the
> (8), should be (1)

These are not looking dangerous, probably some symlinks created long ago.
(we changed block accounting for symlinks some time ago).

> it tells me it can fix it with the --fixable option, done that a couple
> of times, new errors after that.

You mean new errors of the same kind?

> it's an smp box with 2 p3s, intel bx chipset, aic7xxx scsi, ide and scsi
> compiled into the kernel, reiserfs too. tagged queuing enabled, ignore

Well, TQ was broken on IDE some time ago, but then Jens Axboe said he
fixed that, anyway that does not explain why all the disks got eaten.

> >JN> if anyone wants me to do some forensics on the machine, speak up.
> >JN> otherwise i'll swipe it clean and start over from scratch.
> >I wonder if you can create clean fs, copy some stuff there with 2.5.72-mm2
> >and see what happens?
> /var is totally FUBAR, the system won't boot into my fallback kernel. i
> don't have a second machine around to compile another kernel on... sorry.

Sigh.

> i might be able to put the ide disk into another box tomorrow and try
> another reiserfsck with proper logging, but i have no place to check the
> scsi disks. i'll keep you posted when i get it done.

Ok, perhaps you can put some kind of rescue system on that IDE disk,
and try to boot it in original box?

Bye,
Oleg