LinuxLists.cc - ext3 / reiserfs data corruption, 2.5-bk

2003-06-09 19:22:20

Subject: ext3 / reiserfs data corruption, 2.5-bk

2.5 Bitkeeper tree as of last 24 hrs. Running a lot
of disk IO stress (multiple fsstress, over 100 fsx instances,
and random sync calling) produced failures on both reiserfs
and ext3.

Tests were done on seperate disks, but concurrently.

fsx logs at
http://www.codemonkey.org.uk/cruft/reiserfs.fsxlog
http://www.codemonkey.org.uk/cruft/ext3.fsxlog

Dave

2003-06-10 08:29:44

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

Hello!

On Mon, Jun 09, 2003 at 08:35:55PM +0100, Dave Jones wrote:

> 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
> of disk IO stress (multiple fsstress, over 100 fsx instances,
> and random sync calling) produced failures on both reiserfs
> and ext3.
> Tests were done on seperate disks, but concurrently.

Do you have smp or preempt enabled?

Bye,
Oleg

2003-06-10 09:06:45

by Dave Jones

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

On Tue, Jun 10, 2003 at 12:43:23PM +0400, Oleg Drokin wrote:

> > 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
> > of disk IO stress (multiple fsstress, over 100 fsx instances,
> > and random sync calling) produced failures on both reiserfs
> > and ext3.
> > Tests were done on seperate disks, but concurrently.
>
> Do you have smp or preempt enabled?

# CONFIG_SMP is not set
CONFIG_PREEMPT=y

Dave

2003-06-10 17:30:57

by Nathan Conrad

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

I've been noticing a similar problem on my laptop. This may, or may
not be related, but it did start somewhere within the past week (maybe
the IDE taskfile conversion???, to throw out a guess). I wonder if
Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
compiling the kernel with gcc 3.3 (Debian version).

Anyway, certain directories get locked up on occasion and when I try
to execute 'ls' or read from the directory, the process gets into a
locked up state; ^C does not work to kill the process. The only way to
make a directory "readable" is to restart the machine. I have not
noticed any FS corruption, just the lack of being able to enter the
directory.

At the same time, a kernel bug will be displayed:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c016781a
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[find_inode_fast+26/96] Not tainted
EFLAGS: 00010286
EIP is at find_inode_fast+0x1a/0x60
eax: db0355c4 ebx: 0001859f ecx: c3a69844 edx: 00000000
esi: dfd60c00 edi: dff99340 ebp: dff99340 esp: cc6dde50
ds: 007b es: 007b ss: 0068
Process emacs20 (pid: 16508, threadinfo=cc6dc000 task=c6d0adc0)
Stack: c4bca5b8 0001859f 0001859f dfd60c00 c0167d2e dfd60c00 dff99340 0001859f
0001859f da191d40 dfd60c00 da191d40 c018e45b dfd60c00 0001859f db666130
fffffff4 dca22aac dca22a44 c015cd60 dca22a44 da191d40 00000000 cc6ddf48
Call Trace:
[iget_locked+78/160] iget_locked+0x4e/0xa0
[ext3_lookup+107/208] ext3_lookup+0x6b/0xd0
[real_lookup+192/240] real_lookup+0xc0/0xf0
[do_lookup+158/176] do_lookup+0x9e/0xb0
[link_path_walk+1066/2000] link_path_walk+0x42a/0x7d0
[__user_walk+73/96] __user_walk+0x49/0x60
[vfs_stat+31/96] vfs_stat+0x1f/0x60
[sys_stat64+27/64] sys_stat64+0x1b/0x40
[syscall_call+7/11] syscall_call+0x7/0xb

Code: 0f 18 02 90 39 59 18 89 c8 74 0f 85 d2 89 d1 75 ed 31 c0 83

On Tue, Jun 10, 2003 at 12:43:23PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Mon, Jun 09, 2003 at 08:35:55PM +0100, Dave Jones wrote:
>
> > 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
> > of disk IO stress (multiple fsstress, over 100 fsx instances,
> > and random sync calling) produced failures on both reiserfs
> > and ext3.
> > Tests were done on seperate disks, but concurrently.
>
> Do you have smp or preempt enabled?
>
> Bye,
> Oleg

-Nathan Conrad

--
Nathan J. Conrad
GPG: F4FC 7E25 9308 ECE1 735C 0798 CE86 DA45 9170 3112

Attachments:

(No filename) (2.47 kB)
(No filename) (189.00 B)
Download all attachments

2003-06-10 17:58:17

by Bartlomiej Zolnierkiewicz

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

On Tue, 10 Jun 2003, Nathan Conrad wrote:

> I've been noticing a similar problem on my laptop. This may, or may
> not be related, but it did start somewhere within the past week (maybe
> the IDE taskfile conversion???, to throw out a guess). I wonder if

wrt taskfile conversion, if you are using DMA on your IDE disks,
there shouldn't be any change in behaviour.

I will prepare a patch adding old crap and making it selectable
(default will be taskfile, if you go into problems you can check
with old code) to easy spotting possible taskfile problems
and allowing quick judging - taskfile guilty/not guilty.

--
Bartlomiej

> Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
> disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
> compiling the kernel with gcc 3.3 (Debian version).
>
> Anyway, certain directories get locked up on occasion and when I try
> to execute 'ls' or read from the directory, the process gets into a
> locked up state; ^C does not work to kill the process. The only way to
> make a directory "readable" is to restart the machine. I have not
> noticed any FS corruption, just the lack of being able to enter the
> directory.
>
> At the same time, a kernel bug will be displayed:

<...>

2003-06-10 18:05:29

by Nathan Conrad

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

Oh, ok. I am using DMA on my drives. The problem with this bug is that
it is fairly hard to observe, I've only seen it about once every other
day. I should have also pointed out that I am using ext3.

I thought that it might be taskfile stuff because that was the major
change in the kernel the time right before I started to notice these
problems. There likely is some other source of problems because you
say that there should be no change in behaviour.

-Nathan

On Tue, Jun 10, 2003 at 08:11:22PM +0200, Bartlomiej Zolnierkiewicz wrote:
>
> On Tue, 10 Jun 2003, Nathan Conrad wrote:
>
> > I've been noticing a similar problem on my laptop. This may, or may
> > not be related, but it did start somewhere within the past week (maybe
> > the IDE taskfile conversion???, to throw out a guess). I wonder if
>
> wrt taskfile conversion, if you are using DMA on your IDE disks,
> there shouldn't be any change in behaviour.
>
> I will prepare a patch adding old crap and making it selectable
> (default will be taskfile, if you go into problems you can check
> with old code) to easy spotting possible taskfile problems
> and allowing quick judging - taskfile guilty/not guilty.
>
> --
> Bartlomiej
>
> > Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
> > disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
> > compiling the kernel with gcc 3.3 (Debian version).
> >
> > Anyway, certain directories get locked up on occasion and when I try
> > to execute 'ls' or read from the directory, the process gets into a
> > locked up state; ^C does not work to kill the process. The only way to
> > make a directory "readable" is to restart the machine. I have not
> > noticed any FS corruption, just the lack of being able to enter the
> > directory.
> >
> > At the same time, a kernel bug will be displayed:
>
> <...>

Attachments:

(No filename) (1.81 kB)
(No filename) (189.00 B)
Download all attachments

2003-06-10 20:52:45

by Andrew Morton

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

Nathan Conrad <[email protected]> wrote:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c016781a
> *pde = 00000000
> Oops: 0000 [#1]
> CPU: 0
> EIP: 0060:[find_inode_fast+26/96] Not tainted

Something scribbled on your inode hash chains. Please make sure that
you're building the kernel with all the memory debug options enabled, and
run memtest86 on that machine for 12 hourws or so.

2003-06-10 22:36:05

by Dave Jones

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk

On Tue, Jun 10, 2003 at 05:44:36PM -0400, Nathan Conrad wrote:
> I've been noticing a similar problem on my laptop. This may, or may
> not be related, but it did start somewhere within the past week (maybe
> the IDE taskfile conversion???, to throw out a guess). I wonder if
> Dave Jones is using IDE or SCSI.

IDE. I'm too cheap to buy SCSI.

Dave

2003-06-12 05:07:21

by Nathan Conrad

[permalink] [raw]

Subject: Re: ext3 / reiserfs data corruption, 2.5-bk; NULL pointer dereference bug

I just saw another one of these NULL pointer dereference oops on my
laptop:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c01665f3
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[__d_lookup+99/256] Not tainted
EFLAGS: 00210282
EIP is at __d_lookup+0x63/0x100
eax: 00000000 ebx: c06ef980 ecx: 00000010 edx: dfe80000
esi: dfe8da40 edi: 00000000 ebp: df85be70 esp: db047ec8
ds: 007b es: 007b ss: 0068
Process gcc (pid: 4738, threadinfo=db046000 task=c22198c0)
Stack: dcfcc014 c012a225 00000000 00000000 dfe8da40 db047f48 00000000 dcfcc001
0029e101 00000003 dcfcc001 db047f90 dfff4fc0 db047f3c c015cf80 dfd50e00
db047f44 c015cb64 dcfcc001 dcfcc005 db047f3c db047f44 c015d129 db047f90
Call Trace:
[in_group_p+37/48] in_group_p+0x25/0x30
[do_lookup+48/176] do_lookup+0x30/0xb0
[permission+84/112] permission+0x54/0x70
[link_path_walk+297/2000] link_path_walk+0x129/0x7d0
[__user_walk+73/96] __user_walk+0x49/0x60
[sys_access+129/320] sys_access+0x81/0x140
[syscall_call+7/11] syscall_call+0x7/0xb

Code: 0f 18 00 90 8b 74 24 10 8d 5d 90 39 73 78 75 17 8b 7b 58 89

I ran memtest86 for about 14 hours and it passed all of its tests. I
enabled the memory debugging options (under the kernel hacking
section) and I did not notice any errors displayed by it in my syslog.

I'm not sure what else to try... The backtrace is signifigantly
different that the last one...

On Tue, Jun 10, 2003 at 01:59:35PM -0700, Andrew Morton wrote:
> Nathan Conrad <[email protected]> wrote:
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 00000000
> > printing eip:
> > c016781a
> > *pde = 00000000
> > Oops: 0000 [#1]
> > CPU: 0
> > EIP: 0060:[find_inode_fast+26/96] Not tainted
>
> Something scribbled on your inode hash chains. Please make sure that
> you're building the kernel with all the memory debug options enabled, and
> run memtest86 on that machine for 12 hourws or so.

Attachments:

(No filename) (1.94 kB)
(No filename) (189.00 B)
Download all attachments