2000-12-12 03:24:07

by Linus Torvalds

[permalink] [raw]
Subject: Linux-2.4.0-test12


Ok, there it is. Noticeable changes from pre8 are mainly (a) new tq list
compile fixes and (b) the NetApp snapshot thing.

Dave's merge_segments thing could in theory be a deadlock on SMP.

Linus


----
- final:
- David Miller: sparc and net updates. Fix merge_segments.
- Dan Aloni: ISA PnP name parsing cleanup
- Mohammad Haque and others: hunt down tq initializations.
- Petr Vandrovec: ncpfs config changes
- Neil Brown: raid and md cleanups
- Pete Zaitcev: ymfpci update
- Alan Cox: sync (network driver MODULE_OWNER and cleanups)
- Martin Diehl: pirq router for VLSI 82C534 (HP OmniBook and others)
- Tigran Aivazian: ia32 microcode driver update
- Tim Waugh: parport fixes (ECP write, documentation)
- Richard Henderson: alpha update
- David Woodhouse: MTD update
- Trond Myklebust: index the NFS inode cache using the file handle.
This makes NetApp snapshot directories do the right thing.

- test8:
- Stephen Rothwell: APM updates
- Johannes Erdfelt: USB updates
- me: call_usermodehelper(/sbin/hotplug) cleanup and deadlock fix
- Leonard Zubkoff: DAC960 Driver Update
- Martin Diehl: fix PCI PM callback ordering
- Andrew Morton: call_usermodehelper() fixes
- Urban Widmark: clean up and enable shared mmap on smbfs.
- Trond Myklebust: fix NFS path revalidation.

- test7:
- Kai Germaschewski: ymfpci cleanups and resource leak fixes
- me: UHCI drivers really need to enable bus mastering.
- Trond Myklebust: fix up nfs_writepage_sync() to not require "filp".
- Andrew Morton: "tq_scheduler" is no more. We have keventd.
- Nils Faerber: cs46xx sounddriver update

- pre6:
- Alan Cox: synch. PA-RISC arch and bitops cleanups
- Maciej Rozycki: even more proper apic setup order.
- Andrew Morton: exec_usermodehelper fixes
- Adam Richter, Kai Germaschewski, me: PCI irq routing.
- revert A20 code changes. We really need to use the keyboard
controller if one exists.
- Johannes Erdfelt: USB updates
- Ralf Baechle: MIPS memmove() fix.

- pre5:
- Jaroslav Kysela: ymfpci driver
- me: get rid of bogus MS_INVALIDATE semantics
- me: final part of the PageDirty() saga
- Rusty Russell: 4-way SMP iptables fix
- Al Viro: oops - bad ext2 inode dirty block bug

- pre4:
- Andries Brouwer: final isofs pieces.
- Kai Germaschewski: ISDN
- play CD audio correctly, don't stop after 12 minutes.
- Anton Altaparmakov: disable NTFS mmap for now, as it doesn't work.
- Stephen Tweedie: fix inode dirty block handling
- Bill Hartner: reschedule_idle - prefer right cpu
- Johannes Erdfelt: USB updates
- Alan Cox: synchronize
- Richard Henderson: alpha updates and optimizations
- Geert Uytterhoeven: fbdev could be fooled into crashing fix
- Trond Myklebust: NFS filehandles in inode rather than dentry

- pre3:
- me: more PageDirty / swapcache handling
- Neil Brown: raid and md init fixes
- David Brownell: pci hotplug sanitization.
- Kanoj Sarcar: mips64 update
- Kai Germaschewski: ISDN sync
- Andreas Bombe: ieee1394 cleanups and fixes
- Johannes Erdfelt: USB update
- David Miller: Sparc and net update
- Trond Myklebust: RPC layer SMP fixes
- Thomas Sailer: mixed sound driver fixes
- Tigran Aivazian: use atomic_dec_and_lock() for free_uid()

- pre2:
- Peter Anvin: more P4 configuration parsing
- Stephen Tweedie: O_SYNC patches. Make O_SYNC/fsync/fdatasync
do the right thing.
- Keith Owens: make mdule loading use the right struct module size
- Boszormenyi Zoltan: get MTRR's right for the >32-bit case
- Alan Cox: various random documentation etc
- Dario Ballabio: EATA and u14-34f update
- Ivan Kokshaysky: unbreak alpha ruffian
- Richard Henderson: PCI bridge initialization on alpha
- Zach Brown: correct locking in Maestro driver
- Geert Uytterhoeven: more m68k updates
- Andrey Savochkin: eepro100 update
- Dag Brattli: irda update
- Johannes Erdfelt: USB update

- pre1: (for ISDN synchronization _ONLY_! Not complete!)
- Byron Stanoszek: correct decimal precision for CPU MHz in
/proc/cpuinfo
- Ollie Lho: SiS pirq routing.
- Andries Brouwer: isofs cleanups
- Matt Kraai: /proc read() on directories should return EISDIR, not EINVAL
- me: be stricter about what we accept as a PCI bridge setup.
- me: always set PCI interrupts to be level-triggered when we enable them.
- me: updated PageDirty and swap cache handling
- Peter Anvin: update A20 code to work without keyboard controller
- Kai Germaschewski: ISDN updates
- Russell King: ARM updates
- Geert Uytterhoeven: m68k updates


2000-12-12 15:37:04

by Jasper Spaans

[permalink] [raw]
Subject: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Mon, Dec 11, 2000 at 06:52:55PM -0800, Linus Torvalds wrote:
>
> Ok, there it is. Noticeable changes from pre8 are mainly (a) new tq list
> compile fixes and (b) the NetApp snapshot thing.

> - final:
> - Neil Brown: raid and md cleanups

Hmm, while doing some not-so-heavy things with a mysql-db on a raid5-device
this kernel Oopsed on me; ksymoops output [which went through klogd,
shouldn't matter that much, klogd was using the right System.map]:

Dec 12 14:04:50 spaans kernel: invalid operand: 0000
Dec 12 14:04:50 spaans kernel: CPU: 1
Dec 12 14:04:50 spaans kernel: EIP: 0010:[end_buffer_io_bad+85/92]
Dec 12 14:04:50 spaans kernel: EFLAGS: 00010286
Dec 12 14:04:50 spaans kernel: eax: 0000001c ebx: c5418dc8 ecx: 00000000 edx: 01000000
Dec 12 14:04:50 spaans kernel: esi: cfd9b960 edi: cfd9b800 ebp: c5418d80 esp: c14ade8c
Dec 12 14:04:50 spaans kernel: ds: 0018 es: 0018 ss: 0018
Dec 12 14:04:50 spaans kernel: Process raid5d (pid: 9, stackpage=c14ad000)
Dec 12 14:04:50 spaans kernel: Stack: c0225409 c022575a 000002fd 00000008 c01c91ac c5418d80 00000001 00000008
Dec 12 14:04:50 spaans kernel: cfd9b800 00000002 00000000 c01c9d0f cfd9b800 00000002 00000001 cfd9b800
Dec 12 14:04:50 spaans kernel: c146c400 cfd9b800 00000003 00000003 c146c400 c01ca867 cfd9b800 cfd9b800
Dec 12 14:04:50 spaans kernel: Call Trace: [tvecs+13949/58456] [tvecs+14798/58456] [raid5_end_buffer_io+68/128] [complete_stripe+151/272] [handle_stripe+331/1092] [raid5d+173/260] [md_thread+299/508]
Dec 12 14:04:50 spaans kernel: Code: 0f 0b 83 c4 0c 5b c3 55 57 56 53 8b 5c 24 14 8b 54 24 18 85
Using defaults from ksymoops -t elf32-i386 -a i386

Code; 00000000 Before first symbol
00000000 <_EIP>:
Code; 00000000 Before first symbol
0: 0f 0b ud2a
Code; 00000002 Before first symbol
2: 83 c4 0c add $0xc,%esp
Code; 00000005 Before first symbol
5: 5b pop %ebx
Code; 00000006 Before first symbol
6: c3 ret
Code; 00000007 Before first symbol
7: 55 push %ebp
Code; 00000008 Before first symbol
8: 57 push %edi
Code; 00000009 Before first symbol
9: 56 push %esi
Code; 0000000a Before first symbol
a: 53 push %ebx
Code; 0000000b Before first symbol
b: 8b 5c 24 14 mov 0x14(%esp,1),%ebx
Code; 0000000f Before first symbol
f: 8b 54 24 18 mov 0x18(%esp,1),%edx
Code; 00000013 Before first symbol
13: 85 00 test %eax,(%eax)

Regards,
--
Jasper Spaans <[email protected]>

2000-12-12 19:37:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]



On Tue, 12 Dec 2000, Jasper Spaans wrote:

> On Mon, Dec 11, 2000 at 06:52:55PM -0800, Linus Torvalds wrote:
> >
> > Ok, there it is. Noticeable changes from pre8 are mainly (a) new tq list
> > compile fixes and (b) the NetApp snapshot thing.
>
> > - final:
> > - Neil Brown: raid and md cleanups
>
> Hmm, while doing some not-so-heavy things with a mysql-db on a raid5-device
> this kernel Oopsed on me; ksymoops output [which went through klogd,
> shouldn't matter that much, klogd was using the right System.map]:
>
> Dec 12 14:04:50 spaans kernel: invalid operand: 0000
> Dec 12 14:04:50 spaans kernel: CPU: 1
> Dec 12 14:04:50 spaans kernel: EIP: 0010:[end_buffer_io_bad+85/92]
>
> Dec 12 14:04:50 spaans kernel: Call Trace:
> [raid5_end_buffer_io+68/128]
> [complete_stripe+151/272]
> [handle_stripe+331/1092]
> [raid5d+173/260]
> [md_thread+299/508]

Looks like somebody didn't initialize the "b_end_io" pointer - the code
defaults to it being "end_buffer_io_bad" (which oopses unconditionally on
purpose exactly to find places where it wasn't initialized).

And it obviously looks like it's the raid5 code that does it.

It _looks_ like the raid5 code does a "generic_make_request()" without
setting b_end_io anywhere, but I don't know the raid5 code well enough.

To get better debug output, could you please do something for me?

In fs/buffer.c, get rid of "end_buffer_io_bad" completely, and replace all
users of it with NULL.

Then, in drivers/block/ll_rw_block.c: generic_make_request(), add a test
like

if (!bh->b_end_io) BUG();

to the top of that function.

You'll still get a oops, but the difference is that you'll get the oops
when the request is queued, rather than when the requst is finished, which
will make it easier to figure out what the thing is that leads up to this.

In the meantime I'm sure Neil can figure out where in the raid5 code we
don't initialize the buffer head properly even without that, but it's
worth doing the above anyway.

Thanks,

Linus

2000-12-12 20:28:01

by NeilBrown

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tuesday December 12, [email protected] wrote:
>
>
> On Tue, 12 Dec 2000, Jasper Spaans wrote:
>
> > On Mon, Dec 11, 2000 at 06:52:55PM -0800, Linus Torvalds wrote:
> > >
> > > Ok, there it is. Noticeable changes from pre8 are mainly (a) new tq list
> > > compile fixes and (b) the NetApp snapshot thing.
> >
> > > - final:
> > > - Neil Brown: raid and md cleanups
> >
> > Hmm, while doing some not-so-heavy things with a mysql-db on a raid5-device
> > this kernel Oopsed on me; ksymoops output [which went through klogd,
> > shouldn't matter that much, klogd was using the right System.map]:
> >
> > Dec 12 14:04:50 spaans kernel: invalid operand: 0000
> > Dec 12 14:04:50 spaans kernel: CPU: 1
> > Dec 12 14:04:50 spaans kernel: EIP: 0010:[end_buffer_io_bad+85/92]
> >
> > Dec 12 14:04:50 spaans kernel: Call Trace:
> > [raid5_end_buffer_io+68/128]
> > [complete_stripe+151/272]
> > [handle_stripe+331/1092]
> > [raid5d+173/260]
> > [md_thread+299/508]
>
> Looks like somebody didn't initialize the "b_end_io" pointer - the code
> defaults to it being "end_buffer_io_bad" (which oopses unconditionally on
> purpose exactly to find places where it wasn't initialized).
>
> And it obviously looks like it's the raid5 code that does it.

Guilt by association :-)

What this bit of code (complete_stripe/raid5_end_buffer_io) is doing
is observing that it as completed some I/O request that was made of
the raid5 device and is calling the b_end_io on the buffer_head that
is was passed. So it is not one of raid5's buffers that has the bad
b_end_io, but someone else's buffer that raid5 was asked to service.

You say "things with a mysql-db on a raid5-device". Can I interpret
this to mean that mysql was talking driectly to /dev/md0, or is there
some filesystem in-between?
Either way, I expect Linus' suggestion will provide the answer.

NeilBrown


>
> It _looks_ like the raid5 code does a "generic_make_request()" without
> setting b_end_io anywhere, but I don't know the raid5 code well enough.
>
> To get better debug output, could you please do something for me?
>
> In fs/buffer.c, get rid of "end_buffer_io_bad" completely, and replace all
> users of it with NULL.
>
> Then, in drivers/block/ll_rw_block.c: generic_make_request(), add a test
> like
>
> if (!bh->b_end_io) BUG();
>
> to the top of that function.
>
> You'll still get a oops, but the difference is that you'll get the oops
> when the request is queued, rather than when the requst is finished, which
> will make it easier to figure out what the thing is that leads up to this.
>
> In the meantime I'm sure Neil can figure out where in the raid5 code we
> don't initialize the buffer head properly even without that, but it's
> worth doing the above anyway.
>
> Thanks,
>
> Linus

2000-12-12 20:38:44

by Jasper Spaans

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Wed, Dec 13, 2000 at 06:56:22AM +1100, Neil Brown wrote:

> Guilt by association :-)
>
> What this bit of code (complete_stripe/raid5_end_buffer_io) is doing
> is observing that it as completed some I/O request that was made of
> the raid5 device and is calling the b_end_io on the buffer_head that
> is was passed. So it is not one of raid5's buffers that has the bad
> b_end_io, but someone else's buffer that raid5 was asked to service.
>
> You say "things with a mysql-db on a raid5-device". Can I interpret
> this to mean that mysql was talking driectly to /dev/md0, or is there
> some filesystem in-between?
> Either way, I expect Linus' suggestion will provide the answer.

Will try to reproduce this, with Linus' suggestion; btw, this mysql-db is
running on ext2, nothing exotic.

Regards,
--
Q_. Jasper Spaans <mailto:[email protected]> -o)
`~\ Conditional Access/DVB-C/OpenTV/Unix-adviseur /\\
Mr /\ _\_v
Zap Een ongezellig dure consultant nodig? Mail [email protected]

2000-12-12 23:12:18

by Jasper Spaans

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tue, Dec 12, 2000 at 11:06:07AM -0800, Linus Torvalds wrote:

> > Dec 12 14:04:50 spaans kernel: invalid operand: 0000
> > Dec 12 14:04:50 spaans kernel: CPU: 1
> > Dec 12 14:04:50 spaans kernel: EIP: 0010:[end_buffer_io_bad+85/92]
> >
> > Dec 12 14:04:50 spaans kernel: Call Trace:
> > [raid5_end_buffer_io+68/128]
> > [complete_stripe+151/272]
> > [handle_stripe+331/1092]
> > [raid5d+173/260]
> > [md_thread+299/508]
>
> Looks like somebody didn't initialize the "b_end_io" pointer - the code
> defaults to it being "end_buffer_io_bad" (which oopses unconditionally on
> purpose exactly to find places where it wasn't initialized).
>
> And it obviously looks like it's the raid5 code that does it.
>
> To get better debug output, could you please do something for me?
>
> In fs/buffer.c, get rid of "end_buffer_io_bad" completely, and replace all
> users of it with NULL.
>
> Then, in drivers/block/ll_rw_block.c: generic_make_request(), add a test
> like
>
> if (!bh->b_end_io) BUG();
>
> to the top of that function.
>
> You'll still get a oops, but the difference is that you'll get the oops
> when the request is queued, rather than when the requst is finished, which
> will make it easier to figure out what the thing is that leads up to this.

Right, well, I applied your suggestions, and I got it to oops -- dunno how,
but it oopsed.

Unable to handle kernel NULL pointer dereference at virtual address 00000000
00000000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<00000000>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: 00000000 ebx: 00000008 ecx: c146c400 edx: 00000202
esi: c154d560 edi: c154d400 ebp: c70791a0 esp: c14ade9c
ds: 0018 es: 0018 ss: 0018
Process raid5d (pid: 9, stackpage=c14ad000)
Stack: c01c916c c70791a0 00000001 00000008 c154d400 00000002 00000000 c01c9ccf
c154d400 00000002 00000001 c154d400 c146c400 c154d400 00000003 00000003
c146c400 c01ca827 c154d400 c154d400 c146c400 000000a2 c14b1ec0 c02e8c80
Call Trace: [<c01c916c>] [<c01c9ccf>] [<c01ca827>] [<c018c070>] [<c018c0c9>] [<c018c3ed>] [<c01cad99>]
[<c01d1b57>] [<c0109a88>]
Code: Bad EIP value.

>>EIP; 00000000 Before first symbol
Trace; c01c916c <raid5_end_buffer_io+44/80>
Trace; c01c9ccf <complete_stripe+97/110>
Trace; c01ca827 <handle_stripe+14b/444>
Trace; c018c070 <start_request+134/204>
Trace; c018c0c9 <start_request+18d/204>
Trace; c018c3ed <ide_do_request+285/2dc>
Trace; c01cad99 <raid5d+ad/104>
Trace; c01d1b57 <md_thread+12b/1fc>
Trace; c0109a88 <kernel_thread+28/38>

Strange thing is that it doesn't call BUG() and the trace seems quite
identical -- this caused me to start looking at the code in
drivers/md/raid5.c and it seems this null pointer deref is coming from there
- Neil, do you have some documentation on how this code should work, as
stripe_head causes some null-pointer-derefs inside my head..

It seems a stripe_head is quite similar to a block_head, but why is
raid5_end_buffer_io calling the bh_end_io function from the stripe_head, I'd
assume it should be called from ll_rw_block.c?

Regards,
--
Jasper Spaans <[email protected]>

2000-12-12 23:54:43

by NeilBrown

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tuesday December 12, [email protected] wrote:
> On Tue, Dec 12, 2000 at 11:06:07AM -0800, Linus Torvalds wrote:
> >
> > To get better debug output, could you please do something for me?
> >
> > In fs/buffer.c, get rid of "end_buffer_io_bad" completely, and replace all
> > users of it with NULL.
> >
> > Then, in drivers/block/ll_rw_block.c: generic_make_request(), add a test
> > like
> >
> > if (!bh->b_end_io) BUG();
> >
> > to the top of that function.

Could you add this test to the top of md_make_request as well, because
requests to raid5 don't go through generic_make_request.

>
> Strange thing is that it doesn't call BUG() and the trace seems quite
> identical -- this caused me to start looking at the code in
> drivers/md/raid5.c and it seems this null pointer deref is coming from there
> - Neil, do you have some documentation on how this code should work, as
> stripe_head causes some null-pointer-derefs inside my head..

No, no doco, sorry.
I do have a new version of the code that I haven't been brave enough
to submit during a code freeze (whatever that is)... you could try
the raid5 patch under
http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.0-test12-pre8

I expect that you will get the same result as I don't (currently)
think the bug is in RAID code, but at least I would get one more
tester for my code....

NeilBrown

2000-12-13 00:18:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]



On Wed, 13 Dec 2000, Neil Brown wrote:
>
> Could you add this test to the top of md_make_request as well, because
> requests to raid5 don't go through generic_make_request.

Sure they do. Everything that calls ll_rw_block() or submit_bh() will go
through generic_make_request.

Neil, you're probably thinking about __make_request(), which only triggers
for "normal" devices.

The fact that the buffer doesn't go through generic_make_request() implies
that it is some buffer that is completely internal to the raid5
processing. I don't see anything like that, though.

Jasper, sorry about even asking this, but where did you add the check for
b_end_io? Maybe you put it in __make_request() by mistake?

Linus

2000-12-13 00:49:46

by NeilBrown

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tuesday December 12, [email protected] wrote:
>
>
> On Wed, 13 Dec 2000, Neil Brown wrote:
> >
> > Could you add this test to the top of md_make_request as well, because
> > requests to raid5 don't go through generic_make_request.
>
> Sure they do. Everything that calls ll_rw_block() or submit_bh() will go
> through generic_make_request.
>
> Neil, you're probably thinking about __make_request(), which only triggers
> for "normal" devices.

Yes... you are right. Alright, I can't escape it any other way so I
guess I must admit that it is a raid5 bug.

But how can raid5 be calling b_end_io on a buffer_head that was never
passed to generic_make_request?
Answer, it snoops on the buffer cache to try to do complete stripe
writes.
The following patch disabled that code.

NeilBrown

--- drivers/md/raid5.c 2000/12/13 00:13:54 1.1
+++ drivers/md/raid5.c 2000/12/13 00:14:07
@@ -1009,6 +1009,7 @@
struct buffer_head *bh;
int method1 = INT_MAX, method2 = INT_MAX;

+#if 0
/*
* Attempt to add entries :-)
*/
@@ -1039,6 +1040,7 @@
atomic_dec(&bh->b_count);
}
}
+#endif
PRINTK("handle_stripe() -- begin writing, stripe %lu\n", sh->sector);
/*
* Writing, need to update parity buffer.

2000-12-13 03:39:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]



On Wed, 13 Dec 2000, Neil Brown wrote:
>
> Yes... you are right. Alright, I can't escape it any other way so I
> guess I must admit that it is a raid5 bug.
>
> But how can raid5 be calling b_end_io on a buffer_head that was never
> passed to generic_make_request?
> Answer, it snoops on the buffer cache to try to do complete stripe
> writes.

Ahh, yes. It seems to just do a "get_hash_table()", and put that bh into
the queues. Bad.

> The following patch disabled that code.

If this fix makes the oops go away, then the proper fix for the problem is
not the #if 0, but do add something like

bh->b_end_io = buffer_end_io_sync;

to just before the "add_stripe_bh(sh, bh, i, WRITE);"

We've already locked the thing, so that should be ok.

I wonder about that "md_test_and_set_bit(BH_Lock ...);" thing there,
though. If the buffer we find was dirty but already locked, we won't be
using that buffer at all (because the md_test_and_set_bit() will fail),
which probably means that the RAID5 checksum won't be right. Hmm..

Why is there an dirty aliased buffer head anyway? That sounds like a
recipe for disaster - maybe we should have synched all the stripe devices
before we set up the raid? Is that a raid5 rebuild issue? What's going on
here?

Linus

2000-12-13 12:23:05

by Jasper Spaans

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tue, Dec 12, 2000 at 07:08:09PM -0800, Linus Torvalds wrote:

> > The following patch disabled that code.
>
> If this fix makes the oops go away, then the proper fix for the problem is
> not the #if 0, but do add something like

Well, this fix did make the oops go away, but it also caused another scary
oops -- not sure whether my stack trace is in any way how it should be, but
it oopses in nfsd, and has a stacktrace as follows:

>>EIP; c01e379e <ip_frag_queue+20a/254> <=====
Trace; c01e3b80 <ip_defrag+dc/17c>
Trace; d1121502 <[ip_conntrack]ip_ct_gather_frags+2e/ac>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; d1120c49 <[ip_conntrack]ip_conntrack_in+39/2cc>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; d11223aa <[ip_conntrack]ip_conntrack_local+5a/60>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; c01d4e08 <nf_iterate+34/88>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; c01d5087 <nf_hook_slow+3f/b8>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; d11235e8 <[ip_conntrack]ip_conntrack_local_out_ops+0/18>
Trace; c01e5b8b <ip_build_xmit_slow+3cf/4ac>
Trace; c01e65ec <output_maybe_reroute+0/14>
Trace; c01fb748 <udp_getfrag+0/c4>
Trace; c01e5cb6 <ip_build_xmit+4e/334>
Trace; c01fb748 <udp_getfrag+0/c4>
Trace; ea00000a <END_OF_CODE+18edc8c0/????>
Trace; c01e14fb <ip_route_output_key+113/120>
Trace; c01fbbde <udp_sendmsg+38a/414>
Trace; c01fb748 <udp_getfrag+0/c4>
Trace; ea00000a <END_OF_CODE+18edc8c0/????>
Trace; ea00000a <END_OF_CODE+18edc8c0/????>
Trace; e5cfff3e <END_OF_CODE+14bdc7f4/????>
Trace; ea00000a <END_OF_CODE+18edc8c0/????>
Trace; c0201386 <inet_sendmsg+3e/44>
Trace; c01d3355 <sock_sendmsg+69/88>
Trace; d113e8f3 <END_OF_CODE+1b1a9/????>
Trace; d113ee21 <END_OF_CODE+1b6d7/????>
Trace; d113fd26 <END_OF_CODE+1c5dc/????>
Trace; d1166a00 <END_OF_CODE+432b6/????>
Trace; d113e4e9 <END_OF_CODE+1ad9f/????>
Trace; d1166768 <END_OF_CODE+4301e/????>
Trace; d11573ad <END_OF_CODE+33c63/????>
Trace; c0109a88 <kernel_thread+28/38>

Quite scary.. especially the ea00000a part.

I've disabled nfs, and it seems to work ok right now.

Regards,
--
Jasper Spaans <[email protected]>

2000-12-14 00:06:14

by NeilBrown

[permalink] [raw]
Subject: Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

On Tuesday December 12, [email protected] wrote:
>
>
> On Wed, 13 Dec 2000, Neil Brown wrote:
> >
> > Yes... you are right. Alright, I can't escape it any other way so I
> > guess I must admit that it is a raid5 bug.
> >
> > But how can raid5 be calling b_end_io on a buffer_head that was never
> > passed to generic_make_request?
> > Answer, it snoops on the buffer cache to try to do complete stripe
> > writes.
>
> Ahh, yes. It seems to just do a "get_hash_table()", and put that bh into
> the queues. Bad.
>
> > The following patch disabled that code.
>
> If this fix makes the oops go away, then the proper fix for the problem is
> not the #if 0, but do add something like
>
> bh->b_end_io = buffer_end_io_sync;
>
> to just before the "add_stripe_bh(sh, bh, i, WRITE);"
>
> We've already locked the thing, so that should be ok.

Yes, that should work, except that end_buffer_io_sync is static in
ll_rw_blk.c :-)

However I don't think there is a lot of point in maintaining this
piece of code. It was a useful optimisation in 2.2, but it is
substantially less effective in 2.4.

In 2.2 filesystems kept allcache data - file content, meta data, etc,
in the physically addressed buffer cache. As it was physically
addressed, raid5 could go looking for other data in the same stripe as
the stripe that it was writing, and thereby improve performance.

But in 2.4, filesystems (well, ext2 at least) keep only the metadata
in the buffer cache, and if you are using something like LVM or RAID0
on top of the RAID5 array, there wont be anything in the buffer cache
for the raid5 device.

I think I can get similar performance improvements by "plugging" the
raid5 device appropriately, but I haven't quite figured out all the
issues in making that work completely.

>
> I wonder about that "md_test_and_set_bit(BH_Lock ...);" thing there,
> though. If the buffer we find was dirty but already locked, we won't be
> using that buffer at all (because the md_test_and_set_bit() will fail),
> which probably means that the RAID5 checksum won't be right. Hmm..
>
> Why is there an dirty aliased buffer head anyway? That sounds like a
> recipe for disaster - maybe we should have synched all the stripe devices
> before we set up the raid? Is that a raid5 rebuild issue? What's going on
> here?

I we find a dirty, locked buffer, then it means that some other thread
has got through ll_rw_block with that buffer, but is blocked, or is
about to be blocked, in raid5_make_request (calling get_lock_stripe
probably). When the current write phase completes, that dirty block
will come through and cause another write phase on this stripe.
Each time the parity will be correct.
This is completely separate from parity re-syncing. The resync code
doesn't use buffers in the buffer cache at all. It just uses the
buffers in the raid5 stripe cache.

NeilBrown

>
> Linus