2003-01-09 09:29:26

by yuval yeret

[permalink] [raw]
Subject: 2.4.18-14 kernel stuck during ext3 umount with ping still responding

Hi,

I'm running a 2.4.18-14 kernel with a heavy IO profile using ext3 over RAID
0+1 volumes.

>From time to time I get a black screen stuck machine while trying to umount
a volume during an IO workload (as part of a failback solution - but after
killing all IO processes ), with ping still responding, but everything else
mostly dead.

I tried using the forcedumount patch to solve this problem - to no avail.
Also tried upgrading the qlogic drivers to the latest drivers from Qlogic.

After one of the occurences I managed to get some output using the sysrq
keys.

This seems similar to what is described in
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=77508 but with a
different call trace

What I have here is what I managed to copy down (for some reason pgup/pgdown
didn't work so not all information is full...) together with a manual lookup
of the call trace
from /proc/ksyms :

process umount
EIP c01190b8 (set_running_and_schedule)
call trace:
c01144c9 f25f9ec0 IO_APIC_get_PCI_irq_vector
c010a8b0 f25f9ed0 enable_irq
c014200c f25f9ef0 fsync_buffers_list
c0155595 f25f9efc clear_inode
c015553d f25f9f2c invalidate_inodes
c01461d8 f25f9f78 get_super
c014a629 f25f9f94 path_release
c0157c58 f25f9fc0 sys_umount
c0108cab sys_sigaltstack

Any idea what can cause this ?

I'm hoping the ext3fix.patch will solve this problem... am trying that now.


Thanks,
Yuval

P.S. please CC me for questions/replies as I'm not currently subscribed to
the list.

--
Yuval Yeret
Exanet
http://www.exanet.com
Tel. 972-9-9717782
Fax. 972-9-9717778








_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


2003-01-09 09:50:07

by Arjan van de Ven

[permalink] [raw]
Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still responding

On Thu, 2003-01-09 at 10:38, yuval yeret wrote:
> Hi,
>
> I'm running a 2.4.18-14 kernel with a heavy IO profile using ext3 over RAID
> 0+1 volumes.
>
> >From time to time I get a black screen stuck machine while trying to umount
> a volume during an IO workload (as part of a failback solution - but after
> killing all IO processes ), with ping still responding, but everything else
> mostly dead.
>

> I'm hoping the ext3fix.patch will solve this problem... am trying that now.

this got fixed in the recent erratum kernel 2.4.18-19.8.0


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-01-09 09:58:22

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still responding

yuval yeret wrote:
>
> Hi,
>
> I'm running a 2.4.18-14 kernel with a heavy IO profile using ext3 over RAID
> 0+1 volumes.
>
> >From time to time I get a black screen stuck machine while trying to umount
> a volume during an IO workload (as part of a failback solution - but after
> killing all IO processes ), with ping still responding, but everything else
> mostly dead.
>
> I tried using the forcedumount patch to solve this problem - to no avail.
> Also tried upgrading the qlogic drivers to the latest drivers from Qlogic.
>
> After one of the occurences I managed to get some output using the sysrq
> keys.
>
> This seems similar to what is described in
> http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=77508 but with a
> different call trace
>
> What I have here is what I managed to copy down (for some reason pgup/pgdown
> didn't work so not all information is full...) together with a manual lookup
> of the call trace
> from /proc/ksyms :
>
> process umount
> EIP c01190b8 (set_running_and_schedule)
> call trace:
> c01144c9 f25f9ec0 IO_APIC_get_PCI_irq_vector
> c010a8b0 f25f9ed0 enable_irq
> c014200c f25f9ef0 fsync_buffers_list
> c0155595 f25f9efc clear_inode
> c015553d f25f9f2c invalidate_inodes
> c01461d8 f25f9f78 get_super
> c014a629 f25f9f94 path_release
> c0157c58 f25f9fc0 sys_umount
> c0108cab sys_sigaltstack
>
> Any idea what can cause this ?
>

If you have a large amount of data against two or more filesystems,
and you try to unmount one of them the kernel can seize up for a
very long time in the fsync_dev()->sync_buffers() function. Under
these circumstances that function has O(n*n) search complexity
and n is quite large.

However your backtrace shows neither of those functions.

Still, as an experiment it would be interesting to see if the below
patch fixes it up. It converts O(n*n) to O(m), where m > n.



fs/buffer.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

--- 2420/fs/buffer.c~a Thu Jan 9 02:03:06 2003
+++ 2420-akpm/fs/buffer.c Thu Jan 9 02:04:02 2003
@@ -307,11 +307,11 @@ int sync_buffers(kdev_t dev, int wait)
* 2) write out all dirty, unlocked buffers;
* 2) wait for completion by waiting for all buffers to unlock.
*/
- write_unlocked_buffers(dev);
+ write_unlocked_buffers(NODEV);
if (wait) {
- err = wait_for_locked_buffers(dev, BUF_DIRTY, 0);
+ err = wait_for_locked_buffers(NODEV, BUF_DIRTY, 0);
write_unlocked_buffers(dev);
- err |= wait_for_locked_buffers(dev, BUF_LOCKED, 1);
+ err |= wait_for_locked_buffers(NODEV, BUF_LOCKED, 1);
}
return err;
}

_

2003-01-15 19:04:56

by yuval yeret

[permalink] [raw]
Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still responding

Hi,

The problem reproduces on a 2.4.18-19 kernel as well. Took some more time
but finally it roared its ugly head.

This is the stack trace from the new kernel:

>c01190b8 f3791eb4 set_running_and_schedule
>c010a8b0 f3791ed0 enable_irq
>c014200c f3791f0c IO_APIC_get_PCI_irq_vector
>c0155595 f3791f0c clear_inode
>c01556639 f3791f60 invalidate_inodes
>c0149629 f3791f8 set_binfmt
>c0157c58 f3791f94 sys_umount
>c0108cab 0f3791fc0 sys_sigaltstack


Andrew Morton suggested a buffer.c patch for reducing search complexity,
which I will try next.

Any further comments/suggestions are welcome

Thanks,
Yuval







>From: Andrew Morton <[email protected]>
>To: yuval yeret <[email protected]>
>CC: [email protected], [email protected]
>Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still
>responding
>Date: Thu, 09 Jan 2003 02:06:55 -0800
>
>yuval yeret wrote:
> >
> > Hi,
> >
> > I'm running a 2.4.18-14 kernel with a heavy IO profile using ext3 over
>RAID
> > 0+1 volumes.
> >
> > >From time to time I get a black screen stuck machine while trying to
>umount
> > a volume during an IO workload (as part of a failback solution - but
>after
> > killing all IO processes ), with ping still responding, but everything
>else
> > mostly dead.
> >
> > I tried using the forcedumount patch to solve this problem - to no
>avail.
> > Also tried upgrading the qlogic drivers to the latest drivers from
>Qlogic.
> >
> > After one of the occurences I managed to get some output using the sysrq
> > keys.
> >
> > This seems similar to what is described in
> > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=77508 but with a
> > different call trace
> >
> > What I have here is what I managed to copy down (for some reason
>pgup/pgdown
> > didn't work so not all information is full...) together with a manual
>lookup
> > of the call trace
> > from /proc/ksyms :
> >
> > process umount
> > EIP c01190b8 (set_running_and_schedule)
> > call trace:
> > c01144c9 f25f9ec0 IO_APIC_get_PCI_irq_vector
> > c010a8b0 f25f9ed0 enable_irq
> > c014200c f25f9ef0 fsync_buffers_list
> > c0155595 f25f9efc clear_inode
> > c015553d f25f9f2c invalidate_inodes
> > c01461d8 f25f9f78 get_super
> > c014a629 f25f9f94 path_release
> > c0157c58 f25f9fc0 sys_umount
> > c0108cab sys_sigaltstack
> >
> > Any idea what can cause this ?
> >
>
>If you have a large amount of data against two or more filesystems,
>and you try to unmount one of them the kernel can seize up for a
>very long time in the fsync_dev()->sync_buffers() function. Under
>these circumstances that function has O(n*n) search complexity
>and n is quite large.
>
>However your backtrace shows neither of those functions.
>
>Still, as an experiment it would be interesting to see if the below
>patch fixes it up. It converts O(n*n) to O(m), where m > n.
>
>
>
> fs/buffer.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
>--- 2420/fs/buffer.c~a Thu Jan 9 02:03:06 2003
>+++ 2420-akpm/fs/buffer.c Thu Jan 9 02:04:02 2003
>@@ -307,11 +307,11 @@ int sync_buffers(kdev_t dev, int wait)
> * 2) write out all dirty, unlocked buffers;
> * 2) wait for completion by waiting for all buffers to unlock.
> */
>- write_unlocked_buffers(dev);
>+ write_unlocked_buffers(NODEV);
> if (wait) {
>- err = wait_for_locked_buffers(dev, BUF_DIRTY, 0);
>+ err = wait_for_locked_buffers(NODEV, BUF_DIRTY, 0);
> write_unlocked_buffers(dev);
>- err |= wait_for_locked_buffers(dev, BUF_LOCKED, 1);
>+ err |= wait_for_locked_buffers(NODEV, BUF_LOCKED, 1);
> }
> return err;
> }
>
>_


_________________________________________________________________
Help STOP SPAM: Try the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail

2003-01-22 17:53:03

by yuval yeret

[permalink] [raw]
Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still responding

Well, took some time to reproduce the environment with the kernel including
the patch, but seems it didn't help after all.

The ctrl-scroll-lock output shows a stuck umount in R status, with a similar
call trace as before, but the inner calls a little different:

c0141295 __wait_on_buffer
c014200c fsync_buffers_list
c0155595 f25f9efc clear_inode
c015553d f25f9f2c invalidate_inodes
c01461d8 f25f9f78 get_super
c014a629 f25f9f94 path_release
c0157c58 f25f9fc0 sys_umount
c0108cab sys_sigaltstack

This is using a 2.4.18-19.7 kernel patched as per the below suggestion.

Any pointers/suggestions are welcome

Thanks,
Yuval








>From: Andrew Morton <[email protected]>
>To: yuval yeret <[email protected]>
>CC: [email protected], [email protected]
>Subject: Re: 2.4.18-14 kernel stuck during ext3 umount with ping still
>responding
>Date: Thu, 09 Jan 2003 02:06:55 -0800
>
>yuval yeret wrote:
> >
> > Hi,
> >
> > I'm running a 2.4.18-14 kernel with a heavy IO profile using ext3 over
>RAID
> > 0+1 volumes.
> >
> > >From time to time I get a black screen stuck machine while trying to
>umount
> > a volume during an IO workload (as part of a failback solution - but
>after
> > killing all IO processes ), with ping still responding, but everything
>else
> > mostly dead.
> >
> > I tried using the forcedumount patch to solve this problem - to no
>avail.
> > Also tried upgrading the qlogic drivers to the latest drivers from
>Qlogic.
> >
> > After one of the occurences I managed to get some output using the sysrq
> > keys.
> >
> > This seems similar to what is described in
> > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=77508 but with a
> > different call trace
> >
> > What I have here is what I managed to copy down (for some reason
>pgup/pgdown
> > didn't work so not all information is full...) together with a manual
>lookup
> > of the call trace
> > from /proc/ksyms :
> >
> > process umount
> > EIP c01190b8 (set_running_and_schedule)
> > call trace:
> > c01144c9 f25f9ec0 IO_APIC_get_PCI_irq_vector
> > c010a8b0 f25f9ed0 enable_irq
> > c014200c f25f9ef0 fsync_buffers_list
> > c0155595 f25f9efc clear_inode
> > c015553d f25f9f2c invalidate_inodes
> > c01461d8 f25f9f78 get_super
> > c014a629 f25f9f94 path_release
> > c0157c58 f25f9fc0 sys_umount
> > c0108cab sys_sigaltstack
> >
> > Any idea what can cause this ?
> >
>
>If you have a large amount of data against two or more filesystems,
>and you try to unmount one of them the kernel can seize up for a
>very long time in the fsync_dev()->sync_buffers() function. Under
>these circumstances that function has O(n*n) search complexity
>and n is quite large.
>
>However your backtrace shows neither of those functions.
>
>Still, as an experiment it would be interesting to see if the below
>patch fixes it up. It converts O(n*n) to O(m), where m > n.
>
>
>
> fs/buffer.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
>--- 2420/fs/buffer.c~a Thu Jan 9 02:03:06 2003
>+++ 2420-akpm/fs/buffer.c Thu Jan 9 02:04:02 2003
>@@ -307,11 +307,11 @@ int sync_buffers(kdev_t dev, int wait)
> * 2) write out all dirty, unlocked buffers;
> * 2) wait for completion by waiting for all buffers to unlock.
> */
>- write_unlocked_buffers(dev);
>+ write_unlocked_buffers(NODEV);
> if (wait) {
>- err = wait_for_locked_buffers(dev, BUF_DIRTY, 0);
>+ err = wait_for_locked_buffers(NODEV, BUF_DIRTY, 0);
> write_unlocked_buffers(dev);
>- err |= wait_for_locked_buffers(dev, BUF_LOCKED, 1);
>+ err |= wait_for_locked_buffers(NODEV, BUF_LOCKED, 1);
> }
> return err;
> }
>
>_


_________________________________________________________________
The new MSN 8: smart spam protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail