Looks like the same problem with spinlock unfairness we've seen
elsewhere: it seems to be looping here? Or is everyone stuck
just waiting for writeout?
lock_timer_base():
for (;;) {
tvec_base_t *prelock_base = timer->base;
base = tbase_get_base(prelock_base);
if (likely(base != NULL)) {
spin_lock_irqsave(&base->lock, *flags);
if (likely(prelock_base == timer->base))
return base;
/* The timer has migrated to another CPU */
spin_unlock_irqrestore(&base->lock, *flags);
}
cpu_relax();
}
The problem goes away completely if filesystem are mounted
*without* noatime. Has happened in 2.6.20 through 2.6.22...
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563
Part of sysrq-t listing:
mysqld D 000017c0 2196 23162 1562
e383fcb8 00000082 61650954 000017c0 e383fc9c 00000000 c0407208 e383f000
a12b0434 00004d1d c6ed2c00 c6ed2d9c c200fa80 00000000 c0724640 f6c60540
c4ff3c70 00000508 00000286 c042ffcb e383fcc8 00014926 00000000 00000286
Call Trace:
[<c0407208>] do_IRQ+0xbd/0xd1
[<c042ffcb>] lock_timer_base+0x19/0x35
[<c04300df>] __mod_timer+0x9a/0xa4
[<c060bb55>] schedule_timeout+0x70/0x8f
[<c042fd37>] process_timeout+0x0/0x5
[<c060bb50>] schedule_timeout+0x6b/0x8f
[<c060b67c>] io_schedule_timeout+0x39/0x5d
[<c0465eea>] congestion_wait+0x50/0x64
[<c0438539>] autoremove_wake_function+0x0/0x35
[<c04620e2>] balance_dirty_pages_ratelimited_nr+0x148/0x193
[<c045e7fd>] generic_file_buffered_write+0x4c7/0x5d3
named D 000017c0 2024 1454 1
f722acb0 00000082 6165ed96 000017c0 c1523e80 c16f0c00 c16f20e0 f722a000
a12be87d 00004d1d f768ac00 f768ad9c c200fa80 00000000 00000000 f75bda80
c0407208 00000508 00000286 c042ffcb f722acc0 00020207 00000000 00000286
Call Trace:
[<c0407208>] do_IRQ+0xbd/0xd1
[<c042ffcb>] lock_timer_base+0x19/0x35
[<c04300df>] __mod_timer+0x9a/0xa4
[<c060bb55>] schedule_timeout+0x70/0x8f
[<c042fd37>] process_timeout+0x0/0x5
[<c060bb50>] schedule_timeout+0x6b/0x8f
[<c060b67c>] io_schedule_timeout+0x39/0x5d
[<c0465eea>] congestion_wait+0x50/0x64
[<c0438539>] autoremove_wake_function+0x0/0x35
[<c04620e2>] balance_dirty_pages_ratelimited_nr+0x148/0x193
[<c045e7fd>] generic_file_buffered_write+0x4c7/0x5d3
mysqld D 000017c0 2196 23456 1562
e9293cb8 00000082 616692ed 000017c0 e9293c9c 00000000 e9293cc8 e9293000
a12c8dd0 00004d1d c3d5ac00 c3d5ad9c c200fa80 00000000 c0724640 f6c60540
e9293d10 c07e1f00 00000286 c042ffcb e9293cc8 0002b57f 00000000 00000286
Call Trace:
[<c042ffcb>] lock_timer_base+0x19/0x35
[<c04300df>] __mod_timer+0x9a/0xa4
[<c060bb55>] schedule_timeout+0x70/0x8f
[<c042fd37>] process_timeout+0x0/0x5
[<c060bb50>] schedule_timeout+0x6b/0x8f
[<c060b67c>] io_schedule_timeout+0x39/0x5d
[<c0465eea>] congestion_wait+0x50/0x64
[<c0438539>] autoremove_wake_function+0x0/0x35
[<c04620e2>] balance_dirty_pages_ratelimited_nr+0x148/0x193
[<c045e7fd>] generic_file_buffered_write+0x4c7/0x5d3
On Wed, 01 Aug 2007 18:39:51 -0400, Chuck Ebbert wrote:
> Looks like the same problem with spinlock unfairness we've seen
> elsewhere: it seems to be looping here? Or is everyone stuck just
> waiting for writeout?
>
> lock_timer_base():
> for (;;) {
> tvec_base_t *prelock_base = timer->base; base =
> tbase_get_base(prelock_base);
> if (likely(base != NULL)) {
> spin_lock_irqsave(&base->lock, *flags); if
> (likely(prelock_base == timer->base))
> return base;
> /* The timer has migrated to another CPU */
> spin_unlock_irqrestore(&base->lock, *flags);
> }
> cpu_relax();
> }
>
> The problem goes away completely if filesystem are mounted *without*
> noatime. Has happened in 2.6.20 through 2.6.22...
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563
>
> Part of sysrq-t listing:
>
> mysqld D 000017c0 2196 23162 1562
> e383fcb8 00000082 61650954 000017c0 e383fc9c 00000000 c0407208
> e383f000 a12b0434 00004d1d c6ed2c00 c6ed2d9c c200fa80 00000000
> c0724640 f6c60540 c4ff3c70 00000508 00000286 c042ffcb e383fcc8
> 00014926 00000000 00000286
> Call Trace:
> [<c0407208>] do_IRQ+0xbd/0xd1
> [<c042ffcb>] lock_timer_base+0x19/0x35 [<c04300df>]
> __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f [<c042fd37>]
> process_timeout+0x0/0x5
> [<c060bb50>] schedule_timeout+0x6b/0x8f [<c060b67c>]
> io_schedule_timeout+0x39/0x5d [<c0465eea>] congestion_wait+0x50/0x64
> [<c0438539>] autoremove_wake_function+0x0/0x35 [<c04620e2>]
> balance_dirty_pages_ratelimited_nr+0x148/0x193 [<c045e7fd>]
> generic_file_buffered_write+0x4c7/0x5d3
>
>
> named D 000017c0 2024 1454 1
> f722acb0 00000082 6165ed96 000017c0 c1523e80 c16f0c00 c16f20e0
> f722a000 a12be87d 00004d1d f768ac00 f768ad9c c200fa80 00000000
> 00000000 f75bda80 c0407208 00000508 00000286 c042ffcb f722acc0
> 00020207 00000000 00000286
> Call Trace:
> [<c0407208>] do_IRQ+0xbd/0xd1
> [<c042ffcb>] lock_timer_base+0x19/0x35 [<c04300df>]
> __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f [<c042fd37>]
> process_timeout+0x0/0x5
> [<c060bb50>] schedule_timeout+0x6b/0x8f [<c060b67c>]
> io_schedule_timeout+0x39/0x5d [<c0465eea>] congestion_wait+0x50/0x64
> [<c0438539>] autoremove_wake_function+0x0/0x35 [<c04620e2>]
> balance_dirty_pages_ratelimited_nr+0x148/0x193 [<c045e7fd>]
> generic_file_buffered_write+0x4c7/0x5d3
>
>
> mysqld D 000017c0 2196 23456 1562
> e9293cb8 00000082 616692ed 000017c0 e9293c9c 00000000 e9293cc8
> e9293000 a12c8dd0 00004d1d c3d5ac00 c3d5ad9c c200fa80 00000000
> c0724640 f6c60540 e9293d10 c07e1f00 00000286 c042ffcb e9293cc8
> 0002b57f 00000000 00000286
> Call Trace:
> [<c042ffcb>] lock_timer_base+0x19/0x35 [<c04300df>]
> __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f [<c042fd37>]
> process_timeout+0x0/0x5
> [<c060bb50>] schedule_timeout+0x6b/0x8f [<c060b67c>]
> io_schedule_timeout+0x39/0x5d [<c0465eea>] congestion_wait+0x50/0x64
> [<c0438539>] autoremove_wake_function+0x0/0x35 [<c04620e2>]
> balance_dirty_pages_ratelimited_nr+0x148/0x193 [<c045e7fd>]
> generic_file_buffered_write+0x4c7/0x5d3
I'm not sure if this is related to your problem, but I posted a patch the
other day that fixes an issue in balance_dirty_pages.
If a drive is under light load is can get stuck looping waiting for
enough pages to be available to be written. In my test case this occurs
when one drive is under heavy load and the other isn't.
I sent it with the subject "[PATCH] balance_dirty_pages - exit loop when
no more pages available"
here's the patch again in case it helps
Cheers
Richard
------
--- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.000000000 +0100
+++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.000000000 +0100
@@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We've done our duty */
+ if (!wbc.encountered_congestion && wbc.nr_to_write > 0)
+ break; /* didn't find enough to do */
}
congestion_wait(WRITE, HZ/10);
}
(attempting to cc Matthias. If I have the wrong one, please fix it up)
(please generally cc reporters when forwarding their bug reports)
On Wed, 01 Aug 2007 18:39:51 -0400
Chuck Ebbert <[email protected]> wrote:
> Looks like the same problem with spinlock unfairness we've seen
> elsewhere: it seems to be looping here? Or is everyone stuck
> just waiting for writeout?
>
> lock_timer_base():
> for (;;) {
> tvec_base_t *prelock_base = timer->base;
> base = tbase_get_base(prelock_base);
> if (likely(base != NULL)) {
> spin_lock_irqsave(&base->lock, *flags);
> if (likely(prelock_base == timer->base))
> return base;
> /* The timer has migrated to another CPU */
> spin_unlock_irqrestore(&base->lock, *flags);
> }
> cpu_relax();
> }
>
> The problem goes away completely if filesystem are mounted
> *without* noatime. Has happened in 2.6.20 through 2.6.22...
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249563
>
> Part of sysrq-t listing:
>
> mysqld D 000017c0 2196 23162 1562
> e383fcb8 00000082 61650954 000017c0 e383fc9c 00000000 c0407208 e383f000
> a12b0434 00004d1d c6ed2c00 c6ed2d9c c200fa80 00000000 c0724640 f6c60540
> c4ff3c70 00000508 00000286 c042ffcb e383fcc8 00014926 00000000 00000286
> Call Trace:
> [<c0407208>] do_IRQ+0xbd/0xd1
> [<c042ffcb>] lock_timer_base+0x19/0x35
> [<c04300df>] __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f
> [<c042fd37>] process_timeout+0x0/0x5
> [<c060bb50>] schedule_timeout+0x6b/0x8f
> [<c060b67c>] io_schedule_timeout+0x39/0x5d
> [<c0465eea>] congestion_wait+0x50/0x64
> [<c0438539>] autoremove_wake_function+0x0/0x35
> [<c04620e2>] balance_dirty_pages_ratelimited_nr+0x148/0x193
> [<c045e7fd>] generic_file_buffered_write+0x4c7/0x5d3
I expect the lock_timer_base() this is just stack gunk. Matthias's trace
also includes
mysqld S 000017c0 2524 1623 1562
f6ce3b44 00000082 60ca34b2 000017c0 f6ce3b28 00000000 f6ce3b54 f6ce3000
57c63d9c 00004d1d f6c90000 f6c9019c c200fa80 00000000 c0724640 f6c60540
000007d0 c07e1f00 00000286 c042ffcb f6ce3b54 000290ef 00000000 00000286
Call Trace:
[<c042ffcb>] lock_timer_base+0x19/0x35
[<c04300df>] __mod_timer+0x9a/0xa4
[<c060bb55>] schedule_timeout+0x70/0x8f
[<c042fd37>] process_timeout+0x0/0x5
[<c060bb50>] schedule_timeout+0x6b/0x8f
[<c04857c7>] do_select+0x36d/0x3c4
[<c0485da0>] __pollwait+0x0/0xac
[<c04e934d>] __next_cpu+0x12/0x1e
[<c0420130>] find_busiest_group+0x1c4/0x553
[<c0420ada>] update_curr+0x23b/0x25c
[<c04eb700>] rb_insert_color+0x8c/0xad
[<c042128f>] enqueue_entity+0x276/0x294
and it appears that schedule_timeout() always leaves a copy of
lock_timer_base+0x19 on the stack. Enabling CONFIG_FRAME_POINTER might
help sort that out.
I think. Or perhaps lock_timer_base() really has gone and got stuck. One
possibility is that gcc has decided to cache timer->base in a register
rather than rereading it around that loop, which would be bad. Do:
gdb vmlinux
(gdb) x/100i lock_timer_base
Is the machine really completely dead? Or are some tasks running? If the
latter, it might be dirty-memory windup - perhaps some device driver has
died and we're not getting writes out to disk.
Are all the CPUs running flat-out? If so, yup, maybe it's
lock_timer_base(). Hit sysrq-P ten times, see where things are stuck.
Please leave `vmstat 1' running in an ssh seesion next time, let's see the
output just prior to the hang.
And do this:
while true
do
echo
cat /proc/meminfo
sleep 1
done
in another ssh session so we can see what the memory looked like when it
died too.
Chuck Ebbert wrote:
>
> Looks like the same problem with spinlock unfairness we've seen
> elsewhere: it seems to be looping here? Or is everyone stuck
> just waiting for writeout?
>
> lock_timer_base():
> for (;;) {
> tvec_base_t *prelock_base = timer->base;
> base = tbase_get_base(prelock_base);
> if (likely(base != NULL)) {
> spin_lock_irqsave(&base->lock, *flags);
> if (likely(prelock_base == timer->base))
> return base;
> /* The timer has migrated to another CPU */
> spin_unlock_irqrestore(&base->lock, *flags);
> }
> cpu_relax();
> }
I don't think there is an unfairness problem. We are looping only
if timer->base changes in between. IOW, there is no "lock + unlock
+ lock(same_lock)" here, we take another lock on each iteration.
And:
> [<c0407208>] do_IRQ+0xbd/0xd1
> [<c042ffcb>] lock_timer_base+0x19/0x35
> [<c04300df>] __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f
>
> ...
>
> [<c0407208>] do_IRQ+0xbd/0xd1
> [<c042ffcb>] lock_timer_base+0x19/0x35
> [<c04300df>] __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f
>
> ...
>
> [<c042ffcb>] lock_timer_base+0x19/0x35
> [<c04300df>] __mod_timer+0x9a/0xa4
> [<c060bb55>] schedule_timeout+0x70/0x8f
All traces start from schedule_timeout()->mod_timer(). This timer
is local, nobody can see it, its ->base can't be NULL or changed.
This means that lock_timer_base() can't loop, but can't take the
tvec_t_base_s.lock. But in that case the trace should different,
strange.
Oleg.
On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> (attempting to cc Matthias. If I have the wrong one, please fix it up)
You got the correct one.
> > Looks like the same problem with spinlock unfairness we've seen
> > elsewhere: it seems to be looping here? Or is everyone stuck
> > just waiting for writeout?
> [...]
> I think. Or perhaps lock_timer_base() really has gone and got stuck. One
> possibility is that gcc has decided to cache timer->base in a register
> rather than rereading it around that loop, which would be bad. Do:
>
> gdb vmlinux
> (gdb) x/100i lock_timer_base
This is from an affected kernel, but not the kernel matching my
stack-trace. Hope it is useful anyway. If not maybe Chuck can provide
the vmlinux file for the 2.6.22.1-27.fc7 kernel (the debuginfo seems to
be deleted from all mirrors).
(gdb) x/100i lock_timer_base
0xc102ffda <lock_timer_base>: push %ebp
0xc102ffdb <lock_timer_base+1>: mov %edx,%ebp
0xc102ffdd <lock_timer_base+3>: push %edi
0xc102ffde <lock_timer_base+4>: mov %eax,%edi
0xc102ffe0 <lock_timer_base+6>: push %esi
0xc102ffe1 <lock_timer_base+7>: push %ebx
0xc102ffe2 <lock_timer_base+8>: mov 0x14(%edi),%ebx
0xc102ffe5 <lock_timer_base+11>: mov %ebx,%esi
0xc102ffe7 <lock_timer_base+13>: and $0xfffffffe,%esi
0xc102ffea <lock_timer_base+16>: je 0xc1030004 <lock_timer_base+42>
0xc102ffec <lock_timer_base+18>: mov %esi,%eax
0xc102ffee <lock_timer_base+20>: call 0xc122ff35 <_spin_lock_irqsave>
0xc102fff3 <lock_timer_base+25>: mov %eax,0x0(%ebp)
0xc102fff6 <lock_timer_base+28>: cmp 0x14(%edi),%ebx
0xc102fff9 <lock_timer_base+31>: je 0xc1030008 <lock_timer_base+46>
0xc102fffb <lock_timer_base+33>: mov %eax,%edx
0xc102fffd <lock_timer_base+35>: mov %esi,%eax
0xc102ffff <lock_timer_base+37>: call 0xc122ff95 <_spin_unlock_irqrestore>
0xc1030004 <lock_timer_base+42>: pause
0xc1030006 <lock_timer_base+44>: jmp 0xc102ffe2 <lock_timer_base+8>
0xc1030008 <lock_timer_base+46>: mov %esi,%eax
0xc103000a <lock_timer_base+48>: pop %ebx
0xc103000b <lock_timer_base+49>: pop %esi
0xc103000c <lock_timer_base+50>: pop %edi
0xc103000d <lock_timer_base+51>: pop %ebp
0xc103000e <lock_timer_base+52>: ret
0xc103000f <try_to_del_timer_sync>: push %esi
0xc1030010 <try_to_del_timer_sync+1>: or $0xffffffff,%esi
0xc1030013 <try_to_del_timer_sync+4>: push %ebx
0xc1030014 <try_to_del_timer_sync+5>: mov %eax,%ebx
0xc1030016 <try_to_del_timer_sync+7>: sub $0x4,%esp
0xc1030019 <try_to_del_timer_sync+10>: mov %esp,%edx
0xc103001b <try_to_del_timer_sync+12>: call 0xc102ffda <lock_timer_base>
0xc1030020 <try_to_del_timer_sync+17>: cmp %ebx,0x4(%eax)
0xc1030023 <try_to_del_timer_sync+20>: mov %eax,%ecx
0xc1030025 <try_to_del_timer_sync+22>: je 0xc1030049 <try_to_del_timer_sync+58>
0xc1030027 <try_to_del_timer_sync+24>: xor %esi,%esi
0xc1030029 <try_to_del_timer_sync+26>: cmpl $0x0,(%ebx)
0xc103002c <try_to_del_timer_sync+29>: je 0xc1030049 <try_to_del_timer_sync+58>
0xc103002e <try_to_del_timer_sync+31>: mov (%ebx),%edx
0xc1030030 <try_to_del_timer_sync+33>: mov $0x1,%si
0xc1030034 <try_to_del_timer_sync+37>: mov 0x4(%ebx),%eax
0xc1030037 <try_to_del_timer_sync+40>: mov %eax,0x4(%edx)
0xc103003a <try_to_del_timer_sync+43>: mov %edx,(%eax)
0xc103003c <try_to_del_timer_sync+45>: movl $0x200200,0x4(%ebx)
0xc1030043 <try_to_del_timer_sync+52>: movl $0x0,(%ebx)
0xc1030049 <try_to_del_timer_sync+58>: mov (%esp),%edx
0xc103004c <try_to_del_timer_sync+61>: mov %ecx,%eax
0xc103004e <try_to_del_timer_sync+63>: call 0xc122ff95 <_spin_unlock_irqrestore>
0xc1030053 <try_to_del_timer_sync+68>: mov %esi,%eax
0xc1030055 <try_to_del_timer_sync+70>: pop %ebx
---Type <return> to continue, or q <return> to quit---
0xc1030056 <try_to_del_timer_sync+71>: pop %ebx
0xc1030057 <try_to_del_timer_sync+72>: pop %esi
0xc1030058 <try_to_del_timer_sync+73>: ret
0xc1030059 <del_timer_sync>: push %ebx
0xc103005a <del_timer_sync+1>: mov %eax,%ebx
0xc103005c <del_timer_sync+3>: mov %ebx,%eax
0xc103005e <del_timer_sync+5>: call 0xc103000f <try_to_del_timer_sync>
0xc1030063 <del_timer_sync+10>: test %eax,%eax
0xc1030065 <del_timer_sync+12>: jns 0xc103006b <del_timer_sync+18>
0xc1030067 <del_timer_sync+14>: pause
0xc1030069 <del_timer_sync+16>: jmp 0xc103005c <del_timer_sync+3>
0xc103006b <del_timer_sync+18>: pop %ebx
0xc103006c <del_timer_sync+19>: ret
0xc103006d <__mod_timer>: push %ebp
0xc103006e <__mod_timer+1>: mov %edx,%ebp
0xc1030070 <__mod_timer+3>: push %edi
0xc1030071 <__mod_timer+4>: push %esi
0xc1030072 <__mod_timer+5>: mov %eax,%esi
0xc1030074 <__mod_timer+7>: push %ebx
0xc1030075 <__mod_timer+8>: sub $0x8,%esp
0xc1030078 <__mod_timer+11>: mov 0x18(%esp),%edx
0xc103007c <__mod_timer+15>: call 0xc102f693 <__timer_stats_timer_set_start_info>
0xc1030081 <__mod_timer+20>: cmpl $0x0,0xc(%esi)
0xc1030085 <__mod_timer+24>: jne 0xc103008b <__mod_timer+30>
0xc1030087 <__mod_timer+26>: ud2a
0xc1030089 <__mod_timer+28>: jmp 0xc1030089 <__mod_timer+28>
0xc103008b <__mod_timer+30>: lea 0x4(%esp),%edx
0xc103008f <__mod_timer+34>: mov %esi,%eax
0xc1030091 <__mod_timer+36>: call 0xc102ffda <lock_timer_base>
0xc1030096 <__mod_timer+41>: movl $0x0,(%esp)
0xc103009d <__mod_timer+48>: mov %eax,%ebx
0xc103009f <__mod_timer+50>: cmpl $0x0,(%esi)
0xc10300a2 <__mod_timer+53>: je 0xc10300bc <__mod_timer+79>
0xc10300a4 <__mod_timer+55>: mov 0x4(%esi),%eax
0xc10300a7 <__mod_timer+58>: mov (%esi),%edx
0xc10300a9 <__mod_timer+60>: mov %eax,0x4(%edx)
0xc10300ac <__mod_timer+63>: mov %edx,(%eax)
0xc10300ae <__mod_timer+65>: movl $0x200200,0x4(%esi)
0xc10300b5 <__mod_timer+72>: movl $0x1,(%esp)
0xc10300bc <__mod_timer+79>: mov %fs:0xc13a6104,%edx
0xc10300c3 <__mod_timer+86>: mov $0xc13a75dc,%eax
0xc10300c8 <__mod_timer+91>: mov (%edx,%eax,1),%edi
0xc10300cb <__mod_timer+94>: cmp %edi,%ebx
0xc10300cd <__mod_timer+96>: je 0xc10300f0 <__mod_timer+131>
0xc10300cf <__mod_timer+98>: cmp %esi,0x4(%ebx)
0xc10300d2 <__mod_timer+101>: je 0xc10300f0 <__mod_timer+131>
0xc10300d4 <__mod_timer+103>: andl $0x1,0x14(%esi)
0xc10300d8 <__mod_timer+107>: mov $0x1,%al
0xc10300da <__mod_timer+109>: xchg %al,(%ebx)
(gdb)
> Is the machine really completely dead?
No.
> Or are some tasks running?
Still open SSH sessions are fine. I am able to perform certain tasks,
including killing processes which will eventually resolv the issue.
> If the latter, it might be dirty-memory windup - perhaps some device
> driver has died and we're not getting writes out to disk.
I have two affected systems with similar setup but different hardware.
> Are all the CPUs running flat-out?
Both systems are single core.
> If so, yup, maybe it's lock_timer_base(). Hit sysrq-P ten times, see
> where things are stuck.
I can do that, no problem.
> Please leave `vmstat 1' running in an ssh seesion next time, let's see the
> output just prior to the hang.
Will do.
> And do this:
>
> while true
> do
> echo
> cat /proc/meminfo
> sleep 1
> done
>
> in another ssh session so we can see what the memory looked like when
> it died too.
OK.
I am also willing to try the patch posted by Richard.
In fact both systems have two harddrives mirroring their data with rsync
all 4 hours and most of the time the system gets stuck in such a rsync
run. Always when mirroring /var/spool/imap which contains around 500.000
files.
As written if the filesystem is mounted without noatime the problem is
gone.
Regards,
Matthias
On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> [...]
> I am also willing to try the patch posted by Richard.
I want to give some update here:
1. We finally hit the problem on a third system, with a total different
setup and hardware. However, again high I/O load caused the problem
and the affected filesystems were mounted with noatime.
2. I installed a recompiled kernel with just the two line patch from
Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
days uptime now and counting. I believe the patch fixed the problem.
However, I will continue running "vmstat 1" and the endless loop of
"cat /proc/meminfo", just in case I am wrong.
Regards,
Matthias
On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[email protected]> wrote:
> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
> > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> > [...]
> > I am also willing to try the patch posted by Richard.
>
> I want to give some update here:
>
> 1. We finally hit the problem on a third system, with a total different
> setup and hardware. However, again high I/O load caused the problem
> and the affected filesystems were mounted with noatime.
>
> 2. I installed a recompiled kernel with just the two line patch from
> Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
> days uptime now and counting. I believe the patch fixed the problem.
> However, I will continue running "vmstat 1" and the endless loop of
> "cat /proc/meminfo", just in case I am wrong.
>
Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?
If Richard's patch has indeed fixed it then this confirms that we're seeing
contention over the dirty-memory limits. Richard's patch isn't really the
right one because it allows unlimited dirty-memory windup in some situations
(large number of disks with small writes, or when we perform queue congestion
avoidance).
As you're seeing this happening when multiple disks are being written to it is
possible that the per-device-dirty-threshold patches which recently went into
-mm (and which appear to have a bug) will fix it.
But I worry that the stall appears to persist *forever*. That would indicate
that we have a dirty-memory accounting leak, or that for some reason the
system has decided to stop doing writeback to one or more queues (might be
caused by an error in a lower-level driver's queue congestion state management).
If it is the latter, then it could be that running "sync" will clear the
problem. Temporarily, at least. Because sync will ignore the queue congestion
state.
On Thu, Aug 09, 2007 at 09:55:34AM -0700, Andrew Morton wrote:
> On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[email protected]> wrote:
> > On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
> > > On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> > > [...]
> > > I am also willing to try the patch posted by Richard.
> >
> > I want to give some update here:
>
> Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?
So far not, sorry. The problem did not reoccur since I am running the
kernel with that patch now.
> As you're seeing this happening when multiple disks are being written
> to it is possible that the per-device-dirty-threshold patches which
> recently went into -mm (and which appear to have a bug) will fix it.
All affected systems have two devices which are cross mounted, eg. it
has /home on disk 1 and /var/spool/imap on disk 2.
On the respectively device there is the mirror partition (mirror of
/home on disk 2 and mirror of /var/spool/imap on disk 1).
Normally the system hangs when running rsync from /var/spool/imap to its
corresponding mirror partition.
We have around 10 different partitions (all in LVM), but so far the hang
mostly started running rsync on the imapspool (which has by far most of
the files on the system).
_However_: we also had this hang while running a backup to an external
FTP server (so only reading the filesystems apart from the usual system
activity).
And: the third system we had this problem on has no IMAP spool at all.
The hang occured while running a backup together with "yum update".
It might be related that all the systems uses these crossmounts over two
different disks. I wasn't not able to reproduce that on my homesystem
with only one disk, either because the activity-pattern was different,
or it really needs two disks to run into that issue.
> But I worry that the stall appears to persist *forever*. That would indicate
> that we have a dirty-memory accounting leak, or that for some reason the
> system has decided to stop doing writeback to one or more queues (might be
> caused by an error in a lower-level driver's queue congestion state management).
Well, all the time I was still connected to one of that machines it was
possible to clear the situation by killing a lot of processes. That is
mostly killing all httpd, smtpd, imapd, amavisd, crond, spamassassin and
rsync processes. Eventually the system responded normal again (and I
could cleanly reboot it). However the final killed process which
resolved the issue was non deterministic so far.
The workload is different on all the servers:
Server 1 processes around 10-20k mails per day but also servers a lot
of HTTP requests.
Server 2 is only a mailserver processing around 5-10k mails per day.
Server 3 just serves HTTP requests (and a bit DNS).
> If it is the latter, then it could be that running "sync" will clear
> the problem. Temporarily, at least. Because sync will ignore the
> queue congestion state.
I have to admit that I never tried to sync in such a case, since mostly
I had only one open SSH session and tried to find the root cause.
So far the problem occured first on server 2 around march or april
without changes to the machine (we have a Changelog for every server:
there was a single kernel update that time, however we reverted back
after the first issue and run a kernel which was stable for several
weeks before and encountered the problem again). In the beginning we
encountered the problem maybe twice a week, getting worse within the
next weeks. Several major updates (kernel and distribution were made)
without resolving the problem.
Since end of april server 1 began showing the same problem (running a
different Fedora version and kernel at that time), first slowly (once a
week or so) then more regular.
Around july we had a hang nearly every day.
Server 3 had the problem only once now, but that server has no high
workload.
We spent a lot of time investigating the issue, but since all servers
use a different hardware, different setups (beside from the crossmounts
with noatime) and even different base systems (Fedora Core 5+6, Fedora
7, different kernels: 2.6.18, .19, .20, .21 and .22) I think that we can
rule out hardware problems. I think the issue might be there some time
now, but is hit more often now since the workload increased a lot over
the last months.
We ruled a lot out over the month (eg. syslog was replaced, many not so
important services were stopped, schedular was changed), without
changes. Just reverting from "noatime" in the fstab to "default" fixed
it reliable so far.
As said I am still running "vmstat 1" and catting of /proc/meminfo just
in case. If there is anything I can do beside that to clearify the
problem I will try to help.
Regards,
Matthias
On 08/09/2007 12:55 PM, Andrew Morton wrote:
> On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[email protected]> wrote:
>
>> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
>>> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
>>> [...]
>>> I am also willing to try the patch posted by Richard.
>> I want to give some update here:
>>
>> 1. We finally hit the problem on a third system, with a total different
>> setup and hardware. However, again high I/O load caused the problem
>> and the affected filesystems were mounted with noatime.
>>
>> 2. I installed a recompiled kernel with just the two line patch from
>> Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
>> days uptime now and counting. I believe the patch fixed the problem.
>> However, I will continue running "vmstat 1" and the endless loop of
>> "cat /proc/meminfo", just in case I am wrong.
>>
>
> Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?
>
> If Richard's patch has indeed fixed it then this confirms that we're seeing
> contention over the dirty-memory limits. Richard's patch isn't really the
> right one because it allows unlimited dirty-memory windup in some situations
> (large number of disks with small writes, or when we perform queue congestion
> avoidance).
>
> As you're seeing this happening when multiple disks are being written to it is
> possible that the per-device-dirty-threshold patches which recently went into
> -mm (and which appear to have a bug) will fix it.
>
> But I worry that the stall appears to persist *forever*. That would indicate
> that we have a dirty-memory accounting leak, or that for some reason the
> system has decided to stop doing writeback to one or more queues (might be
> caused by an error in a lower-level driver's queue congestion state management).
>
> If it is the latter, then it could be that running "sync" will clear the
> problem. Temporarily, at least. Because sync will ignore the queue congestion
> state.
>
This is still a problem for people, and no fix is in sight until 2.6.24.
Can we get some kind of band-aid, like making the endless 'for' loop in
balance_dirty_pages() terminate after some number of iterations? Clearly
if we haven't written "write_chunk" pages after a few tries, *and* we
haven't encountered congestion, there's no point in trying forever...
[not even compile tested patch follows]
---
mm/page-writeback.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- linux-2.6.22.noarch.orig/mm/page-writeback.c
+++ linux-2.6.22.noarch/mm/page-writeback.c
@@ -208,11 +208,12 @@ static void balance_dirty_pages(struct a
long background_thresh;
long dirty_thresh;
unsigned long pages_written = 0;
+ int i;
unsigned long write_chunk = sync_writeback_pages();
struct backing_dev_info *bdi = mapping->backing_dev_info;
- for (;;) {
+ for (i = 0; ; i++) {
struct writeback_control wbc = {
.bdi = bdi,
.sync_mode = WB_SYNC_NONE,
@@ -250,6 +251,8 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We've done our duty */
+ if (i >= write_chunk && !wbc.encountered_congestion)
+ break; /* nothing to write? */
}
congestion_wait(WRITE, HZ/10);
}
On Thu, 20 Sep 2007 17:07:15 -0400
Chuck Ebbert <[email protected]> wrote:
> On 08/09/2007 12:55 PM, Andrew Morton wrote:
> > On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[email protected]> wrote:
> >
> >> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
> >>> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
> >>> [...]
> >>> I am also willing to try the patch posted by Richard.
> >> I want to give some update here:
> >>
> >> 1. We finally hit the problem on a third system, with a total different
> >> setup and hardware. However, again high I/O load caused the problem
> >> and the affected filesystems were mounted with noatime.
> >>
> >> 2. I installed a recompiled kernel with just the two line patch from
> >> Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
> >> days uptime now and counting. I believe the patch fixed the problem.
> >> However, I will continue running "vmstat 1" and the endless loop of
> >> "cat /proc/meminfo", just in case I am wrong.
> >>
> >
> > Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?
> >
> > If Richard's patch has indeed fixed it then this confirms that we're seeing
> > contention over the dirty-memory limits. Richard's patch isn't really the
> > right one because it allows unlimited dirty-memory windup in some situations
> > (large number of disks with small writes, or when we perform queue congestion
> > avoidance).
> >
> > As you're seeing this happening when multiple disks are being written to it is
> > possible that the per-device-dirty-threshold patches which recently went into
> > -mm (and which appear to have a bug) will fix it.
> >
> > But I worry that the stall appears to persist *forever*. That would indicate
> > that we have a dirty-memory accounting leak, or that for some reason the
> > system has decided to stop doing writeback to one or more queues (might be
> > caused by an error in a lower-level driver's queue congestion state management).
> >
> > If it is the latter, then it could be that running "sync" will clear the
> > problem. Temporarily, at least. Because sync will ignore the queue congestion
> > state.
> >
>
> This is still a problem for people, and no fix is in sight until 2.6.24.
Any bugzilla urls or anything like that?
> Can we get some kind of band-aid, like making the endless 'for' loop in
> balance_dirty_pages() terminate after some number of iterations? Clearly
> if we haven't written "write_chunk" pages after a few tries, *and* we
> haven't encountered congestion, there's no point in trying forever...
Did my above questions get looked at?
Is anyone able to reproduce this?
Do we have a clue what's happening?
Is that function just spinning around, failing to start writeout against
any pages at all? If so, how come?
On 09/20/2007 05:29 PM, Andrew Morton wrote:
> On Thu, 20 Sep 2007 17:07:15 -0400
> Chuck Ebbert <[email protected]> wrote:
>
>> On 08/09/2007 12:55 PM, Andrew Morton wrote:
>>> On Thu, 9 Aug 2007 11:59:43 +0200 Matthias Hensler <[email protected]> wrote:
>>>
>>>> On Sat, Aug 04, 2007 at 10:44:26AM +0200, Matthias Hensler wrote:
>>>>> On Fri, Aug 03, 2007 at 11:34:07AM -0700, Andrew Morton wrote:
>>>>> [...]
>>>>> I am also willing to try the patch posted by Richard.
>>>> I want to give some update here:
>>>>
>>>> 1. We finally hit the problem on a third system, with a total different
>>>> setup and hardware. However, again high I/O load caused the problem
>>>> and the affected filesystems were mounted with noatime.
>>>>
>>>> 2. I installed a recompiled kernel with just the two line patch from
>>>> Richard Kennedy (http://lkml.org/lkml/2007/8/2/89). That system has 5
>>>> days uptime now and counting. I believe the patch fixed the problem.
>>>> However, I will continue running "vmstat 1" and the endless loop of
>>>> "cat /proc/meminfo", just in case I am wrong.
>>>>
>>> Did we ever see the /proc/meminfo and /proc/vmstat output during the stall?
>>>
>>> If Richard's patch has indeed fixed it then this confirms that we're seeing
>>> contention over the dirty-memory limits. Richard's patch isn't really the
>>> right one because it allows unlimited dirty-memory windup in some situations
>>> (large number of disks with small writes, or when we perform queue congestion
>>> avoidance).
>>>
>>> As you're seeing this happening when multiple disks are being written to it is
>>> possible that the per-device-dirty-threshold patches which recently went into
>>> -mm (and which appear to have a bug) will fix it.
>>>
>>> But I worry that the stall appears to persist *forever*. That would indicate
>>> that we have a dirty-memory accounting leak, or that for some reason the
>>> system has decided to stop doing writeback to one or more queues (might be
>>> caused by an error in a lower-level driver's queue congestion state management).
>>>
>>> If it is the latter, then it could be that running "sync" will clear the
>>> problem. Temporarily, at least. Because sync will ignore the queue congestion
>>> state.
>>>
>> This is still a problem for people, and no fix is in sight until 2.6.24.
>
> Any bugzilla urls or anything like that?
https://bugzilla.redhat.com/show_bug.cgi?id=249563
>
>> Can we get some kind of band-aid, like making the endless 'for' loop in
>> balance_dirty_pages() terminate after some number of iterations? Clearly
>> if we haven't written "write_chunk" pages after a few tries, *and* we
>> haven't encountered congestion, there's no point in trying forever...
>
> Did my above questions get looked at?
>
> Is anyone able to reproduce this?
>
> Do we have a clue what's happening?
There are a ton of dirty pages for one disk, and zero or close to zero dirty
for a different one. Kernel spins forever trying to write some arbitrary
minimum amount of data ("write_chunk" pages) to the second disk...
>
> Is that function just spinning around, failing to start writeout against
> any pages at all? If so, how come?
Yes, it spins forever. Just removing the "noatime" mount option for the
second disk generates enough dirty data to keep the system functional.
On Thu, 20 Sep 2007 18:04:38 -0400
Chuck Ebbert <[email protected]> wrote:
> >
> >> Can we get some kind of band-aid, like making the endless 'for' loop in
> >> balance_dirty_pages() terminate after some number of iterations? Clearly
> >> if we haven't written "write_chunk" pages after a few tries, *and* we
> >> haven't encountered congestion, there's no point in trying forever...
> >
> > Did my above questions get looked at?
> >
> > Is anyone able to reproduce this?
> >
> > Do we have a clue what's happening?
>
> There are a ton of dirty pages for one disk, and zero or close to zero dirty
> for a different one. Kernel spins forever trying to write some arbitrary
> minimum amount of data ("write_chunk" pages) to the second disk...
That should be OK. The caller will sit in that loop, sleeping in
congestion_wait(), polling the correct backing-dev occasionally and waiting
until the dirty limits subside to an acceptable limit, at which stage this:
if (nr_reclaimable +
global_page_state(NR_WRITEBACK)
<= dirty_thresh)
break;
will happen and we leave balance_dirty_pages().
That's all a bit crappy if the wrong races happen and some other task is
somehow exceeding the dirty limits each time this task polls them. Seems
unlikely that such a condition would persist forever.
So the question is, why do we have large amounts of dirty pages for one
disk which appear to be sitting there not getting written?
Do we know if there's any writeout at all happening when the system is in
this state?
I guess it's possible that the dirty inodes on the "other" disk got
themselves onto the wrong per-sb inode list, or are on the correct list,
but in the correct place. If so, these:
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
writeback-fix-comment-use-helper-function.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
writeback-fix-periodic-superblock-dirty-inode-flushing.patch
from 2.6.23-rc6-mm1 should help.
Did anyone try running /bin/sync when the system is in this state?
On 09/20/2007 06:36 PM, Andrew Morton wrote:
>
> So the question is, why do we have large amounts of dirty pages for one
> disk which appear to be sitting there not getting written?
>
> Do we know if there's any writeout at all happening when the system is in
> this state?
>
> I guess it's possible that the dirty inodes on the "other" disk got
> themselves onto the wrong per-sb inode list, or are on the correct list,
> but in the correct place. If so, these:
>
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
> writeback-fix-comment-use-helper-function.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
> writeback-fix-periodic-superblock-dirty-inode-flushing.patch
>
> from 2.6.23-rc6-mm1 should help.
Yikes! Simple fixes would be better.
Patch that is confirmed to fix the problem for this user is below, but
that one could cause other problems. I was looking for some band-aid
could be shown to be harmless...
http://lkml.org/lkml/2007/8/2/89:
------
--- linux-2.6.22.1/mm/page-writeback.c.orig 2007-07-30 16:36:09.000000000 +0100
+++ linux-2.6.22.1/mm/page-writeback.c 2007-07-31 16:26:43.000000000 +0100
@@ -250,6 +250,8 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break; /* We've done our duty */
+ if (!wbc.encountered_congestion && wbc.nr_to_write > 0)
+ break; /* didn't find enough to do */
}
congestion_wait(WRITE, HZ/10);
}
>
> Did anyone try running /bin/sync when the system is in this state?
>
Reporter is in the CC: list.
On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote:
> That's all a bit crappy if the wrong races happen and some other task
> is somehow exceeding the dirty limits each time this task polls them.
> Seems unlikely that such a condition would persist forever.
How exactly do you define forever? It looks to me, that this condition
never resolves on its own, at least not in a window of several hours
(the system used to get stuck around 3am and was normally rebooted
between 7am and 8am, so at least hours are not enough to have the problem
resolve on its own).
> So the question is, why do we have large amounts of dirty pages for
> one disk which appear to be sitting there not getting written?
Unfortunately I have no idea. The full stacktrace for all processes was
attached to the original bugreport, maybe that can give a clue to that.
> Do we know if there's any writeout at all happening when the system is
> in this state?
Not sure about that. The system is responsible on a still open SSH
session, allowing several tasks still to be executed. However, that SSH
session gets stuck very fast if the wrong commands are executed. New
logins are not possible (Connection is akzepted but resulting in a
timeout 60 seconds later, so most likely /bin/login is not able to log
into wtmp).
From all that I suspect that there is no more write activity on that
system, but cannot say for sure.
> Did anyone try running /bin/sync when the system is in this state?
I did not, no.
Regards,
Matthias
On Fri, 21 Sep 2007 10:08:08 +0200 Matthias Hensler <[email protected]> wrote:
> On Thu, Sep 20, 2007 at 03:36:54PM -0700, Andrew Morton wrote:
> > That's all a bit crappy if the wrong races happen and some other task
> > is somehow exceeding the dirty limits each time this task polls them.
> > Seems unlikely that such a condition would persist forever.
>
> How exactly do you define forever? It looks to me, that this condition
> never resolves on its own, at least not in a window of several hours
> (the system used to get stuck around 3am and was normally rebooted
> between 7am and 8am, so at least hours are not enough to have the problem
> resolve on its own).
That's forever.
> > So the question is, why do we have large amounts of dirty pages for
> > one disk which appear to be sitting there not getting written?
>
> Unfortunately I have no idea. The full stacktrace for all processes was
> attached to the original bugreport, maybe that can give a clue to that.
>
> > Do we know if there's any writeout at all happening when the system is
> > in this state?
>
> Not sure about that. The system is responsible on a still open SSH
> session, allowing several tasks still to be executed. However, that SSH
> session gets stuck very fast if the wrong commands are executed. New
> logins are not possible (Connection is akzepted but resulting in a
> timeout 60 seconds later, so most likely /bin/login is not able to log
> into wtmp).
>
> From all that I suspect that there is no more write activity on that
> system, but cannot say for sure.
Easiest would be to run `vmstat 1' in a separate ssh session then just
leave it running.
> > Did anyone try running /bin/sync when the system is in this state?
>
> I did not, no.
Would be interesting if poss, please.
This sounds an awful lot like the same problem I reported with fsck
hanging. I believe that Hugh had a candidate patch for that, which was
related to dirty tracking limits. It seems that that patch tested, and
acked by Peter. All on lkml under:
2.6.23-rc6-mm1 -- mkfs stuck in 'D'
-apw
On Thu, 2007-09-20 at 15:36 -0700, Andrew Morton wrote:
> On Thu, 20 Sep 2007 18:04:38 -0400
> Chuck Ebbert <[email protected]> wrote:
>
> > >
> > >> Can we get some kind of band-aid, like making the endless 'for' loop in
> > >> balance_dirty_pages() terminate after some number of iterations? Clearly
> > >> if we haven't written "write_chunk" pages after a few tries, *and* we
> > >> haven't encountered congestion, there's no point in trying forever...
> > >
> > > Did my above questions get looked at?
> > >
> > > Is anyone able to reproduce this?
> > >
> > > Do we have a clue what's happening?
> >
> > There are a ton of dirty pages for one disk, and zero or close to zero dirty
> > for a different one. Kernel spins forever trying to write some arbitrary
> > minimum amount of data ("write_chunk" pages) to the second disk...
>
> That should be OK. The caller will sit in that loop, sleeping in
> congestion_wait(), polling the correct backing-dev occasionally and waiting
> until the dirty limits subside to an acceptable limit, at which stage this:
>
> if (nr_reclaimable +
> global_page_state(NR_WRITEBACK)
> <= dirty_thresh)
> break;
>
>
> will happen and we leave balance_dirty_pages().
>
> That's all a bit crappy if the wrong races happen and some other task is
> somehow exceeding the dirty limits each time this task polls them. Seems
> unlikely that such a condition would persist forever.
>
> So the question is, why do we have large amounts of dirty pages for one
> disk which appear to be sitting there not getting written?
The lockup I'm seeing intermittently occurs when I have 2+ tasks copying
large files (1Gb+) on sda & a small read-mainly mysql db app running on
sdb. The lockup seems to happen just after the copies finish -- there
are lots of dirty pages but nothing left to write them until kupdate
gets round to it.
BTW kupdate can loop for long periods of time when a disk is under this
kind of load -- I regularly see it take over 20 seconds. and often it's
unable to start as there are no pdflush threads available.
> Do we know if there's any writeout at all happening when the system is in
> this state?
>
No there doesn't seem to be any activity at all -- my machine is
completely unresponsive only sysrq works.
> I guess it's possible that the dirty inodes on the "other" disk got
> themselves onto the wrong per-sb inode list, or are on the correct list,
> but in the correct place. If so, these:
>
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
> writeback-fix-comment-use-helper-function.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
> writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
> writeback-fix-periodic-superblock-dirty-inode-flushing.patch
>
> from 2.6.23-rc6-mm1 should help.
> Did anyone try running /bin/sync when the system is in this state?
I'm not able to run anything in this state, but sysrq-s doesn't make any
difference.
Richard
On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[email protected]> wrote:
> > That's all a bit crappy if the wrong races happen and some other task is
> > somehow exceeding the dirty limits each time this task polls them. Seems
> > unlikely that such a condition would persist forever.
> >
> > So the question is, why do we have large amounts of dirty pages for one
> > disk which appear to be sitting there not getting written?
>
> The lockup I'm seeing intermittently occurs when I have 2+ tasks copying
> large files (1Gb+) on sda & a small read-mainly mysql db app running on
> sdb. The lockup seems to happen just after the copies finish -- there
> are lots of dirty pages but nothing left to write them until kupdate
> gets round to it.
Then what happens? The system recovers?
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[email protected]> wrote:
>
> > > That's all a bit crappy if the wrong races happen and some other task is
> > > somehow exceeding the dirty limits each time this task polls them. Seems
> > > unlikely that such a condition would persist forever.
> > >
> > > So the question is, why do we have large amounts of dirty pages for one
> > > disk which appear to be sitting there not getting written?
> >
> > The lockup I'm seeing intermittently occurs when I have 2+ tasks copying
> > large files (1Gb+) on sda & a small read-mainly mysql db app running on
> > sdb. The lockup seems to happen just after the copies finish -- there
> > are lots of dirty pages but nothing left to write them until kupdate
> > gets round to it.
>
> Then what happens? The system recovers?
Nothing -- it stays stuck forever.
I don't think kupdate is getting started, I added some debug in there
but haven't found out anything useful yet. But I am trying to build a
better test case, the one I've got at the moment can take hours to
trigger this problem.
On 09/21/2007 05:39 AM, Andy Whitcroft wrote:
> This sounds an awful lot like the same problem I reported with fsck
> hanging. I believe that Hugh had a candidate patch for that, which was
> related to dirty tracking limits. It seems that that patch tested, and
> acked by Peter. All on lkml under:
>
> 2.6.23-rc6-mm1 -- mkfs stuck in 'D'
>
Unfortunately that whole thread is about fixing bugs in an already
much-changed balance_dirty_pages() in -mm.
On Fri, 21 Sep 2007, Andy Whitcroft wrote:
> This sounds an awful lot like the same problem I reported with fsck
> hanging. I believe that Hugh had a candidate patch for that, which was
> related to dirty tracking limits. It seems that that patch tested, and
> acked by Peter. All on lkml under:
>
> 2.6.23-rc6-mm1 -- mkfs stuck in 'D'
You may well be right.
My initial reaction was to point out that my patch was to a
2.6.23-rc6-mm1 implementation detail, whereas this thread is
about a problem seen since 2.6.20 or perhaps earlier.
But once I look harder at it, I wonder what would have kept
2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from
the global vm stats too low to get synched back to global, yet
adding up to something which misleads balance_dirty_pages into
an indefinite loop e.g. total nr_writeback actually 0, but
appearing more than dirty_thresh in the global approximation.
Certainly my rc6-mm1 patch won't work here, and all it was doing
was apply some safety already added by Peter to one further case.
Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead.
There is a refresh_vm_stats function which we could call (then retest
the break condition) just before resorting to congestion_wait. But
the big NUMA people might get very upset with me calling that too
often: causing a thundering herd of bouncing cachelines which that
was all designed to avoid. And it's not obvious to me what condition
to test for dirty_thresh "too low".
I believe Peter gave all this quite a lot of thought when he was
making the rc6-mm1 changes, and I'd rather defer to him for a
suggestion of what best to do in earlier releases. Or maybe he'll
just point out how this couldn't have been a problem before.
Or there is is Richard's patch, which I haven't considered, but
Andrew was not quite satisfied with it - partly because he'd like
to understand how the situation could come about first, perhaps
we have now got an explanation.
(The original bug report was indeed on SMP, but I haven't seen
anyone say that's a necessary condition for the hang: it would
be if this is the issue. And Richard writes at one point of the
system only responding to AltSysRq: that would be surprising for
this issue, though it's possible that a task in balance_dirty_pages
is holding an i_mutex that everybody else comes to need.)
Hugh
On 09/21/2007 11:58 AM, Hugh Dickins wrote:
> Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead.
> There is a refresh_vm_stats function which we could call (then retest
> the break condition) just before resorting to congestion_wait. But
> the big NUMA people might get very upset with me calling that too
> often: causing a thundering herd of bouncing cachelines which that
> was all designed to avoid. And it's not obvious to me what condition
> to test for dirty_thresh "too low".
>
> I believe Peter gave all this quite a lot of thought when he was
> making the rc6-mm1 changes, and I'd rather defer to him for a
> suggestion of what best to do in earlier releases. Or maybe he'll
> just point out how this couldn't have been a problem before.
>
> Or there is is Richard's patch, which I haven't considered, but
> Andrew was not quite satisfied with it - partly because he'd like
> to understand how the situation could come about first, perhaps
> we have now got an explanation.
>
My patch is just Richard's with the added requirement that we loop
some minimum number of times before exiting balance_dirty_pages()
with unfinished work.
On Fri, 21 Sep 2007 16:58:15 +0100 (BST) Hugh Dickins
<[email protected]> wrote:
> But once I look harder at it, I wonder what would have kept
> 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from
> the global vm stats too low to get synched back to global, yet
> adding up to something which misleads balance_dirty_pages into
> an indefinite loop e.g. total nr_writeback actually 0, but
> appearing more than dirty_thresh in the global approximation.
This could only happen when: dirty_thresh < nr_cpus * per_cpu_max_delta
> Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead.
> There is a refresh_vm_stats function which we could call (then retest
> the break condition) just before resorting to congestion_wait. But
> the big NUMA people might get very upset with me calling that too
> often: causing a thundering herd of bouncing cachelines which that
> was all designed to avoid. And it's not obvious to me what condition
> to test for dirty_thresh "too low".
That could be modeled on the error limit I have. For this particular
case that would end up looking like:
nr_online_cpus * pcp->stat_threshold.
> I believe Peter gave all this quite a lot of thought when he was
> making the rc6-mm1 changes, and I'd rather defer to him for a
> suggestion of what best to do in earlier releases. Or maybe he'll
> just point out how this couldn't have been a problem before.
As outlined above, and I don't think we'll ever have such a low
dirty_limit. But who knows :-)
> Or there is is Richard's patch, which I haven't considered, but
> Andrew was not quite satisfied with it - partly because he'd like
> to understand how the situation could come about first, perhaps
> we have now got an explanation.
I'm with Andrew on this, that is, quite puzzled on how all this arises.
Testing those writeback-fix-* patches might help rule out (or point to)
a mis-function of pdflush.
The theory that one task will spin in balance_dirty_pages() on a
bdi that does not actually have many dirty pages, doesn't sound
plausible because eventually the total dirty count (well, actually
dirty+unstable+writeback) should subside again. This theory can
cause crappy latencies, but should not 'hang' the machine.
> (The original bug report was indeed on SMP, but I haven't seen
> anyone say that's a necessary condition for the hang: it would
> be if this is the issue. And Richard writes at one point of the
> system only responding to AltSysRq: that would be surprising for
> this issue, though it's possible that a task in balance_dirty_pages
> is holding an i_mutex that everybody else comes to need.)
Are we actually holding i_mutex on paths that lead into
balance_dirty_pages? that does (from my admittedly limited knowledge of
the vfs) sound like trouble, since we'd need it to complete writeback.
All quite puzzling.
On Fri, 2007-09-21 at 03:33 -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2007 11:25:41 +0100 richard kennedy <[email protected]> wrote:
>
> > > That's all a bit crappy if the wrong races happen and some other task is
> > > somehow exceeding the dirty limits each time this task polls them. Seems
> > > unlikely that such a condition would persist forever.
> > >
> > > So the question is, why do we have large amounts of dirty pages for one
> > > disk which appear to be sitting there not getting written?
> >
> > The lockup I'm seeing intermittently occurs when I have 2+ tasks copying
> > large files (1Gb+) on sda & a small read-mainly mysql db app running on
> > sdb. The lockup seems to happen just after the copies finish -- there
> > are lots of dirty pages but nothing left to write them until kupdate
> > gets round to it.
>
> Then what happens? The system recovers?
when my system is locked up I get this from sysrq-w
(2.6.23-rc7)
SysRq : Show Blocked State
auditd D ffff8100b422fd28 0 1999 1
ffff8100b422fd78 0000000000000086 0000000000000000 ffff8100b4103020
0000000000000286 ffff8100b4103020 ffff8100bf8af020 ffff8100b41032b8
0000000100000000 0000000000000001 ffffffffffffffff 0000000000000003
Call Trace:
[<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
[<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
[<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
[<ffffffff810a79f7>] __writeback_single_inode+0x1f4/0x332
[<ffffffff8108c0b0>] vfs_statfs_native+0x29/0x34
[<ffffffff810a8362>] sync_inode+0x24/0x36
[<ffffffff8803c350>] :ext3:ext3_sync_file+0xb4/0xc8
[<ffffffff81240730>] mutex_lock+0x1e/0x2f
[<ffffffff810aa7fa>] do_fsync+0x52/0xa4
[<ffffffff810aa86f>] __do_fsync+0x23/0x36
[<ffffffff8100bc8e>] system_call+0x7e/0x83
syslogd D ffff8100b3f67d28 0 2022 1
ffff8100b3f67d78 0000000000000086 0000000000000000 0000000100000000
0000000000000003 ffff810037c66810 ffff8100bf8af020 ffff810037c66aa8
0000000100000000 0000000000000001 ffffffffffffffff 0000000000000003
Call Trace:
[<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5
[<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e
[<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee
[<ffffffff810a79f7>] __writeback_single_inode+0x1f4/0x332
[<ffffffff810a8362>] sync_inode+0x24/0x36
[<ffffffff8803c350>] :ext3:ext3_sync_file+0xb4/0xc8
[<ffffffff81240730>] mutex_lock+0x1e/0x2f
[<ffffffff810aa7fa>] do_fsync+0x52/0xa4
[<ffffffff810aa86f>] __do_fsync+0x23/0x36
[<ffffffff8100be0c>] tracesys+0xdc/0xe1
On Fri, Sep 21, 2007 at 16:58:15 +0100,
Hugh Dickins <[email protected]> wrote:
>
> But once I look harder at it, I wonder what would have kept
> 2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from
> the global vm stats too low to get synched back to global, yet
> adding up to something which misleads balance_dirty_pages into
> an indefinite loop e.g. total nr_writeback actually 0, but
> appearing more than dirty_thresh in the global approximation.
I have been seeing this problem since Fedora Core 5's 2.6.20-1.2300.fc5
kernel. The last Fedora kernel that didn't have this problem was
2.6.19-1.2288.2.4.fc5, which I am still running on one machine.
I have put up vmstat and proc/meminfo data at: http://wolff.to/bruno/vmlog
There are a number of iterations of the output and they are mixed together,
though the vmstat info was written when a buffer was full, so there are
chunks of lines together. This was from a machine running Fedora 7's
2.6.23.1-4.fc7 kernel.
The machine is a single processor P3 with 512 MB of memory. There are
three hard drives, two of which are used in software raid 1 arrays.
I have ext3 file systems mounted noatime. I can provide more details if
you need anything else.
I can get lockups to happen fairly often. Often lots of firefox activity
and/or some other disk heavy task such as yum update will get it to happen.
I am not subscribed to LMKL, so please keep me copied on replies.