2006-03-06 22:06:53

by Andres Salomon

[permalink] [raw]
Subject: pdflush and high load

Hi,

(2nd attempt at posting this; the first one appears to have
disappeared?)

I have a dual processor opteron system w/ a 3ware controller. When
testing out i/o by doing a simple 'dd if=/dev/zero of=/mnt/foo bs=1M
count=1024', the load ends up hitting about 12 before the dd finishes.
This happens because 8 pdflush processes are spawned, and they all sit
in D state for most of the time.

While the load itself isn't a problem, I was tracking down other bugs
that it was potentially creating, and wondering why the load even needed
to be that high. During the i/o, the pdflush processes look like this:

Mar 5 18:29:28 localhost kernel: pdflush D C112D5E0 0 5698
11 5697 (L-TLB)
Mar 5 18:29:28 localhost kernel: d2e8de84 d2e8de88 c124d200 c112d5e0
00000004 f7d21550 c1922c00 c1924758
Mar 5 18:29:28 localhost kernel: 00000286 00000000 00000000
c180f440 00000001 00000000 9db0b4c0 000f44b6
Mar 5 18:29:28 localhost kernel: f7d21550 f7e2d550 f7e2d678
c180fe80 d2e8de94 d2e8de94 0029545f d2e8df04
Mar 5 18:29:28 localhost kernel: Call Trace:
Mar 5 18:29:28 localhost kernel: [schedule_timeout+111/142]
schedule_timeout+0x6f/0x8e
Mar 5 18:29:28 localhost kernel: [process_timeout+0/9] process_timeout
0x0/0x9
Mar 5 18:29:28 localhost kernel: [io_schedule_timeout+41/51]
io_schedule_timeout+0x29/0x33
Mar 5 18:29:28 localhost kernel: [blk_congestion_wait+101/122]
blk_congestion_wait+0x65/0x7a
Mar 5 18:29:28 localhost kernel: [autoremove_wake_function+0/58]
autoremove_wake_function+0x0/0x3a
Mar 5 18:29:28 localhost kernel: [sync_sb_inodes+217/628]
sync_sb_inodes+0xd9/0x274
Mar 5 18:29:28 localhost kernel: [autoremove_wake_function+0/58]
autoremove_wake_function+0x0/0x3a
Mar 5 18:29:28 localhost kernel: [pdflush+0/50] pdflush+0x0/0x32
Mar 5 18:29:28 localhost kernel: [writeback_inodes+121/204]
writeback_inodes+0x79/0xcc
Mar 5 18:29:28 localhost kernel: [pdflush+0/50] pdflush+0x0/0x32
Mar 5 18:29:28 localhost kernel: [background_writeout+133/148]
background_writeout+0x85/0x94
Mar 5 18:29:28 localhost kernel: [__pdflush+231/379] __pdflush
+0xe7/0x17b
Mar 5 18:29:28 localhost kernel: [pdflush+45/50] pdflush+0x2d/0x32
Mar 5 18:29:28 localhost kernel: [background_writeout+0/148]
background_writeout+0x0/0x94
Mar 5 18:29:28 localhost kernel: [kthread+124/166] kthread+0x7c/0xa6
Mar 5 18:29:28 localhost kernel: [kthread+0/166] kthread+0x0/0xa6
Mar 5 18:29:28 localhost kernel: [kernel_thread_helper+5/11]
kernel_thread_helper+0x5/0xb


Basically, what ends up happening on my system is, each pdflush process
is handling background_writeout(), encountering congestion, and calling
blk_congestion_wait(WRITE, HZ/10) after every loop.
Blk_congestion_wait() does an uninterruptible sleep, and other things
get scheduled (typically of the 8 pdflush processes, 4 will be just
waiting at schedule(), 2 will actually be writing out pages in
generic_writepages(), and 2 will be off doing other random stuff like
handling softirqs or updating the wall time inside the timer irq
handler).

Is there a reason why this sleep needs to happen w/
TASK_UNINTERRUPTIBLE? I tried setting it to TASK_INTERRUPTIBLE, and
encountered a minor performance hit (it took a few extra seconds to
write the 1GB of data); however, the pdflush processes spent most of
their time in S state, so the load on the system was much more
reasonable.

Aside from the TASK_UNINTERRUPTIBLE sleep, I think the real problem is
that the pdflush spawning metric has no way to determine whether a
pdflush process is doing actual useful work or not. Currently, it
checks whether there's more work that needs to be done, and keeps
spawning pdflush threads (up to 8) without checking whether that would
actually help or not. Throwing a printk inside the
background_writeout() loop showed the following work being done:

Mar 6 02:57:34 localhost kernel: background_writeout(130):
nr_dirty=102743, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-2681, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(5717):
nr_dirty=102740, nr_unstable=0, bg_thresh=26212, written=5,
min_pages=-2476, pages_skipped=0, nr_to_write=1019(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(131):
nr_dirty=102755, nr_unstable=0, bg_thresh=26212, written=11,
min_pages=-2318, pages_skipped=0, nr_to_write=1013(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(131):
nr_dirty=102742, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-2329, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(5717):
nr_dirty=102744, nr_unstable=0, bg_thresh=26212, written=3,
min_pages=-2481, pages_skipped=0, nr_to_write=1021(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(131):
nr_dirty=102741, nr_unstable=0, bg_thresh=26212, written=1,
min_pages=-2329, pages_skipped=0, nr_to_write=1023(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(130):
nr_dirty=102742, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-2681, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(5716):
nr_dirty=102743, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-1962, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(131):
nr_dirty=102744, nr_unstable=0, bg_thresh=26212, written=2,
min_pages=-2330, pages_skipped=0, nr_to_write=1022(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(5716):
nr_dirty=102744, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-1962, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1
Mar 6 02:57:34 localhost kernel: background_writeout(130):
nr_dirty=102744, nr_unstable=0, bg_thresh=26212, written=0,
min_pages=-2681, pages_skipped=0, nr_to_write=1024(/1024),
encountered_congestion=1


As you can see, they constantly encounter congestion, and constantly
have things to write out, so the background_writeout loop happens
endlessly; however, only two of the processes are actually able to get
anything useful done. Pid #130 and #5716 cannot write anything out due
to locks, so they just keep trying and trying; pids #131 and #5717 are
actually doing useful work. This is typically what it looks like w/ all
pdflush processes.

My solution to the problem is to keep the blk_congestion_wait() sleep as
uninterruptible, but add a check to the background_writeout() loop to
check whether the current pdflush thread is actually doing anything
useful; if it's not, just give up. My threshold for "useful" is having
written any pages in the past 2 loop iterations. This means that
pdflush processes that are actually working should continue working;
ones that are not will exit sooner; and ones that are temporarily not
working due to other activity on the system should exit (and new pdflush
threads should be spawned later on that will attempt to do work again).

I've attached the patch I'm using; my completely unscientific tests
shows it improving i/o speed on the 1GB file write. Feedback on it
would be excellent.

Signed-off-by: Andres Salomon <[email protected]>


Attachments:
pdflush_suicide.patch (1.16 kB)
signature.asc (191.00 B)
This is a digitally signed message part
Download all attachments

2006-03-07 12:10:13

by Andrew Morton

[permalink] [raw]
Subject: Re: pdflush and high load

Andres Salomon <[email protected]> wrote:
>
> Hi,
>
> (2nd attempt at posting this; the first one appears to have
> disappeared?)

It came through.

> Basically, what ends up happening on my system is, each pdflush process
> is handling background_writeout(), encountering congestion, and calling
> blk_congestion_wait(WRITE, HZ/10) after every loop.

Yes, the do-we-need-another-thread algorithm is rather too naive. I could
swear it _used_ to work OK. Maybe something changed to cause individual
threads to block for longer than they used to. That would be an
independent problem - one pdflush instance is supposed to be able to handle
many queues (I tested one instance on 12 disks and it worked OK. But that
was 4-5 years ago)

> My solution to the problem is to keep the blk_congestion_wait() sleep as
> uninterruptible, but add a check to the background_writeout() loop to
> check whether the current pdflush thread is actually doing anything
> useful; if it's not, just give up.

It would be better to not start a new thread in the first place.

2006-03-23 19:30:44

by Andres Salomon

[permalink] [raw]
Subject: Re: pdflush and high load

(Sorry for the delayed response, been way too busy)


Andrew Morton wrote:
> Andres Salomon <[email protected]> wrote:
>> Hi,
>>
>> (2nd attempt at posting this; the first one appears to have
>> disappeared?)
>
> It came through.
>
>> Basically, what ends up happening on my system is, each pdflush process
>> is handling background_writeout(), encountering congestion, and calling
>> blk_congestion_wait(WRITE, HZ/10) after every loop.
>
> Yes, the do-we-need-another-thread algorithm is rather too naive. I could
> swear it _used_ to work OK. Maybe something changed to cause individual
> threads to block for longer than they used to. That would be an
> independent problem - one pdflush instance is supposed to be able to handle
> many queues (I tested one instance on 12 disks and it worked OK. But that
> was 4-5 years ago)

The fact that background_writeout() seems to count up to 0 in the common
case suggests some serious bitrot.


>
>> My solution to the problem is to keep the blk_congestion_wait() sleep as
>> uninterruptible, but add a check to the background_writeout() loop to
>> check whether the current pdflush thread is actually doing anything
>> useful; if it's not, just give up.
>
> It would be better to not start a new thread in the first place.
>

Indeed, but how do we determine that?

One option that I thought of was to move the thread spawning logic down
into the worker callbacks, as they have more of an idea about what's
going on. When they simply sleep (as is the case w/
background_writeout), it's very easy to hit the 1 second threshold that
the generic pdflush logic uses to determine whether to create a new thread.

I'm playing around w/ some patches that move the logic down into the
callbacks, returning non-zero if they need help. Thus, the pdflush
thread checks if the callback returned non-zero, and if there's no
pdflush threads idle; if that's the case, allocate a new thread.