LinuxLists.cc - High load average on disk I/O on 2.6.17-rc3

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Hi Robert,

There are, this is the relevant output of the process list:

...
4659 pts/6 Ss 0:00 -bash
4671 pts/5 R+ 0:12 cp -a test-dir/ new-test
4676 ? D 0:00 [pdflush]
4679 ? D 0:00 [pdflush]
4687 pts/4 D+ 0:01 hdparm -t /dev/sda
4688 ? D 0:00 [pdflush]
4690 ? D 0:00 [pdflush]
4692 ? D 0:00 [pdflush]
...

This was when I was copying a directory and then doing a performance test with
hdparm in a separate shell. The hdparm process was in [D+] state and
basically waited until the cp was finished. During the whole thing there
were up to 5 pdflush processes in [D] state.

The 5 minute load average hit 8.90 during this test.

Does that help?

Jason

-------Original Message-----
From: Robert Hancock
Sent: Friday 05 May 2006 16:12
To: linux-kernel
Subject: Re: High load average on disk I/O on 2.6.17-rc3

Jason Schoonover wrote:
> Hi all,
>
> I'm not sure if this is the right list to post to, so please direct me to
> the appropriate list if this is the wrong one.
>
> I'm having some problems on the latest 2.6.17-rc3 kernel and SCSI disk I/O.
> Whenever I copy any large file (over 500GB) the load average starts to
> slowly rise and after about a minute it is up to 7.5 and keeps on rising
> (depending on how long the file takes to copy). When I watch top, the
> processes at the top of the list are cp, pdflush, kjournald and kswapd.

Are there some processes stuck in D state? These will contribute to the
load average even if they are not using CPU.

2006-05-06 17:22:37

by Robert Hancock

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Jason Schoonover wrote:
> Hi Robert,
>
> There are, this is the relevant output of the process list:
>
> ...
> 4659 pts/6 Ss 0:00 -bash
> 4671 pts/5 R+ 0:12 cp -a test-dir/ new-test
> 4676 ? D 0:00 [pdflush]
> 4679 ? D 0:00 [pdflush]
> 4687 pts/4 D+ 0:01 hdparm -t /dev/sda
> 4688 ? D 0:00 [pdflush]
> 4690 ? D 0:00 [pdflush]
> 4692 ? D 0:00 [pdflush]
> ...
>
> This was when I was copying a directory and then doing a performance test with
> hdparm in a separate shell. The hdparm process was in [D+] state and
> basically waited until the cp was finished. During the whole thing there
> were up to 5 pdflush processes in [D] state.
>
> The 5 minute load average hit 8.90 during this test.
>
> Does that help?

Well, it obviously explains why the load average is high, those D state
processes all count in the load average. It may be sort of a cosmetic
issue, since they're not actually using any CPU, but it's still a bit
unusual. For one thing, not sure why there are that many of them?

You could try enabling the SysRq triggers (if they're not already in
your kernel/distro) and doing Alt-SysRq-T which will dump the kernel
stack of all processes, that should show where exactly in the kernel
those pdflush processes are blocked..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2006-05-06 18:24:29

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Hi Robert,

I did started a ncftpget and managed to get 6 pdflush processes running in
state D, hopefully this will give us a chance to debug it.

I've attached the entire Alt+SysReq+T output here because I have no idea how
to read it.

Thanks,
Jason

-------Original Message-----
From: Robert Hancock
Sent: Saturday 06 May 2006 10:20
To: Jason Schoonover
Subject: Re: High load average on disk I/O on 2.6.17-rc3

Jason Schoonover wrote:
> Hi Robert,
>
> There are, this is the relevant output of the process list:
>
> ...
> 4659 pts/6 Ss 0:00 -bash
> 4671 pts/5 R+ 0:12 cp -a test-dir/ new-test
> 4676 ? D 0:00 [pdflush]
> 4679 ? D 0:00 [pdflush]
> 4687 pts/4 D+ 0:01 hdparm -t /dev/sda
> 4688 ? D 0:00 [pdflush]
> 4690 ? D 0:00 [pdflush]
> 4692 ? D 0:00 [pdflush]
> ...
>
> This was when I was copying a directory and then doing a performance test
> with hdparm in a separate shell. The hdparm process was in [D+] state and
> basically waited until the cp was finished. During the whole thing there
> were up to 5 pdflush processes in [D] state.
>
> The 5 minute load average hit 8.90 during this test.
>
> Does that help?

Well, it obviously explains why the load average is high, those D state
processes all count in the load average. It may be sort of a cosmetic
issue, since they're not actually using any CPU, but it's still a bit
unusual. For one thing, not sure why there are that many of them?

You could try enabling the SysRq triggers (if they're not already in
your kernel/distro) and doing Alt-SysRq-T which will dump the kernel
stack of all processes, that should show where exactly in the kernel
those pdflush processes are blocked..

Attachments:

(No filename) (1.70 kB)
sysreq.output.txt (31.14 kB)
Download all attachments

2006-05-06 20:01:22

by Robert Hancock

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Jason Schoonover wrote:
> Hi Robert,
>
> I did started a ncftpget and managed to get 6 pdflush processes running in
> state D, hopefully this will give us a chance to debug it.
>
> I've attached the entire Alt+SysReq+T output here because I have no idea how
> to read it.

Well, I think the relevant parts would be:

> pdflush D 37E8BE80 0 5842 11 6085 3699 (L-TLB)
> c7803eec b1bfa900 b1bfa920 37e8be80 00000000 00000000 fc4c7fab 000f9668
> dfe67688 dfe67580 b21a0540 b2014340 805ef140 000f966c 00000001 00000000
> 00000286 b011e999 00000286 b011e999 c7803efc c7803efc 00000286 00000000
> Call Trace:
> <b011e999> lock_timer_base+0x15/0x2f <b011e999> lock_timer_base+0x15/0x2f
> <b026fd89> schedule_timeout+0x6c/0x8b <b011ecff> process_timeout+0x0/0x9
> <b026ebf5> io_schedule_timeout+0x29/0x33 <b01387eb> pdflush+0x0/0x1b5
> <b01a0a1d> blk_congestion_wait+0x55/0x69 <b01271dc> autoremove_wake_function+0x0/0x3a
> <b0137f01> background_writeout+0x7d/0x8b <b01388ee> pdflush+0x103/0x1b5
> <b0137e84> background_writeout+0x0/0x8b <b01271a1> kthread+0xa3/0xd0
> <b01270fe> kthread+0x0/0xd0 <b0100bc5> kernel_thread_helper+0x5/0xb
> ncftpget D 568FC100 0 6115 5762 (NOTLB)
> cb351a88 87280008 b013595f 568fc100 00000008 00051615 f2a64c80 b21cc340
> dfbc7b58 dfbc7a50 b02b0480 b200c340 77405900 000f966c 00000000 00000282
> b011e999 f2a64cd8 00000000 dfec8b84 00000000 dfec8b84 b01a216b 00000000
> Call Trace:
> <b013595f> mempool_alloc+0x21/0xbf <b011e999> lock_timer_base+0x15/0x2f
> <b01a216b> get_request+0x55/0x283 <b026f9b5> io_schedule+0x26/0x30
> <b01a2432> get_request_wait+0x99/0xd3 <b01271dc> autoremove_wake_function+0x0/0x3a
> <b01a2786> __make_request+0x2b9/0x350 <b01a0268> generic_make_request+0x168/0x17a
> <b013595f> mempool_alloc+0x21/0xbf <b01a0ad6> submit_bio+0xa5/0xaa
> <b015032c> bio_alloc+0x13/0x22 <b014dad0> submit_bh+0xe6/0x107
> <b014f9d9> __block_write_full_page+0x20e/0x301 <f88e17e8> ext3_get_block+0x0/0xad [ext3]
> <f88e0633> ext3_ordered_writepage+0xcb/0x137 [ext3] <f88e17e8> ext3_get_block+0x0/0xad [ext3]
> <f88def99> bget_one+0x0/0xb [ext3] <b016a1d3> mpage_writepages+0x193/0x2e9
> <f88e0568> ext3_ordered_writepage+0x0/0x137 [ext3] <b013813f> do_writepages+0x30/0x39
> <b0168988> __writeback_single_inode+0x166/0x2e2 <b01aa9d3> __next_cpu+0x11/0x20
> <b0136421> read_page_state_offset+0x33/0x41 <b0168f5e> sync_sb_inodes+0x185/0x23a
> <b01691c6> writeback_inodes+0x6e/0xbb <b0138246> balance_dirty_pages_ratelimited_nr+0xcb/0x152
> <b013429e> generic_file_buffered_write+0x47d/0x56f <f88e698a> __ext3_journal_stop+0x19/0x37 [ext3]
> <f88dfde0> ext3_dirty_inode+0x5e/0x64 [ext3] <b0168b4c> __mark_inode_dirty+0x28/0x14c
> <b01354da> __generic_file_aio_write_nolock+0x3c8/0x405 <b0211c91> sock_aio_read+0x56/0x63
> <b013573c> generic_file_aio_write+0x61/0xb3 <f88dde72> ext3_file_write+0x26/0x92 [ext3]
> <b014b99a> do_sync_write+0xc0/0xf3 <b016201a> notify_change+0x2d4/0x2e5
> <b01271dc> autoremove_wake_function+0x0/0x3a <b014bdb0> vfs_write+0xa3/0x13a
> <b014c630> sys_write+0x3b/0x64 <b010267b> syscall_call+0x7/0xb

It looks like the pdflush threads are sitting in uninterruptible sleep
waiting for a block queue to become uncongested. This seems somewhat
reasonable to me in this situation, but someone more familiar with the
block layer would likely have to comment on whether this is the expected
behavior..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2006-05-06 23:03:30

by bert hubert

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Fri, May 05, 2006 at 10:10:19AM -0700, Jason Schoonover wrote:

> Whenever I copy any large file (over 500GB) the load average starts to slowly
> rise and after about a minute it is up to 7.5 and keeps on rising (depending
> on how long the file takes to copy). When I watch top, the processes at the
> top of the list are cp, pdflush, kjournald and kswapd.

Load average is a bit of an odd metric in this case, try looking at the
output from 'vmstat 1', and especially the 'id' column. As long as that
doesn't rise, you don't have an actual problem.

The number of processes in the runqueue doesn't really tell you anything
about how much CPU you are using.

Having said that, I think there might be a problem to be solved.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2006-05-07 01:04:16

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Hi Bert,

That's interesting, I didn't know that about the vmstat command, I will
definitely use that more in the future. I went ahead and did an ncftpget and
started vmstat 1 at the same time, I've attached the output here. It looks
like the 'id' column was at 100, and then as soon as I started ncftpget, it
went down to 0 the whole time.

The interesting thing was that after I did a Ctrl-C on the ncftpget, the id
column was still at 0, even though the ncftpget process was over. The id
column was at 0 and the 'wa' column was at 98, up until all of the pdflush
processes ended.

Is that the expected behavior?

Jason

-------Original Message-----
From: bert hubert
Sent: Saturday 06 May 2006 16:03
To: Jason Schoonover
Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Fri, May 05, 2006 at 10:10:19AM -0700, Jason Schoonover wrote:
> Whenever I copy any large file (over 500GB) the load average starts to
> slowly rise and after about a minute it is up to 7.5 and keeps on rising
> (depending on how long the file takes to copy). When I watch top, the
> processes at the top of the list are cp, pdflush, kjournald and kswapd.

Load average is a bit of an odd metric in this case, try looking at the
output from 'vmstat 1', and especially the 'id' column. As long as that
doesn't rise, you don't have an actual problem.

The number of processes in the runqueue doesn't really tell you anything
about how much CPU you are using.

Having said that, I think there might be a problem to be solved.

Bert

Attachments:

(No filename) (1.49 kB)
vmstat-1.txt (3.39 kB)
Download all attachments

2006-05-07 10:54:20

by bert hubert

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Sat, May 06, 2006 at 06:02:47PM -0700, Jason Schoonover wrote:

> The interesting thing was that after I did a Ctrl-C on the ncftpget, the id
> column was still at 0, even though the ncftpget process was over. The id
> column was at 0 and the 'wa' column was at 98, up until all of the pdflush
> processes ended.
>
> Is that the expected behavior?

Yes - data is still being written out. 'wa' stands for waiting for io. As
long as 'us' and 'sy' are not 100 (together), your system ('computing
power') is not 'busy'.

The lines below are perfect:
> 0 2 40 47816 7888 1920116 0 0 0 36264 1354 56 0 1 0 99
> 0 2 40 48312 7888 1920116 0 0 0 36248 1362 52 0 1 0 99

Wether you should have 5 pdflushes running is something I have no relevant
experience about, but your system should function just fine during writeout.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2006-05-07 16:50:54

by Andrew Morton

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Fri, 5 May 2006 10:10:19 -0700
Jason Schoonover <[email protected]> wrote:

> I'm having some problems on the latest 2.6.17-rc3 kernel and SCSI disk I/O.
> Whenever I copy any large file (over 500GB) the load average starts to slowly
> rise and after about a minute it is up to 7.5 and keeps on rising (depending
> on how long the file takes to copy). When I watch top, the processes at the
> top of the list are cp, pdflush, kjournald and kswapd.

This is probably because the number of pdflush threads slowly grows to its
maximum. This is bogus, and we seem to have broken it sometime in the past
few releases. I need to find a few quality hours to get in there and fix
it, but they're rare :(

It's pretty harmless though. The "load average" thing just means that the
extra pdflush threads are twiddling thumbs waiting on some disk I/O -
they'll later exit and clean themselves up. They won't be consuming
significant resources.

2006-05-07 17:25:57

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Hi Andrew,

I see, so it sounds as though the load average is not telling the true load of
the machine. However, it still feels that it's consuming quite a bit of
resources. The machine is almost unresponsive if I were to start two or
three copies at the same time.

I noticed the behavior initially when I installed vmware server: I started one
of the vm's booting and was copying another in the background (about a 12GB
directory). The booting VM would started getting slower and slower and
eventually just hung. It wasn't locked up, just seemed like it was "paused."
When I tried an "ls" in another window, it just hung. I then tried to ssh
into the server to open another window and I couldn't even get an ssh prompt.
I had to eventually Ctrl-C the copy and wait for it to be done before I could
do anything. And the load average had skyrocketed, but the consensus here is
definitely that it's not the true load average of the system.

Possibly should I revert back to an older kernel? 2.6.12 or 2.6.10 maybe? Do
you know when abouts the I/O was changed?

I can certainly help debug this issue if you (or someone else) has the time to
look into it and fix it. Otherwise I will just revert back and hope that it
will get fixed in the future.

Thanks,
Jason

-------Original Message-----
From: Andrew Morton
Sent: Sunday 07 May 2006 09:50
To: Jason Schoonover
Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Fri, 5 May 2006 10:10:19 -0700

Jason Schoonover <[email protected]> wrote:
> I'm having some problems on the latest 2.6.17-rc3 kernel and SCSI disk I/O.
> Whenever I copy any large file (over 500GB) the load average starts to
> slowly rise and after about a minute it is up to 7.5 and keeps on rising
> (depending on how long the file takes to copy). When I watch top, the
> processes at the top of the list are cp, pdflush, kjournald and kswapd.

This is probably because the number of pdflush threads slowly grows to its
maximum. This is bogus, and we seem to have broken it sometime in the past
few releases. I need to find a few quality hours to get in there and fix
it, but they're rare :(

It's pretty harmless though. The "load average" thing just means that the
extra pdflush threads are twiddling thumbs waiting on some disk I/O -
they'll later exit and clean themselves up. They won't be consuming
significant resources.

2006-05-08 11:13:49

by Erik Mouw

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Sun, May 07, 2006 at 09:50:39AM -0700, Andrew Morton wrote:
> This is probably because the number of pdflush threads slowly grows to its
> maximum. This is bogus, and we seem to have broken it sometime in the past
> few releases. I need to find a few quality hours to get in there and fix
> it, but they're rare :(
>
> It's pretty harmless though. The "load average" thing just means that the
> extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> they'll later exit and clean themselves up. They won't be consuming
> significant resources.

Not completely harmless. Some daemons (sendmail, exim) use the load
average to decide if they will allow more work. A local user could
create a mail DoS by just copying a couple of large files around.
Zeniv.linux.org.uk mail went down due to this. See
http://lkml.org/lkml/2006/3/28/70 .

Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

2006-05-08 11:22:48

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, 2006-05-08 at 13:13 +0200, Erik Mouw wrote:
> On Sun, May 07, 2006 at 09:50:39AM -0700, Andrew Morton wrote:
> > This is probably because the number of pdflush threads slowly grows to its
> > maximum. This is bogus, and we seem to have broken it sometime in the past
> > few releases. I need to find a few quality hours to get in there and fix
> > it, but they're rare :(
> >
> > It's pretty harmless though. The "load average" thing just means that the
> > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > they'll later exit and clean themselves up. They won't be consuming
> > significant resources.
>
> Not completely harmless. Some daemons (sendmail, exim) use the load
> average to decide if they will allow more work.

and those need to be fixed most likely ;)

2006-05-08 11:28:43

by Russell King

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, May 08, 2006 at 01:22:36PM +0200, Arjan van de Ven wrote:
> On Mon, 2006-05-08 at 13:13 +0200, Erik Mouw wrote:
> > On Sun, May 07, 2006 at 09:50:39AM -0700, Andrew Morton wrote:
> > > This is probably because the number of pdflush threads slowly grows to its
> > > maximum. This is bogus, and we seem to have broken it sometime in the past
> > > few releases. I need to find a few quality hours to get in there and fix
> > > it, but they're rare :(
> > >
> > > It's pretty harmless though. The "load average" thing just means that the
> > > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > > they'll later exit and clean themselves up. They won't be consuming
> > > significant resources.
> >
> > Not completely harmless. Some daemons (sendmail, exim) use the load
> > average to decide if they will allow more work.
>
> and those need to be fixed most likely ;)

Why do you think that? exim uses the load average to work out whether
it's a good idea to spawn more copies of itself, and increase the load
on the machine.

Unfortunately though, under 2.6 kernels, the load average seems to be
a meaningless indication of how busy the system is from that point of
view.

Having a single CPU machine with a load average of 150 and still feel
very interactive at the shell is extremely counter-intuitive.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-08 11:38:22

by Avi Kivity

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Russell King wrote:
> Why do you think that? exim uses the load average to work out whether
> it's a good idea to spawn more copies of itself, and increase the load
> on the machine.
>
> Unfortunately though, under 2.6 kernels, the load average seems to be
> a meaningless indication of how busy the system is from that point of
> view.
>
> Having a single CPU machine with a load average of 150 and still feel
> very interactive at the shell is extremely counter-intuitive.
>

It's even worse: load average used to mean the number of runnable
processes + number of processes waiting on disk or NFS I/O to complete,
a fairly bogus measure as you have noted, but with the aio interfaces
one can issue enormous amounts of I/O without it being counted in the
load average.

To make such decisions real, one needs separate counters for cpu load
and for disk load on the devices one is actually using.

--
error compiling committee.c: too many arguments to function

2006-05-08 12:37:35

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, 2006-05-08 at 12:28 +0100, Russell King wrote:
> On Mon, May 08, 2006 at 01:22:36PM +0200, Arjan van de Ven wrote:
> > On Mon, 2006-05-08 at 13:13 +0200, Erik Mouw wrote:
> > > On Sun, May 07, 2006 at 09:50:39AM -0700, Andrew Morton wrote:
> > > > This is probably because the number of pdflush threads slowly grows to its
> > > > maximum. This is bogus, and we seem to have broken it sometime in the past
> > > > few releases. I need to find a few quality hours to get in there and fix
> > > > it, but they're rare :(
> > > >
> > > > It's pretty harmless though. The "load average" thing just means that the
> > > > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > > > they'll later exit and clean themselves up. They won't be consuming
> > > > significant resources.
> > >
> > > Not completely harmless. Some daemons (sendmail, exim) use the load
> > > average to decide if they will allow more work.
> >
> > and those need to be fixed most likely ;)
>
> Why do you think that?

I think that because the characteristics of modern hardware don't make
"load" a good estimator for finding out if the hardware can take more
jobs.

To explain why I'm thinking this I first need to do an ascii art graph

100
% | ************************|
b | ****
u | ***
s | **
y | *
| *
| *
|*
+------------------------------------------
-> workload

on the Y axis is the percentage in use, on the horizontal axis the
amount of work that is done. (in the mail case, say emails per second).
Modern hardware has an initial ramp-up which is near linear in terms of
workload/use, but then a saturation area is reached at 100% where, even
though a system is 100% busy, more work can be added, upto a certain
point that I showed with a "|". This is due to the behavior of increased
batching that you get at higher utilization compared to the behavior at
lower utilizations. Both cpu caches, memory burst speeds-vs-latency, but
also disk streaming performance vs random seeks... all those will create
and increase this saturation space to the right. And all of those have
been increasing in the hardware the last 4+ years, with the result that
the saturation "reach" has increased to the right as well, by far.

How does this tie into "load" and using load for what exim/sendmail use
it for? Well.... Today "load" is a somewhat poor approximation of this
percentage-in-use[1], but... as per the graph and argument above, even
if it was a perfect representation of that, it still would not be a good
measure to determine if a system can do more work (per time unit) or
not.

[1] I didn't discuss the use of *what*; in reality that is a combination
of cpu, memory, disk and possibly network resources. Load tries to
combine cpu and disk into one number via a sheer addition; that's an
obviously rough estimate and I'm not arguing that it's not rough.

> Having a single CPU machine with a load average of 150 and still feel
> very interactive at the shell is extremely counter-intuitive.

Well it's also a sign that the cpu scheduler is prioritizing your shell
over the background "menial" work ;)

2006-05-08 14:24:17

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

> It's pretty harmless though. The "load average" thing just means that the
> extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> they'll later exit and clean themselves up. They won't be consuming
> significant resources.

If they're waiting on disk I/O, they shouldn't be runnable, and thus
should not be counted as part of the load average, surely?

M.

2006-05-08 14:55:58

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, 2006-05-08 at 07:24 -0700, Martin J. Bligh wrote:
> > It's pretty harmless though. The "load average" thing just means that the
> > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > they'll later exit and clean themselves up. They won't be consuming
> > significant resources.
>
> If they're waiting on disk I/O, they shouldn't be runnable, and thus
> should not be counted as part of the load average, surely?

yes they are, since at least a decade. "load average" != "cpu
utilisation" by any means. It's "tasks waiting for a hardware resource
to become available". CPU is one such resource (runnable) but disk is
another. There are more ...

think of load as "if I bought faster hardware this would improve"

2006-05-08 15:22:58

by Erik Mouw

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, May 08, 2006 at 04:55:48PM +0200, Arjan van de Ven wrote:
> On Mon, 2006-05-08 at 07:24 -0700, Martin J. Bligh wrote:
> > > It's pretty harmless though. The "load average" thing just means that the
> > > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > > they'll later exit and clean themselves up. They won't be consuming
> > > significant resources.
> >
> > If they're waiting on disk I/O, they shouldn't be runnable, and thus
> > should not be counted as part of the load average, surely?
>
> yes they are, since at least a decade. "load average" != "cpu
> utilisation" by any means. It's "tasks waiting for a hardware resource
> to become available". CPU is one such resource (runnable) but disk is
> another. There are more ...

... except that any kernel < 2.6 didn't account tasks waiting for disk
IO. Load average has always been somewhat related to tasks contending
for CPU power. It's easy to say "shrug, it changed, live with it", but
at least give applications that want to be nice to the system a way to
figure out the real cpu load.

Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

2006-05-08 15:25:51

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Erik Mouw wrote:
> On Mon, May 08, 2006 at 04:55:48PM +0200, Arjan van de Ven wrote:
>
>>On Mon, 2006-05-08 at 07:24 -0700, Martin J. Bligh wrote:
>>
>>>>It's pretty harmless though. The "load average" thing just means that the
>>>>extra pdflush threads are twiddling thumbs waiting on some disk I/O -
>>>>they'll later exit and clean themselves up. They won't be consuming
>>>>significant resources.
>>>
>>>If they're waiting on disk I/O, they shouldn't be runnable, and thus
>>>should not be counted as part of the load average, surely?
>>
>>yes they are, since at least a decade. "load average" != "cpu
>>utilisation" by any means. It's "tasks waiting for a hardware resource
>>to become available". CPU is one such resource (runnable) but disk is
>>another. There are more ...
>
>
> ... except that any kernel < 2.6 didn't account tasks waiting for disk
> IO. Load average has always been somewhat related to tasks contending
> for CPU power. It's easy to say "shrug, it changed, live with it", but
> at least give applications that want to be nice to the system a way to
> figure out the real cpu load.

I had a patch to create a real, per-cpu load average. I guess I'll dig
it out again, since it was also extremely useful for diagnosing
scheduler issues.

Maybe I'm confused about what the loadavg figure in Linux was in 2.6,
I'll go read the code again. Not sure it's very useful to provide only
a combined figure of all waiting tasks without separated versions as
well, really.

2006-05-08 15:31:36

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
> On Mon, May 08, 2006 at 04:55:48PM +0200, Arjan van de Ven wrote:
> > On Mon, 2006-05-08 at 07:24 -0700, Martin J. Bligh wrote:
> > > > It's pretty harmless though. The "load average" thing just means that the
> > > > extra pdflush threads are twiddling thumbs waiting on some disk I/O -
> > > > they'll later exit and clean themselves up. They won't be consuming
> > > > significant resources.
> > >
> > > If they're waiting on disk I/O, they shouldn't be runnable, and thus
> > > should not be counted as part of the load average, surely?
> >
> > yes they are, since at least a decade. "load average" != "cpu
> > utilisation" by any means. It's "tasks waiting for a hardware resource
> > to become available". CPU is one such resource (runnable) but disk is
> > another. There are more ...
>
> ... except that any kernel < 2.6 didn't account tasks waiting for disk
> IO.

they did. It was "D" state, which counted into load average.

2006-05-08 15:42:20

by Erik Mouw

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, May 08, 2006 at 05:31:29PM +0200, Arjan van de Ven wrote:
> On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
> > ... except that any kernel < 2.6 didn't account tasks waiting for disk
> > IO.
>
> they did. It was "D" state, which counted into load average.

They did not or at least to a much lesser extent. That's the reason why
ZenIV.linux.org.uk had a mail DoS during the last FC release and why we
see load average questions on lkml.

I've seen it on our servers as well: when using 2.4 and doing 50 MB/s
to disk (through NFS), the load just was slightly above 0. When we
switched the servers to 2.6 it went to ~16 for the same disk usage.

Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

2006-05-08 16:02:14

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Erik Mouw wrote:
> On Mon, May 08, 2006 at 05:31:29PM +0200, Arjan van de Ven wrote:
>
>>On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
>>
>>>... except that any kernel < 2.6 didn't account tasks waiting for disk
>>>IO.
>>
>>they did. It was "D" state, which counted into load average.
>
>
> They did not or at least to a much lesser extent. That's the reason why
> ZenIV.linux.org.uk had a mail DoS during the last FC release and why we
> see load average questions on lkml.
>
> I've seen it on our servers as well: when using 2.4 and doing 50 MB/s
> to disk (through NFS), the load just was slightly above 0. When we
> switched the servers to 2.6 it went to ~16 for the same disk usage.

Looks like both count it, or something stranger is going on.

2.6.16:

static unsigned long count_active_tasks(void)
{
return (nr_running() + nr_uninterruptible()) * FIXED_1;
}

2.4.0:

static unsigned long count_active_tasks(void)
{
struct task_struct *p;
unsigned long nr = 0;

read_lock(&tasklist_lock);
for_each_task(p) {
if ((p->state == TASK_RUNNING ||
(p->state & TASK_UNINTERRUPTIBLE)))
nr += FIXED_1;
}
read_unlock(&tasklist_lock);
return nr;
}

2006-05-08 16:02:48

by Miquel van Smoorenburg

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

In article <[email protected]>,
Erik Mouw <[email protected]> wrote:
>On Mon, May 08, 2006 at 05:31:29PM +0200, Arjan van de Ven wrote:
>> On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
>> > ... except that any kernel < 2.6 didn't account tasks waiting for disk
>> > IO.
>>
>> they did. It was "D" state, which counted into load average.
>
>They did not or at least to a much lesser extent.

I just looked at the 2.4.9 (random 2.4 kernel) source code, and
kernel/timer.c::count_active_tasks(), which is what calculates the
load average, uses the same algorithm as in 2.6.16

Mike.

2006-05-08 16:47:19

by Russell King

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, May 08, 2006 at 05:42:18PM +0200, Erik Mouw wrote:
> On Mon, May 08, 2006 at 05:31:29PM +0200, Arjan van de Ven wrote:
> > On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
> > > ... except that any kernel < 2.6 didn't account tasks waiting for disk
> > > IO.
> >
> > they did. It was "D" state, which counted into load average.
>
> They did not or at least to a much lesser extent. That's the reason why
> ZenIV.linux.org.uk had a mail DoS during the last FC release and why we
> see load average questions on lkml.
>
> I've seen it on our servers as well: when using 2.4 and doing 50 MB/s
> to disk (through NFS), the load just was slightly above 0. When we
> switched the servers to 2.6 it went to ~16 for the same disk usage.

It's actually rather interesting to look at 2.6 and load averages.

The load average appears to depend on the type of load, rather than
the real load. Let's look at three different cases:

1. while (1) { } loop in a C program.

Starting off with a load average of 0.00 with a "watch uptime" running
(and leaving that running for several minutes), then starting such a
program, and letting it run for one minute.

Result: load average at the end: 0.60

2. a program which runs continuously for 6 seconds and then sleeps for
54 seconds.

Result: load average peaks at 0.12, drops to 0.05 just before it
runs for a second 6 second.

ps initially reports 100% CPU usage, drops to 10% after one minute,
rises to 18% and drops back towards 10%, and gradually settles on
10%.

3. a program which runs for 1 second and then sleeps for 9 seconds.

Result: load average peaks at 0.22, drops to 0.15 just before it
runs for the next second.

ps reports 10% CPU usage.

Okay, so, two different CPU work loads without any other IO, using the
same total amount of CPU time every minute seem to produce two very
different load averages. In addition, using 100% CPU for one minute
does not produce a load average of 1.

Seems to me that somethings wrong somewhere. Either that or the first
load average number no longer represents the load over the past one
minute.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-05-08 17:04:27

by Gabor Gombas

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, May 08, 2006 at 05:47:05PM +0100, Russell King wrote:

> Seems to me that somethings wrong somewhere. Either that or the first
> load average number no longer represents the load over the past one
> minute.

... or just the load average statistics and the CPU usage statistics are
computed using different algorithms and they thus estimate different
things. At least the sampling frequency is surely different.

Gabor

--
---------------------------------------------------------
MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
---------------------------------------------------------

2006-05-08 17:17:56

by Mike Galbraith

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Mon, 2006-05-08 at 17:42 +0200, Erik Mouw wrote:
> On Mon, May 08, 2006 at 05:31:29PM +0200, Arjan van de Ven wrote:
> > On Mon, 2006-05-08 at 17:22 +0200, Erik Mouw wrote:
> > > ... except that any kernel < 2.6 didn't account tasks waiting for disk
> > > IO.
> >
> > they did. It was "D" state, which counted into load average.
>
> They did not or at least to a much lesser extent. That's the reason why
> ZenIV.linux.org.uk had a mail DoS during the last FC release and why we
> see load average questions on lkml.

I distinctly recall it counting, but since I don't have a 2.4 tree
handy, I'll refrain from saying "did _too_" ;-)

> I've seen it on our servers as well: when using 2.4 and doing 50 MB/s
> to disk (through NFS), the load just was slightly above 0. When we
> switched the servers to 2.6 it went to ~16 for the same disk usage.

The main difference I see is...

8129 root 15 0 3500 512 432 D 56.0 0.0 0:33.72 bonnie
1393 root 10 -5 0 0 0 D 0.4 0.0 0:00.26 kjournald
8135 root 15 0 0 0 0 D 0.0 0.0 0:00.01 pdflush
573 root 15 0 0 0 0 D 0.0 0.0 0:00.00 pdflush
574 root 15 0 0 0 0 D 0.0 0.0 0:00.04 pdflush
8131 root 15 0 0 0 0 D 0.0 0.0 0:00.01 pdflush
8141 root 15 0 0 0 0 D 0.0 0.0 0:00.00 pdflush

With 2.4, there was only one flush thread. Same load, different
loadavg... in this particular case of one user task running. IIRC, if
you had a bunch of things running and running you low on memory, you
could end up with a slew of 'D' state tasks in 2.4 as well, because
allocating tasks had to help free memory by flushing buffers and pushing
swap. Six to one, half a dozen to the other.

-Mike

2006-05-08 22:24:19

by be-news06

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Erik Mouw <[email protected]> wrote:
> ... except that any kernel < 2.6 didn't account tasks waiting for disk
> IO. Load average has always been somewhat related to tasks contending
> for CPU power.

Actually all Linux kernels accounted for diskwaits and others like BSD based
not. It is a very old linux oddness.

Gruss
Bernd

2006-05-08 22:39:36

by Lee Revell

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

On Tue, 2006-05-09 at 00:24 +0200, Bernd Eckenfels wrote:
> Erik Mouw <[email protected]> wrote:
> > ... except that any kernel < 2.6 didn't account tasks waiting for disk
> > IO. Load average has always been somewhat related to tasks contending
> > for CPU power.
>
> Actually all Linux kernels accounted for diskwaits and others like BSD based
> not. It is a very old linux oddness.

Maybe I am misunderstanding, but IIRC BSD/OS also counted processes
waiting on IO towards the load average.

Lee

2006-05-09 00:08:12

by Peter Williams

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Bernd Eckenfels wrote:
> Erik Mouw <[email protected]> wrote:
>> ... except that any kernel < 2.6 didn't account tasks waiting for disk
>> IO. Load average has always been somewhat related to tasks contending
>> for CPU power.
>
> Actually all Linux kernels accounted for diskwaits and others like BSD based
> not. It is a very old linux oddness.

Personally, I see both types of load estimates (i.e. CPU only and CPU
plus IO wait) as useful. Why can't we have both? The cost would be
minimal.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-05-09 01:57:45

by Nick Piggin

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Arjan van de Ven wrote:

>>... except that any kernel < 2.6 didn't account tasks waiting for disk
>>IO.
>>
>
>they did. It was "D" state, which counted into load average.
>

Perhaps kernel threads in D state should not contribute toward load avg.

Userspace does not care whether there are 2 or 20 pdflush threads waiting
for IO. However, when the network/disks can no longer keep up, userspace
processes will end up going to sleep in writeback or reclaim -- *that* is
when we start feeling the load.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-09 02:02:25

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Nick Piggin wrote:
> Arjan van de Ven wrote:
>
>>> ... except that any kernel < 2.6 didn't account tasks waiting for disk
>>> IO.
>>>
>>
>> they did. It was "D" state, which counted into load average.
>>
>
> Perhaps kernel threads in D state should not contribute toward load avg.
>
> Userspace does not care whether there are 2 or 20 pdflush threads waiting
> for IO. However, when the network/disks can no longer keep up, userspace
> processes will end up going to sleep in writeback or reclaim -- *that* is
> when we start feeling the load.

Personally I'd be far happier having separated counters for both. Then
we can see what the real bottleneck is. Whilst we're at it, on a per-cpu
and per-elevator-queue basis ;-)

M.

2006-05-09 02:16:44

by Nick Piggin

[permalink] [raw]

Subject: Re: High load average on disk I/O on 2.6.17-rc3

Martin Bligh wrote:

> Nick Piggin wrote:
>
>>
>> Perhaps kernel threads in D state should not contribute toward load avg.
>>
>> Userspace does not care whether there are 2 or 20 pdflush threads
>> waiting
>> for IO. However, when the network/disks can no longer keep up, userspace
>> processes will end up going to sleep in writeback or reclaim --
>> *that* is
>> when we start feeling the load.
>
>
> Personally I'd be far happier having separated counters for both.

Well so long as userspace never blocks, blocked kernel threads aren't a
bottleneck (OK, perhaps things like nfsd are an exception, but kernel
threads doing asynch work on behalf of userspace, like pdflush or kswapd).
It is something simple we can do today that might decouple the kernel
implementation (eg. of pdflush) from the load average reporting.

> Then
> we can see what the real bottleneck is. Whilst we're at it, on a per-cpu
> and per-elevator-queue basis ;-)

Might be helpful, yes. At least separate counters for CPU and IO... but
that doesn't mean the global loadavg is going away.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-09 04:36:48