2007-02-12 05:40:14

by malc

[permalink] [raw]
Subject: CPU load

Hello,

How does the kernel calculates the value it places in `/proc/stat' at
4th position (i.e. "idle: twiddling thumbs")?

For background information as to why this question arose in the first
place read on.

While writing the code dealing with video acquisition/processing at
work noticed that what top(1) (and every other tool that uses
`/proc/stat' or `/proc/uptime') shows some very strange results.

Top claimed that the system running one version of the code[A] is
idling more often than the code[B] doing the same thing but more
cleverly. After some head scratching one of my colleagues suggested a
simple test that was implemented in a few minutes.

The test consisted of a counter that incremented in an endless loop
also after certain period of time had elapsed it printed the value of
the counter. Running this test (with priority set to the lowest
possible level) with code[A] and code[B] confirmed that code[B] is
indeed faster than code[A], in a sense that the test made more forward
progress while code[B] is running.

Hard-coding some things (i.e. the value of the counter after counting
for the duration of one period on completely idle system) we extended
the test to show the percentage of CPU that was utilized. This never
matched the value that top presented us with.

Later small kernel module was developed that tried to time how much
time is spent in the idle handler inside the kernel and exported this
information to the user-space. The results were consistent with our
expectations and the output of the test utility.

Two more points.

a. In the past (again video processing context) i have witnessed
`/proc/stat' claiming that CPU utilization is 0% for, say, 20
seconds followed by 5 seconds of 30% load, and then the cycle
repeated. According to the methods outlined above the load is
always at 30%.

b. In my personal experience difference between `/proc/stat' and
"reality" can easily reach 40% (think i saw even more than that)

The module and graphical application that uses it, along with some
short README and a link to Usenet article dealing with the same
subject is available at:
http://www.boblycat.org/~malc/apc

Thanks



2007-02-12 05:44:24

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On 12/02/07, Vassili Karpov <[email protected]> wrote:
> Hello,
>
> How does the kernel calculates the value it places in `/proc/stat' at
> 4th position (i.e. "idle: twiddling thumbs")?
>
> For background information as to why this question arose in the first
> place read on.
>
> While writing the code dealing with video acquisition/processing at
> work noticed that what top(1) (and every other tool that uses
> `/proc/stat' or `/proc/uptime') shows some very strange results.
>
> Top claimed that the system running one version of the code[A] is
> idling more often than the code[B] doing the same thing but more
> cleverly. After some head scratching one of my colleagues suggested a
> simple test that was implemented in a few minutes.
>
> The test consisted of a counter that incremented in an endless loop
> also after certain period of time had elapsed it printed the value of
> the counter. Running this test (with priority set to the lowest
> possible level) with code[A] and code[B] confirmed that code[B] is
> indeed faster than code[A], in a sense that the test made more forward
> progress while code[B] is running.
>
> Hard-coding some things (i.e. the value of the counter after counting
> for the duration of one period on completely idle system) we extended
> the test to show the percentage of CPU that was utilized. This never
> matched the value that top presented us with.
>
> Later small kernel module was developed that tried to time how much
> time is spent in the idle handler inside the kernel and exported this
> information to the user-space. The results were consistent with our
> expectations and the output of the test utility.
>
> Two more points.
>
> a. In the past (again video processing context) i have witnessed
> `/proc/stat' claiming that CPU utilization is 0% for, say, 20
> seconds followed by 5 seconds of 30% load, and then the cycle
> repeated. According to the methods outlined above the load is
> always at 30%.
>
> b. In my personal experience difference between `/proc/stat' and
> "reality" can easily reach 40% (think i saw even more than that)
>
> The module and graphical application that uses it, along with some
> short README and a link to Usenet article dealing with the same
> subject is available at:
> http://www.boblycat.org/~malc/apc

The kernel looks at what is using cpu _only_ during the timer
interrupt. Which means if your HZ is 1000 it looks at what is running
at precisely the moment those 1000 timer ticks occur. It is
theoretically possible using this measurement system to use >99% cpu
and record 0 usage if you time your cpu usage properly. It gets even
more inaccurate at lower HZ values for the same reason.

--
-ck

2007-02-12 05:55:48

by Stephen Rothwell

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007 16:44:22 +1100 "Con Kolivas" <[email protected]> wrote:
>
> The kernel looks at what is using cpu _only_ during the timer
> interrupt. Which means if your HZ is 1000 it looks at what is running
> at precisely the moment those 1000 timer ticks occur. It is
> theoretically possible using this measurement system to use >99% cpu
> and record 0 usage if you time your cpu usage properly. It gets even
> more inaccurate at lower HZ values for the same reason.

That is not true on all architecures, some do more accurate accounting by
recording the times at user/kernel/interrupt transitions ...

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (716.00 B)
(No filename) (189.00 B)
Download all attachments

2007-02-12 06:08:48

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On Monday 12 February 2007 16:55, Stephen Rothwell wrote:
> On Mon, 12 Feb 2007 16:44:22 +1100 "Con Kolivas" <[email protected]> wrote:
> > The kernel looks at what is using cpu _only_ during the timer
> > interrupt. Which means if your HZ is 1000 it looks at what is running
> > at precisely the moment those 1000 timer ticks occur. It is
> > theoretically possible using this measurement system to use >99% cpu
> > and record 0 usage if you time your cpu usage properly. It gets even
> > more inaccurate at lower HZ values for the same reason.
>
> That is not true on all architecures, some do more accurate accounting by
> recording the times at user/kernel/interrupt transitions ...

Indeed. It's certainly the way the common more boring pc architectures do it
though.

--
-ck

2007-02-12 06:13:28

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On Monday 12 February 2007 16:54, malc wrote:
> On Mon, 12 Feb 2007, Con Kolivas wrote:
> > On 12/02/07, Vassili Karpov <[email protected]> wrote:
>
> [..snip..]
>
> > The kernel looks at what is using cpu _only_ during the timer
> > interrupt. Which means if your HZ is 1000 it looks at what is running
> > at precisely the moment those 1000 timer ticks occur. It is
> > theoretically possible using this measurement system to use >99% cpu
> > and record 0 usage if you time your cpu usage properly. It gets even
> > more inaccurate at lower HZ values for the same reason.
>
> Thank you very much. This somewhat contradicts what i saw (and outlined
> in usnet article), namely the mplayer+/dev/rtc case. Unless ofcourse
> /dev/rtc interrupt is considered to be the same as the interrupt from
> PIT (on X86 that is)
>
> P.S. Perhaps it worth documenting this in the documentation? I caused
> me, and perhaps quite a few other people, a great deal of pain and
> frustration.

Lots of confusion comes from this, and often people think their pc suddenly
uses a lot less cpu when they change from 1000HZ to 100HZ and use this as an
argument/reason for changing to 100HZ when in fact the massive _reported_
difference is simply worse accounting. Of course there is more overhead going
from 100 to 1000 but it doesn't suddenly make your apps use 10 times more
cpu.

--
-ck

2007-02-12 06:54:06

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007, Con Kolivas wrote:

> On 12/02/07, Vassili Karpov <[email protected]> wrote:

[..snip..]

> The kernel looks at what is using cpu _only_ during the timer
> interrupt. Which means if your HZ is 1000 it looks at what is running
> at precisely the moment those 1000 timer ticks occur. It is
> theoretically possible using this measurement system to use >99% cpu
> and record 0 usage if you time your cpu usage properly. It gets even
> more inaccurate at lower HZ values for the same reason.

Thank you very much. This somewhat contradicts what i saw (and outlined
in usnet article), namely the mplayer+/dev/rtc case. Unless ofcourse
/dev/rtc interrupt is considered to be the same as the interrupt from
PIT (on X86 that is)

P.S. Perhaps it worth documenting this in the documentation? I caused
me, and perhaps quite a few other people, a great deal of pain and
frustration.

--
vale

2007-02-12 07:10:04

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007, Con Kolivas wrote:

> On Monday 12 February 2007 16:54, malc wrote:
>> On Mon, 12 Feb 2007, Con Kolivas wrote:
>>> On 12/02/07, Vassili Karpov <[email protected]> wrote:
>>
>> [..snip..]
>>
>>> The kernel looks at what is using cpu _only_ during the timer
>>> interrupt. Which means if your HZ is 1000 it looks at what is running
>>> at precisely the moment those 1000 timer ticks occur. It is
>>> theoretically possible using this measurement system to use >99% cpu
>>> and record 0 usage if you time your cpu usage properly. It gets even
>>> more inaccurate at lower HZ values for the same reason.
>>
>> Thank you very much. This somewhat contradicts what i saw (and outlined
>> in usnet article), namely the mplayer+/dev/rtc case. Unless ofcourse
>> /dev/rtc interrupt is considered to be the same as the interrupt from
>> PIT (on X86 that is)
>>
>> P.S. Perhaps it worth documenting this in the documentation? I caused
>> me, and perhaps quite a few other people, a great deal of pain and
>> frustration.
>
> Lots of confusion comes from this, and often people think their pc suddenly
> uses a lot less cpu when they change from 1000HZ to 100HZ and use this as an
> argument/reason for changing to 100HZ when in fact the massive _reported_
> difference is simply worse accounting. Of course there is more overhead going
> from 100 to 1000 but it doesn't suddenly make your apps use 10 times more
> cpu.

Yep. This, i belive, what made the mplayer developers incorrectly conclude
that utilizing RTC suddenly made the code run slower, after all /proc/stat
now claims that CPU load is higher, while in reality it stayed the same -
it's the accuracy that has improved (somewhat)

But back to the original question, does it look at what's running on timer
interrupt only or any IRQ? (something which is more in line with my own
observations)

--
vale

2007-02-12 07:30:27

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On Monday 12 February 2007 18:10, malc wrote:
> On Mon, 12 Feb 2007, Con Kolivas wrote:
> > Lots of confusion comes from this, and often people think their pc
> > suddenly uses a lot less cpu when they change from 1000HZ to 100HZ and
> > use this as an argument/reason for changing to 100HZ when in fact the
> > massive _reported_ difference is simply worse accounting. Of course there
> > is more overhead going from 100 to 1000 but it doesn't suddenly make your
> > apps use 10 times more cpu.
>
> Yep. This, i belive, what made the mplayer developers incorrectly conclude
> that utilizing RTC suddenly made the code run slower, after all /proc/stat
> now claims that CPU load is higher, while in reality it stayed the same -
> it's the accuracy that has improved (somewhat)
>
> But back to the original question, does it look at what's running on timer
> interrupt only or any IRQ? (something which is more in line with my own
> observations)

During the timer interrupt only. However if you create any form of timer, they
will of course have some periodicity relationship with the timer interrupt.

--
-ck

2007-02-12 16:59:06

by Andrew Burgess

[permalink] [raw]
Subject: Re: CPU load

On 12/02/07, Vassili Karpov <[email protected]> wrote:
>
> How does the kernel calculates the value it places in `/proc/stat' at
> 4th position (i.e. "idle: twiddling thumbs")?
>
..
>
> Later small kernel module was developed that tried to time how much
> time is spent in the idle handler inside the kernel and exported this
> information to the user-space. The results were consistent with our
> expectations and the output of the test utility.
..
> http://www.boblycat.org/~malc/apc

Vassili

Could you rewrite this code as a kernel patch for
discussion/inclusion in mainline? I and maybe others would
appreciate having idle statistics be more accurate.

Thanks for your work
Andrew

2007-02-12 18:05:24

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007, Con Kolivas wrote:

> On 12/02/07, Vassili Karpov <[email protected]> wrote:
>> Hello,

[..snip..]

>
> The kernel looks at what is using cpu _only_ during the timer
> interrupt. Which means if your HZ is 1000 it looks at what is running
> at precisely the moment those 1000 timer ticks occur. It is
> theoretically possible using this measurement system to use >99% cpu
> and record 0 usage if you time your cpu usage properly. It gets even
> more inaccurate at lower HZ values for the same reason.

And indeed it appears to be possible to do just that. Example:

/* gcc -o hog smallhog.c */
#include <time.h>
#include <limits.h>
#include <signal.h>
#include <sys/time.h>

#define HIST 10

static sig_atomic_t stop;

static void sighandler (int signr)
{
(void) signr;
stop = 1;
}

static unsigned long hog (unsigned long niters)
{
stop = 0;
while (!stop && --niters);
return niters;
}

int main (void)
{
int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set;
unsigned long v[HIST];
double tmp = 0.0;
unsigned long n;

signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL);

for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST;
n = tmp - (tmp / 3.0);

sigemptyset (&set);
sigaddset (&set, SIGALRM);

for (;;) {
hog (n);
sigwait (&set, &i);
}
return 0;
}
/* end smallhog.c */

Might need some adjustment for a particular system but ran just fine here
on:
2.4.30 + Athlon tbird (1Ghz)
2.6.19.2 + Athlon X2 3800+ (2Ghz)

Showing next to zero load in top(1) and a whole lot more in APC.

http://www.boblycat.org/~malc/apc/load-tbird-hog.png
http://www.boblycat.org/~malc/apc/load-x2-hog.png

Not quite 99% but nevertheless scary.

--
vale

2007-02-12 18:15:10

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007, Andrew Burgess wrote:

> On 12/02/07, Vassili Karpov <[email protected]> wrote:
>>
>> How does the kernel calculates the value it places in `/proc/stat' at
>> 4th position (i.e. "idle: twiddling thumbs")?
>>
> ..
>>
>> Later small kernel module was developed that tried to time how much
>> time is spent in the idle handler inside the kernel and exported this
>> information to the user-space. The results were consistent with our
>> expectations and the output of the test utility.
> ..
>> http://www.boblycat.org/~malc/apc
>
> Vassili
>
> Could you rewrite this code as a kernel patch for
> discussion/inclusion in mainline? I and maybe others would
> appreciate having idle statistics be more accurate.

I really don't know how to approach that, what i do in itc.c is ugly
to say the least (it's less ugly on PPC, but still).

There's stuff there that is very dangerous, i.e. entering idle handler
on SMP and simultaneously rmmoding the module (which surprisingly
never actually caused any bad things on kernels i had (starting with
2.6.17.3), but paniced on Debians 2.6.8). Safety nets were added but i
don't know whether they are sufficient. All in all what i have is a
gross hack, but it works for my purposes.

Another thing that keeps bothering me (again discovered with this
Debian kernel) is the fact that PREEMPT preempts idle handler, this
just doesn't add up in my head.

So to summarize: i don't know how to properly do that (so that it
works on all/most architectures, is less of a hack, has no negative
impact on performance, etc)

But i guess what innocent `smallhog.c' posted earlier demonstrated -
is that something probably ought to be done about it, or at least
the current situation documented.

--
vale

2007-02-13 18:28:23

by Pavel Machek

[permalink] [raw]
Subject: Re: CPU load

Hi!

> The kernel looks at what is using cpu _only_ during the
> timer
> interrupt. Which means if your HZ is 1000 it looks at
> what is running
> at precisely the moment those 1000 timer ticks occur. It
> is
> theoretically possible using this measurement system to
> use >99% cpu
> and record 0 usage if you time your cpu usage properly.
> It gets even
> more inaccurate at lower HZ values for the same reason.

I have (had?) code that 'exploits' this. I believe I could eat 90% of cpu
without being noticed.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-13 22:00:59

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 12 Feb 2007, Pavel Machek wrote:

> Hi!
>
>> The kernel looks at what is using cpu _only_ during the
>> timer
>> interrupt. Which means if your HZ is 1000 it looks at
>> what is running
>> at precisely the moment those 1000 timer ticks occur. It
>> is
>> theoretically possible using this measurement system to
>> use >99% cpu
>> and record 0 usage if you time your cpu usage properly.
>> It gets even
>> more inaccurate at lower HZ values for the same reason.
>
> I have (had?) code that 'exploits' this. I believe I could eat 90% of cpu
> without being noticed.

Slightly changed version of hog(around 3 lines in total changed) does that
easily on 2.6.18.3 on PPC.

http://www.boblycat.org/~malc/apc/load-hog-ppc.png

--
vale

2007-02-13 22:22:38

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On Wednesday 14 February 2007 09:01, malc wrote:
> On Mon, 12 Feb 2007, Pavel Machek wrote:
> > Hi!
> >
> >> The kernel looks at what is using cpu _only_ during the
> >> timer
> >> interrupt. Which means if your HZ is 1000 it looks at
> >> what is running
> >> at precisely the moment those 1000 timer ticks occur. It
> >> is
> >> theoretically possible using this measurement system to
> >> use >99% cpu
> >> and record 0 usage if you time your cpu usage properly.
> >> It gets even
> >> more inaccurate at lower HZ values for the same reason.
> >
> > I have (had?) code that 'exploits' this. I believe I could eat 90% of cpu
> > without being noticed.
>
> Slightly changed version of hog(around 3 lines in total changed) does that
> easily on 2.6.18.3 on PPC.
>
> http://www.boblycat.org/~malc/apc/load-hog-ppc.png

I guess it's worth mentioning this is _only_ about displaying the cpu usage to
userspace, as the cpu scheduler knows the accounting of each task in
different ways. This behaviour can not be used to exploit the cpu scheduler
into a starvation situation. Using the discrete per process accounting to
accumulate the displayed values to userspace would fix this problem, but
would be expensive.

--
-ck

2007-02-14 07:28:18

by malc

[permalink] [raw]
Subject: Re: CPU load

On Wed, 14 Feb 2007, Con Kolivas wrote:

> On Wednesday 14 February 2007 09:01, malc wrote:
>> On Mon, 12 Feb 2007, Pavel Machek wrote:
>>> Hi!

[..snip..]

>>> I have (had?) code that 'exploits' this. I believe I could eat 90% of cpu
>>> without being noticed.
>>
>> Slightly changed version of hog(around 3 lines in total changed) does that
>> easily on 2.6.18.3 on PPC.
>>
>> http://www.boblycat.org/~malc/apc/load-hog-ppc.png
>
> I guess it's worth mentioning this is _only_ about displaying the cpu usage to
> userspace, as the cpu scheduler knows the accounting of each task in
> different ways. This behaviour can not be used to exploit the cpu scheduler
> into a starvation situation. Using the discrete per process accounting to
> accumulate the displayed values to userspace would fix this problem, but
> would be expensive.

Guess you are right, but, once again, the problem is not so much about
fooling the system to do something or other, but confusing the user:

a. Everything is fine - the load is 0%, the fact that the system is
overheating and/or that some processes do not do as much as they
could is probably due to the bad hardware.

b. The weird load pattern must be the result of bugs in my code.
(And then a whole lot of time/effort is poured into fixing the
problem which is simply not there)

The current situation ought to be documented. Better yet some flag can
be introduced somewhere in the system so that it exports realy values to
/proc, not the estimations that are innacurate in some cases (like hog)

--
vale

2007-02-14 08:10:19

by Con Kolivas

[permalink] [raw]
Subject: Re: CPU load

On Wednesday 14 February 2007 18:28, malc wrote:
> On Wed, 14 Feb 2007, Con Kolivas wrote:
> > On Wednesday 14 February 2007 09:01, malc wrote:
> >> On Mon, 12 Feb 2007, Pavel Machek wrote:
> >>> Hi!
>
> [..snip..]
>
> >>> I have (had?) code that 'exploits' this. I believe I could eat 90% of
> >>> cpu without being noticed.
> >>
> >> Slightly changed version of hog(around 3 lines in total changed) does
> >> that easily on 2.6.18.3 on PPC.
> >>
> >> http://www.boblycat.org/~malc/apc/load-hog-ppc.png
> >
> > I guess it's worth mentioning this is _only_ about displaying the cpu
> > usage to userspace, as the cpu scheduler knows the accounting of each
> > task in different ways. This behaviour can not be used to exploit the cpu
> > scheduler into a starvation situation. Using the discrete per process
> > accounting to accumulate the displayed values to userspace would fix this
> > problem, but would be expensive.
>
> Guess you are right, but, once again, the problem is not so much about
> fooling the system to do something or other, but confusing the user:

Yes and I certainly am not arguing against that.

>
> a. Everything is fine - the load is 0%, the fact that the system is
> overheating and/or that some processes do not do as much as they
> could is probably due to the bad hardware.
>
> b. The weird load pattern must be the result of bugs in my code.
> (And then a whole lot of time/effort is poured into fixing the
> problem which is simply not there)
>
> The current situation ought to be documented. Better yet some flag can
> be introduced somewhere in the system so that it exports realy values to
> /proc, not the estimations that are innacurate in some cases (like hog)

I wouldn't argue against any of those either. schedstats with userspace tools
to understand the data will give better information I believe.

--
-ck

2007-02-14 20:47:44

by Pavel Machek

[permalink] [raw]
Subject: Re: CPU load

Hi!
>
> >>>I have (had?) code that 'exploits' this. I believe I could eat 90% of cpu
> >>>without being noticed.
> >>
> >>Slightly changed version of hog(around 3 lines in total changed) does that
> >>easily on 2.6.18.3 on PPC.
> >>
> >>http://www.boblycat.org/~malc/apc/load-hog-ppc.png
> >
> >I guess it's worth mentioning this is _only_ about displaying the cpu
> >usage to
> >userspace, as the cpu scheduler knows the accounting of each task in
> >different ways. This behaviour can not be used to exploit the cpu scheduler
> >into a starvation situation. Using the discrete per process accounting to
> >accumulate the displayed values to userspace would fix this problem, but
> >would be expensive.
>
> Guess you are right, but, once again, the problem is not so much about
> fooling the system to do something or other, but confusing the user:
>
> a. Everything is fine - the load is 0%, the fact that the system is
> overheating and/or that some processes do not do as much as they
> could is probably due to the bad hardware.
>
> b. The weird load pattern must be the result of bugs in my code.
> (And then a whole lot of time/effort is poured into fixing the
> problem which is simply not there)
>
> The current situation ought to be documented. Better yet some flag
> can

It probably _is_ documented, somewhere :-). If you find nice place
where to document it (top manpage?) go ahead with the patch.

> be introduced somewhere in the system so that it exports realy values to
> /proc, not the estimations that are innacurate in some cases (like hog)

Patch would be welcome, but I do not think it will be easy.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-25 10:46:53

by malc

[permalink] [raw]
Subject: Re: CPU load

On Wed, 14 Feb 2007, Pavel Machek wrote:

> Hi!

[..snip..]

>> The current situation ought to be documented. Better yet some flag
>> can
>
> It probably _is_ documented, somewhere :-). If you find nice place
> where to document it (top manpage?) go ahead with the patch.


How about this:

<Documentation/load.txt>
CPU load
--------

Linux exports various bits of information via `/proc/stat' and
`/proc/uptime' that userland tools, such as top(1), use to calculate
the average time system spent in a particular state, for example:

<transcript>
$ iostat
Linux 2.6.18.3-exp (linmac) 02/20/2007

avg-cpu: %user %nice %system %iowait %steal %idle
10.01 0.00 2.92 5.44 0.00 81.63

...
</transcript>

Here the system thinks that over the default sampling period the
system spent 10.01% of the time doing work in user space, 2.92% in the
kernel, and was overall 81.63% of the time idle.

In most cases the `/proc/stat' information reflects the reality quite
closely, however due to the nature of how/when the kernel collects
this data sometimes it can not be trusted at all.

So how is this information collected? Whenever timer interrupt is
signalled the kernel looks what kind of task was running at this
moment and increments the counter that corresponds to this tasks
kind/state. The problem with this is that the system could have
switched between various states multiple times between two timer
interrupts yet the counter is incremented only for the last state.


Example
-------

If we imagine the system with one task that periodically burns cycles
in the following manner:

time line between two timer interrupts
|--------------------------------------|
^ ^
|_ something begins working |
|_ something goes to sleep
(only to be awaken quite soon)

In the above situation the system will be 0% loaded according to the
`/proc/stat' (since the timer interrupt will always happen when the
system is executing the idle handler), but in reality the load is
closer to 99%.

One can imagine many more situations where this behavior of the kernel
will lead to quite erratic information inside `/proc/stat'.


/* gcc -o hog smallhog.c */
#include <time.h>
#include <limits.h>
#include <signal.h>
#include <sys/time.h>
#define HIST 10

static volatile sig_atomic_t stop;

static void sighandler (int signr)
{
(void) signr;
stop = 1;
}
static unsigned long hog (unsigned long niters)
{
stop = 0;
while (!stop && --niters);
return niters;
}
int main (void)
{
int i;
struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
.it_value = { .tv_sec = 0, .tv_usec = 1 } };
sigset_t set;
unsigned long v[HIST];
double tmp = 0.0;
unsigned long n;
signal (SIGALRM, &sighandler);
setitimer (ITIMER_REAL, &it, NULL);

hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
for (i = 0; i < HIST; ++i) tmp += v[i];
tmp /= HIST;
n = tmp - (tmp / 3.0);

sigemptyset (&set);
sigaddset (&set, SIGALRM);

for (;;) {
hog (n);
sigwait (&set, &i);
}
return 0;
}


References
----------

http://lkml.org/lkml/2007/2/12/6
Documentation/filesystems/proc.txt (1.8)
</Documentation/load.txt>

--
vale

2007-02-26 09:28:48

by Pavel Machek

[permalink] [raw]
Subject: Re: CPU load

Hi!

> [..snip..]
>
> >>The current situation ought to be documented. Better yet some flag
> >>can
> >
> >It probably _is_ documented, somewhere :-). If you find nice place
> >where to document it (top manpage?) go ahead with the patch.
>
>
> How about this:

Looks okay to me. (You should probably add your name to it, and I do
not like html-like markup... plus please don't add extra spaces
between words)...

You probably want to send it to akpm?
Pavel

> <Documentation/load.txt>
> CPU load
> --------
>
> Linux exports various bits of information via `/proc/stat' and
> `/proc/uptime' that userland tools, such as top(1), use to calculate
> the average time system spent in a particular state, for example:
>
> <transcript>
> $ iostat
> Linux 2.6.18.3-exp (linmac) 02/20/2007
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 10.01 0.00 2.92 5.44 0.00 81.63
>
> ...
> </transcript>
>
> Here the system thinks that over the default sampling period the
> system spent 10.01% of the time doing work in user space, 2.92% in the
> kernel, and was overall 81.63% of the time idle.
>
> In most cases the `/proc/stat' information reflects the reality quite
> closely, however due to the nature of how/when the kernel collects
> this data sometimes it can not be trusted at all.
>
> So how is this information collected? Whenever timer interrupt is
> signalled the kernel looks what kind of task was running at this
> moment and increments the counter that corresponds to this tasks
> kind/state. The problem with this is that the system could have
> switched between various states multiple times between two timer
> interrupts yet the counter is incremented only for the last state.
>
>
> Example
> -------
>
> If we imagine the system with one task that periodically burns cycles
> in the following manner:
>
> time line between two timer interrupts
> |--------------------------------------|
> ^ ^
> |_ something begins working |
> |_ something goes to sleep
> (only to be awaken quite soon)
>
> In the above situation the system will be 0% loaded according to the
> `/proc/stat' (since the timer interrupt will always happen when the
> system is executing the idle handler), but in reality the load is
> closer to 99%.
>
> One can imagine many more situations where this behavior of the kernel
> will lead to quite erratic information inside `/proc/stat'.
>
>
> /* gcc -o hog smallhog.c */
> #include <time.h>
> #include <limits.h>
> #include <signal.h>
> #include <sys/time.h>
> #define HIST 10
>
> static volatile sig_atomic_t stop;
>
> static void sighandler (int signr)
> {
> (void) signr;
> stop = 1;
> }
> static unsigned long hog (unsigned long niters)
> {
> stop = 0;
> while (!stop && --niters);
> return niters;
> }
> int main (void)
> {
> int i;
> struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 },
> .it_value = { .tv_sec = 0, .tv_usec = 1 } };
> sigset_t set;
> unsigned long v[HIST];
> double tmp = 0.0;
> unsigned long n;
> signal (SIGALRM, &sighandler);
> setitimer (ITIMER_REAL, &it, NULL);
>
> hog (ULONG_MAX);
> for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX);
> for (i = 0; i < HIST; ++i) tmp += v[i];
> tmp /= HIST;
> n = tmp - (tmp / 3.0);
>
> sigemptyset (&set);
> sigaddset (&set, SIGALRM);
>
> for (;;) {
> hog (n);
> sigwait (&set, &i);
> }
> return 0;
> }
>
>
> References
> ----------
>
> http://lkml.org/lkml/2007/2/12/6
> Documentation/filesystems/proc.txt (1.8)
> </Documentation/load.txt>
>

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-02-26 10:43:04

by malc

[permalink] [raw]
Subject: Re: CPU load

On Mon, 26 Feb 2007, Pavel Machek wrote:

> Hi!
>
>> [..snip..]
>>
>>>> The current situation ought to be documented. Better yet some flag
>>>> can
>>>
>>> It probably _is_ documented, somewhere :-). If you find nice place
>>> where to document it (top manpage?) go ahead with the patch.
>>
>>
>> How about this:
>
> Looks okay to me. (You should probably add your name to it, and I do
> not like html-like markup... plus please don't add extra spaces
> between words)...

Thanks. html-like markup was added to clearly mark the boundaries of
the message and the text. Extra-spaces courtesy emacs' C-0 M-q.

>
> You probably want to send it to akpm?

Any pointers on how to do that and perhaps preferred submission
format?

[..snip..]

--
vale

2007-02-26 16:42:57

by Randy Dunlap

[permalink] [raw]
Subject: Re: CPU load

On Mon, 26 Feb 2007 13:42:50 +0300 (MSK) malc wrote:

> On Mon, 26 Feb 2007, Pavel Machek wrote:
>
> > Hi!
> >
> >> [..snip..]
> >>
> >>>> The current situation ought to be documented. Better yet some flag
> >>>> can
> >>>
> >>> It probably _is_ documented, somewhere :-). If you find nice place
> >>> where to document it (top manpage?) go ahead with the patch.
> >>
> >>
> >> How about this:
> >
> > Looks okay to me. (You should probably add your name to it, and I do
> > not like html-like markup... plus please don't add extra spaces
> > between words)...
>
> Thanks. html-like markup was added to clearly mark the boundaries of
> the message and the text. Extra-spaces courtesy emacs' C-0 M-q.
>
> >
> > You probably want to send it to akpm?
>
> Any pointers on how to do that and perhaps preferred submission
> format?
>
> [..snip..]

Well, he wrote it up and posted it at
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***