Hallo everyone,
in a previous message archived at
http://www.ussg.iu.edu/hypermail/linux/kernel/0410.2/2950.html
I described a problem with a wrapping load average on my SMP system.
The following small userspace load simulation exactly matches the
numbers I am seeing.
We can only account for 1024 runnable processes, since we have 22 bits
precision, I would like to suggest a patch to calc_load in kernel/timer.c
that would limit the number of active tasks:
if (active_tasks > 1024 * FIXED_1)
active_tasks = 1024 * FIXED_1;
I am aware that this is not a fix ... The wrapping happens using threaded
applications (Java/JBoss in my case). Below you'll find a small userspace
simulation.
I would really like to provide a real fix, but I really couldn't figure
out what went wrong.
Thanks for any feedback,
Patrick
/* Sample code copied from include/linux/sched.h and kernel/timer.c */
#include <stdio.h>
#define FSHIFT 11 /* 11 bit precision */
#define FIXED_1 (1 << FSHIFT) /* 1.0 as fixed-point */
#define EXP_1 1884 /* 1/exp(5sec/1min) */
#define EXP_5 2014 /* 1/exp(5sec/5min) */
#define EXP_15 2037 /* 1/exp(5sec/15min) */
#define CALC_LOAD(load, exp, n) \
load *= exp; \
load += n*(FIXED_1-exp); \
load >>= FSHIFT;
static unsigned long avenrun[3];
/* normal load spike and one error */
static unsigned long tasks[8] = {
0, 1, 0, ~0, 0, 0, 0
};
static void calc_load(unsigned long tasks)
{
tasks <<= FSHIFT;
CALC_LOAD(avenrun[0], EXP_1, tasks);
CALC_LOAD(avenrun[1], EXP_5, tasks);
CALC_LOAD(avenrun[2], EXP_15, tasks);
}
int main(int argc, char **argv)
{
int i, j;
for (i = 0; i < 8; i++) { /* index for tasks[] */
/* 24 calculations per load change */
for (j = 0; j < 24; j++) {
calc_load(tasks[i]);
printf("%.2f %.2f %.2f\n",
(float) avenrun[0] / FIXED_1,
(float) avenrun[1] / FIXED_1,
(float) avenrun[2] / FIXED_1);
}
}
return 0;
}
Patrick Mau <[email protected]> wrote:
>
> n a previous message archived at
>
> http://www.ussg.iu.edu/hypermail/linux/kernel/0410.2/2950.html
>
> I described a problem with a wrapping load average on my SMP system.
> The following small userspace load simulation exactly matches the
> numbers I am seeing.
>
> We can only account for 1024 runnable processes, since we have 22 bits
> precision, I would like to suggest a patch to calc_load in kernel/timer.c
> that would limit the number of active tasks:
>
>
> if (active_tasks > 1024 * FIXED_1)
> active_tasks = 1024 * FIXED_1;
>
It's better than wrapping to zero...
Why do we need 11 bits after the binary point?
On Mon, Nov 08, 2004 at 01:27:07AM -0800, Andrew Morton wrote:
> Patrick Mau <[email protected]> wrote:
> >
> > We can only account for 1024 runnable processes, since we have 22 bits
> > precision, I would like to suggest a patch to calc_load in kernel/timer.c
>
> It's better than wrapping to zero...
>
> Why do we need 11 bits after the binary point?
I tried various other combinations, the most interesting alternative was
8 bits precision. The exponential values would be:
1 / e (5/60) * 256
235.53
1 / e (5/300) * 256
251.76
1 / e (5/900) * 256
254.58
If you would use 236, 252 and 255 the last to load calculations would
get optimized into register shifts during calculation. The precision
would be bad, but I personally don't mind loosing the fraction.
Best regards,
Patrick
(PLease don't remove people from Cc:. Just do reply-to-all).
Patrick Mau <[email protected]> wrote:
>
> On Mon, Nov 08, 2004 at 01:27:07AM -0800, Andrew Morton wrote:
> > Patrick Mau <[email protected]> wrote:
> > >
> > > We can only account for 1024 runnable processes, since we have 22 bits
> > > precision, I would like to suggest a patch to calc_load in kernel/timer.c
> >
> > It's better than wrapping to zero...
> >
> > Why do we need 11 bits after the binary point?
>
> I tried various other combinations, the most interesting alternative was
> 8 bits precision. The exponential values would be:
>
> 1 / e (5/60) * 256
> 235.53
>
> 1 / e (5/300) * 256
> 251.76
>
> 1 / e (5/900) * 256
> 254.58
>
> If you would use 236, 252 and 255 the last to load calculations would
> get optimized into register shifts during calculation. The precision
> would be bad, but I personally don't mind loosing the fraction.
What would be the impact on the precision if we were to use 8 bits of
fraction?
An upper limit of 1024 tasks sounds a bit squeezy. Even 8192 is a bit
uncomfortable. Maybe we should just reimplement the whole thing, perhaps
in terms of tuples of 32-bit values: 32 bits each side of the binary point?
On Mon, Nov 08, 2004 at 03:50:51PM -0800, Andrew Morton wrote:
>
> (PLease don't remove people from Cc:. Just do reply-to-all).
Hi Andrew,
sorry, I usually remove people from CC if they're subscribed.
> Patrick Mau <[email protected]> wrote:
> >
> > If you would use 236, 252 and 255 the last to load calculations would
> > get optimized into register shifts during calculation. The precision
> > would be bad, but I personally don't mind loosing the fraction.
>
> What would be the impact on the precision if we were to use 8 bits of
> fraction?
I didn't have time to check again, but I think I ended up with a load of 0.97
using one runnable process because of rounding errors.
> An upper limit of 1024 tasks sounds a bit squeezy. Even 8192 is a bit
> uncomfortable. Maybe we should just reimplement the whole thing, perhaps
> in terms of tuples of 32-bit values: 32 bits each side of the binary point?
We re-calculate the load every 5 seconds. I think it would be OK to
use more bits/registers, it's not that frequently called.
It's 1:30 AM and I had a rough working day, maybe I'll prepare a little patch
tomorrow. I think that 8192 _runnable_ processes seems a bit unusual, but we
also account for uninterruptable processes. Maybe there was some swap/IO
storm that triggered the initial overflow, I'll have to check that first.
Best regards,
Patrick
On Tue, Nov 09, 2004 at 01:43:35AM +0100, Patrick Mau wrote:
> On Mon, Nov 08, 2004 at 03:50:51PM -0800, Andrew Morton wrote:
> >
> > (PLease don't remove people from Cc:. Just do reply-to-all).
>
> Hi Andrew,
>
> sorry, I usually remove people from CC if they're subscribed.
>
> > Patrick Mau <[email protected]> wrote:
> > >
> > > If you would use 236, 252 and 255 the last to load calculations would
> > > get optimized into register shifts during calculation. The precision
> > > would be bad, but I personally don't mind loosing the fraction.
> >
> > What would be the impact on the precision if we were to use 8 bits of
> > fraction?
>
> I didn't have time to check again, but I think I ended up with a load of 0.97
> using one runnable process because of rounding errors.
>
> > An upper limit of 1024 tasks sounds a bit squeezy. Even 8192 is a bit
> > uncomfortable. Maybe we should just reimplement the whole thing, perhaps
> > in terms of tuples of 32-bit values: 32 bits each side of the binary point?
>
> We re-calculate the load every 5 seconds. I think it would be OK to
> use more bits/registers, it's not that frequently called.
hmm ...
do_timer() -> update_times() -> calc_load()
so not exactly every 5 seconds ...
but I agree that a higher resolution would be a good
idea ... also doing the calculation when the number
of running/uninterruptible processes has changed would
be a good idea ...
(I implemented something similar for linux-vserver,
if there is interest, I could adapt it for mainline)
> It's 1:30 AM and I had a rough working day, maybe I'll prepare a little patch
> tomorrow. I think that 8192 _runnable_ processes seems a bit unusual, but we
> also account for uninterruptable processes. Maybe there was some swap/IO
> storm that triggered the initial overflow, I'll have to check that first.
best,
Herbert
> Best regards,
> Patrick
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Herbert Poetzl wrote:
> but I agree that a higher resolution would be a good
> idea ... also doing the calculation when the number
> of running/uninterruptible processes has changed would
> be a good idea ...
This could get very expensive. A modern cpu can do about 700,000 context
switches per second of a real task with the current linux kernel so I'd
suggest not doing this.
Cheers,
Con
On Wed, Nov 10, 2004 at 08:49:41AM +1100, Con Kolivas wrote:
> Herbert Poetzl wrote:
> >but I agree that a higher resolution would be a good
> >idea ... also doing the calculation when the number
> >of running/uninterruptible processes has changed would
> >be a good idea ...
>
> This could get very expensive. A modern cpu can do about 700,000 context
> switches per second of a real task with the current linux kernel so I'd
> suggest not doing this.
hmm, right it can, do you have any stats about the
'typical' workload behaviour?
do you know the average time between changes of
nr_running and nr_uninterruptible?
TIA,
Herbert
> Cheers,
> Con
Herbert Poetzl wrote:
> On Tue, Nov 09, 2004 at 01:43:35AM +0100, Patrick Mau wrote:
>
>>We re-calculate the load every 5 seconds. I think it would be OK to
>>use more bits/registers, it's not that frequently called.
>
>
> hmm ...
>
> do_timer() -> update_times() -> calc_load()
>
> so not exactly every 5 seconds ...
>
calc_load() -> messing with LOAD_FREQ -> once every 5 seconds, no?
I think doing 32/32 bit calculations would be fine.
> but I agree that a higher resolution would be a good
> idea ... also doing the calculation when the number
> of running/uninterruptible processes has changed would
> be a good idea ...
>
Apart from the problem Con pointed out, you'd need a fancier algorithm
to calculate load because your interval isn't going to be fixed, so you
need to factor that in when calculating the area under the 'curve'
(loadavg).
I think the good 'ol 5 seconds should be alright.
Herbert Poetzl wrote:
> On Wed, Nov 10, 2004 at 08:49:41AM +1100, Con Kolivas wrote:
>
>>Herbert Poetzl wrote:
>>
>>>but I agree that a higher resolution would be a good
>>>idea ... also doing the calculation when the number
>>>of running/uninterruptible processes has changed would
>>>be a good idea ...
>>
>>This could get very expensive. A modern cpu can do about 700,000 context
>>switches per second of a real task with the current linux kernel so I'd
>>suggest not doing this.
>
>
> hmm, right it can, do you have any stats about the
> 'typical' workload behaviour?
How long is a piece of string? It depends entirely on your workload. On
a desktop machine just switching applications pushes it to 10,000.
Basically you end up making it an O(n) calculation by increasing the
overhead of it (albeit small) proportionately to the context switch load
which is usually significantly higher than the system load.
> do you know the average time between changes of
> nr_running and nr_uninterruptible?
Same answer. Depends entirely on the workload and to whether your
running tasks sleep at all or not (hint - most do). While it will be a
lower number than the number of context switches, it potentially can be
as high with just the right sort of threads (think server, network type
stuff).
> TIA,
> Herbert
Cheers,
Con
On Wed, Nov 10, 2004 at 06:07:25PM +1100, Nick Piggin wrote:
> Herbert Poetzl wrote:
> >On Tue, Nov 09, 2004 at 01:43:35AM +0100, Patrick Mau wrote:
> >
>
> >>We re-calculate the load every 5 seconds. I think it would be OK to
> >>use more bits/registers, it's not that frequently called.
> >
> >
> >hmm ...
> >
> > do_timer() -> update_times() -> calc_load()
> >
> >so not exactly every 5 seconds ...
>
> calc_load() -> messing with LOAD_FREQ -> once every 5 seconds, no?
usually yes ...
> I think doing 32/32 bit calculations would be fine.
agreed ...
> >but I agree that a higher resolution would be a good
> >idea ... also doing the calculation when the number
> >of running/uninterruptible processes has changed would
> >be a good idea ...
> >
>
> Apart from the problem Con pointed out, you'd need a fancier algorithm
> to calculate load because your interval isn't going to be fixed, so you
> need to factor that in when calculating the area under the 'curve'
> (loadavg).
yes, something like this:
update_loadavg(uint32_t load, int wsize, int delta, int n)
{
unsigned long long calc;
if (delta >= wsize)
return (n << FSHIFT);
calc = (delta * n) << FSHIFT;
calc += (wsize - delta) * load;
do_div(calc, wsize);
return calc;
}
> I think the good 'ol 5 seconds should be alright.
probably sufficient, yes ...
best,
Herbert
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/