2013-03-26 14:01:23

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Mon, Mar 18, 2013 at 03:49:02AM -0700, tip-bot for Frederic Weisbecker wrote:
> Commit-ID: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
> Gitweb: http://git.kernel.org/tip/d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
> Author: Frederic Weisbecker <[email protected]>
> AuthorDate: Wed, 20 Feb 2013 18:54:55 +0100
> Committer: Frederic Weisbecker <[email protected]>
> CommitDate: Wed, 13 Mar 2013 18:18:14 +0100
>
> sched: Lower chances of cputime scaling overflow
>
> Some users have reported that after running a process with
> hundreds of threads on intensive CPU-bound loads, the cputime
> of the group started to freeze after a few days.
>
> This is due to how we scale the tick-based cputime against
> the scheduler precise execution time value.
>
> We add the values of all threads in the group and we multiply
> that against the sum of the scheduler exec runtime of the whole
> group.
>
> This easily overflows after a few days/weeks of execution.
>
> A proposed solution to solve this was to compute that multiplication
> on stime instead of utime:
> 62188451f0d63add7ad0cd2a1ae269d600c1663d
> ("cputime: Avoid multiplication overflow on utime scaling")
>
> The rationale behind that was that it's easy for a thread to
> spend most of its time in userspace under intensive CPU-bound workload
> but it's much harder to do CPU-bound intensive long run in the kernel.
>
> This postulate got defeated when a user recently reported he was still
> seeing cputime freezes after the above patch. The workload that
> triggers this issue relates to intensive networking workloads where
> most of the cputime is consumed in the kernel.
>
> To reduce much more the opportunities for multiplication overflow,
> lets reduce the multiplication factors to the remainders of the division
> between sched exec runtime and cputime. Assuming the difference between
> these shouldn't ever be that large, it could work on many situations.
>
> This gets the same results as in the upstream scaling code except for
> a small difference: the upstream code always rounds the results to
> the nearest integer not greater to what would be the precise result.
> The new code rounds to the nearest integer either greater or not
> greater. In practice this difference probably shouldn't matter but
> it's worth mentioning.
>
> If this solution appears not to be enough in the end, we'll
> need to partly revert back to the behaviour prior to commit
> 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
> ("sched, cputime: Introduce thread_group_times()")
>
> Back then, the scaling was done on exit() time before adding the cputime
> of an exiting thread to the signal struct. And then we'll need to
> scale one-by-one the live threads cputime in thread_group_cputime(). The
> drawback may be a slightly slower code on exit time.

Sorry for questioning this patch that late, but I thought this solution
is ok. However before providing backport to RHEL kernel, I made some
analytics and test of this algorithm. I discovered that is pretty easy
to catch multiplication overflow, for example:

rtime: 16877346691
total: 12215365298
stime: 4886146119

will give scaled stime: 5240812402 , whereas correct value should be
6750938676. As problem can be triggered at random, and we can not
give any guaranties that it will not happen for particular time
duration bigger than 50 days (on one thread application), I'm not
considering this patch as good fix.

Of cource reverting 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 will not
help either, since is possible to have overflow on one thread.

Can we remove rtime multiplication at all and just use pure utime and
stime values? I think no. I don't know the details, but is possible that
we can have top "exploit" that will eat lot of cputime but "hide" itself
from tick, hence it will not be showed correctly as process which utilize
lot of cpu. Peter told about that some time ago.

Taking that, and analyze some other possibilities, I think best (or only
possible good) solution for this problem is use 128 bit math for
calculation. I tested (on user space) below code [1] and it give good
results. It use 128 bit multiplication and simplified 128 bit division.

This will require adding those 128 bit math primitives
http://thread.gmane.org/gmane.linux.kernel/1381801
which were initially required for Deadline scheduler, but then removed
from requirement. I hope it is ok to add them or some improved version.
I think soon or later 128 bit math will be required in more places in
the kernel.

Thoughts?

Stanislaw

[1]

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <strings.h>

typedef unsigned long long u64;
typedef unsigned __int128 u128;

#define clzll(x) __builtin_clzll(x)

static u64 div_u64(u64 a, u64 b)
{
return a / b;
}

static u128 mul_u64_u64(u64 a, u64 b)
{
u128 res = a;
res *= b;

return res;
}

static u64 div128_u64(u128 dividend, u64 divisor)
{
u64 high = dividend >> 64;
u64 quot;

if (high == 0) {
quot = div_u64(dividend, divisor);
} else {
int n = 64 - clzll(high);

if ((divisor >> n) == 0) {
/* Oh no */
return 0;
}

quot = div_u64(dividend >> n, divisor >> n);

if (quot != 0)
quot--;
if ((dividend - quot * divisor) >= divisor)
quot++;
}

return quot;
}

u64 _scale_time(u64 rtime, u64 total, u64 time)
{
u128 tmp = mul_u64_u64(time, rtime);

return div128_u64(tmp, total);
}

u64 scale_time(u64 stime, u64 rtime, u64 total)
{
u64 time, scaled;
u64 utime = total - stime;
int use_utime;

if (utime < stime) {
use_utime = 1;
time = utime;
} else {
use_utime = 0;
time = stime;
}

scaled = _scale_time(rtime, total, time);

if (use_utime)
scaled = rtime - scaled;

return scaled;
}

int main(int argc, char *argv[])
{
u64 rtime, total, stime, scaled;

if (argc != 4)
return;

rtime = strtoll(argv[1], NULL, 10);
total = strtoll(argv[2], NULL, 10);
stime = strtoll(argv[3], NULL, 10);

assert (total >= stime);

scaled = scale_time(stime, rtime, total);
printf("%llu\n", scaled);

return 0;
}


2013-03-26 14:19:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

2013/3/26 Stanislaw Gruszka <[email protected]>:
> On Mon, Mar 18, 2013 at 03:49:02AM -0700, tip-bot for Frederic Weisbecker wrote:
>> Commit-ID: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
>> Gitweb: http://git.kernel.org/tip/d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
>> Author: Frederic Weisbecker <[email protected]>
>> AuthorDate: Wed, 20 Feb 2013 18:54:55 +0100
>> Committer: Frederic Weisbecker <[email protected]>
>> CommitDate: Wed, 13 Mar 2013 18:18:14 +0100
>>
>> sched: Lower chances of cputime scaling overflow
>>
>> Some users have reported that after running a process with
>> hundreds of threads on intensive CPU-bound loads, the cputime
>> of the group started to freeze after a few days.
>>
>> This is due to how we scale the tick-based cputime against
>> the scheduler precise execution time value.
>>
>> We add the values of all threads in the group and we multiply
>> that against the sum of the scheduler exec runtime of the whole
>> group.
>>
>> This easily overflows after a few days/weeks of execution.
>>
>> A proposed solution to solve this was to compute that multiplication
>> on stime instead of utime:
>> 62188451f0d63add7ad0cd2a1ae269d600c1663d
>> ("cputime: Avoid multiplication overflow on utime scaling")
>>
>> The rationale behind that was that it's easy for a thread to
>> spend most of its time in userspace under intensive CPU-bound workload
>> but it's much harder to do CPU-bound intensive long run in the kernel.
>>
>> This postulate got defeated when a user recently reported he was still
>> seeing cputime freezes after the above patch. The workload that
>> triggers this issue relates to intensive networking workloads where
>> most of the cputime is consumed in the kernel.
>>
>> To reduce much more the opportunities for multiplication overflow,
>> lets reduce the multiplication factors to the remainders of the division
>> between sched exec runtime and cputime. Assuming the difference between
>> these shouldn't ever be that large, it could work on many situations.
>>
>> This gets the same results as in the upstream scaling code except for
>> a small difference: the upstream code always rounds the results to
>> the nearest integer not greater to what would be the precise result.
>> The new code rounds to the nearest integer either greater or not
>> greater. In practice this difference probably shouldn't matter but
>> it's worth mentioning.
>>
>> If this solution appears not to be enough in the end, we'll
>> need to partly revert back to the behaviour prior to commit
>> 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
>> ("sched, cputime: Introduce thread_group_times()")
>>
>> Back then, the scaling was done on exit() time before adding the cputime
>> of an exiting thread to the signal struct. And then we'll need to
>> scale one-by-one the live threads cputime in thread_group_cputime(). The
>> drawback may be a slightly slower code on exit time.
>
> Sorry for questioning this patch that late, but I thought this solution
> is ok. However before providing backport to RHEL kernel, I made some
> analytics and test of this algorithm. I discovered that is pretty easy
> to catch multiplication overflow, for example:
>
> rtime: 16877346691
> total: 12215365298
> stime: 4886146119
>
> will give scaled stime: 5240812402 , whereas correct value should be
> 6750938676. As problem can be triggered at random, and we can not
> give any guaranties that it will not happen for particular time
> duration bigger than 50 days (on one thread application), I'm not
> considering this patch as good fix.
>
> Of cource reverting 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 will not
> help either, since is possible to have overflow on one thread.
>
> Can we remove rtime multiplication at all and just use pure utime and
> stime values? I think no. I don't know the details, but is possible that
> we can have top "exploit" that will eat lot of cputime but "hide" itself
> from tick, hence it will not be showed correctly as process which utilize
> lot of cpu. Peter told about that some time ago.
>
> Taking that, and analyze some other possibilities, I think best (or only
> possible good) solution for this problem is use 128 bit math for
> calculation. I tested (on user space) below code [1] and it give good
> results. It use 128 bit multiplication and simplified 128 bit division.
>
> This will require adding those 128 bit math primitives
> http://thread.gmane.org/gmane.linux.kernel/1381801
> which were initially required for Deadline scheduler, but then removed
> from requirement. I hope it is ok to add them or some improved version.
> I think soon or later 128 bit math will be required in more places in
> the kernel.
>
> Thoughts?
>
> Stanislaw
>
> [1]
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> #include <strings.h>
>
> typedef unsigned long long u64;
> typedef unsigned __int128 u128;
>
> #define clzll(x) __builtin_clzll(x)
>
> static u64 div_u64(u64 a, u64 b)
> {
> return a / b;
> }
>
> static u128 mul_u64_u64(u64 a, u64 b)
> {
> u128 res = a;
> res *= b;
>
> return res;
> }
>
> static u64 div128_u64(u128 dividend, u64 divisor)
> {
> u64 high = dividend >> 64;
> u64 quot;
>
> if (high == 0) {
> quot = div_u64(dividend, divisor);
> } else {
> int n = 64 - clzll(high);
>
> if ((divisor >> n) == 0) {
> /* Oh no */
> return 0;
> }
>
> quot = div_u64(dividend >> n, divisor >> n);
>
> if (quot != 0)
> quot--;
> if ((dividend - quot * divisor) >= divisor)
> quot++;
> }
>
> return quot;
> }
>
> u64 _scale_time(u64 rtime, u64 total, u64 time)
> {
> u128 tmp = mul_u64_u64(time, rtime);
>
> return div128_u64(tmp, total);
> }
>
> u64 scale_time(u64 stime, u64 rtime, u64 total)
> {
> u64 time, scaled;
> u64 utime = total - stime;
> int use_utime;
>
> if (utime < stime) {
> use_utime = 1;
> time = utime;
> } else {
> use_utime = 0;
> time = stime;
> }
>
> scaled = _scale_time(rtime, total, time);
>
> if (use_utime)
> scaled = rtime - scaled;
>
> return scaled;
> }
>
> int main(int argc, char *argv[])
> {
> u64 rtime, total, stime, scaled;
>
> if (argc != 4)
> return;
>
> rtime = strtoll(argv[1], NULL, 10);
> total = strtoll(argv[2], NULL, 10);
> stime = strtoll(argv[3], NULL, 10);
>
> assert (total >= stime);
>
> scaled = scale_time(stime, rtime, total);
> printf("%llu\n", scaled);
>
> return 0;
> }

I starred at that code for quite some time already and I can't come up
with a better solution.

Of course 128 bits ops are very expensive, so to help you evaluating
the situation, this is going to happen on every call to
task_cputime_adjusted() and thread_group_adjusted(), namely:

* Some proc files read
* sys_times()
* thread group exit

Thoughts?

2013-03-26 16:54:01

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Tue, Mar 26, 2013 at 03:19:25PM +0100, Frederic Weisbecker wrote:
> I starred at that code for quite some time already and I can't come up
> with a better solution.
>
> Of course 128 bits ops are very expensive, so to help you evaluating
> the situation, this is going to happen on every call to
> task_cputime_adjusted() and thread_group_adjusted(), namely:
>
> * Some proc files read
> * sys_times()
> * thread group exit

I need to think if we can get rid of thread_group_cputime_adjusted()
at exit(), perhaps this is possible without causing problems.

Div128 is optimized if higher 64-bits are zeros. 128 bit multiplication
can be avoided if we are sure that 64bit mul will not overflow, for
example:

if (ffs(stime) + ffs(rtime) < 64)
scale_64()
else
scale_128()

Basically 99.9% of processes will go scale_64() path. Fix is only
needed for processes that utilize lot of CPU time for long period,
hence have big stime and rtime values, what is not common.

I'm thinking also about remove cputime scaling. Obviously this would
be the best regarding performance. I think we could avoid multiplication
in kernel, this can be done in user-space if we export rtime to it.
So top could be modified to do cputime scaling by itself. In that way
we will avoid this top hiding "exploit", but also precision of times(2)
syscall values will get worse (I think there are lot of programs that
use this call and they might expect stime/utime are precise).

Stanislaw

2013-04-10 12:51:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow


* Frederic Weisbecker <[email protected]> wrote:

> Of course 128 bits ops are very expensive, so to help you evaluating the
> situation, this is going to happen on every call to task_cputime_adjusted() and
> thread_group_adjusted(), namely:

It's really only expensive for divisions. Addition and multiplication should be
straightforward and relatively low overhead, especially on 64-bit platforms.

Thanks,

Ingo

2013-04-10 15:28:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

2013/4/10 Ingo Molnar <[email protected]>:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
>> Of course 128 bits ops are very expensive, so to help you evaluating the
>> situation, this is going to happen on every call to task_cputime_adjusted() and
>> thread_group_adjusted(), namely:
>
> It's really only expensive for divisions. Addition and multiplication should be
> straightforward and relatively low overhead, especially on 64-bit platforms.

Ok, well we still have one division in the scaling path. I'm mostly
worried about the thread group exit that makes use of it through
threadgroup_cputime_adjusted(). Not sure if we can avoid that.

2013-04-10 17:32:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow


* Frederic Weisbecker <[email protected]> wrote:

> 2013/4/10 Ingo Molnar <[email protected]>:
> >
> > * Frederic Weisbecker <[email protected]> wrote:
> >
> >> Of course 128 bits ops are very expensive, so to help you evaluating the
> >> situation, this is going to happen on every call to task_cputime_adjusted() and
> >> thread_group_adjusted(), namely:
> >
> > It's really only expensive for divisions. Addition and multiplication should be
> > straightforward and relatively low overhead, especially on 64-bit platforms.
>
> Ok, well we still have one division in the scaling path. I'm mostly
> worried about the thread group exit that makes use of it through
> threadgroup_cputime_adjusted(). Not sure if we can avoid that.

I see, scale_stime()'s use of div64_u64_rem(), right?

I swapped out the details already, is there a link or commit ID that explains
where we hit 64-bit multiplication overflow? It's due to accounting in nanosecs,
spread out across thousands of tasks potentially, right?

But even with nsecs, a 64-bit variable ought to be able to hold hundreds of years
worth of runtime. How do we overflow?

Thanks,

Ingo

2013-04-11 08:04:03

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Wed, Apr 10, 2013 at 07:32:19PM +0200, Ingo Molnar wrote:
> * Frederic Weisbecker <[email protected]> wrote:
>
> > 2013/4/10 Ingo Molnar <[email protected]>:
> > >
> > > * Frederic Weisbecker <[email protected]> wrote:
> > >
> > >> Of course 128 bits ops are very expensive, so to help you evaluating the
> > >> situation, this is going to happen on every call to task_cputime_adjusted() and
> > >> thread_group_adjusted(), namely:
> > >
> > > It's really only expensive for divisions. Addition and multiplication should be
> > > straightforward and relatively low overhead, especially on 64-bit platforms.
> >
> > Ok, well we still have one division in the scaling path. I'm mostly
> > worried about the thread group exit that makes use of it through
> > threadgroup_cputime_adjusted(). Not sure if we can avoid that.
>
> I see, scale_stime()'s use of div64_u64_rem(), right?
>
> I swapped out the details already, is there a link or commit ID that explains
> where we hit 64-bit multiplication overflow? It's due to accounting in nanosecs,

No, values are converted to jiffies and then are multiplied. CONFIG_HZ=1000
make issue happen earlier compared to CONFIG_HZ=100.

> spread out across thousands of tasks potentially, right?

Thousands of tasks (in one process) running on thousands of CPUs machine
will make problem reproducible in hours or maybe minutes.

> But even with nsecs, a 64-bit variable ought to be able to hold hundreds of years
> worth of runtime. How do we overflow?

We do rtime * stime. If process utilize 100% CPU we have stime = rtime .
That will overflow if rtime >= sqrt(0xffffffffffffffff) jiffies. That is
rtime >= 49 days assuming CONFIG_HZ=1000. With 50 threads running on
CPUS > 50 machine, this will give 1 day. In real application stime is
never equal to rtime, but can be near half of it, so still problem is
easy achievable.

Using quotient and remainder make problem less reproducible, but it
still happen depending on rtime / total ratio.

I wrote user space program that make same calculations as kernel and python
script that generate stime/rtime/total values (attached). It shows problem
can still be reproducible. For example:

FAIL!
rtime: 25386774344 <- 293 days (one thread, HZ=1000)
total: 27958813690
stime: 8387644107
kernel: 8275815093 <- kernel value
python: 7616032303 <- correct value

FAIL!
rtime: 16877346691 <- 195 days (one thread, HZ=1000)
total: 12215365298
stime: 4886146119
kernel: 5240812402 <- kernel value
python: 6750938676 <- correct value

Stanislaw


Attachments:
(No filename) (2.47 kB)
scale_stime.c (1.00 kB)
scale_stime_test.py (889.00 B)
Download all attachments

2013-04-11 13:46:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
> Thoughts?

Would something like the below work?

(warning: it's never even been near a compiler)

---
kernel/sched/cputime.c | 78 +++++++++++++++++++++++++++++++++++---------------
1 file changed, 55 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 699d597..465f6db 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -522,35 +522,67 @@ void account_idle_ticks(unsigned long ticks)
}

/*
- * Perform (stime * rtime) / total with reduced chances
- * of multiplication overflows by using smaller factors
- * like quotient and remainders of divisions between
- * rtime and total.
+ * Perform: (stime * rtime) / total
*/
static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
{
- u64 rem, res, scaled;
+ int stime_fls = fls64(stime);
+ int total_fls = fls64(total);
+ int rtime_fls = fls64(rtime);
+ int shift, res_fls;
+ u32 rtime_hi = rtime >> 32, rtime_lo = rtime;
+ u64 hi, lo, t;

- if (rtime >= total) {
- /*
- * Scale up to rtime / total then add
- * the remainder scaled to stime / total.
- */
- res = div64_u64_rem(rtime, total, &rem);
- scaled = stime * res;
- scaled += div64_u64(stime * rem, total);
- } else {
- /*
- * Same in reverse: scale down to total / rtime
- * then substract that result scaled to
- * to the remaining part.
- */
- res = div64_u64_rem(total, rtime, &rem);
- scaled = div64_u64(stime, res);
- scaled -= div64_u64(scaled * rem, total);
+ /*
+ * Since the stime:utime ratio is already an approximation through
+ * the sampling, reducing its resolution isn't too big a deal.
+ * And since total = stime+utime; the total_fls will be the biggest
+ * of the two;
+ */
+ if (total_fls > 32) {
+ shift = total_fls - 32; /* a = 2^shift */
+ stime >>= shift;
+ total >>= shift;
+ stime_fls -= shift;
+ total_fls -= shift;
+ }
+
+ /*
+ * Since we limited stime to 32bits the multiplication reduced to 96bit.
+ * stime * rtime = stime * (rl + rh * 2^32) =
+ * stime * rl + stime * rh * 2^32
+ */
+ lo = stime * rtime_lo;
+ hi = stime * rtime_hi;
+ t = hi << 32;
+ lo += t;
+ if (lo < t) /* overflow */
+ hi += 0x100000000L;
+ hi >>= 32;
+
+ /*
+ * Pick the 64 most significant bits for division into @lo.
+ *
+ * NOTE: res_fls is an approximation (upper-bound) do we want to
+ * properly calculate?
+ */
+ shift = 0;
+ res_fls = stime_fls + rtime_fls;
+ if (res_fls > 64) {
+ shift = res_fls - 64; /* b = 2^shift */
+ lo >>= shift;
+ hi <<= 64 - shift;
+ lo |= hi;
}

- return (__force cputime_t) scaled;
+ /*
+ * So here we do:
+ *
+ * ((stime / a) * rtime / b)
+ * --------------------------- / b
+ * (total / a)
+ */
+ return div_u64(lo, total) >> shift;
}

/*

2013-04-11 14:50:20

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 03:45:46PM +0200, Peter Zijlstra wrote:
> On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
> > Thoughts?
>
> Would something like the below work?

Not sure, need to validate that?

> (warning: it's never even been near a compiler)

It compile, but probably has some bugs.
> + /*
> + * Since the stime:utime ratio is already an approximation through
> + * the sampling, reducing its resolution isn't too big a deal.
> + * And since total = stime+utime; the total_fls will be the biggest
> + * of the two;
> + */
> + if (total_fls > 32) {
> + shift = total_fls - 32; /* a = 2^shift */
> + stime >>= shift;
> + total >>= shift;
> + stime_fls -= shift;
> + total_fls -= shift;
> + }
> +
> + /*
> + * Since we limited stime to 32bits the multiplication reduced to 96bit.
> + * stime * rtime = stime * (rl + rh * 2^32) =
> + * stime * rl + stime * rh * 2^32
> + */
> + lo = stime * rtime_lo;
> + hi = stime * rtime_hi;
> + t = hi << 32;
> + lo += t;
> + if (lo < t) /* overflow */
> + hi += 0x100000000L;
> + hi >>= 32;

I do not understand why we shift hi value here, is that correct?

> + /*
> + * Pick the 64 most significant bits for division into @lo.
> + *
> + * NOTE: res_fls is an approximation (upper-bound) do we want to
> + * properly calculate?
> + */
> + shift = 0;
> + res_fls = stime_fls + rtime_fls;
> + if (res_fls > 64) {
> + shift = res_fls - 64; /* b = 2^shift */
> + lo >>= shift;
> + hi <<= 64 - shift;
> + lo |= hi;
> }
>
> - return (__force cputime_t) scaled;
> + /*
> + * So here we do:
> + *
> + * ((stime / a) * rtime / b)
> + * --------------------------- / b
> + * (total / a)
> + */
> + return div_u64(lo, total) >> shift;

I think it should be:

((stime / a) * rtime / b)
--------------------------- * b
(total / a)

return div_u64(lo, total) << shift;

Thanks
Stanislaw

2013-04-11 15:38:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
>> Thoughts?
>
> Would something like the below work?

Ugh, this is hard to think about, it's also fairly inefficient.

> static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> {
> - u64 rem, res, scaled;
> + int stime_fls = fls64(stime);
> + int total_fls = fls64(total);
> + int rtime_fls = fls64(rtime);

Doing "fls64()" unconditionally is quite expensive on some
architectures, and if I am not mistaken, the *common* case (by far) is
that all these values fit in 32 bits, no?

So let's re-think the whole thing entirely. First, let's throw away
the uncommon case, and we'll come back to it later:

if (unlikely((stime | total | rtime) >> 32)
return uncommon_case(stime, total, rtime);

and now we know we have the simple case where everything is in 32
bits, and we can just do the trivial

/* Make sure gcc understands that this is a 32x32->64 multiply,
followed by a 64/32->64 divide */
return div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);

which is the *cheapest* possible case of scale_stime(), with the
simplified multiply and divide. Agreed? This is cheap even on 32-bit.
Well, relatively.

Do we agree that this is (a) the common case that we need to worry
about from performance and (b) simple, understandable and efficient?

Now, let's look at the uncommon case, and I lied, I'm not going to
actually do it as a "uncommon_case()" function, I'm going to do this
as a "let us simplify the uncommon case until it *looks* like the
common case". IOW, in this uncommon thing, the aim is simply to just
reduce stime/rtime/total to the point where they are 32 bits. Ok? And
let's keep it simple again.

So *now*, once we are in the uncommon case, let's start counting bits.
Like this:

/* We know one of the values has a bit set in the high 32 bits */
for (;;) {
/* Make sure "stime" is the bigger of stime/rtime */
if (rtime > stime) {
u64 tmp = stime; stime = rtime; rtime = tmp;
}

/* Do we need to balance stime/rtime bits? */
if (stime >> 32) {
if (rtime >> 31)
goto drop_precision;

/* We can grow rtime and shrink stime and try to make them
both fit */
rtime <<= 1;
stime >>= 1;
continue;
}

/* stime/rtime fits in 32 bits, how about total? */
if (!(total >> 32))
break;

drop_precision:
/* We drop from stime, it has more bits than rtime */
stime >>= 1;
total >>= 1;
}

The above is totally untested, but each step is pretty damn simple and
fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)
iterations, and the normal case is that it's not done at all, or done
only a few times.

And the advantage is that the end result is always that simple
32x32/32 case that we started out with as the common case.

I dunno. Maybe I'm overlooking something, and the above is horrible,
but the above seems reasonably efficient if not optimal, and
*understandable*.

Linus

2013-04-11 17:31:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, 2013-04-11 at 16:50 +0200, Stanislaw Gruszka wrote:

> > + /*
> > + * Since the stime:utime ratio is already an approximation through
> > + * the sampling, reducing its resolution isn't too big a deal.
> > + * And since total = stime+utime; the total_fls will be the biggest
> > + * of the two;
> > + */
> > + if (total_fls > 32) {
> > + shift = total_fls - 32; /* a = 2^shift */
> > + stime >>= shift;
> > + total >>= shift;
> > + stime_fls -= shift;
> > + total_fls -= shift;
> > + }
> > +
> > + /*
> > + * Since we limited stime to 32bits the multiplication reduced to 96bit.
> > + * stime * rtime = stime * (rl + rh * 2^32) =
> > + * stime * rl + stime * rh * 2^32
> > + */
> > + lo = stime * rtime_lo;
> > + hi = stime * rtime_hi;
> > + t = hi << 32;
> > + lo += t;
> > + if (lo < t) /* overflow */
> > + hi += 0x100000000L;
> > + hi >>= 32;
>
> I do not understand why we shift hi value here, is that correct?

Yes.. remember that we have:

stime * rl + stime * rh * 2^32

How we get this 96bit value but our two 64bit values overlap:

| w3 | w2 | w1 | w0 |
+----+----+----+----+
| lo |
| hi |

So what I do is I add the low word of hi to lo and shift the high word
of hi to get:

| hi | lo |

Two non-overlapping 64bit values where the high word of hi is always 0.

> > + /*
> > + * Pick the 64 most significant bits for division into @lo.
> > + *
> > + * NOTE: res_fls is an approximation (upper-bound) do we want to
> > + * properly calculate?
> > + */
> > + shift = 0;
> > + res_fls = stime_fls + rtime_fls;
> > + if (res_fls > 64) {
> > + shift = res_fls - 64; /* b = 2^shift */
> > + lo >>= shift;
> > + hi <<= 64 - shift;
> > + lo |= hi;
> > }
> > - return (__force cputime_t) scaled;
> > + /*
> > + * So here we do:
> > + *
> > + * ((stime / a) * rtime / b)
> > + * --------------------------- / b
> > + * (total / a)
> > + */
> > + return div_u64(lo, total) >> shift;
>
> I think it should be:
>
> ((stime / a) * rtime / b)
> --------------------------- * b
> (total / a)
>
> return div_u64(lo, total) << shift;

I think you're very right about that.. got my head twisted by staring
at this stuff too long.

2013-04-11 18:08:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
> >> Thoughts?
> >
> > Would something like the below work?
>
> Ugh, this is hard to think about, it's also fairly inefficient.
>
> > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> > {
> > - u64 rem, res, scaled;
> > + int stime_fls = fls64(stime);
> > + int total_fls = fls64(total);
> > + int rtime_fls = fls64(rtime);
>
> Doing "fls64()" unconditionally is quite expensive on some
> architectures,

Oh, I (wrongly it appears) assumed that fls was something cheap :/

> and if I am not mistaken, the *common* case (by far) is
> that all these values fit in 32 bits, no?

It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm
completely lost as to which we default to atm. But we sure can reduce
to 32 bits in most cases without too much problems.

But that would mean fls() and shifting again for nsec based cputime.

I'll have a better read and think about the rest of your email but
that'll have to be tomorrow :/

2013-04-11 18:22:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 08:07:50PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:
> > On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:
> > > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
> > >> Thoughts?
> > >
> > > Would something like the below work?
> >
> > Ugh, this is hard to think about, it's also fairly inefficient.
> >
> > > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> > > {
> > > - u64 rem, res, scaled;
> > > + int stime_fls = fls64(stime);
> > > + int total_fls = fls64(total);
> > > + int rtime_fls = fls64(rtime);
> >
> > Doing "fls64()" unconditionally is quite expensive on some
> > architectures,
>
> Oh, I (wrongly it appears) assumed that fls was something cheap :/
>
> > and if I am not mistaken, the *common* case (by far) is
> > that all these values fit in 32 bits, no?
>
> It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm
> completely lost as to which we default to atm. But we sure can reduce
> to 32 bits in most cases without too much problems.

We default to the jiffies. The nsecs case is used only for full dynticks
accounting and ia64 precise accounting.

2013-04-11 18:22:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 11:07 AM, Peter Zijlstra <[email protected]> wrote:
>
> Oh, I (wrongly it appears) assumed that fls was something cheap :/

It often is. Particularly on modern machines, because all popcount and
leading zero counting ends up being interesting to some people.

On older machines, its often a bit-at-a-time thing. We don't even try
to support i386 any more, but on atom and P4 it's something like 16
cycles for bsrl, and older cores were worse. So doing three of them
when not needed seems a bit excessive..

In contrast, on a Core2, I think it's just a single cycle.

Non-x86 architectures end up being the same - some have fast
instructions for it, others don't do it at all and end up doing things
with bitmasking and shifting.

Linus

2013-04-11 18:26:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 08:22:03PM +0200, Frederic Weisbecker wrote:
> On Thu, Apr 11, 2013 at 08:07:50PM +0200, Peter Zijlstra wrote:
> > On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:
> > > On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:
> > > > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:
> > > >> Thoughts?
> > > >
> > > > Would something like the below work?
> > >
> > > Ugh, this is hard to think about, it's also fairly inefficient.
> > >
> > > > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> > > > {
> > > > - u64 rem, res, scaled;
> > > > + int stime_fls = fls64(stime);
> > > > + int total_fls = fls64(total);
> > > > + int rtime_fls = fls64(rtime);
> > >
> > > Doing "fls64()" unconditionally is quite expensive on some
> > > architectures,
> >
> > Oh, I (wrongly it appears) assumed that fls was something cheap :/
> >
> > > and if I am not mistaken, the *common* case (by far) is
> > > that all these values fit in 32 bits, no?
> >
> > It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm
> > completely lost as to which we default to atm. But we sure can reduce
> > to 32 bits in most cases without too much problems.
>
> We default to the jiffies. The nsecs case is used only for full dynticks
> accounting and ia64 precise accounting.

Oh and in the latter case there is no scaling.

2013-04-12 07:56:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:
> So *now*, once we are in the uncommon case, let's start counting bits.
> Like this:
>
> /* We know one of the values has a bit set in the high 32 bits */
> for (;;) {
> /* Make sure "stime" is the bigger of stime/rtime */
> if (rtime > stime) {
> u64 tmp = stime; stime = rtime; rtime = tmp;
> }
>
> /* Do we need to balance stime/rtime bits? */
> if (stime >> 32) {
> if (rtime >> 31)
> goto drop_precision;
>
> /* We can grow rtime and shrink stime and try to make them
> both fit */
> rtime <<= 1;
> stime >>= 1;
> continue;
> }
>
> /* stime/rtime fits in 32 bits, how about total? */
> if (!(total >> 32))
> break;
>
> drop_precision:
> /* We drop from stime, it has more bits than rtime */
> stime >>= 1;
> total >>= 1;
> }
>
> The above is totally untested, but each step is pretty damn simple and
> fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)
> iterations, and the normal case is that it's not done at all, or done
> only a few times.

Right it gets gradually heavier the bigger the numbers get; which is
more and more unlikely.

> And the advantage is that the end result is always that simple
> 32x32/32 case that we started out with as the common case.
>
> I dunno. Maybe I'm overlooking something, and the above is horrible,
> but the above seems reasonably efficient if not optimal, and
> *understandable*.

I suppose that entirely matters on what one is used to ;-) I had to
stare rather hard at it for a little while.

But yes, you take it one step further and are willing to ditch rtime
bits too and I suppose that's fine.

Should work,.. Stanislaw could you stick this into your userspace
thingy and verify the numbers are sane enough?

2013-04-13 14:49:00

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Fri, Apr 12, 2013 at 09:55:56AM +0200, Peter Zijlstra wrote:
> > The above is totally untested, but each step is pretty damn simple and
> > fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)
> > iterations, and the normal case is that it's not done at all, or done
> > only a few times.
>
> Right it gets gradually heavier the bigger the numbers get; which is
> more and more unlikely.
>
> > And the advantage is that the end result is always that simple
> > 32x32/32 case that we started out with as the common case.
> >
> > I dunno. Maybe I'm overlooking something, and the above is horrible,
> > but the above seems reasonably efficient if not optimal, and
> > *understandable*.
>
> I suppose that entirely matters on what one is used to ;-) I had to
> stare rather hard at it for a little while.
>
> But yes, you take it one step further and are willing to ditch rtime
> bits too and I suppose that's fine.
>
> Should work,.. Stanislaw could you stick this into your userspace
> thingy and verify the numbers are sane enough?

It works fine - gives relative error less than 0.1% for very big
numbers.

For the record I'm attaching test program and script.

Thanks
Stanislaw


Attachments:
(No filename) (1.17 kB)
scale_stime5.c (1.55 kB)
scale_stime_test5.py (999.00 B)
Download all attachments

2013-04-13 14:54:49

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Thu, Apr 11, 2013 at 08:38:37AM -0700, Linus Torvalds wrote:
> /* We know one of the values has a bit set in the high 32 bits */
> for (;;) {
> /* Make sure "stime" is the bigger of stime/rtime */
> if (rtime > stime) {
> u64 tmp = stime; stime = rtime; rtime = tmp;
> }

For most workloads rtime is bigger than stime, so swapping those would
save some cycles on common cases. Otherwise this algorithm looks great.

Thanks
Stanislaw

2013-04-13 18:44:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Sat, Apr 13, 2013 at 7:49 AM, Stanislaw Gruszka <[email protected]> wrote:
>
> It works fine - gives relative error less than 0.1% for very big
> numbers.

So I was assuming that the values were all roughly in the same range.

When that is not true (say, very big rtime and very big total, very
small stime), it's possible that we end up first normalizing
rtime-vs-stime, and then dropping precision due to big 'total'. And
that might cause excessive precision loss (and I think "0.1%" is
excessive, even if it's probably perfectly fine).

I suspect I should have done the "drop precision due to 'total' being
out of range *first*. And then only balance rtime/stime after we've
scaled down total. That might avoid any unnecessary precision loss,
because bringing 'total' into the 32-bit range will continually shrink
the much larger (unbalanced) number, rather than shrink something that
was already balanced.

But I doubt it ever matters in practice.

The *big* loss (well, relatively - the worst case I've seen with your
test program is with 'err' being 0.021%) comes because if we have to
drop precision, and both rtime and stime are big, we have to scale
down 'total' by one bit every time. And we scale down the bigger of
rtime/stime too, but we basically have twice as many bits to shave
off rtime/stime, since there are two values (even if we pick the
biggest one, eventually we'll start alternating because shaving bits
will make the other one bigger).

So we may end up scaling 'total' down to much less than 32 bits, and
that's how you get the "big" errors in the 0.02% range.

The good news is that
(a) this requires that rtime/stime really both are big numbers
(b) this only happens with really really big numbers (ie ver much
your "10 years of 4096 threads" at 1000 Hz kind of numbers)
(c) even then the error isn't catastrophic.

So I think my algorithm could be improved a bit (to do the total
scaling *before* doing the scaling of rtime-vs-stime), but I think
it's quite usable.

Linus

PS. This is the "Make sure 'total' fits in 32 bits first" version. Not
really tested, but it's just changing the order of operations a bit.

/* We know one of the values has a bit set in the high 32 bits */
for (;;) {
/* Make sure "rtime" is the bigger of stime/rtime */
if (stime > rtime) {
u64 tmp = rtime; rtime = stime; stime = tmp;
}

/* Make sure 'total' fits in 32 bits */
if (total >> 32)
goto drop_precision;

/* Does rtime (and thus stime) fit in 32 bits? */
if (!(rtime >> 32))
break;

/* Can we just balance rtime/stime rather than dropping bits? */
if (stime >> 31)
goto drop_precision;

/* We can grow stime and shrink rtime and try to make them both fit */
stime <<= 1;
rtime >>= 1;
continue;

drop_precision:
/* We drop from rtime, it has more bits than stime */
rtime >>= 1;
total >>= 1;
}

2013-04-16 10:39:21

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Sat, Apr 13, 2013 at 11:44:54AM -0700, Linus Torvalds wrote:
> PS. This is the "Make sure 'total' fits in 32 bits first" version. Not
> really tested, but it's just changing the order of operations a bit.
>
> /* We know one of the values has a bit set in the high 32 bits */
> for (;;) {
> /* Make sure "rtime" is the bigger of stime/rtime */
> if (stime > rtime) {
> u64 tmp = rtime; rtime = stime; stime = tmp;
> }
>
> /* Make sure 'total' fits in 32 bits */
> if (total >> 32)
> goto drop_precision;
>
> /* Does rtime (and thus stime) fit in 32 bits? */
> if (!(rtime >> 32))
> break;
>
> /* Can we just balance rtime/stime rather than dropping bits? */
> if (stime >> 31)
> goto drop_precision;
>
> /* We can grow stime and shrink rtime and try to make them both fit */
> stime <<= 1;
> rtime >>= 1;
> continue;
>
> drop_precision:
> /* We drop from rtime, it has more bits than stime */
> rtime >>= 1;
> total >>= 1;
> }

It also also pass 0.1% relative error on my tests. Decreasing error
threshold to 0.02% failed. I didn't check other values or measure how
frequent 0.02% fail on each version, I assume this one is better :-)

So regarding relative error this algorithm is fine, there is no
multiplication overflow error which make scaled numbers bogus. Then
I looked on this algorithm regarding context how it is used ...

Raw stime/rtime/total values will increase in jiffies resolution,
so do scaled_stime if we do not drop precision. For bigger numbers,
since we drop precision, scaled_stime will grow in chunks. How big
the chunk depend on how overall big numbers are and stime/total ratio.
For example: stime = 0.5*total, 128 threads, after 1 year of CPU
execution chunk will be 1024 jiffies.

We use scaled stime value this way:

if (total)
stime = scale_stime(stime, rtime, total);
else
stime = rtime;

/*
* If the tick based count grows faster than the scheduler one,
* the result of the scaling may go backward.
* Let's enforce monotonicity.
*/
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);

*ut = prev->utime;
*st = prev->stime;

Since rtime increase, but scaled stime not, stime will be accounted
as prev->utime. Then after chunk jiffies, stime will grow and we will
get it accounted in prev->stime. As result we will account more cpu
time than actually process do. This error will accumulate depending
how frequently cputime_adjust(process) will be called.

As solution for this we could just stop accounting if prev->stime +
prev->utime are already bigger than rtime. For example:

rtime = nsecs_to_cputime(curr->sum_exec_runtime);

+ /*
+ * Update userspace visible utime/stime values only if actual execution
+ * time is bigger than already exported. Note that can happen, that we
+ * provided bigger values due to scaling inaccuracy on big numbers.
+ */
+ if (prev->stime + prev->utime >= rtime)
+ goto out;
+
if (total)
stime = scale_stime(stime, rtime, total);
else
@@ -573,6 +581,7 @@ static void cputime_adjust(struct task_cputime *curr,
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);

+out:
*ut = prev->utime;
*st = prev->stime;
}

This should help with erroneously accounting more CPU time than process
actually use. As disadvantage userspace will see CPU time increase in
chunks, but I think this is better than see values much bigger than
correct ones (and for 99.9% user cases it does not matter).

Stanislaw

2013-04-30 14:02:59

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: [tip:sched/core] sched: Lower chances of cputime scaling overflow

On Sat, Apr 13, 2013 at 11:44:54AM -0700, Linus Torvalds wrote:
> PS. This is the "Make sure 'total' fits in 32 bits first" version. Not
> really tested, but it's just changing the order of operations a bit.
>
> /* We know one of the values has a bit set in the high 32 bits */
> for (;;) {
> /* Make sure "rtime" is the bigger of stime/rtime */
> if (stime > rtime) {
> u64 tmp = rtime; rtime = stime; stime = tmp;
> }
>
> /* Make sure 'total' fits in 32 bits */
> if (total >> 32)
> goto drop_precision;
>
> /* Does rtime (and thus stime) fit in 32 bits? */
> if (!(rtime >> 32))
> break;
>
> /* Can we just balance rtime/stime rather than dropping bits? */
> if (stime >> 31)
> goto drop_precision;
>
> /* We can grow stime and shrink rtime and try to make them both fit */
> stime <<= 1;
> rtime >>= 1;
> continue;
>
> drop_precision:
> /* We drop from rtime, it has more bits than stime */
> rtime >>= 1;
> total >>= 1;
> }

For reference I'm attaching test program and script I used to validate
algorithm.

It generates lot of (possibly real) rtime, total, stime values for 4096
threads running for 10 years. Then compare scaled stime values caluclated
by above algorithm with precise python values.

For all values generated, scaled stime relative error was less than 0.05%

Stanislaw


Attachments:
(No filename) (1.46 kB)
scale_stime.c (1.59 kB)
scale_stime_test.py (998.00 B)
Download all attachments