On Mon, Mar 18, 2013 at 03:49:02AM -0700, tip-bot for Frederic Weisbecker wrote:

> Commit-ID: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5

> Gitweb: http://git.kernel.org/tip/d9a3c9823a2e6a543eb7807fb3d15d8233817ec5

> Author: Frederic Weisbecker <[email protected]>

> AuthorDate: Wed, 20 Feb 2013 18:54:55 +0100

> Committer: Frederic Weisbecker <[email protected]>

> CommitDate: Wed, 13 Mar 2013 18:18:14 +0100

>

> sched: Lower chances of cputime scaling overflow

>

> Some users have reported that after running a process with

> hundreds of threads on intensive CPU-bound loads, the cputime

> of the group started to freeze after a few days.

>

> This is due to how we scale the tick-based cputime against

> the scheduler precise execution time value.

>

> We add the values of all threads in the group and we multiply

> that against the sum of the scheduler exec runtime of the whole

> group.

>

> This easily overflows after a few days/weeks of execution.

>

> A proposed solution to solve this was to compute that multiplication

> on stime instead of utime:

> 62188451f0d63add7ad0cd2a1ae269d600c1663d

> ("cputime: Avoid multiplication overflow on utime scaling")

>

> The rationale behind that was that it's easy for a thread to

> spend most of its time in userspace under intensive CPU-bound workload

> but it's much harder to do CPU-bound intensive long run in the kernel.

>

> This postulate got defeated when a user recently reported he was still

> seeing cputime freezes after the above patch. The workload that

> triggers this issue relates to intensive networking workloads where

> most of the cputime is consumed in the kernel.

>

> To reduce much more the opportunities for multiplication overflow,

> lets reduce the multiplication factors to the remainders of the division

> between sched exec runtime and cputime. Assuming the difference between

> these shouldn't ever be that large, it could work on many situations.

>

> This gets the same results as in the upstream scaling code except for

> a small difference: the upstream code always rounds the results to

> the nearest integer not greater to what would be the precise result.

> The new code rounds to the nearest integer either greater or not

> greater. In practice this difference probably shouldn't matter but

> it's worth mentioning.

>

> If this solution appears not to be enough in the end, we'll

> need to partly revert back to the behaviour prior to commit

> 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239

> ("sched, cputime: Introduce thread_group_times()")

>

> Back then, the scaling was done on exit() time before adding the cputime

> of an exiting thread to the signal struct. And then we'll need to

> scale one-by-one the live threads cputime in thread_group_cputime(). The

> drawback may be a slightly slower code on exit time.

Sorry for questioning this patch that late, but I thought this solution

is ok. However before providing backport to RHEL kernel, I made some

analytics and test of this algorithm. I discovered that is pretty easy

to catch multiplication overflow, for example:

rtime: 16877346691

total: 12215365298

stime: 4886146119

will give scaled stime: 5240812402 , whereas correct value should be

6750938676. As problem can be triggered at random, and we can not

give any guaranties that it will not happen for particular time

duration bigger than 50 days (on one thread application), I'm not

considering this patch as good fix.

Of cource reverting 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 will not

help either, since is possible to have overflow on one thread.

Can we remove rtime multiplication at all and just use pure utime and

stime values? I think no. I don't know the details, but is possible that

we can have top "exploit" that will eat lot of cputime but "hide" itself

from tick, hence it will not be showed correctly as process which utilize

lot of cpu. Peter told about that some time ago.

Taking that, and analyze some other possibilities, I think best (or only

possible good) solution for this problem is use 128 bit math for

calculation. I tested (on user space) below code [1] and it give good

results. It use 128 bit multiplication and simplified 128 bit division.

This will require adding those 128 bit math primitives

http://thread.gmane.org/gmane.linux.kernel/1381801

which were initially required for Deadline scheduler, but then removed

from requirement. I hope it is ok to add them or some improved version.

I think soon or later 128 bit math will be required in more places in

the kernel.

Thoughts?

Stanislaw

[1]

#include <stdio.h>

#include <stdlib.h>

#include <assert.h>

#include <strings.h>

typedef unsigned long long u64;

typedef unsigned __int128 u128;

#define clzll(x) __builtin_clzll(x)

static u64 div_u64(u64 a, u64 b)

{

return a / b;

}

static u128 mul_u64_u64(u64 a, u64 b)

{

u128 res = a;

res *= b;

return res;

}

static u64 div128_u64(u128 dividend, u64 divisor)

{

u64 high = dividend >> 64;

u64 quot;

if (high == 0) {

quot = div_u64(dividend, divisor);

} else {

int n = 64 - clzll(high);

if ((divisor >> n) == 0) {

/* Oh no */

return 0;

}

quot = div_u64(dividend >> n, divisor >> n);

if (quot != 0)

quot--;

if ((dividend - quot * divisor) >= divisor)

quot++;

}

return quot;

}

u64 _scale_time(u64 rtime, u64 total, u64 time)

{

u128 tmp = mul_u64_u64(time, rtime);

return div128_u64(tmp, total);

}

u64 scale_time(u64 stime, u64 rtime, u64 total)

{

u64 time, scaled;

u64 utime = total - stime;

int use_utime;

if (utime < stime) {

use_utime = 1;

time = utime;

} else {

use_utime = 0;

time = stime;

}

scaled = _scale_time(rtime, total, time);

if (use_utime)

scaled = rtime - scaled;

return scaled;

}

int main(int argc, char *argv[])

{

u64 rtime, total, stime, scaled;

if (argc != 4)

return;

rtime = strtoll(argv[1], NULL, 10);

total = strtoll(argv[2], NULL, 10);

stime = strtoll(argv[3], NULL, 10);

assert (total >= stime);

scaled = scale_time(stime, rtime, total);

printf("%llu\n", scaled);

return 0;

}

2013/3/26 Stanislaw Gruszka <[email protected]>:

> On Mon, Mar 18, 2013 at 03:49:02AM -0700, tip-bot for Frederic Weisbecker wrote:

>> Commit-ID: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5

>> Gitweb: http://git.kernel.org/tip/d9a3c9823a2e6a543eb7807fb3d15d8233817ec5

>> Author: Frederic Weisbecker <[email protected]>

>> AuthorDate: Wed, 20 Feb 2013 18:54:55 +0100

>> Committer: Frederic Weisbecker <[email protected]>

>> CommitDate: Wed, 13 Mar 2013 18:18:14 +0100

>>

>> sched: Lower chances of cputime scaling overflow

>>

>> Some users have reported that after running a process with

>> hundreds of threads on intensive CPU-bound loads, the cputime

>> of the group started to freeze after a few days.

>>

>> This is due to how we scale the tick-based cputime against

>> the scheduler precise execution time value.

>>

>> We add the values of all threads in the group and we multiply

>> that against the sum of the scheduler exec runtime of the whole

>> group.

>>

>> This easily overflows after a few days/weeks of execution.

>>

>> A proposed solution to solve this was to compute that multiplication

>> on stime instead of utime:

>> 62188451f0d63add7ad0cd2a1ae269d600c1663d

>> ("cputime: Avoid multiplication overflow on utime scaling")

>>

>> The rationale behind that was that it's easy for a thread to

>> spend most of its time in userspace under intensive CPU-bound workload

>> but it's much harder to do CPU-bound intensive long run in the kernel.

>>

>> This postulate got defeated when a user recently reported he was still

>> seeing cputime freezes after the above patch. The workload that

>> triggers this issue relates to intensive networking workloads where

>> most of the cputime is consumed in the kernel.

>>

>> To reduce much more the opportunities for multiplication overflow,

>> lets reduce the multiplication factors to the remainders of the division

>> between sched exec runtime and cputime. Assuming the difference between

>> these shouldn't ever be that large, it could work on many situations.

>>

>> This gets the same results as in the upstream scaling code except for

>> a small difference: the upstream code always rounds the results to

>> the nearest integer not greater to what would be the precise result.

>> The new code rounds to the nearest integer either greater or not

>> greater. In practice this difference probably shouldn't matter but

>> it's worth mentioning.

>>

>> If this solution appears not to be enough in the end, we'll

>> need to partly revert back to the behaviour prior to commit

>> 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239

>> ("sched, cputime: Introduce thread_group_times()")

>>

>> Back then, the scaling was done on exit() time before adding the cputime

>> of an exiting thread to the signal struct. And then we'll need to

>> scale one-by-one the live threads cputime in thread_group_cputime(). The

>> drawback may be a slightly slower code on exit time.

>

> Sorry for questioning this patch that late, but I thought this solution

> is ok. However before providing backport to RHEL kernel, I made some

> analytics and test of this algorithm. I discovered that is pretty easy

> to catch multiplication overflow, for example:

>

> rtime: 16877346691

> total: 12215365298

> stime: 4886146119

>

> will give scaled stime: 5240812402 , whereas correct value should be

> 6750938676. As problem can be triggered at random, and we can not

> give any guaranties that it will not happen for particular time

> duration bigger than 50 days (on one thread application), I'm not

> considering this patch as good fix.

>

> Of cource reverting 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 will not

> help either, since is possible to have overflow on one thread.

>

> Can we remove rtime multiplication at all and just use pure utime and

> stime values? I think no. I don't know the details, but is possible that

> we can have top "exploit" that will eat lot of cputime but "hide" itself

> from tick, hence it will not be showed correctly as process which utilize

> lot of cpu. Peter told about that some time ago.

>

> Taking that, and analyze some other possibilities, I think best (or only

> possible good) solution for this problem is use 128 bit math for

> calculation. I tested (on user space) below code [1] and it give good

> results. It use 128 bit multiplication and simplified 128 bit division.

>

> This will require adding those 128 bit math primitives

> http://thread.gmane.org/gmane.linux.kernel/1381801

> which were initially required for Deadline scheduler, but then removed

> from requirement. I hope it is ok to add them or some improved version.

> I think soon or later 128 bit math will be required in more places in

> the kernel.

>

> Thoughts?

>

> Stanislaw

>

> [1]

>

> #include <stdio.h>

> #include <stdlib.h>

> #include <assert.h>

> #include <strings.h>

>

> typedef unsigned long long u64;

> typedef unsigned __int128 u128;

>

> #define clzll(x) __builtin_clzll(x)

>

> static u64 div_u64(u64 a, u64 b)

> {

> return a / b;

> }

>

> static u128 mul_u64_u64(u64 a, u64 b)

> {

> u128 res = a;

> res *= b;

>

> return res;

> }

>

> static u64 div128_u64(u128 dividend, u64 divisor)

> {

> u64 high = dividend >> 64;

> u64 quot;

>

> if (high == 0) {

> quot = div_u64(dividend, divisor);

> } else {

> int n = 64 - clzll(high);

>

> if ((divisor >> n) == 0) {

> /* Oh no */

> return 0;

> }

>

> quot = div_u64(dividend >> n, divisor >> n);

>

> if (quot != 0)

> quot--;

> if ((dividend - quot * divisor) >= divisor)

> quot++;

> }

>

> return quot;

> }

>

> u64 _scale_time(u64 rtime, u64 total, u64 time)

> {

> u128 tmp = mul_u64_u64(time, rtime);

>

> return div128_u64(tmp, total);

> }

>

> u64 scale_time(u64 stime, u64 rtime, u64 total)

> {

> u64 time, scaled;

> u64 utime = total - stime;

> int use_utime;

>

> if (utime < stime) {

> use_utime = 1;

> time = utime;

> } else {

> use_utime = 0;

> time = stime;

> }

>

> scaled = _scale_time(rtime, total, time);

>

> if (use_utime)

> scaled = rtime - scaled;

>

> return scaled;

> }

>

> int main(int argc, char *argv[])

> {

> u64 rtime, total, stime, scaled;

>

> if (argc != 4)

> return;

>

> rtime = strtoll(argv[1], NULL, 10);

> total = strtoll(argv[2], NULL, 10);

> stime = strtoll(argv[3], NULL, 10);

>

> assert (total >= stime);

>

> scaled = scale_time(stime, rtime, total);

> printf("%llu\n", scaled);

>

> return 0;

> }

I starred at that code for quite some time already and I can't come up

with a better solution.

Of course 128 bits ops are very expensive, so to help you evaluating

the situation, this is going to happen on every call to

task_cputime_adjusted() and thread_group_adjusted(), namely:

* Some proc files read

* sys_times()

* thread group exit

Thoughts?

On Tue, Mar 26, 2013 at 03:19:25PM +0100, Frederic Weisbecker wrote:

> I starred at that code for quite some time already and I can't come up

> with a better solution.

>

> Of course 128 bits ops are very expensive, so to help you evaluating

> the situation, this is going to happen on every call to

> task_cputime_adjusted() and thread_group_adjusted(), namely:

>

> * Some proc files read

> * sys_times()

> * thread group exit

I need to think if we can get rid of thread_group_cputime_adjusted()

at exit(), perhaps this is possible without causing problems.

Div128 is optimized if higher 64-bits are zeros. 128 bit multiplication

can be avoided if we are sure that 64bit mul will not overflow, for

example:

if (ffs(stime) + ffs(rtime) < 64)

scale_64()

else

scale_128()

Basically 99.9% of processes will go scale_64() path. Fix is only

needed for processes that utilize lot of CPU time for long period,

hence have big stime and rtime values, what is not common.

I'm thinking also about remove cputime scaling. Obviously this would

be the best regarding performance. I think we could avoid multiplication

in kernel, this can be done in user-space if we export rtime to it.

So top could be modified to do cputime scaling by itself. In that way

we will avoid this top hiding "exploit", but also precision of times(2)

syscall values will get worse (I think there are lot of programs that

use this call and they might expect stime/utime are precise).

Stanislaw

* Frederic Weisbecker <[email protected]> wrote:

> Of course 128 bits ops are very expensive, so to help you evaluating the

> situation, this is going to happen on every call to task_cputime_adjusted() and

> thread_group_adjusted(), namely:

It's really only expensive for divisions. Addition and multiplication should be

straightforward and relatively low overhead, especially on 64-bit platforms.

Thanks,

Ingo

2013/4/10 Ingo Molnar <[email protected]>:

>

> * Frederic Weisbecker <[email protected]m> wrote:

>

>> Of course 128 bits ops are very expensive, so to help you evaluating the

>> situation, this is going to happen on every call to task_cputime_adjusted() and

>> thread_group_adjusted(), namely:

>

> It's really only expensive for divisions. Addition and multiplication should be

> straightforward and relatively low overhead, especially on 64-bit platforms.

Ok, well we still have one division in the scaling path. I'm mostly

worried about the thread group exit that makes use of it through

threadgroup_cputime_adjusted(). Not sure if we can avoid that.

* Frederic Weisbecker <[email protected]> wrote:

> 2013/4/10 Ingo Molnar <[email protected]>:

> >

> > * Frederic Weisbecker <[email protected]> wrote:

> >

> >> Of course 128 bits ops are very expensive, so to help you evaluating the

> >> situation, this is going to happen on every call to task_cputime_adjusted() and

> >> thread_group_adjusted(), namely:

> >

> > It's really only expensive for divisions. Addition and multiplication should be

> > straightforward and relatively low overhead, especially on 64-bit platforms.

>

> Ok, well we still have one division in the scaling path. I'm mostly

> worried about the thread group exit that makes use of it through

> threadgroup_cputime_adjusted(). Not sure if we can avoid that.

I see, scale_stime()'s use of div64_u64_rem(), right?

I swapped out the details already, is there a link or commit ID that explains

where we hit 64-bit multiplication overflow? It's due to accounting in nanosecs,

spread out across thousands of tasks potentially, right?

But even with nsecs, a 64-bit variable ought to be able to hold hundreds of years

worth of runtime. How do we overflow?

Thanks,

Ingo

On Wed, Apr 10, 2013 at 07:32:19PM +0200, Ingo Molnar wrote:

> * Frederic Weisbecker <[email protected]> wrote:

>

> > 2013/4/10 Ingo Molnar <[email protected]>:

> > >

> > > * Frederic Weisbecker <[email protected]> wrote:

> > >

> > >> Of course 128 bits ops are very expensive, so to help you evaluating the

> > >> situation, this is going to happen on every call to task_cputime_adjusted() and

> > >> thread_group_adjusted(), namely:

> > >

> > > It's really only expensive for divisions. Addition and multiplication should be

> > > straightforward and relatively low overhead, especially on 64-bit platforms.

> >

> > Ok, well we still have one division in the scaling path. I'm mostly

> > worried about the thread group exit that makes use of it through

> > threadgroup_cputime_adjusted(). Not sure if we can avoid that.

>

> I see, scale_stime()'s use of div64_u64_rem(), right?

>

> I swapped out the details already, is there a link or commit ID that explains

> where we hit 64-bit multiplication overflow? It's due to accounting in nanosecs,

No, values are converted to jiffies and then are multiplied. CONFIG_HZ=1000

make issue happen earlier compared to CONFIG_HZ=100.

> spread out across thousands of tasks potentially, right?

Thousands of tasks (in one process) running on thousands of CPUs machine

will make problem reproducible in hours or maybe minutes.

> But even with nsecs, a 64-bit variable ought to be able to hold hundreds of years

> worth of runtime. How do we overflow?

We do rtime * stime. If process utilize 100% CPU we have stime = rtime .

That will overflow if rtime >= sqrt(0xffffffffffffffff) jiffies. That is

rtime >= 49 days assuming CONFIG_HZ=1000. With 50 threads running on

CPUS > 50 machine, this will give 1 day. In real application stime is

never equal to rtime, but can be near half of it, so still problem is

easy achievable.

Using quotient and remainder make problem less reproducible, but it

still happen depending on rtime / total ratio.

I wrote user space program that make same calculations as kernel and python

script that generate stime/rtime/total values (attached). It shows problem

can still be reproducible. For example:

FAIL!

rtime: 25386774344 <- 293 days (one thread, HZ=1000)

total: 27958813690

stime: 8387644107

kernel: 8275815093 <- kernel value

python: 7616032303 <- correct value

FAIL!

rtime: 16877346691 <- 195 days (one thread, HZ=1000)

total: 12215365298

stime: 4886146119

kernel: 5240812402 <- kernel value

python: 6750938676 <- correct value

Stanislaw

scale_stime.c (1.00 kB)

scale_stime_test.py (889.00 B)

Download all attachments

On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

> Thoughts?

Would something like the below work?

(warning: it's never even been near a compiler)

---

kernel/sched/cputime.c | 78 +++++++++++++++++++++++++++++++++++---------------

1 file changed, 55 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c

index 699d597..465f6db 100644

--- a/kernel/sched/cputime.c

+++ b/kernel/sched/cputime.c

@@ -522,35 +522,67 @@ void account_idle_ticks(unsigned long ticks)

}

/*

- * Perform (stime * rtime) / total with reduced chances

- * of multiplication overflows by using smaller factors

- * like quotient and remainders of divisions between

- * rtime and total.

+ * Perform: (stime * rtime) / total

*/

static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)

{

- u64 rem, res, scaled;

+ int stime_fls = fls64(stime);

+ int total_fls = fls64(total);

+ int rtime_fls = fls64(rtime);

+ int shift, res_fls;

+ u32 rtime_hi = rtime >> 32, rtime_lo = rtime;

+ u64 hi, lo, t;

- if (rtime >= total) {

- /*

- * Scale up to rtime / total then add

- * the remainder scaled to stime / total.

- */

- res = div64_u64_rem(rtime, total, &rem);

- scaled = stime * res;

- scaled += div64_u64(stime * rem, total);

- } else {

- /*

- * Same in reverse: scale down to total / rtime

- * then substract that result scaled to

- * to the remaining part.

- */

- res = div64_u64_rem(total, rtime, &rem);

- scaled = div64_u64(stime, res);

- scaled -= div64_u64(scaled * rem, total);

+ /*

+ * Since the stime:utime ratio is already an approximation through

+ * the sampling, reducing its resolution isn't too big a deal.

+ * And since total = stime+utime; the total_fls will be the biggest

+ * of the two;

+ */

+ if (total_fls > 32) {

+ shift = total_fls - 32; /* a = 2^shift */

+ stime >>= shift;

+ total >>= shift;

+ stime_fls -= shift;

+ total_fls -= shift;

+ }

+

+ /*

+ * Since we limited stime to 32bits the multiplication reduced to 96bit.

+ * stime * rtime = stime * (rl + rh * 2^32) =

+ * stime * rl + stime * rh * 2^32

+ */

+ lo = stime * rtime_lo;

+ hi = stime * rtime_hi;

+ t = hi << 32;

+ lo += t;

+ if (lo < t) /* overflow */

+ hi += 0x100000000L;

+ hi >>= 32;

+

+ /*

+ * Pick the 64 most significant bits for division into @lo.

+ *

+ * NOTE: res_fls is an approximation (upper-bound) do we want to

+ * properly calculate?

+ */

+ shift = 0;

+ res_fls = stime_fls + rtime_fls;

+ if (res_fls > 64) {

+ shift = res_fls - 64; /* b = 2^shift */

+ lo >>= shift;

+ hi <<= 64 - shift;

+ lo |= hi;

}

- return (__force cputime_t) scaled;

+ /*

+ * So here we do:

+ *

+ * ((stime / a) * rtime / b)

+ * --------------------------- / b

+ * (total / a)

+ */

+ return div_u64(lo, total) >> shift;

}

/*

On Thu, Apr 11, 2013 at 03:45:46PM +0200, Peter Zijlstra wrote:

> On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

> > Thoughts?

>

> Would something like the below work?

Not sure, need to validate that?

> (warning: it's never even been near a compiler)

It compile, but probably has some bugs.

> + /*

> + * Since the stime:utime ratio is already an approximation through

> + * the sampling, reducing its resolution isn't too big a deal.

> + * And since total = stime+utime; the total_fls will be the biggest

> + * of the two;

> + */

> + if (total_fls > 32) {

> + shift = total_fls - 32; /* a = 2^shift */

> + stime >>= shift;

> + total >>= shift;

> + stime_fls -= shift;

> + total_fls -= shift;

> + }

> +

> + /*

> + * Since we limited stime to 32bits the multiplication reduced to 96bit.

> + * stime * rtime = stime * (rl + rh * 2^32) =

> + * stime * rl + stime * rh * 2^32

> + */

> + lo = stime * rtime_lo;

> + hi = stime * rtime_hi;

> + t = hi << 32;

> + lo += t;

> + if (lo < t) /* overflow */

> + hi += 0x100000000L;

> + hi >>= 32;

I do not understand why we shift hi value here, is that correct?

> + /*

> + * Pick the 64 most significant bits for division into @lo.

> + *

> + * NOTE: res_fls is an approximation (upper-bound) do we want to

> + * properly calculate?

> + */

> + shift = 0;

> + res_fls = stime_fls + rtime_fls;

> + if (res_fls > 64) {

> + shift = res_fls - 64; /* b = 2^shift */

> + lo >>= shift;

> + hi <<= 64 - shift;

> + lo |= hi;

> }

>

> - return (__force cputime_t) scaled;

> + /*

> + * So here we do:

> + *

> + * ((stime / a) * rtime / b)

> + * --------------------------- / b

> + * (total / a)

> + */

> + return div_u64(lo, total) >> shift;

I think it should be:

((stime / a) * rtime / b)

--------------------------- * b

(total / a)

return div_u64(lo, total) << shift;

Thanks

Stanislaw

On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:

> On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

>> Thoughts?

>

> Would something like the below work?

Ugh, this is hard to think about, it's also fairly inefficient.

> static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)

> {

> - u64 rem, res, scaled;

> + int stime_fls = fls64(stime);

> + int total_fls = fls64(total);

> + int rtime_fls = fls64(rtime);

Doing "fls64()" unconditionally is quite expensive on some

architectures, and if I am not mistaken, the *common* case (by far) is

that all these values fit in 32 bits, no?

So let's re-think the whole thing entirely. First, let's throw away

the uncommon case, and we'll come back to it later:

if (unlikely((stime | total | rtime) >> 32)

return uncommon_case(stime, total, rtime);

and now we know we have the simple case where everything is in 32

bits, and we can just do the trivial

/* Make sure gcc understands that this is a 32x32->64 multiply,

followed by a 64/32->64 divide */

return div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);

which is the *cheapest* possible case of scale_stime(), with the

simplified multiply and divide. Agreed? This is cheap even on 32-bit.

Well, relatively.

Do we agree that this is (a) the common case that we need to worry

about from performance and (b) simple, understandable and efficient?

Now, let's look at the uncommon case, and I lied, I'm not going to

actually do it as a "uncommon_case()" function, I'm going to do this

as a "let us simplify the uncommon case until it *looks* like the

common case". IOW, in this uncommon thing, the aim is simply to just

reduce stime/rtime/total to the point where they are 32 bits. Ok? And

let's keep it simple again.

So *now*, once we are in the uncommon case, let's start counting bits.

Like this:

/* We know one of the values has a bit set in the high 32 bits */

for (;;) {

/* Make sure "stime" is the bigger of stime/rtime */

if (rtime > stime) {

u64 tmp = stime; stime = rtime; rtime = tmp;

}

/* Do we need to balance stime/rtime bits? */

if (stime >> 32) {

if (rtime >> 31)

goto drop_precision;

/* We can grow rtime and shrink stime and try to make them

both fit */

rtime <<= 1;

stime >>= 1;

continue;

}

/* stime/rtime fits in 32 bits, how about total? */

if (!(total >> 32))

break;

drop_precision:

/* We drop from stime, it has more bits than rtime */

stime >>= 1;

total >>= 1;

}

The above is totally untested, but each step is pretty damn simple and

fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)

iterations, and the normal case is that it's not done at all, or done

only a few times.

And the advantage is that the end result is always that simple

32x32/32 case that we started out with as the common case.

I dunno. Maybe I'm overlooking something, and the above is horrible,

but the above seems reasonably efficient if not optimal, and

*understandable*.

Linus

On Thu, 2013-04-11 at 16:50 +0200, Stanislaw Gruszka wrote:

> > + /*

> > + * Since the stime:utime ratio is already an approximation through

> > + * the sampling, reducing its resolution isn't too big a deal.

> > + * And since total = stime+utime; the total_fls will be the biggest

> > + * of the two;

> > + */

> > + if (total_fls > 32) {

> > + shift = total_fls - 32; /* a = 2^shift */

> > + stime >>= shift;

> > + total >>= shift;

> > + stime_fls -= shift;

> > + total_fls -= shift;

> > + }

> > +

> > + /*

> > + * Since we limited stime to 32bits the multiplication reduced to 96bit.

> > + * stime * rtime = stime * (rl + rh * 2^32) =

> > + * stime * rl + stime * rh * 2^32

> > + */

> > + lo = stime * rtime_lo;

> > + hi = stime * rtime_hi;

> > + t = hi << 32;

> > + lo += t;

> > + if (lo < t) /* overflow */

> > + hi += 0x100000000L;

> > + hi >>= 32;

>

> I do not understand why we shift hi value here, is that correct?

Yes.. remember that we have:

stime * rl + stime * rh * 2^32

How we get this 96bit value but our two 64bit values overlap:

| w3 | w2 | w1 | w0 |

+----+----+----+----+

| lo |

| hi |

So what I do is I add the low word of hi to lo and shift the high word

of hi to get:

| hi | lo |

Two non-overlapping 64bit values where the high word of hi is always 0.

> > + /*

> > + * Pick the 64 most significant bits for division into @lo.

> > + *

> > + * NOTE: res_fls is an approximation (upper-bound) do we want to

> > + * properly calculate?

> > + */

> > + shift = 0;

> > + res_fls = stime_fls + rtime_fls;

> > + if (res_fls > 64) {

> > + shift = res_fls - 64; /* b = 2^shift */

> > + lo >>= shift;

> > + hi <<= 64 - shift;

> > + lo |= hi;

> > }

> > - return (__force cputime_t) scaled;

> > + /*

> > + * So here we do:

> > + *

> > + * ((stime / a) * rtime / b)

> > + * --------------------------- / b

> > + * (total / a)

> > + */

> > + return div_u64(lo, total) >> shift;

>

> I think it should be:

>

> ((stime / a) * rtime / b)

> --------------------------- * b

> (total / a)

>

> return div_u64(lo, total) << shift;

I think you're very right about that.. got my head twisted by staring

at this stuff too long.

On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:

> On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:

> > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

> >> Thoughts?

> >

> > Would something like the below work?

>

> Ugh, this is hard to think about, it's also fairly inefficient.

>

> > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)

> > {

> > - u64 rem, res, scaled;

> > + int stime_fls = fls64(stime);

> > + int total_fls = fls64(total);

> > + int rtime_fls = fls64(rtime);

>

> Doing "fls64()" unconditionally is quite expensive on some

> architectures,

Oh, I (wrongly it appears) assumed that fls was something cheap :/

> and if I am not mistaken, the *common* case (by far) is

> that all these values fit in 32 bits, no?

It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm

completely lost as to which we default to atm. But we sure can reduce

to 32 bits in most cases without too much problems.

But that would mean fls() and shifting again for nsec based cputime.

I'll have a better read and think about the rest of your email but

that'll have to be tomorrow :/

On Thu, Apr 11, 2013 at 08:07:50PM +0200, Peter Zijlstra wrote:

> On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:

> > On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:

> > > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

> > >> Thoughts?

> > >

> > > Would something like the below work?

> >

> > Ugh, this is hard to think about, it's also fairly inefficient.

> >

> > > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)

> > > {

> > > - u64 rem, res, scaled;

> > > + int stime_fls = fls64(stime);

> > > + int total_fls = fls64(total);

> > > + int rtime_fls = fls64(rtime);

> >

> > Doing "fls64()" unconditionally is quite expensive on some

> > architectures,

>

> Oh, I (wrongly it appears) assumed that fls was something cheap :/

>

> > and if I am not mistaken, the *common* case (by far) is

> > that all these values fit in 32 bits, no?

>

> It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm

> completely lost as to which we default to atm. But we sure can reduce

> to 32 bits in most cases without too much problems.

We default to the jiffies. The nsecs case is used only for full dynticks

accounting and ia64 precise accounting.

On Thu, Apr 11, 2013 at 11:07 AM, Peter Zijlstra <[email protected]> wrote:

>

> Oh, I (wrongly it appears) assumed that fls was something cheap :/

It often is. Particularly on modern machines, because all popcount and

leading zero counting ends up being interesting to some people.

On older machines, its often a bit-at-a-time thing. We don't even try

to support i386 any more, but on atom and P4 it's something like 16

cycles for bsrl, and older cores were worse. So doing three of them

when not needed seems a bit excessive..

In contrast, on a Core2, I think it's just a single cycle.

Non-x86 architectures end up being the same - some have fast

instructions for it, others don't do it at all and end up doing things

with bitmasking and shifting.

Linus

On Thu, Apr 11, 2013 at 08:22:03PM +0200, Frederic Weisbecker wrote:

> On Thu, Apr 11, 2013 at 08:07:50PM +0200, Peter Zijlstra wrote:

> > On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:

> > > On Thu, Apr 11, 2013 at 6:45 AM, Peter Zijlstra <[email protected]> wrote:

> > > > On Tue, 2013-03-26 at 15:01 +0100, Stanislaw Gruszka wrote:

> > > >> Thoughts?

> > > >

> > > > Would something like the below work?

> > >

> > > Ugh, this is hard to think about, it's also fairly inefficient.

> > >

> > > > static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)

> > > > {

> > > > - u64 rem, res, scaled;

> > > > + int stime_fls = fls64(stime);

> > > > + int total_fls = fls64(total);

> > > > + int rtime_fls = fls64(rtime);

> > >

> > > Doing "fls64()" unconditionally is quite expensive on some

> > > architectures,

> >

> > Oh, I (wrongly it appears) assumed that fls was something cheap :/

> >

> > > and if I am not mistaken, the *common* case (by far) is

> > > that all these values fit in 32 bits, no?

> >

> > It depends on if we use cputime_jiffies.h or cputime_nsec.h and I'm

> > completely lost as to which we default to atm. But we sure can reduce

> > to 32 bits in most cases without too much problems.

>

> We default to the jiffies. The nsecs case is used only for full dynticks

> accounting and ia64 precise accounting.

Oh and in the latter case there is no scaling.

On Thu, 2013-04-11 at 08:38 -0700, Linus Torvalds wrote:

> So *now*, once we are in the uncommon case, let's start counting bits.

> Like this:

>

> /* We know one of the values has a bit set in the high 32 bits */

> for (;;) {

> /* Make sure "stime" is the bigger of stime/rtime */

> if (rtime > stime) {

> u64 tmp = stime; stime = rtime; rtime = tmp;

> }

>

> /* Do we need to balance stime/rtime bits? */

> if (stime >> 32) {

> if (rtime >> 31)

> goto drop_precision;

>

> /* We can grow rtime and shrink stime and try to make them

> both fit */

> rtime <<= 1;

> stime >>= 1;

> continue;

> }

>

> /* stime/rtime fits in 32 bits, how about total? */

> if (!(total >> 32))

> break;

>

> drop_precision:

> /* We drop from stime, it has more bits than rtime */

> stime >>= 1;

> total >>= 1;

> }

>

> The above is totally untested, but each step is pretty damn simple and

> fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)

> iterations, and the normal case is that it's not done at all, or done

> only a few times.

Right it gets gradually heavier the bigger the numbers get; which is

more and more unlikely.

> And the advantage is that the end result is always that simple

> 32x32/32 case that we started out with as the common case.

>

> I dunno. Maybe I'm overlooking something, and the above is horrible,

> but the above seems reasonably efficient if not optimal, and

> *understandable*.

I suppose that entirely matters on what one is used to ;-) I had to

stare rather hard at it for a little while.

But yes, you take it one step further and are willing to ditch rtime

bits too and I suppose that's fine.

Should work,.. Stanislaw could you stick this into your userspace

thingy and verify the numbers are sane enough?

On Fri, Apr 12, 2013 at 09:55:56AM +0200, Peter Zijlstra wrote:

> > The above is totally untested, but each step is pretty damn simple and

> > fairly cheap. Sure, it's a loop, but it's bounded to 32 (cheap)

> > iterations, and the normal case is that it's not done at all, or done

> > only a few times.

>

> Right it gets gradually heavier the bigger the numbers get; which is

> more and more unlikely.

>

> > And the advantage is that the end result is always that simple

> > 32x32/32 case that we started out with as the common case.

> >

> > I dunno. Maybe I'm overlooking something, and the above is horrible,

> > but the above seems reasonably efficient if not optimal, and

> > *understandable*.

>

> I suppose that entirely matters on what one is used to ;-) I had to

> stare rather hard at it for a little while.

>

> But yes, you take it one step further and are willing to ditch rtime

> bits too and I suppose that's fine.

>

> Should work,.. Stanislaw could you stick this into your userspace

> thingy and verify the numbers are sane enough?

It works fine - gives relative error less than 0.1% for very big

numbers.

For the record I'm attaching test program and script.

Thanks

Stanislaw

scale_stime5.c (1.55 kB)

scale_stime_test5.py (999.00 B)

Download all attachments

On Thu, Apr 11, 2013 at 08:38:37AM -0700, Linus Torvalds wrote:

> /* We know one of the values has a bit set in the high 32 bits */

> for (;;) {

> /* Make sure "stime" is the bigger of stime/rtime */

> if (rtime > stime) {

> u64 tmp = stime; stime = rtime; rtime = tmp;

> }

For most workloads rtime is bigger than stime, so swapping those would

save some cycles on common cases. Otherwise this algorithm looks great.

Thanks

Stanislaw

On Sat, Apr 13, 2013 at 7:49 AM, Stanislaw Gruszka <[email protected]> wrote:

>

> It works fine - gives relative error less than 0.1% for very big

> numbers.

So I was assuming that the values were all roughly in the same range.

When that is not true (say, very big rtime and very big total, very

small stime), it's possible that we end up first normalizing

rtime-vs-stime, and then dropping precision due to big 'total'. And

that might cause excessive precision loss (and I think "0.1%" is

excessive, even if it's probably perfectly fine).

I suspect I should have done the "drop precision due to 'total' being

out of range *first*. And then only balance rtime/stime after we've

scaled down total. That might avoid any unnecessary precision loss,

because bringing 'total' into the 32-bit range will continually shrink

the much larger (unbalanced) number, rather than shrink something that

was already balanced.

But I doubt it ever matters in practice.

The *big* loss (well, relatively - the worst case I've seen with your

test program is with 'err' being 0.021%) comes because if we have to

drop precision, and both rtime and stime are big, we have to scale

down 'total' by one bit every time. And we scale down the bigger of

rtime/stime too, but we basically have twice as many bits to shave

off rtime/stime, since there are two values (even if we pick the

biggest one, eventually we'll start alternating because shaving bits

will make the other one bigger).

So we may end up scaling 'total' down to much less than 32 bits, and

that's how you get the "big" errors in the 0.02% range.

The good news is that

(a) this requires that rtime/stime really both are big numbers

(b) this only happens with really really big numbers (ie ver much

your "10 years of 4096 threads" at 1000 Hz kind of numbers)

(c) even then the error isn't catastrophic.

So I think my algorithm could be improved a bit (to do the total

scaling *before* doing the scaling of rtime-vs-stime), but I think

it's quite usable.

Linus

PS. This is the "Make sure 'total' fits in 32 bits first" version. Not

really tested, but it's just changing the order of operations a bit.

/* We know one of the values has a bit set in the high 32 bits */

for (;;) {

/* Make sure "rtime" is the bigger of stime/rtime */

if (stime > rtime) {

u64 tmp = rtime; rtime = stime; stime = tmp;

}

/* Make sure 'total' fits in 32 bits */

if (total >> 32)

goto drop_precision;

/* Does rtime (and thus stime) fit in 32 bits? */

if (!(rtime >> 32))

break;

/* Can we just balance rtime/stime rather than dropping bits? */

if (stime >> 31)

goto drop_precision;

/* We can grow stime and shrink rtime and try to make them both fit */

stime <<= 1;

rtime >>= 1;

continue;

drop_precision:

/* We drop from rtime, it has more bits than stime */

rtime >>= 1;

total >>= 1;

}

On Sat, Apr 13, 2013 at 11:44:54AM -0700, Linus Torvalds wrote:

> PS. This is the "Make sure 'total' fits in 32 bits first" version. Not

> really tested, but it's just changing the order of operations a bit.

>

> /* We know one of the values has a bit set in the high 32 bits */

> for (;;) {

> /* Make sure "rtime" is the bigger of stime/rtime */

> if (stime > rtime) {

> u64 tmp = rtime; rtime = stime; stime = tmp;

> }

>

> /* Make sure 'total' fits in 32 bits */

> if (total >> 32)

> goto drop_precision;

>

> /* Does rtime (and thus stime) fit in 32 bits? */

> if (!(rtime >> 32))

> break;

>

> /* Can we just balance rtime/stime rather than dropping bits? */

> if (stime >> 31)

> goto drop_precision;

>

> /* We can grow stime and shrink rtime and try to make them both fit */

> stime <<= 1;

> rtime >>= 1;

> continue;

>

> drop_precision:

> /* We drop from rtime, it has more bits than stime */

> rtime >>= 1;

> total >>= 1;

> }

It also also pass 0.1% relative error on my tests. Decreasing error

threshold to 0.02% failed. I didn't check other values or measure how

frequent 0.02% fail on each version, I assume this one is better :-)

So regarding relative error this algorithm is fine, there is no

multiplication overflow error which make scaled numbers bogus. Then

I looked on this algorithm regarding context how it is used ...

Raw stime/rtime/total values will increase in jiffies resolution,

so do scaled_stime if we do not drop precision. For bigger numbers,

since we drop precision, scaled_stime will grow in chunks. How big

the chunk depend on how overall big numbers are and stime/total ratio.

For example: stime = 0.5*total, 128 threads, after 1 year of CPU

execution chunk will be 1024 jiffies.

We use scaled stime value this way:

if (total)

stime = scale_stime(stime, rtime, total);

else

stime = rtime;

/*

* If the tick based count grows faster than the scheduler one,

* the result of the scaling may go backward.

* Let's enforce monotonicity.

*/

prev->stime = max(prev->stime, stime);

prev->utime = max(prev->utime, rtime - prev->stime);

*ut = prev->utime;

*st = prev->stime;

Since rtime increase, but scaled stime not, stime will be accounted

as prev->utime. Then after chunk jiffies, stime will grow and we will

get it accounted in prev->stime. As result we will account more cpu

time than actually process do. This error will accumulate depending

how frequently cputime_adjust(process) will be called.

As solution for this we could just stop accounting if prev->stime +

prev->utime are already bigger than rtime. For example:

rtime = nsecs_to_cputime(curr->sum_exec_runtime);

+ /*

+ * Update userspace visible utime/stime values only if actual execution

+ * time is bigger than already exported. Note that can happen, that we

+ * provided bigger values due to scaling inaccuracy on big numbers.

+ */

+ if (prev->stime + prev->utime >= rtime)

+ goto out;

+

if (total)

stime = scale_stime(stime, rtime, total);

else

@@ -573,6 +581,7 @@ static void cputime_adjust(struct task_cputime *curr,

prev->stime = max(prev->stime, stime);

prev->utime = max(prev->utime, rtime - prev->stime);

+out:

*ut = prev->utime;

*st = prev->stime;

}

This should help with erroneously accounting more CPU time than process

actually use. As disadvantage userspace will see CPU time increase in

chunks, but I think this is better than see values much bigger than

correct ones (and for 99.9% user cases it does not matter).

Stanislaw

On Sat, Apr 13, 2013 at 11:44:54AM -0700, Linus Torvalds wrote:

> PS. This is the "Make sure 'total' fits in 32 bits first" version. Not

> really tested, but it's just changing the order of operations a bit.

>

> /* We know one of the values has a bit set in the high 32 bits */

> for (;;) {

> /* Make sure "rtime" is the bigger of stime/rtime */

> if (stime > rtime) {

> u64 tmp = rtime; rtime = stime; stime = tmp;

> }

>

> /* Make sure 'total' fits in 32 bits */

> if (total >> 32)

> goto drop_precision;

>

> /* Does rtime (and thus stime) fit in 32 bits? */

> if (!(rtime >> 32))

> break;

>

> /* Can we just balance rtime/stime rather than dropping bits? */

> if (stime >> 31)

> goto drop_precision;

>

> /* We can grow stime and shrink rtime and try to make them both fit */

> stime <<= 1;

> rtime >>= 1;

> continue;

>

> drop_precision:

> /* We drop from rtime, it has more bits than stime */

> rtime >>= 1;

> total >>= 1;

> }

For reference I'm attaching test program and script I used to validate

algorithm.

It generates lot of (possibly real) rtime, total, stime values for 4096

threads running for 10 years. Then compare scaled stime values caluclated

by above algorithm with precise python values.

For all values generated, scaled stime relative error was less than 0.05%

Stanislaw

scale_stime.c (1.59 kB)

scale_stime_test.py (998.00 B)

Download all attachments