LinuxLists.cc - [GIT PULL] sched: Cputime update for 3.10

2013-03-13 17:44:33

Subject: [GIT PULL] sched: Cputime update for 3.10

Ingo,

Please pull the latest cputime accounting updates that can be found at:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/core

HEAD: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5

Some users are complaining that their threadgroup's runtime accounting freezes
after a week or so of intense cpu-bound workload. This set tries to fix the issue
by reducing the risk of multiplication overflow in the cputime scaling code.

Thanks.

---
Frederic Weisbecker (2):
math64: New div64_u64_rem helper
sched: Lower chances of cputime scaling overflow

include/linux/math64.h | 19 ++++++++++++++++++-
kernel/sched/cputime.c | 46 ++++++++++++++++++++++++++++++++++------------
lib/div64.c | 19 +++++++++++++------
3 files changed, 65 insertions(+), 19 deletions(-)

--
1.7.5.4

2013-03-13 17:44:35

by Frederic Weisbecker

[permalink] [raw]

Subject: [PATCH 1/2] math64: New div64_u64_rem helper

Provide an extended version of div64_u64() that
also returns the remainder of the division.

We are going to need this to refine the cputime
scaling code.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/math64.h | 19 ++++++++++++++++++-
lib/div64.c | 19 +++++++++++++------
2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/math64.h b/include/linux/math64.h
index b8ba855..931a619 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -30,6 +30,15 @@ static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
}

/**
+ * div64_u64_rem - unsigned 64bit divide with 64bit divisor
+ */
+static inline u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder)
+{
+ *remainder = dividend % divisor;
+ return dividend / divisor;
+}
+
+/**
* div64_u64 - unsigned 64bit divide with 64bit divisor
*/
static inline u64 div64_u64(u64 dividend, u64 divisor)
@@ -61,8 +70,16 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
extern s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder);
#endif

+#ifndef div64_u64_rem
+extern u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder);
+#endif
+
#ifndef div64_u64
-extern u64 div64_u64(u64 dividend, u64 divisor);
+static inline u64 div64_u64(u64 dividend, u64 divisor)
+{
+ u64 remainder;
+ return div64_u64_rem(dividend, divisor, &remainder);
+}
#endif

#ifndef div64_s64
diff --git a/lib/div64.c b/lib/div64.c
index a163b6c..3af5728 100644
--- a/lib/div64.c
+++ b/lib/div64.c
@@ -79,9 +79,10 @@ EXPORT_SYMBOL(div_s64_rem);
#endif

/**
- * div64_u64 - unsigned 64bit divide with 64bit divisor
+ * div64_u64_rem - unsigned 64bit divide with 64bit divisor and 64bit remainder
* @dividend: 64bit dividend
* @divisor: 64bit divisor
+ * @remainder: 64bit remainder
*
* This implementation is a modified version of the algorithm proposed
* by the book 'Hacker's Delight'. The original source and full proof
@@ -89,27 +90,33 @@ EXPORT_SYMBOL(div_s64_rem);
*
* 'http://www.hackersdelight.org/HDcode/newCode/divDouble.c.txt'
*/
-#ifndef div64_u64
-u64 div64_u64(u64 dividend, u64 divisor)
+#ifndef div64_u64_rem
+u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder)
{
u32 high = divisor >> 32;
u64 quot;

if (high == 0) {
- quot = div_u64(dividend, divisor);
+ u32 rem32;
+ quot = div_u64_rem(dividend, divisor, &rem32);
+ *remainder = rem32;
} else {
int n = 1 + fls(high);
quot = div_u64(dividend >> n, divisor >> n);

if (quot != 0)
quot--;
- if ((dividend - quot * divisor) >= divisor)
+
+ *remainder = dividend - quot * divisor;
+ if (*remainder >= divisor) {
quot++;
+ *remainder -= divisor;
+ }
}

return quot;
}
-EXPORT_SYMBOL(div64_u64);
+EXPORT_SYMBOL(div64_u64_rem);
#endif

/**
--
1.7.5.4

2013-03-13 17:44:39

by Frederic Weisbecker

[permalink] [raw]

Subject: [PATCH 2/2] sched: Lower chances of cputime scaling overflow

Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.

This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.

We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.

This easily overflows after a few days/weeks of execution.

A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
62188451f0d63add7ad0cd2a1ae269d600c1663d
("cputime: Avoid multiplication overflow on utime scaling")

The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.

This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.

To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.

This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.

If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
("sched, cputime: Introduce thread_group_times()")

Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andrew Morton <[email protected]>
---
kernel/sched/cputime.c | 46 ++++++++++++++++++++++++++++++++++------------
1 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 024fe19..699d597 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -521,18 +521,36 @@ void account_idle_ticks(unsigned long ticks)
account_idle_time(jiffies_to_cputime(ticks));
}

-static cputime_t scale_stime(cputime_t stime, cputime_t rtime, cputime_t total)
+/*
+ * Perform (stime * rtime) / total with reduced chances
+ * of multiplication overflows by using smaller factors
+ * like quotient and remainders of divisions between
+ * rtime and total.
+ */
+static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
{
- u64 temp = (__force u64) rtime;
+ u64 rem, res, scaled;

- temp *= (__force u64) stime;
-
- if (sizeof(cputime_t) == 4)
- temp = div_u64(temp, (__force u32) total);
- else
- temp = div64_u64(temp, (__force u64) total);
+ if (rtime >= total) {
+ /*
+ * Scale up to rtime / total then add
+ * the remainder scaled to stime / total.
+ */
+ res = div64_u64_rem(rtime, total, &rem);
+ scaled = stime * res;
+ scaled += div64_u64(stime * rem, total);
+ } else {
+ /*
+ * Same in reverse: scale down to total / rtime
+ * then substract that result scaled to
+ * to the remaining part.
+ */
+ res = div64_u64_rem(total, rtime, &rem);
+ scaled = div64_u64(stime, res);
+ scaled -= div64_u64(scaled * rem, total);
+ }

- return (__force cputime_t) temp;
+ return (__force cputime_t) scaled;
}

/*
@@ -566,10 +584,14 @@ static void cputime_adjust(struct task_cputime *curr,
*/
rtime = nsecs_to_cputime(curr->sum_exec_runtime);

- if (total)
- stime = scale_stime(stime, rtime, total);
- else
+ if (!rtime) {
+ stime = 0;
+ } else if (!total) {
stime = rtime;
+ } else {
+ stime = scale_stime((__force u64)stime,
+ (__force u64)rtime, (__force u64)total);
+ }

/*
* If the tick based count grows faster than the scheduler one,
--
1.7.5.4

2013-03-14 07:14:32

by Ingo Molnar

[permalink] [raw]

Subject: Re: [GIT PULL] sched: Cputime update for 3.10

2013-03-14 09:13:23

by Stanislaw Gruszka

[permalink] [raw]

Subject: Re: [GIT PULL] sched: Cputime update for 3.10

On Thu, Mar 14, 2013 at 08:14:27AM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > Ingo,
> >
> > Please pull the latest cputime accounting updates that can be found at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > sched/core
> >
> > HEAD: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
> >
> > Some users are complaining that their threadgroup's runtime accounting
> > freezes after a week or so of intense cpu-bound workload. This set tries
> > to fix the issue by reducing the risk of multiplication overflow in the
> > cputime scaling code.
>
> Hm, is this a new bug? When was it introduced and is upstream affected as
> well?

Commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 start to use scalling
for whole thread group, so increase chances of hitting multiplication
overflow, depending on how many CPUs are on the system.

We have multiplication utime * rtime for one thread since commit
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa.

Overflow will happen after:

rtime * utime > 0xffffffffffffffff jiffies

if thread utilize 100% of CPU time, that gives:

rtime > sqrt(0xffffffffffffffff) jiffies

ritme > sqrt(0xffffffffffffffff) / (24 * 60 * 60 * HZ) days

For HZ 100 it will be 497 days for HZ 1000 it will be 49 days.

Bug affect only users, who run CPU intensive application for that
long period. Also they have to be interested on utime,stime values,
as bug has no other visible effect as making those values incorrect.

Stanislaw

2013-03-14 12:39:40

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [GIT PULL] sched: Cputime update for 3.10

2013/3/14 Stanislaw Gruszka <[email protected]>:
> On Thu, Mar 14, 2013 at 08:14:27AM +0100, Ingo Molnar wrote:
>> Hm, is this a new bug? When was it introduced and is upstream affected as
>> well?
>
> Commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239 start to use scalling
> for whole thread group, so increase chances of hitting multiplication
> overflow, depending on how many CPUs are on the system.

Yeah, it was introduced with that commit. Although problems was there
back when we started scaling cputime against precise sched runtime.
But certainly that commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
made the bug visible and happening in practice.

And indeed upstream is affected.

2013-03-18 09:13:15

by Ingo Molnar

[permalink] [raw]

Subject: Re: [GIT PULL] sched: Cputime update for 3.10

* Frederic Weisbecker <[email protected]> wrote:

> Ingo,
>
> Please pull the latest cputime accounting updates that can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> sched/core
>
> HEAD: d9a3c9823a2e6a543eb7807fb3d15d8233817ec5
>
> Some users are complaining that their threadgroup's runtime accounting freezes
> after a week or so of intense cpu-bound workload. This set tries to fix the issue
> by reducing the risk of multiplication overflow in the cputime scaling code.
>
> Thanks.
>
> ---
> Frederic Weisbecker (2):
> math64: New div64_u64_rem helper
> sched: Lower chances of cputime scaling overflow
>
> include/linux/math64.h | 19 ++++++++++++++++++-
> kernel/sched/cputime.c | 46 ++++++++++++++++++++++++++++++++++------------
> lib/div64.c | 19 +++++++++++++------
> 3 files changed, 65 insertions(+), 19 deletions(-)

Pulled, thanks guys!

I Cc:-ed Linus, for the div64_u64_rem() helper to linux/math64.h and
lib/dev64.c.

Ingo