We've hunted a bug that manifests itself in top showing abnormal (>100%)
loads on single threads, sometimes top can show a couple of thousand percent.
We've found the cause of this behavior in cputime_adjust() where the kernel
tries to split the actual runtime into user and system time based on the number
of times that thread was found executing in system or user mode respectively.
The actual problem is a piece of code that *independently* updates the reported
utime and stime to force it to be monotonically increasing.
A small application tailored to trigger this case causes top to output the
following (the load should always be ~20%). The application can be found at:
https://gist.github.com/frma71/706410acbd31556dbbfa
The cause of this is the data read from /proc/<pid>/stat as demonstrated by
the awk-script further down in this email.
The application to reproduce it is not very robust and will sometimes not
reliably reproduce it, on my arndale (2*CA15) it does at least nine
times out of ten. Anyway, the code should explain the problem pretty clearly.
I will follow up with my proposed fix (tested with next-20150610) in a minute or so.
# top -d 1 -H -b -p $(pidof loadbug) |grep loadbug
3100 root 20 0 1596 964 700 S 22 0.0 0:00.47 loadbug
3100 root 20 0 1596 964 700 R 21 0.0 0:00.69 loadbug
3100 root 20 0 1596 964 700 S 22 0.0 0:00.92 loadbug
3100 root 20 0 1596 964 700 S 21 0.0 0:01.14 loadbug
3100 root 20 0 1596 964 700 R 21 0.0 0:01.36 loadbug
3100 root 20 0 1596 964 700 S 22 0.0 0:01.59 loadbug
3100 root 20 0 1596 964 700 S 21 0.0 0:01.81 loadbug
3100 root 20 0 1596 964 700 S 22 0.0 0:02.04 loadbug
3100 root 20 0 1596 964 700 S 204 0.0 0:04.13 loadbug <==
3100 root 20 0 1596 964 700 R 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 R 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 0 0.0 0:04.13 loadbug
3100 root 20 0 1596 964 700 S 209 0.0 0:06.28 loadbug
% while true ; do cat /proc/$(pidof loadbug)/task/*/stat ; sleep 2 ; done | awk '{print $14 " " $15 " Delta=" $14+$15-L ;L=$14+$15}'
0 33 Delta=-103
0 77 Delta=44
0 121 Delta=44
0 165 Delta=44
0 208 Delta=43
254 208 Delta=254 <==
254 208 Delta=0
254 208 Delta=0
254 208 Delta=0
254 208 Delta=0
494 208 Delta=240
494 208 Delta=0
494 208 Delta=0
494 208 Delta=0
494 208 Delta=0
717 208 Delta=223
717 208 Delta=0
Fredrik Markstrom (1):
cputime: Make the reported utime+stime correspond to the actual
runtime.
kernel/sched/cputime.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)
--
1.9.1
The scaling mechanism might sometimes cause top to report >100%
(sometimes > 1000%) cpu usage for a single thread. This patch makes
sure that stime+utime corresponds to the actual runtime of the thread.
Signed-off-by: Fredrik Markstrom <[email protected]>
---
kernel/sched/cputime.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f5a64ff..2d168c8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -554,22 +554,7 @@ drop_precision:
return (__force cputime_t) scaled;
}
-/*
- * Atomically advance counter to the new value. Interrupts, vcpu
- * scheduling, and scaling inaccuracies can cause cputime_advance
- * to be occasionally called with a new value smaller than counter.
- * Let's enforce atomicity.
- *
- * Normally a caller will only go through this loop once, or not
- * at all in case a previous caller updated counter the same jiffy.
- */
-static void cputime_advance(cputime_t *counter, cputime_t new)
-{
- cputime_t old;
-
- while (new > (old = READ_ONCE(*counter)))
- cmpxchg_cputime(counter, old, new);
-}
+static DEFINE_SPINLOCK(prev_time_lock);
/*
* Adjust tick based cputime random precision against scheduler
@@ -590,17 +575,11 @@ static void cputime_adjust(struct task_cputime *curr,
*
* Fix this by scaling these tick based values against the total
* runtime accounted by the CFS scheduler.
+ * In addition make sure the reported stime+utime equals rtime
+ * so that the total runtime reported is correct.
*/
rtime = nsecs_to_cputime(curr->sum_exec_runtime);
- /*
- * Update userspace visible utime/stime values only if actual execution
- * time is bigger than already exported. Note that can happen, that we
- * provided bigger values due to scaling inaccuracy on big numbers.
- */
- if (prev->stime + prev->utime >= rtime)
- goto out;
-
stime = curr->stime;
utime = curr->utime;
@@ -616,12 +595,25 @@ static void cputime_adjust(struct task_cputime *curr,
utime = rtime - stime;
}
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
+ spin_lock(&prev_time_lock);
+ if (stime < prev->stime) {
+ stime = prev->stime;
+ utime = rtime - stime;
+ } else if (utime < prev->utime) {
+ utime = prev->utime;
+ stime = rtime - utime;
+ }
+ WARN_ON(stime < prev->stime);
+ WARN_ON(utime < prev->utime);
+ WARN_ON(stime + utime != rtime);
-out:
+ if (prev->stime + prev->utime < rtime) {
+ prev->stime = stime;
+ prev->utime = utime;
+ }
*ut = prev->utime;
*st = prev->stime;
+ spin_unlock(&prev_time_lock);
}
void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
--
1.9.1
On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
> The scaling mechanism might sometimes cause top to report >100%
> (sometimes > 1000%) cpu usage for a single thread. This patch makes
> sure that stime+utime corresponds to the actual runtime of the thread.
This Changelog is inadequate, it does not explain the actual problem.
> +static DEFINE_SPINLOCK(prev_time_lock);
global (spin)locks are bad.
On Fri, Jun 12, 2015 at 12:16:57PM +0200, Peter Zijlstra wrote:
> On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
> > The scaling mechanism might sometimes cause top to report >100%
> > (sometimes > 1000%) cpu usage for a single thread. This patch makes
> > sure that stime+utime corresponds to the actual runtime of the thread.
>
> This Changelog is inadequate, it does not explain the actual problem.
>
> > +static DEFINE_SPINLOCK(prev_time_lock);
>
> global (spin)locks are bad.
Since you have a proglet handy to test this; does something like the
below help anything?
---
kernel/sched/cputime.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f5a64ffad176..3d3f60a555a0 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -613,6 +613,10 @@ static void cputime_adjust(struct task_cputime *curr,
stime = scale_stime((__force u64)stime,
(__force u64)rtime, (__force u64)total);
+
+ if (stime < prev->stime)
+ stime = prev->stime;
+
utime = rtime - stime;
}
Resending beause it bounced of linux-kernel (google inbox sends
everything as html) !
Thanks for your quick response, I'll elaborate on the Changelog.
Regarding the global spinlock I considered adding it to task_struct
and signal_struct. My reasoning not to do it, flawed or not, was that
I thought the risk for congestion and cache line bouncing would be
small given the following assumptions:
1. As far as I understand neither of the callers are typically called
very frequently. (procfs, k_getrusage, wait_task_zombie and sys_times)
2 The and the time spent in the lock region is small.
Did I have bad luck when thinking :) or do you still think it's better
to add the locks to the structs above ?
/Fredrik
On Fri, Jun 12, 2015 at 12:16 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
>> The scaling mechanism might sometimes cause top to report >100%
>> (sometimes > 1000%) cpu usage for a single thread. This patch makes
>> sure that stime+utime corresponds to the actual runtime of the thread.
>
> This Changelog is inadequate, it does not explain the actual problem.
>
>> +static DEFINE_SPINLOCK(prev_time_lock);
>
> global (spin)locks are bad.
--
/Fredrik
Hello Peter, your patch helps with some of the cases but not all:
(the "called with.." below means cputime_adjust() is called with the
values specified in it's struct task_cputime argument.)
It helps when called with:
sum_exec_runtime=1000000000 utime=0 stime=1
... followed by...
sum_exec_runtime=1010000000 utime=100 stime=1
It doesn't help when called with:
sum_exec_runtime=1000000000 utime=1 stime=0
... followed by...
sum_exec_runtime=1010000000 utime=1 stime=100
Also if we get a call with:
sum_exec_runtime=1000000000 utime=1 stime=1
... then get preempted after your proposed fix and before we are done
with the calls to cpu_advance(), then gets called again (from a
different thread) with:
sum_exec_runtime=1010000000 utime=100 stime=1
... it still breaks.
I think there might be additional concurrency problems before, between
and/or possibly after the calls to cputime_advance(), at least if we
want to guarantee that sys+user should stay sane. I believe my
proposed patch eliminates those potential problems in a pretty
straight forward way.
I tried to come up with a lock free solution but didn't find a simple
solution. Since, from what I understand, the likelihood of scalability
issues here are unlikely I felt that simplicity was preferred. Also
the current implementation has two cmpxchg:s, and my proposal a single
spinlock, so on some setups I bet it's more efficient (like mine with
a lousy interconnect and preempt-rt (but I'm on thin ice here)).
Below is the output from my test application (it's to much of a hack
to post publicly), but I'd be happy to clean it up and post it if
necessary.
/Fredrik
#<test>.<step> <input> => <test>.<step> <output> [=====> FAILED]
0.0 sum_exec=100000000000 utime=0 stime=1 => 0.0 tot=10000
user=0 sys=10000
0.1 sum_exec=101000000000 utime=100 stime=1 => 0.1 tot=10100
user=100 sys=10000
1.0 sum_exec=100000000000 utime=1 stime=0 => 1.0 tot=10000
user=10000 sys=0
1.1 sum_exec=101000000000 utime=1 stime=100 => 1.1 tot=20000
user=10000 sys=10000 =====> FAILED
2.0 sum_exec=100000000000 utime=1 stime=1 => 2.0 tot=10000
user=5000 sys=5000
2.1 sum_exec=101000000000 utime=100 stime=1 => 2.1 tot=10100
user=5100 sys=5000
3.0 sum_exec=100000000000 utime=1 stime=1 => <<PREEMPT>>
3.1 sum_exec=101000000000 utime=100 stime=1 => 3.1
tot=10100 user=10000 sys=100
<<SWITCH BACK>> 3.0 tot=15000 user=10000 sys=5000 =====> FAILED
On Fri, Jun 12, 2015 at 1:01 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jun 12, 2015 at 12:16:57PM +0200, Peter Zijlstra wrote:
>> On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
>> > The scaling mechanism might sometimes cause top to report >100%
>> > (sometimes > 1000%) cpu usage for a single thread. This patch makes
>> > sure that stime+utime corresponds to the actual runtime of the thread.
>>
>> This Changelog is inadequate, it does not explain the actual problem.
>>
>> > +static DEFINE_SPINLOCK(prev_time_lock);
>>
>> global (spin)locks are bad.
>
> Since you have a proglet handy to test this; does something like the
> below help anything?
>
> ---
> kernel/sched/cputime.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f5a64ffad176..3d3f60a555a0 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -613,6 +613,10 @@ static void cputime_adjust(struct task_cputime *curr,
>
> stime = scale_stime((__force u64)stime,
> (__force u64)rtime, (__force u64)total);
> +
> + if (stime < prev->stime)
> + stime = prev->stime;
> +
> utime = rtime - stime;
> }
>
--
/Fredrik
I cleaned up the test application a bit, in case it helps someone
understand the problem and test other potential fixes. It's
at: https://gist.github.com/frma71/5af5a2a4b264d5cdd265
I basically copied the cputime_adjust() code out of the kernel and
added some stubs to be able to compile and run it
as an ordinary user mode application.
To download, compile and run it:
% wget https://goo.gl/PE6RSj -O testadjust.c
% gcc -W -Wall -o testadjust testadjust.c
% ./testadjust
My questions right now are:
1) Is this a problem worth fixing (should the reported sys+user =
sum_exec_runtime) ?
2) Is there a preferred solution to the global spinlock ?
/Fredrik
On Mon, Jun 15, 2015 at 5:34 PM, Fredrik Markström
<[email protected]> wrote:
> Hello Peter, your patch helps with some of the cases but not all:
>
> (the "called with.." below means cputime_adjust() is called with the
> values specified in it's struct task_cputime argument.)
>
> It helps when called with:
>
> sum_exec_runtime=1000000000 utime=0 stime=1
> ... followed by...
> sum_exec_runtime=1010000000 utime=100 stime=1
>
> It doesn't help when called with:
>
> sum_exec_runtime=1000000000 utime=1 stime=0
> ... followed by...
> sum_exec_runtime=1010000000 utime=1 stime=100
>
> Also if we get a call with:
>
> sum_exec_runtime=1000000000 utime=1 stime=1
>
> ... then get preempted after your proposed fix and before we are done
> with the calls to cpu_advance(), then gets called again (from a
> different thread) with:
>
> sum_exec_runtime=1010000000 utime=100 stime=1
>
> ... it still breaks.
>
> I think there might be additional concurrency problems before, between
> and/or possibly after the calls to cputime_advance(), at least if we
> want to guarantee that sys+user should stay sane. I believe my
> proposed patch eliminates those potential problems in a pretty
> straight forward way.
>
> I tried to come up with a lock free solution but didn't find a simple
> solution. Since, from what I understand, the likelihood of scalability
> issues here are unlikely I felt that simplicity was preferred. Also
> the current implementation has two cmpxchg:s, and my proposal a single
> spinlock, so on some setups I bet it's more efficient (like mine with
> a lousy interconnect and preempt-rt (but I'm on thin ice here)).
>
> Below is the output from my test application (it's to much of a hack
> to post publicly), but I'd be happy to clean it up and post it if
> necessary.
>
> /Fredrik
>
>
> #<test>.<step> <input> => <test>.<step> <output> [=====> FAILED]
>
> 0.0 sum_exec=100000000000 utime=0 stime=1 => 0.0 tot=10000
> user=0 sys=10000
> 0.1 sum_exec=101000000000 utime=100 stime=1 => 0.1 tot=10100
> user=100 sys=10000
>
> 1.0 sum_exec=100000000000 utime=1 stime=0 => 1.0 tot=10000
> user=10000 sys=0
> 1.1 sum_exec=101000000000 utime=1 stime=100 => 1.1 tot=20000
> user=10000 sys=10000 =====> FAILED
>
> 2.0 sum_exec=100000000000 utime=1 stime=1 => 2.0 tot=10000
> user=5000 sys=5000
> 2.1 sum_exec=101000000000 utime=100 stime=1 => 2.1 tot=10100
> user=5100 sys=5000
>
> 3.0 sum_exec=100000000000 utime=1 stime=1 => <<PREEMPT>>
> 3.1 sum_exec=101000000000 utime=100 stime=1 => 3.1
> tot=10100 user=10000 sys=100
> <<SWITCH BACK>> 3.0 tot=15000 user=10000 sys=5000 =====> FAILED
>
>
> On Fri, Jun 12, 2015 at 1:01 PM, Peter Zijlstra <[email protected]> wrote:
>> On Fri, Jun 12, 2015 at 12:16:57PM +0200, Peter Zijlstra wrote:
>>> On Fri, 2015-06-12 at 10:55 +0200, Fredrik Markstrom wrote:
>>> > The scaling mechanism might sometimes cause top to report >100%
>>> > (sometimes > 1000%) cpu usage for a single thread. This patch makes
>>> > sure that stime+utime corresponds to the actual runtime of the thread.
>>>
>>> This Changelog is inadequate, it does not explain the actual problem.
>>>
>>> > +static DEFINE_SPINLOCK(prev_time_lock);
>>>
>>> global (spin)locks are bad.
>>
>> Since you have a proglet handy to test this; does something like the
>> below help anything?
>>
>> ---
>> kernel/sched/cputime.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index f5a64ffad176..3d3f60a555a0 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -613,6 +613,10 @@ static void cputime_adjust(struct task_cputime *curr,
>>
>> stime = scale_stime((__force u64)stime,
>> (__force u64)rtime, (__force u64)total);
>> +
>> + if (stime < prev->stime)
>> + stime = prev->stime;
>> +
>> utime = rtime - stime;
>> }
>>
>
>
>
> --
> /Fredrik
--
/Fredrik
Sorry for the delay, I seem to have gotten distracted...
On Mon, Jun 15, 2015 at 05:34:11PM +0200, Fredrik Markstr?m wrote:
> Hello Peter, your patch helps with some of the cases but not all:
Indeed, and barring cmpxchg_double(), which is not available on all
platforms, the thing needs a lock indeed.
Now, while you're probably right in that contention is unlikely for sane
behaviour, I could imagine some perverted apps hammering
getrusage() just because they can.
Therefore, find attached a version that has a per task/signal lock.
---
Subject: sched,cputime: Serialize cputime_adjust()
Fredrik reports that top and other tools can occasionally observe >100%
cpu usage and reports that this is because cputime_adjust() callers are
not serialized.
This means that when the s/u-time sample values are small, and change
can shift the balance quickly, concurrent updaters can race such that
one advances stime while the other advances utime such that the sum will
be vastly larger than the initial rtime.
There is also an issue with calculating utime as rtime - stime, where is
if the computed stime is smaller then the previously returned stime,
our utime will be larger than it should be, making the above problem
worse.
Cc: Jason Low <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Suggested-by: Fredrik Markstrom <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/init_task.h | 10 ++++++++++
include/linux/sched.h | 23 +++++++++++++++--------
kernel/fork.c | 7 ++++---
kernel/sched/cputime.c | 34 +++++++++-------------------------
4 files changed, 38 insertions(+), 36 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index bb9b075f0eb0..e38681f4912d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -39,6 +39,14 @@ extern struct fs_struct init_fs;
#define INIT_CPUSET_SEQ(tsk)
#endif
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+#define INIT_PREV_CPUTIME(x) .prev_cputime = { \
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(x.prev_cputime.lock), \
+},
+#else
+#define INIT_PREV_CPUTIME(x)
+#endif
+
#define INIT_SIGNALS(sig) { \
.nr_threads = 1, \
.thread_head = LIST_HEAD_INIT(init_task.thread_node), \
@@ -53,6 +61,7 @@ extern struct fs_struct init_fs;
.cputime_atomic = INIT_CPUTIME_ATOMIC, \
.running = 0, \
}, \
+ INIT_PREV_CPUTIME(sig) \
.cred_guard_mutex = \
__MUTEX_INITIALIZER(sig.cred_guard_mutex), \
INIT_GROUP_RWSEM(sig) \
@@ -254,6 +263,7 @@ extern struct task_group root_task_group;
INIT_TASK_RCU_TASKS(tsk) \
INIT_CPUSET_SEQ(tsk) \
INIT_RT_MUTEXES(tsk) \
+ INIT_PREV_CPUTIME(tsk) \
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk) \
INIT_KASAN(tsk) \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6633e83e608a..5dfbe92ce04e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -531,17 +531,28 @@ struct cpu_itimer {
};
/**
- * struct cputime - snaphsot of system and user cputime
+ * struct prev_cputime - snaphsot of system and user cputime
* @utime: time spent in user mode
* @stime: time spent in system mode
*
* Gathers a generic snapshot of user and system time.
*/
-struct cputime {
+struct prev_cputime {
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cputime_t utime;
cputime_t stime;
+ raw_spinlock_t lock;
+#endif
};
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ prev->utime = prev->stime = 0;
+ raw_spin_lock_init(&prev->lock);
+#endif
+}
+
/**
* struct task_cputime - collected CPU time counts
* @utime: time spent in user mode, in &cputime_t units
@@ -716,9 +727,7 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1494,9 +1503,7 @@ struct task_struct {
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_t vtime_seqlock;
unsigned long long vtime_snap;
diff --git a/kernel/fork.c b/kernel/fork.c
index c9507afd4a7d..306481d3a0e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1067,6 +1067,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
rcu_assign_pointer(tsk->sighand, sig);
if (!sig)
return -ENOMEM;
+
atomic_set(&sig->count, 1);
memcpy(sig->action, current->sighand->action, sizeof(sig->action));
return 0;
@@ -1128,6 +1129,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
init_sigpending(&sig->shared_pending);
INIT_LIST_HEAD(&sig->posix_timers);
seqlock_init(&sig->stats_lock);
+ prev_cputime_init(&sig->prev_cputime);
hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
@@ -1338,9 +1340,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->utime = p->stime = p->gtime = 0;
p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
+ prev_cputime_init(&p->prev_cputime);
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_init(&p->vtime_seqlock);
p->vtime_snap = 0;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f5a64ffad176..3acfab426e4f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -555,32 +555,16 @@ static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
}
/*
- * Atomically advance counter to the new value. Interrupts, vcpu
- * scheduling, and scaling inaccuracies can cause cputime_advance
- * to be occasionally called with a new value smaller than counter.
- * Let's enforce atomicity.
- *
- * Normally a caller will only go through this loop once, or not
- * at all in case a previous caller updated counter the same jiffy.
- */
-static void cputime_advance(cputime_t *counter, cputime_t new)
-{
- cputime_t old;
-
- while (new > (old = READ_ONCE(*counter)))
- cmpxchg_cputime(counter, old, new);
-}
-
-/*
* Adjust tick based cputime random precision against scheduler
* runtime accounting.
*/
static void cputime_adjust(struct task_cputime *curr,
- struct cputime *prev,
+ struct prev_cputime *prev,
cputime_t *ut, cputime_t *st)
{
cputime_t rtime, stime, utime;
+ raw_spin_lock(&prev->lock);
/*
* Tick based cputime accounting depend on random scheduling
* timeslices of a task to be interrupted or not by the timer.
@@ -609,19 +593,19 @@ static void cputime_adjust(struct task_cputime *curr,
} else if (stime == 0) {
utime = rtime;
} else {
- cputime_t total = stime + utime;
-
- stime = scale_stime((__force u64)stime,
- (__force u64)rtime, (__force u64)total);
+ stime = scale_stime((__force u64)stime, (__force u64)rtime,
+ (__force u64)(stime + utime));
+ if (stime < prev->stime)
+ stime = prev->stime;
utime = rtime - stime;
}
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
-
+ prev->stime = max(prev->stime, stime);
+ prev->utime = max(prev->utime, utime);
out:
*ut = prev->utime;
*st = prev->stime;
+ raw_spin_unlock(&prev->lock);
}
void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
Hello Peter, the locking part looks good, I don't have a strong
opinion on per task/signal lock vs global lock.
But with the patch we still update prev->utime and prev->stime
independently, which was the original problem. But maybe the locking
and monoticity/sum issue should be addressed by two separate patches
since they are separate bugs ?
The part I'm referring to is the change below from my original patch
(maybe without the WARN_ON:s ?):
…
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
+ if (stime < prev->stime) {
+ stime = prev->stime;
+ utime = rtime - stime;
+ } else if (utime < prev->utime) {
+ utime = prev->utime;
+ stime = rtime - utime;
+ }
+ WARN_ON(stime < prev->stime);
+ WARN_ON(utime < prev->utime);
+ WARN_ON(stime + utime != rtime);
I can prepare and test such a patch set (based on yours) if you would
like me to, or you can do it yourself. Either way I'll be happy.
/Fredrik
On Mon, Jun 29, 2015 at 4:58 PM, Peter Zijlstra <[email protected]> wrote:
> Sorry for the delay, I seem to have gotten distracted...
>
> On Mon, Jun 15, 2015 at 05:34:11PM +0200, Fredrik Markström wrote:
>> Hello Peter, your patch helps with some of the cases but not all:
>
> Indeed, and barring cmpxchg_double(), which is not available on all
> platforms, the thing needs a lock indeed.
>
> Now, while you're probably right in that contention is unlikely for sane
> behaviour, I could imagine some perverted apps hammering
> getrusage() just because they can.
>
> Therefore, find attached a version that has a per task/signal lock.
>
> ---
> Subject: sched,cputime: Serialize cputime_adjust()
>
> Fredrik reports that top and other tools can occasionally observe >100%
> cpu usage and reports that this is because cputime_adjust() callers are
> not serialized.
>
> This means that when the s/u-time sample values are small, and change
> can shift the balance quickly, concurrent updaters can race such that
> one advances stime while the other advances utime such that the sum will
> be vastly larger than the initial rtime.
>
> There is also an issue with calculating utime as rtime - stime, where is
> if the computed stime is smaller then the previously returned stime,
> our utime will be larger than it should be, making the above problem
> worse.
>
> Cc: Jason Low <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Suggested-by: Fredrik Markstrom <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/init_task.h | 10 ++++++++++
> include/linux/sched.h | 23 +++++++++++++++--------
> kernel/fork.c | 7 ++++---
> kernel/sched/cputime.c | 34 +++++++++-------------------------
> 4 files changed, 38 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index bb9b075f0eb0..e38681f4912d 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -39,6 +39,14 @@ extern struct fs_struct init_fs;
> #define INIT_CPUSET_SEQ(tsk)
> #endif
>
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> +#define INIT_PREV_CPUTIME(x) .prev_cputime = { \
> + .lock = __RAW_SPIN_LOCK_UNLOCKED(x.prev_cputime.lock), \
> +},
> +#else
> +#define INIT_PREV_CPUTIME(x)
> +#endif
> +
> #define INIT_SIGNALS(sig) { \
> .nr_threads = 1, \
> .thread_head = LIST_HEAD_INIT(init_task.thread_node), \
> @@ -53,6 +61,7 @@ extern struct fs_struct init_fs;
> .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> .running = 0, \
> }, \
> + INIT_PREV_CPUTIME(sig) \
> .cred_guard_mutex = \
> __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
> INIT_GROUP_RWSEM(sig) \
> @@ -254,6 +263,7 @@ extern struct task_group root_task_group;
> INIT_TASK_RCU_TASKS(tsk) \
> INIT_CPUSET_SEQ(tsk) \
> INIT_RT_MUTEXES(tsk) \
> + INIT_PREV_CPUTIME(tsk) \
> INIT_VTIME(tsk) \
> INIT_NUMA_BALANCING(tsk) \
> INIT_KASAN(tsk) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6633e83e608a..5dfbe92ce04e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -531,17 +531,28 @@ struct cpu_itimer {
> };
>
> /**
> - * struct cputime - snaphsot of system and user cputime
> + * struct prev_cputime - snaphsot of system and user cputime
> * @utime: time spent in user mode
> * @stime: time spent in system mode
> *
> * Gathers a generic snapshot of user and system time.
> */
> -struct cputime {
> +struct prev_cputime {
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> cputime_t utime;
> cputime_t stime;
> + raw_spinlock_t lock;
> +#endif
> };
>
> +static inline void prev_cputime_init(struct prev_cputime *prev)
> +{
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> + prev->utime = prev->stime = 0;
> + raw_spin_lock_init(&prev->lock);
> +#endif
> +}
> +
> /**
> * struct task_cputime - collected CPU time counts
> * @utime: time spent in user mode, in &cputime_t units
> @@ -716,9 +727,7 @@ struct signal_struct {
> cputime_t utime, stime, cutime, cstime;
> cputime_t gtime;
> cputime_t cgtime;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - struct cputime prev_cputime;
> -#endif
> + struct prev_cputime prev_cputime;
> unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
> unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
> unsigned long inblock, oublock, cinblock, coublock;
> @@ -1494,9 +1503,7 @@ struct task_struct {
>
> cputime_t utime, stime, utimescaled, stimescaled;
> cputime_t gtime;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - struct cputime prev_cputime;
> -#endif
> + struct prev_cputime prev_cputime;
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqlock_t vtime_seqlock;
> unsigned long long vtime_snap;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c9507afd4a7d..306481d3a0e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1067,6 +1067,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
> rcu_assign_pointer(tsk->sighand, sig);
> if (!sig)
> return -ENOMEM;
> +
> atomic_set(&sig->count, 1);
> memcpy(sig->action, current->sighand->action, sizeof(sig->action));
> return 0;
> @@ -1128,6 +1129,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> init_sigpending(&sig->shared_pending);
> INIT_LIST_HEAD(&sig->posix_timers);
> seqlock_init(&sig->stats_lock);
> + prev_cputime_init(&sig->prev_cputime);
>
> hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> sig->real_timer.function = it_real_fn;
> @@ -1338,9 +1340,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>
> p->utime = p->stime = p->gtime = 0;
> p->utimescaled = p->stimescaled = 0;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - p->prev_cputime.utime = p->prev_cputime.stime = 0;
> -#endif
> + prev_cputime_init(&p->prev_cputime);
> +
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqlock_init(&p->vtime_seqlock);
> p->vtime_snap = 0;
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f5a64ffad176..3acfab426e4f 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -555,32 +555,16 @@ static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> }
>
> /*
> - * Atomically advance counter to the new value. Interrupts, vcpu
> - * scheduling, and scaling inaccuracies can cause cputime_advance
> - * to be occasionally called with a new value smaller than counter.
> - * Let's enforce atomicity.
> - *
> - * Normally a caller will only go through this loop once, or not
> - * at all in case a previous caller updated counter the same jiffy.
> - */
> -static void cputime_advance(cputime_t *counter, cputime_t new)
> -{
> - cputime_t old;
> -
> - while (new > (old = READ_ONCE(*counter)))
> - cmpxchg_cputime(counter, old, new);
> -}
> -
> -/*
> * Adjust tick based cputime random precision against scheduler
> * runtime accounting.
> */
> static void cputime_adjust(struct task_cputime *curr,
> - struct cputime *prev,
> + struct prev_cputime *prev,
> cputime_t *ut, cputime_t *st)
> {
> cputime_t rtime, stime, utime;
>
> + raw_spin_lock(&prev->lock);
> /*
> * Tick based cputime accounting depend on random scheduling
> * timeslices of a task to be interrupted or not by the timer.
> @@ -609,19 +593,19 @@ static void cputime_adjust(struct task_cputime *curr,
> } else if (stime == 0) {
> utime = rtime;
> } else {
> - cputime_t total = stime + utime;
> -
> - stime = scale_stime((__force u64)stime,
> - (__force u64)rtime, (__force u64)total);
> + stime = scale_stime((__force u64)stime, (__force u64)rtime,
> + (__force u64)(stime + utime));
> + if (stime < prev->stime)
> + stime = prev->stime;
> utime = rtime - stime;
> }
>
> - cputime_advance(&prev->stime, stime);
> - cputime_advance(&prev->utime, utime);
> -
> + prev->stime = max(prev->stime, stime);
> + prev->utime = max(prev->utime, utime);
> out:
> *ut = prev->utime;
> *st = prev->stime;
> + raw_spin_unlock(&prev->lock);
> }
>
> void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
--
/Fredrik
On Mon, 2015-06-29 at 17:28 +0200, Fredrik Markström wrote:
> Hello Peter, the locking part looks good, I don't have a strong
> opinion on per task/signal lock vs global lock.
>
> But with the patch we still update prev->utime and prev->stime
> independently, which was the original problem. But maybe the locking
> and monoticity/sum issue should be addressed by two separate patches
> since they are separate bugs ?
>
> The part I'm referring to is the change below from my original patch
> (maybe without the WARN_ON:s ?):
>
> …
> - cputime_advance(&prev->stime, stime);
> - cputime_advance(&prev->utime, utime);
> + if (stime < prev->stime) {
> + stime = prev->stime;
> + utime = rtime - stime;
> + } else if (utime < prev->utime) {
> + utime = prev->utime;
> + stime = rtime - utime;
> + }
> + WARN_ON(stime < prev->stime);
> + WARN_ON(utime < prev->utime);
> + WARN_ON(stime + utime != rtime);
How about substituting:
prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, utime);
with
if (stime > prev->stime || utime > prev->utime) {
prev->stime = stime;
prev->utime = utime;
}
in Peter's patch?
I don't think that is good enough. I believe the reason the
max()-stuff was initially put there to make sure the returned stime
and utime components are increasing monotonically. The scaling code
can cause either or to decrease from one call to the other even i
rtime increases.
The purpose of my patch is to also make sure that the sum of utime and
stime is rtime.
I lost the last part of the patch in my previous email:
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
+ if (stime < prev->stime) {
+ stime = prev->stime;
+ utime = rtime - stime;
+ } else if (utime < prev->utime) {
+ utime = prev->utime;
+ stime = rtime - utime;
+ }
-out:
+ if (prev->stime + prev->utime < rtime) {
+ prev->stime = stime;
+ prev->utime = utime;
+ }
*ut = prev->utime;
*st = prev->stime;
/Fredrik
On Mon, Jun 29, 2015 at 8:54 PM, Jason Low <[email protected]> wrote:
> On Mon, 2015-06-29 at 17:28 +0200, Fredrik Markström wrote:
>> Hello Peter, the locking part looks good, I don't have a strong
>> opinion on per task/signal lock vs global lock.
>>
>> But with the patch we still update prev->utime and prev->stime
>> independently, which was the original problem. But maybe the locking
>> and monoticity/sum issue should be addressed by two separate patches
>> since they are separate bugs ?
>>
>> The part I'm referring to is the change below from my original patch
>> (maybe without the WARN_ON:s ?):
>>
>> …
>> - cputime_advance(&prev->stime, stime);
>> - cputime_advance(&prev->utime, utime);
>> + if (stime < prev->stime) {
>> + stime = prev->stime;
>> + utime = rtime - stime;
>> + } else if (utime < prev->utime) {
>> + utime = prev->utime;
>> + stime = rtime - utime;
>> + }
>> + WARN_ON(stime < prev->stime);
>> + WARN_ON(utime < prev->utime);
>> + WARN_ON(stime + utime != rtime);
>
> How about substituting:
>
> prev->stime = max(prev->stime, stime);
> prev->utime = max(prev->utime, utime);
>
> with
>
> if (stime > prev->stime || utime > prev->utime) {
> prev->stime = stime;
> prev->utime = utime;
> }
>
> in Peter's patch?
>
--
/Fredrik
On Mon, 2015-06-29 at 21:08 +0200, Fredrik Markström wrote:
> I don't think that is good enough. I believe the reason the
> max()-stuff was initially put there to make sure the returned stime
> and utime components are increasing monotonically. The scaling code
> can cause either or to decrease from one call to the other even i
> rtime increases.
>
> The purpose of my patch is to also make sure that the sum of utime and
> stime is rtime.
>
> I lost the last part of the patch in my previous email:
>
>
> - cputime_advance(&prev->stime, stime);
> - cputime_advance(&prev->utime, utime);
> + if (stime < prev->stime) {
> + stime = prev->stime;
> + utime = rtime - stime;
> + } else if (utime < prev->utime) {
> + utime = prev->utime;
> + stime = rtime - utime;
> + }
> -out:
> + if (prev->stime + prev->utime < rtime) {
> + prev->stime = stime;
> + prev->utime = utime;
> + }
> *ut = prev->utime;
> *st = prev->stime;
In this case, we might want to avoid the extra stime and utime
computations if prev->stime + prev->utime >= rtime, since they wouldn't
get used.
On Mon, Jun 29, 2015 at 05:28:42PM +0200, Fredrik Markstr?m wrote:
> Hello Peter, the locking part looks good, I don't have a strong
> opinion on per task/signal lock vs global lock.
>
> But with the patch we still update prev->utime and prev->stime
> independently, which was the original problem.
Ah,..
> > @@ -609,19 +593,19 @@ static void cputime_adjust(struct task_cputime *curr,
> > } else if (stime == 0) {
> > utime = rtime;
> > } else {
> > + stime = scale_stime((__force u64)stime, (__force u64)rtime,
> > + (__force u64)(stime + utime));
> > + if (stime < prev->stime)
> > + stime = prev->stime;
> > utime = rtime - stime;
> > }
> >
> > + prev->stime = max(prev->stime, stime);
> > + prev->utime = max(prev->utime, utime);
> > out:
> > *ut = prev->utime;
> > *st = prev->stime;
> > + raw_spin_unlock(&prev->lock);
> > }
So the things we want from this function is that:
- stime + utime == rtime
- stime_i+1 >= stime_i, utime_i+1 >= utime_i
Under the assumption that rtime_i+1 >= rtime_i.
There are 3 cases:
1) utime == 0 -> stime = rtime
2) stime == 0 -> utime = rtime
Both these trivially meet the above conditions, and:
3) stime_i+1 = max((stime * rtime) / (stime + utime), stime_i)
utime = rtime - stime
Which I thought would meet them because we keep stime from shrinking and
therefore utime from growing too big. But there is indeed the other side
to consider, what if stime grows too big and makes the new utime too
small. When we update stime but not utime and break the first condition.
Now, you propose:
> + if (stime < prev->stime) {
> + stime = prev->stime;
> + utime = rtime - stime;
Right, I have this branch, it keeps stime in check as per the above.
Since we set stime_i+1 = stime_i, utime_i+1 = rtime - stime_i >= utime_i
since rtime_i+1 >= rtime_i.
> + } else if (utime < prev->utime) {
> + utime = prev->utime;
> + stime = rtime - utime;
> + }
And that deals with the other side, similar proof.
> + if (prev->stime + prev->utime < rtime) {
> + prev->stime = stime;
> + prev->utime = utime;
> + }
This I don't really agree with, we can leave the inverse check as is and
simply bail early.
Folding this all together gives me
---
include/linux/init_task.h | 10 ++++++++++
include/linux/sched.h | 23 +++++++++++++--------
kernel/fork.c | 7 ++++---
kernel/sched/cputime.c | 51 ++++++++++++++++++++++-------------------------
4 files changed, 53 insertions(+), 38 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index bb9b075f0eb0..e38681f4912d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -39,6 +39,14 @@ extern struct fs_struct init_fs;
#define INIT_CPUSET_SEQ(tsk)
#endif
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+#define INIT_PREV_CPUTIME(x) .prev_cputime = { \
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(x.prev_cputime.lock), \
+},
+#else
+#define INIT_PREV_CPUTIME(x)
+#endif
+
#define INIT_SIGNALS(sig) { \
.nr_threads = 1, \
.thread_head = LIST_HEAD_INIT(init_task.thread_node), \
@@ -53,6 +61,7 @@ extern struct fs_struct init_fs;
.cputime_atomic = INIT_CPUTIME_ATOMIC, \
.running = 0, \
}, \
+ INIT_PREV_CPUTIME(sig) \
.cred_guard_mutex = \
__MUTEX_INITIALIZER(sig.cred_guard_mutex), \
INIT_GROUP_RWSEM(sig) \
@@ -254,6 +263,7 @@ extern struct task_group root_task_group;
INIT_TASK_RCU_TASKS(tsk) \
INIT_CPUSET_SEQ(tsk) \
INIT_RT_MUTEXES(tsk) \
+ INIT_PREV_CPUTIME(tsk) \
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk) \
INIT_KASAN(tsk) \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6633e83e608a..5dfbe92ce04e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -531,17 +531,28 @@ struct cpu_itimer {
};
/**
- * struct cputime - snaphsot of system and user cputime
+ * struct prev_cputime - snaphsot of system and user cputime
* @utime: time spent in user mode
* @stime: time spent in system mode
*
* Gathers a generic snapshot of user and system time.
*/
-struct cputime {
+struct prev_cputime {
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cputime_t utime;
cputime_t stime;
+ raw_spinlock_t lock;
+#endif
};
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ prev->utime = prev->stime = 0;
+ raw_spin_lock_init(&prev->lock);
+#endif
+}
+
/**
* struct task_cputime - collected CPU time counts
* @utime: time spent in user mode, in &cputime_t units
@@ -716,9 +727,7 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1494,9 +1503,7 @@ struct task_struct {
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_t vtime_seqlock;
unsigned long long vtime_snap;
diff --git a/kernel/fork.c b/kernel/fork.c
index c9507afd4a7d..306481d3a0e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1067,6 +1067,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
rcu_assign_pointer(tsk->sighand, sig);
if (!sig)
return -ENOMEM;
+
atomic_set(&sig->count, 1);
memcpy(sig->action, current->sighand->action, sizeof(sig->action));
return 0;
@@ -1128,6 +1129,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
init_sigpending(&sig->shared_pending);
INIT_LIST_HEAD(&sig->posix_timers);
seqlock_init(&sig->stats_lock);
+ prev_cputime_init(&sig->prev_cputime);
hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
@@ -1338,9 +1340,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->utime = p->stime = p->gtime = 0;
p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
+ prev_cputime_init(&p->prev_cputime);
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_init(&p->vtime_seqlock);
p->vtime_snap = 0;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f5a64ffad176..f3af6d86f38b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -555,32 +555,16 @@ static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
}
/*
- * Atomically advance counter to the new value. Interrupts, vcpu
- * scheduling, and scaling inaccuracies can cause cputime_advance
- * to be occasionally called with a new value smaller than counter.
- * Let's enforce atomicity.
- *
- * Normally a caller will only go through this loop once, or not
- * at all in case a previous caller updated counter the same jiffy.
- */
-static void cputime_advance(cputime_t *counter, cputime_t new)
-{
- cputime_t old;
-
- while (new > (old = READ_ONCE(*counter)))
- cmpxchg_cputime(counter, old, new);
-}
-
-/*
* Adjust tick based cputime random precision against scheduler
* runtime accounting.
*/
static void cputime_adjust(struct task_cputime *curr,
- struct cputime *prev,
+ struct prev_cputime *prev,
cputime_t *ut, cputime_t *st)
{
cputime_t rtime, stime, utime;
+ raw_spin_lock(&prev->lock);
/*
* Tick based cputime accounting depend on random scheduling
* timeslices of a task to be interrupted or not by the timer.
@@ -606,22 +590,35 @@ static void cputime_adjust(struct task_cputime *curr,
if (utime == 0) {
stime = rtime;
- } else if (stime == 0) {
- utime = rtime;
- } else {
- cputime_t total = stime + utime;
+ goto update;
+ }
- stime = scale_stime((__force u64)stime,
- (__force u64)rtime, (__force u64)total);
- utime = rtime - stime;
+ if (stime == 0) {
+ utime = rtime;
+ goto update;
}
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
+ stime = scale_stime((__force u64)stime, (__force u64)rtime,
+ (__force u64)(stime + utime));
+
+ /* make sure stime doesn't go backwards */
+ if (stime < prev->stime)
+ stime = prev->stime;
+ utime = rtime - stime;
+
+ /* make sure utime doesn't go backwards */
+ if (utime < prev->utime) {
+ utime = prev->utime;
+ stime = rtime - utime;
+ }
+update:
+ prev->stime = stime;
+ prev->utime = utime;
out:
*ut = prev->utime;
*st = prev->stime;
+ raw_spin_unlock(&prev->lock);
}
void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
Excellent,
The reason I replaced the early bail with that last test is that I
believe it needs to be done within the lock and I wanted to keep that
region short. To be honest I'm not sure this test is needed at all
anymore, but I couldn't make sense of the comment above the early bail
so I didn't dare to remove it.
Regarding the lock, have you considered how many cores you need
hammering at rusage to introduce some substantial congestion ? On arm
this region is ~10 cpu-cycles long and I measure a null system
(lmbench: lat_syscall null) call to ~1000 cycles (with some kernel
debug turned on).
Sorry for not letting this go (I know I should) but I always feel bad
introducing per thread data.
/Fredrik
On Tue, Jun 30, 2015 at 11:30 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jun 29, 2015 at 05:28:42PM +0200, Fredrik Markström wrote:
>> Hello Peter, the locking part looks good, I don't have a strong
>> opinion on per task/signal lock vs global lock.
>>
>> But with the patch we still update prev->utime and prev->stime
>> independently, which was the original problem.
>
> Ah,..
>
>> > @@ -609,19 +593,19 @@ static void cputime_adjust(struct task_cputime *curr,
>> > } else if (stime == 0) {
>> > utime = rtime;
>> > } else {
>> > + stime = scale_stime((__force u64)stime, (__force u64)rtime,
>> > + (__force u64)(stime + utime));
>> > + if (stime < prev->stime)
>> > + stime = prev->stime;
>> > utime = rtime - stime;
>> > }
>> >
>> > + prev->stime = max(prev->stime, stime);
>> > + prev->utime = max(prev->utime, utime);
>> > out:
>> > *ut = prev->utime;
>> > *st = prev->stime;
>> > + raw_spin_unlock(&prev->lock);
>> > }
>
> So the things we want from this function is that:
>
> - stime + utime == rtime
> - stime_i+1 >= stime_i, utime_i+1 >= utime_i
>
> Under the assumption that rtime_i+1 >= rtime_i.
>
> There are 3 cases:
>
> 1) utime == 0 -> stime = rtime
> 2) stime == 0 -> utime = rtime
>
> Both these trivially meet the above conditions, and:
>
> 3) stime_i+1 = max((stime * rtime) / (stime + utime), stime_i)
> utime = rtime - stime
>
> Which I thought would meet them because we keep stime from shrinking and
> therefore utime from growing too big. But there is indeed the other side
> to consider, what if stime grows too big and makes the new utime too
> small. When we update stime but not utime and break the first condition.
>
> Now, you propose:
>
>> + if (stime < prev->stime) {
>> + stime = prev->stime;
>> + utime = rtime - stime;
>
> Right, I have this branch, it keeps stime in check as per the above.
>
> Since we set stime_i+1 = stime_i, utime_i+1 = rtime - stime_i >= utime_i
> since rtime_i+1 >= rtime_i.
>
>> + } else if (utime < prev->utime) {
>> + utime = prev->utime;
>> + stime = rtime - utime;
>> + }
>
> And that deals with the other side, similar proof.
>
>
>> + if (prev->stime + prev->utime < rtime) {
>> + prev->stime = stime;
>> + prev->utime = utime;
>> + }
>
> This I don't really agree with, we can leave the inverse check as is and
> simply bail early.
>
> Folding this all together gives me
>
> ---
> include/linux/init_task.h | 10 ++++++++++
> include/linux/sched.h | 23 +++++++++++++--------
> kernel/fork.c | 7 ++++---
> kernel/sched/cputime.c | 51 ++++++++++++++++++++++-------------------------
> 4 files changed, 53 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index bb9b075f0eb0..e38681f4912d 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -39,6 +39,14 @@ extern struct fs_struct init_fs;
> #define INIT_CPUSET_SEQ(tsk)
> #endif
>
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> +#define INIT_PREV_CPUTIME(x) .prev_cputime = { \
> + .lock = __RAW_SPIN_LOCK_UNLOCKED(x.prev_cputime.lock), \
> +},
> +#else
> +#define INIT_PREV_CPUTIME(x)
> +#endif
> +
> #define INIT_SIGNALS(sig) { \
> .nr_threads = 1, \
> .thread_head = LIST_HEAD_INIT(init_task.thread_node), \
> @@ -53,6 +61,7 @@ extern struct fs_struct init_fs;
> .cputime_atomic = INIT_CPUTIME_ATOMIC, \
> .running = 0, \
> }, \
> + INIT_PREV_CPUTIME(sig) \
> .cred_guard_mutex = \
> __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
> INIT_GROUP_RWSEM(sig) \
> @@ -254,6 +263,7 @@ extern struct task_group root_task_group;
> INIT_TASK_RCU_TASKS(tsk) \
> INIT_CPUSET_SEQ(tsk) \
> INIT_RT_MUTEXES(tsk) \
> + INIT_PREV_CPUTIME(tsk) \
> INIT_VTIME(tsk) \
> INIT_NUMA_BALANCING(tsk) \
> INIT_KASAN(tsk) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6633e83e608a..5dfbe92ce04e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -531,17 +531,28 @@ struct cpu_itimer {
> };
>
> /**
> - * struct cputime - snaphsot of system and user cputime
> + * struct prev_cputime - snaphsot of system and user cputime
> * @utime: time spent in user mode
> * @stime: time spent in system mode
> *
> * Gathers a generic snapshot of user and system time.
> */
> -struct cputime {
> +struct prev_cputime {
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> cputime_t utime;
> cputime_t stime;
> + raw_spinlock_t lock;
> +#endif
> };
>
> +static inline void prev_cputime_init(struct prev_cputime *prev)
> +{
> +#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> + prev->utime = prev->stime = 0;
> + raw_spin_lock_init(&prev->lock);
> +#endif
> +}
> +
> /**
> * struct task_cputime - collected CPU time counts
> * @utime: time spent in user mode, in &cputime_t units
> @@ -716,9 +727,7 @@ struct signal_struct {
> cputime_t utime, stime, cutime, cstime;
> cputime_t gtime;
> cputime_t cgtime;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - struct cputime prev_cputime;
> -#endif
> + struct prev_cputime prev_cputime;
> unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
> unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
> unsigned long inblock, oublock, cinblock, coublock;
> @@ -1494,9 +1503,7 @@ struct task_struct {
>
> cputime_t utime, stime, utimescaled, stimescaled;
> cputime_t gtime;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - struct cputime prev_cputime;
> -#endif
> + struct prev_cputime prev_cputime;
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqlock_t vtime_seqlock;
> unsigned long long vtime_snap;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c9507afd4a7d..306481d3a0e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1067,6 +1067,7 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
> rcu_assign_pointer(tsk->sighand, sig);
> if (!sig)
> return -ENOMEM;
> +
> atomic_set(&sig->count, 1);
> memcpy(sig->action, current->sighand->action, sizeof(sig->action));
> return 0;
> @@ -1128,6 +1129,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> init_sigpending(&sig->shared_pending);
> INIT_LIST_HEAD(&sig->posix_timers);
> seqlock_init(&sig->stats_lock);
> + prev_cputime_init(&sig->prev_cputime);
>
> hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> sig->real_timer.function = it_real_fn;
> @@ -1338,9 +1340,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>
> p->utime = p->stime = p->gtime = 0;
> p->utimescaled = p->stimescaled = 0;
> -#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
> - p->prev_cputime.utime = p->prev_cputime.stime = 0;
> -#endif
> + prev_cputime_init(&p->prev_cputime);
> +
> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> seqlock_init(&p->vtime_seqlock);
> p->vtime_snap = 0;
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f5a64ffad176..f3af6d86f38b 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -555,32 +555,16 @@ static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
> }
>
> /*
> - * Atomically advance counter to the new value. Interrupts, vcpu
> - * scheduling, and scaling inaccuracies can cause cputime_advance
> - * to be occasionally called with a new value smaller than counter.
> - * Let's enforce atomicity.
> - *
> - * Normally a caller will only go through this loop once, or not
> - * at all in case a previous caller updated counter the same jiffy.
> - */
> -static void cputime_advance(cputime_t *counter, cputime_t new)
> -{
> - cputime_t old;
> -
> - while (new > (old = READ_ONCE(*counter)))
> - cmpxchg_cputime(counter, old, new);
> -}
> -
> -/*
> * Adjust tick based cputime random precision against scheduler
> * runtime accounting.
> */
> static void cputime_adjust(struct task_cputime *curr,
> - struct cputime *prev,
> + struct prev_cputime *prev,
> cputime_t *ut, cputime_t *st)
> {
> cputime_t rtime, stime, utime;
>
> + raw_spin_lock(&prev->lock);
> /*
> * Tick based cputime accounting depend on random scheduling
> * timeslices of a task to be interrupted or not by the timer.
> @@ -606,22 +590,35 @@ static void cputime_adjust(struct task_cputime *curr,
>
> if (utime == 0) {
> stime = rtime;
> - } else if (stime == 0) {
> - utime = rtime;
> - } else {
> - cputime_t total = stime + utime;
> + goto update;
> + }
>
> - stime = scale_stime((__force u64)stime,
> - (__force u64)rtime, (__force u64)total);
> - utime = rtime - stime;
> + if (stime == 0) {
> + utime = rtime;
> + goto update;
> }
>
> - cputime_advance(&prev->stime, stime);
> - cputime_advance(&prev->utime, utime);
> + stime = scale_stime((__force u64)stime, (__force u64)rtime,
> + (__force u64)(stime + utime));
> +
> + /* make sure stime doesn't go backwards */
> + if (stime < prev->stime)
> + stime = prev->stime;
> + utime = rtime - stime;
> +
> + /* make sure utime doesn't go backwards */
> + if (utime < prev->utime) {
> + utime = prev->utime;
> + stime = rtime - utime;
> + }
>
> +update:
> + prev->stime = stime;
> + prev->utime = utime;
> out:
> *ut = prev->utime;
> *st = prev->stime;
> + raw_spin_unlock(&prev->lock);
> }
>
> void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
--
/Fredrik
On Tue, Jun 30, 2015 at 01:50:15PM +0200, Fredrik Markstr?m wrote:
> Excellent,
Please do not top post.
> The reason I replaced the early bail with that last test is that I
> believe it needs to be done within the lock and I wanted to keep that
> region short. To be honest I'm not sure this test is needed at all
> anymore, but I couldn't make sense of the comment above the early bail
> so I didn't dare to remove it.
Ah, there's a simple reason we should keep it, apart from the wobblies
in calculating the division. Imagine two concurrent callers, on with an
rtime ahead of the other. Let the latest rtime caller acquire the lock
first and compute s/u-time. Once the second caller acquires the lock, we
observe the last rtime was in the past and we use the latest values.
> Regarding the lock, have you considered how many cores you need
> hammering at rusage to introduce some substantial congestion ?
Spinlock contention across 120 cores and 4 nodes is pretty bad, even
with hardly any hold time :-)
I've not investigated where the absolute pain threshold is, but given the
size (and growth) of machines these days, its seems like a prudent
thing.
> Sorry for not letting this go (I know I should) but I always feel bad
> introducing per thread data.
Yes agreed, but a global lock is just asking for trouble. Esp when its
as easy as this to avoid it.
On Tue, Jun 30, 2015 at 2:18 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Jun 30, 2015 at 01:50:15PM +0200, Fredrik Markström wrote:
>> Excellent,
>
> Please do not top post.
Understood, sorry !
>
>> The reason I replaced the early bail with that last test is that I
>> believe it needs to be done within the lock and I wanted to keep that
>> region short. To be honest I'm not sure this test is needed at all
>> anymore, but I couldn't make sense of the comment above the early bail
>> so I didn't dare to remove it.
>
> Ah, there's a simple reason we should keep it, apart from the wobblies
> in calculating the division. Imagine two concurrent callers, on with an
> rtime ahead of the other. Let the latest rtime caller acquire the lock
> first and compute s/u-time. Once the second caller acquires the lock, we
> observe the last rtime was in the past and we use the latest values.
You are so right, sorry about that ! I agree the test is needed and it needs to
be done with the lock held.
But I don't think the "wobblies" in the division is, since the division doesn't
affect the sum (prev->stime + prev->rtime) anymore, so that comment should
go, right ?
>
>> Regarding the lock, have you considered how many cores you need
>> hammering at rusage to introduce some substantial congestion ?
>
> Spinlock contention across 120 cores and 4 nodes is pretty bad, even
> with hardly any hold time :-)
>
> I've not investigated where the absolute pain threshold is, but given the
> size (and growth) of machines these days, its seems like a prudent
> thing.
>
Well I guess it can be a problem on a system where 120 cores are doing nothing
but hammering on rusage... on the other hand I feel a system like that
deserves it. :)
>> Sorry for not letting this go (I know I should) but I always feel bad
>> introducing per thread data.
>
> Yes agreed, but a global lock is just asking for trouble. Esp when its
> as easy as this to avoid it.
>
Ok, you might be right. Either or I'm letting go now :)
/Fredrik
On Tue, Jun 30, 2015 at 08:30:05PM +0200, Fredrik Markstr?m wrote:
> But I don't think the "wobblies" in the division is, since the division doesn't
> affect the sum (prev->stime + prev->rtime) anymore, so that comment should
> go, right ?
Agreed.
> Well I guess it can be a problem on a system where 120 cores are doing nothing
> but hammering on rusage... on the other hand I feel a system like that
> deserves it. :)
True, doing stupid deserves stupid, but somehow people often tend to
disagree with it when its them doing it :-)
Imagine someone doing a gprof like thing where they stuff a getrusage()
call in the mcount hook and then run their many threaded application.
Provided they use RUSAGE_THREAD it doesn't even sound too insane..
On Thu, Jul 02, 2015 at 02:11:26PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 08:30:05PM +0200, Fredrik Markstr?m wrote:
> > But I don't think the "wobblies" in the division is, since the division doesn't
> > affect the sum (prev->stime + prev->rtime) anymore, so that comment should
> > go, right ?
>
> Agreed.
OK, I've got the below queued, lets see what a round of cross compiles
does.
---
Subject: sched,cputime: Guarantee stime + utime == rtime
From: Peter Zijlstra <[email protected]>
Date: Tue, 30 Jun 2015 11:30:54 +0200
While the current code guarantees monotonicity for stime and utime
independently of one another, it does not guarantee that the sum of
both is equal to the total time we started out with.
This confuses things (and peoples) who look at this sum, like top, and
will report >100% usage followed by a matching period of 0%.
Rework the code to provide both individual monotonicity and a coherent
sum.
Cc: [email protected]
Cc: [email protected]
Cc: Rik van Riel <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Suggested-by: Fredrik Markstrom <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/init_task.h | 10 ++++
include/linux/sched.h | 40 ++++++++++--------
kernel/fork.c | 7 +--
kernel/sched/cputime.c | 100 +++++++++++++++++++++++++++-------------------
4 files changed, 96 insertions(+), 61 deletions(-)
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -39,6 +39,14 @@ extern struct fs_struct init_fs;
#define INIT_CPUSET_SEQ(tsk)
#endif
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+#define INIT_PREV_CPUTIME(x) .prev_cputime = { \
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(x.prev_cputime.lock), \
+},
+#else
+#define INIT_PREV_CPUTIME(x)
+#endif
+
#define INIT_SIGNALS(sig) { \
.nr_threads = 1, \
.thread_head = LIST_HEAD_INIT(init_task.thread_node), \
@@ -53,6 +61,7 @@ extern struct fs_struct init_fs;
.cputime_atomic = INIT_CPUTIME_ATOMIC, \
.running = 0, \
}, \
+ INIT_PREV_CPUTIME(sig) \
.cred_guard_mutex = \
__MUTEX_INITIALIZER(sig.cred_guard_mutex), \
INIT_GROUP_RWSEM(sig) \
@@ -254,6 +263,7 @@ extern struct task_group root_task_group
INIT_TASK_RCU_TASKS(tsk) \
INIT_CPUSET_SEQ(tsk) \
INIT_RT_MUTEXES(tsk) \
+ INIT_PREV_CPUTIME(tsk) \
INIT_VTIME(tsk) \
INIT_NUMA_BALANCING(tsk) \
INIT_KASAN(tsk) \
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -531,39 +531,49 @@ struct cpu_itimer {
};
/**
- * struct cputime - snaphsot of system and user cputime
+ * struct prev_cputime - snaphsot of system and user cputime
* @utime: time spent in user mode
* @stime: time spent in system mode
+ * @lock: protects the above two fields
*
- * Gathers a generic snapshot of user and system time.
+ * Stores previous user/system time values such that we can guarantee
+ * monotonicity.
*/
-struct cputime {
+struct prev_cputime {
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cputime_t utime;
cputime_t stime;
+ raw_spinlock_t lock;
+#endif
};
+static inline void prev_cputime_init(struct prev_cputime *prev)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ prev->utime = prev->stime = 0;
+ raw_spin_lock_init(&prev->lock);
+#endif
+}
+
/**
* struct task_cputime - collected CPU time counts
* @utime: time spent in user mode, in &cputime_t units
* @stime: time spent in kernel mode, in &cputime_t units
* @sum_exec_runtime: total time spent on the CPU, in nanoseconds
*
- * This is an extension of struct cputime that includes the total runtime
- * spent by the task from the scheduler point of view.
- *
- * As a result, this structure groups together three kinds of CPU time
- * that are tracked for threads and thread groups. Most things considering
- * CPU time want to group these counts together and treat all three
- * of them in parallel.
+ * This structure groups together three kinds of CPU time that are tracked for
+ * threads and thread groups. Most things considering CPU time want to group
+ * these counts together and treat all three of them in parallel.
*/
struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};
+
/* Alternate field names when used to cache expirations. */
-#define prof_exp stime
#define virt_exp utime
+#define prof_exp stime
#define sched_exp sum_exec_runtime
#define INIT_CPUTIME \
@@ -716,9 +726,7 @@ struct signal_struct {
cputime_t utime, stime, cutime, cstime;
cputime_t gtime;
cputime_t cgtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw;
unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt;
unsigned long inblock, oublock, cinblock, coublock;
@@ -1494,9 +1502,7 @@ struct task_struct {
cputime_t utime, stime, utimescaled, stimescaled;
cputime_t gtime;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- struct cputime prev_cputime;
-#endif
+ struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_t vtime_seqlock;
unsigned long long vtime_snap;
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1067,6 +1067,7 @@ static int copy_sighand(unsigned long cl
rcu_assign_pointer(tsk->sighand, sig);
if (!sig)
return -ENOMEM;
+
atomic_set(&sig->count, 1);
memcpy(sig->action, current->sighand->action, sizeof(sig->action));
return 0;
@@ -1128,6 +1129,7 @@ static int copy_signal(unsigned long clo
init_sigpending(&sig->shared_pending);
INIT_LIST_HEAD(&sig->posix_timers);
seqlock_init(&sig->stats_lock);
+ prev_cputime_init(&sig->prev_cputime);
hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
@@ -1338,9 +1340,8 @@ static struct task_struct *copy_process(
p->utime = p->stime = p->gtime = 0;
p->utimescaled = p->stimescaled = 0;
-#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
- p->prev_cputime.utime = p->prev_cputime.stime = 0;
-#endif
+ prev_cputime_init(&p->prev_cputime);
+
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqlock_init(&p->vtime_seqlock);
p->vtime_snap = 0;
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -555,48 +555,42 @@ static cputime_t scale_stime(u64 stime,
}
/*
- * Atomically advance counter to the new value. Interrupts, vcpu
- * scheduling, and scaling inaccuracies can cause cputime_advance
- * to be occasionally called with a new value smaller than counter.
- * Let's enforce atomicity.
+ * Adjust tick based cputime random precision against scheduler runtime
+ * accounting.
*
- * Normally a caller will only go through this loop once, or not
- * at all in case a previous caller updated counter the same jiffy.
- */
-static void cputime_advance(cputime_t *counter, cputime_t new)
-{
- cputime_t old;
-
- while (new > (old = READ_ONCE(*counter)))
- cmpxchg_cputime(counter, old, new);
-}
-
-/*
- * Adjust tick based cputime random precision against scheduler
- * runtime accounting.
+ * Tick based cputime accounting depend on random scheduling timeslices of a
+ * task to be interrupted or not by the timer. Depending on these
+ * circumstances, the number of these interrupts may be over or
+ * under-optimistic, matching the real user and system cputime with a variable
+ * precision.
+ *
+ * Fix this by scaling these tick based values against the total runtime
+ * accounted by the CFS scheduler.
+ *
+ * This code provides the following guarantees:
+ *
+ * stime + utime == rtime
+ * stime_i+1 >= stime_i, utime_i+1 >= utime_i
+ *
+ * Assuming that rtime_i+1 >= rtime_i.
*/
static void cputime_adjust(struct task_cputime *curr,
- struct cputime *prev,
+ struct prev_cputime *prev,
cputime_t *ut, cputime_t *st)
{
cputime_t rtime, stime, utime;
- /*
- * Tick based cputime accounting depend on random scheduling
- * timeslices of a task to be interrupted or not by the timer.
- * Depending on these circumstances, the number of these interrupts
- * may be over or under-optimistic, matching the real user and system
- * cputime with a variable precision.
- *
- * Fix this by scaling these tick based values against the total
- * runtime accounted by the CFS scheduler.
- */
+ /* Serialize concurrent callers such that we can honour our guarantees */
+ raw_spin_lock(&prev->lock);
rtime = nsecs_to_cputime(curr->sum_exec_runtime);
/*
- * Update userspace visible utime/stime values only if actual execution
- * time is bigger than already exported. Note that can happen, that we
- * provided bigger values due to scaling inaccuracy on big numbers.
+ * This is possible under two circumstances:
+ * - rtime isn't monotonic after all (a bug);
+ * - we got reordered by the lock.
+ *
+ * In both cases this acts as a filter such that the rest of the code
+ * can assume it is monotonic regardless of anything else.
*/
if (prev->stime + prev->utime >= rtime)
goto out;
@@ -606,22 +600,46 @@ static void cputime_adjust(struct task_c
if (utime == 0) {
stime = rtime;
- } else if (stime == 0) {
- utime = rtime;
- } else {
- cputime_t total = stime + utime;
+ goto update;
+ }
- stime = scale_stime((__force u64)stime,
- (__force u64)rtime, (__force u64)total);
- utime = rtime - stime;
+ if (stime == 0) {
+ utime = rtime;
+ goto update;
}
- cputime_advance(&prev->stime, stime);
- cputime_advance(&prev->utime, utime);
+ stime = scale_stime((__force u64)stime, (__force u64)rtime,
+ (__force u64)(stime + utime));
+
+ /*
+ * Make sure stime doesn't go backwards; this preserves monotonicity
+ * for utime because rtime is monotonic.
+ *
+ * utime_i+1 = rtime_i+1 - stime_i
+ * = rtime_i+1 - (rtime_i - stime_i)
+ * = (rtime_i+1 - rtime_i) + stime_i
+ * >= stime_i
+ */
+ if (stime < prev->stime)
+ stime = prev->stime;
+ utime = rtime - stime;
+
+ /*
+ * Make sure utime doesn't go backwards; this still preserves
+ * monotonicity for stime, analogous argument to above.
+ */
+ if (utime < prev->utime) {
+ utime = prev->utime;
+ stime = rtime - utime;
+ }
+update:
+ prev->stime = stime;
+ prev->utime = utime;
out:
*ut = prev->utime;
*st = prev->stime;
+ raw_spin_unlock(&prev->lock);
}
void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
On Thu, Jul 02, 2015 at 03:07:01PM +0200, Peter Zijlstra wrote:
> @@ -606,22 +600,46 @@ static void cputime_adjust(struct task_c
>
> if (utime == 0) {
> stime = rtime;
> - } else if (stime == 0) {
> - utime = rtime;
> - } else {
> - cputime_t total = stime + utime;
> + goto update;
> + }
>
> - stime = scale_stime((__force u64)stime,
> - (__force u64)rtime, (__force u64)total);
> - utime = rtime - stime;
> + if (stime == 0) {
> + utime = rtime;
> + goto update;
> }
>
> - cputime_advance(&prev->stime, stime);
> - cputime_advance(&prev->utime, utime);
> + stime = scale_stime((__force u64)stime, (__force u64)rtime,
> + (__force u64)(stime + utime));
> +
> + /*
> + * Make sure stime doesn't go backwards; this preserves monotonicity
> + * for utime because rtime is monotonic.
> + *
> + * utime_i+1 = rtime_i+1 - stime_i
I'm not sure what is meant by _i+1.
I guess stime_i means prev->stime. stime_i+1 the new update of prev->stime
But then what is rtime_i and rtime_i+1 since we have no scaled rtime?
> + * = rtime_i+1 - (rtime_i - stime_i)
> + * = (rtime_i+1 - rtime_i) + stime_i
> + * >= stime_i
> + */
> + if (stime < prev->stime)
> + stime = prev->stime;
> + utime = rtime - stime;
> +
> + /*
> + * Make sure utime doesn't go backwards; this still preserves
> + * monotonicity for stime, analogous argument to above.
> + */
> + if (utime < prev->utime) {
> + utime = prev->utime;
> + stime = rtime - utime;
I see, so we are guaranteed that this final stime won't get below
prev->stime because older prev->stime + prev->utime <= newest rtime. I
guess that's more or less what's in the comments above :-)
> + }
>
> +update:
> + prev->stime = stime;
> + prev->utime = utime;
> out:
> *ut = prev->utime;
> *st = prev->stime;
> + raw_spin_unlock(&prev->lock);
> }
>
> void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
Ok I scratched my head a lot on this patch and the issues behind and it looks
good to me. I worried about introducing a spinlock but we had two cmpxchg before
that. The overhead is close.
On Tue, Jul 07, 2015 at 02:51:36AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 02, 2015 at 03:07:01PM +0200, Peter Zijlstra wrote:
> > + /*
> > + * Make sure stime doesn't go backwards; this preserves monotonicity
> > + * for utime because rtime is monotonic.
> > + *
> > + * utime_i+1 = rtime_i+1 - stime_i
>
> I'm not sure what is meant by _i+1.
Since we have a discrete set of elements, we can enumerate them and _i
is the i-th element in the (ordered) set. _i+1 is the i+1-th element,
and so on.
> I guess stime_i means prev->stime. stime_i+1 the new update of prev->stime
> But then what is rtime_i and rtime_i+1 since we have no scaled rtime?
still the previous and the next value.
rtime_i+1 >= rtime_i
just means that every next rtime value must be equal or greater than the
last, IOW. rtime must be monotonic.
> > + * = rtime_i+1 - (rtime_i - stime_i)
> > + * = (rtime_i+1 - rtime_i) + stime_i
> > + * >= stime_i
> > + */
> > + if (stime < prev->stime)
> > + stime = prev->stime;
> > + utime = rtime - stime;
> > +
> > + /*
> > + * Make sure utime doesn't go backwards; this still preserves
> > + * monotonicity for stime, analogous argument to above.
> > + */
> > + if (utime < prev->utime) {
> > + utime = prev->utime;
> > + stime = rtime - utime;
>
> I see, so we are guaranteed that this final stime won't get below
> prev->stime because older prev->stime + prev->utime <= newest rtime. I
> guess that's more or less what's in the comments above :-)
Indeed.
> > + }
> >
> > +update:
> > + prev->stime = stime;
> > + prev->utime = utime;
> > out:
> > *ut = prev->utime;
> > *st = prev->stime;
> > + raw_spin_unlock(&prev->lock);
> > }
> >
> > void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
>
> Ok I scratched my head a lot on this patch and the issues behind and it looks
> good to me. I worried about introducing a spinlock but we had two cmpxchg before
> that. The overhead is close.
Its slightly worse, I had to change the raw_spin_lock, to
raw_spin_lock_irqsave() because Ingo managed to trigger a lockdep splat
with sighand lock taking this lock, and sighand lock is IRQ-safe.
On Tue, Jul 07, 2015 at 09:59:54AM +0200, Peter Zijlstra wrote:
> > > + /*
> > > + * Make sure stime doesn't go backwards; this preserves monotonicity
> > > + * for utime because rtime is monotonic.
> > > + *
> > > + * utime_i+1 = rtime_i+1 - stime_i
> > > + * = rtime_i+1 - (rtime_i - stime_i)
> > > + * = (rtime_i+1 - rtime_i) + stime_i
> > > + * >= stime_i
> > > + */
Argh, just noticed I messed that up, it should read:
+ /*
+ * Make sure stime doesn't go backwards; this preserves monotonicity
+ * for utime because rtime is monotonic.
+ *
+ * utime_i+1 = rtime_i+1 - stime_i
+ * = rtime_i+1 - (rtime_i - utime_i)
+ * = (rtime_i+1 - rtime_i) + utime_i
+ * >= utime_i
+ */
I got some [us] confusion. Typing is hard.
So we compute: utime = rtime - stime, which we'll denote as:
utime_i+1 = rtime_i+1 - stime_i
since: stime_i + utime_i = rtime_i, we can write: stime_i = rtime_i -
utime_i and substitute in the above:
= rtime_i+1 - (rtime_i - utime_i)
Rearrange terms:
= (rtime_i+1 - rtime_i) + utime_i
And since we have: rtime_i+1 >= rtime_i, which we can write like:
rtime_i+1 - rtime_i >= 0, substitution gives:
>= utime_i
And we've proven that the new utime must be equal or greater than the
previous utime, because rtime is monotonic.
Just to let you know, I've tested your last patch and it solves all my
original problems (which is should since the code is functionally
equivalent).
/Fredrik
On Tue, Jul 7, 2015 at 10:09 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jul 07, 2015 at 09:59:54AM +0200, Peter Zijlstra wrote:
>> > > + /*
>> > > + * Make sure stime doesn't go backwards; this preserves monotonicity
>> > > + * for utime because rtime is monotonic.
>> > > + *
>> > > + * utime_i+1 = rtime_i+1 - stime_i
>> > > + * = rtime_i+1 - (rtime_i - stime_i)
>> > > + * = (rtime_i+1 - rtime_i) + stime_i
>> > > + * >= stime_i
>> > > + */
>
> Argh, just noticed I messed that up, it should read:
>
> + /*
> + * Make sure stime doesn't go backwards; this preserves monotonicity
> + * for utime because rtime is monotonic.
> + *
> + * utime_i+1 = rtime_i+1 - stime_i
> + * = rtime_i+1 - (rtime_i - utime_i)
> + * = (rtime_i+1 - rtime_i) + utime_i
> + * >= utime_i
> + */
>
> I got some [us] confusion. Typing is hard.
>
> So we compute: utime = rtime - stime, which we'll denote as:
>
> utime_i+1 = rtime_i+1 - stime_i
>
> since: stime_i + utime_i = rtime_i, we can write: stime_i = rtime_i -
> utime_i and substitute in the above:
>
> = rtime_i+1 - (rtime_i - utime_i)
>
> Rearrange terms:
>
> = (rtime_i+1 - rtime_i) + utime_i
>
> And since we have: rtime_i+1 >= rtime_i, which we can write like:
> rtime_i+1 - rtime_i >= 0, substitution gives:
>
> >= utime_i
>
> And we've proven that the new utime must be equal or greater than the
> previous utime, because rtime is monotonic.
--
/Fredrik
On Tue, Jul 07, 2015 at 10:09:13AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 09:59:54AM +0200, Peter Zijlstra wrote:
> > > > + /*
> > > > + * Make sure stime doesn't go backwards; this preserves monotonicity
> > > > + * for utime because rtime is monotonic.
> > > > + *
> > > > + * utime_i+1 = rtime_i+1 - stime_i
> > > > + * = rtime_i+1 - (rtime_i - stime_i)
> > > > + * = (rtime_i+1 - rtime_i) + stime_i
> > > > + * >= stime_i
> > > > + */
>
> Argh, just noticed I messed that up, it should read:
>
> + /*
> + * Make sure stime doesn't go backwards; this preserves monotonicity
> + * for utime because rtime is monotonic.
> + *
> + * utime_i+1 = rtime_i+1 - stime_i
> + * = rtime_i+1 - (rtime_i - utime_i)
> + * = (rtime_i+1 - rtime_i) + utime_i
> + * >= utime_i
> + */
>
> I got some [us] confusion. Typing is hard.
>
> So we compute: utime = rtime - stime, which we'll denote as:
>
> utime_i+1 = rtime_i+1 - stime_i
But I don't get how you come to that.
Imagine the following rounds:
utime:2 stime:2 rtime:4 --> prev->utime = 2 prev->stime = 2
utime:2 stime:6 rtime:8 --> prev->utime = 2 prev->stime = 6
So here if I apply your above formula we have:
utime_i+1:2 = rtime_i+1:8 - stime_i:2
Which doesn't work, so probably I still misunderstand those _i things...
On Tue, Jul 07, 2015 at 03:34:22PM +0200, Frederic Weisbecker wrote:
> Imagine the following rounds:
>
> utime:2 stime:2 rtime:4 --> prev->utime = 2 prev->stime = 2
>
> utime:2 stime:6 rtime:8 --> prev->utime = 2 prev->stime = 6
>
> So here if I apply your above formula we have:
>
> utime_i+1:2 = rtime_i+1:8 - stime_i:2
>
> Which doesn't work, so probably I still misunderstand those _i things...
Yes :-)
So its an iterative definition, but a partial one, remember this is for
the case where we preserve stime monotonicity. In your example we
clearly do not take this branch.
I'll try to elucidate by giving the full function (either that or I'll
confuse you more still). Lets define the whole thing as:
{stime, utime}_i+1 = F(rtime_i+1, {ssamples, usamples}_i+1, {stime, utime}_i)
with the constraints:
rtime_i+1 >= rtime_i
providing:
stime + utime == rtime,
stime_i+1 >= stime_i,
utime_i+1 >= utime_i
That is an iterative function computing the new state: stime_i+1,
utime_i+1, from the new input: rtime_i+1, ssamples_i+1, usamples_i+1 and
the old state: stime_i, utime_i.
This function has a bunch of cases; the trivial ones (omitting the
subscript when they're all the same):
A) stime = 0, utime = rtime ; when ssamples == 0
B) utime = 0, stime = rtime ; when usamples == 0
And the complex ones:
sfrac = ssamples * rtime / (ssamples + usamples)
C) stime_i+1 = max(stime_i, sfrac_i+1) ; when rtime_i+1 - max(stime_i, sfrac_i+1) >= utime_i
utime_i+1 = rtime_i+1 - stime_i+1
D) stime_i+1 = rtime_i+1 - utime_i ; when rtime_i+1 - max(stime_i, sfrac_i+1) < utime_i
utime_i+1 = utime_i
Note that we can further split C:
C1) stime_i+1 = stime_i ; when sfrac_i+1 < stime_i && ...
utime_i+1 = rtime_i+1 - stime_1
C2) stime_i+1 = sfrac_i+1 ; when sfrac_i+1 >= stime_i && ...
utime_i+1 = rtime_i+1 - sfrac_i+1
This gives us a total of 5 cases, each selected purely on input.
Now, in your case, you end up in C2, because we advance stime but do not
need to guard utime. In that case we have a different formula for
utime_i+1 -- therefore your application of the formula was wrong, hence
the wrong result.
And the proof given was for C1, which in turn is analogous to the proof
(not given) for D.
The proof for C2 should be evident at this point (stime is advanced,
otherwise C1 and utime is advanced, otherwise D).
Did that help -- or did I hopelessly confuse you?
On Tue, Jul 07, 2015 at 02:10:56PM +0200, Fredrik Markstr?m wrote:
> Just to let you know, I've tested your last patch and it solves all my
> original problems (which is should since the code is functionally
> equivalent).
Awesome, I'll add:
Tested-by: Fredrik Markstrom <[email protected]>
On Tue, Jul 07, 2015 at 05:34:13PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 03:34:22PM +0200, Frederic Weisbecker wrote:
> > Imagine the following rounds:
> >
> > utime:2 stime:2 rtime:4 --> prev->utime = 2 prev->stime = 2
> >
> > utime:2 stime:6 rtime:8 --> prev->utime = 2 prev->stime = 6
> >
> > So here if I apply your above formula we have:
> >
> > utime_i+1:2 = rtime_i+1:8 - stime_i:2
> >
> > Which doesn't work, so probably I still misunderstand those _i things...
>
> Yes :-)
>
> So its an iterative definition, but a partial one, remember this is for
> the case where we preserve stime monotonicity. In your example we
> clearly do not take this branch.
>
> I'll try to elucidate by giving the full function (either that or I'll
> confuse you more still). Lets define the whole thing as:
>
> {stime, utime}_i+1 = F(rtime_i+1, {ssamples, usamples}_i+1, {stime, utime}_i)
>
> with the constraints:
>
> rtime_i+1 >= rtime_i
>
> providing:
>
> stime + utime == rtime,
> stime_i+1 >= stime_i,
> utime_i+1 >= utime_i
>
> That is an iterative function computing the new state: stime_i+1,
> utime_i+1, from the new input: rtime_i+1, ssamples_i+1, usamples_i+1 and
> the old state: stime_i, utime_i.
>
> This function has a bunch of cases; the trivial ones (omitting the
> subscript when they're all the same):
>
> A) stime = 0, utime = rtime ; when ssamples == 0
> B) utime = 0, stime = rtime ; when usamples == 0
>
> And the complex ones:
>
> sfrac = ssamples * rtime / (ssamples + usamples)
>
> C) stime_i+1 = max(stime_i, sfrac_i+1) ; when rtime_i+1 - max(stime_i, sfrac_i+1) >= utime_i
> utime_i+1 = rtime_i+1 - stime_i+1
>
> D) stime_i+1 = rtime_i+1 - utime_i ; when rtime_i+1 - max(stime_i, sfrac_i+1) < utime_i
> utime_i+1 = utime_i
>
> Note that we can further split C:
>
> C1) stime_i+1 = stime_i ; when sfrac_i+1 < stime_i && ...
> utime_i+1 = rtime_i+1 - stime_1
>
> C2) stime_i+1 = sfrac_i+1 ; when sfrac_i+1 >= stime_i && ...
> utime_i+1 = rtime_i+1 - sfrac_i+1
>
> This gives us a total of 5 cases, each selected purely on input.
Alright, when put it that way it makes perfect sense!
> Now, in your case, you end up in C2, because we advance stime but do not
> need to guard utime. In that case we have a different formula for
> utime_i+1 -- therefore your application of the formula was wrong, hence
> the wrong result.
Indeed!
> And the proof given was for C1, which in turn is analogous to the proof
> (not given) for D.
>
> The proof for C2 should be evident at this point (stime is advanced,
> otherwise C1 and utime is advanced, otherwise D).
>
> Did that help -- or did I hopelessly confuse you?
Makes perfect sense now! Thanks for your patience! :-)