LinuxLists.cc - [PATCH] sched: fix cpu utilization account error

2013-07-01 06:48:12

Subject: [PATCH] sched: fix cpu utilization account error

We setting clock_skip_update = 1 based on the assumption that the
next call to update_rq_clock() will come nearly immediately
after being set. However, it is not always true especially on
non-preempt mode. In this case we may miss some clock update, which
would cause an error curr->sum_exec_runtime account.

The test result show that test_kthread's exec_runtime has been
added to watchdog.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
28 root RT 0 0 0 0 S 100 0.0 0:05.39 5 watchdog/5
7 root RT 0 0 0 0 S 95 0.0 0:05.83 0 watchdog/0
12 root RT 0 0 0 0 S 94 0.0 0:05.79 1 watchdog/1
16 root RT 0 0 0 0 S 92 0.0 0:05.74 2 watchdog/2
20 root RT 0 0 0 0 S 91 0.0 0:05.71 3 watchdog/3
24 root RT 0 0 0 0 S 82 0.0 0:05.42 4 watchdog/4
32 root RT 0 0 0 0 S 79 0.0 0:05.35 6 watchdog/6
5200 root 20 0 0 0 0 R 21 0.0 0:08.88 6 test_kthread/6
5194 root 20 0 0 0 0 R 20 0.0 0:08.41 0 test_kthread/0
5195 root 20 0 0 0 0 R 20 0.0 0:08.44 1 test_kthread/1
5196 root 20 0 0 0 0 R 20 0.0 0:08.49 2 test_kthread/2
5197 root 20 0 0 0 0 R 20 0.0 0:08.53 3 test_kthread/3
5198 root 20 0 0 0 0 R 19 0.0 0:08.81 4 test_kthread/4
5199 root 20 0 0 0 0 R 2 0.0 0:08.66 5 test_kthread/5

"test_kthread/i" is a kernel thread which has a infinity loop and it calls
schedule() every 1s. It's main process as below:

static int main_loop (void *unused)
{
unsigned long flags;
unsigned long last = jiffies;
int i;

while (!kthread_should_stop()) {
/* call schedule every 1 sec */
if (HZ <= jiffies - last) {
last = jiffies;
schedule();
}

/* do some thing */
for (i = 0; i < 1000; i++)
;

if (kthread_should_stop())
break;
}
}

In this patch, we do not skip clock update when current task is kernel
thread in non-preempt mode.

Reported-by: Zhang Hang <[email protected]>
Signed-off-by: Xie XiuQi <[email protected]>
---
kernel/sched/core.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8b3350..018dc43 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -970,8 +970,19 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
+#ifdef CONFIG_PREEMPT
if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
rq->skip_clock_update = 1;
+#else
+ /*
+ * In non-preempt mode, a kernel thread may run for a long time
+ * until been scheduled out by itself. In this cace, we need update
+ * rq clock when calling schedule function, otherwise, we might
+ * miss rq clock update for a long time.
+ */
+ if (rq->curr->on_rq && test_tsk_need_resched(rq->curr) && rq->curr->mm)
+ rq->skip_clock_update = 1;
+#endif
}

static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
--
1.7.6.1

2013-07-01 07:36:36

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On Mon, 2013-07-01 at 14:45 +0800, Xie XiuQi wrote:
> We setting clock_skip_update = 1 based on the assumption that the
> next call to update_rq_clock() will come nearly immediately
> after being set. However, it is not always true especially on
> non-preempt mode. In this case we may miss some clock update, which
> would cause an error curr->sum_exec_runtime account.
>
> The test result show that test_kthread's exec_runtime has been
> added to watchdog.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 28 root RT 0 0 0 0 S 100 0.0 0:05.39 5 watchdog/5
> 7 root RT 0 0 0 0 S 95 0.0 0:05.83 0 watchdog/0
> 12 root RT 0 0 0 0 S 94 0.0 0:05.79 1 watchdog/1
> 16 root RT 0 0 0 0 S 92 0.0 0:05.74 2 watchdog/2
> 20 root RT 0 0 0 0 S 91 0.0 0:05.71 3 watchdog/3
> 24 root RT 0 0 0 0 S 82 0.0 0:05.42 4 watchdog/4
> 32 root RT 0 0 0 0 S 79 0.0 0:05.35 6 watchdog/6
> 5200 root 20 0 0 0 0 R 21 0.0 0:08.88 6 test_kthread/6
> 5194 root 20 0 0 0 0 R 20 0.0 0:08.41 0 test_kthread/0
> 5195 root 20 0 0 0 0 R 20 0.0 0:08.44 1 test_kthread/1
> 5196 root 20 0 0 0 0 R 20 0.0 0:08.49 2 test_kthread/2
> 5197 root 20 0 0 0 0 R 20 0.0 0:08.53 3 test_kthread/3
> 5198 root 20 0 0 0 0 R 19 0.0 0:08.81 4 test_kthread/4
> 5199 root 20 0 0 0 0 R 2 0.0 0:08.66 5 test_kthread/5
>
> "test_kthread/i" is a kernel thread which has a infinity loop and it calls
> schedule() every 1s. It's main process as below:

It'd be a shame to lose the cycle savings (we could use more) due to
such horrible behavior. Where are you seeing this in real life?

That said, accounting funnies induced by skipped update are possible,
which could trump the cycle savings I suppose, so maybe savings (sniff)
should just go away?

> static int main_loop (void *unused)
> {
> unsigned long flags;
> unsigned long last = jiffies;
> int i;
>
> while (!kthread_should_stop()) {
> /* call schedule every 1 sec */
> if (HZ <= jiffies - last) {
> last = jiffies;
> schedule();
> }
>
> /* do some thing */
> for (i = 0; i < 1000; i++)
> ;
>
> if (kthread_should_stop())
> break;
> }
> }
>
> In this patch, we do not skip clock update when current task is kernel
> thread in non-preempt mode.
>
> Reported-by: Zhang Hang <[email protected]>
> Signed-off-by: Xie XiuQi <[email protected]>
> ---
> kernel/sched/core.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e8b3350..018dc43 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -970,8 +970,19 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> * A queue event has occurred, and we're going to schedule. In
> * this case, we can save a useless back to back clock update.
> */
> +#ifdef CONFIG_PREEMPT
> if (rq->curr->on_rq && test_tsk_need_resched(rq->curr))
> rq->skip_clock_update = 1;
> +#else
> + /*
> + * In non-preempt mode, a kernel thread may run for a long time
> + * until been scheduled out by itself. In this cace, we need update
> + * rq clock when calling schedule function, otherwise, we might
> + * miss rq clock update for a long time.
> + */
> + if (rq->curr->on_rq && test_tsk_need_resched(rq->curr) && rq->curr->mm)
> + rq->skip_clock_update = 1;
> +#endif
> }
>
> static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);

2013-07-01 11:27:33

by Xie XiuQi

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On 2013/7/1 15:36, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 14:45 +0800, Xie XiuQi wrote:
>> We setting clock_skip_update = 1 based on the assumption that the
>> next call to update_rq_clock() will come nearly immediately
>> after being set. However, it is not always true especially on
>> non-preempt mode. In this case we may miss some clock update, which
>> would cause an error curr->sum_exec_runtime account.
>>
>> The test result show that test_kthread's exec_runtime has been
>> added to watchdog.
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
>> 28 root RT 0 0 0 0 S 100 0.0 0:05.39 5 watchdog/5
>> 7 root RT 0 0 0 0 S 95 0.0 0:05.83 0 watchdog/0
>> 12 root RT 0 0 0 0 S 94 0.0 0:05.79 1 watchdog/1
>> 16 root RT 0 0 0 0 S 92 0.0 0:05.74 2 watchdog/2
>> 20 root RT 0 0 0 0 S 91 0.0 0:05.71 3 watchdog/3
>> 24 root RT 0 0 0 0 S 82 0.0 0:05.42 4 watchdog/4
>> 32 root RT 0 0 0 0 S 79 0.0 0:05.35 6 watchdog/6
>> 5200 root 20 0 0 0 0 R 21 0.0 0:08.88 6 test_kthread/6
>> 5194 root 20 0 0 0 0 R 20 0.0 0:08.41 0 test_kthread/0
>> 5195 root 20 0 0 0 0 R 20 0.0 0:08.44 1 test_kthread/1
>> 5196 root 20 0 0 0 0 R 20 0.0 0:08.49 2 test_kthread/2
>> 5197 root 20 0 0 0 0 R 20 0.0 0:08.53 3 test_kthread/3
>> 5198 root 20 0 0 0 0 R 19 0.0 0:08.81 4 test_kthread/4
>> 5199 root 20 0 0 0 0 R 2 0.0 0:08.66 5 test_kthread/5
>>
>> "test_kthread/i" is a kernel thread which has a infinity loop and it calls
>> schedule() every 1s. It's main process as below:
>
> It'd be a shame to lose the cycle savings (we could use more) due to
> such horrible behavior. Where are you seeing this in real life?
>

Thank you for your comments, Mike.

This issue was reported by a driver related pcie in which a kthread send
huge amounts of data. In non-preempt mode, it would take a cpu for a long
time. But, in preempt mode, I haven't found this issue yet.

Here is the kthread main logic. Although it's not a good idea, but it does
exist:
while (!kthread_should_stop()) {
/* call schedule every 1 sec */
if (HZ <= jiffies - last) {
last = jiffies;
schedule();
}

/* get data and sent it */
get_msg();
send_msg();

if (kthread_should_stop())
break;
}

> That said, accounting funnies induced by skipped update are possible,
> which could trump the cycle savings I suppose, so maybe savings (sniff)
> should just go away?

Indeed, removing the skip_clock_update could resolve the issue, but I found
there is no this issue in preempt mode. However, if remove skip_clock_update
we'll get more precise time account.

So, what's your opinion, Mike.

2013-07-01 11:43:51

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On Mon, 2013-07-01 at 19:26 +0800, Xie XiuQi wrote:
> On 2013/7/1 15:36, Mike Galbraith wrote:
> > On Mon, 2013-07-01 at 14:45 +0800, Xie XiuQi wrote:
> >> We setting clock_skip_update = 1 based on the assumption that the
> >> next call to update_rq_clock() will come nearly immediately
> >> after being set. However, it is not always true especially on
> >> non-preempt mode. In this case we may miss some clock update, which
> >> would cause an error curr->sum_exec_runtime account.
> >>
> >> The test result show that test_kthread's exec_runtime has been
> >> added to watchdog.
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> >> 28 root RT 0 0 0 0 S 100 0.0 0:05.39 5 watchdog/5
> >> 7 root RT 0 0 0 0 S 95 0.0 0:05.83 0 watchdog/0
> >> 12 root RT 0 0 0 0 S 94 0.0 0:05.79 1 watchdog/1
> >> 16 root RT 0 0 0 0 S 92 0.0 0:05.74 2 watchdog/2
> >> 20 root RT 0 0 0 0 S 91 0.0 0:05.71 3 watchdog/3
> >> 24 root RT 0 0 0 0 S 82 0.0 0:05.42 4 watchdog/4
> >> 32 root RT 0 0 0 0 S 79 0.0 0:05.35 6 watchdog/6
> >> 5200 root 20 0 0 0 0 R 21 0.0 0:08.88 6 test_kthread/6
> >> 5194 root 20 0 0 0 0 R 20 0.0 0:08.41 0 test_kthread/0
> >> 5195 root 20 0 0 0 0 R 20 0.0 0:08.44 1 test_kthread/1
> >> 5196 root 20 0 0 0 0 R 20 0.0 0:08.49 2 test_kthread/2
> >> 5197 root 20 0 0 0 0 R 20 0.0 0:08.53 3 test_kthread/3
> >> 5198 root 20 0 0 0 0 R 19 0.0 0:08.81 4 test_kthread/4
> >> 5199 root 20 0 0 0 0 R 2 0.0 0:08.66 5 test_kthread/5
> >>
> >> "test_kthread/i" is a kernel thread which has a infinity loop and it calls
> >> schedule() every 1s. It's main process as below:
> >
> > It'd be a shame to lose the cycle savings (we could use more) due to
> > such horrible behavior. Where are you seeing this in real life?
> >
>
> Thank you for your comments, Mike.
>
> This issue was reported by a driver related pcie in which a kthread send
> huge amounts of data. In non-preempt mode, it would take a cpu for a long
> time. But, in preempt mode, I haven't found this issue yet.
>
> Here is the kthread main logic. Although it's not a good idea, but it does
> exist:
> while (!kthread_should_stop()) {
> /* call schedule every 1 sec */
> if (HZ <= jiffies - last) {
> last = jiffies;
> schedule();
> }
>
> /* get data and sent it */
> get_msg();
> send_msg();
>
> if (kthread_should_stop())
> break;
> }
>
> > That said, accounting funnies induced by skipped update are possible,
> > which could trump the cycle savings I suppose, so maybe savings (sniff)
> > should just go away?
>
> Indeed, removing the skip_clock_update could resolve the issue, but I found
> there is no this issue in preempt mode. However, if remove skip_clock_update
> we'll get more precise time account.
>
> So, what's your opinion, Mike.

Other than take that thing out and shoot it? ;-)

It's definitely bad to hand off cycles expended to some other innocent
bystander, especially an rt task that can then trip over the throttle,
so in the safety first light, I'd say go with your fix I suppose, and
hope that other things like contention won't show up doing the same
thing. The cycle savings are nice to have, so I'd rather not just kill
that entirely, but...

Maybe Peter has a better idea.

-Mike

2013-07-01 15:18:29

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On Mon, Jul 01, 2013 at 02:45:04PM +0800, Xie XiuQi wrote:
> We setting clock_skip_update = 1 based on the assumption that the
> next call to update_rq_clock() will come nearly immediately
> after being set. However, it is not always true especially on
> non-preempt mode. In this case we may miss some clock update, which
> would cause an error curr->sum_exec_runtime account.
>
> The test result show that test_kthread's exec_runtime has been
> added to watchdog.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
> 28 root RT 0 0 0 0 S 100 0.0 0:05.39 5 watchdog/5
> 7 root RT 0 0 0 0 S 95 0.0 0:05.83 0 watchdog/0
> 12 root RT 0 0 0 0 S 94 0.0 0:05.79 1 watchdog/1
> 16 root RT 0 0 0 0 S 92 0.0 0:05.74 2 watchdog/2
> 20 root RT 0 0 0 0 S 91 0.0 0:05.71 3 watchdog/3
> 24 root RT 0 0 0 0 S 82 0.0 0:05.42 4 watchdog/4
> 32 root RT 0 0 0 0 S 79 0.0 0:05.35 6 watchdog/6
> 5200 root 20 0 0 0 0 R 21 0.0 0:08.88 6 test_kthread/6
> 5194 root 20 0 0 0 0 R 20 0.0 0:08.41 0 test_kthread/0
> 5195 root 20 0 0 0 0 R 20 0.0 0:08.44 1 test_kthread/1
> 5196 root 20 0 0 0 0 R 20 0.0 0:08.49 2 test_kthread/2
> 5197 root 20 0 0 0 0 R 20 0.0 0:08.53 3 test_kthread/3
> 5198 root 20 0 0 0 0 R 19 0.0 0:08.81 4 test_kthread/4
> 5199 root 20 0 0 0 0 R 2 0.0 0:08.66 5 test_kthread/5
>
> "test_kthread/i" is a kernel thread which has a infinity loop and it calls
> schedule() every 1s. It's main process as below:
>
> static int main_loop (void *unused)
> {
> unsigned long flags;
> unsigned long last = jiffies;
> int i;
>
> while (!kthread_should_stop()) {
> /* call schedule every 1 sec */
> if (HZ <= jiffies - last) {
> last = jiffies;
> schedule();
> }
>
> /* do some thing */
> for (i = 0; i < 1000; i++)
> ;
>
> if (kthread_should_stop())
> break;
> }
> }
>
> In this patch, we do not skip clock update when current task is kernel
> thread in non-preempt mode.
>
> Reported-by: Zhang Hang <[email protected]>
> Signed-off-by: Xie XiuQi <[email protected]>
> ---
> kernel/sched/core.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read Documentation/stable_kernel_rules.txt
for how to do this properly.

</formletter>

2013-07-02 01:42:19

by Xie XiuQi

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On 2013/7/1 23:19, Greg KH wrote:
> On Mon, Jul 01, 2013 at 02:45:04PM +0800, Xie XiuQi wrote:
>> We setting clock_skip_update = 1 based on the assumption that the
>> next call to update_rq_clock() will come nearly immediately
>> after being set. However, it is not always true especially on
>> non-preempt mode. In this case we may miss some clock update, which
>> would cause an error curr->sum_exec_runtime account.
>>
...
>
> <formletter>
>
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree. Please read Documentation/stable_kernel_rules.txt
> for how to do this properly.

Thanks Greg,
I've removed [email protected] from recipient list now.

>
> </formletter>
>

2013-07-02 03:08:04

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On Mon, 2013-07-01 at 19:26 +0800, Xie XiuQi wrote:

> Here is the kthread main logic. Although it's not a good idea, but it does
> exist:

Why not fix this instead?

> while (!kthread_should_stop()) {
> /* call schedule every 1 sec */
> if (HZ <= jiffies - last) {
> last = jiffies;
> schedule();
> }

Hanging out in the kernel for ages is not cool. That doesn't mean
something else might not pop up that forces the issue, but to date it
has not, and sacrificing precious fastpath cycles is not attractive.

-Mike

2013-07-02 03:20:52

by Michael wang

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

Hi, Xie

On 07/01/2013 07:26 PM, Xie XiuQi wrote:
[snip]
> Here is the kthread main logic. Although it's not a good idea, but it does
> exist:
> while (!kthread_should_stop()) {
> /* call schedule every 1 sec */
> if (HZ <= jiffies - last) {
> last = jiffies;
> schedule();
> }
>
> /* get data and sent it */
> get_msg();
> send_msg();

What about use cond_resched() here? Isn't that more gentle?

Regards,
Michael Wang

>
> if (kthread_should_stop())
> break;
> }
>
>> That said, accounting funnies induced by skipped update are possible,
>> which could trump the cycle savings I suppose, so maybe savings (sniff)
>> should just go away?
>
> Indeed, removing the skip_clock_update could resolve the issue, but I found
> there is no this issue in preempt mode. However, if remove skip_clock_update
> we'll get more precise time account.
>
> So, what's your opinion, Mike.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-07-02 04:11:43

by Xie XiuQi

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On 2013/7/2 11:07, Mike Galbraith wrote:
> On Mon, 2013-07-01 at 19:26 +0800, Xie XiuQi wrote:
>
>> Here is the kthread main logic. Although it's not a good idea, but it does
>> exist:
>
> Why not fix this instead?
>
>> while (!kthread_should_stop()) {
>> /* call schedule every 1 sec */
>> if (HZ <= jiffies - last) {
>> last = jiffies;
>> schedule();
>> }
>
> Hanging out in the kernel for ages is not cool. That doesn't mean
> something else might not pop up that forces the issue, but to date it
> has not, and sacrificing precious fastpath cycles is not attractive.
>

That is to say, the driver's code needs improvement.
Thank you Mike.

> -Mike
>

2013-07-02 04:15:23

by Xie XiuQi

[permalink] [raw]

Subject: Re: [PATCH] sched: fix cpu utilization account error

On 2013/7/2 11:20, Michael Wang wrote:
> Hi, Xie
>
> On 07/01/2013 07:26 PM, Xie XiuQi wrote:
> [snip]
>> Here is the kthread main logic. Although it's not a good idea, but it does
>> exist:
>> while (!kthread_should_stop()) {
>> /* call schedule every 1 sec */
>> if (HZ <= jiffies - last) {
>> last = jiffies;
>> schedule();
>> }
>>
>> /* get data and sent it */
>> get_msg();
>> send_msg();
>
> What about use cond_resched() here? Isn't that more gentle?
>

That's a good idea for driver implementation.
Thank you Michael.

> Regards,
> Michael Wang
>