2013-08-06 13:23:48

by Lei Wen

[permalink] [raw]
Subject: false nr_running check in load balance?

Hi Paul,

I notice in load_balance function, it would check busiest->nr_running
to decide whether to perform the real task movement.

But in some case, I saw the nr_running is not matching with
the task in the queue, which seems make scheduler to do many redundant
checking.
What I means is like there is only one task in the queue, but nr_running
shows it has two. So if that task cannot be moved, it would be still checked
for twice.

With further checking, I find there is one patch you submit before:
commit 953bfcd10e6f3697233e8e5128c611d275da39c1
Author: Paul Turner <[email protected]>
Date: Thu Jul 21 09:43:27 2011 -0700

sched: Implement hierarchical task accounting for SCHED_OTHER

In this patch, you increase nr_running when enqueue enqueue_task_stop,
which is the reason nr_running is increase while task not be increased.
It is true at that time, the stopper has been waken up and enqueue again
into cpu, and do the migration job. So the logic should be right there.

My question is whether we could change the judgment into cfs_rq->nr_running?
Since the load_balance is only for cfs, right?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb456f4..ffc0d35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5096,7 +5096,7 @@ redo:
schedstat_add(sd, lb_imbalance[idle], env.imbalance);

ld_moved = 0;
- if (busiest->nr_running > 1) {
+ if (busiest->cfs.nr_running > 1) {
/*
* Attempt to move tasks. If find_busiest_group has found
* an imbalance but busiest->nr_running <= 1, the group is

Thanks,
Lei


2013-08-12 14:43:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Tue, Aug 06, 2013 at 09:23:46PM +0800, Lei Wen wrote:
> Hi Paul,
>
> I notice in load_balance function, it would check busiest->nr_running
> to decide whether to perform the real task movement.
>
> But in some case, I saw the nr_running is not matching with
> the task in the queue, which seems make scheduler to do many redundant
> checking.
> What I means is like there is only one task in the queue, but nr_running
> shows it has two. So if that task cannot be moved, it would be still checked
> for twice.
>
> With further checking, I find there is one patch you submit before:
> commit 953bfcd10e6f3697233e8e5128c611d275da39c1
> Author: Paul Turner <[email protected]>
> Date: Thu Jul 21 09:43:27 2011 -0700
>
> sched: Implement hierarchical task accounting for SCHED_OTHER
>
> In this patch, you increase nr_running when enqueue enqueue_task_stop,
> which is the reason nr_running is increase while task not be increased.
> It is true at that time, the stopper has been waken up and enqueue again
> into cpu, and do the migration job. So the logic should be right there.
>
> My question is whether we could change the judgment into cfs_rq->nr_running?
> Since the load_balance is only for cfs, right?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bb456f4..ffc0d35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5096,7 +5096,7 @@ redo:
> schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>
> ld_moved = 0;
> - if (busiest->nr_running > 1) {
> + if (busiest->cfs.nr_running > 1) {
> /*
> * Attempt to move tasks. If find_busiest_group has found
> * an imbalance but busiest->nr_running <= 1, the group is
>

Not quite right; I think you need busiest->cfs.h_nr_running.
cfs.nr_running is the number of entries running in this 'group'. If
you've got nested groups like:

'root'
\
'A'
/ \
t1 t2

root.nr_running := 1 'A', even though you've got multiple running tasks.

2013-08-13 04:45:15

by Lei Wen

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

Peter,

On Mon, Aug 12, 2013 at 10:43 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Aug 06, 2013 at 09:23:46PM +0800, Lei Wen wrote:
>> Hi Paul,
>>
>> I notice in load_balance function, it would check busiest->nr_running
>> to decide whether to perform the real task movement.
>>
>> But in some case, I saw the nr_running is not matching with
>> the task in the queue, which seems make scheduler to do many redundant
>> checking.
>> What I means is like there is only one task in the queue, but nr_running
>> shows it has two. So if that task cannot be moved, it would be still checked
>> for twice.
>>
>> With further checking, I find there is one patch you submit before:
>> commit 953bfcd10e6f3697233e8e5128c611d275da39c1
>> Author: Paul Turner <[email protected]>
>> Date: Thu Jul 21 09:43:27 2011 -0700
>>
>> sched: Implement hierarchical task accounting for SCHED_OTHER
>>
>> In this patch, you increase nr_running when enqueue enqueue_task_stop,
>> which is the reason nr_running is increase while task not be increased.
>> It is true at that time, the stopper has been waken up and enqueue again
>> into cpu, and do the migration job. So the logic should be right there.
>>
>> My question is whether we could change the judgment into cfs_rq->nr_running?
>> Since the load_balance is only for cfs, right?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bb456f4..ffc0d35 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5096,7 +5096,7 @@ redo:
>> schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>
>> ld_moved = 0;
>> - if (busiest->nr_running > 1) {
>> + if (busiest->cfs.nr_running > 1) {
>> /*
>> * Attempt to move tasks. If find_busiest_group has found
>> * an imbalance but busiest->nr_running <= 1, the group is
>>
>
> Not quite right; I think you need busiest->cfs.h_nr_running.
> cfs.nr_running is the number of entries running in this 'group'. If
> you've got nested groups like:
>
> 'root'
> \
> 'A'
> / \
> t1 t2
>
> root.nr_running := 1 'A', even though you've got multiple running tasks.
>

You're absolutely right for this. :)
I miss it for not considering the group case...

Then do you think it is necessary to do below change in load_balance() code?
- if (busiest->nr_running > 1) {
+ if (busiest->cfs.h_nr_running > 1) {


Thanks,
Lei

2013-08-13 07:38:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
> > Not quite right; I think you need busiest->cfs.h_nr_running.
> > cfs.nr_running is the number of entries running in this 'group'. If
> > you've got nested groups like:
> >
> > 'root'
> > \
> > 'A'
> > / \
> > t1 t2
> >
> > root.nr_running := 1 'A', even though you've got multiple running tasks.
> >
>
> You're absolutely right for this. :)
> I miss it for not considering the group case...
>
> Then do you think it is necessary to do below change in load_balance() code?
> - if (busiest->nr_running > 1) {
> + if (busiest->cfs.h_nr_running > 1) {
>

Yes I think that would be fine.

2013-08-13 08:08:52

by Paul Turner

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
>> > Not quite right; I think you need busiest->cfs.h_nr_running.
>> > cfs.nr_running is the number of entries running in this 'group'. If
>> > you've got nested groups like:
>> >
>> > 'root'
>> > \
>> > 'A'
>> > / \
>> > t1 t2
>> >
>> > root.nr_running := 1 'A', even though you've got multiple running tasks.
>> >
>>
>> You're absolutely right for this. :)
>> I miss it for not considering the group case...
>>
>> Then do you think it is necessary to do below change in load_balance() code?
>> - if (busiest->nr_running > 1) {
>> + if (busiest->cfs.h_nr_running > 1) {
>>
>
> Yes I think that would be fine.

If we pivot to use h_nr_running we should probably also update
call-sites such as cpu_load_avg_per_task() for consistency.

2013-08-13 08:18:09

by Lei Wen

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

Hi Paul,

On Tue, Aug 13, 2013 at 4:08 PM, Paul Turner <[email protected]> wrote:
> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
>>> > Not quite right; I think you need busiest->cfs.h_nr_running.
>>> > cfs.nr_running is the number of entries running in this 'group'. If
>>> > you've got nested groups like:
>>> >
>>> > 'root'
>>> > \
>>> > 'A'
>>> > / \
>>> > t1 t2
>>> >
>>> > root.nr_running := 1 'A', even though you've got multiple running tasks.
>>> >
>>>
>>> You're absolutely right for this. :)
>>> I miss it for not considering the group case...
>>>
>>> Then do you think it is necessary to do below change in load_balance() code?
>>> - if (busiest->nr_running > 1) {
>>> + if (busiest->cfs.h_nr_running > 1) {
>>>
>>
>> Yes I think that would be fine.
>
> If we pivot to use h_nr_running we should probably also update
> call-sites such as cpu_load_avg_per_task() for consistency.

I didn't find cpu_load_avg_per_task in the latest linux git...
Is it a new patch pending while not being submitted?

Thanks,
Lei

2013-08-13 09:26:19

by Paul Turner

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Tue, Aug 13, 2013 at 1:18 AM, Lei Wen <[email protected]> wrote:
> Hi Paul,
>
> On Tue, Aug 13, 2013 at 4:08 PM, Paul Turner <[email protected]> wrote:
>> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
>>>> > Not quite right; I think you need busiest->cfs.h_nr_running.
>>>> > cfs.nr_running is the number of entries running in this 'group'. If
>>>> > you've got nested groups like:
>>>> >
>>>> > 'root'
>>>> > \
>>>> > 'A'
>>>> > / \
>>>> > t1 t2
>>>> >
>>>> > root.nr_running := 1 'A', even though you've got multiple running tasks.
>>>> >
>>>>
>>>> You're absolutely right for this. :)
>>>> I miss it for not considering the group case...
>>>>
>>>> Then do you think it is necessary to do below change in load_balance() code?
>>>> - if (busiest->nr_running > 1) {
>>>> + if (busiest->cfs.h_nr_running > 1) {
>>>>
>>>
>>> Yes I think that would be fine.
>>
>> If we pivot to use h_nr_running we should probably also update
>> call-sites such as cpu_load_avg_per_task() for consistency.
>
> I didn't find cpu_load_avg_per_task in the latest linux git...
> Is it a new patch pending while not being submitted?

Transposition typo: cpu_avg_load_per_task()
More generally: Most things that examine ->nr_running in the fair
load-balance path.

>
> Thanks,
> Lei

2013-08-15 17:40:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Tue, Aug 13, 2013 at 01:08:17AM -0700, Paul Turner wrote:
> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
> >> > Not quite right; I think you need busiest->cfs.h_nr_running.
> >> > cfs.nr_running is the number of entries running in this 'group'. If
> >> > you've got nested groups like:
> >> >
> >> > 'root'
> >> > \
> >> > 'A'
> >> > / \
> >> > t1 t2
> >> >
> >> > root.nr_running := 1 'A', even though you've got multiple running tasks.

One thing though; doesn't h_nr_running over count the number of tasks?
That is, doesn't it count the runnable entities so the above case would
give root.h_nr_running := 3, where we would only have 2 runnable tasks.

Double check this and be careful when doing the conversion.

2013-08-15 18:24:26

by Paul Turner

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Thu, Aug 15, 2013 at 10:39 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Aug 13, 2013 at 01:08:17AM -0700, Paul Turner wrote:
>> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
>> >> > Not quite right; I think you need busiest->cfs.h_nr_running.
>> >> > cfs.nr_running is the number of entries running in this 'group'. If
>> >> > you've got nested groups like:
>> >> >
>> >> > 'root'
>> >> > \
>> >> > 'A'
>> >> > / \
>> >> > t1 t2
>> >> >
>> >> > root.nr_running := 1 'A', even though you've got multiple running tasks.
>
> One thing though; doesn't h_nr_running over count the number of tasks?
> That is, doesn't it count the runnable entities so the above case would
> give root.h_nr_running := 3, where we would only have 2 runnable tasks.
>
> Double check this and be careful when doing the conversion.

This should be ok: it's accounted like rq->nr_running, not cfs_rq->nr_running.
Specifically: both only account tasks; group-entities do not contribute.

The fact that this distinction exists, despite the very similar names
is unfortunate.
We could consider renaming to h_nr_{running_,}tasks for clarity.
The same applies to rq->nr_running, although that would involve more churn.

2013-08-15 18:39:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

On Thu, Aug 15, 2013 at 11:23:53AM -0700, Paul Turner wrote:
> On Thu, Aug 15, 2013 at 10:39 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Aug 13, 2013 at 01:08:17AM -0700, Paul Turner wrote:
> >> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
> >> > On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
> >> >> > Not quite right; I think you need busiest->cfs.h_nr_running.
> >> >> > cfs.nr_running is the number of entries running in this 'group'. If
> >> >> > you've got nested groups like:
> >> >> >
> >> >> > 'root'
> >> >> > \
> >> >> > 'A'
> >> >> > / \
> >> >> > t1 t2
> >> >> >
> >> >> > root.nr_running := 1 'A', even though you've got multiple running tasks.
> >
> > One thing though; doesn't h_nr_running over count the number of tasks?
> > That is, doesn't it count the runnable entities so the above case would
> > give root.h_nr_running := 3, where we would only have 2 runnable tasks.
> >
> > Double check this and be careful when doing the conversion.
>
> This should be ok: it's accounted like rq->nr_running, not cfs_rq->nr_running.
> Specifically: both only account tasks; group-entities do not contribute.

Ah, ok. I should have looked at the code I guess... :-)

> The fact that this distinction exists, despite the very similar names
> is unfortunate.
> We could consider renaming to h_nr_{running_,}tasks for clarity.
> The same applies to rq->nr_running, although that would involve more churn.

Yah.. that would clarify, although longer variable names will also get
us into more line-breaks I'm sure.

Lets keep it as is. Maybe a comment somewhere would be enough.

2013-08-18 09:12:40

by Lei Wen

[permalink] [raw]
Subject: Re: false nr_running check in load balance?

Paul,

On Tue, Aug 13, 2013 at 5:25 PM, Paul Turner <[email protected]> wrote:
> On Tue, Aug 13, 2013 at 1:18 AM, Lei Wen <[email protected]> wrote:
>> Hi Paul,
>>
>> On Tue, Aug 13, 2013 at 4:08 PM, Paul Turner <[email protected]> wrote:
>>> On Tue, Aug 13, 2013 at 12:38 AM, Peter Zijlstra <[email protected]> wrote:
>>>> On Tue, Aug 13, 2013 at 12:45:12PM +0800, Lei Wen wrote:
>>>>> > Not quite right; I think you need busiest->cfs.h_nr_running.
>>>>> > cfs.nr_running is the number of entries running in this 'group'. If
>>>>> > you've got nested groups like:
>>>>> >
>>>>> > 'root'
>>>>> > \
>>>>> > 'A'
>>>>> > / \
>>>>> > t1 t2
>>>>> >
>>>>> > root.nr_running := 1 'A', even though you've got multiple running tasks.
>>>>> >
>>>>>
>>>>> You're absolutely right for this. :)
>>>>> I miss it for not considering the group case...
>>>>>
>>>>> Then do you think it is necessary to do below change in load_balance() code?
>>>>> - if (busiest->nr_running > 1) {
>>>>> + if (busiest->cfs.h_nr_running > 1) {
>>>>>
>>>>
>>>> Yes I think that would be fine.
>>>
>>> If we pivot to use h_nr_running we should probably also update
>>> call-sites such as cpu_load_avg_per_task() for consistency.
>>
>> I didn't find cpu_load_avg_per_task in the latest linux git...
>> Is it a new patch pending while not being submitted?
>
> Transposition typo: cpu_avg_load_per_task()
> More generally: Most things that examine ->nr_running in the fair
> load-balance path.
>

I see...

I have submitted several patches, which covers cpu_avg_load_per_task.
Please help to check them.

Thanks,
Lei