2014-01-21 05:35:20

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 1/2] sched/update_avg: avoid negative time

rq->avg_idle try to reflect the average idle time between the cpu idle
and first wakeup. But in the function, it maybe get a negative value
if old avg_idle is too small. Then this negative value will be double
counted in next time calculation. Guess that is not the original purpose,
so recalibrate it to zero.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30eb011..af9121c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1358,6 +1358,9 @@ static void update_avg(u64 *avg, u64 sample)
{
s64 diff = sample - *avg;
*avg += diff >> 3;
+
+ if (*avg < 0)
+ *avg = 0;
}
#endif

--
1.8.1.2


2014-01-21 05:35:37

by Alex Shi

[permalink] [raw]
Subject: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

It's useful to track this value in debug mode.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/debug.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1e43e70..f5c529a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -315,6 +315,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ p64(max_idle_balance_cost);
#endif

P(ttwu_count);
--
1.8.1.2

2014-01-21 05:59:54

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/update_avg: avoid negative time

On 01/21/2014 01:33 PM, Alex Shi wrote:
> rq->avg_idle try to reflect the average idle time between the cpu idle
> and first wakeup. But in the function, it maybe get a negative value
> if old avg_idle is too small. Then this negative value will be double
> counted in next time calculation. Guess that is not the original purpose,
> so recalibrate it to zero.

Forget this patch, the avg_idle is impossible to get negative.
Sorry for noise!


--
Thanks
Alex

2014-01-21 06:49:56

by Michael wang

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/update_avg: avoid negative time

Hi, Alex

On 01/21/2014 01:33 PM, Alex Shi wrote:
> rq->avg_idle try to reflect the average idle time between the cpu idle
> and first wakeup. But in the function, it maybe get a negative value
> if old avg_idle is too small. Then this negative value will be double
> counted in next time calculation. Guess that is not the original purpose,
> so recalibrate it to zero.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 30eb011..af9121c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1358,6 +1358,9 @@ static void update_avg(u64 *avg, u64 sample)
> {
> s64 diff = sample - *avg;
> *avg += diff >> 3;
> +
> + if (*avg < 0)
> + *avg = 0;

This seems like won't happen...

if 'diff' is negative, it's absolute value won't bigger than '*avg', not
to mention we only use 1/8 of it.

Regards,
Michael Wang

> }
> #endif
>

2014-01-21 07:43:32

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On Mon, Jan 20, 2014 at 9:33 PM, Alex Shi <[email protected]> wrote:
> It's useful to track this value in debug mode.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/debug.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1e43e70..f5c529a 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -315,6 +315,7 @@ do { \
> P(sched_goidle);
> #ifdef CONFIG_SMP
> P64(avg_idle);
> + p64(max_idle_balance_cost);

Hi Alex,

Does this need to be P64(max_idle_balance_cost)?

2014-01-21 08:44:44

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On 01/21/2014 03:43 PM, Jason Low wrote:
> On Mon, Jan 20, 2014 at 9:33 PM, Alex Shi <[email protected]> wrote:
>> It's useful to track this value in debug mode.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/debug.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 1e43e70..f5c529a 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -315,6 +315,7 @@ do { \
>> P(sched_goidle);
>> #ifdef CONFIG_SMP
>> P64(avg_idle);
>> + p64(max_idle_balance_cost);
>
> Hi Alex,
>
> Does this need to be P64(max_idle_balance_cost)?
>

Yes, thanks for fix!

===

>From 173417f3646ac1c6ac83809b51d0471591cbc8f0 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Tue, 21 Jan 2014 13:28:55 +0800
Subject: [RFC PATCH] sched: add statistic for rq->max_idle_balance_cost

It's useful to track this value in debug mode.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/debug.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1e43e70..f5c529a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -315,6 +315,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif

P(ttwu_count);
--
1.8.1.2



--
Thanks
Alex

2014-01-21 08:46:18

by Alex Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] sched/update_avg: avoid negative time

On 01/21/2014 02:49 PM, Michael wang wrote:
> This seems like won't happen...
>
> if 'diff' is negative, it's absolute value won't bigger than '*avg', not
> to mention we only use 1/8 of it.

That's why I said 'sorry and forget this patch'. :(

--
Thanks
Alex

2014-01-22 08:24:21

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On 01/21/2014 03:43 PM, Jason Low wrote:
> On Mon, Jan 20, 2014 at 9:33 PM, Alex Shi <[email protected]> wrote:
>> It's useful to track this value in debug mode.
>>
>> Signed-off-by: Alex Shi <[email protected]>
>> ---
>> kernel/sched/debug.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 1e43e70..f5c529a 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -315,6 +315,7 @@ do { \
>> P(sched_goidle);
>> #ifdef CONFIG_SMP
>> P64(avg_idle);
>> + p64(max_idle_balance_cost);
>
> Hi Alex,
>
> Does this need to be P64(max_idle_balance_cost)?
>

Ops, this letter lay in my Drafts box.

Yes, thanks for fix!

===

>From 173417f3646ac1c6ac83809b51d0471591cbc8f0 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Tue, 21 Jan 2014 13:28:55 +0800
Subject: [RFC PATCH] sched: add statistic for rq->max_idle_balance_cost

It's useful to track this value in debug mode.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/debug.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1e43e70..f5c529a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -315,6 +315,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif
P(ttwu_count);
--
1.8.1.2



--
Thanks
Alex

2014-01-22 16:09:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On Wed, Jan 22, 2014 at 04:24:13PM +0800, Alex Shi wrote:
> From: Alex Shi <[email protected]>
> Date: Tue, 21 Jan 2014 13:28:55 +0800
> Subject: [RFC PATCH] sched: add statistic for rq->max_idle_balance_cost
>
> It's useful to track this value in debug mode.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/debug.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1e43e70..f5c529a 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -315,6 +315,7 @@ do { \
> P(sched_goidle);
> #ifdef CONFIG_SMP
> P64(avg_idle);
> + P64(max_idle_balance_cost);
> #endif
> P(ttwu_count);

Not also the per-sd value in sd_alloc_ctl_domain_table() ?

2014-01-22 18:10:14

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On Wed, 2014-01-22 at 17:09 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 04:24:13PM +0800, Alex Shi wrote:
> > From: Alex Shi <[email protected]>
> > Date: Tue, 21 Jan 2014 13:28:55 +0800
> > Subject: [RFC PATCH] sched: add statistic for rq->max_idle_balance_cost
> >
> > It's useful to track this value in debug mode.
> >
> > Signed-off-by: Alex Shi <[email protected]>
> > ---
> > kernel/sched/debug.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 1e43e70..f5c529a 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -315,6 +315,7 @@ do { \
> > P(sched_goidle);
> > #ifdef CONFIG_SMP
> > P64(avg_idle);
> > + P64(max_idle_balance_cost);
> > #endif
> > P(ttwu_count);
>
> Not also the per-sd value in sd_alloc_ctl_domain_table() ?

Yeah, tracking the sd->max_newidle_lb_cost can also be useful.

2014-01-23 06:49:41

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On 01/23/2014 02:10 AM, Jason Low wrote:
>>> > > P64(avg_idle);
>>> > > + P64(max_idle_balance_cost);
>>> > > #endif
>>> > > P(ttwu_count);
>> >
>> > Not also the per-sd value in sd_alloc_ctl_domain_table() ?
> Yeah, tracking the sd->max_newidle_lb_cost can also be useful.
>
>

Thanks for suggestion!

I thought the sd->max_newidle_lb_cost keep changing. But yes, it's
still meaningful.
BTW, in the pandaboard ES, sd balance cost is about 1~2ms.

---

>From d37e712b623fb6776d0caba8e4bd06460e26b84e Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Tue, 21 Jan 2014 13:28:55 +0800
Subject: [PATCH] sched: add statistic for newidle load balance cost

Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost
It's useful to track these values in debug mode.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 7 +++++--
kernel/sched/debug.c | 1 +
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c3f3824..a046bc3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4293,7 +4293,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(8);
+ struct ctl_table *table = sd_alloc_ctl_entry(9);

if (table == NULL)
return NULL;
@@ -4313,7 +4313,10 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[6], "name", sd->name,
CORENAME_MAX_SIZE, 0444, proc_dostring);
- /* &table[7] is terminator */
+ set_table_entry(&table[7], "max_newidle_lb_cost",
+ &sd->max_newidle_lb_cost,
+ sizeof(long), 0644, proc_doulongvec_minmax);
+ /* &table[8] is terminator */

return table;
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1e43e70..9ecd861 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -315,6 +315,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif

P(ttwu_count);
--
1.8.1.2

--
Thanks
Alex

2014-01-23 09:55:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On Thu, Jan 23, 2014 at 02:49:25PM +0800, Alex Shi wrote:
> On 01/23/2014 02:10 AM, Jason Low wrote:
> >>> > > P64(avg_idle);
> >>> > > + P64(max_idle_balance_cost);
> >>> > > #endif
> >>> > > P(ttwu_count);
> >> >
> >> > Not also the per-sd value in sd_alloc_ctl_domain_table() ?
> > Yeah, tracking the sd->max_newidle_lb_cost can also be useful.
> >
> >
>
> Thanks for suggestion!
>
> I thought the sd->max_newidle_lb_cost keep changing. But yes, it's
> still meaningful.
> BTW, in the pandaboard ES, sd balance cost is about 1~2ms.

That's an insane amount of time for only 2 cpus.

> ---
>
> From d37e712b623fb6776d0caba8e4bd06460e26b84e Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Tue, 21 Jan 2014 13:28:55 +0800
> Subject: [PATCH] sched: add statistic for newidle load balance cost
>
> Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost
> It's useful to track these values in debug mode.
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/sched/core.c | 7 +++++--
> kernel/sched/debug.c | 1 +
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c3f3824..a046bc3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4293,7 +4293,7 @@ set_table_entry(struct ctl_table *entry,
> static struct ctl_table *
> sd_alloc_ctl_domain_table(struct sched_domain *sd)
> {
> - struct ctl_table *table = sd_alloc_ctl_entry(8);
> + struct ctl_table *table = sd_alloc_ctl_entry(9);

tip/master has 13 there, not 8.

2014-01-23 10:49:43

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On 01/23/2014 05:54 PM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 02:49:25PM +0800, Alex Shi wrote:
>> On 01/23/2014 02:10 AM, Jason Low wrote:
>>>>>>> P64(avg_idle);
>>>>>>> + P64(max_idle_balance_cost);
>>>>>>> #endif
>>>>>>> P(ttwu_count);
>>>>>
>>>>> Not also the per-sd value in sd_alloc_ctl_domain_table() ?
>>> Yeah, tracking the sd->max_newidle_lb_cost can also be useful.
>>>
>>>
>>
>> Thanks for suggestion!
>>
>> I thought the sd->max_newidle_lb_cost keep changing. But yes, it's
>> still meaningful.
>> BTW, in the pandaboard ES, sd balance cost is about 1~2ms.
>
> That's an insane amount of time for only 2 cpus.

maybe, :(

but it is the data.
$ cat /proc/sys/kernel/sched_domain/cpu0/domain0/max_newidle_lb_cost
1873393

>
>> ---
>>

[...]
>> sd_alloc_ctl_domain_table(struct sched_domain *sd)
>> {
>> - struct ctl_table *table = sd_alloc_ctl_entry(8);
>> + struct ctl_table *table = sd_alloc_ctl_entry(9);
>
> tip/master has 13 there, not 8.
>

Sorry about this. updated on tip/master.

-----

>From c49ae4749ffa2d1f95bd0d8b3c8e6fbce8864f04 Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Thu, 23 Jan 2014 18:39:54 +0800
Subject: [PATCH] sched: add statistic for newidle load balance cost

Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost.
It's useful to know these values in debug mode.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/sched/core.c | 9 ++++++---
kernel/sched/debug.c | 1 +
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36c951b..4740456 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4815,7 +4815,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(13);
+ struct ctl_table *table = sd_alloc_ctl_entry(14);

if (table == NULL)
return NULL;
@@ -4843,9 +4843,12 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
sizeof(int), 0644, proc_dointvec_minmax, false);
set_table_entry(&table[10], "flags", &sd->flags,
sizeof(int), 0644, proc_dointvec_minmax, false);
- set_table_entry(&table[11], "name", sd->name,
+ set_table_entry(&table[11], "max_newidle_lb_cost",
+ &sd->max_newidle_lb_cost,
+ sizeof(long), 0644, proc_doulongvec_minmax, false);
+ set_table_entry(&table[12], "name", sd->name,
CORENAME_MAX_SIZE, 0444, proc_dostring, false);
- /* &table[12] is terminator */
+ /* &table[13] is terminator */

return table;
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index dd52e7f..7f17a38 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -321,6 +321,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif

P(ttwu_count);
--
1.8.1.2


--
Thanks
Alex

2014-01-23 12:32:53

by Morten Rasmussen

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On Thu, Jan 23, 2014 at 10:49:35AM +0000, Alex Shi wrote:
> On 01/23/2014 05:54 PM, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 02:49:25PM +0800, Alex Shi wrote:
> >> On 01/23/2014 02:10 AM, Jason Low wrote:
> >>>>>>> P64(avg_idle);
> >>>>>>> + P64(max_idle_balance_cost);
> >>>>>>> #endif
> >>>>>>> P(ttwu_count);
> >>>>>
> >>>>> Not also the per-sd value in sd_alloc_ctl_domain_table() ?
> >>> Yeah, tracking the sd->max_newidle_lb_cost can also be useful.
> >>>
> >>>
> >>
> >> Thanks for suggestion!
> >>
> >> I thought the sd->max_newidle_lb_cost keep changing. But yes, it's
> >> still meaningful.
> >> BTW, in the pandaboard ES, sd balance cost is about 1~2ms.
> >
> > That's an insane amount of time for only 2 cpus.
>
> maybe, :(
>
> but it is the data.
> $ cat /proc/sys/kernel/sched_domain/cpu0/domain0/max_newidle_lb_cost
> 1873393

What sort of workload are you running? It seems rather extreme.

I did a quick test on TC2 and got max_newidle_lb_cost in the range
~7000-80000 when having a constantly running task on each cpu for a few
seconds.

I also traced curr_cost just to get an idea of the cost of idle_balance.
I get an average cost across all five cpus of around 7000.

Morten

2014-01-23 14:44:09

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: add statistic for rq->max_idle_balance_cost

On 01/23/2014 08:32 PM, Morten Rasmussen wrote:
> On Thu, Jan 23, 2014 at 10:49:35AM +0000, Alex Shi wrote:
>> On 01/23/2014 05:54 PM, Peter Zijlstra wrote:
>>> On Thu, Jan 23, 2014 at 02:49:25PM +0800, Alex Shi wrote:
>>>> On 01/23/2014 02:10 AM, Jason Low wrote:
>>>>>>>>> P64(avg_idle);
>>>>>>>>> + P64(max_idle_balance_cost);
>>>>>>>>> #endif
>>>>>>>>> P(ttwu_count);
>>>>>>>
>>>>>>> Not also the per-sd value in sd_alloc_ctl_domain_table() ?
>>>>> Yeah, tracking the sd->max_newidle_lb_cost can also be useful.
>>>>>
>>>>>
>>>>
>>>> Thanks for suggestion!
>>>>
>>>> I thought the sd->max_newidle_lb_cost keep changing. But yes, it's
>>>> still meaningful.
>>>> BTW, in the pandaboard ES, sd balance cost is about 1~2ms.
>>>
>>> That's an insane amount of time for only 2 cpus.
>>
>> maybe, :(
>>
>> but it is the data.
>> $ cat /proc/sys/kernel/sched_domain/cpu0/domain0/max_newidle_lb_cost
>> 1873393
>
> What sort of workload are you running? It seems rather extreme.
>
> I did a quick test on TC2 and got max_newidle_lb_cost in the range
> ~7000-80000 when having a constantly running task on each cpu for a few
> seconds.

Thanks for trying.

I am using the ubuntu server system on the panda ES.
memory has more than half free.
kernel is tip/master with some debug enabled, like debug_kernel, debug locks
etc.
Looks nothing abnormal.

the vmstat data:
alexs@alex-panda:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 628420 48776 245088 0 0 11 4 39 26 0 0 99 0
0 0 0 628396 48776 245088 0 0 0 0 79 44 0 0 100 0
1 0 0 628396 48776 245088 0 0 0 0 64 39 0 0 100 0
0 0 0 628420 48776 245112 0 0 0 0 47 27 0 1 99 0
0 0 0 628420 48776 245112 0 0 0 0 53 31 0 0 100 0
0 0 0 628420 48776 245112 0 0 0 0 47 30 0 0 100 0
0 0 0 628420 48776 245112 0 0 0 0 44 31 0 0 100 0
0 0 0 628420 48776 245112 0 0 0 0 58 39 0 0 100 0
0 0 0 628420 48776 245112 0 0 0 0 46 23 0 0 100 0

=====
@alex-panda:~$ cat /proc/sched_debug
Sched Debug Version: v0.11, 3.13.0-02178-gaac64da-dirty #94
ktime : 14299180.219725
sched_clk : 14299181.549072
cpu_clk : 14299181.549072
jiffies : 1399918

sysctl_sched
.sysctl_sched_latency : 12.000000
.sysctl_sched_min_granularity : 1.500000
.sysctl_sched_wakeup_granularity : 2.000000
.sysctl_sched_child_runs_first : 0
.sysctl_sched_features : 11899
.sysctl_sched_tunable_scaling : 1 (logaritmic)

cpu#0
.nr_running : 0
.load : 0
.nr_switches : 287105
.nr_load_updates : 103294
.nr_uninterruptible : -10
.next_balance : 1.399919
.curr->pid : 0
.clock : 14299181.488037
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0
.yld_count : 2
.sched_count : 287650
.sched_goidle : 106271
.avg_idle : 7203403
.max_idle_balance_cost : 3704243
.ttwu_count : 183044
.ttwu_local : 161092

cfs_rq[0]:
.exec_clock : 200707.458407
.MIN_vruntime : 0.000001
.min_vruntime : 668837.562013
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : 0.000000
.nr_spread_over : 63
.nr_running : 0
.load : 0
.runnable_load_avg : 0
.blocked_load_avg : 97

rt_rq[0]:
.rt_nr_running : 0
.rt_throttled : 0
.rt_time : 0.000000
.rt_runtime : 950.000000

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
ksoftirqd/0 3 668828.662844 12744 120 668828.662844 1148.742697 14295248.412972
kworker/0:0H 5 1780.632452 5 100 1780.632452 0.915527 2428.527832
rcu_bh 8 12.628673 2 120 12.628673 0.030517 0.030518
migration/0 9 0.000000 399 0 0.000000 187.866211 0.030517
kworker/u4:1 16 3362.363077 201 120 3362.363077 72.906491 7433.654788
kworker/0:1 52 668831.623049 7206 120 668831.623049 1035.125800 14296468.780426
writeback 275 238.256812 2 100 238.256812 0.061035 0.030518
bioset 277 244.287678 2 100 244.287678 0.061034 0.030518
kblockd 279 250.318545 2 100 250.318545 0.061036 0.030517
kswapd0 552 533.946299 3 120 533.946299 0.152588 103.240967
nfsiod 554 473.305109 2 100 473.305109 0.061036 0.030517
crypto 555 479.396659 2 100 479.396659 0.091553 0.030517
OMAP UART0 575 527.793717 2 100 527.793717 0.061035 0.061035
OMAP UART1 577 533.854750 2 100 533.854750 0.061036 0.030517
OMAP UART2 579 539.915782 2 100 539.915782 0.061035 0.030518
OMAP UART3 581 552.098882 2 100 552.098882 0.000000 0.061035
spi1 649 968.857220 3 120 968.857220 0.152588 0.152588
spi2 651 981.070836 2 120 981.070836 0.061034 0.030518
spi3 653 993.192900 2 120 993.192900 0.000000 0.061035
spi4 655 1005.375999 2 120 1005.375999 0.030517 0.030518
kpsmoused 694 1238.045075 2 100 1238.045075 0.061036 0.030517
irq/158-talert 706 0.000000 2 49 0.000000 0.366211 0.000000
deferwq 730 1384.852623 2 100 1384.852623 0.091553 0.030517
irq/39-TWL6030- 742 0.000000 2 49 0.000000 1.068115 0.000000
irq/151-twl6040 776 5.999997 2 49 5.999997 0.793457 0.000000
irq/89-48072000 779 0.000000 2 49 0.000000 0.061036 0.000000
irq/93-48060000 782 0.000000 2 49 0.000000 0.122070 0.000000
irq/94-48350000 788 0.000000 2 49 0.000000 0.122070 0.000000
kworker/0:1H 809 380628.904053 3018 100 380628.904053 505.493149 14107134.002707
upstart-udev-br 884 14849.067714 1123 120 14849.067714 498.809820 38619.323719
udevd 896 4948.481614 668 120 4948.481614 368.133548 5490.539550
udevd 1140 2599.121477 4 120 2599.121477 3.997803 0.000000
kworker/0:2 1615 3411.081929 3 120 3411.081929 0.640868 19.897461
dbus-daemon 1629 58044.480491 304 120 58044.480491 150.939951 3185970.214828
rsyslogd 1679 668566.974627 1084 120 668566.974627 11260.040283 14269129.852303
rs:main Q:Reg 1683 381884.264680 146 120 381884.264680 552.001943 14231655.883798
upstart-socket- 1743 14848.579433 221 120 14848.579433 52.886965 34331.695554
nmbd 1834 668796.790535 1455 120 668796.790535 2483.093268 14283383.087151
getty 1873 4977.536546 10 120 4977.536546 13.061522 0.000000
getty 1881 5006.922279 7 120 5006.922279 13.824462 2.929688
getty 1898 13273.101971 20 120 13273.101971 31.250000 12775.329589
irqbalance 1918 668722.339617 1641 120 668722.339617 5594.879138 14275445.953377
cron 1955 668328.823268 255 120 668328.823268 335.906977 14246798.889165
atd 1956 360099.936165 6 120 360099.936165 3.021240 10800118.621826
mysqld 2007 5674.303582 1 120 5674.303582 0.152588 0.000000
mysqld 2009 5675.561879 1 120 5675.561879 0.152587 0.000000
mysqld 2011 5675.053433 1 120 5675.053433 0.000000 0.000000
mysqld 2013 5674.775504 1 120 5674.775504 0.000000 0.000000
mysqld 2015 5675.008589 1 120 5675.008589 0.152588 0.000000
mysqld 2024 668831.696048 14074 120 668831.696048 12500.183090 14259236.908010
mysqld 2026 668796.698982 2846 120 668796.698982 1983.551015 14281165.191669
mysqld 2027 6691.770100 3 120 6691.770100 0.732421 1000.976563
mysqld 2052 6691.546710 1 120 6691.546710 0.793457 0.000000
libvirtd 1981 5444.877738 1 120 5444.877738 0.000000 0.000000
libvirtd 1983 5444.968627 1 120 5444.968627 0.152588 0.000000
libvirtd 1985 5445.243285 1 120 5445.243285 0.000000 0.000000
libvirtd 1991 5445.548948 1 120 5445.548948 0.000000 0.000000
postgres 2218 668834.796877 64567 120 668834.796877 70552.520660 14157241.424533
postgres 2220 668796.668465 8332 120 668796.668465 10573.760938 14259290.405313
apache2 2897 309936.461891 1 120 309936.461891 3.479004 0.000000
apache2 2900 309942.976298 1 120 309942.976298 3.570557 0.000000
apache2 2901 309944.990458 1 120 309944.990458 3.509521 0.000000
bash 3371 668837.562013 225 120 668837.562013 2406.188964 48741.851807

cpu#1
.nr_running : 1
.load : 1024
.nr_switches : 269118
.nr_load_updates : 130029
.nr_uninterruptible : 10
.next_balance : 1.399934
.curr->pid : 3416
.clock : 14299182.861328
.cpu_load[0] : 998
.cpu_load[1] : 499
.cpu_load[2] : 250
.cpu_load[3] : 131
.cpu_load[4] : 82
.yld_count : 2
.sched_count : 269213
.sched_goidle : 108355
.avg_idle : 5111702
.max_idle_balance_cost : 3708756
.ttwu_count : 141417
.ttwu_local : 130286

cfs_rq[1]:
.exec_clock : 175200.378299
.MIN_vruntime : 0.000001
.min_vruntime : 376330.465410
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -292507.096603
.nr_spread_over : 130
.nr_running : 1
.load : 1024
.runnable_load_avg : 998
.blocked_load_avg : 413

rt_rq[1]:
.rt_nr_running : 0
.rt_throttled : 0
.rt_time : 0.000000
.rt_runtime : 950.000000

runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
init 1 313310.289516 3299 120 313310.289516 2455.017054 6104221.954363
kthreadd 2 6577.160196 58 120 6577.160196 5.462646 8008.483888
kworker/u4:0 6 376132.673205 2930 120 376132.673205 493.042018 14291930.480919
rcu_sched 7 376323.397295 14084 120 376323.397295 723.998898 14297275.360261
migration/1 10 0.000000 401 0 0.000000 850.799562 0.030517
ksoftirqd/1 11 376284.212726 7272 120 376284.212726 634.094247 14296060.028045
kworker/1:0 12 232.154164 6 120 232.154164 0.488281 232.757568
kworker/1:0H 13 3954.602298 6 100 3954.602298 0.457762 2448.822022
khelper 14 5.074196 2 100 5.074196 0.152589 0.030517
kdevtmpfs 15 6350.821406 145 120 6350.821406 34.027099 6938.079836
kworker/1:1 55 376324.556962 14214 120 376324.556962 1210.388161 14295621.337855
khubd 431 3992.571696 146 120 3992.571696 73.120116 2298.736572
rpciod 533 3186.886003 2 100 3186.886003 0.061036 0.030517
fsnotify_mark 553 6698.542889 5 120 6698.542889 0.610351 7465.240479
irq/88-48070000 733 0.000000 175 49 0.000000 37.597654 0.000000
mmcqd/0 798 376284.578939 31081 120 376284.578939 7339.965810 14286424.774125
kworker/1:1H 811 376284.153804 2016 100 376284.153804 467.559836 14295915.771461
jbd2/mmcblk0p2- 812 376284.853597 909 120 376284.853597 187.957750 14296129.089368
ext4-rsv-conver 813 3960.631760 2 100 3960.631760 0.091552 0.030518
udevd 1139 5073.299325 3 120 5073.299325 3.326416 0.000000
rsyslogd 1685 367855.641776 89 120 367855.641776 66.558837 14232167.999268
rsyslogd 1686 18583.825784 5 120 18583.825784 8.148193 35041.595459
smbd 1735 373866.679904 381 120 373866.679904 713.073724 14233558.624261
smbd 1750 7436.795409 1 120 7436.795409 3.997803 0.000000
sshd 1791 367855.611257 41 120 367855.611257 56.549075 14230944.915772
getty 1888 8379.423084 9 120 8379.423084 17.639160 0.000000
getty 1890 8397.461169 9 120 8397.461169 17.425537 0.000000
getty 1895 8438.145491 9 120 8438.145491 17.730713 0.000000
acpid 1914 8598.322557 4 120 8598.322557 9.460450 0.000000
mysqld 1967 10371.880479 128 120 10371.880479 366.973878 1735.992431
mysqld 2008 9233.960583 7 120 9233.960583 2.258300 475.158691
mysqld 2010 8867.063605 1 120 8867.063605 0.152588 0.000000
mysqld 2012 8866.341534 1 120 8866.341534 0.000000 0.000000
mysqld 2014 8867.467414 1 120 8867.467414 1.159668 0.000000
mysqld 2016 8866.296690 1 120 8866.296690 0.000000 0.000000
mysqld 2025 376285.463948 14148 120 376285.463948 19846.252467 14255057.739241
whoopsie 1972 319956.300764 20 120 319956.300764 45.135497 7200619.628905
gdbus 1998 8723.098042 8 120 8723.098042 4.394532 0.457763
libvirtd 1977 46500.443638 27 120 46500.443638 46.112063 3182319.610594
libvirtd 1980 8682.048136 1 120 8682.048136 0.000000 0.000000
libvirtd 1982 8681.762422 1 120 8681.762422 0.000000 0.000000
libvirtd 1984 8681.548135 1 120 8681.548135 0.000000 0.000000
libvirtd 1986 8681.548136 1 120 8681.548136 0.000000 0.000000
libvirtd 1989 8681.822793 2 120 8681.822793 0.274658 0.000000
libvirtd 1990 8681.578652 1 120 8681.578652 0.030517 0.000000
postgres 2022 376137.781608 1385 120 376137.781608 8902.221690 14270901.306141
postgres 2217 376284.365315 64617 120 376284.365315 68520.446801 14163715.789733
postgres 2219 376284.273762 15293 120 376284.273762 8413.848882 14260271.423355
apache2 2315 376284.151692 14024 120 376284.151692 8033.294679 14257380.950902
getty 2353 17387.058964 12 120 17387.058964 12.756348 2.380371
apache2 2896 293802.653959 8 120 293802.653959 2.899168 0.000000
apache2 2898 293807.600003 1 120 293807.600003 0.946045 0.000000
sshd 3243 374087.137122 31 120 374087.137122 168.395993 6727.355959
sshd 3370 376324.679034 263 120 376324.679034 360.321040 50774.353030
R cat 3416 376330.465410 4 120 376330.465410 9.033203 0.518799


>
> I also traced curr_cost just to get an idea of the cost of idle_balance.
> I get an average cost across all five cpus of around 7000.
>
> Morten
>


--
Thanks
Alex

Subject: [tip:sched/core] sched: Add statistic for newidle load balance cost

Commit-ID: 37e6bae8395a94b4dd934c92b02b9408be992365
Gitweb: http://git.kernel.org/tip/37e6bae8395a94b4dd934c92b02b9408be992365
Author: Alex Shi <[email protected]>
AuthorDate: Thu, 23 Jan 2014 18:39:54 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 11 Feb 2014 09:58:18 +0100

sched: Add statistic for newidle load balance cost

Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost.
It's useful to know these values in debug mode.

Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 9 ++++++---
kernel/sched/debug.c | 1 +
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3068f37..fb9764f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4811,7 +4811,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(13);
+ struct ctl_table *table = sd_alloc_ctl_entry(14);

if (table == NULL)
return NULL;
@@ -4839,9 +4839,12 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
sizeof(int), 0644, proc_dointvec_minmax, false);
set_table_entry(&table[10], "flags", &sd->flags,
sizeof(int), 0644, proc_dointvec_minmax, false);
- set_table_entry(&table[11], "name", sd->name,
+ set_table_entry(&table[11], "max_newidle_lb_cost",
+ &sd->max_newidle_lb_cost,
+ sizeof(long), 0644, proc_doulongvec_minmax, false);
+ set_table_entry(&table[12], "name", sd->name,
CORENAME_MAX_SIZE, 0444, proc_dostring, false);
- /* &table[12] is terminator */
+ /* &table[13] is terminator */

return table;
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 31b908d..f3344c3 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -321,6 +321,7 @@ do { \
P(sched_goidle);
#ifdef CONFIG_SMP
P64(avg_idle);
+ P64(max_idle_balance_cost);
#endif

P(ttwu_count);