LinuxLists.cc - [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Tue, 29 Jul 2014 13:24:05 +0800
Aaron Lu <[email protected]> wrote:

> FYI, we noticed the below changes on
>
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
>
> ebe06187bf2aec1 a43455a1d572daf7b730fe12e
> --------------- -------------------------
> 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
> 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
> 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local

Hi Aaron,

Jirka Hladky has reported a regression with that changeset as
well, and I have already spent some time debugging the issue.

I added tracing code to task_numa_compare() and saw a number
of thread swaps with tiny improvements.

Does preventing those help your workload, or am I barking up
the wrong tree again? (I have been looking at this for a while...)

---8<---

Subject: sched,numa: prevent task moves with marginal benefit

Commit a43455a1d57 makes task_numa_migrate() always check the
preferred node for task placement. This is causing a performance
regression with hackbench, as well as SPECjbb2005.

Tracing task_numa_compare() with a single instance of SPECjbb2005
on a 4 node system, I have seen several thread swaps with tiny
improvements.

It appears that the hysteresis code that was added to task_numa_compare
is not doing what we needed it to do, and a simple threshold could be
better.

Reported-by: Aaron Lu <[email protected]>
Reported-by: Jirka Hladky <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f5e3c2..bedbc3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)

/*
* These return the fraction of accesses done by a particular task, or
- * task group, on a particular numa node. The group weight is given a
- * larger multiplier, in order to group tasks together that are almost
- * evenly spread out between numa nodes.
+ * task group, on a particular numa node. The NUMA move threshold
+ * prevents task moves with marginal improvement, and is set to 5%.
*/
+#define NUMA_SCALE 1000
+#define NUMA_MOVE_THRESH 50
+
static inline unsigned long task_weight(struct task_struct *p, int nid)
{
unsigned long total_faults;
@@ -940,7 +942,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
if (!total_faults)
return 0;

- return 1000 * task_faults(p, nid) / total_faults;
+ return NUMA_SCALE * task_faults(p, nid) / total_faults;
}

static inline unsigned long group_weight(struct task_struct *p, int nid)
@@ -948,7 +950,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
if (!p->numa_group || !p->numa_group->total_faults)
return 0;

- return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+ return NUMA_SCALE * group_faults(p, nid) / p->numa_group->total_faults;
}

bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
@@ -1181,11 +1183,11 @@ static void task_numa_compare(struct task_numa_env *env,
imp = taskimp + task_weight(cur, env->src_nid) -
task_weight(cur, env->dst_nid);
/*
- * Add some hysteresis to prevent swapping the
- * tasks within a group over tiny differences.
+ * Do not swap tasks within a group around unless
+ * there is a significant improvement.
*/
- if (cur->numa_group)
- imp -= imp/16;
+ if (cur->numa_group && imp < NUMA_MOVE_THRESH)
+ goto unlock;
} else {
/*
* Compare the group weights. If a task is all by
@@ -1205,6 +1207,10 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;

if (!cur) {
+ /* Only move if there is a significant improvement. */
+ if (imp < NUMA_MOVE_THRESH)
+ goto unlock;
+
/* Is there capacity at our destination? */
if (env->src_stats.has_free_capacity &&
!env->dst_stats.has_free_capacity)

2014-07-29 08:17:18

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
> Subject: sched,numa: prevent task moves with marginal benefit
>
> Commit a43455a1d57 makes task_numa_migrate() always check the
> preferred node for task placement. This is causing a performance
> regression with hackbench, as well as SPECjbb2005.
>
> Tracing task_numa_compare() with a single instance of SPECjbb2005
> on a 4 node system, I have seen several thread swaps with tiny
> improvements.
>
> It appears that the hysteresis code that was added to task_numa_compare
> is not doing what we needed it to do, and a simple threshold could be
> better.
>
> Reported-by: Aaron Lu <[email protected]>
> Reported-by: Jirka Hladky <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> kernel/sched/fair.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4f5e3c2..bedbc3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
>
> /*
> * These return the fraction of accesses done by a particular task, or
> - * task group, on a particular numa node. The group weight is given a
> - * larger multiplier, in order to group tasks together that are almost
> - * evenly spread out between numa nodes.
> + * task group, on a particular numa node. The NUMA move threshold
> + * prevents task moves with marginal improvement, and is set to 5%.
> */
> +#define NUMA_SCALE 1000
> +#define NUMA_MOVE_THRESH 50

Please make that 1024, there's no reason not to use power of two here.
This base 10 factor thing annoyed me no end already, its time for it to
die.

2014-07-29 20:05:14

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Tue, 29 Jul 2014 10:17:12 +0200
Peter Zijlstra <[email protected]> wrote:

> > +#define NUMA_SCALE 1000
> > +#define NUMA_MOVE_THRESH 50
>
> Please make that 1024, there's no reason not to use power of two here.
> This base 10 factor thing annoyed me no end already, its time for it to
> die.

That's easy enough. However, it would be good to know whether
this actually helps with the regression Aaron found :)

---8<---

Subject: sched,numa: prevent task moves with marginal benefit

Commit a43455a1d57 makes task_numa_migrate() always check the
preferred node for task placement. This is causing a performance
regression with hackbench, as well as SPECjbb2005.

Tracing task_numa_compare() with a single instance of SPECjbb2005
on a 4 node system, I have seen several thread swaps with tiny
improvements.

It appears that the hysteresis code that was added to task_numa_compare
is not doing what we needed it to do, and a simple threshold could be
better.

Aaron, does this patch help, or am I barking up the wrong tree?

Reported-by: Aaron Lu <[email protected]>
Reported-by: Jirka Hladky <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f5e3c2..9bd283b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)

/*
* These return the fraction of accesses done by a particular task, or
- * task group, on a particular numa node. The group weight is given a
- * larger multiplier, in order to group tasks together that are almost
- * evenly spread out between numa nodes.
+ * task group, on a particular numa node. The NUMA move threshold
+ * prevents task moves with marginal improvement, and is set to 5%.
*/
+#define NUMA_SCALE 1024
+#define NUMA_MOVE_THRESH (5 * NUMA_SCALE / 100)
+
static inline unsigned long task_weight(struct task_struct *p, int nid)
{
unsigned long total_faults;
@@ -940,7 +942,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
if (!total_faults)
return 0;

- return 1000 * task_faults(p, nid) / total_faults;
+ return NUMA_SCALE * task_faults(p, nid) / total_faults;
}

static inline unsigned long group_weight(struct task_struct *p, int nid)
@@ -948,7 +950,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
if (!p->numa_group || !p->numa_group->total_faults)
return 0;

- return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+ return NUMA_SCALE * group_faults(p, nid) / p->numa_group->total_faults;
}

bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
@@ -1181,11 +1183,11 @@ static void task_numa_compare(struct task_numa_env *env,
imp = taskimp + task_weight(cur, env->src_nid) -
task_weight(cur, env->dst_nid);
/*
- * Add some hysteresis to prevent swapping the
- * tasks within a group over tiny differences.
+ * Do not swap tasks within a group around unless
+ * there is a significant improvement.
*/
- if (cur->numa_group)
- imp -= imp/16;
+ if (cur->numa_group && imp < NUMA_MOVE_THRESH)
+ goto unlock;
} else {
/*
* Compare the group weights. If a task is all by
@@ -1205,6 +1207,10 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;

if (!cur) {
+ /* Only move if there is a significant improvement. */
+ if (imp < NUMA_MOVE_THRESH)
+ goto unlock;
+
/* Is there capacity at our destination? */
if (env->src_stats.has_free_capacity &&
!env->dst_stats.has_free_capacity)

2014-07-30 02:14:31

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Tue, Jul 29, 2014 at 04:04:37PM -0400, Rik van Riel wrote:
> On Tue, 29 Jul 2014 10:17:12 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > > +#define NUMA_SCALE 1000
> > > +#define NUMA_MOVE_THRESH 50
> >
> > Please make that 1024, there's no reason not to use power of two here.
> > This base 10 factor thing annoyed me no end already, its time for it to
> > die.
>
> That's easy enough. However, it would be good to know whether
> this actually helps with the regression Aaron found :)

Sorry for the delay.

I applied the last patch and queued the hackbench job to the ivb42 test
machine for it to run 5 times, and here is the result(regarding the
proc-vmstat.numa_hint_faults_local field):
173565
201262
192317
198342
198595
avg:
192816

It seems it is still very big than previous kernels.

BTW, to highlight changes, we only include metrics that have changed a
lot in the report, which means, for metrics that don't show in the
report, it means it doesn't change much. But just in case, here is the
throughput metric regarding commit a43455a1d(compared to its parent):

ebe06187bf2aec1 a43455a1d572daf7b730fe12e
--------------- -------------------------
118881 ~ 0% +1.2% 120325 ~ 0% ivb42/hackbench/50%-threads-pipe
78410 ~ 0% +0.6% 78857 ~ 0% lkp-snb01/hackbench/50%-threads-socket
197292 ~ 0% +1.0% 199182 ~ 0% TOTAL hackbench.throughput

Feel free to let me know if you need more information.

Thanks,
Aaron

>
> ---8<---
>
> Subject: sched,numa: prevent task moves with marginal benefit
>
> Commit a43455a1d57 makes task_numa_migrate() always check the
> preferred node for task placement. This is causing a performance
> regression with hackbench, as well as SPECjbb2005.
>
> Tracing task_numa_compare() with a single instance of SPECjbb2005
> on a 4 node system, I have seen several thread swaps with tiny
> improvements.
>
> It appears that the hysteresis code that was added to task_numa_compare
> is not doing what we needed it to do, and a simple threshold could be
> better.
>
> Aaron, does this patch help, or am I barking up the wrong tree?
>
> Reported-by: Aaron Lu <[email protected]>
> Reported-by: Jirka Hladky <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> kernel/sched/fair.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4f5e3c2..9bd283b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
>
> /*
> * These return the fraction of accesses done by a particular task, or
> - * task group, on a particular numa node. The group weight is given a
> - * larger multiplier, in order to group tasks together that are almost
> - * evenly spread out between numa nodes.
> + * task group, on a particular numa node. The NUMA move threshold
> + * prevents task moves with marginal improvement, and is set to 5%.
> */
> +#define NUMA_SCALE 1024
> +#define NUMA_MOVE_THRESH (5 * NUMA_SCALE / 100)
> +
> static inline unsigned long task_weight(struct task_struct *p, int nid)
> {
> unsigned long total_faults;
> @@ -940,7 +942,7 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)
> if (!total_faults)
> return 0;
>
> - return 1000 * task_faults(p, nid) / total_faults;
> + return NUMA_SCALE * task_faults(p, nid) / total_faults;
> }
>
> static inline unsigned long group_weight(struct task_struct *p, int nid)
> @@ -948,7 +950,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
> if (!p->numa_group || !p->numa_group->total_faults)
> return 0;
>
> - return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
> + return NUMA_SCALE * group_faults(p, nid) / p->numa_group->total_faults;
> }
>
> bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> @@ -1181,11 +1183,11 @@ static void task_numa_compare(struct task_numa_env *env,
> imp = taskimp + task_weight(cur, env->src_nid) -
> task_weight(cur, env->dst_nid);
> /*
> - * Add some hysteresis to prevent swapping the
> - * tasks within a group over tiny differences.
> + * Do not swap tasks within a group around unless
> + * there is a significant improvement.
> */
> - if (cur->numa_group)
> - imp -= imp/16;
> + if (cur->numa_group && imp < NUMA_MOVE_THRESH)
> + goto unlock;
> } else {
> /*
> * Compare the group weights. If a task is all by
> @@ -1205,6 +1207,10 @@ static void task_numa_compare(struct task_numa_env *env,
> goto unlock;
>
> if (!cur) {
> + /* Only move if there is a significant improvement. */
> + if (imp < NUMA_MOVE_THRESH)
> + goto unlock;
> +
> /* Is there capacity at our destination? */
> if (env->src_stats.has_free_capacity &&
> !env->dst_stats.has_free_capacity)
>

2014-07-30 14:26:22

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On 07/29/2014 10:14 PM, Aaron Lu wrote:
> On Tue, Jul 29, 2014 at 04:04:37PM -0400, Rik van Riel wrote:
>> On Tue, 29 Jul 2014 10:17:12 +0200
>> Peter Zijlstra <[email protected]> wrote:
>>
>>>> +#define NUMA_SCALE 1000
>>>> +#define NUMA_MOVE_THRESH 50
>>>
>>> Please make that 1024, there's no reason not to use power of two here.
>>> This base 10 factor thing annoyed me no end already, its time for it to
>>> die.
>>
>> That's easy enough. However, it would be good to know whether
>> this actually helps with the regression Aaron found :)
>
> Sorry for the delay.
>
> I applied the last patch and queued the hackbench job to the ivb42 test
> machine for it to run 5 times, and here is the result(regarding the
> proc-vmstat.numa_hint_faults_local field):
> 173565
> 201262
> 192317
> 198342
> 198595
> avg:
> 192816
>
> It seems it is still very big than previous kernels.

It looks like a step in the right direction, though.

Could you try running with a larger threshold?

>> +++ b/kernel/sched/fair.c
>> @@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
>>
>> /*
>> * These return the fraction of accesses done by a particular task, or
>> - * task group, on a particular numa node. The group weight is given a
>> - * larger multiplier, in order to group tasks together that are almost
>> - * evenly spread out between numa nodes.
>> + * task group, on a particular numa node. The NUMA move threshold
>> + * prevents task moves with marginal improvement, and is set to 5%.
>> */
>> +#define NUMA_SCALE 1024
>> +#define NUMA_MOVE_THRESH (5 * NUMA_SCALE / 100)

It would be good to see if changing NUMA_MOVE_THRESH to
(NUMA_SCALE / 8) does the trick.

I will run the same thing here with SPECjbb2005.

2014-07-31 05:04:59

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Wed, Jul 30, 2014 at 10:25:03AM -0400, Rik van Riel wrote:
> On 07/29/2014 10:14 PM, Aaron Lu wrote:
> > On Tue, Jul 29, 2014 at 04:04:37PM -0400, Rik van Riel wrote:
> >> On Tue, 29 Jul 2014 10:17:12 +0200
> >> Peter Zijlstra <[email protected]> wrote:
> >>
> >>>> +#define NUMA_SCALE 1000
> >>>> +#define NUMA_MOVE_THRESH 50
> >>>
> >>> Please make that 1024, there's no reason not to use power of two here.
> >>> This base 10 factor thing annoyed me no end already, its time for it to
> >>> die.
> >>
> >> That's easy enough. However, it would be good to know whether
> >> this actually helps with the regression Aaron found :)
> >
> > Sorry for the delay.
> >
> > I applied the last patch and queued the hackbench job to the ivb42 test
> > machine for it to run 5 times, and here is the result(regarding the
> > proc-vmstat.numa_hint_faults_local field):
> > 173565
> > 201262
> > 192317
> > 198342
> > 198595
> > avg:
> > 192816
> >
> > It seems it is still very big than previous kernels.
>
> It looks like a step in the right direction, though.
>
> Could you try running with a larger threshold?
>
> >> +++ b/kernel/sched/fair.c
> >> @@ -924,10 +924,12 @@ static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
> >>
> >> /*
> >> * These return the fraction of accesses done by a particular task, or
> >> - * task group, on a particular numa node. The group weight is given a
> >> - * larger multiplier, in order to group tasks together that are almost
> >> - * evenly spread out between numa nodes.
> >> + * task group, on a particular numa node. The NUMA move threshold
> >> + * prevents task moves with marginal improvement, and is set to 5%.
> >> */
> >> +#define NUMA_SCALE 1024
> >> +#define NUMA_MOVE_THRESH (5 * NUMA_SCALE / 100)
>
> It would be good to see if changing NUMA_MOVE_THRESH to
> (NUMA_SCALE / 8) does the trick.

With your 2nd patch and the above change, the result is:

"proc-vmstat.numa_hint_faults_local": [
199708,
209152,
200638,
187324,
196654
],

avg:
198695

Regards,
Aaron

2014-07-31 06:23:27

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/31/2014 01:04 AM, Aaron Lu wrote:
> On Wed, Jul 30, 2014 at 10:25:03AM -0400, Rik van Riel wrote:
>> On 07/29/2014 10:14 PM, Aaron Lu wrote:
>>> On Tue, Jul 29, 2014 at 04:04:37PM -0400, Rik van Riel wrote:
>>>> On Tue, 29 Jul 2014 10:17:12 +0200 Peter Zijlstra
>>>> <[email protected]> wrote:
>>>>
>>>>>> +#define NUMA_SCALE 1000 +#define NUMA_MOVE_THRESH 50
>>>>>
>>>>> Please make that 1024, there's no reason not to use power
>>>>> of two here. This base 10 factor thing annoyed me no end
>>>>> already, its time for it to die.
>>>>
>>>> That's easy enough. However, it would be good to know
>>>> whether this actually helps with the regression Aaron found
>>>> :)
>>>
>>> Sorry for the delay.
>>>
>>> I applied the last patch and queued the hackbench job to the
>>> ivb42 test machine for it to run 5 times, and here is the
>>> result(regarding the proc-vmstat.numa_hint_faults_local
>>> field): 173565 201262 192317 198342 198595 avg: 192816
>>>
>>> It seems it is still very big than previous kernels.
>>
>> It looks like a step in the right direction, though.
>>
>> Could you try running with a larger threshold?
>>
>>>> +++ b/kernel/sched/fair.c @@ -924,10 +924,12 @@ static inline
>>>> unsigned long group_faults_cpu(struct numa_group *group, int
>>>> nid)
>>>>
>>>> /* * These return the fraction of accesses done by a
>>>> particular task, or - * task group, on a particular numa
>>>> node. The group weight is given a - * larger multiplier, in
>>>> order to group tasks together that are almost - * evenly
>>>> spread out between numa nodes. + * task group, on a
>>>> particular numa node. The NUMA move threshold + * prevents
>>>> task moves with marginal improvement, and is set to 5%. */
>>>> +#define NUMA_SCALE 1024 +#define NUMA_MOVE_THRESH (5 *
>>>> NUMA_SCALE / 100)
>>
>> It would be good to see if changing NUMA_MOVE_THRESH to
>> (NUMA_SCALE / 8) does the trick.
>
> With your 2nd patch and the above change, the result is:
>
> "proc-vmstat.numa_hint_faults_local": [ 199708, 209152, 200638,
> 187324, 196654 ],
>
> avg: 198695

OK, so it is still a little higher than your original 162245.

I guess this is to be expected, since the code will be more
successful at placing a task on the right node, which results
in the task scanning its memory more rapidly for a little bit.

Are you seeing any changes in throughput?

- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT2eC/AAoJEM553pKExN6DIFMH/23LsoEJ8cUqMTdWUzhXesEb
TW0yncraZ6tDkGHopTU4oFmck93XUUVSJRVjLC3lxvxAIdWt8M4GCbWN8RD1yicX
Ii9s18+2r2vkc30gkIgh2yahaqQUun9sUkuaQ4BaKlbP+hwQzB3OfU1GjR7iStFE
t04krgCAL+xL63H/4mN0Y9ZjOBUz2QYbkspS21+oEWKkFY2FyyQn+hOSnA6lSvqy
o7v4tmC8jtRXsQY+hfy1aOtMUZO5sRcYHOttlxgjE5MbnW/whhsC+oB7cWw646St
LhvhhIykl/g2Bz+E3KbfnREGn5OO7NmEhv3am2Dj5XsNHnEfxYJH/m/aTA4az/s=
=/IeV
-----END PGP SIGNATURE-----

2014-07-31 06:43:22

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, 31 Jul 2014 13:04:54 +0800
Aaron Lu <[email protected]> wrote:

> On Wed, Jul 30, 2014 at 10:25:03AM -0400, Rik van Riel wrote:
> > On 07/29/2014 10:14 PM, Aaron Lu wrote:

> > >> +#define NUMA_SCALE 1024
> > >> +#define NUMA_MOVE_THRESH (5 * NUMA_SCALE / 100)
> >
> > It would be good to see if changing NUMA_MOVE_THRESH to
> > (NUMA_SCALE / 8) does the trick.

FWIW, running with NUMA_MOVE_THRESH set to (NUMA_SCALE / 8)
seems to resolve the SPECjbb2005 threshold on my system.

I will run some more sanity tests later today...

--
All rights reversed.

2014-07-31 06:54:05

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, Jul 31, 2014 at 02:22:55AM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/31/2014 01:04 AM, Aaron Lu wrote:
> > On Wed, Jul 30, 2014 at 10:25:03AM -0400, Rik van Riel wrote:
> >> On 07/29/2014 10:14 PM, Aaron Lu wrote:
> >>> On Tue, Jul 29, 2014 at 04:04:37PM -0400, Rik van Riel wrote:
> >>>> On Tue, 29 Jul 2014 10:17:12 +0200 Peter Zijlstra
> >>>> <[email protected]> wrote:
> >>>>
> >>>>>> +#define NUMA_SCALE 1000 +#define NUMA_MOVE_THRESH 50
> >>>>>
> >>>>> Please make that 1024, there's no reason not to use power
> >>>>> of two here. This base 10 factor thing annoyed me no end
> >>>>> already, its time for it to die.
> >>>>
> >>>> That's easy enough. However, it would be good to know
> >>>> whether this actually helps with the regression Aaron found
> >>>> :)
> >>>
> >>> Sorry for the delay.
> >>>
> >>> I applied the last patch and queued the hackbench job to the
> >>> ivb42 test machine for it to run 5 times, and here is the
> >>> result(regarding the proc-vmstat.numa_hint_faults_local
> >>> field): 173565 201262 192317 198342 198595 avg: 192816
> >>>
> >>> It seems it is still very big than previous kernels.
> >>
> >> It looks like a step in the right direction, though.
> >>
> >> Could you try running with a larger threshold?
> >>
> >>>> +++ b/kernel/sched/fair.c @@ -924,10 +924,12 @@ static inline
> >>>> unsigned long group_faults_cpu(struct numa_group *group, int
> >>>> nid)
> >>>>
> >>>> /* * These return the fraction of accesses done by a
> >>>> particular task, or - * task group, on a particular numa
> >>>> node. The group weight is given a - * larger multiplier, in
> >>>> order to group tasks together that are almost - * evenly
> >>>> spread out between numa nodes. + * task group, on a
> >>>> particular numa node. The NUMA move threshold + * prevents
> >>>> task moves with marginal improvement, and is set to 5%. */
> >>>> +#define NUMA_SCALE 1024 +#define NUMA_MOVE_THRESH (5 *
> >>>> NUMA_SCALE / 100)
> >>
> >> It would be good to see if changing NUMA_MOVE_THRESH to
> >> (NUMA_SCALE / 8) does the trick.
> >
> > With your 2nd patch and the above change, the result is:
> >
> > "proc-vmstat.numa_hint_faults_local": [ 199708, 209152, 200638,
> > 187324, 196654 ],
> >
> > avg: 198695
>
> OK, so it is still a little higher than your original 162245.

The original number is 94500 for ivb42 machine, the 162245 is the sum
of the two numbers above it that are tested on two machines - one is the
number for ivb42 and one is for lkp-snb01. Sorry if that is not clear.

And for the numbers I have given with your patch applied, they are all
for ivb42 alone.

>
> I guess this is to be expected, since the code will be more
> successful at placing a task on the right node, which results
> in the task scanning its memory more rapidly for a little bit.
>
> Are you seeing any changes in throughput?

The throughput has almost no change. Your 2nd patch with scale changed
has seen a decrease of 0.1% compared to your original commit that
triggered the report, and that original commit has a increase of 1.2%
compared to its parent commit.

Regards,
Aaron

2014-07-31 08:33:44

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Wed, Jul 30, 2014 at 10:14:25AM +0800, Aaron Lu wrote:
> 118881 ~ 0% +1.2% 120325 ~ 0% ivb42/hackbench/50%-threads-pipe

What kind of IVB is that EP or EX (or rather, how many sockets)? Also
what arguments to hackbench do you use?

Attachments:

(No filename) (249.00 B)
(No filename) (836.00 B)
Download all attachments

2014-07-31 08:56:32

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, Jul 31, 2014 at 10:33:30AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 30, 2014 at 10:14:25AM +0800, Aaron Lu wrote:
> > 118881 ~ 0% +1.2% 120325 ~ 0% ivb42/hackbench/50%-threads-pipe
>
> What kind of IVB is that EP or EX (or rather, how many sockets)? Also
> what arguments to hackbench do you use?
>

2 sockets EP.

The cmdline is:
/usr/bin/hackbench -g 24 --threads --pipe -l 60000

Regards,
Aaron

2014-07-31 10:42:55

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
> On Tue, 29 Jul 2014 13:24:05 +0800
> Aaron Lu <[email protected]> wrote:
>
> > FYI, we noticed the below changes on
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> > commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
> >
> > ebe06187bf2aec1 a43455a1d572daf7b730fe12e
> > --------------- -------------------------
> > 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
> > 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
> > 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local
>
> Hi Aaron,
>
> Jirka Hladky has reported a regression with that changeset as
> well, and I have already spent some time debugging the issue.

So assuming those numbers above are the difference in
numa_hint_local_faults, the report is actually a significant
_improvement_, not a regression.

On my IVB-EP I get similar numbers; using:

PRE=`grep numa_hint_faults_local /proc/vmstat | cut -d' ' -f2`
perf bench sched messaging -g 24 -t -p -l 60000
POST=`grep numa_hint_faults_local /proc/vmstat | cut -d' ' -f2`
echo $((POST-PRE))

tip/mater+origin/master tip/master+origin/master-a43455a1d57

local total local total
faults time faults time

19971 51.384 10104 50.838
17193 50.564 9116 50.208
13435 49.057 8332 51.344
23794 50.795 9954 51.364
20255 49.463 9598 51.258

18929.6 50.2526 9420.8 51.0024
3863.61 0.96 717.78 0.49

So that patch improves both local faults and runtime. Its good (even
though for the runtime we're still inside stdev overlap, so ideally I'd
do more runs).

Now I also did a run with the proposed patch, NUMA_SCALE/8 variant, and
that slightly reduces both again:

tip/master+origin/master+patch

local total
faults time

21296 50.541
12771 50.54
13872 52.224
23352 50.85
16516 50.705

17561.4 50.972
4613.32 0.71

So for hackbench a43455a1d57 is good and the proposed patch is making
things worse.

Let me see if I can still find my SPECjbb2005 copy to see what that
does.

Attachments:

(No filename) (2.26 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-31 15:57:36

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

Attachments:

(No filename) (1.22 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-31 16:17:02

by Jirka Hladky

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On 07/31/2014 05:57 PM, Peter Zijlstra wrote:
> On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
>>> On Tue, 29 Jul 2014 13:24:05 +0800
>>> Aaron Lu <[email protected]> wrote:
>>>
>>>> FYI, we noticed the below changes on
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>>>> commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
>>>>
>>>> ebe06187bf2aec1 a43455a1d572daf7b730fe12e
>>>> --------------- -------------------------
>>>> 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
>>>> 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
>>>> 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local
>>> Hi Aaron,
>>>
>>> Jirka Hladky has reported a regression with that changeset as
>>> well, and I have already spent some time debugging the issue.
>> Let me see if I can still find my SPECjbb2005 copy to see what that
>> does.
> Jirka, what kind of setup were you seeing SPECjbb regressions?
>
> I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
> check one instance per socket now.
>
>
Peter, I'm seeing regressions for

SINGLE SPECjbb instance for number of warehouses being the same as total
number of cores in the box.

Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is
for 24 warehouses.

See the attached snapshot.

Jirka

Attachments:

SPECjbb2005_-127.el7numafixes9.png (89.30 kB)

2014-07-31 16:27:22

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, Jul 31, 2014 at 06:16:26PM +0200, Jirka Hladky wrote:
> On 07/31/2014 05:57 PM, Peter Zijlstra wrote:
> >On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:
> >>On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
> >>>On Tue, 29 Jul 2014 13:24:05 +0800
> >>>Aaron Lu <[email protected]> wrote:
> >>>
> >>>>FYI, we noticed the below changes on
> >>>>
> >>>>git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> >>>>commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
> >>>>
> >>>>ebe06187bf2aec1 a43455a1d572daf7b730fe12e
> >>>>--------------- -------------------------
> >>>> 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
> >>>> 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
> >>>> 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local
> >>>Hi Aaron,
> >>>
> >>>Jirka Hladky has reported a regression with that changeset as
> >>>well, and I have already spent some time debugging the issue.
> >>Let me see if I can still find my SPECjbb2005 copy to see what that
> >>does.
> >Jirka, what kind of setup were you seeing SPECjbb regressions?
> >
> >I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
> >check one instance per socket now.
> >
> >
> Peter, I'm seeing regressions for
>
> SINGLE SPECjbb instance for number of warehouses being the same as total
> number of cores in the box.
>
> Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is for
> 24 warehouses.

IVB-EP: 2 node, 10 cores, 2 thread per core:

tip/master+origin/master:

Warehouses Thrput
4 196781
8 358064
12 511318
16 589251
20 656123
24 710789
28 765426
32 787059
36 777899
* 40 748568

Throughput 18258

Warehouses Thrput
4 201598
8 363470
12 512968
16 584289
20 605299
24 720142
28 776066
32 791263
36 776965
* 40 760572

Throughput 18551

tip/master+origin/master-a43455a1d57

SPEC scores
Warehouses Thrput
4 198667
8 362481
12 503344
16 582602
20 647688
24 731639
28 786135
32 794124
36 774567
* 40 757559

Throughput 18477

Given that there's fairly large variance between the two runs with the
commit in, I'm not sure I can say there's a problem here.

The one run without the patch is more or less between the two runs with
the patch.

And doing this many runs takes ages, so I'm not tempted to either make
the runs longer or do more of them.

Lemme try on a 4 node box though, who knows.

Attachments:

(No filename) (3.55 kB)
(No filename) (836.00 B)
Download all attachments

2014-07-31 16:39:32

by Jirka Hladky

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On 07/31/2014 06:27 PM, Peter Zijlstra wrote:
> On Thu, Jul 31, 2014 at 06:16:26PM +0200, Jirka Hladky wrote:
>> On 07/31/2014 05:57 PM, Peter Zijlstra wrote:
>>> On Thu, Jul 31, 2014 at 12:42:41PM +0200, Peter Zijlstra wrote:
>>>> On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
>>>>> On Tue, 29 Jul 2014 13:24:05 +0800
>>>>> Aaron Lu <[email protected]> wrote:
>>>>>
>>>>>> FYI, we noticed the below changes on
>>>>>>
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
>>>>>> commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
>>>>>>
>>>>>> ebe06187bf2aec1 a43455a1d572daf7b730fe12e
>>>>>> --------------- -------------------------
>>>>>> 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
>>>>>> 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
>>>>>> 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local
>>>>> Hi Aaron,
>>>>>
>>>>> Jirka Hladky has reported a regression with that changeset as
>>>>> well, and I have already spent some time debugging the issue.
>>>> Let me see if I can still find my SPECjbb2005 copy to see what that
>>>> does.
>>> Jirka, what kind of setup were you seeing SPECjbb regressions?
>>>
>>> I'm not seeing any on 2 sockets with a single SPECjbb instance, I'll go
>>> check one instance per socket now.
>>>
>>>
>> Peter, I'm seeing regressions for
>>
>> SINGLE SPECjbb instance for number of warehouses being the same as total
>> number of cores in the box.
>>
>> Example: 4 NUMA node box, each CPU has 6 cores => biggest regression is for
>> 24 warehouses.
> IVB-EP: 2 node, 10 cores, 2 thread per core:
>
> tip/master+origin/master:
>
> Warehouses Thrput
> 4 196781
> 8 358064
> 12 511318
> 16 589251
> 20 656123
> 24 710789
> 28 765426
> 32 787059
> 36 777899
> * 40 748568
>
> Throughput 18258
>
> Warehouses Thrput
> 4 201598
> 8 363470
> 12 512968
> 16 584289
> 20 605299
> 24 720142
> 28 776066
> 32 791263
> 36 776965
> * 40 760572
>
> Throughput 18551
>
>
> tip/master+origin/master-a43455a1d57
>
> SPEC scores
> Warehouses Thrput
> 4 198667
> 8 362481
> 12 503344
> 16 582602
> 20 647688
> 24 731639
> 28 786135
> 32 794124
> 36 774567
> * 40 757559
>
> Throughput 18477
>
>
> Given that there's fairly large variance between the two runs with the
> commit in, I'm not sure I can say there's a problem here.
>
> The one run without the patch is more or less between the two runs with
> the patch.
>
> And doing this many runs takes ages, so I'm not tempted to either make
> the runs longer or do more of them.
>
> Lemme try on a 4 node box though, who knows.

IVB-EP: 2 node, 10 cores, 2 thread per core
=> on such system, I run only 20 warenhouses as maximum. (number of
nodes * number of PHYSICAL cores)

The kernels you have tested shows following results:
656123/605299/647688

I'm doing 3 iterations (3 runs) to get some statistics. To speed up the
test significantly please do the run with 20 warehouses only
(or in general with #warehouses == number of nodes * number of PHYSICAL
cores)

Jirka

2014-07-31 17:37:31

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, Jul 31, 2014 at 06:39:05PM +0200, Jirka Hladky wrote:
> I'm doing 3 iterations (3 runs) to get some statistics. To speed up the test
> significantly please do the run with 20 warehouses only
> (or in general with #warehouses == number of nodes * number of PHYSICAL
> cores)

Yeah, went and did that for my 4 node machine, its got a ton more cores, but I
matches the warehouses to it:

-a43455a1d57 tip/master

979996.47 1144715.44
876146 1098499.07
1058974.18 1019499.38
1055951.59 1139405.22
970504.01 1099659.09

988314.45 1100355.64 (avg)
75059.546179565 50085.7473975167(stdev)

So for 5 runs, tip/master (which includes the offending patch) wins hands down.

Each run is 2 minutes.

Attachments:

(No filename) (697.00 B)
(No filename) (836.00 B)
Download all attachments

2014-08-01 00:18:35

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [LKP] [sched/numa] a43455a1d57: +94.1% proc-vmstat.numa_hint_faults_local

On Thu, 2014-07-31 at 12:42 +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 02:39:40AM -0400, Rik van Riel wrote:
> > On Tue, 29 Jul 2014 13:24:05 +0800
> > Aaron Lu <[email protected]> wrote:
> >
> > > FYI, we noticed the below changes on
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> > > commit a43455a1d572daf7b730fe12eb747d1e17411365 ("sched/numa: Ensure task_numa_migrate() checks the preferred node")
> > >
> > > ebe06187bf2aec1 a43455a1d572daf7b730fe12e
> > > --------------- -------------------------
> > > 94500 ~ 3% +115.6% 203711 ~ 6% ivb42/hackbench/50%-threads-pipe
> > > 67745 ~ 4% +64.1% 111174 ~ 5% lkp-snb01/hackbench/50%-threads-socket
> > > 162245 ~ 3% +94.1% 314885 ~ 6% TOTAL proc-vmstat.numa_hint_faults_local
> >
> > Hi Aaron,
> >
> > Jirka Hladky has reported a regression with that changeset as
> > well, and I have already spent some time debugging the issue.
>
> So assuming those numbers above are the difference in
> numa_hint_local_faults, the report is actually a significant
> _improvement_, not a regression.
>
> On my IVB-EP I get similar numbers; using:
>
> PRE=`grep numa_hint_faults_local /proc/vmstat | cut -d' ' -f2`
> perf bench sched messaging -g 24 -t -p -l 60000
> POST=`grep numa_hint_faults_local /proc/vmstat | cut -d' ' -f2`
> echo $((POST-PRE))
>
>
> tip/mater+origin/master tip/master+origin/master-a43455a1d57
>
> local total local total
> faults time faults time
>
> 19971 51.384 10104 50.838
> 17193 50.564 9116 50.208
> 13435 49.057 8332 51.344
> 23794 50.795 9954 51.364
> 20255 49.463 9598 51.258
>
> 18929.6 50.2526 9420.8 51.0024
> 3863.61 0.96 717.78 0.49
>
> So that patch improves both local faults and runtime. Its good (even
> though for the runtime we're still inside stdev overlap, so ideally I'd
> do more runs).
>
>
> Now I also did a run with the proposed patch, NUMA_SCALE/8 variant, and
> that slightly reduces both again:
>
> tip/master+origin/master+patch
>
> local total
> faults time
>
> 21296 50.541
> 12771 50.54
> 13872 52.224
> 23352 50.85
> 16516 50.705
>
> 17561.4 50.972
> 4613.32 0.71
>
> So for hackbench a43455a1d57 is good and the proposed patch is making
> things worse.

It also seems to be the case on a 8-socket 80 core DL980:

tip/master baseline:
67276 169.590 [sec]
82400 188.406 [sec]
87827 201.122 [sec]
96659 228.243 [sec]
83180 192.422 [sec]

tip/master + a43455a1d57 reverted
36686 170.373 [sec]
52670 187.904 [sec]
55723 203.597 [sec]
41780 174.354 [sec]
36070 173.179 [sec]

Runtimes are pretty much all over the place, cannot really say if it's
gotten slower or faster. However, on avg, we nearly double the amount of
hint local faults with the commit in question.

After adding the proposed fix (NUMA_SCALE/8 variant), it goes down
again, closer to without a43455a1d57"

tip/master + patch
50591 175.272 [sec]
57858 191.969 [sec]
77564 215.429 [sec]
50613 179.384 [sec]
61673 201.694 [sec]

> Let me see if I can still find my SPECjbb2005 copy to see what that
> does.

I'll try to dig it up as well.

2014-08-01 02:03:36