Hi Rik,
we observe a tremendous regression between kernel version 3.16 and 3.17
(and up), and I've bisected it to this commit:
a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a43455a1d572daf7b730fe12eb747d1e17411365
We run a Web server (nginx) on a 2-socket Haswell server and we emulate
an e-Commerce Web-site. Clients send requests to the server and measure
the response time. Clients load the server quite heavily - CPU
utilization is more than 90% as measured with turbostat. We use Fedora
20.
If I take 3.17 and revert this patch, I observe 600% or more average
response time improvement comparing to vanilla 3.17.
If I take 4.1-rc1 and revert this patch, I observe 300% or more average
response time improvement comparing to vanilla 3.17.
I asked Fengguang Wu to run LKP workloads on multiple 4 and 8 socket
machines for v4.1-rc1 with and without this patch, and there seem to be
no difference - all the micro-benchmarks performed similarly and the
difference were withing the error range.
IOW, it looks like this patch has bad effect on Web server QoS (slower
response time). What do you think?
Thank you!
On Wed, 2015-05-06 at 13:35 +0300, Artem Bityutskiy wrote:
> If I take 4.1-rc1 and revert this patch, I observe 300% or more average
> response time improvement comparing to vanilla 3.17.
Sorry, I meant vanilla 4.1-rc1 here, not 3.17.
--
Best Regards,
Artem Bityutskiy
---------------------------------------------------------------------
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki
Business Identity Code: 0357606 - 4
Domiciled in Helsinki
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
CC'ing Peter & Mel. Leaving Artem's email intact so
they can read it :)
On 05/06/2015 06:35 AM, Artem Bityutskiy wrote:
> Hi Rik,
>
> we observe a tremendous regression between kernel version 3.16 and 3.17
> (and up), and I've bisected it to this commit:
>
> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a43455a1d572daf7b730fe12eb747d1e17411365
>
> We run a Web server (nginx) on a 2-socket Haswell server and we emulate
> an e-Commerce Web-site. Clients send requests to the server and measure
> the response time. Clients load the server quite heavily - CPU
> utilization is more than 90% as measured with turbostat. We use Fedora
> 20.
>
> If I take 3.17 and revert this patch, I observe 600% or more average
> response time improvement comparing to vanilla 3.17.
>
> If I take 4.1-rc1 and revert this patch, I observe 300% or more average
> response time improvement comparing to vanilla 3.17.
>
> I asked Fengguang Wu to run LKP workloads on multiple 4 and 8 socket
> machines for v4.1-rc1 with and without this patch, and there seem to be
> no difference - all the micro-benchmarks performed similarly and the
> difference were withing the error range.
>
> IOW, it looks like this patch has bad effect on Web server QoS (slower
> response time). What do you think?
The changeset you found fixes the issue where both
node A and B are fully loaded (or overloaded), and
tasks are located on the wrong node.
Without that changeset, workloads in that situation
will never converge, because we do not consider the
best node for a task.
I have seen that changeset cause another regression
in the past, but on a much less heavily loaded
system, with around 20-50% CPU utilization, and a
single process multi-threaded workload, it causes
the workload to not be properly spread out across
the system.
I wonder if we should try a changeset where the
NUMA balancing code never considers moving a task
from a less busy to a busier node, regardless
of whether or not the destination node is the
preferred node, or some other node?
I can cook up a quick patch to test that out.
Any opinions Peter or Mel?
On Wed, 06 May 2015 13:35:30 +0300
Artem Bityutskiy <[email protected]> wrote:
> we observe a tremendous regression between kernel version 3.16 and 3.17
> (and up), and I've bisected it to this commit:
>
> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
Artem, Jirka, does this patch fix (or at least improve) the issues you
have been seeing? Does it introduce any new regressions?
Peter, Mel, I think it may be time to stop waiting for the impedance
mismatch between the load balancer and NUMA balancing to be resolved,
and try to just avoid the issue in the NUMA balancing code...
----8<----
Subject: numa,sched: only consider less busy nodes as numa balancing destination
Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
preferred node") fixes an issue where workloads would never converge
on a fully loaded (or overloaded) system.
However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly staying
spread out across the whole system. This leads to a reduction in available
memory bandwidth, and usable CPU cache, with predictable performance problems.
The root cause appears to be an interaction between the load balancer and
NUMA balancing, where the short term load represented by the load balancer
differs from the long term load the NUMA balancing code would like to base
its decisions on.
Simply reverting a43455a1 would re-introduce the non-convergence of
workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.
Instead, try to compensate for the impedance mismatch between the
load balancer and NUMA balancing by only ever considering a lesser
loaded node as a destination for NUMA balancing, regardless of
whether the task is trying to move to the preferred node, or to another
node.
Signed-off-by: Rik van Riel <[email protected]>
Reported-by: Artem Bityutski <[email protected]>
Reported-by: Jirka Hladky <[email protected]>
---
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffeaa4105e48..480e6a35ab35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
}
}
+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+ struct numa_stats *src = &env->src_stats;
+ struct numa_stats *dst = &env->dst_stats;
+
+ if (src->has_free_capacity && !dst->has_free_capacity)
+ return false;
+
+ /*
+ * Only consider a task move if the source has a higher destination
+ * than the destination, corrected for CPU capacity on each node.
+ *
+ * src->load dst->load
+ * --------------------- vs ---------------------
+ * src->compute_capacity dst->compute_capacity
+ */
+ if (src->load * dst->compute_capacity >
+ dst->load * src->compute_capacity)
+ return true;
+
+ return false;
+}
+
static int task_numa_migrate(struct task_struct *p)
{
struct task_numa_env env = {
@@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
update_numa_stats(&env.dst_stats, env.dst_nid);
/* Try to find a spot on the preferred nid. */
- task_numa_find_cpu(&env, taskimp, groupimp);
+ if (numa_has_capacity(&env))
+ task_numa_find_cpu(&env, taskimp, groupimp);
/*
* Look at other nodes in these cases:
@@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p)
env.dist = dist;
env.dst_nid = nid;
update_numa_stats(&env.dst_stats, env.dst_nid);
- task_numa_find_cpu(&env, taskimp, groupimp);
+ if (numa_has_capacity(&env))
+ task_numa_find_cpu(&env, taskimp, groupimp);
}
}
On Wed, May 06, 2015 at 11:41:28AM -0400, Rik van Riel wrote:
> Peter, Mel, I think it may be time to stop waiting for the impedance
> mismatch between the load balancer and NUMA balancing to be resolved,
> and try to just avoid the issue in the NUMA balancing code...
That's a wee bit unfair since we 'all' decided to let the numa thing
rest for a while. So obviously that issue didn't get resolved.
> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
> 1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffeaa4105e48..480e6a35ab35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> }
> }
>
> +/* Only move tasks to a NUMA node less busy than the current node. */
> +static bool numa_has_capacity(struct task_numa_env *env)
> +{
> + struct numa_stats *src = &env->src_stats;
> + struct numa_stats *dst = &env->dst_stats;
> +
> + if (src->has_free_capacity && !dst->has_free_capacity)
> + return false;
> +
> + /*
> + * Only consider a task move if the source has a higher destination
> + * than the destination, corrected for CPU capacity on each node.
> + *
> + * src->load dst->load
> + * --------------------- vs ---------------------
> + * src->compute_capacity dst->compute_capacity
> + */
> + if (src->load * dst->compute_capacity >
> + dst->load * src->compute_capacity)
> + return true;
> +
> + return false;
> +}
> +
> static int task_numa_migrate(struct task_struct *p)
> {
> struct task_numa_env env = {
> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
> update_numa_stats(&env.dst_stats, env.dst_nid);
>
> /* Try to find a spot on the preferred nid. */
> - task_numa_find_cpu(&env, taskimp, groupimp);
> + if (numa_has_capacity(&env))
> + task_numa_find_cpu(&env, taskimp, groupimp);
>
> /*
> * Look at other nodes in these cases:
> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p)
> env.dist = dist;
> env.dst_nid = nid;
> update_numa_stats(&env.dst_stats, env.dst_nid);
> - task_numa_find_cpu(&env, taskimp, groupimp);
> + if (numa_has_capacity(&env))
> + task_numa_find_cpu(&env, taskimp, groupimp);
> }
> }
Does this not 'duplicate' the logic that we tried for with
task_numa_compare():balance section? That is where we try to avoid
making a decision that the regular load-balancer will dislike and undo.
Alternatively; you can view that as a cpu guard and the proposed as a
node guard, in which case, should it not live inside
task_numa_find_cpu()? Instead of guarding all call sites.
In any case, should we mix a bit of imbalance_pct in there?
/me goes ponder this a bit further..
On 05/06/2015 01:00 PM, Peter Zijlstra wrote:
> On Wed, May 06, 2015 at 11:41:28AM -0400, Rik van Riel wrote:
>
>> Peter, Mel, I think it may be time to stop waiting for the impedance
>> mismatch between the load balancer and NUMA balancing to be resolved,
>> and try to just avoid the issue in the NUMA balancing code...
>
> That's a wee bit unfair since we 'all' decided to let the numa thing
> rest for a while. So obviously that issue didn't get resolved.
I'm not blaming anyone, I know I was involved in the decision
to let the NUMA code rest for a while, too.
After a year of just sitting there, this is the only big bug
affecting the NUMA balancing code that I have heard about.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ffeaa4105e48..480e6a35ab35 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>> }
>> }
>>
>> +/* Only move tasks to a NUMA node less busy than the current node. */
>> +static bool numa_has_capacity(struct task_numa_env *env)
>> +{
>> + struct numa_stats *src = &env->src_stats;
>> + struct numa_stats *dst = &env->dst_stats;
>> +
>> + if (src->has_free_capacity && !dst->has_free_capacity)
>> + return false;
>> +
>> + /*
>> + * Only consider a task move if the source has a higher destination
>> + * than the destination, corrected for CPU capacity on each node.
>> + *
>> + * src->load dst->load
>> + * --------------------- vs ---------------------
>> + * src->compute_capacity dst->compute_capacity
>> + */
>> + if (src->load * dst->compute_capacity >
>> + dst->load * src->compute_capacity)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> static int task_numa_migrate(struct task_struct *p)
>> {
>> struct task_numa_env env = {
>> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
>> update_numa_stats(&env.dst_stats, env.dst_nid);
>>
>> /* Try to find a spot on the preferred nid. */
>> - task_numa_find_cpu(&env, taskimp, groupimp);
>> + if (numa_has_capacity(&env))
>> + task_numa_find_cpu(&env, taskimp, groupimp);
>>
>> /*
>> * Look at other nodes in these cases:
>> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p)
>> env.dist = dist;
>> env.dst_nid = nid;
>> update_numa_stats(&env.dst_stats, env.dst_nid);
>> - task_numa_find_cpu(&env, taskimp, groupimp);
>> + if (numa_has_capacity(&env))
>> + task_numa_find_cpu(&env, taskimp, groupimp);
>> }
>> }
>
> Does this not 'duplicate' the logic that we tried for with
> task_numa_compare():balance section? That is where we try to avoid
> making a decision that the regular load-balancer will dislike and undo.
>
> Alternatively; you can view that as a cpu guard and the proposed as a
> node guard, in which case, should it not live inside
> task_numa_find_cpu()? Instead of guarding all call sites.
>
> In any case, should we mix a bit of imbalance_pct in there?
>
> /me goes ponder this a bit further..
Yes, there is some duplication between this code and the
logic in task_numa_compare()
At this point I am not sure how to resolve that; I am
interested in seeing whether this patch solves the issue
reported by Artem and Jirka. If it does, we can think
about cleanups.
On Wed, 2015-05-06 at 11:41 -0400, Rik van Riel wrote:
> On Wed, 06 May 2015 13:35:30 +0300
> Artem Bityutskiy <[email protected]> wrote:
>
> > we observe a tremendous regression between kernel version 3.16 and 3.17
> > (and up), and I've bisected it to this commit:
> >
> > a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>
> Artem, Jirka, does this patch fix (or at least improve) the issues you
> have been seeing? Does it introduce any new regressions?
>
> Peter, Mel, I think it may be time to stop waiting for the impedance
> mismatch between the load balancer and NUMA balancing to be resolved,
> and try to just avoid the issue in the NUMA balancing code...
I'll give it a try as soon as I can and report back, thanks!
On Wed, 2015-05-06 at 11:41 -0400, Rik van Riel wrote:
> On Wed, 06 May 2015 13:35:30 +0300
> Artem Bityutskiy <[email protected]> wrote:
>
> > we observe a tremendous regression between kernel version 3.16 and 3.17
> > (and up), and I've bisected it to this commit:
> >
> > a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>
> Artem, Jirka, does this patch fix (or at least improve) the issues you
> have been seeing? Does it introduce any new regressions?
Hi Rik,
first of all thanks for your help!
I've tried this patch and it has very small effect. I've also ran the
benchmark with auto-NUMA disabled too, which is useful, I think. I used
the tip of Linuses tree (v4.1-rc2+).
Kernel Avg response time, ms
------------------------------------------------------
Vanilla 1481
Patched 1240
Reverted 256
Disabled 309
Vanilla: pristine v4.1-rc2+
Patched: Vanilla + this patch
Reverted: Vanilla + a revert of a43455a
Disabled: Vanilla and auto-NUMA disabled via procfs
I ran the benchmark for 1 hour for every configuration this time. I
cannot say for sure the deviation right now, but I think it is tens of
milliseconds, so disabled vs reverted _may_ be within the error range,
but I need to do more experiments.
So this patch dropped the average Web server response time dropped from
about 1.4 seconds to about 1.2 seconds, which isn't a bad improvement,
but it is far less than what we get when reverting that patch.
Artem.
On 05/08/2015 09:13 AM, Artem Bityutskiy wrote:
> On Wed, 2015-05-06 at 11:41 -0400, Rik van Riel wrote:
>> On Wed, 06 May 2015 13:35:30 +0300
>> Artem Bityutskiy <[email protected]> wrote:
>>
>>> we observe a tremendous regression between kernel version 3.16 and 3.17
>>> (and up), and I've bisected it to this commit:
>>>
>>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>>
>> Artem, Jirka, does this patch fix (or at least improve) the issues you
>> have been seeing? Does it introduce any new regressions?
>
> Hi Rik,
>
> first of all thanks for your help!
>
> I've tried this patch and it has very small effect. I've also ran the
> benchmark with auto-NUMA disabled too, which is useful, I think. I used
> the tip of Linuses tree (v4.1-rc2+).
Trying with NUMA balancing disabled was extremely useful!
I now have an idea what is going on with your workload.
I suspect Peter and Mel aren't going to like it...
> Kernel Avg response time, ms
> ------------------------------------------------------
> Vanilla 1481
> Patched 1240
> Reverted 256
> Disabled 309
>
>
> Vanilla: pristine v4.1-rc2+
> Patched: Vanilla + this patch
> Reverted: Vanilla + a revert of a43455a
> Disabled: Vanilla and auto-NUMA disabled via procfs
My hypothesis: the NUMA code moving tasks at all is what is
hurting your workload. On a two-node system, you only have
the current node the task is on, and the task's preferred
node, which may or may not be the same node.
In case the preferred node is different, with the patch
reverted the kernel would only try to move a task to the
preferred node if that load was running fewer tasks than
it has CPU cores. It would never attempt task swaps, or
anything else.
With both the vanilla kernel, and with my new patch, the
NUMA balancing code will try to move a task to a better
location (from a NUMA point of view).
This works well when dealing with tasks that are constantly
running, but fails catastrophically when dealing with tasks
that go to sleep, wake back up, go back to sleep, wake back
up, and generally mess up the load statistics that the NUMA
balancing code use in a random way.
If the normal scheduler load balancer is moving tasks the
other way the NUMA balancer is moving them, things will
not converge, and tasks will have worse memory locality
than not doing NUMA balancing at all.
Currently the load balancer has a preference for moving
tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
but there is no resistance to moving tasks away from their
preferred nodes (NUMA_RESIST_LOWER, false). That setting
was arrived at after a fair amount of experimenting, and
is probably correct.
I am still curious what my current patch does for Jirka's
workload (if anything). I have no idea whether his workload
suffers from similar issues as Artem's workload, or whether
they perform relatively poorly for different reasons.
END CONCLUSION
BEGIN RAMBLING UNFORMED IDEA
I do not have a solid idea in my mind on how to solve the
problem above, but I have some poorly formed ideas...
1) It may be worth for the load balancer to keep track of
how many times it moves a task to a NUMA node where it has
worse locality, in order to give it CPU time now.
2) The NUMA balancing code, in turn, may resist/skip moving
tasks to nodes with better NUMA locality, when the load balancer
has moved that task away in the past, and is likely to move it
away again.
3) The statistic from (1) could be a floating average of
se.statistics.nr_forced_migrations, which would require some
modifications to migrate_degrades_locality() and can_migrate_task()
to do the evaluation even when it does not factor it into its
decisions.
4) I am not sure yet how to weigh that floating average against
the NUMA locality. Should the floating average of forced
migrations only block NUMA locality when it is large, and when
the difference in NUMA locality score between nodes is small?
How do we weigh these things?
5) What am I forgetting / overlooking?
--
All rights reversed
On 05/08/2015 04:03 PM, Rik van Riel wrote:
> If the normal scheduler load balancer is moving tasks the
> other way the NUMA balancer is moving them, things will
> not converge, and tasks will have worse memory locality
> than not doing NUMA balancing at all.
>
> Currently the load balancer has a preference for moving
> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
> but there is no resistance to moving tasks away from their
> preferred nodes (NUMA_RESIST_LOWER, false). That setting
> was arrived at after a fair amount of experimenting, and
> is probably correct.
Never mind that. After reading the code several times after
that earlier post, it looks like having NUMA_FAVOR_HIGHER
enabled does absolutely nothing without also having
NUMA_RESIST_LOWER enabled, at least not for idle balancing.
At first glance, this code looks correct, and even useful:
/*
* Aggressive migration if:
* 1) destination numa is preferred
* 2) task is cache cold, or
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, env);
if (!tsk_cache_hot)
tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot) {
schedstat_inc(env->sd, lb_hot_gained[env->idle]);
schedstat_inc(p,
se.statistics.nr_forced_migrations);
}
return 1;
}
However, with NUMA_RESIST_LOWER disabled (default),
migrate_degrades_locality always returns 0.
Furthermore, sched_migrate_latency_ns, which influences task_hot,
is set so small (.5 us) that task_hot is likely to always return
false for workloads with frequent sleeps and network latencies,
like a web workload...
In other words, the idle balancing code will treat tasks moving
towards their preferred NUMA node the same as tasks moving away
from their preferred NUMA node. It will move tasks regardless of
NUMA affinity, and can end up in a big fight with the NUMA
balancing code, as you have observed.
I am not sure what to do about this.
Peter?
--
All rights reversed
On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote:
> This works well when dealing with tasks that are constantly
> running, but fails catastrophically when dealing with tasks
> that go to sleep, wake back up, go back to sleep, wake back
> up, and generally mess up the load statistics that the NUMA
> balancing code use in a random way.
Sleeping is what happens a lot I believe in this workload: processes do
a lot of network I/O, file I/O too, and a lot of IPC.
Would you please expand on this a bit more - why would this scenario
"mess up load statistics" ?
> If the normal scheduler load balancer is moving tasks the
> other way the NUMA balancer is moving them, things will
> not converge, and tasks will have worse memory locality
> than not doing NUMA balancing at all.
Are the regular and NUMA balancers independent?
Are there mechanisms to detect ping-pong situations? I'd like to verify
your theory, and these kinds of mechanisms would be helpful.
> Currently the load balancer has a preference for moving
> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
> but there is no resistance to moving tasks away from their
> preferred nodes (NUMA_RESIST_LOWER, false). That setting
> was arrived at after a fair amount of experimenting, and
> is probably correct.
I guess I can try making NUMA_RESIST_LOWER to be true and see what
happens. But probably first I need to confirm that your theory
(balancers playing ping-pong) is correct, any hints on how would I do
this?
Thanks!
Artem.
Hi Rik,
we have results for SPECjbb2005 and Linpack&Stream benchmarks with
4.1.0-0.rc1.git0.1.el7.x86_64 (without patch)
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch
4.1.0-0.rc2.git0.3.el7.x86_64 with your patch and AUTONUMA disabled
The tests has been conducted on 3 different systems with 4 NUMA nodes
and different versions of Intel processors and different amount of RAM.
For SPECjbb benchmark we see
-with your latest proposed patch applied
* gains in range 7-15% !! for single instance SPECjbb (tested on
variety of systems, biggest gains on brickland system, gains are growing
with growing number of threads)
* for multi-instance SPECjbb run (4 parallel jobs on 4 NUMA node
system) on change in results
* for linpack no change
* for stream bench slight improvements (but very close to error margin)
- with AUTONUMA disabled
* with SPECjbb (both single and 4 parallel jobs) performance drop to
1/2 of performance with AUTONUMA enabled
* for linpack and stream performance drop by 30% compared with
AUTONUMA enabled
In summary:
* the proposed patch improves performance for single process SPECjbb
bench without hurting anything
* With AUTUNUMA disabled, performance drop is huge
Please let me know if you need more details.
Thanks
Jirka
On 05/06/2015 05:41 PM, Rik van Riel wrote:
> On Wed, 06 May 2015 13:35:30 +0300
> Artem Bityutskiy <[email protected]> wrote:
>
>> we observe a tremendous regression between kernel version 3.16 and 3.17
>> (and up), and I've bisected it to this commit:
>>
>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
> Artem, Jirka, does this patch fix (or at least improve) the issues you
> have been seeing? Does it introduce any new regressions?
>
> Peter, Mel, I think it may be time to stop waiting for the impedance
> mismatch between the load balancer and NUMA balancing to be resolved,
> and try to just avoid the issue in the NUMA balancing code...
>
> ----8<----
>
> Subject: numa,sched: only consider less busy nodes as numa balancing destination
>
> Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the
> preferred node") fixes an issue where workloads would never converge
> on a fully loaded (or overloaded) system.
>
> However, it introduces a regression on less than fully loaded systems,
> where workloads converge on a few NUMA nodes, instead of properly staying
> spread out across the whole system. This leads to a reduction in available
> memory bandwidth, and usable CPU cache, with predictable performance problems.
>
> The root cause appears to be an interaction between the load balancer and
> NUMA balancing, where the short term load represented by the load balancer
> differs from the long term load the NUMA balancing code would like to base
> its decisions on.
>
> Simply reverting a43455a1 would re-introduce the non-convergence of
> workloads on fully loaded systems, so that is not a good option. As
> an aside, the check done before a43455a1 only applied to a task's
> preferred node, not to other candidate nodes in the system, so the
> converge-on-too-few-nodes problem still happens, just to a lesser
> degree.
>
> Instead, try to compensate for the impedance mismatch between the
> load balancer and NUMA balancing by only ever considering a lesser
> loaded node as a destination for NUMA balancing, regardless of
> whether the task is trying to move to the preferred node, or to another
> node.
>
> Signed-off-by: Rik van Riel <[email protected]>
> Reported-by: Artem Bityutski <[email protected]>
> Reported-by: Jirka Hladky <[email protected]>
> ---
> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
> 1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffeaa4105e48..480e6a35ab35 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> }
> }
>
> +/* Only move tasks to a NUMA node less busy than the current node. */
> +static bool numa_has_capacity(struct task_numa_env *env)
> +{
> + struct numa_stats *src = &env->src_stats;
> + struct numa_stats *dst = &env->dst_stats;
> +
> + if (src->has_free_capacity && !dst->has_free_capacity)
> + return false;
> +
> + /*
> + * Only consider a task move if the source has a higher destination
> + * than the destination, corrected for CPU capacity on each node.
> + *
> + * src->load dst->load
> + * --------------------- vs ---------------------
> + * src->compute_capacity dst->compute_capacity
> + */
> + if (src->load * dst->compute_capacity >
> + dst->load * src->compute_capacity)
> + return true;
> +
> + return false;
> +}
> +
> static int task_numa_migrate(struct task_struct *p)
> {
> struct task_numa_env env = {
> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p)
> update_numa_stats(&env.dst_stats, env.dst_nid);
>
> /* Try to find a spot on the preferred nid. */
> - task_numa_find_cpu(&env, taskimp, groupimp);
> + if (numa_has_capacity(&env))
> + task_numa_find_cpu(&env, taskimp, groupimp);
>
> /*
> * Look at other nodes in these cases:
> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p)
> env.dist = dist;
> env.dst_nid = nid;
> update_numa_stats(&env.dst_stats, env.dst_nid);
> - task_numa_find_cpu(&env, taskimp, groupimp);
> + if (numa_has_capacity(&env))
> + task_numa_find_cpu(&env, taskimp, groupimp);
> }
> }
>
On 05/11/2015 07:11 AM, Artem Bityutskiy wrote:
> On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote:
>> This works well when dealing with tasks that are constantly
>> running, but fails catastrophically when dealing with tasks
>> that go to sleep, wake back up, go back to sleep, wake back
>> up, and generally mess up the load statistics that the NUMA
>> balancing code use in a random way.
>
> Sleeping is what happens a lot I believe in this workload: processes do
> a lot of network I/O, file I/O too, and a lot of IPC.
>
> Would you please expand on this a bit more - why would this scenario
> "mess up load statistics" ?
>
>> If the normal scheduler load balancer is moving tasks the
>> other way the NUMA balancer is moving them, things will
>> not converge, and tasks will have worse memory locality
>> than not doing NUMA balancing at all.
>
> Are the regular and NUMA balancers independent?
>
> Are there mechanisms to detect ping-pong situations? I'd like to verify
> your theory, and these kinds of mechanisms would be helpful.
>
>> Currently the load balancer has a preference for moving
>> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
>> but there is no resistance to moving tasks away from their
>> preferred nodes (NUMA_RESIST_LOWER, false). That setting
>> was arrived at after a fair amount of experimenting, and
>> is probably correct.
>
> I guess I can try making NUMA_RESIST_LOWER to be true and see what
> happens. But probably first I need to confirm that your theory
> (balancers playing ping-pong) is correct, any hints on how would I do
> this?
Funny thing, for your workload, the kernel only keeps statistics
on forced migrations when NUMA_RESIST_LOWER is enabled.
The reason is that the tasks on your system probably sleep too
long to hit the task_hot() test most of the time.
/*
* Aggressive migration if:
* 1) destination numa is preferred
* 2) task is cache cold, or
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, env);
if (!tsk_cache_hot)
tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env) || !tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot) {
schedstat_inc(env->sd, lb_hot_gained[env->idle]);
schedstat_inc(p,
se.statistics.nr_forced_migrations);
}
return 1;
}
schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
return 0;
I am also not sure where the se.statistics.nr_forced_migrations
statistic is exported.
--
All rights reversed
On 05/11/2015 08:44 AM, Jirka Hladky wrote:
> Hi Rik,
>
> we have results for SPECjbb2005 and Linpack&Stream benchmarks with
>
> 4.1.0-0.rc1.git0.1.el7.x86_64 (without patch)
> 4.1.0-0.rc2.git0.3.el7.x86_64 with your patch
> 4.1.0-0.rc2.git0.3.el7.x86_64 with your patch and AUTONUMA disabled
>
> The tests has been conducted on 3 different systems with 4 NUMA nodes
> and different versions of Intel processors and different amount of RAM.
>
>
> For SPECjbb benchmark we see
> -with your latest proposed patch applied
> * gains in range 7-15% !! for single instance SPECjbb (tested on
> variety of systems, biggest gains on brickland system, gains are growing
> with growing number of threads)
That is significant.
> * for multi-instance SPECjbb run (4 parallel jobs on 4 NUMA node
> system) on change in results
> * for linpack no change
> * for stream bench slight improvements (but very close to error margin)
Glad to hear the patch is not causing regressions.
Peter, can you queue up this patch in your sched tree, or
would you like me to make any changes to it first?
--
All rights reversed
On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote:
> Currently the load balancer has a preference for moving
> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
> but there is no resistance to moving tasks away from their
> preferred nodes (NUMA_RESIST_LOWER, false). That setting
> was arrived at after a fair amount of experimenting, and
> is probably correct.
FYI, (NUMA_RESIST_LOWER, true) does not make any difference for me.
On 05/12/2015 09:50 AM, Artem Bityutskiy wrote:
> On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote:
>> Currently the load balancer has a preference for moving
>> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
>> but there is no resistance to moving tasks away from their
>> preferred nodes (NUMA_RESIST_LOWER, false). That setting
>> was arrived at after a fair amount of experimenting, and
>> is probably correct.
>
> FYI, (NUMA_RESIST_LOWER, true) does not make any difference for me.
I am not surprised by this.
The idle balancing code will simply take a runnable-but-not-running
task off the run queue of the busiest CPU in the system. On a system
with some idle time, it is likely there are only one or two tasks
available on the run queue of the busiest CPU, which leaves little or
no choice to the NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER code.
The idle balancing code, through find_busiest_queue() already tries
to select a CPU where at least one of the runnable tasks is on the
wrong NUMA node.
However, that task may well be the current task, leading us to steal
the other (runnable but on the queue) task instead, moving that one
to the wrong NUMA node.
I have a few poorly formed ideas on what could be done about that:
1) have fbq_classify_rq take the current task on the rq into account,
and adjust the fbq classification if all the runnable-but-queued
tasks are on the right node
2) ensure that rq->nr_numa_running and rq->nr_preferred_running also
get incremented for kernel threads that are bound to a particular
CPU - currently CPU-bound kernel threads will cause the NUMA
statistics to look like a CPU has tasks that do not belong on that
NUMA node
3) have detach_tasks take env->fbq_type into account when deciding
whether to look at NUMA affinity at all
4) maybe have detach_tasks fail if env->fbq_type is regular or remote,
but no !numa or on-the-wrong-node tasks were found ? not sure if
that would cause problems, or what kind...
--
All rights reversed
On Tue, May 12, 2015 at 11:45:09AM -0400, Rik van Riel wrote:
> I have a few poorly formed ideas on what could be done about that:
>
> 1) have fbq_classify_rq take the current task on the rq into account,
> and adjust the fbq classification if all the runnable-but-queued
> tasks are on the right node
So while looking at this I came up with the below; it treats anything
inside ->active_nodes as a preferred node for balancing purposes.
Would that make sense?
I'll see what I can do about current in the runqueue type
classification.
> 2) ensure that rq->nr_numa_running and rq->nr_preferred_running also
> get incremented for kernel threads that are bound to a particular
> CPU - currently CPU-bound kernel threads will cause the NUMA
> statistics to look like a CPU has tasks that do not belong on that
> NUMA node
I'm thinking accounting those to nr_pinned, lemme see how that works
out.
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++++++-------------------
2 files changed, 38 insertions(+), 21 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cb734861123a..ffebc2e091ad 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1443,6 +1443,7 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
+ unsigned sched_preferred:1;
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1510abeefa..d59adb8e8ef4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -856,18 +856,6 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
}
-static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
-{
- rq->nr_numa_running += (p->numa_preferred_nid != -1);
- rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
-}
-
-static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
-{
- rq->nr_numa_running -= (p->numa_preferred_nid != -1);
- rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
-}
-
struct numa_group {
atomic_t refcount;
@@ -887,6 +875,28 @@ struct numa_group {
unsigned long faults[0];
};
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ int node = task_node(p);
+ bool local;
+
+ rq->nr_numa_running += (p->numa_preferred_nid != -1);
+
+ if (p->numa_group)
+ local = node_isset(node, p->numa_group->active_nodes);
+ else
+ local = (p->numa_preferred_nid == node);
+
+ p->sched_preferred = local;
+ rq->nr_preferred_running += local;
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+ rq->nr_preferred_running -= p->sched_preferred;
+}
+
/* Shared or private faults. */
#define NR_NUMA_HINT_FAULT_TYPES 2
@@ -1572,9 +1582,10 @@ static void numa_migrate_preferred(struct task_struct *p)
* are added when they cause over 6/16 of the maximum number of faults, but
* only removed when they drop below 3/16.
*/
-static void update_numa_active_node_mask(struct numa_group *numa_group)
+static bool update_numa_active_node_mask(struct numa_group *numa_group)
{
unsigned long faults, max_faults = 0;
+ bool update = false;
int nid;
for_each_online_node(nid) {
@@ -1586,11 +1597,17 @@ static void update_numa_active_node_mask(struct numa_group *numa_group)
for_each_online_node(nid) {
faults = group_faults_cpu(numa_group, nid);
if (!node_isset(nid, numa_group->active_nodes)) {
- if (faults > max_faults * 6 / 16)
+ if (faults > max_faults * 6 / 16) {
node_set(nid, numa_group->active_nodes);
- } else if (faults < max_faults * 3 / 16)
+ update = true;
+ }
+ } else if (faults < max_faults * 3 / 16) {
node_clear(nid, numa_group->active_nodes);
+ update = true;
+ }
}
+
+ return update;
}
/*
@@ -1884,16 +1901,15 @@ static void task_numa_placement(struct task_struct *p)
update_numa_active_node_mask(p->numa_group);
spin_unlock_irq(group_lock);
max_nid = preferred_group_nid(p, max_group_nid);
- }
-
- if (max_faults) {
+ sched_setnuma(p, max_nid);
+ } else if (max_faults) {
/* Set the new preferred node */
if (max_nid != p->numa_preferred_nid)
sched_setnuma(p, max_nid);
-
- if (task_node(p) != p->numa_preferred_nid)
- numa_migrate_preferred(p);
}
+
+ if (task_node(p) != p->numa_preferred_nid)
+ numa_migrate_preferred(p);
}
static inline int get_numa_group(struct numa_group *grp)
On Wed, May 13, 2015 at 08:29:06AM +0200, Peter Zijlstra wrote:
> @@ -1572,9 +1582,10 @@ static void numa_migrate_preferred(struct task_struct *p)
> * are added when they cause over 6/16 of the maximum number of faults, but
> * only removed when they drop below 3/16.
> */
> -static void update_numa_active_node_mask(struct numa_group *numa_group)
> +static bool update_numa_active_node_mask(struct numa_group *numa_group)
> {
> unsigned long faults, max_faults = 0;
> + bool update = false;
> int nid;
>
> for_each_online_node(nid) {
> @@ -1586,11 +1597,17 @@ static void update_numa_active_node_mask(struct numa_group *numa_group)
> for_each_online_node(nid) {
> faults = group_faults_cpu(numa_group, nid);
> if (!node_isset(nid, numa_group->active_nodes)) {
> - if (faults > max_faults * 6 / 16)
> + if (faults > max_faults * 6 / 16) {
> node_set(nid, numa_group->active_nodes);
> - } else if (faults < max_faults * 3 / 16)
> + update = true;
> + }
> + } else if (faults < max_faults * 3 / 16) {
> node_clear(nid, numa_group->active_nodes);
> + update = true;
> + }
> }
> +
> + return update;
> }
>
> /*
Ignore these hunks, they're dead wood.
On Wed, 2015-05-13 at 08:29 +0200, Peter Zijlstra wrote:
> > 2) ensure that rq->nr_numa_running and rq->nr_preferred_running also
> > get incremented for kernel threads that are bound to a particular
> > CPU - currently CPU-bound kernel threads will cause the NUMA
> > statistics to look like a CPU has tasks that do not belong on that
> > NUMA node
>
> I'm thinking accounting those to nr_pinned, lemme see how that works
> out.
Does not make any difference, avg. response time is still ~1.4sec.
Thank you!
On 05/13/2015 02:29 AM, Peter Zijlstra wrote:
> On Tue, May 12, 2015 at 11:45:09AM -0400, Rik van Riel wrote:
>> I have a few poorly formed ideas on what could be done about that:
>>
>> 1) have fbq_classify_rq take the current task on the rq into account,
>> and adjust the fbq classification if all the runnable-but-queued
>> tasks are on the right node
>
> So while looking at this I came up with the below; it treats anything
> inside ->active_nodes as a preferred node for balancing purposes.
>
> Would that make sense?
Not necessarily.
If there are two workloads on a multi-threaded system, and they
have not yet converged on one node each, both nodes will be part
of ->active_nodes.
Treating them as preferred nodes means the load balancing code
would do nothing at all to help the workloads converge.
> I'll see what I can do about current in the runqueue type
> classification.
This can probably be racy, so just checking a value in the
current task struct for the runqueue should be ok. I am not
aware of any architecture where the task struct address can
become invalid. Worst thing that could happen is that the
bits examined change value.
>> 2) ensure that rq->nr_numa_running and rq->nr_preferred_running also
>> get incremented for kernel threads that are bound to a particular
>> CPU - currently CPU-bound kernel threads will cause the NUMA
>> statistics to look like a CPU has tasks that do not belong on that
>> NUMA node
>
> I'm thinking accounting those to nr_pinned, lemme see how that works
> out.
Cool.
--
All rights reversed
On 05/06/2015 11:41 AM, Rik van Riel wrote:
> On Wed, 06 May 2015 13:35:30 +0300
> Artem Bityutskiy <[email protected]> wrote:
>
>> we observe a tremendous regression between kernel version 3.16 and 3.17
>> (and up), and I've bisected it to this commit:
>>
>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>
> Artem, Jirka, does this patch fix (or at least improve) the issues you
> have been seeing? Does it introduce any new regressions?
>
> Peter, Mel, I think it may be time to stop waiting for the impedance
> mismatch between the load balancer and NUMA balancing to be resolved,
> and try to just avoid the issue in the NUMA balancing code...
Peter, I got some more test results in. This patch can supercede
095bebf61a46 ("sched/numa: Do not move past the balance point if
unbalanced"), which can be reverted.
With this patch, a workload that has one (misplaced) thread running,
and nothing else on the system, is able to move to the node where
its memory is, which is something that 095bebf61a46 prevented.
It also fixes the single instance SpecJBB2005 spreading issue, which
benefited some (but not completely) from 095bebf61a46 in the past.
Peter, what would you like me to do to get this patch into your tree,
and 095bebf61a46 reverted? :)
--
All rights reversed