2014-10-08 07:07:56

by Preeti U Murthy

[permalink] [raw]
Subject: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

There are two masks associated with cpusets. The cpus/mems_allowed
and effective_cpus/mems. On the legacy hierarchy both these masks
are consistent with each other. This is the intersection of their
value and the currently active cpus. This means that we destroy the
original values set in these masks on each cpu/mem hot unplug operation.
As a consequence when we hot plug back the cpus/mems, the tasks
no longer run on them and performance degrades, inspite of having
resources to run on.

This effect is not seen in the default hierarchy since the
allowed and effective masks are distinctly maintained.
allowed masks are never touched once configured and effective masks
alone are hotplug variant.

This patch replicates the above design even for the legacy hierarchy,
so that:

1. Tasks always run on the cpus/memory nodes that they are allowed to run on
as long as they are online. The allowed masks are hotplug invariant.

2. When all cpus/memory nodes in a cpuset are hot unplugged out, the tasks
are moved to their nearest ancestor which has resources to run on.

There were discussions earlier around this issue:
https://lkml.org/lkml/2012/5/4/265
http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133

The argument against making the allowed masks hotplug invariant was that
hotplug is destructive and hence cpusets cannot expect to regain resources
that have gone through a hotplug operation by the user.

But on powerpc we do smt mode switch to suit the workload running.
We therefore need to keep track of the original cpuset configuration
so as to make use of them when they are back online due to a mode switch.
Moreover there is no real harm in keeping the allowed masks invariant
on hotplug since the effective masks will anyway keep track of the
online cpus. In fact there are use cases which need the cpuset's
original configuration to be retained. The v2 of cgroup design therefore
does not overwrite this configuration.

Till the time the controllers switch to the default hierarchy it serves
well to fix this problem in the legacy hierarchy. While at it fix a comment
which assumes that cpuset masks are changed only during a hot-unplug operation.
With this patch it is ensured that cpuset masks are consistent with online cpus
in both default and legacy hierarchy.

Signed-off-by: Preeti U Murthy <[email protected]>
---

kernel/cpuset.c | 38 ++++++++++----------------------------
1 file changed, 10 insertions(+), 28 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 22874d7..89c2e60 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -78,8 +78,6 @@ struct cpuset {
unsigned long flags; /* "unsigned long" so bitops work */

/*
- * On default hierarchy:
- *
* The user-configured masks can only be changed by writing to
* cpuset.cpus and cpuset.mems, and won't be limited by the
* parent masks.
@@ -91,10 +89,6 @@ struct cpuset {
* effective_mask == configured_mask & parent's effective_mask,
* and if it ends up empty, it will inherit the parent's mask.
*
- *
- * On legacy hierachy:
- *
- * The user-configured masks are always the same with effective masks.
*/

/* user-configured CPUs and Memory Nodes allow to tasks */
@@ -842,8 +836,6 @@ static void update_tasks_cpumask(struct cpuset *cs)
* When congifured cpumask is changed, the effective cpumasks of this cpuset
* and all its descendants need to be updated.
*
- * On legacy hierachy, effective_cpus will be the same with cpu_allowed.
- *
* Called with cpuset_mutex held
*/
static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
@@ -879,9 +871,6 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
cpumask_copy(cp->effective_cpus, new_cpus);
mutex_unlock(&callback_mutex);

- WARN_ON(!cgroup_on_dfl(cp->css.cgroup) &&
- !cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
-
update_tasks_cpumask(cp);

/*
@@ -1424,7 +1413,7 @@ static int cpuset_can_attach(struct cgroup_subsys_state *css,
/* allow moving tasks into an empty cpuset if on default hierarchy */
ret = -ENOSPC;
if (!cgroup_on_dfl(css->cgroup) &&
- (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
+ (cpumask_empty(cs->effective_cpus) || nodes_empty(cs->effective_mems)))
goto out_unlock;

cgroup_taskset_for_each(task, tset) {
@@ -2108,8 +2097,8 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
* has online cpus, so can't be empty).
*/
parent = parent_cs(cs);
- while (cpumask_empty(parent->cpus_allowed) ||
- nodes_empty(parent->mems_allowed))
+ while (cpumask_empty(parent->effective_cpus) ||
+ nodes_empty(parent->effective_mems))
parent = parent_cs(parent);

if (cgroup_transfer_tasks(parent->css.cgroup, cs->css.cgroup)) {
@@ -2127,9 +2116,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
bool is_empty;

mutex_lock(&callback_mutex);
- cpumask_copy(cs->cpus_allowed, new_cpus);
cpumask_copy(cs->effective_cpus, new_cpus);
- cs->mems_allowed = *new_mems;
cs->effective_mems = *new_mems;
mutex_unlock(&callback_mutex);

@@ -2137,13 +2124,13 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
* Don't call update_tasks_cpumask() if the cpuset becomes empty,
* as the tasks will be migratecd to an ancestor.
*/
- if (cpus_updated && !cpumask_empty(cs->cpus_allowed))
+ if (cpus_updated && !cpumask_empty(cs->effective_cpus))
update_tasks_cpumask(cs);
- if (mems_updated && !nodes_empty(cs->mems_allowed))
+ if (mems_updated && !nodes_empty(cs->effective_mems))
update_tasks_nodemask(cs);

- is_empty = cpumask_empty(cs->cpus_allowed) ||
- nodes_empty(cs->mems_allowed);
+ is_empty = cpumask_empty(cs->effective_cpus) ||
+ nodes_empty(cs->effective_mems);

mutex_unlock(&cpuset_mutex);

@@ -2180,11 +2167,11 @@ hotplug_update_tasks(struct cpuset *cs,
}

/**
- * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
+ * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotplug
* @cs: cpuset in interest
*
- * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
- * offline, update @cs accordingly. If @cs ends up with no CPU or memory,
+ * Compare @cs's cpu and mem masks against top_cpuset and update @cs
+ * accordingly. If @cs ends up with no CPU or memory,
* all its tasks are moved to the nearest ancestor with both resources.
*/
static void cpuset_hotplug_update_tasks(struct cpuset *cs)
@@ -2244,7 +2231,6 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
static cpumask_t new_cpus;
static nodemask_t new_mems;
bool cpus_updated, mems_updated;
- bool on_dfl = cgroup_on_dfl(top_cpuset.css.cgroup);

mutex_lock(&cpuset_mutex);

@@ -2258,8 +2244,6 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
/* synchronize cpus_allowed to cpu_active_mask */
if (cpus_updated) {
mutex_lock(&callback_mutex);
- if (!on_dfl)
- cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
mutex_unlock(&callback_mutex);
/* we don't mess with cpumasks of tasks in top_cpuset */
@@ -2268,8 +2252,6 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
/* synchronize mems_allowed to N_MEMORY */
if (mems_updated) {
mutex_lock(&callback_mutex);
- if (!on_dfl)
- top_cpuset.mems_allowed = new_mems;
top_cpuset.effective_mems = new_mems;
mutex_unlock(&callback_mutex);
update_tasks_nodemask(&top_cpuset);


2014-10-08 08:07:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Wed, Oct 08, 2014 at 12:37:40PM +0530, Preeti U Murthy wrote:
> There are two masks associated with cpusets. The cpus/mems_allowed
> and effective_cpus/mems. On the legacy hierarchy both these masks
> are consistent with each other. This is the intersection of their
> value and the currently active cpus. This means that we destroy the
> original values set in these masks on each cpu/mem hot unplug operation.
> As a consequence when we hot plug back the cpus/mems, the tasks
> no longer run on them and performance degrades, inspite of having
> resources to run on.
>
> This effect is not seen in the default hierarchy since the
> allowed and effective masks are distinctly maintained.
> allowed masks are never touched once configured and effective masks
> alone are hotplug variant.
>
> This patch replicates the above design even for the legacy hierarchy,
> so that:
>
> 1. Tasks always run on the cpus/memory nodes that they are allowed to run on
> as long as they are online. The allowed masks are hotplug invariant.
>
> 2. When all cpus/memory nodes in a cpuset are hot unplugged out, the tasks
> are moved to their nearest ancestor which has resources to run on.
>
> There were discussions earlier around this issue:
> https://lkml.org/lkml/2012/5/4/265
> http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133
>
> The argument against making the allowed masks hotplug invariant was that
> hotplug is destructive and hence cpusets cannot expect to regain resources
> that have gone through a hotplug operation by the user.
>
> But on powerpc we do smt mode switch to suit the workload running.
> We therefore need to keep track of the original cpuset configuration
> so as to make use of them when they are back online due to a mode switch.
> Moreover there is no real harm in keeping the allowed masks invariant
> on hotplug since the effective masks will anyway keep track of the
> online cpus. In fact there are use cases which need the cpuset's
> original configuration to be retained. The v2 of cgroup design therefore
> does not overwrite this configuration.
>

I still completely hate all that.. It basically makes cpusets useless,
they no longer guarantee anything, it makes then an optional placement
hint instead.

You also break long standing behaviour.

Also, power is insane if it needs/uses hotplug for operational crap
like that.

2014-10-08 09:38:16

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hi Peter,

On 10/08/2014 01:37 PM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 12:37:40PM +0530, Preeti U Murthy wrote:
>> There are two masks associated with cpusets. The cpus/mems_allowed
>> and effective_cpus/mems. On the legacy hierarchy both these masks
>> are consistent with each other. This is the intersection of their
>> value and the currently active cpus. This means that we destroy the
>> original values set in these masks on each cpu/mem hot unplug operation.
>> As a consequence when we hot plug back the cpus/mems, the tasks
>> no longer run on them and performance degrades, inspite of having
>> resources to run on.
>>
>> This effect is not seen in the default hierarchy since the
>> allowed and effective masks are distinctly maintained.
>> allowed masks are never touched once configured and effective masks
>> alone are hotplug variant.
>>
>> This patch replicates the above design even for the legacy hierarchy,
>> so that:
>>
>> 1. Tasks always run on the cpus/memory nodes that they are allowed to run on
>> as long as they are online. The allowed masks are hotplug invariant.
>>
>> 2. When all cpus/memory nodes in a cpuset are hot unplugged out, the tasks
>> are moved to their nearest ancestor which has resources to run on.
>>
>> There were discussions earlier around this issue:
>> https://lkml.org/lkml/2012/5/4/265
>> http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133
>>
>> The argument against making the allowed masks hotplug invariant was that
>> hotplug is destructive and hence cpusets cannot expect to regain resources
>> that have gone through a hotplug operation by the user.
>>
>> But on powerpc we do smt mode switch to suit the workload running.
>> We therefore need to keep track of the original cpuset configuration
>> so as to make use of them when they are back online due to a mode switch.
>> Moreover there is no real harm in keeping the allowed masks invariant
>> on hotplug since the effective masks will anyway keep track of the
>> online cpus. In fact there are use cases which need the cpuset's
>> original configuration to be retained. The v2 of cgroup design therefore
>> does not overwrite this configuration.
>>
>
> I still completely hate all that.. It basically makes cpusets useless,
> they no longer guarantee anything, it makes then an optional placement
> hint instead.

Why do you say they don't guarantee anything? We ensure that we always
run on the cpus in our cpuset which are online. We do not run in any
arbitrary cpuset. We also do not wait unreasonably on an offline cpu to
come back. What we are doing is ensuring that if the resources that we
began with are available we use them. Why is this not a logical thing to
expect?

>
> You also break long standing behaviour.

Which is? As I understand cpusets are meant to ensure a dedicated set of
resources to some tasks. We cannot scheduler the tasks anywhere outside
this set as long as they are available. And when they are not, currently
we move them to their parents, but you had also suggested killing the
task. Maybe this can be debated. But what behavior are we changing by
ensuring that we retain our original configuration at all times?

>
> Also, power is insane if it needs/uses hotplug for operational crap
> like that.

SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
the modes at runtime.

Regards
Preeti U Murthy
>

2014-10-08 10:18:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote:

> > I still completely hate all that.. It basically makes cpusets useless,
> > they no longer guarantee anything, it makes then an optional placement
> > hint instead.
>
> Why do you say they don't guarantee anything? We ensure that we always
> run on the cpus in our cpuset which are online. We do not run in any
> arbitrary cpuset. We also do not wait unreasonably on an offline cpu to
> come back. What we are doing is ensuring that if the resources that we
> began with are available we use them. Why is this not a logical thing to
> expect?

Because if you randomly hotplug cpus your tasks can randomly flip
between sets.

Therefore there is no strict containment and no guarantees.

> > You also break long standing behaviour.
>
> Which is? As I understand cpusets are meant to ensure a dedicated set of
> resources to some tasks. We cannot scheduler the tasks anywhere outside
> this set as long as they are available. And when they are not, currently
> we move them to their parents,

No currently we hard break affinity and never look back. That move to
parent and back crap is all new fangled stuff, and broken because of the
above argument.

> but you had also suggested killing the
> task. Maybe this can be debated. But what behavior are we changing by
> ensuring that we retain our original configuration at all times?

See above, by pretending hotplug is a sane operation you loose all
guarantees.

> > Also, power is insane if it needs/uses hotplug for operational crap
> > like that.
>
> SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
> the modes at runtime.

That's just a horrible piece of crap hack and you deserve any and all
pain you get from doing it.

Randomly removing/adding cpus like that is horrible and makes a mockery
of all the affinity interfaces we have.

2014-10-08 14:54:32

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Wed, Oct 8, 2014 at 12:37 PM, Preeti U Murthy
<[email protected]> wrote:
> There are two masks associated with cpusets. The cpus/mems_allowed
> and effective_cpus/mems. On the legacy hierarchy both these masks
> are consistent with each other. This is the intersection of their
> value and the currently active cpus. This means that we destroy the
> original values set in these masks on each cpu/mem hot unplug operation.
> As a consequence when we hot plug back the cpus/mems, the tasks
> no longer run on them and performance degrades, inspite of having
> resources to run on.
>
> This effect is not seen in the default hierarchy since the
> allowed and effective masks are distinctly maintained.
> allowed masks are never touched once configured and effective masks
> alone are hotplug variant.
>
> This patch replicates the above design even for the legacy hierarchy,
> so that:
>
> 1. Tasks always run on the cpus/memory nodes that they are allowed to run on
> as long as they are online. The allowed masks are hotplug invariant.
>
> 2. When all cpus/memory nodes in a cpuset are hot unplugged out, the tasks
> are moved to their nearest ancestor which has resources to run on.

Hi Preeti,

I may be missing some thing here could you please explain when do we get
tasks move out of a cpuset after this patch and why it is even necessary?

IIUC, with default hierarchy we should never hit a case where we have empty
effective cpuset and hence remove_tasks_in_empty_cpuset should never happen. no?

if my assumption is correct then we should remove
remove_tasks_in_empty_cpuset itself...

2014-10-09 05:12:22

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hi Raghu,

On 10/08/2014 08:24 PM, Raghavendra KT wrote:
> On Wed, Oct 8, 2014 at 12:37 PM, Preeti U Murthy
> <[email protected]> wrote:
>> There are two masks associated with cpusets. The cpus/mems_allowed
>> and effective_cpus/mems. On the legacy hierarchy both these masks
>> are consistent with each other. This is the intersection of their
>> value and the currently active cpus. This means that we destroy the
>> original values set in these masks on each cpu/mem hot unplug operation.
>> As a consequence when we hot plug back the cpus/mems, the tasks
>> no longer run on them and performance degrades, inspite of having
>> resources to run on.
>>
>> This effect is not seen in the default hierarchy since the
>> allowed and effective masks are distinctly maintained.
>> allowed masks are never touched once configured and effective masks
>> alone are hotplug variant.
>>
>> This patch replicates the above design even for the legacy hierarchy,
>> so that:
>>
>> 1. Tasks always run on the cpus/memory nodes that they are allowed to run on
>> as long as they are online. The allowed masks are hotplug invariant.
>>
>> 2. When all cpus/memory nodes in a cpuset are hot unplugged out, the tasks
>> are moved to their nearest ancestor which has resources to run on.
>
> Hi Preeti,
>
> I may be missing some thing here could you please explain when do we get
> tasks move out of a cpuset after this patch and why it is even necessary?

On the legacy hierarchy the tasks are moved to their parents cpusets if
the cpuset to which they were initially bound becomes empty. What the
patch does has nothing to do with moving tasks when the cpuset to which
they are bound becomes empty.The point 2 above was mentioned to merely
state that this part of the behavior is not really changed with the
patch. The patch merely ensures that the original cpuset configuration
is not messed with during hotplug operations.

>
> IIUC, with default hierarchy we should never hit a case where we have empty
> effective cpuset and hence remove_tasks_in_empty_cpuset should never happen. no?
>
> if my assumption is correct then we should remove
> remove_tasks_in_empty_cpuset itself...

remove_tasks_in_empty_cpuset() is called on the legacy hierarchy when
the cpuset becomes empty, hence we require it. But you are right its not
called on the default hierarchy.

Regards
Preeti U Murthy
>

2014-10-09 08:21:19

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On 10/08/2014 03:48 PM, Peter Zijlstra wrote:
> On Wed, Oct 08, 2014 at 03:07:51PM +0530, Preeti U Murthy wrote:
>
>>> I still completely hate all that.. It basically makes cpusets useless,
>>> they no longer guarantee anything, it makes then an optional placement
>>> hint instead.
>>
>> Why do you say they don't guarantee anything? We ensure that we always
>> run on the cpus in our cpuset which are online. We do not run in any
>> arbitrary cpuset. We also do not wait unreasonably on an offline cpu to
>> come back. What we are doing is ensuring that if the resources that we
>> began with are available we use them. Why is this not a logical thing to
>> expect?
>
> Because if you randomly hotplug cpus your tasks can randomly flip
> between sets.
>
> Therefore there is no strict containment and no guarantees.
>
>>> You also break long standing behaviour.
>>
>> Which is? As I understand cpusets are meant to ensure a dedicated set of
>> resources to some tasks. We cannot scheduler the tasks anywhere outside
>> this set as long as they are available. And when they are not, currently
>> we move them to their parents,
>
> No currently we hard break affinity and never look back. That move to
> parent and back crap is all new fangled stuff, and broken because of the
> above argument.
>
>> but you had also suggested killing the
>> task. Maybe this can be debated. But what behavior are we changing by
>> ensuring that we retain our original configuration at all times?
>
> See above, by pretending hotplug is a sane operation you loose all
> guarantees.

Ok I see the point. The kernel must not be bothered about keeping
cpusets and hotplug operations consistent when both of these are user
initiated actions specifying affinity with the former and breaking the
same with the latter.

>
>>> Also, power is insane if it needs/uses hotplug for operational crap
>>> like that.
>>
>> SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
>> the modes at runtime.
>
> That's just a horrible piece of crap hack and you deserve any and all
> pain you get from doing it.
>
> Randomly removing/adding cpus like that is horrible and makes a mockery
> of all the affinity interfaces we have.

We observed this on ubuntu kernel, in which systemd explicitly mounts
cgroup controllers under a child cgroup identified by the user pid.
Since we had not observed this additional cgroup being added under the
hood, it came as a surprise to us that cgroup/cpuset handling in the
kernel should indeed kick in.

At best we expect hotplug to be handled well if the users have not
explicitly configured cpusets, hence implicitly specifying that task
affinity is for all online cpus. This is indeed the case today, so that
is good.

However what remains to be answered is that the V2 of cgroup design -
the default hierarchy, tracks hotplug operations for children cgroups as
well. Tejun, Li, will not the concerns that Peter raised above hold for
the default hierarchy as well?

Regards
Preeti U Murthy
>

2014-10-09 08:31:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Thu, Oct 09, 2014 at 01:50:52PM +0530, Preeti U Murthy wrote:
> We observed this on ubuntu kernel, in which systemd explicitly mounts

Using systemd is your first fail.. total crapfest that.

2014-10-09 08:33:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Thu, Oct 09, 2014 at 01:50:52PM +0530, Preeti U Murthy wrote:
> >> SMT 8 on Power8 can help/hinder workloads. Hence we dynamically switch
> >> the modes at runtime.
> >
> > That's just a horrible piece of crap hack and you deserve any and all
> > pain you get from doing it.
> >
> > Randomly removing/adding cpus like that is horrible and makes a mockery
> > of all the affinity interfaces we have.
>
> We observed this on ubuntu kernel, in which systemd explicitly mounts
> cgroup controllers under a child cgroup identified by the user pid.
> Since we had not observed this additional cgroup being added under the
> hood, it came as a surprise to us that cgroup/cpuset handling in the
> kernel should indeed kick in.
>
> At best we expect hotplug to be handled well if the users have not
> explicitly configured cpusets, hence implicitly specifying that task
> affinity is for all online cpus. This is indeed the case today, so that
> is good.
>
> However what remains to be answered is that the V2 of cgroup design -
> the default hierarchy, tracks hotplug operations for children cgroups as
> well. Tejun, Li, will not the concerns that Peter raised above hold for
> the default hierarchy as well?

None of this addresses the piece of crap thing you did with power8. You
cannot just make CPUs go away at random.

2014-10-09 09:42:37

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On 10/09/2014 10:42 AM, Preeti U Murthy wrote:
> Hi Raghu,

> remove_tasks_in_empty_cpuset() is called on the legacy hierarchy when
> the cpuset becomes empty, hence we require it. But you are right its not
> called on the default hierarchy.

My point was if legacy hierarchy follows unified hierarchy in effective
cpuset handling, we will never endup with a empty effective cpuset and
hence your patch could remove the remove_tasks_in_empty_cpuset() part
too..

2014-10-09 13:06:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Thu, Oct 09, 2014 at 01:50:52PM +0530, Preeti U Murthy wrote:
> However what remains to be answered is that the V2 of cgroup design -
> the default hierarchy, tracks hotplug operations for children cgroups as
> well. Tejun, Li, will not the concerns that Peter raised above hold for
> the default hierarchy as well?

I don't think the legacy one is a good design. Kernel shouldn't lose
configurations in an irreversible way and the legacy one is also
making random cpuset flips by migrating tasks upwards anyway. In
terms of hotunplug behavior, the legacy and unified ones behave the
same. The only difference is that the configuration is independent of
the current state and the configured behavior is restored when the
cpus come back. The other side is that the legacy hierarchy behavior
simply can't be allowed when the hierarchy is shared among multiple
controllers as in the unified hierarchy. It affects all other
controllers attached to the hierarchy.

That said, we can't change the behavior on the legacy one. It's a
very userland visible behavior. We simply can't change it, so
unfortunately you're stuck with it at least on the legacy hierarchy.

Thanks.

--
tejun

2014-10-09 13:48:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Thu, Oct 09, 2014 at 09:06:11AM -0400, Tejun Heo wrote:
> On Thu, Oct 09, 2014 at 01:50:52PM +0530, Preeti U Murthy wrote:
> > However what remains to be answered is that the V2 of cgroup design -
> > the default hierarchy, tracks hotplug operations for children cgroups as
> > well. Tejun, Li, will not the concerns that Peter raised above hold for
> > the default hierarchy as well?
>
> I don't think the legacy one is a good design. Kernel shouldn't lose
> configurations in an irreversible way and the legacy one is also
> making random cpuset flips by migrating tasks upwards anyway.

You do know we disagree on this :-)

The thing is, if you restrict a process to one cpu and then take that
cpu away you have a fail, pretending its 'OK' because you'll place it
back once the cpu appears again doesn't make it right.

And while legacy will indeed move tasks upwards, it does so under
protest, its a clear error and the user needs to go figure out wtf to do
about it.

And while you all can try and pretend hotplug is a 'normal' and 'sane'
operation with cpusets, the same failure more very much still exists
with the regular affinity controls. So you can pretend all you want, but
its a clear and utter fail.

You cannot give the kernel contradictory instructions and then pretend
all is well and dandy.

2014-10-09 13:59:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hello, Peter.

On Thu, Oct 09, 2014 at 03:47:58PM +0200, Peter Zijlstra wrote:
> You do know we disagree on this :-)

Yeap. :)

...
> And while you all can try and pretend hotplug is a 'normal' and 'sane'
> operation with cpusets, the same failure more very much still exists
> with the regular affinity controls. So you can pretend all you want, but
> its a clear and utter fail.
>
> You cannot give the kernel contradictory instructions and then pretend
> all is well and dandy.

But even if you view it that way, the current legacy implementation is
deficient to say the least. It puts way too much trust in the
userland while not giving it mechanisms to deal with the situation.
It's not like the userland is an all-knowing entity and short of the
printk there's no way to detect such automatic migrations or to know
the previous state. If this actually was seen as a configuration
failure, it would have made a lot more sense to just not run those
tasks unless they're SIGKILL'd.

This is all moot tho. We can't change the behavior for the legacy
hierarchies and we can't auto-migrate for the unified hierarchy, so
there isn't much left to decide.

Thanks.

--
tejun

2015-04-02 06:56:46

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hi Tejun, Peter,

On 10/09/2014 06:36 PM, Tejun Heo wrote:
> On Thu, Oct 09, 2014 at 01:50:52PM +0530, Preeti U Murthy wrote:
>> However what remains to be answered is that the V2 of cgroup design -
>> the default hierarchy, tracks hotplug operations for children cgroups as
>> well. Tejun, Li, will not the concerns that Peter raised above hold for
>> the default hierarchy as well?
>
> I don't think the legacy one is a good design. Kernel shouldn't lose
> configurations in an irreversible way and the legacy one is also
> making random cpuset flips by migrating tasks upwards anyway. In
> terms of hotunplug behavior, the legacy and unified ones behave the
> same. The only difference is that the configuration is independent of
> the current state and the configured behavior is restored when the
> cpus come back. The other side is that the legacy hierarchy behavior
> simply can't be allowed when the hierarchy is shared among multiple
> controllers as in the unified hierarchy. It affects all other
> controllers attached to the hierarchy.

We have a use case currently, which needs this to be fixed one way or
the other. While running in a virtualized setup, there may be a need to
hotplug in resources to VMs at runtime. This includes CPUs and Memory.
Due to the behavior of the legacy hierarchy, the new CPUs never get
used. This is not even a scenario where we hot-unplugged CPUs and ask
for it to be plugged back again. Its a case where the workloads running
within a VM are in need of more resources than they began with.

>
> That said, we can't change the behavior on the legacy one. It's a
> very userland visible behavior. We simply can't change it, so

By ensuring that the user configured cpusets are untouched, I don't see
how we affect userspace adversely. The expectation usually is that the
kernel keeps track of the user configurations. If anything we would be
fixing an undesired behavior, wouldn't we?

> unfortunately you're stuck with it at least on the legacy hierarchy.

Given that we are in much need for this to be fixed and that we cannot
easily move to the default hierarchy, can you please take a look at this
patch again?

It is understandable that there are good reasons why legacy hierarchy
currently behaves this way or how we cannot drastically change its
behavior, but there is no sane way in which userspace can get around
this for the sake of genuine use cases such as the above.

Regards
Preeti U Murthy
>
> Thanks.
>

2015-04-06 17:47:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hello, Preeti.

On Thu, Apr 02, 2015 at 12:26:32PM +0530, Preeti U Murthy wrote:
> By ensuring that the user configured cpusets are untouched, I don't see
> how we affect userspace adversely. The expectation usually is that the
> kernel keeps track of the user configurations. If anything we would be
> fixing an undesired behavior, wouldn't we?

The problem is not really about which behavior is "righter" but rather
it's fairly likely that there are users / tools out there expecting
the current behavior and they wouldn't be too happy to see the
behavior flipping underneath them.

One way forward would be implementing a knob in cpuset which makes it
switch sbetween the old and new behaviors in the legacy hierarchy.
It's yucky but doable if absoluately necessary, but what's the reason
for you not being able to transition to the unified hierarchy (except
for it being under the devel flag but I'm really taking that devel
mask out in the next merge window)? The default hierarchy can happily
co-exist with legacy hierarchies so you can just move over the cpuset
part to it if you need it.

Thanks.

--
tejun

2015-04-09 21:13:53

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

On Mon, Apr 06, 2015 at 01:47:35PM -0400, Tejun Heo wrote:
> Hello, Preeti.
>
> On Thu, Apr 02, 2015 at 12:26:32PM +0530, Preeti U Murthy wrote:
> > By ensuring that the user configured cpusets are untouched, I don't see
> > how we affect userspace adversely. The expectation usually is that the
> > kernel keeps track of the user configurations. If anything we would be
> > fixing an undesired behavior, wouldn't we?
>
> The problem is not really about which behavior is "righter" but rather
> it's fairly likely that there are users / tools out there expecting
> the current behavior and they wouldn't be too happy to see the
> behavior flipping underneath them.
>
> One way forward would be implementing a knob in cpuset which makes it
> switch sbetween the old and new behaviors in the legacy hierarchy.
> It's yucky but doable if absoluately necessary, but what's the reason
> for you not being able to transition to the unified hierarchy (except

If the userspace is entirely new then this should work. The
unified hierarchy's behavior is not backward-compatible so any old
software which tried to create cgroups (libcgroup, lxc, etc) will
not work with it (since it won't, for instance, know to fill in
the enabled controllers in every newly created cgroup).

Preeti, can you confirm that you don't have any need to run any
legacy programs which use cgroups? Long as that's the case, new
software can certainly be written to DTRT, and mounting just cpusets
under unified hierarchy seems best.

> for it being under the devel flag but I'm really taking that devel
> mask out in the next merge window)? The default hierarchy can happily
> co-exist with legacy hierarchies so you can just move over the cpuset
> part to it if you need it.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-04-10 14:14:23

by Preeti U Murthy

[permalink] [raw]
Subject: Re: [PATCH] cpusets: Make cpus_allowed and mems_allowed masks hotplug invariant

Hi Serge,

On 04/10/2015 02:43 AM, Serge E. Hallyn wrote:
> On Mon, Apr 06, 2015 at 01:47:35PM -0400, Tejun Heo wrote:
>> Hello, Preeti.
>>
>> On Thu, Apr 02, 2015 at 12:26:32PM +0530, Preeti U Murthy wrote:
>>> By ensuring that the user configured cpusets are untouched, I don't see
>>> how we affect userspace adversely. The expectation usually is that the
>>> kernel keeps track of the user configurations. If anything we would be
>>> fixing an undesired behavior, wouldn't we?
>>
>> The problem is not really about which behavior is "righter" but rather
>> it's fairly likely that there are users / tools out there expecting
>> the current behavior and they wouldn't be too happy to see the
>> behavior flipping underneath them.
>>
>> One way forward would be implementing a knob in cpuset which makes it
>> switch sbetween the old and new behaviors in the legacy hierarchy.
>> It's yucky but doable if absoluately necessary, but what's the reason
>> for you not being able to transition to the unified hierarchy (except
>
> If the userspace is entirely new then this should work. The
> unified hierarchy's behavior is not backward-compatible so any old
> software which tried to create cgroups (libcgroup, lxc, etc) will
> not work with it (since it won't, for instance, know to fill in
> the enabled controllers in every newly created cgroup).
>
> Preeti, can you confirm that you don't have any need to run any
> legacy programs which use cgroups? Long as that's the case, new

I don't think I can vouch for this safely. I have posted out a V2 of
this patch adhering to Tejun's first suggestion. IMO that seemed like
a better option.

Regards
Preeti U Murthy