2006-08-21 13:27:35

by Anton Blanchard

[permalink] [raw]
Subject: cpusets not cpu hotplug aware


Hi,

We have a bug report where sched_setaffinity fails for cpus that are
hotplug added after boot. Nathan found this suspicious piece of code:

void __init cpuset_init_smp(void)
{
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_online_map;
}

Any reason we statically set the top level cpuset to the boot time cpu
online map?

Anton


2006-08-21 13:54:47

by Simon Derr

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

On Mon, 21 Aug 2006, Anton Blanchard wrote:

>
> Hi,
>
> We have a bug report where sched_setaffinity fails for cpus that are
> hotplug added after boot. Nathan found this suspicious piece of code:
>
> void __init cpuset_init_smp(void)
> {
> top_cpuset.cpus_allowed = cpu_online_map;
> top_cpuset.mems_allowed = node_online_map;
> }
>
> Any reason we statically set the top level cpuset to the boot time cpu
> online map?
>

Are you suggesting that all possible cpus should be in the top cpuset,
instead of all existing cpus ?

I'd prefer the hotplugged CPUs being added to the top cpuset afterwards.

Simon.

2006-08-21 17:44:08

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Anton wrote:
> We have a bug report where sched_setaffinity fails for cpus that are
> hotplug added after boot. Nathan found this suspicious piece of code:
>
> void __init cpuset_init_smp(void)
> {
> top_cpuset.cpus_allowed = cpu_online_map;
> top_cpuset.mems_allowed = node_online_map;
> }
>
> Any reason we statically set the top level cpuset to the boot time cpu
> online map?


Your query confuses me, about 4 different ways ...

1) What does sched_setaffinity have to do with this part of cpusets?
2) What did you mean by "statically assigned"? At boot, whatever cpus
and memory nodes are online are copied to the top_cpuset's settings.
As Simon suggests, it would be up to the hotplug/hotunplug folks to
update these top_cpuset settings, as cpus and nodes come and go.
3) I don't understand what you thought was suspicious here.
4) I don't understand what you expected to see instead here.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-21 19:22:51

by Anton Blanchard

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware


Hi,

> Your query confuses me, about 4 different ways ...
>
> 1) What does sched_setaffinity have to do with this part of cpusets?

long sched_setaffinity(pid_t pid, cpumask_t new_mask)
{
...
cpus_allowed = cpuset_cpus_allowed(p);
cpus_and(new_mask, new_mask, cpus_allowed);
retval = set_cpus_allowed(p, new_mask);

If cpuset_cpus_allowed doesnt return the current online mask and we want
to schedule on a cpu that has been added since boot it looks like we
will fail.

> 2) What did you mean by "statically assigned"? At boot, whatever cpus
> and memory nodes are online are copied to the top_cpuset's settings.
> As Simon suggests, it would be up to the hotplug/hotunplug folks to
> update these top_cpuset settings, as cpus and nodes come and go.

Its up to the cpusets code to register a hotplug notifier to update the
top_cpuset maps.

> 3) I don't understand what you thought was suspicious here.
> 4) I don't understand what you expected to see instead here.

If the top level cpuset pointed to cpu_online_map instead of a boot time
copy we wouldnt need any hotplug gunk in the cpusets code.

Maybe the notifier is the right way to go, but it seems strange to
create two copies of cpu_online_map (with the associated possibiliy of
the two getting out of sync).

Anton

2006-08-21 20:42:59

by Nathan Lynch

[permalink] [raw]
Subject: [PATCH] cpuset code prevents binding tasks to new cpus

Anton Blanchard wrote:
>
> Hi,
>
> > Your query confuses me, about 4 different ways ...
> >
> > 1) What does sched_setaffinity have to do with this part of cpusets?
>
> long sched_setaffinity(pid_t pid, cpumask_t new_mask)
> {
> ...
> cpus_allowed = cpuset_cpus_allowed(p);
> cpus_and(new_mask, new_mask, cpus_allowed);
> retval = set_cpus_allowed(p, new_mask);
>
> If cpuset_cpus_allowed doesnt return the current online mask and we want
> to schedule on a cpu that has been added since boot it looks like we
> will fail.
>
> > 2) What did you mean by "statically assigned"? At boot, whatever cpus
> > and memory nodes are online are copied to the top_cpuset's settings.
> > As Simon suggests, it would be up to the hotplug/hotunplug folks to
> > update these top_cpuset settings, as cpus and nodes come and go.
>
> Its up to the cpusets code to register a hotplug notifier to update the
> top_cpuset maps.

I think a notifier is overkill, the top_cpuset mask should just be a
copy of cpu_possible_map, which doesn't change after boot so we don't
have to worry about keeping it in sync. cpuset_cpus_allowed already
guarantees online cpus in the returned mask.


> > 3) I don't understand what you thought was suspicious here.
> > 4) I don't understand what you expected to see instead here.
>
> If the top level cpuset pointed to cpu_online_map instead of a boot time
> copy we wouldnt need any hotplug gunk in the cpusets code.
>
> Maybe the notifier is the right way to go, but it seems strange to
> create two copies of cpu_online_map (with the associated possibiliy of
> the two getting out of sync).

Agreed.

Here's a patch:

We have this code in sched_setaffinity:

cpus_allowed = cpuset_cpus_allowed(p);
cpus_and(new_mask, new_mask, cpus_allowed);
retval = set_cpus_allowed(p, new_mask);

In the default case (the task belongs to the default toplevel cpuset),
cpuset_cpus_allowed always returns a copy of whatever cpu_online_map
was at boot. When a new cpu is added to the system, the user is
unable to bind tasks to the new cpu -- in the code above we wind up
passing an empty cpumask to set_cpus_allowed and get -EINVAL.

Since cpuset_cpus_allowed already ensures that the cpumask it returns
reflects only online cpus, setting the default cpuset's cpus_allowed
mask to cpu_possible_map (which never changes) ensures correct
behavior.


Signed-off-by: Nathan Lynch <[email protected]>

--- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
+++ cpuhp-sched_setaffinity/kernel/cpuset.c
@@ -2041,7 +2041,7 @@ out:

void __init cpuset_init_smp(void)
{
- top_cpuset.cpus_allowed = cpu_online_map;
+ top_cpuset.cpus_allowed = cpu_possible_map;
top_cpuset.mems_allowed = node_online_map;
}


2006-08-21 21:02:27

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Anton wrote:

> Maybe the notifier is the right way to go, but it seems strange to
> create two copies of cpu_online_map (with the associated possibiliy of
> the two getting out of sync).

Every cpuset in the system, of which this top_cpuset is just the first,
has a set of cpus and memory nodes on which tasks in that cpuset are
allowed to operate. It's not just top_cpuset that we need to understand
how to relate to hotplug and unplug.

I'll bet there is more hidden state in mm/mempolicy.c, for mbind()
and set_mempolicy(), and in kernel/sched.c for the sched_setaffinity(),
which was derived from what memory nodes or cpus were online. For
example, I see several fields in 'struct mempolicy' that contain
node numbers in some form, and the 'cpus_allowed' field in the task
struct that sched_setaffinity sets.

How does hotplug and unplug interact with these various saved states?

> Its up to the cpusets code to register a hotplug notifier to update the
> top_cpuset maps.

That, or user level code, when it adds or removes a cpu or a memory
node, needs to be responsible for adding or removing that cpu or node
to or from whichever cpusets are affected.

For example, if you just added cpu 31, to a system that had been
running on cpus 0 to 30, you can add cpu 31 to the top cpuset by
doing:

mkdir /dev/cpuset # if not already done
mount -t cpuset cpuset /dev/cpuset # if not already done
/bin/echo 0-31 > /dev/cpsuet/cpus

> If cpuset_cpus_allowed doesnt return the current online mask and we want
> to schedule on a cpu that has been added since boot it looks like we
> will fail.

In general, on systems actually using cpusets, that -is- what should
happen. Just because a cpu was brought online doesn't mean it was
intended to be allowed in any given tasks current cpuset.

Granted, I would guess users of systems not using cpusets (but
still have cpusets configured in their kernel, as is common in some
distro kernels), would expect the behaviour you expected - bringing
a cpu (or memory node) on or offline would make it available (or
not) for something like a sched_setaffinity (or mbind/set_mempolicy)
immediately, without having to invoke some magic cpuset voodoo.

Offhand, this sounds to me like a choice of two modes of operation.

If you aren't actually using cpusets (so every task is in the
original top_cpuset) then you'd expect that cpuset to "get out
of the way", meaning top_cpuset (the only cpuset, in this case)
tracked whatever cpus and nodes were online at the moment.

If instead you start messing with cpusets (creating more than one
of them and moving tasks between them) then you'd expect cpusets
to be enforced, without automatically adding newly added cpus or
memory nodes to existing cpusets. Only the user knows which
cpusets should get the added cpus or memory nodes in this case.

I don't jump for joy over yet another modal state flag. But I don't see
a better alternative -- do you?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-21 23:05:57

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset code prevents binding tasks to new cpus

Nathan wrote:
> - top_cpuset.cpus_allowed = cpu_online_map;
> + top_cpuset.cpus_allowed = cpu_possible_map;

NAQ

While this seems sensible on systems not using cpusets (but still using
kernels configured for cpusets), it is surprising on systems using
cpusets, on which one expects the cpuset that a task is in to reflect
the cpus that a task is allowed to use.

A long time ago, this code actually had cpu_possible_map here, not
cpu_online_map. That was changed, in the patch:

http://lkml.org/lkml/2004/9/12/104
[Patch] cpusets interoperate with hotplug online maps

Lets discuss this further. See further my previous reply to Anton.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-21 23:27:46

by Nathan Lynch

[permalink] [raw]
Subject: Re: [PATCH] cpuset code prevents binding tasks to new cpus

Paul Jackson wrote:
> Nathan wrote:
> > - top_cpuset.cpus_allowed = cpu_online_map;
> > + top_cpuset.cpus_allowed = cpu_possible_map;
>
> NAQ
>
> While this seems sensible on systems not using cpusets (but still using
> kernels configured for cpusets), it is surprising on systems using
> cpusets, on which one expects the cpuset that a task is in to reflect
> the cpus that a task is allowed to use.

Will it actually break anything?

2006-08-22 04:42:40

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] cpuset code prevents binding tasks to new cpus

> Will it actually break anything?

It will break any user code that thought it could actually run on the
CPUs listed in its cpuset, if it happens to be in a task in the top
cpuset. I'm pretty sure I've got some cpuset users who would be
harmed by this change.

One degree of freedom for change we do have is that, as best I know,
no one is doing anything serious with both cpu hotplug/unplug and with
cpusets at the same time on the same system. For instance, I am
willing to wager that no one is counting on the CPUs in the top cpuset
remaining constant if a CPU comes on or off line. I say this because
so far as I know, serious cpuset users aren't taking CPUs on and
off line. I've rather been expecting cpusets and cpu hotplug to
butt heads for a couple of years now. Looks like the time has come.

So if I'm right, we could change the API to have the top_cpuset
cpus_allowed track the cpus_online_map dynamically, rather than being a
static copy of the cpus_online_map value at system boot, with little
or no negative impact on users.

For adding CPUs, that is easy enough, using a register_cpu_notifier
callback to add new CPUs to top_cpuset.cpus_allowed. This may seem
like overkill to Nathan, but I think it is necessary, to keep the
top cpusets CPUs tracking what's online, rather than changing it to
what's possible (which can be a -much- bigger set of CPUs on some
configurations, and likely surprising to existing cpuset-aware code.)
Automatically adding newly onlined CPUs to just the top cpuset (but not
to other cpusets) does treat the top cpuset as a special case, but I
doubt this will surprise existing cpuset aware code, and corresponds
nicely to what hotplug aware, cpuset clueless code will expect.

For removing CPUs, this is a bit harder, as one cannot remove a CPU
from a cpuset without first removing it from any child cpusets of
that cpuset. So I will need to scan the cpuset hierarchy, from the
bottom up, removing the CPU about to be offline'd, and in the case that
this was the last CPU in a cpuset, finding some fallback setting for
that cpuset. I suspect that means moving any poor tasks left in that
soon to be useless (no CPUs) cpuset into the parent cpuset, then
empty'ing the cpus_allowed for that cpuset. If the user doesn't like
this default action, they should have done what they wanted first, by
adapting their cpuset configuration in anticipation of taking the CPU
offline. Taking CPUs offline will likely surprise (as in break)
existing cpuset aware code, but I don't know any way around that.

In any event, as a workaround with existing kernels, I suspect you could
make use of the existing /sbin/hotplug mechanism to run the (bash syntax)
following commands to add a newly online CPU $cpu to the top cpuset's cpus:

test -d /dev/cpuset || mkdir /dev/cpuset
test -f /dev/cpuset/cpus || mount -t cpuset cpuset /dev/cpuset
/bin/echo $(</dev/cpuset/cpus),$cpu > /dev/cpuset/cpus

This workaround presumes that the number of the CPU being added, $cpu,
is visible to the user level hotplug script - offhand I don't know if
that is so or not.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-22 04:51:30

by Andrew Morton

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

On Mon, 21 Aug 2006 14:01:48 -0700
Paul Jackson <[email protected]> wrote:

> Anton wrote:
>
> > Maybe the notifier is the right way to go, but it seems strange to
> > create two copies of cpu_online_map (with the associated possibiliy of
> > the two getting out of sync).
>
> Every cpuset in the system, of which this top_cpuset is just the first,
> has a set of cpus and memory nodes on which tasks in that cpuset are
> allowed to operate. It's not just top_cpuset that we need to understand
> how to relate to hotplug and unplug.
>
> I'll bet there is more hidden state in mm/mempolicy.c, for mbind()
> and set_mempolicy(), and in kernel/sched.c for the sched_setaffinity(),
> which was derived from what memory nodes or cpus were online. For
> example, I see several fields in 'struct mempolicy' that contain
> node numbers in some form, and the 'cpus_allowed' field in the task
> struct that sched_setaffinity sets.
>
> How does hotplug and unplug interact with these various saved states?
>
> > Its up to the cpusets code to register a hotplug notifier to update the
> > top_cpuset maps.
>
> That, or user level code, when it adds or removes a cpu or a memory
> node, needs to be responsible for adding or removing that cpu or node
> to or from whichever cpusets are affected.
>
> For example, if you just added cpu 31, to a system that had been
> running on cpus 0 to 30, you can add cpu 31 to the top cpuset by
> doing:
>
> mkdir /dev/cpuset # if not already done
> mount -t cpuset cpuset /dev/cpuset # if not already done
> /bin/echo 0-31 > /dev/cpsuet/cpus
>
> > If cpuset_cpus_allowed doesnt return the current online mask and we want
> > to schedule on a cpu that has been added since boot it looks like we
> > will fail.
>
> In general, on systems actually using cpusets, that -is- what should
> happen. Just because a cpu was brought online doesn't mean it was
> intended to be allowed in any given tasks current cpuset.
>
> Granted, I would guess users of systems not using cpusets (but
> still have cpusets configured in their kernel, as is common in some
> distro kernels), would expect the behaviour you expected - bringing
> a cpu (or memory node) on or offline would make it available (or
> not) for something like a sched_setaffinity (or mbind/set_mempolicy)
> immediately, without having to invoke some magic cpuset voodoo.
>
> Offhand, this sounds to me like a choice of two modes of operation.
>
> If you aren't actually using cpusets (so every task is in the
> original top_cpuset) then you'd expect that cpuset to "get out
> of the way", meaning top_cpuset (the only cpuset, in this case)
> tracked whatever cpus and nodes were online at the moment.
>
> If instead you start messing with cpusets (creating more than one
> of them and moving tasks between them) then you'd expect cpusets
> to be enforced, without automatically adding newly added cpus or
> memory nodes to existing cpusets. Only the user knows which
> cpusets should get the added cpus or memory nodes in this case.
>
> I don't jump for joy over yet another modal state flag. But I don't see
> a better alternative -- do you?
>

If the kernel provider (ie: distro) has enabled cpusets then it would be
appropriate that they also provide a hotplug script which detects whether their
user is actually using cpusets and if not, to take some sensible default action.
ie: add the newly-added CPU to the system's single cpuset, no?

iow: perhaps send a patch against the upstream udev package.

2006-08-22 05:04:12

by Nathan Lynch

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Andrew Morton wrote:
> On Mon, 21 Aug 2006 14:01:48 -0700
> Paul Jackson <[email protected]> wrote:
> > Anton wrote:
> > > If cpuset_cpus_allowed doesnt return the current online mask and we want
> > > to schedule on a cpu that has been added since boot it looks like we
> > > will fail.
> >
> > In general, on systems actually using cpusets, that -is- what should
> > happen. Just because a cpu was brought online doesn't mean it was
> > intended to be allowed in any given tasks current cpuset.
> >
> > Granted, I would guess users of systems not using cpusets (but
> > still have cpusets configured in their kernel, as is common in some
> > distro kernels), would expect the behaviour you expected - bringing
> > a cpu (or memory node) on or offline would make it available (or
> > not) for something like a sched_setaffinity (or mbind/set_mempolicy)
> > immediately, without having to invoke some magic cpuset voodoo.
> >
> > Offhand, this sounds to me like a choice of two modes of operation.
> >
> > If you aren't actually using cpusets (so every task is in the
> > original top_cpuset) then you'd expect that cpuset to "get out
> > of the way", meaning top_cpuset (the only cpuset, in this case)
> > tracked whatever cpus and nodes were online at the moment.
> >
> > If instead you start messing with cpusets (creating more than one
> > of them and moving tasks between them) then you'd expect cpusets
> > to be enforced, without automatically adding newly added cpus or
> > memory nodes to existing cpusets. Only the user knows which
> > cpusets should get the added cpus or memory nodes in this case.
> >
> > I don't jump for joy over yet another modal state flag. But I don't see
> > a better alternative -- do you?
> >
>
> If the kernel provider (ie: distro) has enabled cpusets then it would be
> appropriate that they also provide a hotplug script which detects whether their
> user is actually using cpusets and if not, to take some sensible default action.
> ie: add the newly-added CPU to the system's single cpuset, no?

I think it would be more sensible for the default (i.e. user hasn't
explicitly configured any cpusets) behavior on a CONFIG_CPUSETS=y
kernel to match the behavior with a CONFIG_CPUSETS=n kernel. Right
now we don't have that. Binding a task to a newly added cpu just
fails if CONFIG_CPUSETS=y, but it works if CONFIG_CPUSETS=n.

2006-08-22 05:06:32

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Andrew wrote:
> If the kernel provider (ie: distro) has enabled cpusets then it would be
> appropriate that they also provide a hotplug script which detects whether their
> user is actually using cpusets and if not, to take some sensible default action

Interesting point - whether the default action to fix up the
cpuset configuration when a CPU goes on or offline should be:
1) coded in kernel/cpuset.c, or
2) coded in a hotplug script.

I've got 25 cents that says Andrew votes for (2).

At least so far as the cpuset aware portion of the coding goes,
it would probably be easier for me to code in a hotplug script.

I should learn enough about how hotplug scripts work to see if
this will really work.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-22 05:10:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cpuset code prevents binding tasks to new cpus

On Mon, 21 Aug 2006 21:42:24 -0700
Paul Jackson <[email protected]> wrote:

> > Will it actually break anything?
>
> It will break any user code that thought it could actually run on the
> CPUs listed in its cpuset, if it happens to be in a task in the top
> cpuset. I'm pretty sure I've got some cpuset users who would be
> harmed by this change.
>
> One degree of freedom for change we do have is that, as best I know,
> no one is doing anything serious with both cpu hotplug/unplug and with
> cpusets at the same time on the same system. For instance, I am
> willing to wager that no one is counting on the CPUs in the top cpuset
> remaining constant if a CPU comes on or off line. I say this because
> so far as I know, serious cpuset users aren't taking CPUs on and
> off line. I've rather been expecting cpusets and cpu hotplug to
> butt heads for a couple of years now. Looks like the time has come.
>
> So if I'm right, we could change the API to have the top_cpuset
> cpus_allowed track the cpus_online_map dynamically, rather than being a
> static copy of the cpus_online_map value at system boot, with little
> or no negative impact on users.
>
> For adding CPUs, that is easy enough, using a register_cpu_notifier
> callback to add new CPUs to top_cpuset.cpus_allowed.

register_hotcpu_notifier(), please. That exists so all the cpu-hotplug
goop can be put inside #ifdef CONFIG_HOTPLUG_CPU,

> This may seem
> like overkill to Nathan, but I think it is necessary, to keep the
> top cpusets CPUs tracking what's online, rather than changing it to
> what's possible (which can be a -much- bigger set of CPUs on some
> configurations, and likely surprising to existing cpuset-aware code.)
> Automatically adding newly onlined CPUs to just the top cpuset (but not
> to other cpusets) does treat the top cpuset as a special case, but I
> doubt this will surprise existing cpuset aware code, and corresponds
> nicely to what hotplug aware, cpuset clueless code will expect.
>
> For removing CPUs, this is a bit harder, as one cannot remove a CPU
> from a cpuset without first removing it from any child cpusets of
> that cpuset. So I will need to scan the cpuset hierarchy, from the
> bottom up, removing the CPU about to be offline'd, and in the case that
> this was the last CPU in a cpuset, finding some fallback setting for
> that cpuset. I suspect that means moving any poor tasks left in that
> soon to be useless (no CPUs) cpuset into the parent cpuset, then
> empty'ing the cpus_allowed for that cpuset. If the user doesn't like
> this default action, they should have done what they wanted first, by
> adapting their cpuset configuration in anticipation of taking the CPU
> offline. Taking CPUs offline will likely surprise (as in break)
> existing cpuset aware code, but I don't know any way around that.
>
> In any event, as a workaround with existing kernels, I suspect you could
> make use of the existing /sbin/hotplug mechanism to run the (bash syntax)
> following commands to add a newly online CPU $cpu to the top cpuset's cpus:
>
> test -d /dev/cpuset || mkdir /dev/cpuset
> test -f /dev/cpuset/cpus || mount -t cpuset cpuset /dev/cpuset
> /bin/echo $(</dev/cpuset/cpus),$cpu > /dev/cpuset/cpus
>
> This workaround presumes that the number of the CPU being added, $cpu,
> is visible to the user level hotplug script - offhand I don't know if
> that is so or not.

Please, let's get this into the udev tarball and let it trickle out.

2006-08-22 05:11:56

by Andrew Morton

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

On Tue, 22 Aug 2006 00:04:01 -0500
Nathan Lynch <[email protected]> wrote:

> > If the kernel provider (ie: distro) has enabled cpusets then it would be
> > appropriate that they also provide a hotplug script which detects whether their
> > user is actually using cpusets and if not, to take some sensible default action.
> > ie: add the newly-added CPU to the system's single cpuset, no?
>
> I think it would be more sensible for the default (i.e. user hasn't
> explicitly configured any cpusets) behavior on a CONFIG_CPUSETS=y
> kernel to match the behavior with a CONFIG_CPUSETS=n kernel.

Well... why? If he-who-configured the kernel chose cpusets then it's
reasonable to expect he-who-configures to have appropriate userspace
support for that kernel.

> Right
> now we don't have that. Binding a task to a newly added cpu just
> fails if CONFIG_CPUSETS=y, but it works if CONFIG_CPUSETS=n.

Sure. That's because we're not taking appropriate (ie any) action when a
CPU is added.

2006-08-22 05:15:07

by Andrew Morton

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

On Mon, 21 Aug 2006 22:06:25 -0700
Paul Jackson <[email protected]> wrote:

> Andrew wrote:
> > If the kernel provider (ie: distro) has enabled cpusets then it would be
> > appropriate that they also provide a hotplug script which detects whether their
> > user is actually using cpusets and if not, to take some sensible default action
>
> Interesting point - whether the default action to fix up the
> cpuset configuration when a CPU goes on or offline should be:
> 1) coded in kernel/cpuset.c, or
> 2) coded in a hotplug script.
>
> I've got 25 cents that says Andrew votes for (2).
>
> At least so far as the cpuset aware portion of the coding goes,
> it would probably be easier for me to code in a hotplug script.
>
> I should learn enough about how hotplug scripts work to see if
> this will really work.
>

Well... let's suck it and see (please). If for some reason it proves
inadequate and the default kernel behaviour is significantly wrong (it
seems to be) then there's an argument for modifying (ie: adding complexity
to) the kernel.

But I don't think we yet know that.

2006-08-22 05:14:50

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Nathan wrote:
> I think it would be more sensible for the default (i.e. user hasn't
> explicitly configured any cpusets) behavior on a CONFIG_CPUSETS=y
> kernel to match the behavior with a CONFIG_CPUSETS=n kernel.

Basically, in this situation, that would mean that the code that
masks a requested sched_setaffinity cpumask with the tasks
cpuset_cpus_allowed():

cpus_allowed = cpuset_cpus_allowed(p);
cpus_and(new_mask, new_mask, cpus_allowed);

would not be executed in the case of a kernel configured for cpusets,
when running on a system that wasn't using cpusets.

That's quite doable, and for systems not actually using cpusets, makes
good sense.

But it makes a bit of discontinuity in the system behaviour when
someone starts using cpusets. If someone makes a cpuset, then suddenly
tasks in the top cpuset start seeing failed sched_setaffinity calls
for any CPU that was brought online after system boot.

If there is some decent way I can get the cpus_allowed of the top
cpuset to track the cpu_online_map, then we avoid this discontinuity
in system behaviour.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-22 05:21:16

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Andrew wrote:
> > I should learn enough about how hotplug scripts work to see if
> > this will really work.
> >
>
> Well... let's suck it and see (please).

Yup - good plan - will do.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-23 15:01:33

by Anton Blanchard

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware


Hi,

> Well... let's suck it and see (please). If for some reason it proves
> inadequate and the default kernel behaviour is significantly wrong (it
> seems to be) then there's an argument for modifying (ie: adding complexity
> to) the kernel.

I think there is. We have a userspace visible change to the sched
affinity API when the cpusets option is enabled. Papering over it with a
udev callback doesnt sound like the right solution.

Im struggling to understand why we have this problem at all. If a task
has not been touched by cpuset calls it should be allowed to use any
cpu. I completely agree that once you have partitioned the task with
cpusets then it should never spill onto more recently hotplug added
cpus.

Anton

2006-08-23 19:00:36

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Anton wrote:
> Im struggling to understand why we have this problem at all. If a task
> has not been touched by cpuset calls it should be allowed to use any
> cpu.

Cpusets are not like sched_setaffinity. It's not something that is
optional for a task.

Rather, if your kernel config has CPUSETS enabled, then every task is
in a cpuset, always, By default, every task is in the top_cpuset.

Today, the top_cpuset has a static copy of the online cpus as of the
the system boot.

We need to make the cpus in the top_cpuset become a dynamic copy of the
online cpus and nodes, not a static copy.

I'm confident udev can do this.

Right now I'm working through some local problems testing this, and
unraveling some complexities with multiple versions of the udev user
code, including at least:
1) one version uses the kernel default /sbin/hotplug to callback,
2) another version uses a udevd daemon and /etc/dev.d, and
3) a third version uses udevd, but no /etc/dev.d.

The cpuset side of coding this is easy enough for me; the udev side
more difficult.

Any suggestions on the best place for me to go with more detailed udev
questions?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-23 19:08:47

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Anton wrote:
> We have a userspace visible change to the sched
> affinity API when the cpusets option is enabled.

Yup - on cpuset kernels, sched_setaffinity is constrained to CPUs in a
tasks cpuset. Also mbind and set_mempolicy are constrained to nodes in
the tasks cpuset.

This is a long standing, intentional part of the cpuset design.

Using udev isn't papering over it. It's reflecting the reasonable
preference and expectation on systems using hotplug/unplug that
the default cpuset dynamically reflect the current online maps,
not the static maps as they were at boot. And using udev is using
just the sort of mechanism one might expect to be used to adjust
the system when devices go on or off line.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-23 22:11:24

by Nathan Lynch

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Paul Jackson wrote:
> Nathan wrote:
> > I think it would be more sensible for the default (i.e. user hasn't
> > explicitly configured any cpusets) behavior on a CONFIG_CPUSETS=y
> > kernel to match the behavior with a CONFIG_CPUSETS=n kernel.
>
> Basically, in this situation, that would mean that the code that
> masks a requested sched_setaffinity cpumask with the tasks
> cpuset_cpus_allowed():
>
> cpus_allowed = cpuset_cpus_allowed(p);
> cpus_and(new_mask, new_mask, cpus_allowed);
>
> would not be executed in the case of a kernel configured for cpusets,
> when running on a system that wasn't using cpusets.
>
> That's quite doable, and for systems not actually using cpusets, makes
> good sense.
>
> But it makes a bit of discontinuity in the system behaviour when
> someone starts using cpusets. If someone makes a cpuset, then suddenly
> tasks in the top cpuset start seeing failed sched_setaffinity calls
> for any CPU that was brought online after system boot.
>
> If there is some decent way I can get the cpus_allowed of the top
> cpuset to track the cpu_online_map, then we avoid this discontinuity
> in system behaviour.


How about this? I've verified it fixes the issue but I'm nervous
about the locking.


--- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
+++ cpuhp-sched_setaffinity/kernel/cpuset.c
@@ -2033,6 +2033,31 @@ out:
return err;
}

+static int cpuset_handle_cpuhp(struct notifier_block *nb,
+ unsigned long phase, void *_cpu)
+{
+ unsigned long cpu = (unsigned long)_cpu;
+
+ mutex_lock(&manage_mutex);
+ lock_cpu_hotplug();
+ mutex_lock(&callback_mutex);
+
+ switch (phase) {
+ case CPU_ONLINE:
+ cpu_set(cpu, top_cpuset.cpus_allowed);
+ break;
+ case CPU_DEAD:
+ cpu_clear(cpu, top_cpuset.cpus_allowed);
+ break;
+ }
+
+ mutex_unlock(&callback_mutex);
+ unlock_cpu_hotplug();
+ mutex_unlock(&manage_mutex);
+
+ return 0;
+}
+
/**
* cpuset_init_smp - initialize cpus_allowed
*
@@ -2043,6 +2068,8 @@ void __init cpuset_init_smp(void)
{
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_online_map;
+
+ hotcpu_notifier(cpuset_handle_cpuhp, 0);
}

/**

2006-08-23 22:42:33

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Nathan wrote:
> How about this?

The code likely works, and the locking seems ok at first blush.
And this patch seems to match just what I asked for ;).

But the more I think about this, the less I like this direction.

Your patch, and what I initially asked for, impose a policy and create
a side affect. When you bring a cpu online, the top cpuset changes as
a side affect, in order to impose a policy that the top cpuset tracks
what is online.

The kernel should avoid such side affects and avoid imposing policy.

It should be user code that imposes the policy that the top cpuset
tracks what is online.

The kernel gets things going with reasonable basic defaults at system
boot, then adapts to whatever user space mandates from then on.

Kernels should provide generic, orthogonal mechanisms.

Let user space figure out what it wants to do with them.

It is not a kernel bug that the top cpuset doesn't track what is
online. It would be a kernel bug if the top cpuset didn't allow just
exactly whatever cpus the user space code told it to allow.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-23 23:39:53

by Nathan Lynch

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Paul Jackson wrote:
> Nathan wrote:
> > How about this?
>
> The code likely works, and the locking seems ok at first blush.
> And this patch seems to match just what I asked for ;).
>
> But the more I think about this, the less I like this direction.
>
> Your patch, and what I initially asked for, impose a policy and create
> a side affect. When you bring a cpu online, the top cpuset changes as
> a side affect, in order to impose a policy that the top cpuset tracks
> what is online.
>
> The kernel should avoid such side affects and avoid imposing policy.

If you want to call that a policy that's fine, but it hardly seems to
be a non-intuitive behavior to me, especially in the case that the
administrator has not explicitly configured any cpusets.


> It should be user code that imposes the policy that the top cpuset
> tracks what is online.
>
> The kernel gets things going with reasonable basic defaults at system
> boot, then adapts to whatever user space mandates from then on.
>
> Kernels should provide generic, orthogonal mechanisms.

Maybe the cpuset code should just stay out of the way unless the admin
has instantiated one?


> Let user space figure out what it wants to do with them.

The user who initially reported this probably has no idea what cpusets
are or how to deal with the situation.


> It is not a kernel bug that the top cpuset doesn't track what is
> online.

It's a bug that one's ability to bind a task to a new cpu is entirely
dependent on whether you know your way around cpusets. Doesn't sound
orthogonal to me.

Try looking at it this way. You have an application that works on
distro x, where cpu hotplug is supported but cpusets is not enabled in
the kernel config. That application is cpu hotplug-aware, and binding
a task to a new cpu just works. Now, with distro x+1, cpusets is
enabled and that same scenario doesn't work anymore. That's generally
viewed as a regression.

And no, I don't like pushing the responsibility for fixing up the
top-level cpuset out to userspace -- that would force apps to
busy-wait (and for how long?) for the new cpu to be added to the
top-level cpuset by the hotplug helper. It means something as simple
as this:

# echo 1 > /sys/devices/system/cpu/cpu3/online
# taskset 0x8 foo

has a race condition, depending on your kernel's configuration.

2006-08-23 23:50:04

by Nathan Lynch

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Nathan Lynch wrote:
> Paul Jackson wrote:
> >
> > If there is some decent way I can get the cpus_allowed of the top
> > cpuset to track the cpu_online_map, then we avoid this discontinuity
> > in system behaviour.
>
>
> How about this? I've verified it fixes the issue but I'm nervous
> about the locking.
>
>
> --- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
> +++ cpuhp-sched_setaffinity/kernel/cpuset.c
> @@ -2033,6 +2033,31 @@ out:
> return err;
> }
>
> +static int cpuset_handle_cpuhp(struct notifier_block *nb,
> + unsigned long phase, void *_cpu)
> +{
> + unsigned long cpu = (unsigned long)_cpu;
> +
> + mutex_lock(&manage_mutex);
> + lock_cpu_hotplug();
> + mutex_lock(&callback_mutex);
> +
> + switch (phase) {
> + case CPU_ONLINE:
> + cpu_set(cpu, top_cpuset.cpus_allowed);
> + break;
> + case CPU_DEAD:
> + cpu_clear(cpu, top_cpuset.cpus_allowed);
> + break;
> + }
> +
> + mutex_unlock(&callback_mutex);
> + unlock_cpu_hotplug();
> + mutex_unlock(&manage_mutex);
> +
> + return 0;
> +}

Actually the lock/unlock_cpu_hotplug aren't necessary,
cpu_add_remove_lock is already held in this context.


Update cpus_allowed in top_cpuset when cpus are hotplugged;
otherwise binding a task to a newly hotplugged cpu fails since the
toplevel cpuset has a static copy of whatever cpu_online_map was at
boot time.

Signed-off-by: Nathan Lynch <[email protected]>

--- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
+++ cpuhp-sched_setaffinity/kernel/cpuset.c
@@ -2033,6 +2033,29 @@ out:
return err;
}

+static int cpuset_handle_cpuhp(struct notifier_block *nb,
+ unsigned long phase, void *_cpu)
+{
+ unsigned long cpu = (unsigned long)_cpu;
+
+ mutex_lock(&manage_mutex);
+ mutex_lock(&callback_mutex);
+
+ switch (phase) {
+ case CPU_ONLINE:
+ cpu_set(cpu, top_cpuset.cpus_allowed);
+ break;
+ case CPU_DEAD:
+ cpu_clear(cpu, top_cpuset.cpus_allowed);
+ break;
+ }
+
+ mutex_unlock(&callback_mutex);
+ mutex_unlock(&manage_mutex);
+
+ return 0;
+}
+
/**
* cpuset_init_smp - initialize cpus_allowed
*
@@ -2043,6 +2066,8 @@ void __init cpuset_init_smp(void)
{
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_online_map;
+
+ hotcpu_notifier(cpuset_handle_cpuhp, 0);
}

/**

2006-08-24 00:17:04

by Andrew Morton

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

On Wed, 23 Aug 2006 18:49:53 -0500
Nathan Lynch <[email protected]> wrote:

> Update cpus_allowed in top_cpuset when cpus are hotplugged;
> otherwise binding a task to a newly hotplugged cpu fails since the
> toplevel cpuset has a static copy of whatever cpu_online_map was at
> boot time.
>

I must say, that's a pretty simple patch. It beats dealing with udev
developers ;)

>
> --- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
> +++ cpuhp-sched_setaffinity/kernel/cpuset.c
> @@ -2033,6 +2033,29 @@ out:
> return err;
> }
>
> +static int cpuset_handle_cpuhp(struct notifier_block *nb,
> + unsigned long phase, void *_cpu)
> +{
> + unsigned long cpu = (unsigned long)_cpu;
> +
> + mutex_lock(&manage_mutex);
> + mutex_lock(&callback_mutex);
> +
> + switch (phase) {
> + case CPU_ONLINE:
> + cpu_set(cpu, top_cpuset.cpus_allowed);
> + break;
> + case CPU_DEAD:
> + cpu_clear(cpu, top_cpuset.cpus_allowed);
> + break;
> + }
> +
> + mutex_unlock(&callback_mutex);
> + mutex_unlock(&manage_mutex);
> +
> + return 0;
> +}
> +

The above needs #ifdef CONFIG_HOTPLUG_CPU wrappers.

> /**
> * cpuset_init_smp - initialize cpus_allowed
> *
> @@ -2043,6 +2066,8 @@ void __init cpuset_init_smp(void)
> {
> top_cpuset.cpus_allowed = cpu_online_map;
> top_cpuset.mems_allowed = node_online_map;
> +
> + hotcpu_notifier(cpuset_handle_cpuhp, 0);
> }
>
> /**

2006-08-24 00:49:06

by Nathan Lynch

[permalink] [raw]
Subject: [PATCH] cpuset code prevents binding tasks to new cpus

Andrew Morton wrote:
> On Wed, 23 Aug 2006 18:49:53 -0500
> Nathan Lynch <[email protected]> wrote:
>
> >
> > +static int cpuset_handle_cpuhp(struct notifier_block *nb,
> > + unsigned long phase, void *_cpu)
> > +{
> > + unsigned long cpu = (unsigned long)_cpu;
> > +
> > + mutex_lock(&manage_mutex);
> > + mutex_lock(&callback_mutex);
> > +
> > + switch (phase) {
> > + case CPU_ONLINE:
> > + cpu_set(cpu, top_cpuset.cpus_allowed);
> > + break;
> > + case CPU_DEAD:
> > + cpu_clear(cpu, top_cpuset.cpus_allowed);
> > + break;
> > + }
> > +
> > + mutex_unlock(&callback_mutex);
> > + mutex_unlock(&manage_mutex);
> > +
> > + return 0;
> > +}
> > +
>
> The above needs #ifdef CONFIG_HOTPLUG_CPU wrappers.

Fixed, thanks.


Update cpus_allowed in top_cpuset when cpus are hotplugged; otherwise
binding a task to a newly hotplugged cpu fails since the toplevel
cpuset has a static copy of whatever cpu_online_map was at boot time.

Signed-off-by: Nathan Lynch <[email protected]>


--- cpuhp-sched_setaffinity.orig/kernel/cpuset.c
+++ cpuhp-sched_setaffinity/kernel/cpuset.c
@@ -2033,6 +2033,31 @@ out:
return err;
}

+#ifdef CONFIG_HOTPLUG_CPU
+static int cpuset_handle_cpuhp(struct notifier_block *nb,
+ unsigned long phase, void *_cpu)
+{
+ unsigned long cpu = (unsigned long)_cpu;
+
+ mutex_lock(&manage_mutex);
+ mutex_lock(&callback_mutex);
+
+ switch (phase) {
+ case CPU_ONLINE:
+ cpu_set(cpu, top_cpuset.cpus_allowed);
+ break;
+ case CPU_DEAD:
+ cpu_clear(cpu, top_cpuset.cpus_allowed);
+ break;
+ }
+
+ mutex_unlock(&callback_mutex);
+ mutex_unlock(&manage_mutex);
+
+ return 0;
+}
+#endif
+
/**
* cpuset_init_smp - initialize cpus_allowed
*
@@ -2043,6 +2068,8 @@ void __init cpuset_init_smp(void)
{
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed = node_online_map;
+
+ hotcpu_notifier(cpuset_handle_cpuhp, 0);
}

/**

2006-08-24 02:20:09

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Nathan wrote:
> # echo 1 > /sys/devices/system/cpu/cpu3/online
> # taskset 0x8 foo
>
> has a race condition, depending on your kernel's configuration.

Good point. That's not a nice race to force on users.

I relent -- not udev.


> Maybe the cpuset code should just stay out of the way unless the admin
> has instantiated one?

I really don't like where such special case, modal behaviour leads us.

Let's say we have two systems, side by side. Both have been up for
many months. On one of these systems, one time, a few months ago, the
sysadmin briefly made and removed a cpuset:

mkdir /dev/cpuset
mount -t cpuset cpuset /dev/cpuset
mkdir /dev/cpuset/foo
rmdir /dev/cpuset/foo
umount /dev/cpuset
rmdir /dev/cpuset

On the other of these systems, the sysadmin never did any such thing.
These two systems are now identical in every other aspect that might
matter to this discussion.

The sched_setaffinity call should behave the same on these two systems.

If the kernel has to impose a trivial bit of policy (such as forcing
the top cpuset to track what's online) in order to provide uniformally
consistent (not modal) and race free behaviour, then it should.

And besides, as someone else noted, it's alot easier than dealing with
udev ;).

Conclusion - the kernel should simply force the top_cpuset to track the
online maps. See further my response to your patch.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-08-24 02:54:28

by Paul Jackson

[permalink] [raw]
Subject: Re: cpusets not cpu hotplug aware

Simon,

Take note below of my proposal to make the 'cpus' and 'mems'
of the top cpuset read-only. Holler if that hurts.

Nathan wrote:
> +static int cpuset_handle_cpuhp(struct notifier_block *nb,
> + unsigned long phase, void *_cpu)
> +{
> + unsigned long cpu = (unsigned long)_cpu;
> +
> + mutex_lock(&manage_mutex);
> + mutex_lock(&callback_mutex);
> +
> + switch (phase) {
> + case CPU_ONLINE:
> + cpu_set(cpu, top_cpuset.cpus_allowed);
> + break;
> + case CPU_DEAD:
> + cpu_clear(cpu, top_cpuset.cpus_allowed);
> + break;
> + }
> +
> + mutex_unlock(&callback_mutex);
> + mutex_unlock(&manage_mutex);
> +
> + return 0;
> +}
> +

Andrew commented:
> I must say, that's a pretty simple patch.

Not simple enough ;).

How about (uncompiled, untested, unanything):

=================================================================
/*
* The top_cpuset tracks what CPUs and Memory Nodes are online,
* period. This is necessary in order to make cpusets transparent
* (of no affect) on systems that are actively using CPU hotplug
* but making no active use of cpusets.
*
* This handles CPU hotplug (cpuhp) events. If someday Memory
* Nodes can be hotplugged (dynamically changing node_online_map)
* then we should handle that too, perhaps in a similar way.
*/

#ifdef CONFIG_HOTPLUG_CPU
static int cpuset_handle_cpuhp(struct notifier_block *nb,
unsigned long phase, void *cpu)
{
mutex_lock(&manage_mutex);
mutex_lock(&callback_mutex);

top_cpuset.cpus_allowed = cpu_online_map;

mutex_unlock(&callback_mutex);
mutex_unlock(&manage_mutex);

return 0;
}
#endif

... plus the hotcpu_notifier() initializer.
=================================================================

However I get to spend the few lines of code I saved here elsewhere,
adding special case code so that the top_cpuset's 'cpus' and 'mems'
files are read-only.

The new rule is simple:

The top cpuset's cpus and mems track what is online.

The user is no longer in direct control of these two cpuset settings.

And I should add a few lines to Documentation/cpusets.txt, describing
this.

When I return from my son's 18-th birthday party this evening, I will
see what I can whip up.

Thanks, Nathan.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401