LinuxLists.cc - Topology updates and NUMA-level sched domains

2015-04-06 21:46:05

Subject: Topology updates and NUMA-level sched domains

Hi Peter,

As you are very aware, I think, power has some odd NUMA topologies (and
changes to the those topologies) at run-time. In particular, we can see
a topology at boot:

Node 0: all Cpus
Node 7: no cpus

Then we get a notification from the hypervisor that a core (or two) have
moved from node 0 to node 7. This results in the:

[ 64.496687] BUG: arch topology borken
[ 64.496689] the CPU domain not a subset of the NUMA domain

messages for each moved CPU. I think this is because when we first came
up, we degrade (elide altogether?) the NUMA domain for node 7 as it has
no CPUs:

[ 0.305823] CPU0 attaching sched-domain:
[ 0.305831] domain 0: span 0-7 level SIBLING
[ 0.305834] groups: 0 (cpu_power = 146) 1 (cpu_power = 146) 2
(cpu_power = 146) 3 (cpu_power = 146) 4 (cpu_power = 146) 5 (cpu_power =
146) 6 (cpu_power = 146) 7 (cpu_power = 146)
[ 0.305854] domain 1: span 0-79 level CPU
[ 0.305856] groups: 0-7 (cpu_power = 1168) 8-15 (cpu_power = 1168)
16-23 (cpu_power = 1168) 24-31 (cpu_power = 1168) 32-39 (cpu_power =
1168) 40-47 (cpu_power = 1168) 48-55 (cpu_power = 1168) 56-63 (cpu_power
= 1168) 64-71 (cpu_power = 1168) 72-79 (cpu_power = 1168)

For those cpus that moved, we get after the update:

[ 64.505819] CPU8 attaching sched-domain:
[ 64.505821] domain 0: span 8-15 level SIBLING
[ 64.505823] groups: 8 (cpu_power = 147) 9 (cpu_power = 147) 10
(cpu_power = 147) 11 (cpu_power = 146) 12 (cpu_power = 147) 13
(cpu_power = 147) 14 (cpu_power = 146) 15 (cpu_power = 147)
[ 64.505842] domain 1: span 8-23,72-79 level CPU
[ 64.505845] groups: 8-15 (cpu_power = 1174) 16-23 (cpu_power =
1175) 72-79 (cpu_power = 1176)

while the non-modified CPUs report, correctly:

[ 64.497186] CPU0 attaching sched-domain:
[ 64.497189] domain 0: span 0-7 level SIBLING
[ 64.497192] groups: 0 (cpu_power = 147) 1 (cpu_power = 147) 2
(cpu_power = 146) 3 (cpu_power = 147) 4 (cpu_power = 147) 5 (cpu_power =
147) 6 (cpu_power = 147) 7 (cpu_power = 146)
[ 64.497213] domain 1: span 0-7,24-71 level CPU
[ 64.497215] groups: 0-7 (cpu_power = 1174) 24-31 (cpu_power =
1173) 32-39 (cpu_power = 1176) 40-47 (cpu_power = 1175) 48-55 (cpu_power
= 1176) 56-63 (cpu_power = 1175) 64-71 (cpu_power = 1174)
[ 64.497234] domain 2: span 0-79 level NUMA
[ 64.497236] groups: 0-7,24-71 (cpu_power = 8223) 8-23,72-79
(cpu_power = 3525)

It seems like we might need something like this (HORRIBLE HACK, I know,
just to get discussion):

@@ -6958,6 +6960,10 @@ void partition_sched_domains(int ndoms_new,
cpumask_var_t doms_new[],

/* Let architecture update cpu core mappings. */
new_topology = arch_update_cpu_topology();
+ /* Update NUMA topology lists */
+ if (new_topology) {
+ sched_init_numa();
+ }

n = doms_new ? ndoms_new : 0;

or a re-init API (which won't try to reallocate various bits), because
the topology could be completely different now (e.g.,
sched_domains_numa_distance will also be inaccurate now). Really, a
topology update on power (not sure on s390x, but those are the only two
archs that return a positive value from arch_update_cpu_topology() right
now, afaics) is a lot like a hotplug event and we need to re-initialize
any dependent structures.

I'm just sending out feelers, as we can limp by with the above warning,
it seems, but is less than ideal. Any help or insight you could provide
would be greatly appreciated!

-Nish

2015-04-07 10:22:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote:
> Hi Peter,
>
> As you are very aware, I think, power has some odd NUMA topologies (and
> changes to the those topologies) at run-time. In particular, we can see
> a topology at boot:
>
> Node 0: all Cpus
> Node 7: no cpus
>
> Then we get a notification from the hypervisor that a core (or two) have
> moved from node 0 to node 7. This results in the:

> or a re-init API (which won't try to reallocate various bits), because
> the topology could be completely different now (e.g.,
> sched_domains_numa_distance will also be inaccurate now). Really, a
> topology update on power (not sure on s390x, but those are the only two
> archs that return a positive value from arch_update_cpu_topology() right
> now, afaics) is a lot like a hotplug event and we need to re-initialize
> any dependent structures.
>
> I'm just sending out feelers, as we can limp by with the above warning,
> it seems, but is less than ideal. Any help or insight you could provide
> would be greatly appreciated!

So I think (and ISTR having stated this before) that dynamic cpu<->node
maps are absolutely insane.

There is a ton of stuff that assumes the cpu<->node relation is a boot
time fixed one. Userspace being one of them. Per-cpu memory another.

You simply cannot do this without causing massive borkage.

So please come up with a coherent plan to deal with the entire problem
of dynamic cpu to memory relation and I might consider the scheduler
impact. But we're not going to hack around and maybe make it not crash
in a few corner cases while the entire thing is shite.

2015-04-07 17:14:20

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 07.04.2015 [12:21:47 +0200], Peter Zijlstra wrote:
> On Mon, Apr 06, 2015 at 02:45:58PM -0700, Nishanth Aravamudan wrote:
> > Hi Peter,
> >
> > As you are very aware, I think, power has some odd NUMA topologies (and
> > changes to the those topologies) at run-time. In particular, we can see
> > a topology at boot:
> >
> > Node 0: all Cpus
> > Node 7: no cpus
> >
> > Then we get a notification from the hypervisor that a core (or two) have
> > moved from node 0 to node 7. This results in the:
>
> > or a re-init API (which won't try to reallocate various bits), because
> > the topology could be completely different now (e.g.,
> > sched_domains_numa_distance will also be inaccurate now). Really, a
> > topology update on power (not sure on s390x, but those are the only two
> > archs that return a positive value from arch_update_cpu_topology() right
> > now, afaics) is a lot like a hotplug event and we need to re-initialize
> > any dependent structures.
> >
> > I'm just sending out feelers, as we can limp by with the above warning,
> > it seems, but is less than ideal. Any help or insight you could provide
> > would be greatly appreciated!
>
> So I think (and ISTR having stated this before) that dynamic cpu<->node
> maps are absolutely insane.

Sorry if I wasn't involved at the time. I agree that it's a bit of a
mess!

> There is a ton of stuff that assumes the cpu<->node relation is a boot
> time fixed one. Userspace being one of them. Per-cpu memory another.

Well, userspace already deals with CPU hotplug, right? And the topology
updates are, in a lot of ways, just like you've hotplugged a CPU from
one node and re-hotplugged it into another node.

I'll look into the per-cpu memory case.

For what it's worth, our test teams are stressing the kernel with these
topology updates and hopefully we'll be able to resolve any issues that
result.

> You simply cannot do this without causing massive borkage.
>
> So please come up with a coherent plan to deal with the entire problem
> of dynamic cpu to memory relation and I might consider the scheduler
> impact. But we're not going to hack around and maybe make it not crash
> in a few corner cases while the entire thing is shite.

Well, it doesn't crash now. In fact, it stays up reasonable well and
seems to dtrt (from the kernel perspective) other than the sched domain
messages.

I will look into per-cpu memory, and also another case I have been
thinking about where if a process is bound to a CPU/node combination via
numactl and then the topology changes, what exactly will happen. In
theory, via these topology updates, a node could go from memoryless ->
not and v.v., which seems like it might not be well supported (but
again, should not be much different from hotplugging all the memory out
from a node).

And, in fact, I think topologically speaking, I think I should be able
to repeat the same sched domain warnings if I start off with a 2-node
system with all CPUs on one node, and then hotplug a CPU onto the second
node, right? That has nothing to do with power, that I can tell. I'll
see if I can demonstrate it via a KVM guest.

Thanks for your quick response!

-Nish

2015-04-07 19:41:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On Tue, Apr 07, 2015 at 10:14:10AM -0700, Nishanth Aravamudan wrote:
> > So I think (and ISTR having stated this before) that dynamic cpu<->node
> > maps are absolutely insane.
>
> Sorry if I wasn't involved at the time. I agree that it's a bit of a
> mess!
>
> > There is a ton of stuff that assumes the cpu<->node relation is a boot
> > time fixed one. Userspace being one of them. Per-cpu memory another.
>
> Well, userspace already deals with CPU hotplug, right?

Barely, mostly not.

> And the topology
> updates are, in a lot of ways, just like you've hotplugged a CPU from
> one node and re-hotplugged it into another node.

No, that's very much not the same. Even if it were dealing with hotplug
it would still assume the cpu to return to the same node.

But mostly people do not even bother to handle hotplug.

People very much assume that when they set up their node affinities they
will remain the same for the life time of their program. People set
separate cpu affinity with sched_setaffinity() and memory affinity with
mbind() and assume the cpu<->node maps are invariant.

> I'll look into the per-cpu memory case.

Look into everything that does cpu_to_node() based allocations, because
they all assume that that is stable.

They allocate memory at init time to be node local, but they you go an
mess that up.

> For what it's worth, our test teams are stressing the kernel with these
> topology updates and hopefully we'll be able to resolve any issues that
> result.

Still absolutely insane.

> I will look into per-cpu memory, and also another case I have been
> thinking about where if a process is bound to a CPU/node combination via
> numactl and then the topology changes, what exactly will happen. In
> theory, via these topology updates, a node could go from memoryless ->
> not and v.v., which seems like it might not be well supported (but
> again, should not be much different from hotplugging all the memory out
> from a node).

memory hotplug is even less well handled than cpu hotplug.

And yes, the fact that you need to go look into WTF happens when people
use numactl should be a big arse red flag. _That_ is breaking userspace.

> And, in fact, I think topologically speaking, I think I should be able
> to repeat the same sched domain warnings if I start off with a 2-node
> system with all CPUs on one node, and then hotplug a CPU onto the second
> node, right? That has nothing to do with power, that I can tell. I'll
> see if I can demonstrate it via a KVM guest.

Uhm, no. CPUs will not first appear on node 0 only to then appear on
node 1 later.

If you have a cpu-less node 1 and then hotplug cpus in they will start
and end live on node 1, they'll never be part of node 0.

Also, cpu/memory - less nodes + hotplug to later populate them are
crazeh in they they never get the performance you get from regular
setups. Its impossible to get node-local right.

2015-04-08 10:32:07

by Brice Goglin

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

Le 07/04/2015 21:41, Peter Zijlstra a ?crit :
> No, that's very much not the same. Even if it were dealing with hotplug
> it would still assume the cpu to return to the same node.
>
> But mostly people do not even bother to handle hotplug.
>

You said userspace assumes the cpu<->node relation is a boot-time fixed
one, and hotplug breaks this. How do you expect userspace to handle
hotplug? Is there a convenient way to be notified when a CPU (or memory)
is unplugged?

thanks
Brice

2015-04-08 10:52:27

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On Wed, Apr 08, 2015 at 12:32:01PM +0200, Brice Goglin wrote:
> Le 07/04/2015 21:41, Peter Zijlstra a ?crit :
> > No, that's very much not the same. Even if it were dealing with hotplug
> > it would still assume the cpu to return to the same node.
> >
> > But mostly people do not even bother to handle hotplug.
> >
>
> You said userspace assumes the cpu<->node relation is a boot-time fixed
> one, and hotplug breaks this.

I said no such thing. Regular hotplug actually respects that relation.

> How do you expect userspace to handle hotplug?

Mostly not. Why would they? CPU hotplug is rare and mostly a case of:
don't do that then.

Its just that some of the virt wankers are using it for resource
management which is entirely misguided. Then again, most of virt is.

> Is there a convenient way to be notified when a CPU (or memory)
> is unplugged?

I think you can poll some sysfs file or other.

2015-04-09 22:30:06

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 07.04.2015 [21:41:29 +0200], Peter Zijlstra wrote:
> On Tue, Apr 07, 2015 at 10:14:10AM -0700, Nishanth Aravamudan wrote:
> > > So I think (and ISTR having stated this before) that dynamic cpu<->node
> > > maps are absolutely insane.
> >
> > Sorry if I wasn't involved at the time. I agree that it's a bit of a
> > mess!
> >
> > > There is a ton of stuff that assumes the cpu<->node relation is a boot
> > > time fixed one. Userspace being one of them. Per-cpu memory another.
> >
> > Well, userspace already deals with CPU hotplug, right?
>
> Barely, mostly not.

Well, as much as it needs to, I mean. CPU hotlug is done quite regularly
on power, at least.

> > And the topology updates are, in a lot of ways, just like you've
> > hotplugged a CPU from one node and re-hotplugged it into another
> > node.
>
> No, that's very much not the same. Even if it were dealing with hotplug
> it would still assume the cpu to return to the same node.

The analogy may have been poor; a better one is: it's the same as
hotunplugging a CPU from one node and hotplugging a physically identical
CPU on a different node.

> But mostly people do not even bother to handle hotplug.

I'm not sure what you mean by "people" here, but I think it's what you
outline below.

> People very much assume that when they set up their node affinities they
> will remain the same for the life time of their program. People set
> separate cpu affinity with sched_setaffinity() and memory affinity with
> mbind() and assume the cpu<->node maps are invariant.

That's a bad assumption to make if you're virtualized, I would think
(including on KVM). Unless you're also binding your vcpu threads to
physical cpus.

But the point is valid, that userspace does tend to think rather
statically about the world.

> > I'll look into the per-cpu memory case.
>
> Look into everything that does cpu_to_node() based allocations, because
> they all assume that that is stable.
>
> They allocate memory at init time to be node local, but they you go an
> mess that up.

So, the case that you're considering is:

CPU X on Node Y at boot-time, gets memory from Node Y.

CPU X moves to Node Z at run-time, is still using memory from Node Y.

The memory is still there (or it's also been 'moved' via the hypervisor
interface), it's just not optimally placed. Autonuma support should help
us move that memory over at run-time, in my understanding.

I won't deny it's imperfect, but honestly, it does actually work (in
that the kernel doesn't crash). And the updated mappings will ensure
future page allocations are accurate.

But the point is still valid, and I will do my best and work with others
to audit the users of cpu_to_node(). When I worked earlier on supporting
memoryless nodes, I didn't see too too many init time callers using
those APIs, many just rely on getting local allocations implicitly
(which I do understand also would break here, but should also get
migrated to follow the cpus eventually, if possible).

> > For what it's worth, our test teams are stressing the kernel with these
> > topology updates and hopefully we'll be able to resolve any issues that
> > result.
>
> Still absolutely insane.

I won't deny that, necessarily, but I'm in a position to at least try
and make them work with Linux.

> > I will look into per-cpu memory, and also another case I have been
> > thinking about where if a process is bound to a CPU/node combination via
> > numactl and then the topology changes, what exactly will happen. In
> > theory, via these topology updates, a node could go from memoryless ->
> > not and v.v., which seems like it might not be well supported (but
> > again, should not be much different from hotplugging all the memory out
> > from a node).
>
> memory hotplug is even less well handled than cpu hotplug.

That feels awfully hand-wavy to me. Again, we stress test both memory
and cpu hotplug pretty heavily.

> And yes, the fact that you need to go look into WTF happens when people
> use numactl should be a big arse red flag. _That_ is breaking userspace.

It will be the exact same condition as running bound to a CPU and
hotplugging that CPU out, as I understand it. In the kernel, actually,
we can (do) migrate CPUs via stop_machine and so it's slightly different
than a hotplug event (numbering is consistent, just the mapping has
changed).

So maybe the better example would be being bound to a given node and
having the CPUs in that node change. We would need to ensure the sched
domains are accurate after the update, so that the policies can be
accurately applied, afaict. That's why I'm asking you as the sched
domain expert what exactly needs to be done.

> > And, in fact, I think topologically speaking, I think I should be able
> > to repeat the same sched domain warnings if I start off with a 2-node
> > system with all CPUs on one node, and then hotplug a CPU onto the second
> > node, right? That has nothing to do with power, that I can tell. I'll
> > see if I can demonstrate it via a KVM guest.
>
> Uhm, no. CPUs will not first appear on node 0 only to then appear on
> node 1 later.

Sorry I was unclear in my statement. The "hotplug a CPU" wasn't one that
was unplugged from node 0, it was only added to node 1. In other words,
I was trying to say:

Node 0 - all CPUs, some memory
Node 1 - no CPUs, some memory

<hotplug event>

Node 0 - same CPUs, some memory
Node 1 - some CPUs, some memory

> If you have a cpu-less node 1 and then hotplug cpus in they will start
> and end live on node 1, they'll never be part of node 0.

Yes, that's exactly right.

But node 1 won't have a sched domain at the NUMA level, because it had
no CPUs on it to start. And afaict, there's no support to build that
NUMA level domain at run-time if the CPU is hotplugged?

> Also, cpu/memory - less nodes + hotplug to later populate them are
> crazeh in they they never get the performance you get from regular
> setups. Its impossible to get node-local right.

Ok, so the performance may suck and we may eventually say -- reboot when
you can, to re-init everything properly. But I'd actually like to limp
along (which in fact we do already). I'd like the limp to be a little
less pronounced by building the proper sched domains in the example I
gave. I get the impression you disagree, so we'll continue to limp
as-is.

Thanks for your insight,
Nish

2015-04-09 22:37:17

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 08.04.2015 [12:32:01 +0200], Brice Goglin wrote:
> Le 07/04/2015 21:41, Peter Zijlstra a ?crit :
> > No, that's very much not the same. Even if it were dealing with hotplug
> > it would still assume the cpu to return to the same node.
> >
> > But mostly people do not even bother to handle hotplug.
> >
>
> You said userspace assumes the cpu<->node relation is a boot-time fixed
> one, and hotplug breaks this. How do you expect userspace to handle
> hotplug? Is there a convenient way to be notified when a CPU (or memory)
> is unplugged?

There is some mention of "User Space Notification" in cpu-hotplug.txt,
but no idea if it's current.

-Nish

2015-04-09 22:40:49

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 08.04.2015 [12:52:12 +0200], Peter Zijlstra wrote:
> On Wed, Apr 08, 2015 at 12:32:01PM +0200, Brice Goglin wrote:
> > Le 07/04/2015 21:41, Peter Zijlstra a ?crit :
> > > No, that's very much not the same. Even if it were dealing with hotplug
> > > it would still assume the cpu to return to the same node.
> > >
> > > But mostly people do not even bother to handle hotplug.
> > >
> >
> > You said userspace assumes the cpu<->node relation is a boot-time fixed
> > one, and hotplug breaks this.
>
> I said no such thing. Regular hotplug actually respects that relation.

Wel, sort of. If you *just* hotplug a CPU out, your invariant of what
CPUs are currently available on what nodes is no longer held. Similarly
if you just add a CPU. And means that you could end up using cpumasks
that are incorrect if you don't make them at runtime, it seems?

> > How do you expect userspace to handle hotplug?
>
> Mostly not. Why would they? CPU hotplug is rare and mostly a case of:
> don't do that then.
>
> Its just that some of the virt wankers are using it for resource
> management which is entirely misguided. Then again, most of virt is.

I guess that is a matter of opinion.

-Nish

2015-04-10 08:32:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On Thu, Apr 09, 2015 at 03:29:56PM -0700, Nishanth Aravamudan wrote:
> > No, that's very much not the same. Even if it were dealing with hotplug
> > it would still assume the cpu to return to the same node.
>
> The analogy may have been poor; a better one is: it's the same as
> hotunplugging a CPU from one node and hotplugging a physically identical
> CPU on a different node.

Then it'll not be the same cpu from the OS's pov. The outgoing cpus and
the incoming cpus will have different cpu numbers.

Furthermore at boot we will have observed the empty socket and reserved
cpu number and arranged per-cpu resources for them.

> > People very much assume that when they set up their node affinities they
> > will remain the same for the life time of their program. People set
> > separate cpu affinity with sched_setaffinity() and memory affinity with
> > mbind() and assume the cpu<->node maps are invariant.
>
> That's a bad assumption to make if you're virtualized, I would think
> (including on KVM). Unless you're also binding your vcpu threads to
> physical cpus.
>
> But the point is valid, that userspace does tend to think rather
> statically about the world.

I've no idea how KVM numa is working, if at all. I would not be
surprised if it indeed hard binds vcpus to nodes. Not doing that allows
the vcpus to randomly migrate between nodes which will completely
destroy the whole point of exposing numa details to the guest.

I suppose some of the auto-numa work helps here. not sure at all.

> > > I'll look into the per-cpu memory case.
> >
> > Look into everything that does cpu_to_node() based allocations, because
> > they all assume that that is stable.
> >
> > They allocate memory at init time to be node local, but they you go an
> > mess that up.
>
> So, the case that you're considering is:
>
> CPU X on Node Y at boot-time, gets memory from Node Y.
>
> CPU X moves to Node Z at run-time, is still using memory from Node Y.

Right, at which point numa doesn't make sense anymore. If you randomly
scramble your cpu<->node map what's the point of exposing numa to the
guest?

The whole point of NUMA is that userspace can be aware of the layout and
use local memory where possible.

Nobody will want to consider dynamic NUMA information; its utterly
insane; do you see your HPC compute job going: "oi hold on, I've got to
reallocate my data, just hold on while I go do this" ? I think not.

> The memory is still there (or it's also been 'moved' via the hypervisor
> interface), it's just not optimally placed. Autonuma support should help
> us move that memory over at run-time, in my understanding.

No auto-numa cannot fix this. And the HV cannot migrate the memory for
the same reason.

Suppose you have two cpus: X0 X1 on node X, you then move X0 into node
Y. You cannot move memory along with it, X1 might still expect it to be
on node X.

You can only migrate your entire node, at which point nothing has really
changed (assuming a fully connected system).

> I won't deny it's imperfect, but honestly, it does actually work (in
> that the kernel doesn't crash). And the updated mappings will ensure
> future page allocations are accurate.

Well it works for you; because all you care about is the kernel not
crashing.

But does it actually provide usable semantics for userspace? Is there
anyone who _wants_ to use this?

What's the point of thinking all your memory is local only to have it
shredded across whatever nodes you stuffed your vcpu in? Utter crap I'd
say.

> But the point is still valid, and I will do my best and work with others
> to audit the users of cpu_to_node(). When I worked earlier on supporting
> memoryless nodes, I didn't see too too many init time callers using
> those APIs, many just rely on getting local allocations implicitly
> (which I do understand also would break here, but should also get
> migrated to follow the cpus eventually, if possible).

init time or not doesn't matter; runtime cpu_to_node() users equally
expect the allocation to remain local for the duration as well.

You've really got to step back and look at what you think you're
providing.

Sure you can make all this 'work' but what is the end result? Is it
useful? I say not. I'm saying that what you end up with is a useless
pile of crap.

> > > For what it's worth, our test teams are stressing the kernel with these
> > > topology updates and hopefully we'll be able to resolve any issues that
> > > result.
> >
> > Still absolutely insane.
>
> I won't deny that, necessarily, but I'm in a position to at least try
> and make them work with Linux.

Make what work? A useless pile of crap that nobody can or wants to use?

> > > I will look into per-cpu memory, and also another case I have been
> > > thinking about where if a process is bound to a CPU/node combination via
> > > numactl and then the topology changes, what exactly will happen. In
> > > theory, via these topology updates, a node could go from memoryless ->
> > > not and v.v., which seems like it might not be well supported (but
> > > again, should not be much different from hotplugging all the memory out
> > > from a node).
> >
> > memory hotplug is even less well handled than cpu hotplug.
>
> That feels awfully hand-wavy to me. Again, we stress test both memory
> and cpu hotplug pretty heavily.

That's not the point; sure you stress the kernel implementation; but
does anybody actually care?

Is there a single userspace program out there that goes: oh hey, my
memory layout just changed, lemme go fix that?

> > And yes, the fact that you need to go look into WTF happens when people
> > use numactl should be a big arse red flag. _That_ is breaking userspace.
>
> It will be the exact same condition as running bound to a CPU and
> hotplugging that CPU out, as I understand it.

Yes and that is _BROKEN_.. I'm >< that close to merging a patch that
will fail hotplug when there is a user task affine to that cpu. This
madness need to stop _NOW_.

Also, listen to yourself. The user _wanted_ that task there and you say
its OK to wreck that.

Please, step back, look at what you're doing and ask yourself, will any
sane person want to use this? Can they use this?

If so, start by describing the desired user semantics of this work.
Don't start by cobbling kernel bits togerther until it stops crashing.

2015-04-10 09:08:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On Fri, Apr 10, 2015 at 10:31:53AM +0200, Peter Zijlstra wrote:
> Please, step back, look at what you're doing and ask yourself, will any
> sane person want to use this? Can they use this?
>
> If so, start by describing the desired user semantics of this work.
> Don't start by cobbling kernel bits togerther until it stops crashing.

Also, please talk to your s390 folks. They've long since realized that
this doesn't work. They've just added big honking caches (their BOOK
stuff) and pretend the system is UMA.

2015-04-10 19:51:50

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 10.04.2015 [11:08:10 +0200], Peter Zijlstra wrote:
> On Fri, Apr 10, 2015 at 10:31:53AM +0200, Peter Zijlstra wrote:
> > Please, step back, look at what you're doing and ask yourself, will any
> > sane person want to use this? Can they use this?
> >
> > If so, start by describing the desired user semantics of this work.
> > Don't start by cobbling kernel bits togerther until it stops crashing.
>
> Also, please talk to your s390 folks. They've long since realized that
> this doesn't work. They've just added big honking caches (their BOOK
> stuff) and pretend the system is UMA.

Interesting, I will take a look.

-Nish

2015-04-10 20:31:11

by Nishanth Aravamudan

[permalink] [raw]

Subject: Re: Topology updates and NUMA-level sched domains

On 10.04.2015 [10:31:53 +0200], Peter Zijlstra wrote:
> On Thu, Apr 09, 2015 at 03:29:56PM -0700, Nishanth Aravamudan wrote:
> > > No, that's very much not the same. Even if it were dealing with hotplug
> > > it would still assume the cpu to return to the same node.
> >
> > The analogy may have been poor; a better one is: it's the same as
> > hotunplugging a CPU from one node and hotplugging a physically identical
> > CPU on a different node.
>
> Then it'll not be the same cpu from the OS's pov. The outgoing cpus and
> the incoming cpus will have different cpu numbers.

Right, it's an analogy. I understand it's not the exact same. I was
trying to have a civil discussion about how to solve this problem
without you calling me a wanker.

> Furthermore at boot we will have observed the empty socket and reserved
> cpu number and arranged per-cpu resources for them.

Ok, I see what you're referring to now:

static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t
align)
{
return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size,
align,
__pa(MAX_DMA_ADDRESS));
}

So we'll be referring to bootmem in the pcpu path for the node we were
on at boot-time.

Actually, this is already horribly broken on power.

[ 0.000000] pcpu-alloc: [0] 000 001 002 003 [0] 004 005 006 007
[ 0.000000] pcpu-alloc: [0] 008 009 010 011 [0] 012 013 014 015
[ 0.000000] pcpu-alloc: [0] 016 017 018 019 [0] 020 021 022 023
[ 0.000000] pcpu-alloc: [0] 024 025 026 027 [0] 028 029 030 031
[ 0.000000] pcpu-alloc: [0] 032 033 034 035 [0] 036 037 038 039
[ 0.000000] pcpu-alloc: [0] 040 041 042 043 [0] 044 045 046 047
[ 0.000000] pcpu-alloc: [0] 048 049 050 051 [0] 052 053 054 055
[ 0.000000] pcpu-alloc: [0] 056 057 058 059 [0] 060 061 062 063
[ 0.000000] pcpu-alloc: [0] 064 065 066 067 [0] 068 069 070 071
[ 0.000000] pcpu-alloc: [0] 072 073 074 075 [0] 076 077 078 079
[ 0.000000] pcpu-alloc: [0] 080 081 082 083 [0] 084 085 086 087
[ 0.000000] pcpu-alloc: [0] 088 089 090 091 [0] 092 093 094 095
[ 0.000000] pcpu-alloc: [0] 096 097 098 099 [0] 100 101 102 103
[ 0.000000] pcpu-alloc: [0] 104 105 106 107 [0] 108 109 110 111
[ 0.000000] pcpu-alloc: [0] 112 113 114 115 [0] 116 117 118 119
[ 0.000000] pcpu-alloc: [0] 120 121 122 123 [0] 124 125 126 127
[ 0.000000] pcpu-alloc: [0] 128 129 130 131 [0] 132 133 134 135
[ 0.000000] pcpu-alloc: [0] 136 137 138 139 [0] 140 141 142 143
[ 0.000000] pcpu-alloc: [0] 144 145 146 147 [0] 148 149 150 151
[ 0.000000] pcpu-alloc: [0] 152 153 154 155 [0] 156 157 158 159

even though the topology is:
available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 1 cpus: 40 48 56 64 72
node 16 cpus: 80 88 96 104 112
node 17 cpus: 120 128 136 144 152

The comment in for pcpu_build_alloc_info() seems wrong:

"The returned configuration is guaranteed
* to have CPUs on different nodes on different groups and >=75% usage
* of allocated virtual address space."

Or, we're returning node 0 for everything at this point. I'll debug it
further, but one more question:

If we have CONIFG_USE_PERCPU_NUMA_NODE_ID and are using cpu_to_node()
for setting up the per-cpu areas itself; isn't that a problem?

> > > People very much assume that when they set up their node affinities they
> > > will remain the same for the life time of their program. People set
> > > separate cpu affinity with sched_setaffinity() and memory affinity with
> > > mbind() and assume the cpu<->node maps are invariant.
> >
> > That's a bad assumption to make if you're virtualized, I would think
> > (including on KVM). Unless you're also binding your vcpu threads to
> > physical cpus.
> >
> > But the point is valid, that userspace does tend to think rather
> > statically about the world.
>
> I've no idea how KVM numa is working, if at all. I would not be
> surprised if it indeed hard binds vcpus to nodes. Not doing that allows
> the vcpus to randomly migrate between nodes which will completely
> destroy the whole point of exposing numa details to the guest.

Well, you *can* bind vcpus to nuodes. But you don't have to.

> I suppose some of the auto-numa work helps here. not sure at all.

Yes, it does, I think.

> > > > I'll look into the per-cpu memory case.
> > >
> > > Look into everything that does cpu_to_node() based allocations, because
> > > they all assume that that is stable.
> > >
> > > They allocate memory at init time to be node local, but they you go an
> > > mess that up.
> >
> > So, the case that you're considering is:
> >
> > CPU X on Node Y at boot-time, gets memory from Node Y.
> >
> > CPU X moves to Node Z at run-time, is still using memory from Node Y.
>
> Right, at which point numa doesn't make sense anymore. If you randomly
> scramble your cpu<->node map what's the point of exposing numa to the
> guest?
>
> The whole point of NUMA is that userspace can be aware of the layout and
> use local memory where possible.
>
> Nobody will want to consider dynamic NUMA information; its utterly
> insane; do you see your HPC compute job going: "oi hold on, I've got to
> reallocate my data, just hold on while I go do this" ? I think not.

Fair point.

> > The memory is still there (or it's also been 'moved' via the hypervisor
> > interface), it's just not optimally placed. Autonuma support should help
> > us move that memory over at run-time, in my understanding.
>
> No auto-numa cannot fix this. And the HV cannot migrate the memory for
> the same reason.
>
> Suppose you have two cpus: X0 X1 on node X, you then move X0 into node
> Y. You cannot move memory along with it, X1 might still expect it to be
> on node X.

Well, if they are both using the memory, I believe autonuma will achieve
some sort of homeostasis. Not sure.

> You can only migrate your entire node, at which point nothing has really
> changed (assuming a fully connected system).
>
> > I won't deny it's imperfect, but honestly, it does actually work (in
> > that the kernel doesn't crash). And the updated mappings will ensure
> > future page allocations are accurate.
>
> Well it works for you; because all you care about is the kernel not
> crashing.

That's not all I care about, actually. But I think to make userspace
handle this case, the kernel has to not crash first. There really isn't
userspace otherwise, to consider. And we are now at the point where
userspace doesn't crash. But the scheduler, for instance, and probably
other places, are no longer able to load balance a system because it no
longer has an accurate view of the sched domains hierarchy.

> But does it actually provide usable semantics for userspace? Is there
> anyone who _wants_ to use this?

In at least one case where this happens on power, the "user" isn't
selecting anything. The hypervisor is sending events to the partition.
The partition can choose to ignore them, at which point performance will
probably degrade (topologically speaking guest and host now have
different views of the gust). The guest would see the updated partition
topology, in that case, on a reboot, I think.

> What's the point of thinking all your memory is local only to have it
> shredded across whatever nodes you stuffed your vcpu in? Utter crap I'd
> say.

I think the hope is by "fixing" the topology at run-time (these events
are when the topology is already bad and made "good" by putting cpus and
memory closer together), the performance will actually go up, in
practice, without having to reboot systems.

> > But the point is still valid, and I will do my best and work with others
> > to audit the users of cpu_to_node(). When I worked earlier on supporting
> > memoryless nodes, I didn't see too too many init time callers using
> > those APIs, many just rely on getting local allocations implicitly
> > (which I do understand also would break here, but should also get
> > migrated to follow the cpus eventually, if possible).
>
> init time or not doesn't matter; runtime cpu_to_node() users equally
> expect the allocation to remain local for the duration as well.

And those are already broken for memoryless nodes, afaict (and should be
using cpu_to_mem).

> You've really got to step back and look at what you think you're
> providing.
>
> Sure you can make all this 'work' but what is the end result? Is it
> useful? I say not. I'm saying that what you end up with is a useless
> pile of crap.

I understand that is your opinion. You've made it rather clear
throughout this thread.

I will try and come up with a clearer documentation of what happens and
what we'd like to provide, maybe that will lead to a more fruitful
discussion.

> > > > For what it's worth, our test teams are stressing the kernel with these
> > > > topology updates and hopefully we'll be able to resolve any issues that
> > > > result.
> > >
> > > Still absolutely insane.
> >
> > I won't deny that, necessarily, but I'm in a position to at least try
> > and make them work with Linux.
>
> Make what work? A useless pile of crap that nobody can or wants to use?

We get the topology events already. Linux can (does) choose to handle
those events or not. If we do handle those events, I would like to
handle them correctly in the kernel. Your opinion seems to be that it
cannot be done. I respect your opinion as a kernel expert.

> > > > I will look into per-cpu memory, and also another case I have been
> > > > thinking about where if a process is bound to a CPU/node combination via
> > > > numactl and then the topology changes, what exactly will happen. In
> > > > theory, via these topology updates, a node could go from memoryless ->
> > > > not and v.v., which seems like it might not be well supported (but
> > > > again, should not be much different from hotplugging all the memory out
> > > > from a node).
> > >
> > > memory hotplug is even less well handled than cpu hotplug.
> >
> > That feels awfully hand-wavy to me. Again, we stress test both memory
> > and cpu hotplug pretty heavily.
>
> That's not the point; sure you stress the kernel implementation; but
> does anybody actually care?
>
> Is there a single userspace program out there that goes: oh hey, my
> memory layout just changed, lemme go fix that?

This is a good point -- I'm not sure how we'd communicate that to
userspace either. I am guessing that currently we do not.

> > > And yes, the fact that you need to go look into WTF happens when people
> > > use numactl should be a big arse red flag. _That_ is breaking userspace.
> >
> > It will be the exact same condition as running bound to a CPU and
> > hotplugging that CPU out, as I understand it.
>
> Yes and that is _BROKEN_.. I'm >< that close to merging a patch that
> will fail hotplug when there is a user task affine to that cpu. This
> madness need to stop _NOW_.

That feels like policy, but that is your choice. What about when it's
affined to the cpus on a node? And we hotplug one of those cpus out?

It feels like what you really don't like is CPU and memory hotplug
generally. I suppose you could push (or merge) a patch that just
disables it. I think you will get push back on such a patch, but that's
the point of the process.

> Also, listen to yourself. The user _wanted_ that task there and you
> say its OK to wreck that.

Not exactly. But I think you're saying that a user's tasks gets to
override system administration tasks. Maybe that's the right choice. I
don't really know. What I am saying is that userspace cannot always be
satisfied in its request. For instance, memoryless nodes that have CPUs
cannot have node-local memory. Does that mean such tasks should be
killed instaed of run with slightly less performant results? Again, I
think that's a policy question.

> Please, step back, look at what you're doing and ask yourself, will any
> sane person want to use this? Can they use this?
>
> If so, start by describing the desired user semantics of this work.
> Don't start by cobbling kernel bits togerther until it stops crashing.

I will try to do just that. Thank you for input.

-Nish