2004-06-14 15:42:13

by Anton Blanchard

[permalink] [raw]
Subject: NUMA API observations


Hi Andi,

I had a chance to test out the NUMA API on ppc64. I started with 64bit
userspace, I'll send on some 32/64bit compat patches shortly.

This machine is weird in that there are lots of nodes with no memory.
Sure I should probably not set those nodes online, but its a test setup
that is good for finding various issues (like node failover when one is
OOM).

As you can see only node 0 and 1 has memory:

# cat /proc/buddyinfo
Node 1, zone DMA 0 53 62 22 1 0 1 1 1 1 0 0 0 1 60
Node 0, zone DMA 136 18 5 1 2 1 0 1 0 1 1 0 0 0 59

# numastat
node7 node6 node5 node4 node3 node2 node1 node0
numa_hit 0 0 0 0 0 0 30903 2170759
numa_miss 0 0 0 0 0 0 0 0
numa_foreign 0 0 0 0 0 0 0 0
interleave_hit 0 0 0 0 0 0 715 835
local_node 0 0 0 0 0 0 28776 2170737
other_node 0 0 0 0 0 0 2127 22

Now if I try and interleave across all, the task gets OOM killed:

# numactl --interleave=all /bin/sh
Killed

in dmesg: VM: killing process sh

It works if I specify the nodes with memory:

# numactl --interleave=0,1 /bin/sh

Is this expected or do we want it to fallback when there is lots of
memory on other nodes?

A similar scenario happens with:

# numactl --preferred=7 /bin/sh
Killed

in dmesg: VM: killing process numactl

The manpage says we should fallback to other nodes when the preferred
node is OOM.

numactl cpu affinity looks broken on big cpumask systems:

# numactl --cpubind=0 /bin/sh
sched_setaffinity: Invalid argument

sched_setaffinity(19470, 64, { 0, 0, 0, 0, 0, 0, 2332313320, 534d50204d6f6e20 }) = -1 EINVAL (Invalid argument)

My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
called with a bitmap at least as big as the kernels cpumask_t. I will
submit a patch for this shortly.

Next I looked at the numactl --show info:

# numactl --show
policy: default
preferred node: 0
interleavemask:
interleavenode: 0
nodebind: 0 1 2 3
membind: 0 1 2 3 4 5 6 7

Whats the difference between nodebind and membind? Why dont i see all 8
nodes on both of them? I notice if I do membind=all then I only see
the nodes with memory:

# numactl --membind=all --show
policy: bind
preferred node: 0
interleavemask:
interleavenode: 0
nodebind: 0 1 2 3
membind: 0 1

That kind of makes sense, but I dont understand why we have 4 nodes in
the nodebind field. My cpu layout is not contiguous, perhaps thats why
nodebind comes out strange:

processor : 0
processor : 1
processor : 2
processor : 3
processor : 16
processor : 17
processor : 18
processor : 19

Anton


2004-06-14 16:17:53

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

[right list for numactl discussions is lse-tech, not linux-kernel.
Adding cc]

On Tue, Jun 15, 2004 at 01:36:38AM +1000, Anton Blanchard wrote:
> Now if I try and interleave across all, the task gets OOM killed:
>
> # numactl --interleave=all /bin/sh
> Killed
>
> in dmesg: VM: killing process sh

interleave should always fall back to other nodes. Very weird.

Needs to be investigated. What were the actual arguments passed
to the syscalls?

>
> It works if I specify the nodes with memory:

That's probably a user space bug. Can you check what it passes
for "all" to the system calls? Maybe another bug in the sysfs
parser.

(you're using the latest version right? Previous ones were buggy)

>
> # numactl --interleave=0,1 /bin/sh
>
> Is this expected or do we want it to fallback when there is lots of
> memory on other nodes?

interleave should fall back.

> My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
> called with a bitmap at least as big as the kernels cpumask_t. I will
> submit a patch for this shortly.

Umm, what a misfeature. We size the buffer up to the biggest
running CPU. That should be enough.

IMHO that's just a kernel bug. How should a user space
application sanely discover the cpumask_t size needed by the kernel?
Whoever designed that was on crack.

I will probably make it loop and double the buffer until EINVAL ends or it
passes a page and add a nasty comment.

>
> Next I looked at the numactl --show info:
>
> # numactl --show
> policy: default
> preferred node: 0
> interleavemask:
> interleavenode: 0
> nodebind: 0 1 2 3
> membind: 0 1 2 3 4 5 6 7
>
> Whats the difference between nodebind and membind? Why dont i see all 8

nodebind = cpubind

Basically sched_setaffinity, but with node numbes.

That's actually the option to use too, just the documentation
is wrong (will fix). Even --cpubind always works with node numbers.

-Andi

2004-06-14 21:12:22

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

Andi wrote:
> How should a user space application sanely discover the cpumask_t
> size needed by the kernel? Whoever designed that was on crack.
>
> I will probably make it loop and double the buffer until EINVAL
> ends or it passes a page and add a nasty comment.

I agree that a loop is needed. And yes someone didn't do a very
good job of designing this interface.

I posted a piece of code that gets a usable upper bound on cpumask_t
size, suitable for application code to size mask buffers to be used
in these system calls.

See the lkml article:

http://groups.google.com/groups?selm=fa.hp225re.1v68ei0%40ifi.uio.no

Or search in google groups for "cpumasksz".

This article was posted:

Date: 2004-06-04 09:20:13 PST

in a long thread under the Subject of:

[PATCH] cpumask 5/10 rewrite cpumask.h - single bitmap based implementation

Feel free to steal it, or to ignore it, if you find it easier to
write your version than to read mine.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-06-14 21:46:50

by Anton Blanchard

[permalink] [raw]
Subject: Re: NUMA API observations


> interleave should always fall back to other nodes. Very weird.
> Needs to be investigated. What were the actual arguments passed
> to the syscalls?

This one looks like a bug in my code. I wasnt setting numnodes high
enough, so the node fallback lists werent being initialised for some
nodes.

> > My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
> > called with a bitmap at least as big as the kernels cpumask_t. I will
> > submit a patch for this shortly.
>
> Umm, what a misfeature. We size the buffer up to the biggest
> running CPU. That should be enough.
>
> IMHO that's just a kernel bug. How should a user space
> application sanely discover the cpumask_t size needed by the kernel?
> Whoever designed that was on crack.

glibc now uses a select style interface. Unfortunately the interface has
changed about three times by now.

> I will probably make it loop and double the buffer until EINVAL ends or it
> passes a page and add a nasty comment.

Perhaps we could use the new glibc interface and fall back to the loop
on older glibcs.

Anton

2004-06-14 23:44:33

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

On Mon, Jun 14, 2004 at 02:21:28PM -0700, Paul Jackson wrote:
> Andi wrote:
> > How should a user space application sanely discover the cpumask_t
> > size needed by the kernel? Whoever designed that was on crack.
> >
> > I will probably make it loop and double the buffer until EINVAL
> > ends or it passes a page and add a nasty comment.
>
> I agree that a loop is needed. And yes someone didn't do a very
> good job of designing this interface.

I add some code to go upto a page now.

This adds a hardcoded limit of 32768 CPUs to libnuma. That's not
nice, but we have to stop somewhere in case the EINVAL is returned
for other reason
(I really dislike this error code btw, it is near always far too
ambigious...)


>
> I posted a piece of code that gets a usable upper bound on cpumask_t
> size, suitable for application code to size mask buffers to be used
> in these system calls.

My code works basically the same, but thanks for the pointer.

-Andi

2004-06-14 23:50:01

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

On Tue, Jun 15, 2004 at 07:40:04AM +1000, Anton Blanchard wrote:
>
> > interleave should always fall back to other nodes. Very weird.
> > Needs to be investigated. What were the actual arguments passed
> > to the syscalls?
>
> This one looks like a bug in my code. I wasnt setting numnodes high
> enough, so the node fallback lists werent being initialised for some
> nodes.

Ok. Good to know.

That's a bad generic bug, right?

interleaving isn't really doing much different from an ordinary allocation,
except that the numa_node_id() index to the zone table is replaced with a
different number.

> > > My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
> > > called with a bitmap at least as big as the kernels cpumask_t. I will
> > > submit a patch for this shortly.
> >
> > Umm, what a misfeature. We size the buffer up to the biggest
> > running CPU. That should be enough.
> >
> > IMHO that's just a kernel bug. How should a user space
> > application sanely discover the cpumask_t size needed by the kernel?
> > Whoever designed that was on crack.
>
> glibc now uses a select style interface. Unfortunately the interface has
> changed about three times by now.

I have no plans to track the glibc interface of the week for this
and numactl must run with older glibc anyways, that is why I always
used an own stub to this. I am not sure they even solved the problem
completely. With the upcomming numactl version it should work.

What I wonder is why IA64 worked though. We tested on it previously,
but somehow didn't run into this. The regression test suite
needs to check this better.

-Andi

2004-06-15 00:08:01

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

Andi wrote:
> I add some code to go upto a page now.

good.

> This adds a hardcoded limit of 32768 CPUs to libnuma.

Ok - SGI has no issues with a 32768 CPU limit ... for now ;).

> I really dislike this error code [EINVAL] btw ...

Then use others ??

The way I learned Unix, decades ago, the tradition was to use a variety
of errno values, even if they were a slightly strange fit, in order to
provide more detailed error feedback. Look for example at the rename(2)
or acct(2) system calls.

So long as the man page shows them, it can be helpful.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-06-15 00:20:47

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

On Mon, Jun 14, 2004 at 05:06:05PM -0700, Paul Jackson wrote:
> > I really dislike this error code [EINVAL] btw ...
>
> Then use others ??
>
> The way I learned Unix, decades ago, the tradition was to use a variety
> of errno values, even if they were a slightly strange fit, in order to
> provide more detailed error feedback. Look for example at the rename(2)
> or acct(2) system calls.

I tried to use creative errnos in the past (we have some
interesting unused ones from SYSV that can be abused like EADV
or EDOTDOT or ELIBBAD or EILSEQ), but Linus tends to reject
patches that use them. It wouldn't help here anyways, since
libnuma has to work with existing kernels. And changing the errno
now would probably break all user space workaround code people
have developed for this misfeature.

Best would be probably to fix the kernel to check that the
passed buffer is big enough for the highest running CPU,
And it should error out when there are bits set above
the cpumask limit (see the code in mm/mempolicy.c that
checks all this for node masks). I will cook up a patch for
this later.

-Andi

2004-06-15 00:26:18

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

> Linus tends to reject patches that use them.

Too bad. Good to know.

I agree with your other points.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-06-15 05:00:32

by Manfred Spraul

[permalink] [raw]
Subject: Re: NUMA API observations

>
>
>> I will probably make it loop and double the buffer until EINVAL
>> ends or it passes a page and add a nasty comment.
>
>I agree that a loop is needed. And yes someone didn't do a very
>good job of designing this interface.
>
>
>
What about fixing the interface instead? For example if user_mask_ptr
NULL, then sys_sched_{get,set}affinity return the bitmap size.

--
Manfred

2004-06-15 06:10:00

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

> What about fixing the interface instead?

Yeah - if someone has the stomach, and time for such,
it might be well received. Though this brain damage,
and the further glibc brain damage down stream, will
take a while to dissipate.

Manfred - the copy of your email addressed back to me,
"[email protected]" was instead addressed to:

"pj () sgi ! com"@dbl.q-ag.de

Needless to say, I never saw that copy.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-06-15 11:03:23

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

On Tue, Jun 15, 2004 at 06:59:57AM +0200, Manfred Spraul wrote:
> >
> >
> >>I will probably make it loop and double the buffer until EINVAL
> >>ends or it passes a page and add a nasty comment.
> >
> >I agree that a loop is needed. And yes someone didn't do a very
> >good job of designing this interface.
> >
> >
> >
> What about fixing the interface instead? For example if user_mask_ptr
> NULL, then sys_sched_{get,set}affinity return the bitmap size.

Or maybe just a sysctl. But it doesn't really help because
applications have to work with older kernels. I think
cpumask_t is more an kernel internal implementation detail
and should not really be exposed to user space, so
it's better not to do the sysctl neither.

However the clear bugs in the kernel API that should be fixed:
- It should EINVAL for bits set above cpumask_t
- It should not EINVAL as long as the passed in mask
covers all online CPUs.

-Andi

2004-06-15 12:53:25

by Thomas Zehetbauer

[permalink] [raw]
Subject: Re: NUMA API observations

Looking at these numastat results and the default policy it seems that
memory is primarily allocated on the first node which in turn means a
unnecessarily large amount of page faults on the second node.

I wonder if it is possible to better balance processes among the nodes
by e.g. setting nodeAffinity = pid mod nodeCount

Tom

--
T h o m a s Z e h e t b a u e r ( TZ251 )
PGP encrypted mail preferred - KeyID 96FFCB89
finger [email protected] for key

Attempting to apply the OSI layers model to a real network is just like
attempting to represent seven dimensions in four dimensional reality.
Thomas Zehetbauer

2004-06-15 13:27:35

by Andi Kleen

[permalink] [raw]
Subject: Re: NUMA API observations

Thomas Zehetbauer <[email protected]> writes:

> Looking at these numastat results and the default policy it seems that
> memory is primarily allocated on the first node which in turn means a
> unnecessarily large amount of page faults on the second node.

NUMA memory policy has nothing to do with page faults.

If you get most allocations on the first node it either means most
programs run on the first node (assuming they don't use NUMA API
to change their memory affinity) or more likely the programs running
on node 0 need more memory than those running on node 1.

That's easily possible, e.g. a typical desktop uses most of its
memory in the X server. If it runs on node 0 you get such skewed
statistics. On servers it is often similar.

One way to combat that if it was really a problem would be to run the
X server with interleaving policy (numactl --interleave=all
XFree86)[1], but I would recommend careful benchmarks first if it's
really a win. Normally better local memory latency is the better
choice.

[1] Don't do that with startx or xinit, the rest of the X session should
probably not use that.

> I wonder if it is possible to better balance processes among the nodes
> by e.g. setting nodeAffinity = pid mod nodeCount

I assume you mean scheduling not memory affinity here. execve() and
clone() do that kind of (but based on node loads, not pids), but not fork.

-Andi

2004-06-15 13:52:05

by Bill Davidsen

[permalink] [raw]
Subject: Re: NUMA API observations

Andi Kleen wrote:
> On Mon, Jun 14, 2004 at 02:21:28PM -0700, Paul Jackson wrote:
>
>>Andi wrote:
>>
>>>How should a user space application sanely discover the cpumask_t
>>>size needed by the kernel? Whoever designed that was on crack.
>>>
>>>I will probably make it loop and double the buffer until EINVAL
>>>ends or it passes a page and add a nasty comment.
>>
>>I agree that a loop is needed. And yes someone didn't do a very
>>good job of designing this interface.
>
>
> I add some code to go upto a page now.
>
> This adds a hardcoded limit of 32768 CPUs to libnuma. That's not
> nice, but we have to stop somewhere in case the EINVAL is returned
> for other reason

Should be enough for desktop machines...

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-06-15 13:52:09

by Jesse Barnes

[permalink] [raw]
Subject: Re: NUMA API observations

> > > > My kernel is compiled with NR_CPUS=128, the setaffinity syscall must
> > > > be called with a bitmap at least as big as the kernels cpumask_t. I
> > > > will submit a patch for this shortly.
> > >
> > > Umm, what a misfeature. We size the buffer up to the biggest
> > > running CPU. That should be enough.
> > >
> > > IMHO that's just a kernel bug. How should a user space
> > > application sanely discover the cpumask_t size needed by the kernel?
> > > Whoever designed that was on crack.
> >
> > glibc now uses a select style interface. Unfortunately the interface has
> > changed about three times by now.
>
> I have no plans to track the glibc interface of the week for this
> and numactl must run with older glibc anyways, that is why I always
> used an own stub to this. I am not sure they even solved the problem
> completely. With the upcomming numactl version it should work.
>
> What I wonder is why IA64 worked though. We tested on it previously,
> but somehow didn't run into this. The regression test suite
> needs to check this better.

Yeah, I tested it with a kernel compiled for 512p, and didn't have any
problems. But that was awhile ago--I may have fixed the code and lost the
diff when I sent you the other patches...

Jesse

2004-06-15 17:39:22

by Manfred Spraul

[permalink] [raw]
Subject: Re: NUMA API observations

Andi Kleen wrote:

>Or maybe just a sysctl. But it doesn't really help because
>applications have to work with older kernels.
>
What's the largest number of cpus that are supported right now? 256?
First call sysctl or whatever. If it fails, then glibc can assume 256.
If someone installs an old kernel on a new computer then it's his own
fault. The API is broken, we should fix it now - it will only get more
painful in the future.

--
Manfred

2004-06-15 18:09:36

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

Andi wrote:
> But it doesn't really help because
> applications have to work with older kernels.

It doesn't help right away. But one can eventually phase out cruft.
Provide the new, deprecate the old, then perhaps in 2.7/2.8 kernels,
discontinue the old.

Such renewal work is valuable to the long term health of Linux.

I can't do it - I wouldn't want Andrew dreading my submissions anymore
than he already does, and William's questions as to just how I was
explaining to my employer the value of my labors would be increasingly
unanswerable. <grin>

> cpumask_t is more an kernel internal implementation detail
> and should not really be exposed to user space, so
> it's better not to do the sysctl neither.

Bingo.

When you find yourself in a hole, stop digging.

I'd go a step further - even as an internal kernel detail, it was poorly
chosen, as evidenced by the amount of commentary it takes the big-endian
64 bit machines, in the files include/asm-ppc64/bitops.h and
include/asm-s390/bitops.h, to explain the bitmap data type.

Perhaps a byte array, rather than an unsigned long array, would be
better.

And the brain damage is also on the other side of the kernel-user
boundary. Don't get me started on the botch that glibc made of this.

This is a nice case study in the propagation properties of suboptimal
design choices, and in the unintended consequences flowing from the
choices of basic data structures.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-06-15 18:24:06

by Paul Jackson

[permalink] [raw]
Subject: Re: NUMA API observations

Manfred wrote:
> What's the largest number of cpus that are supported right now? 256?

Kernels for SGI's SN2 boxes are usually compiled with NR_CPUS == 512.
A quick grep of the default config files shows that is the largest.

> First call sysctl or whatever. If it fails, then glibc can assume 256.

Yes, one _could_ write code such as that.

Spend a little time looking at what glibc has done so far with these
API's. You will then doubt that the code you recommend would actually
happen consistently. Be forewarned - if you are on anti-depressant
medications, make sure your prescription is filled first.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373