2010-01-06 06:11:33

by Anton Blanchard

[permalink] [raw]
Subject: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

pcibus_to_node can return -1 if we cannot determine which node a pci bus
is on. If passed -1, cpumask_of_node will negatively index the lookup array
and pull in random data:

# cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus
00000000,00000003,00000000,00000000
# cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist
64-65

Change cpumask_of_node to check for -1 and return cpu_all_mask in this
case:

# cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus
ffffffff,ffffffff,ffffffff,ffffffff
# cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist
0-127

Signed-off-by: Anton Blanchard <[email protected]>
---

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index c5087d7..cc4f136 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -99,7 +99,8 @@ extern const struct cpumask *cpumask_of_node(int node);
/* Returns a pointer to the cpumask of CPUs on Node 'node'. */
static inline const struct cpumask *cpumask_of_node(int node)
{
- return node_to_cpumask_map[node];
+ return (node == -1) ? cpu_all_mask :
+ node_to_cpumask_map[node];
}
#endif

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 550df48..e57a7f5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -47,6 +47,9 @@ void __init setup_node_to_cpumask_map(void)
*/
const struct cpumask *cpumask_of_node(int node)
{
+ if (node == -1)
+ return cpu_all_mask;
+
if (node >= nr_node_ids) {
printk(KERN_WARNING
"cpumask_of_node(%d): node > nr_node_ids(%d)\n",

--


2010-01-06 06:36:37

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Wed, 6 Jan 2010 03:25:15 pm Anton Blanchard wrote:
> pcibus_to_node can return -1 if we cannot determine which node a pci bus
> is on. If passed -1, cpumask_of_node will negatively index the lookup array
> and pull in random data:

If you grep for the callers, you'll see those which do this (now-obsolete)
check. One more patch :)

Thanks,
Rusty.

2010-01-06 23:00:57

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Wed, 6 Jan 2010, Rusty Russell wrote:

> > pcibus_to_node can return -1 if we cannot determine which node a pci bus
> > is on. If passed -1, cpumask_of_node will negatively index the lookup array
> > and pull in random data:
>
> If you grep for the callers, you'll see those which do this (now-obsolete)
> check. One more patch :)
>

Do we really want to do this? A nid of -1 is undefined, so the result of
cpumask_of_node(-1) should be undefined; there's no formal definition that
a nid of -1 follows the semantics that we use for x86 pci buses, for
example, where it implies no NUMA locality in all cases.

I think a better fix would be to define -1 where appropriate to either be
an invalid node, cpu_all_mask, or whatever in the caller:

mask = (nid != -1) ? cpumask_of_node(nid) : cpu_all_mask;

2010-01-06 23:20:51

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node


Hi Rusty,

> If you grep for the callers, you'll see those which do this (now-obsolete)
> check. One more patch :)

Good point. I was hoping to have a chance at a -stable backport, so I'll
do this cleanup in a follow up patch.

Anton

2010-01-06 23:40:20

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node


Hi David,

> Do we really want to do this? A nid of -1 is undefined, so the result of
> cpumask_of_node(-1) should be undefined; there's no formal definition that
> a nid of -1 follows the semantics that we use for x86 pci buses, for
> example, where it implies no NUMA locality in all cases.

I don't like the use of -1 as a node, but it's much more widespread than
x86; including sh, powerpc, sparc and the generic topology code. eg:


#fdef CONFIG_PCI
extern int pcibus_to_node(struct pci_bus *pbus);
#else
static inline int pcibus_to_node(struct pci_bus *pbus)
{
return -1;
}


It would be nice to get rid of this special case but I suspect that's not
2.6.33 material.

Speaking of invalid node ids, I also noticed the scheduler isn't using
node iterators:

for (i = 0; i < nr_node_ids; i++) {

which should be fixed at some stage too since it doesn't allow us to
allocate the node structures sparsely.

Anton

2010-01-06 23:51:16

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Thu, 7 Jan 2010, Anton Blanchard wrote:

> I don't like the use of -1 as a node, but it's much more widespread than
> x86; including sh, powerpc, sparc and the generic topology code. eg:
>
>
> #fdef CONFIG_PCI
> extern int pcibus_to_node(struct pci_bus *pbus);
> #else
> static inline int pcibus_to_node(struct pci_bus *pbus)
> {
> return -1;
> }

This seems to be the same semantics that NUMA_NO_NODE was defined for,
it's not necessarily a special case.

Regardless, the result of cpumask_of_node(NUMA_NO_NODE) should be
undefined as it currently is unless you want to obsolete NUMA_NO_NODE
entirely which is much more work. In other words, special-casing a nid of
-1 to mean no affinity is inappropriate if NUMA_NO_NODE represents an
invalid nid.

If x86 pci buses want to use -1 to imply that meaning, that's fine, but it
shouldn't be coded in a generic interface such as cpumask_of_node(). Does
that make sense?

> Speaking of invalid node ids, I also noticed the scheduler isn't using
> node iterators:
>
> for (i = 0; i < nr_node_ids; i++) {
>
> which should be fixed at some stage too since it doesn't allow us to
> allocate the node structures sparsely.
>

That loop has nothing to do with the allocation of a node structure, it's
quite plausible that it checks for various states such as node_online(i)
while looping and doing something else interesting for those that are
offline. Keep in mind that this isn't equivalent to using for_each_node()
since that only iterates over N_POSSIBLE which is architecture specific.

2010-01-07 00:02:05

by Anton Blanchard

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node


Hi David,

> This seems to be the same semantics that NUMA_NO_NODE was defined for,
> it's not necessarily a special case.
>
> Regardless, the result of cpumask_of_node(NUMA_NO_NODE) should be
> undefined as it currently is unless you want to obsolete NUMA_NO_NODE
> entirely which is much more work. In other words, special-casing a nid of
> -1 to mean no affinity is inappropriate if NUMA_NO_NODE represents an
> invalid nid.
>
> If x86 pci buses want to use -1 to imply that meaning, that's fine, but it
> shouldn't be coded in a generic interface such as cpumask_of_node(). Does
> that make sense?

I wasn't using the example to strengthen the case of the -1 behaviour, but to
highlight that a complete fix would be more work and risk not making it back
to -stable.

I'm all for removing the special case as a followon patch.

> > Speaking of invalid node ids, I also noticed the scheduler isn't using
> > node iterators:
> >
> > for (i = 0; i < nr_node_ids; i++) {
> >
> > which should be fixed at some stage too since it doesn't allow us to
> > allocate the node structures sparsely.
>
> That loop has nothing to do with the allocation of a node structure, it's
> quite plausible that it checks for various states such as node_online(i)
> while looping and doing something else interesting for those that are
> offline. Keep in mind that this isn't equivalent to using for_each_node()
> since that only iterates over N_POSSIBLE which is architecture specific.

Yeah I understand it isn't the same thing, but the scheduler oopses in
a number of places with CPUMASK_OFFSTACK and sparse node ids, so things
that can be switched to node iterators should and node_online() checks
should be added elsewhere.

Anton

2010-01-07 00:26:12

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Thu, 7 Jan 2010, Anton Blanchard wrote:

> I wasn't using the example to strengthen the case of the -1 behaviour, but to
> highlight that a complete fix would be more work and risk not making it back
> to -stable.
>

I don't think that we should defer a complete fix to the callers because
it's "more work." If you've identified places where -1 is passed to
cpumask_of_node() without being checked, I think those would be fairly
obvious -stable candidates themselves instead of this series.

2010-01-07 00:39:19

by Rusty Russell

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Thu, 7 Jan 2010 10:21:06 am David Rientjes wrote:
> On Thu, 7 Jan 2010, Anton Blanchard wrote:
>
> > I don't like the use of -1 as a node, but it's much more widespread than
> > x86; including sh, powerpc, sparc and the generic topology code. eg:
> >
> >
> > #fdef CONFIG_PCI
> > extern int pcibus_to_node(struct pci_bus *pbus);
> > #else
> > static inline int pcibus_to_node(struct pci_bus *pbus)
> > {
> > return -1;
> > }
>
> This seems to be the same semantics that NUMA_NO_NODE was defined for,
> it's not necessarily a special case.

It's widespread, and we've just had another bug due to pcibus_to_node handling
-1 and cpumask_of_node not. (Search lkml for subject "[Regression] 2.6.33-rc2
- pci: Commit e0cd516 causes OOPS").

So I think the evidence is in favor of just handling -1.

Cheers,
Rusty.

2010-01-07 08:04:56

by David Rientjes

[permalink] [raw]
Subject: Re: [patch 6/6] x86: cpumask_of_node() should handle -1 as a node

On Thu, 7 Jan 2010, Rusty Russell wrote:

> On Thu, 7 Jan 2010 10:21:06 am David Rientjes wrote:
> > On Thu, 7 Jan 2010, Anton Blanchard wrote:
> >
> > > I don't like the use of -1 as a node, but it's much more widespread than
> > > x86; including sh, powerpc, sparc and the generic topology code. eg:
> > >
> > >
> > > #fdef CONFIG_PCI
> > > extern int pcibus_to_node(struct pci_bus *pbus);
> > > #else
> > > static inline int pcibus_to_node(struct pci_bus *pbus)
> > > {
> > > return -1;
> > > }
> >
> > This seems to be the same semantics that NUMA_NO_NODE was defined for,
> > it's not necessarily a special case.
>
> It's widespread, and we've just had another bug due to pcibus_to_node handling
> -1 and cpumask_of_node not. (Search lkml for subject "[Regression] 2.6.33-rc2
> - pci: Commit e0cd516 causes OOPS").
>

That's similiar to the problem in cpumask_of_pcibus() that I fixed with
7715a1e back in September. The difference is that I isolated my fix to
the pci bus implementation that defined the nid of -1 to mean no NUMA
affinity, whereas generic kernel code can use that value for any (or no)
definition and returning cpu_all_mask may not apply. We know it does for
pcibus, but not for generic NUMA node ids that happen to be invald.

The hope is that eventually we can remove many dependencies on node ids
for these purposes; buses with no affinity are not actually members of any
NUMA node. I had a proposal for a generic kernel interface that is based
on ACPI system localities that would define the proximity of any system
entity (of which node is only a type defining "memory") to each other.
I'm waiting for enough time to work on that project.

NUMA is special because a single cpu is always a member of a single node,
so we're violating the bidirectional mapping by saying that node -1 maps
to all cpus while all cpus don't map to node -1. In other words, I think
we should take this as an opportunity to find and fix broken callers as
we've done both by my patch from September and by your aforementioned
case. In this particular case, it would be a matter of doing:

mask = (nid != -1) ? cpumask_of_node(nid) : cpu_all_mask;

I'm hoping that Ingo will weigh in on this topic with his taste and vision
for how we should decouple device locality information from entities that
are not members of any NUMA node if we continue to "special case" these
things.