2004-03-18 23:05:19

by Matthew Dobson

[permalink] [raw]
Subject: [PATCH] Introduce nodemask_t ADT [0/7]

I've got a fairly good size patch set to implement an ADT (Abstract Data
Type) for nodemasks, which follows the path blazed by wli with the
cpumask_t code. The basic idea is to create a generic,
platform/arch-agnostic nodemask data type, complete with operations to
do most anything you'd want to do with a nodemask. This stops us from
open-coding nodemask operations, allows non-consecutive node numbering
(ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
numnodes entirely (replaced with num_online_nodes()), and will
facilitate the hotplugging of whole nodes.

As I mentioned, the code is heavily based on Bill Irwin's cpumask_t
code. The changes are broken into seven patches:

nodemask_t-01-definitions.patch
The basic definitions of the nodemask_t type: include/linux/nodemask.h,
include/asm-generic/{nodemask.h, nodemask_arith.h, nodemask_array.h,
nodemask_const_reference.h, nodemask_const_value.h, nodemask_nonuma.h},
and some small changes to include/linux/mmzone.h (removing extistant
definition of node_online_map and helper functions).

nodemask_t-02-core.patch
Changes to arch-independent code. Surprisingly few references to
numnodes, open-coded node loops, etc. Most important result of this
patch is that no generic code assumes anything about node numbering.
This allows individual arches to use sparse numbering if they care to.

nodemask_t-03-i386.patch
Changes to i386 specific code. As with most arch changes, it involves
close-coding loops (ie: for_each_online_node(nid) rather than
for(nid=0;nid<numnodes;nid++)) and replacing the use of numnodes with
num_online_nodes() and node_set_online(nid).

nodemask_t-04-ppc64.patch
Changes to ppc64 specific code. Untested. Code review & testing
requested.

nodemask_t-05-x86_64.patch
Changes to x86_64 specific code. Untested. Code review & testing
requested.

nodemask_t-06-ia64.patch
Changes to ia64 specific code. Untested. Code review & testing
requested.

nodemask_t-07-other.patch
Changes to other arch-specific code (alpha, arm, mips, sparc64 & sh).
Untested. Code review & testing requested.



2004-03-19 00:01:41

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 15:23, Jesse Barnes wrote:
> On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> > do most anything you'd want to do with a nodemask. This stops us from
> > open-coding nodemask operations, allows non-consecutive node numbering
> > (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> > numnodes entirely (replaced with num_online_nodes()), and will
> > facilitate the hotplugging of whole nodes.
>
> My hero! :) I think this has been needed for awhile, but now that I

Anything for a damsel in distress! ;)

> think about it, it begs the question of what a node is. Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what?

There have been arguments about exactly what a node is since there has
been a concept of a node at all. In the kernel, it isn't defined. A
node doesn't *have* to have CPUs on it (see nr_cpus_node()), doesn't
*have* to have memory, doesn't *have* to have I/O. It's supposed to be
just a container for those 3 things, but the containers can be empty.
This code doesn't get into what a node is, just makes sure they're used
properly... ;)

> On sn2
> hardware, we have the concept of a node without CPUs. And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory! (The
> I/O guys call these "ionodes".)

Yep... I saw both numnodes and numionodes perusing the ia64 code. You
should be able to put these CPU/memless nodes in the node_online_map
now... If there's code that's assuming a node contains either CPUs or
memory, I'd like to find it! :)

> And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).

node_distance(from, to)

> I'll take a look at the ia64 bits when I get them (I've only received
> two of the seven patches thus far).
>
> Jesse

Super. I'd really like feedback on the ia64 code (well, actually all
the non-i386 code). I did what I thought was right, but eyes more
familiar with the code are extremely welcome.

Cheers!

-Matt

2004-03-19 00:01:42

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thursday 18 March 2004 3:43 pm, Martin J. Bligh wrote:
> > It's probably not too late to change this to
> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> > aren't that many callers, are there (my grep is still running)?
>
> It probably shouldn't have anything to do with PCI directly either,
> so .... ;-) My former thought was that you might just want the most
> local memory for DMAing into.

Right, we want local memory (or potentially remote memory) for DMA,
but what about interrupt redirection? Some chipsets don't support
interrupt round robin, and just target interrupts at one CPU. In that
case (and probably the round robin case too), you want to know which
CPU(s) to send the interrupt at. Can't immediately think of other
in-kernel uses though (administrators will of course want to be able
to locate a given PCI device in a multirack system, but that's another
subject--one that Martin Hicks posted on yesterday).

Jesse

2004-03-19 00:05:40

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 15:32, Martin J. Bligh wrote:
> > On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> >> do most anything you'd want to do with a nodemask. This stops us from
> >> open-coding nodemask operations, allows non-consecutive node numbering
> >> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> >> numnodes entirely (replaced with num_online_nodes()), and will
> >> facilitate the hotplugging of whole nodes.
> >
> > My hero! :) I think this has been needed for awhile, but now that I
> > think about it, it begs the question of what a node is. Is it a set
> > of CPUs and blocks of memory (that seems to be the most commonly used
> > definition in the code), just memory, just CPUs, or what? On sn2
> > hardware, we have the concept of a node without CPUs. And due to our
> > wacky I/O layout, we also have nodes without CPUs *or* memory! (The
> > I/O guys call these "ionodes".) And then of course, there are CPUs
> > that aren't particularly close to any memory (i.e. they have none of
> > their own, and have to go several hops and/or through other CPUs to
> > get at memory at all).
>
> I think the closest answer we have is that it's a grouping of cpus and
> memory, where either may be NULL.
>
> I/O isn't directly associated with a node, though it should fit into the
> topo infrastructure, to give distances from io buses to nodes (for which
> I think we currently use cpumasks, which is probably wrong in retrospect,
> but then life is tough and flawed ;-))
>
> M.

Yeah... We used cpumasks because that seemed like a good idea at the
time. Nodemasks may be a better choice now... We can write a quicky
inline function nodemask_to_cpumask() as well, to keep the current users
happy.

-Matt

2004-03-18 23:48:31

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
> do most anything you'd want to do with a nodemask. This stops us from
> open-coding nodemask operations, allows non-consecutive node numbering
> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
> numnodes entirely (replaced with num_online_nodes()), and will
> facilitate the hotplugging of whole nodes.

My hero! :) I think this has been needed for awhile, but now that I
think about it, it begs the question of what a node is. Is it a set
of CPUs and blocks of memory (that seems to be the most commonly used
definition in the code), just memory, just CPUs, or what? On sn2
hardware, we have the concept of a node without CPUs. And due to our
wacky I/O layout, we also have nodes without CPUs *or* memory! (The
I/O guys call these "ionodes".) And then of course, there are CPUs
that aren't particularly close to any memory (i.e. they have none of
their own, and have to go several hops and/or through other CPUs to
get at memory at all).

I'll take a look at the ia64 bits when I get them (I've only received
two of the seven patches thus far).

Jesse


2004-03-19 00:22:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> On Thursday 18 March 2004 3:04 pm, Matthew Dobson wrote:
>> do most anything you'd want to do with a nodemask. This stops us from
>> open-coding nodemask operations, allows non-consecutive node numbering
>> (ie: nodes don't have to be numbered 0...numnodes-1), gets rid of
>> numnodes entirely (replaced with num_online_nodes()), and will
>> facilitate the hotplugging of whole nodes.
>
> My hero! :) I think this has been needed for awhile, but now that I
> think about it, it begs the question of what a node is. Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what? On sn2
> hardware, we have the concept of a node without CPUs. And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory! (The
> I/O guys call these "ionodes".) And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).

I think the closest answer we have is that it's a grouping of cpus and
memory, where either may be NULL.

I/O isn't directly associated with a node, though it should fit into the
topo infrastructure, to give distances from io buses to nodes (for which
I think we currently use cpumasks, which is probably wrong in retrospect,
but then life is tough and flawed ;-))

M.

2004-03-19 00:22:08

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thursday 18 March 2004 3:32 pm, Martin J. Bligh wrote:
> I think the closest answer we have is that it's a grouping of cpus and
> memory, where either may be NULL.

Yep, that seems to make the most sense, but then part of me wants to
drop the term node and never use it again :)

> I/O isn't directly associated with a node, though it should fit into the
> topo infrastructure, to give distances from io buses to nodes (for which
> I think we currently use cpumasks, which is probably wrong in retrospect,
> but then life is tough and flawed ;-))

It's probably not too late to change this to
pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
aren't that many callers, are there (my grep is still running)?

Thanks,
Jesse

2004-03-19 00:22:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

--On Thursday, March 18, 2004 15:37:10 -0800 Jesse Barnes <[email protected]> wrote:

> On Thursday 18 March 2004 3:32 pm, Martin J. Bligh wrote:
>> I think the closest answer we have is that it's a grouping of cpus and
>> memory, where either may be NULL.
>
> Yep, that seems to make the most sense, but then part of me wants to
> drop the term node and never use it again :)

Hey, *I* wasn't the one who started splitting their h/w into wierdo pieces ;-)
Anyway, it's a damned sight shorter than "cpumemset".

>> I/O isn't directly associated with a node, though it should fit into the
>> topo infrastructure, to give distances from io buses to nodes (for which
>> I think we currently use cpumasks, which is probably wrong in retrospect,
>> but then life is tough and flawed ;-))
>
> It's probably not too late to change this to
> pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> aren't that many callers, are there (my grep is still running)?

It probably shouldn't have anything to do with PCI directly either,
so .... ;-) My former thought was that you might just want the most
local memory for DMAing into.

M.

2004-03-19 00:58:10

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 18 Mar 2004, Martin J. Bligh wrote:

> >> I/O isn't directly associated with a node, though it should fit into the
> >> topo infrastructure, to give distances from io buses to nodes (for which
> >> I think we currently use cpumasks, which is probably wrong in retrospect,
> >> but then life is tough and flawed ;-))
> >
> > It's probably not too late to change this to
> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
> > aren't that many callers, are there (my grep is still running)?
>
> It probably shouldn't have anything to do with PCI directly either,
> so .... ;-) My former thought was that you might just want the most
> local memory for DMAing into.

That knowledge should be in the dma api thing shouldn't it? But in it's
current incarnation i don't see how that's possible.

2004-03-19 01:01:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

Your nodemask_t reminds me of something I posted to linux-ia64 last
November 7, 2004, under the subject: "[PATCH preview] Adds nodemask_t
for use in cpusets (NUMA memory placement)".

Chris Hellwig responded to it at the time asking why I didn't provide a
single generic mask ADT, and make cpumask and nodemask instances of
that.

I coded that up, but then got distracted. The remaining issue for which
I didn't have a good answer was that my proposal would break the optimum
handling for sparc64 or other systems that didn't handle passing
structures on the stack efficiently.

Bill Irwin was a party to my discussions of that effort, so I presume
that if he felt it had further merit, he would have mentioned it to
you, Matthew.

Could one of you, Bill or Matthew, speak to why this generic mask ADT,
underlying both cpumask and nodemask, should not be pursued further,
instead of duplicating the various details of cpumask, after a global
s/cpu/node/g change?

I am attaching the header file include/linux/mask.h for my current
version of this mask.h, in case someone reading wants more specifics of
what it is that I am referring to.

This version almost certainly does _not_ work on big endian 64 systems,
due to my ignorance of how kernel bitmasks were layed out when I last
worked on this mask.h header. Unlike the sparc64 performance issues
with passing structs on the stack, I would hope that the big/little
endian issues could be fixed without messing things up too much.

If this mask.h could actually be made to work, including on sparc64,
then it would seem to be a much cleaner solution. With it, we would
define cpumask_t and nodemask_t as simply:

typedef __mask(NR_NODES) nodemask_t;
typedef __mask(NR_CPUS) cpumask_t;

and either use operations such as mask_and() on both, or if one insisted
on keeping operations that specifically called out cpumask, add some
15 trivial defines such as:

#define cpumask_and(dst, src1, src2) mask_and(dst, src1, src2)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373


Attachments:
mask.h (7.55 kB)

2004-03-19 01:09:09

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> Anyway, it's a damned sight shorter than "cpumemset".

Heh - watch it - I saw that! ... LOL

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 01:11:13

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thursday 18 March 2004 4:58 pm, Zwane Mwaikambo wrote:
> > It probably shouldn't have anything to do with PCI directly either,
> > so .... ;-) My former thought was that you might just want the most
> > local memory for DMAing into.
>
> That knowledge should be in the dma api thing shouldn't it? But in it's
> current incarnation i don't see how that's possible.

Couldn't we mostly hide it under the covers (though obviously not in
the intentionally remote case I mentioned earlier):


===== arch/ia64/sn/io/machvec/pci_dma.c 1.28 vs edited =====
--- 1.28/arch/ia64/sn/io/machvec/pci_dma.c Sun Mar 14 11:17:06 2004
+++ edited/arch/ia64/sn/io/machvec/pci_dma.c Thu Mar 18 17:08:13 2004
@@ -130,13 +130,11 @@
device_sysdata = SN_DEVICE_SYSDATA(hwdev);
vhdl = device_sysdata->vhdl;

- /*
- * Allocate the memory.
- * FIXME: We should be doing alloc_pages_node for the node closest
- * to the PCI device.
- */
- if (!(cpuaddr = (void *)__get_free_pages(GFP_ATOMIC, get_order(size))))
+ cpuaddr = alloc_pages_node(pci_to_node(hwdev), GFP_ATOMIC, (get_order(size)));
+ if (!cpuaddr)
return NULL;
+
+ cpuaddr = page_to_virt(cpuaddr);

memset(cpuaddr, 0x0, size);


2004-03-19 01:20:44

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 16:59, Paul Jackson wrote:
> Your nodemask_t reminds me of something I posted to linux-ia64 last
> November 7, 2004, under the subject: "[PATCH preview] Adds nodemask_t
> for use in cpusets (NUMA memory placement)".
>
> Chris Hellwig responded to it at the time asking why I didn't provide a
> single generic mask ADT, and make cpumask and nodemask instances of
> that.

That is a better idea, if it can be made to work. My goal is to stop
the proliferation of open-coded references to node details as soon as
possible. If we we nip this behavior in the bud and convert all users
of cpu/node data to cpumask_t/nodemask_t, we can (more) easily change
the underlying details of how all the cpumask and nodemask functions
work later. If the users all call our macros, then it's easy to find
them ('grep -r "nodes_and" *' vs searching for every '&' in the kernel
that may or may not be a node or cpu mask) and test them.

> I coded that up, but then got distracted. The remaining issue for which
> I didn't have a good answer was that my proposal would break the optimum
> handling for sparc64 or other systems that didn't handle passing
> structures on the stack efficiently.
>
> Bill Irwin was a party to my discussions of that effort, so I presume
> that if he felt it had further merit, he would have mentioned it to
> you, Matthew.

He never mentioned it to me when I queried him for details on his
cpumask_t implementation...

> Could one of you, Bill or Matthew, speak to why this generic mask ADT,
> underlying both cpumask and nodemask, should not be pursued further,
> instead of duplicating the various details of cpumask, after a global
> s/cpu/node/g change?

Nope. I think it should. Though it is hard to optimize for generic
masks. We've got the bitmap_* functions which work nicely for unsigned
long[]. These (cpumask_t/nodemask_t) are nice because they are
optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
as well as for long mask cases (passing by structs reference). It's
hard to make those types of optimizations on generic masks.

> I am attaching the header file include/linux/mask.h for my current
> version of this mask.h, in case someone reading wants more specifics of
> what it is that I am referring to.
>
> This version almost certainly does _not_ work on big endian 64 systems,
> due to my ignorance of how kernel bitmasks were layed out when I last
> worked on this mask.h header. Unlike the sparc64 performance issues
> with passing structs on the stack, I would hope that the big/little
> endian issues could be fixed without messing things up too much.
>
> If this mask.h could actually be made to work, including on sparc64,
> then it would seem to be a much cleaner solution. With it, we would
> define cpumask_t and nodemask_t as simply:
>
> typedef __mask(NR_NODES) nodemask_t;
> typedef __mask(NR_CPUS) cpumask_t;
>
> and either use operations such as mask_and() on both, or if one insisted
> on keeping operations that specifically called out cpumask, add some
> 15 trivial defines such as:
>
> #define cpumask_and(dst, src1, src2) mask_and(dst, src1, src2)

I'll look it over and see how it looks. As I said, I'm very much for a
generic mask type if we can do it properly and efficiently.

Cheers!

-Matt

2004-03-19 01:34:44

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 18 Mar 2004, Jesse Barnes wrote:

> On Thursday 18 March 2004 4:58 pm, Zwane Mwaikambo wrote:
> > > It probably shouldn't have anything to do with PCI directly either,
> > > so .... ;-) My former thought was that you might just want the most
> > > local memory for DMAing into.
> >
> > That knowledge should be in the dma api thing shouldn't it? But in it's
> > current incarnation i don't see how that's possible.
>
> Couldn't we mostly hide it under the covers (though obviously not in
> the intentionally remote case I mentioned earlier):

How about honouring DMA masks? Would this work with a 32bit DMA mask on a
node with memory above the 4G mark. This is for the more impaired systems
out there of course =)

> ===== arch/ia64/sn/io/machvec/pci_dma.c 1.28 vs edited =====
> --- 1.28/arch/ia64/sn/io/machvec/pci_dma.c Sun Mar 14 11:17:06 2004
> +++ edited/arch/ia64/sn/io/machvec/pci_dma.c Thu Mar 18 17:08:13 2004
> @@ -130,13 +130,11 @@
> device_sysdata = SN_DEVICE_SYSDATA(hwdev);
> vhdl = device_sysdata->vhdl;
>
> - /*
> - * Allocate the memory.
> - * FIXME: We should be doing alloc_pages_node for the node closest
> - * to the PCI device.
> - */
> - if (!(cpuaddr = (void *)__get_free_pages(GFP_ATOMIC, get_order(size))))
> + cpuaddr = alloc_pages_node(pci_to_node(hwdev), GFP_ATOMIC, (get_order(size)));
> + if (!cpuaddr)
> return NULL;
> +
> + cpuaddr = page_to_virt(cpuaddr);
>
> memset(cpuaddr, 0x0, size);

2004-03-19 01:40:30

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thursday 18 March 2004 5:34 pm, Zwane Mwaikambo wrote:
> How about honouring DMA masks? Would this work with a 32bit DMA mask on a
> node with memory above the 4G mark. This is for the more impaired systems
> out there of course =)

Yeah, systems w/o an IOMMU are going to have trouble with this...

Jesse

2004-03-19 01:49:47

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> My goal is to stop the proliferation of open-coded references
> to node details as soon as possible.

A worthy goal.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 01:48:26

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> It's hard to make those types of optimizations on generic masks.

I would be assuming that by "generic" we meant arrays of unsigned longs
(or one unsigned long or something isomorphic to one or more unsigned
longs ...).

And I'm assuming that we mean of a size that would allow for putting a
couple of them on a kernel stack ... not _too_ big. Probably NR_CPUS
rough upper limit on the size that was practical to use.

I wouldn't want to get _too_ generic.


> I'll look it over and see how it looks.

Thanks.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 01:59:40

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> These (cpumask_t/nodemask_t) are nice because they are
> optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
> as well as for long mask cases (passing by structs reference).

When I looked at the assembly code generated on my one lung i386 box for
native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
struct of an array of unsigned longs, both for the single unsigned long
(<= 32 bits) and multiple unsigned long cases.

Except for the sparc64 guys and their friends who disparage passing
structs on the stack, I conjecture that the single implementation of a
struct of an array of unsigned longs is nearly ideal for all
architectures.

... go ahead ... prove me wrong. It probably won't be hard ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 02:04:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

Matthew Dobson <[email protected]> writes:

>> Chris Hellwig responded to it at the time asking why I didn't provide a
>> single generic mask ADT, and make cpumask and nodemask instances of
>> that.
>
> That is a better idea, if it can be made to work. My goal is to stop

It already exists in linux/bitmap.h. I use that in NUMA API for the node masks.

It's just a bit ugly to write because you have to always pass MAX_NUMNODES.
Some wrappers would be prettier.

-Andi

2004-03-19 02:40:44

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> Some wrappers would be prettier.

That's roughly what the mask.h is, along with enough other bits and
pieces so that it can serve as a complete basis for cpumasks and
nodemasks. The bitmap.h stuff is an incomplete subset of what's
used.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 02:55:58

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, Mar 18, 2004 at 03:23:10PM -0800, Jesse Barnes wrote:
> My hero! :) I think this has been needed for awhile, but now that I
> think about it, it begs the question of what a node is. Is it a set
> of CPUs and blocks of memory (that seems to be the most commonly used
> definition in the code), just memory, just CPUs, or what? On sn2
> hardware, we have the concept of a node without CPUs. And due to our
> wacky I/O layout, we also have nodes without CPUs *or* memory! (The
> I/O guys call these "ionodes".) And then of course, there are CPUs
> that aren't particularly close to any memory (i.e. they have none of
> their own, and have to go several hops and/or through other CPUs to
> get at memory at all).
> I'll take a look at the ia64 bits when I get them (I've only received
> two of the seven patches thus far).

You need a tripartite graph.
(a) are connections full or half duplex? (i.e. directed or undirected)
(b) do you need distinct weights on each edge? (i.e. weighted or unweighted)


-- wli

2004-03-19 16:21:06

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

--Jesse Barnes <[email protected]> wrote (on Thursday, March 18, 2004 15:59:42 -0800):

> On Thursday 18 March 2004 3:43 pm, Martin J. Bligh wrote:
>> > It's probably not too late to change this to
>> > pcibus_to_nodemask(pci_bus *), or pci_to_nodemask(pci_dev *), there
>> > aren't that many callers, are there (my grep is still running)?
>>
>> It probably shouldn't have anything to do with PCI directly either,
>> so .... ;-) My former thought was that you might just want the most
>> local memory for DMAing into.
>
> Right, we want local memory (or potentially remote memory) for DMA,
> but what about interrupt redirection? Some chipsets don't support
> interrupt round robin, and just target interrupts at one CPU. In that
> case (and probably the round robin case too), you want to know which
> CPU(s) to send the interrupt at. Can't immediately think of other
> in-kernel uses though (administrators will of course want to be able
> to locate a given PCI device in a multirack system, but that's another
> subject--one that Martin Hicks posted on yesterday).

I think we need both ... maybe a cpumask and a zonelist as the destination
types.

M.

2004-03-19 22:52:45

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 17:45, Paul Jackson wrote:
> > It's hard to make those types of optimizations on generic masks.
>
> I would be assuming that by "generic" we meant arrays of unsigned longs
> (or one unsigned long or something isomorphic to one or more unsigned
> longs ...).
>
> And I'm assuming that we mean of a size that would allow for putting a
> couple of them on a kernel stack ... not _too_ big. Probably NR_CPUS
> rough upper limit on the size that was practical to use.
>
> I wouldn't want to get _too_ generic.

Well, if we're going to make a generic bitmap type, it shouldn't have
size limitations, as almost any limit we set will be too small
eventually... Supporting arbitrary length bitmaps doesn't mean we can't
try to optimize for smaller masks, like less than a couple unsigned
longs, as well as single unsigned long optimizations.

-Matt

2004-03-19 23:04:28

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 17:56, Paul Jackson wrote:
> > These (cpumask_t/nodemask_t) are nice because they are
> > optimized for edge cases (UP for cpumask_t and Non-NUMA for nodemask_t)
> > as well as for long mask cases (passing by structs reference).
>
> When I looked at the assembly code generated on my one lung i386 box for
> native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
> struct of an array of unsigned longs, both for the single unsigned long
> (<= 32 bits) and multiple unsigned long cases.

The code you wrote, or my patch?

> Except for the sparc64 guys and their friends who disparage passing
> structs on the stack, I conjecture that the single implementation of a
> struct of an array of unsigned longs is nearly ideal for all
> architectures.
>
> ... go ahead ... prove me wrong. It probably won't be hard ;).

Sounds like a good idea. We certainly shouldn't be passing huge masks
on the stack, but for small masks like, i dunno, <= 4 ULs (the same
optimization Bill's code makes) it's no problem.

The thing about code specific to node and cpu masks is that we *know*
what the masks we're manipulating are used for, and that lets us do
things like not letting callers set bits other than 0 on UP cpumasks, or
throwing a BUG when they do. Or optimizing first_node() to be smarter
than just calling find_first_set() on the passed in mask by just
checking whether bit 0 is set.

-Matt

2004-03-19 23:10:32

by Matthew Dobson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Thu, 2004-03-18 at 18:04, Andi Kleen wrote:
> Matthew Dobson <[email protected]> writes:
>
> >> Chris Hellwig responded to it at the time asking why I didn't provide a
> >> single generic mask ADT, and make cpumask and nodemask instances of
> >> that.
> >
> > That is a better idea, if it can be made to work. My goal is to stop
>
> It already exists in linux/bitmap.h. I use that in NUMA API for the node masks.
>
> It's just a bit ugly to write because you have to always pass MAX_NUMNODES.
> Some wrappers would be prettier.
>
> -Andi

For MAX_NUMNODES > BITS_PER_LONG, thats just what these are. There are
just built-in optimizations for the single bit (UP for cpumask, non-NUMA
for nodemask) case and the single unsigned long case. For the case of a
single unsigned long, the bitmap operations aren't as efficient as just
doing a
res = (mask1 & mask2);
vs.
bitmask_and(&res, &mask1, &mask2);

Maybe I'm overly concerned about the speed of these ops at the expense
of code size. If that's the consensus, I'll gladly look at a simpler
implementation. As I said, my only real goal is to stop people open
coding these types of operations.

-Matt

2004-03-19 23:44:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> Well, if we're going to make a generic bitmap type, it shouldn't have
> size limitations, as almost any limit we set will be too small
> eventually...

A trillion bits ...

Nah - bitmasks are a bunch of bits, layed out in order. This is
appropriate for tens, hundreds, even (on big iron) thousands of
bits.

There are limitations to the practical application of such bitmasks,
as I'm sure we both agree and both accept.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 00:49:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> For the case of a single unsigned long, the bitmap operations aren't
> as efficient as just doing a
> res = (mask1 & mask2);
> vs.
> bitmask_and(&res, &mask1, &mask2);

This is probably the same as Matt is thinking, but what I imagine has
the inline efficiency (a machine instruction or two) of the openly coded
first example above, but is always written more in the style of the
second example above.

The 'mask_and()' macro can be an inline or define that collapses in the
one word case (by far the most common) to the same machine instruction
or two as the open code does. We _do_ know, at compile time, the
_exact_ size of any given instance of these, so gcc has sufficient
information to collapse the inline code, if we present it right, to the
legitimate minimum.

For example, mask_and might be:

=========================================================================
#define __mask(bits) struct { unsigned long _m[BITS_TO_LONGS(bits)]; }
typedef __mask(NR_CPUS) cpumask_t;

#define mask_and(d,s1,s2) \
do { \
if (sizeof(d) == sizeof(unsigned long)) \
d._m[0] = s1._m[0] & s2._m[0]; \
else \
bitmask_and(d._m, s1._m, s2._m); \
} while(0)
=========================================================================

With this, mask_and() on some cpumask_t's, when NR_CPUS is 64, _does_
collapse to a simple:

and r32 = r8, r32

on my ia64 with gcc 3.2.3.

The mask.h I appended yesterday uses bitmap calls such as bitmap_and()
unconditionally (even in the one word case). This is probably worth
fixing, in the manner shown above, at least for such cases such as
'and', 'or', 'empty', 'complement', 'shift_left' and 'shift_right'.

No use of multiple architecturally chosen implementations, such as we
do now with cpumask, is required. Just a little bit of special case
code for the one-word case, in 6 or 8 of the macros.

I still think, that with just a few lines of compile time logic such as
above, we can have a single implementation that is pretty close to
ideal, for almost all architectures, using a struct holding an array of
unsigned longs.

The one obvious exception, sparc64, should have its own arch specific
implementation under include/asm-sparc64 and perhaps arch/sparc64. It
should not be embedded in a thicket of pseudo-generic ifdefs and
conditional includes as part of the apparently generic implementation.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 01:02:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> > When I looked at the assembly code generated on my one lung i386 box for
> > native gcc 3.3.2, it looked pretty good (to my untrained eye) using a
> > struct of an array of unsigned longs, both for the single unsigned long
> > (<= 32 bits) and multiple unsigned long cases.
>
> The code you wrote, or my patch?

The code I wrote and appended yesterday. Though, as I realize now in my
post of a few minutes ago, there are perhaps 8 places where instead of
calling various bitmap_ops() unconditionally ('and', 'or', ...) it would
be better to peel off the one-word case and inline it.


> Sounds like a good idea. We certainly shouldn't be passing huge masks
> on the stack, but for small masks like, i dunno, <= 4 ULs (the same
> optimization Bill's code makes) it's no problem.

Don't we have quite a few places with one, two, even three local variables
of type cpumask_t? Which live on the stack? For all mask implementations?
Grep around for "cpumask_t.*,.*," and many of the lines you see appear to
be declarations of such local cpumask_t variables.

And we need to be careful of converting pass by value semantics for small
cpumasks into pass by reference semantics for large cpumasks, as a hidden
feature of the implementation. One could code some cute bugs that way.

Better, I think, to provide a reasonably rich set of mask ops, so that the
using code need not have anymore than the essential number of distinctly
different masks hanging around at once in order to write clear code.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 01:14:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> bitmask_and(d._m, s1._m, s2._m); \

Oops - need the size argument to bitmask_and:

bitmask_and(d._m, s1._m, s2._m, sizeof(d)*8); \

(I only tested the 1 word case earlier ...)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 03:19:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

At some point in the past, Matt Dobson wrote:
>> Sounds like a good idea. We certainly shouldn't be passing huge masks
>> on the stack, but for small masks like, i dunno, <= 4 ULs (the same
>> optimization Bill's code makes) it's no problem.

On Fri, Mar 19, 2004 at 04:59:28PM -0800, Paul Jackson wrote:
> Don't we have quite a few places with one, two, even three local variables
> of type cpumask_t? Which live on the stack? For all mask implementations?
> Grep around for "cpumask_t.*,.*," and many of the lines you see appear to
> be declarations of such local cpumask_t variables.

The stack footprint of cpumasks is a concern in general. I don't have a
good answer to this. The half-answer I anticipate is that the truly
larger systems will configure themselves with deeper stacks. There isn't
truly a good answer to this. It's overhead for the larger systems, and
heap allocation would be overhead for smaller systems. Unfortunately,
our general design criteria require the larger systems to eat the
overhead until something imaginative is come up with to reduce the
overhead with low code impact and no cost to smaller systems. One thing
that would help is more expressiveness in the API; it's not entirely
clear how to better 3-address code, but OTOH, reducing the number of
intermediate operands is conceivable and would alleviate those overheads.


On Fri, Mar 19, 2004 at 04:59:28PM -0800, Paul Jackson wrote:
> And we need to be careful of converting pass by value semantics for small
> cpumasks into pass by reference semantics for large cpumasks, as a hidden
> feature of the implementation. One could code some cute bugs that way.
> Better, I think, to provide a reasonably rich set of mask ops, so that the
> using code need not have anymore than the essential number of distinctly
> different masks hanging around at once in order to write clear code.

This is one of the areas where I believe I carried out some innovation.
cpumask_const_t allows more aggressive conversion to call-by-reference
in a safe manner as the constancy of the reference makes the difference
purely operational. It falls down only in scenarios where the input
would be modified. Also, when the argument is actually expected to be
modified, direct call by reference can be used. So only in the case
where a temporary copy is truly required are you forced to do it
(apart from getting the function prototype merged), and the fallback of
cpumask_const_t to call by value in the small case makes it cheap for
smaller machines also, by avoiding the indirection.

It may be worth investigating analogues of cpumask_const_t for more
generic bitmask types (though of course I'm not going to insist on a
force fit).


-- wli

2004-03-20 06:11:28

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> The stack footprint of cpumasks is a concern in general. I don't have a
> good answer to this. The half-answer I anticipate is that the truly
> larger systems will configure themselves with deeper stacks.

Sounds like a good enough answer to me.


That is, a richer API can help - more

> This is one of the areas where I believe I carried out some innovation.
> cpumask_const_t allows more aggressive conversion to call-by-reference

True, you did do some substantial work there, and for const objects,
calls by value and reference can be used interchangeably, for best
performance, without semantic impact.

However, something about the current cpus_*_const() macros doesn't seem
to be having the desired impact. I see five uses in files matching
include/asm-i386/mach-*/mach_apic.h, and one in include/asm-x86_64/smp.h
That's all. None, for example, in any ia64 code.

That's it. And why should one have to code explicitly the choice of
the cpus_*_const() variant? Shouldn't each macro know which of routines
it calls change things, and which don't, letting it pass a pointer to
the read-only routines if that helps? It knows the sizes as well, so
it can even pick and choose which variation of code to generate.

If one needs an explicit call by reference to avoid passing a multi-word
object, one should ask the user to pass an explicit pointer, to a
routine or macro that expects a pointer to a non-const, not an apparent
value. Shouldn't try to hide the reference semantics behind quasi-const
labels.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 08:04:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

Other ways to reduce cpumask copies:

1) Additional macros, for example a boolean cpus_intersect(x,y), which
is true iff x overlaps y, and saves the tmp cpumask in the code:

cpus_and(tmp, target_cpu_mask, allowed_mask);
if (!cpus_empty(tmp)) {

or a cpus_subset(x,y) "is x a subset of y" macro to replace this:

cpus_and(tmp, cpumask, cpu_callout_map);
BUG_ON(!cpus_equal(cpumask, tmp));

with this:

BUG_ON(!cpus_subset(cpumask, cpu_callout_map));

2) Using existing macros more carefully, for example using cpu_set() to
set the bit in the following, and saving the copy by assignment:

pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu);

or using the existing cpu_isset() macro, replacing the code (seen
in part above ;):

cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
target_cpu_mask = cpumask_of_cpu(min_loaded);
cpus_and(tmp, target_cpu_mask, allowed_mask);
if (!cpus_empty(tmp)) {

with the code:

if (cpu_isset(min_loaded, cpu_online_map) && cpu_isset(min_loaded, irq_affinity[selected_irq]) {

The current nested and ifdef'd complexity of the cpumask macro headers works
against such efficient coding - it's non-trivial to see just what macros are
available. The youngins reading this may not appreciate the value of reducing
brain load; but the oldins might.

2) Pass a cpumask pointer to a function that generates a cpumask instead of
taking one as a return value, letting the called function fill in the memory
so referenced. We should not be trying to hide such details, unless we can
do so seamlessly and consistent with traditional 'C' semantics.

3) Passing a const cpumask pointer to a function that examines a cpumask
instead of passing the cpumask by value (on small systems, its one word
either way - on big systems, it saves copying a multiword mask on the
stack).

4) Throwing away dead code:

static int physical_balance = 0;
cpumask_t tmp;
cpus_shift_right(tmp, cpu_online_map, 2);
if (smp_num_siblings > 1 && !cpus_empty(tmp)
physical_balance = 1;

The above code fragments are from arch/i386 in 2.6.3-rc1-mm1.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-20 09:36:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

At some point in the past, I wrote:
>> The stack footprint of cpumasks is a concern in general. I don't have a
>> good answer to this. The half-answer I anticipate is that the truly
>> larger systems will configure themselves with deeper stacks.

On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> Sounds like a good enough answer to me.
> That is, a richer API can help - more

I know that a richer repertoire of operations will reduce the stack
footprint and mentioned it. There is one issue and one issue only: the
larger the "instruction set" grows, the more intrusive and complex-
looking the thing appears, the more of quagmire merging it becomes. The
bits about deeper stacks not really a "good enough" answer. It's an
answer that trades off code impact for overhead on the systems that
demand the mechanism. In industrial/corporate kernels, this wouldn't
fly as votes are dollars, but here votes are users, and the smaller
systems' performance concerns and broader userbase's maintenance
concerns took precedence over the technically superior solution, at
least for the initial merge.


At some point in the past, I wrote:
>> This is one of the areas where I believe I carried out some innovation.
>> cpumask_const_t allows more aggressive conversion to call-by-reference

On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> True, you did do some substantial work there, and for const objects,
> calls by value and reference can be used interchangeably, for best
> performance, without semantic impact.
> However, something about the current cpus_*_const() macros doesn't seem
> to be having the desired impact. I see five uses in files matching
> include/asm-i386/mach-*/mach_apic.h, and one in include/asm-x86_64/smp.h
> That's all. None, for example, in any ia64 code.

Yes, spreading the cpumask_const_t use is needed. I pretty much ran out
of steam before getting its use propagated around, and it was also a late
addition that the arch maintainers who sent in their own updates didn't
get a very wide window to adapt to. I think it's vaguely fair to ask
those who need it to propagate its use around themselves.


On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> That's it. And why should one have to code explicitly the choice of
> the cpus_*_const() variant? Shouldn't each macro know which of routines
> it calls change things, and which don't, letting it pass a pointer to
> the read-only routines if that helps? It knows the sizes as well, so
> it can even pick and choose which variation of code to generate.

This explicitly forces an unnecessary indirection on smaller systems
where the thing may fit into one machine word anyway, meaning it doesn't
make sense to pass it by reference.

Or in a second possible interpretation, this may already be the current
state of affairs, and you're suggesting a different way to implement it
I don't see how to do offhand which would be equivalent but using
fewer #ifdefs. That would be completely innocuous, and in combination
with a more general set of wrappers, easy to endorse.


On Fri, Mar 19, 2004 at 10:09:07PM -0800, Paul Jackson wrote:
> If one needs an explicit call by reference to avoid passing a multi-word
> object, one should ask the user to pass an explicit pointer, to a
> routine or macro that expects a pointer to a non-const, not an apparent
> value. Shouldn't try to hide the reference semantics behind quasi-const
> labels.

It's not quasi-const, it is const. If you need a caller to do destructive
updates on your behalf, you use a pointer in both large and small cases.
You may use explicit unwrapped const call by reference, but this forces
the overhead of indirection on smaller systems.

The advantage the type wrapper has is that when NR_CPUS is small and
things fit in one or few machine words, it can be automatically
switched back to call by value with zero source changes by just picking
a threshold in a #ifdef, avoiding the indirection for smaller systems.

Of course, if the second of the two possibilities from the previously
replied-to paragraph holds, you're keeping all the operational semantics
that small systems need anyway so it may not have been necessary to
reiterate all that.


-- wli

2004-03-20 11:14:01

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> Other ways to reduce cpumask copies:
> 1) Additional macros, for example a boolean cpus_intersect(x,y), which
> is true iff x overlaps y, and saves the tmp cpumask in the code:
> cpus_and(tmp, target_cpu_mask, allowed_mask);
> if (!cpus_empty(tmp)) {
> or a cpus_subset(x,y) "is x a subset of y" macro to replace this:
> cpus_and(tmp, cpumask, cpu_callout_map);
> BUG_ON(!cpus_equal(cpumask, tmp));
> with this:
> BUG_ON(!cpus_subset(cpumask, cpu_callout_map));

That probably wouldn't be tough to get merged.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> 2) Using existing macros more carefully, for example using cpu_set() to
> set the bit in the following, and saving the copy by assignment:
> pending_irq_balance_cpumask[irq] = cpumask_of_cpu(new_cpu);
> or using the existing cpu_isset() macro, replacing the code (seen
> in part above ;):
> cpus_and(allowed_mask, cpu_online_map, irq_affinity[selected_irq]);
> target_cpu_mask = cpumask_of_cpu(min_loaded);
> cpus_and(tmp, target_cpu_mask, allowed_mask);
> if (!cpus_empty(tmp)) {
> with the code:
> if (cpu_isset(min_loaded, cpu_online_map) && cpu_isset(min_loaded, irq_affinity[selected_irq]) {
> The current nested and ifdef'd complexity of the cpumask macro
> headers works against such efficient coding - it's non-trivial to see
> just what macros are available. The youngins reading this may not
> appreciate the value of reducing brain load; but the oldins might.

It was painful enough just to preserve semantics across such a far-
ranging set of changes. Ideally, yes, I would have done this in the
first place.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> 2) Pass a cpumask pointer to a function that generates a cpumask
> instead of taking one as a return value, letting the called function
> fill in the memory so referenced. We should not be trying to hide
> such details, unless we can do so seamlessly and consistent with
> traditional 'C' semantics.

This is a generally sane thing to do.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> 3) Passing a const cpumask pointer to a function that examines a cpumask
> instead of passing the cpumask by value (on small systems, its one word
> either way - on big systems, it saves copying a multiword mask on the
> stack).

IIRC the overhead of indirection on smaller systems was regarded as
unacceptable for these cases, which is what motivated the wrapper type.
If you can get the naked versions merged, more power to you.


On Sat, Mar 20, 2004 at 12:02:35AM -0800, Paul Jackson wrote:
> 4) Throwing away dead code:
> static int physical_balance = 0;
> cpumask_t tmp;
> cpus_shift_right(tmp, cpu_online_map, 2);
> if (smp_num_siblings > 1 && !cpus_empty(tmp)
> physical_balance = 1;
> The above code fragments are from arch/i386 in 2.6.3-rc1-mm1.

[warning! long-winded response follows]

This isn't quite dead; physical_balance isn't a local. it's a state
variable static to io_apic.c and it determines the behavior later after
boot. Frankly, this code fragment is inexplicably bizarre and I've
never seen the code in action, as I have something very closely
approximating zero access (or less) to the systems that use it. In
general, this means you're stuck being very literal about its semantics
until the situation is clarified. This used to be something like

if (smp_num_siblings > 1 && (cpu_online_map >> 2))
physical_balance = 1;

which defied my attempts to reverse-engineer its true meaning in terms
of cpumasks. In terms of HT, it "wants to do":

if (logical_cpus_per_physical_cpu() > 1 && nr_physical_cpus() > 1)
set_state_variable_to_balance_by_physical_cpus();

I'm not wild about this kind of thing and would rather not codify
breakage on mixed cpu revision systems any more so than its already
done. The semantics also differ slightly if/when cpu_online_map is
arranged in varying ways, which obscured the issue and what ultimately
stifled the otherwise strong urge to improve it.

A potential "cleanup" that occurred to me is:

if (cpus_weight(cpu_online_map) > smp_num_siblings)
physical_balance = 1;

which isn't really "good enough" either, but the best that can be done
with the now-extant machine descriptions. I would rather leave the
thing as-is until something equivalent to the following can be done
and the rest of arch/i386 swept to properly deal with HT in general:

if (nr_logical_cpus() > nr_physical_cpus() && nr_physical_cpus() > 1)
physical_balance = 1;

but this is not how the machine descriptions are arranged now, and I
personally have neither the access to machines nor time allotted to
implement the changes required to fix this code properly. Other,
similar things hold for other bizarre cases where there's fear of
general fragility and bizarre bit twiddling forms of checks that would
vary in semantics from the "obviously cleaner" variants that can't be
cleaned up willy-nilly without the backing to get at the systems to
verify the changes. The only real exceptions to this are when things
are broken as-is, and what you have is an immediate and relatively
localized fix.

It's really unclear that the ia32 APIC/SMP disaster can ever feasibly
be cleaned up, as the code already landed there, and churning it just
raises the spectre of version skew, a rather long, dreary fight to get
the whole of the corrections merged, and the still more terrifying
possibility of incomplete merges leaving various out-of-synch codebases
(e.g. distros) in broken states. Let it serve as an example as to why
things should be done right the first time.

In conclusion, I converted a lot of the i386 arch code very literally
for a good reason. I wish us all the best of noseplugs as we're all
stuck holding our noses while the dead woodchuck under the porch stinks
up the kernel in perpetuity.


-- wli

2004-03-21 04:19:37

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> This isn't quite dead; physical_balance isn't a local. it's a state
> variable static to io_apic.c and it determines the behavior later after
> boot.

Find by me if folks have their dirty laundry. There are limits to my
powers to set things right.

Sorry to have provoked your length explanation of physical_balance, but
in the version of the kernel that I happened to do my research on,
2.6.3-rc1-mm1, this is _dead_ code. The variable physical_balance is
never read, just written, and only appears on 3 lines total.

Obviously if it is in use in current versions of the kernel, then it's
not dead code anymore (at least not without a more profound
understanding of what's going on, which I make no claims to).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-21 04:36:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Sat, Mar 20, 2004 at 08:19:54PM -0800, Paul Jackson wrote:
> Find by me if folks have their dirty laundry. There are limits to my
> powers to set things right.
> Sorry to have provoked your length explanation of physical_balance, but
> in the version of the kernel that I happened to do my research on,
> 2.6.3-rc1-mm1, this is _dead_ code. The variable physical_balance is
> never read, just written, and only appears on 3 lines total.
> Obviously if it is in use in current versions of the kernel, then it's
> not dead code anymore (at least not without a more profound
> understanding of what's going on, which I make no claims to).

There's probably something in -mm reducing its use that I haven't
looked at; the digression there was based on mainline.


-- wli

2004-03-21 07:59:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]



William Lee Irwin III wrote:

>On Sat, Mar 20, 2004 at 08:19:54PM -0800, Paul Jackson wrote:
>
>>Find by me if folks have their dirty laundry. There are limits to my
>>powers to set things right.
>>Sorry to have provoked your length explanation of physical_balance, but
>>in the version of the kernel that I happened to do my research on,
>>2.6.3-rc1-mm1, this is _dead_ code. The variable physical_balance is
>>never read, just written, and only appears on 3 lines total.
>>Obviously if it is in use in current versions of the kernel, then it's
>>not dead code anymore (at least not without a more profound
>>understanding of what's going on, which I make no claims to).
>>
>
>There's probably something in -mm reducing its use that I haven't
>looked at; the digression there was based on mainline.
>
>

I think it is my patch that makes cpu_sibling_map a cpumask.

You don't need a special case for num_siblings == 2 anymore.
I forgot to clean up the last trace of physical_balance.

2004-03-23 00:02:11

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> It's not quasi-const, it is const.

Const cpumasks are not your Father's const references.
======================================================

True, the current cpumask_const_t API makes essential use of the 'C'
const construct and semantics.

But it's not simply and entirely the usual const usage as seen in the
signature:

char *strncpy(char *dest, const char *src, size_t n);

In this usual case, declaring something referenced by an argument
pointer as 'const' means the called function won't change the referenced
data. For simple values such as the 'size_t n', it's not useful,
because these are simply pass by value, so changes by the called
function won't be visible to the caller anyway. But when passing in
pointers, it has become useful in 'C' to indicate that the called
function won't mess with the pointed to data.

That's the usual use of the C keyword 'const', as I am sure you know
well.

You aren't simply marking parameters with the C 'const' keyword; you are
also labeling some functions and a cpumask typedef as const, embedding
the letters "_const" in the names. Part of the cpumask_const_t
implementation uses the real C keyword 'const', but it's not simply the
classic usage shown above.

Each time I try to work with this, I have to scratch my head and think
a few minutes first.

Is the following a fair statement of this interface:

1) Declaring a cpumask as type 'cpumask_const_t' means it might be passed
to various cpumask '*_const()' methods either by value or by a const
pointer, depending on the implementation (the size of the mask,
presumably).

2) The various cpumask '*_const()' methods expect to be passed such
cpumask_const_t masks, and may actually pass by value or by const
reference, at the discretion of the implementation, again.

So if I am not too confused, this non-C keyword embedded '_const' string
means something like "optionally pass masks by value or const pointer,
at the discretion of the implementation."

This is an abuse (or at the very least an extension) of our customary
use of the 'const' construct of the 'C' language, in my view; an abuse
with which users will never become comfortable.

You castigate yourself for not having converted more cpumask
manipulations to making full use of this quasi-const mechanism.

That way does not lay closure and sweet success. You have provided a
weird API that will never become widely and comfortably used by others;
you will always find your available energies for putting it in use
yourself to be insufficient; you will be pushing this rock up the hill
throughout your years in purgatory (hopefully fewer years than I expect
to spend there ;).

I do not fault you for failing to put this interface to more wide
spread usage. I do fault the interface for being a bit too weird
to obtain wide spread usage on its own, after it's original creation
and introduction.

Could you (Bill or any lurker) provide any specific examples of generic
code where it is important to pass by value on some architectures, but
by const reference on some other architectures? Rather than distort the
API in general for such cases, I'd prefer to consider more narrowly
focused solutions that address such specific needs. In general, I would
hope to be able to arrive at a cpumask (or even more generic mask as
multi-word bit mask) ADT that was always clear and explicit, using just
traditional 'C' idioms, as to whether arguments were pass by value or
reference, and which arguments were const reference or not-const. If
some arch's (say sparc64) have arch-specific code that explicitly passes
a pointer to a cpumask, where similar code in some other arch passes by
value, that's fine, and an appropriate arch-specific optimization. The
only challenges come in generic code for which arch's cannot agree on
any one form for passing a particular mask. Examples anyone ... ?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 01:14:31

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> > The current nested and ifdef'd complexity of the cpumask macro
> > headers works against such efficient coding - it's non-trivial to see
> > just what macros are available. The youngins reading this may not
> > appreciate the value of reducing brain load; but the oldins might.
>
> It was painful enough just to preserve semantics across such a far-
> ranging set of changes. Ideally, yes, I would have done this in the
> first place.

I'm not trying to get on your case, Bill, for not creating and applying
more various cpumask functions. Rather I am looking for ways to make
that API easier for others to use and use well. If the situations that
passed about temporary cpumasks can be collapsed into single calls that
are more efficient, then that is one way to make progress on this.

Taking masks to be a struct of an array of unsigned longs seems to come
pretty close. The sparc64 arch would want to pass a pointer to this
apparently, rather than the struct itself - faster for them. Some other
smaller archs would _not_ want to pass a pointer, but rather the (one
word, for them) value - avoids a dereference for them. In arch specific
code, each arch can choose which way works best for them. I need to
identify any generic code where these preferences collide.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 01:23:12

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> There is one issue and one issue only: the
> larger the "instruction set" grows, the more intrusive and complex-
> looking the thing appears, the more of quagmire merging it becomes.

Providing the additional mask "instructions" shouldn't create any quagmire.

It's the using of them that is intrusive.

Initially, some intrusive work, such as you (Bill) did was needed to get
masks implanted. But now it should be appropriate to simply provide the
alternative calls that can make certain code sequences more efficient,
and then if someone complains that their old code sequence is too slow
or uses too much stack, we can recommend alternative code sequences that
will work better for them.

Passing the buck, division of labour and all that ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 01:26:06

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> I think it's vaguely fair to ask
> those who need it to propagate its use around themselves.

I think it's entirely fair to ask that.

Once the API is easy to grok.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 02:11:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Mon, Mar 22, 2004 at 05:12:43PM -0800, Paul Jackson wrote:
> I'm not trying to get on your case, Bill, for not creating and applying
> more various cpumask functions. Rather I am looking for ways to make
> that API easier for others to use and use well. If the situations that
> passed about temporary cpumasks can be collapsed into single calls that
> are more efficient, then that is one way to make progress on this.
> Taking masks to be a struct of an array of unsigned longs seems to come
> pretty close. The sparc64 arch would want to pass a pointer to this
> apparently, rather than the struct itself - faster for them. Some other
> smaller archs would _not_ want to pass a pointer, but rather the (one
> word, for them) value - avoids a dereference for them. In arch specific
> code, each arch can choose which way works best for them. I need to
> identify any generic code where these preferences collide.

I generally like the idea of the arches getting their choice here (heck,
even wrt. representation; e.g. some arch might want an array of cpuid
numbers and not a bitmap at all due to extremely sparse cpuid's or some
such nonsense). The asm-generic stuff was largely a question of
reducing diffsize, preemptive code consolidation, etc.

I don't believe normal C (i.e. sans typedef) will allow needed
ambiguities that make UP/small SMP/etc. compile things out nicely, but
if you can get the requirement of the stuff totally compiling out
dropped or do it in normal C somehow, go for it. I'd call it a cleanup.


-- wli

2004-03-23 02:10:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Mon, Mar 22, 2004 at 05:21:34PM -0800, Paul Jackson wrote:
> Providing the additional mask "instructions" shouldn't create any quagmire.
> It's the using of them that is intrusive.
> Initially, some intrusive work, such as you (Bill) did was needed to get
> masks implanted. But now it should be appropriate to simply provide the
> alternative calls that can make certain code sequences more efficient,
> and then if someone complains that their old code sequence is too slow
> or uses too much stack, we can recommend alternative code sequences that
> will work better for them.
> Passing the buck, division of labour and all that ...

Higher-level constructs that improve runtime efficiency may also be
good cleanups that can sometimes fix bugs and/or generally consolidate
code. e.g. cpus_subset(x, y). I think those kinds of things should be
perfectly mergeable.


-- wli

2004-03-23 02:13:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Mon, Mar 22, 2004 at 03:59:52PM -0800, Paul Jackson wrote:
> Could you (Bill or any lurker) provide any specific examples of generic
> code where it is important to pass by value on some architectures, but
> by const reference on some other architectures? Rather than distort the
> API in general for such cases, I'd prefer to consider more narrowly
> focused solutions that address such specific needs. In general, I would
> hope to be able to arrive at a cpumask (or even more generic mask as
> multi-word bit mask) ADT that was always clear and explicit, using just
> traditional 'C' idioms, as to whether arguments were pass by value or
> reference, and which arguments were const reference or not-const. If
> some arch's (say sparc64) have arch-specific code that explicitly passes
> a pointer to a cpumask, where similar code in some other arch passes by
> value, that's fine, and an appropriate arch-specific optimization. The
> only challenges come in generic code for which arch's cannot agree on
> any one form for passing a particular mask. Examples anyone ... ?

I don't have examples of such, no. The requirement was imposed on me for
basically the reason that what cpu-related API's preceded cpumask_t
compiled out to nops on UP and no indirection on small SMP. If you've
got such an implementation that compiles out to nops on UP etc., or can
get the requirement(s) dropped, I'm all ears.

-- wli

2004-03-23 02:41:48

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> go for it

I'll try.

The main thing I keep seeing, when I look at the resulting machine
code, is that things such as the following (stripped of everything
except the bare bones hardwired stuff I needed to compile):

#define BITS_TO_LONGS(i) (i+63)/64
#define NR_CPUS 64
#define __mask(bits) struct { unsigned long _m[BITS_TO_LONGS(bits)]; }
typedef __mask(NR_CPUS) cpumask_t;

#define mask_and(d,s1,s2) \
do { \
int i; \
for (i = 0; i < sizeof(d)/sizeof(unsigned long); i++) \
d._m[i] = s1._m[i] & s2._m[i]; \
} while(0)

unsigned long f(cpumask_t c, cpumask_t d, cpumask_t e)
{
mask_and(c,d,e);
return c._m[0];
}


end up producing quite fine code, such as a single inline 64 bit and
instruction, with no evidence of the for loop, array or struct wrapper,
on an ia64 (gcc 3.2.3 -O2) for the interesting guts of my silly little
test function f().

Objdump -d of function f() in above:

4000000000000610 <f>:
4000000000000610: 0d 60 c0 19 3f 23 [MFI] adds r12=-16,r12
4000000000000616: 00 00 00 02 00 00 nop.f 0x0
400000000000061c: 21 0a 31 80 and r8=r34,r33;;
4000000000000620: 11 00 00 00 01 00 [MIB] nop.m 0x0
4000000000000626: c0 80 30 00 42 80 adds r12=16,r12
400000000000062c: 08 00 84 00 br.ret.sptk.many b0;;

>From this I conjecture that I can provide a single call:

cpumask_and(cpumask_t d, cpumask_t s1, cpumask_t s2);

that works on both normal (1 to 32 cpu) systems and on big iron systems,
with traditional 'C' pass by value semantics, all derived from a single
mask type that works for both node and cpu masks.

The one sticky point evident to me so far would be if some generic code
were passing a cpumask_t across a function call boundary, and needed to
be optimum for both small and sparc64 - one would want to pass by value,
the other would want to pass a pointer to the cpumask.

This is not your fathers 'C'. The compile time inlining and
optimization provided by gcc enables it to do a lot more than Dennis
Ritchie's original C compiler that I learned on.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 03:14:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

On Mon, Mar 22, 2004 at 06:39:18PM -0800, Paul Jackson wrote:
> From this I conjecture that I can provide a single call:
> cpumask_and(cpumask_t d, cpumask_t s1, cpumask_t s2);
> that works on both normal (1 to 32 cpu) systems and on big iron systems,
> with traditional 'C' pass by value semantics, all derived from a single
> mask type that works for both node and cpu masks.
> The one sticky point evident to me so far would be if some generic code
> were passing a cpumask_t across a function call boundary, and needed to
> be optimum for both small and sparc64 - one would want to pass by value,
> the other would want to pass a pointer to the cpumask.
> This is not your fathers 'C'. The compile time inlining and
> optimization provided by gcc enables it to do a lot more than Dennis
> Ritchie's original C compiler that I learned on.

gcc flat out miscompiled such inlines last I checked (Zwane shipped the
bugreport IIRC). Either this kind of good behavior is not universally
observable or a miracle occurred and gcc's codegen went from incorrect
to 1980's (fscking patents).

Anyhow, this was also an observation of the code effectively made in
isolation; uninlining and other catastrophes do happen.

If people really thinks this works and/or don't care when it doesn't,
go for it. Last time I heard they did; who knows, the answer may be
different this time.


-- wli

2004-03-23 03:39:05

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> gcc flat out miscompiled such inlines last I checked (Zwane shipped the
> bugreport IIRC).

The one error I've heard tell of recently is one that Andi Kleen hit in
include/linux/bitmap.h. He had to compute the copy length explicitly in
a separate variable, or gcc forgot to do it (I forget the details). The
change is:


static inline void bitmap_copy(unsigned long *dst,
const unsigned long *src, int bits)
{
- memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
+ int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
+ memcpy(dst, src, len);
}


Do you have any more pointers/info on the miscompile you quote above?
Like - how long ago - what gcc version ...

It would be an insult to wrap the entire cpumask design around the
axle of such non-specific allegations of gcc misconduct. Better to
get the API right, and then deal with specific tool bugs as necessary.

Or, as Linus might say (did say, in a quite different context):

Never _ever_ make your source-code look worse because your tools suck. Fix
the tools instead.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-23 03:59:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

At some point in the past, I wrote:
>> gcc flat out miscompiled such inlines last I checked (Zwane shipped the
>> bugreport IIRC).

On Mon, Mar 22, 2004 at 07:36:28PM -0800, Paul Jackson wrote:
> The one error I've heard tell of recently is one that Andi Kleen hit in
> include/linux/bitmap.h. He had to compute the copy length explicitly in
> a separate variable, or gcc forgot to do it (I forget the details). The
> change is:
> static inline void bitmap_copy(unsigned long *dst,
> const unsigned long *src, int bits)
> {
> - memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
> + int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
> + memcpy(dst, src, len);
> }
> Do you have any more pointers/info on the miscompile you quote above?
> Like - how long ago - what gcc version ...

It dates back to before the merge. It should have been posted to lkml
by Zwane.


On Mon, Mar 22, 2004 at 07:36:28PM -0800, Paul Jackson wrote:
> It would be an insult to wrap the entire cpumask design around the
> axle of such non-specific allegations of gcc misconduct. Better to
> get the API right, and then deal with specific tool bugs as necessary.
> Or, as Linus might say (did say, in a quite different context):
> Never _ever_ make your source-code look worse because your tools suck. Fix
> the tools instead.

I can't answer this directly. Basically, if ppl are merging workarounds
for ancient compilers, my expectation is that operational semantics
specific to newer compiler versions (or specific architectures) can't
be assumed. I'm not the one making the calls as to whether the
requirements (compiling out completely on UP, compiling to no indirection
on small SMP) are relaxed or if "new enough compiler" meets them.


-- wli

2004-03-23 04:05:02

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Introduce nodemask_t ADT [0/7]

> It dates back to before the merge.

what merge?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373