2002-06-18 17:18:13

by James Simmons

[permalink] [raw]
Subject: latest linus-2.5 BK broken



gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c
sched.c: In function `sys_sched_setaffinity':
sched.c:1329: `cpu_online_map' undeclared (first use in this function)
sched.c:1329: (Each undeclared identifier is reported only once
sched.c:1329: for each function it appears in.)
sched.c: In function `sys_sched_getaffinity':
sched.c:1389: `cpu_online_map' undeclared (first use in this function)
make[1]: *** [sched.o] Error 1

. ---
|o_o |
|:_/ | Give Micro$oft the Bird!!!!
// \ \ Use Linux!!!!
(| | )
/'\_ _/`\
\___)=(___/


2002-06-18 17:46:51

by Robert Love

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 2002-06-18 at 10:18, James Simmons wrote:

> gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sched -c -o sched.o sched.c
> sched.c: In function `sys_sched_setaffinity':
> sched.c:1329: `cpu_online_map' undeclared (first use in this function)
> sched.c:1329: (Each undeclared identifier is reported only once
> sched.c:1329: for each function it appears in.)
> sched.c: In function `sys_sched_getaffinity':
> sched.c:1389: `cpu_online_map' undeclared (first use in this function)
> make[1]: *** [sched.o] Error 1

Rusty, I assume this is a side-effect of the hotplug merge?

Can you fix this or tell me what is the new equivalent of
cpu_online_map?

Robert Love

2002-06-18 18:47:14

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <1024422409.1476.208.camel@sinai> you write:
> On Tue, 2002-06-18 at 10:18, James Simmons wrote:
>
> > gcc -Wp,-MD,./.sched.o.d -D__KERNEL__ -I/tmp/fbdev-2.5/include -Wall -Wst
rict-prototypes -Wno-trigraphs -O2 -fomit-frame-pointer -fno-strict-aliasing -f
no-common -pipe -mpreferred-stack-boundary=2 -march=i686 -malign-functions=4 -
nostdinc -iwithprefix include -fno-omit-frame-pointer -DKBUILD_BASENAME=sche
d -c -o sched.o sched.c
> > sched.c: In function `sys_sched_setaffinity':
> > sched.c:1329: `cpu_online_map' undeclared (first use in this function)
> > sched.c:1329: (Each undeclared identifier is reported only once
> > sched.c:1329: for each function it appears in.)
> > sched.c: In function `sys_sched_getaffinity':
> > sched.c:1389: `cpu_online_map' undeclared (first use in this function)
> > make[1]: *** [sched.o] Error 1
>
> Rusty, I assume this is a side-effect of the hotplug merge?

Yes, sorry.

> Can you fix this or tell me what is the new equivalent of
> cpu_online_map?

Well, I'm heading away from assumptions on the arch representations of
online CPUs (which the NUMA guys need anyway).

You could do a loop here, but the real problem is the broken userspace
interface. Can you fix this so it takes a single CPU number please?

ie.
/* -1 = remove affinity */
sys_sched_setaffinity(pid_t pid, int cpu);

This will work everywhere, and doesn't require userspace to know the
size of the cpu bitmask etc.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-18 18:55:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> You could do a loop here, but the real problem is the broken userspace
> interface. Can you fix this so it takes a single CPU number please?

NO.

Rusty, people want to do "node-affine" stuff, which absolutely requires
you to be able to give CPU "collections". Single CPU's need not apply.

Linus

2002-06-18 18:59:39

by Robert Love

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 2002-06-18 at 11:56, Linus Torvalds wrote:

> NO.
>
> Rusty, people want to do "node-affine" stuff, which absolutely requires
> you to be able to give CPU "collections". Single CPU's need not apply.

I would also hate to have to make 32 system calls to get the affinity
mask I want.

If anything, I think the interface is not collective _enough_ - further
abstractions like psets seem to be in favor, not dropping down to a
one-CPU-and-task per-call thing. Not that I am complaining, I am happy
with the interface...

Robert Love

2002-06-18 19:11:44

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Hi Rusty,

On Wed, 19 Jun 2002, Rusty Russell wrote:

> > Can you fix this or tell me what is the new equivalent of
> > cpu_online_map?
>
> Well, I'm heading away from assumptions on the arch representations of
> online CPUs (which the NUMA guys need anyway).

Will there also be some sort of facility to determine which node a cpu is
from, this would be quite handy in other areas.

Cheers,
Zwane Mwaikambo

--
http://function.linuxpower.ca


2002-06-18 19:29:52

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote:
> You could do a loop here, but the real problem is the broken userspace
> interface. Can you fix this so it takes a single CPU number please?
>
> ie.
> /* -1 = remove affinity */
> sys_sched_setaffinity(pid_t pid, int cpu);
>
> This will work everywhere, and doesn't require userspace to know the
> size of the cpu bitmask etc.

That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
quads that share caches.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-18 19:47:25

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 18 Jun 2002, Benjamin LaHaise wrote:

> > /* -1 = remove affinity */
> > sys_sched_setaffinity(pid_t pid, int cpu);
> >
> > This will work everywhere, and doesn't require userspace to know the
> > size of the cpu bitmask etc.
>
> That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> quads that share caches.

Hmm i don't understand, mind explaining why it wouldn't work on HT?

Cheers,
Zwane Mwaikambo

--
http://function.linuxpower.ca


2002-06-18 19:49:15

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, Jun 18, 2002 at 09:19:40PM +0200, Zwane Mwaikambo wrote:
> Hmm i don't understand, mind explaining why it wouldn't work on HT?

On HyperThreading, you want to specify that either cpu in a pair is
okay. In larger SMP machines that share a cache between 4 CPUs, the
mask is likely to contain all 4 CPUs in each quad.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-18 19:55:25

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 18 Jun 2002, Benjamin LaHaise wrote:

> On HyperThreading, you want to specify that either cpu in a pair is
> okay. In larger SMP machines that share a cache between 4 CPUs, the
> mask is likely to contain all 4 CPUs in each quad.

Hmm so you want to apply the same 'node' principal to HT? The way HT works
i can see why that would be a good idea. Node affinity on the quads makes
sense and distinguishing which cpus belong to which quads would also help
for irq affinity.

Thanks,
Zwane Mwaikambo

--
http://function.linuxpower.ca


2002-06-18 20:00:50

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <[email protected]> you wri
te:
>
>
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > You could do a loop here, but the real problem is the broken userspace
> > interface. Can you fix this so it takes a single CPU number please?
>
> NO.
>
> Rusty, people want to do "node-affine" stuff, which absolutely requires
> you to be able to give CPU "collections". Single CPU's need not apply.

NO. They want to be node-affine. They don't want to specify what
CPUs they attach to.

Understand?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-18 20:05:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> NO. They want to be node-affine. They don't want to specify what
> CPUs they attach to.

So you're going to have separate interfaces for that? Gag me with a volvo,
but that's idiotic.

Besides, even that would be broken. You want bitmaps, because bitmaps is
really what it is all about. It's NOT about "I must run on this CPU", it
can equally well be "I mustn't run on those two CPU's that are hosting the
RT part of this thing" or something like that.

Linus

2002-06-18 20:09:44

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <[email protected]> you write:
> On Wed, Jun 19, 2002 at 04:51:31AM +1000, Rusty Russell wrote:
> > You could do a loop here, but the real problem is the broken userspace
> > interface. Can you fix this so it takes a single CPU number please?
> >
> > ie.
> > /* -1 = remove affinity */
> > sys_sched_setaffinity(pid_t pid, int cpu);
> >
> > This will work everywhere, and doesn't require userspace to know the
> > size of the cpu bitmask etc.
>
> That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> quads that share caches.

This is the NUMA "I want to be in this group" problem. If you're
serious about this, you'll go for a sys_sched_groupaffinity call, or
add an extra arg to sys_sched_setaffinity, or simply use the top 16
bits of the cpu arg.

You will also add /proc/cpugroups or something to export this
information to users so there's a point.

Sorry, the current interface is insufficient for NUMA *and* is
impossible[1] for the user to use correctly.

Rusty.
[1] Defined as "too hard for them to ever do it properly"
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-18 20:21:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> > That doesn't work. Think of SMT CPU pairs (aka HyperThreading) or
> > quads that share caches.
>
> This is the NUMA "I want to be in this group" problem. If you're
> serious about this, you'll go for a sys_sched_groupaffinity call, or
> add an extra arg to sys_sched_setaffinity, or simply use the top 16
> bits of the cpu arg.

Oh, yes. That makes sense. NOT.

> Sorry, the current interface is insufficient for NUMA *and* is
> impossible[1] for the user to use correctly.

Don't be silly.

Give _one_ good reason why the affinity system call cannot take a simple
bitmask? It's trivial to use, your arguments do not make any sense.

Linus

2002-06-18 20:27:39

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <[email protected]> you writ
e:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > NO. They want to be node-affine. They don't want to specify what
> > CPUs they attach to.
>
> So you're going to have separate interfaces for that? Gag me with a volvo,
> but that's idiotic.

No, you have accepted a non-portable userspace interface and put it in
generic code. THAT is idiotic.

So any program that doesn't use the following is broken:

#include <limits.h>

#define BITS_PER_LONG (sizeof(long)*CHAR_BIT)

int set_cpu(int cpu)
{
size_t size = sizeof(unsigned long);
unsigned long *bitmask = NULL;
int ret;

do {
size *= 2;
bitmask = realloc(bitmask, size);
memset(bitmask, 0, size);
bitmask[cpu / BITS_PER_LONG] = (1 << (cpu % BITS_PER_LONG);
ret = sched_setaffinity(getpid(), size, bitmask);
} while (ret < 0 && errno = -EINVAL);

return ret;
}

> Besides, even that would be broken. You want bitmaps, because bitmaps is
> really what it is all about. It's NOT about "I must run on this CPU", it
> can equally well be "I mustn't run on those two CPU's that are hosting the
> RT part of this thing" or something like that.

Just bind to a cpu != those two CPUs. I could come up with contrived
examples too, but I'm trying to save userspace programmers and those
who have to port to new architectures.

If you don't know how to do it well, do it simply.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-18 20:41:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> So any program that doesn't use the following is broken:

That wasn't so hard, was it?

Besides, we've had this interface for about 15 years, and it's called
"select()". It scales fine to thousands of descriptors, and we're talking
about something that is a hell of a lot less timing-critical than select
ever was.

"Earth to Rusty, come in Rusty.."

How do we handle the bitmaps in select()? Right. We assume some size that
is plenty good enough. Come back to me when something simple like

#define MAX_CPUNR 1024

unsigned long cpumask[MAX_CPUNR / BITS_PER_LONG];

doesn't work.

The existing interface is _fine_, and when somebody actually has a machine
with more than 1024 CPU's (yeah, right, I'm really worried), the existing
interface will cause graceful errors instead of doing something
unexpected.

And if you're telling me that people who care about CPU affinity cannot
fathom a simple bitmap of longs, you're just out to lunch.

Linus

2002-06-18 20:55:41

by Robert Love

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 2002-06-18 at 13:31, Rusty Russell wrote:

> No, you have accepted a non-portable userspace interface and put it in
> generic code. THAT is idiotic.
>
> So any program that doesn't use the following is broken:

On top of what Linus replied, there is the issue that if your task does
not know how many CPUs can be in the system then setting its affinity is
worthless in 90% of the cases.

I.e., everyone today can write code like

sched_setaffinity(0, sizeof(unsigned long), &mask)

but let's say this code is executed on a system with a different number
of bits in the CPU mask. What do you do with the new/old bits? Ignore
them? Set new ones to zero? To 1?

Summarily, setting CPU affinity is something that is naturally low-level
enough it only makes sense when you know what you are setting and not
setting. While a mask of -1 may always make sense, random bitmaps
(think RT stuff here) are explicit for the number of CPUs given.

The interface is designed to make this easy clean as possible - i.e.,
the size check, etc.

Robert Love

2002-06-18 21:12:01

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, Jun 18, 2002 at 01:41:12PM -0700, Linus Torvalds wrote:
> That wasn't so hard, was it?
>
> Besides, we've had this interface for about 15 years, and it's called
> "select()". It scales fine to thousands of descriptors, and we're talking
> about something that is a hell of a lot less timing-critical than select
> ever was.

I take issue with the statement that select scales fine to thousands of
file descriptors. It doesn't. For fairly trivial workloads it degrades
to 0 operations per second with more than a few dozen filedescriptors in
the array, but only one descriptor being active. To sustain decent
throughput, select needs something like 50% of the filedescriptors in an
array to be active at every select() call, which makes in unsuitable for
things like LDAP servers, or HTTP/FTP where the clients are behind slow
connections or interactive (like in the real world). I've benchmarked
it -- we should really include something like /dev/epoll in the kernel
to improve this case.

Still, I think the bitmap approach in this case is useful, as having
affinity to multiple CPUs can be needed, and it is not a frequently
occuring operation (unlike select()).

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-18 21:20:12

by Cort Dougan

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

I agree with you there. It's not easy, and I'd claim it's not possible
given that no-one has done it yet, to have a select() call that is speedy
for both 0-10 and 1k file descriptors.


} I take issue with the statement that select scales fine to thousands of
} file descriptors. It doesn't. For fairly trivial workloads it degrades
} to 0 operations per second with more than a few dozen filedescriptors in
} the array, but only one descriptor being active. To sustain decent
} throughput, select needs something like 50% of the filedescriptors in an
} array to be active at every select() call, which makes in unsuitable for
} things like LDAP servers, or HTTP/FTP where the clients are behind slow
} connections or interactive (like in the real world). I've benchmarked
} it -- we should really include something like /dev/epoll in the kernel
} to improve this case.
}
} Still, I think the bitmap approach in this case is useful, as having
} affinity to multiple CPUs can be needed, and it is not a frequently
} occuring operation (unlike select()).
}
} -ben
} --
} "You will be reincarnated as a toad; and you will be much happier."
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to [email protected]
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/

2002-06-18 21:46:00

by Bill Huey

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, Jun 18, 2002 at 05:12:00PM -0400, Benjamin LaHaise wrote:
> connections or interactive (like in the real world). I've benchmarked
> it -- we should really include something like /dev/epoll in the kernel
> to improve this case.

Heh, try kqueue(). ;)

It's a pretty workable API and there seems to be a lot of momentum in
the BSDs (Darwin, FreeBSD) for it.

bill

2002-06-18 21:50:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 18 Jun 2002, Cort Dougan wrote:
>
> I agree with you there. It's not easy, and I'd claim it's not possible
> given that no-one has done it yet, to have a select() call that is speedy
> for both 0-10 and 1k file descriptors.

Actually, select() scales a lot better than poll() for _dense_ bitmaps.

The problem with non-scalability ends up being either sparse bitmaps
(minor problem, poll() can help) or just the work involved in watching a
large number of fd's (major problem, but totally unrelated to the bitmap
itself, and poll() usually makes it worse thanks to more data to be
moved).

Anyway, I was talking about the scalability of the _data_structure_, not
the scalability performance-wise. Performance scalability is a non-issue
for something like setaffinity(), since it's just not called at any rate
approaching poll.

>From a data structure standpoint, bitmaps are clearly the simplest dense
representation, and scale perfectly well to any reasonable number of
CPU's.

If we end up using a default of 1024, maybe you'll have to recompile that
part of the system that has anything to do with CPU affinity in about
10-20 years by just upping the number a bit. Quite frankly, that's going
to be the _least_ of the issues.

Linus

2002-06-18 22:05:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken


On Wed, 19 Jun 2002, Rusty Russell wrote:

> This is the NUMA "I want to be in this group" problem. If you're
> serious about this, you'll go for a sys_sched_groupaffinity call, or add
> an extra arg to sys_sched_setaffinity, or simply use the top 16 bits of
> the cpu arg.

the reason why i picked a linear cpu bitmask for the first patches to do
affinity syscalls (which ultimately found their way into 2.5) was very
simple: we do *NOT* want to deal with cache hierarchies in the kernel, at
this point.

enumerating CPUs and giving processes the ability to bind themselves to an
arbitrary set of CPUs is enough. *IF* user-space wants to do more then
they can get and use whatever NUMA information they want. There could even
be separate sets of syscalls perhaps to get the exact CPU cache hierarchy
of the system, although that would have to be done really well to be truly
generic and long-living.

so in this case the simplest approach that scales well to a reasonable
number of CPUs (thousands, at least) won.

> You will also add /proc/cpugroups or something to export this
> information to users so there's a point.

and this might not even be enough. Cache hierarchies can be pretty
non-trivial, and it's not necesserily a distinct group of CPUs, it could
be a hierarchy of multiple levels, or it could even be an assymetric
distribution of caches. In fact it might not be even expressable in
'group' categories - caches could be interconnected in a 2D or even 3D
topology. Or multiprocessing CPUs could have dynamic caches in the future
- 'cache on demand' allocated to a cache-happy CPU, while another CPU with
a smaller working set will use less cache space. [obviously the technology
is not available today.]

one thing i was *very* sure about, we frankly dont have the slightest clue
about how the really big systems will look like in 10 or 20 years. So
hardcoding anything like 'group affinity' or some of today's NUMA
hierarchies would be pretty shortsighted. I'm convinced that the 'opaque'
solution, the simple but generic setaffinity system call is the right
choice.

Ingo

2002-06-18 23:39:04

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tuesday, June 18 2002, Linus Torvalds wrote:

> On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> NO. They want to be node-affine. They don't want to specify what
> CPUs they attach to.
>
> So you're going to have separate interfaces for that? Gag me with a
> volvo, but that's idiotic.
>
> Besides, even that would be broken. You want bitmaps, because bitmaps
> is really what it is all about. It's NOT about "I must run on this
> CPU", it can equally well be "I mustn't run on those two CPU's that
> are hosting the RT part of this thing" or something like that.
>
> Linus


A bit mask is a very good choice for the sched_setaffinity()
interface. I would suggest an additional argument be added
which would indicate the resource that the process is to be
affined to. That way this interface could be used for binding
processes to cpus, memory nodes, perhaps NUMA nodes, and,
as discussed recently in another thread, other processes.
Personally, I see NUMA nodes as an overkill, if a process
can be bound to cpus and memory nodes.

There has been an effort made to address the needs for binding
processes to processors, memory nodes, etc. for NUMA machines.
A proposed API has been developed and implemented. See
http://lse.sourceforge.net/numa/numa_api.html for a spec on
the API. Matt Dobson has posted the implementation to lkml
as a patch against 2.5 several times, but has not seen much
discussion. I could see much of the capabilities provided
in the NUMA API being provided through the sched_setaffinity()
as described above.

Michael Hohnbaum
[email protected]





2002-06-18 23:59:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken


On 18 Jun 2002, Michael Hohnbaum wrote:

> A bit mask is a very good choice for the sched_setaffinity()
> interface. [...]

thanks :)

> [...] I would suggest an additional argument be added
> which would indicate the resource that the process is to be
> affined to. That way this interface could be used for binding
> processes to cpus, memory nodes, perhaps NUMA nodes, and,
> as discussed recently in another thread, other processes.
> Personally, I see NUMA nodes as an overkill, if a process
> can be bound to cpus and memory nodes.

are you sure we want one generic, process-based affinity interface?

i think the affinity to certain memory regions might need to be more
finegrained than this. Eg. it could be useful to define a per-file
(per-inode) 'backing store memory node' that the file is affine to. This
will eg. cause the pagecache to be allocated in the memory node.
Process-based affinity does not describe this in a natural way. Another
example, memory maps: we might want to have a certain memory map (vma)
allocated in a given memory node, independently of where the process that
is faulting a given pages resides.

and it might certainly make sense to have some sort of 'default memory
affinity' for a process as well, but this should be a different syscall -
it really does a much different thing than CPU affinity. The CPU resource
is 'used' only temporarily with little footprint, while memory usage is
often for a very long timespan, and the affinity strategies differ
greatly. Also, memory as a resource is much more complex than CPU, eg. it
must handle things like over-allocation, fallback to 'nearby' nodes if a
node is full, etc.

so i'd suggest to actually create a good memory-affinity syscall interface
instead of trying to generalize it into the simple, robust, finite
CPU-affinity syscalls.

Ingo

2002-06-19 00:11:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken


another thought would be that the 'default' memory affinity can be derived
from the CPU affinity. A default process, one which is affine to all CPUs,
can have memory allocated from all memory nodes. A process which is bound
to a given set of CPUs, should get its memory allocated from the nodes
that 'belong' to those CPUs.

the topology might not be as simple as this, but generally it's the CPU
that drives the topology, so a given CPU affinity mask leads to a specific
'preferred memory nodes' bitmask - there isnt much choice needed on the
user's part, in fact it might be contraproductive to bind a process to
some CPU and bind its memory allocations to a very distant memory node.
While mathematically there is not necesserily any 1:1 relationship between
CPU affinity and 'best memory affinity', technologically there is.

per-object affinity might still be possible under these scheme, it would
override whatever 'default' memory affinity is derived from the CPU
affinity mask. [that would enable for example for an important database
file to be locked to a given memory node, and helper processes executing
on distant CPUs will not cause a distant pagecache page to be allocated.]

another advantage is that this removes the burden from the application
writer, of having to figure out the actual memory topology and fitting the
CPU affinity to the memory affinity (and vice versa). The kernel can
figure out a good default memory affinity based on the CPU affinity mask.

(so everything so far points in the direction of having a simple CPU
affinity syscall, which we have now.)

Ingo

2002-06-19 00:15:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken


On Wed, 19 Jun 2002, Rusty Russell wrote:
>
> - new_mask &= cpu_online_map;
> + /* Eliminate offline cpus from the mask */
> + for (i = 0; i < NR_CPUS; i++)
> + if (!cpu_online(i))
> + new_mask &= ~(1<<i);
> +

And why can't cpu_online_map be a bitmap?

What's your beef against sane and efficient data structures? The above is
just crazy.

Just add a

#define NRCPUWORDS ROUND_UP(NR_CPU, BITS_PER_LONG)

struct cpu_mask {
unsigned long mask[NRCPUWORDS];
} cpu_mask_t;

and then add a few simple operations like

cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);

and friends.. See how we handle this issue in <linux/signal.h>, which has
perfectly efficient things to do all the same issues (ie see how
"sigemptyset()" and friends compile to efficient code for the "normal"
cases.

This is not rocket science, and I find it ridiculous that you claim to
worry about scaling up to thousands of CPU's, and then you try to send me
absolute crap like the above which clearly is unacceptable for lots of
CPU's.

No, C doesn't have built-in support for bitmap operations except on a
small scale level (ie single words), and yes, clearly that's why Linux
tends to prefer only small bitmaps, but NO, that does not make bitmaps
evil.

Linus

2002-06-19 01:20:32

by Matthew Dobson

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

diff -Nur linux-2.5.8-vanilla/include/linux/prctl.h linux-2.5.8-api/include/linux/prctl.h
--- linux-2.5.8-vanilla/include/linux/prctl.h Sun Apr 14 12:18:54 2002
+++ linux-2.5.8-api/include/linux/prctl.h Wed Apr 24 17:31:33 2002
@@ -26,4 +26,31 @@
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */

+/* Get/Set Restricted CPUs and MemBlks */
+#define PR_SET_RESTRICTED_CPUS 11
+#define PR_SET_RESTRICTED_MEMBLKS 12
+#define PR_GET_RESTRICTED_CPUS 13
+#define PR_GET_RESTRICTED_MEMBLKS 14
+
+/* Get CPU/Node */
+#define PR_GET_CPU 15
+#define PR_GET_NODE 16
+
+/* X to Node conversion functions */
+#define PR_CPU_TO_NODE 17
+#define PR_MEMBLK_TO_NODE 18
+#define PR_NODE_TO_NODE 19
+
+/* Node to X conversion functions */
+#define PR_NODE_TO_CPU 20
+#define PR_NODE_TO_MEMBLK 21
+
+/* Set CPU/MemBlk/Memory Bindings */
+#define PR_BIND_TO_CPUS 22
+#define PR_BIND_TO_MEMBLKS 23
+#define PR_BIND_MEMORY 24
+
+/* Set Launch Policy */
+#define PR_SET_LAUNCH_POLICY 25
+
#endif /* _LINUX_PRCTL_H */
diff -Nur linux-2.5.8-vanilla/kernel/sys.c linux-2.5.8-api/kernel/sys.c
--- linux-2.5.8-vanilla/kernel/sys.c Sun Apr 14 12:18:45 2002
+++ linux-2.5.8-api/kernel/sys.c Wed Apr 24 17:32:17 2002
@@ -16,6 +16,7 @@
#include <linux/highuid.h>
#include <linux/fs.h>
#include <linux/device.h>
+#include <linux/numa.h>

#include <asm/uaccess.h>
#include <asm/io.h>
@@ -1277,6 +1278,51 @@
break;
}
current->keep_capabilities = arg2;
+ break;
+ case PR_SET_RESTRICTED_CPUS:
+ error = (long) set_restricted_cpus((numa_bitmap_t)arg2, (numa_set_t *)arg3);
+ break;
+ case PR_SET_RESTRICTED_MEMBLKS:
+ error = (long) set_restricted_memblks((numa_bitmap_t)arg2, (numa_set_t *)arg3);
+ break;
+ case PR_GET_RESTRICTED_CPUS:
+ error = (long) get_restricted_cpus();
+ break;
+ case PR_GET_RESTRICTED_MEMBLKS:
+ error = (long) get_restricted_memblks();
+ break;
+ case PR_GET_CPU:
+ error = (long) get_cpu();
+ break;
+ case PR_GET_NODE:
+ error = (long) get_node();
+ break;
+ case PR_CPU_TO_NODE:
+ error = (long) cpu_to_node((int)arg2);
+ break;
+ case PR_MEMBLK_TO_NODE:
+ error = (long) memblk_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_NODE:
+ error = (long) node_to_node((int)arg2);
+ break;
+ case PR_NODE_TO_CPU:
+ error = (long) node_to_cpu((int)arg2);
+ break;
+ case PR_NODE_TO_MEMBLK:
+ error = (long) node_to_memblk((int)arg2);
+ break;
+ case PR_BIND_TO_CPUS:
+ error = (long) bind_to_cpu((numa_bitmap_t)arg2, (int)arg3);
+ break;
+ case PR_BIND_TO_MEMBLKS:
+ error = (long) bind_to_memblk((numa_bitmap_t)arg2, (int)arg3);
+ break;
+ case PR_BIND_MEMORY:
+ error = (long) bind_memory((unsigned long)arg2, (size_t)arg3, (numa_bitmap_t)arg4, (int)arg5);
+ break;
+ case PR_SET_LAUNCH_POLICY:
+ error = (long) set_launch_policy((numa_bitmap_t)arg2, (int)arg3, (numa_bitmap_t)arg4, (int)arg5);
break;
default:
error = -EINVAL;


Attachments:
numa_api-arch_dep-2.5.14.patch (3.97 kB)
numa_api-arch_indep-impl-2.5.14.patch (17.43 kB)
numa_api-arch_indep-setup-2.5.14.patch (6.38 kB)
numa_api-prctl-2.5.14.patch (2.98 kB)
Download all attachments

2002-06-19 10:22:55

by Padraig Brady

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Cort Dougan wrote:
> I agree with you there. It's not easy, and I'd claim it's not possible
> given that no-one has done it yet, to have a select() call that is speedy
> for both 0-10 and 1k file descriptors.

Have you noticed yesterdays + todays fixup patch from Andi Kleen:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102446644619648&w=2

Padraig.

2002-06-19 12:40:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Linus Torvalds <[email protected]> writes:

> If we end up using a default of 1024, maybe you'll have to recompile that
> part of the system that has anything to do with CPU affinity in about
> 10-20 years by just upping the number a bit. Quite frankly, that's going
> to be the _least_ of the issues.

:)

10-20 years or someone finds a good way to implement a single system
image on linux clusters. They are already into the 1000s of nodes,
and dual processors per node category. And as things continue they
might even grow bigger.

Eric

2002-06-19 13:44:45

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <1024433739.922.236.camel@sinai> you write:
> On Tue, 2002-06-18 at 13:31, Rusty Russell wrote:
>
> > No, you have accepted a non-portable userspace interface and put it in
> > generic code. THAT is idiotic.
> >
> > So any program that doesn't use the following is broken:
>
> On top of what Linus replied, there is the issue that if your task does
> not know how many CPUs can be in the system then setting its affinity is
> worthless in 90% of the cases.

No. You can read the cpus out of /proc/cpuinfo, and say "I want to be
on <some cpu I found>" or "I want one copy for each processor", or
even "I want every processor but the one the other task just bound
to". This is 99% of actual usage.

But I can see the man page now:

The third arg to set/getaffinity is the size of a kernel data
structure. There is no way to know this size: it is dependent
on architecture and kernel configuration. You can pass a
larger datastructure and the higher bits are ignored: try
1024?

> I.e., everyone today can write code like
>
> sched_setaffinity(0, sizeof(unsigned long), &mask)

NO THEY CAN'T. How will ia64 deal with this in ia32 binaries? How
will Sparc64 deal with this in 32-bit binaries? How will PPC64 deal
with this in PPC32 binaries? How will x86_64 deal with this in x86
binaries?

They'll have to either break compatibility, or guess and fill
accordingly.

And when new CPUS come online? At the moment you effectively
zero-fill, because you can't tell what you're supposed to do here. So
you can never truly reset your affinity once it's set.

> Summarily, setting CPU affinity is something that is naturally low-level
> enough it only makes sense when you know what you are setting and not
> setting. While a mask of -1 may always make sense, random bitmaps
> (think RT stuff here) are explicit for the number of CPUs given.

You've designed an interface where the easiest thing to do is the
wrong thing (as per your example). This is the hallmark of bad
design.

*If* there had been a way to tell the bitmask size which was
introduced at the same time, it might have been acceptable. But there
isn't at the moment, so people are writing bugs right now.

Untested patch below, seems to compile (hard to tell since PPC is
v. broken right now)

Summary:
1) Easy to write portable "set this cpu" code.

2) Both system calls now handle NR_CPUS > sizeof(long)*8.

3) Things which have set affinity once can now get back on new cpus as
they come up.

4) Trivial to extent for hyperthreading on a per-arch basis.

Linus, think and apply,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

--- linux-2.5.22/include/linux/affinity.h Thu Jan 1 10:00:00 1970
+++ working-2.5.22-linus/include/linux/affinity.h Wed Jun 19 22:09:47 2002
@@ -0,0 +1,9 @@
+#ifndef _LINUX_AFFINITY_H
+#define _LINUX_AFFINITY_H
+enum {
+ /* Set affinity to these processors */
+ LINUX_AFFINITY_INCLUDE,
+ /* Set affinity to all *but* these processors */
+ LINUX_AFFINITY_EXCLUDE,
+};
+#endif
--- working-2.5.22-linus/kernel/sched.c.~1~ Tue Jun 18 23:48:03 2002
+++ working-2.5.22-linus/kernel/sched.c Wed Jun 19 23:28:32 2002
@@ -26,6 +26,7 @@
#include <linux/interrupt.h>
#include <linux/completion.h>
#include <linux/kernel_stat.h>
+#include <linux/affinity.h>

/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -1309,25 +1310,57 @@
/**
* sys_sched_setaffinity - set the cpu affinity of a process
* @pid: pid of the process
+ * @include: is this include or exclude?
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
- * @user_mask_ptr: user-space pointer to the new cpu mask
+ * @user_mask_ptr: user-space pointer to bitmask of cpus to include/exclude
*/
-asmlinkage int sys_sched_setaffinity(pid_t pid, unsigned int len,
- unsigned long *user_mask_ptr)
+asmlinkage int sys_sched_setaffinity(pid_t pid,
+ int include,
+ unsigned int len,
+ unsigned char *user_mask_ptr)
{
- unsigned long new_mask;
+ bitmap_member(new_mask, NR_CPUS);
task_t *p;
int retval;
+ unsigned int i;

- if (len < sizeof(new_mask))
- return -EINVAL;
-
- if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
+ memset(new_mask, 0x00, sizeof(new_mask));
+ if (copy_from_user(new_mask, user_mask_ptr,
+ min((size_t)len, sizeof(new_mask))))
return -EFAULT;

- new_mask &= cpu_online_map;
- if (!new_mask)
+ /* longer is OK, as long as they don't actually set any of the bits. */
+ if (len > sizeof(new_mask)) {
+ unsigned char c;
+ for (i = sizeof(new_mask); i < len; i++) {
+ if (get_user(c, user_mask_ptr+i))
+ return -EFAULT;
+ if (c != 0)
+ return -ENOENT;
+ }
+ }
+
+ /* Check for cpus that aren't online/don't exist */
+ for (i = 0; i < ARRAY_SIZE(new_mask) * i; i++) {
+ if (i >= NR_CPUS || !cpu_online(i)) {
+ if (test_bit(i, new_mask))
+ return -ENOENT;
+ }
+ }
+
+ /* Invert the mask in the exclude case. */
+ if (include == LINUX_AFFINITY_EXCLUDE) {
+ for (i = 0; i < ARRAY_SIZE(new_mask); i++)
+ new_mask[i] = ~new_mask[i];
+ } else if (include != LINUX_AFFINITY_INCLUDE) {
return -EINVAL;
+ }
+
+ /* The new mask must mention some online cpus */
+ for (i = 0; !cpu_online(i) || !test_bit(i, new_mask); i++)
+ if (i == NR_CPUS-1)
+ /* This is kinda true... */
+ return -EWOULDBLOCK;

read_lock(&tasklist_lock);

@@ -1351,7 +1384,8 @@
goto out_unlock;

retval = 0;
- set_cpus_allowed(p, new_mask);
+ /* FIXME: set_cpus_allowed should take an array... */
+ set_cpus_allowed(p, new_mask[0]);

out_unlock:
put_task_struct(p);
@@ -1363,37 +1397,27 @@
* @pid: pid of the process
* @len: length in bytes of the bitmask pointed to by user_mask_ptr
* @user_mask_ptr: user-space pointer to hold the current cpu mask
+ * Returns the size that required to hold the complete cpu mask.
*/
asmlinkage int sys_sched_getaffinity(pid_t pid, unsigned int len,
- unsigned long *user_mask_ptr)
+ void *user_mask_ptr)
{
- unsigned long mask;
- unsigned int real_len;
+ bitmap_member(mask, NR_CPUS) = { 0 };
task_t *p;
- int retval;
-
- real_len = sizeof(mask);
-
- if (len < real_len)
- return -EINVAL;

read_lock(&tasklist_lock);
-
- retval = -ESRCH;
p = find_process_by_pid(pid);
- if (!p)
- goto out_unlock;
-
- retval = 0;
- mask = p->cpus_allowed & cpu_online_map;
-
-out_unlock:
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ memcpy(mask, &p->cpus_allowed, sizeof(p->cpus_allowed));
read_unlock(&tasklist_lock);
- if (retval)
- return retval;
- if (copy_to_user(user_mask_ptr, &mask, real_len))
+
+ if (copy_to_user(user_mask_ptr, &mask,
+ min((unsigned)sizeof(p->cpus_allowed), len)))
return -EFAULT;
- return real_len;
+ return sizeof(p->cpus_allowed);
}

asmlinkage long sys_sched_yield(void)
@@ -1727,9 +1751,11 @@
migration_req_t req;
runqueue_t *rq;

+#if 0 /* This is checked for userspace, and kernel shouldn't do this */
new_mask &= cpu_online_map;
if (!new_mask)
BUG();
+#endif

preempt_disable();
rq = task_rq_lock(p, &flags);

2002-06-19 15:20:05

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <[email protected]> you
write:
>
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> >
> > - new_mask &= cpu_online_map;
> > + /* Eliminate offline cpus from the mask */
> > + for (i = 0; i < NR_CPUS; i++)
> > + if (!cpu_online(i))
> > + new_mask &= ~(1<<i);
> > +
>
> And why can't cpu_online_map be a bitmap?
>
> What's your beef against sane and efficient data structures? The above is
> just crazy.

Oh, it can be. I wasn't going to require something from all archs for
this one case (well, it was more like zero cases when I first did the
patch).

> and then add a few simple operations like
>
> cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);

Sure... or just make all archs supply a "cpus_online_of(mask)" which
does that, unless there are other interesting cases. Or we can go the
other way and have a general "and_region(void *res, void *a, void *b,
int len)". Which one do you want?

> This is not rocket science, and I find it ridiculous that you claim to
> worry about scaling up to thousands of CPU's, and then you try to send me
> absolute crap like the above which clearly is unacceptable for lots of
> CPU's.

Spinning 1000 times doesn't phase me until someone complains.
Breaking userspace code does. One can be fixed if it proves to be a
bottleneck. Understand?

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-06-19 16:28:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Thu, 20 Jun 2002, Rusty Russell wrote:
> > and then add a few simple operations like
> >
> > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
>
> Sure... or just make all archs supply a "cpus_online_of(mask)" which
> does that, unless there are other interesting cases. Or we can go the
> other way and have a general "and_region(void *res, void *a, void *b,
> int len)". Which one do you want?

There are definitely other "interesting" cases that already do the full
bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/
sigfillset. It's really the exact same code, and the exact same issues.

The problem with a generic "and_region" is that it's a slight amount of
work to make sure that we optimize for the common cases (and since I'm not
a huge believer in hundreds of nodes, I consider the common case to be a
single word). And do things like just automatically get the UP case right:
which we do right now by just virtue of having a constant cpu_online_mask,
and letting the compiler just do the (obvious) optimizations.

I'm a _huge_ believer in having generic code that is automatically
optimized away by the compiler into nothingness. (And by contrast, I
absolutely _detest_ #ifdef's in source code that makes those optimizations
explicit). But that sometimes requires some thought, notably making sure
that all constants hang around as constants all the way to the code
generation phase (this tends to mean inline functions and #defines).

It _would_ probably be worthwhile to try to have better support for
"bitmaps" as real kernel data structures, since we actually have this
problem in multiple places. Right now we already use bitmaps for signal
handling (one or two words, constant size), for FD_SET's (variable size),
for various filesystems (variable size, largish), and for a lot of random
drivers (some variable, some constant).

It wasn't that long ago that I added a "bitmap_member()" macro to
<linux/types.h> to declare bitmaps exactly because a lot of people _were_
doing it and getting it wrong. Actually, the most common case was not a
bug, but a latent problem with code that did something like

unsigned char bitmap[BITMAP_SIZE/8];

which works on x86 as long as the bitmap size was a multiple of 8.

It would probably make sense to make a real <linux/bitmap.h>, move the
bitmap_member() there (and rename to "bitmap_declare()" - it's called
member because all the places I first looked at were structure members),
and add some simple generic routines for handling these things.

(We've obviously had the bit_set/clear/test() stuff forever, but the more
involved stuff should be fairly easy to abstract out too, instead of
having special functions for signal masks).

> Breaking userspace code does. One can be fixed if it proves to be a
> bottleneck. Understand?

What I don't understand is why you don't accept the fact that these
things can be considered infinitely big. There's nothing fundamentally
wrong with static allocation.

People who build thousand-node systems _are_ going to compile their own
distribution. Trust me. They aren't just going to slap down redhat-7.3 on
a 16k-node ASCI Purple. It makes no sense to do that. They may want to run
quake or something standard on it without recompiling, but especially the
maintenance stuff - the stuff which cares about CPU affinity - is a
nobrainer.

So you can easily just accept the fact that at some point the max number
of CPU's can be considered fixed. And that "some point" isn't even very
high, especially since bitmaps _are_ so dense that there is basically no
overhead to just starting out with

#define MAX_CPU (1024)

bitmap_declare(cpu_bitmap, MAX_CPU);

and let it be at that. That 1024 is already ridiculously high, in my
opinion - simply because people who are playing with bigger numbers _are_
going to be able to just increase the number and recompile.

Linus

2002-06-19 17:27:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On 19 Jun 2002, Eric W. Biederman wrote:
>
> 10-20 years or someone finds a good way to implement a single system
> image on linux clusters. They are already into the 1000s of nodes,
> and dual processors per node category. And as things continue they
> might even grow bigger.

Oh, clusters are a separate issue. I'm absolutely 100% conviced that you
don't want to have a "single kernel" for a cluster, you want to run
independent kernels with good communication infrastructure between them
(ie global filesystem, and try to make the networking look uniform).

Trying to have a single kernel for thousands of nodes is just crazy. Even
if the system were ccNuma and _could_ do it in theory.

The NuMA work can probably take single-kernel to maybe 64+ nodes, before
people just start turning stark raving mad. There's no way you'll have
single-kernel for thousands of CPU's, and still stay sane and claim any
reasonable performance under generic loads.

So don't confuse the issue with clusters like that. The "set_affinity()"
call simply doesn't have anything to do with them. If you want to move
processes between nodes on such a cluster, you'll probably need user-level
help, the kernel is unlikely to do it for you.

Linus

2002-06-19 20:53:07

by Rusty Russell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

In message <[email protected]> you wri
te:
>
>
> On Thu, 20 Jun 2002, Rusty Russell wrote:
> > > and then add a few simple operations like
> > >
> > > cpumask_and(cpu_mask_t * res, cpu_mask_t *a, cpu_mask_t *b);
> >
> > Sure... or just make all archs supply a "cpus_online_of(mask)" which
> > does that, unless there are other interesting cases. Or we can go the
> > other way and have a general "and_region(void *res, void *a, void *b,
> > int len)". Which one do you want?
>
> There are definitely other "interesting" cases that already do the full
> bitwise and/or on bitmasks - see sigset_t and sigaddset/sigdelset/
> sigfillset. It's really the exact same code, and the exact same issues.
>
> The problem with a generic "and_region" is that it's a slight amount of
> work to make sure that we optimize for the common cases (and since I'm not
> a huge believer in hundreds of nodes, I consider the common case to be a
> single word). And do things like just automatically get the UP case right:
> which we do right now by just virtue of having a constant cpu_online_mask,
> and letting the compiler just do the (obvious) optimizations.

Sure, completely agreed.

Normal tricks here: 1 long turns into equivalent to dst = a & b, the
other cases are handled with varying amount of suckiness. Code and
optimization tested on 2.95.4 and 3.0.4 (both PPC), kernel compiled on
my x86 box back in .au.

> It would probably make sense to make a real <linux/bitmap.h>, move the
> bitmap_member() there (and rename to "bitmap_declare()" - it's called
> member because all the places I first looked at were structure members),
> and add some simple generic routines for handling these things.

I renamed it to DECLARE_BITMAP() to match list, mutex et al. and moved
it to linux/bitops.h.

PS. Please sort out merging with Paulus's stuff: I'd like to compile
on PPC soon since I'm laptop-only for two more weeks 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/bitops.h working-2.5.23-bitops/include/linux/bitops.h
--- linux-2.5.23/include/linux/bitops.h Fri Jun 7 13:59:07 2002
+++ working-2.5.23-bitops/include/linux/bitops.h Thu Jun 20 06:55:51 2002
@@ -2,6 +2,27 @@
#define _LINUX_BITOPS_H
#include <asm/bitops.h>

+#define DECLARE_BITMAP(name,bits) \
+ unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG]
+
+#ifndef HAVE_ARCH_AND_REGION
+void __and_region(unsigned long num, unsigned char *dst,
+ const unsigned char *a, const unsigned char *b);
+#endif
+
+/* For the moment, handle 1 long case fast, leave rest to __and_region. */
+#define and_region(num,dst,a,b) \
+do { \
+ if (__alignof__(*(a)) == __alignof__(long) \
+ && __alignof__(*(b)) == __alignof__(long) \
+ && __builtin_constant_p(num) \
+ && (num) == sizeof(long)) { \
+ *((unsigned long *)(dst)) = \
+ (*(unsigned long *)(a) & *(unsigned long *)(b)); \
+ } else \
+ __and_region((num), (void*)(dst), (void*)(a), (void*)(b)); \
+} while(0)
+
/*
* ffs: find first bit set. This is defined the same way as
* the libc and compiler builtin ffs routines, therefore
@@ -106,8 +127,5 @@
res = (res & 0x33) + ((res >> 2) & 0x33);
return (res & 0x0F) + ((res >> 4) & 0x0F);
}
-
-#include <asm/bitops.h>
-

#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/linux/types.h working-2.5.23-bitops/include/linux/types.h
--- linux-2.5.23/include/linux/types.h Mon Jun 17 23:19:25 2002
+++ working-2.5.23-bitops/include/linux/types.h Thu Jun 20 06:14:39 2002
@@ -3,9 +3,6 @@

#ifdef __KERNEL__
#include <linux/config.h>
-
-#define bitmap_member(name,bits) \
- unsigned long name[((bits)+BITS_PER_LONG-1)/BITS_PER_LONG]
#endif

#include <linux/posix_types.h>
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/include/sound/ac97_codec.h working-2.5.23-bitops/include/sound/ac97_codec.h
--- linux-2.5.23/include/sound/ac97_codec.h Mon Jun 17 23:19:25 2002
+++ working-2.5.23-bitops/include/sound/ac97_codec.h Thu Jun 20 06:31:35 2002
@@ -25,6 +25,7 @@
*
*/

+#include <linux/bitops.h>
#include "control.h"
#include "info.h"

@@ -160,7 +161,7 @@
unsigned int rates_mic_adc;
unsigned int spdif_status;
unsigned short regs[0x80]; /* register cache */
- bitmap_member(reg_accessed, 0x80); /* bit flags */
+ DECLARE_BITMAP(reg_accessed, 0x80); /* bit flags */
union { /* vendor specific code */
struct {
unsigned short unchained[3]; // 0 = C34, 1 = C79, 2 = C69
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/Makefile working-2.5.23-bitops/kernel/Makefile
--- linux-2.5.23/kernel/Makefile Mon Jun 10 16:03:56 2002
+++ working-2.5.23-bitops/kernel/Makefile Thu Jun 20 06:27:29 2002
@@ -10,12 +10,12 @@
O_TARGET := kernel.o

export-objs = signal.o sys.o kmod.o context.o ksyms.o pm.o exec_domain.o \
- printk.o platform.o suspend.o
+ printk.o platform.o suspend.o bitops.o

obj-y = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o bitops.o

obj-$(CONFIG_UID16) += uid16.o
obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/kernel/bitops.c working-2.5.23-bitops/kernel/bitops.c
--- linux-2.5.23/kernel/bitops.c Thu Jan 1 10:00:00 1970
+++ working-2.5.23-bitops/kernel/bitops.c Thu Jun 20 06:52:29 2002
@@ -0,0 +1,32 @@
+#include <linux/config.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+#ifndef HAVE_ARCH_AND_REGION
+/* Generic is fairly stupid: archs should optimize properly. */
+void __and_region(unsigned long num, unsigned char *dst,
+ const unsigned char *a, const unsigned char *b)
+{
+ unsigned long i;
+
+ /* Copy first bytes, until one is long aligned. */
+ for (i = 0; i < num && ((unsigned long)a+i) % __alignof__(long); i++)
+ dst[i] = (a[i] & b[i]);
+
+ /* If they are all aligned, do long-at-a-time copy */
+ if (((unsigned long)b+i)%__alignof__(long) == 0
+ && ((unsigned long)dst+i)%__alignof__(long) == 0) {
+ for (; i + sizeof(long) <= num; i += sizeof(long)) {
+ *(unsigned long *)(dst+i)
+ = (*(unsigned long *)(a+i)
+ & *(unsigned long *)(b+i));
+ }
+ }
+
+ /* Do whatever is left. */
+ for (; i < num; i++)
+ dst[i] = (a[i] & b[i]);
+}
+
+EXPORT_SYMBOL(__and_region);
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_clientmgr.h working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h
--- linux-2.5.23/sound/core/seq/seq_clientmgr.h Mon Jun 17 23:19:26 2002
+++ working-2.5.23-bitops/sound/core/seq/seq_clientmgr.h Thu Jun 20 06:34:16 2002
@@ -53,8 +53,8 @@
char name[64]; /* client name */
int number; /* client number */
unsigned int filter; /* filter flags */
- bitmap_member(client_filter, 256);
- bitmap_member(event_filter, 256);
+ DECLARE_BITMAP(client_filter, 256);
+ DECLARE_BITMAP(event_filter, 256);
snd_use_lock_t use_lock;
int event_lost;
/* ports */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.23/sound/core/seq/seq_queue.h working-2.5.23-bitops/sound/core/seq/seq_queue.h
--- linux-2.5.23/sound/core/seq/seq_queue.h Mon Jun 17 23:19:26 2002
+++ working-2.5.23-bitops/sound/core/seq/seq_queue.h Thu Jun 20 06:34:11 2002
@@ -26,6 +26,7 @@
#include "seq_lock.h"
#include <linux/interrupt.h>
#include <linux/list.h>
+#include <linux/bitops.h>

#define SEQ_QUEUE_NO_OWNER (-1)

@@ -51,7 +52,7 @@
spinlock_t check_lock;

/* clients which uses this queue (bitmap) */
- bitmap_member(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS);
+ DECLARE_BITMAP(clients_bitmap,SNDRV_SEQ_MAX_CLIENTS);
unsigned int clients; /* users of this queue */
struct semaphore timer_mutex;

2002-06-19 23:50:19

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Tue, 2002-06-18 at 16:57, Ingo Molnar wrote:
>
> On 18 Jun 2002, Michael Hohnbaum wrote:
>
> > [...] I would suggest an additional argument be added
> > which would indicate the resource that the process is to be
> > affined to. That way this interface could be used for binding
> > processes to cpus, memory nodes, perhaps NUMA nodes, and,
> > as discussed recently in another thread, other processes.
> > Personally, I see NUMA nodes as an overkill, if a process
> > can be bound to cpus and memory nodes.
>
> are you sure we want one generic, process-based affinity interface?

No, I'm not sure that is what we want. I see that as a compromise
solution. Something that would allow some of the simple binding
capabilities, but not necessarily a full blown solution.

I agree with your comments below that memory binding/allocation is
much more complex than CPU binding, so additional flexibility in
specifying memory binding is needed. However, wanting to start
simple, the first step is to affine a process to memory on one or
more nodes.


> i think the affinity to certain memory regions might need to be more
> finegrained than this. Eg. it could be useful to define a per-file
> (per-inode) 'backing store memory node' that the file is affine to. This
> will eg. cause the pagecache to be allocated in the memory node.
> Process-based affinity does not describe this in a natural way. Another
> example, memory maps: we might want to have a certain memory map (vma)
> allocated in a given memory node, independently of where the process that
> is faulting a given pages resides.
>
> and it might certainly make sense to have some sort of 'default memory
> affinity' for a process as well, but this should be a different syscall -

This is close to what is currently implemented - memory is allocated,
by default on the node that the process is executing on when the request
for memory is made. Even if a process is affined to multiple CPUs that
span node boundaries, it is performant to dispatch the process on only
one node (providing the cpu cycles are available). The NUMA extensions
to the scheduler try to do this. Similarly, all memory for a process
should be allocated from that one node. If memory is exhausted on
that node, any other nodes that the process has affinity to cpus on
should then be used. In other words, each process should have a home
node that is preferred for dispatch and memory allocation. The process
may have affinity to other nodes, which would be used only if the home
quad had a significant resource shortage.

> it really does a much different thing than CPU affinity. The CPU resource
> is 'used' only temporarily with little footprint, while memory usage is
> often for a very long timespan, and the affinity strategies differ
> greatly. Also, memory as a resource is much more complex than CPU, eg. it
> must handle things like over-allocation, fallback to 'nearby' nodes if a
> node is full, etc.
>
> so i'd suggest to actually create a good memory-affinity syscall interface
> instead of trying to generalize it into the simple, robust, finite
> CPU-affinity syscalls.

We have attempted to do that. Please look at the API definition
http://lse.sourceforge.net/numa/numa_api.html If it would help,
we could break out just the memory portion of this API (both in the
specification and the implementation) and submit those for comment.
What do you think?
>
> Ingo
>

Michael Hohnbaum
[email protected]



2002-06-20 04:07:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Linus Torvalds <[email protected]> writes:

> On 19 Jun 2002, Eric W. Biederman wrote:
> >
> > 10-20 years or someone finds a good way to implement a single system
> > image on linux clusters. They are already into the 1000s of nodes,
> > and dual processors per node category. And as things continue they
> > might even grow bigger.
>
> Oh, clusters are a separate issue. I'm absolutely 100% conviced that you
> don't want to have a "single kernel" for a cluster, you want to run
> independent kernels with good communication infrastructure between them
> (ie global filesystem, and try to make the networking look uniform).
>
> Trying to have a single kernel for thousands of nodes is just crazy. Even
> if the system were ccNuma and _could_ do it in theory.

I totally agree, mostly I was playing devils advocate. The model
actually in my head is when you have multiple kernels but they talk
well enough that the applications have to care in areas where it
doesn't make a performance difference (There's got to be one of those).

> The NuMA work can probably take single-kernel to maybe 64+ nodes, before
> people just start turning stark raving mad. There's no way you'll have
> single-kernel for thousands of CPU's, and still stay sane and claim any
> reasonable performance under generic loads.
>
> So don't confuse the issue with clusters like that. The "set_affinity()"
> call simply doesn't have anything to do with them. If you want to move
> processes between nodes on such a cluster, you'll probably need user-level
> help, the kernel is unlikely to do it for you.

Agreed.

The compute cluster problem is an interesting one. The big items
I see on the todo list are:

- Scalable fast distributed file system (Lustre looks like a
possibility)
- Sub application level checkpointing.

Services like a schedulers, already exist.

Basically the job of a cluster scheduler gets much easier, and the
scheduler more powerful once it gets the ability to suspend jobs.
Checkpointing buys three things. The ability to preempt jobs, the
ability to migrate processes, and the ability to recover from failed
nodes, (assuming the failed hardware didn't corrupt your jobs
checkpoint).

Once solutions to the cluster problems become well understood I
wouldn't be surprised if some of the supporting services started to
live in the kernel like nfsd. Parts of the distributed filesystem
certainly will.

I suspect process checkpointing and restoring will evolve something
something like pthread support. With some code in user space, and
some generic helpers in the kernel as clean pieces of the job can be
broken off. The challenge is only how to save/restore interprocess
communications. Things like moving a tcp connection from one node to
another are interesting problems.

But also I suspect most of the hard problems that we need kernel help
with can have uses independent of checkpointing. Already we have web
server farms that spread connections to a single ip across nodes.

Eric

2002-06-20 05:24:48

by Larry McVoy

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

> I totally agree, mostly I was playing devils advocate. The model
> actually in my head is when you have multiple kernels but they talk
> well enough that the applications have to care in areas where it
> doesn't make a performance difference (There's got to be one of those).

....

> The compute cluster problem is an interesting one. The big items
> I see on the todo list are:
>
> - Scalable fast distributed file system (Lustre looks like a
> possibility)
> - Sub application level checkpointing.
>
> Services like a schedulers, already exist.
>
> Basically the job of a cluster scheduler gets much easier, and the
> scheduler more powerful once it gets the ability to suspend jobs.
> Checkpointing buys three things. The ability to preempt jobs, the
> ability to migrate processes, and the ability to recover from failed
> nodes, (assuming the failed hardware didn't corrupt your jobs
> checkpoint).
>
> Once solutions to the cluster problems become well understood I
> wouldn't be surprised if some of the supporting services started to
> live in the kernel like nfsd. Parts of the distributed filesystem
> certainly will.

http://www.bitmover.com/cc-pitch

I've been trying to get Linus to listen to this for years and he keeps
on flogging the tired SMP horse instead. DEC did it and Sun has been
passing around these slides for a few weeks, so maybe they'll do it too.
Then Linux can join the party after it has become a fine grained,
locked to hell and back, soft "realtime", numa enabled, bloated piece
of crap like all the other kernels and we'll get to go through the
"let's reinvent Unix for the 3rd time in 40 years" all over again.
What fun. Not.

Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy
to talk it over with anyone who wants to think about it. Paul McKenney
from IBM came down the San Francisco to talk to me about it, put me
through an 8 or 9 hour session which felt like a PhD exam, and
after trying to poke holes in it grudgingly let on that maybe it was
a good idea. He was kind of enough to write up what he took away
from it, here it is.

--lm

From: "Paul McKenney" <[email protected]>
To: [email protected], [email protected]
Subject: Greatly enjoyed our discussion yesterday!
Date: Fri, 9 Nov 2001 18:48:56 -0800

Hello!

I greatly enjoyed our discussion yesterday! Here are the pieces of it that
I recall, I know that you will not be shy about correcting any errors and
omissions.

Thanx, Paul

Larry McVoy's SMP Clusters

Discussion on November 8, 2001

Larry McVoy, Ted T'so, and Paul McKenney


What is SMP Clusters?

SMP Clusters is a method of partioning an SMP (symmetric
multiprocessing) machine's CPUs, memory, and I/O devices
so that multiple "OSlets" run on this machine. Each OSlet
owns and controls its partition. A given partition is
expected to contain from 4-8 CPUs, its share of memory,
and its share of I/O devices. A machine large enough to
have SMP Clusters profitably applied is expected to have
enough of the standard I/O adapters (e.g., ethernet,
SCSI, FC, etc.) so that each OSlet would have at least
one of each.

Each OSlet has the same data structures that an isolated
OS would have for the same amount of resources. Unless
interactions with the OSlets are required, an OSlet runs
very nearly the same code over very nearly the same data
as would a standalone OS.

Although each OSlet is in most ways its own machine, the
full set of OSlets appears as one OS to any user programs
running on any of the OSlets. In particular, processes on
on OSlet can share memory with processes on other OSlets,
can send signals to processes on other OSlets, communicate
via pipes and Unix-domain sockets with processes on other
OSlets, and so on. Performance of operations spanning
multiple OSlets may be somewhat slower than operations local
to a single OSlet, but the difference will not be noticeable
except to users who are engaged in careful performance
analysis.

The goals of the SMP Cluster approach are:

1. Allow the core kernel code to use simple locking designs.
2. Present applications with a single-system view.
3. Maintain good (linear!) scalability.
4. Not degrade the performance of a single CPU beyond that
of a standalone OS running on the same resources.
5. Minimize modification of core kernel code. Modified or
rewritten device drivers, filesystems, and
architecture-specific code is permitted, perhaps even
encouraged. ;-)


OS Boot

Early-boot code/firmware must partition the machine, and prepare
tables for each OSlet that describe the resources that each
OSlet owns. Each OSlet must be made aware of the existence of
all the other OSlets, and will need some facility to allow
efficient determination of which OSlet a given resource belongs
to (for example, to determine which OSlet a given page is owned
by).

At some point in the boot sequence, each OSlet creates a "proxy
task" for each of the other OSlets that provides shared services
to them.

Issues:

1. Some systems may require device probing to be done
by a central program, possibly before the OSlets are
spawned. Systems that react in an unfriendly manner
to failed probes might be in this class.

2. Interrupts must be set up very carefully. On some
systems, the interrupt system may constrain the ways
in which the system is partitioned.


Shared Operations

This section describes some possible implementations and issues
with a number of the shared operations.

Shared operations include:

1. Page fault on memory owned by some other OSlet.
2. Manipulation of processes running on some other OSlet.
3. Access to devices owned by some other OSlet.
4. Reception of network packets intended for some other OSlet.
5. SysV msgq and sema operations on msgq and sema objects
accessed by processes running on multiple of the OSlets.
6. Access to filesystems owned by some other OSlet. The
/tmp directory gets special mention.
7. Pipes connecting processes in different OSlets.
8. Creation of processes that are to run on a different
OSlet than their parent.
9. Processing of exit()/wait() pairs involving processes
running on different OSlets.

Page Fault

As noted earlier, each OSlet maintains a proxy process
for each other OSlet (so that for an SMP Cluster made
up of N OSlets, there are N*(N-1) proxy processes).

When a process in OSlet A wishes to map a file
belonging to OSlet B, it makes a request to B's proxy
process corresponding to OSlet A. The proxy process
maps the desired file and takes a page fault at the
desired address (translated as needed, since the file
will usually not be mapped to the same location in the
proxy and client processes), forcing the page into
OSlet B's memory. The proxy process then passes the
corresponding physical address back to the client
process, which maps it.

Issues:

o How to coordinate pageout? Two approaches:

1. Use mlock in the proxy process so that
only the client process can do the pageout.

2. Make the two OSlets coordinate their
pageouts. This is more complex, but will
be required in some form or another to
prevent OSlets from "ganging up" on one
of their number, exhausting its memory.

o When OSlet A ejects the memory from its working
set, where does it put it?

1. Throw it away, and go to the proxy process
as needed to get it back.

2. Augment core VM as needed to track the
"guest" memory. This may be needed for
performance, but...

o Some code is required in the pagein() path to
figure out that the proxy must be used.

1. Larry stated that he is willing to be
punched in the nose to get this code in. ;-)
The amount of this code is minimized by
creating SMP-clusters-specific filesystems,
which have their own functions for mapping
and releasing pages. (Does this really
cover OSlet A's paging out of this memory?)

o How are pagein()s going to be even halfway fast
if IPC to the proxy is involved?

1. Just do it. Page faults should not be
all that frequent with today's memory
sizes. (But then why do we care so
much about page-fault performance???)

2. Use "doors" (from Sun), which are very
similar to protected procedure call
(from K42/Tornado/Hurricane). The idea
is that the CPU in OSlet A that is handling
the page fault temporarily -becomes- a
member of OSlet B by using OSlet B's page
tables for the duration. This results in
some interesting issues:

a. What happens if a process wants to
block while "doored"? Does it
switch back to being an OSlet A
process?

b. What happens if a process takes an
interrupt (which corresponds to
OSlet A) while doored (thus using
OSlet B's page tables)?

i. Prevent this by disabling
interrupts while doored.
This could pose problems
with relatively long VM
code paths.

ii. Switch back to OSlet A's
page tables upon interrupt,
and switch back to OSlet B's
page tables upon return
from interrupt. On machines
not supporting ASID, take a
TLB-flush hit in both
directions. Also likely
requires common text (at
least for low-level interrupts)
for all OSlets, making it more
difficult to support OSlets
running different versions of
the OS.

Furthermore, the last time
that Paul suggested adding
instructions to the interrupt
path, several people politely
informed him that this would
require a nose punching. ;-)

c. If a bunch of OSlets simultaneously
decide to invoke their proxies on
a particular OSlet, that OSlet gets
lock contention corresponding to
the number of CPUs on the system
rather than to the number in a
single OSlet. Some approaches to
handle this:

i. Stripe -everything-, rely
on entropy to save you.
May still have problems with
hotspots (e.g., which of the
OSlets has the root of the
root filesystem?).

ii. Use some sort of queued lock
to limit the number CPUs that
can be running proxy processes
in a given OSlet. This does
not really help scaling, but
would make the contention
less destructive to the
victim OSlet.

o How to balance memory usage across the OSlets?

1. Don't bother, let paging deal with it.
Paul's previous experience with this
philosophy was not encouraging. (You
can end up with one OSlet thrashing
due to the memory load placed on it by
other OSlets, which don't see any
memory pressure.)

2. Use some global memory-pressure scheme
to even things out. Seems possible,
Paul is concerned about the complexity
of this approach. If this approach is
taken, make sure someone with some
control-theory experience is involved.


Manipulation of Processes Running on Some Other OSlet.

The general idea here is to implement something similar
to a vproc layer. This is common code, and thus requires
someone to sacrifice their nose. There was some discussion
of other things that this would be useful for, but I have
lost them.

Manipulations discussed included signals and job control.

Issues:

o Should process information be replicated across
the OSlets for performance reasons? If so, how
much, and how to synchronize.

1. No, just use doors. See above discussion.

2. Yes. No discussion of synchronization
methods. (Hey, we had to leave -something-
for later!)

Access to Devices Owned by Some Other OSlet

Larry mentioned a /rdev, but if we discussed any details
of this, I have lost them. Presumably, one would use some
sort of IPC or doors to make this work.

Reception of Network Packets Intended for Some Other OSlet.

An OSlet receives a packet, and realizes that it is
destined for a process running in some other OSlet.
How is this handled without rewriting most of the
networking stack?

The general approach was to add a NAT-like layer that
inspected the packet and determined which OSlet it was
destined for. The packet was then forwarded to the
correct OSlet, and subjected to full IP-stack processing.

Issues:

o If the address map in the kernel is not to be
manipulated on each packet reception, there
needs to be a circular buffer in each OSlet for
each of the other OSlets (again, N*(N-1) buffers).
In order to prevent the buffer from needing to
be exceedingly large, packets must be bcopy()ed
into this buffer by the OSlet that received
the packet, and then bcopy()ed out by the OSlet
containing the target process. This could add
a fair amount of overhead.

1. Just accept the overhead. Rely on this
being an uncommon case (see the next issue).

2. Come up with some other approach, possibly
involving the user address space of the
proxy process. We could not articulate
such an approach, but it was late and we
were tired.

o If there are two processes that share the FD
on which the packet could be received, and these
two processes are in two different OSlets, and
neither is in the OSlet that received the packet,
what the heck do you do???

1. Prevent this from happening by refusing
to allow processes holding a TCP connection
open to move to another OSlet. This could
result in load-balance problems in some
workloads, though neither Paul nor Ted were
able to come up with a good example on the
spot (seeing as BAAN has not been doing really
well of late).

To indulge in l'esprit d'escalier... How
about a timesharing system that users
access from the network? A single user
would have to log on twice to run a job
that consumed more than one OSlet if each
process in the job might legitimately need
access to stdin.

2. Do all protocol processing on the OSlet
on which the packet was received, and
straighten things out when delivering
the packet data to the receiving process.
This likely requires changes to common
code, hence someone to volunteer their nose.


SysV msgq and sema Operations

We didn't discuss these. None of us seem to be SysV fans,
but these must be made to work regardless.

Larry says that shm should be implemented in terms of mmap(),
so that this case reduces to page-mapping discussed above.
Of course, one would need a filesystem large enough to handle
the largest possible shmget. Paul supposes that one could
dynamically create a memory filesystem to avoid problems here,
but is in no way volunteering his nose to this cause.


Access to Filesystems Owned by Some Other OSlet.

For the most part, this reduces to the mmap case. However,
partitioning popular filesystems over the OSlets could be
very helpful. Larry mentioned that this had been prototyped.
Paul cannot remember if Larry promised to send papers or
other documentation, but duly requests them after the fact.

Larry suggests having a local /tmp, so that /tmp is in effect
private to each OSlet. There would be a /gtmp that would
be a globally visible /tmp equivalent. We went round and
round on software compatibility, Paul suggesting a hashed
filesystem as an alternative. Larry eventually pointed out
that one could just issue different mount commands to get
a global filesystem in /tmp, and create a per-OSlet /ltmp.
This would allow people to determine their own level of
risk/performance.


Pipes Connecting Processes in Different OSlets.

This was mentioned, but I have forgotten the details.
My vague recollections lead me to believe that some
nose-punching was required, but I must defer to Larry
and Ted.

Ditto for Unix-domain sockets.


Creation of Processes on a Different OSlet Than Their Parent.

There would be a inherited attribute that would prevent
fork() or exec() from creating its child on a different
OSlet. This attribute would be set by default to prevent
too many surprises. Things like make(1) would clear
this attribute to allow amazingly fast kernel builds.

There would also be a system call that would cause the
child to be placed on a specified OSlet (Paul suggested
use of HP's "launch policy" concept to avoid adding yet
another dimension to the exec() combinatorial explosion).

The discussion of packet reception lead Larry to suggest
that cross-OSlet process creation would be prohibited if
the parent and child shared a socket. See above for the
load-balancing concern and corresponding l'esprit d'escalier.


Processing of exit()/wait() Pairs Crossing OSlet Boundaries

We didn't discuss this. My guess is that vproc deals
with it. Some care is required when optimizing for this.
If one hands off to a remote parent that dies before
doing a wait(), one would not want one of the init
processes getting a nasty surprise.

(Yes, there are separate init processes for each OSlet.
We did not talk about implications of this, which might
occur if one were to need to send a signal intended to
be received by all the replicated processes.)


Other Desiderata:

1. Ability of surviving OSlets to continue running after one of their
number fails.

Paul was quite skeptical of this. Larry suggested that the
"door" mechanism could use a dynamic-linking strategy. Paul
remained skeptical. ;-)

2. Ability to run different versions of the OS on different OSlets.

Some discussion of this above.


The Score.

Paul agreed that SMP Clusters could be implemented. He was not
sure that it could achieve good performance, but could not prove
otherwise. Although he suspected that the complexity might be
less than the proprietary highly parallel Unixes, he was not
convinced that it would be less than Linux would be, given the
Linux community's emphasis on simplicity in addition to performance.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-06-20 07:28:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Jun 19, 2002 22:24 -0700, Larry McVoy wrote:
> Linus Torvalds <[email protected]> writes:
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)

Well, I can speak to this a little bit... Given Lustre's ext3
underpinnings, we have been thinking of some interesting methods
by which we could take an existing ext3 filesystem on a disk and
"clusterify" it (i.e. have distributed coherency across multiple
clients). This would be perfectly suited for application on a
CC cluster.

Given that the network communication protocols are also abstracted
out from the Lustre core, it would probably be trivial for someone
with network/VM experience to write a "no-op" networking layer
which basically did little more than passing around page addresses
and faulting the right pages into each OSlet. The protocol design
is already set up to handle direct DMA between client and storage
target, and a CC cluster could also do away with the actual copy
involved in the DMA. We can already do "zero copy" I/O between
user-space and a remote disk with O_DIRECT and the right network
hardware (which does direct DMA from one node to another).

> "Paul McKenney" <[email protected]> writes:
> Access to Devices Owned by Some Other OSlet
>
> Larry mentioned a /rdev, but if we discussed any details
> of this, I have lost them. Presumably, one would use some
> sort of IPC or doors to make this work.

I would just make access to remote devices act like NBD or something,
and have similar "network/proxy" kernel drivers to all "remote" devices.
At boot time something like devfs would instantiate the "proxy"
drivers for all of the kernels except the one which is "in control"
of that device.

For example /dev/hda would be a real IDE disk device driver on the
controlling node, but would be NBD in all of the other OSlets. It would
have the same major/minor number across all OSlets so that it presented
a uniform interface to user-space. While in some cases (e.g. FC) you
could have shared-access directly to the device, other devices don't
have the correct locking mechanisms internally to be accessed by more
than one thread at a time.

As the "network" layer between two OSlets would run basically at memory
speeds, this would not impose much of an overhead. The proxy device
interfaces would be equally useful between OSlets as with two remote
machines (e.g. remote modem access), so I have no doubt that many of
them already exist, and the others could be written rather easily.

> Access to Filesystems Owned by Some Other OSlet.
>
> For the most part, this reduces to the mmap case. However,
> partitioning popular filesystems over the OSlets could be
> very helpful. Larry mentioned that this had been prototyped.
> Paul cannot remember if Larry promised to send papers or
> other documentation, but duly requests them after the fact.
>
> Larry suggests having a local /tmp, so that /tmp is in effect
> private to each OSlet. There would be a /gtmp that would
> be a globally visible /tmp equivalent. We went round and
> round on software compatibility, Paul suggesting a hashed
> filesystem as an alternative. Larry eventually pointed out
> that one could just issue different mount commands to get
> a global filesystem in /tmp, and create a per-OSlet /ltmp.
> This would allow people to determine their own level of
> risk/performance.

Nah, just use a cluster filesystem for everything ;-). As I mentioned
previously, Lustre could run from a single (optionally shared-access) disk
(with proper, relatively minor, hacks that are just in the discussion
phase now), or it can run from distributed disks that serve the data to
the remote clients. With smart allocation of resources, OSlets will
prefer to create new files on their "local" storage unless there are
resource shortages. The fast "networking" between OSlets means even
"remote" disk access is cheap.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-06-20 15:04:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Larry McVoy <[email protected]> writes:

> > I totally agree, mostly I was playing devils advocate. The model
> > actually in my head is when you have multiple kernels but they talk
> > well enough that the applications have to care in areas where it
> > doesn't make a performance difference (There's got to be one of those).
>
> ....
>
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)
> > - Sub application level checkpointing.
> >
> > Services like a schedulers, already exist.
> >
> > Basically the job of a cluster scheduler gets much easier, and the
> > scheduler more powerful once it gets the ability to suspend jobs.
> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes, and the ability to recover from failed
> > nodes, (assuming the failed hardware didn't corrupt your jobs
> > checkpoint).
> >
> > Once solutions to the cluster problems become well understood I
> > wouldn't be surprised if some of the supporting services started to
> > live in the kernel like nfsd. Parts of the distributed filesystem
> > certainly will.
>
> http://www.bitmover.com/cc-pitch
>
> I've been trying to get Linus to listen to this for years and he keeps
> on flogging the tired SMP horse instead.

Hmm. My impression is that Linux has been doing SMP but mostly because
it hasn't become a nightmare so far. Linus just a moment ago noted that
there are scaleablity limits, to SMP.

As for the cc-SMP stuff.
a) Except dual cpu systems no-one makes affordable SMPs.
b) It doesn't solve anything except your problem with locks.

You have presented your idea, and maybe it will be useful. But at
the moment it is not the place to start. What I need today is process
checkpointing. The rest comes in easy incremental steps from there.

For me the natural place to start is with clusters, they are cheaper
and more accessible than SMPs. And then work on the clustering
software with gradual refinements until it can be managed as one
machine. At that point it should be easy to compare which does a
better job for SMPs.

Eric

2002-06-20 16:42:55

by Cort Dougan

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

"Beating the SMP horse to death" does make sense for 2 processor SMP
machines. When 64 processor machines become commodity (Linux is a
commodity hardware OS) something will have to be done. When research
groups put Linux on 1k processors - it's an experiment. I don't think they
have much right to complain that Linux doesn't scale up to that level -
it's not designed to.

That being said, large clusters are an interesting research area but it is
_not_ a failing of Linux that it doesn't scale to them.

2002-06-20 17:16:41

by RW Hawkins

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

You're missing the point. Larry is saying "I have been down this road
before, take heed". We don't want to waste the time reinventing bloat
when we can learn from others mistakes.

-RW

Cort Dougan wrote:

>"Beating the SMP horse to death" does make sense for 2 processor SMP
>machines. When 64 processor machines become commodity (Linux is a
>commodity hardware OS) something will have to be done. When research
>groups put Linux on 1k processors - it's an experiment. I don't think they
>have much right to complain that Linux doesn't scale up to that level -
>it's not designed to.
>
>That being said, large clusters are an interesting research area but it is
>_not_ a failing of Linux that it doesn't scale to them.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2002-06-20 17:16:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Thu, 20 Jun 2002, Cort Dougan wrote:
>
> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines.

It makes fine sense for any tightly coupled system, where the tight
coupling is cost-efficient.

Today that means 2 CPU's, and maybe 4.

Things like SMT (Intel calls it "HT") increase that to 4/8. It's just
_cheaper_ to do that kind of built-in SMP support than it is to not use
it.

The important part of what Cort says is "commodity". Not the "small number
of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING
HARDWARE BASE in the commodity space.

ccNuma and clusters just aren't even on the _radar_ from a commodity
standpoint. While commodity 4- and 8-way SMP is just a few years away.

So because SMP hardware is cheap and efficient, all reasonable scalability
work is done on SMP. And the fringe is just that - fringe. The
numa/cluster fringe tends to try to use SMP approaches because they know
they are a minority, and they want to try to leverage off the commodity.

And it will continue to be this way for the forseeable future. People
should just accept the fact.

The only thing that may change the current state of affairs is that some
cluster/numa issues are slowly percolating down and they may become more
commoditized. For example, I think the AMD approach to SMP on the hammer
series is "local memories" with a fast CPU interconnect. That's a lot more
NUMA than we're used to in the PC space.

On the other hand, another interesting trend seems to be that since
commoditizing NUMA ends up being done with a lot of integration, the
actual _latency_ difference is so small that those potential future
commodity NUMA boxes can be considered largely UMA/SMP.

And I guarantee Linux will scale up fine to 16 CPU's, once that is
commodity. And the rest is just not all that important.

Linus

2002-06-20 17:35:10

by Cort Dougan

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

I'm not disagreeing with Larry here. I'm just pointing out that mainline
Linux cares about what is commodity. That's 1-2 processors and 2-4 on
some PPC and other boards.

I'm keenly interested in 1k processors, as is Larry, and scaling Linux up
to them. I'm don't disagree with Linus' path for Linux staying on SMP for
now. Scaling up to huge clusters isn't a mainline Linux concern. It's a
very interesting research area, though. In fact, some research I work on.

} You're missing the point. Larry is saying "I have been down this road
} before, take heed". We don't want to waste the time reinventing bloat
} when we can learn from others mistakes.
}
} -RW
}
} Cort Dougan wrote:
}
} >"Beating the SMP horse to death" does make sense for 2 processor SMP
} >machines. When 64 processor machines become commodity (Linux is a
} >commodity hardware OS) something will have to be done. When research
} >groups put Linux on 1k processors - it's an experiment. I don't think they
} >have much right to complain that Linux doesn't scale up to that level -
} >it's not designed to.
} >
} >That being said, large clusters are an interesting research area but it is
} >_not_ a failing of Linux that it doesn't scale to them.
} >-
} >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} >the body of a message to [email protected]
} >More majordomo info at http://vger.kernel.org/majordomo-info.html
} >Please read the FAQ at http://www.tux.org/lkml/
} >
} >
}
}
}
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to [email protected]
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/

2002-06-20 20:40:39

by Martin Dalecki

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

U?ytkownik Cort Dougan napisa?:
> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines. When 64 processor machines become commodity (Linux is a
> commodity hardware OS) something will have to be done. When research

64 processor machines will *never* become a commodity becouse:

1. It's not like paralell machines are something entierly new. They are
around for an awfoul long time on this planet. (nearly longer then myself)

2. See 1. even dual CPU machines are a rarity even *now*.

3. Nobody needs them for the usual tasks they are a *waste*
of resources and economics still applies.

4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)

5. It will never become a commodity to run highly transactional
workloads where integrated bunches of 4 make sense. Neiter will
it be common to solve partial differential equations for aeroplane
dynamics or to calculate the behaviour of an hydrogen bomb.

6. Even in the aerodynamics department an only 14 CPU machine was
very very fast. (NEC SX-3R)

7. Hyper threaded cores make hardly sense behind 2.

8. Amdahls law is math and not a decret from the Central Komitee of
the Kommunist Party or George Bush. You can not overrule it.

One exception could be dedicated rendering CPUs - which is the
direction where graphics cards are apparently heading - but they
will hardly ever need a general purpose operating system. But even then -
I'm still in the bunch of people who are not interrested
in any OpenGL or Direct whatever... The worsest graphics cards
those days drive my display screens at the resolutions I wish them too
just fine.

PS. I'm sick of seeing bunches of PC's which are accidentally in
the same room nowadays in the list of the 500 fastest computers
on the world. It makes this list useless...

If one want's to have a grasp on how the next generation of
really fast computers will look alike. Well: they will be based
on Johnson-junctions. TRW will build them (same company
as Voyager sonde). Look there they don't plan for thousands of CPUs
they plan for few CPUs in liquid helium:

http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2


2002-06-20 20:54:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> 2. See 1. even dual CPU machines are a rarity even *now*.

With stuff like HT, you may well not be able to _buy_ an intel desktop
machine with just "one" CPU.

Get with the flow. The old Windows codebase is dead as far as new machines
are concerned, which means that there is no reason to hold back any more:
all OS's support SMP.

> 3. Nobody needs them for the usual tasks they are a *waste*
> of resources and economics still applies.

That's a load of bull.

For usual tasks, two CPU's give clearly better responsiveness than one. If
only because one of them may be doing the computation, and the other may
be doing GUI.

The number of people doing things like mp3 ripping is apparently quite
high. And it's definitely CPU-intensive.

Now, I suspect that past two CPU's you won't find much added oomph, but
the load-balancing of just two is definitely noticeable on a personal
scale. I just don't want to use UP machines any more unless they have
other things going for them (ie really really small).

> 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)

That's not true either.

You can easily make _cheap_ hardware scale to 4, no problem. You may not
want a shared bus, but hey, they's a small implementation detail. Most new
CPU's have the interconnect hardware on-die (either now or planned).

Intel made SMP cheap by putting all the glue logic on-chip and in the
standard chipsets.

And besides, you don't actually need to _scale_ well, if the actual
incremental costs are low. That's the whole point with the P4-HT, of
course. Intel claims 5% die area addition for a 30% scaling. They may be
full of sh*t, of course, and it may be that the added complexity in the
control logic hurts them in other areas (longer pipeline, whatever), but
the point is that if it's cheap, the second CPU doesn't have to "scale".

Linus

2002-06-20 21:14:25

by Timothy D. Witham

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Thu, 2002-06-20 at 13:40, Martin Dalecki wrote:
> U?ytkownik Cort Dougan napisa?:
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines. When 64 processor machines become commodity (Linux is a
> > commodity hardware OS) something will have to be done. When research
>
>
> 8. Amdahls law is math and not a decret from the Central Komitee of
> the Kommunist Party or George Bush. You can not overrule it.
>
Boy, I haven't been beat up by Amdahl's law for at least 10 years. :-)

A point to mention is that Amdahl's law also applies to scaling on
clusters. Same issues as SMP as far as application scalability
is concerned.

But the point is that there are a whole bunch of applications that
can have the serial portion reduce to such a small amount that they
can benefit from lots of CPUS.

> One exception could be dedicated rendering CPUs - which is the
> direction where graphics cards are apparently heading - but they
> will hardly ever need a general purpose operating system. But even then -
> I'm still in the bunch of people who are not interrested
> in any OpenGL or Direct whatever... The worsest graphics cards
> those days drive my display screens at the resolutions I wish them too
> just fine.
>
> PS. I'm sick of seeing bunches of PC's which are accidentally in
> the same room nowadays in the list of the 500 fastest computers
> on the world. It makes this list useless...
>
> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
> they plan for few CPUs in liquid helium:
>
> http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2
>
>

You know there used to be a whole bunch of companies doing this
sort of work and they all went out of business because people could
build a cluster out of off the shelf parts for 1/10 of the cost and
get good enough performance. ETA, CDC, the old Cray the list goes
on. All gone from the CPU business because good enough cheap enough
wins every time.

Tim

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2002-06-20 21:27:43

by Martin Dalecki

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

U?ytkownik Linus Torvalds napisa?:
>
> On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
>>2. See 1. even dual CPU machines are a rarity even *now*.
>
>
> With stuff like HT, you may well not be able to _buy_ an intel desktop
> machine with just "one" CPU.

Linus you forget one simple fact - a HT CPU is *not* two CPUs.
It is one CPU with a slightly better utilization of the
super scalar pipelines. And it's only slightly better.
Just another way of increasind the fill reate of the pipelines
for some specific tasks.

> Get with the flow. The old Windows codebase is dead as far as new machines
> are concerned, which means that there is no reason to hold back any more:
> all OS's support SMP.
>
>
>>3. Nobody needs them for the usual tasks they are a *waste*
>>of resources and economics still applies.
>
>
> That's a load of bull.

Did I mention that ARMs are the most sold CPUs out there?

> For usual tasks, two CPU's give clearly better responsiveness than one. If
> only because one of them may be doing the computation, and the other may
> be doing GUI.

For the usual task of controlling just the fuel level of the motor
or therlike one CPU makes fine. For the other usual
tasks - well dissect a PCMCIA WLAN card or some reasonable fast
ethernet card or some hard disk. You will find tons of
independant CPUs in your system... but they are hardly SMP
connected. For the other usual task my single Athlon is
just fine. The main argument is yes it makes sense to
use additional CPUs for work offload on dedicated tasks
but the normal case is not to do it SMP way.

> The number of people doing things like mp3 ripping is apparently quite
> high. And it's definitely CPU-intensive.
>
> Now, I suspect that past two CPU's you won't find much added oomph, but

Well on intel two CPU give you about 1.5 horse power of
a single CPU. On Good SPM systems it's about 1.7.

> the load-balancing of just two is definitely noticeable on a personal
> scale. I just don't want to use UP machines any more unless they have
> other things going for them (ie really really small).
> >
>>4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)
> >
> That's not true either.
>
> You can easily make _cheap_ hardware scale to 4, no problem. You may not
> want a shared bus, but hey, they's a small implementation detail. Most new
> CPU's have the interconnect hardware on-die (either now or planned).
>
> Intel made SMP cheap by putting all the glue logic on-chip and in the
> standard chipsets.

Not if I look out to buy a real SMP board. They are still
very expensive in comparision to normal boards. However
indeed they are nowadays affordable.

> And besides, you don't actually need to _scale_ well, if the actual
> incremental costs are low. That's the whole point with the P4-HT, of
> course. Intel claims 5% die area addition for a 30% scaling. They may be

The 30% - I never saw it in the intel paper. I remember they talk
about 20% + something. And 30% is a *peak* value.
The paper in question talks about 12% on average. Awfoul much for
5% die area (2.4 factor win) in esp. if you look at the constant
increase of die area of CPUs in comparision to the speed factoring out
the scaling of the production process. If once factors out
the production process scale modern CPU are wasting transistors like
no good in comparision to they older silbings. (Remember 8088 was
just about 22t transistors and not 140M!).
But it's not much in absolute numbers...

> full of sh*t, of course, and it may be that the added complexity in the
> control logic hurts them in other areas (longer pipeline, whatever), but
> the point is that if it's cheap, the second CPU doesn't have to "scale".

The main hurting point is the quadruple of the correctness testing
effort. Longer pipelines - I hardly think so. The synchronization infrastructure
for out of order execution was already there in the last CPU generation.
This is the reaons why it's so cheap in terms of die estate to add it now.

BTW. Them pulling this trick shows nicely that we are now at a point
where there will be hardly any increase in the deployment of micro scale
paralellity in CPU design nowadays... And not just on behalf of
the CPU - even more importantly you could read it as public admit to the
fact that we are near the end of static optimizations by improvements in
compiler technology as well. Oh the compiler people promise miracles
constantly since the first days of pipeline of course...
In view of this I would love to see how they intend
to HT the VLSI design of the Itanic :-).



2002-06-20 21:37:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> Linus you forget one simple fact - a HT CPU is *not* two CPUs.
> It is one CPU with a slightly better utilization of the
> super scalar pipelines.

Doesn't matter. It's SMP to software, _and_ it is a perfect example of how
integration, in the form of almost free transistors, changes the
economics.

> Just another way of increasind the fill reate of the pipelines
> for some specific tasks.

Integration is _not_ "just another way".

Integration fundamentally changes the whole equation.

When you integrate the SMP capabilities on the CPU, suddenly the world
changes, because suddenly SMP is cheap and easy to do for motherboard
manufacturers that would never have done it before. Suddenly SMP is
available at mass-market prices.

When you integrate multiple CPU's on one standard die (either HT or real
CPU's), the same thing happens.

When you start integrating crossbars etc "numa-like" stuff, like Hammer
apparently is doing, you get the same old technology, but it _behaves_
differently.

You see this outside CPU's too.

When people started integrating high-performance 3D onto a single die, the
_market_ changed. The way people used it changed. It's largely the same
technology that has been around for a long time in visual workstations,
but it's DIFFERENT thanks to low prices and easy integration into
bog-standard PC's.

A 3D tech person might say that the technology is still the same.

But a real human will notice that it's radically different.

> Did I mention that ARMs are the most sold CPUs out there?

Doesn't matter. Did I mention that microbes are the most populous form of
living beings? Does that make any difference to us as humans? Should that
make us think we want to be microbes? Or should it mean that we're somehow
inferior? Obviously not.

Did you mention that there are a lot more resistors in computers than
CPU's? No. It is irrelevant. It doesn't drive technology in fundamental
ways - even though the amount of fundamental technolgy inherent on a
modern motherboard in _just_ the passive components like the resistor
network is way beyond what people built just a few years ago.

Linus

2002-06-20 21:59:25

by Martin Dalecki

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

U?ytkownik Linus Torvalds napisa?:
>
> On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
>>Linus you forget one simple fact - a HT CPU is *not* two CPUs.
>>It is one CPU with a slightly better utilization of the
>>super scalar pipelines.
>
>
> Doesn't matter. It's SMP to software, _and_ it is a perfect example of how
> integration, in the form of almost free transistors, changes the
> economics.

Well but this simply still doesn't make SMP magically scale
better. HT gives you about 12% increase in throughput on average.
This will hardly increase your MP3 ripping expierence :-).

> Integration is _not_ "just another way".
>
> Integration fundamentally changes the whole equation.
>
> When you integrate the SMP capabilities on the CPU, suddenly the world
> changes, because suddenly SMP is cheap and easy to do for motherboard
> manufacturers that would never have done it before. Suddenly SMP is
> available at mass-market prices.

And suddenly the Chip-Set manufacturers start to buy CPU
designs like creazy, becouse they can see what will be next... of course.

> When you integrate multiple CPU's on one standard die (either HT or real
> CPU's), the same thing happens.

Again HT is still only one CPU. You are too software centric :-).

> When you start integrating crossbars etc "numa-like" stuff, like Hammer
> apparently is doing, you get the same old technology, but it _behaves_
> differently.

Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
gives 70% for two CPU. All those numbers are well below the level
where more then 2-4 makes hardly any sense... Amdahl bites you still if you
read it like:

88% waste (well actuall this time not)
50% waste
20% waste

on scale.

However corssbar switches are indeed allowing for maximally
64 CPUs and more importantly it's the first step since a long time
to provide better overall system throughput. However they will still
not be near any commodity - too much heat for the foreseeable future.

> You see this outside CPU's too.
>
> When people started integrating high-performance 3D onto a single die, the
> _market_ changed. The way people used it changed. It's largely the same
> technology that has been around for a long time in visual workstations,
> but it's DIFFERENT thanks to low prices and easy integration into
> bog-standard PC's.
>
> A 3D tech person might say that the technology is still the same.
>
> But a real human will notice that it's radically different.

Yes but you can drive the technology only up to the perceptual limits
of a human. For example since about 6 years all those advancements
in the graphics area are largely uninterresting to me. I don't
play computer games. Never - they are too boring. Jet another
fan in my computer - no thank's.

> Did you mention that there are a lot more resistors in computers than
> CPU's? No. It is irrelevant. It doesn't drive technology in fundamental
> ways - even though the amount of fundamental technolgy inherent on a
> modern motherboard in _just_ the passive components like the resistor
> network is way beyond what people built just a few years ago.

Well the last real technological jump comparable to the invention
of television was actually due to this kind of CPUs which you
compare to microbes - mobiles :-). And well I'm awaiting the
day where there will be some WinWLAN card as shoddy as those Win
modems are... Fortunately they made 802.11b complicated enough :-)
But with a corssbar switch in place they could well make up for
the latency on the main CPU... oh fear... oh scare...





2002-06-20 22:18:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken



On Thu, 20 Jun 2002, Martin Dalecki wrote:
>
> Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
> gives 70% for two CPU. All those numbers are well below the level
> where more then 2-4 makes hardly any sense...

You don't _understand_.

If it's "free", you take that 70% for the second CPU, and the additional
20% for the next two.

Don't bother repeating yourself about Amdahls law. Realize what Moore's
law says: things get cheaper over time. A _lot_ cheaper.

It's still a fact that people are willing to pay for performance. Even if
they strictly don't "need" it (but who are you or I to say who "needs"
performance?).

At which point it doesn't _matter_ if you only get 70% or 30% or 12%
improvement. If it's within "cheap enough", people will buy it. In fact,
once it gets "too cheap", people will buy something more expensive just
because a cheap PC obviously isn't good enough. That's _reality_.

Your "efficiency" arguments have no basis in the real life of economics in
a developing market. Only embedded people care about absolute cost and
absolute efficiencies ("it's not worth it for us to go for a more powerful
CPU, since we don't need it"). The rest of the world takes that 66MHz
improvement (in a CPU that does multiple gigahertz) and is happy about it.
Or takes the added 12%, and is happy about it.

Humans are not rational creatures. We're _rationalizing_ creatures, and we
love rationalizing that big machine that just makes us feel better.

Linus

2002-06-20 22:41:31

by Martin Dalecki

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

U?ytkownik Linus Torvalds napisa?:

> At which point it doesn't _matter_ if you only get 70% or 30% or 12%
> improvement. If it's within "cheap enough", people will buy it. In fact,
> once it gets "too cheap", people will buy something more expensive just
> because a cheap PC obviously isn't good enough. That's _reality_.
>
> Your "efficiency" arguments have no basis in the real life of economics in
> a developing market. Only embedded people care about absolute cost and
> absolute efficiencies ("it's not worth it for us to go for a more powerful
> CPU, since we don't need it"). The rest of the world takes that 66MHz
> improvement (in a CPU that does multiple gigahertz) and is happy about it.
> Or takes the added 12%, and is happy about it.

You don't read economic papers. Don't you? Or what is it with this
plumbing server/pc market around us? Or increased notebook sales.
(Typical marked saturation symptom, like the second car for the
familiy :-).

I suggest it's precisely the end of the open invention curve out there:

1. Nowadays the CPUs are indeed good enough for most of the common tasks.
WindowsXP tries hard to help overcome this :-). But in reality Win2000
is just fine for office work.

2. The technology in question is starting to hit real physical barriers becouse
it appears more and more that not everything comming out of the labs
can be implemented at reasonable costs.

> Humans are not rational creatures. We're _rationalizing_ creatures, and we
> love rationalizing that big machine that just makes us feel better.

Perhaps it's just still too deep in to my brain that
the overwhelimg part of the PC market is still determined
by corporate buyers (70%). And they look for efficiency (well within
wide boundaries :-). There is for example not much of an uprush from
Win4.0 or Win2000 to WindowsXP. Not only due to "political" reasons,
but becouse a normal PC from few years ago still does the job
for office productivity. Quite away from the days of yearly upgrades
all around the office :-)... And finally the whole thing driving
the movement behind AS/390 boxen running Linux OS instancies is consolidation
and costs too...

2002-06-20 23:52:11

by Miles Lane

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Uz.ytkownik Martin Dalecki napisa?:
<snip>
> You don't read economic papers. Don't you? Or what is it with this
> plumbing server/pc market around us? Or increased notebook sales.
> (Typical marked saturation symptom, like the second car for the
> familiy :-).
>
> I suggest it's precisely the end of the open invention curve out there:
>
> 1. Nowadays the CPUs are indeed good enough for most of the common tasks.
> WindowsXP tries hard to help overcome this :-). But in reality Win2000
> is just fine for office work.
>
> 2. The technology in question is starting to hit real physical barriers becouse
> it appears more and more that not everything comming out of the labs
> can be implemented at reasonable costs.

Martin, perhaps you haven't seen this article.
This news seems to contradict your assertion that cost is going
to become a big problem as we attempt to continue tracking the
price/performance trajectory of Moore's law.

http://www.nytimes.com/reuters/technology/tech-technology-chip.html

Miles

2002-06-21 00:09:21

by Allen Campbell

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

> Perhaps it's just still too deep in to my brain that
> the overwhelimg part of the PC market is still determined
> by corporate buyers (70%). And they look for efficiency (well within
> wide boundaries :-).

Most of those buyers care about cost efficiency, not design
efficiency. If a 4 way Dell can just match a 2 way Sun, and for
half the cost, guess who gets the sale. Doesn't matter if it's
"naive" SMP or a beautiful cross-bar design, blessed by MIT. Yes,
it's ugly. Sure, it would be nice if everyone loved computing so
much that they actually cared enough to make the distinction. They
don't. Get over it.

As long as Linux is true to the market it will thrive. The moment
the motivation becomes someone's pedantic notion of "purity", it's
gone. I believe Linus understands this, and I'm thankful. I'm
guessing that gift of understanding comes from a time when a certain
programmer couldn't afford to pay for the elegance that was offered
at the time.

2002-06-21 05:45:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Cort Dougan <[email protected]> writes:

> "Beating the SMP horse to death" does make sense for 2 processor SMP
> machines. When 64 processor machines become commodity (Linux is a
> commodity hardware OS) something will have to be done. When research
> groups put Linux on 1k processors - it's an experiment. I don't think they
> have much right to complain that Linux doesn't scale up to that level -
> it's not designed to.
>
> That being said, large clusters are an interesting research area but it is
> _not_ a failing of Linux that it doesn't scale to them.

Linux in a classic beowulf configuration scales just fine. To be clear
I am talking a batch scheduling system, where the jobs which run for
hours at a time and on many nodes, possibly the entire cluster at a
time. Are scheduled on some number of commodity systems, with a good
network interconnect.

The concern now is not does it work, or does it work well. But can
it be made more convenient to use.

Eric

2002-06-21 06:26:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Linus Torvalds <[email protected]> writes:

> On Thu, 20 Jun 2002, Cort Dougan wrote:
> >
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines.
>
> It makes fine sense for any tightly coupled system, where the tight
> coupling is cost-efficient.
>
> Today that means 2 CPU's, and maybe 4.
>
> Things like SMT (Intel calls it "HT") increase that to 4/8. It's just
> _cheaper_ to do that kind of built-in SMP support than it is to not use
> it.
>
> The important part of what Cort says is "commodity". Not the "small number
> of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING
> HARDWARE BASE in the commodity space.

Commodity is the wrong word. Volume is the right word. Volumes of machines,
volumes of money, and volumes of developers.

> ccNuma and clusters just aren't even on the _radar_ from a commodity
> standpoint. While commodity 4- and 8-way SMP is just a few years away.

I bet it is easy to find a easy to find a 2-4 way heterogenous pile of
computers in many a developers personal possession that could be turned
into a cluster if the software wasn't so inconvenient to use, or if
there was a good reason to run computer systems that way.

Clusters and ccNuma are entirely different animals. ccNuma is about
specialized hardware. Clusters are about using commodity hardware in
a different way.

> So because SMP hardware is cheap and efficient, all reasonable scalability
> work is done on SMP. And the fringe is just that - fringe. The
> numa/cluster fringe tends to try to use SMP approaches because they know
> they are a minority, and they want to try to leverage off the commodity.

The cluster fringe is a minority. But the high performance computer
and batch scheduling minority has done a lot of work of the
theoretical, and developmental computer science in the past. And I
would be surprised if they weren't influential in the future. But
like most research a lot of it is trying suboptimal solutions that
eventually get ditched.

The only SMP like stuff I have seen in clustering are the attempts to
make clusters simpler to use. And the question I hear is how simple
can we make it without sacrificing scaleabilty.

> And it will continue to be this way for the forseeable future. People
> should just accept the fact.

I apparently see things differently. That the clusters will be a
minority certainly. That the people working on them are hopelessly in
fringes not a bit.

Clusters of Linux machines scale acceptably . And for a certain set of
people get the job done. The problem is making it more convenient to
get the job done. And just like in hardware as integration can make
extra hardware features essentially free, the next step is to begin
integrating cluster features into Linux both kernel and user space.

Basically the technique is. Implement something that works. Then
find the clean efficient way to do it. If that takes kernel support
write a kernel patch, and get it in.

> And I guarantee Linux will scale up fine to 16 CPU's, once that is
> commodity. And the rest is just not all that important.

It works just fine on my little 20 node 20 kernel test machine too.

I think Larry's perspective is interesting and if the common cluster
software gets working well enough I might even try it. But until a
big SMP becomes commodity I don't see the point.

Eric

2002-06-21 07:29:13

by Martin Knoblauch

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
----------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>they plan for few CPUs in liquid helium:
>
>
>
http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper.pdf&DIR=2
>

first thing that I cought on page 2 was the 4096 processors. Hmm...

Martin
--
----------------------------------
Martin Knoblauch
[email protected]
http://www.knobisoft.de

2002-06-21 08:12:05

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Thu, 20 Jun 2002, Martin Dalecki wrote:

> > When you integrate multiple CPU's on one standard die (either HT or real
> > CPU's), the same thing happens.
>
> Again HT is still only one CPU. You are too software centric :-).

Can't help it...

Remember i386/i387?

--
http://function.linuxpower.ca


2002-06-21 13:00:01

by Jesse Pollard

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Martin Dalecki <[email protected]>:
>Yes HT gives 12%. naive SMP gives 50% and good SMP (aka corssbar bus)
>gives 70% for two CPU. All those numbers are well below the level
>where more then 2-4 makes hardly any sense... Amdahl bites you still if you
>read it like:
...

I think your numbers are a little low - I've seen between 50%-80% on
master/slave SMP depending on the job. 50% if both processess are heavily
syscall oriented, 75% (or therabouts) when both processes are more normally
balanced, and 80% if both processes are more compute bound.

Good SMP, with a crossbar switch buss should give close to 95%. Good SMP
alone should give about 75%.

My expierence with good crossbar switch is based on Cray UNICOS/YMP/SV
hardware. A well tuned hardware platform, and slightly less well tuned
SMP implementation, though the UNICOS 10 rewrite may have fixed the
SMP implementation.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-06-21 16:56:58

by Sandy Harris

[permalink] [raw]
Subject: Re: Re: latest linus-2.5 BK broken

Linus Torvalds wrote:

> Integration is _not_ "just another way".
>
> Integration fundamentally changes the whole equation.
>
> When you integrate the SMP capabilities on the CPU, suddenly the world
> changes, because suddenly SMP is cheap and easy to do for motherboard
> manufacturers that would never have done it before. Suddenly SMP is
> available at mass-market prices.
>
> When you integrate multiple CPU's on one standard die (either HT or real
> CPU's), the same thing happens.
>
> When you start integrating crossbars etc "numa-like" stuff, like Hammer
> apparently is doing, you get the same old technology, but it _behaves_
> differently.
>
> You see this outside CPU's too.
>
> When people started integrating high-performance 3D ...

It seems to me we're talking about several different ways to get
parralllelism in volume hardware. SMP, smarter peripherals, and
various sorts of cluster (beowulf compute engines, redundant for
high availability, load sharing for web servers or other I/O
bound loads, ...). Great. All have their place.

I wonder, though, about one that doesn't seem to be discussed
much: asymmetric multiprocessing.

One example is IBM mainframes with their channel processors; not
just smart peripherals but whole CPUs dedicated to I/O control.
Another was the VAX 782, two 780s with a fat bus-to-bus cable
and each CPU getting DMA into the other's memory. One CPU ran
most of the kernel, the other all the user processes.

To what extent is this becoming relevant to Linux with the port
to System 390 and the trend to I20 devices in PCs? How does it
affect the overall design?

I rather like the notion of a machine with most of the kernel,
including all disk and net I/O, running on, say, a pair of ARMs
while a quad of 64-bit whatevers run the user proceses. This
might give better $/power/heat/... tradeoffs than just goiing
to 8-way systems.

2002-06-21 17:50:56

by Larry McVoy

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote:
> I think Larry's perspective is interesting and if the common cluster
> software gets working well enough I might even try it. But until a
> big SMP becomes commodity I don't see the point.

The real point is that multi threading screws up your kernel. All the Linux
hackers are going through the learning curve on threading and think I'm an
alarmist or a nut. After Linux works on a 64 way box, I suspect that the
majority of them will secretly admit that threading does screw up the kernel
but at that point it's far too late.

The current approach is a lot like western medicine. Wait until the
cancer shows up and then make an effort to get rid of it. My suggested
approach is to take steps to make sure the cancer never gets here in
the first place. It's proactive rather than reactive. And the reason
I harp on this is that I'm positive (and history supports me 100%)
that the reactive approach doesn't work, you'll be stuck with it,
there is no way to "fix" it other than starting over with a new kernel.
Then we get to repeat this whole discussion in 15 years with one of the
Linux veterans trying to explain to the NewOS guys that multi threading
really isn't as cool as it sounds and they should try this other approach.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-06-21 17:55:38

by Robert Love

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Fri, 2002-06-21 at 10:50, Larry McVoy wrote:

> The real point is that multi threading screws up your kernel. All the Linux
> hackers are going through the learning curve on threading and think I'm an
> alarmist or a nut. After Linux works on a 64 way box, I suspect that the
> majority of them will secretly admit that threading does screw up the kernel
> but at that point it's far too late.

Larry, this is a point you have made several times and admittedly one I
agree with. I fail to see how the high-end scaling will not compromise
the low-end and I am genuinely concerned Linux will become Solaris.

I do not know what to do to prevent it - and I am certainly not saying
we should outright prevent certain things, but it worries me. You are
going to be in Ottawa next week? Maybe we can talk about it...

Robert Love


2002-06-22 01:52:10

by Rob Landley

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Thursday 20 June 2002 04:40 pm, Martin Dalecki wrote:
> U?ytkownik Cort Dougan napisa?:
> > "Beating the SMP horse to death" does make sense for 2 processor SMP
> > machines. When 64 processor machines become commodity (Linux is a
> > commodity hardware OS) something will have to be done. When research
>
> 64 processor machines will *never* become a commodity becouse:
>
> 1. It's not like paralell machines are something entierly new. They are
> around for an awfoul long time on this planet. (nearly longer then myself)
>
> 2. See 1. even dual CPU machines are a rarity even *now*.

DOS was a reverse engineered clone of CP/M with some unix features bolted on
in the early 80's. Dos couldn't multitask on a single CPU. Dos couldn't
handle more than one video card. DOS could barely keep track of more than
one hard drive.

Windows 3.1 through Windows 98 (and bill gates' 1/8 scale clone wini-me) were
based on DOS, they couldn't take advantage of SMP if their life depended on
it. NT through 4.0 had a market share dwarfed by the macintosh.

> 3. Nobody needs them for the usual tasks they are a *waste*
> of resources and economics still applies.

Until moore's law hits atomic resolution, sure. How long that will take is
hotly debated...

> 4. SMP doesn't scale behind 4. Point. (64 hardly makes sense...)

Actually it does, just not with Intel's brain dead memory bus architecture.
EV6 goes to 32-way pretty well.

The question is, at what point is it cheaper to just go to NUMA or clusters.
(And at what point do your trace lengths get long enough that SMP starts
acting like NUMA. And at what point do your cluster interconnects get fast
enough that something like mosix starts acting like numa?)

And the REALLY interesting advance is SMT (hyper-threading), rather than SMP.
How do you go beyond the athlon's three execution cores without running out
of parallel instructions to feed them? Simple, teach the chip about
processes, so it can advance multiple points of execution to keep the cores
fed. This lets you throw a higher transistor budget at the L1 and L2 caches
without encountering diminishing returns as well. It's pretty
straightforward, and at the very least allows dispatching interrupts in
parallel and lets your GUI overlap drawing on the screen with the processing
to figure out what goes on the screen. Between the two of them, even X11
might finally give me smooth mouse scrolling, one of these days... :)

SMP on a chip really is overkill. Why give the multiple processors their own
cache and memory bus interface? Waste of transistors, power, heat, etc...
SMT is minimalist SMP on a chip...

> 5. It will never become a commodity to run highly transactional
> workloads where integrated bunches of 4 make sense. Neiter will
> it be common to solve partial differential equations for aeroplane
> dynamics or to calculate the behaviour of an hydrogen bomb.

No, but it will be common to display bidirectional MP4 compressed video
through an encrypted link, with sound, quite possibly in a window while you
do other stuff with the machine. And some day voice recognition may actually
replace "the clapper" to turn your light off when you get into bed at night...

> One exception could be dedicated rendering CPUs - which is the
> direction where graphics cards are apparently heading - but they

"heading"? Headed. (What did you think your 3D accelerator card was?)

> PS. I'm sick of seeing bunches of PC's which are accidentally in
> the same room nowadays in the list of the 500 fastest computers
> on the world. It makes this list useless...

It shows who has money to throw at the problem, and approximately how much,
which is all it ever really showed...

> If one want's to have a grasp on how the next generation of
> really fast computers will look alike. Well: they will be based
> on Johnson-junctions. TRW will build them (same company
> as Voyager sonde). Look there they don't plan for thousands of CPUs
> they plan for few CPUs in liquid helium:
>
> http://www.trw.com/extlink/1,,,00.html?ExternalTRW=/images/imaps_2000_paper
>.pdf&DIR=2

And cray bathed their circuitry in flourinert decades ago. Liquid Helium
ain't winding up on my desktop any time soon, and my laptop outperforms a
cray-1, and I use it for a dozen variations of text editing (coding,
email...) Not interesting.

Rob

2002-06-22 02:37:13

by Rob Landley

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Thursday 20 June 2002 05:27 pm, Martin Dalecki wrote:
> U?ytkownik Linus Torvalds napisa?:
> > On Thu, 20 Jun 2002, Martin Dalecki wrote:
> >>2. See 1. even dual CPU machines are a rarity even *now*.
> >
> > With stuff like HT, you may well not be able to _buy_ an intel desktop
> > machine with just "one" CPU.
>
> Linus you forget one simple fact - a HT CPU is *not* two CPUs.
> It is one CPU with a slightly better utilization of the
> super scalar pipelines. And it's only slightly better.
> Just another way of increasind the fill reate of the pipelines
> for some specific tasks.

Wrong.

RISC let you have two execution cores dispatching instructions in parallel.
(Two instructions per clock). AMD expanded this to three execution cores in
the Athlon with clever and insanely complex cisc->risc translation and
pipeline organizing circuitry. Intel couldn't match that (at first) and went
to VLIW, hence itanic.

VLIW/EPIC was an attempt to figure out how to keep more execution cores busy
without having each one know what the other ones are doing, and searchign for
paralellism in a single instruction stream. Unload the parallelism finding
work on the compiler, batch the resulting instructions together in groups,
and explicitly feed an instruction to each execution core, each clock cycle.
If there's nothing for it to do, feed it a NOP. That way you can have three
execution cores (getting three instructions per clock), and you can even do
four or five or six cores receiving big batches of paralell instructions and
executing the whole mess each clock cycle in parallel.

Of course the real bottleneck in a processor that's clock multiplied by a
factor of 20 relative to the motherboard it sits in is the memory bus speed,
and L1 cache size (since it's up to 20x slower when it hits the edge of the
cache), and VLIW makes the memory bus MORE of a bottleneck, so resulting
preformance sucks tremendously. Oops. Back to the drawing board. (R.I.P.
itanium, modulo intel's marketing budget...)

Hyper-threading is another way to keep extra execution cores busy: teach the
chip about processes and dole the execution cores out to each process
depending on how many they can use. (One, two, or three, depending on how
parallel the next few instructions in the thread are.)

Of course each thread needs its own register profile, but register renaming
for speculative execution is way more complicated than that. And you need to
teach the MMU how to look at more than one set of page tables at a time, but
that's doable too.

Putting full-blown SMP on a chip means you're duplicating all sorts of
circuitry: your L1 cache, your bus interface logic, etc. SMT is basically
SMP on a chip that shares the L1 cache, AND gives you an excuse to EXPAND it
(they've got the transistor budge: Xeons hae a megabyte or more of L1 cache,
there's just a case of diminishing returns. Now, they get to spend the
transistors for a larger cache and actually have it MEAN something.)

And yes, you could go beyond three execution cores with SMT. You could go to
five or six execution cores, and have three threads of execution if you
really wanted to. The design gets a little more complicated, but not really
all that much, since the purpose is to SEPARATE what the threads are doing,
as opposed to the traditional "is core #2 going to interfere with what core
#2 is doing"? You may wind up designing a full blown instruction scheduler,
but if that's too complex you could always put it in software and call it
code morphing II. :)

We've had a variant of multiprocessing on a chip since the original pentium,
we just called it pipelining. Saying SMT is not "true SMP" is splitting
hairs, and an attempt to win an argument by redefining the words used in the
original statement. (I wasn't wrong: that color's not blue!)

> > Get with the flow. The old Windows codebase is dead as far as new
> > machines are concerned, which means that there is no reason to hold back
> > any more: all OS's support SMP.
> >
> >>3. Nobody needs them for the usual tasks they are a *waste*
> >>of resources and economics still applies.
> >
> > That's a load of bull.
>
> Did I mention that ARMs are the most sold CPUs out there?

So they finally passed the enormous installed base of Z80's in traffic
lights, elevators, and microwaves? Bully for them.

What USE this information is remains an open question.

> For the usual task of controlling just the fuel level of the motor
> or therlike one CPU makes fine. For the other usual
> tasks - well dissect a PCMCIA WLAN card or some reasonable fast
> ethernet card or some hard disk. You will find tons of
> independant CPUs in your system... but they are hardly SMP
> connected. For the other usual task my single Athlon is
> just fine.

And the Z80 hooked up to an S100 bus running CP/M shall always rule forever
and ever alelujiah amen. Case dismissed.

> > The number of people doing things like mp3 ripping is apparently quite
> > high. And it's definitely CPU-intensive.
> >
> > Now, I suspect that past two CPU's you won't find much added oomph, but
>
> Well on intel two CPU give you about 1.5 horse power of
> a single CPU. On Good SPM systems it's about 1.7.

Intel's traditional way of doing SMP sucks (the memory bus is STILL the main
bottleneck to performance: let's share it!), and most PC OSes have
traditionally had mondo lock contention doing even simple things. Okay. So?

> > Intel made SMP cheap by putting all the glue logic on-chip and in the
> > standard chipsets.
>
> Not if I look out to buy a real SMP board.

Again with the "the PC isn't a real computer" line of argument...

> They are still
> very expensive in comparision to normal boards. However
> indeed they are nowadays affordable.

A year and a half ago I worked at the company that prototyped the first dual
Athlon board (Boxxtech: tyan owed them a favor). Intel was never interested
in bringing out a dual celeron motherboard (the first celerons were so
cache-crippled trying to SMP them was just painful). The ONLY wanted to do
SMP at the high end, and as processors came down in price they yanked the SMP
support circuitry.

Add in the fact the Intel SMP bus still sucks tremendously and the dominant
OS through windows 98 couldn't even understand two graphics cards (and often
got confused by two NETWORK cards) we're not talking a recipe for widespread
adoption here...

> > And besides, you don't actually need to _scale_ well, if the actual
> > incremental costs are low. That's the whole point with the P4-HT, of
> > course. Intel claims 5% die area addition for a 30% scaling. They may be
>
> The 30% - I never saw it in the intel paper. I remember they talk
> about 20% + something. And 30% is a *peak* value.

Sure. Keeping that third execution core busy 24/7. On the rare instances
their pipeline organizer can devote that third execution core to advancing
the first process, preventing it from doing so is slowing that first process
down by repurposing a resource that would NOT otherwise have been wasted.
(Minus 3% performance penalty for extra cache trashing and memory bus
contention.)

Now add a FOURTH execution core to the chip, bump the L1 cache size a bit,
and watch performance go up 25%...

I am REALLY waiting for AMD to start doing this. We've been waiting for "smp
on a chip" (outside of PPC) for years, without ever explaining what the
advantage was of giving each one its own bus interface unit and L1 cache...

> The paper in question talks about 12% on average. Awfoul much for
> 5% die area (2.4 factor win) in esp. if you look at the constant
> increase of die area of CPUs in comparision to the speed factoring out
> the scaling of the production process. If once factors out
> the production process scale modern CPU are wasting transistors like
> no good in comparision to they older silbings. (Remember 8088 was
> just about 22t transistors and not 140M!).
> But it's not much in absolute numbers...

Yeah. It's called "a good idea" instead of brute force throwing transistors
at the problem. Even Intel's allowed to have the occasional good idea.
(After itanium they're certainly due for one!)

> > full of sh*t, of course, and it may be that the added complexity in the
> > control logic hurts them in other areas (longer pipeline, whatever), but
> > the point is that if it's cheap, the second CPU doesn't have to "scale".
>
> The main hurting point is the quadruple of the correctness testing
> effort. Longer pipelines - I hardly think so. The synchronization
> infrastructure for out of order execution was already there in the last CPU
> generation. This is the reaons why it's so cheap in terms of die estate to
> add it now.

In theory they might even be able to get rid of some of it, as long as they
can keep all their execution cores busy 99% of the time without it. (Picking
three simultaneously runnable instructions from two different threads of
execution is a fundamentally easier problem than consistently picking even
two instructions from one thread.)

And it's a far cry from the itanium's way of handling branch preditiction to
keep the cores busy. (Execute BOTH forks and throw the one we don't take
away! Yeah, that'll guarantee we waste work so we LOOK busy, but don't
actually run noticeably faster! Brilliant! (What, is the goal to make the
chip run hot? 95% prediction rate isn't enough for you, and you're STILL
going to stall the pipeline when you hit the edge of the L1 cache anyway...))

> BTW. Them pulling this trick shows nicely that we are now at a point
> where there will be hardly any increase in the deployment of micro scale
> paralellity in CPU design nowadays...

Famous last words...

> And not just on behalf of
> the CPU - even more importantly you could read it as public admit to the
> fact that we are near the end of static optimizations by improvements in
> compiler technology as well. Oh the compiler people promise miracles
> constantly since the first days of pipeline of course...

Trust me: GCC 3.x can still be seriously improved upon.

> In view of this I would love to see how they intend
> to HT the VLSI design of the Itanic :-).

Well, the rumors are that Intel is going to bury iTanic in a sea trench and
license x86-64. AMD has confirmed that intel licensed the rights to the
x86-64 instruction set, and intel's prototype is apparently called yamhill:

http://www.matrixlist.com/pipermail/pc_support/2002-May/001416.html

Whether or not AMD got a license to the inevitable hyper-threading patents in
return, I have no idea. (If AMD would just buy transmeta and be done with
it, I'd feel more comfortable predicting them. I have friends who work
there, that rumor mill's bandwidth is full of the trouble they're having with
absolutely sucky motherboard chipsets and nvidia writing out of spec graphics
cards that the chipsets are actually designed to compensate for, and as such
wind up screwing up other things by being out of spec. Or something like
that, that's the trouble with rumors, details get mangled...)

Rob

2002-06-22 03:00:40

by Rob Landley

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Thursday 20 June 2002 05:59 pm, Martin Dalecki wrote:

> Well but this simply still doesn't make SMP magically scale
> better. HT gives you about 12% increase in throughput on average.
> This will hardly increase your MP3 ripping expierence :-).

HT is currently sopping up the idle time on the second and third execution
core in the processor, and the fact that the processor before HT only had as
many cores as it could at least sometimes use means that these execution
cores aren't always idle.

That said, there's nothing to stop them from adding a fourth or even fifth
execution core to the die and getting a 25% boost, and then a fifth core and
getting a little boost form that too. (And when you add the sixth core,
teach the processor about the concept of a third thread, at which point you
just write in instruction dispatcher feeding an arbitrary number of thread
instruction streams into an arbitrary number of execution cores, and then add
cores to your heart's content until you start having numa problems in your L1
cache... :)

By the way, your mp3 ripping experience is largely about latency, which HT
does help. (Realtime is all about getting a tiny amount of work done NOW,
rather than a lot of work done after a significant fraction of a second
scheduling delay.) As long as ripping and playback don't skip, processes
that can be batched aren't really the problem. (Suck this CD dry, crunch it
to files in this directory, I'm going to answer email in the meantime.)

> > When you integrate multiple CPU's on one standard die (either HT or real
> > CPU's), the same thing happens.
>
> Again HT is still only one CPU. You are too software centric :-).

It's a CPU that literally can advance two processes at once. Not "time
slice, time slice, time slice" with evil context switches in between trashing
your cache, but actual parallel processing.

My understanding is that with HT turned on, one of your three execution cores
is devoted to each thread, and they get to fight over who gets to use the
third each clock cycle. So you get to queue up DMA for that screaming scsi
card without waiting for your other system call to exit its critical region.
Hence the latency picture is REALLY NICE...

> However corssbar switches are indeed allowing for maximally
> 64 CPUs and more importantly it's the first step since a long time
> to provide better overall system throughput. However they will still
> not be near any commodity - too much heat for the foreseeable future.

If you can do 8-way SMP/SMT on a chip (does SMT with twice as many execution
cores as threads count as "real" SMP to you?), and then you fit that in an
8-way motherboard, boom: you have 64 way. Without really needing crossbar
switches if you don't want to go that way...

Sooner or later they'll just have an arbitrary execution core scheduler, and
they won't have a fixed ratio of threads to cores, you'll just feed the
chip what you've got and it'll power down any cores that aren't in use this
clock cycle. I can easily see transmeta scaling code morphing up to dozens
or even hundreds of execution in that case...

That's a few years in the future, though.

> > A 3D tech person might say that the technology is still the same.
> >
> > But a real human will notice that it's radically different.
>
> Yes but you can drive the technology only up to the perceptual limits
> of a human. For example since about 6 years all those advancements
> in the graphics area are largely uninterresting to me. I don't
> play computer games. Never - they are too boring. Jet another
> fan in my computer - no thank's.

"It doesn't interst me so it's not interesting" is not a good argument, but
the fact that the human visual perception threshold has long been reported to
be 80 million triangles per second and we're approaching the ability to do
that in real time with commodity off the shelf video cards. (Another two or
three generations of moore's law and we WON'T be able to see the
difference...) That is a point.

> Well the last real technological jump comparable to the invention
> of television was actually due to this kind of CPUs which you
> compare to microbes - mobiles :-). And well I'm awaiting the
> day where there will be some WinWLAN card as shoddy as those Win
> modems are... Fortunately they made 802.11b complicated enough :-)
> But with a corssbar switch in place they could well make up for
> the latency on the main CPU... oh fear... oh scare...

The latency in the cat 5 dwarfs any latency you're going to have on the
motherboard, and that's something they deal with by just making gigabit and
higher synchronous. No reason you can't have a win-ethernet card except that
100baseT is now $4.50 on a card (and a lot less on a chip on the motherboard,
and that's just a licensing cost, the IC is pennies), and your "last mile"
cable modem or DSL still isn't maxing out the ten magabit ethernet connection
you're really hooking up to the internet through...

There's no excess cost to squeeze out of here by going to a DSP...

Rob

2002-06-22 18:35:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Larry McVoy <[email protected]> writes:

> On Fri, Jun 21, 2002 at 12:15:54AM -0600, Eric W. Biederman wrote:
> > I think Larry's perspective is interesting and if the common cluster
> > software gets working well enough I might even try it. But until a
> > big SMP becomes commodity I don't see the point.
>
> The real point is that multi threading screws up your kernel. All the Linux
> hackers are going through the learning curve on threading and think I'm an
> alarmist or a nut. After Linux works on a 64 way box, I suspect that the
> majority of them will secretly admit that threading does screw up the kernel
> but at that point it's far too late.

I don't see a argument that locks that get to fine grained are not an
issue. However even traditional version of single cpu unix are multi
threaded. The locking in a multi cpu design just makes that explicit.

And the only really nasty place to get locks is when you get a
noticeable number of them in your device drivers. With the core code
you can fix it without out worrying about killing the OS.

> The current approach is a lot like western medicine. Wait until the
> cancer shows up and then make an effort to get rid of it. My suggested
> approach is to take steps to make sure the cancer never gets here in
> the first place. It's proactive rather than reactive. And the reason
> I harp on this is that I'm positive (and history supports me 100%)
> that the reactive approach doesn't work, you'll be stuck with it,
> there is no way to "fix" it other than starting over with a new kernel.
> Then we get to repeat this whole discussion in 15 years with one of the
> Linux veterans trying to explain to the NewOS guys that multi threading
> really isn't as cool as it sounds and they should try this other
> approach.

Proactive don't add a lock unless you can really justify that you need
it. That is well suited to open source code review type practices,
and it appears to be what we are doing now. And if you don't add
locks you certainly don't get into a lock tangle.

As for 100% history supported all I see is that evolution of code,
as it dynamically gathers the requirements instead of magically
knowing them does much better than design as a long term
strategy. Of course you design the parts you can see but every has a
limited ability to see the future.

To specifics, I don't see the point of OSlets on a single cpu that is
hyper threaded. Traditional threading appears to make more sense to
me. Similarly I don't see the point in the 2-4 cpu range.

Given that there are some scales when you don't want/need more than
one kernel, who has a machine where OSlets start to pay off? They
don't exist in commodity hardware, so being proactive now looks
stupid.

The only practical course I see is to work on solutions that work on
clusters of commodity machines. At least any one who wants one can
get one. If you can produce a single system image, the big iron guys
can tweak the startup routing and run that on their giant NUMA or SMP
machines.

Eric

2002-06-22 19:26:57

by Larry McVoy

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote:
> I don't see a argument that locks that get to fine grained are not an
> issue. However even traditional version of single cpu unix are multi
> threaded. The locking in a multi cpu design just makes that explicit.
>
> And the only really nasty place to get locks is when you get a
> noticeable number of them in your device drivers. With the core code
> you can fix it without out worrying about killing the OS.

Just out of curiousity, have you actually ever worked on a fine grain
threaded OS? One that scales to at least 32 processors? Solaris? IRIX?
Others? It makes a difference, if you've been there, your perspective is
somewhat different than just talking about it. If you have worked on one,
for how long? Did you support the source base after it matured for any
length of time?

> Proactive don't add a lock unless you can really justify that you need
> it. That is well suited to open source code review type practices,
> and it appears to be what we are doing now. And if you don't add
> locks you certainly don't get into a lock tangle.

That's a great theory. I support that theory, life would great if it
matched that theory. Unfortunately, I don't know of any kernel which
matches that theory, do you? Linux certainly doesn't. FreeBSD certainly
doesn't. Solaris/IRIX crossed that point years ago. So where is the
OS which has managed to resist the lock tangle?

linux-2.5$ bk -r grep CONFIG_SMP | wc -l
1290

That's a lot of ifdefs for a supposedly tangle free kernel. And I suspect
that the threading people will say Linux doesn't really scale beyond
2-4 CPUs for any I/O bound work load today. What's it going to be when
Linux is at 32 CPUs? Solaris was around 3000 statically allocated locks
when I left and I think it was scaling to maybe 8. At SGI, they were
carefully putting the lock on the same cache line as the data structure
that it protected, for all locked data structure which had any contention.
The limit as the number of CPUs goes up is that each read/write cache
line in the data segment has a lock. They certainly weren't there,
but they were much closer than you might guess. It was definitely the
norm that you laid out your locks with the data, it was that pervasive.

Take a walk through sched.c and you can see the mess starting. How
can anyone support that code on both UP and SMP? You are already
supporting two code bases. Imagine what it is going to look like when
the NUMA people get done. Don't forget the preempt people. Oh, yeah,
let's throw in some soft realtime, that shouldn't screw things up too
much.

> To specifics, I don't see the point of OSlets on a single cpu that is
> hyper threaded. Traditional threading appears to make more sense to
> me. Similarly I don't see the point in the 2-4 cpu range.

In general I agree with you here, but I think you haven't really considered
all the options. I can see the benefit on a *single* CPU. There are all
sorts of interesting games you could play in the area of fault tolerance
and containment. Imagine a system, like what IBM has, that runs lots of
copies of Linux with the mmap sharing turned off. ISPs would love it.

Jeff Dike pointed out that if UML can run one kernel in user space, why
not N? And if so, the OS clustering stuff could be done on top of
UML and then "ported" to real hardware. I think that's a great idea,
and you can carry it farther, you could run multiple kernels just for
fault containment. See Sun's domains, DEC's Galaxy.

> Given that there are some scales when you don't want/need more than
> one kernel, who has a machine where OSlets start to pay off? They
> don't exist in commodity hardware, so being proactive now looks
> stupid.

Not as stupid as having a kernel noone can maintain and not being able
to do anything about it. There seems to be a subthread of elitist macho
attitude along the lines of "oh, it won't be that bad, and besides,
if you aren't good enough to code in a fine grained locked, soft real
time, preempted, NUMA aware, then you just shouldn't be in the kernel".
I'm not saying you are saying that, but I've definitely heard it on
the list.

It's a great thing for bragging rights but it's a horrible thing from
the sustainability point of view.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-06-22 22:35:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

Larry McVoy <[email protected]> writes:

> On Sat, Jun 22, 2002 at 12:25:09PM -0600, Eric W. Biederman wrote:
> > To specifics, I don't see the point of OSlets on a single cpu that is
> > hyper threaded. Traditional threading appears to make more sense to
> > me. Similarly I don't see the point in the 2-4 cpu range.
>
> In general I agree with you here, but I think you haven't really considered
> all the options. I can see the benefit on a *single* CPU. There are all
> sorts of interesting games you could play in the area of fault tolerance
> and containment. Imagine a system, like what IBM has, that runs lots of
> copies of Linux with the mmap sharing turned off. ISPs would love
> it.

Hmm. Perhaps. But you are fundamentally susceptible to the base
kernel, and the hardware on the machine.

> Jeff Dike pointed out that if UML can run one kernel in user space, why
> not N? And if so, the OS clustering stuff could be done on top of
> UML and then "ported" to real hardware. I think that's a great idea,
> and you can carry it farther, you could run multiple kernels just for
> fault containment. See Sun's domains, DEC's Galaxy.

Right. A clustered environment is accessible. For the most part I
don't have a problem (except check pointing) that is facilitated by
running linux under linux.

Currently my problem to solve is compute clusters. My current worries
are not can I scale a kernel to 64 cpus. My practical worries are
will my user space to 1000 dual processor machines.

The important point for me is that there are a fair number of
fundamentally hard problems to get multiple kernels look like one.
Especially when you start with a maximum decoupling. And you seem to
assume that solving these problems are trivial.

Maybe it is maintainable when you get done but there is a huge amount
of work to get there. I haven't heard of a distributed OS as anything
other than a dream, or a prototype with scaling problems.

> > Given that there are some scales when you don't want/need more than
> > one kernel, who has a machine where OSlets start to pay off? They
> > don't exist in commodity hardware, so being proactive now looks
> > stupid.
>
> Not as stupid as having a kernel noone can maintain and not being able
> to do anything about it. There seems to be a subthread of elitist macho
> attitude along the lines of "oh, it won't be that bad, and besides,
> if you aren't good enough to code in a fine grained locked, soft real
> time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> I'm not saying you are saying that, but I've definitely heard it on
> the list.

Hmm. I see a bulk of the on-going kernel work composed of projects to
make the whole kernel easier to maintain. Especially interesting is
the work that makes drivers relatively easy, and free from all of this
cruft.

Running some numbers (wc -l kernel/*.c fs/*.c mm/*.c)
1.2.12: 18813 lines
2.2.12: 37510 lines
2.5.14: 55701 lines

So the core kernel is growing, but a fairly slow rate. Only worrying
about the 60 thousand lines of generic kernel code is much better than
worrying about the 3 million lines of driver code.

And since you thought it was an interesting statistic:
grep CONFIG_SMP kernel/*.c fs/*.c mm/*.c init/*.c | wc -l
44

So most of the code that cares about SMP is not in the core of the
kernel, but is mostly the code that actually implements SMP support.

So in thinking about I agree that the constant simplification work
that is done to the linux kernel looks like one of the most important
activities long term.

> It's a great thing for bragging rights but it's a horrible thing from
> the sustainability point of view.

Given that the simplification efforts tend to be some of the highest
priority activities in the kernel, and the easiest patches to get
accepted. I don't get the feeling that we are walking into a long
term maintenance problem.

As for bragging rights, my kernel work tends to be some of the easiest
code I have to write. I have no doubts that C is a high level
programming language.

Eric

2002-06-22 23:10:45

by Larry McVoy

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Sat, Jun 22, 2002 at 04:25:29PM -0600, Eric W. Biederman wrote:
> The important point for me is that there are a fair number of
> fundamentally hard problems to get multiple kernels look like one.
> Especially when you start with a maximum decoupling. And you seem to
> assume that solving these problems are trivial.

No such assumption was made. Poke through my slides, you'll see that I
think it will take a reasonable amount of effort to get there. I actually
spelled out the staffing and the time estimates. Start asking around and
you'll find that senior people who _have_ gone the multi threading route
agree that this approach gets you to the same place with less than 1/10th
the amount of work. The last guy who agreed with that statement was the
guy who headed up the threading design and implementation of Solaris,
he's at Netapp now.

In fairness to you, I'm doing the same thing you are: I'm arguing about
something I haven't done. On the other hand, I have been through (twice)
the thing that you are saying is no problem and every person who has been
there agrees with me that it sucks. It's doable, but it's a nightmare to
maintain, it easily increases the subtlety of kernel interactions by an
order of magnitude, probably closer to two orders.

And I have done enough of what I've described to know it can be done.
People who have deep knowledge of the fine grained approach have tried
to prove that I was wrong and failed, repeatedly. They may not agree
that this is a better way but they can't show that it won't work.

> Maybe it is maintainable when you get done but there is a huge amount
> of work to get there. I haven't heard of a distributed OS as anything
> other than a dream, or a prototype with scaling problems.

This is a distributed OS on one system, that's a lot easier than a
distributed OS across machine boundaries. And if you are worried about
scaling problems, you don't understand the design. The OS cluster idea
multi threads all data structures for free. No locks on 99% of the
data structures that you would need locks on in an SMP os.

Think about this fact: if you have lock contention you don't scale. So
you thread until you don't. Go do the math that shows how tiny of a
fraction of 1% of lock contention screws your scaling, everyone has
bumped up against those curves. So the goal of any multithreaded OS
is ZERO lock contention. Makes you wonder why the locks are there
in the first place. They are trying to get to where I want to go but
they are definitely doing it the hard way.

> > Not as stupid as having a kernel noone can maintain and not being able
> > to do anything about it. There seems to be a subthread of elitist macho
> > attitude along the lines of "oh, it won't be that bad, and besides,
> > if you aren't good enough to code in a fine grained locked, soft real
> > time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> > I'm not saying you are saying that, but I've definitely heard it on
> > the list.
>
> Hmm. I see a bulk of the on-going kernel work composed of projects to
> make the whole kernel easier to maintain.
[...]
> I don't get the feeling that we are walking into a long
> term maintenance problem.

I don't mean to harp on this, but if you are going to comment on how
hard it is to maintain a kernel could you please give us some idea of
why it is you think as you do? Do you have some prior experience with a
project of this size that shows what you believe to be true in practice?
You keep suggesting that there isn't a problem, that we aren't headed for
a problem. Why is that? Do you know something I don't? I've certainly
seen what happens to a kernel source base as it goes through this process
a few times and my experience is that what you are saying is the opposite
of what happens. So if you've got some different experience, how about
sharing it? Maybe there is some way to do what you are suggesting will
happen, but I haven't ever seen it personally, nor have I ever heard
of it occurring in any long lived project. All projects become more
complex as time goes on, it's a direct result of the demands placed on
any successful project.

> So in thinking about I agree that the constant simplification work
> that is done to the linux kernel looks like one of the most important
> activities long term.

What constant simplification work? The generic part of the kernel does
more or less what it did a few years ago yet is has grown at a pretty fast
clip. Talk to the embedded people and ask them if they think it has gotten
simpler. By what standard has the kernel become less complex?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-06-23 06:35:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

On Sat, Jun 22, 2002 at 12:26:56PM -0700, Larry McVoy wrote:
> Not as stupid as having a kernel noone can maintain and not being able
> to do anything about it. There seems to be a subthread of elitist macho
> attitude along the lines of "oh, it won't be that bad, and besides,
> if you aren't good enough to code in a fine grained locked, soft real
> time, preempted, NUMA aware, then you just shouldn't be in the kernel".
> I'm not saying you are saying that, but I've definitely heard it on
> the list.

I've been accused of this, so I'll state for the record: my views on
locking are not efficiency-related in the least. They have to do with
ensuring that locks protect well-defined data and that locking
constructs are clean (e.g. nonrecursive and no implicit drop or acquire).
My duties are not directly related to locking, and I only push the
agenda I do as a low-priority kernel janitoring effort. As this is not
a scalability issue, I'll not press it further in this thread.


Cheers,
Bill

2002-06-23 23:05:45

by kaih

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken

[email protected] (Larry McVoy) wrote on 22.06.02 in <[email protected]>:

> Just out of curiousity, have you actually ever worked on a fine grain
> threaded OS? One that scales to at least 32 processors? Solaris? IRIX?
> Others? It makes a difference, if you've been there, your perspective is

IIRC, you said that your proposed system should have one oslet per about 4
CPUs. And I see many people claiming that current Linux locking is aimed
at being good with about 4 CPUs.

Maybe I'm dense, but it seems to me that means current Linux locking is
aimed at exactly the spot where you argue it should be aimed *anyway*.

What am I not seeing?

MfG Kai

2002-06-24 21:28:21

by Paul McKenney

[permalink] [raw]
Subject: Re: latest linus-2.5 BK broken


Hello, Larry,

Our SMP cluster discussion was quite a bit of fun, very challenging!
I still stand by my assessment:

> The Score.
>
> Paul agreed that SMP Clusters could be implemented. He was not
> sure that it could achieve good performance, but could not prove
> otherwise. Although he suspected that the complexity might be
> less than the proprietary highly parallel Unixes, he was not
> convinced that it would be less than Linux would be, given the
> Linux community's emphasis on simplicity in addition to performance.

See you at Ottawa!

Thanx, Paul


> Larry McVoy <[email protected]>
> Sent by: [email protected]
> 06/19/2002 10:24 PM
>
> > I totally agree, mostly I was playing devils advocate. The model
> > actually in my head is when you have multiple kernels but they talk
> > well enough that the applications have to care in areas where it
> > doesn't make a performance difference (There's got to be one of those).
>
> ....
>
> > The compute cluster problem is an interesting one. The big items
> > I see on the todo list are:
> >
> > - Scalable fast distributed file system (Lustre looks like a
> > possibility)
> > - Sub application level checkpointing.
> >
> > Services like a schedulers, already exist.
> >
> > Basically the job of a cluster scheduler gets much easier, and the
> > scheduler more powerful once it gets the ability to suspend jobs.
> > Checkpointing buys three things. The ability to preempt jobs, the
> > ability to migrate processes, and the ability to recover from failed
> > nodes, (assuming the failed hardware didn't corrupt your jobs
> > checkpoint).
> >
> > Once solutions to the cluster problems become well understood I
> > wouldn't be surprised if some of the supporting services started to
> > live in the kernel like nfsd. Parts of the distributed filesystem
> > certainly will.
>
> http://www.bitmover.com/cc-pitch
>
> I've been trying to get Linus to listen to this for years and he keeps
> on flogging the tired SMP horse instead. DEC did it and Sun has been
> passing around these slides for a few weeks, so maybe they'll do it too.
> Then Linux can join the party after it has become a fine grained,
> locked to hell and back, soft "realtime", numa enabled, bloated piece
> of crap like all the other kernels and we'll get to go through the
> "let's reinvent Unix for the 3rd time in 40 years" all over again.
> What fun. Not.
>
> Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy
> to talk it over with anyone who wants to think about it. Paul McKenney
> from IBM came down the San Francisco to talk to me about it, put me
> through an 8 or 9 hour session which felt like a PhD exam, and
> after trying to poke holes in it grudgingly let on that maybe it was
> a good idea. He was kind of enough to write up what he took away
> from it, here it is.
>
> --lm
>
> From: "Paul McKenney" <[email protected]>
> To: [email protected], [email protected]
> Subject: Greatly enjoyed our discussion yesterday!
> Date: Fri, 9 Nov 2001 18:48:56 -0800
>
> Hello!
>
> I greatly enjoyed our discussion yesterday! Here are the pieces of it
that
> I recall, I know that you will not be shy about correcting any errors and
> omissions.
>
> Thanx, Paul
>
> Larry McVoy's SMP Clusters
>
> Discussion on November 8, 2001
>
> Larry McVoy, Ted T'so, and Paul McKenney
>
>
> What is SMP Clusters?
>
> SMP Clusters is a method of partioning an SMP (symmetric
> multiprocessing) machine's CPUs, memory, and I/O devices
> so that multiple "OSlets" run on this machine. Each OSlet
> owns and controls its partition. A given partition is
> expected to contain from 4-8 CPUs, its share of memory,
> and its share of I/O devices. A machine large enough to
> have SMP Clusters profitably applied is expected to have
> enough of the standard I/O adapters (e.g., ethernet,
> SCSI, FC, etc.) so that each OSlet would have at least
> one of each.
>
> Each OSlet has the same data structures that an isolated
> OS would have for the same amount of resources. Unless
> interactions with the OSlets are required, an OSlet runs
> very nearly the same code over very nearly the same data
> as would a standalone OS.
>
> Although each OSlet is in most ways its own machine, the
> full set of OSlets appears as one OS to any user programs
> running on any of the OSlets. In particular, processes on
> on OSlet can share memory with processes on other OSlets,
> can send signals to processes on other OSlets, communicate
> via pipes and Unix-domain sockets with processes on other
> OSlets, and so on. Performance of operations spanning
> multiple OSlets may be somewhat slower than operations local
> to a single OSlet, but the difference will not be noticeable
> except to users who are engaged in careful performance
> analysis.
>
> The goals of the SMP Cluster approach are:
>
> 1. Allow the core kernel code to use simple locking designs.
> 2. Present applications with a single-system view.
> 3. Maintain good (linear!) scalability.
> 4. Not degrade the performance of a single CPU beyond that
> of a standalone OS running on the same resources.
> 5. Minimize modification of core kernel code. Modified or
> rewritten device drivers, filesystems, and
> architecture-specific code is permitted, perhaps even
> encouraged. ;-)
>
>
> OS Boot
>
> Early-boot code/firmware must partition the machine, and prepare
> tables for each OSlet that describe the resources that each
> OSlet owns. Each OSlet must be made aware of the existence of
> all the other OSlets, and will need some facility to allow
> efficient determination of which OSlet a given resource belongs
> to (for example, to determine which OSlet a given page is owned
> by).
>
> At some point in the boot sequence, each OSlet creates a "proxy
> task" for each of the other OSlets that provides shared services
> to them.
>
> Issues:
>
> 1. Some systems may require device probing to be done
> by a central program, possibly before the OSlets are
> spawned. Systems that react in an unfriendly manner
> to failed probes might be in this class.
>
> 2. Interrupts must be set up very carefully. On some
> systems, the interrupt system may constrain the ways
> in which the system is partitioned.
>
>
> Shared Operations
>
> This section describes some possible implementations and issues
> with a number of the shared operations.
>
> Shared operations include:
>
> 1. Page fault on memory owned by some other OSlet.
> 2. Manipulation of processes running on some other OSlet.
> 3. Access to devices owned by some other OSlet.
> 4. Reception of network packets intended for some other OSlet.
> 5. SysV msgq and sema operations on msgq and sema objects
> accessed by processes running on multiple of the OSlets.
> 6. Access to filesystems owned by some other OSlet. The
> /tmp directory gets special mention.
> 7. Pipes connecting processes in different OSlets.
> 8. Creation of processes that are to run on a different
> OSlet than their parent.
> 9. Processing of exit()/wait() pairs involving processes
> running on different OSlets.
>
> Page Fault
>
> As noted earlier, each OSlet maintains a proxy process
> for each other OSlet (so that for an SMP Cluster made
> up of N OSlets, there are N*(N-1) proxy processes).
>
> When a process in OSlet A wishes to map a file
> belonging to OSlet B, it makes a request to B's proxy
> process corresponding to OSlet A. The proxy process
> maps the desired file and takes a page fault at the
> desired address (translated as needed, since the file
> will usually not be mapped to the same location in the
> proxy and client processes), forcing the page into
> OSlet B's memory. The proxy process then passes the
> corresponding physical address back to the client
> process, which maps it.
>
> Issues:
>
> o How to coordinate pageout? Two approaches:
>
> 1. Use mlock in the proxy process so that
> only the client process can do the pageout.
>
> 2. Make the two OSlets coordinate their
> pageouts. This is more complex, but will
> be required in some form or another to
> prevent OSlets from "ganging up" on one
> of their number, exhausting its memory.
>
> o When OSlet A ejects the memory from its working
> set, where does it put it?
>
> 1. Throw it away, and go to the proxy process
> as needed to get it back.
>
> 2. Augment core VM as needed to track the
> "guest" memory. This may be needed for
> performance, but...
>
> o Some code is required in the pagein() path to
> figure out that the proxy must be used.
>
> 1. Larry stated that he is willing to be
> punched in the nose to get this code in. ;-)
> The amount of this code is minimized by
> creating SMP-clusters-specific filesystems,
> which have their own functions for mapping
> and releasing pages. (Does this really
> cover OSlet A's paging out of this memory?)
>
> o How are pagein()s going to be even halfway fast
> if IPC to the proxy is involved?
>
> 1. Just do it. Page faults should not be
> all that frequent with today's memory
> sizes. (But then why do we care so
> much about page-fault performance???)
>
> 2. Use "doors" (from Sun), which are very
> similar to protected procedure call
> (from K42/Tornado/Hurricane). The idea
> is that the CPU in OSlet A that is handling
> the page fault temporarily -becomes- a
> member of OSlet B by using OSlet B's page
> tables for the duration. This results in
> some interesting issues:
>
> a. What happens if a process wants to
> block while "doored"? Does it
> switch back to being an OSlet A
> process?
>
> b. What happens if a process takes an
> interrupt (which corresponds to
> OSlet A) while doored (thus using
> OSlet B's page tables)?
>
> i. Prevent this by disabling
> interrupts while doored.
> This could pose problems
> with relatively long VM
> code paths.
>
> ii. Switch back to OSlet A's
> page tables upon interrupt,
> and switch back to OSlet B's
> page tables upon return
> from interrupt. On machines
> not supporting ASID, take a
> TLB-flush hit in both
> directions. Also likely
> requires common text (at
> least for low-level interrupts)
> for all OSlets, making it more
> difficult to support OSlets
> running different versions of
> the OS.
>
> Furthermore, the last time
> that Paul suggested adding
> instructions to the interrupt
> path, several people politely
> informed him that this would
> require a nose punching. ;-)
>
> c. If a bunch of OSlets simultaneously
> decide to invoke their proxies on
> a particular OSlet, that OSlet gets
> lock contention corresponding to
> the number of CPUs on the system
> rather than to the number in a
> single OSlet. Some approaches to
> handle this:
>
> i. Stripe -everything-, rely
> on entropy to save you.
> May still have problems with
> hotspots (e.g., which of the
> OSlets has the root of the
> root filesystem?).
>
> ii. Use some sort of queued lock
> to limit the number CPUs that
> can be running proxy processes
> in a given OSlet. This does
> not really help scaling, but
> would make the contention
> less destructive to the
> victim OSlet.
>
> o How to balance memory usage across the OSlets?
>
> 1. Don't bother, let paging deal with it.
> Paul's previous experience with this
> philosophy was not encouraging. (You
> can end up with one OSlet thrashing
> due to the memory load placed on it by
> other OSlets, which don't see any
> memory pressure.)
>
> 2. Use some global memory-pressure scheme
> to even things out. Seems possible,
> Paul is concerned about the complexity
> of this approach. If this approach is
> taken, make sure someone with some
> control-theory experience is involved.
>
>
> Manipulation of Processes Running on Some Other OSlet.
>
> The general idea here is to implement something similar
> to a vproc layer. This is common code, and thus requires
> someone to sacrifice their nose. There was some discussion
> of other things that this would be useful for, but I have
> lost them.
>
> Manipulations discussed included signals and job control.
>
> Issues:
>
> o Should process information be replicated across
> the OSlets for performance reasons? If so, how
> much, and how to synchronize.
>
> 1. No, just use doors. See above discussion.
>
> 2. Yes. No discussion of synchronization
> methods. (Hey, we had to leave -something-
> for later!)
>
> Access to Devices Owned by Some Other OSlet
>
> Larry mentioned a /rdev, but if we discussed any details
> of this, I have lost them. Presumably, one would use some
> sort of IPC or doors to make this work.
>
> Reception of Network Packets Intended for Some Other OSlet.
>
> An OSlet receives a packet, and realizes that it is
> destined for a process running in some other OSlet.
> How is this handled without rewriting most of the
> networking stack?
>
> The general approach was to add a NAT-like layer that
> inspected the packet and determined which OSlet it was
> destined for. The packet was then forwarded to the
> correct OSlet, and subjected to full IP-stack processing.
>
> Issues:
>
> o If the address map in the kernel is not to be
> manipulated on each packet reception, there
> needs to be a circular buffer in each OSlet for
> each of the other OSlets (again, N*(N-1) buffers).
> In order to prevent the buffer from needing to
> be exceedingly large, packets must be bcopy()ed
> into this buffer by the OSlet that received
> the packet, and then bcopy()ed out by the OSlet
> containing the target process. This could add
> a fair amount of overhead.
>
> 1. Just accept the overhead. Rely on this
> being an uncommon case (see the next issue).
>
> 2. Come up with some other approach, possibly
> involving the user address space of the
> proxy process. We could not articulate
> such an approach, but it was late and we
> were tired.
>
> o If there are two processes that share the FD
> on which the packet could be received, and these
> two processes are in two different OSlets, and
> neither is in the OSlet that received the packet,
> what the heck do you do???
>
> 1. Prevent this from happening by refusing
> to allow processes holding a TCP connection
> open to move to another OSlet. This could
> result in load-balance problems in some
> workloads, though neither Paul nor Ted were
> able to come up with a good example on the
> spot (seeing as BAAN has not been doing really
> well of late).
>
> To indulge in l'esprit d'escalier... How
> about a timesharing system that users
> access from the network? A single user
> would have to log on twice to run a job
> that consumed more than one OSlet if each
> process in the job might legitimately need
> access to stdin.
>
> 2. Do all protocol processing on the OSlet
> on which the packet was received, and
> straighten things out when delivering
> the packet data to the receiving process.
> This likely requires changes to common
> code, hence someone to volunteer their nose.
>
>
> SysV msgq and sema Operations
>
> We didn't discuss these. None of us seem to be SysV fans,
> but these must be made to work regardless.
>
> Larry says that shm should be implemented in terms of mmap(),
> so that this case reduces to page-mapping discussed above.
> Of course, one would need a filesystem large enough to handle
> the largest possible shmget. Paul supposes that one could
> dynamically create a memory filesystem to avoid problems here,
> but is in no way volunteering his nose to this cause.
>
>
> Access to Filesystems Owned by Some Other OSlet.
>
> For the most part, this reduces to the mmap case. However,
> partitioning popular filesystems over the OSlets could be
> very helpful. Larry mentioned that this had been prototyped.
> Paul cannot remember if Larry promised to send papers or
> other documentation, but duly requests them after the fact.
>
> Larry suggests having a local /tmp, so that /tmp is in effect
> private to each OSlet. There would be a /gtmp that would
> be a globally visible /tmp equivalent. We went round and
> round on software compatibility, Paul suggesting a hashed
> filesystem as an alternative. Larry eventually pointed out
> that one could just issue different mount commands to get
> a global filesystem in /tmp, and create a per-OSlet /ltmp.
> This would allow people to determine their own level of
> risk/performance.
>
>
> Pipes Connecting Processes in Different OSlets.
>
> This was mentioned, but I have forgotten the details.
> My vague recollections lead me to believe that some
> nose-punching was required, but I must defer to Larry
> and Ted.
>
> Ditto for Unix-domain sockets.
>
>
> Creation of Processes on a Different OSlet Than Their Parent.
>
> There would be a inherited attribute that would prevent
> fork() or exec() from creating its child on a different
> OSlet. This attribute would be set by default to prevent
> too many surprises. Things like make(1) would clear
> this attribute to allow amazingly fast kernel builds.
>
> There would also be a system call that would cause the
> child to be placed on a specified OSlet (Paul suggested
> use of HP's "launch policy" concept to avoid adding yet
> another dimension to the exec() combinatorial explosion).
>
> The discussion of packet reception lead Larry to suggest
> that cross-OSlet process creation would be prohibited if
> the parent and child shared a socket. See above for the
> load-balancing concern and corresponding l'esprit d'escalier.
>
>
> Processing of exit()/wait() Pairs Crossing OSlet Boundaries
>
> We didn't discuss this. My guess is that vproc deals
> with it. Some care is required when optimizing for this.
> If one hands off to a remote parent that dies before
> doing a wait(), one would not want one of the init
> processes getting a nasty surprise.
>
> (Yes, there are separate init processes for each OSlet.
> We did not talk about implications of this, which might
> occur if one were to need to send a signal intended to
> be received by all the replicated processes.)
>
>
> Other Desiderata:
>
> 1. Ability of surviving OSlets to continue running after one of their
> number fails.
>
> Paul was quite skeptical of this. Larry suggested that the
> "door" mechanism could use a dynamic-linking strategy. Paul
> remained skeptical. ;-)
>
> 2. Ability to run different versions of the OS on different OSlets.
>
> Some discussion of this above.
>
>
> The Score.
>
> Paul agreed that SMP Clusters could be implemented. He was not
> sure that it could achieve good performance, but could not prove
> otherwise. Although he suspected that the complexity might be
> less than the proprietary highly parallel Unixes, he was not
> convinced that it would be less than Linux would be, given the
> Linux community's emphasis on simplicity in addition to performance.
>
> --
> ---
> Larry McVoy lm at bitmover.com
http://www.bitmover.com/lm