2004-03-18 08:05:42

by Ulrich Drepper

[permalink] [raw]
Subject: sched_setaffinity usability

The sched_setaffinity syscall currently has a usability problem. The
size of cpumask_t is not visible outside the kernel and might change
from kernel to kernel. So, if the user uses a large CPU bitset and
passes it to the kernel it is not known at all whether all the bits
provided in the bitmap are used. The kernel simply copies the first
bytes, enough to fill in the cpumask_t object and ignores the rest.

A simple check for a too large bitset is not good. Programs which are
portable (to different kernels) and future safe should use large bitmap
sizes. Instead the user should only be notified about the size problem
if any nonzero bit is ignored.

Doing this in the kernel isn't good. It would require copying all the
bitmap into the kernel address space. So do it at userlevel.

But how? The userlevel code does not know the size of the type the
kernel used. In the getaffinity call this is handled nicely: the
syscall returns the size of the type.

I think we should do the same for setaffinity. Something like this:

--- kernel/sched.c 2004-03-16 20:57:25.000000000 -0800
+++ kernel/sched.c-new 2004-03-17 23:52:25.000000000 -0800
@@ -2328,6 +2328,8 @@ asmlinkage long sys_sched_setaffinity(pi
goto out_unlock;

retval = set_cpus_allowed(p, new_mask);
+ if (retval == 0)
+ retval = sizeof(new_mask);

out_unlock:
put_task_struct(p);


The userlevel code could then check whether the remaining words in the
bitset contain any set bits.

The interface change is limited to the kernel only. We can arrange for
the sched_setaffinity/pthread_setaffinity calls to still return zero in
case of success (in fact, that's the desirable solution). Additionally,
we could hardcode a size for the case when the syscall returns zero to
handle old kernels.


Is this acceptable?

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


2004-03-18 08:12:56

by Tim Hockin

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, Mar 18, 2004 at 12:05:22AM -0800, Ulrich Drepper wrote:
> kernel used. In the getaffinity call this is handled nicely: the
> syscall returns the size of the type.

could you just call getaffinity first?

2004-03-18 08:22:43

by Ulrich Drepper

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Tim Hockin wrote:

> could you just call getaffinity first?

This means copying quite some memory.

On the bright side, this would allow to have a well-defined error case
which guarantees the affinity mask is not changed if it cannot be done
completely. Still, it might mean for some systems (and the number is
going to grow in future) to spend a lot of effort on the error case
which never really is hit. And the affinity calls have to take a lock,
serializing code possibly quite a bit.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2004-03-18 08:47:31

by Ulrich Drepper

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Replying to my own post.

Forget the proposed patch. If we ever want to get the new interfaces
standardized saying the affinity mask is undefined after a failed
setaffinity call is not acceptable.

So I'm going to hardcode the fact that the value returned by a
successful getaffinity call will never change. Regardless of how many
hotplug CPUs as added. Can this be guaranteed?

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2004-03-18 09:48:38

by Andrew Morton

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Ulrich Drepper <[email protected]> wrote:
>
> The sched_setaffinity syscall currently has a usability problem. The
> size of cpumask_t is not visible outside the kernel and might change
> from kernel to kernel. So, if the user uses a large CPU bitset and
> passes it to the kernel it is not known at all whether all the bits
> provided in the bitmap are used. The kernel simply copies the first
> bytes, enough to fill in the cpumask_t object and ignores the rest.
>
> A simple check for a too large bitset is not good. Programs which are
> portable (to different kernels) and future safe should use large bitmap
> sizes. Instead the user should only be notified about the size problem
> if any nonzero bit is ignored.

Perhaps the syscall itself should go look for set bits which are beyond the
current number of physical CPUs and fail the syscall if any are found.

Like this, if it was tested:


diff -puN kernel/sched.c~a kernel/sched.c
--- 25/kernel/sched.c~a 2004-03-18 01:25:29.697217008 -0800
+++ 25-akpm/kernel/sched.c 2004-03-18 01:42:32.312755816 -0800
@@ -2736,13 +2736,39 @@ asmlinkage long sys_sched_setaffinity(pi
cpumask_t new_mask;
int retval;
task_t *p;
+ int remainder;
+ unsigned long __user *up;

if (len < sizeof(new_mask))
return -EINVAL;

+ /* Avoid spending stupid amounts of time in the kernel */
+ if (len > 16384)
+ return -EINVAL;
+
if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
return -EFAULT;

+ /*
+ * Check that the user hasn't asked for any impossible CPUs outside
+ * sizeof(cpumask_t). set_cpus_allowed() will check for impossible
+ * cpus inside sizeof(cpumask_t).
+ */
+ remainder = len - sizeof(new_mask); /* bytes */
+ up = user_mask_ptr + 1;
+ while (remainder > 0) {
+ unsigned long u;
+ int nr_bits;
+
+ if (get_user(u, up))
+ return -EFAULT;
+ nr_bits = min((int)sizeof(u), remainder);
+ if (find_next_bit(&u, nr_bits, 0) >= nr_bits)
+ return -EINVAL;
+ remainder -= sizeof(u);
+ up++;
+ }
+
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);

_

2004-03-18 10:11:02

by Andrew Morton

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Andrew Morton <[email protected]> wrote:
>
> + if (find_next_bit(&u, nr_bits, 0) >= nr_bits)

Make that:

> + if (find_next_bit(&u, nr_bits, 0) < nr_bits)

2004-03-18 11:28:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Ulrich Drepper <[email protected]> wrote:

> The sched_setaffinity syscall currently has a usability problem. The
> size of cpumask_t is not visible outside the kernel and might change
> from kernel to kernel. So, if the user uses a large CPU bitset and
> passes it to the kernel it is not known at all whether all the bits
> provided in the bitmap are used. The kernel simply copies the first
> bytes, enough to fill in the cpumask_t object and ignores the rest.
>
> A simple check for a too large bitset is not good. Programs which are
> portable (to different kernels) and future safe should use large
> bitmap sizes. Instead the user should only be notified about the size
> problem if any nonzero bit is ignored.

how about adding a new syscall, sys_sched_get_affinity_span(), which
would be called by glibc first time one of the affinity syscalls are
called.

This syscall would be a prime target to be optimized away via the
vsyscall DSO, so there's no overhead worry.

something like the attached patch.

or, maybe it would be better to introduce some sort of 'system
constants' syscall that would be a generic umbrella for such things -
and could easily be converted into a vsyscall. Or we could make it part
of the .data section of the VDSO - thus no copying overhead, only one
symbol lookup.

Ingo

--- linux/arch/i386/kernel/entry.S.orig
+++ linux/arch/i386/kernel/entry.S
@@ -908,5 +908,6 @@ ENTRY(sys_call_table)
.long sys_utimes
.long sys_fadvise64_64
.long sys_ni_syscall /* sys_vserver */
+ .long sys_sched_get_affinity_span

syscall_table_size=(.-sys_call_table)
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2793,6 +2793,17 @@ out_unlock:
}

/**
+ * sys_sched_affinity_span - get the cpu affinity mask size possible
+ *
+ * returns the size of cpumask_t. Note that this is a hard upper limit
+ * for the # of CPUs.
+ */
+asmlinkage long sys_sched_get_affinity_span(void)
+{
+ return sizeof(cpumask_t);
+}
+
+/**
* sys_sched_yield - yield the current processor to other threads.
*
* this function yields the current CPU by moving the calling thread

2004-03-18 12:07:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, Mar 18, 2004 at 12:29:13PM +0100, Ingo Molnar wrote:
> or, maybe it would be better to introduce some sort of 'system
> constants' syscall that would be a generic umbrella for such things -
> and could easily be converted into a vsyscall. Or we could make it part
> of the .data section of the VDSO - thus no copying overhead, only one
> symbol lookup.

Like, umm, the long overdue sysconf()? For the time beeing a sysctl might
be the easiest thing..

2004-03-18 12:32:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Christoph Hellwig <[email protected]> wrote:

> > or, maybe it would be better to introduce some sort of 'system
> > constants' syscall that would be a generic umbrella for such things -
> > and could easily be converted into a vsyscall. Or we could make it part
> > of the .data section of the VDSO - thus no copying overhead, only one
> > symbol lookup.
>
> Like, umm, the long overdue sysconf()? For the time beeing a sysctl
> might be the easiest thing..

i think we want to kill several birds with a single stone, and just make
it part of the VDSO - along with the parameters visible via uname().
This would cut another extra syscall, and data copying.

i'm wondering how dangerous of an API idea it is to make these
parameters part of the VDSO .data section (and make it/them versioned
DSO symbols).

The only minor complication wrt. uname() would be sethostname: other
CPUs could observe a transitional state of (the VDSO-equavalent of)
system_utsname.nodename. Is this a problem? It's not like systems call
sethostname all that often ...

Ingo

2004-03-18 15:56:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: sched_setaffinity usability



On Thu, 18 Mar 2004, Christoph Hellwig wrote:
>
> Like, umm, the long overdue sysconf()? For the time beeing a sysctl might
> be the easiest thing..

"sysconf" has not been "long-overdue". It's just that glibc hasn't (after
years of pleading) just fixed it.

sysconf() MUST NOT be done in kernel space. A lot of the sysconf() options
are pure user space stuff that the kernel has no idea about. Take a quick
look at some of those things, and realize that there are things like
"_SC_EXPR_NEST_MAX" etc that are _most_ of the values. And the kernel is
simply not involved in any of this.

So I will tell this one more time (and I bet I'll have to repeat myself
again in a year or two, and I bet I'll be ignored then too).

sysconf() is a user-level implementation issue, and so is something like
"number of CPU's". Damn, the simplest way to do it is as a environment
variable, for christ sake! Just make a magic environment variable called
__SC_ARRAY, and make it be some kind of binary encoding if you worry about
performance.

Or make a "/etc/sysconf/array" file, and just map it and look up the
values there.

Please don't raise this issue again.

Linus

2004-03-18 18:23:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Linus Torvalds <[email protected]> wrote:

> sysconf() is a user-level implementation issue, and so is something
> like "number of CPU's". Damn, the simplest way to do it is as a
> environment variable, for christ sake! Just make a magic environment
> variable called __SC_ARRAY, and make it be some kind of binary
> encoding if you worry about performance.

i am not arguing for any sysconf() support at all - it clearly belongs
into glibc. Just doing 'man sysconf' shows that it should be in
user-space. No argument about that.

But how about the original issue Ulrich raised: how does user-space
figure out the NR_CPUS value supported by the kernel? (not the current #
of CPUs, that can be figured out using /proc/cpuinfo)

one solution would be what you suggest: to build some sort of /etc/info
file that glibc can access, which file is build during the kernel build
and contains the necessary constants. One problem with this approach is
that a user could boot via any arbitrary kernel, how does glibc (or even
a supposed early-init info-setup mechanism) know what info belongs to
which kernel? Kernel version numbers are not required to be unique. A
single non-modular bzImage can be used to have a fully working
userspace. Right now the kernel and glibc is isolated pretty much and
this gives us flexibility.

an environment variable is a similar solution and has the same problem:
it has to be generated somehow from the kernel's info, just like the
info file. As such it breaks the single-image concept, and the kernel
image and the 'metadata' can get detached.

but there's a clean solution i believe, a convenient object that we
already generate during kernel builds: the VDSO. It's mapped by the
kernel during execve() [or, in current kernels, is inherited via the
kernel mappings], and thus becomes a pretty fast (zero-copy) method of
having a kernel-specific user-space dynamic object. It's structured in
the most convenient (and thus, fastest, and most portable) way for
glibc's purposes. ld.so recognizes and uses it. It cannot be
misconfigured by the user, it comes with the kernel. It is included in a
single bzImage kernel just as much as a kernel rpm.

Right now the VDSO mostly contains code and exception-handling data, but
it could contain real, userspace-visible data just as much: info that is
only known during the kernel build. There's basically no cost in adding
more fields to the VDSO, and it seems to be superior to any of the other
approaches. Is there any reason not to do it?

Ingo

2004-03-18 18:39:18

by Andrew Morton

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Ingo Molnar <[email protected]> wrote:
>
> Right now the VDSO mostly contains code and exception-handling data, but
> it could contain real, userspace-visible data just as much: info that is
> only known during the kernel build. There's basically no cost in adding
> more fields to the VDSO, and it seems to be superior to any of the other
> approaches. Is there any reason not to do it?

It's x86-specific?

2004-03-18 18:41:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Andrew Morton <[email protected]> wrote:

> > Right now the VDSO mostly contains code and exception-handling data, but
> > it could contain real, userspace-visible data just as much: info that is
> > only known during the kernel build. There's basically no cost in adding
> > more fields to the VDSO, and it seems to be superior to any of the other
> > approaches. Is there any reason not to do it?
>
> It's x86-specific?

x86-64 has a VDSO page as well, and it can be implemented on any
architecture that wants to accelerate syscalls in user-space (and/or
wants to provide alternate methods of system-entry).

and a non-existent VDSO is something glibc handles already.

Ingo

2004-03-18 18:55:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Ingo Molnar <[email protected]> wrote:

> x86-64 has a VDSO page as well, [...]

hm, i'm not sure this is the case. It does have a vsyscall page but
doesnt fill out AT_SYSINFO. ia64 seems to have something like a vdso,
passed down via AT_SYSINFO.

We could introduce AT_VDSO to standardize this, and switch x86 from
AT_SYSINFO to AT_VDSO, and all architectures that implement the VDSO.
I.e. something like the draft patch below. (this would also decrease the
size of the auxiliary table - glibc doesnt need to know the exception
header address.).

(this could also put an end to the AT_ bloat - eg. we could even get rid
of some of the constant AT_ values in the future: AT_PAGESZ, AT_CLKTCK,
AT_HWCAP, AT_PHENT and AT_FLAGS, and move them into the VDSO?)

Ingo

--- linux/include/linux/elf.h.orig
+++ linux/include/linux/elf.h
@@ -162,6 +162,7 @@ typedef __s64 Elf64_Sxword;
#define AT_PLATFORM 15 /* string identifying CPU for optimizations */
#define AT_HWCAP 16 /* arch dependent hints at CPU capabilities */
#define AT_CLKTCK 17 /* frequency at which times() increments */
+#define AT_VDSO 18 /* VDSO address */

#define AT_SECURE 23 /* secure mode boolean */

--- linux/include/asm-i386/elf.h.orig
+++ linux/include/asm-i386/elf.h
@@ -132,10 +132,9 @@ extern int dump_task_extended_fpu (struc
#define VSYSCALL_ENTRY ((unsigned long) &__kernel_vsyscall)
extern void __kernel_vsyscall;

-#define ARCH_DLINFO \
-do { \
- NEW_AUX_ENT(AT_SYSINFO, VSYSCALL_ENTRY); \
- NEW_AUX_ENT(AT_SYSINFO_EHDR, VSYSCALL_BASE); \
+#define ARCH_DLINFO \
+do { \
+ NEW_AUX_ENT(AT_VDSO, VSYSCALL_BASE); \
} while (0)

/*

2004-03-18 20:00:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, Mar 18, 2004 at 07:39:44PM +0100, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > > Right now the VDSO mostly contains code and exception-handling data, but
> > > it could contain real, userspace-visible data just as much: info that is
> > > only known during the kernel build. There's basically no cost in adding
> > > more fields to the VDSO, and it seems to be superior to any of the other
> > > approaches. Is there any reason not to do it?
> >
> > It's x86-specific?
>
> x86-64 has a VDSO page as well, and it can be implemented on any

it doesn't.

> architecture that wants to accelerate syscalls in user-space (and/or

x86-64 is the first arch ever implementing vsyscalls in production with
the fastest possible API.

The API doesn't contemplate the idea of relocating the vsyscall address,
but it can be extended easily with a relocation API.

> wants to provide alternate methods of system-entry).

there's no need of alternate methods of system-entry in x86-64, luckily
Intel merged the optimal extremely optimized syscall/sysexit from AMD
instead of only providing sysenter/sysexit like they do in the 32bit
cpus.

The way x86-64 implements the entry.S code is ultraoptimized since we
don't save a full stack for all syscalls except fork and few othrs that
needs to see all the registers, we've two different stack frames
depending on which syscalls is running.

Intel also provides sysenter/sysexit but that's useless on 64bit since
syscall/sysret is the standard in 64bit mode.

2004-03-18 20:28:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Andrea Arcangeli <[email protected]> wrote:

> > architecture that wants to accelerate syscalls in user-space (and/or
>
> x86-64 is the first arch ever implementing vsyscalls in production
> with the fastest possible API.
>
> The API doesn't contemplate the idea of relocating the vsyscall
> address, but it can be extended easily with a relocation API.

you'd end up doing much like what a DSO does. Anyway, what you say does
not conflict with the idea of the VDSO at all. It's only that x86 has
the most complex needs in this area so it was the first to do a real
DSO.

Ingo

2004-03-18 20:49:22

by David Lang

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, 18 Mar 2004, Ingo Molnar wrote:

> * Linus Torvalds <[email protected]> wrote:
>
> > sysconf() is a user-level implementation issue, and so is something
> > like "number of CPU's". Damn, the simplest way to do it is as a
> > environment variable, for christ sake! Just make a magic environment
> > variable called __SC_ARRAY, and make it be some kind of binary
> > encoding if you worry about performance.
>
> i am not arguing for any sysconf() support at all - it clearly belongs
> into glibc. Just doing 'man sysconf' shows that it should be in
> user-space. No argument about that.
>
> But how about the original issue Ulrich raised: how does user-space
> figure out the NR_CPUS value supported by the kernel? (not the current #
> of CPUs, that can be figured out using /proc/cpuinfo)

Doesn't /proc/config.gz answer this question?

David Lang

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-18 21:02:31

by Randy.Dunlap

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, 18 Mar 2004 12:49:12 -0800 (PST) David Lang wrote:

| On Thu, 18 Mar 2004, Ingo Molnar wrote:
|
| > * Linus Torvalds <[email protected]> wrote:
| >
| > > sysconf() is a user-level implementation issue, and so is something
| > > like "number of CPU's". Damn, the simplest way to do it is as a
| > > environment variable, for christ sake! Just make a magic environment
| > > variable called __SC_ARRAY, and make it be some kind of binary
| > > encoding if you worry about performance.
| >
| > i am not arguing for any sysconf() support at all - it clearly belongs
| > into glibc. Just doing 'man sysconf' shows that it should be in
| > user-space. No argument about that.
| >
| > But how about the original issue Ulrich raised: how does user-space
| > figure out the NR_CPUS value supported by the kernel? (not the current #
| > of CPUs, that can be figured out using /proc/cpuinfo)
|
| Doesn't /proc/config.gz answer this question?

I guess it could, but it's another CONFIG option...

--
~Randy

2004-03-18 21:06:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* David Lang <[email protected]> wrote:

> Doesn't /proc/config.gz answer this question?

no. /proc as an interface has the same disadvantages as the /etc
approach.

(there was talk about something like /proc/vdso.so - but in this special
case the kernel is much better at mapping the vdso pages: why spend
three syscalls and a pagefault on something that can be done zero-cost.)

99.9% of userspace code is modularized around the concept of ELF DSOs.
They are well-understood and have a history of providing good control of
backwards and forwards compatibility. They are flexible and they dont
really have any baggage that affects performance. A DSO is the ideal
interface to attach the kernel to glibc. Code and constant data can
reside in this DSO just fine. (even non-constant data can reside in the
DSO.) I'd really not want to reinvent the wheel and put yet another
concept of a dynamic shared object into the kernel (and make that
per-platform too).

Ingo

2004-03-18 21:09:07

by Davide Libenzi

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, 18 Mar 2004, Ingo Molnar wrote:

> But how about the original issue Ulrich raised: how does user-space
> figure out the NR_CPUS value supported by the kernel? (not the current #
> of CPUs, that can be figured out using /proc/cpuinfo)

Why not a /proc/something? I mean, doesn't glibc already have to handle in
some way kernels not exporting certain information (in the same way it
does for missing VDSO)?


> Right now the VDSO mostly contains code and exception-handling data, but
> it could contain real, userspace-visible data just as much: info that is
> only known during the kernel build. There's basically no cost in adding
> more fields to the VDSO, and it seems to be superior to any of the other
> approaches. Is there any reason not to do it?

With /proc/something you can have a single piece of code for all archs
that exports NR_CPUS. The VDSO should be added to all missing archs. IMO
performance is not an issue in getting NR_CPUS from userspace.


- Davide


2004-03-18 21:23:24

by Andi Kleen

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Ingo Molnar <[email protected]> writes:

> * Ingo Molnar <[email protected]> wrote:
>
>> x86-64 has a VDSO page as well, [...]
>
> hm, i'm not sure this is the case. It does have a vsyscall page but
> doesnt fill out AT_SYSINFO. ia64 seems to have something like a vdso,
> passed down via AT_SYSINFO.

Yes, the x86-64 64bit vsyscalls predate all the vDSO work and haven't
been updated. It has a vDSO for 32bit programs though.

I guess it would be not that much work to add it for 64bit too.
I would not be opposed to it if somebody sends me patches.

This means my only objection is that an dwarf2 unwind table written
without the .cfi_* support in the assembler is incredibly ugly and
unmaintainable. I really don't want to have more such ugly tables. I
guess it would be best to force an binutils update for dwarf2
information.

-Andi

2004-03-18 21:45:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Davide Libenzi <[email protected]> wrote:

> > Right now the VDSO mostly contains code and exception-handling data, but
> > it could contain real, userspace-visible data just as much: info that is
> > only known during the kernel build. There's basically no cost in adding
> > more fields to the VDSO, and it seems to be superior to any of the other
> > approaches. Is there any reason not to do it?
>
> With /proc/something you can have a single piece of code for all archs
> that exports NR_CPUS. The VDSO should be added to all missing archs.
> IMO performance is not an issue in getting NR_CPUS from userspace.

you just cannot beat the mapping performance of a near-zero-overhead
(V)DSO. No copying. No syscalls to set it up. No runtime dependencies on
having some filesystem mounted in the right spot. Already existing
framework to handle various API issues. Debuggers know the layout.

glibc could in theory boot-time assemble a /etc/vdso.so file and
open()/mmap()/close() it and then pagefault it in, which would be
roughly +10% to the cost of an exec(). I find it hard to accept that if
the best access method to this information by glibc is a DSO, and that
the source of the information is the kernel and only the kernel, that
glibc has to resort to some inferior method to access this information.
[not to mention the practical problem of readonly or remote /etc, so one
would have to mount ramfs, and mount /proc to construct /ram/vdso.so.
Also, nothing runtime-critical can thus be put into the vdso.]

it could also be in /boot/modules/$ver/vdso.so, but this detaches the
vdso from the kernel, breaking the single-image kernel concept (which
concept is quite useful). It also forces glibc to do the uname() syscall
to get to the kernel version in addition to the DSO mapping syscalls -
again an inferior method to access this always-needed DSO.

Ingo

2004-03-19 00:09:30

by Paul Jackson

[permalink] [raw]
Subject: Re: sched_setaffinity usability

> Or make a "/etc/sysconf/array" file, and just map it and look up the
> values there.

On our [SGI's] Linux 2.4 kernel versions, we have a file called

/var/cpuset/cpu-node-map

that is built by an init script each boot, and lists each CPU, and the
corresponding Node number. Our user level library code reads that file
in as need be (caching the results of parsing it), and provides a
convenient way of asking how many CPUs, how many Memory Nodes, and on
which Node each CPU is located.

As we have worked our way through various hardware configuration
and devfs changes, the init script has changed how it pokes around
to obtain this information.

But the user level stuff that uses this file, via some library calls,
has not seen these problems - and enjoys having a convenient place from
which to obtain this information.

Our above choice of pathname for this file is a botch, but the idea
sounds like an instance of what Linus is suggesting.

The file is formatted with two whitespace separated ascii numbers per
line, a CPU number and the number of the associated Memory Node.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-03-19 01:37:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: sched_setaffinity usability

On Thu, 18 Mar 2004, Ingo Molnar wrote:

>
> * Davide Libenzi <[email protected]> wrote:
>
> > > Right now the VDSO mostly contains code and exception-handling data, but
> > > it could contain real, userspace-visible data just as much: info that is
> > > only known during the kernel build. There's basically no cost in adding
> > > more fields to the VDSO, and it seems to be superior to any of the other
> > > approaches. Is there any reason not to do it?
> >
> > With /proc/something you can have a single piece of code for all archs
> > that exports NR_CPUS. The VDSO should be added to all missing archs.
> > IMO performance is not an issue in getting NR_CPUS from userspace.
>
> you just cannot beat the mapping performance of a near-zero-overhead
> (V)DSO. No copying. No syscalls to set it up. No runtime dependencies on
> having some filesystem mounted in the right spot. Already existing
> framework to handle various API issues. Debuggers know the layout.

Talking about performance for a function that returns NR_CPUS seems a
little out of scope IMO (I'd rather exclude tight loops calling sysconf,
since the info will be const). My objection was more to look at the
standard way we currently have to export information toward
userspace/glibc. At this time /proc/ is a standard supported by all
architectures. The (V)DSO is currently not. If the (V)DSO would have been
a standard, we wouldn't have had this conversation ;)



- Davide


2004-03-19 08:06:14

by Ulrich Drepper

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Ingo Molnar wrote:

> i'm wondering how dangerous of an API idea it is to make these
> parameters part of the VDSO .data section (and make it/them versioned
> DSO symbols).

Exporting variables is never a good idea. The interface is inflexible.
If the variable size or layout changes or it needs to be dynamically
changed this is bad.

Even if this ticks off a certain LT, a sysconf()-like interface is the
most flexible. The results would be stored in libc if the lookup is
likely to happen frequently. The sysconf code in the vdso has all the
flexibility it could ever need. For instance, a query as to how many
processors are online could do some computations or even make syscalls
if necessary. Or it could just return a constant like 1 if this is
known at compile or startup time. I cannot imagine why this isn't
something the kernel people like, you get full control of the way to
compute the values. The exposed interface is minimal, as opposed to
exporting many individual variables.


> The only minor complication wrt. uname() would be sethostname: other
> CPUs could observe a transitional state of (the VDSO-equavalent of)
> system_utsname.nodename. Is this a problem? It's not like systems call
> sethostname all that often ...

Again, by exporting an interface to access the value you can get all the
control you need. In this case it'd probably be a confstr()-like
interface which is just like sysconf(), but can return strings or
arbitrary data (it gets passed a memory pointer and size).

To implement gethostname() without races store the hostname as

host name MAGIC

The read function can first read MAGIC, read barrier, then read the host
name, read barrier, then read MAGIC again. If MAGIC changed, rinse and
repeat. Doing this from libc would mean to hardcode all the processor
idiosyncrasies to do all this in the libc. It has to be generic enough
to cover all versions of the CPU (maybe some need the MAGIC value to be
specially aligned) and then has to dynamically decide what version to
use. In the vdso the kernel can decide at boot time which functions to
use since it knows at that time what CPUs are used.

Some other syscalls like uname() can be fully implemented in the vdso as
well. The vdso is writable in the kernel so the mapped data can be
updated. In the uname() case, the syscall would be a simple memcpy()
from the place in the vdso into the place designated by the parameter.

Even if it is not possible to implement the entire syscall at userlevel,
maybe just a part can be done in the vdso, in the prologue or epilogue
of the vdso function.


The kernel gets the opportunity to *OPTIONALLY* tweak every little
aspect of the syscall handling if it just wants to. All this without
having to change the libc and waiting for the changes to be widely deployed.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2004-03-19 09:02:31

by Helge Hafting

[permalink] [raw]
Subject: Re: sched_setaffinity usability

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>
>>sysconf() is a user-level implementation issue, and so is something
>>like "number of CPU's". Damn, the simplest way to do it is as a
>>environment variable, for christ sake! Just make a magic environment
>>variable called __SC_ARRAY, and make it be some kind of binary
>>encoding if you worry about performance.
>
>
> i am not arguing for any sysconf() support at all - it clearly belongs
> into glibc. Just doing 'man sysconf' shows that it should be in
> user-space. No argument about that.
>
> But how about the original issue Ulrich raised: how does user-space
> figure out the NR_CPUS value supported by the kernel? (not the current #
> of CPUs, that can be figured out using /proc/cpuinfo)
>
> one solution would be what you suggest: to build some sort of /etc/info
> file that glibc can access, which file is build during the kernel build
> and contains the necessary constants. One problem with this approach is
> that a user could boot via any arbitrary kernel, how does glibc (or even
> a supposed early-init info-setup mechanism) know what info belongs to
> which kernel? Kernel version numbers are not required to be unique. A
> single non-modular bzImage can be used to have a fully working
> userspace. Right now the kernel and glibc is isolated pretty much and
> this gives us flexibility.

Let the compile create that info file. Then handle it much like a module,
except that it is a "module" without any code.
I.e. copy it to /lib/modules/<kernelversion> if installing modules,
or stuff the file into the initrd if making an initrd.

Now it is in a place specific to the kernel, where a library can find it.

Helge Hafting


2004-03-21 09:50:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_setaffinity usability


* Helge Hafting <[email protected]> wrote:

> Let the compile create that info file. Then handle it much like a
> module, except that it is a "module" without any code. I.e. copy it
> to /lib/modules/<kernelversion> if installing modules, or stuff the
> file into the initrd if making an initrd.
>
> Now it is in a place specific to the kernel, where a library can find
> it.

this has a couple of disadvantages:

- the kernel can pre-map the 'file' cheaper - in fact on x86 it's
zero-cost currently. Mapping a file takes 3 syscalls and at least one
pagefault. Since glibc needs a good portion of this info for
absolutely every ELF binary, why not provide it in a preconstructed
way? x86 is doing it via the VDSO. ia64 and x86-64 is doing it via a
dso-alike mechanism.

- obtaining the kernel version currently needs one more syscall
[uname()].

- the 'metadata' becomes detached from the kernel image, so it
cannot contain 'crutial' data. Testing kernels becomes harder, etc.
(until now i could just send a bzImage to someone to get it tested -
now it would have to include the metadata too.)

- it excludes non-build-time data.

Ingo