2007-11-20 06:54:56

by Ulrich Drepper

[permalink] [raw]
Subject: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

This patch adds support for setting the O_NONBLOCK flag of the file
descriptors returned by socket, socketpair, and accept.

socket.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

--- net/socket.c
+++ net/socket.c
@@ -362,7 +362,7 @@ static int sock_alloc_fd(struct file **filep, int flags)
return fd;
}

-static int sock_attach_fd(struct socket *sock, struct file *file)
+static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
{
struct dentry *dentry;
struct qstr name = { .name = "" };
@@ -384,7 +384,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file)
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
&socket_file_ops);
SOCK_INODE(sock)->i_fop = &socket_file_ops;
- file->f_flags = O_RDWR;
+ file->f_flags = O_RDWR | (flags & O_NONBLOCK);
file->f_pos = 0;
file->private_data = sock;

@@ -397,7 +397,7 @@ static int sock_map_fd_flags(struct socket *sock, int flags)
int fd = sock_alloc_fd(&newfile, flags);

if (likely(fd >= 0)) {
- int err = sock_attach_fd(sock, newfile);
+ int err = sock_attach_fd(sock, newfile, flags);

if (unlikely(err < 0)) {
put_filp(newfile);
@@ -1268,12 +1268,14 @@ asmlinkage long sys_socketpair(int family, int type, int protocol,
goto out_release_both;
}

- err = sock_attach_fd(sock1, newfile1);
+ err = sock_attach_fd(sock1, newfile1,
+ INDIRECT_PARAM(file_flags, flags));
if (unlikely(err < 0)) {
goto out_fd2;
}

- err = sock_attach_fd(sock2, newfile2);
+ err = sock_attach_fd(sock2, newfile2,
+ INDIRECT_PARAM(file_flags, flags));
if (unlikely(err < 0)) {
fput(newfile1);
goto out_fd1;
@@ -1423,7 +1425,8 @@ asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr,
goto out_put;
}

- err = sock_attach_fd(newsock, newfile);
+ err = sock_attach_fd(newsock, newfile,
+ INDIRECT_PARAM(file_flags, flags));
if (err < 0)
goto out_fd_simple;


2007-11-20 07:59:54

by David Miller

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

From: Ulrich Drepper <[email protected]>
Date: Tue, 20 Nov 2007 01:53:14 -0500

FWIW, I think this indirect syscall stuff is the most ugly interface
I've ever seen proposed for the kernel.

And I agree with all of the objections raised by both H. Pater Anvin
and Eric Dumazet.

> This patch adds support for setting the O_NONBLOCK flag of the file
> descriptors returned by socket, socketpair, and accept.
...
> - err = sock_attach_fd(sock1, newfile1);
> + err = sock_attach_fd(sock1, newfile1,
> + INDIRECT_PARAM(file_flags, flags));

Where does this INDIRECT_PARAM() macro get defined? I do not
see it being defined anywhere in these patches.

2007-11-20 16:05:54

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Miller wrote:
> FWIW, I think this indirect syscall stuff is the most ugly interface
> I've ever seen proposed for the kernel.

Well, the alternative is to introduce a dozens of new interfaces. It
was Linus who suggested this alternative. Plus, it seems that for
syslets we need basically the same interface anyway.


> And I agree with all of the objections raised by both H. Pater Anvin
> and Eric Dumazet.

Eric had no arguments and HP's comments lack a viable alternative proposal.


> Where does this INDIRECT_PARAM() macro get defined? I do not
> see it being defined anywhere in these patches.

Defined in <linux/indirect.h>:

+#define INDIRECT_PARAM(set, name) current->indirect_params.set.name

Not my idea, I was following one review comment.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHQwWl2ijCOnn/RHQRAhEbAJ9/bkrb/phOMRl16Fb0N1TDYglSsgCeNhHQ
3huhdKCAVTu4CJnktf/ufy4=
=Jj6h
-----END PGP SIGNATURE-----

2007-11-20 17:54:51

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

David Miller wrote:
> From: Ulrich Drepper <[email protected]>
> Date: Tue, 20 Nov 2007 01:53:14 -0500
>
> FWIW, I think this indirect syscall stuff is the most ugly interface
> I've ever seen proposed for the kernel.

Well, there's no XML in /proc :) :).

But, yes, I agree that the internal code needs a lot more cleanup before
being considered for merging.

> And I agree with all of the objections raised by both H. Pater Anvin
> and Eric Dumazet.

I'm worried, too. Do we have a stronger alternative? I'm all ears,
this isn't really my area of expertise.

- z

2007-11-20 18:14:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ulrich Drepper wrote:
>
>> And I agree with all of the objections raised by both H. Pater Anvin
>> and Eric Dumazet.
>
> Eric had no arguments and HP's comments lack a viable alternative proposal.
>

That's only because you're being, deliberately or accidentally, vague
about what your actual (as opposed to imagined) requirements are.

The only thing concrete that I have seen is that the limitation to 6
system call arguments is insufficient. This is clearly true, as
evidenced by things like pselect. To which I responded that I'd *much*
rather see a systematized way to handle the the system call ABI beyond 6
arguments... the system call interface is a calling convention and
should be treated as such, and the last thing we need is something that
ends up looking like the MS-DOS kernel interface where every call has
its own random convention.

The easy answer, to repeat myself, is to adopt the convention that for >
6 system calls, the sixth argument register carries a pointer to the 6+
arguments. This has minor performance disadvantages on platforms which
use the stack for return addresses AND uses exactly six registers for
arguments (a surprisingly common number.) On those platforms we have
the option of either take the extra user space copies, or pick a method
for passing the in-memory copy in a pointer.

If the whole thing about "a dozen new [system calls]" then a dozen
system calls added to the existing tables are better than this mess.

Inside the kernel, a lot of things could be cleaned up substantially by
automating the generation of stubs, where necessary. I did a lot of
work in klibc to automatically generate stubs of various sorts; some of
that work may be possible to re-use.

-hpa

2007-11-20 18:24:59

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


> That's only because you're being, deliberately or accidentally, vague
> about what your actual (as opposed to imagined) requirements are.

Maybe I can help by summarizing how syslets fit in to this.

Currently the syslet patches add a single submission call which includes
an argument which is a structure which duplicates the system call ABI.
The submission syscall in the kernel does some syslet specific work
which amounts to verifying state and storing it in the task_struct. It
then has to unpack the system call arguments from this submission
syscall argument and call the specified system call.

Every architecture will need helpers, then, on either side. They'll
need to pack their arguments into the struct and then unpack and call in
the kernel. The PPC64 guys have already expressed concern about this.

It's, in effect, adding the syslet arguments to every single system call.

So, instead of duplicating the system call ABI in the argument to a
syslet submission syscall, we could pass the syslet arguments via this
indirect parameters convention. This, hopefully, will reduce complexity
by reducing the number of places that we have to muck around with the
sycall ABI.

That's the high level summary, anyway. I'm working on the simplest
expression of this mechanism at the moment. We'll have code to argue
about before the silly thanksgiving break, I hope.

- z

2007-11-20 19:15:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Zach Brown wrote:
>> That's only because you're being, deliberately or accidentally, vague
>> about what your actual (as opposed to imagined) requirements are.
>
> Maybe I can help by summarizing how syslets fit in to this.
>
> Currently the syslet patches add a single submission call which includes
> an argument which is a structure which duplicates the system call ABI.
> The submission syscall in the kernel does some syslet specific work
> which amounts to verifying state and storing it in the task_struct. It
> then has to unpack the system call arguments from this submission
> syscall argument and call the specified system call.
>
> Every architecture will need helpers, then, on either side. They'll
> need to pack their arguments into the struct and then unpack and call in
> the kernel. The PPC64 guys have already expressed concern about this.
>
> It's, in effect, adding the syslet arguments to every single system call.
>
> So, instead of duplicating the system call ABI in the argument to a
> syslet submission syscall, we could pass the syslet arguments via this
> indirect parameters convention. This, hopefully, will reduce complexity
> by reducing the number of places that we have to muck around with the
> sycall ABI.
>
> That's the high level summary, anyway. I'm working on the simplest
> expression of this mechanism at the moment. We'll have code to argue
> about before the silly thanksgiving break, I hope.
>

It seems that you're doing the same thing in both cases, except you're
now extending it to include other random functionality, which means
other things than syslets are suddenly affected.

syslets are arguably a little bit different, since what you're
effectively doing there is running a miniature interpreted language in
kernel space. A higher startup overhead should be acceptable, since
you're amortizing it over a larger number of calls. Extending that
mechanism suddenly means you HAVE to use that interpreted language
message mechanism to access certain system calls, which really does not
seem like a good thing neither for performance nor for encouraging sane
design of interfaces.

Everyone who designs a multiplexer have good reasons for the expediency
that it provides, but it really isn't a good thing in the long term.
The reason I mentioned MS-DOS is that MS-DOS has tons of multiplexers,
sometimes three levels deep. Furthermore, it doesn't have any kind of
uniformity to its system calls calling convention. The end result is
hand-crafted stubs and wrappers, on both sides of the interface.

-hpa

2007-11-20 21:48:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

From: Ulrich Drepper <[email protected]>
Date: Tue, 20 Nov 2007 08:04:53 -0800

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> David Miller wrote:
> > Where does this INDIRECT_PARAM() macro get defined? I do not
> > see it being defined anywhere in these patches.
>
> Defined in <linux/indirect.h>:
>
> +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
>
> Not my idea, I was following one review comment.

This was not in the patches you posted, I double checked before
sending my reply.

2007-11-20 21:56:21

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


>>> Where does this INDIRECT_PARAM() macro get defined? I do not
>>> see it being defined anywhere in these patches.
>> Defined in <linux/indirect.h>:
>>
>> +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
>>
>> Not my idea, I was following one review comment.
>
> This was not in the patches you posted, I double checked before
> sending my reply.

Not to belabor this point, but it was:

http://lkml.org/lkml/2007/11/20/53

$ grep -l INDIRECT_PARAM .git/patches/master/*
.git/patches/master/indirect-v4-4.patch
.git/patches/master/indirect-v4-5.patch
.git/patches/master/indirect-v4-6.patch

Maybe the patches got to you out of order so you saw 5/ before 4/?

- z

2007-11-20 22:23:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


* H. Peter Anvin <[email protected]> wrote:

> It seems that you're doing the same thing in both cases, except you're
> now extending it to include other random functionality, which means
> other things than syslets are suddenly affected.
>
> syslets are arguably a little bit different, since what you're
> effectively doing there is running a miniature interpreted language in
> kernel space. A higher startup overhead should be acceptable, since
> you're amortizing it over a larger number of calls. Extending that
> mechanism suddenly means you HAVE to use that interpreted language
> message mechanism to access certain system calls, which really does
> not seem like a good thing neither for performance nor for encouraging
> sane design of interfaces.

whether that interpreted syslet language survives is still an open
question - it was extremely ugly when i wrote the first version of it
and it only got uglier since then :-)

do you suggest that extending the system call calling convention to
include an arbitrary number of parameters will solve all these API needs
we have at the moment?

if yes, then a one-shot syslet/async call would in essence be:

syslet_arg1 ... N, syscall_arg 1 ... M

the same is true for the indirect stuff, we in essence nest syscalls
inside another syscall:

sys_indirect arg1 ... N, syscall arg 1 ... M

this all assumes an arbitrarily extendable syscall ABI, which can take
N+M parameters. Right?

i'm not entirely sure we really want to do this. Nested syscalls would
have to unpack the arguments and repack them into a kernel-internal call
format anyway. So there's no performance upside - in fact i can only see
additional complications.

Why not just pin down the current ABI that there's 6 syscall parameters
_and not more_? It's totally sensible, and indirection has some minimal
costs anyway, so copying the nested syscall parameters is a non-issue.
This is not ad-hoc and when i wrote syslets i actually profiled the
performance a variable-length calling convention and decided _against_
it. Nothing beats the performance of a straight fixd-length copy of 6x4
(or 6x8) bytes.

The memory access cost argument you mentioned is largely irrelevant and
inapposite here, this is all passed in on the stack which is well-cached
in the L1 cache.

Ingo

2007-11-20 22:33:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

On Tue, 20 Nov 2007, Ingo Molnar wrote:

> * H. Peter Anvin <[email protected]> wrote:
>
> > It seems that you're doing the same thing in both cases, except you're
> > now extending it to include other random functionality, which means
> > other things than syslets are suddenly affected.
> >
> > syslets are arguably a little bit different, since what you're
> > effectively doing there is running a miniature interpreted language in
> > kernel space. A higher startup overhead should be acceptable, since
> > you're amortizing it over a larger number of calls. Extending that
> > mechanism suddenly means you HAVE to use that interpreted language
> > message mechanism to access certain system calls, which really does
> > not seem like a good thing neither for performance nor for encouraging
> > sane design of interfaces.
>
> whether that interpreted syslet language survives is still an open
> question - it was extremely ugly when i wrote the first version of it
> and it only got uglier since then :-)

Aha! You admitted it finally :)



- Davide


2007-11-20 22:36:50

by David Miller

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

From: Zach Brown <[email protected]>
Date: Tue, 20 Nov 2007 13:55:56 -0800

> Not to belabor this point, but it was:
>
> http://lkml.org/lkml/2007/11/20/53
>
> $ grep -l INDIRECT_PARAM .git/patches/master/*
> .git/patches/master/indirect-v4-4.patch
> .git/patches/master/indirect-v4-5.patch
> .git/patches/master/indirect-v4-6.patch
>
> Maybe the patches got to you out of order so you saw 5/ before 4/?

Thanks for pointing this out, I stand corrected.

2007-11-20 22:43:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


* Davide Libenzi <[email protected]> wrote:

> > whether that interpreted syslet language survives is still an open
> > question - it was extremely ugly when i wrote the first version of
> > it and it only got uglier since then :-)
>
> Aha! You admitted it finally :)

damn :-)

but if the only alternative is to be fundamentally slower, i am not
afraid of some ugliness :-)

Ingo

2007-11-20 23:27:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ingo Molnar wrote:
> do you suggest that extending the system call calling convention to
> include an arbitrary number of parameters will solve all these API needs
> we have at the moment?
>
> if yes, then a one-shot syslet/async call would in essence be:
>
> syslet_arg1 ... N, syscall_arg 1 ... M
>
> the same is true for the indirect stuff, we in essence nest syscalls
> inside another syscall:
>
> sys_indirect arg1 ... N, syscall arg 1 ... M
>
> this all assumes an arbitrarily extendable syscall ABI, which can take
> N+M parameters. Right?
>
> i'm not entirely sure we really want to do this. Nested syscalls would
> have to unpack the arguments and repack them into a kernel-internal call
> format anyway. So there's no performance upside - in fact i can only see
> additional complications.

Forget about indirection for the moment. Let's first look at the need
of plain system calls.

> Why not just pin down the current ABI that there's 6 syscall parameters
> _and not more_?

Because we have already violated it. There are system calls that need
more than 6 arguments: we need *a* convention. Worse, we're not
actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit
platforms a single argument can occupy two words.

Uli talks about the need to adding additional system calls with
parameters, and suggests "back-dooring" them via the sys_indirect interface.

pselect introduced the convention that to take more than 6 arguments,
the 6th argument register contains a pointer into a user-space memory
area which contains the real arguments 6 and above. This is a simple
convention, which can be trivially executed as a rule set. Furthermore,
*with some care* it can be mapped 1:1 onto the C calling convention by
system-call-generic code, as opposed to needing system-call-specific
stubs to marshall parameters.

Now, if you execute that asynchronously, you of course need to make sure
that userspace doesn't clobber those additional arguments, so you would
have to save them away or otherwise restrict userspace from reclaiming this.

** This is the situation as it stands today, and any solution needs to
take this into account. **

> It's totally sensible, and indirection has some minimal
> costs anyway, so copying the nested syscall parameters is a non-issue.
> This is not ad-hoc and when i wrote syslets i actually profiled the
> performance a variable-length calling convention and decided _against_
> it. Nothing beats the performance of a straight fixd-length copy of 6x4
> (or 6x8) bytes.
>
> The memory access cost argument you mentioned is largely irrelevant and
> inapposite here, this is all passed in on the stack which is well-cached
> in the L1 cache.

What I'm objecting to, strongly, is the use of this syslets-style
indirection for unrelated purposes, such as modifying the behaviour of
existing system calls. The sys_indirect call, as far as I understand
it, basically is a way to inject commands into a separate
hyper-lightweight thread of kernel execution. That's fine so far.

However, proposing that we should have system calls (call a spade a
spade) which can ONLY be accessed via this indirection interface is bad
interface design at best and something much stronger at worst. What we
have done in the past when we want to add new parameters to a system
call is that we assign it a new system call number, and point the old
system call number to a thunk which sets the new parameters to specific
default values and then tailcalls the new system call. This is a very
straightforward thing to do, and imposes any costs at all only on users
of the legacy system call number -- and then they are only a handful of
instructions.

-hpa

2007-11-20 23:42:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


* H. Peter Anvin <[email protected]> wrote:

>> Why not just pin down the current ABI that there's 6 syscall
>> parameters _and not more_?
>
> Because we have already violated it. There are system calls that need
> more than 6 arguments: we need *a* convention. Worse, we're not
> actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit
> platforms a single argument can occupy two words.

i think you are at least partly wrong here. Multiplexing/demultiplexing
can go on infinitely - for example sys_write(fd, size, buf) can be
thought of as a function call that passes in fd, size and a variable
number of arguments of the data to be written.

in that sense capping function arguments at 6 is _sensible_ because it
prefers _simple_ interfaces. When i wrote syslets i did a syscall number
of arguments histogram:

#args #syscalls
-----------------
0 22
1 51
2 83
3 85
4 40
5 23
6 8


Fortunately what we see today is that 80% of all syscalls have 4 or less
parameters. (yes, there are a few 6-parameter syscalls that arguably
hurt, but still, it's the exception not the rule)

this histogram shows a healthy bell curve which is _not_ limited by the
arguments limit of 6, but by common sense! If the 6-arguments limit was
a problem then we'd see a pile-up of 6-param syscalls.

so i believe you should start thinking about lots-of-arguments syscalls
as an exception not as something that needs to fit into some generic
ABI. (Especially as most schemes that were supposed to handle this
problem would hurt the sane 4-parameter (or less) syscall case too.)

Ingo

2007-11-20 23:58:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
>>> Why not just pin down the current ABI that there's 6 syscall
>>> parameters _and not more_?
>> Because we have already violated it. There are system calls that need
>> more than 6 arguments: we need *a* convention. Worse, we're not
>> actually talking 6 *arguments*, we're talking 6 *words*; on 32-bit
>> platforms a single argument can occupy two words.
>
> i think you are at least partly wrong here. Multiplexing/demultiplexing
> can go on infinitely - for example sys_write(fd, size, buf) can be
> thought of as a function call that passes in fd, size and a variable
> number of arguments of the data to be written.
>
> in that sense capping function arguments at 6 is _sensible_ because it
> prefers _simple_ interfaces. When i wrote syslets i did a syscall number
> of arguments histogram:
>
> #args #syscalls
> -----------------
> 0 22
> 1 51
> 2 83
> 3 85
> 4 40
> 5 23
> 6 8
>
> Fortunately what we see today is that 80% of all syscalls have 4 or less
> parameters. (yes, there are a few 6-parameter syscalls that arguably
> hurt, but still, it's the exception not the rule)
>
> this histogram shows a healthy bell curve which is _not_ limited by the
> arguments limit of 6, but by common sense! If the 6-arguments limit was
> a problem then we'd see a pile-up of 6-param syscalls.
>
> so i believe you should start thinking about lots-of-arguments syscalls
> as an exception not as something that needs to fit into some generic
> ABI. (Especially as most schemes that were supposed to handle this
> problem would hurt the sane 4-parameter (or less) syscall case too.)
>

I guess I'm confused here... all I said was I wanted them to be
systematic, and not need ad-hoc interfaces. In particular, I really
don't want to see an interface where "oh, the fifth parameter is really
a flags field so it's passed with sys_indirect, and is only accessible
via a sys_indirect" is the norm.

We don't have all that many; pselect() being the main one (I think there
might be a handful more on 32-bit platforms, but not positive.) It
introduced the convention of pointing argument register 6 to a
user-space data structure. Simple, and as you correctly point out, it's
a comparatively rare case. In klibc, I currently handle it as a special
case, but I would prefer to avoid special cases of that sort going forward.

Note that on s390, 6-parameter system calls are already a special case:
anything with over 5 parameters is invoked via a memory structure. This
actually means that for pselect on s390, we indirect via a memory
structure not once, but twice, for no good reason.

-hpa

2007-11-26 18:19:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets



On Tue, 20 Nov 2007, H. Peter Anvin wrote:
>
> If the whole thing about "a dozen new [system calls]" then a dozen system
> calls added to the existing tables are better than this mess.

No it's not.

The point about the indirect calls is that we can do it for other things
than just a dozen random things that wants this one flag.

We'll eventually want AIO calls for filename lookup etc, for example.
That's another dozen calls (stat, lstat, open, etc). Having an indirect
call interface to do these kinds of things would be wonderful, instead of
having to add new system calls every time some issue with a flag that
changes behaviour for an already existing system call comes up.

THAT is why I'd much rather have indirect system calls.

The actual calling convention details are open for debate, of course. We
could encode the information in the system call number itself, for example
(eg have a bit there that says "extended information"). But we'll never
get away from the fact that we have the odd architecture-specific system
call interfaces with things like "pselect()" having pointers etc, if only
because of legacy issues.

So we can *never* have a truly "generic" argument marshalling setup. We'll
have to live with each architecture having system calls with special
rules: some of those rules will be architecture-specific (eg number of
easily available registers and/or historical reasons), and a few of the
rules will be architecture-independent (eg things like sigreturn, clone
and execve, that need to have direct access to the whole kernel return
stack and simply *cannot* be called from any indirect code!)

So the choice is basically one of:

- come up with a totally new interface to system calls, and effectively
duplicating the whole system call table.

I'd hate to do this. We already have duplicated system call tables due
to compat stuff, it's painful.

- just emulate the *existing* interface exactly, but with indirection.
IOW, the system call interface on x86 an unconditional "six words in
six registers, the meaning of which is totally up to the system call
implementation itself".

This is what Uli's sys_indirect() does.

- add whole new system calls with extended information, making the 6-word
limits even worse, and likely forcing a whole new argument marshalling
code with conditionals depending on per-system-call flags, which
further complicates it and slows things down.

Quite frankly, I can't really see many other approaches. And of the above
three ones, the sys_indirect() approach really does seem to be the
simplest *and* the best-performing. It's basically faster to just
unconditionally load six registers off an indirect block than it would be
to have any conditionals based on which system call it is.

Linus

2007-11-26 18:45:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets


* Linus Torvalds <[email protected]> wrote:

> Quite frankly, I can't really see many other approaches. And of the
> above three ones, the sys_indirect() approach really does seem to be
> the simplest *and* the best-performing. It's basically faster to just
> unconditionally load six registers off an indirect block than it would
> be to have any conditionals based on which system call it is.

yeah. And even assuming for the sake of argument, that there was only
one dominant architecture we care about, even there many of our existing
syscall APIs are _already_ special-purpose APIs that do not encode
parameters in a 'flat' way.

So it's not like sys_indirect() would break some magic pristine state of
a flat parameter space - on the contrary, most of the nontrivial
syscalls take pointers to structures or pointers to streams of
information. The parameter count histogram i believe further underlines
this point:

#args #syscalls
-----------------
0 22
1 51
2 83
3 85
4 40
5 23
6 8

the natural 'center' of function call parameter counts is around 1-4
parameters, and that is natural. (most operators that the human brain
prefers to operate with are like that - having higher complexity than
that often defeats the purpose of getting an API used by ... humans.)

(side-note: in that sense, introducing some generic "arbitrary number of
parameters" ABI design that was suggested instead of sys_indirect()
would be _counter productive_ from a meta-design POV: it would not
punish sucky, over-complicated APIs that expose way too many details in
their main API call.)

Ingo

2007-11-26 19:08:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ingo Molnar wrote:
>
> So it's not like sys_indirect() would break some magic pristine state of
> a flat parameter space - on the contrary, most of the nontrivial
> syscalls take pointers to structures or pointers to streams of
> information. The parameter count histogram i believe further underlines
> this point:
>
> #args #syscalls
> -----------------
> 0 22
> 1 51
> 2 83
> 3 85
> 4 40
> 5 23
> 6 8
>
> the natural 'center' of function call parameter counts is around 1-4
> parameters, and that is natural. (most operators that the human brain
> prefers to operate with are like that - having higher complexity than
> that often defeats the purpose of getting an API used by ... humans.)
>

I was preparing a response to Linus' email, but I really feel this needs
to be addressed specifically.

When it comes to dealing with the operator-visible state, what matters
is what happens on the API level, NOT on the system call level.
Furthermore, the proposed sys_indirect interface just means that there
are parameters hidden from immediately view, even though they
fundamentally change the operation performed, and that it is much harder
to correlate, say, the output of strace(1) with what actually happened
in the program. So from a *psychological* point of view, this seems to
be an insane design choice.

-hpa

2007-11-26 19:21:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Linus Torvalds wrote:
>
> On Tue, 20 Nov 2007, H. Peter Anvin wrote:
>> If the whole thing about "a dozen new [system calls]" then a dozen system
>> calls added to the existing tables are better than this mess.
>
> No it's not.
>
> The point about the indirect calls is that we can do it for other things
> than just a dozen random things that wants this one flag.
>
> We'll eventually want AIO calls for filename lookup etc, for example.
> That's another dozen calls (stat, lstat, open, etc). Having an indirect
> call interface to do these kinds of things would be wonderful, instead of
> having to add new system calls every time some issue with a flag that
> changes behaviour for an already existing system call comes up.
>
> THAT is why I'd much rather have indirect system calls.

I'm presuming you're not talking about some sort of
syslets/fibrils/threadlets here (executing an interpreted thread of
execution in kernel space). That's a whole separate ball of wax.

> The actual calling convention details are open for debate, of course. We
> could encode the information in the system call number itself, for example
> (eg have a bit there that says "extended information"). But we'll never
> get away from the fact that we have the odd architecture-specific system
> call interfaces with things like "pselect()" having pointers etc, if only
> because of legacy issues.
>
> So we can *never* have a truly "generic" argument marshalling setup. We'll
> have to live with each architecture having system calls with special
> rules: some of those rules will be architecture-specific (eg number of
> easily available registers and/or historical reasons), and a few of the
> rules will be architecture-independent (eg things like sigreturn, clone
> and execve, that need to have direct access to the whole kernel return
> stack and simply *cannot* be called from any indirect code!)

> So the choice is basically one of:
>
> - come up with a totally new interface to system calls, and effectively
> duplicating the whole system call table.
>
> I'd hate to do this. We already have duplicated system call tables due
> to compat stuff, it's painful.

This would be the right thing to do if we were to redesign the system
call interface from the ground up, which it doesn't exactly sound like
we are intending.

> - just emulate the *existing* interface exactly, but with indirection.
> IOW, the system call interface on x86 an unconditional "six words in
> six registers, the meaning of which is totally up to the system call
> implementation itself".
>
> This is what Uli's sys_indirect() does.
>
> - add whole new system calls with extended information, making the 6-word
> limits even worse, and likely forcing a whole new argument marshalling
> code with conditionals depending on per-system-call flags, which
> further complicates it and slows things down.

The 6-word limit is a red herring. There is at least two ways to deal
with it (and this doesn't mean wiping the legacy stuff we already have):

- Let each architecture pick a calling convention and redefine the
architecture-independent bits to take an arbitrary number of arguments.
This is a one-time panarchitectural change.

- Define the architecture-independent interface inside the kernel to be
a 6-word interface and use a marshalling thunk when the number of
parameters exceed this number. **This is what we're currently doing.**
This is inefficient for s390 (which already has to thunk
6-parameter functions in its arch layer), but I think all other
architectures are fine. Those thunks (stubs) could be generated
automatically if we wanted to.

So I would advocate admitting that we already broke the 6-word limit and
abolish it. Then we can create new system calls that match what the
user would see.

> Quite frankly, I can't really see many other approaches. And of the above
> three ones, the sys_indirect() approach really does seem to be the
> simplest *and* the best-performing. It's basically faster to just
> unconditionally load six registers off an indirect block than it would be
> to have any conditionals based on which system call it is.

I find it very hard to see how it could be better performing than
jumping through a thunk; in fact, for the second option (the one we're
currently using) when gcc does top-level reordering the thunk (e.g.
sys_pselect6) SHOULD simply the system call function proper (e.g.
sys_pselect7). For one thing, you will have at least one additional
data-dependent indirect call in the path.

-hpa

2007-11-26 19:55:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

On Mon, 26 Nov 2007, H. Peter Anvin wrote:

> Ingo Molnar wrote:
> >
> > So it's not like sys_indirect() would break some magic pristine state of a
> > flat parameter space - on the contrary, most of the nontrivial syscalls take
> > pointers to structures or pointers to streams of information. The parameter
> > count histogram i believe further underlines this point:
> >
> > #args #syscalls
> > -----------------
> > 0 22
> > 1 51
> > 2 83
> > 3 85
> > 4 40
> > 5 23
> > 6 8
> >
> > the natural 'center' of function call parameter counts is around 1-4
> > parameters, and that is natural. (most operators that the human brain
> > prefers to operate with are like that - having higher complexity than that
> > often defeats the purpose of getting an API used by ... humans.)
> >
>
> I was preparing a response to Linus' email, but I really feel this needs to be
> addressed specifically.
>
> When it comes to dealing with the operator-visible state, what matters is what
> happens on the API level, NOT on the system call level. Furthermore, the
> proposed sys_indirect interface just means that there are parameters hidden
> from immediately view, even though they fundamentally change the operation
> performed, and that it is much harder to correlate, say, the output of
> strace(1) with what actually happened in the program. So from a
> *psychological* point of view, this seems to be an insane design choice.

I think there are two different issues. One is the proliferation of system
calls, and the other is the sane design of internal kernel APIs.
The first one is not very interesting to me, since I don't have any strong
opinions in either cases.
The second is the one I'd care most. I think that, whatever is the
solution used to address the first, internal kernel APIs should be
designed so that parameters flow down from the system call to the
parameter's user code. IMO, besides very few cases where it could make
some sense [*], setting some thread-global bits in the upper layer, to be
magically picked up by code in the lower layers, does not lead to readable
interfaces.



[*] Things that already read from a shared context, that is already
exposed to the user through some sort of set/get APIs.



- Davide


2007-11-26 23:30:32

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

H. Peter Anvin wrote:
> The 6-word limit is a red herring. There is at least two ways to deal
> with it (and this doesn't mean wiping the legacy stuff we already have):
>
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments.
> This is a one-time panarchitectural change.
> [...]

Just think beyond wishful thinking for a moment. What does it take to
come up with something completely new and grand?

Let's start at the basic: you need to signal that the new syscall
calling convention is used. Since the syscall entry code is limited (at
least the likes of syscall/sysenter, it would be easy enough to use int
$0x81 in addition to int $0x80) you would have to extend the use of the
syscall number while keeping binary compatibility. This means
additional costs for every single syscall.

Once you're past that, how do you implement the expandable syscall
parameter count? There are two ways:

- - pass to the real sys_* implementations the number of provided syscall
parameters and have each function figure out what this means

- - dynamically construct a call to the sys_* functions where the syscall
magic adds an appropriate number of parameters filled with zeros. This
is quite complicated and, more importantly, it requires that you have
code/data somewhere which specifies how many parameters each of the
sys_* function actually requires. The actual sys_* code and the data
has to be kept in sync at all times. A maintenance nightmare.


The handling of syscalls with many parameters should not at all be a
driver of this design at all. Syscalls shouldn't be that complicated, I
completely agree with ingo.


I'm perfectly willing to give you the benefit of doubt, show us a design
for what you're proposing which is not slower than the current code,
doesn't impact existing code, and solves the problem in a nice and clean
way. I cannot really see it now but I might miss something. The
sys_indirect approach ain't pretty but it does it jobs, doesn't impact
performance, and is expandable in direction we *know* we will want to go
very soon.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHS1X12ijCOnn/RHQRAihRAJwLNJ9fT8GTv6MAoO6RZGOub07sGgCdGBLR
frXyQVB8Oh5VgWY5YJhpitg=
=FuBx
-----END PGP SIGNATURE-----

2007-11-27 00:19:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> H. Peter Anvin wrote:
>> The 6-word limit is a red herring. There is at least two ways to deal
>> with it (and this doesn't mean wiping the legacy stuff we already have):
>>
>> - Let each architecture pick a calling convention and redefine the
>> architecture-independent bits to take an arbitrary number of arguments.
>> This is a one-time panarchitectural change.
>> [...]
>
> Just think beyond wishful thinking for a moment. What does it take to
> come up with something completely new and grand?
>
> Let's start at the basic: you need to signal that the new syscall
> calling convention is used. Since the syscall entry code is limited (at
> least the likes of syscall/sysenter, it would be easy enough to use int
> $0x81 in addition to int $0x80) you would have to extend the use of the
> syscall number while keeping binary compatibility. This means
> additional costs for every single syscall.

No.

I already said I'm not looking at changing the calling convention for
existing syscalls. I don't think that makes sense. (The only realistic
exception to that is that if we really want a small number (16 or less)
of flags fully orthogonal to the system call table, I guess it might
make sense to stuff those in the high bits of the system call register.
However, I am a bit concerned about the auditability of that.)

> Once you're past that, how do you implement the expandable syscall
> parameter count? There are two ways:
>
> - - pass to the real sys_* implementations the number of provided syscall
> parameters and have each function figure out what this means
>
> - - dynamically construct a call to the sys_* functions where the syscall
> magic adds an appropriate number of parameters filled with zeros. This
> is quite complicated and, more importantly, it requires that you have
> code/data somewhere which specifies how many parameters each of the
> sys_* function actually requires. The actual sys_* code and the data
> has to be kept in sync at all times. A maintenance nightmare.

Hardly so, as evidenced by the fact that we have successfully done so
for 15 years already; a number of Linux architectures require this
information for the existing system calls.

> The handling of syscalls with many parameters should not at all be a
> driver of this design at all. Syscalls shouldn't be that complicated, I
> completely agree with ingo.

You *ARE* introducing additional syscall parameters, regardless if
you're admitting it or not. It's exactly what you're doing, and by
making those parameters hidden, all we're accomplishing is:

- a penalty any time those parameters have to be invoked,
- a good possibility that all the combinations are not audited.

> I'm perfectly willing to give you the benefit of doubt, show us a design
> for what you're proposing which is not slower than the current code,
> doesn't impact existing code, and solves the problem in a nice and clean
> way. I cannot really see it now but I might miss something. The
> sys_indirect approach ain't pretty but it does it jobs, doesn't impact
> performance, and is expandable in direction we *know* we will want to go
> very soon.

We have a ton of examples in the kernel already for both dealing with
additional parameters and (somewhat fewer examples, but sys_pselect is a
good one) too many parameters: in all cases, we invoke a wrapper
function that sets up the parameters and then invokes the "true" syscall
function. Right now we're generating those manually, which appears to
work okay simply because the amount of work it takes is small compared
to the amount of work it takes to write the code for a proper syscall.
(That doesn't mean it is the ideal model, of course.) We *could*
generate them automatically if we wanted to, with either of the models I
mentioned -- the code I have in klibc to generate syscall stubs should
be relatively easily modifiable to do this, although it's probably
overkill for this job.

In this case we do minimal thunking of parameters for the legacy
entrypoints, and for the current entrypoints we do the guaranteed
minimum amount of work.

-hpa

2007-11-27 00:48:14

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

H. Peter Anvin wrote:
> No.
>
> I already said I'm not looking at changing the calling convention for
> existing syscalls.

I did not suggest or ask for that at all.

I was asking you to consider the real implementation details for a new
syscall mechanism.

We do not want to abandon the use of syscall/sysenter and go back to int
(on x86/x86-64). This means that you have to come up with a mechanism
which hooks into the current syscall/sysenter path while preserving full
backward compatibility.

Now it's your turn. How do you do this without additional costs?


> Hardly so, as evidenced by the fact that we have successfully done so
> for 15 years already; a number of Linux architectures require this
> information for the existing system calls.

Nothing at this scale is there in the moment, as far as I can see. And
nothing so critical for getting right.

Talk is cheap. You still haven't shown one bit if design how you want
to achieve your grand goal. The time for hand-waiving is over. Do some
work or step out of the way. Nothing you have said so far in the least
convinces me and your arguments like "sys_indirect adds parameters" are
not really contested. Yes, that's what sys_indirect does. So what? It
does this with almost no cost which outweighs the ugliness factor in my
book.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHS2gQ2ijCOnn/RHQRAlN5AKCWZQL97sROWBv33//Uj/MN+CNi3gCdFgCU
uLVEOfclERpakp1kdYzy2oI=
=stVB
-----END PGP SIGNATURE-----

2007-11-27 01:24:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> H. Peter Anvin wrote:
>> No.
>>
>> I already said I'm not looking at changing the calling convention for
>> existing syscalls.
>
> I did not suggest or ask for that at all.
>
> I was asking you to consider the real implementation details for a new
> syscall mechanism.
>
> We do not want to abandon the use of syscall/sysenter and go back to int
> (on x86/x86-64). This means that you have to come up with a mechanism
> which hooks into the current syscall/sysenter path while preserving full
> backward compatibility.
>
> Now it's your turn. How do you do this without additional costs?
>

- Add sys_new_call to the syscall table
- Create a stub thunk:

asmlinkage long sys_old_call(long parm1, long parm2, long parm3)
{
return sys_new_call(parm1, parm2, parm3, 0);
}

We have 2^n examples on this in the kernel already.

Or, if the new syscall requires more than 6 parameters (with the current
convention):

asmlinkage long sys_new_call6(long parm1, long parm2, long parm3,
long parm4, long parm5,
long __user *additional)
{
long xparm[3]; /* 8 parameters, total */

if (copy_from_user(xparm, additional, sizeof xparm)
!= sizeof xparm)
return -EFAULT;

return sys_new_call(parm1, parm2, parm3, parm4, parm5,
xparm[0], xparm[1], xparm[2]);
}

This is a fixed-size copy from userspace, which obviously cannot be avoided.

The C version isn't optimal, obviously, hence my mentioning the
possibility of doing it in the arch layer.

-hpa

2007-11-27 02:15:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets



On Mon, 26 Nov 2007, H. Peter Anvin wrote:
>
> I'm presuming you're not talking about some sort of syslets/fibrils/threadlets
> here (executing an interpreted thread of execution in kernel space). That's a
> whole separate ball of wax.

Indeed.

I'm hoping that just dies. It's too complex. But the "do this single
system call asynchronously" isn't, and has lots of historical
implementations, ranging from VMS to the braindead POSIX "aio" setup.

I do think that more complex threadlets could be useful in theory, I just
doubt they'd be used in practice..

> > So the choice is basically one of:
> >
> > - come up with a totally new interface to system calls, and effectively
> > duplicating the whole system call table.
> >
> > I'd hate to do this. We already have duplicated system call tables due
> > to compat stuff, it's painful.
>
> This would be the right thing to do if we were to redesign the system call
> interface from the ground up, which it doesn't exactly sound like we are
> intending.

Yeah. I'm also not sure it's the right thing even if we did redesign from
scratch.

The current system call interface may look less than regular, but it has
some very solid foundation: it's fast. Passing arguments in registers is
by definition a lot faster *and*safer* than passing them any other way.
There are no subtle security issues with people playing games with the
argument base pointer (ie usually the stack pointer) and trying to fool
the kernel into accessing kernel memory etc.

Immediately when you do anything but registers, it is much *much* more
costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow,
but it's quite noticeable overhead for simple system calls. It gets worse
if this all is described by some indirect table setup.

In the system call path, right now, for some system calls, the biggest two
overheads are

- the CPU system call overhead itself. We can't do much about this, but
the CPU designers do seem to be slowly getting it fixed (ie it's slower
than it should need to be, but it's a hell of a lot faster than a P4
used to be ;)

- the cost of just the single indirect - and unpredictable - call.

(The second cost is actually often totally hidden in the trivial system
call benchmarks people run: if the benchmark just does "getppid()" a
million times in a tight loop, the indirect call on the system call number
seems really quite fast, but outside of benchmarks it is generally totally
unpredictable indeed, and a real cost for real-life system call usage).

Everything else in the system call path is generally as fast as we can
make it. Doing more indirection and conditionals would be really quite
nasty.

Of course, for *most* of system calls, the work the kernel actually does
ends up being so big that it doesn't much matter, but I was literally
chasing down why a page fault had slowed down by ~70 cycles two weeks ago.
And it doesn't take more than a couple of unpredictable jumps to do things
like that!

> The 6-word limit is a red herring. There is at least two ways to deal with it
> (and this doesn't mean wiping the legacy stuff we already have):
>
> - Let each architecture pick a calling convention and redefine the
> architecture-independent bits to take an arbitrary number of arguments. This
> is a one-time panarchitectural change.

Not applicable on x86-32.

The six-word limit is effectively a hardware limit there. Once it goes
past that limit, one of the words needs to be a pointer to extended
information that is fundamentally slower to access. Happily, only very
rare system calls do that (and none of them are of the simple variety
where we see a few cycles easily).

On other architectures, we could more easily just use more registers. But
x86-32 is still a big part (bulk) of what matters for most people.

Linus

2007-11-27 02:39:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

Linus Torvalds wrote:
>
>> The 6-word limit is a red herring. There is at least two ways to deal with it
>> (and this doesn't mean wiping the legacy stuff we already have):
>>
>> - Let each architecture pick a calling convention and redefine the
>> architecture-independent bits to take an arbitrary number of arguments. This
>> is a one-time panarchitectural change.
>
> Not applicable on x86-32.
>
> The six-word limit is effectively a hardware limit there. Once it goes
> past that limit, one of the words needs to be a pointer to extended
> information that is fundamentally slower to access. Happily, only very
> rare system calls do that (and none of them are of the simple variety
> where we see a few cycles easily).
>
> On other architectures, we could more easily just use more registers. But
> x86-32 is still a big part (bulk) of what matters for most people.
>

Well, x86-32 and x86-64 are surprisingly similar here, for very
different reasons (x86-64 is because there are only seven clobbered
registers that aren't destroyed by the syscall instruction itself.)

However, on both of these we could make the user-space side cheaper, by
making sure that we don't have to do additional copies in user space.
For both these architectures, anything more than 3 parameters (i386) or
6 parameters (x86-64) will be already in memory on the stack, so if we
can use that image as-is then we at least save the intra-user-space copy
that goes along with it.

x86-64 requires some minor thought, since the obvious way of doing it
(using arg register 6 to push in a pointer) would end up with a
discontiguous frame. One can do it with something like this, although
it's not clear to me it is a win at all (the more obvious sequence using
XCHG isn't usable since XCHG locks unconditionally):

pop %r10 # Return address
push %r9 # Argument 6
movq %rsp, %r11
push %r10
movq %rcx, %r10
syscall
cmpq $-4095, %rax
jae ...
pop %r10
pop %r9
push %r10
retq

The number of registers do vary, obviously, with s390 being the smallest
number (5).

> Immediately when you do anything but registers, it is much *much* more
> costly. The "get_user()" and "copy_from_user()" stuff is not exactly slow,
> but it's quite noticeable overhead for simple system calls. It gets worse
> if this all is described by some indirect table setup.

True, of course, although we're talking here about different ways to
pull arguments out of userspace memory; *definitely* agreed with that we
don't want to have any additional indirection necessary.

-hpa