wing patches provide an alternative implementation of the
sys_indirect system call which has been discussed a few times.
This no system call allows us to extend existing system call
interfaces with adding more system calls.
Davide's previous implementation is IMO far more complex than
warranted. This code here is trivial, as you can see. I've
discussed this approach with Linus last week and for a brief moment
we actually agreed on something.
We pass an additional block of data to the kernel, it is copied into
the task_struct, and then it is up to the function implementing the system
call to interpret the data. Each system call, which is meant to be
extended this way, has to be white-listed in sys_indirect. The
alternative is to filter out those system calls which absolutely cannot
be handled using sys_indirect (like clone, execve) since they require
the stack layout of an ordinary system call. This is more dangerous
since it is too easy to miss a call.
The code for x86 and x86-64 gets by without a single line of assembly
code. This is likely to be true for most/all the other archs as well.
There is architecture-dependent code, though. For x86 and x86-64 I've
also fixed up UML (although only x86-64 is tested, that's my setup).
The last patch shows the first application of the functionality. It also
shows a complication: we need the test for valid sub-syscalls in the
main implementation and in the compatibility code. And more: the actual
sources and generated binary are very different (the numbers differ).
Duplicating the information is a big problem, though. I've used some macro
tricks to avoid this. All the information about the flags and the system
calls using them is concentrated in one header. This should maintenance
bearable.
This patch to use sys_indirect is just the beginngin. More will follow,
but I want to see how these patches are received before I spend more time
on it. This code is enough to test the implementation with the following
test program. Adjust it for architectures other than x86 and x86-64.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/syscall.h>
typedef uint32_t __u32;
typedef uint64_t __u64;
union indirect_params {
struct {
int flags;
} file_flags;
};
#ifdef __x86_64__
# define __NR_indirect 286
struct indirect_registers {
__u64 rax;
__u64 rdi;
__u64 rsi;
__u64 rdx;
__u64 r10;
__u64 r8;
__u64 r9;
};
#elif defined __i386__
# define __NR_indirect 325
struct indirect_registers {
__u32 eax;
__u32 ebx;
__u32 ecx;
__u32 edx;
__u32 esi;
__u32 edi;
__u32 ebp;
};
#else
# error "need to define __NR_indirect and struct indirect_params"
#endif
#define FILL_IN(var, values...) \
var = (struct indirect_registers) { values }
int
main (void)
{
int fd = socket (AF_INET, SOCK_DGRAM, IPPROTO_IP);
int s1 = fcntl (fd, F_GETFD);
printf ("old: FD_CLOEXEC %s set\n", s1 == 0 ? "not" : "is");
close (fd);
union indirect_params i;
i.file_flags.flags = O_CLOEXEC;
struct indirect_registers r;
#ifdef __NR_socketcall
# define SOCKOP_socket 1
long args[3] = { AF_INET, SOCK_DGRAM, IPPROTO_IP };
FILL_IN (r, __NR_socketcall, SOCKOP_socket, (long) args);
#else
FILL_IN (r, __NR_socket, AF_INET, SOCK_DGRAM, IPPROTO_IP);
#endif
fd = syscall (__NR_indirect, &r, &i, sizeof (i));
int s2 = fcntl (fd, F_GETFD);
printf ("new: FD_CLOEXEC %s set\n", s2 == 0 ? "not" : "is");
close (fd);
return s1 != 0 || s2 == 0;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Ulrich Drepper <[email protected]>
arch/x86/ia32/Makefile | 1
arch/x86/ia32/ia32entry.S | 2 +
arch/x86/ia32/sys_ia32.c | 37 +++++++++++++++++++++++++++++++++-
arch/x86/kernel/syscall_table_32.S | 1
include/asm-um/indirect.h | 6 +++++
include/asm-x86/ia32_unistd.h | 1
include/asm-x86/indirect.h | 5 ++++
include/asm-x86/indirect_32.h | 23 +++++++++++++++++++++
include/asm-x86/indirect_64.h | 34 +++++++++++++++++++++++++++++++
include/asm-x86/unistd_32.h | 3 +-
include/asm-x86/unistd_64.h | 2 +
include/linux/indirect.h | 39 ++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 4 +++
include/linux/syscalls.h | 6 ++++-
kernel/Makefile | 4 ++-
kernel/indirect.c | 40 +++++++++++++++++++++++++++++++++++++
net/socket.c | 29 +++++++++++++++-----------
17 files changed, 221 insertions(+), 16 deletions(-)
Ulrich Drepper wrote:
> wing patches provide an alternative implementation of the
> sys_indirect system call which has been discussed a few times.
> This no system call allows us to extend existing system call
> interfaces with adding more system calls.
>
> Davide's previous implementation is IMO far more complex than
> warranted. This code here is trivial, as you can see. I've
> discussed this approach with Linus last week and for a brief moment
> we actually agreed on something.
>
> We pass an additional block of data to the kernel, it is copied into
> the task_struct, and then it is up to the function implementing the system
> call to interpret the data. Each system call, which is meant to be
> extended this way, has to be white-listed in sys_indirect. The
> alternative is to filter out those system calls which absolutely cannot
> be handled using sys_indirect (like clone, execve) since they require
> the stack layout of an ordinary system call. This is more dangerous
> since it is too easy to miss a call.
>
I stared at this a bit, and it took me some time to try to grok what it
is trying to do. Eventually I figured it out, and I wonder if there
isn't an easier -- or at least more efficient -- way to accomplish this
goal.
It seems to me that we could accomplish the same thing by passing the
number of parameters in the upper bits of the system call number
register (%eax in the case of x86.) If set to zero, we'd fill in the
legacy number of registers (for backwards compatibility.) Unspecified
arguments are then forced to zero before invoking the target function;
we could also make a register count available if need be.
Alternatively, the same thing can be done with a dense system call
number space by adding a number of parameters field to the system call
table, however, that is more invasive in that one has to poke something
into each architecture (unfortunately -- it would be so much nicer if
there was a central metafile which one could process into the various
architecture system call tables.)
-hpa
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
H. Peter Anvin wrote:
> It seems to me that we could accomplish the same thing by passing the
> number of parameters in the upper bits of the system call number
> register (%eax in the case of x86.)
This isn't really a generic solution. The number of parameters is
limited to six. There are syscalls with six parameters already. There
are many more with five which could only handle one more parameter.
Also, is it really simpler? You'd need to have another table which
contains the default number of parameters a system call takes so that
you can fill in the default value of zero. This extra memory access has
to be performed for every system call.
I think it is unlikely that this approach is faster. To the contrary,
I'd guess.
I don't have much invested into this but it seems the sys_indirect
approach is so much simpler. Overhead is only paid if you really need
it which is rarely the case. Plus, you might have heard Linus and Zack
talk about syslets again. Starting syslets can be done using the same
interface, I guess.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
iD8DBQFHQAez2ijCOnn/RHQRAjoHAJ4/Qq4ygaZ4uq6uCIVNq4hfN1m2pACgpJFi
Z/vBsGFpUc/EUz+VW66jEIY=
=B19x
-----END PGP SIGNATURE-----
Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> H. Peter Anvin wrote:
>> It seems to me that we could accomplish the same thing by passing the
>> number of parameters in the upper bits of the system call number
>> register (%eax in the case of x86.)
>
> This isn't really a generic solution. The number of parameters is
> limited to six. There are syscalls with six parameters already. There
> are many more with five which could only handle one more parameter.
>
> Also, is it really simpler? You'd need to have another table which
> contains the default number of parameters a system call takes so that
> you can fill in the default value of zero. This extra memory access has
> to be performed for every system call.
>
> I think it is unlikely that this approach is faster. To the contrary,
> I'd guess.
>
> I don't have much invested into this but it seems the sys_indirect
> approach is so much simpler. Overhead is only paid if you really need
> it which is rarely the case. Plus, you might have heard Linus and Zack
> talk about syslets again. Starting syslets can be done using the same
> interface, I guess.
>
What bothers me about the sys_indirect approach is that it will get
increasingly expensive as time goes on, and in doing so it does a
user-space memory reference, which are extra expensive. The extra table
can be colocated with the main table (a structure, in effect) so they'll
share the same cache line.
-hpa
On Sat, 17 Nov 2007 00:31:36 -0500
Ulrich Drepper <[email protected]> wrote:
>
> union indirect_params i;
> i.file_flags.flags = O_CLOEXEC;
This setup forbids future addons to file_flags
In three years, when we want to add a new indirect feature to socket()
call, do we need a new indirect2() syscall ?
So I suggest using :
union indirect_params i = {
.file_flags.flags = O_CLOEXEC,
};
fd = syscall (__NR_indirect, &r, &i, sizeof (i));
Or better, you could avoid using 'union indirect_params' in user code, and
only use the substructs for each function.
/* define this in some include file of course */
struct file_flags {
int flags;
};
/* use it */
struct file_flags ff = { .flags = O_CLOEXEC};
fd = syscall (__NR_indirect, &r, &ff, sizeof(ff));
This to avoid to copy to kernel un-necessary data, and permitting futures changes.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Eric Dumazet wrote:
>> union indirect_params i;
>> i.file_flags.flags = O_CLOEXEC;
>
> This setup forbids future addons to file_flags
>
> In three years, when we want to add a new indirect feature to socket()
> call, do we need a new indirect2() syscall ?
No, it doesn't. The setup is indefinitely expandable.
All you need to do, if it becomes necessary to have more than an int, is
to define a little structure for the system call and then use it. The
only requirement is that the code has to assume a value of zero is what
is used today. That's the whole point.
union indirect_params {
struct {
int flags;
} file_flags;
struct {
int flags;
int new_syscall_data1;
sigset_t and_a_sigmask;
} new_data;
};
Old programs will set only the 'flags' member of 'new_data' while new
once can also set the new elements. New programs on old kernels will
eithe have failing calls since the structure is too big or the call will
not have all the desired effects. The latter can be tested for.
> Or better, you could avoid using 'union indirect_params' in user code, and
> only use the substructs for each function.
There is no overhead introduced through the union. The only reason the
union is there in the first place is to allocate sufficient data in
task_struct to cover all cases.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFHQafd2ijCOnn/RHQRAlSFAJ99lahwCDZGRSlIHCov5bWowrpoiQCgwvW4
LDSEusNUpMfIE1ywBCRDBfc=
=ChVT
-----END PGP SIGNATURE-----
On Mon, 19 Nov 2007 07:12:29 -0800
Ulrich Drepper <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Eric Dumazet wrote:
> >> union indirect_params i;
> >> i.file_flags.flags = O_CLOEXEC;
> >
> > This setup forbids future addons to file_flags
> >
> > In three years, when we want to add a new indirect feature to socket()
> > call, do we need a new indirect2() syscall ?
>
> No, it doesn't. The setup is indefinitely expandable.
>
> All you need to do, if it becomes necessary to have more than an int, is
> to define a little structure for the system call and then use it. The
> only requirement is that the code has to assume a value of zero is what
> is used today. That's the whole point.
Yes, this is what I understood.
>
> union indirect_params {
> struct {
> int flags;
> } file_flags;
> struct {
> int flags;
> int new_syscall_data1;
> sigset_t and_a_sigmask;
> } new_data;
> };
Yes, so now, if you take this new definition of indirect_params,
its size is 12 bytes at least.
So when you recompile your old program (as you post it and as I commented on),
it will pass a >= 12 bytes data to kernel, with only first 4 bytes set to O_CLOEXEC.
Other bytes will contain junk (automatic variables are not set to 0 by compiler),
and your program possibly break - if kernel socket() call was
extended to use new_data struct. (ie wanting O_CLOEXEC features +
new ones with new_syscall_data1)
You have to take care of binary compatibility, but also that an old program
(unaware of the new fields) will be re-compiled with new headers.
>
> Old programs will set only the 'flags' member of 'new_data' while new
> once can also set the new elements. New programs on old kernels will
> eithe have failing calls since the structure is too big or the call will
> not have all the desired effects. The latter can be tested for.
>
>
> > Or better, you could avoid using 'union indirect_params' in user code, and
> > only use the substructs for each function.
>
> There is no overhead introduced through the union. The only reason the
> union is there in the first place is to allocate sufficient data in
> task_struct to cover all cases.
Yes, sure, but it should be a kernel issue, not a user space one at all,
as long as user provides the size of the data he want to pass to kernel (
and your patch does this already)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Eric Dumazet wrote:
> So when you recompile your old program (as you post it and as I commented on),
> it will pass a >= 12 bytes data to kernel, with only first 4 bytes set to O_CLOEXEC.
>
> Other bytes will contain junk
If you don't initialize the entire structure and you use it all, of
course you get undefined behavior. That's nothing new. The program I
attached is not an example, it's a test for the functionality in this patch.
Like with every kernel interface, you have to use it correctly. The
good news is that user programs should never use this syscall directly
(just like don't for existing ones).
I see no problem at all here.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFHQbBH2ijCOnn/RHQRAkc3AKCxVTWQ3BiQnCBwdbAsT122QWWaiwCggKXN
Z5Sz9/NFojMHZXXTzIMoxX4=
=slte
-----END PGP SIGNATURE-----
On Mon, 19 Nov 2007 07:48:23 -0800
Ulrich Drepper <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Eric Dumazet wrote:
> > So when you recompile your old program (as you post it and as I commented on),
> > it will pass a >= 12 bytes data to kernel, with only first 4 bytes set to O_CLOEXEC.
> >
> > Other bytes will contain junk
>
> If you don't initialize the entire structure and you use it all, of
> course you get undefined behavior. That's nothing new. The program I
> attached is not an example, it's a test for the functionality in this patch.
>
> Like with every kernel interface, you have to use it correctly. The
> good news is that user programs should never use this syscall directly
> (just like don't for existing ones).
>
> I see no problem at all here.
I do see a problem, because some readers will take your example as a reference,
as it will probably sit in a page that google^Wsearch_engines will bring at the
top of search results for next ten years or so.
(I bet for "sys_indirect syscall" -> http://lwn.net/Articles/258708/ )
Next time you post it, please warn users that it will break in some years, or
state clearly this should only be used internally by glibc.
Thank you
* Eric Dumazet <[email protected]> wrote:
> I do see a problem, because some readers will take your example as a
> reference, as it will probably sit in a page that
> google^Wsearch_engines will bring at the top of search results for
> next ten years or so.
>
> (I bet for "sys_indirect syscall" -> http://lwn.net/Articles/258708/ )
>
> Next time you post it, please warn users that it will break in some
> years, or state clearly this should only be used internally by glibc.
dont be silly, next time Ulrich should also warn everyone that running
attachments and applying patches from untrusted sources is dangerous?
any code that includes:
fd = syscall (__NR_indirect, &r, &i, sizeof (i));
is by definition broken and unportable in every sense of the word. Apps
will use the proper glibc interfaces (if it's exposed).
really, lets discuss _real_ issues. If you cannot think of any, please
stay silent.
Ingo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
H. Peter Anvin wrote:
> What bothers me about the sys_indirect approach is that it will get
> increasingly expensive as time goes on, and in doing so it does a
> user-space memory reference, which are extra expensive. The extra table
> can be colocated with the main table (a structure, in effect) so they'll
> share the same cache line.
You assume that using sys_indirect will be the norm. It won't. We
mustn't design system calls deliberately wrong so that they require the
indirection.
Beside, if the number of syscalls which has to be handled this way grows
we can use something more efficient for large numbers of test than a
switch statement. It could even be a word next to the system call table.
But I still don't see that the magic encoding is a valid solution, it
doesn't address the limited parameter number. Plus, using sys_indirect
could in future be used to transport entire parameters (like a sigset_t)
along with other information, thereby saving individual copy operations.
I think the sys_indirect approach is the way forward. I'll submit a
last version of the patch in a bit.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFHQlRw2ijCOnn/RHQRApifAKDE1nZqRbm4cJxbhobBb7jCx1T00QCgiSa0
EXKjL2Gwu3atSLSD+Rb4yO4=
=6ZGt
-----END PGP SIGNATURE-----
Ulrich Drepper wrote:
>
> But I still don't see that the magic encoding is a valid solution, it
> doesn't address the limited parameter number. Plus, using sys_indirect
> could in future be used to transport entire parameters (like a sigset_t)
> along with other information, thereby saving individual copy operations.
>
The limited number of parameters is a non-issue, we already have the
convention for that: for more than 6 parameters, pass a parameter to
arguments 6 and higher in the register normally used for parameter 6.
Now, for the specific case of x86-64 (as well as some of the RISC
architectures), this meshes poorly with the C calling convention, which
is that parameter 7+ are passed on the stack. We would obtain a more
efficient calling convention by allocating an additional register for
such an indirect pointer and/or adopt the convention that the additional
parameters are simply stored on the user stack starting at a specific
offset, presumably +8.
I would really like to see a systematic calling convention that doesn't
have limits that we have already broken several times. In particular, I
would like to see a convention that can be mapped 1:1 onto the platform
C calling convention in a syscall-independent way. We *almost* have
this today, but sys_indirect would blow that out of the water in a
particularly ugly way.(*)
> I think the sys_indirect approach is the way forward. I'll submit a
> last version of the patch in a bit.
I think it is a horrible kluge. It's yet another multiplexer, which we
are trying desperately to avoid in the kernel. Just to make things more
painful, it is a multiplexer which creates yet another ad hoc calling
convention, whereas we should strive to make the kernel calling
convention as uniform as possible.
If is is NOT going to be used in the common case, then it really doesn't
matter -- we can just use manually or automatically generated thunks.
It's not like we have thousands of system calls, and it's not like "a
system call" is a precious thing.
If it IS going to be used in the common case, then we should use
something more streamlined.
-hpa
On Mon, 19 Nov 2007, Ingo Molnar wrote:
>
> * Eric Dumazet <[email protected]> wrote:
>
> > I do see a problem, because some readers will take your example as a
> > reference, as it will probably sit in a page that
> > google^Wsearch_engines will bring at the top of search results for
> > next ten years or so.
> >
> > (I bet for "sys_indirect syscall" -> http://lwn.net/Articles/258708/ )
> >
> > Next time you post it, please warn users that it will break in some
> > years, or state clearly this should only be used internally by glibc.
>
> dont be silly, next time Ulrich should also warn everyone that running
> attachments and applying patches from untrusted sources is dangerous?
>
> any code that includes:
>
> fd = syscall (__NR_indirect, &r, &i, sizeof (i));
>
> is by definition broken and unportable in every sense of the word. Apps
> will use the proper glibc interfaces (if it's exposed).
as an application writer how do i access accept(2) with FD_CLOEXEC
functionality? will glibc expose an accept2() with a flags param? if
so... why don't we just have an accept2() syscall?
-dean
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
dean gaudet wrote:
> as an application writer how do i access accept(2) with FD_CLOEXEC
> functionality? will glibc expose an accept2() with a flags param?
Not yet decided. There is the alternative to extend the accept()
interface to have both interfaces:
int accept(int, struct sockaddr *, socklen_t *);
and
int accept(int, struct sockaddr *, socklen_t *, int);
We can do this with type safety even in C nowadays.
> if so... why don't we just have an accept2() syscall?
If you read the mails of my first submission you'll find that I
explained this. I talked to Andrew and he favored new syscalls. But
then I talked to Linus and he favored this approach. Probably
especially because it can be used for syslets as well. And it is less
code and data than introducing new syscalls.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFHQwhx2ijCOnn/RHQRAnezAKCkFmGwlwDZjpfKTRSUN4yLIeGTkACgtMK/
OcHdIaR8wbp848D3GU2iNYQ=
=nTu9
-----END PGP SIGNATURE-----