2018-11-20 00:24:48

by Andy Lutomirski

[permalink] [raw]
Subject: Cleaning up numbering for new x86 syscalls?

Hi all-

We currently have some giant turds in the way that syscalls are
numbered. We have the x86_32 table, which is totally sane other than
some legacy multiplexers. Then we have the x86_64 table, which is,
um, demented:

- The numbers don't match x86_32. I have no idea why.

- We use bit 30, which triggers in_x32_syscall(). It should have
been bit 31, bit I digress.

- We have this weird set of extra x32 syscalls that start at 512.
Who wants to bet whether we have no bugs if someone does syscall with,
say, nr == 512 (i.e. not 512 | BIT(30)) or nr == (16 | BIT(30))? The
latter would be non-compat ioctl with in_x32_syscall() set and hence
in_compat_syscall() set.

- Bloody restart_syscall() has a different number on x86_64 and
x64_32, which is a big mess.

I propose we consider some subset of the following:

1. Introduce restart_syscall_2(). Make its number be 1024. Maybe
someday we could start using it instead of restart_syscall(). The
only issue I can see is programs that allow restart_syscall() using
seccomp but don't allow the new variant.

2. Introduce an outright ban on new syscalls with nr < 1024.

3. Introduce an outright ban on the addition of new __x32_compat
syscalls. If new compat hacks are needed, they can use
in_compat_syscall(), thank you very much.

4. Modify the wrappers of the __x32_compat entries so that they will
return -ENOSYS if in_x32_syscall() returns false.

5. Adjust the scripts so that we only have to wire up new syscalls
once. They'll have a nr above 1024, and they'll have the same nr on
all x86 variants.

Thoughts?


2018-11-20 07:59:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?


* Andy Lutomirski <[email protected]> wrote:

> Hi all-
>
> We currently have some giant turds in the way that syscalls are
> numbered. We have the x86_32 table, which is totally sane other than
> some legacy multiplexers. Then we have the x86_64 table, which is,
> um, demented:
>
> - The numbers don't match x86_32. I have no idea why.
>
> - We use bit 30, which triggers in_x32_syscall(). It should have
> been bit 31, bit I digress.
>
> - We have this weird set of extra x32 syscalls that start at 512.
> Who wants to bet whether we have no bugs if someone does syscall with,
> say, nr == 512 (i.e. not 512 | BIT(30)) or nr == (16 | BIT(30))? The
> latter would be non-compat ioctl with in_x32_syscall() set and hence
> in_compat_syscall() set.
>
> - Bloody restart_syscall() has a different number on x86_64 and
> x64_32, which is a big mess.
>
> I propose we consider some subset of the following:
>
> 1. Introduce restart_syscall_2(). Make its number be 1024. Maybe
> someday we could start using it instead of restart_syscall(). The
> only issue I can see is programs that allow restart_syscall() using
> seccomp but don't allow the new variant.
>
> 2. Introduce an outright ban on new syscalls with nr < 1024.

Also let's make sure it results in a build error or boot panic if someone
tries.

> 3. Introduce an outright ban on the addition of new __x32_compat
> syscalls. If new compat hacks are needed, they can use
> in_compat_syscall(), thank you very much.

Here too build-time and runtime enforcement would be nice.

> 4. Modify the wrappers of the __x32_compat entries so that they will
> return -ENOSYS if in_x32_syscall() returns false.
>
> 5. Adjust the scripts so that we only have to wire up new syscalls
> once. They'll have a nr above 1024, and they'll have the same nr on
> all x86 variants.
>
> Thoughts?

Fully agreed:

6. Is x32 even used in practice? I still think it was a mistake to add it
and some significant distributions like Fedora are not enabling it.

Barring any sane way to phase out x32 support I'd suggest we implement
all your suggestions.

Thanks,

Ingo

2018-11-20 09:04:38

by Florian Weimer

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

* Andy Lutomirski:

> 5. Adjust the scripts so that we only have to wire up new syscalls
> once. They'll have a nr above 1024, and they'll have the same nr on
> all x86 variants.

Is there a sufficiently sized gap on all other architectures as well?
The restriction to the x86 variants seems arbitrary to me.

Thanks,
Florian

2018-11-20 15:34:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Tue, Nov 20, 2018 at 1:03 AM Florian Weimer <[email protected]> wrote:
>
> * Andy Lutomirski:
>
> > 5. Adjust the scripts so that we only have to wire up new syscalls
> > once. They'll have a nr above 1024, and they'll have the same nr on
> > all x86 variants.
>
> Is there a sufficiently sized gap on all other architectures as well?
> The restriction to the x86 variants seems arbitrary to me.
>

Fair point. We have this shiny "generic" syscall list. Maybe we can
get x86 synced up with it for new syscalls.

2018-11-20 16:49:53

by Tycho Andersen

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Mon, Nov 19, 2018 at 04:22:49PM -0800, Andy Lutomirski wrote:
> Hi all-
>
> We currently have some giant turds in the way that syscalls are
> numbered. We have the x86_32 table, which is totally sane other than
> some legacy multiplexers. Then we have the x86_64 table, which is,
> um, demented:
>
> - The numbers don't match x86_32. I have no idea why.
>
> - We use bit 30, which triggers in_x32_syscall(). It should have
> been bit 31, bit I digress.
>
> - We have this weird set of extra x32 syscalls that start at 512.
> Who wants to bet whether we have no bugs if someone does syscall with,
> say, nr == 512 (i.e. not 512 | BIT(30)) or nr == (16 | BIT(30))? The
> latter would be non-compat ioctl with in_x32_syscall() set and hence
> in_compat_syscall() set.
>
> - Bloody restart_syscall() has a different number on x86_64 and
> x64_32, which is a big mess.
>
> I propose we consider some subset of the following:
>
> 1. Introduce restart_syscall_2(). Make its number be 1024. Maybe
> someday we could start using it instead of restart_syscall(). The
> only issue I can see is programs that allow restart_syscall() using
> seccomp but don't allow the new variant.
>
> 2. Introduce an outright ban on new syscalls with nr < 1024.
>
> 3. Introduce an outright ban on the addition of new __x32_compat
> syscalls. If new compat hacks are needed, they can use
> in_compat_syscall(), thank you very much.
>
> 4. Modify the wrappers of the __x32_compat entries so that they will
> return -ENOSYS if in_x32_syscall() returns false.

This sounds like a great idea independent of all of this.

> 5. Adjust the scripts so that we only have to wire up new syscalls
> once. They'll have a nr above 1024, and they'll have the same nr on
> all x86 variants.
>
> Thoughts?

+1. Who wants to do it? :D

Tycho

2018-11-20 18:18:32

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Tue, Nov 20, 2018 at 07:23:09AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 20, 2018 at 1:03 AM Florian Weimer <[email protected]> wrote:
> >
> > * Andy Lutomirski:
> >
> > > 5. Adjust the scripts so that we only have to wire up new syscalls
> > > once. They'll have a nr above 1024, and they'll have the same nr on
> > > all x86 variants.
> >
> > Is there a sufficiently sized gap on all other architectures as well?
> > The restriction to the x86 variants seems arbitrary to me.
> >
>
> Fair point. We have this shiny "generic" syscall list. Maybe we can
> get x86 synced up with it for new syscalls.

I heard this discussed at Plumbers. There was a proposal to use the
same syscall numbers across architectures. Also, when adding new
generic syscalls, they want all arches to be wired up at the same time.

https://linuxplumbersconf.org/event/2/contributions/149/attachments/129/161/Ideas_to_improve_glibc_and_Kernel_interaction.pdf

Adding Adhemerval to CC.

--
Josh

2018-11-21 00:17:10

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On 20/11/2018 08:33, Ingo Molnar wrote:
[...]
> 6. Is x32 even used in practice? I still think it was a mistake to add it
> and some significant distributions like Fedora are not enabling it.

x32 works as far as gcc/gas/ld is concerned (at least for compiling
non-trivial programs).
Finding a distribution that actually *delivers* x32 libraries is another
thing (and said non-trivial software uses ATM e.g. libxml2) - at least I
can't find an "x32-Ubuntu".
And no, I don't see a compelling reason to (try to) build the n+1.
architecture for the major distributions.
And yes, lots of stuff will not compile out of the box (especially if
one uses a somewhat sane set of gcc options - not only -Wall -Wextra
-Werror) but if one gets software to compile for i386 and x86_64,
getting it to compile for x32 is a Friday afternoon job (more or less).
And yes, there is enough hardware/systems out there that uses 64bit CPUs
(for whatever reason - if only that one can't get a 32bit CPU for that
board) but will never ever need more than 2-3 GB RAM .....

MfG,
Bernd
--
Bernd Petrovitsch Email : [email protected]
LUGA : http://www.luga.at

2018-11-21 19:08:59

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Tue, Nov 20, 2018 at 1:25 AM Andy Lutomirski <[email protected]> wrote:
>
> Hi all-
>
> We currently have some giant turds in the way that syscalls are
> numbered. We have the x86_32 table, which is totally sane other than
> some legacy multiplexers. Then we have the x86_64 table, which is,
> um, demented:
>
> - The numbers don't match x86_32. I have no idea why.

I think it was an early attempt at cleanup up the table, and only
adding those that were still used. Back in the days, each architecture
had its own table, and of course they started out as separate
top-level architectures.

> - We use bit 30, which triggers in_x32_syscall(). It should have
> been bit 31, bit I digress.
>
> - We have this weird set of extra x32 syscalls that start at 512.
> Who wants to bet whether we have no bugs if someone does syscall with,
> say, nr == 512 (i.e. not 512 | BIT(30)) or nr == (16 | BIT(30))? The
> latter would be non-compat ioctl with in_x32_syscall() set and hence
> in_compat_syscall() set.

The comment in the table says it's purely for keeping the calls
in separate cache lines. I don't know if the cache lines make
a difference in the end, but it seems that once we start running
into the x32 syscall numbers, I think we just treat them like any
others, we just choose to never call them from a 64-bit glibc.

> I propose we consider some subset of the following:
>
> 1. Introduce restart_syscall_2(). Make its number be 1024. Maybe
> someday we could start using it instead of restart_syscall(). The
> only issue I can see is programs that allow restart_syscall() using
> seccomp but don't allow the new variant.
>
> 2. Introduce an outright ban on new syscalls with nr < 1024.

This would leave a hole of several hundred numbers if we do it
for all architectures. Wasting multiple kilobytes for a cosmetic
cleanup might be considered excessive.

> 3. Introduce an outright ban on the addition of new __x32_compat
> syscalls. If new compat hacks are needed, they can use
> in_compat_syscall(), thank you very much.

I would definitely want to keep anything regarding x32 out of the
common syscall implementation. If you want to add on to that
pile, please do it in arch/x86, not in kernel/ or fs/.

If we decide that x32 is a failed experiment and we don't keep
it working in the future, let's just kill it off right away. I'm fairly
sure nobody depends on it for anything real, the only users I
could find are either for showing off benchmark results or for
playing around with it for fun. Most of that fun part has apparently
ended many years ago, but there is still some work going into
debian/x32. We probably need to coordinate with them and see
if they know of actual users before removing it. Popcon lists
5 active users [1] and a sharp downward trend.

> 4. Modify the wrappers of the __x32_compat entries so that they will
> return -ENOSYS if in_x32_syscall() returns false.

No objection here, but what would that help?

> 5. Adjust the scripts so that we only have to wire up new syscalls
> once. They'll have a nr above 1024, and they'll have the same nr on
> all x86 variants.
>
> Thoughts?

I would definitely welcome assigning the same syscall numbers across
all architectures. It is a needless burden for the libc developers to
figure out for each syscall which kernel is known to support it.
When a call gets added, they typically add logic to check for the
system call at runtime, but for older syscalls, it helps to know when
all architectures support it once the minimum kernel version for
a libc has been raised beyond that.

Please see also the work that Firoz Khan has been posting
for generalizing the tables on all architectures to use the
format we have on x86, arm and s390. I hope we can merge it
all for 4.21, and then build on top of that for generalization and
cleanups.

Arnd

[1] https://popcon.debian.org/stat/sub-x32.png

2018-11-21 19:09:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Tue, Nov 20, 2018 at 4:35 PM Andy Lutomirski <[email protected]> wrote:
>
> On Tue, Nov 20, 2018 at 1:03 AM Florian Weimer <[email protected]> wrote:
> >
> > * Andy Lutomirski:
> >
> > > 5. Adjust the scripts so that we only have to wire up new syscalls
> > > once. They'll have a nr above 1024, and they'll have the same nr on
> > > all x86 variants.
> >
> > Is there a sufficiently sized gap on all other architectures as well?
> > The restriction to the x86 variants seems arbitrary to me.
> >
>
> Fair point. We have this shiny "generic" syscall list. Maybe we can
> get x86 synced up with it for new syscalls.

The generic table is already a subset of the x86 tables, so there
should be no need to sync up the contents.

It's more critical on other architectures that currently lack a number
of the syscalls that got added in asm-generic and x86 recently,
so I'd like to synchronize these all and add the missing calls
to ensure that each architecture has at least all the calls from
asm-generic table.

After that, I would hope to come up with a way to add future numbers
to all tables together, either using the same numbers everywhere (plus
an offset where necessary, e.g. mips), or even have an include file
logic so we only need a single file for future additions.

Note: for y2038, we will have to add around 20 to 25 syscalls to each
32-bit architecture, plus another 10 for those that lack the separate
sys_ipc calls.


Arnd

2018-11-30 23:27:00

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Cleaning up numbering for new x86 syscalls?

On Wed, 21 Nov 2018, Bernd Petrovitsch wrote:

> And yes, lots of stuff will not compile out of the box (especially if
> one uses a somewhat sane set of gcc options - not only -Wall -Wextra
> -Werror) but if one gets software to compile for i386 and x86_64,
> getting it to compile for x32 is a Friday afternoon job (more or less).
> And yes, there is enough hardware/systems out there that uses 64bit CPUs
> (for whatever reason - if only that one can't get a 32bit CPU for that
> board) but will never ever need more than 2-3 GB RAM .....

The functionally equivalent 64-bit ILP32 MIPS n32 ABI has been around
supported by Linux and the GNU toolchain for some 17 years now and people
have been using it, so by now any sane piece of software that does not use
handcoded assembly should work out of the box for the x86-64 x32 ABI as
well.

NB the important advantage of an LP64 ABI over an ILP32 ABI is the
ability to mmap(2) files that exceed 4GiB in size (and in reality even
smaller ones, as some user VM space is surely needed for other stuff),
regardless of how much physical RAM is actually supported or has been
installed.

And these days even a web browser can easily overrun a 4GiB VM space. :(

FWIW,

Maciej