2021-05-19 18:31:26

by H. Peter Anvin

[permalink] [raw]
Subject: [PATCH v4 0/6] x86/syscall: use int for x86-64 system calls

From: "H. Peter Anvin (Intel)" <[email protected]>

This patchset addresses several inconsistencies in the handling of
system call numbers in x86-64 (and x32).

Right now, *some* code will treat e.g. 0x00000001_00000001 as a system
call and some will not. Some of the code, notably in ptrace and
seccomp, will treat 0x00000001_ffffffff as a system call and some will
not.

Furthermore, right now, e.g. 335 for x86-64 will force the exit code
to be set to -ENOSYS even if poked by ptrace, but 548 will not,
because there is an observable difference between an out of range
system call and a system call number that falls outside the range of
the tables.

Both of these issues are visible to the user; for example the
syscall_numbering_64 kernel selftest fails if run under ptrace for
this reason (system calls succeed with the high bits set, whereas they
fail when not being traced.)

The architecture independent code in Linux expects "int" for the
system call number, per the API documented, but not implemented, in
<asm-generic/syscalls.h>: system call numbers are expected to be
"int", with -1 as the only non-system-call sentinel.

Treating the same data in multiple ways in different context is at the
very best confusing, but it also has the potential to cause security
problems (no such security problems are known at this time, however.)

This is an ABI change, but it is in fact a return to the original
x86-64 ABI: the original assembly entry code would zero-extend the
system call number passed and only the bottom 32 bits were examined.

1. Consistently treat the system call number as a signed int. This is
what syscall_get_nr() already does, and therefore what all
architecture-independent code (e.g. seccomp) already expects.

2. As per the defined semantics of syscall_get_nr(), only the value -1
is defined as a non-system call, so comparing >= 0 is
incorrect. Change to != -1.

3. Call sys_ni_syscall() for system calls which are out of range
except for -1, which is used by ptrace and seccomp as a "skip
system call" marker) just as for system call numbers that
correspond to holes in the table.

4. Updates and extends the syscall_numbering_64 selftest, including
testing the system call numbering when running under ptrace.

Changes from v3:

* Reorganize the patchset to have the selftest change first.
* Add tests running under ptrace to selftest.

Changes from v2:

* Factor out and split what was a single patch in the v2 patchset; the
rest of the patches have already been applied.
* Fix the syscall_numbering_64 selftest to match the definition
changes, make its output more informative, and extend it to more
tests. Avoid using the glibc syscall() wrapper to make sure we test
what we think we are testing.
* Better documentation of the changes.

Changes from v1:

* Only -1 should be a non-system call per the cross-architectural
definition of sys_ni_syscall().
* Fix/improve patch descriptions.

---
arch/x86/entry/common.c | 93 +++--
arch/x86/entry/entry_64.S | 2 +-
arch/x86/include/asm/syscall.h | 2 +-
tools/testing/selftests/x86/syscall_numbering.c | 488 +++++++++++++++++++++---
4 files changed, 508 insertions(+), 77 deletions(-)


2021-05-19 18:31:34

by H. Peter Anvin

[permalink] [raw]
Subject: [PATCH v4 4/6] x86/syscall: sign-extend system calls on entry to int

From: "H. Peter Anvin (Intel)" <[email protected]>

Right now, *some* code will treat e.g. 0x0000000100000001 as a system
call and some will not. Some of the code, notably in ptrace, will
treat 0x000000018000000 as a system call and some will not. Finally,
right now, e.g. 335 for x86-64 will force the exit code to be set to
-ENOSYS even if poked by ptrace, but 548 will not, because there is an
observable difference between an out of range system call and a system
call number that falls outside the range of the table.

This is visible to the user: for example, the syscall_numbering_64
test fails if run under strace, because as strace uses ptrace, it ends
up clobbering the upper half of the 64-bit system call number.

The arch-independent code all assumes that a system call is "int" that
the value -1 specifically and not just any negative value is used for
a non-system call. This is the case on x86 as well when
arch-independent code is involved. The arch-independent API is
defined/documented (but not *implemented*!) in
<asm-generic/syscall.h>.

This is an ABI change, but is in fact a revert to the original x86-64
ABI. The original assembly entry code would zero-extend the system
call number; this patch uses sign extend to be explicit that this is
treated as a signed number (although in practice it makes no
difference, of course) and to avoid people getting the idea of
"optimizing" it, as has happened on at least two(!) separate
occasions.

Do not store the extended value into regs->orig_ax, however: on
x86-64, the ABI is that the callee is responsible for extending
parameters, so only examining the lower 32 bits is fully consistent
with any "int" argument to any system call, e.g. regs->di for
write(2). The full value of %rax on entry to the kernel is thus still
available.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
---
arch/x86/entry/entry_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1d9db15fdc69..85f04ea0e368 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -108,7 +108,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)

/* IRQs are off. */
movq %rsp, %rdi
- movq %rax, %rsi
+ movslq %eax, %rsi
call do_syscall_64 /* returns with IRQs disabled */

/*
--
2.31.1


2021-05-19 18:31:44

by H. Peter Anvin

[permalink] [raw]
Subject: [PATCH v4 6/6] x86/syscall: use int everywhere for system call numbers

From: "H. Peter Anvin (Intel)" <[email protected]>

System call numbers are defined as int, so use int everywhere for
system call numbers. This patch is strictly a cleanup; it should not
change anything user visible; all ABI changes have been done in the
preceeding patches.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
---
arch/x86/entry/common.c | 93 ++++++++++++++++++++++++----------
arch/x86/include/asm/syscall.h | 2 +-
2 files changed, 66 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f51bc17262db..714804f0970c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -36,49 +36,87 @@
#include <asm/irq_stack.h>

#ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
+
+static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
+{
+ /*
+ * Convert negative numbers to very high and thus out of range
+ * numbers for comparisons. Use unsigned long to slightly
+ * improve the array_index_nospec() generated code.
+ */
+ unsigned long unr = nr;
+
+ if (likely(unr < NR_syscalls)) {
+ unr = array_index_nospec(unr, NR_syscalls);
+ regs->ax = sys_call_table[unr](regs);
+ return true;
+ }
+ return false;
+}
+
+static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
+{
+ /*
+ * Adjust the starting offset of the table, and convert numbers
+ * < __X32_SYSCALL_BIT to very high and thus out of range
+ * numbers for comparisons. Use unsigned long to slightly
+ * improve the array_index_nospec() generated code.
+ */
+ unsigned long xnr = nr - __X32_SYSCALL_BIT;
+
+ if (IS_ENABLED(CONFIG_X86_X32_ABI) &&
+ likely(xnr < X32_NR_syscalls)) {
+ xnr = array_index_nospec(xnr, X32_NR_syscalls);
+ regs->ax = x32_sys_call_table[xnr](regs);
+ return true;
+ }
+ return false;
+}
+
+__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
- if (likely(nr < NR_syscalls)) {
- nr = array_index_nospec(nr, NR_syscalls);
- regs->ax = sys_call_table[nr](regs);
-#ifdef CONFIG_X86_X32_ABI
- } else if (likely((nr & __X32_SYSCALL_BIT) &&
- (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
- nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
- X32_NR_syscalls);
- regs->ax = x32_sys_call_table[nr](regs);
-#endif
- } else if (unlikely((int)nr != -1)) {
+
+ if (!do_syscall_x64(regs, nr) &&
+ !do_syscall_x32(regs, nr) &&
+ unlikely(nr != -1)) {
+ /* Invalid system call, but still a system call? */
regs->ax = __x64_sys_ni_syscall(regs);
}
+
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
#endif

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
-static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
+static __always_inline int syscall_32_enter(struct pt_regs *regs)
{
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;

- return (unsigned int)regs->orig_ax;
+ return (int)regs->orig_ax;
}

/*
* Invoke a 32-bit syscall. Called with IRQs on in CONTEXT_KERNEL.
*/
-static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
- unsigned int nr)
+static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, int nr)
{
- if (likely(nr < IA32_NR_syscalls)) {
- nr = array_index_nospec(nr, IA32_NR_syscalls);
- regs->ax = ia32_sys_call_table[nr](regs);
- } else if (unlikely((int)nr != -1)) {
+ /*
+ * Convert negative numbers to very high and thus out of range
+ * numbers for comparisons. Use unsigned long to slightly
+ * improve the array_index_nospec() generated code.
+ */
+ unsigned long unr = nr;
+
+ if (likely(unr < IA32_NR_syscalls)) {
+ unr = array_index_nospec(unr, IA32_NR_syscalls);
+ regs->ax = ia32_sys_call_table[unr](regs);
+ } else if (unlikely(nr != -1)) {
regs->ax = __ia32_sys_ni_syscall(regs);
}
}
@@ -86,15 +124,15 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
/* Handles int $0x80 */
__visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ int nr = syscall_32_enter(regs);

add_random_kstack_offset();
/*
- * Subtlety here: if ptrace pokes something larger than 2^32-1 into
- * orig_ax, the unsigned int return value truncates it. This may
- * or may not be necessary, but it matches the old asm behavior.
+ * Subtlety here: if ptrace pokes something larger than 2^31-1 into
+ * orig_ax, the int return value truncates it. This matches
+ * the semantics of syscall_get_nr().
*/
- nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
+ nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();

do_syscall_32_irqs_on(regs, nr);
@@ -105,7 +143,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)

static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ int nr = syscall_32_enter(regs);
int res;

add_random_kstack_offset();
@@ -140,8 +178,7 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
return false;
}

- /* The case truncates any ptrace induced syscall nr > 2^32 -1 */
- nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+ nr = syscall_enter_from_user_mode_work(regs, nr);

/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index f6593cafdbd9..f7e2d82d24fb 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task)
? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
}

-void do_syscall_64(struct pt_regs *regs, unsigned long nr);
+void do_syscall_64(struct pt_regs *regs, int nr);
void do_int80_syscall_32(struct pt_regs *regs);
long do_fast_syscall_32(struct pt_regs *regs);

--
2.31.1


2021-05-19 18:33:30

by H. Peter Anvin

[permalink] [raw]
Subject: [PATCH v4 2/6] x86/syscall: simplify message reporting in syscall_numbering.c

From: "H. Peter Anvin (Intel)" <[email protected]>

Reduce some boiler plate in printing and indenting messages in
syscall_numbering.c. This makes it easier to produce clean status
output.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
---
.../testing/selftests/x86/syscall_numbering.c | 102 ++++++++++++------
1 file changed, 71 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/x86/syscall_numbering.c b/tools/testing/selftests/x86/syscall_numbering.c
index 7dd86bcbee25..03915cd48cfc 100644
--- a/tools/testing/selftests/x86/syscall_numbering.c
+++ b/tools/testing/selftests/x86/syscall_numbering.c
@@ -34,6 +34,33 @@

static unsigned int nerr = 0; /* Cumulative error count */
static int nullfd = -1; /* File descriptor for /dev/null */
+static int indent = 0;
+
+static inline unsigned int offset(void)
+{
+ return 8+indent*4;
+}
+
+#define msg(lvl, fmt, ...) printf("%-*s" fmt, offset(), "[" #lvl "]", \
+ ## __VA_ARGS__)
+
+#define run(fmt, ...) msg(RUN, fmt, ## __VA_ARGS__)
+#define info(fmt, ...) msg(INFO, fmt, ## __VA_ARGS__)
+#define ok(fmt, ...) msg(OK, fmt, ## __VA_ARGS__)
+
+#define fail(fmt, ...) \
+ do { \
+ msg(FAIL, fmt, ## __VA_ARGS__); \
+ nerr++; \
+ } while (0)
+
+#define crit(fmt, ...) \
+ do { \
+ indent = 0; \
+ msg(FAIL, fmt, ## __VA_ARGS__); \
+ msg(SKIP, "Unable to run test\n"); \
+ exit(71); /* EX_OSERR */ \
+ } while (0)

/*
* Directly invokes the given syscall with nullfd as the first argument
@@ -91,28 +118,37 @@ static unsigned int _check_for(int msb, int start, int end, long long expect,
{
unsigned int err = 0;

+ indent++;
+ if (start != end)
+ indent++;
+
for (int nr = start; nr <= end; nr++) {
long long ret = probe_syscall(msb, nr);

if (ret != expect) {
- printf("[FAIL]\t %s returned %lld, but it should have returned %s\n",
+ fail("%s returned %lld, but it should have returned %s\n",
syscall_str(msb, nr, nr),
ret, expect_str);
err++;
}
}

+ if (start != end)
+ indent--;
+
if (err) {
nerr += err;
if (start != end)
- printf("[FAIL]\t %s had %u failure%s\n",
+ fail("%s had %u failure%s\n",
syscall_str(msb, start, end),
- err, (err == 1) ? "s" : "");
+ err, err == 1 ? "s" : "");
} else {
- printf("[OK]\t %s returned %s as expected\n",
- syscall_str(msb, start, end), expect_str);
+ ok("%s returned %s as expected\n",
+ syscall_str(msb, start, end), expect_str);
}

+ indent--;
+
return err;
}

@@ -137,35 +173,38 @@ static bool check_enosys(int msb, int nr)
static bool test_x32(void)
{
long long ret;
- long long mypid = getpid();
+ pid_t mypid = getpid();
+ bool with_x32;

- printf("[RUN]\tChecking for x32 by calling x32 getpid()\n");
+ run("Checking for x32 by calling x32 getpid()\n");
ret = probe_syscall(0, SYS_GETPID | X32_BIT);

+ indent++;
if (ret == mypid) {
- printf("[INFO]\t x32 is supported\n");
- return true;
+ info("x32 is supported\n");
+ with_x32 = true;
} else if (ret == -ENOSYS) {
- printf("[INFO]\t x32 is not supported\n");
- return false;
+ info("x32 is not supported\n");
+ with_x32 = false;
} else {
- printf("[FAIL]\t x32 getpid() returned %lld, but it should have returned either %lld or -ENOSYS\n", ret, mypid);
- nerr++;
- return true; /* Proceed as if... */
+ fail("x32 getpid() returned %lld, but it should have returned either %lld or -ENOSYS\n", ret, mypid);
+ with_x32 = false;
}
+ indent--;
+ return with_x32;
}

static void test_syscalls_common(int msb)
{
- printf("[RUN]\t Checking some common syscalls as 64 bit\n");
+ run("Checking some common syscalls as 64 bit\n");
check_zero(msb, SYS_READ);
check_zero(msb, SYS_WRITE);

- printf("[RUN]\t Checking some 64-bit only syscalls as 64 bit\n");
+ run("Checking some 64-bit only syscalls as 64 bit\n");
check_zero(msb, X64_READV);
check_zero(msb, X64_WRITEV);

- printf("[RUN]\t Checking out of range system calls\n");
+ run("Checking out of range system calls\n");
check_for(msb, -64, -1, -ENOSYS);
check_for(msb, X32_BIT-64, X32_BIT-1, -ENOSYS);
check_for(msb, -64-X32_BIT, -1-X32_BIT, -ENOSYS);
@@ -180,18 +219,18 @@ static void test_syscalls_with_x32(int msb)
* set. Calling them without the x32 bit set is
* nonsense and should not work.
*/
- printf("[RUN]\t Checking x32 syscalls as 64 bit\n");
+ run("Checking x32 syscalls as 64 bit\n");
check_for(msb, 512, 547, -ENOSYS);

- printf("[RUN]\t Checking some common syscalls as x32\n");
+ run("Checking some common syscalls as x32\n");
check_zero(msb, SYS_READ | X32_BIT);
check_zero(msb, SYS_WRITE | X32_BIT);

- printf("[RUN]\t Checking some x32 syscalls as x32\n");
+ run("Checking some x32 syscalls as x32\n");
check_zero(msb, X32_READV | X32_BIT);
check_zero(msb, X32_WRITEV | X32_BIT);

- printf("[RUN]\t Checking some 64-bit syscalls as x32\n");
+ run("Checking some 64-bit syscalls as x32\n");
check_enosys(msb, X64_IOCTL | X32_BIT);
check_enosys(msb, X64_READV | X32_BIT);
check_enosys(msb, X64_WRITEV | X32_BIT);
@@ -199,7 +238,7 @@ static void test_syscalls_with_x32(int msb)

static void test_syscalls_without_x32(int msb)
{
- printf("[RUN]\t Checking for absence of x32 system calls\n");
+ run("Checking for absence of x32 system calls\n");
check_for(msb, 0 | X32_BIT, 999 | X32_BIT, -ENOSYS);
}

@@ -217,14 +256,18 @@ static void test_syscall_numbering(void)
*/
for (size_t i = 0; i < sizeof(msbs)/sizeof(msbs[0]); i++) {
int msb = msbs[i];
- printf("[RUN]\tChecking system calls with msb = %d (0x%x)\n",
- msb, msb);
+ run("Checking system calls with msb = %d (0x%x)\n",
+ msb, msb);
+
+ indent++;

test_syscalls_common(msb);
if (with_x32)
test_syscalls_with_x32(msb);
else
test_syscalls_without_x32(msb);
+
+ indent--;
}
}

@@ -241,19 +284,16 @@ int main(void)
*/
nullfd = open("/dev/null", O_RDWR);
if (nullfd < 0) {
- printf("[FAIL]\tUnable to open /dev/null: %s\n",
- strerror(errno));
- printf("[SKIP]\tCannot execute test\n");
- return 71; /* EX_OSERR */
+ crit("Unable to open /dev/null: %s\n", strerror(errno));
}

test_syscall_numbering();
if (!nerr) {
- printf("[OK]\tAll system calls succeeded or failed as expected\n");
+ ok("All system calls succeeded or failed as expected\n");
return 0;
} else {
- printf("[FAIL]\tA total of %u system call%s had incorrect behavior\n",
- nerr, nerr != 1 ? "s" : "");
+ fail("A total of %u system call%s had incorrect behavior\n",
+ nerr, nerr != 1 ? "s" : "");
return 1;
}
}
--
2.31.1


2021-05-19 20:16:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] x86/syscall: use int for x86-64 system calls

:)

On May 19, 2021 4:29:29 AM PDT, Ingo Molnar <[email protected]> wrote:
>
>* H. Peter Anvin <[email protected]> wrote:
>
>> From: "H. Peter Anvin (Intel)" <[email protected]>
>>
>> This patchset addresses several inconsistencies in the handling of
>> system call numbers in x86-64 (and x32).
>
>> arch/x86/entry/common.c | 93 +++--
>> arch/x86/entry/entry_64.S | 2 +-
>> arch/x86/include/asm/syscall.h | 2 +-
>> tools/testing/selftests/x86/syscall_numbering.c | 488
>+++++++++++++++++++++---
>> 4 files changed, 508 insertions(+), 77 deletions(-)
>
>Thanks Peter - this series is really nice now, and I agree that this
>inconsistency should be fixed.
>
>Thanks,
>
> Ingo

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2021-05-19 21:05:08

by H. Peter Anvin

[permalink] [raw]
Subject: [PATCH v4 5/6] x86/syscall: treat out of range and gap system calls the same

From: "H. Peter Anvin (Intel)" <[email protected]>

The current 64-bit system call entry code treats out-of-range system
calls differently than system calls that map to a hole in the system
call table. This is visible to the user if system calls are
intercepted via ptrace or seccomp and the return value (regs->ax) is
modified: in the former case, the return value is preserved, and in
the latter case, sys_ni_syscall() is called and the return value is
forced to -ENOSYS.

The API spec in <asm-generic/syscalls.h> is very clear that only
(int)-1 is the non-system-call sentinel value, so make the system call
behavior consistent by calling sys_ni_syscall() for all invalid system
call numbers except for -1.

Although currently sys_ni_syscall() simply returns -ENOSYS, calling it
explicitly is friendly for tracing and future possible extensions, and
as this is an error path there is no reason to optimize it.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
---
arch/x86/entry/common.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 00da0f5420de..f51bc17262db 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -52,6 +52,8 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
X32_NR_syscalls);
regs->ax = x32_sys_call_table[nr](regs);
#endif
+ } else if (unlikely((int)nr != -1)) {
+ regs->ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
@@ -76,6 +78,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
if (likely(nr < IA32_NR_syscalls)) {
nr = array_index_nospec(nr, IA32_NR_syscalls);
regs->ax = ia32_sys_call_table[nr](regs);
+ } else if (unlikely((int)nr != -1)) {
+ regs->ax = __ia32_sys_ni_syscall(regs);
}
}

--
2.31.1


2021-05-19 21:10:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v4 0/6] x86/syscall: use int for x86-64 system calls


* H. Peter Anvin <[email protected]> wrote:

> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> This patchset addresses several inconsistencies in the handling of
> system call numbers in x86-64 (and x32).

> arch/x86/entry/common.c | 93 +++--
> arch/x86/entry/entry_64.S | 2 +-
> arch/x86/include/asm/syscall.h | 2 +-
> tools/testing/selftests/x86/syscall_numbering.c | 488 +++++++++++++++++++++---
> 4 files changed, 508 insertions(+), 77 deletions(-)

Thanks Peter - this series is really nice now, and I agree that this
inconsistency should be fixed.

Thanks,

Ingo


2021-05-20 08:56:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] x86/syscall: use int everywhere for system call numbers

On Tue, May 18 2021 at 12:13, H. Peter Anvin wrote:
> +static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
> +{
> + /*
> + * Convert negative numbers to very high and thus out of range
> + * numbers for comparisons. Use unsigned long to slightly
> + * improve the array_index_nospec() generated code.

How is that actually improving the generated code?

unsigned long:

104: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
10b: 48 19 c0 sbb %rax,%rax
10e: 48 21 c2 and %rax,%rdx
111: 48 89 df mov %rbx,%rdi
114: 48 8b 04 d5 00 00 00 mov 0x0(,%rdx,8),%rax
11b: 00
11c: e8 00 00 00 00 callq 121 <do_syscall_64+0x41>

unsigned int:

f1: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
f8: 48 19 d2 sbb %rdx,%rdx
fb: 21 d0 and %edx,%eax
fd: 48 89 df mov %rbx,%rdi
100: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax
107: 00
108: e8 00 00 00 00 callq 10d <do_syscall_64+0x3d>

Text size increases with that unsigned long cast.

I must be missing something.

Thanks,

tglx

Subject: [tip: x86/entry] x86/entry/64: Sign-extend system calls on entry to int

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: 0595494891723a1dcca5eaa8eeca8ab54ad953b9
Gitweb: https://git.kernel.org/tip/0595494891723a1dcca5eaa8eeca8ab54ad953b9
Author: H. Peter Anvin (Intel) <[email protected]>
AuthorDate: Tue, 18 May 2021 12:13:01 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 20 May 2021 15:19:49 +02:00

x86/entry/64: Sign-extend system calls on entry to int

Right now, *some* code will treat e.g. 0x0000000100000001 as a system
call and some will not. Some of the code, notably in ptrace, will
treat 0x000000018000000 as a system call and some will not. Finally,
right now, e.g. 335 for x86-64 will force the exit code to be set to
-ENOSYS even if poked by ptrace, but 548 will not, because there is an
observable difference between an out of range system call and a system
call number that falls outside the range of the table.

This is visible to the user: for example, the syscall_numbering_64
test fails if run under strace, because as strace uses ptrace, it ends
up clobbering the upper half of the 64-bit system call number.

The architecture independent code all assumes that a system call is "int"
that the value -1 specifically and not just any negative value is used for
a non-system call. This is the case on x86 as well when arch-independent
code is involved. The arch-independent API is defined/documented (but not
*implemented*!) in <asm-generic/syscall.h>.

This is an ABI change, but is in fact a revert to the original x86-64
ABI. The original assembly entry code would zero-extend the system call
number;

Use sign extend to be explicit that this is treated as a signed number
(although in practice it makes no difference, of course) and to avoid
people getting the idea of "optimizing" it, as has happened on at least
two(!) separate occasions.

Do not store the extended value into regs->orig_ax, however: on x86-64, the
ABI is that the callee is responsible for extending parameters, so only
examining the lower 32 bits is fully consistent with any "int" argument to
any system call, e.g. regs->di for write(2). The full value of %rax on
entry to the kernel is thus still available.

[ tglx: Add a comment to the ASM code ]

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/entry/entry_64.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1d9db15..a5f02d0 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -108,7 +108,8 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)

/* IRQs are off. */
movq %rsp, %rdi
- movq %rax, %rsi
+ /* Sign extend the lower 32bit as syscall numbers are treated as int */
+ movslq %eax, %rsi
call do_syscall_64 /* returns with IRQs disabled */

/*

Subject: [tip: x86/entry] selftests/x86/syscall: Simplify message reporting in syscall_numbering

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: c5c39488dcb5f818bb07f856a349262d667ef147
Gitweb: https://git.kernel.org/tip/c5c39488dcb5f818bb07f856a349262d667ef147
Author: H. Peter Anvin (Intel) <[email protected]>
AuthorDate: Tue, 18 May 2021 12:12:59 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 20 May 2021 15:19:48 +02:00

selftests/x86/syscall: Simplify message reporting in syscall_numbering

Reduce some boiler plate in printing and indenting messages.
This makes it easier to produce clean status output.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
tools/testing/selftests/x86/syscall_numbering.c | 103 ++++++++++-----
1 file changed, 72 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/x86/syscall_numbering.c b/tools/testing/selftests/x86/syscall_numbering.c
index 7dd86bc..434fe0e 100644
--- a/tools/testing/selftests/x86/syscall_numbering.c
+++ b/tools/testing/selftests/x86/syscall_numbering.c
@@ -16,6 +16,7 @@
#include <string.h>
#include <fcntl.h>
#include <limits.h>
+#include <sysexits.h>

/* Common system call numbers */
#define SYS_READ 0
@@ -34,6 +35,33 @@

static unsigned int nerr = 0; /* Cumulative error count */
static int nullfd = -1; /* File descriptor for /dev/null */
+static int indent = 0;
+
+static inline unsigned int offset(void)
+{
+ return 8 + indent * 4;
+}
+
+#define msg(lvl, fmt, ...) printf("%-*s" fmt, offset(), "[" #lvl "]", \
+ ## __VA_ARGS__)
+
+#define run(fmt, ...) msg(RUN, fmt, ## __VA_ARGS__)
+#define info(fmt, ...) msg(INFO, fmt, ## __VA_ARGS__)
+#define ok(fmt, ...) msg(OK, fmt, ## __VA_ARGS__)
+
+#define fail(fmt, ...) \
+ do { \
+ msg(FAIL, fmt, ## __VA_ARGS__); \
+ nerr++; \
+ } while (0)
+
+#define crit(fmt, ...) \
+ do { \
+ indent = 0; \
+ msg(FAIL, fmt, ## __VA_ARGS__); \
+ msg(SKIP, "Unable to run test\n"); \
+ exit(EX_OSERR);
+ } while (0)

/*
* Directly invokes the given syscall with nullfd as the first argument
@@ -91,28 +119,37 @@ static unsigned int _check_for(int msb, int start, int end, long long expect,
{
unsigned int err = 0;

+ indent++;
+ if (start != end)
+ indent++;
+
for (int nr = start; nr <= end; nr++) {
long long ret = probe_syscall(msb, nr);

if (ret != expect) {
- printf("[FAIL]\t %s returned %lld, but it should have returned %s\n",
+ fail("%s returned %lld, but it should have returned %s\n",
syscall_str(msb, nr, nr),
ret, expect_str);
err++;
}
}

+ if (start != end)
+ indent--;
+
if (err) {
nerr += err;
if (start != end)
- printf("[FAIL]\t %s had %u failure%s\n",
+ fail("%s had %u failure%s\n",
syscall_str(msb, start, end),
- err, (err == 1) ? "s" : "");
+ err, err == 1 ? "s" : "");
} else {
- printf("[OK]\t %s returned %s as expected\n",
- syscall_str(msb, start, end), expect_str);
+ ok("%s returned %s as expected\n",
+ syscall_str(msb, start, end), expect_str);
}

+ indent--;
+
return err;
}

@@ -137,35 +174,38 @@ static bool check_enosys(int msb, int nr)
static bool test_x32(void)
{
long long ret;
- long long mypid = getpid();
+ pid_t mypid = getpid();
+ bool with_x32;

- printf("[RUN]\tChecking for x32 by calling x32 getpid()\n");
+ run("Checking for x32 by calling x32 getpid()\n");
ret = probe_syscall(0, SYS_GETPID | X32_BIT);

+ indent++;
if (ret == mypid) {
- printf("[INFO]\t x32 is supported\n");
- return true;
+ info("x32 is supported\n");
+ with_x32 = true;
} else if (ret == -ENOSYS) {
- printf("[INFO]\t x32 is not supported\n");
- return false;
+ info("x32 is not supported\n");
+ with_x32 = false;
} else {
- printf("[FAIL]\t x32 getpid() returned %lld, but it should have returned either %lld or -ENOSYS\n", ret, mypid);
- nerr++;
- return true; /* Proceed as if... */
+ fail("x32 getpid() returned %lld, but it should have returned either %lld or -ENOSYS\n", ret, mypid);
+ with_x32 = false;
}
+ indent--;
+ return with_x32;
}

static void test_syscalls_common(int msb)
{
- printf("[RUN]\t Checking some common syscalls as 64 bit\n");
+ run("Checking some common syscalls as 64 bit\n");
check_zero(msb, SYS_READ);
check_zero(msb, SYS_WRITE);

- printf("[RUN]\t Checking some 64-bit only syscalls as 64 bit\n");
+ run("Checking some 64-bit only syscalls as 64 bit\n");
check_zero(msb, X64_READV);
check_zero(msb, X64_WRITEV);

- printf("[RUN]\t Checking out of range system calls\n");
+ run("Checking out of range system calls\n");
check_for(msb, -64, -1, -ENOSYS);
check_for(msb, X32_BIT-64, X32_BIT-1, -ENOSYS);
check_for(msb, -64-X32_BIT, -1-X32_BIT, -ENOSYS);
@@ -180,18 +220,18 @@ static void test_syscalls_with_x32(int msb)
* set. Calling them without the x32 bit set is
* nonsense and should not work.
*/
- printf("[RUN]\t Checking x32 syscalls as 64 bit\n");
+ run("Checking x32 syscalls as 64 bit\n");
check_for(msb, 512, 547, -ENOSYS);

- printf("[RUN]\t Checking some common syscalls as x32\n");
+ run("Checking some common syscalls as x32\n");
check_zero(msb, SYS_READ | X32_BIT);
check_zero(msb, SYS_WRITE | X32_BIT);

- printf("[RUN]\t Checking some x32 syscalls as x32\n");
+ run("Checking some x32 syscalls as x32\n");
check_zero(msb, X32_READV | X32_BIT);
check_zero(msb, X32_WRITEV | X32_BIT);

- printf("[RUN]\t Checking some 64-bit syscalls as x32\n");
+ run("Checking some 64-bit syscalls as x32\n");
check_enosys(msb, X64_IOCTL | X32_BIT);
check_enosys(msb, X64_READV | X32_BIT);
check_enosys(msb, X64_WRITEV | X32_BIT);
@@ -199,7 +239,7 @@ static void test_syscalls_with_x32(int msb)

static void test_syscalls_without_x32(int msb)
{
- printf("[RUN]\t Checking for absence of x32 system calls\n");
+ run("Checking for absence of x32 system calls\n");
check_for(msb, 0 | X32_BIT, 999 | X32_BIT, -ENOSYS);
}

@@ -217,14 +257,18 @@ static void test_syscall_numbering(void)
*/
for (size_t i = 0; i < sizeof(msbs)/sizeof(msbs[0]); i++) {
int msb = msbs[i];
- printf("[RUN]\tChecking system calls with msb = %d (0x%x)\n",
- msb, msb);
+ run("Checking system calls with msb = %d (0x%x)\n",
+ msb, msb);
+
+ indent++;

test_syscalls_common(msb);
if (with_x32)
test_syscalls_with_x32(msb);
else
test_syscalls_without_x32(msb);
+
+ indent--;
}
}

@@ -241,19 +285,16 @@ int main(void)
*/
nullfd = open("/dev/null", O_RDWR);
if (nullfd < 0) {
- printf("[FAIL]\tUnable to open /dev/null: %s\n",
- strerror(errno));
- printf("[SKIP]\tCannot execute test\n");
- return 71; /* EX_OSERR */
+ crit("Unable to open /dev/null: %s\n", strerror(errno));
}

test_syscall_numbering();
if (!nerr) {
- printf("[OK]\tAll system calls succeeded or failed as expected\n");
+ ok("All system calls succeeded or failed as expected\n");
return 0;
} else {
- printf("[FAIL]\tA total of %u system call%s had incorrect behavior\n",
- nerr, nerr != 1 ? "s" : "");
+ fail("A total of %u system call%s had incorrect behavior\n",
+ nerr, nerr != 1 ? "s" : "");
return 1;
}
}

Subject: [tip: x86/entry] x86/entry: Treat out of range and gap system calls the same

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: b337b4965e3a3e567f11828a9e3fe3fb3faefa47
Gitweb: https://git.kernel.org/tip/b337b4965e3a3e567f11828a9e3fe3fb3faefa47
Author: H. Peter Anvin (Intel) <[email protected]>
AuthorDate: Tue, 18 May 2021 12:13:02 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 20 May 2021 15:19:49 +02:00

x86/entry: Treat out of range and gap system calls the same

The current 64-bit system call entry code treats out-of-range system
calls differently than system calls that map to a hole in the system
call table.

This is visible to the user if system calls are intercepted via ptrace or
seccomp and the return value (regs->ax) is modified: in the former case,
the return value is preserved, and in the latter case, sys_ni_syscall() is
called and the return value is forced to -ENOSYS.

The API spec in <asm-generic/syscalls.h> is very clear that only
(int)-1 is the non-system-call sentinel value, so make the system call
behavior consistent by calling sys_ni_syscall() for all invalid system
call numbers except for -1.

Although currently sys_ni_syscall() simply returns -ENOSYS, calling it
explicitly is friendly for tracing and future possible extensions, and
as this is an error path there is no reason to optimize it.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/entry/common.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 00da0f5..f51bc17 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -52,6 +52,8 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
X32_NR_syscalls);
regs->ax = x32_sys_call_table[nr](regs);
#endif
+ } else if (unlikely((int)nr != -1)) {
+ regs->ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
@@ -76,6 +78,8 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
if (likely(nr < IA32_NR_syscalls)) {
nr = array_index_nospec(nr, IA32_NR_syscalls);
regs->ax = ia32_sys_call_table[nr](regs);
+ } else if (unlikely((int)nr != -1)) {
+ regs->ax = __ia32_sys_ni_syscall(regs);
}
}

2021-05-21 21:37:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v4 6/6] x86/syscall: use int everywhere for system call numbers



On 5/20/21 1:53 AM, Thomas Gleixner wrote:
> On Tue, May 18 2021 at 12:13, H. Peter Anvin wrote:
>> +static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
>> +{
>> + /*
>> + * Convert negative numbers to very high and thus out of range
>> + * numbers for comparisons. Use unsigned long to slightly
>> + * improve the array_index_nospec() generated code.
>
> How is that actually improving the generated code?
>
> unsigned long:
>
> 104: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
> 10b: 48 19 c0 sbb %rax,%rax
> 10e: 48 21 c2 and %rax,%rdx
> 111: 48 89 df mov %rbx,%rdi
> 114: 48 8b 04 d5 00 00 00 mov 0x0(,%rdx,8),%rax
> 11b: 00
> 11c: e8 00 00 00 00 callq 121 <do_syscall_64+0x41>
>
> unsigned int:
>
> f1: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
> f8: 48 19 d2 sbb %rdx,%rdx
> fb: 21 d0 and %edx,%eax
> fd: 48 89 df mov %rbx,%rdi
> 100: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax
> 107: 00
> 108: e8 00 00 00 00 callq 10d <do_syscall_64+0x3d>
>
> Text size increases with that unsigned long cast.
>
> I must be missing something.
>

"unsigned long" gave slightly better code than "int", but as you
correctly point out here, "unsigned int" is even better.

Thanks for catching that.

-hpa

2021-05-22 13:21:02

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v4 6/6] x86/syscall: use int everywhere for system call numbers

From: H. Peter Anvin
> Sent: 21 May 2021 22:37
>
> On 5/20/21 1:53 AM, Thomas Gleixner wrote:
> > On Tue, May 18 2021 at 12:13, H. Peter Anvin wrote:
> >> +static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
> >> +{
> >> + /*
> >> + * Convert negative numbers to very high and thus out of range
> >> + * numbers for comparisons. Use unsigned long to slightly
> >> + * improve the array_index_nospec() generated code.
> >
> > How is that actually improving the generated code?
> >
> > unsigned long:
> >
> > 104: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
> > 10b: 48 19 c0 sbb %rax,%rax
> > 10e: 48 21 c2 and %rax,%rdx
> > 111: 48 89 df mov %rbx,%rdi
> > 114: 48 8b 04 d5 00 00 00 mov 0x0(,%rdx,8),%rax
> > 11b: 00
> > 11c: e8 00 00 00 00 callq 121 <do_syscall_64+0x41>
> >
> > unsigned int:
> >
> > f1: 48 81 fa bf 01 00 00 cmp $0x1bf,%rdx
> > f8: 48 19 d2 sbb %rdx,%rdx
> > fb: 21 d0 and %edx,%eax
> > fd: 48 89 df mov %rbx,%rdi
> > 100: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax
> > 107: 00
> > 108: e8 00 00 00 00 callq 10d <do_syscall_64+0x3d>
> >
> > Text size increases with that unsigned long cast.
> >
> > I must be missing something.
> >
>
> "unsigned long" gave slightly better code than "int", but as you
> correctly point out here, "unsigned int" is even better.

Indexing arrays with 'int' almost always ends up generating
an extra instruction to sign-extend the 32bit value to 64bits.
This lengthens the register dependency chain as is likely to
add a clock.

OTOH using 'unsigned int' can save a 'reg' prefix (as here)
marginally reducing the cache footprint.
That might speed it up, but may slow it down!
Rather depends on the exact alignment of instructions
relative to (on Intel cpu) the 16-byte fetch/decode blocks.

Looking at the above code, out of range values get masked
to zero to ensure that speculative execution doesn't expose
anything.
If the syscall number is offset by one before masking
a zero will only be generated for invalid values:

https://godbolt.org/z/av839bsxf

bool do_syscall_x64(struct pt_regs *regs, int nr)
{
unsigned long unr = nr + 1;

unr = array_index_nospec(unr, NR_syscalls + 1);
if (!unr)
return false;
regs->ax = sys_call_table[unr - 1](regs);
return true;
}

This speeds up the native system calls with a slight slow down
of the compat ones.

In principle sys_call_table[] could be offset by one.
So that invalid numbers go through sys_call_table[0].
You wouldn't want to do this if a second table follows.

I'm also seeing better code for 'unsigned long'.
Probably because array_index_mask_nospec() is defined for long.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Subject: [tip: x86/entry] x86/entry: Use int everywhere for system call numbers

The following commit has been merged into the x86/entry branch of tip:

Commit-ID: 2978996f620001f4e748c79af0fe89be729ef58d
Gitweb: https://git.kernel.org/tip/2978996f620001f4e748c79af0fe89be729ef58d
Author: H. Peter Anvin (Intel) <[email protected]>
AuthorDate: Tue, 18 May 2021 12:13:03 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 25 May 2021 10:07:00 +02:00

x86/entry: Use int everywhere for system call numbers

System call numbers are defined as int, so use int everywhere for system
call numbers. This is strictly a cleanup; it should not change anything
user visible; all ABI changes have been done in the preceeding patches.

[ tglx: Replaced the unsigned long cast ]

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/entry/common.c | 87 ++++++++++++++++++++++-----------
arch/x86/include/asm/syscall.h | 2 +-
2 files changed, 60 insertions(+), 29 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index f51bc17..ee95fe3 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -36,49 +36,81 @@
#include <asm/irq_stack.h>

#ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
+
+static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
+{
+ /*
+ * Convert negative numbers to very high and thus out of range
+ * numbers for comparisons.
+ */
+ unsigned int unr = nr;
+
+ if (likely(unr < NR_syscalls)) {
+ unr = array_index_nospec(unr, NR_syscalls);
+ regs->ax = sys_call_table[unr](regs);
+ return true;
+ }
+ return false;
+}
+
+static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
+{
+ /*
+ * Adjust the starting offset of the table, and convert numbers
+ * < __X32_SYSCALL_BIT to very high and thus out of range
+ * numbers for comparisons.
+ */
+ unsigned int xnr = nr - __X32_SYSCALL_BIT;
+
+ if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
+ xnr = array_index_nospec(xnr, X32_NR_syscalls);
+ regs->ax = x32_sys_call_table[xnr](regs);
+ return true;
+ }
+ return false;
+}
+
+__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
- if (likely(nr < NR_syscalls)) {
- nr = array_index_nospec(nr, NR_syscalls);
- regs->ax = sys_call_table[nr](regs);
-#ifdef CONFIG_X86_X32_ABI
- } else if (likely((nr & __X32_SYSCALL_BIT) &&
- (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
- nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
- X32_NR_syscalls);
- regs->ax = x32_sys_call_table[nr](regs);
-#endif
- } else if (unlikely((int)nr != -1)) {
+
+ if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
+ /* Invalid system call, but still a system call. */
regs->ax = __x64_sys_ni_syscall(regs);
}
+
instrumentation_end();
syscall_exit_to_user_mode(regs);
}
#endif

#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
-static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
+static __always_inline int syscall_32_enter(struct pt_regs *regs)
{
if (IS_ENABLED(CONFIG_IA32_EMULATION))
current_thread_info()->status |= TS_COMPAT;

- return (unsigned int)regs->orig_ax;
+ return (int)regs->orig_ax;
}

/*
* Invoke a 32-bit syscall. Called with IRQs on in CONTEXT_KERNEL.
*/
-static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
- unsigned int nr)
+static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, int nr)
{
- if (likely(nr < IA32_NR_syscalls)) {
- nr = array_index_nospec(nr, IA32_NR_syscalls);
- regs->ax = ia32_sys_call_table[nr](regs);
- } else if (unlikely((int)nr != -1)) {
+ /*
+ * Convert negative numbers to very high and thus out of range
+ * numbers for comparisons.
+ */
+ unsigned int unr = nr;
+
+ if (likely(unr < IA32_NR_syscalls)) {
+ unr = array_index_nospec(unr, IA32_NR_syscalls);
+ regs->ax = ia32_sys_call_table[unr](regs);
+ } else if (nr != -1) {
regs->ax = __ia32_sys_ni_syscall(regs);
}
}
@@ -86,15 +118,15 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
/* Handles int $0x80 */
__visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ int nr = syscall_32_enter(regs);

add_random_kstack_offset();
/*
- * Subtlety here: if ptrace pokes something larger than 2^32-1 into
- * orig_ax, the unsigned int return value truncates it. This may
- * or may not be necessary, but it matches the old asm behavior.
+ * Subtlety here: if ptrace pokes something larger than 2^31-1 into
+ * orig_ax, the int return value truncates it. This matches
+ * the semantics of syscall_get_nr().
*/
- nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
+ nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();

do_syscall_32_irqs_on(regs, nr);
@@ -105,7 +137,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)

static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
{
- unsigned int nr = syscall_32_enter(regs);
+ int nr = syscall_32_enter(regs);
int res;

add_random_kstack_offset();
@@ -140,8 +172,7 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
return false;
}

- /* The case truncates any ptrace induced syscall nr > 2^32 -1 */
- nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+ nr = syscall_enter_from_user_mode_work(regs, nr);

/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs, nr);
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index f6593ca..f7e2d82 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task)
? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
}

-void do_syscall_64(struct pt_regs *regs, unsigned long nr);
+void do_syscall_64(struct pt_regs *regs, int nr);
void do_int80_syscall_32(struct pt_regs *regs);
long do_fast_syscall_32(struct pt_regs *regs);