When running the pidfd_fdinfo_test on arm64, it fails for me. After some
digging, the reason is that the child exits due to SIGBUS, because it
overflows the 1024 byte stack we've reserved for it.
To fix the issue, increase the stack size to 8192 bytes (this number is
somewhat arbitrary, and was arrived at through experimentation -- I kept
doubling until the failure no longer occurred).
Also, let's make the issue easier to debug. wait_for_pid() returns an
ambiguous value: it may return -1 in all of these cases:
1. waitpid() itself returned -1
2. waitpid() returned success, but we found !WIFEXITED(status).
3. The child process exited, but it did so with a -1 exit code.
There's no way for the caller to tell the difference. So, at least log
which occurred, so the test runner can debug things.
While debugging this, I found that we had !WIFEXITED(), because the
child exited due to a signal. This seems like a reasonably common case,
so also print out whether or not we have WIFSIGNALED(), and the
associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
clearly when it occurs.
Finally, I'm suspicious of allocating the child's stack on our stack.
man clone(2) suggests that the correct way to do this is with mmap(),
and in particular by setting MAP_STACK. So, switch to doing it that way
instead.
Signed-off-by: Axel Rasmussen <[email protected]>
---
tools/testing/selftests/pidfd/pidfd.h | 13 ++++++++---
.../selftests/pidfd/pidfd_fdinfo_test.c | 22 +++++++++++++++----
2 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index 01f8d3c0cf2c..6922d6417e1c 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -68,7 +68,7 @@
#define PIDFD_SKIP 3
#define PIDFD_XFAIL 4
-int wait_for_pid(pid_t pid)
+static inline int wait_for_pid(pid_t pid)
{
int status, ret;
@@ -78,13 +78,20 @@ int wait_for_pid(pid_t pid)
if (errno == EINTR)
goto again;
+ ksft_print_msg("waitpid returned -1, errno=%d\n", errno);
return -1;
}
- if (!WIFEXITED(status))
+ if (!WIFEXITED(status)) {
+ ksft_print_msg(
+ "waitpid !WIFEXITED, WIFSIGNALED=%d, WTERMSIG=%d\n",
+ WIFSIGNALED(status), WTERMSIG(status));
return -1;
+ }
- return WEXITSTATUS(status);
+ ret = WEXITSTATUS(status);
+ ksft_print_msg("waitpid WEXITSTATUS=%d\n", ret);
+ return ret;
}
static inline int sys_pidfd_open(pid_t pid, unsigned int flags)
diff --git a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
index 22558524f71c..3fd8e903118f 100644
--- a/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_fdinfo_test.c
@@ -12,6 +12,7 @@
#include <string.h>
#include <syscall.h>
#include <sys/wait.h>
+#include <sys/mman.h>
#include "pidfd.h"
#include "../kselftest.h"
@@ -80,7 +81,10 @@ static inline int error_check(struct error *err, const char *test_name)
return err->code;
}
+#define CHILD_STACK_SIZE 8192
+
struct child {
+ char *stack;
pid_t pid;
int fd;
};
@@ -89,17 +93,22 @@ static struct child clone_newns(int (*fn)(void *), void *args,
struct error *err)
{
static int flags = CLONE_PIDFD | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
- size_t stack_size = 1024;
- char *stack[1024] = { 0 };
struct child ret;
if (!(flags & CLONE_NEWUSER) && geteuid() != 0)
flags |= CLONE_NEWUSER;
+ ret.stack = mmap(NULL, CHILD_STACK_SIZE, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
+ if (ret.stack == MAP_FAILED) {
+ error_set(err, -1, "mmap of stack failed (errno %d)", errno);
+ return ret;
+ }
+
#ifdef __ia64__
- ret.pid = __clone2(fn, stack, stack_size, flags, args, &ret.fd);
+ ret.pid = __clone2(fn, ret.stack, CHILD_STACK_SIZE, flags, args, &ret.fd);
#else
- ret.pid = clone(fn, stack + stack_size, flags, args, &ret.fd);
+ ret.pid = clone(fn, ret.stack + CHILD_STACK_SIZE, flags, args, &ret.fd);
#endif
if (ret.pid < 0) {
@@ -129,6 +138,11 @@ static inline int child_join(struct child *child, struct error *err)
else if (r > 0)
error_set(err, r, "child %d reported: %d", child->pid, r);
+ if (munmap(child->stack, CHILD_STACK_SIZE)) {
+ error_set(err, -1, "munmap of child stack failed (errno %d)", errno);
+ r = -1;
+ }
+
return r;
}
--
2.35.0.rc2.247.g8bbb082509-goog
On Thu, Jan 27, 2022 at 01:29:51PM -0800, Axel Rasmussen wrote:
> When running the pidfd_fdinfo_test on arm64, it fails for me. After some
> digging, the reason is that the child exits due to SIGBUS, because it
> overflows the 1024 byte stack we've reserved for it.
>
> To fix the issue, increase the stack size to 8192 bytes (this number is
> somewhat arbitrary, and was arrived at through experimentation -- I kept
> doubling until the failure no longer occurred).
>
> Also, let's make the issue easier to debug. wait_for_pid() returns an
> ambiguous value: it may return -1 in all of these cases:
>
> 1. waitpid() itself returned -1
> 2. waitpid() returned success, but we found !WIFEXITED(status).
> 3. The child process exited, but it did so with a -1 exit code.
>
> There's no way for the caller to tell the difference. So, at least log
> which occurred, so the test runner can debug things.
>
> While debugging this, I found that we had !WIFEXITED(), because the
> child exited due to a signal. This seems like a reasonably common case,
> so also print out whether or not we have WIFSIGNALED(), and the
> associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
> clearly when it occurs.
>
> Finally, I'm suspicious of allocating the child's stack on our stack.
> man clone(2) suggests that the correct way to do this is with mmap(),
> and in particular by setting MAP_STACK. So, switch to doing it that way
> instead.
Heh, yes. :)
commit 99c3a000279919cc4875c9dfa9c3ebb41ed8773e
Author: Michael Kerrisk <[email protected]>
Date: Thu Nov 14 12:19:21 2019 +0100
clone.2: Allocate child's stack using mmap(2) rather than malloc(3)
Christian Brauner suggested mmap(MAP_STACKED), rather than
malloc(), as the canonical way of allocating a stack for the
child of clone(), and Jann Horn noted some reasons why:
Not on Linux, but on OpenBSD, they do use MAP_STACK now
AFAIK; this was announced here:
<http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
Basically they periodically check whether the userspace
stack pointer points into a MAP_STACK region, and if not,
they kill the process. So even if it's a no-op on Linux, it
might make sense to advise people to use the flag to improve
portability? I'm not sure if that's something that belongs
in Linux manpages.
Another reason against malloc() is that when setting up
thread stacks in proper, reliable software, you'll probably
want to place a guard page (in other words, a 4K PROT_NONE
VMA) at the bottom of the stack to reliably catch stack
overflows; and you probably don't want to do that with
malloc, in particular with non-page-aligned allocations.
And the OpenBSD 6.5 manual pages says:
MAP_STACK
Indicate that the mapping is used as a stack. This
flag must be used in combination with MAP_ANON and
MAP_PRIVATE.
And I then noticed that MAP_STACK seems already to be on
FreeBSD for a long time:
MAP_STACK
Map the area as a stack. MAP_ANON is implied.
Offset should be 0, fd must be -1, and prot should
include at least PROT_READ and PROT_WRITE. This
option creates a memory region that grows to at
most len bytes in size, starting from the stack
top and growing down. The stack top is the start‐
ing address returned by the call, plus len bytes.
The bottom of the stack at maximum growth is the
starting address returned by the call.
The entire area is reserved from the point of view
of other mmap() calls, even if not faulted in yet.
Reported-by: Jann Horn <[email protected]>
Reported-by: Christian Brauner <[email protected]>
Signed-off-by: Michael Kerrisk <[email protected]>
>
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
Yeah, stack handling - especially with legacy clone() - is yucky on the
best of days. Thank you for the fix.
Acked-by: Christian Brauner <[email protected]>
On 1/28/22 1:56 AM, Christian Brauner wrote:
> On Thu, Jan 27, 2022 at 01:29:51PM -0800, Axel Rasmussen wrote:
>> When running the pidfd_fdinfo_test on arm64, it fails for me. After some
>> digging, the reason is that the child exits due to SIGBUS, because it
>> overflows the 1024 byte stack we've reserved for it.
>>
>> To fix the issue, increase the stack size to 8192 bytes (this number is
>> somewhat arbitrary, and was arrived at through experimentation -- I kept
>> doubling until the failure no longer occurred).
>>
>> Also, let's make the issue easier to debug. wait_for_pid() returns an
>> ambiguous value: it may return -1 in all of these cases:
>>
>> 1. waitpid() itself returned -1
>> 2. waitpid() returned success, but we found !WIFEXITED(status).
>> 3. The child process exited, but it did so with a -1 exit code.
>>
>> There's no way for the caller to tell the difference. So, at least log
>> which occurred, so the test runner can debug things.
>>
>> While debugging this, I found that we had !WIFEXITED(), because the
>> child exited due to a signal. This seems like a reasonably common case,
>> so also print out whether or not we have WIFSIGNALED(), and the
>> associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
>> clearly when it occurs.
>>
>> Finally, I'm suspicious of allocating the child's stack on our stack.
>> man clone(2) suggests that the correct way to do this is with mmap(),
>> and in particular by setting MAP_STACK. So, switch to doing it that way
>> instead.
>
> Heh, yes. :)
>
> commit 99c3a000279919cc4875c9dfa9c3ebb41ed8773e
> Author: Michael Kerrisk <[email protected]>
> Date: Thu Nov 14 12:19:21 2019 +0100
>
> clone.2: Allocate child's stack using mmap(2) rather than malloc(3)
>
> Christian Brauner suggested mmap(MAP_STACKED), rather than
> malloc(), as the canonical way of allocating a stack for the
> child of clone(), and Jann Horn noted some reasons why:
>
> Not on Linux, but on OpenBSD, they do use MAP_STACK now
> AFAIK; this was announced here:
> <http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
> Basically they periodically check whether the userspace
> stack pointer points into a MAP_STACK region, and if not,
> they kill the process. So even if it's a no-op on Linux, it
> might make sense to advise people to use the flag to improve
> portability? I'm not sure if that's something that belongs
> in Linux manpages.
>
> Another reason against malloc() is that when setting up
> thread stacks in proper, reliable software, you'll probably
> want to place a guard page (in other words, a 4K PROT_NONE
> VMA) at the bottom of the stack to reliably catch stack
> overflows; and you probably don't want to do that with
> malloc, in particular with non-page-aligned allocations.
>
> And the OpenBSD 6.5 manual pages says:
>
> MAP_STACK
> Indicate that the mapping is used as a stack. This
> flag must be used in combination with MAP_ANON and
> MAP_PRIVATE.
>
> And I then noticed that MAP_STACK seems already to be on
> FreeBSD for a long time:
>
> MAP_STACK
> Map the area as a stack. MAP_ANON is implied.
> Offset should be 0, fd must be -1, and prot should
> include at least PROT_READ and PROT_WRITE. This
> option creates a memory region that grows to at
> most len bytes in size, starting from the stack
> top and growing down. The stack top is the start‐
> ing address returned by the call, plus len bytes.
> The bottom of the stack at maximum growth is the
> starting address returned by the call.
>
> The entire area is reserved from the point of view
> of other mmap() calls, even if not faulted in yet.
>
> Reported-by: Jann Horn <[email protected]>
> Reported-by: Christian Brauner <[email protected]>
> Signed-off-by: Michael Kerrisk <[email protected]>
>
>
>>
>> Signed-off-by: Axel Rasmussen <[email protected]>
>> ---
>
> Yeah, stack handling - especially with legacy clone() - is yucky on the
> best of days. Thank you for the fix.
>
> Acked-by: Christian Brauner <[email protected]>
>
Thank you both. Will apply for 5.17-rc4 or so.
thanks,
-- Shuah