2022-09-16 13:47:35

by Josh Triplett

[permalink] [raw]
Subject: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

Currently, execve allocates an mm and parses argv and envp before
checking if the path exists. However, the common case of a $PATH search
may have several failed calls to exec before a single success. Do a
filename lookup for the purposes of returning ENOENT before doing more
expensive operations.

This does not create a TOCTTOU race, because this can only happen if the
file didn't exist at some point during the exec call, and that point is
permitted to be when we did our lookup.

To measure performance, I ran 2000 fork and execvpe calls with a
seven-element PATH in which the file was found in the seventh directory
(representative of the common case as /usr/bin is the seventh directory
on my $PATH), as well as 2000 fork and execve calls with an absolute
path to an existing binary. I recorded the minimum time for each, to
eliminate noise from context switches and similar.

Without fast-path:
fork/execvpe: 49876ns
fork/execve: 32773ns

With fast-path:
fork/execvpe: 36890ns
fork/execve: 32069ns

The cost of the additional lookup seems to be in the noise for a
successful exec, but it provides a 26% improvement for the path search
case by speeding up the six failed execs.

Signed-off-by: Josh Triplett <[email protected]>
---

Discussed this at Plumbers with Kees Cook; turned out to be even more of
a win than anticipated.

fs/exec.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..fe786aeb2f1b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1881,6 +1881,16 @@ static int do_execveat_common(int fd, struct filename *filename,
if (IS_ERR(filename))
return PTR_ERR(filename);

+ /* Fast-path ENOENT for $PATH search failures, before we alloc an mm or
+ * parse arguments. */
+ if (fd == AT_FDCWD && flags == 0 && filename->name[0] == '/') {
+ struct path path;
+ retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL);
+ if (retval == -ENOENT)
+ goto out_ret;
+ path_put(&path);
+ }
+
/*
* We move the actual failure in case of RLIMIT_NPROC excess from
* set*uid() to execve() because too many poorly written programs
--
2.37.2


2022-09-16 15:05:13

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> Currently, execve allocates an mm and parses argv and envp before
> checking if the path exists. However, the common case of a $PATH search
> may have several failed calls to exec before a single success. Do a
> filename lookup for the purposes of returning ENOENT before doing more
> expensive operations.

At first I didn't understand how you were seeing this, since I'm so used
to watching shell scripts under tracing, which correctly use stat():

$ strace bash -c foo
stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/usr/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
stat("/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)

But I see, yes, glibc tries to actually call execve(), which, as you
say, is extremely heavy:

$ strace ./execvpe
...
execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)

This really seems much more like a glibc bug. The shell does it correctly...

-Kees

--
Kees Cook

2022-09-16 20:57:12

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Fri, Sep 16, 2022 at 07:38:37AM -0700, Kees Cook wrote:
> On Fri, Sep 16, 2022 at 02:41:30PM +0100, Josh Triplett wrote:
> > Currently, execve allocates an mm and parses argv and envp before
> > checking if the path exists. However, the common case of a $PATH search
> > may have several failed calls to exec before a single success. Do a
> > filename lookup for the purposes of returning ENOENT before doing more
> > expensive operations.
>
> At first I didn't understand how you were seeing this, since I'm so used
> to watching shell scripts under tracing, which correctly use stat():
>
> $ strace bash -c foo
> stat("/home/keescook/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/local/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/local/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/usr/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/sbin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
> stat("/bin/foo", 0x7ffe1f9ddea0) = -1 ENOENT (No such file or directory)
>
> But I see, yes, glibc tries to actually call execve(), which, as you
> say, is extremely heavy:
>
> $ strace ./execvpe
> ...
> execve("/home/kees/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/local/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/local/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/usr/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/sbin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
> execve("/bin/foo", ["./execvpe"], 0x7ffc542bff38 /* 33 vars */) = -1 ENOENT (No such file or directory)
>
> This really seems much more like a glibc bug. The shell does it correctly...

musl does the same thing, as do python and perl (likely via execvp or
posix_spawnp). As does gcc when it executes `as`. And I've seen more
than a few programs hand-implement a PATH search the same way. Seems
worth optimizing for.

And with io_uring_spawn, it'll be the substantially easier approach,
since it'll just require one pass (series of execs) rather than two
(stats then exec).

- Josh Triplett

2022-09-17 01:06:54

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

[Hi Peter, apologies for dumping you into the middle of this thread.
I've got a question about sched_exec() below...]

On Fri, Sep 16, 2022 at 09:13:44PM +0100, Josh Triplett wrote:
> musl does the same thing, as do python and perl (likely via execvp or
> posix_spawnp). As does gcc when it executes `as`. And I've seen more
> than a few programs hand-implement a PATH search the same way. Seems
> worth optimizing for.

Yeah, it does seem like a simple way to eliminate needless work, though
I'd really like to see some kind of perf count of "in a given kernel
build, how many execve() system calls fail due to path search vs succeed",
just to get a better sense of the scale of the problem.

I don't like the idea of penalizing the _succeeding_ case, though, which
happens if we do the path walk twice. So, I went and refactoring the setup
order, moving the do_open_execat() up into alloc_bprm() instead of where
it was in bprm_exec(). The result makes it so it is, as you observed,
before the mm creation and generally expensive argument copying. The
difference to your patch seems to only be the allocation of the file
table entry, but avoids the double lookup, so I'm hoping the result is
actually even faster.

This cleanup is actually quite satisfying organizationally too -- the
fd and filename were passed around rather oddly.

The interaction with sched_exec() should be no worse (the file is opened
before it in either case), but in reading that function, it talks about
taking the opportunity to move the process to another CPU (IIUC) since,
paraphrasing, "it is at its lowest memory/cache size." But I wonder if
there is an existing accidental pessimistic result in that the process
stack has already been allocated. I am only passingly familiar with how
tasks get moved around under NUMA -- is the scheduler going to move
this process onto a different NUMA node and now it will be forced to
have the userspace process stack on one node and the program text and
heap on another? Or is that totally lost in the noise?

More specifically, I was wondering if processes would benefit from having
sched_exec() moved before the mm creation?

Regardless, here's a very lightly tested patch. Can you take this for a
spin and check your benchmark? Thanks!

-Kees

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..5534301d67ca 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -898,6 +898,10 @@ EXPORT_SYMBOL(transfer_args_to_stack);

#endif /* CONFIG_MMU */

+/*
+ * On success, callers must call do_close_execat() on the returned
+ * struct file.
+ */
static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
@@ -945,6 +949,16 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
return ERR_PTR(err);
}

+/**
+ * open_exec - Open a path name for execution
+ *
+ * @name: path name to open with the intent of executing it.
+ *
+ * Returns ERR_PTR on failure or allocated struct file on success.
+ *
+ * As this is a wrapper for the internal do_open_execat(), callers
+ * must call allow_write_access() before fput() on release.
+ */
struct file *open_exec(const char *name)
{
struct filename *filename = getname_kernel(name);
@@ -1485,6 +1499,15 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
return -ENOMEM;
}

+/* Matches do_open_execat() */
+static void do_close_execat(struct file *file)
+{
+ if (!file)
+ return;
+ allow_write_access(file);
+ fput(file);
+}
+
static void free_bprm(struct linux_binprm *bprm)
{
if (bprm->mm) {
@@ -1496,10 +1519,7 @@ static void free_bprm(struct linux_binprm *bprm)
mutex_unlock(&current->signal->cred_guard_mutex);
abort_creds(bprm->cred);
}
- if (bprm->file) {
- allow_write_access(bprm->file);
- fput(bprm->file);
- }
+ do_close_execat(bprm->file);
if (bprm->executable)
fput(bprm->executable);
/* If a binfmt changed the interp, free it. */
@@ -1509,12 +1529,26 @@ static void free_bprm(struct linux_binprm *bprm)
kfree(bprm);
}

-static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
+static struct linux_binprm *alloc_bprm(int fd, struct filename *filename,
+ int flags)
{
- struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
- int retval = -ENOMEM;
- if (!bprm)
+ struct linux_binprm *bprm;
+ struct file *file;
+ int retval;
+
+ file = do_open_execat(fd, filename, flags);
+ if (IS_ERR(file)) {
+ retval = PTR_ERR(file);
goto out;
+ }
+
+ retval = -ENOMEM;
+ bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+ if (!bprm) {
+ do_close_execat(file);
+ goto out;
+ }
+ bprm->file = file;

if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
@@ -1531,6 +1565,18 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
}
bprm->interp = bprm->filename;

+ /*
+ * Record that a name derived from an O_CLOEXEC fd will be
+ * inaccessible after exec. This allows the code in exec to
+ * choose to fail when the executable is not mmaped into the
+ * interpreter and an open file descriptor is not passed to
+ * the interpreter. This makes for a better user experience
+ * than having the interpreter start and then immediately fail
+ * when it finds the executable is inaccessible.
+ */
+ if (bprm->fdpath && get_close_on_exec(fd))
+ bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
+
retval = bprm_mm_init(bprm);
if (retval)
goto out_free;
@@ -1803,10 +1849,8 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int bprm_execve(struct linux_binprm *bprm,
- int fd, struct filename *filename, int flags)
+static int bprm_execve(struct linux_binprm *bprm)
{
- struct file *file;
int retval;

retval = prepare_bprm_creds(bprm);
@@ -1816,26 +1860,8 @@ static int bprm_execve(struct linux_binprm *bprm,
check_unsafe_exec(bprm);
current->in_execve = 1;

- file = do_open_execat(fd, filename, flags);
- retval = PTR_ERR(file);
- if (IS_ERR(file))
- goto out_unmark;
-
sched_exec();

- bprm->file = file;
- /*
- * Record that a name derived from an O_CLOEXEC fd will be
- * inaccessible after exec. This allows the code in exec to
- * choose to fail when the executable is not mmaped into the
- * interpreter and an open file descriptor is not passed to
- * the interpreter. This makes for a better user experience
- * than having the interpreter start and then immediately fail
- * when it finds the executable is inaccessible.
- */
- if (bprm->fdpath && get_close_on_exec(fd))
- bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
-
/* Set the unchanging part of bprm->cred */
retval = security_bprm_creds_for_exec(bprm);
if (retval)
@@ -1863,7 +1889,6 @@ static int bprm_execve(struct linux_binprm *bprm,
if (bprm->point_of_no_return && !fatal_signal_pending(current))
force_fatal_sig(SIGSEGV);

-out_unmark:
current->fs->in_exec = 0;
current->in_execve = 0;

@@ -1897,7 +1922,7 @@ static int do_execveat_common(int fd, struct filename *filename,
* further execve() calls fail. */
current->flags &= ~PF_NPROC_EXCEEDED;

- bprm = alloc_bprm(fd, filename);
+ bprm = alloc_bprm(fd, filename, flags);
if (IS_ERR(bprm)) {
retval = PTR_ERR(bprm);
goto out_ret;
@@ -1946,7 +1971,7 @@ static int do_execveat_common(int fd, struct filename *filename,
bprm->argc = 1;
}

- retval = bprm_execve(bprm, fd, filename, flags);
+ retval = bprm_execve(bprm);
out_free:
free_bprm(bprm);

@@ -1971,7 +1996,7 @@ int kernel_execve(const char *kernel_filename,
if (IS_ERR(filename))
return PTR_ERR(filename);

- bprm = alloc_bprm(fd, filename);
+ bprm = alloc_bprm(fd, filename, 0);
if (IS_ERR(bprm)) {
retval = PTR_ERR(bprm);
goto out_ret;
@@ -2006,7 +2031,7 @@ int kernel_execve(const char *kernel_filename,
if (retval < 0)
goto out_free;

- retval = bprm_execve(bprm, fd, filename, 0);
+ retval = bprm_execve(bprm);
out_free:
free_bprm(bprm);
out_ret:


--
Kees Cook

2022-09-17 02:03:23

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> I don't like the idea of penalizing the _succeeding_ case, though, which
> happens if we do the path walk twice. So, I went and refactoring the setup
> order, moving the do_open_execat() up into alloc_bprm() instead of where
> it was in bprm_exec(). The result makes it so it is, as you observed,
> before the mm creation and generally expensive argument copying. The
> difference to your patch seems to only be the allocation of the file
> table entry, but avoids the double lookup, so I'm hoping the result is
> actually even faster.

Thanks for giving this a try; I'd wondered how feasible it would be to
just do one lookup.

However, on the same test system with the same test setup, with your
refactor it seems to go slower:
fork/execvpe: 38087ns
fork/execve: 33758ns

For comparison, the previous numbers (which I re-confirmed):

Without fast-path:
fork/execvpe: 49876ns
fork/execve: 32773ns

With my original separate-lookup fast-path:
fork/execvpe: 36890ns
fork/execve: 31551ns


I tried several runs of each, and I seem to get reasonably consistent
results.

My test program just creates a pipe once, then loops on
clock_gettime/fork/execvpe/read, with the spawned child process doing
clock_gettime/write/exit (in asm to minimize overhead). The test PATH is
PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with
the test program in the current directory.

- Josh Triplett

2022-09-19 14:41:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:

> The interaction with sched_exec() should be no worse (the file is opened
> before it in either case), but in reading that function, it talks about
> taking the opportunity to move the process to another CPU (IIUC) since,
> paraphrasing, "it is at its lowest memory/cache size." But I wonder if
> there is an existing accidental pessimistic result in that the process
> stack has already been allocated. I am only passingly familiar with how
> tasks get moved around under NUMA -- is the scheduler going to move
> this process onto a different NUMA node and now it will be forced to
> have the userspace process stack on one node and the program text and
> heap on another? Or is that totally lost in the noise?

Probably noise; text is going to be a crap-shoot anyway due to DSOs only
having a single copy in the page-cache. And the stack will be relatively
small at this point and also, numa-balance can migrate those pages
around if they matter.

> More specifically, I was wondering if processes would benefit from having
> sched_exec() moved before the mm creation?

Can't hurt I think.

2022-09-19 21:00:39

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote:
> On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> > I don't like the idea of penalizing the _succeeding_ case, though, which
> > happens if we do the path walk twice. So, I went and refactoring the setup
> > order, moving the do_open_execat() up into alloc_bprm() instead of where
> > it was in bprm_exec(). The result makes it so it is, as you observed,
> > before the mm creation and generally expensive argument copying. The
> > difference to your patch seems to only be the allocation of the file
> > table entry, but avoids the double lookup, so I'm hoping the result is
> > actually even faster.
>
> Thanks for giving this a try; I'd wondered how feasible it would be to
> just do one lookup.
>
> However, on the same test system with the same test setup, with your
> refactor it seems to go slower:
> fork/execvpe: 38087ns
> fork/execve: 33758ns
>
> For comparison, the previous numbers (which I re-confirmed):
>
> Without fast-path:
> fork/execvpe: 49876ns
> fork/execve: 32773ns
>
> With my original separate-lookup fast-path:
> fork/execvpe: 36890ns
> fork/execve: 31551ns

Hmm, this shows as slower in the *normal* case, which I find rather
surprising -- it's the same work, just reordered.

Can you post a URL to your tests? I'd like to reproduce this and maybe
throw perf at it as well.

> I tried several runs of each, and I seem to get reasonably consistent
> results.
>
> My test program just creates a pipe once, then loops on
> clock_gettime/fork/execvpe/read, with the spawned child process doing
> clock_gettime/write/exit (in asm to minimize overhead). The test PATH is
> PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:. with
> the test program in the current directory.

I'm also curious about less synthetic testing. It'd be nice to see real
workloads with these changes, etc.

--
Kees Cook

2022-09-22 07:50:11

by Yujie Liu

[permalink] [raw]
Subject: [fs/exec.c] 0a276ae2d2: BUG:workqueue_lockup-pool

Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 0a276ae2d224d48c9ddbeb9fa14bc2fd0893a7b9 ("[PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm")
url: https://github.com/intel-lab-lkp/linux/commits/Josh-Triplett/fs-exec-c-Add-fast-path-for-ENOENT-on-PATH-search-before-allocating-mm/20220916-214429
base: https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git for-next/execve
patch link: https://lore.kernel.org/linux-mm/5c7333ea4bec2fad1b47a8fa2db7c31e4ffc4f14.1663334978.git.josh@joshtriplett.org

in testcase: trinity
version: trinity-i386-4d2343bd-1_20200320
with following parameters:

runtime: 300s
group: group-01

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[ 129.613450][ C0] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 43s!
[ 129.616497][ C0] Showing busy workqueues and worker pools:
[ 129.620580][ C0] workqueue events: flags=0x0
[ 129.623078][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 refcnt=5
[ 129.623118][ C0] in-flight: 52:do_free_init do_free_init, 29:key_garbage_collector key_garbage_collector
[ 129.623151][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
[ 129.623161][ C0] pending: vmstat_shepherd, drm_fb_helper_damage_work [drm_kms_helper]
[ 129.637767][ C0] workqueue events_highpri: flags=0x10
[ 129.640231][ C0] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 129.640252][ C0] pending: mix_interrupt_randomness
[ 129.640271][ C0] workqueue events_freezable_power_: flags=0x84
[ 129.650217][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 129.650238][ C0] pending: disk_events_workfn
[ 129.650254][ C0] workqueue rcu_gp: flags=0x8
[ 129.659366][ C0] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 129.659387][ C0] in-flight: 185:wait_rcu_exp_gp
[ 129.659405][ C0] workqueue mm_percpu_wq: flags=0x8
[ 129.668545][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 refcnt=4
[ 129.668565][ C0] pending: vmstat_update, lru_add_drain_per_cpu BAR(35)
[ 129.668595][ C0] workqueue kblockd: flags=0x18
[ 129.678480][ C0] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2
[ 129.678499][ C0] pending: blk_mq_timeout_work
[ 129.678560][ C0] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 3764 22
[ 175.181511][ T185] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 88583 jiffies s: 41 root: 0x1/.
[ 175.185095][ T185] rcu: blocking rcu_node structures (internal RCU debug):
[ 175.188233][ T185] Task dump for CPU 0:
[ 175.191019][ T185] task:trinity-subchil state:R running task stack: 0 pid: 3731 ppid: 3652 flags:0x2000000c
[ 175.198313][ T185] Call Trace:
[ 175.200246][ T185] <TASK>
[ 175.202827][ T185] ? do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 175.205560][ T185] ? entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122)
[ 175.208517][ T185] </TASK>
[ 186.313437][ C0] rcu: INFO: rcu_sched self-detected stall on CPU
[ 186.315637][ C0] rcu: 0-....: (99854 ticks this GP) idle=64dc/1/0x4000000000000000 softirq=25122/25122 fqs=30028
[ 186.317670][ C0] (t=100000 jiffies g=32689 q=29567 ncpus=2)
[ 186.319650][ C0] NMI backtrace for cpu 0
[ 186.321784][ C0] CPU: 0 PID: 3731 Comm: trinity-subchil Not tainted 6.0.0-rc2-00003-g0a276ae2d224 #1
[ 186.324188][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 186.326774][ C0] Call Trace:
[ 186.328778][ C0] <IRQ>
[ 186.331085][ C0] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
[ 186.332793][ C0] nmi_cpu_backtrace.cold (lib/nmi_backtrace.c:111)
[ 186.334549][ C0] ? lapic_can_unplug_cpu (arch/x86/kernel/apic/hw_nmi.c:33)
[ 186.336150][ C0] nmi_trigger_cpumask_backtrace (lib/nmi_backtrace.c:62)
[ 186.337777][ C0] trigger_single_cpu_backtrace (include/linux/nmi.h:168)
[ 186.339411][ C0] rcu_dump_cpu_stacks (kernel/rcu/tree_stall.h:371)
[ 186.341082][ C0] print_cpu_stall.cold (kernel/rcu/tree_stall.h:662)
[ 186.342889][ C0] check_cpu_stall (kernel/rcu/tree_stall.h:745)
[ 186.344423][ C0] rcu_sched_clock_irq (kernel/rcu/tree_nocb.h:913 kernel/rcu/tree.c:3663 kernel/rcu/tree.c:2342)
[ 186.346071][ C0] update_process_times (arch/x86/include/asm/preempt.h:27 kernel/time/timer.c:1841)
[ 186.347626][ C0] tick_sched_handle (kernel/time/tick-sched.c:244)
[ 186.349235][ C0] tick_sched_timer (kernel/time/tick-sched.c:1480)
[ 186.350780][ C0] ? tick_sched_do_timer (kernel/time/tick-sched.c:1467)
[ 186.352495][ C0] __hrtimer_run_queues (kernel/time/hrtimer.c:1685 kernel/time/hrtimer.c:1749)
[ 186.354484][ C0] ? enqueue_hrtimer (kernel/time/hrtimer.c:1719)
[ 186.356550][ C0] ? kvm_clock_get_cycles (arch/x86/include/asm/preempt.h:85 arch/x86/kernel/kvmclock.c:80 arch/x86/kernel/kvmclock.c:86)
[ 186.358499][ C0] ? ktime_get_update_offsets_now (kernel/time/timekeeping.c:292 (discriminator 3) kernel/time/timekeeping.c:388 (discriminator 3) kernel/time/timekeeping.c:2320 (discriminator 3))
[ 186.360609][ C0] hrtimer_interrupt (kernel/time/hrtimer.c:1814)
[ 186.362418][ C0] __sysvec_apic_timer_interrupt (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 arch/x86/include/asm/trace/irq_vectors.h:41 arch/x86/kernel/apic/apic.c:1113)
[ 186.364295][ C0] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1106 (discriminator 14))
[ 186.366201][ C0] </IRQ>
[ 186.367490][ C0] <TASK>
[ 186.369287][ C0] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649)
[ 186.371191][ C0] RIP: 0010:native_queued_spin_lock_slowpath (arch/x86/include/asm/vdso/processor.h:19 arch/x86/include/asm/qspinlock.h:101 kernel/locking/qspinlock.c:327)
[ 186.373248][ C0] Code: 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 2f 08 00 00 48 81 c4 88 00 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc f3 90 <e9> 71 ff ff ff 44 8b 74 24 48 41 81 fe 00 01 00 00 0f 84 e3 00 00
All code
========
0: 00 00 add %al,(%rax)
2: 00 65 48 add %ah,0x48(%rbp)
5: 2b 04 25 28 00 00 00 sub 0x28,%eax
c: 0f 85 2f 08 00 00 jne 0x841
12: 48 81 c4 88 00 00 00 add $0x88,%rsp
19: 5b pop %rbx
1a: 5d pop %rbp
1b: 41 5c pop %r12
1d: 41 5d pop %r13
1f: 41 5e pop %r14
21: 41 5f pop %r15
23: c3 retq
24: cc int3
25: cc int3
26: cc int3
27: cc int3
28: f3 90 pause
2a:* e9 71 ff ff ff jmpq 0xffffffffffffffa0 <-- trapping instruction
2f: 44 8b 74 24 48 mov 0x48(%rsp),%r14d
34: 41 81 fe 00 01 00 00 cmp $0x100,%r14d
3b: 0f .byte 0xf
3c: 84 e3 test %ah,%bl
...

Code starting with the faulting instruction
===========================================
0: e9 71 ff ff ff jmpq 0xffffffffffffff76
5: 44 8b 74 24 48 mov 0x48(%rsp),%r14d
a: 41 81 fe 00 01 00 00 cmp $0x100,%r14d
11: 0f .byte 0xf
12: 84 e3 test %ah,%bl
...
[ 186.377327][ C0] RSP: 0018:ffffc90000e17cb8 EFLAGS: 00000202
[ 186.379142][ C0] RAX: 0000000000000000 RBX: 00000000587993b0 RCX: ffffffff812fc8ec
[ 186.380912][ C0] RDX: fffff520001c2ff7 RSI: 0000000000000004 RDI: ffffc90000e17fb0
[ 186.383091][ C0] RBP: ffffc90000e17fb0 R08: 0000000000000000 R09: ffffc90000e17fb3
[ 186.384750][ C0] R10: fffff520001c2ff6 R11: 0000000000000001 R12: 0000000000000003
[ 186.386354][ C0] R13: fffff520001c2ff6 R14: 0000000000000001 R15: 1ffff920001c2f98
[ 186.388126][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327)
[ 186.389979][ C0] ? native_queued_spin_lock_slowpath (arch/x86/include/asm/atomic.h:29 include/linux/atomic/atomic-instrumented.h:28 arch/x86/include/asm/qspinlock.h:100 kernel/locking/qspinlock.c:327)
[ 186.391917][ C0] ? pv_hash (kernel/locking/qspinlock.c:317)
[ 186.393367][ C0] ? lockref_put_return (arch/x86/include/asm/atomic.h:29 (discriminator 5) include/linux/atomic/atomic-instrumented.h:28 (discriminator 5) include/asm-generic/qspinlock.h:73 (discriminator 5) lib/lockref.c:122 (discriminator 5))
[ 186.394784][ C0] _raw_spin_lock (arch/x86/include/asm/paravirt.h:591 arch/x86/include/asm/qspinlock.h:51 include/asm-generic/qspinlock.h:114 include/linux/spinlock.h:185 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154)
[ 186.396052][ C0] ? _raw_write_lock_irq (kernel/locking/spinlock.c:153)
[ 186.397502][ C0] ? allocate_slab (mm/slub.c:378 mm/slub.c:1929 mm/slub.c:1995)
[ 186.399281][ C0] dput (fs/dcache.c:790 fs/dcache.c:900)
[ 186.400885][ C0] path_put (fs/namei.c:559)
[ 186.402250][ C0] do_execveat_common+0x5ec/0x7c0
[ 186.403558][ C0] ? __check_object_size (mm/memremap.c:37)
[ 186.404848][ C0] ? alloc_bprm (fs/exec.c:1866)
[ 186.406039][ C0] __ia32_compat_sys_execve (fs/exec.c:2117)
[ 186.407302][ C0] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178)
[ 186.408512][ C0] do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 186.409678][ C0] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122)
[ 186.410947][ C0] RIP: 0023:0xf7f3c549
[ 186.412085][ C0] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi
4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d
a: 10 06 adc %al,(%rsi)
c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi
10: 10 07 adc %al,(%rdi)
12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi
16: 10 08 adc %cl,(%rax)
18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
1c: 00 00 add %al,(%rax)
1e: 00 00 add %al,(%rax)
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24: 89 e5 mov %esp,%ebp
26: 0f 34 sysenter
28: cd 80 int $0x80
2a:* 5d pop %rbp <-- trapping instruction
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi

Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/r/[email protected]


To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc2-00003-g0a276ae2d224 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (12.63 kB)
config-6.0.0-rc2-00003-g0a276ae2d224 (169.94 kB)
job-script (4.76 kB)
dmesg.xz (33.32 kB)
trinity (5.90 kB)
Download all attachments

2022-10-01 16:54:27

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH] fs/exec.c: Add fast path for ENOENT on PATH search before allocating mm

On Mon, Sep 19, 2022 at 01:02:51PM -0700, Kees Cook wrote:
> On Sat, Sep 17, 2022 at 01:50:24AM +0100, Josh Triplett wrote:
> > On Fri, Sep 16, 2022 at 05:11:18PM -0700, Kees Cook wrote:
> > > I don't like the idea of penalizing the _succeeding_ case, though, which
> > > happens if we do the path walk twice. So, I went and refactoring the setup
> > > order, moving the do_open_execat() up into alloc_bprm() instead of where
> > > it was in bprm_exec(). The result makes it so it is, as you observed,
> > > before the mm creation and generally expensive argument copying. The
> > > difference to your patch seems to only be the allocation of the file
> > > table entry, but avoids the double lookup, so I'm hoping the result is
> > > actually even faster.
> >
> > Thanks for giving this a try; I'd wondered how feasible it would be to
> > just do one lookup.
> >
> > However, on the same test system with the same test setup, with your
> > refactor it seems to go slower:
> > fork/execvpe: 38087ns
> > fork/execve: 33758ns
> >
> > For comparison, the previous numbers (which I re-confirmed):
> >
> > Without fast-path:
> > fork/execvpe: 49876ns
> > fork/execve: 32773ns
> >
> > With my original separate-lookup fast-path:
> > fork/execvpe: 36890ns
> > fork/execve: 31551ns
>
> Hmm, this shows as slower in the *normal* case, which I find rather
> surprising -- it's the same work, just reordered.
>
> Can you post a URL to your tests? I'd like to reproduce this and maybe
> throw perf at it as well.

Sure. Sorry for the delay, needed to integrate some fixes (such as
aarch64 support) and factor out the bits that won't build if you don't
have a patched liburing.

https://github.com/joshtriplett/spawnbench

- Josh Triplett