The /proc/$pid/wchan file has been broken by default on x86_64 for 4
years now[1]. As this remains a potential leak of either kernel
addresses (when symbolization fails) or limited observation of kernel
function progress, just remove the contents for good.
Unconditionally set the contents to "0" and also mark the wchan
field in /proc/$pid/stat with 0.
This leaves kernel/sched/fair.c as the only user of get_wchan(). But
again, since this was broken for 4 years, was this profiling logic
actually doing anything useful?
[1] https://lore.kernel.org/lkml/20210922001537.4ktg3r2ky3b3r6yp@treble/
Cc: Josh Poimboeuf <[email protected]>
Cc: Vito Caputo <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/kernel/process.c | 2 +-
fs/proc/array.c | 16 +++++-----------
fs/proc/base.c | 16 +---------------
3 files changed, 7 insertions(+), 27 deletions(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..84a4f9f3f0c2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -937,7 +937,7 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
}
/*
- * Called from fs/proc with a reference on @p to find the function
+ * Called from scheduler with a reference on @p to find the function
* which called into schedule(). This needs to be done carefully
* because the task might wake up and we might look at a stack
* changing under us.
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49be8c8ef555..8a4ecfd901b8 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -452,7 +452,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task, int whole)
{
- unsigned long vsize, eip, esp, wchan = 0;
+ unsigned long vsize, eip, esp;
int priority, nice;
int tty_pgrp = -1, tty_nr = 0;
sigset_t sigign, sigcatch;
@@ -540,8 +540,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
unlock_task_sighand(task, &flags);
}
- if (permitted && (!whole || num_threads < 2))
- wchan = get_wchan(task);
if (!whole) {
min_flt = task->min_flt;
maj_flt = task->maj_flt;
@@ -600,16 +598,12 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
seq_put_decimal_ull(m, " ", sigcatch.sig[0] & 0x7fffffffUL);
/*
- * We used to output the absolute kernel address, but that's an
- * information leak - so instead we show a 0/1 flag here, to signal
- * to user-space whether there's a wchan field in /proc/PID/wchan.
- *
+ * We used to output the absolute kernel address, and then just
+ * a symbol. But both are information leaks, so just report 0
+ * to indicate there is no wchan field in /proc/$PID/wchan.
* This works with older implementations of procps as well.
*/
- if (wchan)
- seq_puts(m, " 1");
- else
- seq_puts(m, " 0");
+ seq_puts(m, " 0");
seq_put_decimal_ull(m, " ", 0);
seq_put_decimal_ull(m, " ", 0);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 533d5836eb9a..52484cd77f99 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -378,24 +378,10 @@ static const struct file_operations proc_pid_cmdline_ops = {
};
#ifdef CONFIG_KALLSYMS
-/*
- * Provides a wchan file via kallsyms in a proper one-value-per-file format.
- * Returns the resolved symbol. If that fails, simply return the address.
- */
static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
- unsigned long wchan;
-
- if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
- wchan = get_wchan(task);
- else
- wchan = 0;
-
- if (wchan)
- seq_printf(m, "%ps", (void *) wchan);
- else
- seq_putc(m, '0');
+ seq_putc(m, '0');
return 0;
}
--
2.30.2
On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> years now[1]. As this remains a potential leak of either kernel
> addresses (when symbolization fails) or limited observation of kernel
> function progress, just remove the contents for good.
>
> Unconditionally set the contents to "0" and also mark the wchan
> field in /proc/$pid/stat with 0.
I forgot to CC Qi Zheng on this patch. Now corrected. :)
> This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> again, since this was broken for 4 years, was this profiling logic
> actually doing anything useful?
If the fair scheduler would actually benefit from still using get_wchan,
I think this patch:
https://lore.kernel.org/all/[email protected]/
should still be applied too.
If not, we can rip get_wchan() out completely (across all
architectures).
-Kees
> [1] https://lore.kernel.org/lkml/20210922001537.4ktg3r2ky3b3r6yp@treble/
>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Vito Caputo <[email protected]>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> arch/x86/kernel/process.c | 2 +-
> fs/proc/array.c | 16 +++++-----------
> fs/proc/base.c | 16 +---------------
> 3 files changed, 7 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 1d9463e3096b..84a4f9f3f0c2 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -937,7 +937,7 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
> }
>
> /*
> - * Called from fs/proc with a reference on @p to find the function
> + * Called from scheduler with a reference on @p to find the function
> * which called into schedule(). This needs to be done carefully
> * because the task might wake up and we might look at a stack
> * changing under us.
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 49be8c8ef555..8a4ecfd901b8 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -452,7 +452,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
> static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
> struct pid *pid, struct task_struct *task, int whole)
> {
> - unsigned long vsize, eip, esp, wchan = 0;
> + unsigned long vsize, eip, esp;
> int priority, nice;
> int tty_pgrp = -1, tty_nr = 0;
> sigset_t sigign, sigcatch;
> @@ -540,8 +540,6 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
> unlock_task_sighand(task, &flags);
> }
>
> - if (permitted && (!whole || num_threads < 2))
> - wchan = get_wchan(task);
> if (!whole) {
> min_flt = task->min_flt;
> maj_flt = task->maj_flt;
> @@ -600,16 +598,12 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
> seq_put_decimal_ull(m, " ", sigcatch.sig[0] & 0x7fffffffUL);
>
> /*
> - * We used to output the absolute kernel address, but that's an
> - * information leak - so instead we show a 0/1 flag here, to signal
> - * to user-space whether there's a wchan field in /proc/PID/wchan.
> - *
> + * We used to output the absolute kernel address, and then just
> + * a symbol. But both are information leaks, so just report 0
> + * to indicate there is no wchan field in /proc/$PID/wchan.
> * This works with older implementations of procps as well.
> */
> - if (wchan)
> - seq_puts(m, " 1");
> - else
> - seq_puts(m, " 0");
> + seq_puts(m, " 0");
>
> seq_put_decimal_ull(m, " ", 0);
> seq_put_decimal_ull(m, " ", 0);
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 533d5836eb9a..52484cd77f99 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -378,24 +378,10 @@ static const struct file_operations proc_pid_cmdline_ops = {
> };
>
> #ifdef CONFIG_KALLSYMS
> -/*
> - * Provides a wchan file via kallsyms in a proper one-value-per-file format.
> - * Returns the resolved symbol. If that fails, simply return the address.
> - */
> static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
> struct pid *pid, struct task_struct *task)
> {
> - unsigned long wchan;
> -
> - if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
> - wchan = get_wchan(task);
> - else
> - wchan = 0;
> -
> - if (wchan)
> - seq_printf(m, "%ps", (void *) wchan);
> - else
> - seq_putc(m, '0');
> + seq_putc(m, '0');
>
> return 0;
> }
> --
> 2.30.2
>
--
Kees Cook
On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> years now[1]. As this remains a potential leak of either kernel
> addresses (when symbolization fails) or limited observation of kernel
> function progress, just remove the contents for good.
>
> Unconditionally set the contents to "0" and also mark the wchan
> field in /proc/$pid/stat with 0.
>
> This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> again, since this was broken for 4 years, was this profiling logic
> actually doing anything useful?
>
> [1] https://lore.kernel.org/lkml/20210922001537.4ktg3r2ky3b3r6yp@treble/
>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: Vito Caputo <[email protected]>
> Signed-off-by: Kees Cook <[email protected]>
<snip>
Please don't deliberately break WCHANs wholesale. This is a very
useful tool for sysadmins to get a vague sense of where processes are
spending time in the kernel on production systems without affecting
performance or having to restart things under instrumentation.
I don't see how providing the symbol name of a given task's kernel
function, especially if shallow near the user->kernel entrypoint, is a
worrisome information leak. Just make sure it's not failing open with
addresses like my original report documented seems to happen
spuriously as-is w/kallsyms.
When I worked full-time as a sysadmin WCHAN's were regularly the first
thing to look at in `ps -o stat,wchan | grep D` when things were
falling over. e.g.:
```
root@shells:/root# ps -o stat,wchan | grep D
D io_schedule
```
Furthermore this is a well documented on dead trees and understood
*nix/posix system observation technique. Even the POSIX ps(1) man
page documents it:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ps.html
Frankly I'm a bit mortified that I have to write this email.
Today I'm hoping to test Josh's patch @
https://lore.kernel.org/all/[email protected]/
Thanks,
Vito Caputo
On Fri, Sep 24, 2021 at 1:59 AM Vito Caputo <[email protected]> wrote:
> On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> > The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> > years now[1]. As this remains a potential leak of either kernel
> > addresses (when symbolization fails) or limited observation of kernel
> > function progress, just remove the contents for good.
> >
> > Unconditionally set the contents to "0" and also mark the wchan
> > field in /proc/$pid/stat with 0.
> >
> > This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> > again, since this was broken for 4 years, was this profiling logic
> > actually doing anything useful?
> >
> > [1] https://lore.kernel.org/lkml/20210922001537.4ktg3r2ky3b3r6yp@treble/
> >
> > Cc: Josh Poimboeuf <[email protected]>
> > Cc: Vito Caputo <[email protected]>
> > Signed-off-by: Kees Cook <[email protected]>
> <snip>
>
>
> Please don't deliberately break WCHANs wholesale. This is a very
> useful tool for sysadmins to get a vague sense of where processes are
> spending time in the kernel on production systems without affecting
> performance or having to restart things under instrumentation.
Wouldn't /proc/$pid/stack be more useful for that anyway? As long as
you have root privileges, you can read that to get the entire stack,
not just a single method name.
(By the way, I guess that might be an alternative to ripping wchan out
completely - require CAP_SYS_ADMIN like for /proc/$pid/stack?)
On Fri, Sep 24, 2021 at 02:08:45AM +0200, Jann Horn wrote:
> On Fri, Sep 24, 2021 at 1:59 AM Vito Caputo <[email protected]> wrote:
> > On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> > > The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> > > years now[1]. As this remains a potential leak of either kernel
> > > addresses (when symbolization fails) or limited observation of kernel
> > > function progress, just remove the contents for good.
> > >
> > > Unconditionally set the contents to "0" and also mark the wchan
> > > field in /proc/$pid/stat with 0.
> > >
> > > This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> > > again, since this was broken for 4 years, was this profiling logic
> > > actually doing anything useful?
> > >
> > > [1] https://lore.kernel.org/lkml/20210922001537.4ktg3r2ky3b3r6yp@treble/
> > >
> > > Cc: Josh Poimboeuf <[email protected]>
> > > Cc: Vito Caputo <[email protected]>
> > > Signed-off-by: Kees Cook <[email protected]>
> > <snip>
> >
> >
> > Please don't deliberately break WCHANs wholesale. This is a very
> > useful tool for sysadmins to get a vague sense of where processes are
> > spending time in the kernel on production systems without affecting
> > performance or having to restart things under instrumentation.
>
> Wouldn't /proc/$pid/stack be more useful for that anyway? As long as
> you have root privileges, you can read that to get the entire stack,
> not just a single method name.
>
> (By the way, I guess that might be an alternative to ripping wchan out
> completely - require CAP_SYS_ADMIN like for /proc/$pid/stack?)
WCHAN is a first-class concept of the OS. As a result we have
long-standing useful tools exposing them in far more organized,
documented, and discoverable ways than poking around linux-specific
/proc files at the shell. Even `top` can show WCHAN in a column
alongside everything else it exposes, complete with sorting etc, and
I've already demonstrated the support in `ps`.
I also think it's worth preserving the ability for regular users to
observe the WCHAN of their own processes. It's unclear to me why this
is such a worry. If the WCHAN as-implemented is granular enough to
expose too much kernel inner workings, then it should be watered down
to be more vague. Even if it just said "ioctl" when a process was
blocked in D state through making an ioctl() it would still be much
more useful than saying nothing at all. Can't regular users see this
much about their own processes via strace/gdb anyways?
Instead of unwinding stacks maybe the kernel should be sticking an
entrypoint address in the current task struct for get_wchan() to
access, whenever userspace enters the kernel?
Regards,
Vito Caputo
On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> Instead of unwinding stacks maybe the kernel should be sticking an
> entrypoint address in the current task struct for get_wchan() to
> access, whenever userspace enters the kernel?
wchan is supposed to show where the kernel is at the instant the
get_wchan() happens. (i.e. recording it at syscall entry would just
always show syscall entry.)
--
Kees Cook
On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > Instead of unwinding stacks maybe the kernel should be sticking an
> > entrypoint address in the current task struct for get_wchan() to
> > access, whenever userspace enters the kernel?
>
> wchan is supposed to show where the kernel is at the instant the
> get_wchan() happens. (i.e. recording it at syscall entry would just
> always show syscall entry.)
>
And you have the syscall # onhand when performing the syscall entry,
no?
The point is, if the alternative is to always get 0 from
/proc/PID/wchan when a process is sitting in ioctl(), I'd be perfectly
happy to get back sys_ioctl. I'm under the impression there's quite a
bit of vendor-specific flexibility here in terms of how precise WCHAN
is.
If it's possible to preserve the old WCHAN precision I'm all for it.
But if we've become so paranoid about leaking anything about the
kernel to userspace that this is untenable, even if it just spits out
the syscall being performed that has value.
Regards,
Vito Caputo
On Thu, Sep 23, 2021 at 06:34:08PM -0700, Vito Caputo wrote:
> On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > entrypoint address in the current task struct for get_wchan() to
> > > access, whenever userspace enters the kernel?
> >
> > wchan is supposed to show where the kernel is at the instant the
> > get_wchan() happens. (i.e. recording it at syscall entry would just
> > always show syscall entry.)
> >
>
> And you have the syscall # onhand when performing the syscall entry,
> no?
>
> The point is, if the alternative is to always get 0 from
> /proc/PID/wchan when a process is sitting in ioctl(), I'd be perfectly
> happy to get back sys_ioctl. I'm under the impression there's quite a
> bit of vendor-specific flexibility here in terms of how precise WCHAN
> is.
Oh, yeah, if you're happy with syscall-level granularity, that'd be
totally fine by me too.
> If it's possible to preserve the old WCHAN precision I'm all for it.
> But if we've become so paranoid about leaking anything about the
> kernel to userspace that this is untenable, even if it just spits out
> the syscall being performed that has value.
I'd like to find a middle ground -- wchan has always seemed like a info
leak, even with only symbols. And it doesn't help that walking the stack
from outside the current task is difficult. :)
-Kees
--
Kees Cook
On Thu, 23 Sep 2021 16:31:05 -0700 Kees Cook <[email protected]> wrote:
> The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> years now[1].
[1] is hard to decrypt. I think it would be better if this changelog
were to describe the problem directly, completely and succinctly?
> As this remains a potential leak of either kernel
> addresses (when symbolization fails) or limited observation of kernel
> function progress, just remove the contents for good.
>
> Unconditionally set the contents to "0" and also mark the wchan
> field in /proc/$pid/stat with 0.
>
> This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> again, since this was broken for 4 years, was this profiling logic
> actually doing anything useful?
Agree that returning a hard-wired "0\n" is the way to go.
On Thu, Sep 23, 2021 at 07:13:06PM -0700, Andrew Morton wrote:
> On Thu, 23 Sep 2021 16:31:05 -0700 Kees Cook <[email protected]> wrote:
>
> > The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> > years now[1].
>
> [1] is hard to decrypt. I think it would be better if this changelog
> were to describe the problem directly, completely and succinctly?
>
> > As this remains a potential leak of either kernel
> > addresses (when symbolization fails) or limited observation of kernel
> > function progress, just remove the contents for good.
> >
> > Unconditionally set the contents to "0" and also mark the wchan
> > field in /proc/$pid/stat with 0.
> >
> > This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> > again, since this was broken for 4 years, was this profiling logic
> > actually doing anything useful?
>
> Agree that returning a hard-wired "0\n" is the way to go.
I must NAK my own patch. ;) It looks like this would be a breaking
userspace-visible change[1].
We need to fix the two bugs though:
1) wchan broken under ORC (patch exists in the thread at [1])
2) wchan leaking raw addresses (152c432b128c needs reverting from v5.12 and later)
-Kees
[1] https://lore.kernel.org/lkml/[email protected]
--
Kees Cook
On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > Instead of unwinding stacks maybe the kernel should be sticking an
> > entrypoint address in the current task struct for get_wchan() to
> > access, whenever userspace enters the kernel?
>
> wchan is supposed to show where the kernel is at the instant the
> get_wchan() happens. (i.e. recording it at syscall entry would just
> always show syscall entry.)
It's supposed to show where a blocked task is blocked; the "wait
channel".
I'd wanted to remove get_wchan since it requires cross-task stack
walking, which is generally painful.
We could instead have the scheduler entrypoints snapshot their caller
into a field in task_struct. If there are sufficiently few callers, that
could be an inline wrapper that passes a __func__ string. Otherwise, we
still need to symbolize.
Thanks,
Mark.
On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > entrypoint address in the current task struct for get_wchan() to
> > > access, whenever userspace enters the kernel?
> >
> > wchan is supposed to show where the kernel is at the instant the
> > get_wchan() happens. (i.e. recording it at syscall entry would just
> > always show syscall entry.)
>
> It's supposed to show where a blocked task is blocked; the "wait
> channel".
>
> I'd wanted to remove get_wchan since it requires cross-task stack
> walking, which is generally painful.
Right -- this is the "fragile" part I'm worried about.
> We could instead have the scheduler entrypoints snapshot their caller
> into a field in task_struct. If there are sufficiently few callers, that
> could be an inline wrapper that passes a __func__ string. Otherwise, we
> still need to symbolize.
Hmm. Does PREEMPT break this?
Can we actually use __builtin_return_address(0) in __schedule?
--
Kees Cook
On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> again, since this was broken for 4 years, was this profiling logic
> actually doing anything useful?
Unlikely, I proposed removing the whole kernel/profile.c thing a while
back, but some people complained that it would make it 'hard' to profile
early boot.
IMO this not actually working right seems another argument for killing
that too :-)
On Thu, Sep 23, 2021 at 04:38:26PM -0700, Kees Cook wrote:
> On Thu, Sep 23, 2021 at 04:31:05PM -0700, Kees Cook wrote:
> If the fair scheduler would actually benefit from still using get_wchan,
> I think this patch:
> https://lore.kernel.org/all/[email protected]/
> should still be applied too.
>
> If not, we can rip get_wchan() out completely (across all
> architectures).
The scheduler doesn't care, it's kernel/profile.c and please kill that
right along with wchan, good riddance.
On Fri, Sep 24, 2021 at 07:26:22AM -0700, Kees Cook wrote:
> On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> > On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > > entrypoint address in the current task struct for get_wchan() to
> > > > access, whenever userspace enters the kernel?
> > >
> > > wchan is supposed to show where the kernel is at the instant the
> > > get_wchan() happens. (i.e. recording it at syscall entry would just
> > > always show syscall entry.)
> >
> > It's supposed to show where a blocked task is blocked; the "wait
> > channel".
> >
> > I'd wanted to remove get_wchan since it requires cross-task stack
> > walking, which is generally painful.
>
> Right -- this is the "fragile" part I'm worried about.
>
> > We could instead have the scheduler entrypoints snapshot their caller
> > into a field in task_struct. If there are sufficiently few callers, that
> > could be an inline wrapper that passes a __func__ string. Otherwise, we
> > still need to symbolize.
>
> Hmm. Does PREEMPT break this?
Within the core scheduler functions interrupts should be disabled, and
as long as we only update task_struct there we shouldn't have a race.
> Can we actually use __builtin_return_address(0) in __schedule?
We'd need to do this in a few entry points above __schedule, since the
currently get_wchan walks until !in_sched_functions(). It should be
possible, though we might need to make sure those the nexus points
aren't inlined.
Thanks,
Mark.
From: Mark Rutland
> Sent: 24 September 2021 14:54
>
> On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > entrypoint address in the current task struct for get_wchan() to
> > > access, whenever userspace enters the kernel?
> >
> > wchan is supposed to show where the kernel is at the instant the
> > get_wchan() happens. (i.e. recording it at syscall entry would just
> > always show syscall entry.)
>
> It's supposed to show where a blocked task is blocked; the "wait
> channel".
>
> I'd wanted to remove get_wchan since it requires cross-task stack
> walking, which is generally painful.
>
> We could instead have the scheduler entrypoints snapshot their caller
> into a field in task_struct. If there are sufficiently few callers, that
> could be an inline wrapper that passes a __func__ string. Otherwise, we
> still need to symbolize.
It ought to be something stashed in the 'wait_queue_head'.
Perhaps defaulting to the address/name of the function that
initialised it.
That would be much nearer the original (pre Linux) semantics.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Sep 27, 2021 at 10:03:51AM +0100, Mark Rutland wrote:
> On Fri, Sep 24, 2021 at 07:26:22AM -0700, Kees Cook wrote:
> > On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> > > On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > > > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > > > entrypoint address in the current task struct for get_wchan() to
> > > > > access, whenever userspace enters the kernel?
> > > >
> > > > wchan is supposed to show where the kernel is at the instant the
> > > > get_wchan() happens. (i.e. recording it at syscall entry would just
> > > > always show syscall entry.)
> > >
> > > It's supposed to show where a blocked task is blocked; the "wait
> > > channel".
> > >
> > > I'd wanted to remove get_wchan since it requires cross-task stack
> > > walking, which is generally painful.
> >
> > Right -- this is the "fragile" part I'm worried about.
I'd like to clarify this concern first -- is the proposed fix actually
fragile? Because I think we'd be better off just restoring behavior than
trying to invent new behavior...
i.e. Josh, Jann, do you see any issues with Qi Zheng's fix here:
https://lore.kernel.org/all/[email protected]/
-Kees
--
Kees Cook
On Mon, Sep 27, 2021 at 11:07:27AM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 10:03:51AM +0100, Mark Rutland wrote:
> > On Fri, Sep 24, 2021 at 07:26:22AM -0700, Kees Cook wrote:
> > > On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> > > > On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > > > > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > > > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > > > > entrypoint address in the current task struct for get_wchan() to
> > > > > > access, whenever userspace enters the kernel?
> > > > >
> > > > > wchan is supposed to show where the kernel is at the instant the
> > > > > get_wchan() happens. (i.e. recording it at syscall entry would just
> > > > > always show syscall entry.)
> > > >
> > > > It's supposed to show where a blocked task is blocked; the "wait
> > > > channel".
> > > >
> > > > I'd wanted to remove get_wchan since it requires cross-task stack
> > > > walking, which is generally painful.
> > >
> > > Right -- this is the "fragile" part I'm worried about.
>
> I'd like to clarify this concern first -- is the proposed fix actually
> fragile? Because I think we'd be better off just restoring behavior than
> trying to invent new behavior...
>
> i.e. Josh, Jann, do you see any issues with Qi Zheng's fix here:
> https://lore.kernel.org/all/[email protected]/
Even with that patch, it doesn't lock the task's runqueue before reading
the stack, so there's still the possibility of the task running on
another CPU and the unwinder going off the rails a bit, which might be
used by an attacker in creative ways similar to the /proc/<pid>/stack
vulnerability Jann mentioned earlier.
--
Josh
On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> We could instead have the scheduler entrypoints snapshot their caller
> into a field in task_struct. If there are sufficiently few callers, that
> could be an inline wrapper that passes a __func__ string. Otherwise, we
> still need to symbolize.
I strongly dislike that option, we'd make the scheduler slower for
something that's typically unused.
The 3 people actually using wchan should pay the cost of obtaining the
data, not everybody all the time.
On Mon, Sep 27, 2021 at 01:50:56PM -0700, Josh Poimboeuf wrote:
> On Mon, Sep 27, 2021 at 11:07:27AM -0700, Kees Cook wrote:
> > On Mon, Sep 27, 2021 at 10:03:51AM +0100, Mark Rutland wrote:
> > > On Fri, Sep 24, 2021 at 07:26:22AM -0700, Kees Cook wrote:
> > > > On Fri, Sep 24, 2021 at 02:54:24PM +0100, Mark Rutland wrote:
> > > > > On Thu, Sep 23, 2021 at 06:16:16PM -0700, Kees Cook wrote:
> > > > > > On Thu, Sep 23, 2021 at 05:22:30PM -0700, Vito Caputo wrote:
> > > > > > > Instead of unwinding stacks maybe the kernel should be sticking an
> > > > > > > entrypoint address in the current task struct for get_wchan() to
> > > > > > > access, whenever userspace enters the kernel?
> > > > > >
> > > > > > wchan is supposed to show where the kernel is at the instant the
> > > > > > get_wchan() happens. (i.e. recording it at syscall entry would just
> > > > > > always show syscall entry.)
> > > > >
> > > > > It's supposed to show where a blocked task is blocked; the "wait
> > > > > channel".
> > > > >
> > > > > I'd wanted to remove get_wchan since it requires cross-task stack
> > > > > walking, which is generally painful.
> > > >
> > > > Right -- this is the "fragile" part I'm worried about.
> >
> > I'd like to clarify this concern first -- is the proposed fix actually
> > fragile? Because I think we'd be better off just restoring behavior than
> > trying to invent new behavior...
> >
> > i.e. Josh, Jann, do you see any issues with Qi Zheng's fix here:
> > https://lore.kernel.org/all/[email protected]/
>
> Even with that patch, it doesn't lock the task's runqueue before reading
> the stack, so there's still the possibility of the task running on
> another CPU and the unwinder going off the rails a bit, which might be
> used by an attacker in creative ways similar to the /proc/<pid>/stack
> vulnerability Jann mentioned earlier.
Since I think we're considering get_wchan() to be slow-path, can we just
lock the runqueue and use arch_stack_walk_reliable()?
--
Kees Cook
On Wed, Sep 29, 2021 at 11:54:55AM -0700, Kees Cook wrote:
> > > > > > It's supposed to show where a blocked task is blocked; the "wait
> > > > > > channel".
> Since I think we're considering get_wchan() to be slow-path, can we just
> lock the runqueue and use arch_stack_walk_reliable()?
Funny thing, when a task is blocked it isn't on the runqueue :-)
So if all we want to do is capture a blocked task and fail otherwise we
don't need the rq->lock at all.
Something like:
unsigned long ip = 0;
raw_spin_lock_irq(&p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb(); /* see try_to_wake_up() */
if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
goto unlock;
ip = /* do actual stack walk on a blocked task */
unlock:
raw_spin_unlock_irq(&p->pi_lock);
return ip;
Should get us there.
On Wed, Sep 29, 2021 at 11:54:55AM -0700, Kees Cook wrote:
> On Mon, Sep 27, 2021 at 01:50:56PM -0700, Josh Poimboeuf wrote:
> > Even with that patch, it doesn't lock the task's runqueue before reading
> > the stack, so there's still the possibility of the task running on
> > another CPU and the unwinder going off the rails a bit, which might be
> > used by an attacker in creative ways similar to the /proc/<pid>/stack
> > vulnerability Jann mentioned earlier.
> Since I think we're considering get_wchan() to be slow-path, can we just
> lock the runqueue and use arch_stack_walk_reliable()?
Unfortunately arch_stack_walk_reliable() is only available for powerpc,
s390 and x86 currently - work is in progress to implement it for arm64
as well but it's not there yet.
On Wed, Sep 29, 2021 at 09:40:26PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 29, 2021 at 11:54:55AM -0700, Kees Cook wrote:
>
> > > > > > > It's supposed to show where a blocked task is blocked; the "wait
> > > > > > > channel".
>
> > Since I think we're considering get_wchan() to be slow-path, can we just
> > lock the runqueue and use arch_stack_walk_reliable()?
>
> Funny thing, when a task is blocked it isn't on the runqueue :-)
>
> So if all we want to do is capture a blocked task and fail otherwise we
> don't need the rq->lock at all.
>
> Something like:
>
> unsigned long ip = 0;
>
> raw_spin_lock_irq(&p->pi_lock);
> state = READ_ONCE(p->__state);
> smp_rmb(); /* see try_to_wake_up() */
> if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
> goto unlock;
>
> ip = /* do actual stack walk on a blocked task */
> unlock:
> raw_spin_unlock_irq(&p->pi_lock);
>
> return ip;
Ah, cool :-)
I'd also add that I don't see any reason to use the "reliable" unwinding
variant. AFAIK, just basic stack_trace_save_tsk() should be sufficient.
--
Josh
On Wed, Sep 29, 2021 at 08:00:42PM +0100, Mark Brown wrote:
> On Wed, Sep 29, 2021 at 11:54:55AM -0700, Kees Cook wrote:
> > On Mon, Sep 27, 2021 at 01:50:56PM -0700, Josh Poimboeuf wrote:
>
> > > Even with that patch, it doesn't lock the task's runqueue before reading
> > > the stack, so there's still the possibility of the task running on
> > > another CPU and the unwinder going off the rails a bit, which might be
> > > used by an attacker in creative ways similar to the /proc/<pid>/stack
> > > vulnerability Jann mentioned earlier.
>
> > Since I think we're considering get_wchan() to be slow-path, can we just
> > lock the runqueue and use arch_stack_walk_reliable()?
>
> Unfortunately arch_stack_walk_reliable() is only available for powerpc,
> s390 and x86 currently - work is in progress to implement it for arm64
> as well but it's not there yet.
Strictly speaking, we're only trying to fix this for x86+ORC. The other
architectures (or non-ORC x86) already have their own non-ORC unwinders
behind get_wchan(). They may have similar weaknesses (which should
certainly be fixed), I think the first step here is to restore wchan
under x86+ORC.
--
Kees Cook
On 9/23/21 4:31 PM, Kees Cook wrote:
> The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> years now[1]. As this remains a potential leak of either kernel
> addresses (when symbolization fails) or limited observation of kernel
> function progress, just remove the contents for good.
>
> Unconditionally set the contents to "0" and also mark the wchan
> field in /proc/$pid/stat with 0.
Hi all,
It looks like there's already been pushback on this idea, but I wanted
to add another voice from a frequent user of /proc/$pid/wchan (via PS).
Much of my job involves diagnosing kernel issues and performance issues
on stable kernels, frequently on production systems where I can't do
anything too invasive. wchan is incredibly useful for these situations,
so much so that we store regular snapshots of ps output, and we expand
the size of the WCHAN column to fit more data (e.g. ps -e -o
pid,wchan=WCHAN-WIDE-COLUMN). Disabling wchan would remove a critical
tool for me and my team.
From my our team's feedback:
1. It's fine if this needs to have CAP_SYS_ADMIN to read for tasks not
owned by the calling user; and for non-admin, if the symbolization
fails, to return 0 just like kallsyms does for unprivileged users.
2. We don't care about the stack of an actively running process
(/proc/$pid/stack is there for that). We only need WCHAN for
understanding why a task is blocked.
3. Keeping the function / symbol name in the wchan is ideal (so we can
pinpoint the exact area that a task is blocked at).
> This leaves kernel/sched/fair.c as the only user of get_wchan(). But
> again, since this was broken for 4 years, was this profiling logic
> actually doing anything useful?
This was only broken with CONFIG_UNWINDER_ORC. You may say this is the
default, but Ubuntu's latest kernel (5.11 in Hirsute) still ships with
CONFIG_UNWINDER_FRAME_POINTER, and many other distributions are the
same. Stable distributions have a lag time picking up new code, and even
longer lag picking up new configurations -- even new defaults.
(Especially when frame pointers are so useful for debugging...) So
saying that this was broken for 4 years is at best misleading. Plenty of
users have been happily using recent kernels when this was supposedly
"broken", on valid configurations, without any issues.
It looks like we've backed off of the decision to rip out
/proc/$pid/wchan, but I just wanted to chime in, since it feels like the
discussion is happening without much input from users.
Thanks,
Stephen
On Thu, Sep 30, 2021 at 11:05:35AM -0700, Stephen Brennan wrote:
> On 9/23/21 4:31 PM, Kees Cook wrote:
> > The /proc/$pid/wchan file has been broken by default on x86_64 for 4
> > years now[1]. As this remains a potential leak of either kernel
> > addresses (when symbolization fails) or limited observation of kernel
> > function progress, just remove the contents for good.
> >
> > Unconditionally set the contents to "0" and also mark the wchan
> > field in /proc/$pid/stat with 0.
>
> Hi all,
>
> It looks like there's already been pushback on this idea, but I wanted
> to add another voice from a frequent user of /proc/$pid/wchan (via PS).
> Much of my job involves diagnosing kernel issues and performance issues
> on stable kernels, frequently on production systems where I can't do
> anything too invasive. wchan is incredibly useful for these situations,
> so much so that we store regular snapshots of ps output, and we expand
> the size of the WCHAN column to fit more data (e.g. ps -e -o
> pid,wchan=WCHAN-WIDE-COLUMN). Disabling wchan would remove a critical
> tool for me and my team.
Thanks for speaking up! Yes, we've moved to fixing wchan correctly as
it's clear it's still very much in use. :) Current thread is here:
https://lore.kernel.org/lkml/[email protected]/
--
Kees Cook