LinuxLists.cc - [PATCH 0/3] prevent /proc/<pid>/stack garbage for running tasks

[permalink] [raw]

Subject: [PATCH 1/3] sched: add sched_task_call()

Add a sched_task_call() to allow a callback function to be called with
the task's rq locked. It guarantees that a task remains either active
or inactive during the function call, without having to expose rq
locking details outside of the scheduler.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41c60e5..9b61e66 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2506,6 +2506,10 @@ static inline void set_task_comm(struct task_struct *tsk, const char *from)
}
extern char *get_task_comm(char *to, struct task_struct *tsk);

+typedef void (*sched_task_call_func_t)(struct task_struct *tsk, void *data);
+extern void sched_task_call(sched_task_call_func_t func,
+ struct task_struct *tsk, void *data);
+
#ifdef CONFIG_SMP
void scheduler_ipi(void);
extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13049aa..c83a451 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+/***
+ * sched_task_call - call a function with a task's state locked
+ *
+ * The task is guaranteed to remain either active or inactive during the
+ * function call.
+ */
+void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
+ void *data)
+{
+ unsigned long flags;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ func(p, data);
+ task_rq_unlock(rq, p, &flags);
+}
+
#ifdef CONFIG_SMP
/*
* ->cpus_allowed is protected by both rq->lock and p->pi_lock
--
2.1.0

2015-02-16 18:53:30

[permalink] [raw]

Subject: [PATCH 2/3] stacktrace: add save_stack_trace_tsk_safe()

It isn't possible to get the stack of a running task (other than
current) because we don't have the stack pointer and the stack can be
inconsistent anyway. Add a safe stack saving facility which only saves
the stack of the task if it's sleeping or if it's the current task.

Suggested-by: Jiri Kosina <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/stacktrace.h | 2 ++
kernel/stacktrace.c | 22 ++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 669045a..898f0fc 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -20,6 +20,8 @@ extern void save_stack_trace_regs(struct pt_regs *regs,
struct stack_trace *trace);
extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);
+extern void save_stack_trace_tsk_safe(struct task_struct *tsk,
+ struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
extern int snprint_stack_trace(char *buf, size_t size,
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index b6e4c16..49cdd7f 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -57,6 +57,28 @@ int snprint_stack_trace(char *buf, size_t size,
}
EXPORT_SYMBOL_GPL(snprint_stack_trace);

+static void __save_stack_trace_tsk_safe(struct task_struct *tsk,
+ void *data)
+{
+ struct stack_trace *trace = data;
+
+ if (tsk != current && tsk->on_cpu)
+ return;
+
+ save_stack_trace_tsk(tsk, trace);
+}
+
+/*
+ * If the task is sleeping, safely get its stack. Otherwise, do nothing, since
+ * the stack of a running task is undefined.
+ */
+void save_stack_trace_tsk_safe(struct task_struct *tsk,
+ struct stack_trace *trace)
+{
+ sched_task_call(__save_stack_trace_tsk_safe, tsk, trace);
+}
+EXPORT_SYMBOL_GPL(save_stack_trace_tsk_safe);
+
/*
* Architectures that do not implement save_stack_trace_tsk or
* save_stack_trace_regs get this weak alias and a once-per-bootup warning
--
2.1.0

2015-02-16 18:53:26

[permalink] [raw]

Subject: [PATCH 3/3] proc: fix /proc/<pid>/stack for running tasks

Reading /proc/<pid>/stack for a running task can show garbage. Use the
new safe version of the stack saving interface. For running tasks
(other than current) it won't show anything.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
fs/proc/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..4c3f5d5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -288,7 +288,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,

err = lock_trace(task);
if (!err) {
- save_stack_trace_tsk(task, &trace);
+ save_stack_trace_tsk_safe(task, &trace);

for (i = 0; i < trace.nr_entries; i++) {
seq_printf(m, "[<%pK>] %pS\n",
--
2.1.0

2015-02-16 20:44:43

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> +++ b/kernel/sched/core.c
> @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
> EXPORT_SYMBOL_GPL(kick_process);
> #endif /* CONFIG_SMP */
>
> +/***
> + * sched_task_call - call a function with a task's state locked
> + *
> + * The task is guaranteed to remain either active or inactive during the
> + * function call.
> + */
> +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> + void *data)
> +{
> + unsigned long flags;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, &flags);
> + func(p, data);
> + task_rq_unlock(rq, p, &flags);
> +}

Yeah, I think not. We're so not going to allow running random code under
rq->lock and p->pi_lock.

2015-02-16 22:05:36

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 09:44:36PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> > +++ b/kernel/sched/core.c
> > @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
> > EXPORT_SYMBOL_GPL(kick_process);
> > #endif /* CONFIG_SMP */
> >
> > +/***
> > + * sched_task_call - call a function with a task's state locked
> > + *
> > + * The task is guaranteed to remain either active or inactive during the
> > + * function call.
> > + */
> > +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> > + void *data)
> > +{
> > + unsigned long flags;
> > + struct rq *rq;
> > +
> > + rq = task_rq_lock(p, &flags);
> > + func(p, data);
> > + task_rq_unlock(rq, p, &flags);
> > +}
>
> Yeah, I think not. We're so not going to allow running random code under
> rq->lock and p->pi_lock.

Yeah, I can understand that. I definitely want to avoid touching the
scheduler code. Basically I'm trying to find a way to atomically do the
following:

if (task is sleeping) {
walk the stack
if (certain set of functions isn't on the stack)
set (or clear) a thread flag for the task
}

Any ideas on how I can achieve that? So far my ideas are:

1. Use task_rq_lock() -- but rq_lock is internal to sched code.

2. Use wait_task_inactive() -- I could call it twice, with the stack
checking in between, and use ncsw to ensure that it didn't reschedule
in the mean time. But this still seems racy. i.e., I think the task
could start running after the second call to wait_task_inactive()
returns but before setting the thread flag. Not sure if that's a
realistic race condition or not.

3. Use set_cpus_allowed() to temporarily pin the task to its current
CPU, and then call smp_call_function_single() to run the above
critical section on that CPU. I'm not sure if there's a race-free
way to do it but it's a lot more disruptive than I'd like...

Any ideas or guidance would be greatly appreciated!

--
Josh

2015-02-17 09:25:05

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> Yeah, I can understand that. I definitely want to avoid touching the
> scheduler code. Basically I'm trying to find a way to atomically do the
> following:
>
> if (task is sleeping) {
> walk the stack
> if (certain set of functions isn't on the stack)
> set (or clear) a thread flag for the task
> }
>
> Any ideas on how I can achieve that? So far my ideas are:

So far stack unwinding has basically been a best effort debug output
kind of thing, you're wanting to make the integrity of the kernel depend
on it.

You require an absolute 100% correctness of the stack unwinder -- where
today it is; as stated above; a best effort debug output thing.

That is a _big_ change.

Has this been properly considered; has all the asm of the relevant
architectures been audited? Are you planning on maintaining that level
of audit for all new patches?

Because the way you propose to do things, we'll end up with silent but
deadly fail if the unwinder is less than 100% correct. No way to easily
debug that, no warns, just silent corruption.

Are you really really sure you want to go do this?

2015-02-17 14:13:39

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> > Yeah, I can understand that. I definitely want to avoid touching the
> > scheduler code. Basically I'm trying to find a way to atomically do the
> > following:
> >
> > if (task is sleeping) {
> > walk the stack
> > if (certain set of functions isn't on the stack)
> > set (or clear) a thread flag for the task
> > }
> >
> > Any ideas on how I can achieve that? So far my ideas are:
>
> So far stack unwinding has basically been a best effort debug output
> kind of thing, you're wanting to make the integrity of the kernel depend
> on it.
>
> You require an absolute 100% correctness of the stack unwinder -- where
> today it is; as stated above; a best effort debug output thing.
>
> That is a _big_ change.

True, this does seem to be the first attempt to rely on the correctness
of the stack walking code.

> Has this been properly considered; has all the asm of the relevant
> architectures been audited? Are you planning on maintaining that level
> of audit for all new patches?

I agree, the unwinder needs to be 100% correct. Currently we only
support x86_64. Luckily, in general, stacks are very well defined and
walking the stack of a sleeping task should be straightforward. I don't
think it would be too hard to ensure the stack unwinder is right for
other architectures as we port them.

> Because the way you propose to do things, we'll end up with silent but
> deadly fail if the unwinder is less than 100% correct. No way to easily
> debug that, no warns, just silent corruption.

That's a good point. There's definitely room for additional error
checking in the x86 stack walking code. A couple of ideas:

- make sure it starts from a __schedule() call at the top
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
call instruction

If we hit any of those errors, we can bail out, unregister with ftrace
and restore the system to its original state.

> Are you really really sure you want to go do this?

Basically, yes. We've had a lot of conversations about many different
variations of how to do this over the past year, and this is by far the
best approach we've come up with.

FWIW, we've been doing something similar with kpatch and stop_machine()
over the last 1+ years, and have run into zero problems with that
approach.

--
Josh

2015-02-17 18:15:51

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > So far stack unwinding has basically been a best effort debug output
> > kind of thing, you're wanting to make the integrity of the kernel depend
> > on it.
> >
> > You require an absolute 100% correctness of the stack unwinder -- where
> > today it is; as stated above; a best effort debug output thing.
> >
> > That is a _big_ change.
>
> True, this does seem to be the first attempt to rely on the correctness
> of the stack walking code.
>
> > Has this been properly considered; has all the asm of the relevant
> > architectures been audited? Are you planning on maintaining that level
> > of audit for all new patches?
>
> I agree, the unwinder needs to be 100% correct. Currently we only
> support x86_64. Luckily, in general, stacks are very well defined and
> walking the stack of a sleeping task should be straightforward. I don't
> think it would be too hard to ensure the stack unwinder is right for
> other architectures as we port them.

I would not be too sure about that, the x86 framepointer thing is not
universal. IIRC some have to have some form of in-kernel dwarf
unwinding.

And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
because otherwise x86 stacks are a mess too.

So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
are implemented in asm, don't honour that. So if one of those faults and
the fault handler sleeps, you'll miss say your
'copy_user_enhanced_fast_string' entry.

Then again, those asm functions don't have function trace bits either,
so you can't patch them to begin with I suppose.

Here's to hoping you don't have to..

> > Because the way you propose to do things, we'll end up with silent but
> > deadly fail if the unwinder is less than 100% correct. No way to easily
> > debug that, no warns, just silent corruption.
>
> That's a good point. There's definitely room for additional error
> checking in the x86 stack walking code. A couple of ideas:
>
> - make sure it starts from a __schedule() call at the top
> - make sure we actually reach the bottom of the stack
> - make sure each stack frame's return address is immediately after a
> call instruction
>
> If we hit any of those errors, we can bail out, unregister with ftrace
> and restore the system to its original state.

And then hope you can get a better trace next time around? Or will you
fall-back to an alternative method of patching?

> > Are you really really sure you want to go do this?
>
> Basically, yes. We've had a lot of conversations about many different
> variations of how to do this over the past year, and this is by far the
> best approach we've come up with.

For some reason I'm thinking jikos is going to disagree with you on that
:-)

I'm further thinking we don't actually need 2 (or more) different means
of live patching in the kernel. So you all had better sit down (again)
and come up with something you all agree on.

> FWIW, we've been doing something similar with kpatch and stop_machine()
> over the last 1+ years, and have run into zero problems with that
> approach.

Could be you've just been lucky...

2015-02-17 21:26:03

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> > On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > > So far stack unwinding has basically been a best effort debug output
> > > kind of thing, you're wanting to make the integrity of the kernel depend
> > > on it.
> > >
> > > You require an absolute 100% correctness of the stack unwinder -- where
> > > today it is; as stated above; a best effort debug output thing.
> > >
> > > That is a _big_ change.
> >
> > True, this does seem to be the first attempt to rely on the correctness
> > of the stack walking code.
> >
> > > Has this been properly considered; has all the asm of the relevant
> > > architectures been audited? Are you planning on maintaining that level
> > > of audit for all new patches?
> >
> > I agree, the unwinder needs to be 100% correct. Currently we only
> > support x86_64. Luckily, in general, stacks are very well defined and
> > walking the stack of a sleeping task should be straightforward. I don't
> > think it would be too hard to ensure the stack unwinder is right for
> > other architectures as we port them.
>
> I would not be too sure about that, the x86 framepointer thing is not
> universal. IIRC some have to have some form of in-kernel dwarf
> unwinding.
>
> And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> because otherwise x86 stacks are a mess too.

Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
about now (x86, power, s390, arm64) all have frame pointer support, and
the stack formats are all sane, AFAICT.

If we ever port livepatch to a more obscure arch for which walking the
stack is more difficult, we'll have several options:

a) spend the time to ensure the unwinding code is correct and resilient
to errors;

b) leave the consistency model compiled code out if !FRAME_POINTER and
allow users to patch without one (similar to the livepatch code
that's already in the livepatch tree today); or

c) not support that arch.

> So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
> are implemented in asm, don't honour that. So if one of those faults and
> the fault handler sleeps, you'll miss say your
> 'copy_user_enhanced_fast_string' entry.
>
> Then again, those asm functions don't have function trace bits either,
> so you can't patch them to begin with I suppose.
>
> Here's to hoping you don't have to..

Right. We can't patch asm functions because we rely on ftrace, which
needs the -mfentry prologue.

> > > Because the way you propose to do things, we'll end up with silent but
> > > deadly fail if the unwinder is less than 100% correct. No way to easily
> > > debug that, no warns, just silent corruption.
> >
> > That's a good point. There's definitely room for additional error
> > checking in the x86 stack walking code. A couple of ideas:
> >
> > - make sure it starts from a __schedule() call at the top
> > - make sure we actually reach the bottom of the stack
> > - make sure each stack frame's return address is immediately after a
> > call instruction
> >
> > If we hit any of those errors, we can bail out, unregister with ftrace
> > and restore the system to its original state.
>
> And then hope you can get a better trace next time around? Or will you
> fall-back to an alternative method of patching?

Yeah, on second thought, we wouldn't have to cancel the patch. We could
defer to check the task's stack again at a later time. If it's stuck
there, the user can try sending it a signal to unstick it, or cancel the
patching process. Those mechanisms are already in place with my
consistency model patch set.

I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
guessing it would either indicate an error in the unwinding code or
point us to an unexpected stack condition.

> > > Are you really really sure you want to go do this?
> >
> > Basically, yes. We've had a lot of conversations about many different
> > variations of how to do this over the past year, and this is by far the
> > best approach we've come up with.
>
> For some reason I'm thinking jikos is going to disagree with you on that
> :-)
>
> I'm further thinking we don't actually need 2 (or more) different means
> of live patching in the kernel. So you all had better sit down (again)
> and come up with something you all agree on.

Yeah, I also _really_ want to avoid multiple consistency models.

In fact, that's a big motivation behind my consistency model patch set.
It's heavily inspired by a suggestion from Vojtech [1]. It combines
kpatch (backtrace checking) with kGraft (per-thread consistency). It
has several advantages over either of the individual approaches. See
http://lwn.net/Articles/632582/ where I describe its pros over both
kpatch and kGraft.

Jiri, would you and Vojtech agree that the proposed consistency model is
all we need? Or do you still want to do the multiple consistency model
thing?

> > FWIW, we've been doing something similar with kpatch and stop_machine()
> > over the last 1+ years, and have run into zero problems with that
> > approach.
>
> Could be you've just been lucky...

I can't really disagree with that ;-)

[1] https://lkml.org/lkml/2014/11/7/354

--
Josh

2015-02-18 00:13:22

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 2/3] stacktrace: add save_stack_trace_tsk_safe()

On Mon, 16 Feb 2015 12:52:35 -0600 Josh Poimboeuf <[email protected]> wrote:

> It isn't possible to get the stack of a running task (other than
> current) because we don't have the stack pointer and the stack can be
> inconsistent anyway. Add a safe stack saving facility which only saves
> the stack of the task if it's sleeping or if it's the current task.

Can you send the task an IPI and get it to save its stack for you?

There's probably some (messy, arch-dependent) way of trimming away the
interrupt-related stuff off the stack, if that's really needed.

2015-02-18 15:21:12

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > because otherwise x86 stacks are a mess too.
>
> Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
> about now (x86, power, s390, arm64) all have frame pointer support, and
> the stack formats are all sane, AFAICT.
>
> If we ever port livepatch to a more obscure arch for which walking the
> stack is more difficult, we'll have several options:
>
> a) spend the time to ensure the unwinding code is correct and resilient
> to errors;
>
> b) leave the consistency model compiled code out if !FRAME_POINTER and
> allow users to patch without one (similar to the livepatch code
> that's already in the livepatch tree today); or

Which has a much more limited set of patches it can do, right?

> c) not support that arch.

Which would be sad of course.

> > And then hope you can get a better trace next time around? Or will you
> > fall-back to an alternative method of patching?
>
> Yeah, on second thought, we wouldn't have to cancel the patch. We could
> defer to check the task's stack again at a later time. If it's stuck
> there, the user can try sending it a signal to unstick it, or cancel the
> patching process. Those mechanisms are already in place with my
> consistency model patch set.
>
> I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> guessing it would either indicate an error in the unwinding code or
> point us to an unexpected stack condition.

So uhm, what happens if your target task is running? When will you
retry? The problem I see is that if you do a sample approach you might
never hit an opportune moment.

> > I'm further thinking we don't actually need 2 (or more) different means
> > of live patching in the kernel. So you all had better sit down (again)
> > and come up with something you all agree on.
>
> Yeah, I also _really_ want to avoid multiple consistency models.
>
> In fact, that's a big motivation behind my consistency model patch set.
> It's heavily inspired by a suggestion from Vojtech [1]. It combines
> kpatch (backtrace checking) with kGraft (per-thread consistency). It
> has several advantages over either of the individual approaches. See
> http://lwn.net/Articles/632582/ where I describe its pros over both
> kpatch and kGraft.
>
> Jiri, would you and Vojtech agree that the proposed consistency model is
> all we need? Or do you still want to do the multiple consistency model
> thing?

Skimmed that thread; you all mostly seem to agree that one would be good
but not quite agree on which one.

And I note, not all seem to require this stack walking stuff.

2015-02-18 17:13:29

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > > because otherwise x86 stacks are a mess too.
> >
> > Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
> > about now (x86, power, s390, arm64) all have frame pointer support, and
> > the stack formats are all sane, AFAICT.
> >
> > If we ever port livepatch to a more obscure arch for which walking the
> > stack is more difficult, we'll have several options:
> >
> > a) spend the time to ensure the unwinding code is correct and resilient
> > to errors;
> >
> > b) leave the consistency model compiled code out if !FRAME_POINTER and
> > allow users to patch without one (similar to the livepatch code
> > that's already in the livepatch tree today); or
>
> Which has a much more limited set of patches it can do, right?

If we're talking security patches, roughly 9 out of 10 of them don't
change function prototypes, data, or data semantics. If the patch
author is careful, he or she doesn't need a consistency model in those
cases.

So it's not _overly_ limited. If somebody's not happy about those
limitations, they can put in the work for option a :-)

> > > And then hope you can get a better trace next time around? Or will you
> > > fall-back to an alternative method of patching?
> >
> > Yeah, on second thought, we wouldn't have to cancel the patch. We could
> > defer to check the task's stack again at a later time. If it's stuck
> > there, the user can try sending it a signal to unstick it, or cancel the
> > patching process. Those mechanisms are already in place with my
> > consistency model patch set.
> >
> > I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> > guessing it would either indicate an error in the unwinding code or
> > point us to an unexpected stack condition.
>
> So uhm, what happens if your target task is running? When will you
> retry? The problem I see is that if you do a sample approach you might
> never hit an opportune moment.

We attack it from multiple angles.

First we check the stack of all sleeping tasks. That patches the
majority of tasks immediately. If necessary, we also do that
periodically in a workqueue to catch any stragglers.

The next line of attack is patching tasks when exiting the kernel to
user space (system calls, interrupts, signals), to catch all CPU-bound
and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
model patch set.

As a last resort, if there are still any tasks which are sleeping on a
to-be-patched function, the user can send them SIGSTOP and SIGCONT to
force them to be patched.

If none of those work, the user has the option of either canceling the
patch or just waiting to see if the patching process eventually
completes.

> > > I'm further thinking we don't actually need 2 (or more) different means
> > > of live patching in the kernel. So you all had better sit down (again)
> > > and come up with something you all agree on.
> >
> > Yeah, I also _really_ want to avoid multiple consistency models.
> >
> > In fact, that's a big motivation behind my consistency model patch set.
> > It's heavily inspired by a suggestion from Vojtech [1]. It combines
> > kpatch (backtrace checking) with kGraft (per-thread consistency). It
> > has several advantages over either of the individual approaches. See
> > http://lwn.net/Articles/632582/ where I describe its pros over both
> > kpatch and kGraft.
> >
> > Jiri, would you and Vojtech agree that the proposed consistency model is
> > all we need? Or do you still want to do the multiple consistency model
> > thing?
>
> Skimmed that thread; you all mostly seem to agree that one would be good
> but not quite agree on which one.

Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD
is the most interesting combination. That's what my patches implement.
Let's wait to see what Jiri and Vojtech say.

> And I note, not all seem to require this stack walking stuff.

True. But without stack walking, you have bigger problems:

1. You have to signal _all_ sleeping tasks to force them to be patched.
That's much more disruptive to the system.

2. There's no way to convert kthreads without either:

a) hacking up all the various kthread loops to call
klp_update_task_universe(); or

b) converting all kthreads' code to be freezeable. Then we could
freeze all tasks to patch them. Not only would it be a lot of
work to make all kthreads freezable, but it would encode within
the kernel the dangerous and non-obvious assumption that the
freeze point is a "safe" place for patching.

[1] https://lkml.org/lkml/2015/2/9/478

--
Josh

2015-02-19 00:21:09

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > a) spend the time to ensure the unwinding code is correct and resilient
> > > to errors;
> > >
> > > b) leave the consistency model compiled code out if !FRAME_POINTER and
> > > allow users to patch without one (similar to the livepatch code
> > > that's already in the livepatch tree today); or
> >
> > Which has a much more limited set of patches it can do, right?
>
> If we're talking security patches, roughly 9 out of 10 of them don't
> change function prototypes, data, or data semantics. If the patch
> author is careful, he or she doesn't need a consistency model in those
> cases.
>
> So it's not _overly_ limited. If somebody's not happy about those
> limitations, they can put in the work for option a :-)

OK. So all this is really only about really funny bits.

> > So uhm, what happens if your target task is running? When will you
> > retry? The problem I see is that if you do a sample approach you might
> > never hit an opportune moment.
>
> We attack it from multiple angles.
>
> First we check the stack of all sleeping tasks. That patches the
> majority of tasks immediately. If necessary, we also do that
> periodically in a workqueue to catch any stragglers.

So not only do you need an 'atomic' stack save, you need to analyze and
flip its state in the same atomic region. The task cannot start running
again after the save and start using old functions while you analyze the
stack and flip it.

> The next line of attack is patching tasks when exiting the kernel to
> user space (system calls, interrupts, signals), to catch all CPU-bound
> and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> model patch set.

So the HPC people are really into userspace that does for (;;) ; and
isolate that on CPUs and have the tick interrupt stopped and all that.

You'll not catch those threads on the sysexit path.

And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.

Now its fairly easy to also handle this; just mark those tasks with a
_TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
go-away, then flip their state and clear the flag.

> As a last resort, if there are still any tasks which are sleeping on a
> to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> force them to be patched.

You typically cannot SIGSTOP/SIGCONT kernel threads. Also
TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.

Bit pesky that.. needs pondering.

2015-02-19 04:18:30

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
> > > So uhm, what happens if your target task is running? When will you
> > > retry? The problem I see is that if you do a sample approach you might
> > > never hit an opportune moment.
> >
> > We attack it from multiple angles.
> >
> > First we check the stack of all sleeping tasks. That patches the
> > majority of tasks immediately. If necessary, we also do that
> > periodically in a workqueue to catch any stragglers.
>
> So not only do you need an 'atomic' stack save, you need to analyze and
> flip its state in the same atomic region. The task cannot start running
> again after the save and start using old functions while you analyze the
> stack and flip it.

Yes, exactly.

> > The next line of attack is patching tasks when exiting the kernel to
> > user space (system calls, interrupts, signals), to catch all CPU-bound
> > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > model patch set.
>
> So the HPC people are really into userspace that does for (;;) ; and
> isolate that on CPUs and have the tick interrupt stopped and all that.
>
> You'll not catch those threads on the sysexit path.
>
> And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
>
> Now its fairly easy to also handle this; just mark those tasks with a
> _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> go-away, then flip their state and clear the flag.

I guess you mean patch the task when it makes a syscall? I'm doing that
already on syscall exit with a bit in _TIF_ALLWORK_MASK and
_TIF_DO_NOTIFY_MASK.

> > As a last resort, if there are still any tasks which are sleeping on a
> > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > force them to be patched.
>
> You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
>
> Bit pesky that.. needs pondering.

I did have a scheme for patching kthreads which are sleeping on
to-be-patched functions.

But now I'm thinking that kthreads will almost never be a problem. Most
kthreads are basically this:

void thread_fn()
{
while (1) {
/* do some stuff */

schedule();

/* do other stuff */
}
}

So a kthread would typically only fail the stack check if we're trying
to patch either schedule() or the top-level thread_fn.

Patching thread_fn wouldn't be possible unless we killed the thread.

And I'd guess we can probably live without being able to patch
schedule() for now :-)

--
Josh

2015-02-19 10:16:17

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > The next line of attack is patching tasks when exiting the kernel to
> > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > > model patch set.
> >
> > So the HPC people are really into userspace that does for (;;) ; and
> > isolate that on CPUs and have the tick interrupt stopped and all that.
> >
> > You'll not catch those threads on the sysexit path.
> >
> > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> >
> > Now its fairly easy to also handle this; just mark those tasks with a
> > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > go-away, then flip their state and clear the flag.
>
> I guess you mean patch the task when it makes a syscall? I'm doing that
> already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> _TIF_DO_NOTIFY_MASK.

No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.

---
arch/x86/include/asm/thread_info.h | 4 +++-
arch/x86/kernel/entry_64.S | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e82e95abc92b..baa836f13536 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -90,6 +90,7 @@ struct thread_info {
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
#define TIF_X32 30 /* 32-bit native x86-64 binary */
+#define TIF_ENTER_WAIT 31

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -113,12 +114,13 @@ struct thread_info {
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
#define _TIF_X32 (1 << TIF_X32)
+#define _TIF_ENTER_WAIT (1 << TIF_ENTER_WAIT)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_NOHZ)
+ _TIF_NOHZ | _TIF_ENTER_WAIT)

/* work to do in syscall_trace_leave() */
#define _TIF_WORK_SYSCALL_EXIT \
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index db13655c3a2a..735566b35903 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs)

/* Do syscall tracing */
tracesys:
+ andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+ jnz tracesys;
leaq -REST_SKIP(%rsp), %rdi
movq $AUDIT_ARCH_X86_64, %rsi
call syscall_trace_enter_phase1

> > > As a last resort, if there are still any tasks which are sleeping on a
> > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > force them to be patched.
> >
> > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> >
> > Bit pesky that.. needs pondering.

I still absolutely hate you need to disturb userspace like that. Signals
are quite visible and perturb userspace state.

Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
what they thought was a good reason. You'd wreck their state.

> But now I'm thinking that kthreads will almost never be a problem. Most
> kthreads are basically this:

You guys are way too optimistic; maybe its because I've worked on
realtime stuff too much, but I'm always looking at worst cases. If you
can't handle those, I feel you might as well not bother :-)

> Patching thread_fn wouldn't be possible unless we killed the thread.

It is, see kthread_park().

2015-02-19 16:24:57

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
>
> > > > The next line of attack is patching tasks when exiting the kernel to
> > > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > > > model patch set.
> > >
> > > So the HPC people are really into userspace that does for (;;) ; and
> > > isolate that on CPUs and have the tick interrupt stopped and all that.
> > >
> > > You'll not catch those threads on the sysexit path.
> > >
> > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> > >
> > > Now its fairly easy to also handle this; just mark those tasks with a
> > > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > > go-away, then flip their state and clear the flag.
> >
> > I guess you mean patch the task when it makes a syscall? I'm doing that
> > already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> > _TIF_DO_NOTIFY_MASK.
>
> No, these tasks will _never_ make syscalls. So you need to guarantee
> they don't accidentally enter the kernel while you flip them. Something
> like so should do.
>
> You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> them then clear TIF_ENTER_WAIT.

Ah, that's a good idea. But how do we check if they're in user space?

> > > > As a last resort, if there are still any tasks which are sleeping on a
> > > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > > force them to be patched.
> > >
> > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> > >
> > > Bit pesky that.. needs pondering.
>
> I still absolutely hate you need to disturb userspace like that. Signals
> are quite visible and perturb userspace state.

Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
results in EINTR being returned from a system call. It's not ideal, but
it's much less intrusive than something like suspend.

But anyway we leave it up to the user to decide whether they want to
take that risk, or wait for the task to wake up on its own, or cancel
the patch.

> Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> what they thought was a good reason. You'd wreck their state.

Hm, that's a good point. Maybe use the freezer instead of signals?

(Again this would only be for those user tasks which are sleeping on a
patched function)

> > But now I'm thinking that kthreads will almost never be a problem. Most
> > kthreads are basically this:
>
> You guys are way too optimistic; maybe its because I've worked on
> realtime stuff too much, but I'm always looking at worst cases. If you
> can't handle those, I feel you might as well not bother :-)

Well, I think we're already resigned to the fact that live patching
won't work for every patch, every time. And that the patch author must
know what they're doing, and must do it carefully.

Our target worst case is that the patching fails gracefully and the user
can't patch their system with that particular patch. Or that the system
remains in a partially patched state forever, if the user is ok with
that.

> > Patching thread_fn wouldn't be possible unless we killed the thread.
>
> It is, see kthread_park().

When the kthread returns from kthread_parkme(), it'll still be running
the old thread_fn code, regardless of whether we flipped the task's
patch state.

--
Josh

2015-02-19 16:34:03

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

> > No, these tasks will _never_ make syscalls. So you need to guarantee
> > they don't accidentally enter the kernel while you flip them. Something
> > like so should do.
> >
> > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > them then clear TIF_ENTER_WAIT.
>
> Ah, that's a good idea. But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

> > I still absolutely hate you need to disturb userspace like that. Signals
> > are quite visible and perturb userspace state.
>
> Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
> results in EINTR being returned from a system call. It's not ideal, but
> it's much less intrusive than something like suspend.
>
> But anyway we leave it up to the user to decide whether they want to
> take that risk, or wait for the task to wake up on its own, or cancel
> the patch.
>
> > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> > what they thought was a good reason. You'd wreck their state.
>
> Hm, that's a good point. Maybe use the freezer instead of signals?
>
> (Again this would only be for those user tasks which are sleeping on a
> patched function)
>
> > > But now I'm thinking that kthreads will almost never be a problem. Most
> > > kthreads are basically this:
> >
> > You guys are way too optimistic; maybe its because I've worked on
> > realtime stuff too much, but I'm always looking at worst cases. If you
> > can't handle those, I feel you might as well not bother :-)
>
> Well, I think we're already resigned to the fact that live patching
> won't work for every patch, every time. And that the patch author must
> know what they're doing, and must do it carefully.
>
> Our target worst case is that the patching fails gracefully and the user
> can't patch their system with that particular patch. Or that the system
> remains in a partially patched state forever, if the user is ok with
> that.
>
> > > Patching thread_fn wouldn't be possible unless we killed the thread.
> >
> > It is, see kthread_park().
>
> When the kthread returns from kthread_parkme(), it'll still be running
> the old thread_fn code, regardless of whether we flipped the task's
> patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 17:04:25

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
>
> > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > they don't accidentally enter the kernel while you flip them. Something
> > > like so should do.
> > >
> > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > them then clear TIF_ENTER_WAIT.
> >
> > Ah, that's a good idea. But how do we check if they're in user space?
>
> I don't see the benefit in holding them in a loop - you can just as well
> flip them from the syscall code as kGraft does.

But we were talking specifically about HPC tasks which never make
syscalls.

--
Josh

2015-02-19 17:08:28

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > >
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > >
> > > Ah, that's a good idea. But how do we check if they're in user space?
> >
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
>
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yeah. And getting answer to "is this task_struct currently running in
userspace?" is not easy in a non-disruptive way.

Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the
number of systems that have this option turned on will be rather limited
(especially in HPC space).

Sending IPIs around to check whether the task is running in user-space or
kernel-space would work, but it's disruptive to the running task, which
might be unwanted as well.

--
Jiri Kosina
SUSE Labs

2015-02-19 17:19:32

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> >
> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > >
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > >
> > > Ah, that's a good idea. But how do we check if they're in user space?
> >
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
>
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 17:33:30

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > >
> > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > like so should do.
> > > > >
> > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > them then clear TIF_ENTER_WAIT.
> > > >
> > > > Ah, that's a good idea. But how do we check if they're in user space?
> > >
> > > I don't see the benefit in holding them in a loop - you can just as well
> > > flip them from the syscall code as kGraft does.
> >
> > But we were talking specifically about HPC tasks which never make
> > syscalls.
>
> Yes. I'm saying that rather than guaranteeing they don't enter the
> kernel (by having them spin) you can flip them in case they try to do
> that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
right at the border. Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
klp_switch_task_universe(task);

--
Josh

2015-02-19 17:48:24

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf <[email protected]> wrote:

>> Yes. I'm saying that rather than guaranteeing they don't enter the
>> kernel (by having them spin) you can flip them in case they try to do
>> that instead. That solves the race condition just as well.
>
>Ok, gotcha.
>
>We'd still need a safe way to check if they're in user space though.
>
>How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
>right at the border. Then for running tasks it's as simple as:
>
>if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs.

--
Vojtech Pavlik
Director SuSE Labs

2015-02-19 20:40:45

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > >
> > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > > like so should do.
> > > > > >
> > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > > them then clear TIF_ENTER_WAIT.
> > > > >
> > > > > Ah, that's a good idea. But how do we check if they're in user space?
> > > >
> > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > flip them from the syscall code as kGraft does.
> > >
> > > But we were talking specifically about HPC tasks which never make
> > > syscalls.
> >
> > Yes. I'm saying that rather than guaranteeing they don't enter the
> > kernel (by having them spin) you can flip them in case they try to do
> > that instead. That solves the race condition just as well.
>
> Ok, gotcha.
>
> We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

> How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> right at the border. Then for running tasks it's as simple as:
>
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 21:26:13

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> right at the border. Then for running tasks it's as simple as:
>
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
we'd be interested in would be

__this_cpu_read(context_tracking.state) == IN_USER

But it has overhead.

--
Jiri Kosina
SUSE Labs

2015-02-19 21:38:40

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Jiri Kosina wrote:

> > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> > right at the border. Then for running tasks it's as simple as:
> >
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
>
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
> we'd be interested in would be
>
> __this_cpu_read(context_tracking.state) == IN_USER

Well, more precisely

per_cpu(context_tracking.state, cpu) == IN_USER

of course.

--
Jiri Kosina
SUSE Labs

2015-02-19 21:43:01

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > > >
> > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > > > like so should do.
> > > > > > >
> > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > > > them then clear TIF_ENTER_WAIT.
> > > > > >
> > > > > > Ah, that's a good idea. But how do we check if they're in user space?
> > > > >
> > > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > > flip them from the syscall code as kGraft does.
> > > >
> > > > But we were talking specifically about HPC tasks which never make
> > > > syscalls.
> > >
> > > Yes. I'm saying that rather than guaranteeing they don't enter the
> > > kernel (by having them spin) you can flip them in case they try to do
> > > that instead. That solves the race condition just as well.
> >
> > Ok, gotcha.
> >
> > We'd still need a safe way to check if they're in user space though.
>
> Having a safe way would be very nice and actually quite useful in other
> cases, too.
>
> For this specific purpose, however, we don't need a very safe way,
> though. We don't require atomicity in any way, we don't mind even if it
> creates false negatives, only false positives would be bad.
>
> kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
> addresses there, it assumes userspace. Not very nice, but does the job.

So I've looked at kgr_needs_lazy_migration(), but I still have no idea
how it works.

First of all, I think reading the stack while its being written to could
give you some garbage values, and a completely wrong nr_entries value
from save_stack_trace_tsk().

But also, how would you walk a stack without knowing its stack pointer?
That function relies on the saved stack pointer in
task_struct.thread.sp, which, AFAICT, was last saved during the last
call to schedule(). Since then, the stack could have been completely
rewritten, with different size stack frames, before the task exited the
kernel.

Am I missing something?

--
Josh

2015-02-19 23:12:14

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote:
> On Thu, 19 Feb 2015, Josh Poimboeuf wrote:
>
> > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> > right at the border. Then for running tasks it's as simple as:
> >
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
>
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
> we'd be interested in would be
>
> __this_cpu_read(context_tracking.state) == IN_USER
>
> But it has overhead.

Yeah, that does seem to pretty much do what we want. Unfortunately it
has a much higher overhead than just setting a thread flag.

And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would
enable it on all CPUs during boot, "this option brings an overhead that
you don't want in production."

Maybe that code needs a rewrite to rely on a thread flag instead. Then
we could use it too.

--
Josh

2015-02-20 07:46:56

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> So I've looked at kgr_needs_lazy_migration(), but I still have no idea
> how it works.
>
> First of all, I think reading the stack while its being written to could
> give you some garbage values, and a completely wrong nr_entries value
> from save_stack_trace_tsk().

I believe we've already been discussing this some time ago ...

I agree that this is a very crude optimization that should probably be
either removed (which would only cause slower convergence in the presence
of CPU-bound tasks), or rewritten to perform IPI-based stack dumping
(probably on a voluntarily-configurable basis).

Reading garbage values could only happen if the task would be running in
kernelspace. nr_entries would then be at least 2.

But I agree that relying on this very specific behavior is not really
safe in general in case someone changes the stack dumping implementation
in the future in an unpredictable way.

> But also, how would you walk a stack without knowing its stack pointer?
> That function relies on the saved stack pointer in
> task_struct.thread.sp, which, AFAICT, was last saved during the last
> call to schedule(). Since then, the stack could have been completely
> rewritten, with different size stack frames, before the task exited the
> kernel.

Same argument holds here as well, I believe.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-20 08:49:36

[permalink] [raw]

Subject: Re: [PATCH 1/3] sched: add sched_task_call()

Alright, so to sum it up:

- current stack dumping (even looking at /proc/<pid>/stack) is not
guaranteed to yield "correct" results in case the task is running at the
time the stack is being examined

- the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize
the handlers between each other (to make sure that reschedule doesn't
happen in between on some CPU and other task doesn't start running in
the interim).
The NMI handler dumps its current stack in case it's running in context
of the process whose stack is to be dumped. Otherwise, one of the NMI
handlers looks up the required task_struct, and dumps it if it's not
running on any CPU

- For live patching use-case, the stack has to be analyzed (and decision
on what to do based on the analysis) in the NMI handler itself,
otherwise it gets racy again

Converting /proc/<pid>/stack to this mechanism seems like a correct thing
to do in any case, as it's slow path anyway.

The original intent seemed to have been to make this fast path for the
live patching case, but that's probably not possible, so it seems like the
price that will have to be paid for being able to finish live-patching of
CPU-bound processess is the cost of IPI-NMI broadcast.

Agreed?

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-20 09:32:47