2015-02-16 18:53:29

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH 0/3] prevent /proc/<pid>/stack garbage for running tasks

Reading /proc/<pid>/stack for a running task (other than current) can print
garbage because the saved stack pointer is no longer accurate and the stack
itself can be inconsistent.

Add new sched and stacktrace functions so that /proc/<pid>/stack only walks the
stack for sleeping tasks and the current task.

The new sched_task_call() function will also be useful for future live patching
code which will need to atomically examine a task's stack before patching it.

Josh Poimboeuf (3):
sched: add sched_task_call()
stacktrace: add save_stack_trace_tsk_safe()
proc: fix /proc/<pid>/stack for running tasks

fs/proc/base.c | 2 +-
include/linux/sched.h | 4 ++++
include/linux/stacktrace.h | 2 ++
kernel/sched/core.c | 17 +++++++++++++++++
kernel/stacktrace.c | 22 ++++++++++++++++++++++
5 files changed, 46 insertions(+), 1 deletion(-)

--
2.1.0


2015-02-16 18:53:31

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH 1/3] sched: add sched_task_call()

Add a sched_task_call() to allow a callback function to be called with
the task's rq locked. It guarantees that a task remains either active
or inactive during the function call, without having to expose rq
locking details outside of the scheduler.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 41c60e5..9b61e66 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2506,6 +2506,10 @@ static inline void set_task_comm(struct task_struct *tsk, const char *from)
}
extern char *get_task_comm(char *to, struct task_struct *tsk);

+typedef void (*sched_task_call_func_t)(struct task_struct *tsk, void *data);
+extern void sched_task_call(sched_task_call_func_t func,
+ struct task_struct *tsk, void *data);
+
#ifdef CONFIG_SMP
void scheduler_ipi(void);
extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 13049aa..c83a451 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+/***
+ * sched_task_call - call a function with a task's state locked
+ *
+ * The task is guaranteed to remain either active or inactive during the
+ * function call.
+ */
+void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
+ void *data)
+{
+ unsigned long flags;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ func(p, data);
+ task_rq_unlock(rq, p, &flags);
+}
+
#ifdef CONFIG_SMP
/*
* ->cpus_allowed is protected by both rq->lock and p->pi_lock
--
2.1.0

2015-02-16 18:53:30

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH 2/3] stacktrace: add save_stack_trace_tsk_safe()

It isn't possible to get the stack of a running task (other than
current) because we don't have the stack pointer and the stack can be
inconsistent anyway. Add a safe stack saving facility which only saves
the stack of the task if it's sleeping or if it's the current task.

Suggested-by: Jiri Kosina <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/stacktrace.h | 2 ++
kernel/stacktrace.c | 22 ++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 669045a..898f0fc 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -20,6 +20,8 @@ extern void save_stack_trace_regs(struct pt_regs *regs,
struct stack_trace *trace);
extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);
+extern void save_stack_trace_tsk_safe(struct task_struct *tsk,
+ struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
extern int snprint_stack_trace(char *buf, size_t size,
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index b6e4c16..49cdd7f 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -57,6 +57,28 @@ int snprint_stack_trace(char *buf, size_t size,
}
EXPORT_SYMBOL_GPL(snprint_stack_trace);

+static void __save_stack_trace_tsk_safe(struct task_struct *tsk,
+ void *data)
+{
+ struct stack_trace *trace = data;
+
+ if (tsk != current && tsk->on_cpu)
+ return;
+
+ save_stack_trace_tsk(tsk, trace);
+}
+
+/*
+ * If the task is sleeping, safely get its stack. Otherwise, do nothing, since
+ * the stack of a running task is undefined.
+ */
+void save_stack_trace_tsk_safe(struct task_struct *tsk,
+ struct stack_trace *trace)
+{
+ sched_task_call(__save_stack_trace_tsk_safe, tsk, trace);
+}
+EXPORT_SYMBOL_GPL(save_stack_trace_tsk_safe);
+
/*
* Architectures that do not implement save_stack_trace_tsk or
* save_stack_trace_regs get this weak alias and a once-per-bootup warning
--
2.1.0

2015-02-16 18:53:26

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH 3/3] proc: fix /proc/<pid>/stack for running tasks

Reading /proc/<pid>/stack for a running task can show garbage. Use the
new safe version of the stack saving interface. For running tasks
(other than current) it won't show anything.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
fs/proc/base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3f3d7ae..4c3f5d5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -288,7 +288,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,

err = lock_trace(task);
if (!err) {
- save_stack_trace_tsk(task, &trace);
+ save_stack_trace_tsk_safe(task, &trace);

for (i = 0; i < trace.nr_entries; i++) {
seq_printf(m, "[<%pK>] %pS\n",
--
2.1.0

2015-02-16 20:44:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> +++ b/kernel/sched/core.c
> @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
> EXPORT_SYMBOL_GPL(kick_process);
> #endif /* CONFIG_SMP */
>
> +/***
> + * sched_task_call - call a function with a task's state locked
> + *
> + * The task is guaranteed to remain either active or inactive during the
> + * function call.
> + */
> +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> + void *data)
> +{
> + unsigned long flags;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, &flags);
> + func(p, data);
> + task_rq_unlock(rq, p, &flags);
> +}

Yeah, I think not. We're so not going to allow running random code under
rq->lock and p->pi_lock.

2015-02-16 22:05:36

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 09:44:36PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 12:52:34PM -0600, Josh Poimboeuf wrote:
> > +++ b/kernel/sched/core.c
> > @@ -1338,6 +1338,23 @@ void kick_process(struct task_struct *p)
> > EXPORT_SYMBOL_GPL(kick_process);
> > #endif /* CONFIG_SMP */
> >
> > +/***
> > + * sched_task_call - call a function with a task's state locked
> > + *
> > + * The task is guaranteed to remain either active or inactive during the
> > + * function call.
> > + */
> > +void sched_task_call(sched_task_call_func_t func, struct task_struct *p,
> > + void *data)
> > +{
> > + unsigned long flags;
> > + struct rq *rq;
> > +
> > + rq = task_rq_lock(p, &flags);
> > + func(p, data);
> > + task_rq_unlock(rq, p, &flags);
> > +}
>
> Yeah, I think not. We're so not going to allow running random code under
> rq->lock and p->pi_lock.

Yeah, I can understand that. I definitely want to avoid touching the
scheduler code. Basically I'm trying to find a way to atomically do the
following:

if (task is sleeping) {
walk the stack
if (certain set of functions isn't on the stack)
set (or clear) a thread flag for the task
}

Any ideas on how I can achieve that? So far my ideas are:

1. Use task_rq_lock() -- but rq_lock is internal to sched code.

2. Use wait_task_inactive() -- I could call it twice, with the stack
checking in between, and use ncsw to ensure that it didn't reschedule
in the mean time. But this still seems racy. i.e., I think the task
could start running after the second call to wait_task_inactive()
returns but before setting the thread flag. Not sure if that's a
realistic race condition or not.

3. Use set_cpus_allowed() to temporarily pin the task to its current
CPU, and then call smp_call_function_single() to run the above
critical section on that CPU. I'm not sure if there's a race-free
way to do it but it's a lot more disruptive than I'd like...

Any ideas or guidance would be greatly appreciated!

--
Josh

2015-02-17 09:25:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> Yeah, I can understand that. I definitely want to avoid touching the
> scheduler code. Basically I'm trying to find a way to atomically do the
> following:
>
> if (task is sleeping) {
> walk the stack
> if (certain set of functions isn't on the stack)
> set (or clear) a thread flag for the task
> }
>
> Any ideas on how I can achieve that? So far my ideas are:

So far stack unwinding has basically been a best effort debug output
kind of thing, you're wanting to make the integrity of the kernel depend
on it.

You require an absolute 100% correctness of the stack unwinder -- where
today it is; as stated above; a best effort debug output thing.

That is a _big_ change.

Has this been properly considered; has all the asm of the relevant
architectures been audited? Are you planning on maintaining that level
of audit for all new patches?

Because the way you propose to do things, we'll end up with silent but
deadly fail if the unwinder is less than 100% correct. No way to easily
debug that, no warns, just silent corruption.

Are you really really sure you want to go do this?

2015-02-17 14:13:39

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 16, 2015 at 04:05:05PM -0600, Josh Poimboeuf wrote:
> > Yeah, I can understand that. I definitely want to avoid touching the
> > scheduler code. Basically I'm trying to find a way to atomically do the
> > following:
> >
> > if (task is sleeping) {
> > walk the stack
> > if (certain set of functions isn't on the stack)
> > set (or clear) a thread flag for the task
> > }
> >
> > Any ideas on how I can achieve that? So far my ideas are:
>
> So far stack unwinding has basically been a best effort debug output
> kind of thing, you're wanting to make the integrity of the kernel depend
> on it.
>
> You require an absolute 100% correctness of the stack unwinder -- where
> today it is; as stated above; a best effort debug output thing.
>
> That is a _big_ change.

True, this does seem to be the first attempt to rely on the correctness
of the stack walking code.

> Has this been properly considered; has all the asm of the relevant
> architectures been audited? Are you planning on maintaining that level
> of audit for all new patches?

I agree, the unwinder needs to be 100% correct. Currently we only
support x86_64. Luckily, in general, stacks are very well defined and
walking the stack of a sleeping task should be straightforward. I don't
think it would be too hard to ensure the stack unwinder is right for
other architectures as we port them.

> Because the way you propose to do things, we'll end up with silent but
> deadly fail if the unwinder is less than 100% correct. No way to easily
> debug that, no warns, just silent corruption.

That's a good point. There's definitely room for additional error
checking in the x86 stack walking code. A couple of ideas:

- make sure it starts from a __schedule() call at the top
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
call instruction

If we hit any of those errors, we can bail out, unregister with ftrace
and restore the system to its original state.

> Are you really really sure you want to go do this?

Basically, yes. We've had a lot of conversations about many different
variations of how to do this over the past year, and this is by far the
best approach we've come up with.

FWIW, we've been doing something similar with kpatch and stop_machine()
over the last 1+ years, and have run into zero problems with that
approach.

--
Josh

2015-02-17 18:15:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > So far stack unwinding has basically been a best effort debug output
> > kind of thing, you're wanting to make the integrity of the kernel depend
> > on it.
> >
> > You require an absolute 100% correctness of the stack unwinder -- where
> > today it is; as stated above; a best effort debug output thing.
> >
> > That is a _big_ change.
>
> True, this does seem to be the first attempt to rely on the correctness
> of the stack walking code.
>
> > Has this been properly considered; has all the asm of the relevant
> > architectures been audited? Are you planning on maintaining that level
> > of audit for all new patches?
>
> I agree, the unwinder needs to be 100% correct. Currently we only
> support x86_64. Luckily, in general, stacks are very well defined and
> walking the stack of a sleeping task should be straightforward. I don't
> think it would be too hard to ensure the stack unwinder is right for
> other architectures as we port them.

I would not be too sure about that, the x86 framepointer thing is not
universal. IIRC some have to have some form of in-kernel dwarf
unwinding.

And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
because otherwise x86 stacks are a mess too.

So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
are implemented in asm, don't honour that. So if one of those faults and
the fault handler sleeps, you'll miss say your
'copy_user_enhanced_fast_string' entry.

Then again, those asm functions don't have function trace bits either,
so you can't patch them to begin with I suppose.

Here's to hoping you don't have to..

> > Because the way you propose to do things, we'll end up with silent but
> > deadly fail if the unwinder is less than 100% correct. No way to easily
> > debug that, no warns, just silent corruption.
>
> That's a good point. There's definitely room for additional error
> checking in the x86 stack walking code. A couple of ideas:
>
> - make sure it starts from a __schedule() call at the top
> - make sure we actually reach the bottom of the stack
> - make sure each stack frame's return address is immediately after a
> call instruction
>
> If we hit any of those errors, we can bail out, unregister with ftrace
> and restore the system to its original state.

And then hope you can get a better trace next time around? Or will you
fall-back to an alternative method of patching?

> > Are you really really sure you want to go do this?
>
> Basically, yes. We've had a lot of conversations about many different
> variations of how to do this over the past year, and this is by far the
> best approach we've come up with.

For some reason I'm thinking jikos is going to disagree with you on that
:-)

I'm further thinking we don't actually need 2 (or more) different means
of live patching in the kernel. So you all had better sit down (again)
and come up with something you all agree on.

> FWIW, we've been doing something similar with kpatch and stop_machine()
> over the last 1+ years, and have run into zero problems with that
> approach.

Could be you've just been lucky...

2015-02-17 21:26:03

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 07:15:41PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 08:12:11AM -0600, Josh Poimboeuf wrote:
> > On Tue, Feb 17, 2015 at 10:24:50AM +0100, Peter Zijlstra wrote:
> > > So far stack unwinding has basically been a best effort debug output
> > > kind of thing, you're wanting to make the integrity of the kernel depend
> > > on it.
> > >
> > > You require an absolute 100% correctness of the stack unwinder -- where
> > > today it is; as stated above; a best effort debug output thing.
> > >
> > > That is a _big_ change.
> >
> > True, this does seem to be the first attempt to rely on the correctness
> > of the stack walking code.
> >
> > > Has this been properly considered; has all the asm of the relevant
> > > architectures been audited? Are you planning on maintaining that level
> > > of audit for all new patches?
> >
> > I agree, the unwinder needs to be 100% correct. Currently we only
> > support x86_64. Luckily, in general, stacks are very well defined and
> > walking the stack of a sleeping task should be straightforward. I don't
> > think it would be too hard to ensure the stack unwinder is right for
> > other architectures as we port them.
>
> I would not be too sure about that, the x86 framepointer thing is not
> universal. IIRC some have to have some form of in-kernel dwarf
> unwinding.
>
> And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> because otherwise x86 stacks are a mess too.

Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
about now (x86, power, s390, arm64) all have frame pointer support, and
the stack formats are all sane, AFAICT.

If we ever port livepatch to a more obscure arch for which walking the
stack is more difficult, we'll have several options:

a) spend the time to ensure the unwinding code is correct and resilient
to errors;

b) leave the consistency model compiled code out if !FRAME_POINTER and
allow users to patch without one (similar to the livepatch code
that's already in the livepatch tree today); or

c) not support that arch.

> So even with CONFIG_FRAMEPOINTER, things like copy_to/from_user, which
> are implemented in asm, don't honour that. So if one of those faults and
> the fault handler sleeps, you'll miss say your
> 'copy_user_enhanced_fast_string' entry.
>
> Then again, those asm functions don't have function trace bits either,
> so you can't patch them to begin with I suppose.
>
> Here's to hoping you don't have to..

Right. We can't patch asm functions because we rely on ftrace, which
needs the -mfentry prologue.

> > > Because the way you propose to do things, we'll end up with silent but
> > > deadly fail if the unwinder is less than 100% correct. No way to easily
> > > debug that, no warns, just silent corruption.
> >
> > That's a good point. There's definitely room for additional error
> > checking in the x86 stack walking code. A couple of ideas:
> >
> > - make sure it starts from a __schedule() call at the top
> > - make sure we actually reach the bottom of the stack
> > - make sure each stack frame's return address is immediately after a
> > call instruction
> >
> > If we hit any of those errors, we can bail out, unregister with ftrace
> > and restore the system to its original state.
>
> And then hope you can get a better trace next time around? Or will you
> fall-back to an alternative method of patching?

Yeah, on second thought, we wouldn't have to cancel the patch. We could
defer to check the task's stack again at a later time. If it's stuck
there, the user can try sending it a signal to unstick it, or cancel the
patching process. Those mechanisms are already in place with my
consistency model patch set.

I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
guessing it would either indicate an error in the unwinding code or
point us to an unexpected stack condition.

> > > Are you really really sure you want to go do this?
> >
> > Basically, yes. We've had a lot of conversations about many different
> > variations of how to do this over the past year, and this is by far the
> > best approach we've come up with.
>
> For some reason I'm thinking jikos is going to disagree with you on that
> :-)
>
> I'm further thinking we don't actually need 2 (or more) different means
> of live patching in the kernel. So you all had better sit down (again)
> and come up with something you all agree on.

Yeah, I also _really_ want to avoid multiple consistency models.

In fact, that's a big motivation behind my consistency model patch set.
It's heavily inspired by a suggestion from Vojtech [1]. It combines
kpatch (backtrace checking) with kGraft (per-thread consistency). It
has several advantages over either of the individual approaches. See
http://lwn.net/Articles/632582/ where I describe its pros over both
kpatch and kGraft.

Jiri, would you and Vojtech agree that the proposed consistency model is
all we need? Or do you still want to do the multiple consistency model
thing?

> > FWIW, we've been doing something similar with kpatch and stop_machine()
> > over the last 1+ years, and have run into zero problems with that
> > approach.
>
> Could be you've just been lucky...

I can't really disagree with that ;-)


[1] https://lkml.org/lkml/2014/11/7/354

--
Josh

2015-02-18 00:13:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] stacktrace: add save_stack_trace_tsk_safe()

On Mon, 16 Feb 2015 12:52:35 -0600 Josh Poimboeuf <[email protected]> wrote:

> It isn't possible to get the stack of a running task (other than
> current) because we don't have the stack pointer and the stack can be
> inconsistent anyway. Add a safe stack saving facility which only saves
> the stack of the task if it's sleeping or if it's the current task.

Can you send the task an IPI and get it to save its stack for you?

There's probably some (messy, arch-dependent) way of trimming away the
interrupt-related stuff off the stack, if that's really needed.

2015-02-18 15:21:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > because otherwise x86 stacks are a mess too.
>
> Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
> about now (x86, power, s390, arm64) all have frame pointer support, and
> the stack formats are all sane, AFAICT.
>
> If we ever port livepatch to a more obscure arch for which walking the
> stack is more difficult, we'll have several options:
>
> a) spend the time to ensure the unwinding code is correct and resilient
> to errors;
>
> b) leave the consistency model compiled code out if !FRAME_POINTER and
> allow users to patch without one (similar to the livepatch code
> that's already in the livepatch tree today); or

Which has a much more limited set of patches it can do, right?

> c) not support that arch.

Which would be sad of course.

> > And then hope you can get a better trace next time around? Or will you
> > fall-back to an alternative method of patching?
>
> Yeah, on second thought, we wouldn't have to cancel the patch. We could
> defer to check the task's stack again at a later time. If it's stuck
> there, the user can try sending it a signal to unstick it, or cancel the
> patching process. Those mechanisms are already in place with my
> consistency model patch set.
>
> I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> guessing it would either indicate an error in the unwinding code or
> point us to an unexpected stack condition.

So uhm, what happens if your target task is running? When will you
retry? The problem I see is that if you do a sample approach you might
never hit an opportune moment.

> > I'm further thinking we don't actually need 2 (or more) different means
> > of live patching in the kernel. So you all had better sit down (again)
> > and come up with something you all agree on.
>
> Yeah, I also _really_ want to avoid multiple consistency models.
>
> In fact, that's a big motivation behind my consistency model patch set.
> It's heavily inspired by a suggestion from Vojtech [1]. It combines
> kpatch (backtrace checking) with kGraft (per-thread consistency). It
> has several advantages over either of the individual approaches. See
> http://lwn.net/Articles/632582/ where I describe its pros over both
> kpatch and kGraft.
>
> Jiri, would you and Vojtech agree that the proposed consistency model is
> all we need? Or do you still want to do the multiple consistency model
> thing?

Skimmed that thread; you all mostly seem to agree that one would be good
but not quite agree on which one.

And I note, not all seem to require this stack walking stuff.

2015-02-18 17:13:29

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 04:21:00PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 03:25:32PM -0600, Josh Poimboeuf wrote:
> > > And I'm assuming you're hard relying on CONFIG_FRAMEPOINTER here,
> > > because otherwise x86 stacks are a mess too.
> >
> > Yeah, it'll rely on CONFIG_FRAME_POINTER. IIUC, the arches we care
> > about now (x86, power, s390, arm64) all have frame pointer support, and
> > the stack formats are all sane, AFAICT.
> >
> > If we ever port livepatch to a more obscure arch for which walking the
> > stack is more difficult, we'll have several options:
> >
> > a) spend the time to ensure the unwinding code is correct and resilient
> > to errors;
> >
> > b) leave the consistency model compiled code out if !FRAME_POINTER and
> > allow users to patch without one (similar to the livepatch code
> > that's already in the livepatch tree today); or
>
> Which has a much more limited set of patches it can do, right?

If we're talking security patches, roughly 9 out of 10 of them don't
change function prototypes, data, or data semantics. If the patch
author is careful, he or she doesn't need a consistency model in those
cases.

So it's not _overly_ limited. If somebody's not happy about those
limitations, they can put in the work for option a :-)

> > > And then hope you can get a better trace next time around? Or will you
> > > fall-back to an alternative method of patching?
> >
> > Yeah, on second thought, we wouldn't have to cancel the patch. We could
> > defer to check the task's stack again at a later time. If it's stuck
> > there, the user can try sending it a signal to unstick it, or cancel the
> > patching process. Those mechanisms are already in place with my
> > consistency model patch set.
> >
> > I'd also do a WARN_ON_ONCE and a dump of the full stack data, since I'm
> > guessing it would either indicate an error in the unwinding code or
> > point us to an unexpected stack condition.
>
> So uhm, what happens if your target task is running? When will you
> retry? The problem I see is that if you do a sample approach you might
> never hit an opportune moment.

We attack it from multiple angles.

First we check the stack of all sleeping tasks. That patches the
majority of tasks immediately. If necessary, we also do that
periodically in a workqueue to catch any stragglers.

The next line of attack is patching tasks when exiting the kernel to
user space (system calls, interrupts, signals), to catch all CPU-bound
and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
model patch set.

As a last resort, if there are still any tasks which are sleeping on a
to-be-patched function, the user can send them SIGSTOP and SIGCONT to
force them to be patched.

If none of those work, the user has the option of either canceling the
patch or just waiting to see if the patching process eventually
completes.

> > > I'm further thinking we don't actually need 2 (or more) different means
> > > of live patching in the kernel. So you all had better sit down (again)
> > > and come up with something you all agree on.
> >
> > Yeah, I also _really_ want to avoid multiple consistency models.
> >
> > In fact, that's a big motivation behind my consistency model patch set.
> > It's heavily inspired by a suggestion from Vojtech [1]. It combines
> > kpatch (backtrace checking) with kGraft (per-thread consistency). It
> > has several advantages over either of the individual approaches. See
> > http://lwn.net/Articles/632582/ where I describe its pros over both
> > kpatch and kGraft.
> >
> > Jiri, would you and Vojtech agree that the proposed consistency model is
> > all we need? Or do you still want to do the multiple consistency model
> > thing?
>
> Skimmed that thread; you all mostly seem to agree that one would be good
> but not quite agree on which one.

Well, I think we at least agreed that LEAVE_PATCHED_SET + SWITCH_THREAD
is the most interesting combination. That's what my patches implement.
Let's wait to see what Jiri and Vojtech say.

> And I note, not all seem to require this stack walking stuff.

True. But without stack walking, you have bigger problems:

1. You have to signal _all_ sleeping tasks to force them to be patched.
That's much more disruptive to the system.

2. There's no way to convert kthreads without either:

a) hacking up all the various kthread loops to call
klp_update_task_universe(); or

b) converting all kthreads' code to be freezeable. Then we could
freeze all tasks to patch them. Not only would it be a lot of
work to make all kthreads freezable, but it would encode within
the kernel the dangerous and non-obvious assumption that the
freeze point is a "safe" place for patching.


[1] https://lkml.org/lkml/2015/2/9/478


--
Josh

2015-02-19 00:21:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > a) spend the time to ensure the unwinding code is correct and resilient
> > > to errors;
> > >
> > > b) leave the consistency model compiled code out if !FRAME_POINTER and
> > > allow users to patch without one (similar to the livepatch code
> > > that's already in the livepatch tree today); or
> >
> > Which has a much more limited set of patches it can do, right?
>
> If we're talking security patches, roughly 9 out of 10 of them don't
> change function prototypes, data, or data semantics. If the patch
> author is careful, he or she doesn't need a consistency model in those
> cases.
>
> So it's not _overly_ limited. If somebody's not happy about those
> limitations, they can put in the work for option a :-)

OK. So all this is really only about really funny bits.

> > So uhm, what happens if your target task is running? When will you
> > retry? The problem I see is that if you do a sample approach you might
> > never hit an opportune moment.
>
> We attack it from multiple angles.
>
> First we check the stack of all sleeping tasks. That patches the
> majority of tasks immediately. If necessary, we also do that
> periodically in a workqueue to catch any stragglers.

So not only do you need an 'atomic' stack save, you need to analyze and
flip its state in the same atomic region. The task cannot start running
again after the save and start using old functions while you analyze the
stack and flip it.

> The next line of attack is patching tasks when exiting the kernel to
> user space (system calls, interrupts, signals), to catch all CPU-bound
> and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> model patch set.

So the HPC people are really into userspace that does for (;;) ; and
isolate that on CPUs and have the tick interrupt stopped and all that.

You'll not catch those threads on the sysexit path.

And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.

Now its fairly easy to also handle this; just mark those tasks with a
_TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
go-away, then flip their state and clear the flag.

> As a last resort, if there are still any tasks which are sleeping on a
> to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> force them to be patched.

You typically cannot SIGSTOP/SIGCONT kernel threads. Also
TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.

Bit pesky that.. needs pondering.

2015-02-19 04:18:30

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
> > > So uhm, what happens if your target task is running? When will you
> > > retry? The problem I see is that if you do a sample approach you might
> > > never hit an opportune moment.
> >
> > We attack it from multiple angles.
> >
> > First we check the stack of all sleeping tasks. That patches the
> > majority of tasks immediately. If necessary, we also do that
> > periodically in a workqueue to catch any stragglers.
>
> So not only do you need an 'atomic' stack save, you need to analyze and
> flip its state in the same atomic region. The task cannot start running
> again after the save and start using old functions while you analyze the
> stack and flip it.

Yes, exactly.

> > The next line of attack is patching tasks when exiting the kernel to
> > user space (system calls, interrupts, signals), to catch all CPU-bound
> > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > model patch set.
>
> So the HPC people are really into userspace that does for (;;) ; and
> isolate that on CPUs and have the tick interrupt stopped and all that.
>
> You'll not catch those threads on the sysexit path.
>
> And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
>
> Now its fairly easy to also handle this; just mark those tasks with a
> _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> go-away, then flip their state and clear the flag.

I guess you mean patch the task when it makes a syscall? I'm doing that
already on syscall exit with a bit in _TIF_ALLWORK_MASK and
_TIF_DO_NOTIFY_MASK.

> > As a last resort, if there are still any tasks which are sleeping on a
> > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > force them to be patched.
>
> You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
>
> Bit pesky that.. needs pondering.

I did have a scheme for patching kthreads which are sleeping on
to-be-patched functions.

But now I'm thinking that kthreads will almost never be a problem. Most
kthreads are basically this:

void thread_fn()
{
while (1) {
/* do some stuff */

schedule();

/* do other stuff */
}
}

So a kthread would typically only fail the stack check if we're trying
to patch either schedule() or the top-level thread_fn.

Patching thread_fn wouldn't be possible unless we killed the thread.

And I'd guess we can probably live without being able to patch
schedule() for now :-)

--
Josh

2015-02-19 10:16:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:

> > > The next line of attack is patching tasks when exiting the kernel to
> > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > > model patch set.
> >
> > So the HPC people are really into userspace that does for (;;) ; and
> > isolate that on CPUs and have the tick interrupt stopped and all that.
> >
> > You'll not catch those threads on the sysexit path.
> >
> > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> >
> > Now its fairly easy to also handle this; just mark those tasks with a
> > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > go-away, then flip their state and clear the flag.
>
> I guess you mean patch the task when it makes a syscall? I'm doing that
> already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> _TIF_DO_NOTIFY_MASK.

No, these tasks will _never_ make syscalls. So you need to guarantee
they don't accidentally enter the kernel while you flip them. Something
like so should do.

You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
them then clear TIF_ENTER_WAIT.

---
arch/x86/include/asm/thread_info.h | 4 +++-
arch/x86/kernel/entry_64.S | 2 ++
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e82e95abc92b..baa836f13536 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -90,6 +90,7 @@ struct thread_info {
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
#define TIF_X32 30 /* 32-bit native x86-64 binary */
+#define TIF_ENTER_WAIT 31

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -113,12 +114,13 @@ struct thread_info {
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
#define _TIF_X32 (1 << TIF_X32)
+#define _TIF_ENTER_WAIT (1 << TIF_ENTER_WAIT)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_NOHZ)
+ _TIF_NOHZ | _TIF_ENTER_WAIT)

/* work to do in syscall_trace_leave() */
#define _TIF_WORK_SYSCALL_EXIT \
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index db13655c3a2a..735566b35903 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -387,6 +387,8 @@ GLOBAL(system_call_after_swapgs)

/* Do syscall tracing */
tracesys:
+ andl $_TIF_ENTER_WAIT,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+ jnz tracesys;
leaq -REST_SKIP(%rsp), %rdi
movq $AUDIT_ARCH_X86_64, %rsi
call syscall_trace_enter_phase1

> > > As a last resort, if there are still any tasks which are sleeping on a
> > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > force them to be patched.
> >
> > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> >
> > Bit pesky that.. needs pondering.

I still absolutely hate you need to disturb userspace like that. Signals
are quite visible and perturb userspace state.

Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
what they thought was a good reason. You'd wreck their state.

> But now I'm thinking that kthreads will almost never be a problem. Most
> kthreads are basically this:

You guys are way too optimistic; maybe its because I've worked on
realtime stuff too much, but I'm always looking at worst cases. If you
can't handle those, I feel you might as well not bother :-)

> Patching thread_fn wouldn't be possible unless we killed the thread.

It is, see kthread_park().

2015-02-19 16:24:57

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:16:07AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 18, 2015 at 10:17:53PM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 01:20:58AM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 18, 2015 at 11:12:56AM -0600, Josh Poimboeuf wrote:
>
> > > > The next line of attack is patching tasks when exiting the kernel to
> > > > user space (system calls, interrupts, signals), to catch all CPU-bound
> > > > and some I/O-bound tasks. That's done in patch 9 [1] of the consistency
> > > > model patch set.
> > >
> > > So the HPC people are really into userspace that does for (;;) ; and
> > > isolate that on CPUs and have the tick interrupt stopped and all that.
> > >
> > > You'll not catch those threads on the sysexit path.
> > >
> > > And I'm fairly sure they'll not want to SIGSTOP/CONT their stuff either.
> > >
> > > Now its fairly easy to also handle this; just mark those tasks with a
> > > _TIF_WORK_SYSCALL_ENTRY flag, have that slowpath wait for the flag to
> > > go-away, then flip their state and clear the flag.
> >
> > I guess you mean patch the task when it makes a syscall? I'm doing that
> > already on syscall exit with a bit in _TIF_ALLWORK_MASK and
> > _TIF_DO_NOTIFY_MASK.
>
> No, these tasks will _never_ make syscalls. So you need to guarantee
> they don't accidentally enter the kernel while you flip them. Something
> like so should do.
>
> You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> them then clear TIF_ENTER_WAIT.

Ah, that's a good idea. But how do we check if they're in user space?

> > > > As a last resort, if there are still any tasks which are sleeping on a
> > > > to-be-patched function, the user can send them SIGSTOP and SIGCONT to
> > > > force them to be patched.
> > >
> > > You typically cannot SIGSTOP/SIGCONT kernel threads. Also
> > > TASK_UNINTERRUPTIBLE sleeps are unaffected by signals.
> > >
> > > Bit pesky that.. needs pondering.
>
> I still absolutely hate you need to disturb userspace like that. Signals
> are quite visible and perturb userspace state.

Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
results in EINTR being returned from a system call. It's not ideal, but
it's much less intrusive than something like suspend.

But anyway we leave it up to the user to decide whether they want to
take that risk, or wait for the task to wake up on its own, or cancel
the patch.

> Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> what they thought was a good reason. You'd wreck their state.

Hm, that's a good point. Maybe use the freezer instead of signals?

(Again this would only be for those user tasks which are sleeping on a
patched function)

> > But now I'm thinking that kthreads will almost never be a problem. Most
> > kthreads are basically this:
>
> You guys are way too optimistic; maybe its because I've worked on
> realtime stuff too much, but I'm always looking at worst cases. If you
> can't handle those, I feel you might as well not bother :-)

Well, I think we're already resigned to the fact that live patching
won't work for every patch, every time. And that the patch author must
know what they're doing, and must do it carefully.

Our target worst case is that the patching fails gracefully and the user
can't patch their system with that particular patch. Or that the system
remains in a partially patched state forever, if the user is ok with
that.

> > Patching thread_fn wouldn't be possible unless we killed the thread.
>
> It is, see kthread_park().

When the kthread returns from kthread_parkme(), it'll still be running
the old thread_fn code, regardless of whether we flipped the task's
patch state.

--
Josh

2015-02-19 16:34:03

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:

> > No, these tasks will _never_ make syscalls. So you need to guarantee
> > they don't accidentally enter the kernel while you flip them. Something
> > like so should do.
> >
> > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > them then clear TIF_ENTER_WAIT.
>
> Ah, that's a good idea. But how do we check if they're in user space?

I don't see the benefit in holding them in a loop - you can just as well
flip them from the syscall code as kGraft does.

> > I still absolutely hate you need to disturb userspace like that. Signals
> > are quite visible and perturb userspace state.
>
> Yeah, SIGSTOP on a sleeping task can be intrusive to user space if it
> results in EINTR being returned from a system call. It's not ideal, but
> it's much less intrusive than something like suspend.
>
> But anyway we leave it up to the user to decide whether they want to
> take that risk, or wait for the task to wake up on its own, or cancel
> the patch.
>
> > Also, you cannot SIGCONT a task that was SIGSTOP'ed by userspace for
> > what they thought was a good reason. You'd wreck their state.
>
> Hm, that's a good point. Maybe use the freezer instead of signals?
>
> (Again this would only be for those user tasks which are sleeping on a
> patched function)
>
> > > But now I'm thinking that kthreads will almost never be a problem. Most
> > > kthreads are basically this:
> >
> > You guys are way too optimistic; maybe its because I've worked on
> > realtime stuff too much, but I'm always looking at worst cases. If you
> > can't handle those, I feel you might as well not bother :-)
>
> Well, I think we're already resigned to the fact that live patching
> won't work for every patch, every time. And that the patch author must
> know what they're doing, and must do it carefully.
>
> Our target worst case is that the patching fails gracefully and the user
> can't patch their system with that particular patch. Or that the system
> remains in a partially patched state forever, if the user is ok with
> that.
>
> > > Patching thread_fn wouldn't be possible unless we killed the thread.
> >
> > It is, see kthread_park().
>
> When the kthread returns from kthread_parkme(), it'll still be running
> the old thread_fn code, regardless of whether we flipped the task's
> patch state.

We could instrument kthread_parkme() to be able to return to a different
function, but it'd be a bit crazy indeed.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 17:04:25

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
>
> > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > they don't accidentally enter the kernel while you flip them. Something
> > > like so should do.
> > >
> > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > them then clear TIF_ENTER_WAIT.
> >
> > Ah, that's a good idea. But how do we check if they're in user space?
>
> I don't see the benefit in holding them in a loop - you can just as well
> flip them from the syscall code as kGraft does.

But we were talking specifically about HPC tasks which never make
syscalls.

--
Josh

2015-02-19 17:08:28

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > >
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > >
> > > Ah, that's a good idea. But how do we check if they're in user space?
> >
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
>
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yeah. And getting answer to "is this task_struct currently running in
userspace?" is not easy in a non-disruptive way.

Piggy-backing on CONFIG_CONTEXT_TRACKING is one option, but I guess the
number of systems that have this option turned on will be rather limited
(especially in HPC space).

Sending IPIs around to check whether the task is running in user-space or
kernel-space would work, but it's disruptive to the running task, which
might be unwanted as well.

--
Jiri Kosina
SUSE Labs

2015-02-19 17:19:32

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> >
> > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > they don't accidentally enter the kernel while you flip them. Something
> > > > like so should do.
> > > >
> > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > them then clear TIF_ENTER_WAIT.
> > >
> > > Ah, that's a good idea. But how do we check if they're in user space?
> >
> > I don't see the benefit in holding them in a loop - you can just as well
> > flip them from the syscall code as kGraft does.
>
> But we were talking specifically about HPC tasks which never make
> syscalls.

Yes. I'm saying that rather than guaranteeing they don't enter the
kernel (by having them spin) you can flip them in case they try to do
that instead. That solves the race condition just as well.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 17:33:30

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > >
> > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > like so should do.
> > > > >
> > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > them then clear TIF_ENTER_WAIT.
> > > >
> > > > Ah, that's a good idea. But how do we check if they're in user space?
> > >
> > > I don't see the benefit in holding them in a loop - you can just as well
> > > flip them from the syscall code as kGraft does.
> >
> > But we were talking specifically about HPC tasks which never make
> > syscalls.
>
> Yes. I'm saying that rather than guaranteeing they don't enter the
> kernel (by having them spin) you can flip them in case they try to do
> that instead. That solves the race condition just as well.

Ok, gotcha.

We'd still need a safe way to check if they're in user space though.

How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
right at the border. Then for running tasks it's as simple as:

if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
klp_switch_task_universe(task);

--
Josh

2015-02-19 17:48:24

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On February 19, 2015 6:32:55 PM CET, Josh Poimboeuf <[email protected]> wrote:

>> Yes. I'm saying that rather than guaranteeing they don't enter the
>> kernel (by having them spin) you can flip them in case they try to do
>> that instead. That solves the race condition just as well.
>
>Ok, gotcha.
>
>We'd still need a safe way to check if they're in user space though.
>
>How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
>right at the border. Then for running tasks it's as simple as:
>
>if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

The s390x arch used to have a TIF_SYSCALL, which was doing exactly that (well, negated). I think it would work well, but it isn't entirely for free: two instructions per syscall and an extra TIF bit, which are (near) exhausted on some archs.

--
Vojtech Pavlik
Director SuSE Labs

2015-02-19 20:40:45

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > >
> > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > > like so should do.
> > > > > >
> > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > > them then clear TIF_ENTER_WAIT.
> > > > >
> > > > > Ah, that's a good idea. But how do we check if they're in user space?
> > > >
> > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > flip them from the syscall code as kGraft does.
> > >
> > > But we were talking specifically about HPC tasks which never make
> > > syscalls.
> >
> > Yes. I'm saying that rather than guaranteeing they don't enter the
> > kernel (by having them spin) you can flip them in case they try to do
> > that instead. That solves the race condition just as well.
>
> Ok, gotcha.
>
> We'd still need a safe way to check if they're in user space though.

Having a safe way would be very nice and actually quite useful in other
cases, too.

For this specific purpose, however, we don't need a very safe way,
though. We don't require atomicity in any way, we don't mind even if it
creates false negatives, only false positives would be bad.

kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
addresses there, it assumes userspace. Not very nice, but does the job.

> How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> right at the border. Then for running tasks it's as simple as:
>
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

--
Vojtech Pavlik
Director SUSE Labs

2015-02-19 21:26:13

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> right at the border. Then for running tasks it's as simple as:
>
> if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> klp_switch_task_universe(task);

That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
we'd be interested in would be

__this_cpu_read(context_tracking.state) == IN_USER

But it has overhead.

--
Jiri Kosina
SUSE Labs

2015-02-19 21:38:40

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Jiri Kosina wrote:

> > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> > right at the border. Then for running tasks it's as simple as:
> >
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
>
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
> we'd be interested in would be
>
> __this_cpu_read(context_tracking.state) == IN_USER

Well, more precisely

per_cpu(context_tracking.state, cpu) == IN_USER

of course.

--
Jiri Kosina
SUSE Labs

2015-02-19 21:43:01

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 09:40:36PM +0100, Vojtech Pavlik wrote:
> On Thu, Feb 19, 2015 at 11:32:55AM -0600, Josh Poimboeuf wrote:
> > On Thu, Feb 19, 2015 at 06:19:29PM +0100, Vojtech Pavlik wrote:
> > > On Thu, Feb 19, 2015 at 11:03:53AM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Feb 19, 2015 at 05:33:59PM +0100, Vojtech Pavlik wrote:
> > > > > On Thu, Feb 19, 2015 at 10:24:29AM -0600, Josh Poimboeuf wrote:
> > > > >
> > > > > > > No, these tasks will _never_ make syscalls. So you need to guarantee
> > > > > > > they don't accidentally enter the kernel while you flip them. Something
> > > > > > > like so should do.
> > > > > > >
> > > > > > > You set TIF_ENTER_WAIT on them, check they're still in userspace, flip
> > > > > > > them then clear TIF_ENTER_WAIT.
> > > > > >
> > > > > > Ah, that's a good idea. But how do we check if they're in user space?
> > > > >
> > > > > I don't see the benefit in holding them in a loop - you can just as well
> > > > > flip them from the syscall code as kGraft does.
> > > >
> > > > But we were talking specifically about HPC tasks which never make
> > > > syscalls.
> > >
> > > Yes. I'm saying that rather than guaranteeing they don't enter the
> > > kernel (by having them spin) you can flip them in case they try to do
> > > that instead. That solves the race condition just as well.
> >
> > Ok, gotcha.
> >
> > We'd still need a safe way to check if they're in user space though.
>
> Having a safe way would be very nice and actually quite useful in other
> cases, too.
>
> For this specific purpose, however, we don't need a very safe way,
> though. We don't require atomicity in any way, we don't mind even if it
> creates false negatives, only false positives would be bad.
>
> kGraft looks at the stacktrace of CPU hogs and if it finds no kernel
> addresses there, it assumes userspace. Not very nice, but does the job.

So I've looked at kgr_needs_lazy_migration(), but I still have no idea
how it works.

First of all, I think reading the stack while its being written to could
give you some garbage values, and a completely wrong nr_entries value
from save_stack_trace_tsk().

But also, how would you walk a stack without knowing its stack pointer?
That function relies on the saved stack pointer in
task_struct.thread.sp, which, AFAICT, was last saved during the last
call to schedule(). Since then, the stack could have been completely
rewritten, with different size stack frames, before the task exited the
kernel.

Am I missing something?

--
Josh

2015-02-19 23:12:14

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, Feb 19, 2015 at 10:26:09PM +0100, Jiri Kosina wrote:
> On Thu, 19 Feb 2015, Josh Poimboeuf wrote:
>
> > How about with a TIF_IN_USERSPACE thread flag? It could be cleared/set
> > right at the border. Then for running tasks it's as simple as:
> >
> > if (test_tsk_thread_flag(task, TIF_IN_USERSPACE))
> > klp_switch_task_universe(task);
>
> That's in principle what CONTEXT_TRACKING is doing, i.e. the condition
> we'd be interested in would be
>
> __this_cpu_read(context_tracking.state) == IN_USER
>
> But it has overhead.

Yeah, that does seem to pretty much do what we want. Unfortunately it
has a much higher overhead than just setting a thread flag.

And from the Kconfig description for CONTEXT_TRACKING_FORCE, which would
enable it on all CPUs during boot, "this option brings an overhead that
you don't want in production."

Maybe that code needs a rewrite to rely on a thread flag instead. Then
we could use it too.

--
Josh

2015-02-20 07:46:56

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Thu, 19 Feb 2015, Josh Poimboeuf wrote:

> So I've looked at kgr_needs_lazy_migration(), but I still have no idea
> how it works.
>
> First of all, I think reading the stack while its being written to could
> give you some garbage values, and a completely wrong nr_entries value
> from save_stack_trace_tsk().

I believe we've already been discussing this some time ago ...

I agree that this is a very crude optimization that should probably be
either removed (which would only cause slower convergence in the presence
of CPU-bound tasks), or rewritten to perform IPI-based stack dumping
(probably on a voluntarily-configurable basis).

Reading garbage values could only happen if the task would be running in
kernelspace. nr_entries would then be at least 2.

But I agree that relying on this very specific behavior is not really
safe in general in case someone changes the stack dumping implementation
in the future in an unpredictable way.

> But also, how would you walk a stack without knowing its stack pointer?
> That function relies on the saved stack pointer in
> task_struct.thread.sp, which, AFAICT, was last saved during the last
> call to schedule(). Since then, the stack could have been completely
> rewritten, with different size stack frames, before the task exited the
> kernel.

Same argument holds here as well, I believe.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-20 08:49:36

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

Alright, so to sum it up:

- current stack dumping (even looking at /proc/<pid>/stack) is not
guaranteed to yield "correct" results in case the task is running at the
time the stack is being examined

- the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize
the handlers between each other (to make sure that reschedule doesn't
happen in between on some CPU and other task doesn't start running in
the interim).
The NMI handler dumps its current stack in case it's running in context
of the process whose stack is to be dumped. Otherwise, one of the NMI
handlers looks up the required task_struct, and dumps it if it's not
running on any CPU

- For live patching use-case, the stack has to be analyzed (and decision
on what to do based on the analysis) in the NMI handler itself,
otherwise it gets racy again

Converting /proc/<pid>/stack to this mechanism seems like a correct thing
to do in any case, as it's slow path anyway.

The original intent seemed to have been to make this fast path for the
live patching case, but that's probably not possible, so it seems like the
price that will have to be paid for being able to finish live-patching of
CPU-bound processess is the cost of IPI-NMI broadcast.

Agreed?

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-20 09:32:47

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 2/3] stacktrace: add save_stack_trace_tsk_safe()

On Tue, 17 Feb 2015, Andrew Morton wrote:

> > It isn't possible to get the stack of a running task (other than
> > current) because we don't have the stack pointer and the stack can be
> > inconsistent anyway. Add a safe stack saving facility which only saves
> > the stack of the task if it's sleeping or if it's the current task.
>
> Can you send the task an IPI and get it to save its stack for you?
>
> There's probably some (messy, arch-dependent) way of trimming away the
> interrupt-related stuff off the stack, if that's really needed.

I am afraid that you need to send broadcast IPI to all CPUs and only dump
stack then to avoid races with reschedule and make it really consistent.

If that is agreed to be a reasonable price to pay for consistent stack
dump for a single process is of course a question.

--
Jiri Kosina
SUSE Labs

2015-02-20 09:50:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()


* Jiri Kosina <[email protected]> wrote:

> Alright, so to sum it up:
>
> - current stack dumping (even looking at /proc/<pid>/stack) is not
> guaranteed to yield "correct" results in case the task is running at the
> time the stack is being examined

Don't even _think_ about trying to base something as
dangerous as live patching the kernel image on the concept
of:

'We can make all stack backtraces reliably correct all
the time, with no false positives, with no false
negatives, 100% of the time, and quickly discover and
fix bugs in that'.

It's not going to happen:

- The correctness of stacktraces partially depends on
tooling and we don't control those.

- More importantly, there's no strong force that ensures
we can rely on stack backtraces: correcting bad stack
traces depends on people hitting those functions and
situations that generate them, seeing a bad stack trace,
noticing that it's weird and correcting whatever code or
tooling quirk causes the stack entry to be incorrect.

Essentially unlike other kernel code which breaks stuff if
it's incorrect, there's no _functional_ dependence on stack
traces, so live patching would be the first (and pretty
much only) thing that breaks on bad stack traces ...

If you think you can make something like dwarf annotations
work reliably to base kernel live patching on that,
reconsider.

Even with frame pointer backtraces can go bad sometimes, I
wouldn't base live patching even on _that_, and that's a
very simple concept with a performance cost that most
distros don't want to pay.

So if your design is based on being able to discover 'live'
functions in the kernel stack dump of all tasks in the
system, I think you need a serious reboot of the whole
approach and get rid of that fragility before any of that
functionality gets upstream!

> - For live patching use-case, the stack has to be
> analyzed (and decision on what to do based on the
> analysis) in the NMI handler itself, otherwise it gets
> racy again

You simply cannot reliably determine from the kernel stack
whether a function is used by a task or not, and actually
modify the kernel image, from a stack backtrace, as things
stand today. Full stop.

Thanks,

Ingo

2015-02-20 10:02:41

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Fri, 20 Feb 2015, Ingo Molnar wrote:

> So if your design is based on being able to discover 'live' functions in
> the kernel stack dump of all tasks in the system, I think you need a
> serious reboot of the whole approach and get rid of that fragility
> before any of that functionality gets upstream!

So let me repeat again, just to make sure that no more confusion is being
spread around -- there are aproaches which do rely on stack contents, and
aproaches which don't. kpatch (the Red Hat solution) and ksplice (the
Oracle solution) contains stack analysis as a conceptual design step,
kgraft (the SUSE solution) doesn't.

Also the limited (in features -- consistency models) common infrastructure
that is currently upstream doesn't care about stack contents.

We are now figuring out which consistency models make sense for upstream,
and how they can be achieved. Josh's patchset is one of the aproaches that
is currently being discussed, but it's not the only available option.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-20 10:44:25

by Ingo Molnar

[permalink] [raw]
Subject: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> On Fri, 20 Feb 2015, Ingo Molnar wrote:
>
> > So if your design is based on being able to discover
>
> > 'live' functions in the kernel stack dump of all tasks
> > in the system, I think you need a serious reboot of the
> > whole approach and get rid of that fragility before any
> > of that functionality gets upstream!
>
> So let me repeat again, just to make sure that no more
> confusion is being spread around -- there are aproaches
> which do rely on stack contents, and aproaches which
> don't. kpatch (the Red Hat solution) and ksplice (the
> Oracle solution) contains stack analysis as a conceptual
> design step, kgraft (the SUSE solution) doesn't.

So just to make my position really clear: any talk about
looking at the kernel stack for backtraces is just crazy
talk, considering how stack backtrace technology stands
today and in the reasonable near future!

With that out of the way, the only safe mechanism to live
patch the kernel (for sufficiently simple sets of changes
to singular functions) I'm aware of at the moment is:

- forcing all user space tasks out of kernel mode and
intercepting them in a safe state. I.e. making sure that
no kernel code is executed, no kernel stack state is
used (modulo code closely related to the live
patching mechanism and kernel threads in safe state,
lets ignore them for this argument)

There's two variants of this concept, which deals with the
timing of how user-space tasks are forced into user mode:

- the simple method: force all user-space tasks out of
kernel mode, stop the machine for a brief moment and be
done with the patching safely and essentially
atomically.

- the complicated method spread out over time: uses the
same essential mechanism plus the ftrace patching
machinery to detect whether all tasks have transitioned
through a version flip. [this is what kgraft does in
part.]

All fundamental pieces of the simple method are necessary
to get guaranteed time transition from the complicated
method: task tracking and transparent catching of them,
handling kthreads, etc.

My argument is that the simple method should be implemented
first and foremost.

Then people can do add-on features to possibly spread out
the new function versions in a more complicated way if they
want to avoid the stop-all-tasks transition - although I'm
not convinced about it: I'm sure sure many sysadmins would
like the bug patching to be over with quickly and not have
their systems in an intermediate state like kgraft does it.

In any case, as per my arguments above, examining the
kernel stack is superfluous (so we won't be exposed to the
fragility of it either): there's no need to examine it and
writing such patches is misguided...

Thanks,

Ingo

2015-02-20 10:58:20

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Fri, 20 Feb 2015, Ingo Molnar wrote:

> - the complicated method spread out over time: uses the
> same essential mechanism plus the ftrace patching
> machinery to detect whether all tasks have transitioned
> through a version flip. [this is what kgraft does in
> part.]

The only difference of this to what kgraft does is that alive-enough tasks
are not put in this kind of "out of kernel 'freezer'", but keep running.
Modifying kgraft to (optionally) add the synchronization barrier and then
flip the switch should be a rather trivial task, and can indeed be added
as a simple option to the patch author / sysadmin. However ...

> All fundamental pieces of the simple method are necessary to get
> guaranteed time transition from the complicated method: task tracking
> and transparent catching of them, handling kthreads, etc.
>
> My argument is that the simple method should be implemented first and
> foremost.
>
> Then people can do add-on features to possibly spread out the new
> function versions in a more complicated way if they want to avoid the
> stop-all-tasks transition - although I'm not convinced about it: I'm
> sure sure many sysadmins would like the bug patching to be over with
> quickly and not have their systems in an intermediate state like kgraft
> does it.

... the choice the sysadmins have here is either have the system running
in an intermediate state, or have the system completely dead for the *same
time*. Because to finish the transition successfully, all the tasks have
to be woken up in any case.

(please note that this is different to suspend/resume task freezing,
because we don't care about sleeping tasks there).

But I do get your point; what you are basically saying is that your
preference is what kgraft is doing, and option to allow for a global
synchronization point should be added in parallel to the gradual lazy
migration.

> In any case, as per my arguments above, examining the kernel stack is
> superfluous (so we won't be exposed to the fragility of it either):
> there's no need to examine it and writing such patches is misguided...
>
> Thanks,
>
> Ingo
>

--
Jiri Kosina
SUSE Labs

2015-02-20 16:12:54

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
>
> * Jiri Kosina <[email protected]> wrote:
>
> > Alright, so to sum it up:
> >
> > - current stack dumping (even looking at /proc/<pid>/stack) is not
> > guaranteed to yield "correct" results in case the task is running at the
> > time the stack is being examined
>
> Don't even _think_ about trying to base something as
> dangerous as live patching the kernel image on the concept
> of:
>
> 'We can make all stack backtraces reliably correct all
> the time, with no false positives, with no false
> negatives, 100% of the time, and quickly discover and
> fix bugs in that'.
>
> It's not going to happen:
>
> - The correctness of stacktraces partially depends on
> tooling and we don't control those.

What tooling are you referring to?

Sure, we rely on the compiler to produce the correct assembly for
putting frame pointers on the stack. But that's pretty straightforward.
The kernel already relies on the compiler to do plenty of things which
are much more complex than that.

> - More importantly, there's no strong force that ensures
> we can rely on stack backtraces: correcting bad stack
> traces depends on people hitting those functions and
> situations that generate them, seeing a bad stack trace,
> noticing that it's weird and correcting whatever code or
> tooling quirk causes the stack entry to be incorrect.
>
> Essentially unlike other kernel code which breaks stuff if
> it's incorrect, there's no _functional_ dependence on stack
> traces, so live patching would be the first (and pretty
> much only) thing that breaks on bad stack traces ...

First, there are several things we do to avoid anomalies:

- we don't patch asm functions
- we only walk the stack of sleeping tasks

Second, currently the stack unwinding code is rather crude and doesn't
do much in the way of error handling. There are several error
conditions we could easily check for programmatically:

- make sure it starts from a __schedule() call at the top (I've only
proposed walking the stacks of sleeping tasks)
- make sure we actually reach the bottom of the stack
- make sure each stack frame's return address is immediately after a
call instruction

If any of these checks fail, we can either delay the patching of the
task until later or we can cancel the patching process, with no harm
done. Either way we can WARN the user so that we get reports of these
anomalies.

> If you think you can make something like dwarf annotations
> work reliably to base kernel live patching on that,
> reconsider.

No, we would rely on CONFIG_FRAME_POINTER.

> Even with frame pointer backtraces can go bad sometimes, I
> wouldn't base live patching even on _that_,

Other than hand-coded assembly, can you provide more details about how
frame pointer stack traces can go bad?

> and that's a very simple concept with a performance cost that most
> distros don't want to pay.

Hm, CONFIG_FRAME_POINTER is enabled on all the distros I use.

> So if your design is based on being able to discover 'live'
> functions in the kernel stack dump of all tasks in the
> system, I think you need a serious reboot of the whole
> approach and get rid of that fragility before any of that
> functionality gets upstream!

Just to clarify a couple of things:

1. Despite the email subject, this discussion is really about another
RFC patch set [1]. It hasn't been merged, and there's still a lot of
ongoing discussion about it.

2. We don't dump the stack of _all_ tasks. Only sleeping ones.


[1] https://lkml.org/lkml/2015/2/9/475

--
Josh

2015-02-20 17:06:19

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Fri, Feb 20, 2015 at 09:49:32AM +0100, Jiri Kosina wrote:
> Alright, so to sum it up:
>
> - current stack dumping (even looking at /proc/<pid>/stack) is not
> guaranteed to yield "correct" results in case the task is running at the
> time the stack is being examined
>
> - the only fool-proof way is to send IPI-NMI to all CPUs, and synchronize
> the handlers between each other (to make sure that reschedule doesn't
> happen in between on some CPU and other task doesn't start running in
> the interim).
> The NMI handler dumps its current stack in case it's running in context
> of the process whose stack is to be dumped. Otherwise, one of the NMI
> handlers looks up the required task_struct, and dumps it if it's not
> running on any CPU
>
> - For live patching use-case, the stack has to be analyzed (and decision
> on what to do based on the analysis) in the NMI handler itself,
> otherwise it gets racy again
>
> Converting /proc/<pid>/stack to this mechanism seems like a correct thing
> to do in any case, as it's slow path anyway.
>
> The original intent seemed to have been to make this fast path for the
> live patching case, but that's probably not possible, so it seems like the
> price that will have to be paid for being able to finish live-patching of
> CPU-bound processess is the cost of IPI-NMI broadcast.

Hm, syncing IPI's among CPUs sounds pretty disruptive.

This is really two different issues, so I'll separate them:

1. /proc/pid/stack for running tasks

I haven't heard anybody demanding that /proc/<pid>/stack should actually
print the stack for running tasks. My suggestion was just that we avoid
the possibility of printing garbage.

Today's behavior for a running task is (usually):

# cat /proc/802/stack
[<ffffffffffffffff>] 0xffffffffffffffff

How about, when we detecting a running task, just always show that?
That would give us today's behavior, except without occasionally
printing garbage, while avoiding all the overhead of syncing IPI's.

2. live patching of running tasks

I don't see why we would need to sync IPI's to patch CPU-bound
processes. Why not use context tracking or the TIF_USERSPACE flag like
I mentioned before?

--
Josh

2015-02-20 19:49:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> > All fundamental pieces of the simple method are
> > necessary to get guaranteed time transition from the
> > complicated method: task tracking and transparent
> > catching of them, handling kthreads, etc.
> >
> > My argument is that the simple method should be
> > implemented first and foremost.
> >
> > Then people can do add-on features to possibly spread
> > out the new function versions in a more complicated way
> > if they want to avoid the stop-all-tasks transition -
> > although I'm not convinced about it: I'm sure sure many
> > sysadmins would like the bug patching to be over with
> > quickly and not have their systems in an intermediate
> > state like kgraft does it.
>
> ... the choice the sysadmins have here is either have the
> system running in an intermediate state, or have the
> system completely dead for the *same time*. Because to
> finish the transition successfully, all the tasks have to
> be woken up in any case.

That statement is false: an 'intermediate state' system
where 'new' tasks are still running is still running and
will interfere with the resolution of 'old' tasks.

> But I do get your point; what you are basically saying is
> that your preference is what kgraft is doing, and option
> to allow for a global synchronization point should be
> added in parallel to the gradual lazy migration.

I think you misunderstood: the 'simple' method I outlined
does not just 'synchronize', it actually executes the live
patching atomically, once all tasks are gathered and we
know they are _all_ in a safe state.

I.e. it's in essence the strong stop-all atomic patching
model of 'kpatch', combined with the reliable avoidance of
kernel stacks that 'kgraft' uses.

That should be the starting point, because it's the most
reliable method.

Thanks,

Ingo

2015-02-20 20:08:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()


* Josh Poimboeuf <[email protected]> wrote:

> On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
> >
> > * Jiri Kosina <[email protected]> wrote:
> >
> > > Alright, so to sum it up:
> > >
> > > - current stack dumping (even looking at /proc/<pid>/stack) is not
> > > guaranteed to yield "correct" results in case the task is running at the
> > > time the stack is being examined
> >
> > Don't even _think_ about trying to base something as
> > dangerous as live patching the kernel image on the concept
> > of:
> >
> > 'We can make all stack backtraces reliably correct all
> > the time, with no false positives, with no false
> > negatives, 100% of the time, and quickly discover and
> > fix bugs in that'.
> >
> > It's not going to happen:
> >
> > - The correctness of stacktraces partially depends on
> > tooling and we don't control those.
>
> What tooling are you referring to?
>
> Sure, we rely on the compiler to produce the correct
> assembly for putting frame pointers on the stack. [...]

We also rely on all assembly code to have valid frames all
the time - and this is far from a given, we've had many
bugs in that area.

> [...] But that's pretty straightforward. The kernel
> already relies on the compiler to do plenty of things
> which are much more complex than that.

So this claim scares me because it's such nonsense:

We rely on the compiler to create _functional code_ for us.
If the compiler messes up then functionality, actual code
breaks.

'Valid stack frame' is not functional code, it's in essence
debug info. If a backtrace is wrong in some rare, racy case
then that won't break the kernel, it just gives slightly
misleading debug info in most cases.

Now you want to turn 'debug info' to have functional
relevance, for something as intrusive as patching the
kernel code.

You really need to accept this distinction!

> > - More importantly, there's no strong force that ensures
> > we can rely on stack backtraces: correcting bad
> > stack traces depends on people hitting those
> > functions and situations that generate them, seeing
> > a bad stack trace, noticing that it's weird and
> > correcting whatever code or tooling quirk causes the
> > stack entry to be incorrect.
> >
> > Essentially unlike other kernel code which breaks stuff
> > if it's incorrect, there's no _functional_ dependence
> > on stack traces, so live patching would be the first
> > (and pretty much only) thing that breaks on bad stack
> > traces ...
>
> First, there are several things we do to avoid anomalies:
>
> - we don't patch asm functions

the problem with asm functions isn't that we might want to
patch them (although sometimes we might want to: especially
the more interesting security fixes tend to be in assembly
code), it's that asm functions can (easily) mess up debug
info, without that being apparent in any functional way.

> - we only walk the stack of sleeping tasks

sleeping tasks are affected by debug info bugs too.

> Second, currently the stack unwinding code is rather
> crude and doesn't do much in the way of error handling.

That's a big red flag in my book ...

Thanks,

Ingo

2015-02-20 21:22:57

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched: add sched_task_call()

On Fri, Feb 20, 2015 at 09:08:49PM +0100, Ingo Molnar wrote:
> * Josh Poimboeuf <[email protected]> wrote:
> > On Fri, Feb 20, 2015 at 10:50:03AM +0100, Ingo Molnar wrote:
> > > * Jiri Kosina <[email protected]> wrote:
> > >
> > > > Alright, so to sum it up:
> > > >
> > > > - current stack dumping (even looking at /proc/<pid>/stack) is not
> > > > guaranteed to yield "correct" results in case the task is running at the
> > > > time the stack is being examined
> > >
> > > Don't even _think_ about trying to base something as
> > > dangerous as live patching the kernel image on the concept
> > > of:
> > >
> > > 'We can make all stack backtraces reliably correct all
> > > the time, with no false positives, with no false
> > > negatives, 100% of the time, and quickly discover and
> > > fix bugs in that'.
> > >
> > > It's not going to happen:
> > >
> > > - The correctness of stacktraces partially depends on
> > > tooling and we don't control those.
> >
> > What tooling are you referring to?
> >
> > Sure, we rely on the compiler to produce the correct
> > assembly for putting frame pointers on the stack. [...]
>
> We also rely on all assembly code to have valid frames all
> the time - and this is far from a given, we've had many
> bugs in that area.

Are you talking about kernel bugs or compiler bugs? Either way, would
those bugs not be caught by the stack unwinding validation improvements
I proposed?

> > [...] But that's pretty straightforward. The kernel
> > already relies on the compiler to do plenty of things
> > which are much more complex than that.
>
> So this claim scares me because it's such nonsense:
>
> We rely on the compiler to create _functional code_ for us.
> If the compiler messes up then functionality, actual code
> breaks.
>
> 'Valid stack frame' is not functional code, it's in essence
> debug info.

Only because we _treat_ it like debug info. I'm proposing that we treat
it like functional code. So that yes, if the compiler messes up frame
pointers, actual code breaks. (But we could detect that before we do
the patch, so it doesn't destabilize the system.)

If gcc can't do frame pointers right, it's a *major* gcc bug. It's
nothing more than:

push %rbp
mov %rsp,%rbp

at the beginning of the function, and:

pop %rbp

right before the function returns.

If you really think the compiler could possibly mess that up, I can
probably write a simple script, executed by the Makefile, which
validates that every C function in the kernel has the above pattern.

> If a backtrace is wrong in some rare, racy case
> then that won't break the kernel, it just gives slightly
> misleading debug info in most cases.
>
> Now you want to turn 'debug info' to have functional
> relevance, for something as intrusive as patching the
> kernel code.
>
> You really need to accept this distinction!

Yes, I already realize that this would be the first time the kernel
relies on stack traces being 100% correct. Peter and I had already some
discussion about it already elsewhere in this thread.

Just because we haven't relied on its correctness in the past doesn't
mean we can't rely on it in the future.

> > > - More importantly, there's no strong force that ensures
> > > we can rely on stack backtraces: correcting bad
> > > stack traces depends on people hitting those
> > > functions and situations that generate them, seeing
> > > a bad stack trace, noticing that it's weird and
> > > correcting whatever code or tooling quirk causes the
> > > stack entry to be incorrect.
> > >
> > > Essentially unlike other kernel code which breaks stuff
> > > if it's incorrect, there's no _functional_ dependence
> > > on stack traces, so live patching would be the first
> > > (and pretty much only) thing that breaks on bad stack
> > > traces ...
> >
> > First, there are several things we do to avoid anomalies:
> >
> > - we don't patch asm functions
>
> the problem with asm functions isn't that we might want to
> patch them (although sometimes we might want to: especially
> the more interesting security fixes tend to be in assembly
> code), it's that asm functions can (easily) mess up debug
> info, without that being apparent in any functional way.

Again, that's why I proposed the stack unwinding validation
improvements.

> > - we only walk the stack of sleeping tasks
>
> sleeping tasks are affected by debug info bugs too.
>
> > Second, currently the stack unwinding code is rather crude and
> > doesn't do much in the way of error handling.
>
> That's a big red flag in my book ...

I wasn't talking about my patch set. I was talking about the existing
stack dumping code in the kernel.

--
Josh

2015-02-20 21:46:22

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:

> > ... the choice the sysadmins have here is either have the
> > system running in an intermediate state, or have the
> > system completely dead for the *same time*. Because to
> > finish the transition successfully, all the tasks have to
> > be woken up in any case.
>
> That statement is false: an 'intermediate state' system
> where 'new' tasks are still running is still running and
> will interfere with the resolution of 'old' tasks.

Can you suggest a way how they would interfere? The transition happens
on entering or returning from a syscall, there is no influence between
individual tasks.

If you mean that the patch author has to consider the fact that both old
and new code will be running simultaneously, then yes, they have to.

> > But I do get your point; what you are basically saying is
> > that your preference is what kgraft is doing, and option
> > to allow for a global synchronization point should be
> > added in parallel to the gradual lazy migration.
>
> I think you misunderstood: the 'simple' method I outlined
> does not just 'synchronize', it actually executes the live
> patching atomically, once all tasks are gathered and we
> know they are _all_ in a safe state.

The 'simple' method has to catch and freeze all tasks one by one in
syscall entry/exit, at the kernel/userspace boundary, until all are
frozen and then patch the system atomically.

This means that each and every sleeping task in the system has to be
woken up in some way (sending a signal ...) to exit from a syscall it is
sleeping in. Same for CPU hogs. All kernel threads need to be parked.

This is exactly what you need to do for kGraft to complete patching.

This may take a significant amount of time to achieve and you won't be
able to use a userspace script to send the signals, because it'd be
frozen immediately.

> I.e. it's in essence the strong stop-all atomic patching
> model of 'kpatch', combined with the reliable avoidance of
> kernel stacks that 'kgraft' uses.

> That should be the starting point, because it's the most
> reliable method.

In the consistency models discussion, this was marked the
"LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of
all, but also comes at the highest cost in terms of impact on running
tasks. It's so high (the interruption may be seconds or more) that it
was deemed not worth implementing.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-20 22:09:20

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
> On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
> > I.e. it's in essence the strong stop-all atomic patching
> > model of 'kpatch', combined with the reliable avoidance of
> > kernel stacks that 'kgraft' uses.
>
> > That should be the starting point, because it's the most
> > reliable method.
>
> In the consistency models discussion, this was marked the
> "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the strongest model of
> all, but also comes at the highest cost in terms of impact on running
> tasks. It's so high (the interruption may be seconds or more) that it
> was deemed not worth implementing.

Yeah, this is way too disruptive to the user.

Even the comparatively tiny latency caused by kpatch's use of
stop_machine() was considered unacceptable by some.

Plus a lot of processes would see EINTR, causing more havoc.

--
Josh

2015-02-21 18:18:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Vojtech Pavlik <[email protected]> wrote:

> On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
>
> > > ... the choice the sysadmins have here is either have
> > > the system running in an intermediate state, or have
> > > the system completely dead for the *same time*.
> > > Because to finish the transition successfully, all
> > > the tasks have to be woken up in any case.
> >
> > That statement is false: an 'intermediate state' system
> > where 'new' tasks are still running is still running
> > and will interfere with the resolution of 'old' tasks.
>
> Can you suggest a way how they would interfere? The
> transition happens on entering or returning from a
> syscall, there is no influence between individual tasks.

Well, a 'new' task does not stop executing after returning
from the syscall, right? If it's stopped (until all
patching is totally complete) then you are right and I
concede your point.

If it's allowed to continue its workload then my point
stands: subsequent execution of 'new' tasks can interfere
with, slow down, interact with 'old' tasks trying to get
patched.

> > I think you misunderstood: the 'simple' method I
> > outlined does not just 'synchronize', it actually
> > executes the live patching atomically, once all tasks
> > are gathered and we know they are _all_ in a safe
> > state.
>
> The 'simple' method has to catch and freeze all tasks one
> by one in syscall entry/exit, at the kernel/userspace
> boundary, until all are frozen and then patch the system
> atomically.

Correct.

> This means that each and every sleeping task in the
> system has to be woken up in some way (sending a signal
> ...) to exit from a syscall it is sleeping in. Same for
> CPU hogs. All kernel threads need to be parked.

Yes - although I'd not use signals for this, signals have
side effects - but yes, something functionally equivalent.

> This is exactly what you need to do for kGraft to
> complete patching.

My understanding of kGraft is that by default you allow
tasks to continue 'in the new universe' after they are
patched. Has this changed or have I misunderstood the
concept?

Thanks,

Ingo

2015-02-21 18:30:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Josh Poimboeuf <[email protected]> wrote:

> On Fri, Feb 20, 2015 at 10:46:13PM +0100, Vojtech Pavlik wrote:
> > On Fri, Feb 20, 2015 at 08:49:01PM +0100, Ingo Molnar wrote:
> >
> > > I.e. it's in essence the strong stop-all atomic
> > > patching model of 'kpatch', combined with the
> > > reliable avoidance of kernel stacks that 'kgraft'
> > > uses.
> >
> > > That should be the starting point, because it's the
> > > most reliable method.
> >
> > In the consistency models discussion, this was marked
> > the "LEAVE_KERNEL+SWITCH_KERNEL" model. It's indeed the
> > strongest model of all, but also comes at the highest
> > cost in terms of impact on running tasks. It's so high
> > (the interruption may be seconds or more) that it was
> > deemed not worth implementing.
>
> Yeah, this is way too disruptive to the user.
>
> Even the comparatively tiny latency caused by kpatch's
> use of stop_machine() was considered unacceptable by
> some.

Unreliable, unrobust patching is even more disruptive...

What I think makes it long term fragile is that we combine
two unrobust, unlikely mechanisms: the chance that a task
just happens to execute a patched function, with the chance
that debug information is unreliable.

For example tracing patching got debugged to a fair degree
because we rely on the patching for actual tracing
functionality. Even with that relatively robust usage model
we had our crises ...

I just don't see how a stack backtrace based live patching
method can become robust in the long run.

> Plus a lot of processes would see EINTR, causing more
> havoc.

Parking threads safely in user mode does not require the
propagation of syscall interruption to user-space.

(It does have some other requirements, such as making all
syscalls interruptible to a 'special' signalling method
that only live patching triggers - even syscalls that are
under the normal ABI uninterruptible, such as sys_sync().)

On the other hand, if it's too slow, people will work on
improving signal propagation latencies: making syscalls
more readily interruptible and more seemlessly restartable
has various other advantages beyond live kernel patching.

I.e. it's a win-win scenario and will improve various areas
of the kernel in terms of syscall interruptability
latencies.

Thanks,

Ingo

2015-02-21 18:57:45

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > This means that each and every sleeping task in the system has to be
> > woken up in some way (sending a signal ...) to exit from a syscall it
> > is sleeping in. Same for CPU hogs. All kernel threads need to be
> > parked.
>
> Yes - although I'd not use signals for this, signals have
> side effects - but yes, something functionally equivalent.

This is similar to my proposal I came up with not too long time ago; a
fake signal (analogically to, but not exactly the same, what freezer is
using), that will just make tasks cycle through userspace/kernelspace
boundary without other side-effects.

> > This is exactly what you need to do for kGraft to complete patching.
>
> My understanding of kGraft is that by default you allow tasks to
> continue 'in the new universe' after they are patched. Has this changed
> or have I misunderstood the concept?

What Vojtech meant here, I believe, is that the effort you have to make to
force all tasks to queue themselves to park them on a safe place and then
restart their execution is exactly the same as the effort you have to make
to make kGraft converge and succeed.

But admittedly, if we reserve a special sort-of signal for making the
tasks pass through a safe checkpoint (and make them queue there (your
solution) or make them just pass through it and continue (current
kGraft)), it might reduce the time this effort needs considerably.

--
Jiri Kosina
SUSE Labs

2015-02-21 19:16:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
>
> > > This means that each and every sleeping task in the
> > > system has to be woken up in some way (sending a
> > > signal ...) to exit from a syscall it is sleeping in.
> > > Same for CPU hogs. All kernel threads need to be
> > > parked.
> >
> > Yes - although I'd not use signals for this, signals
> > have side effects - but yes, something functionally
> > equivalent.
>
> This is similar to my proposal I came up with not too
> long time ago; a fake signal (analogically to, but not
> exactly the same, what freezer is using), that will just
> make tasks cycle through userspace/kernelspace boundary
> without other side-effects.

Yeah.

> > > This is exactly what you need to do for kGraft to
> > > complete patching.
> >
> > My understanding of kGraft is that by default you allow
> > tasks to continue 'in the new universe' after they are
> > patched. Has this changed or have I misunderstood the
> > concept?
>
> What Vojtech meant here, I believe, is that the effort
> you have to make to force all tasks to queue themselves
> to park them on a safe place and then restart their
> execution is exactly the same as the effort you have to
> make to make kGraft converge and succeed.

Yes - with the difference that in the 'simple' method I
suggested we'd have kpatch's patching robustness (all or
nothing atomic patching - no intermediate patching state,
no reliance on mcount entries, no doubt about which version
of the function is working - sans kpatch's stack trace
logic), combined with kGraft's task parking robustness.

> But admittedly, if we reserve a special sort-of signal
> for making the tasks pass through a safe checkpoint (and
> make them queue there (your solution) or make them just
> pass through it and continue (current kGraft)), it might
> reduce the time this effort needs considerably.

Well, I think the 'simple' method has another advantage: it
can only work if all those problems (kthreads, parking
machinery) are solved, because the patching will occur only
everything is quiescent.

So no shortcuts are allowed, by design. It starts from a
fundamentally safe, robust base, while all the other
approaches I've seen were developed in a 'lets get the
patching to work, then iteratively try to make it safer'
which really puts the cart before the horse.

So to put it more bluntly: I don't subscribe to the whole
'consistency model' nonsense: that's just crazy talk IMHO.

Make it fundamentally safe from the very beginning, the
'simple method' I suggested _won't live patch the kernel_
if the mechanism has a bug and some kthread or task does
not park. See the difference?

Thanks,

Ingo

2015-02-21 19:31:55

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > But admittedly, if we reserve a special sort-of signal
> > for making the tasks pass through a safe checkpoint (and
> > make them queue there (your solution) or make them just
> > pass through it and continue (current kGraft)), it might
> > reduce the time this effort needs considerably.
>
> Well, I think the 'simple' method has another advantage: it can only
> work if all those problems (kthreads, parking machinery) are solved,
> because the patching will occur only everything is quiescent.
>
> So no shortcuts are allowed, by design. It starts from a fundamentally
> safe, robust base, while all the other approaches I've seen were
> developed in a 'lets get the patching to work, then iteratively try to
> make it safer' which really puts the cart before the horse.
>
> So to put it more bluntly: I don't subscribe to the whole 'consistency
> model' nonsense: that's just crazy talk IMHO.
>
> Make it fundamentally safe from the very beginning, the 'simple method'
> I suggested _won't live patch the kernel_ if the mechanism has a bug and
> some kthread or task does not park. See the difference?

I see the difference, but I am afraid you are simplifying the situation a
litle bit too much.

There will always be properties of patches that will make them
unapplicable in a "live patching" way by design. Think of data structure
layout changes (*).

Or think of kernel that has some 3rd party vendor module loaded, and this
module spawning a ktrehad that is not capable of parking itself.

Or think of patching __notrace__ functions.

Etc.

So it's not black and white, it's really a rather philosophical question
where to draw the line (and make sure that everyone is aware of where the
line is and what it means).
This is exactly why we came up with consistency models -- it allows you to
draw the line at well-defined places.

(*) there are some rather crazy ideas how to make this work, but the price
you pay is basically always unacceptable slowdown

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-21 19:48:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
>
> > > But admittedly, if we reserve a special sort-of
> > > signal for making the tasks pass through a safe
> > > checkpoint (and make them queue there (your solution)
> > > or make them just pass through it and continue
> > > (current kGraft)), it might reduce the time this
> > > effort needs considerably.
> >
> > Well, I think the 'simple' method has another
> > advantage: it can only work if all those problems
> > (kthreads, parking machinery) are solved, because the
> > patching will occur only everything is quiescent.
> >
> > So no shortcuts are allowed, by design. It starts from
> > a fundamentally safe, robust base, while all the other
> > approaches I've seen were developed in a 'lets get the
> > patching to work, then iteratively try to make it
> > safer' which really puts the cart before the horse.
> >
> > So to put it more bluntly: I don't subscribe to the
> > whole 'consistency model' nonsense: that's just crazy
> > talk IMHO.
> >
> > Make it fundamentally safe from the very beginning, the
> > 'simple method' I suggested _won't live patch the
> > kernel_ if the mechanism has a bug and some kthread or
> > task does not park. See the difference?
>
> I see the difference, but I am afraid you are simplifying
> the situation a litle bit too much.
>
> There will always be properties of patches that will make
> them unapplicable in a "live patching" way by design.
> Think of data structure layout changes (*).

Yes.

> Or think of kernel that has some 3rd party vendor module
> loaded, and this module spawning a ktrehad that is not
> capable of parking itself.

The kernel will refuse to live patch until the module is
fixed. It is a win by any measure.

> Or think of patching __notrace__ functions.

Why would __notrace__ functions be a problem in the
'simple' method? Live patching with this method will work
even if ftrace is not built in, we can just patch out the
function in the simplest possible fashion, because we do it
atomically and don't have to worry about per task
'transition state' - like kGraft does.

It's a massive simplification and there's no need to rely
on ftrace's mcount trick. (Sorry Steve!)

> Etc.
>
> So it's not black and white, it's really a rather
> philosophical question where to draw the line (and make
> sure that everyone is aware of where the line is and what
> it means).

Out of the three examples you mentioned, two are actually
an advantage to begin with - so I'd suggest you stop
condescending me ...

> This is exactly why we came up with consistency models --
> it allows you to draw the line at well-defined places.

To still be blunt: they are snake oil, a bit like 'security
modules': allowing upstream merges by consensus between
competing pieces of crap, instead of coming up with a
single good design that we can call the Linux kernel live
patching method ...

Thanks,

Ingo

2015-02-21 20:10:37

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > I see the difference, but I am afraid you are simplifying
> > the situation a litle bit too much.
> >
> > There will always be properties of patches that will make
> > them unapplicable in a "live patching" way by design.
> > Think of data structure layout changes (*).
>
> Yes.

The (*) note described that it's actually in theory possible, but the
price to pay is very high.

So it's possible to write a patch that does this, and have it use a
different consistency model, which would take care of data structure
layout change (and the fact that this particular consistency model is very
expensive, would be a clear fact).

Your "simple one-method-to-rule-them-all aproach" is not sufficient for
this.

Even such a stretched case doesn't justify existency of consistency models
for you?

> > Or think of kernel that has some 3rd party vendor module loaded, and
> > this module spawning a ktrehad that is not capable of parking itself.
>
> The kernel will refuse to live patch until the module is fixed. It is a
> win by any measure.

Depends on your point of view. If you are an administrator of the
vulnerable system, you have no to very little clue why patching your
system is not possible (and what you should to to have it fixed in the
coming days, before your system is taken down by attackers, for example),
and would be really sad to be forced to run with unpatched system.

I see your point that this would be a good lever we'd have to 3rd party
vendors, but OTOH we'd be taking OS users as hostages in some sense.

> > Or think of patching __notrace__ functions.
>
> Why would __notrace__ functions be a problem in the 'simple' method?
> Live patching with this method will work even if ftrace is not built in,
> we can just patch out the function in the simplest possible fashion,
> because we do it atomically and don't have to worry about per task
> 'transition state' - like kGraft does.
>
> It's a massive simplification and there's no need to rely on ftrace's
> mcount trick. (Sorry Steve!)

This would indeed be possible iff we take the "global synchronizing point"
aproach you are proposing. Still technicalities to be solved (what happens
if you are patching that already has ftrace on it, etc), but probably
nothing principal.

> > So it's not black and white, it's really a rather philosophical
> > question where to draw the line (and make sure that everyone is aware
> > of where the line is and what it means).
>
> Out of the three examples you mentioned, two are actually an advantage
> to begin with - so I'd suggest you stop condescending me ...

Ugh, if you feel my e-mails like attempts to condescend you, I am really
surprised; I thought we are discussing technical details. If you feel
otherwise, we should probably just stop discussing then.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-21 20:53:37

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

To make sure that this thread doesn't conclude in void, here's my take on
it:

- what's currently alredy there is the simplest-of-simplest methods; it
allows you to apply context-less patches (such as adding bounds checking
to the beginning of syscall, etc), which turns out to cover vast portion
of applicable CVEs

- it can always be made more clever; patch author always has to know the
version of the kernel he's preparing the patch for anyway (the live
patch and the kernel is closely tied together)

- the proposal to force sleeping or CPU-hogging tasks through a defined
safe checkpoint using a fake sort-of signal without any other
sideeffects might be useful even for kGraft and also for other proposed
aproaches. I think we'll try to implement this as an optimization for
kGraft and we'll see how (a) fast (b) non-intrusive we would be able to
make it. If it turns out to be successful, we can then just reuse it in
the upstream solution (whatever that would be)

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-22 08:46:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> > > Or think of kernel that has some 3rd party vendor
> > > module loaded, and this module spawning a ktrehad
> > > that is not capable of parking itself.
> >
> > The kernel will refuse to live patch until the module
> > is fixed. It is a win by any measure.
>
> Depends on your point of view. If you are an
> administrator of the vulnerable system, you have no to
> very little clue why patching your system is not possible
> (and what you should to to have it fixed in the coming
> days, before your system is taken down by attackers, for
> example), and would be really sad to be forced to run
> with unpatched system.

I don't see what precise argument you are making here. That
we are unable to fix possible bugs in binary only modules?
News at 11.

Or are you making the argument that we should design our
core kernel capabilities to be deficient, just to
accomodate hypothetical scenarios with buggy third party
modules that create unparkable kernel threads? That would
be a crazy proposition.

> > Why would __notrace__ functions be a problem in the
> > 'simple' method? Live patching with this method will
> > work even if ftrace is not built in, we can just patch
> > out the function in the simplest possible fashion,
> > because we do it atomically and don't have to worry
> > about per task 'transition state' - like kGraft does.
> >
> > It's a massive simplification and there's no need to
> > rely on ftrace's mcount trick. (Sorry Steve!)
>
> This would indeed be possible iff we take the "global
> synchronizing point" aproach you are proposing. [...]

Yes.

> [...] Still technicalities to be solved (what happens if
> you are patching that already has ftrace on it, etc), but
> probably nothing principal.

Correct.

> > > So it's not black and white, it's really a rather
> > > philosophical question where to draw the line (and
> > > make sure that everyone is aware of where the line is
> > > and what it means).
> >
> > Out of the three examples you mentioned, two are
> > actually an advantage to begin with - so I'd suggest
> > you stop condescending me ...
>
> Ugh, if you feel my e-mails like attempts to condescend
> you, I am really surprised; I thought we are discussing
> technical details. If you feel otherwise, we should
> probably just stop discussing then.

I am making specific technical arguments, but you attempted
to redirect my very specific arguments towards 'differences
in philosophy' and 'where to draw the line'. Lets discuss
the merits and brush them aside as 'philosophical
differences' or a made up category of 'consistency models'.

Anyway, let me try to reboot this discussion back to
technological details by summing up my arguments in another
mail.

Thanks,

Ingo

2015-02-22 08:52:43

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sat, 21 Feb 2015, Ingo Molnar wrote:

> > Plus a lot of processes would see EINTR, causing more havoc.
>
> Parking threads safely in user mode does not require the propagation of
> syscall interruption to user-space.

BTW how exactly do you envision this will work? Do I understand your
proposal correctly that EINTR will be "handled" somewhere in the "live
patching special signal handler" and then have the interrupted syscall
restarted?

Even without EINTR propagation to userspace, this would make a lot of new
syscall restarts that were not there before, and I am still to be
convinced that this is something we are not going to cause a lot of new
user-visible breakage with.

Yes, the breakage would be caused kernel bugs (I mostly envision drivers
to be problematic in this area) that would be nice to have fixed, but the
user experience that will come out of it will be just completely horrible.

Or did I misunderstand what you are really proposing?

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-22 09:08:39

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sun, 22 Feb 2015, Ingo Molnar wrote:

> I am making specific technical arguments, but you attempted to redirect
> my very specific arguments towards 'differences in philosophy' and
> 'where to draw the line'. Lets discuss the merits and brush them aside
> as 'philosophical differences' or a made up category of 'consistency
> models'.
>
> Anyway, let me try to reboot this discussion back to technological
> details by summing up my arguments in another mail.

Sounds good, thanks. Don't get me wrong -- I am not really opposing your
solution (because it's of course quite close to what kgraft is doing :p ),
but I'd like to make sure we understand each other well.

I still have to think a little bit more about how to handle kthreads
properly even in your proposed solution (i.e. how exactly is it superior
to what kgraft is currently doing in this respect). We'd have to go
through all them anyway, and make them parkable, right?

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-22 09:46:48

by Ingo Molnar

[permalink] [raw]
Subject: live kernel upgrades (was: live kernel patching design)


* Ingo Molnar <[email protected]> wrote:

> Anyway, let me try to reboot this discussion back to
> technological details by summing up my arguments in
> another mail.

So here's how I see the kGraft and kpatch series. To not
put too fine a point on it, I think they are fundamentally
misguided in both implementation and in design, which turns
them into an (unwilling) extended arm of the security
theater:

- kGraft creates a 'mixed' state where old kernel
functions and new kernel functions are allowed to
co-exist, furthermore there's no guarantee currently at
attempting to get the patching done within a bound
amount of time.

- kpatch uses kernel stack backtraces to determine whether
a task is executing a function or not - which IMO is
fundamentally fragile as kernel stack backtraces are
'debug info' and are maintained and created as such:
we've had long lasting stack backtrace bugs which would
now be turned into 'potentially patching a live
function' type of functional (and hard to debug) bugs.
I didn't see much effort that tries to turn this
equation around and makes kernel stacktraces more
robust.

- the whole 'consistency model' talk both projects employ
reminds me of how we grew 'security modules': where
people running various mediocre projects would in the
end not seek to create a superior upstream project, but
would seek the 'consensus' in the form of cross-acking
each others' patches as long as their own code got
upstream as well ...

I'm not blaming Linus for giving in to allowing security
modules: they might be the right model for such a hard
to define and in good part psychological discipline as
'security', but I sure don't see the necessity of doing
that for 'live kernel patching'.

More importantly, both kGraft and kpatch are pretty limited
in what kinds of updates they allow, and neither kGraft nor
kpatch has any clear path towards applying more complex
fixes to kernel images that I can see: kGraft can only
apply the simplest of fixes where both versions of a
function are interchangeable, and kpatch is only marginally
better at that - and that's pretty fundamental to both
projects!

I think all of these problems could be resolved by shooting
for the moon instead:

- work towards allowing arbitrary live kernel upgrades!

not just 'live kernel patches'.

Work towards the goal of full live kernel upgrades between
any two versions of a kernel that supports live kernel
upgrades (and that doesn't have fatal bugs in the kernel
upgrade support code requiring a hard system restart).

Arbitrary live kernel upgrades could be achieved by
starting with the 'simple method' I outlined in earlier
mails, using some of the methods that kpatch and kGraft are
both utilizing or planning to utilize:

- implement user task and kthread parking to get the
kernel into quiescent state.

- implement (optional, thus ABI-compatible)
system call interruptability and restartability
support.

- implement task state and (limited) device state
snapshotting support

- implement live kernel upgrades by:

- snapshotting all system state transparently

- fast-rebooting into the new kernel image without
shutting down and rebooting user-space, i.e. _much_
faster than a regular reboot.

- restoring system state transparently within the new
kernel image and resuming system workloads where
they were left.

Even complex external state like TCP socket state and
graphics state can be preserved over an upgrade. As far as
the user is concerned, nothing happened but a brief pause -
and he's now running a v3.21 kernel, not v3.20.

Obviously one of the simplest utilizations of live kernel
upgrades would be to apply simple security fixes to
production systems. But that's just a very simple
application of a much broader capability.

Note that if done right, then the time to perform a live
kernel upgrade on a typical system could be brought to well
below 10 seconds system stoppage time: adequate to the vast
majority of installations.

For special installations or well optimized hardware the
latency could possibly be brought below 1 second stoppage
time.

This 'live kernel upgrades' approach would have various
advantages:

- it brings together various principles working towards
shared goals:

- the boot time reduction folks
- the checkpoint/restore folks
- the hibernation folks
- the suspend/resume and power management folks
- the live patching folks (you)
- the syscall latency reduction folks

if so many disciplines are working together then maybe
something really good and long term maintainble can
crystalize out of that effort.

- it ignores the security theater that treats security
fixes as a separate, disproportionally more important
class of fixes and instead allows arbitrary complex
changes over live kernel upgrades.

- there's no need to 'engineer' live patches separately,
there's no need to review them and their usage sites
for live patching relevant side effects. Just create a
'better' kernel as defined by users of that kernel:

- in the enterprise distro space create a more stable
kernel and allow transparent upgrades into it.

- in the desktop distro space create a kernel that
will contain fixes and support for latest hardware.

- etc.

there's the need to engineer c/r and device state
support, but that's a much more concentrated and
specific field with many usecases beyond live
kernel upgrades.

We have many of the building blocks in place and have them
available:

- the freezer code already attempts at parking/unparking
threads transparently, that could be fixed/extended.

- hibernation, regular suspend/resume and in general
power management has in essence already implemented
most building blocks needed to enumerate and
checkpoint/restore device state that otherwise gets
lost in a shutdown/reboot cycle.

- c/r patches started user state enumeration and
checkpoint/restore logic

A feature like arbitrary live kernel upgrades would be well
worth the pain and would be worth the complications, and
it's actually very feasible technically.

The goals of the current live kernel patching projects,
"being able to apply only the simplest of live patches",
which would in my opinion mostly serve the security
theater? They are not forward looking enough, and in that
sense they could even be counterproductive.

Thanks,

Ingo

2015-02-22 10:17:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
>
> > > Plus a lot of processes would see EINTR, causing more
> > > havoc.
> >
> > Parking threads safely in user mode does not require
> > the propagation of syscall interruption to user-space.
>
> BTW how exactly do you envision this will work? Do I
> understand your proposal correctly that EINTR will be
> "handled" somewhere in the "live patching special signal
> handler" and then have the interrupted syscall restarted?

If you want to think about it in signal handling terms then
it's a new automatic in-kernel handler, which does not
actually return back to user-mode at all.

We can do it via the signals machinery (mainly to reuse all
the existing signal_pending() code in various syscalls), or
via new TIF flags like the user work machinery: the
principle is the same: interrupt out of syscall functions
into a central place and restart them, and return to
user-space later on as if a single call had been performed.

This necessarily means some changes to syscalls, but not
insurmountable ones - and checkpoint/restore support would
want to have similar changes in any case so we can hit two
birds with the same stone.

> Even without EINTR propagation to userspace, this would
> make a lot of new syscall restarts that were not there
> before, [...]

That's only a problem if you do system call restarts by
restarting them via user-space system call restart handler
- I'm not proposing that.

I'm suggesting a completely user-space transparent way to
execute long lasting system calls in a smarter way. I.e. it
would not be observable via strace either.

Thanks,

Ingo

2015-02-22 10:34:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Ingo Molnar <[email protected]> wrote:

> - implement live kernel upgrades by:
>
> - snapshotting all system state transparently

Note that this step can be sped up further in the end,
because most of this work can be performed asynchronously
and transparently prior to the live kernel upgrade step
itself.

So if we split the snapshotting+parking preparatory step
into two parts:

- do opportunistic snapshotting of
sleeping/inactive user tasks while allowing
snapshotted tasks to continue to run

- once that is completed, do snapshotting+parking
of all user tasks, even running ones

The first step is largely asynchronous, can be done with
lower priority and does not park/stop any tasks on the
system.

Only the second step counts as 'system stoppage time': and
only those tasks have to be snapshotted again which
executed any code since the first snapshotting run was
performed.

Note that even this stoppage time can be reduced further:
if a system is running critical services/users that need as
little interruption as possible, they could be
prioritized/ordered to be snapshotted/parked closest to the
live kernel upgrade step.

> - fast-rebooting into the new kernel image without
> shutting down and rebooting user-space, i.e. _much_
> faster than a regular reboot.
>
> - restoring system state transparently within the new
> kernel image and resuming system workloads where
> they were left.
>
> Even complex external state like TCP socket state and
> graphics state can be preserved over an upgrade. As far
> as the user is concerned, nothing happened but a brief
> pause - and he's now running a v3.21 kernel, not v3.20.

So all this would allow 'live, rolling kernel upgrades' in
the end.

Thanks,

Ingo

2015-02-22 10:48:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Ingo Molnar <[email protected]> wrote:

> We have many of the building blocks in place and have
> them available:
>
> - the freezer code already attempts at parking/unparking
> threads transparently, that could be fixed/extended.
>
> - hibernation, regular suspend/resume and in general
> power management has in essence already implemented
> most building blocks needed to enumerate and
> checkpoint/restore device state that otherwise gets
> lost in a shutdown/reboot cycle.
>
> - c/r patches started user state enumeration and
> checkpoint/restore logic

I forgot to mention:

- kexec allows the loading and execution of a new
kernel image.

It's all still tons of work to pull off a 'live kernel
upgrade' on native hardware, but IMHO it's tons of very
useful work that helps a dozen non-competing projects,
literally.

Thanks,

Ingo

2015-02-22 14:39:14

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

[ adding live-patching mailing list to CC ]

On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
> > Anyway, let me try to reboot this discussion back to
> > technological details by summing up my arguments in
> > another mail.
>
> So here's how I see the kGraft and kpatch series. To not
> put too fine a point on it, I think they are fundamentally
> misguided in both implementation and in design, which turns
> them into an (unwilling) extended arm of the security
> theater:
>
> - kGraft creates a 'mixed' state where old kernel
> functions and new kernel functions are allowed to
> co-exist,

Yes, some tasks may be running old functions and some tasks may be
running new functions. This would only cause a problem if there are
changes to global data semantics. We have guidelines the patch author
can follow to ensure that this isn't a problem.

> attempting to get the patching done within a bound
> amount of time.

Don't forget about my RFC [1] which converges the system to a patched
state within a few seconds. If the system isn't patched by then, the
user space tool can trigger a safe patch revert.

> - kpatch uses kernel stack backtraces to determine whether
> a task is executing a function or not - which IMO is
> fundamentally fragile as kernel stack backtraces are
> 'debug info' and are maintained and created as such:
> we've had long lasting stack backtrace bugs which would
> now be turned into 'potentially patching a live
> function' type of functional (and hard to debug) bugs.
> I didn't see much effort that tries to turn this
> equation around and makes kernel stacktraces more
> robust.

Again, I proposed several stack unwinding validation improvements which
would make this a non-issue IMO.

> - the whole 'consistency model' talk both projects employ
> reminds me of how we grew 'security modules': where
> people running various mediocre projects would in the
> end not seek to create a superior upstream project, but
> would seek the 'consensus' in the form of cross-acking
> each others' patches as long as their own code got
> upstream as well ...

That's just not the case. The consistency models were used to describe
the features and the pros and cons of the different approaches.

The RFC is not a compromise to get "cross-acks". IMO it's an
improvement on both kpatch and kGraft. See the RFC cover letter [1] and
the original consistency model discussion [2] for more details.

> I'm not blaming Linus for giving in to allowing security
> modules: they might be the right model for such a hard
> to define and in good part psychological discipline as
> 'security', but I sure don't see the necessity of doing
> that for 'live kernel patching'.
>
> More importantly, both kGraft and kpatch are pretty limited
> in what kinds of updates they allow, and neither kGraft nor
> kpatch has any clear path towards applying more complex
> fixes to kernel images that I can see: kGraft can only
> apply the simplest of fixes where both versions of a
> function are interchangeableand kpatch is only marginally
> better at that - and that's pretty fundamental to both
> projects!

Sorry, but that is just not true. We can apply complex patches,
including "non-interchangeable functions" and data structures/semantics.

The catch is that it requires the patch author to put in the work to
modify the patch to make it compatible with live patching. But that's
an acceptable tradeoff for distros who want to support live patching.

> I think all of these problems could be resolved by shooting
> for the moon instead:
>
> - work towards allowing arbitrary live kernel upgrades!
>
> not just 'live kernel patches'.
>
> Work towards the goal of full live kernel upgrades between
> any two versions of a kernel that supports live kernel
> upgrades (and that doesn't have fatal bugs in the kernel
> upgrade support code requiring a hard system restart).
>
> Arbitrary live kernel upgrades could be achieved by
> starting with the 'simple method' I outlined in earlier
> mails, using some of the methods that kpatch and kGraft are
> both utilizing or planning to utilize:
>
> - implement user task and kthread parking to get the
> kernel into quiescent state.
>
> - implement (optional, thus ABI-compatible)
> system call interruptability and restartability
> support.
>
> - implement task state and (limited) device state
> snapshotting support
>
> - implement live kernel upgrades by:
>
> - snapshotting all system state transparently
>
> - fast-rebooting into the new kernel image without
> shutting down and rebooting user-space, i.e. _much_
> faster than a regular reboot.
>
> - restoring system state transparently within the new
> kernel image and resuming system workloads where
> they were left.
>
> Even complex external state like TCP socket state and
> graphics state can be preserved over an upgrade. As far as
> the user is concerned, nothing happened but a brief pause -
> and he's now running a v3.21 kernel, not v3.20.
>
> Obviously one of the simplest utilizations of live kernel
> upgrades would be to apply simple security fixes to
> production systems. But that's just a very simple
> application of a much broader capability.
>
> Note that if done right, then the time to perform a live
> kernel upgrade on a typical system could be brought to well
> below 10 seconds system stoppage time: adequate to the vast
> majority of installations.
>
> For special installations or well optimized hardware the
> latency could possibly be brought below 1 second stoppage
> time.
>
> This 'live kernel upgrades' approach would have various
> advantages:
>
> - it brings together various principles working towards
> shared goals:
>
> - the boot time reduction folks
> - the checkpoint/restore folks
> - the hibernation folks
> - the suspend/resume and power management folks
> - the live patching folks (you)
> - the syscall latency reduction folks
>
> if so many disciplines are working together then maybe
> something really good and long term maintainble can
> crystalize out of that effort.
>
> - it ignores the security theater that treats security
> fixes as a separate, disproportionally more important
> class of fixes and instead allows arbitrary complex
> changes over live kernel upgrades.
>
> - there's no need to 'engineer' live patches separately,
> there's no need to review them and their usage sites
> for live patching relevant side effects. Just create a
> 'better' kernel as defined by users of that kernel:
>
> - in the enterprise distro space create a more stable
> kernel and allow transparent upgrades into it.
>
> - in the desktop distro space create a kernel that
> will contain fixes and support for latest hardware.
>
> - etc.
>
> there's the need to engineer c/r and device state
> support, but that's a much more concentrated and
> specific field with many usecases beyond live
> kernel upgrades.
>
> We have many of the building blocks in place and have them
> available:
>
> - the freezer code already attempts at parking/unparking
> threads transparently, that could be fixed/extended.
>
> - hibernation, regular suspend/resume and in general
> power management has in essence already implemented
> most building blocks needed to enumerate and
> checkpoint/restore device state that otherwise gets
> lost in a shutdown/reboot cycle.
>
> - c/r patches started user state enumeration and
> checkpoint/restore logic
>
> A feature like arbitrary live kernel upgrades would be well
> worth the pain and would be worth the complications, and
> it's actually very feasible technically.
>
> The goals of the current live kernel patching projects,
> "being able to apply only the simplest of live patches",
> which would in my opinion mostly serve the security
> theater? They are not forward looking enough, and in that
> sense they could even be counterproductive.

My only issue with this proposal is the assertion that it's somehow a
replacement for kpatch or kGraft.

IIUC, the idea is basically to kick the tasks out of the kernel,
checkpoint them, replace the entire kernel, and restore the tasks. It
sounds like kind of a variation on kexec+criu.

There really needs to be a distinction between the two: live upgrading
vs live patching. They are two completely different animals, for
different use cases.

Live upgrading is basically a less disruptive reboot, with application
states preserved.

Live patching is applying simple fixes which are as close to *zero*
disruption as possible.

Your upgrade proposal is an *enormous* disruption to the system:

- a latency of "well below 10" seconds is completely unacceptable to
most users who want to patch the kernel of a production system _while_
it's in production.

- more importantly, the number of things that would have to go *right*
in order to apply a simple security fix means that it is
_exponentially_ more complex than live patching. That kind of risk
and disruption to the system is *exactly* what a live patching user is
trying to avoid.

[1] https://lkml.org/lkml/2015/2/9/475
[2] https://lkml.org/lkml/2014/11/7/354

--
Josh

2015-02-22 16:41:37

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, Feb 22, 2015 at 08:37:58AM -0600, Josh Poimboeuf wrote:
> On Sun, Feb 22, 2015 at 10:46:39AM +0100, Ingo Molnar wrote:
> > - the whole 'consistency model' talk both projects employ
> > reminds me of how we grew 'security modules': where
> > people running various mediocre projects would in the
> > end not seek to create a superior upstream project, but
> > would seek the 'consensus' in the form of cross-acking
> > each others' patches as long as their own code got
> > upstream as well ...
>
> That's just not the case. The consistency models were used to describe
> the features and the pros and cons of the different approaches.
>
> The RFC is not a compromise to get "cross-acks". IMO it's an
> improvement on both kpatch and kGraft. See the RFC cover letter [1] and
> the original consistency model discussion [2] for more details.

BTW, I proposed that with my RFC we only need a _single_ consistency
model.

Yes, there have been some suggestions that we should support multiple
consistency models, but I haven't heard any good reasons that would
justify the added complexity.

--
Josh

2015-02-22 19:03:12

by Jiri Kosina

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, 22 Feb 2015, Josh Poimboeuf wrote:

> Yes, there have been some suggestions that we should support multiple
> consistency models, but I haven't heard any good reasons that would
> justify the added complexity.

I tend to agree, consistency models were just a temporary idea that seems
to likely become unnecessary given all the ideas on the unified solution
that have been presented so far.

(Well, with a small exception to this -- I still think we should be able
to "fire and forget" for patches where it's guaranteed that no
housekeeping is necessary -- my favorite example is again fixing out of
bounds access in a certain syscall entry ... i.e. the "super-simple"
consistency model).

--
Jiri Kosina
SUSE Labs

2015-02-22 19:13:32

by Jiri Kosina

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


[ added live-patching@ ML as well, in consistency with Josh ]

On Sun, 22 Feb 2015, Ingo Molnar wrote:

> It's all still tons of work to pull off a 'live kernel upgrade' on
> native hardware, but IMHO it's tons of very useful work that helps a
> dozen non-competing projects, literally.

Yes, I agree, it might be nice-to-have feature. The only issue with that
is that it's solving a completely different problem than live patching.

Guys working on criu have made quite a few steps in that direction of
already course; modulo bugs and current implementation limitations, you
should be able to checkpoint your userspace, kexec to a new kernel, and
restart your userspace.

But if you ask the folks who are hungry for live bug patching, they
wouldn't care.

You mentioned "10 seconds", that's more or less equal to infinity to them.
And frankly, even "10 seconds" is something we can't really guarantee. We
could optimize the kernel the craziest way we can, but hardware takes its
time to reinitialize. And in most cases, you'd really need to reinitalize
it; I don't see a way how you could safely suspend it somehow in the old
kernel and resume it in a new one, because the driver suspending the
device might be completely different than the driver resuming the device.
How are you able to provide hard guarantees that this is going to work?

So all in all, if you ask me -- yes, live kernel upgrades from v3.20 to
v3.21, pretty cool feature. Is it related to the problem we are after with
live bug patching? I very much don't think so.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-22 19:18:10

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


[ live-patching@ ML added to CC here as well ]

On Sun, 22 Feb 2015, Ingo Molnar wrote:

> > BTW how exactly do you envision this will work? Do I understand your
> > proposal correctly that EINTR will be "handled" somewhere in the "live
> > patching special signal handler" and then have the interrupted syscall
> > restarted?
>
> If you want to think about it in signal handling terms then it's a new
> automatic in-kernel handler, which does not actually return back to
> user-mode at all.
>
> We can do it via the signals machinery (mainly to reuse all the existing
> signal_pending() code in various syscalls), or via new TIF flags like
> the user work machinery: the principle is the same: interrupt out of
> syscall functions into a central place and restart them, and return to
> user-space later on as if a single call had been performed.
>
> This necessarily means some changes to syscalls, but not insurmountable
> ones - and checkpoint/restore support would want to have similar changes
> in any case so we can hit two birds with the same stone.

Understood. I think this by itself really would be a huge improvement on
how to force tasks to converge to the new universe without confusing the
userspace via artificial signals (i.e. even with the lazy migration
aproaches, without full serialization checkpoint). As I said yesterday, I
am pretty sure we'll experiment with this in kgraft. Thanks for bringing
this idea up,

--
Jiri Kosina
SUSE Labs

2015-02-22 23:02:32

by Andrew Morton

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <[email protected]> wrote:

> But if you ask the folks who are hungry for live bug patching, they
> wouldn't care.
>
> You mentioned "10 seconds", that's more or less equal to infinity to them.

10 seconds outage is unacceptable, but we're running our service on a
single machine with no failover. Who is doing this??

2015-02-23 00:18:58

by Dave Airlie

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On 23 February 2015 at 09:01, Andrew Morton <[email protected]> wrote:
> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <[email protected]> wrote:
>
>> But if you ask the folks who are hungry for live bug patching, they
>> wouldn't care.
>>
>> You mentioned "10 seconds", that's more or less equal to infinity to them.
>
> 10 seconds outage is unacceptable, but we're running our service on a
> single machine with no failover. Who is doing this??

if I had to guess, telcos generally, you've only got one wire between a phone
and the exchange and if the switch on the end needs patching it better be fast.

Dave.

2015-02-23 00:44:10

by Arjan van de Ven

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

There's failover, there's running the core services in VMs (which can
migrate)...
I think 10 seconds is Ingo being a bit exaggerating, since you can
boot a full system in a lot less time than that, and more so if you
know more about the system
(e.g. don't need to spin down and then discover and spin up disks). If
you're talking about inside a VM it's even more extreme than that.


Now, live patching sounds great as ideal, but it may end up being
(mostly) similar like hardware hotplug: Everyone wants it, but nobody
wants to use it
(and just waits for a maintenance window instead). In the hotplug
case, while people say they want it, they're also aware that hardware
hotplug is fundamentally messy, and then nobody wants to do it on that
mission critical piece of hardware outside the maintenance window.
(hotswap drives seem to have been the exception to this, that seems to
have been worked out well enough, but that's replace-with-the-same).
I would be very afraid that hot kernel patching ends up in the same
space: The super-mission-critical folks are what its aimed at, while
those are the exact same folks that would rather wait for the
maintenance window.

There's a lot of logistical issues (can you patch a patched system...
if live patching is a first class citizen you end up with dozens and
dozens of live patches applied, some out of sequence etc etc). There's
the "which patches do I have, and if the first patch for a security
hole was not complete, how do I cope by applying number two. There's
the "which of my 50.000 servers have which patch applied" logistics.

And Ingo is absolutely right: The scope is very fuzzy. Todays bugfix
is tomorrows "oh oops it turns out exploitable".

I will throw a different hat in the ring: Maybe we don't want full
kernel update as step one, maybe we want this on a kernel module
level:
Hot-swap of kernel modules, where a kernel module makes itself go
quiet and serializes its state ("suspend" pretty much), then gets
swapped out (hot) by its replacement,
which then unserializes the state and continues.

If we can do this on a module level, then the next step is treating
more components of the kernel as modules, which is a fundamental
modularity thing.



On Sun, Feb 22, 2015 at 4:18 PM, Dave Airlie <[email protected]> wrote:
> On 23 February 2015 at 09:01, Andrew Morton <[email protected]> wrote:
>> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <[email protected]> wrote:
>>
>>> But if you ask the folks who are hungry for live bug patching, they
>>> wouldn't care.
>>>
>>> You mentioned "10 seconds", that's more or less equal to infinity to them.
>>
>> 10 seconds outage is unacceptable, but we're running our service on a
>> single machine with no failover. Who is doing this??
>
> if I had to guess, telcos generally, you've only got one wire between a phone
> and the exchange and if the switch on the end needs patching it better be fast.
>
> Dave.

2015-02-23 06:36:13

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:

> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <[email protected]> wrote:
>
> > But if you ask the folks who are hungry for live bug patching, they
> > wouldn't care.
> >
> > You mentioned "10 seconds", that's more or less equal to infinity to them.
>
> 10 seconds outage is unacceptable, but we're running our service on a
> single machine with no failover. Who is doing this??

This is the most common argument that's raised when live patching is
discussed. "Why do need live patching when we have redundancy?"

People who are asking for live patching typically do have failover in
place, but prefer not to have to use it when they don't have to.

In many cases, the failover just can't be made transparent to the
outside world and there is a short outage. Examples would be legacy
applications which can't run in an active-active cluster and need to be
restarted on failover. Or trading systems, where the calculations must
be strictly serialized and response times are counted in tens of
microseconds.

Another usecase is large HPC clusters, where all nodes have to run
carefully synchronized. Once one gets behind in a calculation cycle,
others have to wait for the results and the efficiency of the whole
cluster goes down. There are people who run realtime on them for
that reason. Dumping all data and restarting the HPC cluster takes a lot
of time and many nodes (out of tens of thousands) may not come back up,
making the restore from media difficult. Doing a rolling upgrade causes
the nodes one by one stall by 10+ seconds, which times 10k is a long
time, too.

And even the case where you have a perfect setup with everything
redundant and with instant failover does benefit from live patching.
Since you have to plan for failure, you have to plan for failure while
patching, too. With live patching you need 2 servers minimum (or N+1),
without you need 3 (or N+2), as one will be offline while during the
upgrade process.

10 seconds of outage may be acceptable in a disaster scenario. Not
necessarily for a regular update scenario.

The value of live patching is in near zero disruption.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-23 08:18:03

by Jiri Kosina

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Sun, 22 Feb 2015, Arjan van de Ven wrote:

> There's a lot of logistical issues (can you patch a patched system... if
> live patching is a first class citizen you end up with dozens and dozens
> of live patches applied, some out of sequence etc etc).

I can't speak on behalf of others, but I definitely can speak on behalf of
SUSE, as we are already basing a product on this.

Yes, you can patch a patched system, you can patch one function multiple
times, you can revert a patch. It's all tracked by dependencies.

Of course, if you are random Joe User, you can do whatever you want, i.e.
also compile your own home-brew patches and apply them randomly and brick
your system that way. But that's in no way different to what you as Joe
User can do today; there is nothing that will prevent you from shooting
yourself in a foot if you are creative.

Regarding "out of sequence", this is up to the vendor providing/packaging
the patches to make sure that this is guaranteed not to happen. SUSE for
example always provides "all-in-one" patch for each and every released and
supported kernel codestream in a cummulative manner, which takes care of
the ordering issue completely.

It's not really too different from shipping external kernel modules and
making sure they have proper dependencies that need to be satisfied before
the module can be loaded.

> There's the "which patches do I have, and if the first patch for a
> security hole was not complete, how do I cope by applying number two.
> There's the "which of my 50.000 servers have which patch applied"
> logistics.

Yes. That's easy if distro/patch vendors make reasonable userspace and
distribution infrastructure around this.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-02-23 10:42:22

by Richard Weinberger

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Mon, Feb 23, 2015 at 9:17 AM, Jiri Kosina <[email protected]> wrote:
> On Sun, 22 Feb 2015, Arjan van de Ven wrote:
> Of course, if you are random Joe User, you can do whatever you want, i.e.
> also compile your own home-brew patches and apply them randomly and brick
> your system that way. But that's in no way different to what you as Joe
> User can do today; there is nothing that will prevent you from shooting
> yourself in a foot if you are creative.

Sorry if I ask something that got already discussed, I did not follow
the whole live-patching discussion.

How much of the userspace tools will be public available?
With live-patching mainline the kernel offers the mechanism, but
random Joe user still needs
the tools to create good live patches.

--
Thanks,
//richard

2015-02-23 11:08:23

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Mon, Feb 23, 2015 at 11:42:17AM +0100, Richard Weinberger wrote:

> > Of course, if you are random Joe User, you can do whatever you want, i.e.
> > also compile your own home-brew patches and apply them randomly and brick
> > your system that way. But that's in no way different to what you as Joe
> > User can do today; there is nothing that will prevent you from shooting
> > yourself in a foot if you are creative.
>
> Sorry if I ask something that got already discussed, I did not follow
> the whole live-patching discussion.
>
> How much of the userspace tools will be public available?
> With live-patching mainline the kernel offers the mechanism, but
> random Joe user still needs
> the tools to create good live patches.

All the tools for kGraft and kpatch are available in public git
repositories.

Also, while kGraft has tools to automate the generation of patches,
these are generally not required to create a patch.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-23 11:42:19

by Pavel Machek

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

> More importantly, both kGraft and kpatch are pretty limited
> in what kinds of updates they allow, and neither kGraft nor
> kpatch has any clear path towards applying more complex
> fixes to kernel images that I can see: kGraft can only
> apply the simplest of fixes where both versions of a
> function are interchangeable, and kpatch is only marginally
> better at that - and that's pretty fundamental to both
> projects!
>
> I think all of these problems could be resolved by shooting
> for the moon instead:
>
> - work towards allowing arbitrary live kernel upgrades!
>
> not just 'live kernel patches'.

Note that live kernel upgrade would have interesting implications
outside kernel:

1) glibc does "what kernel version is this?" caches result
and alters behaviour accordingly.

2) apps will do recently_introduced_syscall(), get error
and not attempt it again.

Pavel

2015-02-23 11:52:49

by Pavel Machek

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

> kernel update as step one, maybe we want this on a kernel module
> level:
> Hot-swap of kernel modules, where a kernel module makes itself go
> quiet and serializes its state ("suspend" pretty much), then gets
> swapped out (hot) by its replacement,
> which then unserializes the state and continues.

Hmm. So Linux 5.0 will be micro-kernel? :-).

Pavek

2015-02-23 12:43:57

by Jiri Kosina

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())

On Sat, 21 Feb 2015, Ingo Molnar wrote:

> (It does have some other requirements, such as making all
> syscalls interruptible to a 'special' signalling method
> that only live patching triggers - even syscalls that are
> under the normal ABI uninterruptible, such as sys_sync().)

BTW I didn't really understand this -- could you please elaborate what
exactly do you propose to do here in your "simplified" patching method
(i.e. serializing everybody at the kernel boundary) for
TASK_UNINTERRUPTIBLE processess?

That actually seems to be the most crucial problem to me in this respect.
Other things are rather implementation details; no matter whether we are
sending normal SIGCONT or SIGPATCHING with special semantics you have
described above, at the end of the day we end up calling kick_process()
for the task in question, and that makes both interruptible sleepers and
CPU hogs go through the "checkpoint". SIGPATCHING would then be "just" an
improvement of this, making sure that EINTR doesn't spuriously get leaked
to userspace.

But I didn't understand your claims regarding uninterruptible sleeps in
your paragraph above. sys_sync() is one thing, that's just waiting
uninterruptibly for completion. But how about all the mutex waitiers in
TASK_UNINTERRUPTIBLE, for example?

Thanks a lot,

--
Jiri Kosina
SUSE Labs

2015-02-24 09:16:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Arjan van de Ven <[email protected]> wrote:

> I think 10 seconds is Ingo being a bit exaggerating,
> since you can boot a full system in a lot less time than
> that, and more so if you know more about the system (e.g.
> don't need to spin down and then discover and spin up
> disks). If you're talking about inside a VM it's even
> more extreme than that.

Correct, I mentioned 10 seconds latency to be on the safe
side - but in general I suspect it can be reduced to below
1 second, which should be enough for everyone but the most
specialized cases: even specialized HA servers will update
their systems in low activity maintenance windows.

and we don't design the Linux kernel for weird, extreme
cases, we design for the common, sane case that has the
broadest appeal, and we hope that the feature garners
enough interest to be maintainable.

This is not a problem in general: the weird case can take
care of itself just fine - 'specialized and weird' usually
means there's enough money to throw at special hardware and
human solutions or it goes extinct quickly ...

Thanks,

Ingo

2015-02-24 09:44:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Vojtech Pavlik <[email protected]> wrote:

> On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:
>
> > On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <[email protected]> wrote:
> >
> > > But if you ask the folks who are hungry for live bug
> > > patching, they wouldn't care.
> > >
> > > You mentioned "10 seconds", that's more or less equal
> > > to infinity to them.
> >
> > 10 seconds outage is unacceptable, but we're running
> > our service on a single machine with no failover. Who
> > is doing this??
>
> This is the most common argument that's raised when live
> patching is discussed. "Why do need live patching when we
> have redundancy?"

My argument is that if we start off with a latency of 10
seconds and improve that gradually, it will be good for
everyone with a clear, actionable route for even those who
cannot take a 10 seconds delay today.

Lets see the use cases:

> [...] Examples would be legacy applications which can't
> run in an active-active cluster and need to be restarted
> on failover.

Most clusters (say web frontends) can take a stoppage of a
couple of seconds.

> [...] Or trading systems, where the calculations must be
> strictly serialized and response times are counted in
> tens of microseconds.

All trading systems I'm aware of have daily maintenance
time periods that can afford at minimum of a couple of
seconds of optional maintenance latency: stock trading
systems can be maintained when there's no trading session
(which is many hours), aftermarket or global trading
systems can be maintained when the daily rollover
interested is calculated in a predetermined low activity
period.

> Another usecase is large HPC clusters, where all nodes
> have to run carefully synchronized. Once one gets behind
> in a calculation cycle, others have to wait for the
> results and the efficiency of the whole cluster goes
> down. [...]

I think calculation nodes on large HPC clusters qualify as
the specialized case that I mentioned, where the update
latency could be brought down into the 1 second range.

But I don't think calculation nodes are patched in the
typical case: you might want to patch Internet facing
frontend systems, the rest is left as undisturbed as
possible. So I'm not even sure this is a typical usecase.

In any case, there's no hard limit on how fast such a
kernel upgrade can get in principle, and the folks who care
about that latency will sure help out optimizing it and
many HPC projects are well funded.

> The value of live patching is in near zero disruption.

Latency is a good attribute of a kernel upgrade mechanism,
but it's by far not the only attribute and we should
definitely not design limitations into the approach and
hurt all the other attributes, just to optimize that single
attribute.

I.e. don't make it a single-issue project.

Thanks,

Ingo

2015-02-24 10:23:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Josh Poimboeuf <[email protected]> wrote:

> Your upgrade proposal is an *enormous* disruption to the
> system:
>
> - a latency of "well below 10" seconds is completely
> unacceptable to most users who want to patch the kernel
> of a production system _while_ it's in production.

I think this statement is false for the following reasons.

- I'd say the majority of system operators of production
systems can live with a couple of seconds of delay at a
well defined moment of the day or week - with gradual,
pretty much open ended improvements in that latency
down the line.

- I think your argument ignores the fact that live
upgrades would extend the scope of 'users willing to
patch the kernel of a production system' _enormously_.

For example, I have a production system with this much
uptime:

10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05

While currently I'm reluctant to reboot the system to
upgrade the kernel (due to a reboot's intrusiveness),
and that is why it has achieved a relatively high
uptime, but I'd definitely allow the kernel to upgrade
at 0:00am just fine. (I'd even give it up to a few
minutes, as long as TCP connections don't time out.)

And I don't think my usecase is special.

What gradual improvements in live upgrade latency am I
talking about?

- For example the majority of pure user-space process
pages in RAM could be saved from the old kernel over
into the new kernel - i.e. they'd stay in place in RAM,
but they'd be re-hashed for the new data structures.
This avoids a big chunk of checkpointing overhead.

- Likewise, most of the page cache could be saved from an
old kernel to a new kernel as well - further reducing
checkpointing overhead.

- The PROT_NONE mechanism of the current NUMA balancing
code could be used to transparently mark user-space
pages as 'checkpointed'. This would reduce system
interruption as only 'newly modified' pages would have
to be checkpointed when the upgrade happens.

- Hardware devices could be marked as 'already in well
defined state', skipping the more expensive steps of
driver initialization.

- Possibly full user-space page tables could be preserved
over an upgrade: this way user-space execution would be
unaffected even in the micro level: cache layout, TLB
patterns, etc.

There's lots of gradual speedups possible with such a model
IMO.

With live kernel patching we run into a brick wall of
complexity straight away: we have to analyze the nature of
the kernel modification, in the context of live patching,
and that only works for the simplest of kernel
modifications.

With live kernel upgrades no such brick wall exists, just
about any transition between kernel versions is possible.

Granted, with live kernel upgrades it's much more complex
to get the 'simple' case into an even rudimentarily working
fashion (full userspace state has to be enumerated, saved
and restored), but once we are there, it's a whole new
category of goodness and it probably covers 90%+ of the
live kernel patching usecases on day 1 already ...

Thanks,

Ingo

2015-02-24 10:25:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Pavel Machek <[email protected]> wrote:

> > More importantly, both kGraft and kpatch are pretty limited
> > in what kinds of updates they allow, and neither kGraft nor
> > kpatch has any clear path towards applying more complex
> > fixes to kernel images that I can see: kGraft can only
> > apply the simplest of fixes where both versions of a
> > function are interchangeable, and kpatch is only marginally
> > better at that - and that's pretty fundamental to both
> > projects!
> >
> > I think all of these problems could be resolved by shooting
> > for the moon instead:
> >
> > - work towards allowing arbitrary live kernel upgrades!
> >
> > not just 'live kernel patches'.
>
> Note that live kernel upgrade would have interesting
> implications outside kernel:
>
> 1) glibc does "what kernel version is this?" caches
> result and alters behaviour accordingly.

That should be OK, as a new kernel will be ABI compatible
with an old kernel.

A later optimization could update the glibc cache on an
upgrade, fortunately both projects are open source.

> 2) apps will do recently_introduced_syscall(), get error
> and not attempt it again.

That should be fine too.

Thanks,

Ingo

2015-02-24 10:37:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: live patching design (was: Re: [PATCH 1/3] sched: add sched_task_call())


* Jiri Kosina <[email protected]> wrote:

> On Sat, 21 Feb 2015, Ingo Molnar wrote:
>
> > (It does have some other requirements, such as making
> > all syscalls interruptible to a 'special' signalling
> > method that only live patching triggers - even syscalls
> > that are under the normal ABI uninterruptible, such as
> > sys_sync().)
>
> BTW I didn't really understand this -- could you please
> elaborate what exactly do you propose to do here in your
> "simplified" patching method (i.e. serializing everybody
> at the kernel boundary) for TASK_UNINTERRUPTIBLE
> processess?

So I'd try to separate out the two main categories of
uninterruptible sleepers:

- those who just serialize with other local CPUs/tasks
relatively quickly

- those who are waiting for some potentially very long and
open ended request. [such as IO, potentially network IO.]

I'd only touch the latter: a prominent example would be
sys_sync(). I'd leave alone the myriads of other
uninterruptible sleepers.

> But I didn't understand your claims regarding
> uninterruptible sleeps in your paragraph above.
> sys_sync() is one thing, that's just waiting
> uninterruptibly for completion. But how about all the
> mutex waitiers in TASK_UNINTERRUPTIBLE, for example?

I'd not touch those - unless they are waiting for something
that will not be done by the time we park all tasks: for
example NFS might have uninterruptible sleeps, and
sys_sync() will potentially do IO for minutes.

I think it would be the exception, not the rule - but it
would give us an approach that allows us to touch 'any'
kernel code if its wait times are unreasonably long or open
ended.

Thanks,

Ingo

2015-02-24 10:53:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)


* Jiri Kosina <[email protected]> wrote:

> [...] We could optimize the kernel the craziest way we
> can, but hardware takes its time to reinitialize. And in
> most cases, you'd really need to reinitalize it; [...]

If we want to reinitialize a device, most of the longer
initialization latencies during bootup these days involve
things like: 'poke hardware, see if there's any response'.
Those are mostly going away quickly with modern,
well-enumerated hardware interfaces.

Just try a modprobe of a random hardware driver - most
initialization sequences are very fast. (That's how people
are able to do cold bootups in less than 1 second.)

In theory this could also be optimized: we could avoid the
reinitialization step through an upgrade via relatively
simple means, for example if drivers define their own
version and the new kernel's driver checks whether the
previous state is from a compatible driver. Then the new
driver could do a shorter initialization sequence.

But I'd only do it only in special cases, where for some
reason the initialization sequence takes longer time and it
makes sense to share hardware discovery information between
two versions of the driver. I'm not convinced such a
mechanism is necessary in the general case.

Thanks,

Ingo

2015-02-24 11:10:55

by Petr Mladek

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Tue 2015-02-24 11:23:29, Ingo Molnar wrote:
> What gradual improvements in live upgrade latency am I
> talking about?
>
> - For example the majority of pure user-space process
> pages in RAM could be saved from the old kernel over
> into the new kernel - i.e. they'd stay in place in RAM,
> but they'd be re-hashed for the new data structures.

I wonder how many structures we would need to rehash when we update
the whole kernel. I think that it is not only about memory but also
about any other subsystem: networking, scheduler, ...


> - Hardware devices could be marked as 'already in well
> defined state', skipping the more expensive steps of
> driver initialization.

This is another point that might get easily wrong. We know that
the quality of many drivers is not good. Yes, we want to make it
better. But we also know that system suspend does not work well
on many systems for years even with the huge effort.


> - Possibly full user-space page tables could be preserved
> over an upgrade: this way user-space execution would be
> unaffected even in the micro level: cache layout, TLB
> patterns, etc.
>
> There's lots of gradual speedups possible with such a model
> IMO.
>
> With live kernel patching we run into a brick wall of
> complexity straight away: we have to analyze the nature of
> the kernel modification, in the context of live patching,
> and that only works for the simplest of kernel
> modifications.
>
> With live kernel upgrades no such brick wall exists, just
> about any transition between kernel versions is possible.

I see here a big difference in the complexity. If verifying patches
is considered as complex then I think that it is much much more
complicated to verify that the whole kernel upgrade is safe and
that all states will be properly preserved and reused.

Otherwise, I think that live patching won't be for any Joe User.
The people producing patches will need to investigate the
changes anyway. They will not blindly take a patch on internet
and convert it to a life patch. I think that this is true for
many other kernel features.


> Granted, with live kernel upgrades it's much more complex
> to get the 'simple' case into an even rudimentarily working
> fashion (full userspace state has to be enumerated, saved
> and restored), but once we are there, it's a whole new
> category of goodness and it probably covers 90%+ of the
> live kernel patching usecases on day 1 already ...

I like the idea and I see the benefit for other tasks: system suspend,
migration of systems to another hardware, ... But I also think that it
is another level of functionality.

IMHO, live patching is somewhere on the way for the full kernel
update and it will help as well. For example, we will need to somehow
solve transition of kthreads and thus fix their parking.

I think that live patching deserves its separate solution. I consider
it much less risky but still valuable. I am sure that it will have
its users. Also it will not block improving the things for the full
update in the future.


Best Regards,
Petr

2015-02-24 12:12:01

by Jiri Slaby

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On 02/22/2015, 10:46 AM, Ingo Molnar wrote:
> Arbitrary live kernel upgrades could be achieved by
> starting with the 'simple method' I outlined in earlier
> mails, using some of the methods that kpatch and kGraft are
> both utilizing or planning to utilize:
>
> - implement user task and kthread parking to get the
> kernel into quiescent state.
>
> - implement (optional, thus ABI-compatible)
> system call interruptability and restartability
> support.
>
> - implement task state and (limited) device state
> snapshotting support
>
> - implement live kernel upgrades by:
>
> - snapshotting all system state transparently
>
> - fast-rebooting into the new kernel image without
> shutting down and rebooting user-space, i.e. _much_
> faster than a regular reboot.
>
> - restoring system state transparently within the new
> kernel image and resuming system workloads where
> they were left.
>
> Even complex external state like TCP socket state and
> graphics state can be preserved over an upgrade. As far as
> the user is concerned, nothing happened but a brief pause -
> and he's now running a v3.21 kernel, not v3.20.
>
> Obviously one of the simplest utilizations of live kernel
> upgrades would be to apply simple security fixes to
> production systems. But that's just a very simple
> application of a much broader capability.
>
> Note that if done right, then the time to perform a live
> kernel upgrade on a typical system could be brought to well
> below 10 seconds system stoppage time: adequate to the vast
> majority of installations.
>
> For special installations or well optimized hardware the
> latency could possibly be brought below 1 second stoppage
> time.

Hello,

IMNSHO, you cannot.

The criu-based approach you have just described is already alive as an
external project in Parallels. It is of course a perfect solution for
some use cases. But its use case is a distinctive one. It is not our
competitor, it is our complementer. I will try to explain why.

It is highly dependent on HW. Kexec is not (or any other arbitrary
kernel-exchange mechanism would not be) supported by all HW, neither
drivers. There is not even a way to implement snapshotting for some
devices which is a real issue, obviously.

Downtime is highly dependent on the scenario. If you have a plenty of
dirty memory, you have to flush first. This might be minutes, especially
when using a network FS. Or you need not, but a failure to replace a
kernel is then lethal. If you have a heap of open FD, restore time will
take ages. You cannot fool any of those. It's pure I/O. You cannot
estimate the downtime and that is a real downside.

Even if you can get the criu time under one second, this is still
unacceptable for live patching. Live patching shall be by 3 orders of
magnitude faster than that, otherwise it makes no sense. If you can
afford a second, you probably already have a large enough windows or
failure handling to perform a full and mainly safer reboot/kexec anyway.

You cannot restore everything.
* TCP is one of the pure beasts in this. And there is indeed a plenty of
theoretical papers behind this, explaining what can or cannot be done.
* NFS is another one.
* Xorg. Today, we cannot even fluently switch between discreet and
native GFX chip. No go.
* There indeed are situations, where NP-hard problems need to be solved
upon restoration. No way, if you want to restore yet in this century.

While you cannot live-patch everything using KLP, it is patch-dependent.
Failure of restoration is condition-dependent and the condition is
really fuzzy. That is a huge difference.

Despite you put criu-based approach as provably safe and correct, it is
not in many cases and cannot be by definition.

That said, we are not going to start moving that way, except the many
good points which emerged during the discussion (fake signals to pick one).

> This 'live kernel upgrades' approach would have various
> advantages:
>
> - it brings together various principles working towards
> shared goals:
>
> - the boot time reduction folks
> - the checkpoint/restore folks
> - the hibernation folks
> - the suspend/resume and power management folks
> - the live patching folks (you)
> - the syscall latency reduction folks
>
> if so many disciplines are working together then maybe
> something really good and long term maintainble can
> crystalize out of that effort.

I must admit, whenever I implemented something in the kernel, nobody did
any work for me. So the above will only result in live patching teams to
do all the work. I am not saying we do not want to do the work. I am
only pointing out that there is nothing like "work together with other
teams" (unless we are sending them their pay-bills).

> - it ignores the security theater that treats security
> fixes as a separate, disproportionally more important
> class of fixes and instead allows arbitrary complex
> changes over live kernel upgrades.

Hmm, more changes, more regressions. Complex changes, even more
regressions. No customer desires complex changes in such udpates.

> - there's no need to 'engineer' live patches separately,
> there's no need to review them and their usage sites
> for live patching relevant side effects. Just create a
> 'better' kernel as defined by users of that kernel:

Review is the basic process which has to be done in any way.

ABI is stable, not much in reality. criu has the same deficiency as KLP
in here:
* One example is file size of entries in /sys or /proc. That can change
and you have to take care of it as processes "assume" something.
* Return values of syscalls are standardized, but nothing protects
anybody to change them in subsequent kernels. But state machines in
processes might be confused by a different retval from two subsequent
syscalls (provided by two kernels).

> - in the enterprise distro space create a more stable
> kernel and allow transparent upgrades into it.

This is IMHO unsupportable.

> We have many of the building blocks in place and have them
> available:
>
> - the freezer code already attempts at parking/unparking
> threads transparently, that could be fixed/extended.

That is broken in many funny ways. It needs to be fixed in any case:
nothing defines a good freezing point and something of course should.
And if we want to use those well-defined points? No doubt. Freezer will
benefit of course too.

> - hibernation, regular suspend/resume and in general
> power management has in essence already implemented
> most building blocks needed to enumerate and
> checkpoint/restore device state that otherwise gets
> lost in a shutdown/reboot cycle.

Not at all. A lot of suspend/resume hooks results only in
shutdown/reset. That is not what criu wants. And in many cases,
implementing c/r is not feasible (see above).

> A feature like arbitrary live kernel upgrades would be well
> worth the pain and would be worth the complications, and
> it's actually very feasible technically.

Yes, I like criu very much, but it is not going to save the Universe as
are your beliefs. Neither KLP. Remember, they are complementary. Maybe
Pavel can comment on this too.

> The goals of the current live kernel patching projects,
> "being able to apply only the simplest of live patches",
> which would in my opinion mostly serve the security
> theater?

No, we all pray to the KISS principle. Having something basic, which
works and can be built upon is everything we want as the starting point.
Extending the functionality is the right way. Not our idea, the recent
SW management point of views tell. User-driven development is called one
successful.

> They are not forward looking enough, and in that
> sense they could even be counterproductive.

Being able to apply over 90 % of CVEs in 3.20 does not sound bad or
counterproductive to me at all, sorry.

thanks,
--
js
suse labs

2015-02-24 12:12:22

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Tue, Feb 24, 2015 at 10:44:05AM +0100, Ingo Molnar wrote:

> > This is the most common argument that's raised when live
> > patching is discussed. "Why do need live patching when we
> > have redundancy?"
>
> My argument is that if we start off with a latency of 10
> seconds and improve that gradually, it will be good for
> everyone with a clear, actionable route for even those who
> cannot take a 10 seconds delay today.

Sure, we can do it that way.

Or do it in the other direction.

Today we have a tool (livepatch) in the kernel that can apply trivial
single-function fixes without a measurable disruption to applications.

And we can improve it gradually to expand the range of fixes it can
apply.

Dependent functions can be done by kGraft's lazy migration.

Limited data structure changes can be handled by shadowing.

Major data structure and/or locking changes require stopping the kernel,
and trapping all tasks at the kernel/userspace boundary is clearly the
cleanest way to do that. I comes at a steep latency cost, though.

Full code replacement without change scope consideration requires full
serialization and deserialization of hardware and userspace
interface state, which is something we don't have today and would
require work on every single driver. Possible, but probably a decade of
effort.

With this approach you have something useful at every point and every
piece of effort put in gives you a rewars.

> Lets see the use cases:
>
> > [...] Examples would be legacy applications which can't
> > run in an active-active cluster and need to be restarted
> > on failover.
>
> Most clusters (say web frontends) can take a stoppage of a
> couple of seconds.

It's easy to find examples of workloads that can be stopped. It doesn't
rule out a significant set of those where stopping them is very
expensive.

> > Another usecase is large HPC clusters, where all nodes
> > have to run carefully synchronized. Once one gets behind
> > in a calculation cycle, others have to wait for the
> > results and the efficiency of the whole cluster goes
> > down. [...]
>
> I think calculation nodes on large HPC clusters qualify as
> the specialized case that I mentioned, where the update
> latency could be brought down into the 1 second range.
>
> But I don't think calculation nodes are patched in the
> typical case: you might want to patch Internet facing
> frontend systems, the rest is left as undisturbed as
> possible. So I'm not even sure this is a typical usecase.

They're not patched for security bugs, but stability bugs are an
important issue for multi-month calculations.

> In any case, there's no hard limit on how fast such a
> kernel upgrade can get in principle, and the folks who care
> about that latency will sure help out optimizing it and
> many HPC projects are well funded.

So far, unless you come up with an effective solutions, if you're
catching all tasks at the kernel/userspace boundary (the "Kragle"
approach), the service interruption is effectively unbounded due to
tasks in D state.

> > The value of live patching is in near zero disruption.
>
> Latency is a good attribute of a kernel upgrade mechanism,
> but it's by far not the only attribute and we should
> definitely not design limitations into the approach and
> hurt all the other attributes, just to optimize that single
> attribute.

It's an attribute I'm not willing to give up. On the other hand, I
definitely wouldn't argue against having modes of operation where the
latency is higher and the tool is more powerful.

> I.e. don't make it a single-issue project.

There is no need to worry about that.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-24 12:19:22

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Tue, Feb 24, 2015 at 11:53:28AM +0100, Ingo Molnar wrote:
>
> * Jiri Kosina <[email protected]> wrote:
>
> > [...] We could optimize the kernel the craziest way we
> > can, but hardware takes its time to reinitialize. And in
> > most cases, you'd really need to reinitalize it; [...]
>
> If we want to reinitialize a device, most of the longer
> initialization latencies during bootup these days involve
> things like: 'poke hardware, see if there's any response'.
> Those are mostly going away quickly with modern,
> well-enumerated hardware interfaces.
>
> Just try a modprobe of a random hardware driver - most
> initialization sequences are very fast. (That's how people
> are able to do cold bootups in less than 1 second.)

Have you ever tried to boot a system with a large (> 100) number of
drives connected over FC? That takes time to discover and you have to do
the discovery as the configuration could have changed while you were not
looking.

Or a machine with terabytes of memory? Just initializing the memory
takes minutes.

Or a desktop with USB? And you have to reinitialize the USB bus and the
state of all the USB devices, because an application might be accessing
files on an USB drive.

> In theory this could also be optimized: we could avoid the
> reinitialization step through an upgrade via relatively
> simple means, for example if drivers define their own
> version and the new kernel's driver checks whether the
> previous state is from a compatible driver. Then the new
> driver could do a shorter initialization sequence.

There you're clearly getting in the "so complex to maintain that it'll never
work reliably" territory.

> But I'd only do it only in special cases, where for some
> reason the initialization sequence takes longer time and it
> makes sense to share hardware discovery information between
> two versions of the driver. I'm not convinced such a
> mechanism is necessary in the general case.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-24 12:28:55

by Jiri Slaby

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On 02/24/2015, 10:16 AM, Ingo Molnar wrote:
> and we don't design the Linux kernel for weird, extreme
> cases, we design for the common, sane case that has the
> broadest appeal, and we hope that the feature garners
> enough interest to be maintainable.

Hello,

oh, so why do we have NR_CPUS up to 8192, then? I haven't met a machine
with more than 16 cores yet. You did. But you haven't met a guy
thanksgiving for a live patching being so easy to implement, but fast. I
did.

What ones call extreme, others accept as standard. That is, I believe,
why you signed under the support for up to 8192 CPUs.

We develop Linux to be scalable, i.e. used on *whatever* scenario you
can imagine in any world. Be it large/small machines, lowmem/higmem,
numa/uma, whatever. If you don't like something, you are free to disable
that. Democracy.

> This is not a problem in general: the weird case can take
> care of itself just fine - 'specialized and weird' usually
> means there's enough money to throw at special hardware and
> human solutions or it goes extinct quickly ...

Live patching is not a random idea which is about to die. It is months
of negotiations with customers, management, between developers,
establishing teams and really thinking about the idea. The decisions
were discussed on many conferences too. I am trying to shed some light
on why we are not trying to improve criu or any other already existing
project. We studied papers, codes, implementations, kSplice and such and
decided to incline to what we have implemented, presented and merged.

thanks,
--
js
suse labs

2015-02-24 12:36:07

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: live kernel upgrades (was: live kernel patching design)

On Tue, Feb 24, 2015 at 11:23:29AM +0100, Ingo Molnar wrote:

> > Your upgrade proposal is an *enormous* disruption to the
> > system:
> >
> > - a latency of "well below 10" seconds is completely
> > unacceptable to most users who want to patch the kernel
> > of a production system _while_ it's in production.
>
> I think this statement is false for the following reasons.

The statement is very true.

> - I'd say the majority of system operators of production
> systems can live with a couple of seconds of delay at a
> well defined moment of the day or week - with gradual,
> pretty much open ended improvements in that latency
> down the line.

In the most usual corporate setting any noticeable outage, even out of
business hours, requires an ahead notice, and an agreement of all
stakeholders - teams that depend on the system.

If a live patching technology introduces an outage, it's not "live" and
because of these bureaucratic reasons, it will not be used and a regular
reboot will be scheduled instead.

> - I think your argument ignores the fact that live
> upgrades would extend the scope of 'users willing to
> patch the kernel of a production system' _enormously_.
>
> For example, I have a production system with this much
> uptime:
>
> 10:50:09 up 153 days, 3:58, 34 users, load average: 0.00, 0.02, 0.05
>
> While currently I'm reluctant to reboot the system to
> upgrade the kernel (due to a reboot's intrusiveness),
> and that is why it has achieved a relatively high
> uptime, but I'd definitely allow the kernel to upgrade
> at 0:00am just fine. (I'd even give it up to a few
> minutes, as long as TCP connections don't time out.)
>
> And I don't think my usecase is special.

I agree that this is useful. But it is a different problem that only
partially overlaps with what we're trying to achieve with live patching.

If you can make full kernel upgrades to work this way, which I doubt is
achievable in the next 10 years due to all the research and
infrastructure needed, then you certainly gain an additional group of
users. And a great tool. A large portion of those that ask for live
patching won't use it, though.

But honestly, I prefer a solution that works for small patches now, than
a solution for unlimited patches sometime in next decade.

> What gradual improvements in live upgrade latency am I
> talking about?
>
> - For example the majority of pure user-space process
> pages in RAM could be saved from the old kernel over
> into the new kernel - i.e. they'd stay in place in RAM,
> but they'd be re-hashed for the new data structures.
> This avoids a big chunk of checkpointing overhead.

I'd have hoped this would be a given. If you can't preserve memory
contents and have to re-load from disk, you can just as well reboot
entirely, the time needed will not be much more..

> - Likewise, most of the page cache could be saved from an
> old kernel to a new kernel as well - further reducing
> checkpointing overhead.
>
> - The PROT_NONE mechanism of the current NUMA balancing
> code could be used to transparently mark user-space
> pages as 'checkpointed'. This would reduce system
> interruption as only 'newly modified' pages would have
> to be checkpointed when the upgrade happens.
>
> - Hardware devices could be marked as 'already in well
> defined state', skipping the more expensive steps of
> driver initialization.
>
> - Possibly full user-space page tables could be preserved
> over an upgrade: this way user-space execution would be
> unaffected even in the micro level: cache layout, TLB
> patterns, etc.
>
> There's lots of gradual speedups possible with such a model
> IMO.

Yes, as I say above, guaranteeing decades of employment. ;)

> With live kernel patching we run into a brick wall of
> complexity straight away: we have to analyze the nature of
> the kernel modification, in the context of live patching,
> and that only works for the simplest of kernel
> modifications.

But you're able to _use_ it.

> With live kernel upgrades no such brick wall exists, just
> about any transition between kernel versions is possible.

The brick wall you run to is "I need to implement full kernel state
serialization before I can do anything at all." That's something that
isn't even clear _how_ to do. Particularly with Linux kernel's
development model where internal ABI and structures are always in flux
it may not even be realistic.

> Granted, with live kernel upgrades it's much more complex
> to get the 'simple' case into an even rudimentarily working
> fashion (full userspace state has to be enumerated, saved
> and restored), but once we are there, it's a whole new
> category of goodness and it probably covers 90%+ of the
> live kernel patching usecases on day 1 already ...

Feel free to start working on it. I'll stick with live patching.

--
Vojtech Pavlik
Director SUSE Labs

2015-02-24 13:18:40

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: live kernel upgrades

On 02/24/2015 03:11 PM, Jiri Slaby wrote:
> On 02/22/2015, 10:46 AM, Ingo Molnar wrote:
>> Arbitrary live kernel upgrades could be achieved by
>> starting with the 'simple method' I outlined in earlier
>> mails, using some of the methods that kpatch and kGraft are
>> both utilizing or planning to utilize:
>>
>> - implement user task and kthread parking to get the
>> kernel into quiescent state.
>>
>> - implement (optional, thus ABI-compatible)
>> system call interruptability and restartability
>> support.
>>
>> - implement task state and (limited) device state
>> snapshotting support
>>
>> - implement live kernel upgrades by:
>>
>> - snapshotting all system state transparently
>>
>> - fast-rebooting into the new kernel image without
>> shutting down and rebooting user-space, i.e. _much_
>> faster than a regular reboot.
>>
>> - restoring system state transparently within the new
>> kernel image and resuming system workloads where
>> they were left.
>>
>> Even complex external state like TCP socket state and
>> graphics state can be preserved over an upgrade. As far as
>> the user is concerned, nothing happened but a brief pause -
>> and he's now running a v3.21 kernel, not v3.20.
>>
>> Obviously one of the simplest utilizations of live kernel
>> upgrades would be to apply simple security fixes to
>> production systems. But that's just a very simple
>> application of a much broader capability.
>>
>> Note that if done right, then the time to perform a live
>> kernel upgrade on a typical system could be brought to well
>> below 10 seconds system stoppage time: adequate to the vast
>> majority of installations.
>>
>> For special installations or well optimized hardware the
>> latency could possibly be brought below 1 second stoppage
>> time.
>
> Hello,
>
> IMNSHO, you cannot.
>
> The criu-based approach you have just described is already alive as an
> external project in Parallels. It is of course a perfect solution for
> some use cases. But its use case is a distinctive one. It is not our
> competitor, it is our complementer. I will try to explain why.

I fully agree -- these two approaches are not replacement for one another.
The only possible way do go is to use them both, each where appropriate.

But I have some comments to Jiri's, please find them inline.

> It is highly dependent on HW. Kexec is not (or any other arbitrary
> kernel-exchange mechanism would not be) supported by all HW, neither
> drivers. There is not even a way to implement snapshotting for some
> devices which is a real issue, obviously.
>
> Downtime is highly dependent on the scenario. If you have a plenty of
> dirty memory, you have to flush first. This might be minutes, especially
> when using a network FS. Or you need not, but a failure to replace a
> kernel is then lethal.

It's not completely true. We can leave the dirty memory in-memory (only
flush the critical metadata) and remember it being such. If during reboot
the node crashes, then this would just look like if you write()-ed into a
file and then crashed w/o fsync(). I.e. -- the update of the page cache
is lost, but it's, well, sometimes expected to get lost.

This doesn't improve the downtime completely, but helps to reduce one
several times.

> If you have a heap of open FD, restore time will take ages.

One trick exists here too. For disk-FS we can pick up the "relevant parts"
of block device cache in memory before freezeing the processes. In this
case open()-s would bring up dentries and inodes without big I/O, it would
still take a while, but not ages.

> You cannot fool any of those. It's pure I/O. You cannot
> estimate the downtime and that is a real downside.

Generally you're right. We cannot bring the downtime to the same small values
as real live-patching do. Not even close to it. But there are tricks that can
shorten it.

> Even if you can get the criu time under one second, this is still
> unacceptable for live patching. Live patching shall be by 3 orders of
> magnitude faster than that, otherwise it makes no sense. If you can
> afford a second, you probably already have a large enough windows or
> failure handling to perform a full and mainly safer reboot/kexec anyway.

Fully agree here.

> You cannot restore everything.
> * TCP is one of the pure beasts in this. And there is indeed a plenty of
> theoretical papers behind this, explaining what can or cannot be done.
> * NFS is another one.
> * Xorg. Today, we cannot even fluently switch between discreet and
> native GFX chip. No go.
> * There indeed are situations, where NP-hard problems need to be solved
> upon restoration. No way, if you want to restore yet in this century.

+1. Candidates for those are process linkage (pids, ppids, sids and pgids,
especially orphaned) and mountpoints (all of them -- shared, slave, private,
bind and real) in several mount namespaces.

> While you cannot live-patch everything using KLP, it is patch-dependent.
> Failure of restoration is condition-dependent and the condition is
> really fuzzy. That is a huge difference.
>
> Despite you put criu-based approach as provably safe and correct, it is
> not in many cases and cannot be by definition.
>
> That said, we are not going to start moving that way, except the many
> good points which emerged during the discussion (fake signals to pick one).
>
>> This 'live kernel upgrades' approach would have various
>> advantages:
>>
>> - it brings together various principles working towards
>> shared goals:
>>
>> - the boot time reduction folks
>> - the checkpoint/restore folks
>> - the hibernation folks
>> - the suspend/resume and power management folks
>> - the live patching folks (you)
>> - the syscall latency reduction folks
>>
>> if so many disciplines are working together then maybe
>> something really good and long term maintainble can
>> crystalize out of that effort.
>
> I must admit, whenever I implemented something in the kernel, nobody did
> any work for me. So the above will only result in live patching teams to
> do all the work. I am not saying we do not want to do the work. I am
> only pointing out that there is nothing like "work together with other
> teams" (unless we are sending them their pay-bills).
>
>> - it ignores the security theater that treats security
>> fixes as a separate, disproportionally more important
>> class of fixes and instead allows arbitrary complex
>> changes over live kernel upgrades.
>
> Hmm, more changes, more regressions. Complex changes, even more
> regressions. No customer desires complex changes in such udpates.
>
>> - there's no need to 'engineer' live patches separately,
>> there's no need to review them and their usage sites
>> for live patching relevant side effects. Just create a
>> 'better' kernel as defined by users of that kernel:
>
> Review is the basic process which has to be done in any way.
>
> ABI is stable, not much in reality. criu has the same deficiency as KLP
> in here:
> * One example is file size of entries in /sys or /proc. That can change
> and you have to take care of it as processes "assume" something.
> * Return values of syscalls are standardized, but nothing protects
> anybody to change them in subsequent kernels. But state machines in
> processes might be confused by a different retval from two subsequent
> syscalls (provided by two kernels).

Well, this bites us even without C/R. We've seen several times that a container
with some "stabilized" distro fails on newer kernel due to ESRCH is returned
instead of ENOENT.

>> - in the enterprise distro space create a more stable
>> kernel and allow transparent upgrades into it.
>
> This is IMHO unsupportable.
>
>> We have many of the building blocks in place and have them
>> available:
>>
>> - the freezer code already attempts at parking/unparking
>> threads transparently, that could be fixed/extended.
>
> That is broken in many funny ways. It needs to be fixed in any case:
> nothing defines a good freezing point and something of course should.
> And if we want to use those well-defined points? No doubt. Freezer will
> benefit of course too.

BTW, we don't need freezer. The thing is that we need to inject parasite code
into processes we dump, so the process should remain in runnable state. Thus
we stop and freeze tasks with PTRACE_SEIZE.

And this brings another obstacle to this kind of kernel update -- if one is
sitting under gdb/strace, then no kernel update.

>> - hibernation, regular suspend/resume and in general
>> power management has in essence already implemented
>> most building blocks needed to enumerate and
>> checkpoint/restore device state that otherwise gets
>> lost in a shutdown/reboot cycle.
>
> Not at all. A lot of suspend/resume hooks results only in
> shutdown/reset. That is not what criu wants. And in many cases,
> implementing c/r is not feasible (see above).
>
>> A feature like arbitrary live kernel upgrades would be well
>> worth the pain and would be worth the complications, and
>> it's actually very feasible technically.
>
> Yes, I like criu very much, but it is not going to save the Universe as
> are your beliefs. Neither KLP. Remember, they are complementary. Maybe
> Pavel can comment on this too.

I agree, these two technology are complementary. And I'd also add that their
potential users differ. For example hosting providers like the live-patching
a lot, since they always provide to their customers the kernel with all (most)
the recent security hot-fixes applied. But they don't use CRIU-based update
actively since they do major update either when node crashes or by migrating
containers from one node to anther and replacing the node afterwards.

>> The goals of the current live kernel patching projects,
>> "being able to apply only the simplest of live patches",
>> which would in my opinion mostly serve the security
>> theater?
>
> No, we all pray to the KISS principle. Having something basic, which
> works and can be built upon is everything we want as the starting point.
> Extending the functionality is the right way. Not our idea, the recent
> SW management point of views tell. User-driven development is called one
> successful.
>
>> They are not forward looking enough, and in that
>> sense they could even be counterproductive.
>
> Being able to apply over 90 % of CVEs in 3.20 does not sound bad or
> counterproductive to me at all, sorry.

Thanks,
Pavel