2023-02-24 16:50:18

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()

v3:
- Add barriers (pmladek)
- Update comments

v2:
- Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
- Fix race in klp_reverse_transition()

Fix patching stalls caused by busy kthreads.

Josh Poimboeuf (3):
livepatch: Skip task_call_func() for current task
livepatch,sched: Add livepatch task switching to cond_resched()
vhost: Fix livepatch timeouts in vhost_worker()

drivers/vhost/vhost.c | 3 +-
include/linux/livepatch.h | 1 +
include/linux/livepatch_sched.h | 29 ++++++++
include/linux/sched.h | 20 ++++--
kernel/livepatch/core.c | 1 +
kernel/livepatch/transition.c | 113 +++++++++++++++++++++++++++-----
kernel/sched/core.c | 64 +++++++++++++++---
7 files changed, 200 insertions(+), 31 deletions(-)
create mode 100644 include/linux/livepatch_sched.h

--
2.39.1



2023-02-24 16:50:21

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v3 1/3] livepatch: Skip task_call_func() for current task

The current task doesn't need the scheduler's protection to unwind its
own stack.

Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
kernel/livepatch/transition.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..4d1f443778f7 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -307,7 +307,11 @@ static bool klp_try_switch_task(struct task_struct *task)
* functions. If all goes well, switch the task to the target patch
* state.
*/
- ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+ if (task == current)
+ ret = klp_check_and_switch_task(current, &old_name);
+ else
+ ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+
switch (ret) {
case 0: /* success */
break;
--
2.39.1


2023-02-24 16:50:24

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v3 3/3] vhost: Fix livepatch timeouts in vhost_worker()

Livepatch timeouts were reported due to busy vhost_worker() kthreads.
Now that cond_resched() can do livepatch task switching, use
cond_resched() in vhost_worker(). That's the better way to
conditionally call schedule() anyway.

Reported-by: Seth Forshee (DigitalOcean) <[email protected]>
Link: https://lkml.kernel.org/lkml/[email protected]
Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
drivers/vhost/vhost.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 43c9770b86e5..87e3cf12da1c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -363,8 +363,7 @@ static int vhost_worker(void *data)
kcov_remote_start_common(dev->kcov_handle);
work->fn(work);
kcov_remote_stop();
- if (need_resched())
- schedule();
+ cond_resched();
}
}
kthread_unuse_mm(dev->mm);
--
2.39.1


2023-02-24 16:50:26

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()

There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration. Hook into all of them.

A more elegant solution might be to use a preempt notifier. However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lkml.kernel.org/lkml/[email protected]

Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/livepatch.h | 1 +
include/linux/livepatch_sched.h | 29 +++++++++
include/linux/sched.h | 20 ++++--
kernel/livepatch/core.c | 1 +
kernel/livepatch/transition.c | 107 +++++++++++++++++++++++++++-----
kernel/sched/core.c | 64 ++++++++++++++++---
6 files changed, 194 insertions(+), 28 deletions(-)
create mode 100644 include/linux/livepatch_sched.h

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e29960c6e..9b9b38e89563 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -13,6 +13,7 @@
#include <linux/ftrace.h>
#include <linux/completion.h>
#include <linux/list.h>
+#include <linux/livepatch_sched.h>

#if IS_ENABLED(CONFIG_LIVEPATCH)

diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 000000000000..013794fb5da0
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/jump_label.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+
+#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+ if (static_branch_unlikely(&klp_sched_try_switch_key))
+ __klp_sched_try_switch();
+}
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+static inline void __klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..bd1e6f02facb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@
#include <linux/seqlock.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
+#include <linux/livepatch_sched.h>
#include <asm/kmap_size.h>

/* task_struct member predeclarations (sorted alphabetically): */
@@ -2064,6 +2065,9 @@ extern int __cond_resched(void);

#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

+void sched_dynamic_klp_enable(void);
+void sched_dynamic_klp_disable(void);
+
DECLARE_STATIC_CALL(cond_resched, __cond_resched);

static __always_inline int _cond_resched(void)
@@ -2072,6 +2076,7 @@ static __always_inline int _cond_resched(void)
}

#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
extern int dynamic_cond_resched(void);

static __always_inline int _cond_resched(void)
@@ -2079,20 +2084,25 @@ static __always_inline int _cond_resched(void)
return dynamic_cond_resched();
}

-#else
+#else /* !CONFIG_PREEMPTION */

static inline int _cond_resched(void)
{
+ klp_sched_try_switch();
return __cond_resched();
}

-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */

-#else
+#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */

-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+ klp_sched_try_switch();
+ return 0;
+}

-#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
+#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */

#define cond_resched() ({ \
__might_resched(__FILE__, __LINE__, 0); \
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 201f0c0482fb..3f79265dd6e5 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -33,6 +33,7 @@
*
* - klp_ftrace_handler()
* - klp_update_patch_state()
+ * - __klp_sched_try_switch()
*/
DEFINE_MUTEX(klp_mutex);

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 4d1f443778f7..2662f2efb164 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@

#include <linux/cpu.h>
#include <linux/stacktrace.h>
+#include <linux/static_call.h>
#include "core.h"
#include "patch.h"
#include "transition.h"
@@ -24,6 +25,25 @@ static int klp_target_state = KLP_UNDEFINED;

static unsigned int klp_signals_cnt;

+/*
+ * When a livepatch is in progress, enable klp stack checking in
+ * cond_resched(). This helps CPU-bound kthreads get patched.
+ */
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+#define klp_cond_resched_enable() sched_dynamic_klp_enable()
+#define klp_cond_resched_disable() sched_dynamic_klp_disable()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+EXPORT_SYMBOL(klp_sched_try_switch_key);
+
+#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
+#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
/*
* This work can be performed periodically to finish patching or unpatching any
* "straggler" tasks which failed to transition in the first attempt.
@@ -172,8 +192,8 @@ void klp_update_patch_state(struct task_struct *task)
* barrier (smp_rmb) for two cases:
*
* 1) Enforce the order of the TIF_PATCH_PENDING read and the
- * klp_target_state read. The corresponding write barrier is in
- * klp_init_transition().
+ * klp_target_state read. The corresponding write barriers are in
+ * klp_init_transition() and klp_reverse_transition().
*
* 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
* of func->transition, if klp_ftrace_handler() is called later on
@@ -338,6 +358,44 @@ static bool klp_try_switch_task(struct task_struct *task)
return !ret;
}

+void __klp_sched_try_switch(void)
+{
+ if (likely(!klp_patch_pending(current)))
+ return;
+
+ /*
+ * This function is called from cond_resched() which is called in many
+ * places throughout the kernel. Using the klp_mutex here might
+ * deadlock.
+ *
+ * Instead, disable preemption to prevent racing with other callers of
+ * klp_try_switch_task(). Thanks to task_call_func() they won't be
+ * able to switch this task while it's running.
+ */
+ preempt_disable();
+
+ /*
+ * Make sure current didn't get patched between the above check and
+ * preempt_disable().
+ */
+ if (unlikely(!klp_patch_pending(current)))
+ goto out;
+
+ /*
+ * Enforce the order of the TIF_PATCH_PENDING read above and the
+ * klp_target_state read in klp_try_switch_task(). The corresponding
+ * write barriers are in klp_init_transition() and
+ * klp_reverse_transition().
+ */
+ smp_rmb();
+
+ klp_try_switch_task(current);
+
+out:
+ preempt_enable();
+}
+EXPORT_SYMBOL(__klp_sched_try_switch);
+
/*
* Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
* Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -444,7 +502,8 @@ void klp_try_complete_transition(void)
return;
}

- /* we're done, now cleanup the data structures */
+ /* Done! Now cleanup the data structures. */
+ klp_cond_resched_disable();
patch = klp_transition_patch;
klp_complete_transition();

@@ -496,6 +555,8 @@ void klp_start_transition(void)
set_tsk_thread_flag(task, TIF_PATCH_PENDING);
}

+ klp_cond_resched_enable();
+
klp_signals_cnt = 0;
}

@@ -551,8 +612,9 @@ void klp_init_transition(struct klp_patch *patch, int state)
* see a func in transition with a task->patch_state of KLP_UNDEFINED.
*
* Also enforce the order of the klp_target_state write and future
- * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
- * set a task->patch_state to KLP_UNDEFINED.
+ * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
+ * __klp_sched_try_switch() don't set a task->patch_state to
+ * KLP_UNDEFINED.
*/
smp_wmb();

@@ -588,14 +650,10 @@ void klp_reverse_transition(void)
klp_target_state == KLP_PATCHED ? "patching to unpatching" :
"unpatching to patching");

- klp_transition_patch->enabled = !klp_transition_patch->enabled;
-
- klp_target_state = !klp_target_state;
-
/*
* Clear all TIF_PATCH_PENDING flags to prevent races caused by
- * klp_update_patch_state() running in parallel with
- * klp_start_transition().
+ * klp_update_patch_state() or __klp_sched_try_switch() running in
+ * parallel with the reverse transition.
*/
read_lock(&tasklist_lock);
for_each_process_thread(g, task)
@@ -605,9 +663,28 @@ void klp_reverse_transition(void)
for_each_possible_cpu(cpu)
clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);

- /* Let any remaining calls to klp_update_patch_state() complete */
+ /*
+ * Make sure all existing invocations of klp_update_patch_state() and
+ * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
+ * starting the reverse transition.
+ */
klp_synchronize_transition();

+ /*
+ * All patching has stopped, now re-initialize the global variables to
+ * prepare for the reverse transition.
+ */
+ klp_transition_patch->enabled = !klp_transition_patch->enabled;
+ klp_target_state = !klp_target_state;
+
+ /*
+ * Enforce the order of the klp_target_state write and the
+ * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
+ * klp_update_patch_state() and __klp_sched_try_switch() don't set
+ * task->patch_state to the wrong value.
+ */
+ smp_wmb();
+
klp_start_transition();
}

@@ -621,9 +698,9 @@ void klp_copy_process(struct task_struct *child)
* the task flag up to date with the parent here.
*
* The operation is serialized against all klp_*_transition()
- * operations by the tasklist_lock. The only exception is
- * klp_update_patch_state(current), but we cannot race with
- * that because we are current.
+ * operations by the tasklist_lock. The only exceptions are
+ * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
+ * cannot race with them because we are current.
*/
if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
set_tsk_thread_flag(child, TIF_PATCH_PENDING);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838feb6adc5..895d2a1fdcb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8487,6 +8487,7 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
int __sched dynamic_cond_resched(void)
{
+ klp_sched_try_switch();
if (!static_branch_unlikely(&sk_dynamic_cond_resched))
return 0;
return __cond_resched();
@@ -8635,13 +8636,17 @@ int sched_dynamic_mode(const char *str)
#error "Unsupported PREEMPT_DYNAMIC mechanism"
#endif

-void sched_dynamic_update(int mode)
+DEFINE_MUTEX(sched_dynamic_mutex);
+static bool klp_override;
+
+static void __sched_dynamic_update(int mode)
{
/*
* Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
* the ZERO state, which is invalid.
*/
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_enable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8649,36 +8654,79 @@ void sched_dynamic_update(int mode)

switch (mode) {
case preempt_dynamic_none:
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_disable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: none\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: none\n");
break;

case preempt_dynamic_voluntary:
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_enable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: voluntary\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: voluntary\n");
break;

case preempt_dynamic_full:
- preempt_dynamic_disable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_disable(cond_resched);
preempt_dynamic_disable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: full\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: full\n");
break;
}

preempt_dynamic_mode = mode;
}

+void sched_dynamic_update(int mode)
+{
+ mutex_lock(&sched_dynamic_mutex);
+ __sched_dynamic_update(mode);
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
+
+static int klp_cond_resched(void)
+{
+ __klp_sched_try_switch();
+ return __cond_resched();
+}
+
+void sched_dynamic_klp_enable(void)
+{
+ mutex_lock(&sched_dynamic_mutex);
+
+ klp_override = true;
+ static_call_update(cond_resched, klp_cond_resched);
+
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+void sched_dynamic_klp_disable(void)
+{
+ mutex_lock(&sched_dynamic_mutex);
+
+ klp_override = false;
+ __sched_dynamic_update(preempt_dynamic_mode);
+
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
static int __init setup_preempt_mode(char *str)
{
int mode = sched_dynamic_mode(str);
--
2.39.1


2023-02-27 15:55:59

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()

On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> There have been reports [1][2] of live patches failing to complete
> within a reasonable amount of time due to CPU-bound kthreads.
>
> Fix it by patching tasks in cond_resched().
>
> There are four different flavors of cond_resched(), depending on the
> kernel configuration. Hook into all of them.
>
> A more elegant solution might be to use a preempt notifier. However,
> non-ORC unwinders can't unwind a preempted task reliably.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lkml.kernel.org/lkml/[email protected]
>
> Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
> Reviewed-by: Petr Mladek <[email protected]>
> Signed-off-by: Josh Poimboeuf <[email protected]>

Just for record, I have double checked the changes against v2
and everything looks good to me.

Best Regards,
Petr

2023-02-28 16:41:13

by Seth Forshee

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()

On Fri, Feb 24, 2023 at 08:49:58AM -0800, Josh Poimboeuf wrote:
> v3:
> - Add barriers (pmladek)
> - Update comments
>
> v2:
> - Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
> - Fix race in klp_reverse_transition()
>
> Fix patching stalls caused by busy kthreads.

I ran the v3 patches through my reproducer, still working great.

Thanks,
Seth

2023-02-28 16:57:01

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()

On Mon, Feb 27, 2023 at 04:55:47PM +0100, Petr Mladek wrote:
> On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> > There have been reports [1][2] of live patches failing to complete
> > within a reasonable amount of time due to CPU-bound kthreads.
> >
> > Fix it by patching tasks in cond_resched().
> >
> > There are four different flavors of cond_resched(), depending on the
> > kernel configuration. Hook into all of them.
> >
> > A more elegant solution might be to use a preempt notifier. However,
> > non-ORC unwinders can't unwind a preempted task reliably.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lkml.kernel.org/lkml/[email protected]
> >
> > Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
> > Reviewed-by: Petr Mladek <[email protected]>
> > Signed-off-by: Josh Poimboeuf <[email protected]>
>
> Just for record, I have double checked the changes against v2
> and everything looks good to me.

Whoops, so I found another little surprise:

static int klp_check_stack(struct task_struct *task, const char **oldname)
{
static unsigned long entries[MAX_STACK_ENTRIES];
^^^^^^

That entries array is shared between the klp_mutex owner and all
cond_resched() callers.

MAX_STACK_ENTRIES is 100, which seems excessive. If we halved that, the
array would be "only" 400 bytes, which is *almost* reasonable to
allocate on the stack?

Alternatively we could have a percpu entries array... :-/

--
Josh

2023-03-03 14:00:21

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] livepatch,sched: Add livepatch task switching to cond_resched()

On Tue 2023-02-28 08:56:08, Josh Poimboeuf wrote:
> On Mon, Feb 27, 2023 at 04:55:47PM +0100, Petr Mladek wrote:
> > On Fri 2023-02-24 08:50:00, Josh Poimboeuf wrote:
> > > There have been reports [1][2] of live patches failing to complete
> > > within a reasonable amount of time due to CPU-bound kthreads.
> > >
> > > Fix it by patching tasks in cond_resched().
> > >
> > > There are four different flavors of cond_resched(), depending on the
> > > kernel configuration. Hook into all of them.
> > >
> > > A more elegant solution might be to use a preempt notifier. However,
> > > non-ORC unwinders can't unwind a preempted task reliably.
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://lkml.kernel.org/lkml/[email protected]
> > >
> > > Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
> > > Reviewed-by: Petr Mladek <[email protected]>
> > > Signed-off-by: Josh Poimboeuf <[email protected]>
> >
> > Just for record, I have double checked the changes against v2
> > and everything looks good to me.
>
> Whoops, so I found another little surprise:
>
> static int klp_check_stack(struct task_struct *task, const char **oldname)
> {
> static unsigned long entries[MAX_STACK_ENTRIES];
> ^^^^^^
>
> That entries array is shared between the klp_mutex owner and all
> cond_resched() callers.

Huh, great catch!

> MAX_STACK_ENTRIES is 100, which seems excessive. If we halved that, the
> array would be "only" 400 bytes, which is *almost* reasonable to
> allocate on the stack?

It is just for the stack in the process context. Right?

I think that I have never seen a stack with over 50 entries. And in
the worst case, a bigger amount of entries would "just" result in
a non-reliable stack which might be acceptable.

It looks acceptable to me.

> Alternatively we could have a percpu entries array... :-/

That said, percpu entries would be fine as well. It sounds like
a good price for the livepatching feature. I think that livepatching
is used on big systems anyway.

I slightly prefer the per-cpu solution.

Best Regards,
Petr

2023-03-13 23:33:54

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu

On Fri, Mar 03, 2023 at 03:00:13PM +0100, Petr Mladek wrote:
> > MAX_STACK_ENTRIES is 100, which seems excessive. If we halved that, the
> > array would be "only" 400 bytes, which is *almost* reasonable to
> > allocate on the stack?
>
> It is just for the stack in the process context. Right?
>
> I think that I have never seen a stack with over 50 entries. And in
> the worst case, a bigger amount of entries would "just" result in
> a non-reliable stack which might be acceptable.
>
> It looks acceptable to me.
>
> > Alternatively we could have a percpu entries array... :-/
>
> That said, percpu entries would be fine as well. It sounds like
> a good price for the livepatching feature. I think that livepatching
> is used on big systems anyway.
>
> I slightly prefer the per-cpu solution.

Booting a kernel with PREEMPT+LOCKDEP gave me a high-water mark of 60+
stack entries, seen when probing a device. I decided not to mess with
MAX_STACK_ENTRIES, and instead just convert the entries to percpu. This
patch could be inserted at the beginning of the set.

---8<---

Subject: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu

The entries array in klp_check_stack() is static local because it's too
big to be reasonably allocated on the stack. Serialized access is
enforced by the klp_mutex.

In preparation for calling klp_check_stack() without the mutex (from
cond_resched), convert it to a percpu variable.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
kernel/livepatch/transition.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec581e0..135fc73e2e5d 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -14,6 +14,8 @@
#include "transition.h"

#define MAX_STACK_ENTRIES 100
+DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], klp_stack_entries);
+
#define STACK_ERR_BUF_SIZE 128

#define SIGNALS_TIMEOUT 15
@@ -240,12 +242,15 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
*/
static int klp_check_stack(struct task_struct *task, const char **oldname)
{
- static unsigned long entries[MAX_STACK_ENTRIES];
+ unsigned long *entries = this_cpu_ptr(klp_stack_entries);
struct klp_object *obj;
struct klp_func *func;
int ret, nr_entries;

- ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
+ /* Protect 'klp_stack_entries' */
+ lockdep_assert_preemption_disabled();
+
+ ret = stack_trace_save_tsk_reliable(task, entries, MAX_STACK_ENTRIES);
if (ret < 0)
return -EINVAL;
nr_entries = ret;
--
2.39.2


2023-03-14 10:51:12

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu

On Mon 2023-03-13 16:33:46, Josh Poimboeuf wrote:
> On Fri, Mar 03, 2023 at 03:00:13PM +0100, Petr Mladek wrote:
> > > MAX_STACK_ENTRIES is 100, which seems excessive. If we halved that, the
> > > array would be "only" 400 bytes, which is *almost* reasonable to
> > > allocate on the stack?
> >
> > It is just for the stack in the process context. Right?
> >
> > I think that I have never seen a stack with over 50 entries. And in
> > the worst case, a bigger amount of entries would "just" result in
> > a non-reliable stack which might be acceptable.
> >
> > It looks acceptable to me.
> >
> > > Alternatively we could have a percpu entries array... :-/
> >
> > That said, percpu entries would be fine as well. It sounds like
> > a good price for the livepatching feature. I think that livepatching
> > is used on big systems anyway.
> >
> > I slightly prefer the per-cpu solution.
>
> Booting a kernel with PREEMPT+LOCKDEP gave me a high-water mark of 60+
> stack entries, seen when probing a device. I decided not to mess with
> MAX_STACK_ENTRIES, and instead just convert the entries to percpu. This
> patch could be inserted at the beginning of the set.

Good to know.

>
> ---8<---
>
> Subject: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu
>
> --- a/kernel/livepatch/transition.c
> +++ b/kernel/livepatch/transition.c
> @@ -240,12 +242,15 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
> */
> static int klp_check_stack(struct task_struct *task, const char **oldname)
> {
> - static unsigned long entries[MAX_STACK_ENTRIES];
> + unsigned long *entries = this_cpu_ptr(klp_stack_entries);
> struct klp_object *obj;
> struct klp_func *func;
> int ret, nr_entries;
>
> - ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
> + /* Protect 'klp_stack_entries' */
> + lockdep_assert_preemption_disabled();

I think about adding:

/*
* Stay on the safe side even when cond_resched() is called from
* an IRQ context by mistake.
*/
if (!in_task())
return -EINVAL;

Or is this prevented another way, please?

> +
> + ret = stack_trace_save_tsk_reliable(task, entries, MAX_STACK_ENTRIES);
> if (ret < 0)
> return -EINVAL;
> nr_entries = ret;

Otherwise, it looks good to me.

Best Regards,
Petr

2023-03-15 00:26:18

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH 0.5/3] livepatch: Convert stack entries array to percpu

On Tue, Mar 14, 2023 at 11:50:21AM +0100, Petr Mladek wrote:
> > static int klp_check_stack(struct task_struct *task, const char **oldname)
> > {
> > - static unsigned long entries[MAX_STACK_ENTRIES];
> > + unsigned long *entries = this_cpu_ptr(klp_stack_entries);
> > struct klp_object *obj;
> > struct klp_func *func;
> > int ret, nr_entries;
> >
> > - ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
> > + /* Protect 'klp_stack_entries' */
> > + lockdep_assert_preemption_disabled();
>
> I think about adding:
>
> /*
> * Stay on the safe side even when cond_resched() is called from
> * an IRQ context by mistake.
> */
> if (!in_task())
> return -EINVAL;
>
> Or is this prevented another way, please?

An IRQ handler trying to schedule would be a pretty major bug, no?

I think there are already several different checks for that. For
example, the __might_resched() call in cond_resched().

--
Josh

Subject: [tip: sched/core] vhost: Fix livepatch timeouts in vhost_worker()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 05bfb338fa8dd40b008ce443e397fc374f6bd107
Gitweb: https://git.kernel.org/tip/05bfb338fa8dd40b008ce443e397fc374f6bd107
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Fri, 24 Feb 2023 08:50:01 -08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 22 Mar 2023 17:09:29 +01:00

vhost: Fix livepatch timeouts in vhost_worker()

Livepatch timeouts were reported due to busy vhost_worker() kthreads.
Now that cond_resched() can do livepatch task switching, use
cond_resched() in vhost_worker(). That's the better way to
conditionally call schedule() anyway.

Reported-by: Seth Forshee (DigitalOcean) <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Link: https://lore.kernel.org/r/509f6ea6fe6505f0a75a66026ba531c765ef922f.1677257135.git.jpoimboe@kernel.org
---
drivers/vhost/vhost.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f11bdbe..822fbdb 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -363,8 +363,7 @@ static int vhost_worker(void *data)
kcov_remote_start_common(dev->kcov_handle);
work->fn(work);
kcov_remote_stop();
- if (need_resched())
- schedule();
+ cond_resched();
}
}
kthread_unuse_mm(dev->mm);

Subject: [tip: sched/core] livepatch: Skip task_call_func() for current task

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 383439d3d400d4c5a7ffb4495124adc6be6a05e2
Gitweb: https://git.kernel.org/tip/383439d3d400d4c5a7ffb4495124adc6be6a05e2
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Fri, 24 Feb 2023 08:49:59 -08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 22 Mar 2023 17:09:28 +01:00

livepatch: Skip task_call_func() for current task

The current task doesn't need the scheduler's protection to unwind its
own stack.

Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Link: https://lore.kernel.org/r/4b92e793462d532a05f03767151fa29db3e68e13.1677257135.git.jpoimboe@kernel.org
---
kernel/livepatch/transition.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 135fc73..824e2e3 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -312,7 +312,11 @@ static bool klp_try_switch_task(struct task_struct *task)
* functions. If all goes well, switch the task to the target patch
* state.
*/
- ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+ if (task == current)
+ ret = klp_check_and_switch_task(current, &old_name);
+ else
+ ret = task_call_func(task, klp_check_and_switch_task, &old_name);
+
switch (ret) {
case 0: /* success */
break;

Subject: [tip: sched/core] livepatch: Convert stack entries array to percpu

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e92606fa172f63a26054885b9715be86c643229d
Gitweb: https://git.kernel.org/tip/e92606fa172f63a26054885b9715be86c643229d
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Mon, 13 Mar 2023 16:33:46 -07:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 22 Mar 2023 17:09:28 +01:00

livepatch: Convert stack entries array to percpu

The entries array in klp_check_stack() is static local because it's too
big to be reasonably allocated on the stack. Serialized access is
enforced by the klp_mutex.

In preparation for calling klp_check_stack() without the mutex (from
cond_resched), convert it to a percpu variable.

Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/20230313233346.kayh4t2lpicjkpsv@treble
---
kernel/livepatch/transition.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index f1b25ec..135fc73 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -14,6 +14,8 @@
#include "transition.h"

#define MAX_STACK_ENTRIES 100
+DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], klp_stack_entries);
+
#define STACK_ERR_BUF_SIZE 128

#define SIGNALS_TIMEOUT 15
@@ -240,12 +242,15 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
*/
static int klp_check_stack(struct task_struct *task, const char **oldname)
{
- static unsigned long entries[MAX_STACK_ENTRIES];
+ unsigned long *entries = this_cpu_ptr(klp_stack_entries);
struct klp_object *obj;
struct klp_func *func;
int ret, nr_entries;

- ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
+ /* Protect 'klp_stack_entries' */
+ lockdep_assert_preemption_disabled();
+
+ ret = stack_trace_save_tsk_reliable(task, entries, MAX_STACK_ENTRIES);
if (ret < 0)
return -EINVAL;
nr_entries = ret;

Subject: [tip: sched/core] livepatch,sched: Add livepatch task switching to cond_resched()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8
Gitweb: https://git.kernel.org/tip/e3ff7c609f39671d1aaff4fb4a8594e14f3e03f8
Author: Josh Poimboeuf <[email protected]>
AuthorDate: Fri, 24 Feb 2023 08:50:00 -08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 22 Mar 2023 17:09:28 +01:00

livepatch,sched: Add livepatch task switching to cond_resched()

There have been reports [1][2] of live patches failing to complete
within a reasonable amount of time due to CPU-bound kthreads.

Fix it by patching tasks in cond_resched().

There are four different flavors of cond_resched(), depending on the
kernel configuration. Hook into all of them.

A more elegant solution might be to use a preempt notifier. However,
non-ORC unwinders can't unwind a preempted task reliably.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lkml.kernel.org/lkml/[email protected]

Signed-off-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Tested-by: Seth Forshee (DigitalOcean) <[email protected]>
Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
---
include/linux/livepatch.h | 1 +-
include/linux/livepatch_sched.h | 29 ++++++++-
include/linux/sched.h | 20 ++++--
kernel/livepatch/core.c | 1 +-
kernel/livepatch/transition.c | 107 ++++++++++++++++++++++++++-----
kernel/sched/core.c | 64 ++++++++++++++++---
6 files changed, 194 insertions(+), 28 deletions(-)
create mode 100644 include/linux/livepatch_sched.h

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 293e299..9b9b38e 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -13,6 +13,7 @@
#include <linux/ftrace.h>
#include <linux/completion.h>
#include <linux/list.h>
+#include <linux/livepatch_sched.h>

#if IS_ENABLED(CONFIG_LIVEPATCH)

diff --git a/include/linux/livepatch_sched.h b/include/linux/livepatch_sched.h
new file mode 100644
index 0000000..013794f
--- /dev/null
+++ b/include/linux/livepatch_sched.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _LINUX_LIVEPATCH_SCHED_H_
+#define _LINUX_LIVEPATCH_SCHED_H_
+
+#include <linux/jump_label.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_LIVEPATCH
+
+void __klp_sched_try_switch(void);
+
+#if !defined(CONFIG_PREEMPT_DYNAMIC) || !defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+DECLARE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+
+static __always_inline void klp_sched_try_switch(void)
+{
+ if (static_branch_unlikely(&klp_sched_try_switch_key))
+ __klp_sched_try_switch();
+}
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+#else /* !CONFIG_LIVEPATCH */
+static inline void klp_sched_try_switch(void) {}
+static inline void __klp_sched_try_switch(void) {}
+#endif /* CONFIG_LIVEPATCH */
+
+#endif /* _LINUX_LIVEPATCH_SCHED_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d2421..6d654eb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -36,6 +36,7 @@
#include <linux/seqlock.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
+#include <linux/livepatch_sched.h>
#include <asm/kmap_size.h>

/* task_struct member predeclarations (sorted alphabetically): */
@@ -2070,6 +2071,9 @@ extern int __cond_resched(void);

#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)

+void sched_dynamic_klp_enable(void);
+void sched_dynamic_klp_disable(void);
+
DECLARE_STATIC_CALL(cond_resched, __cond_resched);

static __always_inline int _cond_resched(void)
@@ -2078,6 +2082,7 @@ static __always_inline int _cond_resched(void)
}

#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
+
extern int dynamic_cond_resched(void);

static __always_inline int _cond_resched(void)
@@ -2085,20 +2090,25 @@ static __always_inline int _cond_resched(void)
return dynamic_cond_resched();
}

-#else
+#else /* !CONFIG_PREEMPTION */

static inline int _cond_resched(void)
{
+ klp_sched_try_switch();
return __cond_resched();
}

-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */

-#else
+#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */

-static inline int _cond_resched(void) { return 0; }
+static inline int _cond_resched(void)
+{
+ klp_sched_try_switch();
+ return 0;
+}

-#endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */
+#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */

#define cond_resched() ({ \
__might_resched(__FILE__, __LINE__, 0); \
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 4bd2d5e..eea7c8e 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -33,6 +33,7 @@
*
* - klp_ftrace_handler()
* - klp_update_patch_state()
+ * - __klp_sched_try_switch()
*/
DEFINE_MUTEX(klp_mutex);

diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 824e2e3..e9fd83a 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -9,6 +9,7 @@

#include <linux/cpu.h>
#include <linux/stacktrace.h>
+#include <linux/static_call.h>
#include "core.h"
#include "patch.h"
#include "transition.h"
@@ -27,6 +28,25 @@ static int klp_target_state = KLP_UNDEFINED;
static unsigned int klp_signals_cnt;

/*
+ * When a livepatch is in progress, enable klp stack checking in
+ * cond_resched(). This helps CPU-bound kthreads get patched.
+ */
+#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
+
+#define klp_cond_resched_enable() sched_dynamic_klp_enable()
+#define klp_cond_resched_disable() sched_dynamic_klp_disable()
+
+#else /* !CONFIG_PREEMPT_DYNAMIC || !CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+DEFINE_STATIC_KEY_FALSE(klp_sched_try_switch_key);
+EXPORT_SYMBOL(klp_sched_try_switch_key);
+
+#define klp_cond_resched_enable() static_branch_enable(&klp_sched_try_switch_key)
+#define klp_cond_resched_disable() static_branch_disable(&klp_sched_try_switch_key)
+
+#endif /* CONFIG_PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
+/*
* This work can be performed periodically to finish patching or unpatching any
* "straggler" tasks which failed to transition in the first attempt.
*/
@@ -174,8 +194,8 @@ void klp_update_patch_state(struct task_struct *task)
* barrier (smp_rmb) for two cases:
*
* 1) Enforce the order of the TIF_PATCH_PENDING read and the
- * klp_target_state read. The corresponding write barrier is in
- * klp_init_transition().
+ * klp_target_state read. The corresponding write barriers are in
+ * klp_init_transition() and klp_reverse_transition().
*
* 2) Enforce the order of the TIF_PATCH_PENDING read and a future read
* of func->transition, if klp_ftrace_handler() is called later on
@@ -343,6 +363,44 @@ static bool klp_try_switch_task(struct task_struct *task)
return !ret;
}

+void __klp_sched_try_switch(void)
+{
+ if (likely(!klp_patch_pending(current)))
+ return;
+
+ /*
+ * This function is called from cond_resched() which is called in many
+ * places throughout the kernel. Using the klp_mutex here might
+ * deadlock.
+ *
+ * Instead, disable preemption to prevent racing with other callers of
+ * klp_try_switch_task(). Thanks to task_call_func() they won't be
+ * able to switch this task while it's running.
+ */
+ preempt_disable();
+
+ /*
+ * Make sure current didn't get patched between the above check and
+ * preempt_disable().
+ */
+ if (unlikely(!klp_patch_pending(current)))
+ goto out;
+
+ /*
+ * Enforce the order of the TIF_PATCH_PENDING read above and the
+ * klp_target_state read in klp_try_switch_task(). The corresponding
+ * write barriers are in klp_init_transition() and
+ * klp_reverse_transition().
+ */
+ smp_rmb();
+
+ klp_try_switch_task(current);
+
+out:
+ preempt_enable();
+}
+EXPORT_SYMBOL(__klp_sched_try_switch);
+
/*
* Sends a fake signal to all non-kthread tasks with TIF_PATCH_PENDING set.
* Kthreads with TIF_PATCH_PENDING set are woken up.
@@ -449,7 +507,8 @@ void klp_try_complete_transition(void)
return;
}

- /* we're done, now cleanup the data structures */
+ /* Done! Now cleanup the data structures. */
+ klp_cond_resched_disable();
patch = klp_transition_patch;
klp_complete_transition();

@@ -501,6 +560,8 @@ void klp_start_transition(void)
set_tsk_thread_flag(task, TIF_PATCH_PENDING);
}

+ klp_cond_resched_enable();
+
klp_signals_cnt = 0;
}

@@ -556,8 +617,9 @@ void klp_init_transition(struct klp_patch *patch, int state)
* see a func in transition with a task->patch_state of KLP_UNDEFINED.
*
* Also enforce the order of the klp_target_state write and future
- * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() doesn't
- * set a task->patch_state to KLP_UNDEFINED.
+ * TIF_PATCH_PENDING writes to ensure klp_update_patch_state() and
+ * __klp_sched_try_switch() don't set a task->patch_state to
+ * KLP_UNDEFINED.
*/
smp_wmb();

@@ -593,14 +655,10 @@ void klp_reverse_transition(void)
klp_target_state == KLP_PATCHED ? "patching to unpatching" :
"unpatching to patching");

- klp_transition_patch->enabled = !klp_transition_patch->enabled;
-
- klp_target_state = !klp_target_state;
-
/*
* Clear all TIF_PATCH_PENDING flags to prevent races caused by
- * klp_update_patch_state() running in parallel with
- * klp_start_transition().
+ * klp_update_patch_state() or __klp_sched_try_switch() running in
+ * parallel with the reverse transition.
*/
read_lock(&tasklist_lock);
for_each_process_thread(g, task)
@@ -610,9 +668,28 @@ void klp_reverse_transition(void)
for_each_possible_cpu(cpu)
clear_tsk_thread_flag(idle_task(cpu), TIF_PATCH_PENDING);

- /* Let any remaining calls to klp_update_patch_state() complete */
+ /*
+ * Make sure all existing invocations of klp_update_patch_state() and
+ * __klp_sched_try_switch() see the cleared TIF_PATCH_PENDING before
+ * starting the reverse transition.
+ */
klp_synchronize_transition();

+ /*
+ * All patching has stopped, now re-initialize the global variables to
+ * prepare for the reverse transition.
+ */
+ klp_transition_patch->enabled = !klp_transition_patch->enabled;
+ klp_target_state = !klp_target_state;
+
+ /*
+ * Enforce the order of the klp_target_state write and the
+ * TIF_PATCH_PENDING writes in klp_start_transition() to ensure
+ * klp_update_patch_state() and __klp_sched_try_switch() don't set
+ * task->patch_state to the wrong value.
+ */
+ smp_wmb();
+
klp_start_transition();
}

@@ -626,9 +703,9 @@ void klp_copy_process(struct task_struct *child)
* the task flag up to date with the parent here.
*
* The operation is serialized against all klp_*_transition()
- * operations by the tasklist_lock. The only exception is
- * klp_update_patch_state(current), but we cannot race with
- * that because we are current.
+ * operations by the tasklist_lock. The only exceptions are
+ * klp_update_patch_state(current) and __klp_sched_try_switch(), but we
+ * cannot race with them because we are current.
*/
if (test_tsk_thread_flag(current, TIF_PATCH_PENDING))
set_tsk_thread_flag(child, TIF_PATCH_PENDING);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ddd961..b9616f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8525,6 +8525,7 @@ EXPORT_STATIC_CALL_TRAMP(might_resched);
static DEFINE_STATIC_KEY_FALSE(sk_dynamic_cond_resched);
int __sched dynamic_cond_resched(void)
{
+ klp_sched_try_switch();
if (!static_branch_unlikely(&sk_dynamic_cond_resched))
return 0;
return __cond_resched();
@@ -8673,13 +8674,17 @@ int sched_dynamic_mode(const char *str)
#error "Unsupported PREEMPT_DYNAMIC mechanism"
#endif

-void sched_dynamic_update(int mode)
+DEFINE_MUTEX(sched_dynamic_mutex);
+static bool klp_override;
+
+static void __sched_dynamic_update(int mode)
{
/*
* Avoid {NONE,VOLUNTARY} -> FULL transitions from ever ending up in
* the ZERO state, which is invalid.
*/
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_enable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
@@ -8687,36 +8692,79 @@ void sched_dynamic_update(int mode)

switch (mode) {
case preempt_dynamic_none:
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_disable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: none\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: none\n");
break;

case preempt_dynamic_voluntary:
- preempt_dynamic_enable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_enable(cond_resched);
preempt_dynamic_enable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: voluntary\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: voluntary\n");
break;

case preempt_dynamic_full:
- preempt_dynamic_disable(cond_resched);
+ if (!klp_override)
+ preempt_dynamic_disable(cond_resched);
preempt_dynamic_disable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
- pr_info("Dynamic Preempt: full\n");
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: full\n");
break;
}

preempt_dynamic_mode = mode;
}

+void sched_dynamic_update(int mode)
+{
+ mutex_lock(&sched_dynamic_mutex);
+ __sched_dynamic_update(mode);
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
+
+static int klp_cond_resched(void)
+{
+ __klp_sched_try_switch();
+ return __cond_resched();
+}
+
+void sched_dynamic_klp_enable(void)
+{
+ mutex_lock(&sched_dynamic_mutex);
+
+ klp_override = true;
+ static_call_update(cond_resched, klp_cond_resched);
+
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+void sched_dynamic_klp_disable(void)
+{
+ mutex_lock(&sched_dynamic_mutex);
+
+ klp_override = false;
+ __sched_dynamic_update(preempt_dynamic_mode);
+
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
+
static int __init setup_preempt_mode(char *str)
{
int mode = sched_dynamic_mode(str);

2023-03-30 16:26:11

by Miroslav Benes

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] livepatch,sched: Add livepatch task switching to cond_resched()

On Fri, 24 Feb 2023, Josh Poimboeuf wrote:

> v3:
> - Add barriers (pmladek)
> - Update comments
>
> v2:
> - Avoid calling klp_cond_resched_disable() in klp_cancel_transition()
> - Fix race in klp_reverse_transition()
>
> Fix patching stalls caused by busy kthreads.
>
> Josh Poimboeuf (3):
> livepatch: Skip task_call_func() for current task
> livepatch,sched: Add livepatch task switching to cond_resched()
> vhost: Fix livepatch timeouts in vhost_worker()
>
> drivers/vhost/vhost.c | 3 +-
> include/linux/livepatch.h | 1 +
> include/linux/livepatch_sched.h | 29 ++++++++
> include/linux/sched.h | 20 ++++--
> kernel/livepatch/core.c | 1 +
> kernel/livepatch/transition.c | 113 +++++++++++++++++++++++++++-----
> kernel/sched/core.c | 64 +++++++++++++++---
> 7 files changed, 200 insertions(+), 31 deletions(-)
> create mode 100644 include/linux/livepatch_sched.h

Late, so just recording it here...

Acked-by: Miroslav Benes <[email protected]>

Thanks for improving the situation, Josh.

M