Hi folks,
this is v2 of:
https://lore.kernel.org/lkml/[email protected]/
respun from Frederic and Paul's helpful feedback. Tested atop v5.14-rc4-rt6 with
the v1 patches reverted. There, commit
d76e0926d835 ("rcu/nocb: Use the rcuog CPU's ->nocb_timer")
prevents the NOCB offload warning from firing if there are no NOCB CPUs (which
is sensible). Adding a single NOCB CPU brings the warning back, which patch 3
fixes.
Note that patches 2 & 4 are already in v5.14-rc4-rt6, but still apply against
mainline.
Revisions
=========
v1 -> v2
++++++++
o Rebased and tested against v5.14-rc4-rt6
o Picked rcutorture patch from
https://lore.kernel.org/lkml/[email protected]/
o Added a local_lock to protect NOCB offload state under PREEMPT_RT (Frederic,
Paul)
Valentin Schneider (4):
rcutorture: Don't disable softirqs with preemption disabled when
PREEMPT_RT
sched: Introduce is_pcpu_safe()
rcu/nocb: Protect NOCB state via local_lock() under PREEMPT_RT
arm64: mm: Make arch_faults_on_old_pte() check for migratability
arch/arm64/include/asm/pgtable.h | 2 +-
include/linux/sched.h | 10 ++++
kernel/rcu/rcutorture.c | 2 +
kernel/rcu/tree.c | 4 ++
kernel/rcu/tree.h | 4 ++
kernel/rcu/tree_plugin.h | 82 ++++++++++++++++++++++++++++----
6 files changed, 94 insertions(+), 10 deletions(-)
--
2.25.1
Some areas use preempt_disable() + preempt_enable() to safely access
per-CPU data. The PREEMPT_RT folks have shown this can also be done by
keeping preemption enabled and instead disabling migration (and acquiring a
sleepable lock, if relevant).
Introduce a helper which checks whether the current task can safely access
per-CPU data, IOW if the task's context guarantees the accesses will target
a single CPU. This accounts for preemption, CPU affinity, and migrate
disable - note that the CPU affinity check also mandates the presence of
PF_NO_SETAFFINITY, as otherwise userspace could concurrently render the
upcoming per-CPU access(es) unsafe.
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index debc960f41e3..b77d65f677f6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,16 @@ static inline bool is_percpu_thread(void)
#endif
}
+/* Is the current task guaranteed not to be migrated elsewhere? */
+static inline bool is_pcpu_safe(void)
+{
+#ifdef CONFIG_SMP
+ return !preemptible() || is_percpu_thread() || current->migration_disabled;
+#else
+ return true;
+#endif
+}
+
/* Per-process atomic flags. */
#define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
#define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
--
2.25.1
Running RCU torture with CONFIG_PREEMPT_RT under v5.14-rc4-rt6 triggers:
[ 8.755472] DEBUG_LOCKS_WARN_ON(this_cpu_read(softirq_ctrl.cnt))
[ 8.755482] WARNING: CPU: 1 PID: 137 at kernel/softirq.c:172 __local_bh_disable_ip (kernel/softirq.c:172 (discriminator 31))
[ 8.755500] Modules linked in:
[ 8.755506] CPU: 1 PID: 137 Comm: rcu_torture_rea Not tainted 5.14.0-rc4-rt6 #171
[ 8.755514] Hardware name: ARM Juno development board (r0) (DT)
[ 8.755518] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 8.755622] Call trace:
[ 8.755624] __local_bh_disable_ip (kernel/softirq.c:172 (discriminator 31))
[ 8.755633] rcutorture_one_extend (kernel/rcu/rcutorture.c:1453)
[ 8.755640] rcu_torture_one_read (kernel/rcu/rcutorture.c:1601 kernel/rcu/rcutorture.c:1645)
[ 8.755645] rcu_torture_reader (kernel/rcu/rcutorture.c:1737)
[ 8.755650] kthread (kernel/kthread.c:327)
[ 8.755656] ret_from_fork (arch/arm64/kernel/entry.S:783)
[ 8.755663] irq event stamp: 11959
[ 8.755666] hardirqs last enabled at (11959): __rcu_read_unlock (kernel/rcu/tree_plugin.h:695 kernel/rcu/tree_plugin.h:451)
[ 8.755675] hardirqs last disabled at (11958): __rcu_read_unlock (kernel/rcu/tree_plugin.h:661 kernel/rcu/tree_plugin.h:451)
[ 8.755683] softirqs last enabled at (11950): __local_bh_enable_ip (./arch/arm64/include/asm/irqflags.h:85 kernel/softirq.c:261)
[ 8.755692] softirqs last disabled at (11942): rcutorture_one_extend (./include/linux/bottom_half.h:19 kernel/rcu/rcutorture.c:1441)
Per this warning, softirqs cannot be disabled in a non-preemptible region
under CONFIG_PREEMPT_RT. Adjust RCU torture accordingly.
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/rcu/rcutorture.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index eecd1caef71d..4f3db1d3170d 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1548,6 +1548,8 @@ rcutorture_extend_mask(int oldmask, struct torture_random_state *trsp)
* them on non-RT.
*/
if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ /* Can't disable bh in atomic context under PREEMPT_RT */
+ mask &= ~(RCUTORTURE_RDR_ATOM_BH | RCUTORTURE_RDR_ATOM_RBH);
/*
* Can't release the outermost rcu lock in an irq disabled
* section without preemption also being disabled, if irqs
--
2.25.1
Warning
=======
Running v5.13-rt1 on my arm64 Juno board triggers:
[ 0.156302] =============================
[ 0.160416] WARNING: suspicious RCU usage
[ 0.164529] 5.13.0-rt1 #20 Not tainted
[ 0.168300] -----------------------------
[ 0.172409] kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
[ 0.179920]
[ 0.179920] other info that might help us debug this:
[ 0.179920]
[ 0.188037]
[ 0.188037] rcu_scheduler_active = 1, debug_locks = 1
[ 0.194677] 3 locks held by rcuc/0/11:
[ 0.198448] #0: ffff00097ef10cf8 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip (./include/linux/rcupdate.h:662 kernel/softirq.c:171)
[ 0.208709] #1: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock (kernel/locking/spinlock_rt.c:43 (discriminator 4))
[ 0.217134] #2: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip (kernel/softirq.c:169)
[ 0.226428]
[ 0.226428] stack backtrace:
[ 0.230889] CPU: 0 PID: 11 Comm: rcuc/0 Not tainted 5.13.0-rt1 #20
[ 0.237100] Hardware name: ARM Juno development board (r0) (DT)
[ 0.243041] Call trace:
[ 0.245497] dump_backtrace (arch/arm64/kernel/stacktrace.c:163)
[ 0.249185] show_stack (arch/arm64/kernel/stacktrace.c:219)
[ 0.252522] dump_stack (lib/dump_stack.c:122)
[ 0.255947] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6439)
[ 0.260328] rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
[ 0.264537] rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
[ 0.267786] rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
[ 0.271644] smpboot_thread_fn (kernel/smpboot.c:165 (discriminator 3))
[ 0.275767] kthread (kernel/kthread.c:321)
[ 0.279013] ret_from_fork (arch/arm64/kernel/entry.S:1005)
In this case, this is the RCU core kthread accessing the local CPU's
rdp. Before that, rcu_cpu_kthread() invokes local_bh_disable().
Under !CONFIG_PREEMPT_RT (and rcutree.use_softirq=0), this ends up
incrementing the preempt_count, which satisfies the "local non-preemptible
read" of rcu_rdp_is_offloaded().
Under CONFIG_PREEMPT_RT however, this becomes
local_lock(&softirq_ctrl.lock)
which, under the same config, is migrate_disable() + rt_spin_lock(). As
pointed out by Frederic, this is not sufficient to safely access an rdp's
offload state, as the RCU core kthread can be preempted by a kworker
executing rcu_nocb_rdp_offload() [1].
Introduce a local_lock to serialize an rdp's offload state while the rdp's
associated core kthread is executing rcu_core().
rcu_core() preemptability considerations
========================================
As pointed out by Paul [2], keeping rcu_check_quiescent_state() preemptible
(which is the case under CONFIG_PREEMPT_RT) requires some consideration.
note_gp_changes() itself runs with irqs off, and enters
__note_gp_changes() with rnp->lock held (raw_spinlock), thus is safe vs
preemption.
rdp->core_needs_qs *could* change after being read by the RCU core
kthread if it then gets preempted. Consider, with
CONFIG_RCU_STRICT_GRACE_PERIOD:
rcuc/x task_foo
rcu_check_quiescent_state()
`\
rdp->core_needs_qs == true
<PREEMPT>
rcu_read_unlock()
`\
rcu_preempt_deferred_qs_irqrestore()
`\
rcu_report_qs_rdp()
`\
rdp->core_needs_qs := false;
This would let rcuc/x's rcu_check_quiescent_state() proceed further down to
rcu_report_qs_rdp(), but if task_foo's earlier rcu_report_qs_rdp()
invocation would have cleared the rdp grpmask from the rnp mask, so
rcuc/x's invocation would simply bail.
Since rcu_report_qs_rdp() can be safely invoked, even if rdp->core_needs_qs
changed, it appears safe to keep rcu_check_quiescent_state() preemptible.
[1]: http://lore.kernel.org/r/20210727230814.GC283787@lothringen
[2]: http://lore.kernel.org/r/20210729010445.GO4397@paulmck-ThinkPad-P17-Gen-1
Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/rcu/tree.c | 4 ++
kernel/rcu/tree.h | 4 ++
kernel/rcu/tree_plugin.h | 82 +++++++++++++++++++++++++++++++++++-----
3 files changed, 81 insertions(+), 9 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51f24ecd94b2..caadfec994f3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -87,6 +87,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
.dynticks = ATOMIC_INIT(RCU_DYNTICK_CTRL_CTR),
#ifdef CONFIG_RCU_NOCB_CPU
.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
+ .nocb_local_lock = INIT_LOCAL_LOCK(nocb_local_lock),
#endif
};
static struct rcu_state rcu_state = {
@@ -2853,10 +2854,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
{
unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
int spincnt;
trace_rcu_utilization(TPS("Start CPU kthread@rcu_run"));
for (spincnt = 0; spincnt < 10; spincnt++) {
+ rcu_nocb_local_lock(rdp);
local_bh_disable();
*statusp = RCU_KTHREAD_RUNNING;
local_irq_disable();
@@ -2866,6 +2869,7 @@ static void rcu_cpu_kthread(unsigned int cpu)
if (work)
rcu_core();
local_bh_enable();
+ rcu_nocb_local_unlock(rdp);
if (*workp == 0) {
trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
*statusp = RCU_KTHREAD_WAITING;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 305cf6aeb408..aa6831255fec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -210,6 +210,8 @@ struct rcu_data {
struct timer_list nocb_timer; /* Enforce finite deferral. */
unsigned long nocb_gp_adv_time; /* Last call_rcu() CB adv (jiffies). */
+ local_lock_t nocb_local_lock;
+
/* The following fields are used by call_rcu, hence own cacheline. */
raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
struct rcu_cblist nocb_bypass; /* Lock-contention-bypass CB list. */
@@ -445,6 +447,8 @@ static void rcu_nocb_unlock(struct rcu_data *rdp);
static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
unsigned long flags);
static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
+static void rcu_nocb_local_lock(struct rcu_data *rdp);
+static void rcu_nocb_local_unlock(struct rcu_data *rdp);
#ifdef CONFIG_RCU_NOCB_CPU
static void __init rcu_organize_nocb_kthreads(void);
#define rcu_nocb_lock_irqsave(rdp, flags) \
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0ff5e4fb933e..b1487cac9250 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -21,6 +21,17 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
return lockdep_is_held(&rdp->nocb_lock);
}
+static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
+{
+ return lockdep_is_held(
+#ifdef CONFIG_PREEMPT_RT
+ &rdp->nocb_local_lock.lock
+#else
+ &rdp->nocb_local_lock
+#endif
+ );
+}
+
static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
{
/* Race on early boot between thread creation and assignment */
@@ -38,7 +49,10 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
{
return 0;
}
-
+static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
+{
+ return 0;
+}
static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
{
return false;
@@ -46,23 +60,44 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+/*
+ * Is a local read of the rdp's offloaded state safe and stable?
+ * See rcu_nocb_local_lock() & family.
+ */
+static inline bool rcu_local_offload_access_safe(struct rcu_data *rdp)
+{
+ if (!preemptible())
+ return true;
+
+ if (is_pcpu_safe()) {
+ if (!IS_ENABLED(CONFIG_RCU_NOCB))
+ return true;
+
+ return rcu_lockdep_is_held_nocb_local(rdp);
+ }
+
+ return false;
+}
+
static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
{
/*
- * In order to read the offloaded state of an rdp is a safe
- * and stable way and prevent from its value to be changed
- * under us, we must either hold the barrier mutex, the cpu
- * hotplug lock (read or write) or the nocb lock. Local
- * non-preemptible reads are also safe. NOCB kthreads and
- * timers have their own means of synchronization against the
- * offloaded state updaters.
+ * In order to read the offloaded state of an rdp is a safe and stable
+ * way and prevent from its value to be changed under us, we must either...
*/
RCU_LOCKDEP_WARN(
+ // ...hold the barrier mutex...
!(lockdep_is_held(&rcu_state.barrier_mutex) ||
+ // ... the cpu hotplug lock (read or write)...
(IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
+ // ... or the NOCB lock.
rcu_lockdep_is_held_nocb(rdp) ||
+ // Local reads still require the local state to remain stable
+ // (preemption disabled / local lock held)
(rdp == this_cpu_ptr(&rcu_data) &&
- !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
+ rcu_local_offload_access_safe(rdp)) ||
+ // NOCB kthreads and timers have their own means of synchronization
+ // against the offloaded state updaters.
rcu_current_is_nocb_kthread(rdp)),
"Unsafe read of RCU_NOCB offloaded state"
);
@@ -1629,6 +1664,22 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
}
}
+/*
+ * The invocation of rcu_core() within the RCU core kthreads remains preemptible
+ * under PREEMPT_RT, thus the offload state of a CPU could change while
+ * said kthreads are preempted. Prevent this from happening by protecting the
+ * offload state with a local_lock().
+ */
+static void rcu_nocb_local_lock(struct rcu_data *rdp)
+{
+ local_lock(&rcu_data.nocb_local_lock);
+}
+
+static void rcu_nocb_local_unlock(struct rcu_data *rdp)
+{
+ local_unlock(&rcu_data.nocb_local_lock);
+}
+
/* Lockdep check that ->cblist may be safely accessed. */
static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
{
@@ -2396,6 +2447,7 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
if (rdp->nocb_cb_sleep)
rdp->nocb_cb_sleep = false;
rcu_nocb_unlock_irqrestore(rdp, flags);
+ rcu_nocb_local_unlock(rdp);
/*
* Ignore former value of nocb_cb_sleep and force wake up as it could
@@ -2427,6 +2479,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
pr_info("De-offloading %d\n", rdp->cpu);
+ rcu_nocb_local_lock(rdp);
rcu_nocb_lock_irqsave(rdp, flags);
/*
* Flush once and for all now. This suffices because we are
@@ -2509,6 +2562,7 @@ static long rcu_nocb_rdp_offload(void *arg)
* Can't use rcu_nocb_lock_irqsave() while we are in
* SEGCBLIST_SOFTIRQ_ONLY mode.
*/
+ rcu_nocb_local_lock(rdp);
raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
/*
@@ -2868,6 +2922,16 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
local_irq_restore(flags);
}
+/* No ->nocb_local_lock to acquire. */
+static void rcu_nocb_local_lock(struct rcu_data *rdp)
+{
+}
+
+/* No ->nocb_local_lock to release. */
+static void rcu_nocb_local_unlock(struct rcu_data *rdp)
+{
+}
+
/* Lockdep check that ->cblist may be safely accessed. */
static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
{
--
2.25.1
Running v5.13-rt1 on my arm64 Juno board triggers:
[ 30.430643] WARNING: CPU: 4 PID: 1 at arch/arm64/include/asm/pgtable.h:985 do_set_pte (./arch/arm64/include/asm/pgtable.h:985 ./arch/arm64/include/asm/pgtable.h:997 mm/memory.c:3830)
[ 30.430669] Modules linked in:
[ 30.430679] CPU: 4 PID: 1 Comm: init Tainted: G W 5.13.0-rt1-00002-gcb994ad7c570 #35
[ 30.430690] Hardware name: ARM Juno development board (r0) (DT)
[ 30.430695] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[ 30.430874] Call trace:
[ 30.430878] do_set_pte (./arch/arm64/include/asm/pgtable.h:985 ./arch/arm64/include/asm/pgtable.h:997 mm/memory.c:3830)
[ 30.430886] filemap_map_pages (mm/filemap.c:3222)
[ 30.430895] __handle_mm_fault (mm/memory.c:4006 mm/memory.c:4020 mm/memory.c:4153 mm/memory.c:4412 mm/memory.c:4547)
[ 30.430904] handle_mm_fault (mm/memory.c:4645)
[ 30.430912] do_page_fault (arch/arm64/mm/fault.c:507 arch/arm64/mm/fault.c:607)
[ 30.430925] do_translation_fault (arch/arm64/mm/fault.c:692)
[ 30.430936] do_mem_abort (arch/arm64/mm/fault.c:821)
[ 30.430946] el0_ia (arch/arm64/kernel/entry-common.c:324)
[ 30.430959] el0_sync_handler (arch/arm64/kernel/entry-common.c:431)
[ 30.430967] el0_sync (arch/arm64/kernel/entry.S:744)
[ 30.430977] irq event stamp: 1228384
[ 30.430981] hardirqs last enabled at (1228383): lock_page_memcg (mm/memcontrol.c:2005 (discriminator 1))
[ 30.430993] hardirqs last disabled at (1228384): el1_dbg (arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:234)
[ 30.431007] softirqs last enabled at (1228260): __local_bh_enable_ip (./arch/arm64/include/asm/irqflags.h:85 kernel/softirq.c:262)
[ 30.431022] softirqs last disabled at (1228232): fpsimd_restore_current_state (./include/linux/bottom_half.h:19 arch/arm64/kernel/fpsimd.c:183 arch/arm64/kernel/fpsimd.c:1182)
CONFIG_PREEMPT_RT turns the PTE lock into a sleepable spinlock. Since
acquiring such a lock also disables migration, any per-CPU access done
under the lock remains safe even if preemptible.
This affects:
filemap_map_pages()
`\
do_set_pte()
`\
arch_wants_old_prefaulted_pte()
which checks preemptible() to figure out if the output of
cpu_has_hw_af() (IOW the underlying CPU) will remain stable for the
subsequent operations. Make it use is_pcpu_safe() instead.
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f09bf5c02891..767a064bedde 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -995,7 +995,7 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
*/
static inline bool arch_faults_on_old_pte(void)
{
- WARN_ON(preemptible());
+ WARN_ON(!is_pcpu_safe());
return !cpu_has_hw_af();
}
--
2.25.1
On Sat, 2021-08-07 at 01:58 +0100, Valentin Schneider wrote:
>
> +static inline bool is_pcpu_safe(void)
Nit: seems odd to avoid spelling it out to save two characters, percpu
is word like, rolls off the ole tongue better than p-c-p-u.
-Mike
On 07/08/21 03:42, Mike Galbraith wrote:
> On Sat, 2021-08-07 at 01:58 +0100, Valentin Schneider wrote:
>>
>> +static inline bool is_pcpu_safe(void)
>
> Nit: seems odd to avoid spelling it out to save two characters, percpu
> is word like, rolls off the ole tongue better than p-c-p-u.
>
> -Mike
True. A quick grep says both versions are used, though "percpu" wins by
about a factor of 2. I'll tweak that for a v3.
Hi,
On Sat, Aug 07, 2021 at 01:58:05AM +0100, Valentin Schneider wrote:
> Some areas use preempt_disable() + preempt_enable() to safely access
> per-CPU data. The PREEMPT_RT folks have shown this can also be done by
> keeping preemption enabled and instead disabling migration (and acquiring a
> sleepable lock, if relevant).
>
> Introduce a helper which checks whether the current task can safely access
> per-CPU data, IOW if the task's context guarantees the accesses will target
> a single CPU. This accounts for preemption, CPU affinity, and migrate
> disable - note that the CPU affinity check also mandates the presence of
> PF_NO_SETAFFINITY, as otherwise userspace could concurrently render the
> upcoming per-CPU access(es) unsafe.
>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> include/linux/sched.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index debc960f41e3..b77d65f677f6 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1715,6 +1715,16 @@ static inline bool is_percpu_thread(void)
> #endif
> }
>
> +/* Is the current task guaranteed not to be migrated elsewhere? */
> +static inline bool is_pcpu_safe(void)
> +{
> +#ifdef CONFIG_SMP
> + return !preemptible() || is_percpu_thread() || current->migration_disabled;
> +#else
> + return true;
> +#endif
> +}
I wonder whether the following can happen, say thread A is a worker
thread for CPU 1, so it has the flag PF_NO_SETAFFINITY set.
{ percpu variable X on CPU 2 is initially 0 }
thread A
========
<preemption enabled>
if (is_pcpu_safe()) { // nr_cpus_allowed == 1, so return true.
<preempted>
<hot unplug CPU 1>
unbinder_workers(1); // A->cpus_mask becomes cpu_possible_mask
<back to run on CPU 2>
__this_cpu_inc(X);
tmp = X; // tmp == 0
<preempted>
<in thread B>
this_cpu_inc(X); // X becomes 1
<back to run A on CPU 2>
X = tmp + 1; // race!
}
if so, then is_percpu_thread() doesn't indicate is_pcpu_safe()?
Regards,
Boqun
> +
> /* Per-process atomic flags. */
> #define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
> #define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
> --
> 2.25.1
>
Hi,
On 10/08/21 10:42, Boqun Feng wrote:
> Hi,
>
> On Sat, Aug 07, 2021 at 01:58:05AM +0100, Valentin Schneider wrote:
>> Some areas use preempt_disable() + preempt_enable() to safely access
>> per-CPU data. The PREEMPT_RT folks have shown this can also be done by
>> keeping preemption enabled and instead disabling migration (and acquiring a
>> sleepable lock, if relevant).
>>
>> Introduce a helper which checks whether the current task can safely access
>> per-CPU data, IOW if the task's context guarantees the accesses will target
>> a single CPU. This accounts for preemption, CPU affinity, and migrate
>> disable - note that the CPU affinity check also mandates the presence of
>> PF_NO_SETAFFINITY, as otherwise userspace could concurrently render the
>> upcoming per-CPU access(es) unsafe.
>>
>> Signed-off-by: Valentin Schneider <[email protected]>
>> ---
>> include/linux/sched.h | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index debc960f41e3..b77d65f677f6 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1715,6 +1715,16 @@ static inline bool is_percpu_thread(void)
>> #endif
>> }
>>
>> +/* Is the current task guaranteed not to be migrated elsewhere? */
>> +static inline bool is_pcpu_safe(void)
>> +{
>> +#ifdef CONFIG_SMP
>> + return !preemptible() || is_percpu_thread() || current->migration_disabled;
>> +#else
>> + return true;
>> +#endif
>> +}
>
> I wonder whether the following can happen, say thread A is a worker
> thread for CPU 1, so it has the flag PF_NO_SETAFFINITY set.
>
> { percpu variable X on CPU 2 is initially 0 }
>
> thread A
> ========
>
> <preemption enabled>
> if (is_pcpu_safe()) { // nr_cpus_allowed == 1, so return true.
> <preempted>
> <hot unplug CPU 1>
> unbinder_workers(1); // A->cpus_mask becomes cpu_possible_mask
> <back to run on CPU 2>
> __this_cpu_inc(X);
> tmp = X; // tmp == 0
> <preempted>
> <in thread B>
> this_cpu_inc(X); // X becomes 1
> <back to run A on CPU 2>
> X = tmp + 1; // race!
> }
>
> if so, then is_percpu_thread() doesn't indicate is_pcpu_safe()?
>
You're absolutely right.
migrate_disable() protects the thread against being migrated due to
hotplug, but pure CPU affinity doesn't at all. kthread_is_per_cpu() doesn't
work either, because parking is not the only approach to hotplug for those
(e.g. per-CPU workqueue threads unbind themselves on hotplug, as in your
example).
One could hold cpus_read_lock(), but I don't see much point here. So that
has to be
return !preemptible() || current->migration_disabled;
Thanks!
On Sun, Aug 08, 2021 at 05:15:20PM +0100, Valentin Schneider wrote:
> On 07/08/21 03:42, Mike Galbraith wrote:
> > On Sat, 2021-08-07 at 01:58 +0100, Valentin Schneider wrote:
> >>
> >> +static inline bool is_pcpu_safe(void)
> >
> > Nit: seems odd to avoid spelling it out to save two characters, percpu
> > is word like, rolls off the ole tongue better than p-c-p-u.
> >
> > -Mike
>
> True. A quick grep says both versions are used, though "percpu" wins by
> about a factor of 2. I'll tweak that for a v3.
I wonder why is_percpu_safe() is the correct name. The safety of
accesses to percpu variables means two things to me:
a) The thread cannot migrate to other CPU in the middle of
accessing a percpu variable, in other words, the following
cannot happen:
{ percpu variable X is 0 on CPU 0 and 2 on CPU 1
CPU 0 CPU 1
======== =========
<in thread A>
__this_cpu_inc(X);
tmp = X; // tmp is 0
<preempted>
<migrate to CPU 1>
// continue __this_cpu_inc(X);
X = tmp + 1; // CPU 0 miss this
// increment (this
// may be OK), and
// CPU 1's X got
// corrupted.
b) The accesses to a percpu variable are exclusive, i.e. no
interrupt or preemption can happen in the middle of accessing,
in other words, the following cannot happen:
{ percpu variable X is 0 on CPU 0 }
CPU 0
========
<in thread A>
__this_cpu_inc(X);
tmp = X; // tmp is 0
<preempted>
<in other thread>
this_cpu_inc(X); // X is 1 afterwards.
<back to thread A>
X = tmp + 1; // X is 1, and we have a race condition.
And the is_p{er}cpu_safe() only detects the first, and it doesn't mean
totally safe for percpu accesses.
Maybe we can implement a migratable()? Although not sure it's a English
word.
Regards,
Boqun
On 10/08/21 20:49, Boqun Feng wrote:
> On Sun, Aug 08, 2021 at 05:15:20PM +0100, Valentin Schneider wrote:
>> On 07/08/21 03:42, Mike Galbraith wrote:
>> > On Sat, 2021-08-07 at 01:58 +0100, Valentin Schneider wrote:
>> >>
>> >> +static inline bool is_pcpu_safe(void)
>> >
>> > Nit: seems odd to avoid spelling it out to save two characters, percpu
>> > is word like, rolls off the ole tongue better than p-c-p-u.
>> >
>> > -Mike
>>
>> True. A quick grep says both versions are used, though "percpu" wins by
>> about a factor of 2. I'll tweak that for a v3.
>
> I wonder why is_percpu_safe() is the correct name. The safety of
> accesses to percpu variables means two things to me:
>
> a) The thread cannot migrate to other CPU in the middle of
> accessing a percpu variable, in other words, the following
> cannot happen:
>
> { percpu variable X is 0 on CPU 0 and 2 on CPU 1
> CPU 0 CPU 1
> ======== =========
> <in thread A>
> __this_cpu_inc(X);
> tmp = X; // tmp is 0
> <preempted>
> <migrate to CPU 1>
> // continue __this_cpu_inc(X);
> X = tmp + 1; // CPU 0 miss this
> // increment (this
> // may be OK), and
> // CPU 1's X got
> // corrupted.
>
> b) The accesses to a percpu variable are exclusive, i.e. no
> interrupt or preemption can happen in the middle of accessing,
> in other words, the following cannot happen:
>
> { percpu variable X is 0 on CPU 0 }
> CPU 0
> ========
> <in thread A>
> __this_cpu_inc(X);
> tmp = X; // tmp is 0
> <preempted>
> <in other thread>
> this_cpu_inc(X); // X is 1 afterwards.
> <back to thread A>
> X = tmp + 1; // X is 1, and we have a race condition.
>
> And the is_p{er}cpu_safe() only detects the first, and it doesn't mean
> totally safe for percpu accesses.
>
Right. I do briefly point this out in the changelog (the bit about
"acquiring a sleepable lock if relevant"), but that doesn't do much to
clarify the helper name itself.
> Maybe we can implement a migratable()? Although not sure it's a English
> word.
>
Funnily enough that is exactly how I named the thing in my initial draft,
but then I somehow convinced myself that tailoring the name to per-CPU
accesses would make its intent clearer.
I think you're right that "migratable()" is less confusing at the end of
the day. Oh well, so much for overthinking the naming problem :-)
> Regards,
> Boqun