2022-04-08 05:31:05

by Kalesh Singh

[permalink] [raw]
Subject: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
latency because its workqueues run at SCHED_OTHER, and thus can be
delayed by normal processes. This commit avoids these delays by moving
the expedited GP work items to a real-time-priority kthread_worker.

The results were evaluated on arm64 Android devices (6GB ram) running
5.10 kernel, and capturing trace data in critical user-level code.

The table below shows the resulting order-of-magnitude improvements
in synchronize_rcu_expedited() latency:

------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Count | 725 | 688 | |
------------------------------------------------------------------------
| Min Duration (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Q1 (ns) | 39,428 | 38,971 | -1.16% |
------------------------------------------------------------------------
| Q2 - Median (ns) | 98,225 | 69,743 | -29.00% |
------------------------------------------------------------------------
| Q3 (ns) | 342,122 | 126,638 | -62.98% |
------------------------------------------------------------------------
| Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% |
------------------------------------------------------------------------
| Standard Deviation (ns) | 19,327,765 | 294,408 | |
------------------------------------------------------------------------

The below table show the range of maximums/minimums for
synchronize_rcu_expedited() latency from all experiments:

------------------------------------------------------------------------
| | workqueues | kthread_worker | Diff |
------------------------------------------------------------------------
| Total No. of Experiments | 25 | 23 | |
------------------------------------------------------------------------
| Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% |
------------------------------------------------------------------------
| Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% |
------------------------------------------------------------------------
| Range of Maximums (ns) | 372,728,148 | 2,242,717 | |
------------------------------------------------------------------------
| Largest Minimum (ns) | 88,623 | 27,588 | -68.87% |
------------------------------------------------------------------------
| Smallest Minimum (ns) | 326 | 447 | 37.12% |
------------------------------------------------------------------------
| Range of Minimums (ns) | 88,297 | 27,141 | |
------------------------------------------------------------------------

Cc: "Paul E. McKenney" <[email protected]>
Cc: Tejun Heo <[email protected]>
Reported-by: Tim Murray <[email protected]>
Reported-by: Wei Wang <[email protected]>
Tested-by: Kyle Lin <[email protected]>
Tested-by: Chunwei Lu <[email protected]>
Tested-by: Lulu Wang <[email protected]>
Signed-off-by: Kalesh Singh <[email protected]>
---

Changes in v2:
- Add stats on variation: standard deviation, range of maxs, etc; per Paul
- Make this optimization configurable (CONFIG_RCU_EXP_KTHREAD); per Paul

kernel/rcu/Kconfig | 10 +++
kernel/rcu/rcu.h | 5 ++
kernel/rcu/tree.c | 51 ++++++++++++++-
kernel/rcu/tree.h | 5 ++
kernel/rcu/tree_exp.h | 147 +++++++++++++++++++++++++++++++++---------
5 files changed, 184 insertions(+), 34 deletions(-)

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bf8e341e75b4..56c6552c2af6 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -195,6 +195,16 @@ config RCU_BOOST_DELAY

Accept the default if unsure.

+config RCU_EXP_KTHREAD
+ bool "Perform RCU expedited work in a real-time kthread"
+ depends on RCU_BOOST && RCU_EXPERT
+ default !PREEMPT_RT && NR_CPUS <= 32
+ help
+ Use this option to further reduce the latencies of expedited
+ grace periods at the expense of being more disruptive.
+
+ Accept the default if unsure.
+
config RCU_NOCB_CPU
bool "Offload RCU callback processing from boot-selected CPUs"
depends on TREE_RCU
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 24b5f2c2de87..5510d2231c32 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -534,7 +534,12 @@ int rcu_get_gp_kthreads_prio(void);
void rcu_fwd_progress_check(unsigned long j);
void rcu_force_quiescent_state(void);
extern struct workqueue_struct *rcu_gp_wq;
+#ifdef CONFIG_RCU_EXP_KTHREAD
+extern struct kthread_worker *rcu_exp_gp_kworker;
+extern struct kthread_worker *rcu_exp_par_gp_kworker;
+#else /* !CONFIG_RCU_EXP_KTHREAD */
extern struct workqueue_struct *rcu_par_gp_wq;
+#endif /* CONFIG_RCU_EXP_KTHREAD */
#endif /* #else #ifdef CONFIG_TINY_RCU */

#ifdef CONFIG_RCU_NOCB_CPU
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a4b8189455d5..763e45fdf49b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4471,6 +4471,51 @@ static int rcu_pm_notify(struct notifier_block *self,
return NOTIFY_OK;
}

+#ifdef CONFIG_RCU_EXP_KTHREAD
+struct kthread_worker *rcu_exp_gp_kworker;
+struct kthread_worker *rcu_exp_par_gp_kworker;
+
+static void __init rcu_start_exp_gp_kworkers(void)
+{
+ const char *par_gp_kworker_name = "rcu_exp_par_gp_kthread_worker";
+ const char *gp_kworker_name = "rcu_exp_gp_kthread_worker";
+ struct sched_param param = { .sched_priority = kthread_prio };
+
+ rcu_exp_gp_kworker = kthread_create_worker(0, gp_kworker_name);
+ if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) {
+ pr_err("Failed to create %s!\n", gp_kworker_name);
+ return;
+ }
+
+ rcu_exp_par_gp_kworker = kthread_create_worker(0, par_gp_kworker_name);
+ if (IS_ERR_OR_NULL(rcu_exp_par_gp_kworker)) {
+ pr_err("Failed to create %s!\n", par_gp_kworker_name);
+ kthread_destroy_worker(rcu_exp_gp_kworker);
+ return;
+ }
+
+ sched_setscheduler_nocheck(rcu_exp_gp_kworker->task, SCHED_FIFO, &param);
+ sched_setscheduler_nocheck(rcu_exp_par_gp_kworker->task, SCHED_FIFO,
+ &param);
+}
+
+static inline void rcu_alloc_par_gp_wq(void)
+{
+}
+#else /* !CONFIG_RCU_EXP_KTHREAD */
+struct workqueue_struct *rcu_par_gp_wq;
+
+static void __init rcu_start_exp_gp_kworkers(void)
+{
+}
+
+static inline void rcu_alloc_par_gp_wq(void)
+{
+ rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
+ WARN_ON(!rcu_par_gp_wq);
+}
+#endif /* CONFIG_RCU_EXP_KTHREAD */
+
/*
* Spawn the kthreads that handle RCU's grace periods.
*/
@@ -4500,6 +4545,8 @@ static int __init rcu_spawn_gp_kthread(void)
rcu_spawn_nocb_kthreads();
rcu_spawn_boost_kthreads();
rcu_spawn_core_kthreads();
+ /* Create kthread worker for expedited GPs */
+ rcu_start_exp_gp_kworkers();
return 0;
}
early_initcall(rcu_spawn_gp_kthread);
@@ -4745,7 +4792,6 @@ static void __init rcu_dump_rcu_node_tree(void)
}

struct workqueue_struct *rcu_gp_wq;
-struct workqueue_struct *rcu_par_gp_wq;

static void __init kfree_rcu_batch_init(void)
{
@@ -4811,8 +4857,7 @@ void __init rcu_init(void)
/* Create workqueue for Tree SRCU and for expedited GPs. */
rcu_gp_wq = alloc_workqueue("rcu_gp", WQ_MEM_RECLAIM, 0);
WARN_ON(!rcu_gp_wq);
- rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
- WARN_ON(!rcu_par_gp_wq);
+ rcu_alloc_par_gp_wq();

/* Fill in default value for rcutree.qovld boot parameter. */
/* -After- the rcu_node ->lock fields are initialized! */
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 926673ebe355..b577cdfdc851 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -10,6 +10,7 @@
*/

#include <linux/cache.h>
+#include <linux/kthread.h>
#include <linux/spinlock.h>
#include <linux/rtmutex.h>
#include <linux/threads.h>
@@ -23,7 +24,11 @@
/* Communicate arguments to a workqueue handler. */
struct rcu_exp_work {
unsigned long rew_s;
+#ifdef CONFIG_RCU_EXP_KTHREAD
+ struct kthread_work rew_work;
+#else
struct work_struct rew_work;
+#endif /* CONFIG_RCU_EXP_KTHREAD */
};

/* RCU's kthread states for tracing. */
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 60197ea24ceb..2c6c0e0bc83a 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
* Select the CPUs within the specified rcu_node that the upcoming
* expedited grace period needs to wait for.
*/
-static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
+static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
{
int cpu;
unsigned long flags;
unsigned long mask_ofl_test;
unsigned long mask_ofl_ipi;
int ret;
- struct rcu_exp_work *rewp =
- container_of(wp, struct rcu_exp_work, rew_work);
struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);

raw_spin_lock_irqsave_rcu_node(rnp, flags);
@@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
}

+static void rcu_exp_sel_wait_wake(unsigned long s);
+
+#ifdef CONFIG_RCU_EXP_KTHREAD
+static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
+{
+ struct rcu_exp_work *rewp =
+ container_of(wp, struct rcu_exp_work, rew_work);
+
+ __sync_rcu_exp_select_node_cpus(rewp);
+}
+
+static inline bool rcu_gp_par_worker_started(void)
+{
+ return !!READ_ONCE(rcu_exp_par_gp_kworker);
+}
+
+static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
+{
+ kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
+ /*
+ * Use rcu_exp_par_gp_kworker, because flushing a work item from
+ * another work item on the same kthread worker can result in
+ * deadlock.
+ */
+ kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
+}
+
+static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
+{
+ kthread_flush_work(&rnp->rew.rew_work);
+}
+
+/*
+ * Work-queue handler to drive an expedited grace period forward.
+ */
+static void wait_rcu_exp_gp(struct kthread_work *wp)
+{
+ struct rcu_exp_work *rewp;
+
+ rewp = container_of(wp, struct rcu_exp_work, rew_work);
+ rcu_exp_sel_wait_wake(rewp->rew_s);
+}
+
+static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
+{
+ kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
+ kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
+}
+
+static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
+{
+}
+#else /* !CONFIG_RCU_EXP_KTHREAD */
+static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
+{
+ struct rcu_exp_work *rewp =
+ container_of(wp, struct rcu_exp_work, rew_work);
+
+ __sync_rcu_exp_select_node_cpus(rewp);
+}
+
+static inline bool rcu_gp_par_worker_started(void)
+{
+ return !!READ_ONCE(rcu_par_gp_wq);
+}
+
+static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
+{
+ int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
+
+ INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
+ /* If all offline, queue the work on an unbound CPU. */
+ if (unlikely(cpu > rnp->grphi - rnp->grplo))
+ cpu = WORK_CPU_UNBOUND;
+ else
+ cpu += rnp->grplo;
+ queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
+}
+
+static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
+{
+ flush_work(&rnp->rew.rew_work);
+}
+
+/*
+ * Work-queue handler to drive an expedited grace period forward.
+ */
+static void wait_rcu_exp_gp(struct work_struct *wp)
+{
+ struct rcu_exp_work *rewp;
+
+ rewp = container_of(wp, struct rcu_exp_work, rew_work);
+ rcu_exp_sel_wait_wake(rewp->rew_s);
+}
+
+static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
+{
+ INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
+ queue_work(rcu_gp_wq, &rew->rew_work);
+}
+
+static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
+{
+ destroy_work_on_stack(&rew->rew_work);
+}
+#endif /* CONFIG_RCU_EXP_KTHREAD */
+
/*
* Select the nodes that the upcoming expedited grace period needs
* to wait for.
*/
static void sync_rcu_exp_select_cpus(void)
{
- int cpu;
struct rcu_node *rnp;

trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
@@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
rnp->exp_need_flush = false;
if (!READ_ONCE(rnp->expmask))
continue; /* Avoid early boot non-existent wq. */
- if (!READ_ONCE(rcu_par_gp_wq) ||
+ if (!rcu_gp_par_worker_started() ||
rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
rcu_is_last_leaf_node(rnp)) {
- /* No workqueues yet or last leaf, do direct call. */
+ /* No worker started yet or last leaf, do direct call. */
sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
continue;
}
- INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
- cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
- /* If all offline, queue the work on an unbound CPU. */
- if (unlikely(cpu > rnp->grphi - rnp->grplo))
- cpu = WORK_CPU_UNBOUND;
- else
- cpu += rnp->grplo;
- queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
+ sync_rcu_exp_select_cpus_queue_work(rnp);
rnp->exp_need_flush = true;
}

- /* Wait for workqueue jobs (if any) to complete. */
+ /* Wait for jobs (if any) to complete. */
rcu_for_each_leaf_node(rnp)
if (rnp->exp_need_flush)
- flush_work(&rnp->rew.rew_work);
+ sync_rcu_exp_select_cpus_flush_work(rnp);
}

/*
@@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
rcu_exp_wait_wake(s);
}

-/*
- * Work-queue handler to drive an expedited grace period forward.
- */
-static void wait_rcu_exp_gp(struct work_struct *wp)
-{
- struct rcu_exp_work *rewp;
-
- rewp = container_of(wp, struct rcu_exp_work, rew_work);
- rcu_exp_sel_wait_wake(rewp->rew_s);
-}
-
#ifdef CONFIG_PREEMPT_RCU

/*
@@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
} else {
/* Marshall arguments & schedule the expedited grace period. */
rew.rew_s = s;
- INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
- queue_work(rcu_gp_wq, &rew.rew_work);
+ synchronize_rcu_expedited_queue_work(&rew);
}

/* Wait for expedited grace period to complete. */
rnp = rcu_get_root();
wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
sync_exp_work_done(s));
- smp_mb(); /* Workqueue actions happen before return. */
+ smp_mb(); /* Work actions happen before return. */

/* Let the next expedited grace period start. */
mutex_unlock(&rcu_state.exp_mutex);

if (likely(!boottime))
- destroy_work_on_stack(&rew.rew_work);
+ synchronize_rcu_expedited_destroy_work(&rew);
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);

base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
--
2.35.1.1178.g4f1659d476-goog


2022-04-08 16:15:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
[..]
> +config RCU_EXP_KTHREAD
> + bool "Perform RCU expedited work in a real-time kthread"
> + depends on RCU_BOOST && RCU_EXPERT

Doesn't this disable the fix if a system is not RCU_EXPERT ? Please
see the definition of RCU_EXPERT:
"This option needs to be enabled if you wish to make expert-level
adjustments to RCU configuration."

I don't think a bug fix counts as an expert-level adjustment.

> + default !PREEMPT_RT && NR_CPUS <= 32

What is the benefit of turning it off on PREEMPT_RT, even if
PREEMPT_RT does not use expedited GPs much post-boot? I would think in
the future if PREEMPT_RT ever uses expedited GPs, they would want this
feature even more. I'd rather be future-proof now as I don't see any
advantages of disabling it on !PREEMPT_RT (And a drawback that the fix
won't apply to those systems). Also will keep the config simple.

> + help
> + Use this option to further reduce the latencies of expedited
> + grace periods at the expense of being more disruptive.
> +
> + Accept the default if unsure.
> +
> config RCU_NOCB_CPU
> bool "Offload RCU callback processing from boot-selected CPUs"
> depends on TREE_RCU
[...]

Thanks,

- Joel

2022-04-08 16:41:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
>
[...]
> @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> * Select the CPUs within the specified rcu_node that the upcoming
> * expedited grace period needs to wait for.
> */
> -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> {
> int cpu;
> unsigned long flags;
> unsigned long mask_ofl_test;
> unsigned long mask_ofl_ipi;
> int ret;
> - struct rcu_exp_work *rewp =
> - container_of(wp, struct rcu_exp_work, rew_work);
> struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
>
> raw_spin_lock_irqsave_rcu_node(rnp, flags);
> @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> }
>
> +static void rcu_exp_sel_wait_wake(unsigned long s);
> +
> +#ifdef CONFIG_RCU_EXP_KTHREAD

Just my 2c:

Honestly, I am not sure if the benefits of duplicating the code to use
normal workqueues outweighs the drawbacks (namely code complexity,
code duplication - which can in turn cause more bugs and maintenance
headaches down the line). The code is harder to read and adding more
30 character function names does not help.

For something as important as expedited GPs, I can't imagine a
scenario where an RT kthread worker would cause "issues". If it does
cause issues, that's what the -rc cycles and the stable releases are
for. I prefer to trust the process than take a one-foot-in-the-door
approach.

So please, can we just keep it simple?

Thanks,

- Joel


> +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
> +{
> + struct rcu_exp_work *rewp =
> + container_of(wp, struct rcu_exp_work, rew_work);
> +
> + __sync_rcu_exp_select_node_cpus(rewp);
> +}
> +
> +static inline bool rcu_gp_par_worker_started(void)
> +{
> + return !!READ_ONCE(rcu_exp_par_gp_kworker);
> +}
> +
> +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> +{
> + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> + /*
> + * Use rcu_exp_par_gp_kworker, because flushing a work item from
> + * another work item on the same kthread worker can result in
> + * deadlock.
> + */
> + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
> +}
> +
> +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> +{
> + kthread_flush_work(&rnp->rew.rew_work);
> +}
> +
> +/*
> + * Work-queue handler to drive an expedited grace period forward.
> + */
> +static void wait_rcu_exp_gp(struct kthread_work *wp)
> +{
> + struct rcu_exp_work *rewp;
> +
> + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> + rcu_exp_sel_wait_wake(rewp->rew_s);
> +}
> +
> +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> +{
> + kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
> + kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
> +}
> +
> +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> +{
> +}
> +#else /* !CONFIG_RCU_EXP_KTHREAD */
> +static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> +{
> + struct rcu_exp_work *rewp =
> + container_of(wp, struct rcu_exp_work, rew_work);
> +
> + __sync_rcu_exp_select_node_cpus(rewp);
> +}
> +
> +static inline bool rcu_gp_par_worker_started(void)
> +{
> + return !!READ_ONCE(rcu_par_gp_wq);
> +}
> +
> +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> +{
> + int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> +
> + INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> + /* If all offline, queue the work on an unbound CPU. */
> + if (unlikely(cpu > rnp->grphi - rnp->grplo))
> + cpu = WORK_CPU_UNBOUND;
> + else
> + cpu += rnp->grplo;
> + queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> +}
> +
> +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> +{
> + flush_work(&rnp->rew.rew_work);
> +}
> +
> +/*
> + * Work-queue handler to drive an expedited grace period forward.
> + */
> +static void wait_rcu_exp_gp(struct work_struct *wp)
> +{
> + struct rcu_exp_work *rewp;
> +
> + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> + rcu_exp_sel_wait_wake(rewp->rew_s);
> +}
> +
> +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> +{
> + INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
> + queue_work(rcu_gp_wq, &rew->rew_work);
> +}
> +
> +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> +{
> + destroy_work_on_stack(&rew->rew_work);
> +}
> +#endif /* CONFIG_RCU_EXP_KTHREAD */
> +
> /*
> * Select the nodes that the upcoming expedited grace period needs
> * to wait for.
> */
> static void sync_rcu_exp_select_cpus(void)
> {
> - int cpu;
> struct rcu_node *rnp;
>
> trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> @@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
> rnp->exp_need_flush = false;
> if (!READ_ONCE(rnp->expmask))
> continue; /* Avoid early boot non-existent wq. */
> - if (!READ_ONCE(rcu_par_gp_wq) ||
> + if (!rcu_gp_par_worker_started() ||
> rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
> rcu_is_last_leaf_node(rnp)) {
> - /* No workqueues yet or last leaf, do direct call. */
> + /* No worker started yet or last leaf, do direct call. */
> sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
> continue;
> }
> - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> - /* If all offline, queue the work on an unbound CPU. */
> - if (unlikely(cpu > rnp->grphi - rnp->grplo))
> - cpu = WORK_CPU_UNBOUND;
> - else
> - cpu += rnp->grplo;
> - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> + sync_rcu_exp_select_cpus_queue_work(rnp);
> rnp->exp_need_flush = true;
> }
>
> - /* Wait for workqueue jobs (if any) to complete. */
> + /* Wait for jobs (if any) to complete. */
> rcu_for_each_leaf_node(rnp)
> if (rnp->exp_need_flush)
> - flush_work(&rnp->rew.rew_work);
> + sync_rcu_exp_select_cpus_flush_work(rnp);
> }
>
> /*
> @@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
> rcu_exp_wait_wake(s);
> }
>
> -/*
> - * Work-queue handler to drive an expedited grace period forward.
> - */
> -static void wait_rcu_exp_gp(struct work_struct *wp)
> -{
> - struct rcu_exp_work *rewp;
> -
> - rewp = container_of(wp, struct rcu_exp_work, rew_work);
> - rcu_exp_sel_wait_wake(rewp->rew_s);
> -}
> -
> #ifdef CONFIG_PREEMPT_RCU
>
> /*
> @@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
> } else {
> /* Marshall arguments & schedule the expedited grace period. */
> rew.rew_s = s;
> - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> - queue_work(rcu_gp_wq, &rew.rew_work);
> + synchronize_rcu_expedited_queue_work(&rew);
> }
>
> /* Wait for expedited grace period to complete. */
> rnp = rcu_get_root();
> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> sync_exp_work_done(s));
> - smp_mb(); /* Workqueue actions happen before return. */
> + smp_mb(); /* Work actions happen before return. */
>
> /* Let the next expedited grace period start. */
> mutex_unlock(&rcu_state.exp_mutex);
>
> if (likely(!boottime))
> - destroy_work_on_stack(&rew.rew_work);
> + synchronize_rcu_expedited_destroy_work(&rew);
> }
> EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
>
> base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
> --
> 2.35.1.1178.g4f1659d476-goog
>

2022-04-08 21:06:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 08, 2022 at 11:39:46AM -0400, Steven Rostedt wrote:
> On Fri, 8 Apr 2022 06:21:03 -0400
> Joel Fernandes <[email protected]> wrote:
>
> > > + default !PREEMPT_RT && NR_CPUS <= 32
> >
> > What is the benefit of turning it off on PREEMPT_RT, even if
> > PREEMPT_RT does not use expedited GPs much post-boot? I would think in
> > the future if PREEMPT_RT ever uses expedited GPs, they would want this
> > feature even more. I'd rather be future-proof now as I don't see any
> > advantages of disabling it on !PREEMPT_RT (And a drawback that the fix
> > won't apply to those systems). Also will keep the config simple.
>
> The default kthread priority is 1. This should not affect PREEMPT_RT, as
> PREEMPT_RT users are usually more concerned by the performance of their
> higher priority tasks. Priority 1 is not considered an issue.
>
> I do not see why PREEMPT_RT is special here. Why was it singled out?

Because I didn't see the point of doing kthread_create_worker() to create
a pair of kthread_worker structuress that were never going to be used
in PREEMPT_RT kernels.

Or are PREEMPT_RT kernels no longer booting with rcupdate.rcu_normal?
Last I knew, they all did this to avoid IPI latencies from expedited
grace periods. Has this changed?

Thanx, Paul

2022-04-08 21:11:48

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 11:34 AM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Apr 08, 2022 at 10:41:26AM -0400, Joel Fernandes wrote:
> > On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > > > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> > > > >
> > > > [...]
> > > > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > > > * Select the CPUs within the specified rcu_node that the upcoming
> > > > > * expedited grace period needs to wait for.
> > > > > */
> > > > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > > > {
> > > > > int cpu;
> > > > > unsigned long flags;
> > > > > unsigned long mask_ofl_test;
> > > > > unsigned long mask_ofl_ipi;
> > > > > int ret;
> > > > > - struct rcu_exp_work *rewp =
> > > > > - container_of(wp, struct rcu_exp_work, rew_work);
> > > > > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > > > >
> > > > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > > > }
> > > > >
> > > > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > > > +
> > > > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> > > >
> > > > Just my 2c:
> > > >
> > > > Honestly, I am not sure if the benefits of duplicating the code to use
> > > > normal workqueues outweighs the drawbacks (namely code complexity,
> > > > code duplication - which can in turn cause more bugs and maintenance
> > > > headaches down the line). The code is harder to read and adding more
> > > > 30 character function names does not help.
> > > >
> > > > For something as important as expedited GPs, I can't imagine a
> > > > scenario where an RT kthread worker would cause "issues". If it does
> > > > cause issues, that's what the -rc cycles and the stable releases are
> > > > for. I prefer to trust the process than take a one-foot-in-the-door
> > > > approach.
> > > >
> > > > So please, can we just keep it simple?
> > >
> > > Yes and no.
> > >
> > > This is a bug fix, but only for those systems that are expecting real-time
> > > response from synchronize_rcu_expedited(). As far as I know, this is only
> > > Android. The rest of the systems are just fine with the current behavior.
> >
> > As far as you know, but are you sure?
>
> None of us are sure. We are balancing risks and potential benefits.

Right.

> > > In addition, this bug fix introduces significant risks, especially in
> > > terms of performance for throughput-oriented workloads.
> >
> > Could you explain what the risk is? That's the part I did not follow.
> > How can making synchronize_rcu_expedited() work getting priority
> > introduce throughput issues?
>
> Status quo has synchronize_rcu_expedited() workqueues running as
> SCHED_OTHER.

Yeah, so if we go by this, you are saying RCU_BOOST likely does not
work correctly on status quo right? I do not see what in the commit
message indicates that this is an Android-only issue, let me know what
I am missing.

The users affected by this will instead have these running
> as SCHED_FIFO. That changes scheduling. For users not explicitly
> needing low-latency synchronize_rcu_expedited(), this change is very
> unlikely to be for the better. And historically, unnecessarily running
> portions of RCU at real-time priorities has been a change for the worse.
> As in greatly increased context-switch rates and consequently degraded
> performance. Please note that this is not a theoretical statement: Real
> users have really been burned by too much SCHED_FIFO in RCU kthreads in
> the past.

Android also has suffered from too much SCHED_FIFO in the past. I
remember the display thread called 'surfaceflinger' had to be dropped
from RT privilege at one point.

> > > So yes, let's do this bug fix (with appropriate adjustment), but let's
> > > also avoid exposing the non-Android workloads to risks from the inevitable
> > > unintended consequences. ;-)
> >
> > I would argue the risk is also adding code complexity and more bugs
> > without clear rationale for why it is being done. There's always risk
> > with any change, but that's the -rc cycles and stable kernels help
> > catch those. I think we should not add more code complexity if it is a
> > theoretical concern.
> >
> > There's also another possible risk - there is a possible hidden
> > problem here that probably the non-Android folks haven't noticed or
> > been able to debug. I would rather just do the right thing.
> >
> > Just my 2c,
>
> Sorry, but my answer is still "no".
>
> Your suggested change risks visiting unacceptable performance
> degradation on a very large number of innocent users for whom current
> synchronize_rcu_expedited() latency is plenty good enough.

I believe the process will catch any such risk, but it is your call! ;-)

Thanks,

- Joel

2022-04-09 01:58:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 08, 2022 at 10:41:26AM -0400, Joel Fernandes wrote:
> On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> > > >
> > > [...]
> > > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > > * Select the CPUs within the specified rcu_node that the upcoming
> > > > * expedited grace period needs to wait for.
> > > > */
> > > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > > {
> > > > int cpu;
> > > > unsigned long flags;
> > > > unsigned long mask_ofl_test;
> > > > unsigned long mask_ofl_ipi;
> > > > int ret;
> > > > - struct rcu_exp_work *rewp =
> > > > - container_of(wp, struct rcu_exp_work, rew_work);
> > > > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > > >
> > > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > > }
> > > >
> > > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > > +
> > > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> > >
> > > Just my 2c:
> > >
> > > Honestly, I am not sure if the benefits of duplicating the code to use
> > > normal workqueues outweighs the drawbacks (namely code complexity,
> > > code duplication - which can in turn cause more bugs and maintenance
> > > headaches down the line). The code is harder to read and adding more
> > > 30 character function names does not help.
> > >
> > > For something as important as expedited GPs, I can't imagine a
> > > scenario where an RT kthread worker would cause "issues". If it does
> > > cause issues, that's what the -rc cycles and the stable releases are
> > > for. I prefer to trust the process than take a one-foot-in-the-door
> > > approach.
> > >
> > > So please, can we just keep it simple?
> >
> > Yes and no.
> >
> > This is a bug fix, but only for those systems that are expecting real-time
> > response from synchronize_rcu_expedited(). As far as I know, this is only
> > Android. The rest of the systems are just fine with the current behavior.
>
> As far as you know, but are you sure?

None of use are sure. We are balancing risks and potential benefits.

> > In addition, this bug fix introduces significant risks, especially in
> > terms of performance for throughput-oriented workloads.
>
> Could you explain what the risk is? That's the part I did not follow.
> How can making synchronize_rcu_expedited() work getting priority
> introduce throughput issues?

Status quo has synchronize_rcu_expedited() workqueues running as
SCHED_OTHER. The users affected by this will instead have these running
as SCHED_FIFO. That changes scheduling. For users not explicitly
needing low-latency synchronize_rcu_expedited(), this change is very
unlikely to be for the better. And historically, unnecessarily running
portions of RCU at real-time priorities has been a change for the worse.
As in greatly increased context-switch rates and consequently degraded
performance. Please note that this is not a theoretical statement: Real
users have really been burned by too much SCHED_FIFO in RCU kthreads in
the past.

> > So yes, let's do this bug fix (with appropriate adjustment), but let's
> > also avoid exposing the non-Android workloads to risks from the inevitable
> > unintended consequences. ;-)
>
> I would argue the risk is also adding code complexity and more bugs
> without clear rationale for why it is being done. There's always risk
> with any change, but that's the -rc cycles and stable kernels help
> catch those. I think we should not add more code complexity if it is a
> theoretical concern.
>
> There's also another possible risk - there is a possible hidden
> problem here that probably the non-Android folks haven't noticed or
> been able to debug. I would rather just do the right thing.
>
> Just my 2c,

Sorry, but my answer is still "no".

Your suggested change risks visiting unacceptable performance
degradation on a very large number of innocent users for whom current
synchronize_rcu_expedited() latency is pleent good enough.

After all, if simplicity were the only goal, we would set NR_CPUS=0
and be happy. ;-)

Thanx, Paul

> - Joel
>
>
> >
> > Thanx, Paul
> >
> > > Thanks,
> > >
> > > - Joel
> > >
> > >
> > > > +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
> > > > +{
> > > > + struct rcu_exp_work *rewp =
> > > > + container_of(wp, struct rcu_exp_work, rew_work);
> > > > +
> > > > + __sync_rcu_exp_select_node_cpus(rewp);
> > > > +}
> > > > +
> > > > +static inline bool rcu_gp_par_worker_started(void)
> > > > +{
> > > > + return !!READ_ONCE(rcu_exp_par_gp_kworker);
> > > > +}
> > > > +
> > > > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > > > +{
> > > > + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > > + /*
> > > > + * Use rcu_exp_par_gp_kworker, because flushing a work item from
> > > > + * another work item on the same kthread worker can result in
> > > > + * deadlock.
> > > > + */
> > > > + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
> > > > +}
> > > > +
> > > > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > > > +{
> > > > + kthread_flush_work(&rnp->rew.rew_work);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Work-queue handler to drive an expedited grace period forward.
> > > > + */
> > > > +static void wait_rcu_exp_gp(struct kthread_work *wp)
> > > > +{
> > > > + struct rcu_exp_work *rewp;
> > > > +
> > > > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > > > +}
> > > > +
> > > > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > > > +{
> > > > + kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
> > > > + kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
> > > > +}
> > > > +
> > > > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > > > +{
> > > > +}
> > > > +#else /* !CONFIG_RCU_EXP_KTHREAD */
> > > > +static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > +{
> > > > + struct rcu_exp_work *rewp =
> > > > + container_of(wp, struct rcu_exp_work, rew_work);
> > > > +
> > > > + __sync_rcu_exp_select_node_cpus(rewp);
> > > > +}
> > > > +
> > > > +static inline bool rcu_gp_par_worker_started(void)
> > > > +{
> > > > + return !!READ_ONCE(rcu_par_gp_wq);
> > > > +}
> > > > +
> > > > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > > > +{
> > > > + int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > > > +
> > > > + INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > > + /* If all offline, queue the work on an unbound CPU. */
> > > > + if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > > > + cpu = WORK_CPU_UNBOUND;
> > > > + else
> > > > + cpu += rnp->grplo;
> > > > + queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > > > +}
> > > > +
> > > > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > > > +{
> > > > + flush_work(&rnp->rew.rew_work);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Work-queue handler to drive an expedited grace period forward.
> > > > + */
> > > > +static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > +{
> > > > + struct rcu_exp_work *rewp;
> > > > +
> > > > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > > > +}
> > > > +
> > > > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > > > +{
> > > > + INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
> > > > + queue_work(rcu_gp_wq, &rew->rew_work);
> > > > +}
> > > > +
> > > > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > > > +{
> > > > + destroy_work_on_stack(&rew->rew_work);
> > > > +}
> > > > +#endif /* CONFIG_RCU_EXP_KTHREAD */
> > > > +
> > > > /*
> > > > * Select the nodes that the upcoming expedited grace period needs
> > > > * to wait for.
> > > > */
> > > > static void sync_rcu_exp_select_cpus(void)
> > > > {
> > > > - int cpu;
> > > > struct rcu_node *rnp;
> > > >
> > > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> > > > @@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
> > > > rnp->exp_need_flush = false;
> > > > if (!READ_ONCE(rnp->expmask))
> > > > continue; /* Avoid early boot non-existent wq. */
> > > > - if (!READ_ONCE(rcu_par_gp_wq) ||
> > > > + if (!rcu_gp_par_worker_started() ||
> > > > rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
> > > > rcu_is_last_leaf_node(rnp)) {
> > > > - /* No workqueues yet or last leaf, do direct call. */
> > > > + /* No worker started yet or last leaf, do direct call. */
> > > > sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
> > > > continue;
> > > > }
> > > > - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > > - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > > > - /* If all offline, queue the work on an unbound CPU. */
> > > > - if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > > > - cpu = WORK_CPU_UNBOUND;
> > > > - else
> > > > - cpu += rnp->grplo;
> > > > - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > > > + sync_rcu_exp_select_cpus_queue_work(rnp);
> > > > rnp->exp_need_flush = true;
> > > > }
> > > >
> > > > - /* Wait for workqueue jobs (if any) to complete. */
> > > > + /* Wait for jobs (if any) to complete. */
> > > > rcu_for_each_leaf_node(rnp)
> > > > if (rnp->exp_need_flush)
> > > > - flush_work(&rnp->rew.rew_work);
> > > > + sync_rcu_exp_select_cpus_flush_work(rnp);
> > > > }
> > > >
> > > > /*
> > > > @@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
> > > > rcu_exp_wait_wake(s);
> > > > }
> > > >
> > > > -/*
> > > > - * Work-queue handler to drive an expedited grace period forward.
> > > > - */
> > > > -static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > -{
> > > > - struct rcu_exp_work *rewp;
> > > > -
> > > > - rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > > - rcu_exp_sel_wait_wake(rewp->rew_s);
> > > > -}
> > > > -
> > > > #ifdef CONFIG_PREEMPT_RCU
> > > >
> > > > /*
> > > > @@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
> > > > } else {
> > > > /* Marshall arguments & schedule the expedited grace period. */
> > > > rew.rew_s = s;
> > > > - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > > > - queue_work(rcu_gp_wq, &rew.rew_work);
> > > > + synchronize_rcu_expedited_queue_work(&rew);
> > > > }
> > > >
> > > > /* Wait for expedited grace period to complete. */
> > > > rnp = rcu_get_root();
> > > > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > > sync_exp_work_done(s));
> > > > - smp_mb(); /* Workqueue actions happen before return. */
> > > > + smp_mb(); /* Work actions happen before return. */
> > > >
> > > > /* Let the next expedited grace period start. */
> > > > mutex_unlock(&rcu_state.exp_mutex);
> > > >
> > > > if (likely(!boottime))
> > > > - destroy_work_on_stack(&rew.rew_work);
> > > > + synchronize_rcu_expedited_destroy_work(&rew);
> > > > }
> > > > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> > > >
> > > > base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
> > > > --
> > > > 2.35.1.1178.g4f1659d476-goog
> > > >

2022-04-09 02:39:22

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, 8 Apr 2022 06:21:03 -0400
Joel Fernandes <[email protected]> wrote:

> > + default !PREEMPT_RT && NR_CPUS <= 32
>
> What is the benefit of turning it off on PREEMPT_RT, even if
> PREEMPT_RT does not use expedited GPs much post-boot? I would think in
> the future if PREEMPT_RT ever uses expedited GPs, they would want this
> feature even more. I'd rather be future-proof now as I don't see any
> advantages of disabling it on !PREEMPT_RT (And a drawback that the fix
> won't apply to those systems). Also will keep the config simple.

The default kthread priority is 1. This should not affect PREEMPT_RT, as
PREEMPT_RT users are usually more concerned by the performance of their
higher priority tasks. Priority 1 is not considered an issue.

I do not see why PREEMPT_RT is special here. Why was it singled out?

-- Steve

2022-04-09 04:16:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 08, 2022 at 01:14:35PM -0400, Joel Fernandes wrote:
> On Fri, Apr 8, 2022 at 11:34 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Fri, Apr 08, 2022 at 10:41:26AM -0400, Joel Fernandes wrote:
> > > On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > > > > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> > > > > >
> > > > > [...]
> > > > > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > > > > * Select the CPUs within the specified rcu_node that the upcoming
> > > > > > * expedited grace period needs to wait for.
> > > > > > */
> > > > > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > > > > {
> > > > > > int cpu;
> > > > > > unsigned long flags;
> > > > > > unsigned long mask_ofl_test;
> > > > > > unsigned long mask_ofl_ipi;
> > > > > > int ret;
> > > > > > - struct rcu_exp_work *rewp =
> > > > > > - container_of(wp, struct rcu_exp_work, rew_work);
> > > > > > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > > > > >
> > > > > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > > > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > > > > }
> > > > > >
> > > > > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > > > > +
> > > > > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> > > > >
> > > > > Just my 2c:
> > > > >
> > > > > Honestly, I am not sure if the benefits of duplicating the code to use
> > > > > normal workqueues outweighs the drawbacks (namely code complexity,
> > > > > code duplication - which can in turn cause more bugs and maintenance
> > > > > headaches down the line). The code is harder to read and adding more
> > > > > 30 character function names does not help.
> > > > >
> > > > > For something as important as expedited GPs, I can't imagine a
> > > > > scenario where an RT kthread worker would cause "issues". If it does
> > > > > cause issues, that's what the -rc cycles and the stable releases are
> > > > > for. I prefer to trust the process than take a one-foot-in-the-door
> > > > > approach.
> > > > >
> > > > > So please, can we just keep it simple?
> > > >
> > > > Yes and no.
> > > >
> > > > This is a bug fix, but only for those systems that are expecting real-time
> > > > response from synchronize_rcu_expedited(). As far as I know, this is only
> > > > Android. The rest of the systems are just fine with the current behavior.
> > >
> > > As far as you know, but are you sure?
> >
> > None of us are sure. We are balancing risks and potential benefits.
>
> Right.
>
> > > > In addition, this bug fix introduces significant risks, especially in
> > > > terms of performance for throughput-oriented workloads.
> > >
> > > Could you explain what the risk is? That's the part I did not follow.
> > > How can making synchronize_rcu_expedited() work getting priority
> > > introduce throughput issues?
> >
> > Status quo has synchronize_rcu_expedited() workqueues running as
> > SCHED_OTHER.
>
> Yeah, so if we go by this, you are saying RCU_BOOST likely does not
> work correctly on status quo right? I do not see what in the commit
> message indicates that this is an Android-only issue, let me know what
> I am missing.

Android is in the process of imposing a new requirement, which I have been
letting you get away with calling a bug. Mostly because I don't care that
much what it is called. Either way, the code base needs to accommodate
Android's needs with acceptably low risk of breakage elsewhere.

You are right that RCU_BOOST has run additional kthreads at real-time
priority, and has done so for quite some time. But RCU_BOOST has not
(repeat, NOT) implied real-time response from synchronize_rcu_expedited(),
and the ONLY request thus far has come from Android.

Face it, the fact that synchronize_rcu_expedited() can achieve
tens-of-milliseconds "worst-case" latencies latencies is quite
amazing. Let's not spoil things by visiting the changes on
workloads that neither need nor care about "worst-case" latencies
from synchronize_rcu_expedited().

The "worst case" with the double quotes is because a single long-running
CPU-bound reader will force increased latencies. Then again, that is
exactly why Uladzislau created this commit:

0b74356e7b95 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")

Please note that this commit's short timeouts depend on ANDROID. If you
are serious about also having this work in ChromeOS, please send a patch
on top of Uladzislau's that also covers ChromeOS.

But please be very careful what you wish for...

Right now, you get expedited RCU CPU stall warnings at 21 seconds
in mainline kernels. With this commit, that become 20 milliseconds,
but only for ANDROID.

I could easily imagine that the Android guys would welcome help
fixing long-running RCU readers, but I find it even easier to
imagine that some of the more vocal ChromeOS users might stridently
object to being involuntarily recruited to provide such help. ;-)

> The users affected by this will instead have these running
> > as SCHED_FIFO. That changes scheduling. For users not explicitly
> > needing low-latency synchronize_rcu_expedited(), this change is very
> > unlikely to be for the better. And historically, unnecessarily running
> > portions of RCU at real-time priorities has been a change for the worse.
> > As in greatly increased context-switch rates and consequently degraded
> > performance. Please note that this is not a theoretical statement: Real
> > users have really been burned by too much SCHED_FIFO in RCU kthreads in
> > the past.
>
> Android also has suffered from too much SCHED_FIFO in the past. I
> remember the display thread called 'surfaceflinger' had to be dropped
> from RT privilege at one point.

"With great power..." ;-)

Thanx, Paul

> > > > So yes, let's do this bug fix (with appropriate adjustment), but let's
> > > > also avoid exposing the non-Android workloads to risks from the inevitable
> > > > unintended consequences. ;-)
> > >
> > > I would argue the risk is also adding code complexity and more bugs
> > > without clear rationale for why it is being done. There's always risk
> > > with any change, but that's the -rc cycles and stable kernels help
> > > catch those. I think we should not add more code complexity if it is a
> > > theoretical concern.
> > >
> > > There's also another possible risk - there is a possible hidden
> > > problem here that probably the non-Android folks haven't noticed or
> > > been able to debug. I would rather just do the right thing.
> > >
> > > Just my 2c,
> >
> > Sorry, but my answer is still "no".
> >
> > Your suggested change risks visiting unacceptable performance
> > degradation on a very large number of innocent users for whom current
> > synchronize_rcu_expedited() latency is plenty good enough.
>
> I believe the process will catch any such risk, but it is your call! ;-)
>
> Thanks,
>
> - Joel

2022-04-09 04:55:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 08, 2022 at 06:21:03AM -0400, Joel Fernandes wrote:
> On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> [..]
> > +config RCU_EXP_KTHREAD
> > + bool "Perform RCU expedited work in a real-time kthread"
> > + depends on RCU_BOOST && RCU_EXPERT
>
> Doesn't this disable the fix if a system is not RCU_EXPERT ? Please
> see the definition of RCU_EXPERT:
> "This option needs to be enabled if you wish to make expert-level
> adjustments to RCU configuration."
>
> I don't think a bug fix counts as an expert-level adjustment.

Good catch!

The constraints that this Kconfig option must meet are as follows:

1. It must not expose innocent bystanders to additional questions
in response to "make oldconfig" and friends. So, if you build
a kernel, then apply this patch, then run "make oldconfig",
this last must run to completion without waiting for user input.

This point is not trivial, and everyone who has submitted a
mainline pull request violating this rule can attest. ;-)

2. Enabing RCU_EXPERT should be the exception rather than the
rule. (This rule is all too often honored in the breach.)

3. If a given Kconfig option is to be selected, it must not depend
on RCU_EXPERT. (Otherwise, the build system complains.)

So the question is "Does Android already enable RCU_EXPERT?" If so,
having this depend on RCU_EXPERT, though not great, is OK given that
Android is the only expected user. If not, another option is to make
this option depend on ANDROID or some similar Kconfig option that Linus
does not enable in his kernel builds.

> > + default !PREEMPT_RT && NR_CPUS <= 32
>
> What is the benefit of turning it off on PREEMPT_RT, even if
> PREEMPT_RT does not use expedited GPs much post-boot? I would think in
> the future if PREEMPT_RT ever uses expedited GPs, they would want this
> feature even more. I'd rather be future-proof now as I don't see any
> advantages of disabling it on !PREEMPT_RT (And a drawback that the fix
> won't apply to those systems). Also will keep the config simple.

Current PREEMPT_RT workloads do not care about synchronize_rcu_expedited()
latencies, so much so that they boot with rcupdate.rcu_normal=y. Plus
the kthread_create_worker() function does not yet have that many users.
So it makes sense to fence in the effects of this change.

Of course, if you are instead saying that ChromeOS also wants low-latency
synchronize_rcu_expedited(), let's figure out the best way to get this
change to ChromeOS users as well as to Android users. ;-)

Thanx, Paul

> > + help
> > + Use this option to further reduce the latencies of expedited
> > + grace periods at the expense of being more disruptive.
> > +
> > + Accept the default if unsure.
> > +
> > config RCU_NOCB_CPU
> > bool "Offload RCU callback processing from boot-selected CPUs"
> > depends on TREE_RCU
> [...]
>
> Thanks,
>
> - Joel

2022-04-09 07:03:18

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> > >
> > [...]
> > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > * Select the CPUs within the specified rcu_node that the upcoming
> > > * expedited grace period needs to wait for.
> > > */
> > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > {
> > > int cpu;
> > > unsigned long flags;
> > > unsigned long mask_ofl_test;
> > > unsigned long mask_ofl_ipi;
> > > int ret;
> > > - struct rcu_exp_work *rewp =
> > > - container_of(wp, struct rcu_exp_work, rew_work);
> > > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > >
> > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > }
> > >
> > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > +
> > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> >
> > Just my 2c:
> >
> > Honestly, I am not sure if the benefits of duplicating the code to use
> > normal workqueues outweighs the drawbacks (namely code complexity,
> > code duplication - which can in turn cause more bugs and maintenance
> > headaches down the line). The code is harder to read and adding more
> > 30 character function names does not help.
> >
> > For something as important as expedited GPs, I can't imagine a
> > scenario where an RT kthread worker would cause "issues". If it does
> > cause issues, that's what the -rc cycles and the stable releases are
> > for. I prefer to trust the process than take a one-foot-in-the-door
> > approach.
> >
> > So please, can we just keep it simple?
>
> Yes and no.
>
> This is a bug fix, but only for those systems that are expecting real-time
> response from synchronize_rcu_expedited(). As far as I know, this is only
> Android. The rest of the systems are just fine with the current behavior.

As far as you know, but are you sure?

> In addition, this bug fix introduces significant risks, especially in
> terms of performance for throughput-oriented workloads.

Could you explain what the risk is? That's the part I did not follow.
How can making synchronize_rcu_expedited() work getting priority
introduce throughput issues?

> So yes, let's do this bug fix (with appropriate adjustment), but let's
> also avoid exposing the non-Android workloads to risks from the inevitable
> unintended consequences. ;-)

I would argue the risk is also adding code complexity and more bugs
without clear rationale for why it is being done. There's always risk
with any change, but that's the -rc cycles and stable kernels help
catch those. I think we should not add more code complexity if it is a
theoretical concern.

There's also another possible risk - there is a possible hidden
problem here that probably the non-Android folks haven't noticed or
been able to debug. I would rather just do the right thing.

Just my 2c,

- Joel


>
> Thanx, Paul
>
> > Thanks,
> >
> > - Joel
> >
> >
> > > +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
> > > +{
> > > + struct rcu_exp_work *rewp =
> > > + container_of(wp, struct rcu_exp_work, rew_work);
> > > +
> > > + __sync_rcu_exp_select_node_cpus(rewp);
> > > +}
> > > +
> > > +static inline bool rcu_gp_par_worker_started(void)
> > > +{
> > > + return !!READ_ONCE(rcu_exp_par_gp_kworker);
> > > +}
> > > +
> > > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > > +{
> > > + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > + /*
> > > + * Use rcu_exp_par_gp_kworker, because flushing a work item from
> > > + * another work item on the same kthread worker can result in
> > > + * deadlock.
> > > + */
> > > + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
> > > +}
> > > +
> > > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > > +{
> > > + kthread_flush_work(&rnp->rew.rew_work);
> > > +}
> > > +
> > > +/*
> > > + * Work-queue handler to drive an expedited grace period forward.
> > > + */
> > > +static void wait_rcu_exp_gp(struct kthread_work *wp)
> > > +{
> > > + struct rcu_exp_work *rewp;
> > > +
> > > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > > +}
> > > +
> > > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > > +{
> > > + kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
> > > + kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
> > > +}
> > > +
> > > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > > +{
> > > +}
> > > +#else /* !CONFIG_RCU_EXP_KTHREAD */
> > > +static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > +{
> > > + struct rcu_exp_work *rewp =
> > > + container_of(wp, struct rcu_exp_work, rew_work);
> > > +
> > > + __sync_rcu_exp_select_node_cpus(rewp);
> > > +}
> > > +
> > > +static inline bool rcu_gp_par_worker_started(void)
> > > +{
> > > + return !!READ_ONCE(rcu_par_gp_wq);
> > > +}
> > > +
> > > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > > +{
> > > + int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > > +
> > > + INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > + /* If all offline, queue the work on an unbound CPU. */
> > > + if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > > + cpu = WORK_CPU_UNBOUND;
> > > + else
> > > + cpu += rnp->grplo;
> > > + queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > > +}
> > > +
> > > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > > +{
> > > + flush_work(&rnp->rew.rew_work);
> > > +}
> > > +
> > > +/*
> > > + * Work-queue handler to drive an expedited grace period forward.
> > > + */
> > > +static void wait_rcu_exp_gp(struct work_struct *wp)
> > > +{
> > > + struct rcu_exp_work *rewp;
> > > +
> > > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > > +}
> > > +
> > > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > > +{
> > > + INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
> > > + queue_work(rcu_gp_wq, &rew->rew_work);
> > > +}
> > > +
> > > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > > +{
> > > + destroy_work_on_stack(&rew->rew_work);
> > > +}
> > > +#endif /* CONFIG_RCU_EXP_KTHREAD */
> > > +
> > > /*
> > > * Select the nodes that the upcoming expedited grace period needs
> > > * to wait for.
> > > */
> > > static void sync_rcu_exp_select_cpus(void)
> > > {
> > > - int cpu;
> > > struct rcu_node *rnp;
> > >
> > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> > > @@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
> > > rnp->exp_need_flush = false;
> > > if (!READ_ONCE(rnp->expmask))
> > > continue; /* Avoid early boot non-existent wq. */
> > > - if (!READ_ONCE(rcu_par_gp_wq) ||
> > > + if (!rcu_gp_par_worker_started() ||
> > > rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
> > > rcu_is_last_leaf_node(rnp)) {
> > > - /* No workqueues yet or last leaf, do direct call. */
> > > + /* No worker started yet or last leaf, do direct call. */
> > > sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
> > > continue;
> > > }
> > > - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > > - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > > - /* If all offline, queue the work on an unbound CPU. */
> > > - if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > > - cpu = WORK_CPU_UNBOUND;
> > > - else
> > > - cpu += rnp->grplo;
> > > - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > > + sync_rcu_exp_select_cpus_queue_work(rnp);
> > > rnp->exp_need_flush = true;
> > > }
> > >
> > > - /* Wait for workqueue jobs (if any) to complete. */
> > > + /* Wait for jobs (if any) to complete. */
> > > rcu_for_each_leaf_node(rnp)
> > > if (rnp->exp_need_flush)
> > > - flush_work(&rnp->rew.rew_work);
> > > + sync_rcu_exp_select_cpus_flush_work(rnp);
> > > }
> > >
> > > /*
> > > @@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
> > > rcu_exp_wait_wake(s);
> > > }
> > >
> > > -/*
> > > - * Work-queue handler to drive an expedited grace period forward.
> > > - */
> > > -static void wait_rcu_exp_gp(struct work_struct *wp)
> > > -{
> > > - struct rcu_exp_work *rewp;
> > > -
> > > - rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > > - rcu_exp_sel_wait_wake(rewp->rew_s);
> > > -}
> > > -
> > > #ifdef CONFIG_PREEMPT_RCU
> > >
> > > /*
> > > @@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
> > > } else {
> > > /* Marshall arguments & schedule the expedited grace period. */
> > > rew.rew_s = s;
> > > - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > > - queue_work(rcu_gp_wq, &rew.rew_work);
> > > + synchronize_rcu_expedited_queue_work(&rew);
> > > }
> > >
> > > /* Wait for expedited grace period to complete. */
> > > rnp = rcu_get_root();
> > > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > > sync_exp_work_done(s));
> > > - smp_mb(); /* Workqueue actions happen before return. */
> > > + smp_mb(); /* Work actions happen before return. */
> > >
> > > /* Let the next expedited grace period start. */
> > > mutex_unlock(&rcu_state.exp_mutex);
> > >
> > > if (likely(!boottime))
> > > - destroy_work_on_stack(&rew.rew_work);
> > > + synchronize_rcu_expedited_destroy_work(&rew);
> > > }
> > > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> > >
> > > base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
> > > --
> > > 2.35.1.1178.g4f1659d476-goog
> > >

2022-04-09 10:01:01

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 10:39 AM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Apr 08, 2022 at 01:14:35PM -0400, Joel Fernandes wrote:
> > On Fri, Apr 8, 2022 at 11:34 AM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Fri, Apr 08, 2022 at 10:41:26AM -0400, Joel Fernandes wrote:
> > > > On Fri, Apr 8, 2022 at 10:34 AM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> > > > > > On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > > > > > > * Select the CPUs within the specified rcu_node that the upcoming
> > > > > > > * expedited grace period needs to wait for.
> > > > > > > */
> > > > > > > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > > > > > > {
> > > > > > > int cpu;
> > > > > > > unsigned long flags;
> > > > > > > unsigned long mask_ofl_test;
> > > > > > > unsigned long mask_ofl_ipi;
> > > > > > > int ret;
> > > > > > > - struct rcu_exp_work *rewp =
> > > > > > > - container_of(wp, struct rcu_exp_work, rew_work);
> > > > > > > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> > > > > > >
> > > > > > > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > > > > > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > > > > > > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > > > > > > }
> > > > > > >
> > > > > > > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > > > > > > +
> > > > > > > +#ifdef CONFIG_RCU_EXP_KTHREAD
> > > > > >
> > > > > > Just my 2c:
> > > > > >
> > > > > > Honestly, I am not sure if the benefits of duplicating the code to use
> > > > > > normal workqueues outweighs the drawbacks (namely code complexity,
> > > > > > code duplication - which can in turn cause more bugs and maintenance
> > > > > > headaches down the line). The code is harder to read and adding more
> > > > > > 30 character function names does not help.
> > > > > >
> > > > > > For something as important as expedited GPs, I can't imagine a
> > > > > > scenario where an RT kthread worker would cause "issues". If it does
> > > > > > cause issues, that's what the -rc cycles and the stable releases are
> > > > > > for. I prefer to trust the process than take a one-foot-in-the-door
> > > > > > approach.
> > > > > >
> > > > > > So please, can we just keep it simple?
> > > > >
> > > > > Yes and no.
> > > > >
> > > > > This is a bug fix, but only for those systems that are expecting real-time
> > > > > response from synchronize_rcu_expedited(). As far as I know, this is only
> > > > > Android. The rest of the systems are just fine with the current behavior.
> > > >
> > > > As far as you know, but are you sure?
> > >
> > > None of us are sure. We are balancing risks and potential benefits.
> >
> > Right.
> >
> > > > > In addition, this bug fix introduces significant risks, especially in
> > > > > terms of performance for throughput-oriented workloads.
> > > >
> > > > Could you explain what the risk is? That's the part I did not follow.
> > > > How can making synchronize_rcu_expedited() work getting priority
> > > > introduce throughput issues?
> > >
> > > Status quo has synchronize_rcu_expedited() workqueues running as
> > > SCHED_OTHER.
> >
> > Yeah, so if we go by this, you are saying RCU_BOOST likely does not
> > work correctly on status quo right? I do not see what in the commit
> > message indicates that this is an Android-only issue, let me know what
> > I am missing.
>
> Android is in the process of imposing a new requirement, which I have been
> letting you get away with calling a bug. Mostly because I don't care that
> much what it is called. Either way, the code base needs to accommodate
> Android's needs with acceptably low risk of breakage elsewhere.
>
> You are right that RCU_BOOST has run additional kthreads at real-time
> priority, and has done so for quite some time. But RCU_BOOST has not
> (repeat, NOT) implied real-time response from synchronize_rcu_expedited(),
> and the ONLY request thus far has come from Android.
>
> Face it, the fact that synchronize_rcu_expedited() can achieve
> tens-of-milliseconds "worst-case" latencies latencies is quite
> amazing. Let's not spoil things by visiting the changes on
> workloads that neither need nor care about "worst-case" latencies
> from synchronize_rcu_expedited().
>
> The "worst case" with the double quotes is because a single long-running
> CPU-bound reader will force increased latencies. Then again, that is
> exactly why Uladzislau created this commit:
>
> 0b74356e7b95 ("rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUT")
>
> Please note that this commit's short timeouts depend on ANDROID. If you
> are serious about also having this work in ChromeOS, please send a patch
> on top of Uladzislau's that also covers ChromeOS.
>
> But please be very careful what you wish for...
>
> Right now, you get expedited RCU CPU stall warnings at 21 seconds
> in mainline kernels. With this commit, that become 20 milliseconds,
> but only for ANDROID.
>
> I could easily imagine that the Android guys would welcome help
> fixing long-running RCU readers, but I find it even easier to
> imagine that some of the more vocal ChromeOS users might stridently
> object to being involuntarily recruited to provide such help. ;-)

Thanks for the discussion everyone.

We didn't fully switch to kthread workers to avoid changing the
behavior for users that don’t need this low latency exp GPs. Another
(and perhaps more important) reason is because kthread_worker offers
reduced concurrency than workqueues which Pual reported can pose
issues on systems with a large number of CPUs.

The Kconfig depends on RCU_EXPERT. Android enables RCU_EXPERT in the
common kernels defconfigs to allow configuring performance
optimizations. Changing to depend on CONFIG_ANDROID would also mean
less testing. So I prefer to keep RCU_EXPERT.

If PREEMT_RT systems don’t disable expedited grace periods, we can
remove default off for those. Steve?

Thanks,
Kalesh

>
> > The users affected by this will instead have these running
> > > as SCHED_FIFO. That changes scheduling. For users not explicitly
> > > needing low-latency synchronize_rcu_expedited(), this change is very
> > > unlikely to be for the better. And historically, unnecessarily running
> > > portions of RCU at real-time priorities has been a change for the worse.
> > > As in greatly increased context-switch rates and consequently degraded
> > > performance. Please note that this is not a theoretical statement: Real
> > > users have really been burned by too much SCHED_FIFO in RCU kthreads in
> > > the past.
> >
> > Android also has suffered from too much SCHED_FIFO in the past. I
> > remember the display thread called 'surfaceflinger' had to be dropped
> > from RT privilege at one point.
>
> "With great power..." ;-)
>
> Thanx, Paul
>
> > > > > So yes, let's do this bug fix (with appropriate adjustment), but let's
> > > > > also avoid exposing the non-Android workloads to risks from the inevitable
> > > > > unintended consequences. ;-)
> > > >
> > > > I would argue the risk is also adding code complexity and more bugs
> > > > without clear rationale for why it is being done. There's always risk
> > > > with any change, but that's the -rc cycles and stable kernels help
> > > > catch those. I think we should not add more code complexity if it is a
> > > > theoretical concern.
> > > >
> > > > There's also another possible risk - there is a possible hidden
> > > > problem here that probably the non-Android folks haven't noticed or
> > > > been able to debug. I would rather just do the right thing.
> > > >
> > > > Just my 2c,
> > >
> > > Sorry, but my answer is still "no".
> > >
> > > Your suggested change risks visiting unacceptable performance
> > > degradation on a very large number of innocent users for whom current
> > > synchronize_rcu_expedited() latency is plenty good enough.
> >
> > I believe the process will catch any such risk, but it is your call! ;-)
> >
> > Thanks,
> >
> > - Joel
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2022-04-09 13:18:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, 8 Apr 2022 08:58:34 -0700
"Paul E. McKenney" <[email protected]> wrote:

> > I do not see why PREEMPT_RT is special here. Why was it singled out?
>
> Because I didn't see the point of doing kthread_create_worker() to create
> a pair of kthread_worker structuress that were never going to be used
> in PREEMPT_RT kernels.
>
> Or are PREEMPT_RT kernels no longer booting with rcupdate.rcu_normal?
> Last I knew, they all did this to avoid IPI latencies from expedited
> grace periods. Has this changed?

Honestly, I don't know. But I guess I can go and find out ;-)

-- Steve

2022-04-09 21:12:14

by Kalesh Singh

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 8, 2022 at 2:42 PM Steven Rostedt <[email protected]> wrote:
>
> On Fri, 8 Apr 2022 10:53:53 -0700
> Kalesh Singh <[email protected]> wrote:
>
> > If PREEMT_RT systems don’t disable expedited grace periods, we can
> > remove default off for those. Steve?
>
> Yep, commit 36221e109eb20 ("rcu: Enable rcu_normal_after_boot
> unconditionally for RT") is still there.
>
> OK, that explains it. I missed that change, so I didn't see the relation to
> PREEMPT_RT. It wasn't described in the change log nor was there any
> comment in the kconfig to say why it was special. Perhaps that should be
> added for those that don't know all this tribal knowledge ;-)

Thanks for verifying Steve. I'll update the commit and Kconfig texts
accordingly.

-Kalesh

>
> -- Steve
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>

2022-04-11 06:19:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, 8 Apr 2022 10:53:53 -0700
Kalesh Singh <[email protected]> wrote:

> If PREEMT_RT systems don’t disable expedited grace periods, we can
> remove default off for those. Steve?

Yep, commit 36221e109eb20 ("rcu: Enable rcu_normal_after_boot
unconditionally for RT") is still there.

OK, that explains it. I missed that change, so I didn't see the relation to
PREEMPT_RT. It wasn't described in the change log nor was there any
comment in the kconfig to say why it was special. Perhaps that should be
added for those that don't know all this tribal knowledge ;-)

-- Steve

2022-04-12 06:54:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Sat, Apr 09, 2022 at 03:17:40PM +0800, Hillf Danton wrote:
> On Fri, 8 Apr 2022 10:53:53 -0700 Kalesh Singh wrote
> > Thanks for the discussion everyone.
> >
> > We didn't fully switch to kthread workers to avoid changing the
> > behavior for users that dont need this low latency exp GPs. Another
> > (and perhaps more important) reason is because kthread_worker offers
> > reduced concurrency than workqueues which Pual reported can pose
> > issues on systems with a large number of CPUs.
>
> A second ... what issues were reported wrt concurrency, given the output
> of grep -nr workqueue block mm drivers.
>
> Feel free to post a URL link to the issues.

The issues can be easily seen by inspecting kthread_queue_work() and
the functions that it invokes. In contrast, normal workqueues uses
per-CPU mechanisms to avoid contention, as can equally easily be seen
by inspecting queue_work_on() and the functions that it invokes.

Please do feel free to take a look.

If taking a look does not convince you, please construct some in-kernel
benchmarks to test the scalability of these two mechanisms. Please note
that some care will be required to make sure that you are doing a valid
apples-to-apples comparison.

Thanx, Paul

2022-04-12 12:39:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Fri, Apr 08, 2022 at 06:42:42AM -0400, Joel Fernandes wrote:
> On Fri, Apr 8, 2022 at 12:57 AM Kalesh Singh <[email protected]> wrote:
> >
> [...]
> > @@ -334,15 +334,13 @@ static bool exp_funnel_lock(unsigned long s)
> > * Select the CPUs within the specified rcu_node that the upcoming
> > * expedited grace period needs to wait for.
> > */
> > -static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > +static void __sync_rcu_exp_select_node_cpus(struct rcu_exp_work *rewp)
> > {
> > int cpu;
> > unsigned long flags;
> > unsigned long mask_ofl_test;
> > unsigned long mask_ofl_ipi;
> > int ret;
> > - struct rcu_exp_work *rewp =
> > - container_of(wp, struct rcu_exp_work, rew_work);
> > struct rcu_node *rnp = container_of(rewp, struct rcu_node, rew);
> >
> > raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > @@ -417,13 +415,119 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > rcu_report_exp_cpu_mult(rnp, mask_ofl_test, false);
> > }
> >
> > +static void rcu_exp_sel_wait_wake(unsigned long s);
> > +
> > +#ifdef CONFIG_RCU_EXP_KTHREAD
>
> Just my 2c:
>
> Honestly, I am not sure if the benefits of duplicating the code to use
> normal workqueues outweighs the drawbacks (namely code complexity,
> code duplication - which can in turn cause more bugs and maintenance
> headaches down the line). The code is harder to read and adding more
> 30 character function names does not help.
>
> For something as important as expedited GPs, I can't imagine a
> scenario where an RT kthread worker would cause "issues". If it does
> cause issues, that's what the -rc cycles and the stable releases are
> for. I prefer to trust the process than take a one-foot-in-the-door
> approach.
>
> So please, can we just keep it simple?

Yes and no.

This is a bug fix, but only for those systems that are expecting real-time
response from synchronize_rcu_expedited(). As far as I know, this is only
Android. The rest of the systems are just fine with the current behavior.

In addition, this bug fix introduces significant risks, especially in
terms of performance for throughput-oriented workloads.

So yes, let's do this bug fix (with appropriate adjustment), but let's
also avoid exposing the non-Android workloads to risks from the inevitable
unintended consequences. ;-)

Thanx, Paul

> Thanks,
>
> - Joel
>
>
> > +static void sync_rcu_exp_select_node_cpus(struct kthread_work *wp)
> > +{
> > + struct rcu_exp_work *rewp =
> > + container_of(wp, struct rcu_exp_work, rew_work);
> > +
> > + __sync_rcu_exp_select_node_cpus(rewp);
> > +}
> > +
> > +static inline bool rcu_gp_par_worker_started(void)
> > +{
> > + return !!READ_ONCE(rcu_exp_par_gp_kworker);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > +{
> > + kthread_init_work(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > + /*
> > + * Use rcu_exp_par_gp_kworker, because flushing a work item from
> > + * another work item on the same kthread worker can result in
> > + * deadlock.
> > + */
> > + kthread_queue_work(rcu_exp_par_gp_kworker, &rnp->rew.rew_work);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > +{
> > + kthread_flush_work(&rnp->rew.rew_work);
> > +}
> > +
> > +/*
> > + * Work-queue handler to drive an expedited grace period forward.
> > + */
> > +static void wait_rcu_exp_gp(struct kthread_work *wp)
> > +{
> > + struct rcu_exp_work *rewp;
> > +
> > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > +{
> > + kthread_init_work(&rew->rew_work, wait_rcu_exp_gp);
> > + kthread_queue_work(rcu_exp_gp_kworker, &rew->rew_work);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > +{
> > +}
> > +#else /* !CONFIG_RCU_EXP_KTHREAD */
> > +static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
> > +{
> > + struct rcu_exp_work *rewp =
> > + container_of(wp, struct rcu_exp_work, rew_work);
> > +
> > + __sync_rcu_exp_select_node_cpus(rewp);
> > +}
> > +
> > +static inline bool rcu_gp_par_worker_started(void)
> > +{
> > + return !!READ_ONCE(rcu_par_gp_wq);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_queue_work(struct rcu_node *rnp)
> > +{
> > + int cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > +
> > + INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > + /* If all offline, queue the work on an unbound CPU. */
> > + if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > + cpu = WORK_CPU_UNBOUND;
> > + else
> > + cpu += rnp->grplo;
> > + queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > +}
> > +
> > +static inline void sync_rcu_exp_select_cpus_flush_work(struct rcu_node *rnp)
> > +{
> > + flush_work(&rnp->rew.rew_work);
> > +}
> > +
> > +/*
> > + * Work-queue handler to drive an expedited grace period forward.
> > + */
> > +static void wait_rcu_exp_gp(struct work_struct *wp)
> > +{
> > + struct rcu_exp_work *rewp;
> > +
> > + rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > + rcu_exp_sel_wait_wake(rewp->rew_s);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_queue_work(struct rcu_exp_work *rew)
> > +{
> > + INIT_WORK_ONSTACK(&rew->rew_work, wait_rcu_exp_gp);
> > + queue_work(rcu_gp_wq, &rew->rew_work);
> > +}
> > +
> > +static inline void synchronize_rcu_expedited_destroy_work(struct rcu_exp_work *rew)
> > +{
> > + destroy_work_on_stack(&rew->rew_work);
> > +}
> > +#endif /* CONFIG_RCU_EXP_KTHREAD */
> > +
> > /*
> > * Select the nodes that the upcoming expedited grace period needs
> > * to wait for.
> > */
> > static void sync_rcu_exp_select_cpus(void)
> > {
> > - int cpu;
> > struct rcu_node *rnp;
> >
> > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("reset"));
> > @@ -435,28 +539,21 @@ static void sync_rcu_exp_select_cpus(void)
> > rnp->exp_need_flush = false;
> > if (!READ_ONCE(rnp->expmask))
> > continue; /* Avoid early boot non-existent wq. */
> > - if (!READ_ONCE(rcu_par_gp_wq) ||
> > + if (!rcu_gp_par_worker_started() ||
> > rcu_scheduler_active != RCU_SCHEDULER_RUNNING ||
> > rcu_is_last_leaf_node(rnp)) {
> > - /* No workqueues yet or last leaf, do direct call. */
> > + /* No worker started yet or last leaf, do direct call. */
> > sync_rcu_exp_select_node_cpus(&rnp->rew.rew_work);
> > continue;
> > }
> > - INIT_WORK(&rnp->rew.rew_work, sync_rcu_exp_select_node_cpus);
> > - cpu = find_next_bit(&rnp->ffmask, BITS_PER_LONG, -1);
> > - /* If all offline, queue the work on an unbound CPU. */
> > - if (unlikely(cpu > rnp->grphi - rnp->grplo))
> > - cpu = WORK_CPU_UNBOUND;
> > - else
> > - cpu += rnp->grplo;
> > - queue_work_on(cpu, rcu_par_gp_wq, &rnp->rew.rew_work);
> > + sync_rcu_exp_select_cpus_queue_work(rnp);
> > rnp->exp_need_flush = true;
> > }
> >
> > - /* Wait for workqueue jobs (if any) to complete. */
> > + /* Wait for jobs (if any) to complete. */
> > rcu_for_each_leaf_node(rnp)
> > if (rnp->exp_need_flush)
> > - flush_work(&rnp->rew.rew_work);
> > + sync_rcu_exp_select_cpus_flush_work(rnp);
> > }
> >
> > /*
> > @@ -622,17 +719,6 @@ static void rcu_exp_sel_wait_wake(unsigned long s)
> > rcu_exp_wait_wake(s);
> > }
> >
> > -/*
> > - * Work-queue handler to drive an expedited grace period forward.
> > - */
> > -static void wait_rcu_exp_gp(struct work_struct *wp)
> > -{
> > - struct rcu_exp_work *rewp;
> > -
> > - rewp = container_of(wp, struct rcu_exp_work, rew_work);
> > - rcu_exp_sel_wait_wake(rewp->rew_s);
> > -}
> > -
> > #ifdef CONFIG_PREEMPT_RCU
> >
> > /*
> > @@ -848,20 +934,19 @@ void synchronize_rcu_expedited(void)
> > } else {
> > /* Marshall arguments & schedule the expedited grace period. */
> > rew.rew_s = s;
> > - INIT_WORK_ONSTACK(&rew.rew_work, wait_rcu_exp_gp);
> > - queue_work(rcu_gp_wq, &rew.rew_work);
> > + synchronize_rcu_expedited_queue_work(&rew);
> > }
> >
> > /* Wait for expedited grace period to complete. */
> > rnp = rcu_get_root();
> > wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
> > sync_exp_work_done(s));
> > - smp_mb(); /* Workqueue actions happen before return. */
> > + smp_mb(); /* Work actions happen before return. */
> >
> > /* Let the next expedited grace period start. */
> > mutex_unlock(&rcu_state.exp_mutex);
> >
> > if (likely(!boottime))
> > - destroy_work_on_stack(&rew.rew_work);
> > + synchronize_rcu_expedited_destroy_work(&rew);
> > }
> > EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> >
> > base-commit: 42e7a03d3badebd4e70aea5362d6914dfc7c220b
> > --
> > 2.35.1.1178.g4f1659d476-goog
> >

2022-04-13 18:15:22

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

Hi Paul,


On Wed, Apr 13, 2022 at 8:07 AM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Apr 13, 2022 at 07:37:11PM +0800, Hillf Danton wrote:
> > On Sat, 9 Apr 2022 08:56:12 -0700 Paul E. McKenney wrote:
> > > On Sat, Apr 09, 2022 at 03:17:40PM +0800, Hillf Danton wrote:
> > > > On Fri, 8 Apr 2022 10:53:53 -0700 Kalesh Singh wrote
> > > > > Thanks for the discussion everyone.
> > > > >
> > > > > We didn't fully switch to kthread workers to avoid changing the
> > > > > behavior for users that dont need this low latency exp GPs. Another
> > > > > (and perhaps more important) reason is because kthread_worker offers
> > > > > reduced concurrency than workqueues which Pual reported can pose
> > > > > issues on systems with a large number of CPUs.
> > > >
> > > > A second ... what issues were reported wrt concurrency, given the output
> > > > of grep -nr workqueue block mm drivers.
> > > >
> > > > Feel free to post a URL link to the issues.
> > >
> > > The issues can be easily seen by inspecting kthread_queue_work() and
> > > the functions that it invokes. In contrast, normal workqueues uses
> > > per-CPU mechanisms to avoid contention, as can equally easily be seen
> > > by inspecting queue_work_on() and the functions that it invokes.
> >
> > The worker from kthread_create_worker() roughly matches unbound workqueue
> > that can get every CPU overloaded, thus the difference in implementation
> > details between kthread worker and WQ worker (either bound or unbound) can
> > be safely ignored if the kthread method works, given that prioirty is barely
> > a cure to concurrency issues.
>
> Please look again, this time taking lock contention in to account,
> keeping in mind that systems with several hundred CPUs are reasonably
> common and that systems with more than a thousand CPUs are not unheard of.


You are talking about lock contention in the kthread_worker infra
which unbound WQ does not suffer from, right? I don't think the worker
lock contention will be an issue unless several
synchronize_rcu_expedited() calls are trying to queue work at the same
time. Did I miss something? Considering synchronize_rcu_expedited()
can block in the normal case (blocking is a pretty heavy operation
involving the scheduler and load balancers), I don't see how
contending on the worker infra locks can be an issue. If it was
call_rcu() , then I can relate to any contention since that executes
much more often.

I think the argument about too many things being RT is stronger though.

Thanks,

Joel


>
>
> Thanx, Paul
>
> > Hillf
> > >
> > > Please do feel free to take a look.
> > >
> > > If taking a look does not convince you, please construct some in-kernel
> > > benchmarks to test the scalability of these two mechanisms. Please note
> > > that some care will be required to make sure that you are doing a valid
> > > apples-to-apples comparison.
> > >
> > > Thanx, Paul
> > >

2022-04-14 01:45:01

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Wed, Apr 13, 2022 at 01:21:20PM -0400, Joel Fernandes wrote:
> Hi Paul,
>
>
> On Wed, Apr 13, 2022 at 8:07 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Apr 13, 2022 at 07:37:11PM +0800, Hillf Danton wrote:
> > > On Sat, 9 Apr 2022 08:56:12 -0700 Paul E. McKenney wrote:
> > > > On Sat, Apr 09, 2022 at 03:17:40PM +0800, Hillf Danton wrote:
> > > > > On Fri, 8 Apr 2022 10:53:53 -0700 Kalesh Singh wrote
> > > > > > Thanks for the discussion everyone.
> > > > > >
> > > > > > We didn't fully switch to kthread workers to avoid changing the
> > > > > > behavior for users that dont need this low latency exp GPs. Another
> > > > > > (and perhaps more important) reason is because kthread_worker offers
> > > > > > reduced concurrency than workqueues which Pual reported can pose
> > > > > > issues on systems with a large number of CPUs.
> > > > >
> > > > > A second ... what issues were reported wrt concurrency, given the output
> > > > > of grep -nr workqueue block mm drivers.
> > > > >
> > > > > Feel free to post a URL link to the issues.
> > > >
> > > > The issues can be easily seen by inspecting kthread_queue_work() and
> > > > the functions that it invokes. In contrast, normal workqueues uses
> > > > per-CPU mechanisms to avoid contention, as can equally easily be seen
> > > > by inspecting queue_work_on() and the functions that it invokes.
> > >
> > > The worker from kthread_create_worker() roughly matches unbound workqueue
> > > that can get every CPU overloaded, thus the difference in implementation
> > > details between kthread worker and WQ worker (either bound or unbound) can
> > > be safely ignored if the kthread method works, given that prioirty is barely
> > > a cure to concurrency issues.
> >
> > Please look again, this time taking lock contention in to account,
> > keeping in mind that systems with several hundred CPUs are reasonably
> > common and that systems with more than a thousand CPUs are not unheard of.
>
> You are talking about lock contention in the kthread_worker infra
> which unbound WQ does not suffer from, right? I don't think the worker
> lock contention will be an issue unless several
> synchronize_rcu_expedited() calls are trying to queue work at the same
> time. Did I miss something? Considering synchronize_rcu_expedited()
> can block in the normal case (blocking is a pretty heavy operation
> involving the scheduler and load balancers), I don't see how
> contending on the worker infra locks can be an issue. If it was
> call_rcu() , then I can relate to any contention since that executes
> much more often.

Think in terms of a system with 1536 CPUs (which IBM would be extremely
happy to sell you, last I checked). This has 96 leaf rcu_node structures.
Keeping that in mind, take another look at that code.

And in the past there have been real systems with 256 leaf rcu_node
structures.

> I think the argument about too many things being RT is stronger though.

Fair enough. Except that this could be dealt with by conditionally
setting SCHED_FIFO. But the lock contention would remain.

Thanx, Paul

> Thanks,
>
> Joel
>
>
> >
> >
> > Thanx, Paul
> >
> > > Hillf
> > > >
> > > > Please do feel free to take a look.
> > > >
> > > > If taking a look does not convince you, please construct some in-kernel
> > > > benchmarks to test the scalability of these two mechanisms. Please note
> > > > that some care will be required to make sure that you are doing a valid
> > > > apples-to-apples comparison.
> > > >
> > > > Thanx, Paul
> > > >

2022-04-14 08:27:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2] EXP rcu: Move expedited grace period (GP) work to RT kthread_worker

On Wed, Apr 13, 2022 at 07:37:11PM +0800, Hillf Danton wrote:
> On Sat, 9 Apr 2022 08:56:12 -0700 Paul E. McKenney wrote:
> > On Sat, Apr 09, 2022 at 03:17:40PM +0800, Hillf Danton wrote:
> > > On Fri, 8 Apr 2022 10:53:53 -0700 Kalesh Singh wrote
> > > > Thanks for the discussion everyone.
> > > >
> > > > We didn't fully switch to kthread workers to avoid changing the
> > > > behavior for users that dont need this low latency exp GPs. Another
> > > > (and perhaps more important) reason is because kthread_worker offers
> > > > reduced concurrency than workqueues which Pual reported can pose
> > > > issues on systems with a large number of CPUs.
> > >
> > > A second ... what issues were reported wrt concurrency, given the output
> > > of grep -nr workqueue block mm drivers.
> > >
> > > Feel free to post a URL link to the issues.
> >
> > The issues can be easily seen by inspecting kthread_queue_work() and
> > the functions that it invokes. In contrast, normal workqueues uses
> > per-CPU mechanisms to avoid contention, as can equally easily be seen
> > by inspecting queue_work_on() and the functions that it invokes.
>
> The worker from kthread_create_worker() roughly matches unbound workqueue
> that can get every CPU overloaded, thus the difference in implementation
> details between kthread worker and WQ worker (either bound or unbound) can
> be safely ignored if the kthread method works, given that prioirty is barely
> a cure to concurrency issues.

Please look again, this time taking lock contention in to account,
keeping in mind that systems with several hundred CPUs are reasonably
common and that systems with more than a thousand CPUs are not unheard of.

Thanx, Paul

> Hillf
> >
> > Please do feel free to take a look.
> >
> > If taking a look does not convince you, please construct some in-kernel
> > benchmarks to test the scalability of these two mechanisms. Please note
> > that some care will be required to make sure that you are doing a valid
> > apples-to-apples comparison.
> >
> > Thanx, Paul
> >