2024-02-13 06:00:51

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

Hi,

This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.

The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].

An earlier RFC version is at [4].

Design
==

PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)

Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).

As mentioned above, the scheduler decides which need-resched bits are
chosen based on the preemption model in use:

TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY

none never always [*]
voluntary higher sched class other tasks [*]
full always never

[*] some details elided here.

The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:

exit-to-user ret-to-kernel preempt_count()

NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y

Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.

In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.

Performance
==
The performance in the basic tests (perf bench sched messaging,
kernbench) is fairly close to what we see under PREEMPT_DYNAMIC.
(See patches 24, 25.)

Comparing stress-ng --cyclic latencies with a background kernel load
(stress-ng --mmap) serves as a good demonstration of how letting the
scheduler enforce priorities, tick exhaustion etc helps:

PREEMPT_DYNAMIC, preempt=voluntary
stress-ng: info: [12252] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [12252] dispatching hogs: 1 cyclic
stress-ng: info: [12253] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [12253] cyclic: mean: 19973.46 ns, mode: 3560 ns
stress-ng: info: [12253] cyclic: min: 2541 ns, max: 2751830 ns, std.dev. 68891.71
stress-ng: info: [12253] cyclic: latency percentiles:
stress-ng: info: [12253] cyclic: 25.00%: 4800 ns
stress-ng: info: [12253] cyclic: 50.00%: 12458 ns
stress-ng: info: [12253] cyclic: 75.00%: 25220 ns
stress-ng: info: [12253] cyclic: 90.00%: 35404 ns


PREEMPT_AUTO, preempt=voluntary
stress-ng: info: [8883] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [8883] dispatching hogs: 1 cyclic
stress-ng: info: [8884] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [8884] cyclic: mean: 14169.08 ns, mode: 3355 ns
stress-ng: info: [8884] cyclic: min: 2570 ns, max: 2234939 ns, std.dev. 66056.95
stress-ng: info: [8884] cyclic: latency percentiles:
stress-ng: info: [8884] cyclic: 25.00%: 3665 ns
stress-ng: info: [8884] cyclic: 50.00%: 5409 ns
stress-ng: info: [8884] cyclic: 75.00%: 16009 ns
stress-ng: info: [8884] cyclic: 90.00%: 24392 ns

Notice how much lower the 25/50/75/90 percentile latencies are for the
PREEMPT_AUTO case.
(See patch 26 for the full performance numbers.)


For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with this series
backported.)

In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.

Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs


PREEMPT_VOLUNTARY PREEMPT_AUTO
(preempt=voluntary)
============================== =============================
clients throughput cpu-usage throughput cpu-usage Gain
(tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
------- ---------- ----------------- ---------- ----------------- -------


OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
(local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%


OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
(remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
90/10 RW ratio)


(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)

The comparative performance for both the tests is fairly close,
more or less within a margin of error.

IMO the tests above (sched-messaging, kernbench, stress-ng, OLTP) show
that this scheduling model has legs. That said, the none/voluntary
models under PREEMPT_AUTO are conceptually different enough that there
likely are workloads where performance would be subpar. That needs
more extensive testing to figure out the weak points.


Series layout
==

Patch 1,
"preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.

Patches 2-5,
"thread_info: selector for TIF_NEED_RESCHED[_LAZY]",
"thread_info: tif_need_resched() now takes resched_t as param",
"sched: make test_*_tsk_thread_flag() return bool",
"sched: *_tsk_need_resched() now takes resched_t as param"

introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.

Patches 6-9,
"entry: handle lazy rescheduling at user-exit",
"entry/kvm: handle lazy rescheduling at guest-entry",
"entry: irqentry_exit only preempts for TIF_NEED_RESCHED",
"sched: __schedule_loop() doesn't need to check for need_resched_lazy()"

make changes/document the rescheduling points.

Patches 10-11,
"sched: separate PREEMPT_DYNAMIC config logic",
"sched: runtime preemption config under PREEMPT_AUTO"

reuse the PREEMPT_DYNAMIC runtime configuration logic.

Patch 12-16,
"rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO",
"rcu: fix header guard for rcu_all_qs()",
"preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full",
"rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y",
"rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"

add RCU support.

Patch 17,
"x86/thread_info: define TIF_NEED_RESCHED_LAZY"

adds x86 support.

Note on platform support: this is x86 only for now. Howeer, supporting
architectures with !ARCH_NO_PREEMPT is straight-forward -- especially
if they support GENERIC_ENTRY.

Patches 18-21,
"sched: prepare for lazy rescheduling in resched_curr()",
"sched: default preemption policy for PREEMPT_AUTO",
"sched: handle idle preemption for PREEMPT_AUTO",
"sched: schedule eagerly in resched_cpu()"

are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.

Patches 22-23,
"sched/fair: refactor update_curr(), entity_tick()",
"sched/fair: handle tick expiry under lazy preemption"

handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.

And, finally patches 24-26,
"sched: support preempt=none under PREEMPT_AUTO"
"sched: support preempt=full under PREEMPT_AUTO"
"sched: handle preempt=voluntary under PREEMPT_AUTO"

add support for the three preemption models.

Patch 27-30,
"sched: latency warn for TIF_NEED_RESCHED_LAZY",
"tracing: support lazy resched",
"Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
"osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"

handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.


Changelog
==

RFC:
- Addresses review comments and is generally a more focused
version of the RFC.
- Lots of code reorganization.
- Bugfixes all over.
- need_resched() now only checks for TIF_NEED_RESCHED instead
of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
- set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
- Tighten idle related checks.
- RCU changes to force context-switches when a quiescent state is
urgently needed.
- Does not break live-patching anymore

Also at: github.com/terminus/linux preempt-v1

Please review.

Thanks
Ankur

[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/[email protected]/


Ankur Arora (30):
preempt: introduce CONFIG_PREEMPT_AUTO
thread_info: selector for TIF_NEED_RESCHED[_LAZY]
thread_info: tif_need_resched() now takes resched_t as param
sched: make test_*_tsk_thread_flag() return bool
sched: *_tsk_need_resched() now takes resched_t as param
entry: handle lazy rescheduling at user-exit
entry/kvm: handle lazy rescheduling at guest-entry
entry: irqentry_exit only preempts for TIF_NEED_RESCHED
sched: __schedule_loop() doesn't need to check for need_resched_lazy()
sched: separate PREEMPT_DYNAMIC config logic
sched: runtime preemption config under PREEMPT_AUTO
rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
rcu: fix header guard for rcu_all_qs()
preempt,rcu: warn on PREEMPT_RCU=n, preempt_model_full
rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
x86/thread_info: define TIF_NEED_RESCHED_LAZY
sched: prepare for lazy rescheduling in resched_curr()
sched: default preemption policy for PREEMPT_AUTO
sched: handle idle preemption for PREEMPT_AUTO
sched: schedule eagerly in resched_cpu()
sched/fair: refactor update_curr(), entity_tick()
sched/fair: handle tick expiry under lazy preemption
sched: support preempt=none under PREEMPT_AUTO
sched: support preempt=full under PREEMPT_AUTO
sched: handle preempt=voluntary under PREEMPT_AUTO
sched: latency warn for TIF_NEED_RESCHED_LAZY
tracing: support lazy resched
Documentation: tracing: add TIF_NEED_RESCHED_LAZY
osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y

.../admin-guide/kernel-parameters.txt | 1 +
Documentation/trace/ftrace.rst | 6 +-
arch/s390/include/asm/preempt.h | 4 +-
arch/s390/mm/pfault.c | 2 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 10 +-
drivers/acpi/processor_idle.c | 2 +-
include/asm-generic/preempt.h | 4 +-
include/linux/entry-common.h | 2 +-
include/linux/entry-kvm.h | 2 +-
include/linux/preempt.h | 2 +-
include/linux/rcutree.h | 2 +-
include/linux/sched.h | 43 ++-
include/linux/sched/idle.h | 8 +-
include/linux/thread_info.h | 57 +++-
include/linux/trace_events.h | 6 +-
init/Makefile | 1 +
kernel/Kconfig.preempt | 37 ++-
kernel/entry/common.c | 12 +-
kernel/entry/kvm.c | 4 +-
kernel/rcu/Kconfig | 2 +-
kernel/rcu/tiny.c | 2 +-
kernel/rcu/tree.c | 17 +-
kernel/rcu/tree_exp.h | 4 +-
kernel/rcu/tree_plugin.h | 15 +-
kernel/rcu/tree_stall.h | 2 +-
kernel/sched/core.c | 311 ++++++++++++------
kernel/sched/deadline.c | 6 +-
kernel/sched/debug.c | 13 +-
kernel/sched/fair.c | 56 ++--
kernel/sched/idle.c | 4 +-
kernel/sched/rt.c | 6 +-
kernel/sched/sched.h | 27 +-
kernel/trace/trace.c | 4 +-
kernel/trace/trace_osnoise.c | 22 +-
kernel/trace/trace_output.c | 16 +-
36 files changed, 498 insertions(+), 215 deletions(-)

--
2.31.1



2024-02-13 06:01:31

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 19/30] sched: default preemption policy for PREEMPT_AUTO

Add resched_opt_translate() which determines the particular
need-resched flag based on scheduling policy.

Preemption models other than PREEMPT_AUTO: continue to use
tif_resched(NR_now).

PREEMPT_AUTO: use tif_resched(NR_lazy) to mark for exit-to-user
by default.

Note that the target task might be running in userspace or in the
kernel. Allow both to run to timeslice completion.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 31 +++++++++++++++++++++++++------
kernel/sched/sched.h | 12 +++++++++++-
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7248d1dbdc14..6596b5e0b6c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,20 +1032,39 @@ void wake_up_q(struct wake_q_head *head)
}

/*
- * resched_curr - mark rq's current task 'to be rescheduled now'.
+ * For preemption models other than PREEMPT_AUTO: always schedule
+ * eagerly.
*
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
+ * For PREEMPT_AUTO: allow everything, whether running in user or
+ * kernel context, to finish its time quanta, and mark for
+ * rescheduling at the next exit to user.
*/
-void resched_curr(struct rq *rq)
+static resched_t resched_opt_translate(struct task_struct *curr,
+ enum resched_opt opt)
+{
+ if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ return NR_now;
+
+ return NR_lazy;
+}
+
+/*
+ * __resched_curr - mark rq's current task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the appropriate need_resched flag.
+ * On SMP, in addition it might also involve a cross-CPU call to
+ * trigger the scheduler on the target CPU.
+ */
+void __resched_curr(struct rq *rq, enum resched_opt opt)
{
struct task_struct *curr = rq->curr;
- resched_t rs = NR_now;
+ resched_t rs;
int cpu;

lockdep_assert_rq_held(rq);

+ rs = resched_opt_translate(curr, opt);
+
/*
* TIF_NEED_RESCHED is the higher priority bit, so if it is already
* set, nothing more to be done. So, the only combinations we want to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 34899d17553e..c3ae70ad23ec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2462,7 +2462,17 @@ extern void init_sched_fair_class(void);

extern void reweight_task(struct task_struct *p, int prio);

-extern void resched_curr(struct rq *rq);
+enum resched_opt {
+ RESCHED_DEFAULT,
+};
+
+extern void __resched_curr(struct rq *rq, enum resched_opt opt);
+
+static inline void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_DEFAULT);
+}
+
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1


2024-02-13 06:01:45

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
states for read-side critical sections via rcu_all_qs().
One reason why this was necessary: lacking preempt-count, the tick
handler has no way of knowing whether it is executing in a read-side
critical section or not.

With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
PREEMPT_RCU=n). This means that cond_resched() is a stub which does
not provide for quiescent states via rcu_all_qs().

So, use the availability of preempt_count() to report quiescent states
in rcu_flavor_sched_clock_irq().

Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/rcu/tree_plugin.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 26c79246873a..9b72e9d2b6fe 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
*/
static void rcu_flavor_sched_clock_irq(int user)
{
- if (user || rcu_is_cpu_rrupt_from_idle()) {
+ if (user || rcu_is_cpu_rrupt_from_idle() ||
+ (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
+ !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {

/*
* Get here if this CPU took its interrupt from user
- * mode or from the idle loop, and if this is not a
- * nested interrupt. In this case, the CPU is in
- * a quiescent state, so note it.
+ * mode, from the idle loop without this being a nested
+ * interrupt, or while not holding a preempt count.
+ * In this case, the CPU is in a quiescent state, so note
+ * it.
*
* No memory barrier is required here because rcu_qs()
* references only CPU-local variables that other CPUs
--
2.31.1


2024-02-13 06:02:03

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 17/30] x86/thread_info: define TIF_NEED_RESCHED_LAZY

Define TIF_NEED_RESCHED_LAZY which, with TIF_NEED_RESCHED provides the
scheduler with two kinds of rescheduling intent: TIF_NEED_RESCHED,
for the usual rescheduling at the next safe preemption point;
TIF_NEED_RESCHED_LAZY expressing an intent to reschedule at some
time in the future while allowing the current task to run to
completion.

Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 10 ++++++++--
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..ab58558068a4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -275,6 +275,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index d63b02940747..88c1802185fc 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -81,8 +81,11 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
-#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
-#define TIF_SSBD 5 /* Speculative store bypass disable */
+#ifdef CONFIG_PREEMPT_AUTO
+#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling */
+#endif
+#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
+#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -104,6 +107,9 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
+#ifdef CONFIG_PREEMPT_AUTO
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
+#endif
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
--
2.31.1


2024-02-13 06:02:27

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 09/30] sched: __schedule_loop() doesn't need to check for need_resched_lazy()

Various scheduling loops recheck need_resched() to avoid a missed
scheduling opportunity.

Explicitly note that we don't need to check for need_resched_lazy()
since that only needs to be handled at exit-to-user.

Also update the comment above __schedule() to describe
TIF_NEED_RESCHED_LAZY semantics.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41c3bd49a700..8e492d20021c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6573,18 +6573,21 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
*
- * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
- * paths. For example, see arch/x86/entry_64.S.
+ * 2. TIF_NEED_RESCHED flag is checked on interrupt and TIF_NEED_RESCHED[_LAZY]
+ * flags on userspace return paths. For example, see kernel/entry/common.c
*
- * To drive preemption between tasks, the scheduler sets the flag in timer
- * interrupt handler scheduler_tick().
+ * To drive preemption between tasks, the scheduler sets one of the need-
+ * resched flags in the timer interrupt handler scheduler_tick():
+ * - !CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED.
+ * - CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED or TIF_NEED_RESCHED_LAZY
+ * depending on the preemption model.
*
* 3. Wakeups don't really cause entry into schedule(). They add a
* task to the run-queue and that's it.
*
* Now, if the new task added to the run-queue preempts the current
- * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
- * called on the nearest possible occasion:
+ * task, then the wakeup sets TIF_NEED_RESCHED[_LAZY] and schedule()
+ * gets called on the nearest possible occasion:
*
* - If the kernel is preemptible (CONFIG_PREEMPTION=y):
*
@@ -6802,6 +6805,11 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
preempt_disable();
__schedule(sched_mode);
sched_preempt_enable_no_resched();
+
+ /*
+ * We don't check for need_resched_lazy() here, since it is
+ * always handled at exit-to-user.
+ */
} while (need_resched());
}

@@ -6907,7 +6915,7 @@ static void __sched notrace preempt_schedule_common(void)
preempt_enable_no_resched_notrace();

/*
- * Check again in case we missed a preemption opportunity
+ * Check again in case we missed an eager preemption opportunity
* between schedule and now.
*/
} while (need_resched());
--
2.31.1


2024-02-13 06:03:00

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 22/30] sched/fair: refactor update_curr(), entity_tick()

When updating the task's runtime statistics via update_curr()
or entity_tick(), we call resched_curr() to resched if needed.

Refactor update_curr() and entity_tick() to only update the stats
and deferring any rescheduling needed to task_tick_fair() or
update_curr().

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Suggested-by: Peter Ziljstra <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++-----------------------
1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae9b237fa32b..278eebe6656a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -975,10 +975,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
*/
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if ((s64)(se->vruntime - se->deadline) < 0)
- return;
+ return false;

/*
* For EEVDF the virtual time slope is determined by w_i (iow.
@@ -996,9 +996,11 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
+ return true;
}
+
+ return false;
}

#include "pelt.h"
@@ -1153,26 +1155,35 @@ s64 update_curr_common(struct rq *rq)
/*
* Update the current task's runtime statistics.
*/
-static void update_curr(struct cfs_rq *cfs_rq)
+static bool __update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
s64 delta_exec;
+ bool resched;

if (unlikely(!curr))
- return;
+ return false;

delta_exec = update_curr_se(rq_of(cfs_rq), curr);
if (unlikely(delta_exec <= 0))
- return;
+ return false;

curr->vruntime += calc_delta_fair(delta_exec, curr);
- update_deadline(cfs_rq, curr);
+ resched = update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);

if (entity_is_task(curr))
update_curr_task(task_of(curr), delta_exec);

account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+ return resched;
+}
+
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ if (__update_curr(cfs_rq))
+ resched_curr(rq_of(cfs_rq));
}

static void update_curr_fair(struct rq *rq)
@@ -5487,13 +5498,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
cfs_rq->curr = NULL;
}

-static void
-entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
+static bool
+entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
* Update run-time statistics of the 'current'.
*/
- update_curr(cfs_rq);
+ bool resched = __update_curr(cfs_rq);

/*
* Ensure that runnable average is periodically updated.
@@ -5501,22 +5512,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);

-#ifdef CONFIG_SCHED_HRTICK
- /*
- * queued ticks are scheduled to match the slice, so don't bother
- * validating it and just reschedule.
- */
- if (queued) {
- resched_curr(rq_of(cfs_rq));
- return;
- }
- /*
- * don't let the period tick interfere with the hrtick preemption
- */
- if (!sched_feat(DOUBLE_TICK) &&
- hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
- return;
-#endif
+ return resched;
}


@@ -12617,12 +12613,16 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ bool resched = false;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- entity_tick(cfs_rq, se, queued);
+ resched |= entity_tick(cfs_rq, se);
}

+ if (resched)
+ resched_curr(rq);
+
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);

--
2.31.1


2024-02-13 06:03:29

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 24/30] sched: support preempt=none under PREEMPT_AUTO

The default preemption policy for the no forced preemption model under
PREEMPT_AUTO is to always schedule lazily for well-behaved, non-idle
tasks, preempting at exit-to-user.

We already have that, so enable it.

Comparing a scheduling/IPC workload:

# perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

PREEMPT_AUTO, preempt=none

3,074,466 context-switches ( +- 0.34% )
3.66437 +- 0.00494 seconds time elapsed ( +- 0.13% )

PREEMPT_DYNAMIC, preempt=none

2,954,976 context-switches ( +- 0.70% )
3.62855 +- 0.00708 seconds time elapsed ( +- 0.20% )

Both perform similarly, but we incur a slightly higher number of
context-switches with PREEMPT_AUTO.

Drilling down we see that both voluntary and involuntary
context-switches are higher for this test:

PREEMPT_AUTO, preempt=none

2115660.30 +- 20442.34 voluntary context-switches ( +- 0.960% )
784690.40 +- 19629.42 involuntary context-switches ( +- 2.500% )

PREEMPT_DYNAMIC, preempt=none

2049027.10 +- 35237.10 voluntary context-switches ( +- 1.710% )
740676.90 +- 20346.45 involuntary context-switches ( +- 2.740% )

Assuming voluntary context-switches due to explicit blocking are
similar, we expect that PREEMPT_AUTO will incur larger context
switches at exit-to-user (counted as voluntary) since that is its
default rescheduling point.

Involuntary context-switches, under PREEMPT_AUTO are seen when a
task has exceeded its time quanta by a tick. Under PREEMPT_DYNAMIC,
these are incurred when a task needs to be rescheduled and then
encounters a cond_resched().
So, these two numbers aren't directly comparable.

Comparing a kernbench workload:

# Half load (-j 32)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 74.41 +- 0.45 ( +- 0.60% ) 74.20 +- 0.33 sec ( +- 0.45% )
utime 1419.78 +- 2.04 ( +- 0.14% ) 1416.40 +- 6.07 sec ( +- 0.42% )
stime 247.70 +- 0.88 ( +- 0.35% ) 246.23 +- 1.20 sec ( +- 0.49% )
%cpu 2240.20 +- 16.03 ( +- 0.71% ) 2240.20 +- 19.34 ( +- 0.86% )
inv-csw 13056.00 +- 427.58 ( +- 3.27% ) 18750.60 +- 771.21 ( +- 4.11% )
vol-csw 191000.00 +- 1623.25 ( +- 0.84% ) 182857.00 +- 2373.12 ( +- 1.29% )

The runtimes are basically identical for both of these. Voluntary
context switches, as above (and in the optimal, maximal runs below),
are higher. Which as mentioned above, does add up.

However, unlike the sched-messaging workload, the involuntary
context-switches are generally lower (also true for the optimal,
maximal runs below.) One reason for that might be that kbuild spends
~20% time executing in the kernel, while sched-messaging spends ~95%
time in the kernel. Which means a greater likelihood of being
preempted due to exceeding its time quanta.

# Optimal load (-j 256)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 65.15 +- 0.08 ( +- 0.12% ) 65.10 +- 0.19 ( +- 0.29% )
utime 1876.56 +- 477.03 ( +- 25.42% ) 1873.63 +- 481.98 ( +- 25.72% )
stime 295.77 +- 49.17 ( +- 16.62% ) 294.41 +- 50.79 ( +- 17.25% )
%cpu 3179.30 +- 970.30 ( +- 30.51% ) 3172.90 +- 983.26 ( +- 30.98% )
inv-csw 369670.00 +- 375980.00 ( +- 101.70% ) 390848.00 +- 392231.00 ( +- 100.35% )
vol-csw 216544.00 +- 28604.60 ( +- 13.20% ) 205117.00 +- 23949.50 ( +- 11.67% )

# Maximal load (-j 0)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 66.02 +- 0.53 ( +- 0.80% ) 65.67 +- 0.55 ( +- 0.83% )
utime 2024.79 +- 439.74 ( +- 21.71% ) 2026.12 +- 446.28 ( +- 22.02% )
stime 312.13 +- 46.14 ( +- 14.78% ) 311.53 +- 47.84 ( +- 15.35% )
%cpu 3465.40 +- 883.75 ( +- 25.50% ) 3473.80 +- 903.27 ( +- 26.00% )
inv-csw 471639.00 +- 336424.00 ( +- 71.33% ) 500981.00 +- 353471.00 ( +- 70.55% )
vol-csw 190138.00 +- 44947.20 ( +- 23.63% ) 177813.00 +- 44345.50 ( +- 24.93% )

Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5df59a4548dc..2d33f3ff51a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8968,7 +8968,9 @@ static void __sched_dynamic_update(int mode)
{
switch (mode) {
case preempt_dynamic_none:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: none\n", PREEMPT_MODE);
+ preempt_dynamic_mode = mode;
break;

case preempt_dynamic_voluntary:
--
2.31.1


2024-02-13 06:03:45

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 21/30] sched: schedule eagerly in resched_cpu()

resched_cpu() is used as an RCU hammer of last resort. Force
rescheduling eagerly with tif_resched(NR_now).

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 15 +++++++++++----
kernel/sched/sched.h | 1 +
2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e3dd95efb73..de963e8e2bee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,9 +1035,9 @@ void wake_up_q(struct wake_q_head *head)
* For preemption models other than PREEMPT_AUTO: always schedule
* eagerly.
*
- * For PREEMPT_AUTO: allow everything, whether running in user or
- * kernel context, to finish its time quanta, and mark for
- * rescheduling at the next exit to user.
+ * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything
+ * else, whether running in user or kernel context, to finish its time
+ * quanta, and mark for rescheduling at the next exit to user.
*/
static resched_t resched_opt_translate(struct task_struct *curr,
enum resched_opt opt)
@@ -1045,6 +1045,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
return NR_now;

+ if (opt == RESCHED_FORCE)
+ return NR_now;
+
if (is_idle_task(curr))
return NR_now;

@@ -1106,7 +1109,11 @@ void resched_cpu(int cpu)

raw_spin_rq_lock_irqsave(rq, flags);
if (cpu_online(cpu) || cpu == smp_processor_id())
- resched_curr(rq);
+ /*
+ * resched_cpu() is typically used as an RCU hammer.
+ * Mark for imminent resched.
+ */
+ __resched_curr(rq, RESCHED_FORCE);
raw_spin_rq_unlock_irqrestore(rq, flags);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c3ae70ad23ec..3600a8673d08 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2464,6 +2464,7 @@ extern void reweight_task(struct task_struct *p, int prio);

enum resched_opt {
RESCHED_DEFAULT,
+ RESCHED_FORCE,
};

extern void __resched_curr(struct rq *rq, enum resched_opt opt);
--
2.31.1


2024-02-13 06:03:53

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 01/30] preempt: introduce CONFIG_PREEMPT_AUTO

PREEMPT_AUTO adds a new scheduling model which, like PREEMPT_DYNAMIC,
allows dynamic switching between a none/voluntary/full preemption
model. However, unlike PREEMPT_DYNAMIC, it doesn't use explicit
preemption points for the voluntary models.

It works by depending on CONFIG_PREEMPTION (and thus PREEMPT_COUNT),
allowing the scheduler to always know when it is safe to preempt
for all three preemption models.

In addition, it uses an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (the usual TIF_NEED_RESCHED semantics), or
express a need for rescheduling while allowing the task on the
runqueue to run to timeslice completion (TIF_NEED_RESCHED_LAZY).

Based on the preemption model in use, the scheduler chooses
need-resched in the following manner:

TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY

none never always [*]
voluntary higher sched class other tasks [*]
full always never

[*] when preempting idle, or for kernel tasks that are 'urgent' in
some way (ex. resched_cpu() used as an RCU hammer), we use
TIF_NEED_RESCHED.

As mentioned above, the other part is when preemption happens -- when
are the need-resched flags checked:

exit-to-user ret-to-kernel preempt_count()
NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y

Exposed under CONFIG_EXPERT for now.

Cc: Peter Ziljstra <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 1 +
include/linux/thread_info.h | 8 ++++
init/Makefile | 1 +
kernel/Kconfig.preempt | 37 +++++++++++++++++--
4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 31b3a25680d0..5d2bd21f98e1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4662,6 +4662,7 @@

preempt= [KNL]
Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
+ or CONFIG_PREEMPT_AUTO.
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f4..7b1d9185aac6 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,14 @@ enum syscall_work_bit {

#include <asm/thread_info.h>

+/*
+ * Fall back to the default behaviour if we don't have CONFIG_PREEMPT_AUTO.
+ */
+#ifndef CONFIG_PREEMPT_AUTO
+#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
#ifdef __KERNEL__

#ifndef arch_set_restart_data
diff --git a/init/Makefile b/init/Makefile
index cbac576c57d6..da1dba3116dc 100644
--- a/init/Makefile
+++ b/init/Makefile
@@ -27,6 +27,7 @@ smp-flag-$(CONFIG_SMP) := SMP
preempt-flag-$(CONFIG_PREEMPT_BUILD) := PREEMPT
preempt-flag-$(CONFIG_PREEMPT_DYNAMIC) := PREEMPT_DYNAMIC
preempt-flag-$(CONFIG_PREEMPT_RT) := PREEMPT_RT
+preempt-flag-$(CONFIG_PREEMPT_AUTO) := PREEMPT_AUTO

build-version = $(or $(KBUILD_BUILD_VERSION), $(build-version-auto))
build-timestamp = $(or $(KBUILD_BUILD_TIMESTAMP), $(build-timestamp-auto))
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a821..fe83040ad755 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,13 +11,17 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK

+config HAVE_PREEMPT_AUTO
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE

config PREEMPT_NONE
bool "No Forced Preemption (Server)"
- select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
+ select PREEMPT_NONE_BUILD if (!PREEMPT_DYNAMIC && !PREEMPT_AUTO)
+
help
This is the traditional Linux preemption model, geared towards
throughput. It will still provide good latencies most of the
@@ -32,7 +36,7 @@ config PREEMPT_NONE
config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)"
depends on !ARCH_NO_PREEMPT
- select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
+ select PREEMPT_VOLUNTARY_BUILD if (!PREEMPT_DYNAMIC && !PREEMPT_AUTO)
help
This option reduces the latency of the kernel by adding more
"explicit preemption points" to the kernel code. These new
@@ -95,7 +99,7 @@ config PREEMPTION

config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot"
- depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT
+ depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL
@@ -115,6 +119,33 @@ config PREEMPT_DYNAMIC
Interesting if you want the same pre-built kernel should be used for
both Server and Desktop workloads.

+config PREEMPT_AUTO
+ bool "Scheduler controlled preemption model"
+ depends on EXPERT && HAVE_PREEMPT_AUTO && !ARCH_NO_PREEMPT
+ select PREEMPT_BUILD
+ help
+ This option allows to define the preemption model on the kernel
+ command line parameter and thus override the default preemption
+ model selected during compile time.
+
+ However, note that the compile time choice of preemption model
+ might impact other kernel options like the specific RCU model.
+
+ This feature makes the latency of the kernel configurable by
+ allowing the scheduler to choose when to preempt based on
+ the preemption policy in effect. It does this without needing
+ voluntary preemption points.
+
+ With PREEMPT_NONE: the scheduler allows a task (executing in
+ user or kernel context) to run to completion, at least until
+ its current tick expires.
+
+ With PREEMPT_VOLUNTARY: similar to PREEMPT_NONE, but the scheduler
+ will also preempt for higher priority class of processes but not
+ lower.
+
+ With PREEMPT: the scheduler preempts at the earliest opportunity.
+
config SCHED_CORE
bool "Core Scheduling for SMT"
depends on SCHED_SMT
--
2.31.1


2024-02-13 06:03:57

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption

The default policy for lazy scheduling is to schedule in exit-to-user,
assuming that would happen within the remaining time quanta of the
task.

However, that runs into the 'hog' problem -- the target task might
be running in the kernel and might not relinquish CPU on its own.

Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
tif_resched(NR_now) at the next tick.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>

---
Note:
Instead of special casing the tick, it might be simpler to always
do the upgrade on the second resched_curr().

The theoretical problem with doing that is that the current
approach deterministically provides a well-defined extra unit of
time. Going with a second resched_curr() might mean that the
amount of extra time the task gets depends on the vagaries of
the incoming resched_curr() (which is fine if it's mostly from
the tick; not fine if we could get it due to other reasons.)

Practically, both performed equally well in my tests.

Thoughts?

kernel/sched/core.c | 8 ++++++++
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 6 ++++++
5 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de963e8e2bee..5df59a4548dc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1038,6 +1038,10 @@ void wake_up_q(struct wake_q_head *head)
* For PREEMPT_AUTO: schedule idle threads eagerly, allow everything
* else, whether running in user or kernel context, to finish its time
* quanta, and mark for rescheduling at the next exit to user.
+ *
+ * Note: to avoid the hog problem, where the user does not relinquish
+ * CPU even after its time quanta has expired, upgrade lazy to eager
+ * on the second tick.
*/
static resched_t resched_opt_translate(struct task_struct *curr,
enum resched_opt opt)
@@ -1051,6 +1055,10 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (is_idle_task(curr))
return NR_now;

+ if (opt == RESCHED_TICK &&
+ unlikely(test_tsk_need_resched(curr, NR_lazy)))
+ return NR_now;
+
return NR_lazy;
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b4e68cfc3c3a..b935e634fbf8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1379,7 +1379,7 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
}

if (!is_leftmost(dl_se, &rq->dl))
- resched_curr(rq);
+ resched_curr_tick(rq);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 278eebe6656a..92910b721adb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12621,7 +12621,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
}

if (resched)
- resched_curr(rq);
+ resched_curr_tick(rq);

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index c57cc8427a57..1a2f3524d0eb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1023,7 +1023,7 @@ static void update_curr_rt(struct rq *rq)
rt_rq->rt_time += delta_exec;
exceeded = sched_rt_runtime_exceeded(rt_rq);
if (exceeded)
- resched_curr(rq);
+ resched_curr_tick(rq);
raw_spin_unlock(&rt_rq->rt_runtime_lock);
if (exceeded)
do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3600a8673d08..c7e7acab1022 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2465,6 +2465,7 @@ extern void reweight_task(struct task_struct *p, int prio);
enum resched_opt {
RESCHED_DEFAULT,
RESCHED_FORCE,
+ RESCHED_TICK,
};

extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2474,6 +2475,11 @@ static inline void resched_curr(struct rq *rq)
__resched_curr(rq, RESCHED_DEFAULT);
}

+static inline void resched_curr_tick(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_TICK);
+}
+
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1


2024-02-13 06:04:35

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 25/30] sched: support preempt=full under PREEMPT_AUTO

The default preemption policy for preempt-full under PREEMPT_AUTO is
to minimize latency, and thus to always schedule eagerly. This is
identical to CONFIG_PREEMPT, and so should result in similar
performance.

Comparing scheduling/IPC workload:

# perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

PREEMPT_AUTO, preempt=full

3,080,508 context-switches ( +- 0.64% )
3.65171 +- 0.00654 seconds time elapsed ( +- 0.18% )

PREEMPT_DYNAMIC, preempt=full

3,087,527 context-switches ( +- 0.33% )
3.60163 +- 0.00633 seconds time elapsed ( +- 0.18% )

Looking at the breakup between voluntary and involuntary context-switches, we
see almost identical behaviour as well.

PREEMPT_AUTO, preempt=full

2087910.00 +- 34720.95 voluntary context-switches ( +- 1.660% )
784437.60 +- 19827.79 involuntary context-switches ( +- 2.520% )

PREEMPT_DYNAMIC, preempt=full

2102879.60 +- 22767.11 voluntary context-switches ( +- 1.080% )
801189.90 +- 21324.18 involuntary context-switches ( +- 2.660% )

Comparing kernbench half load (-j 32), we see that both voluntary
and involuntary context switches, and their stdev is fairly similar.
So is the percentage of CPU taken and various process times.

# Half load (-j 32)
PREEMPT_AUTO PREEMPT_DYNAMIC

wall 74.45 +- 0.39 sec ( +- 0.53% ) 74.08 +- 0.20 sec ( +- 0.27% )
utime 1419.68 +- 5.12 sec ( +- 0.36% ) 1419.76 +- 3.63 sec ( +- 0.25% )
stime 250.56 +- 1.08 sec ( +- 0.43% ) 248.94 +- 0.80 sec ( +- 0.32% )
%cpu 2243.20 +- 19.57 ( +- 0.87% ) 2251.80 +- 11.12 ( +- 0.49% )
inv-csw 20286.60 +- 547.48 ( +- 2.69% ) 20175.60 +- 214.20 ( +- 1.06% )
vol-csw 187688.00 +- 5097.26 ( +- 2.71% ) 182914.00 +- 2525.59 ( +- 1.38% )

Same for kernbench optimal and maximal loads.

# Optimal load (-j 256)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 65.10 +- 0.09 sec ( +- 0.14% ) 65.11 +- 0.27 sec ( +- 0.42% )
utime 1875.03 +- 479.98 sec ( +- 25.59% ) 1874.55 +- 479.39 sec ( +- 25.57% )
stime 297.70 +- 49.68 sec ( +- 16.69% ) 297.04 +- 50.69 sec ( +- 17.06% )
%cpu 3175.60 +- 982.93 ( +- 30.95% ) 3179.40 +- 977.87 ( +- 30.75% )
inv-csw 391147.00 +- 390941.00 ( +- 99.94% ) 392298.00 +- 392268.00 ( +- 99.99% )
vol-csw 212039.00 +- 26419.90 ( +- 12.45% ) 211349.00 +- 30227.30 ( +- 14.30% )

# Maximal load (-j 256)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 66.55 +- 0.34 sec ( +- 0.51% ) 66.41 +- 0.72 sec ( +- 1.09% )
utime 2028.83 +- 445.86 sec ( +- 21.97% ) 2027.59 +- 444.89 sec ( +- 21.94% )
stime 316.16 +- 48.29 sec ( +- 15.27% ) 313.97 +- 47.61 sec ( +- 15.16% )
%cpu 3463.93 +- 894.12 ( +- 25.81% ) 3465.33 +- 889.04 ( +- 25.65% )
inv-csw 491115.00 +- 345936.00 ( +- 70.43% ) 492028.00 +- 346745.00 ( +- 70.47% )
vol-csw 200509.00 +- 32922.60 ( +- 16.41% ) 187447.00 +- 42567.20 ( +- 22.70% )

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d33f3ff51a3..aaa87d5fecdd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,8 +1035,9 @@ void wake_up_q(struct wake_q_head *head)
* For preemption models other than PREEMPT_AUTO: always schedule
* eagerly.
*
- * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything
- * else, whether running in user or kernel context, to finish its time
+ * For PREEMPT_AUTO: schedule idle threads eagerly, and under full
+ * preemption all tasks eagerly. Otherwise, allow everything else,
+ * whether running in user or kernel context, to finish its time
* quanta, and mark for rescheduling at the next exit to user.
*
* Note: to avoid the hog problem, where the user does not relinquish
@@ -1052,6 +1053,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (opt == RESCHED_FORCE)
return NR_now;

+ if (preempt_model_preemptible())
+ return NR_now;
+
if (is_idle_task(curr))
return NR_now;

@@ -8982,7 +8986,9 @@ static void __sched_dynamic_update(int mode)
pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
PREEMPT_MODE);

- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: full\n", PREEMPT_MODE);
+ preempt_dynamic_mode = mode;
break;
}
}
--
2.31.1


2024-02-13 06:04:51

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 27/30] sched: latency warn for TIF_NEED_RESCHED_LAZY

resched_latency_warn() now also warns if TIF_NEED_RESCHED_LAZY
is set without rescheduling for more than the latency_warn_ms period.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/debug.c | 7 +++++--
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aa31f23f43f4..cce84ba0417f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5670,7 +5670,7 @@ static u64 cpu_resched_latency(struct rq *rq)
if (sysctl_resched_latency_warn_once && warned_once)
return 0;

- if (!need_resched() || !latency_warn_ms)
+ if ((!need_resched() && !need_resched_lazy()) || !latency_warn_ms)
return 0;

if (system_state == SYSTEM_BOOTING)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e53f1b73bf4a..869a09781f2e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1114,9 +1114,12 @@ void proc_sched_set_task(struct task_struct *p)
void resched_latency_warn(int cpu, u64 latency)
{
static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1);
+ char *nr;
+
+ nr = tif_need_resched(NR_now) ? "need_resched" : "need_resched_lazy";

WARN(__ratelimit(&latency_check_ratelimit),
- "sched: CPU %d need_resched set for > %llu ns (%d ticks) "
+ "sched: CPU %d %s set for > %llu ns (%d ticks) "
"without schedule\n",
- cpu, latency, cpu_rq(cpu)->ticks_without_resched);
+ cpu, nr, latency, cpu_rq(cpu)->ticks_without_resched);
}
--
2.31.1


2024-02-13 06:05:33

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 30/30] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y

To reduce RCU noise for nohz_full configurations, osnoise depends
on cond_resched() providing quiescent states for PREEMPT_RCU=n
configurations. And, for PREEMPT_RCU=y configurations does this
by directly calling rcu_momentary_dyntick_idle().

With PREEMPT_AUTO=y, however, we can have configurations with
(PREEMPTION=y, PREEMPT_RCU=n), which means neither of the above can
help.

Handle that by fallback to the explicit quiescent states via
rcu_momentary_dyntick_idle().

Cc: Paul E. McKenney <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/trace/trace_osnoise.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index bd0d01d00fb9..8f9f654594ed 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1532,18 +1532,20 @@ static int run_osnoise(void)
/*
* In some cases, notably when running on a nohz_full CPU with
* a stopped tick PREEMPT_RCU has no way to account for QSs.
- * This will eventually cause unwarranted noise as PREEMPT_RCU
- * will force preemption as the means of ending the current
- * grace period. We avoid this problem by calling
- * rcu_momentary_dyntick_idle(), which performs a zero duration
- * EQS allowing PREEMPT_RCU to end the current grace period.
- * This call shouldn't be wrapped inside an RCU critical
- * section.
+ * This will eventually cause unwarranted noise as RCU forces
+ * preemption as the means of ending the current grace period.
+ * We avoid this by calling rcu_momentary_dyntick_idle(),
+ * which performs a zero duration EQS allowing RCU to end the
+ * current grace period. This call shouldn't be wrapped inside
+ * an RCU critical section.
*
- * Note that in non PREEMPT_RCU kernels QSs are handled through
- * cond_resched()
+ * For non-PREEMPT_RCU kernels with cond_resched() (non-
+ * PREEMPT_AUTO configurations), QSs are handled through
+ * cond_resched(). For PREEMPT_AUTO kernels, we fallback to the
+ * zero duration QS via rcu_momentary_dyntick_idle().
*/
- if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
+ if (IS_ENABLED(CONFIG_PREEMPT_RCU) ||
+ (!IS_ENABLED(CONFIG_PREEMPT_RCU) && IS_ENABLED(CONFIG_PREEMPTION))) {
if (!disable_irq)
local_irq_disable();

--
2.31.1


2024-02-13 06:05:35

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 28/30] tracing: support lazy resched

Tracing support for TIF_NEED_RESCHED_LAZY.

trace_entry::flags is full, so reuse the TRACE_FLAG_IRQS_NOSUPPORT
bit for this. The flag is safe to reuse since it is only used in
old archs that don't support lockdep irq tracing.

Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/trace_events.h | 6 +++---
kernel/trace/trace.c | 2 ++
kernel/trace/trace_output.c | 16 ++++++++++++++--
3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..0c17e9de1434 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -178,7 +178,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);

enum trace_flag_type {
TRACE_FLAG_IRQS_OFF = 0x01,
- TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
+ TRACE_FLAG_NEED_RESCHED_LAZY = 0x02,
TRACE_FLAG_NEED_RESCHED = 0x04,
TRACE_FLAG_HARDIRQ = 0x08,
TRACE_FLAG_SOFTIRQ = 0x10,
@@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_ctx(void)

static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
static inline unsigned int tracing_gen_ctx(void)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
#endif

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0a9642fba852..8fb3a79771df 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2703,6 +2703,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)

if (tif_need_resched(NR_now))
trace_flags |= TRACE_FLAG_NEED_RESCHED;
+ if (tif_need_resched(NR_lazy))
+ trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..5e120c2404cf 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
bh_off ? 'b' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
'.';

- switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+ switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
TRACE_FLAG_PREEMPT_RESCHED)) {
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'B';
+ break;
case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'N';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'L';
+ break;
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'b';
+ break;
case TRACE_FLAG_NEED_RESCHED:
need_resched = 'n';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'l';
+ break;
case TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'p';
break;
--
2.31.1


2024-02-13 06:05:39

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY

Document various combinations of resched flags.

Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
Documentation/trace/ftrace.rst | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 7e7b8ec17934..7f20c0bae009 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1036,8 +1036,12 @@ explains which is which.
be printed here.

need-resched:
- - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
+ - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+ - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
+ - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+ - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
- 'n' only TIF_NEED_RESCHED is set,
+ - 'l' only TIF_NEED_RESCHED_LAZY is set,
- 'p' only PREEMPT_NEED_RESCHED is set,
- '.' otherwise.

--
2.31.1


2024-02-13 06:10:19

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 16/30] rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y

With (PREEMPT_RCU=n, PREEMPT_COUNT=y), rcu_flavor_sched_clock_irq()
registers urgently needed quiescent states when preempt_count() is
available and no task or softirq is in a non-preemptible section.

This, however, does nothing for long running loops where preemption
is only temporarily enabled, since the tick is unlikely to neatly fall
in the preemptible() section.

Handle that by forcing a context-switch when we require a quiescent
state urgently but are holding a preempt_count().

Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/rcu/tree.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d6ac2b703a6d..5f61e7e0f16c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2248,8 +2248,17 @@ void rcu_sched_clock_irq(int user)
raw_cpu_inc(rcu_data.ticks_this_gp);
/* The load-acquire pairs with the store-release setting to true. */
if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
- /* Idle and userspace execution already are quiescent states. */
- if (!rcu_is_cpu_rrupt_from_idle() && !user) {
+ /*
+ * Idle and userspace execution already are quiescent states.
+ * If, however, we came here from a nested interrupt in the
+ * kernel, or if we have PREEMPT_RCU=n but are holding a
+ * preempt_count() (say, with CONFIG_PREEMPT_AUTO=y), then
+ * force a context switch.
+ */
+ if ((!rcu_is_cpu_rrupt_from_idle() && !user) ||
+ ((!IS_ENABLED(CONFIG_PREEMPT_RCU) &&
+ IS_ENABLED(CONFIG_PREEMPT_COUNT)) &&
+ (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
set_tsk_need_resched(current, NR_now);
set_preempt_need_resched();
}
--
2.31.1


2024-02-13 06:10:24

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 18/30] sched: prepare for lazy rescheduling in resched_curr()

Handle NR_lazy in resched_curr(), by registering an intent to
reschedule at exit-to-user.
Given that the rescheduling is not imminent, skip the preempt folding
and the resched IPI.

Also, update set_nr_and_not_polling() to handle NR_lazy. Note that
there are no changes to set_nr_if_polling(), since lazy rescheduling
is not meaningful for idle.

And finally, now that there are two need-resched bits, enforce a
priority order while setting them.


Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 41 +++++++++++++++++++++++++++++------------
1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 425e4f03e0af..7248d1dbdc14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,14 +899,14 @@ static inline void hrtick_rq_init(struct rq *rq)

#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+ return !(fetch_or(&ti->flags, _tif_resched(rs)) & _TIF_POLLING_NRFLAG);
}

/*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
}

#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
{
- set_tsk_need_resched(p, NR_now);
+ set_tsk_need_resched(p, rs);
return true;
}

@@ -1041,25 +1041,40 @@ void wake_up_q(struct wake_q_head *head)
void resched_curr(struct rq *rq)
{
struct task_struct *curr = rq->curr;
+ resched_t rs = NR_now;
int cpu;

lockdep_assert_rq_held(rq);

- if (test_tsk_need_resched(curr, NR_now))
+ /*
+ * TIF_NEED_RESCHED is the higher priority bit, so if it is already
+ * set, nothing more to be done. So, the only combinations we want to
+ * let in are:
+ *
+ * - . + (NR_now | NR_lazy)
+ * - NR_lazy + NR_now
+ *
+ * In the second case both flags would be set simultaneously.
+ */
+ if (test_tsk_need_resched(curr, NR_now) ||
+ (rs == NR_lazy && test_tsk_need_resched(curr, NR_lazy)))
return;

cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr, NR_now);
- set_preempt_need_resched();
+ set_tsk_need_resched(curr, rs);
+ if (rs == NR_now)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, rs)) {
+ if (rs == NR_now)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
}

void resched_cpu(int cpu)
@@ -1154,7 +1169,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report
* much benefits.
*/
- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, NR_now))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -6693,6 +6708,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+
+ /* Clear both TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY */
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
--
2.31.1


2024-02-13 06:11:30

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 20/30] sched: handle idle preemption for PREEMPT_AUTO

When running the idle task, we always want to schedule out immediately.
Use tif_resched(NR_now) to do that.

This path should be identical across preemption models which is borne
out by comparing latency via perf bench sched pipe (5 runs):

PREEMPT_AUTO: 4.430 usecs/op +- 0.080 ( +- 1.800% )
PREEMPT_DYNAMIC: 4.400 usecs/op +- 0.100 ( +- 2.270% )

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6596b5e0b6c8..5e3dd95efb73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1045,6 +1045,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (!IS_ENABLED(CONFIG_PREEMPT_AUTO))
return NR_now;

+ if (is_idle_task(curr))
+ return NR_now;
+
return NR_lazy;
}

--
2.31.1


2024-02-13 06:12:57

by Ankur Arora

[permalink] [raw]
Subject: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

The default preemption policy for voluntary preemption under
PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
class, and lazily for well-behaved, non-idle tasks.

This is the same policy as preempt=none, with an eager handling of
higher priority scheduling classes.

So, compare a SCHED_DEADLINE workload (stress-ng --cyclic) with a
background kernel load of 'stress-ng --mmap':

# stress-ng --mmap 0 &
# stress-ng --cyclic 1 --timeout 300

PREEMPT_AUTO, preempt=none
stress-ng: info: [8827] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [8827] dispatching hogs: 1 cyclic
stress-ng: info: [8828] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [8828] cyclic: mean: 23829.70 ns, mode: 3317 ns
stress-ng: info: [8828] cyclic: min: 2688 ns, max: 5701735 ns, std.dev. 123502.57
stress-ng: info: [8828] cyclic: latency percentiles:
stress-ng: info: [8828] cyclic: 25.00%: 6289 ns
stress-ng: info: [8828] cyclic: 50.00%: 13945 ns
stress-ng: info: [8828] cyclic: 75.00%: 25335 ns
stress-ng: info: [8828] cyclic: 90.00%: 34500 ns
stress-ng: info: [8828] cyclic: 95.40%: 41470 ns
stress-ng: info: [8828] cyclic: 99.00%: 85602 ns
stress-ng: info: [8828] cyclic: 99.50%: 301099 ns
stress-ng: info: [8828] cyclic: 99.90%: 1798633 ns
stress-ng: info: [8828] cyclic: 99.99%: 5701735 ns
stress-ng: info: [8827] successful run completed in 300.01s (5 mins, 0.01 secs)

PREEMPT_AUTO, preempt=voluntary
stress-ng: info: [8883] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [8883] dispatching hogs: 1 cyclic
stress-ng: info: [8884] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [8884] cyclic: mean: 14169.08 ns, mode: 3355 ns
stress-ng: info: [8884] cyclic: min: 2570 ns, max: 2234939 ns, std.dev. 66056.95
stress-ng: info: [8884] cyclic: latency percentiles:
stress-ng: info: [8884] cyclic: 25.00%: 3665 ns
stress-ng: info: [8884] cyclic: 50.00%: 5409 ns
stress-ng: info: [8884] cyclic: 75.00%: 16009 ns
stress-ng: info: [8884] cyclic: 90.00%: 24392 ns
stress-ng: info: [8884] cyclic: 95.40%: 28827 ns
stress-ng: info: [8884] cyclic: 99.00%: 39153 ns
stress-ng: info: [8884] cyclic: 99.50%: 168643 ns
stress-ng: info: [8884] cyclic: 99.90%: 1186267 ns
stress-ng: info: [8884] cyclic: 99.99%: 2234939 ns
stress-ng: info: [8883] successful run completed in 300.01s (5 mins, 0.01 secs)

So, the latency improves significantly -- even at the 25th percentile
threshold.

This configuration also compares quite favourably to voluntary
preemption under PREEMPT_DYNAMIC.

PREEMPT_DYNAMIC, preempt=voluntary
stress-ng: info: [12252] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [12252] dispatching hogs: 1 cyclic
stress-ng: info: [12253] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [12253] cyclic: mean: 19973.46 ns, mode: 3560 ns
stress-ng: info: [12253] cyclic: min: 2541 ns, max: 2751830 ns, std.dev. 68891.71
stress-ng: info: [12253] cyclic: latency percentiles:
stress-ng: info: [12253] cyclic: 25.00%: 4800 ns
stress-ng: info: [12253] cyclic: 50.00%: 12458 ns
stress-ng: info: [12253] cyclic: 75.00%: 25220 ns
stress-ng: info: [12253] cyclic: 90.00%: 35404 ns
stress-ng: info: [12253] cyclic: 95.40%: 43135 ns
stress-ng: info: [12253] cyclic: 99.00%: 61053 ns
stress-ng: info: [12253] cyclic: 99.50%: 98159 ns
stress-ng: info: [12253] cyclic: 99.90%: 1164407 ns
stress-ng: info: [12253] cyclic: 99.99%: 2751830 ns
stress-ng: info: [12252] successful run completed in 300.01s (5 mins, 0.01 secs)

And, as you would expect, we perform identically to preempt=full
with PREEMPT_DYNAMIC (ignoring the outliers at 99.99%.)

PREEMPT_DYNAMIC, preempt=full
stress-ng: info: [12206] setting to a 300 second (5 mins, 0.00 secs) run per stressor
stress-ng: info: [12206] dispatching hogs: 1 cyclic
stress-ng: info: [12207] cyclic: sched SCHED_DEADLINE: 100000 ns delay, 10000 samples
stress-ng: info: [12207] cyclic: mean: 16647.05 ns, mode: 3535 ns
stress-ng: info: [12207] cyclic: min: 2548 ns, max: 4093382 ns, std.dev. 116825.95
stress-ng: info: [12207] cyclic: latency percentiles:
stress-ng: info: [12207] cyclic: 25.00%: 3568 ns
stress-ng: info: [12207] cyclic: 50.00%: 4755 ns
stress-ng: info: [12207] cyclic: 75.00%: 15187 ns
stress-ng: info: [12207] cyclic: 90.00%: 24083 ns
stress-ng: info: [12207] cyclic: 95.40%: 29314 ns
stress-ng: info: [12207] cyclic: 99.00%: 40102 ns
stress-ng: info: [12207] cyclic: 99.50%: 366571 ns
stress-ng: info: [12207] cyclic: 99.90%: 2752037 ns
stress-ng: info: [12207] cyclic: 99.99%: 4093382 ns
stress-ng: info: [12206] successful run completed in 300.01s (5 mins, 0.01 secs)

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 12 ++++++++----
kernel/sched/sched.h | 6 ++++++
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa87d5fecdd..aa31f23f43f4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1056,6 +1056,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (preempt_model_preemptible())
return NR_now;

+ if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
+ return NR_now;
+
if (is_idle_task(curr))
return NR_now;

@@ -2297,7 +2300,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
if (p->sched_class == rq->curr->sched_class)
rq->curr->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, rq->curr->sched_class))
- resched_curr(rq);
+ resched_curr_priority(rq);

/*
* A queue event has occurred, and we're going to schedule. In
@@ -8974,11 +8977,11 @@ static void __sched_dynamic_update(int mode)
case preempt_dynamic_none:
if (mode != preempt_dynamic_mode)
pr_info("%s: none\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;

case preempt_dynamic_voluntary:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: voluntary\n", PREEMPT_MODE);
break;

case preempt_dynamic_full:
@@ -8988,9 +8991,10 @@ static void __sched_dynamic_update(int mode)

if (mode != preempt_dynamic_mode)
pr_info("%s: full\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;
}
+
+ preempt_dynamic_mode = mode;
}

#endif /* CONFIG_PREEMPT_AUTO */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7e7acab1022..197c038b87c6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2466,6 +2466,7 @@ enum resched_opt {
RESCHED_DEFAULT,
RESCHED_FORCE,
RESCHED_TICK,
+ RESCHED_PRIORITY,
};

extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2480,6 +2481,11 @@ static inline void resched_curr_tick(struct rq *rq)
__resched_curr(rq, RESCHED_TICK);
}

+static inline void resched_curr_priority(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_PRIORITY);
+}
+
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1


2024-02-13 10:00:20

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

Hi Ankur,

On Tue, Feb 13, 2024 at 6:56 AM Ankur Arora <[email protected]> wrote:
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.

Thanks for your series!

Can you please reduce the CC list for future submissions?
It is always a good idea to have a closer look at the output of
scripts/get_maintainer.pl, and edit it manually. There is usually no
need to include everyone who ever contributed a tiny change to one of
the affected files.

Thanks again!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68korg

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2024-02-13 21:48:21

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Geert Uytterhoeven <[email protected]> writes:

> Hi Ankur,
>
> On Tue, Feb 13, 2024 at 6:56 AM Ankur Arora <ankur.a.arora@oraclecom> wrote:
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>
> Thanks for your series!
>
> Can you please reduce the CC list for future submissions?

Will do.

> It is always a good idea to have a closer look at the output of
> scripts/get_maintainer.pl, and edit it manually. There is usually no
> need to include everyone who ever contributed a tiny change to one of
> the affected files.

I was in two minds about whether to prune the CC list or not. So this
is very helpful.


Thanks

--
ankur

2024-02-14 13:38:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 17/30] x86/thread_info: define TIF_NEED_RESCHED_LAZY

Hi Ankur,

On Mon, Feb 12, 2024 at 09:55:41PM -0800, Ankur Arora wrote:
> Define TIF_NEED_RESCHED_LAZY which, with TIF_NEED_RESCHED provides the
> scheduler with two kinds of rescheduling intent: TIF_NEED_RESCHED,
> for the usual rescheduling at the next safe preemption point;
> TIF_NEED_RESCHED_LAZY expressing an intent to reschedule at some
> time in the future while allowing the current task to run to
> completion.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/thread_info.h | 10 ++++++++--
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..ab58558068a4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -275,6 +275,7 @@ config X86
> select HAVE_STATIC_CALL
> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
> select HAVE_PREEMPT_DYNAMIC_CALL
> + select HAVE_PREEMPT_AUTO
> select HAVE_RSEQ
> select HAVE_RUST if X86_64
> select HAVE_SYSCALL_TRACEPOINTS
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index d63b02940747..88c1802185fc 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -81,8 +81,11 @@ struct thread_info {
> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
> #define TIF_SIGPENDING 2 /* signal pending */
> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
> -#define TIF_SSBD 5 /* Speculative store bypass disable */
> +#ifdef CONFIG_PREEMPT_AUTO
> +#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling */
> +#endif
> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
> +#define TIF_SSBD 6 /* Speculative store bypass disable */

It's a bit awkward/ugly to conditionally define the TIF_* bits in arch code,
and we don't do that for other bits that are only used in some configurations
(e.g. TIF_UPROBE). That's not just for aesthetics -- for example, on arm64 we
try to keep the TIF_WORK_MASK bits contiguous, which is difficult if a bit in
the middle doesn't exist in some configurations.

Is it painful to organise the common code so that arch code can define
TIF_NEED_RESCHED_LAZY regardless of whether CONFIG_PREEMPT_AUTO is selected?

Mark.

> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
> @@ -104,6 +107,9 @@ struct thread_info {
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> +#ifdef CONFIG_PREEMPT_AUTO
> +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
> +#endif
> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
> #define _TIF_SSBD (1 << TIF_SSBD)
> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
> --
> 2.31.1
>

2024-02-14 20:32:55

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 17/30] x86/thread_info: define TIF_NEED_RESCHED_LAZY


Mark Rutland <[email protected]> writes:

> Hi Ankur,
>
> On Mon, Feb 12, 2024 at 09:55:41PM -0800, Ankur Arora wrote:
>> Define TIF_NEED_RESCHED_LAZY which, with TIF_NEED_RESCHED provides the
>> scheduler with two kinds of rescheduling intent: TIF_NEED_RESCHED,
>> for the usual rescheduling at the next safe preemption point;
>> TIF_NEED_RESCHED_LAZY expressing an intent to reschedule at some
>> time in the future while allowing the current task to run to
>> completion.
>>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> arch/x86/Kconfig | 1 +
>> arch/x86/include/asm/thread_info.h | 10 ++++++++--
>> 2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 5edec175b9bf..ab58558068a4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -275,6 +275,7 @@ config X86
>> select HAVE_STATIC_CALL
>> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
>> select HAVE_PREEMPT_DYNAMIC_CALL
>> + select HAVE_PREEMPT_AUTO
>> select HAVE_RSEQ
>> select HAVE_RUST if X86_64
>> select HAVE_SYSCALL_TRACEPOINTS
>> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
>> index d63b02940747..88c1802185fc 100644
>> --- a/arch/x86/include/asm/thread_info.h
>> +++ b/arch/x86/include/asm/thread_info.h
>> @@ -81,8 +81,11 @@ struct thread_info {
>> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
>> #define TIF_SIGPENDING 2 /* signal pending */
>> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
>> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
>> -#define TIF_SSBD 5 /* Speculative store bypass disable */
>> +#ifdef CONFIG_PREEMPT_AUTO
>> +#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling */
>> +#endif
>> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
>> +#define TIF_SSBD 6 /* Speculative store bypass disable */
>
> It's a bit awkward/ugly to conditionally define the TIF_* bits in arch code,
> and we don't do that for other bits that are only used in some configurations
> (e.g. TIF_UPROBE). That's not just for aesthetics -- for example, on arm64 we
> try to keep the TIF_WORK_MASK bits contiguous, which is difficult if a bit in
> the middle doesn't exist in some configurations.

That's useful to know. And, I think you are right about the
ugliness of this.

> Is it painful to organise the common code so that arch code can define
> TIF_NEED_RESCHED_LAZY regardless of whether CONFIG_PREEMPT_AUTO is selected?

So, the original reason I did it this way was because I wanted to have
zero performance impact on !CONFIG_PREEMPT_AUTO configurations whether
TIF_NEED_RESCHED_LAZY was defined or not.
(I was doing some computation with TIF_NEED_RESCHED_LAZY at that point.)

Eventually I changed that part of code but this stayed.

Anyway, this should be easy enough to fix with done #ifdefry.

Thanks for reviewing.

--
ankur

2024-02-15 00:01:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> An earlier RFC version is at [4].

This uncovered a couple of latent bugs in RCU due to its having been
a good long time since anyone built a !SMP preemptible kernel with
non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
likely for the merge window after next, but let me know if you need
them sooner.

I am also seeing OOM conditions during rcutorture testing of callback
flooding, but I am still looking into this. The full diff on top of
your series on top of v6.8-rc4 is shown below. Please let me know if
I have messed up the Kconfig options.

Thanx, Paul

[1] 6a4352fd1418 ("rcu: Update lockdep while in RCU read-side critical section")
1b85e92eabcd ("rcu: Make TINY_RCU depend on !PREEMPT_RCU rather than !PREEMPTION")

------------------------------------------------------------------------

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 0746b1b0b6639..b0b61b8598b03 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -778,9 +778,9 @@ static inline void rcu_read_unlock(void)
{
RCU_LOCKDEP_WARN(!rcu_is_watching(),
"rcu_read_unlock() used illegally while idle");
+ rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
__release(RCU);
__rcu_read_unlock();
- rcu_lock_release(&rcu_lock_map); /* Keep acq info for rls diags. */
}

/**
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index d0ecc8ef17a72..6bf969857a85b 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -31,7 +31,7 @@ config PREEMPT_RCU

config TINY_RCU
bool
- default y if !PREEMPTION && !SMP
+ default y if !PREEMPT_RCU && !SMP
help
This option selects the RCU implementation that is
designed for UP systems from which real-time response
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N
index 07f5e0a70ae70..737389417c7b3 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-N
@@ -3,8 +3,10 @@ CONFIG_SMP=y
CONFIG_NR_CPUS=4
CONFIG_HOTPLUG_CPU=y
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
#CHECK#CONFIG_RCU_EXPERT=n
CONFIG_KPROBES=n
CONFIG_FTRACE=n
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
index c70cf0405f248..c9aca21d02f8c 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
+++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-T
@@ -1,5 +1,6 @@
CONFIG_SMP=n
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -10,3 +11,4 @@ CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_DEBUG_ATOMIC_SLEEP=y
#CHECK#CONFIG_PREEMPT_COUNT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02 b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
index 2f9fcffff5ae3..472259f9e0a6a 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TASKS02
@@ -1,8 +1,10 @@
CONFIG_SMP=n
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
#CHECK#CONFIG_TASKS_RCU=y
CONFIG_FORCE_TASKS_RCU=y
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TINY02 b/tools/testing/selftests/rcutorture/configs/rcu/TINY02
index 30439f6fc20e6..df408933e7013 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TINY02
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY02
@@ -1,5 +1,6 @@
CONFIG_SMP=n
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -13,3 +14,4 @@ CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
index 85b407467454a..2f75c7349d83a 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01
@@ -2,6 +2,7 @@ CONFIG_SMP=y
CONFIG_NR_CPUS=5
CONFIG_HOTPLUG_CPU=y
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -12,3 +13,4 @@ CONFIG_FORCE_TASKS_TRACE_RCU=y
#CHECK#CONFIG_TASKS_TRACE_RCU=y
CONFIG_TASKS_TRACE_RCU_READ_MB=y
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3ad..9ef845d54fa41 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -2,6 +2,7 @@ CONFIG_SMP=y
CONFIG_NR_CPUS=8
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
#CHECK#CONFIG_TREE_RCU=y
@@ -16,3 +17,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_EQS_DEBUG=y
CONFIG_RCU_LAZY=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE05 b/tools/testing/selftests/rcutorture/configs/rcu/TREE05
index 9f48c73709ec3..31afd943d85ef 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE05
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE05
@@ -1,6 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=8
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
#CHECK#CONFIG_TREE_RCU=y
@@ -18,3 +19,4 @@ CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU_LIST=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE06 b/tools/testing/selftests/rcutorture/configs/rcu/TREE06
index db27651de04b8..1180fe36a3a12 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE06
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE06
@@ -1,6 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=8
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
#CHECK#CONFIG_TREE_RCU=y
@@ -17,3 +18,4 @@ CONFIG_PROVE_LOCKING=y
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE07 b/tools/testing/selftests/rcutorture/configs/rcu/TREE07
index d30922d8c8832..969e852bd618b 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE07
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE07
@@ -1,6 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -15,3 +16,4 @@ CONFIG_RCU_FANOUT_LEAF=2
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE10 b/tools/testing/selftests/rcutorture/configs/rcu/TREE10
index a323d8948b7cf..4af22599f13ed 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE10
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE10
@@ -1,6 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=56
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -16,3 +17,4 @@ CONFIG_PROVE_LOCKING=n
CONFIG_DEBUG_OBJECTS=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=n
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL b/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL
index 5d546efa68e83..7b2c9fb0cd826 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TRIVIAL
@@ -1,6 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=8
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_HZ_PERIODIC=n
@@ -9,3 +10,4 @@ CONFIG_NO_HZ_FULL=n
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TINY b/tools/testing/selftests/rcutorture/configs/rcuscale/TINY
index 0fa2dc086e10c..80230745e9dc7 100644
--- a/tools/testing/selftests/rcutorture/configs/rcuscale/TINY
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TINY
@@ -1,5 +1,6 @@
CONFIG_SMP=n
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -14,3 +15,4 @@ CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_TRACE=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01
index 0059592c7408a..eb47f36712305 100644
--- a/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/TRACE01
@@ -1,5 +1,6 @@
CONFIG_SMP=y
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -14,3 +15,4 @@ CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_TRACE=y
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT
index 67f9d2998afd3..cb3219cb98c78 100644
--- a/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT
+++ b/tools/testing/selftests/rcutorture/configs/refscale/NOPREEMPT
@@ -1,5 +1,6 @@
CONFIG_SMP=y
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -18,3 +19,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_KPROBES=n
CONFIG_FTRACE=n
+CONFIG_EXPERT=y
diff --git a/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT b/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT
index 6133f54ce2a7d..241f28e965e57 100644
--- a/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT
+++ b/tools/testing/selftests/rcutorture/configs/scf/NOPREEMPT
@@ -1,5 +1,6 @@
CONFIG_SMP=y
CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n
@@ -11,3 +12,4 @@ CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_PROVE_LOCKING=n
CONFIG_KPROBES=n
CONFIG_FTRACE=n
+CONFIG_EXPERT=y

2024-02-15 02:06:59

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> An earlier RFC version is at [4].
>
> This uncovered a couple of latent bugs in RCU due to its having been
> a good long time since anyone built a !SMP preemptible kernel with
> non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> likely for the merge window after next, but let me know if you need
> them sooner.

Thanks. As you can probably tell, I skipped out on !SMP in my testing.
But, the attached diff should tide me over until the fixes are in.

> I am also seeing OOM conditions during rcutorture testing of callback
> flooding, but I am still looking into this.

That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?

Thanks

--
ankur

2024-02-15 03:46:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> Hi,
> >>
> >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> on explicit preemption points for the voluntary models.
> >>
> >> The series is based on Thomas' original proposal which he outlined
> >> in [1], [2] and in his PoC [3].
> >>
> >> An earlier RFC version is at [4].
> >
> > This uncovered a couple of latent bugs in RCU due to its having been
> > a good long time since anyone built a !SMP preemptible kernel with
> > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> > likely for the merge window after next, but let me know if you need
> > them sooner.
>
> Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> But, the attached diff should tide me over until the fixes are in.

That was indeed my guess. ;-)

> > I am also seeing OOM conditions during rcutorture testing of callback
> > flooding, but I am still looking into this.
>
> That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?

On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
two of them thus far. I am running a longer test to see if this might
be just luck. If not, I look to see what rcutorture scenarios TREE10
and TRACE01 have in common.

Thanx, Paul

2024-02-15 19:28:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >
> > Paul E. McKenney <[email protected]> writes:
> >
> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> > >> Hi,
> > >>
> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> > >> on explicit preemption points for the voluntary models.
> > >>
> > >> The series is based on Thomas' original proposal which he outlined
> > >> in [1], [2] and in his PoC [3].
> > >>
> > >> An earlier RFC version is at [4].
> > >
> > > This uncovered a couple of latent bugs in RCU due to its having been
> > > a good long time since anyone built a !SMP preemptible kernel with
> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> > > likely for the merge window after next, but let me know if you need
> > > them sooner.
> >
> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> > But, the attached diff should tide me over until the fixes are in.
>
> That was indeed my guess. ;-)
>
> > > I am also seeing OOM conditions during rcutorture testing of callback
> > > flooding, but I am still looking into this.
> >
> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>
> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> two of them thus far. I am running a longer test to see if this might
> be just luck. If not, I look to see what rcutorture scenarios TREE10
> and TRACE01 have in common.

And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
sets them apart. I also hit a grace-period hang in TREE04, which does
CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
to dig into more.

I am also getting these from builds that enable KASAN:

vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section

Does tif_resched() need to be marked noinstr or some such?

Tracing got harder to disable, but I beleive that is unrelated to lazy
preemption. ;-)

Thanx, Paul

2024-02-15 20:04:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15 2024 at 11:28, Paul E. McKenney wrote:
> On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> I am also getting these from builds that enable KASAN:
>
> vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
>
> Does tif_resched() need to be marked noinstr or some such?

__always_inline() probably

2024-02-15 20:54:32

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> >
>> > Paul E. McKenney <[email protected]> writes:
>> >
>> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> > >> Hi,
>> > >>
>> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> > >> on explicit preemption points for the voluntary models.
>> > >>
>> > >> The series is based on Thomas' original proposal which he outlined
>> > >> in [1], [2] and in his PoC [3].
>> > >>
>> > >> An earlier RFC version is at [4].
>> > >
>> > > This uncovered a couple of latent bugs in RCU due to its having been
>> > > a good long time since anyone built a !SMP preemptible kernel with
>> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> > > likely for the merge window after next, but let me know if you need
>> > > them sooner.
>> >
>> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> > But, the attached diff should tide me over until the fixes are in.
>>
>> That was indeed my guess. ;-)
>>
>> > > I am also seeing OOM conditions during rcutorture testing of callback
>> > > flooding, but I am still looking into this.
>> >
>> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>>
>> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> two of them thus far. I am running a longer test to see if this might
>> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> and TRACE01 have in common.
>
> And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> sets them apart. I also hit a grace-period hang in TREE04, which does
> CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> to dig into more.
>
> I am also getting these from builds that enable KASAN:
>
> vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section
> vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section

Thanks Paul. Yeah, with KASAN, tif_resched() transforms into this out of
line function:

ffffffff810fec20 <tif_resched.constprop.0>:
ffffffff810fec20: e8 5b c6 20 00 call ffffffff8130b280 <__sanitizer_cov_trace_pc>
ffffffff810fec25: b8 03 00 00 00 mov $0x3,%eax
ffffffff810fec2a: e9 71 56 61 01 jmp ffffffff827142a0 <__x86_return_thunk>
ffffffff810fec2f: 90 nop

> Does tif_resched() need to be marked noinstr or some such?

Builds fine with Thomas' suggested fix.

--------
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 8752dbc2dac7..0810ddeb365d 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -81,12 +81,12 @@ typedef enum {
* reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour
* unchanged.
*/
-static inline int tif_resched(resched_t rs)
+static __always_inline inline int tif_resched(resched_t rs)
{
return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET;
}

-static inline int _tif_resched(resched_t rs)
+static __always_inline inline int _tif_resched(resched_t rs)
{
return 1 << tif_resched(rs);
}

2024-02-15 20:55:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 12:53:13PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >> >
> >> > Paul E. McKenney <[email protected]> writes:
> >> >
> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> > >> Hi,
> >> > >>
> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> > >> on explicit preemption points for the voluntary models.
> >> > >>
> >> > >> The series is based on Thomas' original proposal which he outlined
> >> > >> in [1], [2] and in his PoC [3].
> >> > >>
> >> > >> An earlier RFC version is at [4].
> >> > >
> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> >> > > a good long time since anyone built a !SMP preemptible kernel with
> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> >> > > likely for the merge window after next, but let me know if you need
> >> > > them sooner.
> >> >
> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> >> > But, the attached diff should tide me over until the fixes are in.
> >>
> >> That was indeed my guess. ;-)
> >>
> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> >> > > flooding, but I am still looking into this.
> >> >
> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> >>
> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> >> two of them thus far. I am running a longer test to see if this might
> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> >> and TRACE01 have in common.
> >
> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > sets them apart. I also hit a grace-period hang in TREE04, which does
> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> > to dig into more.
> >
> > I am also getting these from builds that enable KASAN:
> >
> > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
>
> Thanks Paul. Yeah, with KASAN, tif_resched() transforms into this out of
> line function:
>
> ffffffff810fec20 <tif_resched.constprop.0>:
> ffffffff810fec20: e8 5b c6 20 00 call ffffffff8130b280 <__sanitizer_cov_trace_pc>
> ffffffff810fec25: b8 03 00 00 00 mov $0x3,%eax
> ffffffff810fec2a: e9 71 56 61 01 jmp ffffffff827142a0 <__x86_return_thunk>
> ffffffff810fec2f: 90 nop
>
> > Does tif_resched() need to be marked noinstr or some such?
>
> Builds fine with Thomas' suggested fix.

You beat me to it. ;-)

Thanx, Paul

> --------
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 8752dbc2dac7..0810ddeb365d 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -81,12 +81,12 @@ typedef enum {
> * reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour
> * unchanged.
> */
> -static inline int tif_resched(resched_t rs)
> +static __always_inline inline int tif_resched(resched_t rs)
> {
> return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET;
> }
>
> -static inline int _tif_resched(resched_t rs)
> +static __always_inline inline int _tif_resched(resched_t rs)
> {
> return 1 << tif_resched(rs);
> }

2024-02-15 21:11:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 09:04:00PM +0100, Thomas Gleixner wrote:
> On Thu, Feb 15 2024 at 11:28, Paul E. McKenney wrote:
> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> > I am also getting these from builds that enable KASAN:
> >
> > vmlinux.o: warning: objtool: mwait_idle+0x13: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: acpi_processor_ffh_cstate_enter+0x36: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: cpu_idle_poll.isra.0+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: acpi_safe_halt+0x0: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: poll_idle+0x33: call to tif_resched.constprop.0() leaves .noinstr.text section
> > vmlinux.o: warning: objtool: default_enter_idle+0x18: call to tif_resched.constprop.0() leaves .noinstr.text section
> >
> > Does tif_resched() need to be marked noinstr or some such?
>
> __always_inline() probably

That does the trick, thank you!

Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 8752dbc2dac75..43b729935804e 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -81,7 +81,7 @@ typedef enum {
* reduce to the same value (TIF_NEED_RESCHED) leaving any scheduling behaviour
* unchanged.
*/
-static inline int tif_resched(resched_t rs)
+static __always_inline int tif_resched(resched_t rs)
{
return TIF_NEED_RESCHED + rs * TIF_NEED_RESCHED_LAZY_OFFSET;
}

2024-02-15 21:30:39

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> >
>> > Paul E. McKenney <[email protected]> writes:
>> >
>> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> > >> Hi,
>> > >>
>> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> > >> on explicit preemption points for the voluntary models.
>> > >>
>> > >> The series is based on Thomas' original proposal which he outlined
>> > >> in [1], [2] and in his PoC [3].
>> > >>
>> > >> An earlier RFC version is at [4].
>> > >
>> > > This uncovered a couple of latent bugs in RCU due to its having been
>> > > a good long time since anyone built a !SMP preemptible kernel with
>> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> > > likely for the merge window after next, but let me know if you need
>> > > them sooner.
>> >
>> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> > But, the attached diff should tide me over until the fixes are in.
>>
>> That was indeed my guess. ;-)
>>
>> > > I am also seeing OOM conditions during rcutorture testing of callback
>> > > flooding, but I am still looking into this.
>> >
>> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>>
>> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> two of them thus far. I am running a longer test to see if this might
>> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> and TRACE01 have in common.
>
> And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> sets them apart. I also hit a grace-period hang in TREE04, which does
> CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> to dig into more.

So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
as well?
(Just in the interest of minimizing configurations.)

---
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index 9ef845d54fa4..819cff9113d8 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -1,7 +1,7 @@
CONFIG_SMP=y
CONFIG_NR_CPUS=8
-CONFIG_PREEMPT_NONE=n
-CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_PREEMPT_NONE=y
+CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT_AUTO=y
CONFIG_PREEMPT=n
CONFIG_PREEMPT_DYNAMIC=n

2024-02-15 22:55:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >> >
> >> > Paul E. McKenney <[email protected]> writes:
> >> >
> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> > >> Hi,
> >> > >>
> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> > >> on explicit preemption points for the voluntary models.
> >> > >>
> >> > >> The series is based on Thomas' original proposal which he outlined
> >> > >> in [1], [2] and in his PoC [3].
> >> > >>
> >> > >> An earlier RFC version is at [4].
> >> > >
> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> >> > > a good long time since anyone built a !SMP preemptible kernel with
> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> >> > > likely for the merge window after next, but let me know if you need
> >> > > them sooner.
> >> >
> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> >> > But, the attached diff should tide me over until the fixes are in.
> >>
> >> That was indeed my guess. ;-)
> >>
> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> >> > > flooding, but I am still looking into this.
> >> >
> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> >>
> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> >> two of them thus far. I am running a longer test to see if this might
> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> >> and TRACE01 have in common.
> >
> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > sets them apart. I also hit a grace-period hang in TREE04, which does
> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> > to dig into more.
>
> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> as well?
> (Just in the interest of minimizing configurations.)

I would be happy to, but in the spirit of full disclosure...

First, I have seen that failure only once, which is not enough to
conclude that it has much to do with TREE04. It might simply be low
probability, so that TREE04 simply was unlucky enough to hit it first.
In contrast, I have sufficient data to be reasonably confident that the
callback-flooding OOMs really do have something to do with the TRACE01 and
TREE10 scenarios, even though I am not yet seeing what these two scenarios
have in common that they don't also have in common with other scenarios.
But what is life without a bit of mystery? ;-)

Second, please see the attached tarball, which contains .csv files showing
Kconfig options and kernel boot parameters for the various torture tests.
The portions of the filenames preceding the "config.csv" correspond to
the directories in tools/testing/selftests/rcutorture/configs.

Third, there are additional scenarios hand-crafted by the script at
tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
them have triggered, other than via the newly increased difficulty
of configurating a tracing-free kernel with which to test, but they
can still be useful in ruling out particular Kconfig options or kernel
boot parameters being related to a given issue.

But please do take a look at the .csv files and let me know what
adjustments would be appropriate given the failure information.

Thanx, Paul

> ---
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> index 9ef845d54fa4..819cff9113d8 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> @@ -1,7 +1,7 @@
> CONFIG_SMP=y
> CONFIG_NR_CPUS=8
> -CONFIG_PREEMPT_NONE=n
> -CONFIG_PREEMPT_VOLUNTARY=y
> +CONFIG_PREEMPT_NONE=y
> +CONFIG_PREEMPT_VOLUNTARY=n
> CONFIG_PREEMPT_AUTO=y
> CONFIG_PREEMPT=n
> CONFIG_PREEMPT_DYNAMIC=n

2024-02-15 23:04:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 02:54:45PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> >
> > Paul E. McKenney <[email protected]> writes:
> >
> > > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> > >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> > >> >
> > >> > Paul E. McKenney <[email protected]> writes:
> > >> >
> > >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> > >> > >> Hi,
> > >> > >>
> > >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> > >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> > >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> > >> > >> on explicit preemption points for the voluntary models.
> > >> > >>
> > >> > >> The series is based on Thomas' original proposal which he outlined
> > >> > >> in [1], [2] and in his PoC [3].
> > >> > >>
> > >> > >> An earlier RFC version is at [4].
> > >> > >
> > >> > > This uncovered a couple of latent bugs in RCU due to its having been
> > >> > > a good long time since anyone built a !SMP preemptible kernel with
> > >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> > >> > > likely for the merge window after next, but let me know if you need
> > >> > > them sooner.
> > >> >
> > >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> > >> > But, the attached diff should tide me over until the fixes are in.
> > >>
> > >> That was indeed my guess. ;-)
> > >>
> > >> > > I am also seeing OOM conditions during rcutorture testing of callback
> > >> > > flooding, but I am still looking into this.
> > >> >
> > >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> > >>
> > >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> > >> two of them thus far. I am running a longer test to see if this might
> > >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> > >> and TRACE01 have in common.
> > >
> > > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > > sets them apart. I also hit a grace-period hang in TREE04, which does
> > > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> > > to dig into more.
> >
> > So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> > if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> > as well?
> > (Just in the interest of minimizing configurations.)

This time with the tarball actually attached! :-/

Thanx, Paul

> I would be happy to, but in the spirit of full disclosure...
>
> First, I have seen that failure only once, which is not enough to
> conclude that it has much to do with TREE04. It might simply be low
> probability, so that TREE04 simply was unlucky enough to hit it first.
> In contrast, I have sufficient data to be reasonably confident that the
> callback-flooding OOMs really do have something to do with the TRACE01 and
> TREE10 scenarios, even though I am not yet seeing what these two scenarios
> have in common that they don't also have in common with other scenarios.
> But what is life without a bit of mystery? ;-)
>
> Second, please see the attached tarball, which contains .csv files showing
> Kconfig options and kernel boot parameters for the various torture tests.
> The portions of the filenames preceding the "config.csv" correspond to
> the directories in tools/testing/selftests/rcutorture/configs.
>
> Third, there are additional scenarios hand-crafted by the script at
> tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> them have triggered, other than via the newly increased difficulty
> of configurating a tracing-free kernel with which to test, but they
> can still be useful in ruling out particular Kconfig options or kernel
> boot parameters being related to a given issue.
>
> But please do take a look at the .csv files and let me know what
> adjustments would be appropriate given the failure information.
>
> Thanx, Paul
>
> > ---
> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> > index 9ef845d54fa4..819cff9113d8 100644
> > --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
> > @@ -1,7 +1,7 @@
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=8
> > -CONFIG_PREEMPT_NONE=n
> > -CONFIG_PREEMPT_VOLUNTARY=y
> > +CONFIG_PREEMPT_NONE=y
> > +CONFIG_PREEMPT_VOLUNTARY=n
> > CONFIG_PREEMPT_AUTO=y
> > CONFIG_PREEMPT=n
> > CONFIG_PREEMPT_DYNAMIC=n


Attachments:
(No filename) (4.67 kB)
tortureconfigs.2024.02.15a.tgz (1.57 kB)
tortureconfigs.2024.02.15a.tgz
Download all attachments

2024-02-16 00:47:25

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <[email protected]> writes:
>>
>> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> >> >
>> >> > Paul E. McKenney <[email protected]> writes:
>> >> >
>> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> >> > >> Hi,
>> >> > >>
>> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> >> > >> on explicit preemption points for the voluntary models.
>> >> > >>
>> >> > >> The series is based on Thomas' original proposal which he outlined
>> >> > >> in [1], [2] and in his PoC [3].
>> >> > >>
>> >> > >> An earlier RFC version is at [4].
>> >> > >
>> >> > > This uncovered a couple of latent bugs in RCU due to its having been
>> >> > > a good long time since anyone built a !SMP preemptible kernel with
>> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> >> > > likely for the merge window after next, but let me know if you need
>> >> > > them sooner.
>> >> >
>> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> >> > But, the attached diff should tide me over until the fixes are in.
>> >>
>> >> That was indeed my guess. ;-)
>> >>
>> >> > > I am also seeing OOM conditions during rcutorture testing of callback
>> >> > > flooding, but I am still looking into this.
>> >> >
>> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>> >>
>> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> >> two of them thus far. I am running a longer test to see if this might
>> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> >> and TRACE01 have in common.
>> >
>> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
>> > sets them apart. I also hit a grace-period hang in TREE04, which does
>> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
>> > to dig into more.
>>
>> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
>> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
>> as well?
>> (Just in the interest of minimizing configurations.)
>
> I would be happy to, but in the spirit of full disclosure...
>
> First, I have seen that failure only once, which is not enough to
> conclude that it has much to do with TREE04. It might simply be low
> probability, so that TREE04 simply was unlucky enough to hit it first.
> In contrast, I have sufficient data to be reasonably confident that the
> callback-flooding OOMs really do have something to do with the TRACE01 and
> TREE10 scenarios, even though I am not yet seeing what these two scenarios
> have in common that they don't also have in common with other scenarios.
> But what is life without a bit of mystery? ;-)

:).

> Second, please see the attached tarball, which contains .csv files showing
> Kconfig options and kernel boot parameters for the various torture tests.
> The portions of the filenames preceding the "config.csv" correspond to
> the directories in tools/testing/selftests/rcutorture/configs.

So, at least some of the HZ_FULL=y tests don't run into problems.

> Third, there are additional scenarios hand-crafted by the script at
> tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> them have triggered, other than via the newly increased difficulty
> of configurating a tracing-free kernel with which to test, but they
> can still be useful in ruling out particular Kconfig options or kernel
> boot parameters being related to a given issue.
>
> But please do take a look at the .csv files and let me know what
> adjustments would be appropriate given the failure information.

Nothing stands out just yet. Let me start a run here and see if
that gives me some ideas.

I'm guessing the splats don't give any useful information or
you would have attached them ;).

Thanks for testing, btw.

--
ankur

2024-02-16 00:47:42

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <[email protected]> writes:
>>
>> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> >> >
>> >> > Paul E. McKenney <[email protected]> writes:
>> >> >
>> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> >> > >> Hi,
>> >> > >>
>> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> >> > >> on explicit preemption points for the voluntary models.
>> >> > >>
>> >> > >> The series is based on Thomas' original proposal which he outlined
>> >> > >> in [1], [2] and in his PoC [3].
>> >> > >>
>> >> > >> An earlier RFC version is at [4].
>> >> > >
>> >> > > This uncovered a couple of latent bugs in RCU due to its having been
>> >> > > a good long time since anyone built a !SMP preemptible kernel with
>> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> >> > > likely for the merge window after next, but let me know if you need
>> >> > > them sooner.
>> >> >
>> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> >> > But, the attached diff should tide me over until the fixes are in.
>> >>
>> >> That was indeed my guess. ;-)
>> >>
>> >> > > I am also seeing OOM conditions during rcutorture testing of callback
>> >> > > flooding, but I am still looking into this.
>> >> >
>> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>> >>
>> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> >> two of them thus far. I am running a longer test to see if this might
>> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> >> and TRACE01 have in common.
>> >
>> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
>> > sets them apart. I also hit a grace-period hang in TREE04, which does
>> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
>> > to dig into more.
>>
>> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
>> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
>> as well?
>> (Just in the interest of minimizing configurations.)
>
> I would be happy to, but in the spirit of full disclosure...
>
> First, I have seen that failure only once, which is not enough to
> conclude that it has much to do with TREE04. It might simply be low
> probability, so that TREE04 simply was unlucky enough to hit it first.
> In contrast, I have sufficient data to be reasonably confident that the
> callback-flooding OOMs really do have something to do with the TRACE01 and
> TREE10 scenarios, even though I am not yet seeing what these two scenarios
> have in common that they don't also have in common with other scenarios.
> But what is life without a bit of mystery? ;-)

:).

> Second, please see the attached tarball, which contains .csv files showing
> Kconfig options and kernel boot parameters for the various torture tests.
> The portions of the filenames preceding the "config.csv" correspond to
> the directories in tools/testing/selftests/rcutorture/configs.

So, at least some of the HZ_FULL=y tests don't run into problems.

> Third, there are additional scenarios hand-crafted by the script at
> tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> them have triggered, other than via the newly increased difficulty
> of configurating a tracing-free kernel with which to test, but they
> can still be useful in ruling out particular Kconfig options or kernel
> boot parameters being related to a given issue.
>
> But please do take a look at the .csv files and let me know what
> adjustments would be appropriate given the failure information.

Nothing stands out just yet. Let me start a run here and see if
that gives me some ideas.

I'm guessing the splats don't give any useful information or
you would have attached them ;).

Thanks for testing, btw.

--
ankur

2024-02-16 02:59:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> >>
> >> Paul E. McKenney <[email protected]> writes:
> >>
> >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >> >> >
> >> >> > Paul E. McKenney <[email protected]> writes:
> >> >> >
> >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> >> > >> Hi,
> >> >> > >>
> >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> >> > >> on explicit preemption points for the voluntary models.
> >> >> > >>
> >> >> > >> The series is based on Thomas' original proposal which he outlined
> >> >> > >> in [1], [2] and in his PoC [3].
> >> >> > >>
> >> >> > >> An earlier RFC version is at [4].
> >> >> > >
> >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> >> >> > > likely for the merge window after next, but let me know if you need
> >> >> > > them sooner.
> >> >> >
> >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> >> >> > But, the attached diff should tide me over until the fixes are in.
> >> >>
> >> >> That was indeed my guess. ;-)
> >> >>
> >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> >> >> > > flooding, but I am still looking into this.
> >> >> >
> >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> >> >>
> >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> >> >> two of them thus far. I am running a longer test to see if this might
> >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> >> >> and TRACE01 have in common.
> >> >
> >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> >> > sets them apart. I also hit a grace-period hang in TREE04, which does
> >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> >> > to dig into more.
> >>
> >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> >> as well?
> >> (Just in the interest of minimizing configurations.)
> >
> > I would be happy to, but in the spirit of full disclosure...
> >
> > First, I have seen that failure only once, which is not enough to
> > conclude that it has much to do with TREE04. It might simply be low
> > probability, so that TREE04 simply was unlucky enough to hit it first.
> > In contrast, I have sufficient data to be reasonably confident that the
> > callback-flooding OOMs really do have something to do with the TRACE01 and
> > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> > have in common that they don't also have in common with other scenarios.
> > But what is life without a bit of mystery? ;-)
>
> :).
>
> > Second, please see the attached tarball, which contains .csv files showing
> > Kconfig options and kernel boot parameters for the various torture tests.
> > The portions of the filenames preceding the "config.csv" correspond to
> > the directories in tools/testing/selftests/rcutorture/configs.
>
> So, at least some of the HZ_FULL=y tests don't run into problems.
>
> > Third, there are additional scenarios hand-crafted by the script at
> > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> > them have triggered, other than via the newly increased difficulty
> > of configurating a tracing-free kernel with which to test, but they
> > can still be useful in ruling out particular Kconfig options or kernel
> > boot parameters being related to a given issue.
> >
> > But please do take a look at the .csv files and let me know what
> > adjustments would be appropriate given the failure information.
>
> Nothing stands out just yet. Let me start a run here and see if
> that gives me some ideas.

Sounds good, thank you!

> I'm guessing the splats don't give any useful information or
> you would have attached them ;).

My plan is to extract what can be extracted from the overnight run
that I just started. Just in case the fixes have any effect on things,
unlikely though that might be given those fixes and the runs that failed.

> Thanks for testing, btw.

The sooner we find them, the sooner they get fixed. ;-)

Thanx, Paul

2024-02-17 00:56:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
> >
> > Paul E. McKenney <[email protected]> writes:
> >
> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> > >>
> > >> Paul E. McKenney <[email protected]> writes:
> > >>
> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> > >> >> >
> > >> >> > Paul E. McKenney <[email protected]> writes:
> > >> >> >
> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> > >> >> > >> Hi,
> > >> >> > >>
> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> > >> >> > >> on explicit preemption points for the voluntary models.
> > >> >> > >>
> > >> >> > >> The series is based on Thomas' original proposal which he outlined
> > >> >> > >> in [1], [2] and in his PoC [3].
> > >> >> > >>
> > >> >> > >> An earlier RFC version is at [4].
> > >> >> > >
> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> > >> >> > > likely for the merge window after next, but let me know if you need
> > >> >> > > them sooner.
> > >> >> >
> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> > >> >> > But, the attached diff should tide me over until the fixes are in.
> > >> >>
> > >> >> That was indeed my guess. ;-)
> > >> >>
> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> > >> >> > > flooding, but I am still looking into this.
> > >> >> >
> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> > >> >>
> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> > >> >> two of them thus far. I am running a longer test to see if this might
> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> > >> >> and TRACE01 have in common.
> > >> >
> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> > >> > to dig into more.
> > >>
> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> > >> as well?
> > >> (Just in the interest of minimizing configurations.)
> > >
> > > I would be happy to, but in the spirit of full disclosure...
> > >
> > > First, I have seen that failure only once, which is not enough to
> > > conclude that it has much to do with TREE04. It might simply be low
> > > probability, so that TREE04 simply was unlucky enough to hit it first.
> > > In contrast, I have sufficient data to be reasonably confident that the
> > > callback-flooding OOMs really do have something to do with the TRACE01 and
> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> > > have in common that they don't also have in common with other scenarios.
> > > But what is life without a bit of mystery? ;-)
> >
> > :).
> >
> > > Second, please see the attached tarball, which contains .csv files showing
> > > Kconfig options and kernel boot parameters for the various torture tests.
> > > The portions of the filenames preceding the "config.csv" correspond to
> > > the directories in tools/testing/selftests/rcutorture/configs.
> >
> > So, at least some of the HZ_FULL=y tests don't run into problems.
> >
> > > Third, there are additional scenarios hand-crafted by the script at
> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> > > them have triggered, other than via the newly increased difficulty
> > > of configurating a tracing-free kernel with which to test, but they
> > > can still be useful in ruling out particular Kconfig options or kernel
> > > boot parameters being related to a given issue.
> > >
> > > But please do take a look at the .csv files and let me know what
> > > adjustments would be appropriate given the failure information.
> >
> > Nothing stands out just yet. Let me start a run here and see if
> > that gives me some ideas.
>
> Sounds good, thank you!
>
> > I'm guessing the splats don't give any useful information or
> > you would have attached them ;).
>
> My plan is to extract what can be extracted from the overnight run
> that I just started. Just in case the fixes have any effect on things,
> unlikely though that might be given those fixes and the runs that failed.

And I only got no failures from either TREE10 or TRACE01 on last night's
run. I merged your series on top of v6.8-rc4 with the -rcu tree's
dev branch, the latter to get the RCU fixes. But this means that last
night's results are not really comparable to earlier results.

I did get a few TREE09 failures, but I get those anyway. I took it
apart below for you because I got confused and thought that it was a
TREE10 failure. So just in case you were curious what one of these
looks like and because I am too lazy to delete it. ;-)

So from the viewpoint of moderate rcutorture testing, this series
looks good. Woo hoo!!!

We did uncover a separate issue with Tasks RCU, which I will report on
in more detail separately. However, this issue does not (repeat, *not*)
affect lazy preemption as such, but instead any attempt to remove all
of the cond_resched() invocations.

My next step is to try this on bare metal on a system configured as
is the fleet. But good progress for a week!!!

Thanx, Paul

------------------------------------------------------------------------

[ 3458.875819] rcu_torture_fwd_prog: Starting forward-progress test 0
[ 3458.877155] rcu_torture_fwd_prog_cr: Starting forward-progress test 0

This says that rcutorture is starting a callback-flood
forward-progress test.

[ 3459.252546] rcu-torture: rtc: 00000000ec445089 ver: 298757 tfle: 0 rta: 298758 rtaf: 0 rtf: 298747 rtmbe: 0 rtmbkf: 0/0 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 895741 barrier: 27715/27716:0 read-exits: 3984 nocb-toggles: 0:0
[ 3459.261545] rcu-torture: Reader Pipe: 363878289 159521 0 0 0 0 0 0 0 0 0
[ 3459.263883] rcu-torture: Reader Batch: 363126419 911391 0 0 0 0 0 0 0 0 0
[ 3459.267544] rcu-torture: Free-Block Circulation: 298757 298756 298754 298753 298752 298751 298750 298749 298748 298747 0

These lines are just statistics that rcutorture prints out
periodically. Thus far, nothing bad happened. This is one of a
few SMP scenarios that does not do CPU hotplug. But the TRACE01
scenario does do CPU hotplug, so not likely a determining factor.
Another difference is that TREE10 is the only scenario with more
than 16 CPUs, but then again, TRACE01 has only five.

[ 3459.733109] ------------[ cut here ]------------
[ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing.
[ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0

Now something bad happened. RCU was unable to keep up with the
callback flood. Given that users can create callback floods
with close(open()) loops, this is not good.

[ 3459.737761] Modules linked in:
[ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252
[ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0
[ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83
[ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282
[ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff
[ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
[ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff
[ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000
[ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff
[ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000
[ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0
[ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3459.764472] Call Trace:
[ 3459.765003] <TASK>
[ 3459.765483] ? __warn+0x61/0xe0
[ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0
[ 3459.767154] ? report_bug+0x174/0x180
[ 3459.767942] ? handle_bug+0x3d/0x70
[ 3459.768715] ? exc_invalid_op+0x18/0x70
[ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20
[ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0
[ 3459.771501] blocking_notifier_call_chain+0x5c/0x80
[ 3459.772553] out_of_memory+0x236/0x4b0
[ 3459.773365] __alloc_pages+0x9ca/0xb10
[ 3459.774233] ? set_next_entity+0x8b/0x150
[ 3459.775107] new_slab+0x382/0x430
[ 3459.776454] ___slab_alloc+0x23c/0x8c0
[ 3459.777315] ? preempt_schedule_irq+0x32/0x50
[ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.781244] kmalloc_trace+0x179/0x1a0
[ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 3459.786617] ? kthread+0xc3/0xf0
[ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 3459.788417] kthread+0xc3/0xf0
[ 3459.789101] ? __pfx_kthread+0x10/0x10
[ 3459.789906] ret_from_fork+0x2f/0x50
[ 3459.790708] ? __pfx_kthread+0x10/0x10
[ 3459.791523] ret_from_fork_asm+0x1a/0x30
[ 3459.792359] </TASK>
[ 3459.792835] ---[ end trace 0000000000000000 ]---

Standard rcutorture stack trace for this failure mode.

[ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0

So the whole thing lasted less than a second (913 jiffies).
Each element of the histogram is 100 milliseconds worth. Nothing
came through during the first 100 ms (not surprising), and one
grace period elapsed (also not surprising). A lot of callbacks
came through in the second 100 ms (also not surprising), but there
were some tens of thousand grace periods (extremely surprising).
The third 100 ms got a lot of callbacks but no grace periods
(not surprising).

Huh. This started at t=3458.877155 and we got the OOM at
t=3459.733109, which roughly matches the reported time.

[ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies

The callback flood does seem to have stalled grace periods,
though not by all *that* much.

[ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0

The RCU grace-period kthread is in its loop looking for
quiescent states, and is executing normally ("->gp_activity 0",
as opposed to some huge number indicating that the kthread was
never awakened).

[ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0

The "->qsmask 0x0" says that all CPUs have provided a quiescent
state, but the "G" indicates that the normal grace period is
blocked by some task preempted within an RCU read-side critical
section. This output is strange because a 56-CPU scenario should
have considerably more output.

Plus this means that this cannot possibly be TREE10 because that
scenario is non-preemptible, so there cannot be grace periods
waiting for quiescent states on anything but CPUs.

This happens from time to time because TREE09 runs on a single
CPU, and has preemption enabled, but not RCU priority boosting.
And this output is quite appropriate for a single-CPU scenario.

I probably should enable RCU priority boosting on this scenario.
I would also need it to be pretty fast off the mark, because we
OOMed about 700 milliseconds into the grace period.

But that has nothing to do with lazy preemption!

[ 3459.806271] rcu: cpu 0 ->gp_seq_needed 5535692
[ 3459.807139] rcu: RCU callbacks invoked since boot: 65398010
[ 3459.808374] rcu: rcu_fwd_progress_check: callbacks 0: 7484262
[ 3459.809640] rcutorture_oom_notify: Freed 1 RCU callbacks.
[ 3460.616268] rcutorture_oom_notify: Freed 7484253 RCU callbacks.
[ 3460.619551] rcutorture_oom_notify: Freed 0 RCU callbacks.
[ 3460.620740] rcutorture_oom_notify returning after OOM processing.
[ 3460.622719] rcu_torture_fwd_prog_cr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0
[ 3461.678556] rcu_torture_fwd_prog_nr: Starting forward-progress test 0
[ 3461.684546] rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0


2024-02-17 04:01:15

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
>> >
>> > Paul E. McKenney <[email protected]> writes:
>> >
>> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
>> > >>
>> > >> Paul E. McKenney <[email protected]> writes:
>> > >>
>> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> > >> >> >
>> > >> >> > Paul E. McKenney <[email protected]> writes:
>> > >> >> >
>> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> > >> >> > >> Hi,
>> > >> >> > >>
>> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> > >> >> > >> on explicit preemption points for the voluntary models.
>> > >> >> > >>
>> > >> >> > >> The series is based on Thomas' original proposal which he outlined
>> > >> >> > >> in [1], [2] and in his PoC [3].
>> > >> >> > >>
>> > >> >> > >> An earlier RFC version is at [4].
>> > >> >> > >
>> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
>> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
>> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> > >> >> > > likely for the merge window after next, but let me know if you need
>> > >> >> > > them sooner.
>> > >> >> >
>> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> > >> >> > But, the attached diff should tide me over until the fixes are in.
>> > >> >>
>> > >> >> That was indeed my guess. ;-)
>> > >> >>
>> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
>> > >> >> > > flooding, but I am still looking into this.
>> > >> >> >
>> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>> > >> >>
>> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> > >> >> two of them thus far. I am running a longer test to see if this might
>> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> > >> >> and TRACE01 have in common.
>> > >> >
>> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
>> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
>> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
>> > >> > to dig into more.
>> > >>
>> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
>> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
>> > >> as well?
>> > >> (Just in the interest of minimizing configurations.)
>> > >
>> > > I would be happy to, but in the spirit of full disclosure...
>> > >
>> > > First, I have seen that failure only once, which is not enough to
>> > > conclude that it has much to do with TREE04. It might simply be low
>> > > probability, so that TREE04 simply was unlucky enough to hit it first.
>> > > In contrast, I have sufficient data to be reasonably confident that the
>> > > callback-flooding OOMs really do have something to do with the TRACE01 and
>> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
>> > > have in common that they don't also have in common with other scenarios.
>> > > But what is life without a bit of mystery? ;-)
>> >
>> > :).
>> >
>> > > Second, please see the attached tarball, which contains .csv files showing
>> > > Kconfig options and kernel boot parameters for the various torture tests.
>> > > The portions of the filenames preceding the "config.csv" correspond to
>> > > the directories in tools/testing/selftests/rcutorture/configs.
>> >
>> > So, at least some of the HZ_FULL=y tests don't run into problems.
>> >
>> > > Third, there are additional scenarios hand-crafted by the script at
>> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
>> > > them have triggered, other than via the newly increased difficulty
>> > > of configurating a tracing-free kernel with which to test, but they
>> > > can still be useful in ruling out particular Kconfig options or kernel
>> > > boot parameters being related to a given issue.
>> > >
>> > > But please do take a look at the .csv files and let me know what
>> > > adjustments would be appropriate given the failure information.
>> >
>> > Nothing stands out just yet. Let me start a run here and see if
>> > that gives me some ideas.
>>
>> Sounds good, thank you!
>>
>> > I'm guessing the splats don't give any useful information or
>> > you would have attached them ;).
>>
>> My plan is to extract what can be extracted from the overnight run
>> that I just started. Just in case the fixes have any effect on things,
>> unlikely though that might be given those fixes and the runs that failed.
>
> And I only got no failures from either TREE10 or TRACE01 on last night's
> run.

Oh that's great news. Same for my overnight runs for TREE04 and TRACE01.

Ongoing: a 24 hour run for those. Let's see how that goes.

> I merged your series on top of v6.8-rc4 with the -rcu tree's
> dev branch, the latter to get the RCU fixes. But this means that last
> night's results are not really comparable to earlier results.
>
> I did get a few TREE09 failures, but I get those anyway. I took it
> apart below for you because I got confused and thought that it was a
> TREE10 failure. So just in case you were curious what one of these
> looks like and because I am too lazy to delete it. ;-)

Heh. Well, thanks for being lazy /after/ dissecting it nicely.

> So from the viewpoint of moderate rcutorture testing, this series
> looks good. Woo hoo!!!

Awesome!

> We did uncover a separate issue with Tasks RCU, which I will report on
> in more detail separately. However, this issue does not (repeat, *not*)
> affect lazy preemption as such, but instead any attempt to remove all
> of the cond_resched() invocations.

So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n,
CONFIG_PREEMPT=y)?
Anyway will look out for it when you go into the detail.

> My next step is to try this on bare metal on a system configured as
> is the fleet. But good progress for a week!!!

Yeah this is great. Fingers crossed for the wider set of tests.

Thanks

--
ankur

2024-02-18 18:17:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Fri, Feb 16, 2024 at 07:59:45PM -0800, Ankur Arora wrote:
> Paul E. McKenney <[email protected]> writes:
> > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
> >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
> >> >
> >> > Paul E. McKenney <[email protected]> writes:
> >> >
> >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> >> > >>
> >> > >> Paul E. McKenney <[email protected]> writes:
> >> > >>
> >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >> > >> >> >
> >> > >> >> > Paul E. McKenney <[email protected]> writes:
> >> > >> >> >
> >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> > >> >> > >> Hi,
> >> > >> >> > >>
> >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> > >> >> > >> on explicit preemption points for the voluntary models.
> >> > >> >> > >>
> >> > >> >> > >> The series is based on Thomas' original proposal which he outlined
> >> > >> >> > >> in [1], [2] and in his PoC [3].
> >> > >> >> > >>
> >> > >> >> > >> An earlier RFC version is at [4].
> >> > >> >> > >
> >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> >> > >> >> > > likely for the merge window after next, but let me know if you need
> >> > >> >> > > them sooner.
> >> > >> >> >
> >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> >> > >> >> > But, the attached diff should tide me over until the fixes are in.
> >> > >> >>
> >> > >> >> That was indeed my guess. ;-)
> >> > >> >>
> >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> >> > >> >> > > flooding, but I am still looking into this.
> >> > >> >> >
> >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> >> > >> >>
> >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> >> > >> >> two of them thus far. I am running a longer test to see if this might
> >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> >> > >> >> and TRACE01 have in common.
> >> > >> >
> >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
> >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> >> > >> > to dig into more.
> >> > >>
> >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> >> > >> as well?
> >> > >> (Just in the interest of minimizing configurations.)
> >> > >
> >> > > I would be happy to, but in the spirit of full disclosure...
> >> > >
> >> > > First, I have seen that failure only once, which is not enough to
> >> > > conclude that it has much to do with TREE04. It might simply be low
> >> > > probability, so that TREE04 simply was unlucky enough to hit it first.
> >> > > In contrast, I have sufficient data to be reasonably confident that the
> >> > > callback-flooding OOMs really do have something to do with the TRACE01 and
> >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> >> > > have in common that they don't also have in common with other scenarios.
> >> > > But what is life without a bit of mystery? ;-)
> >> >
> >> > :).
> >> >
> >> > > Second, please see the attached tarball, which contains .csv files showing
> >> > > Kconfig options and kernel boot parameters for the various torture tests.
> >> > > The portions of the filenames preceding the "config.csv" correspond to
> >> > > the directories in tools/testing/selftests/rcutorture/configs.
> >> >
> >> > So, at least some of the HZ_FULL=y tests don't run into problems.
> >> >
> >> > > Third, there are additional scenarios hand-crafted by the script at
> >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> >> > > them have triggered, other than via the newly increased difficulty
> >> > > of configurating a tracing-free kernel with which to test, but they
> >> > > can still be useful in ruling out particular Kconfig options or kernel
> >> > > boot parameters being related to a given issue.
> >> > >
> >> > > But please do take a look at the .csv files and let me know what
> >> > > adjustments would be appropriate given the failure information.
> >> >
> >> > Nothing stands out just yet. Let me start a run here and see if
> >> > that gives me some ideas.
> >>
> >> Sounds good, thank you!
> >>
> >> > I'm guessing the splats don't give any useful information or
> >> > you would have attached them ;).
> >>
> >> My plan is to extract what can be extracted from the overnight run
> >> that I just started. Just in case the fixes have any effect on things,
> >> unlikely though that might be given those fixes and the runs that failed.
> >
> > And I only got no failures from either TREE10 or TRACE01 on last night's
> > run.
>
> Oh that's great news. Same for my overnight runs for TREE04 and TRACE01.
>
> Ongoing: a 24 hour run for those. Let's see how that goes.
>
> > I merged your series on top of v6.8-rc4 with the -rcu tree's
> > dev branch, the latter to get the RCU fixes. But this means that last
> > night's results are not really comparable to earlier results.
> >
> > I did get a few TREE09 failures, but I get those anyway. I took it
> > apart below for you because I got confused and thought that it was a
> > TREE10 failure. So just in case you were curious what one of these
> > looks like and because I am too lazy to delete it. ;-)
>
> Heh. Well, thanks for being lazy /after/ dissecting it nicely.
>
> > So from the viewpoint of moderate rcutorture testing, this series
> > looks good. Woo hoo!!!
>
> Awesome!
>
> > We did uncover a separate issue with Tasks RCU, which I will report on
> > in more detail separately. However, this issue does not (repeat, *not*)
> > affect lazy preemption as such, but instead any attempt to remove all
> > of the cond_resched() invocations.
>
> So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n,
> CONFIG_PREEMPT=y)?
> Anyway will look out for it when you go into the detail.

Fair point, normally Tasks RCU isn't present when cond_resched()
means anything.

I will look again -- it is quite possible that I was confused by earlier
in-fleet setups that had Tasks RCU enabled even when preemption was
disabled. (We don't do that anymore, and, had I been paying sufficient
attention, would not have been doing it to start with. Back in the day,
enabling rcutorture, even as a module, had the side effect of enabling
Tasks RCU. How else to test it, right? Well...)

> > My next step is to try this on bare metal on a system configured as
> > is the fleet. But good progress for a week!!!
>
> Yeah this is great. Fingers crossed for the wider set of tests.

I got what might be a one-off when hitting rcutorture and KASAN harder.
I am running 320*TRACE01 to see if it reproduces.

In the meantime...

[ 2242.502082] ------------[ cut here ]------------
[ 2242.502946] WARNING: CPU: 3 PID: 72 at kernel/rcu/rcutorture.c:2839 rcu_torture_fwd_prog+0x1125/0x14e0

Callback-flooding events go for at most eight seconds, and end
earlier if the RCU flavor under test can clear the flood sooner.
If the flood does consume the full eight seconds, then we get the
above WARN_ON if too few callbacks are invoked in the meantime.

So we get a splat, which is mostly there to make sure that
rcutorture complains about this, not much information here.

[ 2242.504580] Modules linked in:
[ 2242.505125] CPU: 3 PID: 72 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00101-ga3ecbc334926 #8321
[ 2242.506652] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 2242.508577] RIP: 0010:rcu_torture_fwd_prog+0x1125/0x14e0
[ 2242.509513] Code: 4b f9 ff ff e8 ac 16 3d 00 e9 0e f9 ff ff 48 c7 c7 c0 b9 00 91 e8 9b 16 3d 00 e9 71 f9 ff ff e8 91 16 3d 00 e9 bb f0 ff ff 90 <0f> 0b 90 e9 38 fe ff ff 48 8b bd 48 ff ff ff e8 47 16 3d 00 e9 53
[ 2242.512705] RSP: 0018:ffff8880028b7da0 EFLAGS: 00010293
[ 2242.513615] RAX: 000000010031ebc4 RBX: 0000000000000000 RCX: ffffffff8d5c6040
[ 2242.514843] RDX: 00000001001da27d RSI: 0000000000000008 RDI: 0000000000000e10
[ 2242.516073] RBP: ffff8880028b7ee0 R08: 0000000000000000 R09: ffffed100036d340
[ 2242.517308] R10: ffff888001b69a07 R11: 0000000000030001 R12: dffffc0000000000
[ 2242.518537] R13: 000000000001869e R14: ffffffff9100b9c0 R15: ffff888002714000
[ 2242.519765] FS: 0000000000000000(0000) GS:ffff88806d100000(0000) knlGS:0000000000000000
[ 2242.521152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2242.522151] CR2: 0000000000000000 CR3: 0000000003f26000 CR4: 00000000000006f0
[ 2242.523392] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2242.524624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2242.525857] Call Trace:
[ 2242.526301] <TASK>
[ 2242.526674] ? __warn+0xc8/0x260
[ 2242.527258] ? rcu_torture_fwd_prog+0x1125/0x14e0
[ 2242.528077] ? report_bug+0x291/0x2e0
[ 2242.528726] ? handle_bug+0x3d/0x80
[ 2242.529348] ? exc_invalid_op+0x18/0x50
[ 2242.530022] ? asm_exc_invalid_op+0x1a/0x20
[ 2242.530753] ? kthread_should_stop+0x70/0xc0
[ 2242.531508] ? rcu_torture_fwd_prog+0x1125/0x14e0
[ 2242.532333] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 2242.533191] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 2242.534083] ? set_cpus_allowed_ptr+0x7c/0xb0
[ 2242.534847] ? __pfx_set_cpus_allowed_ptr+0x10/0x10
[ 2242.535696] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 2242.536547] ? kthread+0x24a/0x300
[ 2242.537159] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 2242.538038] kthread+0x24a/0x300
[ 2242.538607] ? __pfx_kthread+0x10/0x10
[ 2242.539283] ret_from_fork+0x2f/0x70
[ 2242.539907] ? __pfx_kthread+0x10/0x10
[ 2242.540571] ret_from_fork_asm+0x1b/0x30
[ 2242.541273] </TASK>
[ 2242.541661] ---[ end trace 0000000000000000 ]---

But now there is some information...

[ 2242.542471] rcu_torture_fwd_prog_cr Duration 8000 barrier: 749 pending 49999 n_launders: 99998 n_launders_sa: 99998 n_max_gps: 0 n_max_cbs: 50000 cver 0 gps 0

The flood lasted the full eight seconds ("Duration 8000").

The rcu_barrier_trace() operation consumed an additional 749ms
("barrier: 749").

There were almost 50K callbacks for that rcu_barrier_trace()
to deal with ("pending 49999").

There were almost 100K that were recycled directly, as opposed
to being newly allocated ("n_launders: 99998"), and all launders
happened after the last allocation ("n_launders_sa: 99998").
This is to be expected because RCU Tasks Trace does not do
priority boosting of preempted readers, and therefore rcutorture
limits the number of outstanding callbacks in the flood to 50K.
And it might never boost readers, given that it is legal to
block in an RCU Tasks Trace read-side critical section.

Insufficient callbacks were invoked ("n_max_gps: 0") and the
full 50K permitted was allocated ("n_max_cbs: 50000").

The rcu_torture_writer() did not see a full grace period in the
meantime ("cver 0") and there was no recorded grace period in
the meantime ("gps 0").

[ 2242.544890] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 8751 jiffies): 1s/10: 0:0 2s/10: 0:0 3s/10: 0:0 4s/10: 0:0 5s/10: 0:0 6s/10: 0:0 7s/10: 0:0 8s/10: 50000:0 9s/10: 0:0 10s/10: 0:0 11s/10: 0:0 12s/10: 0:0 13s/10: 0:0 14s/10: 0:0 15s/10: 0:0 16s/10: 49999:0 17s/10: 0:0 18s/10: 0:0 19s/10: 0:0 20s/10: 0:0 21s/10: 0:0 22s/10: 0:0 23s/10: 0:0 24s/10: 0:0 25s/10: 0:0 26s/10: 0:0 27s/10: 0:0 28s/10: 0:0 29s/10: 0:0 30s/10: 0:0 31s/10: 0:0 32s/10: 0:0 33s/10: 0:0 34s/10: 0:0 35s/10: 0:0 36s/10: 0:0 37s/10: 0:0 38s/10: 0:0 39s/10: 0:0 40s/10: 0:0 41s/10: 0:0 42s/10: 0:0 43s/10: 0:0 44s/10: 0:0 45s/10: 0:0 46s/10: 0:0 47s/10: 0:0 48s/10: 0:0 49s/10: 0:0 50s/10: 0:0 51s/10: 0:0 52s/10: 0:0 53s/10: 0:0 54s/10: 0:0 55s/10: 0:0 56s/10: 0:0 57s/10: 0:0 58s/10: 0:0 59s/10: 0:0 60s/10: 0:0 61s/10: 0:0 62s/10: 0:0 63s/10: 0:0 64s/10: 0:0 65s/10: 0:0 66s/10: 0:0 67s/10: 0:0 68s/10: 0:0 69s/10: 0:0 70s/10: 0:0 71s/10: 0:0 72s/10: 0:0 73s/10: 0:0 74s/10: 0:0 75s/10: 0:0 76s/10: 0:0 77s/10: 0:0 78s/10: 0:0
[ 2242.545050] 79s/10: 0:0 80s/10: 0:0 81s/10: 49999:0

Except that we can see callbacks having been invoked during this
time ("8s/10: 50000:0", "16s/10: 49999:0", "81s/10: 49999:0").

In part because rcutorture is unaware of RCU Tasks Trace's
grace-period sequence number.

So, first see if it is reproducible, second enable more diagnostics,
third make more grace-period sequence numbers available to rcutorture,
fourth recheck the diagnostics code, and then see where we go from there.
It might be that lazy preemption needs adjustment, or it might be that
it just tickled latent diagnostic issues in rcutorture.

(I rarely hit this WARN_ON() except in early development, when the
problem is usually glaringly obvious, hence all the uncertainty.)

Thanx, Paul

2024-02-19 12:33:58

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 17/30] x86/thread_info: define TIF_NEED_RESCHED_LAZY

On Wed, Feb 14, 2024 at 12:31:29PM -0800, Ankur Arora wrote:
> Mark Rutland <[email protected]> writes:
> > On Mon, Feb 12, 2024 at 09:55:41PM -0800, Ankur Arora wrote:

> >> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> >> index d63b02940747..88c1802185fc 100644
> >> --- a/arch/x86/include/asm/thread_info.h
> >> +++ b/arch/x86/include/asm/thread_info.h
> >> @@ -81,8 +81,11 @@ struct thread_info {
> >> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
> >> #define TIF_SIGPENDING 2 /* signal pending */
> >> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
> >> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
> >> -#define TIF_SSBD 5 /* Speculative store bypass disable */
> >> +#ifdef CONFIG_PREEMPT_AUTO
> >> +#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling */
> >> +#endif
> >> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
> >> +#define TIF_SSBD 6 /* Speculative store bypass disable */
> >
> > It's a bit awkward/ugly to conditionally define the TIF_* bits in arch code,
> > and we don't do that for other bits that are only used in some configurations
> > (e.g. TIF_UPROBE). That's not just for aesthetics -- for example, on arm64 we
> > try to keep the TIF_WORK_MASK bits contiguous, which is difficult if a bit in
> > the middle doesn't exist in some configurations.
>
> That's useful to know. And, I think you are right about the
> ugliness of this.
>
> > Is it painful to organise the common code so that arch code can define
> > TIF_NEED_RESCHED_LAZY regardless of whether CONFIG_PREEMPT_AUTO is selected?
>
> So, the original reason I did it this way was because I wanted to have
> zero performance impact on !CONFIG_PREEMPT_AUTO configurations whether
> TIF_NEED_RESCHED_LAZY was defined or not.
> (I was doing some computation with TIF_NEED_RESCHED_LAZY at that point.)
>
> Eventually I changed that part of code but this stayed.
>
> Anyway, this should be easy enough to fix with done #ifdefry.
>
> Thanks for reviewing.

Great!

BTW, the series overall looks to be in very good shape; thanks a lot for
working on this!

Mark.

2024-02-19 15:55:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Tue, Feb 13 2024 at 20:27, Thomas Gleixner wrote:
> On Mon, Feb 12 2024 at 21:55, Ankur Arora wrote:
>
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> An earlier RFC version is at [4].

Aside of the nitpicks from me and Mark this really looks promising!

Thanks for working through this.

tglx


2024-02-19 16:48:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Sun, Feb 18, 2024 at 10:17:48AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 16, 2024 at 07:59:45PM -0800, Ankur Arora wrote:
> > Paul E. McKenney <[email protected]> writes:
> > > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
> > >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
> > >> >
> > >> > Paul E. McKenney <[email protected]> writes:
> > >> >
> > >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> > >> > >>
> > >> > >> Paul E. McKenney <[email protected]> writes:
> > >> > >>
> > >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> > >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> > >> > >> >> >
> > >> > >> >> > Paul E. McKenney <[email protected]> writes:
> > >> > >> >> >
> > >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> > >> > >> >> > >> Hi,
> > >> > >> >> > >>
> > >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> > >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> > >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> > >> > >> >> > >> on explicit preemption points for the voluntary models.
> > >> > >> >> > >>
> > >> > >> >> > >> The series is based on Thomas' original proposal which he outlined
> > >> > >> >> > >> in [1], [2] and in his PoC [3].
> > >> > >> >> > >>
> > >> > >> >> > >> An earlier RFC version is at [4].
> > >> > >> >> > >
> > >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> > >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> > >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> > >> > >> >> > > likely for the merge window after next, but let me know if you need
> > >> > >> >> > > them sooner.
> > >> > >> >> >
> > >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> > >> > >> >> > But, the attached diff should tide me over until the fixes are in.
> > >> > >> >>
> > >> > >> >> That was indeed my guess. ;-)
> > >> > >> >>
> > >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> > >> > >> >> > > flooding, but I am still looking into this.
> > >> > >> >> >
> > >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> > >> > >> >>
> > >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> > >> > >> >> two of them thus far. I am running a longer test to see if this might
> > >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> > >> > >> >> and TRACE01 have in common.
> > >> > >> >
> > >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
> > >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> > >> > >> > to dig into more.
> > >> > >>
> > >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> > >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> > >> > >> as well?
> > >> > >> (Just in the interest of minimizing configurations.)
> > >> > >
> > >> > > I would be happy to, but in the spirit of full disclosure...
> > >> > >
> > >> > > First, I have seen that failure only once, which is not enough to
> > >> > > conclude that it has much to do with TREE04. It might simply be low
> > >> > > probability, so that TREE04 simply was unlucky enough to hit it first.
> > >> > > In contrast, I have sufficient data to be reasonably confident that the
> > >> > > callback-flooding OOMs really do have something to do with the TRACE01 and
> > >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> > >> > > have in common that they don't also have in common with other scenarios.
> > >> > > But what is life without a bit of mystery? ;-)
> > >> >
> > >> > :).
> > >> >
> > >> > > Second, please see the attached tarball, which contains .csv files showing
> > >> > > Kconfig options and kernel boot parameters for the various torture tests.
> > >> > > The portions of the filenames preceding the "config.csv" correspond to
> > >> > > the directories in tools/testing/selftests/rcutorture/configs.
> > >> >
> > >> > So, at least some of the HZ_FULL=y tests don't run into problems.
> > >> >
> > >> > > Third, there are additional scenarios hand-crafted by the script at
> > >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> > >> > > them have triggered, other than via the newly increased difficulty
> > >> > > of configurating a tracing-free kernel with which to test, but they
> > >> > > can still be useful in ruling out particular Kconfig options or kernel
> > >> > > boot parameters being related to a given issue.
> > >> > >
> > >> > > But please do take a look at the .csv files and let me know what
> > >> > > adjustments would be appropriate given the failure information.
> > >> >
> > >> > Nothing stands out just yet. Let me start a run here and see if
> > >> > that gives me some ideas.
> > >>
> > >> Sounds good, thank you!
> > >>
> > >> > I'm guessing the splats don't give any useful information or
> > >> > you would have attached them ;).
> > >>
> > >> My plan is to extract what can be extracted from the overnight run
> > >> that I just started. Just in case the fixes have any effect on things,
> > >> unlikely though that might be given those fixes and the runs that failed.
> > >
> > > And I only got no failures from either TREE10 or TRACE01 on last night's
> > > run.
> >
> > Oh that's great news. Same for my overnight runs for TREE04 and TRACE01.
> >
> > Ongoing: a 24 hour run for those. Let's see how that goes.
> >
> > > I merged your series on top of v6.8-rc4 with the -rcu tree's
> > > dev branch, the latter to get the RCU fixes. But this means that last
> > > night's results are not really comparable to earlier results.
> > >
> > > I did get a few TREE09 failures, but I get those anyway. I took it
> > > apart below for you because I got confused and thought that it was a
> > > TREE10 failure. So just in case you were curious what one of these
> > > looks like and because I am too lazy to delete it. ;-)
> >
> > Heh. Well, thanks for being lazy /after/ dissecting it nicely.
> >
> > > So from the viewpoint of moderate rcutorture testing, this series
> > > looks good. Woo hoo!!!
> >
> > Awesome!
> >
> > > We did uncover a separate issue with Tasks RCU, which I will report on
> > > in more detail separately. However, this issue does not (repeat, *not*)
> > > affect lazy preemption as such, but instead any attempt to remove all
> > > of the cond_resched() invocations.
> >
> > So, that sounds like it happens even with (CONFIG_PREEMPT_AUTO=n,
> > CONFIG_PREEMPT=y)?
> > Anyway will look out for it when you go into the detail.
>
> Fair point, normally Tasks RCU isn't present when cond_resched()
> means anything.
>
> I will look again -- it is quite possible that I was confused by earlier
> in-fleet setups that had Tasks RCU enabled even when preemption was
> disabled. (We don't do that anymore, and, had I been paying sufficient
> attention, would not have been doing it to start with. Back in the day,
> enabling rcutorture, even as a module, had the side effect of enabling
> Tasks RCU. How else to test it, right? Well...)

OK, I got my head straight on this one...

And the problem is in fact that Tasks RCU isn't normally present
in non-preemptible kernels. This is because normal RCU will wait
for preemption-disabled regions of code, and in PREMPT_NONE and
PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code
lacking an explicit schedule() or similar. And as I understand it,
tracing trampolines rely on this implicit lack of preemption.

So, with lazy preemption, we could preempt in the middle of a
trampoline, and synchronize_rcu() won't save us.

Steve and Mathieu will correct me if I am wrong.

If I do understand this correctly, one workaround is to remove the
"if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way,
all kernels would use synchronize_rcu_tasks(), which would wait for
a voluntary context switch.

This workaround does increase the overhead and tracepoint-removal
latency on non-preemptible kernels, so it might be time to revisit the
synchronization of trampolines. Unfortunately, the things I have come
up with thus far have disadvantages:

o Keep a set of permanent trampolines that enter and exit
some sort of explicit RCU read-side critical section.
If the address for this trampoline to call is in a register,
then these permanent trampolines remain constant so that
no synchronization of them is required. The selected
flavor of RCU can then be used to deal with the non-permanent
trampolines.

The disadvantage here is a significant increase in the complexity
and overhead of trampoline code and the code that invokes the
trampolines. This overhead limits where tracing may be used
in the kernel, which is of course undesirable.

o Check for being preempted within a trampoline, and track this
within the tasks structure. The disadvantage here is that this
requires keeping track of all of the trampolines and adding a
check for being in one on a scheduler fast path.

o Have a variant of Tasks RCU which checks the stack of preempted
tasks, waiting until all have been seen without being preempted
in a trampoline. This still requires keeping track of all the
trampolines in an easy-to-search manner, but gets the overhead
of searching off of the scheduler fastpaths.

It is also necessary to check running tasks, which might have
been interrupted from within a trampoline.

I would have a hard time convincing myself that these return
addresses were unconditionally reliable. But maybe they are?

o Your idea here!

Again, the short-term workaround is to remove the "if PREEMPTIBLE" from
all of the "select TASKS_RCU" clauses.

> > > My next step is to try this on bare metal on a system configured as
> > > is the fleet. But good progress for a week!!!
> >
> > Yeah this is great. Fingers crossed for the wider set of tests.
>
> I got what might be a one-off when hitting rcutorture and KASAN harder.
> I am running 320*TRACE01 to see if it reproduces.

[ . . . ]

> So, first see if it is reproducible, second enable more diagnostics,
> third make more grace-period sequence numbers available to rcutorture,
> fourth recheck the diagnostics code, and then see where we go from there.
> It might be that lazy preemption needs adjustment, or it might be that
> it just tickled latent diagnostic issues in rcutorture.
>
> (I rarely hit this WARN_ON() except in early development, when the
> problem is usually glaringly obvious, hence all the uncertainty.)

And it is eminently reproducible. Digging into it...

Thanx, Paul

2024-02-21 06:49:36

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Paul E. McKenney <[email protected]> writes:

> On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
>> >
>> > Paul E. McKenney <[email protected]> writes:
>> >
>> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
>> > >>
>> > >> Paul E. McKenney <[email protected]> writes:
>> > >>
>> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
>> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
>> > >> >> >
>> > >> >> > Paul E. McKenney <[email protected]> writes:
>> > >> >> >
>> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
>> > >> >> > >> Hi,
>> > >> >> > >>
>> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
>> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>> > >> >> > >> on explicit preemption points for the voluntary models.
>> > >> >> > >>
>> > >> >> > >> The series is based on Thomas' original proposal which he outlined
>> > >> >> > >> in [1], [2] and in his PoC [3].
>> > >> >> > >>
>> > >> >> > >> An earlier RFC version is at [4].
>> > >> >> > >
>> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
>> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
>> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
>> > >> >> > > likely for the merge window after next, but let me know if you need
>> > >> >> > > them sooner.
>> > >> >> >
>> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
>> > >> >> > But, the attached diff should tide me over until the fixes are in.
>> > >> >>
>> > >> >> That was indeed my guess. ;-)
>> > >> >>
>> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
>> > >> >> > > flooding, but I am still looking into this.
>> > >> >> >
>> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
>> > >> >>
>> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
>> > >> >> two of them thus far. I am running a longer test to see if this might
>> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
>> > >> >> and TRACE01 have in common.
>> > >> >
>> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
>> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
>> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
>> > >> > to dig into more.
>> > >>
>> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
>> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
>> > >> as well?
>> > >> (Just in the interest of minimizing configurations.)
>> > >
>> > > I would be happy to, but in the spirit of full disclosure...
>> > >
>> > > First, I have seen that failure only once, which is not enough to
>> > > conclude that it has much to do with TREE04. It might simply be low
>> > > probability, so that TREE04 simply was unlucky enough to hit it first.
>> > > In contrast, I have sufficient data to be reasonably confident that the
>> > > callback-flooding OOMs really do have something to do with the TRACE01 and
>> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
>> > > have in common that they don't also have in common with other scenarios.
>> > > But what is life without a bit of mystery? ;-)
>> >
>> > :).
>> >
>> > > Second, please see the attached tarball, which contains .csv files showing
>> > > Kconfig options and kernel boot parameters for the various torture tests.
>> > > The portions of the filenames preceding the "config.csv" correspond to
>> > > the directories in tools/testing/selftests/rcutorture/configs.
>> >
>> > So, at least some of the HZ_FULL=y tests don't run into problems.
>> >
>> > > Third, there are additional scenarios hand-crafted by the script at
>> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
>> > > them have triggered, other than via the newly increased difficulty
>> > > of configurating a tracing-free kernel with which to test, but they
>> > > can still be useful in ruling out particular Kconfig options or kernel
>> > > boot parameters being related to a given issue.
>> > >
>> > > But please do take a look at the .csv files and let me know what
>> > > adjustments would be appropriate given the failure information.
>> >
>> > Nothing stands out just yet. Let me start a run here and see if
>> > that gives me some ideas.
>>
>> Sounds good, thank you!
>>
>> > I'm guessing the splats don't give any useful information or
>> > you would have attached them ;).
>>
>> My plan is to extract what can be extracted from the overnight run
>> that I just started. Just in case the fixes have any effect on things,
>> unlikely though that might be given those fixes and the runs that failed.
>
> And I only got no failures from either TREE10 or TRACE01 on last night's
> run. I merged your series on top of v6.8-rc4 with the -rcu tree's
> dev branch, the latter to get the RCU fixes. But this means that last
> night's results are not really comparable to earlier results.

Not sure if you saw any othe instances of this since, but a couple of
things I tbelatedly noticed below.

[ ... ]

> [ 3459.733109] ------------[ cut here ]------------
> [ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing.
> [ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0
>
> Now something bad happened. RCU was unable to keep up with the
> callback flood. Given that users can create callback floods
> with close(open()) loops, this is not good.
>
> [ 3459.737761] Modules linked in:
> [ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252
> [ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0
> [ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83
> [ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282
> [ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff
> [ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
> [ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff
> [ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000
> [ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff
> [ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000
> [ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0
> [ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3459.764472] Call Trace:
> [ 3459.765003] <TASK>
> [ 3459.765483] ? __warn+0x61/0xe0
> [ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0
> [ 3459.767154] ? report_bug+0x174/0x180
> [ 3459.767942] ? handle_bug+0x3d/0x70
> [ 3459.768715] ? exc_invalid_op+0x18/0x70
> [ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20
> [ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0
> [ 3459.771501] blocking_notifier_call_chain+0x5c/0x80
> [ 3459.772553] out_of_memory+0x236/0x4b0
> [ 3459.773365] __alloc_pages+0x9ca/0xb10
> [ 3459.774233] ? set_next_entity+0x8b/0x150
> [ 3459.775107] new_slab+0x382/0x430
> [ 3459.776454] ___slab_alloc+0x23c/0x8c0
> [ 3459.777315] ? preempt_schedule_irq+0x32/0x50
> [ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970
> [ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970
> [ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970
> [ 3459.781244] kmalloc_trace+0x179/0x1a0
> [ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970
> [ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
> [ 3459.786617] ? kthread+0xc3/0xf0
> [ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
> [ 3459.788417] kthread+0xc3/0xf0
> [ 3459.789101] ? __pfx_kthread+0x10/0x10
> [ 3459.789906] ret_from_fork+0x2f/0x50
> [ 3459.790708] ? __pfx_kthread+0x10/0x10
> [ 3459.791523] ret_from_fork_asm+0x1a/0x30
> [ 3459.792359] </TASK>
> [ 3459.792835] ---[ end trace 0000000000000000 ]---
>
> Standard rcutorture stack trace for this failure mode.

I see a preempt_schedule_irq() in the stack. So, I guess that at some
point current (the task responsible for the callback flood?) was marked
for lazy scheduling, did not schedule out, and then was eagerly
preempted out at the next tick.

> [ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0
>
> So the whole thing lasted less than a second (913 jiffies).
> Each element of the histogram is 100 milliseconds worth. Nothing
> came through during the first 100 ms (not surprising), and one
> grace period elapsed (also not surprising). A lot of callbacks
> came through in the second 100 ms (also not surprising), but there
> were some tens of thousand grace periods (extremely surprising).
> The third 100 ms got a lot of callbacks but no grace periods
> (not surprising).
>
> Huh. This started at t=3458.877155 and we got the OOM at
> t=3459.733109, which roughly matches the reported time.
>
> [ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies
>
> The callback flood does seem to have stalled grace periods,
> though not by all *that* much.
>
> [ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0
>
> The RCU grace-period kthread is in its loop looking for
> quiescent states, and is executing normally ("->gp_activity 0",
> as opposed to some huge number indicating that the kthread was
> never awakened).
>
> [ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0
>
> The "->qsmask 0x0" says that all CPUs have provided a quiescent
> state, but the "G" indicates that the normal grace period is
> blocked by some task preempted within an RCU read-side critical
> section. This output is strange because a 56-CPU scenario should
> have considerably more output.
>
> Plus this means that this cannot possibly be TREE10 because that
> scenario is non-preemptible, so there cannot be grace periods
> waiting for quiescent states on anything but CPUs.

Might be missing the point, but with CONFIG_PREEMPT_NONE, you could
be preempted if you exceed your time quanta by more than one tick.
Though that of course needs the task to not be in the read-side critical
section.

Thanks

--
ankur

2024-02-21 06:49:49

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Thomas Gleixner <[email protected]> writes:

> On Tue, Feb 13 2024 at 20:27, Thomas Gleixner wrote:
>> On Mon, Feb 12 2024 at 21:55, Ankur Arora wrote:
>>
>>> Hi,
>>>
>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
>>> on explicit preemption points for the voluntary models.
>>>
>>> The series is based on Thomas' original proposal which he outlined
>>> in [1], [2] and in his PoC [3].
>>>
>>> An earlier RFC version is at [4].
>
> Aside of the nitpicks from me and Mark this really looks promising!

Thanks, I had great material to work with.

--
ankur

2024-02-21 12:24:04

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/13/2024 11:25 AM, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> An earlier RFC version is at [4].
>

[...]

Hello Ankur,
Thank you for the series. Just giving a crisp summary since I am
expecting a respin of patchseries with minor changes suggested by
Thomas, Mark and a fix by Paul. and looking forward to test that.

I was able to test the current patchset rather in a different way.

On Milan, (2 node, 256 cpu, 512GB RAM), Did my regular benchmark
testing, to see if there are any surprises.

Will do more detailed testing/analysis w/ some of the scheduler specific
tests also after your respin.

Configuration tested.
a) Base kernel (6.7),
b) patched with PREEMPT_AUTO voluntary preemption.
c) patched with PREEMPT_DYNAMIC voluntary preemption.

Workloads I tested and their %gain,
case b case c
NAS +2.7 +1.9
Hashjoin, +0 +0
Graph500, -6 +0
XSBench +1.7 +0

Did kernbench etc test from Mel's mmtests suite also. Did not notice
much difference.

In summary benchmarks are mostly on positive side.

Thanks and Regards
- Raghu

2024-02-21 17:16:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
> Configuration tested.
> a) Base kernel (6.7),

Which scheduling model is the baseline using?

> b) patched with PREEMPT_AUTO voluntary preemption.
> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>
> Workloads I tested and their %gain,
> case b case c
> NAS +2.7 +1.9
> Hashjoin, +0 +0
> XSBench +1.7 +0
> Graph500, -6 +0

The Graph500 stands out. Needs some analysis.

Thanks,

tglx

2024-02-21 17:28:19

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>> Configuration tested.
>> a) Base kernel (6.7),
>
> Which scheduling model is the baseline using?
>

baseline is also PREEMPT_DYNAMIC with voluntary preemption

>> b) patched with PREEMPT_AUTO voluntary preemption.
>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>
>> Workloads I tested and their %gain,
>> case b case c
>> NAS +2.7 +1.9
>> Hashjoin, +0 +0
>> XSBench +1.7 +0
>> Graph500, -6 +0
>
> The Graph500 stands out. Needs some analysis.
>

Sure. Will do more detailed analysis and comeback on this.

Thanks
- Raghu

2024-02-21 17:46:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Tue, Feb 20, 2024 at 10:48:41PM -0800, Ankur Arora wrote:
> Paul E. McKenney <[email protected]> writes:
> > On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
> >> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
> >> >
> >> > Paul E. McKenney <[email protected]> writes:
> >> >
> >> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> >> > >>
> >> > >> Paul E. McKenney <[email protected]> writes:
> >> > >>
> >> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> >> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> >> > >> >> >
> >> > >> >> > Paul E. McKenney <[email protected]> writes:
> >> > >> >> >
> >> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> >> > >> >> > >> Hi,
> >> > >> >> > >>
> >> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> >> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> >> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> >> > >> >> > >> on explicit preemption points for the voluntary models.
> >> > >> >> > >>
> >> > >> >> > >> The series is based on Thomas' original proposal which he outlined
> >> > >> >> > >> in [1], [2] and in his PoC [3].
> >> > >> >> > >>
> >> > >> >> > >> An earlier RFC version is at [4].
> >> > >> >> > >
> >> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> >> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> >> > >> >> > > non-preemptible RCU. I have a couple of fixes queued on -rcu [1], most
> >> > >> >> > > likely for the merge window after next, but let me know if you need
> >> > >> >> > > them sooner.
> >> > >> >> >
> >> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> >> > >> >> > But, the attached diff should tide me over until the fixes are in.
> >> > >> >>
> >> > >> >> That was indeed my guess. ;-)
> >> > >> >>
> >> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> >> > >> >> > > flooding, but I am still looking into this.
> >> > >> >> >
> >> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> >> > >> >>
> >> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> >> > >> >> two of them thus far. I am running a longer test to see if this might
> >> > >> >> be just luck. If not, I look to see what rcutorture scenarios TREE10
> >> > >> >> and TRACE01 have in common.
> >> > >> >
> >> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> >> > >> > sets them apart. I also hit a grace-period hang in TREE04, which does
> >> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y. Something
> >> > >> > to dig into more.
> >> > >>
> >> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> >> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> >> > >> as well?
> >> > >> (Just in the interest of minimizing configurations.)
> >> > >
> >> > > I would be happy to, but in the spirit of full disclosure...
> >> > >
> >> > > First, I have seen that failure only once, which is not enough to
> >> > > conclude that it has much to do with TREE04. It might simply be low
> >> > > probability, so that TREE04 simply was unlucky enough to hit it first.
> >> > > In contrast, I have sufficient data to be reasonably confident that the
> >> > > callback-flooding OOMs really do have something to do with the TRACE01 and
> >> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> >> > > have in common that they don't also have in common with other scenarios.
> >> > > But what is life without a bit of mystery? ;-)
> >> >
> >> > :).
> >> >
> >> > > Second, please see the attached tarball, which contains .csv files showing
> >> > > Kconfig options and kernel boot parameters for the various torture tests.
> >> > > The portions of the filenames preceding the "config.csv" correspond to
> >> > > the directories in tools/testing/selftests/rcutorture/configs.
> >> >
> >> > So, at least some of the HZ_FULL=y tests don't run into problems.
> >> >
> >> > > Third, there are additional scenarios hand-crafted by the script at
> >> > > tools/testing/selftests/rcutorture/bin/torture.sh. Thus far, none of
> >> > > them have triggered, other than via the newly increased difficulty
> >> > > of configurating a tracing-free kernel with which to test, but they
> >> > > can still be useful in ruling out particular Kconfig options or kernel
> >> > > boot parameters being related to a given issue.
> >> > >
> >> > > But please do take a look at the .csv files and let me know what
> >> > > adjustments would be appropriate given the failure information.
> >> >
> >> > Nothing stands out just yet. Let me start a run here and see if
> >> > that gives me some ideas.
> >>
> >> Sounds good, thank you!
> >>
> >> > I'm guessing the splats don't give any useful information or
> >> > you would have attached them ;).
> >>
> >> My plan is to extract what can be extracted from the overnight run
> >> that I just started. Just in case the fixes have any effect on things,
> >> unlikely though that might be given those fixes and the runs that failed.
> >
> > And I only got no failures from either TREE10 or TRACE01 on last night's
> > run. I merged your series on top of v6.8-rc4 with the -rcu tree's
> > dev branch, the latter to get the RCU fixes. But this means that last
> > night's results are not really comparable to earlier results.
>
> Not sure if you saw any othe instances of this since, but a couple of
> things I tbelatedly noticed below.

Thank you for taking a look!

> [ ... ]
>
> > [ 3459.733109] ------------[ cut here ]------------
> > [ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing.
> > [ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0
> >
> > Now something bad happened. RCU was unable to keep up with the
> > callback flood. Given that users can create callback floods
> > with close(open()) loops, this is not good.
> >
> > [ 3459.737761] Modules linked in:
> > [ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252
> > [ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > [ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0
> > [ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83
> > [ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282
> > [ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff
> > [ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
> > [ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff
> > [ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000
> > [ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff
> > [ 3459.756695] FS: 0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000
> > [ 3459.758443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0
> > [ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 3459.764472] Call Trace:
> > [ 3459.765003] <TASK>
> > [ 3459.765483] ? __warn+0x61/0xe0
> > [ 3459.766176] ? rcutorture_oom_notify+0x3e/0x1d0
> > [ 3459.767154] ? report_bug+0x174/0x180
> > [ 3459.767942] ? handle_bug+0x3d/0x70
> > [ 3459.768715] ? exc_invalid_op+0x18/0x70
> > [ 3459.769561] ? asm_exc_invalid_op+0x1a/0x20
> > [ 3459.770494] ? rcutorture_oom_notify+0x3e/0x1d0
> > [ 3459.771501] blocking_notifier_call_chain+0x5c/0x80
> > [ 3459.772553] out_of_memory+0x236/0x4b0
> > [ 3459.773365] __alloc_pages+0x9ca/0xb10
> > [ 3459.774233] ? set_next_entity+0x8b/0x150
> > [ 3459.775107] new_slab+0x382/0x430
> > [ 3459.776454] ___slab_alloc+0x23c/0x8c0
> > [ 3459.777315] ? preempt_schedule_irq+0x32/0x50
> > [ 3459.778319] ? rcu_torture_fwd_prog+0x6bf/0x970
> > [ 3459.779291] ? rcu_torture_fwd_prog+0x6bf/0x970
> > [ 3459.780264] ? rcu_torture_fwd_prog+0x6bf/0x970
> > [ 3459.781244] kmalloc_trace+0x179/0x1a0
> > [ 3459.784651] rcu_torture_fwd_prog+0x6bf/0x970
> > [ 3459.785529] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
> > [ 3459.786617] ? kthread+0xc3/0xf0
> > [ 3459.787352] ? __pfx_rcu_torture_fwd_prog+0x10/0x10
> > [ 3459.788417] kthread+0xc3/0xf0
> > [ 3459.789101] ? __pfx_kthread+0x10/0x10
> > [ 3459.789906] ret_from_fork+0x2f/0x50
> > [ 3459.790708] ? __pfx_kthread+0x10/0x10
> > [ 3459.791523] ret_from_fork_asm+0x1a/0x30
> > [ 3459.792359] </TASK>
> > [ 3459.792835] ---[ end trace 0000000000000000 ]---
> >
> > Standard rcutorture stack trace for this failure mode.
>
> I see a preempt_schedule_irq() in the stack. So, I guess that at some
> point current (the task responsible for the callback flood?) was marked
> for lazy scheduling, did not schedule out, and then was eagerly
> preempted out at the next tick.

That is expected, given that the kthread doing the callback flooding
will run for up to eight seconds.

Some instrumentation shows grace periods waiting on tasks, but that
instrumentation is later than would be good, after the barrier operation.

> > [ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0
> >
> > So the whole thing lasted less than a second (913 jiffies).
> > Each element of the histogram is 100 milliseconds worth. Nothing
> > came through during the first 100 ms (not surprising), and one
> > grace period elapsed (also not surprising). A lot of callbacks
> > came through in the second 100 ms (also not surprising), but there
> > were some tens of thousand grace periods (extremely surprising).
> > The third 100 ms got a lot of callbacks but no grace periods
> > (not surprising).
> >
> > Huh. This started at t=3458.877155 and we got the OOM at
> > t=3459.733109, which roughly matches the reported time.
> >
> > [ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies
> >
> > The callback flood does seem to have stalled grace periods,
> > though not by all *that* much.
> >
> > [ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0
> >
> > The RCU grace-period kthread is in its loop looking for
> > quiescent states, and is executing normally ("->gp_activity 0",
> > as opposed to some huge number indicating that the kthread was
> > never awakened).
> >
> > [ 3459.804267] rcu: rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0
> >
> > The "->qsmask 0x0" says that all CPUs have provided a quiescent
> > state, but the "G" indicates that the normal grace period is
> > blocked by some task preempted within an RCU read-side critical
> > section. This output is strange because a 56-CPU scenario should
> > have considerably more output.
> >
> > Plus this means that this cannot possibly be TREE10 because that
> > scenario is non-preemptible, so there cannot be grace periods
> > waiting for quiescent states on anything but CPUs.
>
> Might be missing the point, but with CONFIG_PREEMPT_NONE, you could
> be preempted if you exceed your time quanta by more than one tick.
> Though that of course needs the task to not be in the read-side critical
> section.

I have three things on my list: (1) Improve the instrumentation so that it
captures the grace-period diagnostics periodically in a list of strings,
then prints them only if something bad happened, (2) Use bisection
to work out which commit instigates this behavior, and (3) that old
fallback, code inspection.

Other thoughts?

Thanx, Paul

2024-02-21 18:17:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Mon, 19 Feb 2024 08:48:20 -0800
"Paul E. McKenney" <[email protected]> wrote:

> > I will look again -- it is quite possible that I was confused by earlier
> > in-fleet setups that had Tasks RCU enabled even when preemption was
> > disabled. (We don't do that anymore, and, had I been paying sufficient
> > attention, would not have been doing it to start with. Back in the day,
> > enabling rcutorture, even as a module, had the side effect of enabling
> > Tasks RCU. How else to test it, right? Well...)
>
> OK, I got my head straight on this one...
>
> And the problem is in fact that Tasks RCU isn't normally present
> in non-preemptible kernels. This is because normal RCU will wait
> for preemption-disabled regions of code, and in PREMPT_NONE and
> PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code
> lacking an explicit schedule() or similar. And as I understand it,
> tracing trampolines rely on this implicit lack of preemption.
>
> So, with lazy preemption, we could preempt in the middle of a
> trampoline, and synchronize_rcu() won't save us.
>
> Steve and Mathieu will correct me if I am wrong.
>
> If I do understand this correctly, one workaround is to remove the
> "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way,
> all kernels would use synchronize_rcu_tasks(), which would wait for
> a voluntary context switch.
>
> This workaround does increase the overhead and tracepoint-removal
> latency on non-preemptible kernels, so it might be time to revisit the
> synchronization of trampolines. Unfortunately, the things I have come
> up with thus far have disadvantages:
>
> o Keep a set of permanent trampolines that enter and exit
> some sort of explicit RCU read-side critical section.
> If the address for this trampoline to call is in a register,
> then these permanent trampolines remain constant so that
> no synchronization of them is required. The selected
> flavor of RCU can then be used to deal with the non-permanent
> trampolines.
>
> The disadvantage here is a significant increase in the complexity
> and overhead of trampoline code and the code that invokes the
> trampolines. This overhead limits where tracing may be used
> in the kernel, which is of course undesirable.

I wonder if we can just see if the instruction pointer at preemption is at
something that was allocated? That is, if it __is_kernel(addr) returns
false, then we need to do more work. Of course that means modules will also
trigger this. We could check __is_module_text() but that does a bit more
work and may cause too much overhead. But who knows, if the module check is
only done if the __is_kernel() check fails, maybe it's not that bad.

-- Steve

>
> o Check for being preempted within a trampoline, and track this
> within the tasks structure. The disadvantage here is that this
> requires keeping track of all of the trampolines and adding a
> check for being in one on a scheduler fast path.
>
> o Have a variant of Tasks RCU which checks the stack of preempted
> tasks, waiting until all have been seen without being preempted
> in a trampoline. This still requires keeping track of all the
> trampolines in an easy-to-search manner, but gets the overhead
> of searching off of the scheduler fastpaths.
>
> It is also necessary to check running tasks, which might have
> been interrupted from within a trampoline.
>
> I would have a hard time convincing myself that these return
> addresses were unconditionally reliable. But maybe they are?
>
> o Your idea here!
>
> Again, the short-term workaround is to remove the "if PREEMPTIBLE" from
> all of the "select TASKS_RCU" clauses.
>
> > > > My next step is to try this on bare metal on a system configured as
> > > > is the fleet. But good progress for a week!!!
> > >
> > > Yeah this is great. Fingers crossed for the wider set of tests.
> >
> > I got what might be a one-off when hitting rcutorture and KASAN harder.
> > I am running 320*TRACE01 to see if it reproduces.
>
> [ . . . ]
>
> > So, first see if it is reproducible, second enable more diagnostics,
> > third make more grace-period sequence numbers available to rcutorture,
> > fourth recheck the diagnostics code, and then see where we go from there.
> > It might be that lazy preemption needs adjustment, or it might be that
> > it just tickled latent diagnostic issues in rcutorture.
> >
> > (I rarely hit this WARN_ON() except in early development, when the
> > problem is usually glaringly obvious, hence all the uncertainty.)
>
> And it is eminently reproducible. Digging into it...

2024-02-21 20:09:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21, 2024 at 01:19:01PM -0500, Steven Rostedt wrote:
> On Mon, 19 Feb 2024 08:48:20 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > I will look again -- it is quite possible that I was confused by earlier
> > > in-fleet setups that had Tasks RCU enabled even when preemption was
> > > disabled. (We don't do that anymore, and, had I been paying sufficient
> > > attention, would not have been doing it to start with. Back in the day,
> > > enabling rcutorture, even as a module, had the side effect of enabling
> > > Tasks RCU. How else to test it, right? Well...)
> >
> > OK, I got my head straight on this one...
> >
> > And the problem is in fact that Tasks RCU isn't normally present
> > in non-preemptible kernels. This is because normal RCU will wait
> > for preemption-disabled regions of code, and in PREMPT_NONE and
> > PREEMPT_VOLUNTARY kernels, that includes pretty much any region of code
> > lacking an explicit schedule() or similar. And as I understand it,
> > tracing trampolines rely on this implicit lack of preemption.
> >
> > So, with lazy preemption, we could preempt in the middle of a
> > trampoline, and synchronize_rcu() won't save us.
> >
> > Steve and Mathieu will correct me if I am wrong.
> >
> > If I do understand this correctly, one workaround is to remove the
> > "if PREEMPTIBLE" on all occurrences of "select TASKS_RCU". That way,
> > all kernels would use synchronize_rcu_tasks(), which would wait for
> > a voluntary context switch.
> >
> > This workaround does increase the overhead and tracepoint-removal
> > latency on non-preemptible kernels, so it might be time to revisit the
> > synchronization of trampolines. Unfortunately, the things I have come
> > up with thus far have disadvantages:
> >
> > o Keep a set of permanent trampolines that enter and exit
> > some sort of explicit RCU read-side critical section.
> > If the address for this trampoline to call is in a register,
> > then these permanent trampolines remain constant so that
> > no synchronization of them is required. The selected
> > flavor of RCU can then be used to deal with the non-permanent
> > trampolines.
> >
> > The disadvantage here is a significant increase in the complexity
> > and overhead of trampoline code and the code that invokes the
> > trampolines. This overhead limits where tracing may be used
> > in the kernel, which is of course undesirable.
>
> I wonder if we can just see if the instruction pointer at preemption is at
> something that was allocated? That is, if it __is_kernel(addr) returns
> false, then we need to do more work. Of course that means modules will also
> trigger this. We could check __is_module_text() but that does a bit more
> work and may cause too much overhead. But who knows, if the module check is
> only done if the __is_kernel() check fails, maybe it's not that bad.

I do like very much that idea, but it requires that we be able to identify
this instruction pointer perfectly, no matter what. It might also require
that we be able to perfectly identify any IRQ return addresses as well,
for example, if the preemption was triggered within an interrupt handler.
And interrupts from softirq environments might require identifying an
additional level of IRQ return address. The original IRQ might have
interrupted a trampoline, and then after transitioning into softirq,
another IRQ might also interrupt a trampoline, and this last IRQ handler
might have instigated a preemption.

Are there additional levels or mechanisms requiring identifying
return addresses?

For whatever it is worth, and in case it should prove necessary,
I have added a sneak preview of the Kconfig workaround at the end of
this message.

Thanx, Paul

> -- Steve
>
> >
> > o Check for being preempted within a trampoline, and track this
> > within the tasks structure. The disadvantage here is that this
> > requires keeping track of all of the trampolines and adding a
> > check for being in one on a scheduler fast path.
> >
> > o Have a variant of Tasks RCU which checks the stack of preempted
> > tasks, waiting until all have been seen without being preempted
> > in a trampoline. This still requires keeping track of all the
> > trampolines in an easy-to-search manner, but gets the overhead
> > of searching off of the scheduler fastpaths.
> >
> > It is also necessary to check running tasks, which might have
> > been interrupted from within a trampoline.
> >
> > I would have a hard time convincing myself that these return
> > addresses were unconditionally reliable. But maybe they are?
> >
> > o Your idea here!
> >
> > Again, the short-term workaround is to remove the "if PREEMPTIBLE" from
> > all of the "select TASKS_RCU" clauses.
> >
> > > > > My next step is to try this on bare metal on a system configured as
> > > > > is the fleet. But good progress for a week!!!
> > > >
> > > > Yeah this is great. Fingers crossed for the wider set of tests.
> > >
> > > I got what might be a one-off when hitting rcutorture and KASAN harder.
> > > I am running 320*TRACE01 to see if it reproduces.
> >
> > [ . . . ]
> >
> > > So, first see if it is reproducible, second enable more diagnostics,
> > > third make more grace-period sequence numbers available to rcutorture,
> > > fourth recheck the diagnostics code, and then see where we go from there.
> > > It might be that lazy preemption needs adjustment, or it might be that
> > > it just tickled latent diagnostic issues in rcutorture.
> > >
> > > (I rarely hit this WARN_ON() except in early development, when the
> > > problem is usually glaringly obvious, hence all the uncertainty.)
> >
> > And it is eminently reproducible. Digging into it...

diff --git a/arch/Kconfig b/arch/Kconfig
index c91917b508736..154f994547632 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -55,7 +55,7 @@ config KPROBES
depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
- select TASKS_RCU if PREEMPTION
+ select NEED_TASKS_RCU
help
Kprobes allows you to trap at almost any kernel address and
execute a callback function. register_kprobe() establishes
@@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
config OPTPROBES
def_bool y
depends on KPROBES && HAVE_OPTPROBES
- select TASKS_RCU if PREEMPTION
+ select NEED_TASKS_RCU

config KPROBES_ON_FTRACE
def_bool y
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index 6a906ff930065..ce9fbc3b27ecf 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -27,7 +27,7 @@ config BPF_SYSCALL
bool "Enable bpf() system call"
select BPF
select IRQ_WORK
- select TASKS_RCU if PREEMPTION
+ select NEED_TASKS_RCU
select TASKS_TRACE_RCU
select BINARY_PRINTF
select NET_SOCK_MSG if NET
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 7dca0138260c3..3e079de0f5b43 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -85,9 +85,13 @@ config FORCE_TASKS_RCU
idle, and user-mode execution as quiescent states. Not for
manual selection in most cases.

-config TASKS_RCU
+config NEED_TASKS_RCU
bool
default n
+
+config TASKS_RCU
+ bool
+ default NEED_TASKS_RCU && (PREEMPTION || PREEMPT_AUTO)
select IRQ_WORK

config FORCE_TASKS_RUDE_RCU
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596d..6cdc5ff919b09 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -163,7 +163,7 @@ config TRACING
select BINARY_PRINTF
select EVENT_TRACING
select TRACE_CLOCK
- select TASKS_RCU if PREEMPTION
+ select NEED_TASKS_RCU

config GENERIC_TRACER
bool
@@ -204,7 +204,7 @@ config FUNCTION_TRACER
select GENERIC_TRACER
select CONTEXT_SWITCH_TRACER
select GLOB
- select TASKS_RCU if PREEMPTION
+ select NEED_TASKS_RCU
select TASKS_RUDE_RCU
help
Enable the kernel to trace every kernel function. This is done

2024-02-21 20:10:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, 21 Feb 2024 11:41:47 -0800
"Paul E. McKenney" <[email protected]> wrote:

> > I wonder if we can just see if the instruction pointer at preemption is at
> > something that was allocated? That is, if it __is_kernel(addr) returns
> > false, then we need to do more work. Of course that means modules will also
> > trigger this. We could check __is_module_text() but that does a bit more
> > work and may cause too much overhead. But who knows, if the module check is
> > only done if the __is_kernel() check fails, maybe it's not that bad.
>
> I do like very much that idea, but it requires that we be able to identify
> this instruction pointer perfectly, no matter what. It might also require
> that we be able to perfectly identify any IRQ return addresses as well,
> for example, if the preemption was triggered within an interrupt handler.
> And interrupts from softirq environments might require identifying an
> additional level of IRQ return address. The original IRQ might have
> interrupted a trampoline, and then after transitioning into softirq,
> another IRQ might also interrupt a trampoline, and this last IRQ handler
> might have instigated a preemption.

Note, softirqs still require a real interrupt to happen in order to preempt
executing code. Otherwise it should never be running from a trampoline.

>
> Are there additional levels or mechanisms requiring identifying
> return addresses?

Hmm, could we add to irq_enter_rcu()

__this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));

That is to save off were the ip was when it was interrupted.

Hmm, but it looks like the get_irq_regs() is set up outside of
irq_enter_rcu() :-(

I wonder how hard it would be to change all the architectures to pass in
pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
available.

Either way, it looks like it will be a bit of work around the trampoline or
around RCU to get this efficiently done.

-- Steve

2024-02-21 20:22:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> On Wed, 21 Feb 2024 11:41:47 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > I wonder if we can just see if the instruction pointer at preemption is at
> > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > false, then we need to do more work. Of course that means modules will also
> > > trigger this. We could check __is_module_text() but that does a bit more
> > > work and may cause too much overhead. But who knows, if the module check is
> > > only done if the __is_kernel() check fails, maybe it's not that bad.
> >
> > I do like very much that idea, but it requires that we be able to identify
> > this instruction pointer perfectly, no matter what. It might also require
> > that we be able to perfectly identify any IRQ return addresses as well,
> > for example, if the preemption was triggered within an interrupt handler.
> > And interrupts from softirq environments might require identifying an
> > additional level of IRQ return address. The original IRQ might have
> > interrupted a trampoline, and then after transitioning into softirq,
> > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > might have instigated a preemption.
>
> Note, softirqs still require a real interrupt to happen in order to preempt
> executing code. Otherwise it should never be running from a trampoline.

Yes, the first interrupt interrupted a trampoline. Then, on return,
that interrupt transitioned to softirq (as opposed to ksoftirqd).
While a softirq handler was executing within a trampoline, we got
another interrupt. We thus have two interrupted trampolines.

Or am I missing something that prevents this?

> > Are there additional levels or mechanisms requiring identifying
> > return addresses?
>
> Hmm, could we add to irq_enter_rcu()
>
> __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));
>
> That is to save off were the ip was when it was interrupted.
>
> Hmm, but it looks like the get_irq_regs() is set up outside of
> irq_enter_rcu() :-(
>
> I wonder how hard it would be to change all the architectures to pass in
> pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
> available.
>
> Either way, it looks like it will be a bit of work around the trampoline or
> around RCU to get this efficiently done.

One approach would be to make Tasks RCU be present for PREEMPT_AUTO
kernels as well as PREEMPTIBLE kernels, and then, as architectures provide
the needed return-address infrastructure, transition those architectures
to something more precise.

Or maybe the workaround will prove to be good enough. We did
inadvertently test it for a year or so at scale, so I am at least
confident that it works. ;-)

Thanx, Paul

2024-02-21 21:19:07

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Raghavendra K T <[email protected]> writes:

> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>> Configuration tested.
>>> a) Base kernel (6.7),
>> Which scheduling model is the baseline using?
>>
>
> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>
>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>>
>>> Workloads I tested and their %gain,
>>> case b case c
>>> NAS +2.7 +1.9
>>> Hashjoin, +0 +0
>>> XSBench +1.7 +0
>>> Graph500, -6 +0
>> The Graph500 stands out. Needs some analysis.
>>
>
> Sure. Will do more detailed analysis and comeback on this.

Thanks Raghu. Please keep me posted.

Also, let me try to reproduce this locally. Could you post the
parameters that you used for the Graph500 run?

Thanks

--
ankur

2024-02-21 21:39:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption

On Mon, 12 Feb 2024 21:55:47 -0800
Ankur Arora <[email protected]> wrote:

> The default policy for lazy scheduling is to schedule in exit-to-user,
> assuming that would happen within the remaining time quanta of the
> task.
>
> However, that runs into the 'hog' problem -- the target task might
> be running in the kernel and might not relinquish CPU on its own.
>
> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
> tif_resched(NR_now) at the next tick.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
>
> ---
> Note:
> Instead of special casing the tick, it might be simpler to always
> do the upgrade on the second resched_curr().
>
> The theoretical problem with doing that is that the current
> approach deterministically provides a well-defined extra unit of
> time. Going with a second resched_curr() might mean that the
> amount of extra time the task gets depends on the vagaries of
> the incoming resched_curr() (which is fine if it's mostly from
> the tick; not fine if we could get it due to other reasons.)
>
> Practically, both performed equally well in my tests.
>
> Thoughts?

I personally prefer the determinism of using the tick to force the resched.

-- Steve

2024-02-21 21:41:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY

On Mon, 12 Feb 2024 21:55:53 -0800
Ankur Arora <[email protected]> wrote:

> Document various combinations of resched flags.
>
> Cc: Steven Rostedt <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> Documentation/trace/ftrace.rst | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
> index 7e7b8ec17934..7f20c0bae009 100644
> --- a/Documentation/trace/ftrace.rst
> +++ b/Documentation/trace/ftrace.rst
> @@ -1036,8 +1036,12 @@ explains which is which.
> be printed here.
>
> need-resched:
> - - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
> + - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
> + - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
> + - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
> + - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
> - 'n' only TIF_NEED_RESCHED is set,
> + - 'l' only TIF_NEED_RESCHED_LAZY is set,
> - 'p' only PREEMPT_NEED_RESCHED is set,
> - '.' otherwise.
>

I wonder if we should also add this information in /sys/kernel/tracing/README
so that it is easier to find on a machine.

-- Steve

2024-02-21 23:23:50

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY


Steven Rostedt <[email protected]> writes:

> On Mon, 12 Feb 2024 21:55:53 -0800
> Ankur Arora <[email protected]> wrote:
>
>> Document various combinations of resched flags.
>>
>> Cc: Steven Rostedt <[email protected]>
>> Cc: Masami Hiramatsu <[email protected]>
>> Cc: Jonathan Corbet <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> Documentation/trace/ftrace.rst | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
>> index 7e7b8ec17934..7f20c0bae009 100644
>> --- a/Documentation/trace/ftrace.rst
>> +++ b/Documentation/trace/ftrace.rst
>> @@ -1036,8 +1036,12 @@ explains which is which.
>> be printed here.
>>
>> need-resched:
>> - - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
>> + - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>> + - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
>> + - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>> + - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
>> - 'n' only TIF_NEED_RESCHED is set,
>> + - 'l' only TIF_NEED_RESCHED_LAZY is set,
>> - 'p' only PREEMPT_NEED_RESCHED is set,
>> - '.' otherwise.
>>
>
> I wonder if we should also add this information in /sys/kernel/tracing/README
> so that it is easier to find on a machine.

Yeah, there is a problem with the discovery. Seems a little out of place
in tracing/README though.

How about something like this? Though this isn't really a model of clarity.

--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4292,7 +4292,7 @@ static void print_lat_help_header(struct seq_file *m)
{
seq_puts(m, "# _------=> CPU# \n"
"# / _-----=> irqs-off/BH-disabled\n"
- "# | / _----=> need-resched \n"
+ "# | / _----=> need-resched [ l: lazy, n: now, p: preempt, b: l|n, L: l|p, N: n|p, B: l|n|p ]\n"
"# || / _---=> hardirq/softirq \n"
"# ||| / _--=> preempt-depth \n"
"# |||| / _-=> migrate-disable \n"


Also, haven't looked at trace-cmd. Anything I should be sending a patch
out for?

Thanks

--
ankur

2024-02-21 23:54:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY

On Wed, 21 Feb 2024 15:22:20 -0800
Ankur Arora <[email protected]> wrote:


> > I wonder if we should also add this information in /sys/kernel/tracing/README
> > so that it is easier to find on a machine.
>
> Yeah, there is a problem with the discovery. Seems a little out of place
> in tracing/README though.
>
> How about something like this? Though this isn't really a model of clarity.

Could work, but I would also have it check the configs that are set in the
kernel, and only show the options that are available with the current
configs that are enabled in the kernel.

>
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -4292,7 +4292,7 @@ static void print_lat_help_header(struct seq_file *m)
> {
> seq_puts(m, "# _------=> CPU# \n"
> "# / _-----=> irqs-off/BH-disabled\n"
> - "# | / _----=> need-resched \n"
> + "# | / _----=> need-resched [ l: lazy, n: now, p: preempt, b: l|n, L: l|p, N: n|p, B: l|n|p ]\n"
> "# || / _---=> hardirq/softirq \n"
> "# ||| / _--=> preempt-depth \n"
> "# |||| / _-=> migrate-disable \n"
>
>
> Also, haven't looked at trace-cmd. Anything I should be sending a patch
> out for?

There's really nothing that explains it. But that probably should be fixed.

-- Steve


2024-02-22 04:06:26

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/22/2024 2:46 AM, Ankur Arora wrote:
>
> Raghavendra K T <[email protected]> writes:
>
>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>>> Configuration tested.
>>>> a) Base kernel (6.7),
>>> Which scheduling model is the baseline using?
>>>
>>
>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>
>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>>>
>>>> Workloads I tested and their %gain,
>>>> case b case c
>>>> NAS +2.7 +1.9
>>>> Hashjoin, +0 +0
>>>> XSBench +1.7 +0
>>>> Graph500, -6 +0
>>> The Graph500 stands out. Needs some analysis.
>>>
>>
>> Sure. Will do more detailed analysis and comeback on this.
>
> Thanks Raghu. Please keep me posted.
>
> Also, let me try to reproduce this locally. Could you post the
> parameters that you used for the Graph500 run?
>

This was run as part of test suite where by from output, Parameters, I
see as :

SCALE: 27
nvtx: 134217728
edgefactor: 16
terasize: 3.43597383679999993e-02
A: 5.69999999999999951e-01
B: 1.90000000000000002e-01
C: 1.90000000000000002e-01
D: 5.00000000000000444e-02
generation_time: 4.93902114900000022e+00
construction_time: 2.55216929010000015e+01
nbfs: 64

Meanwhile since stddev for the runs I saw was little bit on higher side,
I did think results are Okay.

Rerunning with more iterations to see if there is a valid concern, if so
I will dig more deep as Thomas pointed.
Also will post the results of run.

Thanks and Regards
- Raghu

2024-02-22 13:05:10

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>> Configuration tested.
>> a) Base kernel (6.7),
>
> Which scheduling model is the baseline using?
>
>> b) patched with PREEMPT_AUTO voluntary preemption.
>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>
>> Workloads I tested and their %gain,
>> case b case c
>> NAS +2.7 +1.9
>> Hashjoin, +0 +0
>> XSBench +1.7 +0
>> Graph500, -6 +0
>
> The Graph500 stands out. Needs some analysis.
>

Hello Thomas, Ankur,

Because of high stdev I saw with the runs for Graph500, continued to
take results with more iterations.

Here is the result. It does not look like there is a concern here.

(you can see the *min* side of preempt-auto case which could have got
the negative result in the analysis, But I should have posted stdev
along with that. Sorry for not being louder there.).

Overall this looks good. some time better but all within noise level.

Benchmark = Graph500

x 6.7.0+
+ 6.7.0-preempt-auto+

N Min Max Median Avg Stddev
x 15 6.7165689e+09 7.7607743e+09 7.2213638e+09 7.2759563e+09 3.3353312e+08
+ 15 6.4856432e+09 7.942607e+09 7.3115082e+09 7.3386124e+09 4.6474773e+08

No difference proven at 80.0% confidence
No difference proven at 95.0% confidence
No difference proven at 99.0% confidence

Thanks and Regards
- Raghu

2024-02-22 15:57:11

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > On Wed, 21 Feb 2024 11:41:47 -0800
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > false, then we need to do more work. Of course that means modules will also
> > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > work and may cause too much overhead. But who knows, if the module check is
> > > > only done if the __is_kernel() check fails, maybe it's not that bad.
> > >
> > > I do like very much that idea, but it requires that we be able to identify
> > > this instruction pointer perfectly, no matter what. It might also require
> > > that we be able to perfectly identify any IRQ return addresses as well,
> > > for example, if the preemption was triggered within an interrupt handler.
> > > And interrupts from softirq environments might require identifying an
> > > additional level of IRQ return address. The original IRQ might have
> > > interrupted a trampoline, and then after transitioning into softirq,
> > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > might have instigated a preemption.
> >
> > Note, softirqs still require a real interrupt to happen in order to preempt
> > executing code. Otherwise it should never be running from a trampoline.
>
> Yes, the first interrupt interrupted a trampoline. Then, on return,
> that interrupt transitioned to softirq (as opposed to ksoftirqd).
> While a softirq handler was executing within a trampoline, we got
> another interrupt. We thus have two interrupted trampolines.
>
> Or am I missing something that prevents this?

Surely the problematic case is where the first interrupt is taken from a
trampoline, but the inner interrupt is taken from not-a-trampoline? If the
innermost interrupt context is a trampoline, that's the same as that without
any nesting.

We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag
in irqentry_state_t (which is on the stack, and so each nested IRQ gets its
own):

* At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a
trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline.

* At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear
TIF_IN_TRAMPOLINE.

That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set
and does nothing on entry or exit, and anything imbetween can inspect
TIF_IN_TRAMPOLINE and see the right value.

On arm64 we don't dynamically allocate trampolines, *but* we potentially have a
similar problem when changing the active ftrace_ops for a callsite, as all
callsites share a common trampoline in the kernel text which reads a pointer to
an ftrace_ops out of the callsite, then reads ftrace_ops::func from that.

Since the ops could be dynamically allocated, we want to wait for reads of that
to complete before reusing the memory, and ideally we wouldn't have new
entryies into the func after we think we'd completed the transition. So Tasks
RCU might be preferable as it waits for both the trampoline *and* the func to
complete.

> > > Are there additional levels or mechanisms requiring identifying
> > > return addresses?
> >
> > Hmm, could we add to irq_enter_rcu()
> >
> > __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));
> >
> > That is to save off were the ip was when it was interrupted.
> >
> > Hmm, but it looks like the get_irq_regs() is set up outside of
> > irq_enter_rcu() :-(
> >
> > I wonder how hard it would be to change all the architectures to pass in
> > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
> > available.
> >
> > Either way, it looks like it will be a bit of work around the trampoline or
> > around RCU to get this efficiently done.
>
> One approach would be to make Tasks RCU be present for PREEMPT_AUTO
> kernels as well as PREEMPTIBLE kernels, and then, as architectures provide
> the needed return-address infrastructure, transition those architectures
> to something more precise.

FWIW, that sounds good to me.

Mark.

2024-02-22 19:12:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote:
> On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > > On Wed, 21 Feb 2024 11:41:47 -0800
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > > false, then we need to do more work. Of course that means modules will also
> > > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > > work and may cause too much overhead. But who knows, if the module check is
> > > > > only done if the __is_kernel() check fails, maybe it's not that bad.
> > > >
> > > > I do like very much that idea, but it requires that we be able to identify
> > > > this instruction pointer perfectly, no matter what. It might also require
> > > > that we be able to perfectly identify any IRQ return addresses as well,
> > > > for example, if the preemption was triggered within an interrupt handler.
> > > > And interrupts from softirq environments might require identifying an
> > > > additional level of IRQ return address. The original IRQ might have
> > > > interrupted a trampoline, and then after transitioning into softirq,
> > > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > > might have instigated a preemption.
> > >
> > > Note, softirqs still require a real interrupt to happen in order to preempt
> > > executing code. Otherwise it should never be running from a trampoline.
> >
> > Yes, the first interrupt interrupted a trampoline. Then, on return,
> > that interrupt transitioned to softirq (as opposed to ksoftirqd).
> > While a softirq handler was executing within a trampoline, we got
> > another interrupt. We thus have two interrupted trampolines.
> >
> > Or am I missing something that prevents this?
>
> Surely the problematic case is where the first interrupt is taken from a
> trampoline, but the inner interrupt is taken from not-a-trampoline? If the
> innermost interrupt context is a trampoline, that's the same as that without
> any nesting.

It depends. If we wait for each task to not have a trampoline in effect
then yes, we only need to know whether or not a given task has at least
one trampoline in use. One concern with this approach is that a given
task might have at least one trampoline in effect every time it is
checked, unlikely though that might seem.

If this is a problem, one way around it is to instead ask whether the
current task still has a reference to one of a set of trampolines that
has recently been removed. This avoids the problem of a task always
being one some trampoline or another, but requires exact identification
of any and all trampolines a given task is currently using.

Either way, we need some way of determining whether or not a given
PC value resides in a trampoline. This likely requires some data
structure (hash table? tree? something else?) that must be traversed
in order to carry out that determination. Depending on the traversal
overhead, it might (or might not) be necessary to make sure that the
traversal is not on the entry/exit/scheduler fast paths. It is also
necessary to keep the trampoline-use overhead low and the trampoline
call points small.

> We could handle nesting with a thread flag (e.g. TIF_IN_TRAMPOLINE) and a flag
> in irqentry_state_t (which is on the stack, and so each nested IRQ gets its
> own):
>
> * At IRQ exception entry, if TIF_IN_TRAMPOLINE is clear and pt_regs::ip is a
> trampoline, set TIF_IN_TRAMPOLINE and irqentry_state_t::entered_trampoline.
>
> * At IRQ exception exit, if irqentry_state_t::entered_trampoline is set, clear
> TIF_IN_TRAMPOLINE.
>
> That naturally nests since the inner IRQ sees TIF_IN_TRAMPOLINE is already set
> and does nothing on entry or exit, and anything imbetween can inspect
> TIF_IN_TRAMPOLINE and see the right value.

If the overhead of determining whether pt_regs::ip is a trampoline
is sufficiently low, sounds good to me! I suppose that different
architectures might have different answers to this question, just to
keep things entertaining. ;-)

> On arm64 we don't dynamically allocate trampolines, *but* we potentially have a
> similar problem when changing the active ftrace_ops for a callsite, as all
> callsites share a common trampoline in the kernel text which reads a pointer to
> an ftrace_ops out of the callsite, then reads ftrace_ops::func from that.
>
> Since the ops could be dynamically allocated, we want to wait for reads of that
> to complete before reusing the memory, and ideally we wouldn't have new
> entryies into the func after we think we'd completed the transition. So Tasks
> RCU might be preferable as it waits for both the trampoline *and* the func to
> complete.

OK, I am guessing that it is easier to work out whether pt_regs::ip is
a trampoline than cases involving dyanamic allocation of trampolines.

> > > > Are there additional levels or mechanisms requiring identifying
> > > > return addresses?
> > >
> > > Hmm, could we add to irq_enter_rcu()
> > >
> > > __this_cpu_write(__rcu_ip, instruction_pointer(get_irq_regs()));
> > >
> > > That is to save off were the ip was when it was interrupted.
> > >
> > > Hmm, but it looks like the get_irq_regs() is set up outside of
> > > irq_enter_rcu() :-(
> > >
> > > I wonder how hard it would be to change all the architectures to pass in
> > > pt_regs to irq_enter_rcu()? All the places it is called, the regs should be
> > > available.
> > >
> > > Either way, it looks like it will be a bit of work around the trampoline or
> > > around RCU to get this efficiently done.
> >
> > One approach would be to make Tasks RCU be present for PREEMPT_AUTO
> > kernels as well as PREEMPTIBLE kernels, and then, as architectures provide
> > the needed return-address infrastructure, transition those architectures
> > to something more precise.
>
> FWIW, that sounds good to me.

Thank you, and I will Cc you on the patches.

Thanx, Paul

2024-02-22 21:23:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote:
> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>> Configuration tested.
>>> a) Base kernel (6.7),
>>
>> Which scheduling model is the baseline using?
>>
>
> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>
>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.

Which RCU variant do you have enabled with a, b, c ?

I.e. PREEMPT_RCU=?

Thanks,

tglx

2024-02-23 03:16:11

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Thomas Gleixner <[email protected]> writes:

> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote:
>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>>> Configuration tested.
>>>> a) Base kernel (6.7),
>>>
>>> Which scheduling model is the baseline using?
>>>
>>
>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>
>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>
> Which RCU variant do you have enabled with a, b, c ?
>
> I.e. PREEMPT_RCU=?

Raghu please confirm this, but if the defaults were chosen
then we should have:

>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
PREEMPT_RCU=y

>>>> b) patched with PREEMPT_AUTO voluntary preemption.

If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n.
If with CONFIG_PREEMPT, PREEMPT_RCU=y.

Might be worth rerunning the tests with the other combination
as well (still with voluntary preemption).

>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
PREEMPT_RCU=y


Thanks

--
ankur

2024-02-23 06:28:39

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/23/2024 8:44 AM, Ankur Arora wrote:
>
> Thomas Gleixner <[email protected]> writes:
>
>> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote:
>>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>>>> Configuration tested.
>>>>> a) Base kernel (6.7),
>>>>
>>>> Which scheduling model is the baseline using?
>>>>
>>>
>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>>
>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>
>> Which RCU variant do you have enabled with a, b, c ?
>>
>> I.e. PREEMPT_RCU=?
>
> Raghu please confirm this, but if the defaults were chosen
> then we should have:
>
>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
> PREEMPT_RCU=y
>
>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>
> If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n.
> If with CONFIG_PREEMPT, PREEMPT_RCU=y.
>
> Might be worth rerunning the tests with the other combination
> as well (still with voluntary preemption).
>
>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
> PREEMPT_RCU=y

Hello Thomas, Ankur,
Yes, Ankur's understanding is right, defaults were chosen all the time so
for
a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y
b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n
c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y

I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b)
and comeback if I see anything interesting.

Thanks and Regards
- Raghu



2024-02-23 11:32:18

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote:
> > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > > > On Wed, 21 Feb 2024 11:41:47 -0800
> > > > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > > > false, then we need to do more work. Of course that means modules will also
> > > > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > > > work and may cause too much overhead. But who knows, if the module check is
> > > > > > only done if the __is_kernel() check fails, maybe it's not that bad.
> > > > >
> > > > > I do like very much that idea, but it requires that we be able to identify
> > > > > this instruction pointer perfectly, no matter what. It might also require
> > > > > that we be able to perfectly identify any IRQ return addresses as well,
> > > > > for example, if the preemption was triggered within an interrupt handler.
> > > > > And interrupts from softirq environments might require identifying an
> > > > > additional level of IRQ return address. The original IRQ might have
> > > > > interrupted a trampoline, and then after transitioning into softirq,
> > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > > > might have instigated a preemption.
> > > >
> > > > Note, softirqs still require a real interrupt to happen in order to preempt
> > > > executing code. Otherwise it should never be running from a trampoline.
> > >
> > > Yes, the first interrupt interrupted a trampoline. Then, on return,
> > > that interrupt transitioned to softirq (as opposed to ksoftirqd).
> > > While a softirq handler was executing within a trampoline, we got
> > > another interrupt. We thus have two interrupted trampolines.
> > >
> > > Or am I missing something that prevents this?
> >
> > Surely the problematic case is where the first interrupt is taken from a
> > trampoline, but the inner interrupt is taken from not-a-trampoline? If the
> > innermost interrupt context is a trampoline, that's the same as that without
> > any nesting.
>
> It depends. If we wait for each task to not have a trampoline in effect
> then yes, we only need to know whether or not a given task has at least
> one trampoline in use. One concern with this approach is that a given
> task might have at least one trampoline in effect every time it is
> checked, unlikely though that might seem.
>
> If this is a problem, one way around it is to instead ask whether the
> current task still has a reference to one of a set of trampolines that
> has recently been removed. This avoids the problem of a task always
> being one some trampoline or another, but requires exact identification
> of any and all trampolines a given task is currently using.
>
> Either way, we need some way of determining whether or not a given
> PC value resides in a trampoline. This likely requires some data
> structure (hash table? tree? something else?) that must be traversed
> in order to carry out that determination. Depending on the traversal
> overhead, it might (or might not) be necessary to make sure that the
> traversal is not on the entry/exit/scheduler fast paths. It is also
> necessary to keep the trampoline-use overhead low and the trampoline
> call points small.

Thanks; I hadn't thought about that shape of livelock problem; with that in
mind my suggestion using flags was inadequate.

I'm definitely in favour of just using Tasks RCU! That's what arm64 does today,
anyhow!

Mark.

2024-02-23 15:32:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Fri, Feb 23, 2024 at 11:05:45AM +0000, Mark Rutland wrote:
> On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote:
> > > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > > > > On Wed, 21 Feb 2024 11:41:47 -0800
> > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > >
> > > > > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > > > > false, then we need to do more work. Of course that means modules will also
> > > > > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > > > > work and may cause too much overhead. But who knows, if the module check is
> > > > > > > only done if the __is_kernel() check fails, maybe it's not that bad.
> > > > > >
> > > > > > I do like very much that idea, but it requires that we be able to identify
> > > > > > this instruction pointer perfectly, no matter what. It might also require
> > > > > > that we be able to perfectly identify any IRQ return addresses as well,
> > > > > > for example, if the preemption was triggered within an interrupt handler.
> > > > > > And interrupts from softirq environments might require identifying an
> > > > > > additional level of IRQ return address. The original IRQ might have
> > > > > > interrupted a trampoline, and then after transitioning into softirq,
> > > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > > > > might have instigated a preemption.
> > > > >
> > > > > Note, softirqs still require a real interrupt to happen in order to preempt
> > > > > executing code. Otherwise it should never be running from a trampoline.
> > > >
> > > > Yes, the first interrupt interrupted a trampoline. Then, on return,
> > > > that interrupt transitioned to softirq (as opposed to ksoftirqd).
> > > > While a softirq handler was executing within a trampoline, we got
> > > > another interrupt. We thus have two interrupted trampolines.
> > > >
> > > > Or am I missing something that prevents this?
> > >
> > > Surely the problematic case is where the first interrupt is taken from a
> > > trampoline, but the inner interrupt is taken from not-a-trampoline? If the
> > > innermost interrupt context is a trampoline, that's the same as that without
> > > any nesting.
> >
> > It depends. If we wait for each task to not have a trampoline in effect
> > then yes, we only need to know whether or not a given task has at least
> > one trampoline in use. One concern with this approach is that a given
> > task might have at least one trampoline in effect every time it is
> > checked, unlikely though that might seem.
> >
> > If this is a problem, one way around it is to instead ask whether the
> > current task still has a reference to one of a set of trampolines that
> > has recently been removed. This avoids the problem of a task always
> > being one some trampoline or another, but requires exact identification
> > of any and all trampolines a given task is currently using.
> >
> > Either way, we need some way of determining whether or not a given
> > PC value resides in a trampoline. This likely requires some data
> > structure (hash table? tree? something else?) that must be traversed
> > in order to carry out that determination. Depending on the traversal
> > overhead, it might (or might not) be necessary to make sure that the
> > traversal is not on the entry/exit/scheduler fast paths. It is also
> > necessary to keep the trampoline-use overhead low and the trampoline
> > call points small.
>
> Thanks; I hadn't thought about that shape of livelock problem; with that in
> mind my suggestion using flags was inadequate.
>
> I'm definitely in favour of just using Tasks RCU! That's what arm64 does today,
> anyhow!

Full speed ahead, then!!! But if you come up with a nicer solution,
please do not keep it a secret!

Thanx, Paul

2024-02-24 03:15:34

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On 2/23/2024 11:58 AM, Raghavendra K T wrote:
> On 2/23/2024 8:44 AM, Ankur Arora wrote:
>>
>> Thomas Gleixner <[email protected]> writes:
>>
>>> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote:
>>>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>>>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>>>>> Configuration tested.
>>>>>> a) Base kernel (6.7),
>>>>>
>>>>> Which scheduling model is the baseline using?
>>>>>
>>>>
>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>>>
>>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>>
>>> Which RCU variant do you have enabled with a, b, c ?
>>>
>>> I.e. PREEMPT_RCU=?
>>
>> Raghu please confirm this, but if the defaults were chosen
>> then we should have:
>>
>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>> PREEMPT_RCU=y
>>
>>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>
>> If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n.
>> If with CONFIG_PREEMPT, PREEMPT_RCU=y.
>>
>> Might be worth rerunning the tests with the other combination
>> as well (still with voluntary preemption).
>>
>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>> PREEMPT_RCU=y
>
> Hello Thomas, Ankur,
> Yes, Ankur's understanding is right, defaults were chosen all the time so
> for
> a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y
> b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n
> c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y

> I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b)
> and comeback if I see anything interesting.
>

I see that

d) patched + PREEMPT_AUTO=y voluntary preemption CONFIG_PREEMPT,
PREEMPT_RCU = y

All the results at 80% confidence
case (d)
HashJoin 0%
Graph500 0%
XSBench +1.2%
NAS-ft +2.1%

In general averages are better for all the benchmarks but at 99%
confidence there seem to be no difference.

Overall looks on par or better for case (d)

Thanks and Regards
- Raghu






2024-02-27 17:47:10

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Raghavendra K T <[email protected]> writes:

> On 2/23/2024 11:58 AM, Raghavendra K T wrote:
>> On 2/23/2024 8:44 AM, Ankur Arora wrote:
>>>
>>> Thomas Gleixner <[email protected]> writes:
>>>
>>>> On Wed, Feb 21 2024 at 22:57, Raghavendra K T wrote:
>>>>> On 2/21/2024 10:45 PM, Thomas Gleixner wrote:
>>>>>> On Wed, Feb 21 2024 at 17:53, Raghavendra K T wrote:
>>>>>>> Configuration tested.
>>>>>>> a) Base kernel (6.7),
>>>>>>
>>>>>> Which scheduling model is the baseline using?
>>>>>>
>>>>>
>>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>>>>
>>>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>>>
>>>> Which RCU variant do you have enabled with a, b, c ?
>>>>
>>>> I.e. PREEMPT_RCU=?
>>>
>>> Raghu please confirm this, but if the defaults were chosen
>>> then we should have:
>>>
>>>>> baseline is also PREEMPT_DYNAMIC with voluntary preemption
>>> PREEMPT_RCU=y
>>>
>>>>>>> b) patched with PREEMPT_AUTO voluntary preemption.
>>>
>>> If this was built with PREEMPT_VOLUNTARY then, PREEMPT_RCU=n.
>>> If with CONFIG_PREEMPT, PREEMPT_RCU=y.
>>>
>>> Might be worth rerunning the tests with the other combination
>>> as well (still with voluntary preemption).
>>>
>>>>>>> c) patched with PREEMPT_DYNAMIC voluntary preemption.
>>> PREEMPT_RCU=y
>> Hello Thomas, Ankur,
>> Yes, Ankur's understanding is right, defaults were chosen all the time so
>> for
>> a) base 6.7.0+ + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y
>> b) patched + PREEMPT_AUTO voluntary preemption. PREEMPT_RCU = n
>> c) patched + PREEMPT_DYNAMIC with voluntary preemption PREEMPT_RCU=y
>
>> I will check with other combination (CONFIG_PREEMPT/PREEMPT_RCU) for (b)
>> and comeback if I see anything interesting.
>>
>
> I see that
>
> d) patched + PREEMPT_AUTO=y voluntary preemption CONFIG_PREEMPT, PREEMPT_RCU = y
>
> All the results at 80% confidence
> case (d)
> HashJoin 0%
> Graph500 0%
> XSBench +1.2%
> NAS-ft +2.1%
>
> In general averages are better for all the benchmarks but at 99%
> confidence there seem to be no difference.
>
> Overall looks on par or better for case (d)

Thanks for running all of these Raghu. The numbers look pretty good
(better than I expected honestly).

--
ankur

2024-02-28 13:48:14

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption

Hi Ankur,

On 12/02/24 21:55, Ankur Arora wrote:
> The default policy for lazy scheduling is to schedule in exit-to-user,
> assuming that would happen within the remaining time quanta of the
> task.
>
> However, that runs into the 'hog' problem -- the target task might
> be running in the kernel and might not relinquish CPU on its own.
>
> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
> tif_resched(NR_now) at the next tick.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
>
> ---
> Note:
> Instead of special casing the tick, it might be simpler to always
> do the upgrade on the second resched_curr().
>
> The theoretical problem with doing that is that the current
> approach deterministically provides a well-defined extra unit of
> time. Going with a second resched_curr() might mean that the
> amount of extra time the task gets depends on the vagaries of
> the incoming resched_curr() (which is fine if it's mostly from
> the tick; not fine if we could get it due to other reasons.)
>
> Practically, both performed equally well in my tests.
>
> Thoughts?

I'm still digesting the series, so I could simply be confused, but I
have the impression that the extra unit of time might be a problem for
deadline (and maybe rt as well?).

For deadline we call resched_curr_tick() from the throttle part of
update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
so in this case I believe we really want to reschedule straight away and
not wait for the second time around (otherwise we might be breaking the
new leftmost tasks guarantees)?

Thanks,
Juri


2024-02-29 06:46:29

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption


Juri Lelli <[email protected]> writes:

> Hi Ankur,
>
> On 12/02/24 21:55, Ankur Arora wrote:
>> The default policy for lazy scheduling is to schedule in exit-to-user,
>> assuming that would happen within the remaining time quanta of the
>> task.
>>
>> However, that runs into the 'hog' problem -- the target task might
>> be running in the kernel and might not relinquish CPU on its own.
>>
>> Handle that by upgrading the ignored tif_resched(NR_lazy) bit to
>> tif_resched(NR_now) at the next tick.
>>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>>
>> ---
>> Note:
>> Instead of special casing the tick, it might be simpler to always
>> do the upgrade on the second resched_curr().
>>
>> The theoretical problem with doing that is that the current
>> approach deterministically provides a well-defined extra unit of
>> time. Going with a second resched_curr() might mean that the
>> amount of extra time the task gets depends on the vagaries of
>> the incoming resched_curr() (which is fine if it's mostly from
>> the tick; not fine if we could get it due to other reasons.)
>>
>> Practically, both performed equally well in my tests.
>>
>> Thoughts?
>
> I'm still digesting the series, so I could simply be confused, but I
> have the impression that the extra unit of time might be a problem for
> deadline (and maybe rt as well?).
>
> For deadline we call resched_curr_tick() from the throttle part of
> update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> so in this case I believe we really want to reschedule straight away and
> not wait for the second time around (otherwise we might be breaking the
> new leftmost tasks guarantees)?

Yes, agreed, this looks like it breaks the deadline invariant for both
preempt=none and preempt=voluntary.

For RT, update_curr_rt() seems to have a similar problem if the task
doesn't have RUNTIME_INF.

Relatedly, do you think there's a similar problem when switching to
a task with a higher scheduling class?
(Related to code is in patch 25, 26.)

For preempt=voluntary, wakeup_preempt() will do the right thing, but
for preempt=none, we only reschedule lazily so the target might
continue to run until the end of the tick.

Thanks for the review, btw.

--
ankur

2024-02-29 09:44:21

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption

On 28/02/24 22:43, Ankur Arora wrote:
> Juri Lelli <[email protected]> writes:

..

> > For deadline we call resched_curr_tick() from the throttle part of
> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> > so in this case I believe we really want to reschedule straight away and
> > not wait for the second time around (otherwise we might be breaking the
> > new leftmost tasks guarantees)?
>
> Yes, agreed, this looks like it breaks the deadline invariant for both
> preempt=none and preempt=voluntary.
>
> For RT, update_curr_rt() seems to have a similar problem if the task
> doesn't have RUNTIME_INF.
>
> Relatedly, do you think there's a similar problem when switching to
> a task with a higher scheduling class?
> (Related to code is in patch 25, 26.)
>
> For preempt=voluntary, wakeup_preempt() will do the right thing, but

Right.

> for preempt=none, we only reschedule lazily so the target might
> continue to run until the end of the tick.

Hummm, not sure honestly, but I seem to understand that with
preempt=none we want to be super conservative wrt preemptions, so maybe
current behavior (1 tick of laziness) is OK? Otherwise what would be the
difference wrt preempt=voluntary from a scheduler pow? Yes, it might
break deadline guarantees, but if you wanted to use preempt=none maybe
there is a strong reason for it, I'm thinking.

> Thanks for the review, btw.

Sure. Thanks for working on this actually! :)

Best,
Juri


2024-02-29 23:56:10

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption


Juri Lelli <[email protected]> writes:

> On 28/02/24 22:43, Ankur Arora wrote:
>> Juri Lelli <[email protected]> writes:
>
> ..
>
>> > For deadline we call resched_curr_tick() from the throttle part of
>> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
>> > so in this case I believe we really want to reschedule straight away and
>> > not wait for the second time around (otherwise we might be breaking the
>> > new leftmost tasks guarantees)?
>>
>> Yes, agreed, this looks like it breaks the deadline invariant for both
>> preempt=none and preempt=voluntary.
>>
>> For RT, update_curr_rt() seems to have a similar problem if the task
>> doesn't have RUNTIME_INF.
>>
>> Relatedly, do you think there's a similar problem when switching to
>> a task with a higher scheduling class?
>> (Related to code is in patch 25, 26.)
>>
>> For preempt=voluntary, wakeup_preempt() will do the right thing, but
>
> Right.
>
>> for preempt=none, we only reschedule lazily so the target might
>> continue to run until the end of the tick.
>
> Hummm, not sure honestly, but I seem to understand that with
> preempt=none we want to be super conservative wrt preemptions, so maybe
> current behavior (1 tick of laziness) is OK? Otherwise what would be the

Yeah, that's kind of where I'm thinking of getting to. Be lazy so long
as we don't violate guarantees.

> difference wrt preempt=voluntary from a scheduler pow? Yes, it might
> break deadline guarantees, but if you wanted to use preempt=none maybe
> there is a strong reason for it, I'm thinking.

Yeah, the difference between preempt=none and preempt=voluntary is
looking narrower and narrower, and maybe a bit artificial in that
there seem to be very few cases where the two models would actually
differ in behaviour.

Thanks
Ankur

>> Thanks for the review, btw.
>
> Sure. Thanks for working on this actually! :)
>
> Best,
> Juri

2024-03-01 00:28:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 23/30] sched/fair: handle tick expiry under lazy preemption

On Thu, Feb 29, 2024 at 03:54:42PM -0800, Ankur Arora wrote:
>
> Juri Lelli <[email protected]> writes:
>
> > On 28/02/24 22:43, Ankur Arora wrote:
> >> Juri Lelli <[email protected]> writes:
> >
> > ..
> >
> >> > For deadline we call resched_curr_tick() from the throttle part of
> >> > update_curr_dl_se() if the dl_se happens to not be the leftmost anymore,
> >> > so in this case I believe we really want to reschedule straight away and
> >> > not wait for the second time around (otherwise we might be breaking the
> >> > new leftmost tasks guarantees)?
> >>
> >> Yes, agreed, this looks like it breaks the deadline invariant for both
> >> preempt=none and preempt=voluntary.
> >>
> >> For RT, update_curr_rt() seems to have a similar problem if the task
> >> doesn't have RUNTIME_INF.
> >>
> >> Relatedly, do you think there's a similar problem when switching to
> >> a task with a higher scheduling class?
> >> (Related to code is in patch 25, 26.)
> >>
> >> For preempt=voluntary, wakeup_preempt() will do the right thing, but
> >
> > Right.
> >
> >> for preempt=none, we only reschedule lazily so the target might
> >> continue to run until the end of the tick.
> >
> > Hummm, not sure honestly, but I seem to understand that with
> > preempt=none we want to be super conservative wrt preemptions, so maybe
> > current behavior (1 tick of laziness) is OK? Otherwise what would be the
>
> Yeah, that's kind of where I'm thinking of getting to. Be lazy so long
> as we don't violate guarantees.
>
> > difference wrt preempt=voluntary from a scheduler pow? Yes, it might
> > break deadline guarantees, but if you wanted to use preempt=none maybe
> > there is a strong reason for it, I'm thinking.
>
> Yeah, the difference between preempt=none and preempt=voluntary is
> looking narrower and narrower, and maybe a bit artificial in that
> there seem to be very few cases where the two models would actually
> differ in behaviour.

If it turns out that cond_resched() and the preemption points in
might_sleep() really can be dispensed with, then there would be little
difference between them. But that is still "if". ;-)

Thanx, Paul

> Thanks
> Ankur
>
> >> Thanks for the review, btw.
> >
> > Sure. Thanks for working on this actually! :)
> >
> > Best,
> > Juri

2024-03-01 23:34:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY

On Wed, Feb 21, 2024 at 04:43:34PM -0500, Steven Rostedt wrote:
> On Mon, 12 Feb 2024 21:55:53 -0800
> Ankur Arora <[email protected]> wrote:
>
> > Document various combinations of resched flags.
> >
> > Cc: Steven Rostedt <[email protected]>
> > Cc: Masami Hiramatsu <[email protected]>
> > Cc: Jonathan Corbet <[email protected]>
> > Originally-by: Thomas Gleixner <[email protected]>
> > Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> > Signed-off-by: Ankur Arora <[email protected]>
> > ---
> > Documentation/trace/ftrace.rst | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
> > index 7e7b8ec17934..7f20c0bae009 100644
> > --- a/Documentation/trace/ftrace.rst
> > +++ b/Documentation/trace/ftrace.rst
> > @@ -1036,8 +1036,12 @@ explains which is which.
> > be printed here.
> >
> > need-resched:
> > - - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
> > + - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
> > + - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
> > + - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
> > + - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
> > - 'n' only TIF_NEED_RESCHED is set,
> > + - 'l' only TIF_NEED_RESCHED_LAZY is set,
> > - 'p' only PREEMPT_NEED_RESCHED is set,

One thing I was curious about in current code, under which situation will
"only PREEMPT_NEED_RESCHED is set" case happen? All the code paths I see set
the PREEMPT_NEED_RESCHED when then TIF has already been set. That kind of
makes sense, why enter the scheduler on preempt_enable() unless there was a
reason to, and TIF should have recorded that reason.

So in other words, if 'p' above cannot happen, then it could be removed as a
potential need-resched stat. If it can happen, I'd love to learn in which
case?

Also considering all users of set_tsk_need_resched() also call the
set_preempt_need_resched() shortly after, should we add a helper to squash
the 2 calls and simplify callers?

There are some "special cases" where only TIF flag need be set (such as setting
rescheduling from an interrupt or when rescheduling a remote CPU). For those,
they can directly call the set_tsk_need_resched() like they do now.

thanks,

- Joel




2024-03-02 01:18:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Fri, Feb 23, 2024 at 07:31:50AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 23, 2024 at 11:05:45AM +0000, Mark Rutland wrote:
> > On Thu, Feb 22, 2024 at 11:11:34AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 22, 2024 at 03:50:02PM +0000, Mark Rutland wrote:
> > > > On Wed, Feb 21, 2024 at 12:22:35PM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Feb 21, 2024 at 03:11:57PM -0500, Steven Rostedt wrote:
> > > > > > On Wed, 21 Feb 2024 11:41:47 -0800
> > > > > > "Paul E. McKenney" <[email protected]> wrote:
> > > > > >
> > > > > > > > I wonder if we can just see if the instruction pointer at preemption is at
> > > > > > > > something that was allocated? That is, if it __is_kernel(addr) returns
> > > > > > > > false, then we need to do more work. Of course that means modules will also
> > > > > > > > trigger this. We could check __is_module_text() but that does a bit more
> > > > > > > > work and may cause too much overhead. But who knows, if the module check is
> > > > > > > > only done if the __is_kernel() check fails, maybe it's not that bad.
> > > > > > >
> > > > > > > I do like very much that idea, but it requires that we be able to identify
> > > > > > > this instruction pointer perfectly, no matter what. It might also require
> > > > > > > that we be able to perfectly identify any IRQ return addresses as well,
> > > > > > > for example, if the preemption was triggered within an interrupt handler.
> > > > > > > And interrupts from softirq environments might require identifying an
> > > > > > > additional level of IRQ return address. The original IRQ might have
> > > > > > > interrupted a trampoline, and then after transitioning into softirq,
> > > > > > > another IRQ might also interrupt a trampoline, and this last IRQ handler
> > > > > > > might have instigated a preemption.
> > > > > >
> > > > > > Note, softirqs still require a real interrupt to happen in order to preempt
> > > > > > executing code. Otherwise it should never be running from a trampoline.
> > > > >
> > > > > Yes, the first interrupt interrupted a trampoline. Then, on return,
> > > > > that interrupt transitioned to softirq (as opposed to ksoftirqd).
> > > > > While a softirq handler was executing within a trampoline, we got
> > > > > another interrupt. We thus have two interrupted trampolines.
> > > > >
> > > > > Or am I missing something that prevents this?
> > > >
> > > > Surely the problematic case is where the first interrupt is taken from a
> > > > trampoline, but the inner interrupt is taken from not-a-trampoline? If the
> > > > innermost interrupt context is a trampoline, that's the same as that without
> > > > any nesting.
> > >
> > > It depends. If we wait for each task to not have a trampoline in effect
> > > then yes, we only need to know whether or not a given task has at least
> > > one trampoline in use. One concern with this approach is that a given
> > > task might have at least one trampoline in effect every time it is
> > > checked, unlikely though that might seem.
> > >
> > > If this is a problem, one way around it is to instead ask whether the
> > > current task still has a reference to one of a set of trampolines that
> > > has recently been removed. This avoids the problem of a task always
> > > being one some trampoline or another, but requires exact identification
> > > of any and all trampolines a given task is currently using.
> > >
> > > Either way, we need some way of determining whether or not a given
> > > PC value resides in a trampoline. This likely requires some data
> > > structure (hash table? tree? something else?) that must be traversed
> > > in order to carry out that determination. Depending on the traversal
> > > overhead, it might (or might not) be necessary to make sure that the
> > > traversal is not on the entry/exit/scheduler fast paths. It is also
> > > necessary to keep the trampoline-use overhead low and the trampoline
> > > call points small.
> >
> > Thanks; I hadn't thought about that shape of livelock problem; with that in
> > mind my suggestion using flags was inadequate.
> >
> > I'm definitely in favour of just using Tasks RCU! That's what arm64 does today,
> > anyhow!
>
> Full speed ahead, then!!! But if you come up with a nicer solution,
> please do not keep it a secret!

The networking NAPI code ends up needing special help to avoid starving
Tasks RCU grace periods [1]. I am therefore revisiting trying to make
Tasks RCU directly detect trampoline usage, but without quite as much
need to identify specific trampolines...

I am putting this information in a Google document for future
reference [2].

Thoughts?

Thanx, Paul

[1] https://lore.kernel.org/all/[email protected]/
[2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

2024-03-02 03:11:15

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY


Joel Fernandes <[email protected]> writes:

> On Wed, Feb 21, 2024 at 04:43:34PM -0500, Steven Rostedt wrote:
>> On Mon, 12 Feb 2024 21:55:53 -0800
>> Ankur Arora <[email protected]> wrote:
>>
>> > Document various combinations of resched flags.
>> >
>> > Cc: Steven Rostedt <[email protected]>
>> > Cc: Masami Hiramatsu <[email protected]>
>> > Cc: Jonathan Corbet <[email protected]>
>> > Originally-by: Thomas Gleixner <[email protected]>
>> > Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> > Signed-off-by: Ankur Arora <[email protected]>
>> > ---
>> > Documentation/trace/ftrace.rst | 6 +++++-
>> > 1 file changed, 5 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
>> > index 7e7b8ec17934..7f20c0bae009 100644
>> > --- a/Documentation/trace/ftrace.rst
>> > +++ b/Documentation/trace/ftrace.rst
>> > @@ -1036,8 +1036,12 @@ explains which is which.
>> > be printed here.
>> >
>> > need-resched:
>> > - - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
>> > + - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>> > + - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
>> > + - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>> > + - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
>> > - 'n' only TIF_NEED_RESCHED is set,
>> > + - 'l' only TIF_NEED_RESCHED_LAZY is set,
>> > - 'p' only PREEMPT_NEED_RESCHED is set,
>
> One thing I was curious about in current code, under which situation will
> "only PREEMPT_NEED_RESCHED is set" case happen? All the code paths I see set
> the PREEMPT_NEED_RESCHED when then TIF has already been set. That kind of
> makes sense, why enter the scheduler on preempt_enable() unless there was a
> reason to, and TIF should have recorded that reason.

True. And, the only place where we clear PREEMPT_NEED_RESCHED is in
__schedule() after clearing the TIF_NEED_RESCHED* flags.

> So in other words, if 'p' above cannot happen, then it could be removed as a
> potential need-resched stat. If it can happen, I'd love to learn in which
> case?

Yeah, AFAICT the only case we might see it alone is in case of a bug.

> Also considering all users of set_tsk_need_resched() also call the
> set_preempt_need_resched() shortly after, should we add a helper to squash
> the 2 calls and simplify callers?
>
Just to explicitly lay out the reason for these being separate interfaces:
set_tsk_need_resched() can be set from a remote CPU, while
set_preempt_need_resched() (or its folding version) is only meant to be
used on the local CPU.

> There are some "special cases" where only TIF flag need be set (such as setting
> rescheduling from an interrupt or when rescheduling a remote CPU). For those,
> they can directly call the set_tsk_need_resched() like they do now.

The remote case, as you say is always handled in the scheduler. So, maybe
it is worth having a set_need_resched_local_cpu() which takes care of
both of these things but more importantly makes clear the limits of this.

That said, these interfaces have very few users (just RCU, and the s390
page fault handler) and both of these are fairly sophisticated users, so
not sure yet another interface in this area is worth adding.

Thanks

--
ankur

2024-03-03 01:08:34

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

Hi Anukr,

On Mon, Feb 12, 2024 at 09:55:50PM -0800, Ankur Arora wrote:
> The default preemption policy for voluntary preemption under
> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
> class, and lazily for well-behaved, non-idle tasks.
>
> This is the same policy as preempt=none, with an eager handling of
> higher priority scheduling classes.

AFAICS, the meaning of the word 'voluntary' has changed versus the old
CONFIG_PREEMPT_VOLUNTARY, with this patch.

So the word voluntary does not completely make sense in this context. What is
VOLUNTARY about choosing a higher scheduling class?

For instance, even in the same scheduling class, there is a notion of higher
priority, not just between classes. Example, higher RT priority within RT, or
earlier deadline within EEVDF (formerly CFS).

IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
'voluntary' business because
1. The behavior vs =none is to allow higher scheduling class to preempt, it
is not about the old voluntary.
2. you are also planning to remove cond_resched()s via this series and leave
it to the scheduler right?

Or call it preempt=higher, or something? No one is going to understand the
meaning of voluntary the way it is implied here IMHO.

thanks,

- Joel


2024-03-03 19:32:39

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 29/30] Documentation: tracing: add TIF_NEED_RESCHED_LAZY



On 3/1/2024 10:09 PM, Ankur Arora wrote:
>
> Joel Fernandes <[email protected]> writes:
>
>> On Wed, Feb 21, 2024 at 04:43:34PM -0500, Steven Rostedt wrote:
>>> On Mon, 12 Feb 2024 21:55:53 -0800
>>> Ankur Arora <[email protected]> wrote:
>>>
>>>> Document various combinations of resched flags.
>>>>
>>>> Cc: Steven Rostedt <[email protected]>
>>>> Cc: Masami Hiramatsu <[email protected]>
>>>> Cc: Jonathan Corbet <[email protected]>
>>>> Originally-by: Thomas Gleixner <[email protected]>
>>>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>> ---
>>>> Documentation/trace/ftrace.rst | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
>>>> index 7e7b8ec17934..7f20c0bae009 100644
>>>> --- a/Documentation/trace/ftrace.rst
>>>> +++ b/Documentation/trace/ftrace.rst
>>>> @@ -1036,8 +1036,12 @@ explains which is which.
>>>> be printed here.
>>>>
>>>> need-resched:
>>>> - - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
>>>> + - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>>>> + - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
>>>> + - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
>>>> + - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
>>>> - 'n' only TIF_NEED_RESCHED is set,
>>>> + - 'l' only TIF_NEED_RESCHED_LAZY is set,
>>>> - 'p' only PREEMPT_NEED_RESCHED is set,
>>
>> One thing I was curious about in current code, under which situation will
>> "only PREEMPT_NEED_RESCHED is set" case happen? All the code paths I see set
>> the PREEMPT_NEED_RESCHED when then TIF has already been set. That kind of
>> makes sense, why enter the scheduler on preempt_enable() unless there was a
>> reason to, and TIF should have recorded that reason.
>
> True. And, the only place where we clear PREEMPT_NEED_RESCHED is in
> __schedule() after clearing the TIF_NEED_RESCHED* flags.

Thanks for taking a look.

>> So in other words, if 'p' above cannot happen, then it could be removed as a
>> potential need-resched stat. If it can happen, I'd love to learn in which
>> case?
>
> Yeah, AFAICT the only case we might see it alone is in case of a bug.

Thanks for confirming as well. Steve, are you Ok with a patch to remove 'p'? Or
you probably want to leave it alone in case something changes in the future? I'd
rather it be removed.

>> Also considering all users of set_tsk_need_resched() also call the
>> set_preempt_need_resched() shortly after, should we add a helper to squash
>> the 2 calls and simplify callers?
>>
> Just to explicitly lay out the reason for these being separate interfaces:
> set_tsk_need_resched() can be set from a remote CPU, while
> set_preempt_need_resched() (or its folding version) is only meant to be
> used on the local CPU.

Yes, agreed.

>> There are some "special cases" where only TIF flag need be set (such as setting
>> rescheduling from an interrupt or when rescheduling a remote CPU). For those,
>> they can directly call the set_tsk_need_resched() like they do now.
>
> The remote case, as you say is always handled in the scheduler. So, maybe
> it is worth having a set_need_resched_local_cpu() which takes care of
> both of these things but more importantly makes clear the limits of this.
>
> That said, these interfaces have very few users (just RCU, and the s390
> page fault handler) and both of these are fairly sophisticated users, so
> not sure yet another interface in this area is worth adding.
>

Sounds good, though there are quite a few RCU users and I do remember in the
past, that during a code review I pointed out that both needed to be set. Not
just one. :) I wonder what Paul McKenney thinks about this as well ;-)

thanks,

- Joel


2024-03-05 08:14:42

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Joel Fernandes <[email protected]> writes:

> Hi Anukr,
>
> On Mon, Feb 12, 2024 at 09:55:50PM -0800, Ankur Arora wrote:
>> The default preemption policy for voluntary preemption under
>> PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
>> class, and lazily for well-behaved, non-idle tasks.
>>
>> This is the same policy as preempt=none, with an eager handling of
>> higher priority scheduling classes.
>
> AFAICS, the meaning of the word 'voluntary' has changed versus the old
> CONFIG_PREEMPT_VOLUNTARY, with this patch.
>
> So the word voluntary does not completely make sense in this context. What is
> VOLUNTARY about choosing a higher scheduling class?
>
> For instance, even in the same scheduling class, there is a notion of higher
> priority, not just between classes. Example, higher RT priority within RT, or
> earlier deadline within EEVDF (formerly CFS).

Agreed. The higher scheduling class line is pretty fuzzy and after the discussion
with Juri, almost non existent: https://lore.kernel.org/lkml/[email protected]/.

> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> 'voluntary' business because
> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> is not about the old voluntary.

What do you think about folding the higher scheduling class preemption logic
into preempt=none? As Juri pointed out, prioritization of at least the leftmost
deadline task needs to be done for correctness.

(That'll get rid of the current preempt=voluntary model, at least until
there's a separate use for it.)

> 2. you are also planning to remove cond_resched()s via this series and leave
> it to the scheduler right?

Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
defined to:

static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}

Right now, we need cond_resched() to make timely forward progress while
doing live-patching.

> Or call it preempt=higher, or something? No one is going to understand the
> meaning of voluntary the way it is implied here IMHO.

I don't think there's enough to make it worth adding a new model. For
now I'm tending towards moving the correctness parts to preempt=none and
making preempt=voluntary identical to preempt=none.

Thanks for the review.

--
ankur

2024-03-06 20:42:31

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

Hi Ankur,

On 3/5/2024 3:11 AM, Ankur Arora wrote:
>
> Joel Fernandes <[email protected]> writes:
>
[..]
>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> 'voluntary' business because
>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> is not about the old voluntary.
>
> What do you think about folding the higher scheduling class preemption logic
> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> deadline task needs to be done for correctness.
>
> (That'll get rid of the current preempt=voluntary model, at least until
> there's a separate use for it.)

Yes I am all in support for that. Its less confusing for the user as well, and
scheduling higher priority class at the next tick for preempt=none sounds good
to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
that is more aggressive, it could be added in the future.

>> 2. you are also planning to remove cond_resched()s via this series and leave
>> it to the scheduler right?
>
> Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
> defined to:
>
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
>
> Right now, we need cond_resched() to make timely forward progress while
> doing live-patching.

Cool, got it!

>> Or call it preempt=higher, or something? No one is going to understand the
>> meaning of voluntary the way it is implied here IMHO.
>
> I don't think there's enough to make it worth adding a new model. For
> now I'm tending towards moving the correctness parts to preempt=none and
> making preempt=voluntary identical to preempt=none.

Got it, sounds good.

> Thanks for the review.

Sure! Thanks for this work. Looking forward to the next series,

- Joel


2024-03-07 19:01:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> Hi Ankur,
>
> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >
> > Joel Fernandes <[email protected]> writes:
> >
> [..]
> >> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> 'voluntary' business because
> >> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> is not about the old voluntary.
> >
> > What do you think about folding the higher scheduling class preemption logic
> > into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> > deadline task needs to be done for correctness.
> >
> > (That'll get rid of the current preempt=voluntary model, at least until
> > there's a separate use for it.)
>
> Yes I am all in support for that. Its less confusing for the user as well, and
> scheduling higher priority class at the next tick for preempt=none sounds good
> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> that is more aggressive, it could be added in the future.

This would be something that happens only after removing cond_resched()
might_sleep() functionality from might_sleep(), correct?

Thanx, Paul

> >> 2. you are also planning to remove cond_resched()s via this series and leave
> >> it to the scheduler right?
> >
> > Yeah, under PREEMPT_AUTO, cond_resched() will /almost/ be not there. Gets
> > defined to:
> >
> > static inline int _cond_resched(void)
> > {
> > klp_sched_try_switch();
> > return 0;
> > }
> >
> > Right now, we need cond_resched() to make timely forward progress while
> > doing live-patching.
>
> Cool, got it!
>
> >> Or call it preempt=higher, or something? No one is going to understand the
> >> meaning of voluntary the way it is implied here IMHO.
> >
> > I don't think there's enough to make it worth adding a new model. For
> > now I'm tending towards moving the correctness parts to preempt=none and
> > making preempt=voluntary identical to preempt=none.
>
> Got it, sounds good.
>
> > Thanks for the review.
>
> Sure! Thanks for this work. Looking forward to the next series,
>
> - Joel
>

2024-03-08 00:15:51

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO



On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> Hi Ankur,
>>
>> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>>>
>>> Joel Fernandes <[email protected]> writes:
>>>
>> [..]
>>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>>>> 'voluntary' business because
>>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>>>> is not about the old voluntary.
>>>
>>> What do you think about folding the higher scheduling class preemption logic
>>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>>> deadline task needs to be done for correctness.
>>>
>>> (That'll get rid of the current preempt=voluntary model, at least until
>>> there's a separate use for it.)
>>
>> Yes I am all in support for that. Its less confusing for the user as well, and
>> scheduling higher priority class at the next tick for preempt=none sounds good
>> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> that is more aggressive, it could be added in the future.
>
> This would be something that happens only after removing cond_resched()
> might_sleep() functionality from might_sleep(), correct?

Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
he seems to be suggesting preempting instantly for higher scheduling CLASSES
even for preempt=none mode, without having to wait till the next
scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
to treat "higher class" any differently than "higher priority" for preempt=none.

And if SCHED_DEADLINE has a problem with that, then it already happens so with
CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
more than the treatment given to higher priority within same class. Ankur/Juri?

Re: cond_resched(), I did not follow you Paul, why does removing the proposed
preempt=voluntary mode (i.e. dropping this patch) have to happen only after
cond_resched()/might_sleep() modifications?

thanks,

- Joel









2024-03-08 00:42:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>
>
> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> >> Hi Ankur,
> >>
> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >>>
> >>> Joel Fernandes <[email protected]> writes:
> >>>
> >> [..]
> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >>>> 'voluntary' business because
> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >>>> is not about the old voluntary.
> >>>
> >>> What do you think about folding the higher scheduling class preemption logic
> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> >>> deadline task needs to be done for correctness.
> >>>
> >>> (That'll get rid of the current preempt=voluntary model, at least until
> >>> there's a separate use for it.)
> >>
> >> Yes I am all in support for that. Its less confusing for the user as well, and
> >> scheduling higher priority class at the next tick for preempt=none sounds good
> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> >> that is more aggressive, it could be added in the future.
> >
> > This would be something that happens only after removing cond_resched()
> > might_sleep() functionality from might_sleep(), correct?
>
> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> even for preempt=none mode, without having to wait till the next
> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
> to treat "higher class" any differently than "higher priority" for preempt=none.
>
> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> more than the treatment given to higher priority within same class. Ankur/Juri?
>
> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
> cond_resched()/might_sleep() modifications?

Because right now, one large difference between CONFIG_PREEMPT_NONE
an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
preemption point, but not for the former.

But if might_sleep() becomes debug-only, then there will no longer be
this difference.

Thanx, Paul

2024-03-08 03:50:56

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Joel Fernandes <[email protected]> writes:

> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>>> Hi Ankur,
>>>
>>> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>>>>
>>>> Joel Fernandes <[email protected]> writes:
>>>>
>>> [..]
>>>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>>>>> 'voluntary' business because
>>>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>>>>> is not about the old voluntary.
>>>>
>>>> What do you think about folding the higher scheduling class preemption logic
>>>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>>>> deadline task needs to be done for correctness.
>>>>
>>>> (That'll get rid of the current preempt=voluntary model, at least until
>>>> there's a separate use for it.)
>>>
>>> Yes I am all in support for that. Its less confusing for the user as well, and
>>> scheduling higher priority class at the next tick for preempt=none sounds good
>>> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>>> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>>> that is more aggressive, it could be added in the future.
>>
>> This would be something that happens only after removing cond_resched()
>> might_sleep() functionality from might_sleep(), correct?
>
> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> even for preempt=none mode, without having to wait till the next
> scheduling-clock interrupt.

Yes, that's what I was suggesting.

> Not sure if that makes sense to me, I was asking not
> to treat "higher class" any differently than "higher priority" for preempt=none.

Ah. Understood.

> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> more than the treatment given to higher priority within same class. Ankur/Juri?

No. I think that behaviour might be worse for PREEMPT_AUTO.

PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't
quite have a policy around when preemption happens. Preemption
might happen quickly, might happen slowly based on when the next
preemption point is found.

The PREEMPT_AUTO, preempt=none policy in this series will always
cause preemption to be at user exit or the next tick. Seems like
it would be worse for higher scheduling classes more often.

But, I wonder what Juri thinks about this.

--
ankur

2024-03-08 04:23:28

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Paul E. McKenney <[email protected]> writes:

> On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>>
>>
>> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> Hi Ankur,
>> >>
>> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >>>
>> >>> Joel Fernandes <[email protected]> writes:
>> >>>
>> >> [..]
>> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >>>> 'voluntary' business because
>> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >>>> is not about the old voluntary.
>> >>>
>> >>> What do you think about folding the higher scheduling class preemption logic
>> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >>> deadline task needs to be done for correctness.
>> >>>
>> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >>> there's a separate use for it.)
>> >>
>> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> that is more aggressive, it could be added in the future.
>> >
>> > This would be something that happens only after removing cond_resched()
>> > might_sleep() functionality from might_sleep(), correct?
>>
>> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> even for preempt=none mode, without having to wait till the next
>> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> to treat "higher class" any differently than "higher priority" for preempt=none.
>>
>> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> more than the treatment given to higher priority within same class. Ankur/Juri?
>>
>> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> cond_resched()/might_sleep() modifications?
>
> Because right now, one large difference between CONFIG_PREEMPT_NONE
> an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> preemption point, but not for the former.

True. But, there is no difference between either of those with
PREEMPT_AUTO=y (at least right now).

For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
might_sleep() is:

# define might_resched() do { } while (0)
# define might_sleep() \
do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)

And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
DEBUG_ATOMIC_SLEEP=y):

static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}
#define cond_resched() ({ \
__might_resched(__FILE__, __LINE__, 0); \
_cond_resched(); \
})

And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).

Thanks
Ankur

> But if might_sleep() becomes debug-only, then there will no longer be
> this difference.

2024-03-08 05:30:19

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Thu, Mar 7, 2024 at 10:49 PM Ankur Arora <[email protected]> wrote:
>

> The PREEMPT_AUTO, preempt=none policy in this series will always
> cause preemption to be at user exit or the next tick. Seems like
> it would be worse for higher scheduling classes more often.

Ok, sounds good. I believe we are on the same page then.

Thanks.

2024-03-08 06:54:38

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On 07/03/24 19:49, Ankur Arora wrote:
> Joel Fernandes <[email protected]> writes:

..

> > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> > he seems to be suggesting preempting instantly for higher scheduling CLASSES
> > even for preempt=none mode, without having to wait till the next
> > scheduling-clock interrupt.
>
> Yes, that's what I was suggesting.
>
> > Not sure if that makes sense to me, I was asking not
> > to treat "higher class" any differently than "higher priority" for preempt=none.
>
> Ah. Understood.
>
> > And if SCHED_DEADLINE has a problem with that, then it already happens so with
> > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> > more than the treatment given to higher priority within same class. Ankur/Juri?
>
> No. I think that behaviour might be worse for PREEMPT_AUTO.
>
> PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't
> quite have a policy around when preemption happens. Preemption
> might happen quickly, might happen slowly based on when the next
> preemption point is found.
>
> The PREEMPT_AUTO, preempt=none policy in this series will always
> cause preemption to be at user exit or the next tick. Seems like
> it would be worse for higher scheduling classes more often.
>
> But, I wonder what Juri thinks about this.

As I was saying in my last comment in the other discussion, I'm honestly
not sure, mostly because I'm currently fail to see what type of users
would choose preempt=none and have tasks scheduled with SCHED_DEADLINE
(please suggest example usecases, as I'm pretty sure I'm missing
something :). With that said, if the purpose of preempt=none is to have
a model which is super conservative wrt preemptions, having to wait one
tick to possibly schedule a DEADLINE task still seems kind of broken for
DEADLINE, but at least is predictably broken (guess one needs to account
for that somehow when coming up with parameters :).

Thanks,
Juri


2024-03-08 21:33:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
> >>
> >>
> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> >> >> Hi Ankur,
> >> >>
> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >> >>>
> >> >>> Joel Fernandes <[email protected]> writes:
> >> >>>
> >> >> [..]
> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> >>>> 'voluntary' business because
> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> >>>> is not about the old voluntary.
> >> >>>
> >> >>> What do you think about folding the higher scheduling class preemption logic
> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> >> >>> deadline task needs to be done for correctness.
> >> >>>
> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
> >> >>> there's a separate use for it.)
> >> >>
> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> >> >> that is more aggressive, it could be added in the future.
> >> >
> >> > This would be something that happens only after removing cond_resched()
> >> > might_sleep() functionality from might_sleep(), correct?
> >>
> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> >> even for preempt=none mode, without having to wait till the next
> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
> >> to treat "higher class" any differently than "higher priority" for preempt=none.
> >>
> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> >> more than the treatment given to higher priority within same class. Ankur/Juri?
> >>
> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
> >> cond_resched()/might_sleep() modifications?
> >
> > Because right now, one large difference between CONFIG_PREEMPT_NONE
> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> > preemption point, but not for the former.
>
> True. But, there is no difference between either of those with
> PREEMPT_AUTO=y (at least right now).
>
> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
> might_sleep() is:
>
> # define might_resched() do { } while (0)
> # define might_sleep() \
> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>
> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
> DEBUG_ATOMIC_SLEEP=y):
>
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
> #define cond_resched() ({ \
> __might_resched(__FILE__, __LINE__, 0); \
> _cond_resched(); \
> })
>
> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).

As long as it is easy to restore the prior cond_resched() functionality
for testing in the meantime, I should be OK. For example, it would
be great to have the commit removing the old functionality from
cond_resched() at the end of the series,

Thanx, Paul

2024-03-10 10:03:46

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

Hello Ankur and Paul,

On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
> states for read-side critical sections via rcu_all_qs().
> One reason why this was necessary: lacking preempt-count, the tick
> handler has no way of knowing whether it is executing in a read-side
> critical section or not.
>
> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
> not provide for quiescent states via rcu_all_qs().
>
> So, use the availability of preempt_count() to report quiescent states
> in rcu_flavor_sched_clock_irq().
>
> Suggested-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/rcu/tree_plugin.h | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 26c79246873a..9b72e9d2b6fe 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> */
> static void rcu_flavor_sched_clock_irq(int user)
> {
> - if (user || rcu_is_cpu_rrupt_from_idle()) {
> + if (user || rcu_is_cpu_rrupt_from_idle() ||
> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {

I was wondering if it makes sense to even support !PREEMPT_RCU under
CONFIG_PREEMPT_AUTO.

AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
the next tick boundary in the worst case, with all preempt modes including
the preempt=none mode.

Considering this, does it makes sense for RCU to be non-preemptible in
CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
from happening, because rcu_read_lock() would preempt_disable().

To that end, I wonder if CONFIG_PREEMPT_AUTO should select CONFIG_PREEMPTION
(or CONFIG_PREEMPT_BUILD, not sure which) as well because it does cause
kernel preemption. That then forces selection of CONFIG_PREEMPT_RCU as well.

thanks,

- Joel







>
> /*
> * Get here if this CPU took its interrupt from user
> - * mode or from the idle loop, and if this is not a
> - * nested interrupt. In this case, the CPU is in
> - * a quiescent state, so note it.
> + * mode, from the idle loop without this being a nested
> + * interrupt, or while not holding a preempt count.
> + * In this case, the CPU is in a quiescent state, so note
> + * it.
> *
> * No memory barrier is required here because rcu_qs()
> * references only CPU-local variables that other CPUs
> --
> 2.31.1
>

2024-03-10 18:56:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
> Hello Ankur and Paul,
>
> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
> > With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
> > states for read-side critical sections via rcu_all_qs().
> > One reason why this was necessary: lacking preempt-count, the tick
> > handler has no way of knowing whether it is executing in a read-side
> > critical section or not.
> >
> > With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
> > PREEMPT_RCU=n). This means that cond_resched() is a stub which does
> > not provide for quiescent states via rcu_all_qs().
> >
> > So, use the availability of preempt_count() to report quiescent states
> > in rcu_flavor_sched_clock_irq().
> >
> > Suggested-by: Paul E. McKenney <[email protected]>
> > Signed-off-by: Ankur Arora <[email protected]>
> > ---
> > kernel/rcu/tree_plugin.h | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 26c79246873a..9b72e9d2b6fe 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> > */
> > static void rcu_flavor_sched_clock_irq(int user)
> > {
> > - if (user || rcu_is_cpu_rrupt_from_idle()) {
> > + if (user || rcu_is_cpu_rrupt_from_idle() ||
> > + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
> > + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>
> I was wondering if it makes sense to even support !PREEMPT_RCU under
> CONFIG_PREEMPT_AUTO.
>
> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
> the next tick boundary in the worst case, with all preempt modes including
> the preempt=none mode.
>
> Considering this, does it makes sense for RCU to be non-preemptible in
> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
> from happening, because rcu_read_lock() would preempt_disable().

Yes, it does make sense for RCU to be non-preemptible in kernels
built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
CONFIG_PREEMPT_VOLUNTARY=y. As noted in earlier discussions, there are
systems that are adequately but not abundantly endowed with memory.
Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
Note well that non-preemptible RCU explicitly disables preemption across
all RCU readers.

Thanx, Paul


> To that end, I wonder if CONFIG_PREEMPT_AUTO should select CONFIG_PREEMPTION
> (or CONFIG_PREEMPT_BUILD, not sure which) as well because it does cause
> kernel preemption. That then forces selection of CONFIG_PREEMPT_RCU as well.
>
> thanks,
>
> - Joel
>
>
>
>
>
>
>
> >
> > /*
> > * Get here if this CPU took its interrupt from user
> > - * mode or from the idle loop, and if this is not a
> > - * nested interrupt. In this case, the CPU is in
> > - * a quiescent state, so note it.
> > + * mode, from the idle loop without this being a nested
> > + * interrupt, or while not holding a preempt count.
> > + * In this case, the CPU is in a quiescent state, so note
> > + * it.
> > *
> > * No memory barrier is required here because rcu_qs()
> > * references only CPU-local variables that other CPUs
> > --
> > 2.31.1
> >

2024-03-11 00:48:47

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y



On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
> On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
>> Hello Ankur and Paul,
>>
>> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
>>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
>>> states for read-side critical sections via rcu_all_qs().
>>> One reason why this was necessary: lacking preempt-count, the tick
>>> handler has no way of knowing whether it is executing in a read-side
>>> critical section or not.
>>>
>>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
>>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
>>> not provide for quiescent states via rcu_all_qs().
>>>
>>> So, use the availability of preempt_count() to report quiescent states
>>> in rcu_flavor_sched_clock_irq().
>>>
>>> Suggested-by: Paul E. McKenney <[email protected]>
>>> Signed-off-by: Ankur Arora <[email protected]>
>>> ---
>>> kernel/rcu/tree_plugin.h | 11 +++++++----
>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>> index 26c79246873a..9b72e9d2b6fe 100644
>>> --- a/kernel/rcu/tree_plugin.h
>>> +++ b/kernel/rcu/tree_plugin.h
>>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>> */
>>> static void rcu_flavor_sched_clock_irq(int user)
>>> {
>>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
>>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
>>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
>>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>>
>> I was wondering if it makes sense to even support !PREEMPT_RCU under
>> CONFIG_PREEMPT_AUTO.
>>
>> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
>> the next tick boundary in the worst case, with all preempt modes including
>> the preempt=none mode.
>>
>> Considering this, does it makes sense for RCU to be non-preemptible in
>> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
>> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
>> from happening, because rcu_read_lock() would preempt_disable().
>
> Yes, it does make sense for RCU to be non-preemptible in kernels
> built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
> CONFIG_PREEMPT_VOLUNTARY=y.
> As noted in earlier discussions, there are

Sorry if I missed a discussion, appreciate a link.

> systems that are adequately but not abundantly endowed with memory.
> Such systems need non-preemptible RCU to avoid preempted-reader OOMs.

Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
preempt=none boot parameter. IMHO, if this feature is inconsistent with
CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
burden to the users configuring the PREEMPT Kconfig option IMHO.

> Note well that non-preemptible RCU explicitly disables preemption across
> all RCU readers.

Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
preemption in preempt=none. So the "Don't preempt the kernel" behavior has
changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
kernel but are worried about reader preemptions.

That aside, as such, I do agree your point of view, that preemptible readers
presents a problem to folks using preempt=none in this series and we could
decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a
week from now, no one will likely be able to tell the difference ;-). So IMHO
either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.

Let me know if I missed something.

Thanks!

- Joel

2024-03-11 03:56:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Sun, Mar 10, 2024 at 08:48:28PM -0400, Joel Fernandes wrote:
> On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
> > On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
> >> Hello Ankur and Paul,
> >>
> >> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
> >>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
> >>> states for read-side critical sections via rcu_all_qs().
> >>> One reason why this was necessary: lacking preempt-count, the tick
> >>> handler has no way of knowing whether it is executing in a read-side
> >>> critical section or not.
> >>>
> >>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
> >>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
> >>> not provide for quiescent states via rcu_all_qs().
> >>>
> >>> So, use the availability of preempt_count() to report quiescent states
> >>> in rcu_flavor_sched_clock_irq().
> >>>
> >>> Suggested-by: Paul E. McKenney <[email protected]>
> >>> Signed-off-by: Ankur Arora <[email protected]>
> >>> ---
> >>> kernel/rcu/tree_plugin.h | 11 +++++++----
> >>> 1 file changed, 7 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >>> index 26c79246873a..9b72e9d2b6fe 100644
> >>> --- a/kernel/rcu/tree_plugin.h
> >>> +++ b/kernel/rcu/tree_plugin.h
> >>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
> >>> */
> >>> static void rcu_flavor_sched_clock_irq(int user)
> >>> {
> >>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
> >>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
> >>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
> >>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
> >>
> >> I was wondering if it makes sense to even support !PREEMPT_RCU under
> >> CONFIG_PREEMPT_AUTO.
> >>
> >> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
> >> the next tick boundary in the worst case, with all preempt modes including
> >> the preempt=none mode.
> >>
> >> Considering this, does it makes sense for RCU to be non-preemptible in
> >> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
> >> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
> >> from happening, because rcu_read_lock() would preempt_disable().
> >
> > Yes, it does make sense for RCU to be non-preemptible in kernels
> > built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
> > CONFIG_PREEMPT_VOLUNTARY=y.
> > As noted in earlier discussions, there are
>
> Sorry if I missed a discussion, appreciate a link.

It is part of the discussion of the first version of this patch series,
if I recall correctly.

> > systems that are adequately but not abundantly endowed with memory.
> > Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
>
> Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
> preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
> no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
> preempt=none boot parameter. IMHO, if this feature is inconsistent with
> CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
> CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
> mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
> CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
> burden to the users configuring the PREEMPT Kconfig option IMHO.

Because such systems are built with CONFIG_PREEMPT_DYNAMIC=n.

You could argue that we should just build with CONFIG_PREEMPT_AUTO=n,
but the long-term goal of eliminating cond_resched() will make that
ineffective.

> > Note well that non-preemptible RCU explicitly disables preemption across
> > all RCU readers.
>
> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
> the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
> to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
> kernel but are worried about reader preemptions.

Such users can simply avoid building with either CONFIG_PREEMPT_NONE=y
or with CONFIG_PREEMPT_VOLUNTARY=y. They might also experiment with
CONFIG_RCU_BOOST=y, and also with short timeouts until boosting.
If that doesn't do what they need, we talk with them and either help
them configure their kernels, make RCU do what they need, or help work
out some other way for them to get their jobs done.

> That aside, as such, I do agree your point of view, that preemptible readers
> presents a problem to folks using preempt=none in this series and we could
> decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
> just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
> consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a
> week from now, no one will likely be able to tell the difference ;-). So IMHO
> either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
> optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.
>
> Let me know if I missed something.

Why not key off of the value of CONFIG_PREEMPT_DYNAMIC? That way,
if both CONFIG_PREEMPT_AUTO=y and CONFIG_PREEMPT_DYNAMIC=y, RCU is
always preemptible. Then CONFIG_PREEMPT_DYNAMIC=y enables boot-time
(and maybe even run-time) switching between preemption flavors, while
CONFIG_PREEMPT_AUTO instead enables unconditional preemption of any
region of code that has not explicitly disabled preemption (or irq or
bh or whatever).

But one way or another, we really do need non-preemptible RCU in
conjunction with CONFIG_PREEMPT_AUTO=y.

Also, I don't yet see CONFIG_PREEMPT_AUTO in -next.

Thanx, Paul

2024-03-11 04:51:49

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Paul E. McKenney <[email protected]> writes:

> On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>>
>> Paul E. McKenney <[email protected]> writes:
>>
>> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>> >>
>> >>
>> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> >> Hi Ankur,
>> >> >>
>> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >> >>>
>> >> >>> Joel Fernandes <[email protected]> writes:
>> >> >>>
>> >> >> [..]
>> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >> >>>> 'voluntary' business because
>> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >> >>>> is not about the old voluntary.
>> >> >>>
>> >> >>> What do you think about folding the higher scheduling class preemption logic
>> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >> >>> deadline task needs to be done for correctness.
>> >> >>>
>> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >> >>> there's a separate use for it.)
>> >> >>
>> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> >> that is more aggressive, it could be added in the future.
>> >> >
>> >> > This would be something that happens only after removing cond_resched()
>> >> > might_sleep() functionality from might_sleep(), correct?
>> >>
>> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> >> even for preempt=none mode, without having to wait till the next
>> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> >> to treat "higher class" any differently than "higher priority" for preempt=none.
>> >>
>> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> >> more than the treatment given to higher priority within same class. Ankur/Juri?
>> >>
>> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> >> cond_resched()/might_sleep() modifications?
>> >
>> > Because right now, one large difference between CONFIG_PREEMPT_NONE
>> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
>> > preemption point, but not for the former.
>>
>> True. But, there is no difference between either of those with
>> PREEMPT_AUTO=y (at least right now).
>>
>> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
>> might_sleep() is:
>>
>> # define might_resched() do { } while (0)
>> # define might_sleep() \
>> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>>
>> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
>> DEBUG_ATOMIC_SLEEP=y):
>>
>> static inline int _cond_resched(void)
>> {
>> klp_sched_try_switch();
>> return 0;
>> }
>> #define cond_resched() ({ \
>> __might_resched(__FILE__, __LINE__, 0); \
>> _cond_resched(); \
>> })
>>
>> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
>
> As long as it is easy to restore the prior cond_resched() functionality
> for testing in the meantime, I should be OK. For example, it would
> be great to have the commit removing the old functionality from
> cond_resched() at the end of the series,

I would, of course, be happy to make any changes that helps testing,
but I think I'm missing something that you are saying wrt
cond_resched()/might_sleep().

There's no commit explicitly removing the core cond_reshed()
functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
(That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)

For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):

#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
/* ... */
#if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
/* ... */
#else /* !CONFIG_PREEMPTION */
/* ... */
#endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */

#else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
static inline int _cond_resched(void)
{
klp_sched_try_switch();
return 0;
}
#endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */

Same for might_sleep() (which really amounts to might_resched()):

#ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
/* ... */
#elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
/* ... */
#else
# define might_resched() do { } while (0)
#endif /* CONFIG_PREEMPT_* */

But, I doubt that I'm telling you anything new. So, what am I missing?

--
ankur

2024-03-11 05:19:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y


Joel Fernandes <[email protected]> writes:

> On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
>> On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
>>> Hello Ankur and Paul,
>>>
>>> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
>>>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
>>>> states for read-side critical sections via rcu_all_qs().
>>>> One reason why this was necessary: lacking preempt-count, the tick
>>>> handler has no way of knowing whether it is executing in a read-side
>>>> critical section or not.
>>>>
>>>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
>>>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
>>>> not provide for quiescent states via rcu_all_qs().
>>>>
>>>> So, use the availability of preempt_count() to report quiescent states
>>>> in rcu_flavor_sched_clock_irq().
>>>>
>>>> Suggested-by: Paul E. McKenney <[email protected]>
>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>> ---
>>>> kernel/rcu/tree_plugin.h | 11 +++++++----
>>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index 26c79246873a..9b72e9d2b6fe 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>>> */
>>>> static void rcu_flavor_sched_clock_irq(int user)
>>>> {
>>>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
>>>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
>>>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
>>>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>>>
>>> I was wondering if it makes sense to even support !PREEMPT_RCU under
>>> CONFIG_PREEMPT_AUTO.
>>>
>>> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
>>> the next tick boundary in the worst case, with all preempt modes including
>>> the preempt=none mode.
>>>
>>> Considering this, does it makes sense for RCU to be non-preemptible in
>>> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
>>> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
>>> from happening, because rcu_read_lock() would preempt_disable().
>>
>> Yes, it does make sense for RCU to be non-preemptible in kernels
>> built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
>> CONFIG_PREEMPT_VOLUNTARY=y.
>> As noted in earlier discussions, there are
>
> Sorry if I missed a discussion, appreciate a link.
>
>> systems that are adequately but not abundantly endowed with memory.
>> Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
>
> Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
> preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
> no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
> preempt=none boot parameter. IMHO, if this feature is inconsistent with
> CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
> CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
> mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
> CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
> burden to the users configuring the PREEMPT Kconfig option IMHO.
>
>> Note well that non-preemptible RCU explicitly disables preemption across
>> all RCU readers.
>
> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on

I think that's a view from too close to the implementation. Someone
using the kernel is not necessarily concered with whether tasks are
preempted or not. They are concerned with throughput and latency.

Framed thus:

preempt=none: tasks typically run to completion, might result in high latency
preempt=full: preempt at the earliest opportunity, low latency
preempt=voluntary: somewhere between these two

In that case you could argue that CONFIG_PREEMPT_NONE,
(CONFIG_PREEMPT_DYNAMIC, preempt=none) and (CONFIG_PREEMPT_AUTO,
preempt=none) have broadly similar behaviour.

> the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
> to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
> kernel but are worried about reader preemptions.
>
> That aside, as such, I do agree your point of view, that preemptible readers
> presents a problem to folks using preempt=none in this series and we could
> decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
> just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
> consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a

PREEMPT_DYNAMIC and PREEMPT_AUTO are trying to do different tasks.

PREEMPT_DYNAMIC: allow dynamic switching between the original
PREEMPT_NONE, PREEMPT_VOLUNTARY, PREEEMPT models.

PREEMPT_AUTO: remove the need for explicit preemption points, by
- bringing the scheduling model under complete control of the
scheduler
- always having unconditional preemption (and using it to varying
degrees of strictness based on the preemption model in effect.)

So, even though PREEMPT_AUTO does use PREEMPT_NONE, PREEMPT_VOLUNTARY,
PREEMPT options, those are just meant to loosely identify with Linux's
preemption models, and the intent is not to be identical to it -- they
can't be identical because the underlying implementation is completely
different.

The eventual hope is that we can get rid of explicit preemption points.

> week from now, no one will likely be able to tell the difference ;-). So IMHO
> either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
> optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.
>
I think that's a patch for CONFIG_PREEMPT_DYNAMIC :).

From earlier discussions on this, I'm convinced that PREEMPT_AUTO, at
least where the user has explicitly opted for throughput,
(PREEMPT_{NONE,VOLUNTARY}), should support the non-preemptible RCU variant.

Thanks

--
ankur

2024-03-11 05:35:44

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Juri Lelli <[email protected]> writes:

> On 07/03/24 19:49, Ankur Arora wrote:
>> Joel Fernandes <[email protected]> writes:
>
> ...
>
>> > Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> > he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> > even for preempt=none mode, without having to wait till the next
>> > scheduling-clock interrupt.
>>
>> Yes, that's what I was suggesting.
>>
>> > Not sure if that makes sense to me, I was asking not
>> > to treat "higher class" any differently than "higher priority" for preempt=none.
>>
>> Ah. Understood.
>>
>> > And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> > CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> > more than the treatment given to higher priority within same class. Ankur/Juri?
>>
>> No. I think that behaviour might be worse for PREEMPT_AUTO.
>>
>> PREEMPT_NONE=y (or PREEMPT_VOLUNTARY=y for that matter) don't
>> quite have a policy around when preemption happens. Preemption
>> might happen quickly, might happen slowly based on when the next
>> preemption point is found.
>>
>> The PREEMPT_AUTO, preempt=none policy in this series will always
>> cause preemption to be at user exit or the next tick. Seems like
>> it would be worse for higher scheduling classes more often.
>>
>> But, I wonder what Juri thinks about this.
>
> As I was saying in my last comment in the other discussion, I'm honestly
> not sure, mostly because I'm currently fail to see what type of users
> would choose preempt=none and have tasks scheduled with SCHED_DEADLINE
> (please suggest example usecases, as I'm pretty sure I'm missing
> something :). With that said, if the purpose of preempt=none is to have
> a model which is super conservative wrt preemptions, having to wait one
> tick to possibly schedule a DEADLINE task still seems kind of broken for
> DEADLINE, but at least is predictably broken (guess one needs to account
> for that somehow when coming up with parameters :).

True :). Let me do some performance comparisons between (preempt=none +
the leftmost logic) and whatever is left off in the preempt=voluntary
patch.

Thanks

--
ankur

2024-03-11 15:04:57

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y



On 3/10/2024 11:56 PM, Paul E. McKenney wrote:
> On Sun, Mar 10, 2024 at 08:48:28PM -0400, Joel Fernandes wrote:
>> On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
>>> On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
>>>> Hello Ankur and Paul,
>>>>
>>>> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
>>>>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
>>>>> states for read-side critical sections via rcu_all_qs().
>>>>> One reason why this was necessary: lacking preempt-count, the tick
>>>>> handler has no way of knowing whether it is executing in a read-side
>>>>> critical section or not.
>>>>>
>>>>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
>>>>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
>>>>> not provide for quiescent states via rcu_all_qs().
>>>>>
>>>>> So, use the availability of preempt_count() to report quiescent states
>>>>> in rcu_flavor_sched_clock_irq().
>>>>>
>>>>> Suggested-by: Paul E. McKenney <[email protected]>
>>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>>> ---
>>>>> kernel/rcu/tree_plugin.h | 11 +++++++----
>>>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>> index 26c79246873a..9b72e9d2b6fe 100644
>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>>>> */
>>>>> static void rcu_flavor_sched_clock_irq(int user)
>>>>> {
>>>>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
>>>>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
>>>>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
>>>>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>>>>
>>>> I was wondering if it makes sense to even support !PREEMPT_RCU under
>>>> CONFIG_PREEMPT_AUTO.
>>>>
>>>> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
>>>> the next tick boundary in the worst case, with all preempt modes including
>>>> the preempt=none mode.
>>>>
>>>> Considering this, does it makes sense for RCU to be non-preemptible in
>>>> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
>>>> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
>>>> from happening, because rcu_read_lock() would preempt_disable().
>>>
>>> Yes, it does make sense for RCU to be non-preemptible in kernels
>>> built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
>>> CONFIG_PREEMPT_VOLUNTARY=y.
>>> As noted in earlier discussions, there are
>>
>> Sorry if I missed a discussion, appreciate a link.
>
> It is part of the discussion of the first version of this patch series,
> if I recall correctly.
>
>>> systems that are adequately but not abundantly endowed with memory.
>>> Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
>>
>> Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
>> preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
>> no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
>> preempt=none boot parameter. IMHO, if this feature is inconsistent with
>> CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
>> CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
>> mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
>> CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
>> burden to the users configuring the PREEMPT Kconfig option IMHO.
>
> Because such systems are built with CONFIG_PREEMPT_DYNAMIC=n.
>
> You could argue that we should just build with CONFIG_PREEMPT_AUTO=n,
> but the long-term goal of eliminating cond_resched() will make that
> ineffective.

I see what you mean. We/I could also highlight some of the differences in RCU
between DYNAMIC vs AUTO.

>
>>> Note well that non-preemptible RCU explicitly disables preemption across
>>> all RCU readers.
>>
>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
>> the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
>> to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
>> kernel but are worried about reader preemptions.
>
> Such users can simply avoid building with either CONFIG_PREEMPT_NONE=y
> or with CONFIG_PREEMPT_VOLUNTARY=y. They might also experiment with
> CONFIG_RCU_BOOST=y, and also with short timeouts until boosting.
> If that doesn't do what they need, we talk with them and either help
> them configure their kernels, make RCU do what they need, or help work
> out some other way for them to get their jobs done.

Makes sense.

>> That aside, as such, I do agree your point of view, that preemptible readers
>> presents a problem to folks using preempt=none in this series and we could
>> decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
>> just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
>> consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a
>> week from now, no one will likely be able to tell the difference ;-). So IMHO
>> either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
>> optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.
>>
>> Let me know if I missed something.
>
> Why not key off of the value of CONFIG_PREEMPT_DYNAMIC? That way,
> if both CONFIG_PREEMPT_AUTO=y and CONFIG_PREEMPT_DYNAMIC=y, RCU is
> always preemptible. Then CONFIG_PREEMPT_DYNAMIC=y enables boot-time
> (and maybe even run-time) switching between preemption flavors, while
> CONFIG_PREEMPT_AUTO instead enables unconditional preemption of any
> region of code that has not explicitly disabled preemption (or irq or
> bh or whatever).

That could be done. But currently, these patches disable DYNAMIC if AUTO is
enabled in the config. I think the reason is the 2 features are incompatible.
i.e. DYNAMIC wants to override the preemption mode at boot time, where as AUTO
wants the scheduler to have a say in it using the need-resched LAZY bit.

> But one way or another, we really do need non-preemptible RCU in
> conjunction with CONFIG_PREEMPT_AUTO=y.

Ok, lets go with that. Thanks,

- Joel


2024-03-11 15:45:28

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y



On 3/11/2024 1:18 AM, Ankur Arora wrote:
>
> Joel Fernandes <[email protected]> writes:
>
>> On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
>>> On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
>>>> Hello Ankur and Paul,
>>>>
>>>> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
>>>>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
>>>>> states for read-side critical sections via rcu_all_qs().
>>>>> One reason why this was necessary: lacking preempt-count, the tick
>>>>> handler has no way of knowing whether it is executing in a read-side
>>>>> critical section or not.
>>>>>
>>>>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
>>>>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
>>>>> not provide for quiescent states via rcu_all_qs().
>>>>>
>>>>> So, use the availability of preempt_count() to report quiescent states
>>>>> in rcu_flavor_sched_clock_irq().
>>>>>
>>>>> Suggested-by: Paul E. McKenney <[email protected]>
>>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>>> ---
>>>>> kernel/rcu/tree_plugin.h | 11 +++++++----
>>>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>> index 26c79246873a..9b72e9d2b6fe 100644
>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>>>> */
>>>>> static void rcu_flavor_sched_clock_irq(int user)
>>>>> {
>>>>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
>>>>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
>>>>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
>>>>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>>>>
>>>> I was wondering if it makes sense to even support !PREEMPT_RCU under
>>>> CONFIG_PREEMPT_AUTO.
>>>>
>>>> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
>>>> the next tick boundary in the worst case, with all preempt modes including
>>>> the preempt=none mode.
>>>>
>>>> Considering this, does it makes sense for RCU to be non-preemptible in
>>>> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
>>>> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
>>>> from happening, because rcu_read_lock() would preempt_disable().
>>>
>>> Yes, it does make sense for RCU to be non-preemptible in kernels
>>> built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
>>> CONFIG_PREEMPT_VOLUNTARY=y.
>>> As noted in earlier discussions, there are
>>
>> Sorry if I missed a discussion, appreciate a link.
>>
>>> systems that are adequately but not abundantly endowed with memory.
>>> Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
>>
>> Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
>> preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
>> no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
>> preempt=none boot parameter. IMHO, if this feature is inconsistent with
>> CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
>> CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
>> mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
>> CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
>> burden to the users configuring the PREEMPT Kconfig option IMHO.
>>
>>> Note well that non-preemptible RCU explicitly disables preemption across
>>> all RCU readers.
>>
>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
>
> I think that's a view from too close to the implementation. Someone
> using the kernel is not necessarily concered with whether tasks are
> preempted or not. They are concerned with throughput and latency.

No, we are not only talking about that (throughput/latency). We are also talking
about the issue related to RCU reader-preemption causing OOM (well and that
could hurt both throughput and latency as well).

With CONFIG_PREEMPT_AUTO=y, you now preempt in the preempt=none mode. Something
very different from the classical CONFIG_PREEMPT_NONE=y.

Essentially this means preemption is now more aggressive from the point of view
of a preempt=none user. I was suggesting that, a point of view could be RCU
should always support preepmtiblity (don't give PREEEMPT_RCU=n option) because
AUTO *does preempt* unlike classic CONFIG_PREEMPT_NONE. Otherwise it is
inconsistent -- say with CONFIG_PREEMPT=y (another *preemption mode*) which
forces CONFIG_PREEMPT_RCU. However to Paul's point, we need to worry about those
users who are concerned with running out of memory due to reader preemption.

In that vain, maybe we should also support CONFIG_PREEMPT_RCU=n for
CONFIG_PREEMPT=y as well. There are plenty of popular systems with relatively
low memory that need low latency (like some low-end devices / laptops :-)).

> Framed thus:
>
> preempt=none: tasks typically run to completion, might result in high latency
> preempt=full: preempt at the earliest opportunity, low latency
> preempt=voluntary: somewhere between these two
>
> In that case you could argue that CONFIG_PREEMPT_NONE,
> (CONFIG_PREEMPT_DYNAMIC, preempt=none) and (CONFIG_PREEMPT_AUTO,
> preempt=none) have broadly similar behaviour.

Yes, in that respect I agree.

>
>> the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
>> to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
>> kernel but are worried about reader preemptions.
>>
>> That aside, as such, I do agree your point of view, that preemptible readers
>> presents a problem to folks using preempt=none in this series and we could
>> decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
>> just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
>> consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a
>
> PREEMPT_DYNAMIC and PREEMPT_AUTO are trying to do different tasks.
>
> PREEMPT_DYNAMIC: allow dynamic switching between the original
> PREEMPT_NONE, PREEMPT_VOLUNTARY, PREEEMPT models.
>
> PREEMPT_AUTO: remove the need for explicit preemption points, by
> - bringing the scheduling model under complete control of the
> scheduler
> - always having unconditional preemption (and using it to varying
> degrees of strictness based on the preemption model in effect.)
>
> So, even though PREEMPT_AUTO does use PREEMPT_NONE, PREEMPT_VOLUNTARY,
> PREEMPT options, those are just meant to loosely identify with Linux's
> preemption models, and the intent is not to be identical to it -- they
> can't be identical because the underlying implementation is completely
> different.
>
> The eventual hope is that we can get rid of explicit preemption points.

Sounds good.

>> week from now, no one will likely be able to tell the difference ;-). So IMHO
>> either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
>> optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.
>>
> I think that's a patch for CONFIG_PREEMPT_DYNAMIC :).

Yes, I'm curious to explore it more. Specifically, with DYNAMIC=y, I'll explore
during my free time about how different does preempt=none behave from an RCU PoV
(say versus CONFIG_PREEMPT_NONE). I will also continue reviewing these patches.

Thanks!

- Joel


2024-03-11 19:13:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:
> On 3/11/2024 1:18 AM, Ankur Arora wrote:
>>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
>>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
>>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
>>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
>>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
>>
>> I think that's a view from too close to the implementation. Someone
>> using the kernel is not necessarily concered with whether tasks are
>> preempted or not. They are concerned with throughput and latency.
>
> No, we are not only talking about that (throughput/latency). We are also talking
> about the issue related to RCU reader-preemption causing OOM (well and that
> could hurt both throughput and latency as well).

That happens only when PREEMPT_RCU=y. For PREEMPT_RCU=n the read side
critical sections still have preemption disabled.

> With CONFIG_PREEMPT_AUTO=y, you now preempt in the preempt=none mode. Something
> very different from the classical CONFIG_PREEMPT_NONE=y.

In PREEMPT_RCU=y and preempt=none mode this happens only when really
required, i.e. when the task does not schedule out or returns to user
space on time, or when a higher scheduling class task gets runnable. For
the latter the jury is still out whether this should be done or just
lazily defered like the SCHED_OTHER preemption requests.

In any case for that to matter this forced preemption would need to
preempt a RCU read side critical section and then keep the preempted
task away from the CPU for a long time.

That's very different from the unconditional kernel preemption model which
preempt=full provides and only marginally different from the existing
PREEMPT_NONE model. I know there might be dragons, but I'm not convinced
yet that this is an actual problem.

OTOH, doesn't PREEMPT_RCU=y have mechanism to mitigate that already?

> Essentially this means preemption is now more aggressive from the point of view
> of a preempt=none user. I was suggesting that, a point of view could be RCU
> should always support preepmtiblity (don't give PREEEMPT_RCU=n option) because
> AUTO *does preempt* unlike classic CONFIG_PREEMPT_NONE. Otherwise it is
> inconsistent -- say with CONFIG_PREEMPT=y (another *preemption mode*) which
> forces CONFIG_PREEMPT_RCU. However to Paul's point, we need to worry about those
> users who are concerned with running out of memory due to reader
> preemption.

What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
Paul and me agreed long ago that this needs to be supported.

> In that vain, maybe we should also support CONFIG_PREEMPT_RCU=n for
> CONFIG_PREEMPT=y as well. There are plenty of popular systems with relatively
> low memory that need low latency (like some low-end devices / laptops
> :-)).

I'm not sure whether that's useful as the goal is to get rid of all the
CONFIG_PREEMPT_FOO options, no?

I'd rather spend brain cycles on figuring out whether RCU can be flipped
over between PREEMPT_RCU=n/y at boot or obviously run-time.

Thanks,

tglx

2024-03-11 19:26:50

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Sun, Mar 10, 2024 at 09:50:33PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
> >>
> >> Paul E. McKenney <[email protected]> writes:
> >>
> >> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
> >> >>
> >> >>
> >> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
> >> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
> >> >> >> Hi Ankur,
> >> >> >>
> >> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
> >> >> >>>
> >> >> >>> Joel Fernandes <[email protected]> writes:
> >> >> >>>
> >> >> >> [..]
> >> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
> >> >> >>>> 'voluntary' business because
> >> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
> >> >> >>>> is not about the old voluntary.
> >> >> >>>
> >> >> >>> What do you think about folding the higher scheduling class preemption logic
> >> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
> >> >> >>> deadline task needs to be done for correctness.
> >> >> >>>
> >> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
> >> >> >>> there's a separate use for it.)
> >> >> >>
> >> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
> >> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
> >> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
> >> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
> >> >> >> that is more aggressive, it could be added in the future.
> >> >> >
> >> >> > This would be something that happens only after removing cond_resched()
> >> >> > might_sleep() functionality from might_sleep(), correct?
> >> >>
> >> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
> >> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
> >> >> even for preempt=none mode, without having to wait till the next
> >> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
> >> >> to treat "higher class" any differently than "higher priority" for preempt=none.
> >> >>
> >> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
> >> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
> >> >> more than the treatment given to higher priority within same class. Ankur/Juri?
> >> >>
> >> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
> >> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
> >> >> cond_resched()/might_sleep() modifications?
> >> >
> >> > Because right now, one large difference between CONFIG_PREEMPT_NONE
> >> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
> >> > preemption point, but not for the former.
> >>
> >> True. But, there is no difference between either of those with
> >> PREEMPT_AUTO=y (at least right now).
> >>
> >> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
> >> might_sleep() is:
> >>
> >> # define might_resched() do { } while (0)
> >> # define might_sleep() \
> >> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
> >>
> >> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
> >> DEBUG_ATOMIC_SLEEP=y):
> >>
> >> static inline int _cond_resched(void)
> >> {
> >> klp_sched_try_switch();
> >> return 0;
> >> }
> >> #define cond_resched() ({ \
> >> __might_resched(__FILE__, __LINE__, 0); \
> >> _cond_resched(); \
> >> })
> >>
> >> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
> >
> > As long as it is easy to restore the prior cond_resched() functionality
> > for testing in the meantime, I should be OK. For example, it would
> > be great to have the commit removing the old functionality from
> > cond_resched() at the end of the series,
>
> I would, of course, be happy to make any changes that helps testing,
> but I think I'm missing something that you are saying wrt
> cond_resched()/might_sleep().
>
> There's no commit explicitly removing the core cond_reshed()
> functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
> out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
> (That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
>
> For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
> and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
>
> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
> /* ... */
> #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> /* ... */
> #else /* !CONFIG_PREEMPTION */
> /* ... */
> #endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>
> #else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
> static inline int _cond_resched(void)
> {
> klp_sched_try_switch();
> return 0;
> }
> #endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>
> Same for might_sleep() (which really amounts to might_resched()):
>
> #ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> /* ... */
> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> /* ... */
> #else
> # define might_resched() do { } while (0)
> #endif /* CONFIG_PREEMPT_* */
>
> But, I doubt that I'm telling you anything new. So, what am I missing?

It is really a choice at your end.

Suppose we enable CONFIG_PREEMPT_AUTO on our fleet, and find that there
was some small set of cond_resched() calls that provided sub-jiffy
preemption that matter to some of our workloads. At that point, what
are our options?

1. Revert CONFIG_PREEMPT_AUTO.

2. Revert only the part that disables the voluntary preemption
semantics of cond_resched(). Which, as you point out, ends up
being the same as #1 above.

3. Hotwire a voluntary preemption into the required locations.
Which we would avoid doing due to upstream-acceptance concerns.

So, how easy would you like to make it for us to use as much of
CONFIG_PREEMPT_AUTO=y under various possible problem scenarios?

Yes, in a perfect world, we would have tested this already, but I
am still chasing down problems induced by simple rcutorture testing.
Cowardly of us, isn't it? ;-)

Thanx, Paul

2024-03-11 19:53:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Mon, Mar 11, 2024 at 08:12:58PM +0100, Thomas Gleixner wrote:
> On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:
> > On 3/11/2024 1:18 AM, Ankur Arora wrote:
> >>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
> >>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
> >>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
> >>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
> >>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
> >>
> >> I think that's a view from too close to the implementation. Someone
> >> using the kernel is not necessarily concered with whether tasks are
> >> preempted or not. They are concerned with throughput and latency.
> >
> > No, we are not only talking about that (throughput/latency). We are also talking
> > about the issue related to RCU reader-preemption causing OOM (well and that
> > could hurt both throughput and latency as well).
>
> That happens only when PREEMPT_RCU=y. For PREEMPT_RCU=n the read side
> critical sections still have preemption disabled.
>
> > With CONFIG_PREEMPT_AUTO=y, you now preempt in the preempt=none mode. Something
> > very different from the classical CONFIG_PREEMPT_NONE=y.
>
> In PREEMPT_RCU=y and preempt=none mode this happens only when really
> required, i.e. when the task does not schedule out or returns to user
> space on time, or when a higher scheduling class task gets runnable. For
> the latter the jury is still out whether this should be done or just
> lazily defered like the SCHED_OTHER preemption requests.
>
> In any case for that to matter this forced preemption would need to
> preempt a RCU read side critical section and then keep the preempted
> task away from the CPU for a long time.
>
> That's very different from the unconditional kernel preemption model which
> preempt=full provides and only marginally different from the existing
> PREEMPT_NONE model. I know there might be dragons, but I'm not convinced
> yet that this is an actual problem.
>
> OTOH, doesn't PREEMPT_RCU=y have mechanism to mitigate that already?

You are right, it does, CONFIG_RCU_BOOST=y.

> > Essentially this means preemption is now more aggressive from the point of view
> > of a preempt=none user. I was suggesting that, a point of view could be RCU
> > should always support preepmtiblity (don't give PREEEMPT_RCU=n option) because
> > AUTO *does preempt* unlike classic CONFIG_PREEMPT_NONE. Otherwise it is
> > inconsistent -- say with CONFIG_PREEMPT=y (another *preemption mode*) which
> > forces CONFIG_PREEMPT_RCU. However to Paul's point, we need to worry about those
> > users who are concerned with running out of memory due to reader
> > preemption.
>
> What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
> Paul and me agreed long ago that this needs to be supported.
>
> > In that vain, maybe we should also support CONFIG_PREEMPT_RCU=n for
> > CONFIG_PREEMPT=y as well. There are plenty of popular systems with relatively
> > low memory that need low latency (like some low-end devices / laptops
> > :-)).
>
> I'm not sure whether that's useful as the goal is to get rid of all the
> CONFIG_PREEMPT_FOO options, no?
>
> I'd rather spend brain cycles on figuring out whether RCU can be flipped
> over between PREEMPT_RCU=n/y at boot or obviously run-time.

Well, it is just software, so anything is possible. But there can
be a wide gap between "possible" and "sensible". ;-)

In theory, one boot-time approach would be build preemptible RCU,
and then to boot-time binary-rewrite calls to __rcu_read_lock()
and __rcu_read_unlock() to preempt_disable() and preempt_enable(),
respectively. Because preemptible RCU has to treat preemption-disabled
regions of code as RCU readers, this Should Just Work. However, there
would then be a lot of needless branches in the grace-period code.
Only the ones on fastpaths (for example, context switch) would need
to be static-branchified, but there would likely need to be other
restructuring, given the need for current preemptible RCU to do a better
job of emulating non-preemptible RCU. (Emulating non-preemptible RCU
is of course currently a complete non-goal for preemptible RCU.)

So maybe?

But this one needs careful design and review up front, as in step through
all the code and check assumptions and changes in behavior. After all,
this stuff is way easier to break than to debug and fix. ;-)


On the other hand, making RCU switch at runtime is... Tricky.

For example, if the system was in non-preemptible mode at rcu_read_lock()
time, the corresponding rcu_read_unlock() needs to be aware that it needs
to act as if the system was still in non-preemptible mode, and vice versa.
Grace period processing during the switch needs to be aware that different
CPUs will be switching at different times. Also, it will be common for a
given CPU's switch to span more than one grace period. So any approach
based on either binary rewrite or static branches will need to be set
up in a multi-phase multi-grace-period state machine. Sort of like
Frederic's runtime-switched callback offloading, but rather more complex,
and way more performance sensitive.

But do we even need to switch RCU at runtime, other than to say that
we did it? What is the use case? Or is this just a case of "it would
be cool!"? Don't get me wrong, I am a sucker for "it would be cool",
as you well know, but even for me there are limits. ;-)

At the moment, I would prioritize improving quiescent-state forcing for
existing RCU over this, especially perhaps given the concerns from the
MM folks.

But what is motivating the desire to boot-time/run-time switch RCU
between preemptible and non-preemptible?

Thanx, Paul

2024-03-11 20:10:50

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Paul E. McKenney <[email protected]> writes:

> On Sun, Mar 10, 2024 at 09:50:33PM -0700, Ankur Arora wrote:
>>
>> Paul E. McKenney <[email protected]> writes:
>>
>> > On Thu, Mar 07, 2024 at 08:22:30PM -0800, Ankur Arora wrote:
>> >>
>> >> Paul E. McKenney <[email protected]> writes:
>> >>
>> >> > On Thu, Mar 07, 2024 at 07:15:35PM -0500, Joel Fernandes wrote:
>> >> >>
>> >> >>
>> >> >> On 3/7/2024 2:01 PM, Paul E. McKenney wrote:
>> >> >> > On Wed, Mar 06, 2024 at 03:42:10PM -0500, Joel Fernandes wrote:
>> >> >> >> Hi Ankur,
>> >> >> >>
>> >> >> >> On 3/5/2024 3:11 AM, Ankur Arora wrote:
>> >> >> >>>
>> >> >> >>> Joel Fernandes <[email protected]> writes:
>> >> >> >>>
>> >> >> >> [..]
>> >> >> >>>> IMO, just kill 'voluntary' if PREEMPT_AUTO is enabled. There is no
>> >> >> >>>> 'voluntary' business because
>> >> >> >>>> 1. The behavior vs =none is to allow higher scheduling class to preempt, it
>> >> >> >>>> is not about the old voluntary.
>> >> >> >>>
>> >> >> >>> What do you think about folding the higher scheduling class preemption logic
>> >> >> >>> into preempt=none? As Juri pointed out, prioritization of at least the leftmost
>> >> >> >>> deadline task needs to be done for correctness.
>> >> >> >>>
>> >> >> >>> (That'll get rid of the current preempt=voluntary model, at least until
>> >> >> >>> there's a separate use for it.)
>> >> >> >>
>> >> >> >> Yes I am all in support for that. Its less confusing for the user as well, and
>> >> >> >> scheduling higher priority class at the next tick for preempt=none sounds good
>> >> >> >> to me. That is still an improvement for folks using SCHED_DEADLINE for whatever
>> >> >> >> reason, with a vanilla CONFIG_PREEMPT_NONE=y kernel. :-P. If we want a new mode
>> >> >> >> that is more aggressive, it could be added in the future.
>> >> >> >
>> >> >> > This would be something that happens only after removing cond_resched()
>> >> >> > might_sleep() functionality from might_sleep(), correct?
>> >> >>
>> >> >> Firstly, Maybe I misunderstood Ankur completely. Re-reading his comments above,
>> >> >> he seems to be suggesting preempting instantly for higher scheduling CLASSES
>> >> >> even for preempt=none mode, without having to wait till the next
>> >> >> scheduling-clock interrupt. Not sure if that makes sense to me, I was asking not
>> >> >> to treat "higher class" any differently than "higher priority" for preempt=none.
>> >> >>
>> >> >> And if SCHED_DEADLINE has a problem with that, then it already happens so with
>> >> >> CONFIG_PREEMPT_NONE=y kernels, so no need special treatment for higher class any
>> >> >> more than the treatment given to higher priority within same class. Ankur/Juri?
>> >> >>
>> >> >> Re: cond_resched(), I did not follow you Paul, why does removing the proposed
>> >> >> preempt=voluntary mode (i.e. dropping this patch) have to happen only after
>> >> >> cond_resched()/might_sleep() modifications?
>> >> >
>> >> > Because right now, one large difference between CONFIG_PREEMPT_NONE
>> >> > an CONFIG_PREEMPT_VOLUNTARY is that for the latter might_sleep() is a
>> >> > preemption point, but not for the former.
>> >>
>> >> True. But, there is no difference between either of those with
>> >> PREEMPT_AUTO=y (at least right now).
>> >>
>> >> For (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y, DEBUG_ATOMIC_SLEEP=y),
>> >> might_sleep() is:
>> >>
>> >> # define might_resched() do { } while (0)
>> >> # define might_sleep() \
>> >> do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0)
>> >>
>> >> And, cond_resched() for (PREEMPT_AUTO=y, PREEMPT_VOLUNTARY=y,
>> >> DEBUG_ATOMIC_SLEEP=y):
>> >>
>> >> static inline int _cond_resched(void)
>> >> {
>> >> klp_sched_try_switch();
>> >> return 0;
>> >> }
>> >> #define cond_resched() ({ \
>> >> __might_resched(__FILE__, __LINE__, 0); \
>> >> _cond_resched(); \
>> >> })
>> >>
>> >> And, no change for (PREEMPT_AUTO=y, PREEMPT_NONE=y, DEBUG_ATOMIC_SLEEP=y).
>> >
>> > As long as it is easy to restore the prior cond_resched() functionality
>> > for testing in the meantime, I should be OK. For example, it would
>> > be great to have the commit removing the old functionality from
>> > cond_resched() at the end of the series,
>>
>> I would, of course, be happy to make any changes that helps testing,
>> but I think I'm missing something that you are saying wrt
>> cond_resched()/might_sleep().
>>
>> There's no commit explicitly removing the core cond_reshed()
>> functionality: PREEMPT_AUTO explicitly selects PREEMPT_BUILD and selects
>> out PREEMPTION_{NONE,VOLUNTARY}_BUILD.
>> (That's patch-1 "preempt: introduce CONFIG_PREEMPT_AUTO".)
>>
>> For the rest it just piggybacks on the CONFIG_PREEMPT_DYNAMIC work
>> and just piggybacks on (!CONFIG_PREEMPT_DYNAMIC && CONFIG_PREEMPTION):
>>
>> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
>> /* ... */
>> #if defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else /* !CONFIG_PREEMPTION */
>> /* ... */
>> #endif /* PREEMPT_DYNAMIC && CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>>
>> #else /* CONFIG_PREEMPTION && !CONFIG_PREEMPT_DYNAMIC */
>> static inline int _cond_resched(void)
>> {
>> klp_sched_try_switch();
>> return 0;
>> }
>> #endif /* !CONFIG_PREEMPTION || CONFIG_PREEMPT_DYNAMIC */
>>
>> Same for might_sleep() (which really amounts to might_resched()):
>>
>> #ifdef CONFIG_PREEMPT_VOLUNTARY_BUILD
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> /* ... */
>> #elif defined(CONFIG_PREEMPT_DYNAMIC) && defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
>> /* ... */
>> #else
>> # define might_resched() do { } while (0)
>> #endif /* CONFIG_PREEMPT_* */
>>
>> But, I doubt that I'm telling you anything new. So, what am I missing?
>
> It is really a choice at your end.
>
> Suppose we enable CONFIG_PREEMPT_AUTO on our fleet, and find that there
> was some small set of cond_resched() calls that provided sub-jiffy
> preemption that matter to some of our workloads. At that point, what
> are our options?
>
> 1. Revert CONFIG_PREEMPT_AUTO.
>
> 2. Revert only the part that disables the voluntary preemption
> semantics of cond_resched(). Which, as you point out, ends up
> being the same as #1 above.
>
> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.
>
> So, how easy would you like to make it for us to use as much of
> CONFIG_PREEMPT_AUTO=y under various possible problem scenarios?

Ah, I see your point. Basically, keep the lazy semantics but -- in
addition -- also provide the ability to dynamically toggle
cond_resched(), might_reshed() as a feature to help move this along
further.

So, as I mentioned earlier, the callsites are already present, and
removing them needs work (with livepatch and more generally to ensure
PREEMPT_AUTO is good enough for the current PREEMPT_* scenarios so
we can ditch cond_resched()).

I honestly don't see any reason not to do this -- I would prefer
this be a temporary thing to help beat PREEMPT_AUTO into shape. And,
this provides an insurance policy for using PREEMPT_AUTO.

That said, I would like Thomas' opinion on this.

> 3. Hotwire a voluntary preemption into the required locations.
> Which we would avoid doing due to upstream-acceptance concerns.

Apropos of this, how would you determine which are the locations
where we specifically need voluntary preemption?

> Yes, in a perfect world, we would have tested this already, but I
> am still chasing down problems induced by simple rcutorture testing.
> Cowardly of us, isn't it? ;-)

Cowards are us :).

--
ankur

2024-03-11 20:23:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Mon, 11 Mar 2024 at 13:10, Ankur Arora <[email protected]> wrote:
>
> Ah, I see your point. Basically, keep the lazy semantics but -- in
> addition -- also provide the ability to dynamically toggle
> cond_resched(), might_reshed() as a feature to help move this along
> further.

Please, let's not make up any random hypotheticals.

Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's

(a) deal with it THEN, when we actually know what the situation is

(b) learn and document what it is that actually causes the odd behavior

IOW, instead of assuming that some "cond_resched()" case would even be
the right thing to do, maybe there are other issues going on? Let's
not paper over them by keeping some hack around - and *if* some
cond_resched() model is actually the right model in some individual
place, let's make it the rule that *when* we hit that case, we
document it.

And we should absolutely not have some hypothetical case keep us from
just doing the right thing and getting rid of the existing
cond_resched().

Because any potential future case is *not* going to be the same
cond_resched() that the current case is anyway. It is going to have
some very different cause.

Linus

2024-03-11 20:29:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

Paul!

On Mon, Mar 11 2024 at 12:53, Paul E. McKenney wrote:
> On Mon, Mar 11, 2024 at 08:12:58PM +0100, Thomas Gleixner wrote:
>> I'd rather spend brain cycles on figuring out whether RCU can be flipped
>> over between PREEMPT_RCU=n/y at boot or obviously run-time.
>
> Well, it is just software, so anything is possible. But there can
> be a wide gap between "possible" and "sensible". ;-)

Indeed.

> In theory, one boot-time approach would be build preemptible RCU,
> and then to boot-time binary-rewrite calls to __rcu_read_lock()
> and __rcu_read_unlock() to preempt_disable() and preempt_enable(),
> respectively. Because preemptible RCU has to treat preemption-disabled
> regions of code as RCU readers, this Should Just Work. However, there
> would then be a lot of needless branches in the grace-period code.
> Only the ones on fastpaths (for example, context switch) would need
> to be static-branchified, but there would likely need to be other
> restructuring, given the need for current preemptible RCU to do a better
> job of emulating non-preemptible RCU. (Emulating non-preemptible RCU
> is of course currently a complete non-goal for preemptible RCU.)

Sure, but that might be a path to have a more unified RCU model in the
longer term with the tweak of patching the flavor once at boot.

> But this one needs careful design and review up front, as in step through
> all the code and check assumptions and changes in behavior. After all,
> this stuff is way easier to break than to debug and fix. ;-)

Isn't that where all the real fun is? :)

> On the other hand, making RCU switch at runtime is... Tricky.

I was only half serious about that. Just thought to mention if for
completeness sake and of course to make you write a novel. :)

> For example, if the system was in non-preemptible mode at rcu_read_lock()
> time, the corresponding rcu_read_unlock() needs to be aware that it needs
> to act as if the system was still in non-preemptible mode, and vice versa.
> Grace period processing during the switch needs to be aware that different
> CPUs will be switching at different times. Also, it will be common for a
> given CPU's switch to span more than one grace period. So any approach
> based on either binary rewrite or static branches will need to be set
> up in a multi-phase multi-grace-period state machine. Sort of like
> Frederic's runtime-switched callback offloading, but rather more complex,
> and way more performance sensitive.

Of course it would be a complex endavour at the scale of run-time
switching to or from PREEMPT_RT, which will never happen. Even the boot
time switch for RT would be way harder than the RCU one :)

> But do we even need to switch RCU at runtime, other than to say that
> we did it? What is the use case? Or is this just a case of "it would
> be cool!"? Don't get me wrong, I am a sucker for "it would be cool",
> as you well know, but even for me there are limits. ;-)

There is no need for runtime switching other than "it would be cool" and
I'm happy that even you have limits. :)

> At the moment, I would prioritize improving quiescent-state forcing for
> existing RCU over this, especially perhaps given the concerns from the
> MM folks.
>
> But what is motivating the desire to boot-time/run-time switch RCU
> between preemptible and non-preemptible?

The goal of PREEMPT_AUTO is to get rid of the zoo of preemption models
and their associated horrors. As we debated some time ago we still need
to have the distinction of preemptible and non-preemptible RCU even with
that.

So it's a pretty obvious consequence to make this decision at boot time
to reduce the number of kernel flavors for distros to build.

Nothing urgent, but it just came to my mind when reading through this
thread and replying to Joel.

Thanks,

tglx

2024-03-11 20:52:54

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y


Joel Fernandes <[email protected]> writes:

> On 3/10/2024 11:56 PM, Paul E. McKenney wrote:
>> On Sun, Mar 10, 2024 at 08:48:28PM -0400, Joel Fernandes wrote:
>>> On 3/10/2024 2:56 PM, Paul E. McKenney wrote:
>>>> On Sun, Mar 10, 2024 at 06:03:30AM -0400, Joel Fernandes wrote:
>>>>> Hello Ankur and Paul,
>>>>>
>>>>> On Mon, Feb 12, 2024 at 09:55:39PM -0800, Ankur Arora wrote:
>>>>>> With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
>>>>>> states for read-side critical sections via rcu_all_qs().
>>>>>> One reason why this was necessary: lacking preempt-count, the tick
>>>>>> handler has no way of knowing whether it is executing in a read-side
>>>>>> critical section or not.
>>>>>>
>>>>>> With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
>>>>>> PREEMPT_RCU=n). This means that cond_resched() is a stub which does
>>>>>> not provide for quiescent states via rcu_all_qs().
>>>>>>
>>>>>> So, use the availability of preempt_count() to report quiescent states
>>>>>> in rcu_flavor_sched_clock_irq().
>>>>>>
>>>>>> Suggested-by: Paul E. McKenney <[email protected]>
>>>>>> Signed-off-by: Ankur Arora <[email protected]>
>>>>>> ---
>>>>>> kernel/rcu/tree_plugin.h | 11 +++++++----
>>>>>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>>> index 26c79246873a..9b72e9d2b6fe 100644
>>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>>> @@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
>>>>>> */
>>>>>> static void rcu_flavor_sched_clock_irq(int user)
>>>>>> {
>>>>>> - if (user || rcu_is_cpu_rrupt_from_idle()) {
>>>>>> + if (user || rcu_is_cpu_rrupt_from_idle() ||
>>>>>> + (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
>>>>>> + !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
>>>>>
>>>>> I was wondering if it makes sense to even support !PREEMPT_RCU under
>>>>> CONFIG_PREEMPT_AUTO.
>>>>>
>>>>> AFAIU, this CONFIG_PREEMPT_AUTO series preempts the kernel on
>>>>> the next tick boundary in the worst case, with all preempt modes including
>>>>> the preempt=none mode.
>>>>>
>>>>> Considering this, does it makes sense for RCU to be non-preemptible in
>>>>> CONFIG_PREEMPT_AUTO=y? Because if that were the case, and a read-side critical
>>>>> section extended beyond the tick, then it prevents the PREEMPT_AUTO preemption
>>>>> from happening, because rcu_read_lock() would preempt_disable().
>>>>
>>>> Yes, it does make sense for RCU to be non-preemptible in kernels
>>>> built with CONFIG_PREEMPT_AUTO=y and either CONFIG_PREEMPT_NONE=y or
>>>> CONFIG_PREEMPT_VOLUNTARY=y.
>>>> As noted in earlier discussions, there are
>>>
>>> Sorry if I missed a discussion, appreciate a link.
>>
>> It is part of the discussion of the first version of this patch series,
>> if I recall correctly.
>>
>>>> systems that are adequately but not abundantly endowed with memory.
>>>> Such systems need non-preemptible RCU to avoid preempted-reader OOMs.
>>>
>>> Then why don't such systems have a problem with CONFIG_PREEMPT_DYNAMIC=y and
>>> preempt=none mode? CONFIG_PREEMPT_DYNAMIC forces CONFIG_PREEMPT_RCU=y. There's
>>> no way to set CONFIG_PREEMPT_RCU=n with CONFIG_PREEMPT_DYNAMIC=y and
>>> preempt=none boot parameter. IMHO, if this feature is inconsistent with
>>> CONFIG_PREEMPT_DYNAMIC, that makes it super confusing. In fact, I feel
>>> CONFIG_PREEMPT_AUTO should instead just be another "preempt=auto" boot parameter
>>> mode added to CONFIG_PREEMPT_DYNAMIC feature, otherwise the proliferation of
>>> CONFIG_PREEMPT config options is getting a bit insane. And likely going to be
>>> burden to the users configuring the PREEMPT Kconfig option IMHO.
>>
>> Because such systems are built with CONFIG_PREEMPT_DYNAMIC=n.
>>
>> You could argue that we should just build with CONFIG_PREEMPT_AUTO=n,
>> but the long-term goal of eliminating cond_resched() will make that
>> ineffective.
>
> I see what you mean. We/I could also highlight some of the differences in RCU
> between DYNAMIC vs AUTO.
>
>>
>>>> Note well that non-preemptible RCU explicitly disables preemption across
>>>> all RCU readers.
>>>
>>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
>>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
>>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
>>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
>>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
>>> the way. It is like saying, you want an option for CONFIG_PREEMPT_RCU to be set
>>> to =n for CONFIG_PREEMPT=y kernels, sighting users who want a fully-preemptible
>>> kernel but are worried about reader preemptions.
>>
>> Such users can simply avoid building with either CONFIG_PREEMPT_NONE=y
>> or with CONFIG_PREEMPT_VOLUNTARY=y. They might also experiment with
>> CONFIG_RCU_BOOST=y, and also with short timeouts until boosting.
>> If that doesn't do what they need, we talk with them and either help
>> them configure their kernels, make RCU do what they need, or help work
>> out some other way for them to get their jobs done.
>
> Makes sense.
>
>>> That aside, as such, I do agree your point of view, that preemptible readers
>>> presents a problem to folks using preempt=none in this series and we could
>>> decide to keep CONFIG_PREEMPT_RCU optional for whoever wants it that way. I was
>>> just saying that I want CONFIG_PREEMPT_AUTO's preempt=none mode to be somewhat
>>> consistent with CONFIG_PREEMPT_DYNAMIC's preempt=none. Because I'm pretty sure a
>>> week from now, no one will likely be able to tell the difference ;-). So IMHO
>>> either CONFIG_PREEMPT_DYNAMIC should be changed to make CONFIG_PREEMPT_RCU
>>> optional, or this series should be altered to force CONFIG_PREEMPT_RCU=y.
>>>
>>> Let me know if I missed something.
>>
>> Why not key off of the value of CONFIG_PREEMPT_DYNAMIC? That way,
>> if both CONFIG_PREEMPT_AUTO=y and CONFIG_PREEMPT_DYNAMIC=y, RCU is
>> always preemptible. Then CONFIG_PREEMPT_DYNAMIC=y enables boot-time
>> (and maybe even run-time) switching between preemption flavors, while
>> CONFIG_PREEMPT_AUTO instead enables unconditional preemption of any
>> region of code that has not explicitly disabled preemption (or irq or
>> bh or whatever).

Currently CONFIG_PREEMPT_DYNAMIC does a few things:

1. dynamic selection of preemption model
2. dynamically toggling explicit preemption points
3. PREEMPT_RCU=y (though maybe this should be fixed to also
also allow PREEMPT_RCU=n)

Of these 3, PREEMPT_AUTO only really needs (1).

Maybe combining gives us the option of switching between the old and the
new models:
preempt=none | voluntary | full | auto-none | auto-voluntary

Where the last two provide the new auto semantics. But, the mixture
seems too rich.
This just complicates all the CONFIG_PREEMPT_* configurations more than
they were before when the end goal is to actually reduce and simplify
the number of options.

> That could be done. But currently, these patches disable DYNAMIC if AUTO is
> enabled in the config. I think the reason is the 2 features are incompatible.
> i.e. DYNAMIC wants to override the preemption mode at boot time, where as AUTO
> wants the scheduler to have a say in it using the need-resched LAZY bit.

Yeah exactly. That's why I originally made PREEMPT_AUTO and
PREEMPT_DYNAMIC exclusive of each other.

Thanks

--
ankur

2024-03-11 21:05:10

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO


Linus Torvalds <[email protected]> writes:

> On Mon, 11 Mar 2024 at 13:10, Ankur Arora <[email protected]> wrote:
>>
>> Ah, I see your point. Basically, keep the lazy semantics but -- in
>> addition -- also provide the ability to dynamically toggle
>> cond_resched(), might_reshed() as a feature to help move this along
>> further.
>
> Please, let's not make up any random hypotheticals.
>
> Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's
>
> (a) deal with it THEN, when we actually know what the situation is
>
> (b) learn and document what it is that actually causes the odd behavior
>
> IOW, instead of assuming that some "cond_resched()" case would even be
> the right thing to do, maybe there are other issues going on? Let's
> not paper over them by keeping some hack around - and *if* some
> cond_resched() model is actually the right model in some individual
> place, let's make it the rule that *when* we hit that case, we
> document it.
>
> And we should absolutely not have some hypothetical case keep us from
> just doing the right thing and getting rid of the existing
> cond_resched().
>
> Because any potential future case is *not* going to be the same
> cond_resched() that the current case is anyway. It is going to have
> some very different cause.

Ack that. And, thanks that makes sense to me.

--
ankur

2024-03-11 22:12:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Mon, Mar 11 2024 at 13:51, Ankur Arora wrote:
> Joel Fernandes <[email protected]> writes:
>>> Why not key off of the value of CONFIG_PREEMPT_DYNAMIC? That way,
>>> if both CONFIG_PREEMPT_AUTO=y and CONFIG_PREEMPT_DYNAMIC=y, RCU is
>>> always preemptible. Then CONFIG_PREEMPT_DYNAMIC=y enables boot-time
>>> (and maybe even run-time) switching between preemption flavors, while
>>> CONFIG_PREEMPT_AUTO instead enables unconditional preemption of any
>>> region of code that has not explicitly disabled preemption (or irq or
>>> bh or whatever).
>
> Currently CONFIG_PREEMPT_DYNAMIC does a few things:
>
> 1. dynamic selection of preemption model
> 2. dynamically toggling explicit preemption points
> 3. PREEMPT_RCU=y (though maybe this should be fixed to also
> also allow PREEMPT_RCU=n)
>
> Of these 3, PREEMPT_AUTO only really needs (1).
>
> Maybe combining gives us the option of switching between the old and the
> new models:
> preempt=none | voluntary | full | auto-none | auto-voluntary
>
> Where the last two provide the new auto semantics. But, the mixture
> seems too rich.
> This just complicates all the CONFIG_PREEMPT_* configurations more than
> they were before when the end goal is to actually reduce and simplify
> the number of options.
>
>> That could be done. But currently, these patches disable DYNAMIC if AUTO is
>> enabled in the config. I think the reason is the 2 features are incompatible.
>> i.e. DYNAMIC wants to override the preemption mode at boot time, where as AUTO
>> wants the scheduler to have a say in it using the need-resched LAZY bit.
>
> Yeah exactly. That's why I originally made PREEMPT_AUTO and
> PREEMPT_DYNAMIC exclusive of each other.

Rightfully so. The purpose of PREEMPT_AUTO is to get rid of
PREEMPT_DYNAMIC and not to proliferate the existance of it.

There is no point. All what AUTO wants to provide at configuration time
is the default model. So what would DYNAMIC buy what AUTO does not
provide trivially with a single sysfs knob which only affects the
scheduler where the decisions are made and nothing else?

The only extra config knob is PREEMPT_RCU which as we concluded long ago
needs to support both no and yes when AUTO is selected up to the point
where that model can be switched at boot time too.

Seriously, keep this stuff simple and straight forward and keep the real
goals in focus:

1) Simplify the preemption model zoo

2) Get rid of the ill defined cond_resched()/might_sleep() hackery

All the extra - pardon my french - ivory tower wankery on top is not
helpful at all. We can debate this forever on a theoretical base and
never get anywhere and anything done.

Please focus on getting the base mechanics in place with the required
fixes for the fallout for preemptible and non-preemtible RCU (selected
at compile time) and work it out from there.

Perfect it the enemy of good. Especially when nobody can come up with a
perfect definition what 'perfect' actually means.

Thanks,

tglx

2024-03-12 00:01:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

On Mon, Mar 11, 2024 at 09:29:14PM +0100, Thomas Gleixner wrote:
> Paul!
>
> On Mon, Mar 11 2024 at 12:53, Paul E. McKenney wrote:
> > On Mon, Mar 11, 2024 at 08:12:58PM +0100, Thomas Gleixner wrote:
> >> I'd rather spend brain cycles on figuring out whether RCU can be flipped
> >> over between PREEMPT_RCU=n/y at boot or obviously run-time.
> >
> > Well, it is just software, so anything is possible. But there can
> > be a wide gap between "possible" and "sensible". ;-)
>
> Indeed.
>
> > In theory, one boot-time approach would be build preemptible RCU,
> > and then to boot-time binary-rewrite calls to __rcu_read_lock()
> > and __rcu_read_unlock() to preempt_disable() and preempt_enable(),
> > respectively. Because preemptible RCU has to treat preemption-disabled
> > regions of code as RCU readers, this Should Just Work. However, there
> > would then be a lot of needless branches in the grace-period code.
> > Only the ones on fastpaths (for example, context switch) would need
> > to be static-branchified, but there would likely need to be other
> > restructuring, given the need for current preemptible RCU to do a better
> > job of emulating non-preemptible RCU. (Emulating non-preemptible RCU
> > is of course currently a complete non-goal for preemptible RCU.)
>
> Sure, but that might be a path to have a more unified RCU model in the
> longer term with the tweak of patching the flavor once at boot.

I am pretty sure that it can be made to happen. The big question in my
mind is "Is it worth it?"

> > But this one needs careful design and review up front, as in step through
> > all the code and check assumptions and changes in behavior. After all,
> > this stuff is way easier to break than to debug and fix. ;-)
>
> Isn't that where all the real fun is? :)
>
> > On the other hand, making RCU switch at runtime is... Tricky.
>
> I was only half serious about that. Just thought to mention if for
> completeness sake and of course to make you write a novel. :)

A short novel. ;-)

> > For example, if the system was in non-preemptible mode at rcu_read_lock()
> > time, the corresponding rcu_read_unlock() needs to be aware that it needs
> > to act as if the system was still in non-preemptible mode, and vice versa.
> > Grace period processing during the switch needs to be aware that different
> > CPUs will be switching at different times. Also, it will be common for a
> > given CPU's switch to span more than one grace period. So any approach
> > based on either binary rewrite or static branches will need to be set
> > up in a multi-phase multi-grace-period state machine. Sort of like
> > Frederic's runtime-switched callback offloading, but rather more complex,
> > and way more performance sensitive.
>
> Of course it would be a complex endavour at the scale of run-time
> switching to or from PREEMPT_RT, which will never happen. Even the boot
> time switch for RT would be way harder than the RCU one :)

Run-time switching of either RCU or PREEMPT_RT would likely provide many
opportunities along the way for black hats. ;-)

> > But do we even need to switch RCU at runtime, other than to say that
> > we did it? What is the use case? Or is this just a case of "it would
> > be cool!"? Don't get me wrong, I am a sucker for "it would be cool",
> > as you well know, but even for me there are limits. ;-)
>
> There is no need for runtime switching other than "it would be cool" and
> I'm happy that even you have limits. :)

So *that* was the point of this whole email exchange. ;-)

> > At the moment, I would prioritize improving quiescent-state forcing for
> > existing RCU over this, especially perhaps given the concerns from the
> > MM folks.
> >
> > But what is motivating the desire to boot-time/run-time switch RCU
> > between preemptible and non-preemptible?
>
> The goal of PREEMPT_AUTO is to get rid of the zoo of preemption models
> and their associated horrors. As we debated some time ago we still need
> to have the distinction of preemptible and non-preemptible RCU even with
> that.
>
> So it's a pretty obvious consequence to make this decision at boot time
> to reduce the number of kernel flavors for distros to build.

Very well, we can look to the distros for requirements, when and if.

> Nothing urgent, but it just came to my mind when reading through this
> thread and replying to Joel.

Got it, thank you!

Thanx, Paul

2024-03-12 00:04:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 13:10, Ankur Arora <[email protected]> wrote:
> >
> > Ah, I see your point. Basically, keep the lazy semantics but -- in
> > addition -- also provide the ability to dynamically toggle
> > cond_resched(), might_reshed() as a feature to help move this along
> > further.
>
> Please, let's not make up any random hypotheticals.
>
> Honestly, if we ever hit the hypothetical scenario that Paul outlined, let's
>
> (a) deal with it THEN, when we actually know what the situation is
>
> (b) learn and document what it is that actually causes the odd behavior
>
> IOW, instead of assuming that some "cond_resched()" case would even be
> the right thing to do, maybe there are other issues going on? Let's
> not paper over them by keeping some hack around - and *if* some
> cond_resched() model is actually the right model in some individual
> place, let's make it the rule that *when* we hit that case, we
> document it.
>
> And we should absolutely not have some hypothetical case keep us from
> just doing the right thing and getting rid of the existing
> cond_resched().
>
> Because any potential future case is *not* going to be the same
> cond_resched() that the current case is anyway. It is going to have
> some very different cause.

Fair enough, and that approach just means that we will be reaching out
to Ankur and Thomas sooner rather than later if something goes sideways
latency-wise. ;-)

Thanx, Paul

2024-03-12 00:09:03

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

Hi, Thomas,
Thanks for your reply! I replied below.

On 3/11/2024 3:12 PM, Thomas Gleixner wrote:
> On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:
>> On 3/11/2024 1:18 AM, Ankur Arora wrote:
>>>> Yes, I mentioned this 'disabling preemption' aspect in my last email. My point
>>>> being, unlike CONFIG_PREEMPT_NONE, CONFIG_PREEMPT_AUTO allows for kernel
>>>> preemption in preempt=none. So the "Don't preempt the kernel" behavior has
>>>> changed. That is, preempt=none under CONFIG_PREEMPT_AUTO is different from
>>>> CONFIG_PREEMPT_NONE=y already. Here we *are* preempting. And RCU is getting on
>>>
>>> I think that's a view from too close to the implementation. Someone
>>> using the kernel is not necessarily concered with whether tasks are
>>> preempted or not. They are concerned with throughput and latency.
>>
>> No, we are not only talking about that (throughput/latency). We are also talking
>> about the issue related to RCU reader-preemption causing OOM (well and that
>> could hurt both throughput and latency as well).
>
> That happens only when PREEMPT_RCU=y. For PREEMPT_RCU=n the read side
> critical sections still have preemption disabled.

Sorry, let me clarify. And please forgive my noise but it is just a point of
view. CONFIG_PREEMPT_AUTO always preempts sooner or later, even for
preempt=none. A point of view could be, if you are preempting anyway (like
CONFIG_PREEMPT=y), then why bother with disabling CONFIG_PREEMPT_RCU or even
give it as an option. After all, with CONFIG_PREEMPT=y, you cannot do
CONFIG_PREEMPT_RCU=n. It is just a point of view, while we are still discussing
this patch series ahead of its potential merge.

>> With CONFIG_PREEMPT_AUTO=y, you now preempt in the preempt=none mode. Something
>> very different from the classical CONFIG_PREEMPT_NONE=y.
>
> In PREEMPT_RCU=y and preempt=none mode this happens only when really
> required, i.e. when the task does not schedule out or returns to user
> space on time, or when a higher scheduling class task gets runnable. For
> the latter the jury is still out whether this should be done or just
> lazily defered like the SCHED_OTHER preemption requests.
>
> In any case for that to matter this forced preemption would need to
> preempt a RCU read side critical section and then keep the preempted
> task away from the CPU for a long time.
>
> That's very different from the unconditional kernel preemption model which
> preempt=full provides and only marginally different from the existing
> PREEMPT_NONE model. I know there might be dragons, but I'm not convinced
> yet that this is an actual problem.

Sure it is less aggressive than a full preemption, still a preemption
nonetheless, so its quirky in the regard of whether or not RCU preemption is
provided as an option (as I mentioned as a point-of-view above).

> OTOH, doesn't PREEMPT_RCU=y have mechanism to mitigate that already?

It does. But that sounds more in favor of forcing PREEMPT_RCU=y for AUTO since
such mitigation will help those concerns that AUTO users would need
PREEMPT_RCU=y (?).

>> Essentially this means preemption is now more aggressive from the point of view
>> of a preempt=none user. I was suggesting that, a point of view could be RCU
>> should always support preepmtiblity (don't give PREEEMPT_RCU=n option) because
>> AUTO *does preempt* unlike classic CONFIG_PREEMPT_NONE. Otherwise it is
>> inconsistent -- say with CONFIG_PREEMPT=y (another *preemption mode*) which
>> forces CONFIG_PREEMPT_RCU. However to Paul's point, we need to worry about those
>> users who are concerned with running out of memory due to reader
>> preemption.
>
> What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
> Paul and me agreed long ago that this needs to be supported.

There's nothing wrong with it. Its just a bit quirky (again just a point of
view), that for a configuration that causes preemption (similar to
CONFIG_PREEMPT=y), that PREEMPT_RCU can be disabled. After all, again with
CONFIG_PREEMPT=y, PREEMPT_RCU cannot be currently disabled.

>> In that vain, maybe we should also support CONFIG_PREEMPT_RCU=n for
>> CONFIG_PREEMPT=y as well. There are plenty of popular systems with relatively
>> low memory that need low latency (like some low-end devices / laptops
>> :-)).
>
> I'm not sure whether that's useful as the goal is to get rid of all the
> CONFIG_PREEMPT_FOO options, no?

I think I may have lost you here, how does forcing or not forcing
CONFIG_PREEMPT_RCU relate to getting rid of CONFIG options? There's no new
CONFIG options added one way or the other.

> I'd rather spend brain cycles on figuring out whether RCU can be flipped
> over between PREEMPT_RCU=n/y at boot or obviously run-time.

Yes I agree with that actually, and I see Paul provided some detailed thoughts
in a reply to you in your quest to get him to write a novel as you put it ;-). I
am Ok with our providing preemptible-RCU as an option with AUTO and it could be
documented well that this is possible. I am fully on board also with the
sentiment of getting rid of the zoo of CONFIG_PREEMPT options!

thanks,

- Joel

2024-03-12 03:17:15

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y


Joel Fernandes <[email protected]> writes:

> Hi, Thomas,
> Thanks for your reply! I replied below.
>
> On 3/11/2024 3:12 PM, Thomas Gleixner wrote:
>> On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:

[ ... ]

>> What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
>> Paul and me agreed long ago that this needs to be supported.
>
> There's nothing wrong with it. Its just a bit quirky (again just a point of
> view), that for a configuration that causes preemption (similar to
> CONFIG_PREEMPT=y), that PREEMPT_RCU can be disabled. After all, again with
> CONFIG_PREEMPT=y, PREEMPT_RCU cannot be currently disabled.

I think the argument was that PREEMPT_RCU=y is suboptimal for certain
workloads, and those configurations might prefer the stronger
forward-progress guarantees that PREEMPT_RCU=n provides.

See this:
https://lore.kernel.org/lkml/73ecce1c-d321-4579-b892-13b1e0a0620a@paulmck-laptop/T/#m6aab5a6fd5f1fd4c3dc9282ce564e64f2fa6cdc3

and the surrounding thread.

Thanks

--
ankur

2024-03-12 03:24:36

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y



On 3/11/2024 11:16 PM, Ankur Arora wrote:
>
> Joel Fernandes <[email protected]> writes:
>
>> Hi, Thomas,
>> Thanks for your reply! I replied below.
>>
>> On 3/11/2024 3:12 PM, Thomas Gleixner wrote:
>>> On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:
>
> [ ... ]
>
>>> What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
>>> Paul and me agreed long ago that this needs to be supported.
>>
>> There's nothing wrong with it. Its just a bit quirky (again just a point of
>> view), that for a configuration that causes preemption (similar to
>> CONFIG_PREEMPT=y), that PREEMPT_RCU can be disabled. After all, again with
>> CONFIG_PREEMPT=y, PREEMPT_RCU cannot be currently disabled.
>
> I think the argument was that PREEMPT_RCU=y is suboptimal for certain
> workloads, and those configurations might prefer the stronger
> forward-progress guarantees that PREEMPT_RCU=n provides.
>
> See this:
> https://lore.kernel.org/lkml/73ecce1c-d321-4579-b892-13b1e0a0620a@paulmck-laptop/T/#m6aab5a6fd5f1fd4c3dc9282ce564e64f2fa6cdc3
>
> and the surrounding thread.

Thanks for the link. Sorry for any noise due to being late to the party. Based
on the discussions, I concur with everyone on the goal of getting rid of
CONFIG_PREEMPT_DYNAMIC and the various cond_resched()/might_sleep() things. I'll
also go look harder at what else we need to get CONFIG_PREEMPT_RCU=y/n working
with CONFIG_PREEMPT_AUTO=y.

thanks,

- Joel

2024-03-12 05:25:27

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 15/30] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y


Joel Fernandes <[email protected]> writes:

> On 3/11/2024 11:16 PM, Ankur Arora wrote:
>>
>> Joel Fernandes <[email protected]> writes:
>>
>>> Hi, Thomas,
>>> Thanks for your reply! I replied below.
>>>
>>> On 3/11/2024 3:12 PM, Thomas Gleixner wrote:
>>>> On Mon, Mar 11 2024 at 11:25, Joel Fernandes wrote:
>>
>> [ ... ]
>>
>>>> What's wrong with the combination of PREEMPT_AUTO=y and PREEMPT_RCU=n?
>>>> Paul and me agreed long ago that this needs to be supported.
>>>
>>> There's nothing wrong with it. Its just a bit quirky (again just a point of
>>> view), that for a configuration that causes preemption (similar to
>>> CONFIG_PREEMPT=y), that PREEMPT_RCU can be disabled. After all, again with
>>> CONFIG_PREEMPT=y, PREEMPT_RCU cannot be currently disabled.
>>
>> I think the argument was that PREEMPT_RCU=y is suboptimal for certain
>> workloads, and those configurations might prefer the stronger
>> forward-progress guarantees that PREEMPT_RCU=n provides.
>>
>> See this:
>> https://lore.kernel.org/lkml/73ecce1c-d321-4579-b892-13b1e0a0620a@paulmck-laptop/T/#m6aab5a6fd5f1fd4c3dc9282ce564e64f2fa6cdc3
>>
>> and the surrounding thread.
>
> Thanks for the link. Sorry for any noise due to being late to the party. Based
> on the discussions, I concur with everyone on the goal of getting rid of

No worries. Given the unending context, easy enough to miss.

> CONFIG_PREEMPT_DYNAMIC and the various cond_resched()/might_sleep() things. I'll
> also go look harder at what else we need to get CONFIG_PREEMPT_RCU=y/n working
> with CONFIG_PREEMPT_AUTO=y.

Sounds great. Thanks.

And, please keep the review comments coming.

--
ankur

2024-03-12 12:14:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Mon, Mar 11 2024 at 17:03, Paul E. McKenney wrote:
> On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote:
>> Because any potential future case is *not* going to be the same
>> cond_resched() that the current case is anyway. It is going to have
>> some very different cause.
>
> Fair enough, and that approach just means that we will be reaching out
> to Ankur and Thomas sooner rather than later if something goes sideways
> latency-wise. ;-)

You can't make an omelette without breaking eggs.

2024-03-12 19:40:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 26/30] sched: handle preempt=voluntary under PREEMPT_AUTO

On Tue, Mar 12, 2024 at 01:14:37PM +0100, Thomas Gleixner wrote:
> On Mon, Mar 11 2024 at 17:03, Paul E. McKenney wrote:
> > On Mon, Mar 11, 2024 at 01:23:09PM -0700, Linus Torvalds wrote:
> >> Because any potential future case is *not* going to be the same
> >> cond_resched() that the current case is anyway. It is going to have
> >> some very different cause.
> >
> > Fair enough, and that approach just means that we will be reaching out
> > to Ankur and Thomas sooner rather than later if something goes sideways
> > latency-wise. ;-)
>
> You can't make an omelette without breaking eggs.

Precisely! ;-)

Thanx, Paul

2024-03-19 11:45:49

by Mark Rutland

[permalink] [raw]
Subject: Tasks RCU, ftrace, and trampolines (was: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling)

Hi Paul,

On Fri, Mar 01, 2024 at 05:16:33PM -0800, Paul E. McKenney wrote:
> The networking NAPI code ends up needing special help to avoid starving
> Tasks RCU grace periods [1]. I am therefore revisiting trying to make
> Tasks RCU directly detect trampoline usage, but without quite as much
> need to identify specific trampolines...
>
> I am putting this information in a Google document for future
> reference [2].
>
> Thoughts?

Sorry for the long delay! I've been looking into this general area over the
last couple of weeks due to the latent bugs I mentioned in:

https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/

I was somewhat hoping that staring at the code for long enough would result in
an ephinany (and a nice simple-to-backport solution for the latent issues), but
so far that has eluded me.

I believe some of those cases will need to use synchronize_rcu_tasks() and we
might be able to make some structural changes to minimize the number of times
we'd need to synchronize (e.g. having static ftrace call ops->func from the ops
pointer, so we can switch ops+func atomically), but those look pretty invasive
so far.

I haven't been able to come up with "a precise and completely reliable way to
determine whether the current preemption occurred within a trampoline". Since
preemption might occur within a trampoline's callee that eventually returns
back to the trampoline, I believe that'll either depend on having a reliable
stacktrace or requiring the trampoline to dynamically register/unregister
somewhere around calling other functions. That, and we do also care about those
callees themselves, and it's not just about the trampolines...

On arm64, we kinda have "permanent trampolines", as our
DYNAMIC_FTRACE_WILL_CALL_OPS implementation uses a common trampoline. However,
that will tail-call direct functions (and those could also be directly called
from ftrace callsites), so we don't have a good way of handling those without a
change to the direct func calling convention.

I assume that permanent trampolines wouldn't be an option on architectures
where trampolines are a spectre mitigation.

Mark.

> Thanx, Paul
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

2024-03-19 23:33:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Tasks RCU, ftrace, and trampolines (was: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling)

On Tue, Mar 19, 2024 at 11:45:15AM +0000, Mark Rutland wrote:
> Hi Paul,
>
> On Fri, Mar 01, 2024 at 05:16:33PM -0800, Paul E. McKenney wrote:
> > The networking NAPI code ends up needing special help to avoid starving
> > Tasks RCU grace periods [1]. I am therefore revisiting trying to make
> > Tasks RCU directly detect trampoline usage, but without quite as much
> > need to identify specific trampolines...
> >
> > I am putting this information in a Google document for future
> > reference [2].
> >
> > Thoughts?
>
> Sorry for the long delay! I've been looking into this general area over the
> last couple of weeks due to the latent bugs I mentioned in:
>
> https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/
>
> I was somewhat hoping that staring at the code for long enough would result in
> an ephinany (and a nice simple-to-backport solution for the latent issues), but
> so far that has eluded me.
>
> I believe some of those cases will need to use synchronize_rcu_tasks() and we
> might be able to make some structural changes to minimize the number of times
> we'd need to synchronize (e.g. having static ftrace call ops->func from the ops
> pointer, so we can switch ops+func atomically), but those look pretty invasive
> so far.
>
> I haven't been able to come up with "a precise and completely reliable way to
> determine whether the current preemption occurred within a trampoline". Since
> preemption might occur within a trampoline's callee that eventually returns
> back to the trampoline, I believe that'll either depend on having a reliable
> stacktrace or requiring the trampoline to dynamically register/unregister
> somewhere around calling other functions. That, and we do also care about those
> callees themselves, and it's not just about the trampolines...
>
> On arm64, we kinda have "permanent trampolines", as our
> DYNAMIC_FTRACE_WILL_CALL_OPS implementation uses a common trampoline. However,
> that will tail-call direct functions (and those could also be directly called
> from ftrace callsites), so we don't have a good way of handling those without a
> change to the direct func calling convention.
>
> I assume that permanent trampolines wouldn't be an option on architectures
> where trampolines are a spectre mitigation.

Thank you for checking! I placed a pointer to this email in the document
and updated the relevant sections accordingly.

> Mark.
>
> > Thanx, Paul
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> > [2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

2024-04-23 15:28:05

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling



On 2/13/24 11:25 AM, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> An earlier RFC version is at [4].
>

Hi Ankur/Thomas.

Thank you for this series and previous ones.
These are very interesting patch series and the even more interesting
discussions. I have been trying go through to get different bits of it.


Tried this patch on PowerPC by defining LAZY similar to x86. The change is below.
Kept it at PREEMPT=none for PREEMPT_AUTO.

Running into soft lockup on large systems (40Cores, SMT8) and seeing close to 100%
regression on small system ( 12 Cores, SMT8). More details are after the patch.

Are these the only arch bits that need to be defined? am I missing something very
basic here? will try to debug this further. Any inputs?

---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/thread_info.h | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..11e7008f5dd3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -268,6 +268,7 @@ config PPC
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select HAVE_PREEMPT_AUTO
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE
select HAVE_RSEQ
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 15c5691dd218..c28780443b3b 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -117,11 +117,13 @@ void arch_setup_new_exec(void);
#endif
#define TIF_POLLING_NRFLAG 19 /* true if poll_idle() is polling TIF_NEED_RESCHED */
#define TIF_32BIT 20 /* 32 bit binary */
+#define TIF_NEED_RESCHED_LAZY 21 /* Lazy rescheduling */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_NOTIFY_SIGNAL (1<<TIF_NOTIFY_SIGNAL)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
#define _TIF_32BIT (1<<TIF_32BIT)
@@ -144,7 +146,7 @@ void arch_setup_new_exec(void);
#define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
- _TIF_NOTIFY_SIGNAL)
+ _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED_LAZY)
#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR)

/* Bits in local_flags */

---------------- Smaller system ---------------------------------

NUMA:
NUMA node(s): 5
NUMA node2 CPU(s): 0-7
NUMA node3 CPU(s): 8-31
NUMA node5 CPU(s): 32-39
NUMA node6 CPU(s): 40-47
NUMA node7 CPU(s): 48-95

Hackbench 6.9 +preempt_auto (=none)
(10 iterations, 10000 loops)

Process 10 groups : 3.00, 3.07( -2.33)
Process 20 groups : 5.47, 5.81( -6.22)
Process 30 groups : 7.78, 8.52( -9.51)
Process 40 groups : 10.16, 11.28( -11.02)
Process 50 groups : 12.37, 13.90( -12.37)
Process 60 groups : 14.58, 16.68( -14.40)
Thread 10 groups : 3.24, 3.28( -1.23)
Thread 20 groups : 5.93, 6.16( -3.88)
Process(Pipe) 10 groups : 1.94, 2.96( -52.58)
Process(Pipe) 20 groups : 2.91, 5.44( -86.94)
Process(Pipe) 30 groups : 4.23, 7.83( -85.11)
Process(Pipe) 40 groups : 5.35, 10.61( -98.32)
Process(Pipe) 50 groups : 6.64, 13.18( -98.49)
Process(Pipe) 60 groups : 7.88, 16.69(-111.80)
Thread(Pipe) 10 groups : 1.92, 3.02( -57.29)
Thread(Pipe) 20 groups : 3.25, 5.36( -64.92)

------------------- Large systems -------------------------

NUMA:
NUMA node(s): 4
NUMA node2 CPU(s): 0-31
NUMA node3 CPU(s): 32-127
NUMA node6 CPU(s): 128-223
NUMA node7 CPU(s): 224-319


watchdog: BUG: soft lockup - CPU#278 stuck for 26s! [hackbench:7137]
Modules linked in: bonding tls rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink pseries_rng vmx_crypto drm drm_panel_orientation_quirks xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 278 PID: 7137 Comm: hackbench Kdump: loaded Tainted: G L 6.9.0-rc1+ #42
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1050.00 (NM1050_052) hv:phyp pSeries
NIP: c000000000037fbc LR: c000000000038324 CTR: c0000000001a8548
REGS: c0000003de72fbb8 TRAP: 0900 Tainted: G L (6.9.0-rc1+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28002222 XER: 20040000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: c000000000038324 c0000003de72fb90 c000000001973e00 c0000003de72fb88
GPR04: 0000000000240080 0000000000000007 0010000000000000 c000000002220090
GPR08: 4000000000000002 0000000000000049 c0000003f1dcff00 0000000000002000
GPR12: c0000000001a8548 c000001fff72d080 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000002002000
GPR24: 0000000000000001 0000000000000000 0000000002802000 0000000000000002
GPR28: 0000000000000003 fcffffffffffffff fcffffffffffffff c0000003f1dcff00
NIP [c000000000037fbc] __replay_soft_interrupts+0x3c/0x154
LR [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
Call Trace:
[c0000003de72fb90] [c000000000038020] __replay_soft_interrupts+0xa0/0x154 (unreliable)
[c0000003de72fd40] [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
[c0000003de72fd90] [c000000000030268] interrupt_exit_user_prepare_main+0x19c/0x274
[c0000003de72fe00] [c0000000000304e0] syscall_exit_prepare+0x1a0/0x1c8
[c0000003de72fe50] [c00000000000cee8] system_call_vectored_common+0x168/0x2ec


+mpe, nick

2024-04-23 16:14:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

On Tue, 23 Apr 2024 at 08:23, Shrikanth Hegde <[email protected]> wrote:
>
> Tried this patch on PowerPC by defining LAZY similar to x86. The change is below.
> Kept it at PREEMPT=none for PREEMPT_AUTO.
>
> Running into soft lockup on large systems (40Cores, SMT8) and seeing close to 100%
> regression on small system ( 12 Cores, SMT8). More details are after the patch.
>
> Are these the only arch bits that need to be defined? am I missing something very
> basic here? will try to debug this further. Any inputs?

I don't think powerpc uses the generic *_exit_to_user_mode() helper
functions, so you'll need to also add that logic to the low-level
powerpc code.

IOW, on x86, with this patch series, patch 06/30 did this:

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

in kernel/entry/common.c exit_to_user_mode_loop().

But that works on x86 because it uses the irqentry_exit_to_user_mode().

On PowerPC, I think you need to at least fix up

interrupt_exit_user_prepare_main()

similarly (and any other paths like that - I used to know the powerpc
code, but that was long long LOOONG ago).

Linus

2024-04-26 07:50:13

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling



On 4/23/24 9:43 PM, Linus Torvalds wrote:
> On Tue, 23 Apr 2024 at 08:23, Shrikanth Hegde <[email protected]> wrote:
>>
>>
>> Are these the only arch bits that need to be defined? am I missing something very
>> basic here? will try to debug this further. Any inputs?
>
> I don't think powerpc uses the generic *_exit_to_user_mode() helper
> functions, so you'll need to also add that logic to the low-level
> powerpc code.
>
> IOW, on x86, with this patch series, patch 06/30 did this:
>
> - if (ti_work & _TIF_NEED_RESCHED)
> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> schedule();
>
> in kernel/entry/common.c exit_to_user_mode_loop().
>
> But that works on x86 because it uses the irqentry_exit_to_user_mode().
>
> On PowerPC, I think you need to at least fix up
>
> interrupt_exit_user_prepare_main()
>
> similarly (and any other paths like that - I used to know the powerpc
> code, but that was long long LOOONG ago).
>
> Linus

Thank you Linus for the pointers. That indeed did the trick.

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index eca293794a1e..f0f38bf5cea9 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -185,7 +185,7 @@ interrupt_exit_user_prepare_main(unsigned long ret, struct pt_regs *regs)
ti_flags = read_thread_flags();
while (unlikely(ti_flags & (_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM))) {
local_irq_enable();
- if (ti_flags & _TIF_NEED_RESCHED) {
+ if (ti_flags & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) ) {
schedule();
} else {


By adding LAZY checks in interrupt_exit_user_prepare_main, softlockup is no longer seen and
hackbench results are more or less same on smaller system(96CPUS). However, I still see 20-50%
regression on the larger system(320 CPUS). I will continue to debug why.


2024-04-26 19:02:15

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling


Shrikanth Hegde <[email protected]> writes:

> On 4/23/24 9:43 PM, Linus Torvalds wrote:
>> On Tue, 23 Apr 2024 at 08:23, Shrikanth Hegde <[email protected]> wrote:
>>>
>>>
>>> Are these the only arch bits that need to be defined? am I missing something very
>>> basic here? will try to debug this further. Any inputs?
>>
>> I don't think powerpc uses the generic *_exit_to_user_mode() helper
>> functions, so you'll need to also add that logic to the low-level
>> powerpc code.
>>
>> IOW, on x86, with this patch series, patch 06/30 did this:
>>
>> - if (ti_work & _TIF_NEED_RESCHED)
>> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> schedule();
>>
>> in kernel/entry/common.c exit_to_user_mode_loop().
>>
>> But that works on x86 because it uses the irqentry_exit_to_user_mode().
>>
>> On PowerPC, I think you need to at least fix up
>>
>> interrupt_exit_user_prepare_main()
>>
>> similarly (and any other paths like that - I used to know the powerpc
>> code, but that was long long LOOONG ago).
>>
>> Linus
>
> Thank you Linus for the pointers. That indeed did the trick.
>
> diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
> index eca293794a1e..f0f38bf5cea9 100644
> --- a/arch/powerpc/kernel/interrupt.c
> +++ b/arch/powerpc/kernel/interrupt.c
> @@ -185,7 +185,7 @@ interrupt_exit_user_prepare_main(unsigned long ret, struct pt_regs *regs)
> ti_flags = read_thread_flags();
> while (unlikely(ti_flags & (_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM))) {
> local_irq_enable();
> - if (ti_flags & _TIF_NEED_RESCHED) {
> + if (ti_flags & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY) ) {
> schedule();
> } else {
>
>
> By adding LAZY checks in interrupt_exit_user_prepare_main, softlockup is no longer seen and
> hackbench results are more or less same on smaller system(96CPUS).

Great. I'm guessing these tests are when running in voluntary preemption
mode (under PREEMPT_AUTO).

If you haven't, could you also try full preemption? There you should see
identical results unless something is horribly wrong.

> However, I still see 20-50%
> regression on the larger system(320 CPUS). I will continue to debug why.

Could you try this patch? This is needed because PREEMPT_AUTO turns on
CONFIG_PREEMPTION, but not CONFIG_PREEMPT:

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index eca293794a1e..599410050f6b 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -396,7 +396,7 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs)
/* Returning to a kernel context with local irqs enabled. */
WARN_ON_ONCE(!(regs->msr & MSR_EE));
again:
- if (IS_ENABLED(CONFIG_PREEMPT)) {
+ if (IS_ENABLED(CONFIG_PREEMPTION)) {
/* Return to preemptible kernel context */
if (unlikely(read_thread_flags() & _TIF_NEED_RESCHED)) {
if (preempt_count() == 0)


--
ankur