2024-05-28 00:36:14

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

Hi,

This series adds a new scheduling model PREEMPT_AUTO, which like
PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
on explicit preemption points for the voluntary models.

The series is based on Thomas' original proposal which he outlined
in [1], [2] and in his PoC [3].

v2 mostly reworks v1, with one of the main changes having less
noisy need-resched-lazy related interfaces.
More details in the changelog below.

The v1 of the series is at [4] and the RFC at [5].

Design
==

PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
PREEMPT_COUNT). This means that the scheduler can always safely
preempt. (This is identical to CONFIG_PREEMPT.)

Having that, the next step is to make the rescheduling policy dependent
on the chosen scheduling model. Currently, the scheduler uses a single
need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
reschedule is needed.
PREEMPT_AUTO extends this by adding an additional need-resched bit
(TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
scheduler to express two kinds of rescheduling intent: schedule at
the earliest opportunity (TIF_NEED_RESCHED), or express a need for
rescheduling while allowing the task on the runqueue to run to
timeslice completion (TIF_NEED_RESCHED_LAZY).

The scheduler decides which need-resched bits are chosen based on
the preemption model in use:

TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY

none never always [*]
voluntary higher sched class other tasks [*]
full always never

[*] some details elided.

The last part of the puzzle is, when does preemption happen, or
alternately stated, when are the need-resched bits checked:

exit-to-user ret-to-kernel preempt_count()

NEED_RESCHED_LAZY Y N N
NEED_RESCHED Y Y Y

Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
none/voluntary preemption policies are in effect. And eager semantics
under full preemption.

In addition, since this is driven purely by the scheduler (not
depending on cond_resched() placement and the like), there is enough
flexibility in the scheduler to cope with edge cases -- ex. a kernel
task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
simply upgrading to a full NEED_RESCHED which can use more coercive
instruments like resched IPI to induce a context-switch.

Performance
==
The performance in the basic tests (perf bench sched messaging, kernbench,
cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
(See patches
"sched: support preempt=none under PREEMPT_AUTO"
"sched: support preempt=full under PREEMPT_AUTO"
"sched: handle preempt=voluntary under PREEMPT_AUTO")

For a macro test, a colleague in Oracle's Exadata team tried two
OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
backported.)

In both tests the data was cached on remote nodes (cells), and the
database nodes (compute) served client queries, with clients being
local in the first test and remote in the second.

Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs


PREEMPT_VOLUNTARY PREEMPT_AUTO
(preempt=voluntary)
============================== =============================
clients throughput cpu-usage throughput cpu-usage Gain
(tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
------- ---------- ----------------- ---------- ----------------- -------


OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
(local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%


OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
(remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
90/10 RW ratio)


(Both sets of tests have a fair amount of NW traffic since the query
tables etc are cached on the cells. Additionally, the first set,
given the local clients, stress the scheduler a bit more than the
second.)

The comparative performance for both the tests is fairly close,
more or less within a margin of error.

Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):

"
a) Base kernel (6.7),
b) v1, PREEMPT_AUTO, preempt=voluntary
c) v1, PREEMPT_DYNAMIC, preempt=voluntary
d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y

Workloads I tested and their %gain,
case b case c case d
NAS +2.7% +1.9% +2.1%
Hashjoin, +0.0% +0.0% +0.0%
Graph500, -6.0% +0.0% +0.0%
XSBench +1.7% +0.0% +1.2%

(Note about the Graph500 numbers at [8].)

Did kernbench etc test from Mel's mmtests suite also. Did not notice
much difference.
"

One case where there is a significant performance drop is on powerpc,
seen running hackbench on a 320 core system (a test on a smaller system is
fine.) In theory there's no reason for this to only happen on powerpc
since most of the code is common, but I haven't been able to reproduce
it on x86 so far.

All in all, I think the tests above show that this scheduling model has legs.
However, the none/voluntary models under PREEMPT_AUTO are conceptually
different enough from the current none/voluntary models that there
likely are workloads where performance would be subpar. That needs more
extensive testing to figure out the weak points.


Series layout
==

Patches 1,2
"sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
"sched/core: Drop spinlocks on contention iff kernel is preemptible"
condition spin_needbreak() on the dynamic preempt_model_*().
Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.

Patch 3
"sched: make test_*_tsk_thread_flag() return bool"
is a minor cleanup.

Patch 4,
"preempt: introduce CONFIG_PREEMPT_AUTO"
introduces the new scheduling model.

Patch 5-7,
"thread_info: selector for TIF_NEED_RESCHED[_LAZY]"
"thread_info: define __tif_need_resched(resched_t)"
"sched: define *_tsk_need_resched_lazy() helpers"

introduce new thread_info/task helper interfaces or make changes to
pre-existing ones that will be used in the rest of the series.

Patches 8-11,
"entry: handle lazy rescheduling at user-exit"
"entry/kvm: handle lazy rescheduling at guest-entry"
"entry: irqentry_exit only preempts for TIF_NEED_RESCHED"
"sched: __schedule_loop() doesn't need to check for need_resched_lazy()"

make changes/document the rescheduling points.

Patches 12-13,
"sched: separate PREEMPT_DYNAMIC config logic"
"sched: allow runtime config for PREEMPT_AUTO"

reuse the PREEMPT_DYNAMIC runtime configuration logic.

Patch 14-18,

"rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO"
"rcu: fix header guard for rcu_all_qs()"
"preempt,rcu: warn on PREEMPT_RCU=n, preempt=full"
"rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y"
"rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y"

add changes needed for RCU.

Patch 19-20,
"x86/thread_info: define TIF_NEED_RESCHED_LAZY"
"powerpc: add support for PREEMPT_AUTO"

adds x86, powerpc support.

Patches 21-24,
"sched: prepare for lazy rescheduling in resched_curr()"
"sched: default preemption policy for PREEMPT_AUTO"
"sched: handle idle preemption for PREEMPT_AUTO"
"sched: schedule eagerly in resched_cpu()"

are preparatory patches for adding PREEMPT_AUTO. Among other things
they add the default need-resched policy for !PREEMPT_AUTO,
PREEMPT_AUTO, and the idle task.

Patches 25-26,
"sched/fair: refactor update_curr(), entity_tick()",
"sched/fair: handle tick expiry under lazy preemption"

handle the 'hog' problem, where a kernel task does not voluntarily
schedule out.

And, finally patches 27-29,
"sched: support preempt=none under PREEMPT_AUTO"
"sched: support preempt=full under PREEMPT_AUTO"
"sched: handle preempt=voluntary under PREEMPT_AUTO"

add support for the three preemption models.

Patch 30-33,
"sched: latency warn for TIF_NEED_RESCHED_LAZY",
"tracing: support lazy resched",
"Documentation: tracing: add TIF_NEED_RESCHED_LAZY",
"osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y"

handles remaining bits and pieces to do with TIF_NEED_RESCHED_LAZY.

And, finally patches 34-35

"kconfig: decompose ARCH_NO_PREEMPT"
"arch: decompose ARCH_NO_PREEMPT"

decompose ARCH_NO_PREEMPT which might make it easier to support
CONFIG_PREEMPTION on some architectures.


Changelog
==
v2: rebased to v6.9, addreses review comments, folds some other patches.

- the lazy interfaces are less noisy now: the current interfaces stay
unchanged so non-scheduler code doesn't need to change.
This also means that the lazy preemption becomes a scheduler detail
which works well with the core idea of lazy scheduling.
(Mark Rutland, Thomas Gleixner)

- preempt=none model now respects the leftmost deadline (Juri Lelli)
- Add need-resched flag combination state in tracing headers (Steven Rostedt)
- Decompose ARCH_NO_PREEMPT
- Changes for RCU (and TASKS_RCU) will go in separately [6]

- spin_needbreak() should be conditioned on preempt_model_*() at
runtime (patches from Sean Christopherson [7])
- powerpc support from Shrikanth Hegde

RFC:
- Addresses review comments and is generally a more focused
version of the RFC.
- Lots of code reorganization.
- Bugfixes all over.
- need_resched() now only checks for TIF_NEED_RESCHED instead
of TIF_NEED_RESCHED|TIF_NEED_RESCHED_LAZY.
- set_nr_if_polling() now does not check for TIF_NEED_RESCHED_LAZY.
- Tighten idle related checks.
- RCU changes to force context-switches when a quiescent state is
urgently needed.
- Does not break live-patching anymore

Also at: github.com/terminus/linux preempt-v2

Please review.

Thanks
Ankur

Cc: Thomas Gleixner <[email protected]>
Cc: Raghavendra K T <[email protected]>
Cc: Shrikanth Hegde <[email protected]>

[1] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[2] https://lore.kernel.org/lkml/87led2wdj0.ffs@tglx/
[3] https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
[4] https://lore.kernel.org/lkml/[email protected]/
[5] https://lore.kernel.org/lkml/[email protected]/
[6] https://lore.kernel.org/lkml/[email protected]/
[7] https://lore.kernel.org/lkml/[email protected]/
[8] https://lore.kernel.org/lkml/[email protected]/
[9] https://lore.kernel.org/lkml/[email protected]/

Ankur Arora (32):
sched: make test_*_tsk_thread_flag() return bool
preempt: introduce CONFIG_PREEMPT_AUTO
thread_info: selector for TIF_NEED_RESCHED[_LAZY]
thread_info: define __tif_need_resched(resched_t)
sched: define *_tsk_need_resched_lazy() helpers
entry: handle lazy rescheduling at user-exit
entry/kvm: handle lazy rescheduling at guest-entry
entry: irqentry_exit only preempts for TIF_NEED_RESCHED
sched: __schedule_loop() doesn't need to check for need_resched_lazy()
sched: separate PREEMPT_DYNAMIC config logic
sched: allow runtime config for PREEMPT_AUTO
rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO
rcu: fix header guard for rcu_all_qs()
preempt,rcu: warn on PREEMPT_RCU=n, preempt=full
rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y
rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y
x86/thread_info: define TIF_NEED_RESCHED_LAZY
sched: prepare for lazy rescheduling in resched_curr()
sched: default preemption policy for PREEMPT_AUTO
sched: handle idle preemption for PREEMPT_AUTO
sched: schedule eagerly in resched_cpu()
sched/fair: refactor update_curr(), entity_tick()
sched/fair: handle tick expiry under lazy preemption
sched: support preempt=none under PREEMPT_AUTO
sched: support preempt=full under PREEMPT_AUTO
sched: handle preempt=voluntary under PREEMPT_AUTO
sched: latency warn for TIF_NEED_RESCHED_LAZY
tracing: support lazy resched
Documentation: tracing: add TIF_NEED_RESCHED_LAZY
osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y
kconfig: decompose ARCH_NO_PREEMPT
arch: decompose ARCH_NO_PREEMPT

Sean Christopherson (2):
sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
sched/core: Drop spinlocks on contention iff kernel is preemptible

Shrikanth Hegde (1):
powerpc: add support for PREEMPT_AUTO

.../admin-guide/kernel-parameters.txt | 5 +-
Documentation/trace/ftrace.rst | 6 +-
arch/Kconfig | 7 +
arch/alpha/Kconfig | 3 +-
arch/hexagon/Kconfig | 3 +-
arch/m68k/Kconfig | 3 +-
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/thread_info.h | 5 +-
arch/powerpc/kernel/interrupt.c | 5 +-
arch/um/Kconfig | 3 +-
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 6 +-
include/linux/entry-common.h | 2 +-
include/linux/entry-kvm.h | 2 +-
include/linux/preempt.h | 43 ++-
include/linux/rcutree.h | 2 +-
include/linux/sched.h | 101 +++---
include/linux/spinlock.h | 14 +-
include/linux/thread_info.h | 71 +++-
include/linux/trace_events.h | 6 +-
init/Makefile | 1 +
kernel/Kconfig.preempt | 37 ++-
kernel/entry/common.c | 16 +-
kernel/entry/kvm.c | 4 +-
kernel/rcu/Kconfig | 2 +-
kernel/rcu/tree.c | 13 +-
kernel/rcu/tree_plugin.h | 11 +-
kernel/sched/core.c | 311 ++++++++++++------
kernel/sched/deadline.c | 9 +-
kernel/sched/debug.c | 13 +-
kernel/sched/fair.c | 56 ++--
kernel/sched/rt.c | 6 +-
kernel/sched/sched.h | 27 +-
kernel/trace/trace.c | 30 +-
kernel/trace/trace_osnoise.c | 22 +-
kernel/trace/trace_output.c | 16 +-
36 files changed, 598 insertions(+), 265 deletions(-)

--
2.31.1



2024-05-28 00:36:27

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 03/35] sched: make test_*_tsk_thread_flag() return bool

All users of test_*_tsk_thread_flag() treat the result as boolean.
This is also true for the underlying test_and_*_bit() operations.

Change the return type to bool.

Cc: Peter Ziljstra <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
Acked-by: Mark Rutland <[email protected]>
---
include/linux/sched.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73a3402843c6..4808e5dd4f69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1937,17 +1937,17 @@ static inline void update_tsk_thread_flag(struct task_struct *tsk, int flag,
update_ti_thread_flag(task_thread_info(tsk), flag, value);
}

-static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_set_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag);
}

-static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag)
+static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag)
{
return test_ti_thread_flag(task_thread_info(tsk), flag);
}
@@ -1962,7 +1962,7 @@ static inline void clear_tsk_need_resched(struct task_struct *tsk)
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

-static inline int test_tsk_need_resched(struct task_struct *tsk)
+static inline bool test_tsk_need_resched(struct task_struct *tsk)
{
return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED));
}
--
2.31.1


2024-05-28 00:36:28

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]

Define tif_resched() to serve as selector for the specific
need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
or to TIF_NEED_RESCHED_LAZY.

For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
to TIF_NEED_RESCHED, preserving existing scheduling behaviour.

Cc: Peter Ziljstra <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/thread_info.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 06e13e7acbe2..65e5beedc915 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -71,6 +71,31 @@ enum syscall_work_bit {
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif

+typedef enum {
+ RESCHED_NOW = 0,
+ RESCHED_LAZY = 1,
+} resched_t;
+
+/*
+ * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
+ *
+ * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
+ * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
+ * leaving any scheduling behaviour unchanged.
+ */
+static __always_inline int tif_resched(resched_t rs)
+{
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
+ return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
+ else
+ return TIF_NEED_RESCHED;
+}
+
+static __always_inline int _tif_resched(resched_t rs)
+{
+ return 1 << tif_resched(rs);
+}
+
#ifdef __KERNEL__

#ifndef arch_set_restart_data
--
2.31.1


2024-05-28 00:37:07

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t)

Define __tif_need_resched() which takes a resched_t parameter to
decide the immediacy of the need-resched.

Update need_resched() and should_resched() so they both check for
__tif_need_resched(RESCHED_NOW), which keeps the current semantics.

Non scheduling code -- which only cares about any immediately required
preemption -- can continue unchanged since the commonly used interfaces
(need_resched(), should_resched(), tif_need_resched()) stay the same.

This also allows lazy preemption to just be a scheduler detail.

Cc: Arnd Bergmann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Rafael J. Wysocki" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/preempt.h | 2 +-
include/linux/sched.h | 7 ++++++-
include/linux/thread_info.h | 34 ++++++++++++++++++++++++++++------
kernel/trace/trace.c | 2 +-
4 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index ce76f1a45722..d453f5e34390 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -312,7 +312,7 @@ do { \
} while (0)
#define preempt_fold_need_resched() \
do { \
- if (tif_need_resched()) \
+ if (__tif_need_resched(RESCHED_NOW)) \
set_preempt_need_resched(); \
} while (0)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4808e5dd4f69..37a51115b691 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2062,7 +2062,12 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);

static __always_inline bool need_resched(void)
{
- return unlikely(tif_need_resched());
+ return unlikely(__tif_need_resched(RESCHED_NOW));
+}
+
+static __always_inline bool need_resched_lazy(void)
+{
+ return unlikely(__tif_need_resched(RESCHED_LAZY));
}

/*
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 65e5beedc915..e246b01553a5 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -216,22 +216,44 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti

#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H

-static __always_inline bool tif_need_resched(void)
+static __always_inline bool __tif_need_resched_bitop(int nr_flag)
{
- return arch_test_bit(TIF_NEED_RESCHED,
- (unsigned long *)(&current_thread_info()->flags));
+ return arch_test_bit(nr_flag,
+ (unsigned long *)(&current_thread_info()->flags));
}

#else

-static __always_inline bool tif_need_resched(void)
+static __always_inline bool __tif_need_resched_bitop(int nr_flag)
{
- return test_bit(TIF_NEED_RESCHED,
- (unsigned long *)(&current_thread_info()->flags));
+ return test_bit(nr_flag,
+ (unsigned long *)(&current_thread_info()->flags));
}

#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */

+static __always_inline bool __tif_need_resched(resched_t rs)
+{
+ /*
+ * With !PREEMPT_AUTO, this check is only meaningful if we
+ * are checking if tif_resched(RESCHED_NOW) is set.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
+ return __tif_need_resched_bitop(tif_resched(rs));
+ else
+ return false;
+}
+
+static __always_inline bool tif_need_resched(void)
+{
+ return __tif_need_resched(RESCHED_NOW);
+}
+
+static __always_inline bool tif_need_resched_lazy(void)
+{
+ return __tif_need_resched(RESCHED_LAZY);
+}
+
#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
static inline int arch_within_stack_frames(const void * const stack,
const void * const stackend,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..ed229527be05 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2511,7 +2511,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
trace_flags |= TRACE_FLAG_BH_OFF;

- if (tif_need_resched())
+ if (__tif_need_resched(RESCHED_NOW))
trace_flags |= TRACE_FLAG_NEED_RESCHED;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
--
2.31.1


2024-05-28 00:37:22

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit

The scheduling policy for TIF_NEED_RESCHED_LAZY is to allow the
running task to voluntarily schedule out, running it to completion.

For archs with GENERIC_ENTRY, do this by adding a check in
exit_to_user_mode_loop().

Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/entry-common.h | 2 +-
kernel/entry/common.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b0fb775a600d..f5bb19369973 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -65,7 +65,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
- ARCH_EXIT_TO_USER_MODE_WORK)
+ _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)

/**
* arch_enter_from_user_mode - Architecture specific sanity check for user mode regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 90843cc38588..bcb23c866425 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,

local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_UPROBE)
--
2.31.1


2024-05-28 00:37:23

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 02/35] sched/core: Drop spinlocks on contention iff kernel is preemptible

From: Sean Christopherson <[email protected]>

Use preempt_model_preemptible() to detect a preemptible kernel when
deciding whether or not to reschedule in order to drop a contended
spinlock or rwlock. Because PREEMPT_DYNAMIC selects PREEMPTION, kernels
built with PREEMPT_DYNAMIC=y will yield contended locks even if the live
preemption model is "none" or "voluntary". In short, make kernels with
dynamically selected models behave the same as kernels with statically
selected models.

Somewhat counter-intuitively, NOT yielding a lock can provide better
latency for the relevant tasks/processes. E.g. KVM x86's mmu_lock, a
rwlock, is often contended between an invalidation event (takes mmu_lock
for write) and a vCPU servicing a guest page fault (takes mmu_lock for
read). For _some_ setups, letting the invalidation task complete even
if there is mmu_lock contention provides lower latency for *all* tasks,
i.e. the invalidation completes sooner *and* the vCPU services the guest
page fault sooner.

But even KVM's mmu_lock behavior isn't uniform, e.g. the "best" behavior
can vary depending on the host VMM, the guest workload, the number of
vCPUs, the number of pCPUs in the host, why there is lock contention, etc.

In other words, simply deleting the CONFIG_PREEMPTION guard (or doing the
opposite and removing contention yielding entirely) needs to come with a
big pile of data proving that changing the status quo is a net positive.

Opportunistically document this side effect of preempt=full, as yielding
contended spinlocks can have significant, user-visible impact.

Fixes: c597bfddc9e9 ("sched: Provide Kconfig support for default dynamic preempt mode")
Link: https://lore.kernel.org/kvm/[email protected]
Cc: Valentin Schneider <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Friedrich Weber <[email protected]>
Cc: Ankur Arora <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Ankur Arora <[email protected]>
Reviewed-by: Chen Yu <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 4 +++-
include/linux/spinlock.h | 14 ++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 396137ee018d..2d693300ab57 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4722,7 +4722,9 @@
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
- can be preempted anytime.
+ can be preempted anytime. Tasks will also yield
+ contended spinlocks (if the critical section isn't
+ explicitly preempt disabled beyond the lock itself).

print-fatal-signals=
[KNL] debug: print fatal signals
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3fcd20de6ca8..63dd8cf3c3c2 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -462,11 +462,10 @@ static __always_inline int spin_is_contended(spinlock_t *lock)
*/
static inline int spin_needbreak(spinlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return spin_is_contended(lock);
-#else
- return 0;
-#endif
}

/*
@@ -479,11 +478,10 @@ static inline int spin_needbreak(spinlock_t *lock)
*/
static inline int rwlock_needbreak(rwlock_t *lock)
{
-#ifdef CONFIG_PREEMPTION
+ if (!preempt_model_preemptible())
+ return 0;
+
return rwlock_is_contended(lock);
-#else
- return 0;
-#endif
}

/*
--
2.31.1


2024-05-28 00:37:53

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry

Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
xfer_to_guest_mode_handle_work() from various KVM vcpu-run
loops to check for any task work including rescheduling.

Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.

Also, while at it, remove the explicit check for need_resched() in
the exit condition as that is already covered in the loop condition.

Cc: Paolo Bonzini <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/entry-kvm.h | 2 +-
kernel/entry/kvm.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 6813171afccb..674a622c91be 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -18,7 +18,7 @@

#define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)

struct kvm_vcpu;

diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd..8485f63863af 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return -EINTR;
}

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_NOTIFY_RESUME)
@@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return ret;

ti_work = read_thread_flags();
- } while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+ } while (ti_work & XFER_TO_GUEST_MODE_WORK);
return 0;
}

--
2.31.1


2024-05-28 00:38:03

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED

Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
explicit that this path only reschedules if it is needed imminently.

Also, add a comment about why we need a need-resched check here at
all, given that the top level conditional has already checked the
preempt_count().

Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/entry/common.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bcb23c866425..c684385921de 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
rcu_irq_exit_check_preempt();
if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
WARN_ON_ONCE(!on_thread_stack());
- if (need_resched())
+
+ /*
+ * Check if we need to preempt eagerly.
+ *
+ * Note: we need an explicit check here because some
+ * architectures don't fold TIF_NEED_RESCHED in the
+ * preempt_count. For archs that do, this is already covered
+ * in the conditional above.
+ */
+ if (__tif_need_resched(RESCHED_NOW))
preempt_schedule_irq();
}
}
--
2.31.1


2024-05-28 00:38:15

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 11/35] sched: __schedule_loop() doesn't need to check for need_resched_lazy()

Various scheduling loops recheck need_resched() to avoid a missed
scheduling opportunity.

Explicitly note that we don't need to check for need_resched_lazy()
since that only needs to be handled at exit-to-user.

Also update the comment above __schedule() to describe
TIF_NEED_RESCHED_LAZY semantics.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d00d7b45303e..0c26b60c1101 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6582,20 +6582,23 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
*
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
*
- * 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
- * paths. For example, see arch/x86/entry_64.S.
+ * 2. TIF_NEED_RESCHED flag is checked on interrupt and TIF_NEED_RESCHED[_LAZY]
+ * flags on userspace return paths. For example, see kernel/entry/common.c
*
- * To drive preemption between tasks, the scheduler sets the flag in timer
- * interrupt handler scheduler_tick().
+ * To drive preemption between tasks, the scheduler sets one of the need-
+ * resched flags in the timer interrupt handler scheduler_tick():
+ * - !CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED.
+ * - CONFIG_PREEMPT_AUTO: TIF_NEED_RESCHED or TIF_NEED_RESCHED_LAZY
+ * depending on the preemption model.
*
* 3. Wakeups don't really cause entry into schedule(). They add a
* task to the run-queue and that's it.
*
* Now, if the new task added to the run-queue preempts the current
- * task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
- * called on the nearest possible occasion:
+ * task, then the wakeup sets TIF_NEED_RESCHED[_LAZY] and schedule()
+ * gets called on the nearest possible occasion:
*
- * - If the kernel is preemptible (CONFIG_PREEMPTION=y):
+ * - If the kernel is running under preempt_model_preemptible():
*
* - in syscall or exception context, at the next outmost
* preempt_enable(). (this might be as soon as the wake_up()'s
@@ -6604,8 +6607,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* - in IRQ context, return from interrupt-handler to
* preemptible context
*
- * - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
- * then at the next:
+ * - If the kernel is running under preempt_model_none(), or
+ * preempt_model_voluntary(), then at the next:
*
* - cond_resched() call
* - explicit schedule() call
@@ -6823,6 +6826,11 @@ static __always_inline void __schedule_loop(unsigned int sched_mode)
preempt_disable();
__schedule(sched_mode);
sched_preempt_enable_no_resched();
+
+ /*
+ * We don't check for need_resched_lazy() here, since it is
+ * always handled at exit-to-user.
+ */
} while (need_resched());
}

@@ -6928,7 +6936,7 @@ static void __sched notrace preempt_schedule_common(void)
preempt_enable_no_resched_notrace();

/*
- * Check again in case we missed a preemption opportunity
+ * Check again in case we missed an eager preemption opportunity
* between schedule and now.
*/
} while (need_resched());
--
2.31.1


2024-05-28 00:38:30

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic

Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
models to dynamically configure preemption.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
1 file changed, 86 insertions(+), 79 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c26b60c1101..349f6257fdcd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
}
EXPORT_SYMBOL(__cond_resched_rwlock_write);

+#if defined(CONFIG_PREEMPT_DYNAMIC)
+
+#define PREEMPT_MODE "Dynamic Preempt"
+
+enum {
+ preempt_dynamic_undefined = -1,
+ preempt_dynamic_none,
+ preempt_dynamic_voluntary,
+ preempt_dynamic_full,
+};
+
+int preempt_dynamic_mode = preempt_dynamic_undefined;
+static DEFINE_MUTEX(sched_dynamic_mutex);
+
+int sched_dynamic_mode(const char *str)
+{
+ if (!strcmp(str, "none"))
+ return preempt_dynamic_none;
+
+ if (!strcmp(str, "voluntary"))
+ return preempt_dynamic_voluntary;
+
+ if (!strcmp(str, "full"))
+ return preempt_dynamic_full;
+
+ return -EINVAL;
+}
+
+static void __sched_dynamic_update(int mode);
+void sched_dynamic_update(int mode)
+{
+ mutex_lock(&sched_dynamic_mutex);
+ __sched_dynamic_update(mode);
+ mutex_unlock(&sched_dynamic_mutex);
+}
+
+static void __init preempt_dynamic_init(void)
+{
+ if (preempt_dynamic_mode == preempt_dynamic_undefined) {
+ if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
+ sched_dynamic_update(preempt_dynamic_none);
+ } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
+ sched_dynamic_update(preempt_dynamic_voluntary);
+ } else {
+ /* Default static call setting, nothing to do */
+ WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
+ preempt_dynamic_mode = preempt_dynamic_full;
+ pr_info("%s: full\n", PREEMPT_MODE);
+ }
+ }
+}
+
+static int __init setup_preempt_mode(char *str)
+{
+ int mode = sched_dynamic_mode(str);
+ if (mode < 0) {
+ pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
+ return 0;
+ }
+
+ sched_dynamic_update(mode);
+ return 1;
+}
+__setup("preempt=", setup_preempt_mode);
+
+#define PREEMPT_MODEL_ACCESSOR(mode) \
+ bool preempt_model_##mode(void) \
+ { \
+ WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
+ return preempt_dynamic_mode == preempt_dynamic_##mode; \
+ } \
+ EXPORT_SYMBOL_GPL(preempt_model_##mode)
+
+PREEMPT_MODEL_ACCESSOR(none);
+PREEMPT_MODEL_ACCESSOR(voluntary);
+PREEMPT_MODEL_ACCESSOR(full);
+
+#else /* !CONFIG_PREEMPT_DYNAMIC */
+
+static inline void preempt_dynamic_init(void) { }
+
+#endif /* !CONFIG_PREEMPT_DYNAMIC */
+
#ifdef CONFIG_PREEMPT_DYNAMIC

#ifdef CONFIG_GENERIC_ENTRY
@@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* irqentry_exit_cond_resched <- irqentry_exit_cond_resched
*/

-enum {
- preempt_dynamic_undefined = -1,
- preempt_dynamic_none,
- preempt_dynamic_voluntary,
- preempt_dynamic_full,
-};
-
-int preempt_dynamic_mode = preempt_dynamic_undefined;
-
-int sched_dynamic_mode(const char *str)
-{
- if (!strcmp(str, "none"))
- return preempt_dynamic_none;
-
- if (!strcmp(str, "voluntary"))
- return preempt_dynamic_voluntary;
-
- if (!strcmp(str, "full"))
- return preempt_dynamic_full;
-
- return -EINVAL;
-}
-
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
#define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
@@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
#error "Unsupported PREEMPT_DYNAMIC mechanism"
#endif

-static DEFINE_MUTEX(sched_dynamic_mutex);
static bool klp_override;

static void __sched_dynamic_update(int mode)
@@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
if (mode != preempt_dynamic_mode)
- pr_info("Dynamic Preempt: none\n");
+ pr_info("%s: none\n", PREEMPT_MODE);
break;

case preempt_dynamic_voluntary:
@@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
if (mode != preempt_dynamic_mode)
- pr_info("Dynamic Preempt: voluntary\n");
+ pr_info("%s: voluntary\n", PREEMPT_MODE);
break;

case preempt_dynamic_full:
@@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
if (mode != preempt_dynamic_mode)
- pr_info("Dynamic Preempt: full\n");
+ pr_info("%s: full\n", PREEMPT_MODE);
break;
}

preempt_dynamic_mode = mode;
}

-void sched_dynamic_update(int mode)
-{
- mutex_lock(&sched_dynamic_mutex);
- __sched_dynamic_update(mode);
- mutex_unlock(&sched_dynamic_mutex);
-}
-
#ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL

static int klp_cond_resched(void)
@@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)

#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */

-static int __init setup_preempt_mode(char *str)
-{
- int mode = sched_dynamic_mode(str);
- if (mode < 0) {
- pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
- return 0;
- }
-
- sched_dynamic_update(mode);
- return 1;
-}
-__setup("preempt=", setup_preempt_mode);
-
-static void __init preempt_dynamic_init(void)
-{
- if (preempt_dynamic_mode == preempt_dynamic_undefined) {
- if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
- sched_dynamic_update(preempt_dynamic_none);
- } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
- sched_dynamic_update(preempt_dynamic_voluntary);
- } else {
- /* Default static call setting, nothing to do */
- WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
- preempt_dynamic_mode = preempt_dynamic_full;
- pr_info("Dynamic Preempt: full\n");
- }
- }
-}
-
-#define PREEMPT_MODEL_ACCESSOR(mode) \
- bool preempt_model_##mode(void) \
- { \
- WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
- return preempt_dynamic_mode == preempt_dynamic_##mode; \
- } \
- EXPORT_SYMBOL_GPL(preempt_model_##mode)
-
-PREEMPT_MODEL_ACCESSOR(none);
-PREEMPT_MODEL_ACCESSOR(voluntary);
-PREEMPT_MODEL_ACCESSOR(full);
-
-#else /* !CONFIG_PREEMPT_DYNAMIC */
-
-static inline void preempt_dynamic_init(void) { }
-
#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */

/**
--
2.31.1


2024-05-28 00:38:44

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 14/35] rcu: limit PREEMPT_RCU to full preemption under PREEMPT_AUTO

Under PREEMPT_AUTO, CONFIG_PREEMPTION is enabled, and much like
PREEMPT_DYNAMIC, PREEMPT_AUTO also allows for dynamic switching
of preemption models.

The RCU model, however, is fixed at compile time.

Now, RCU typically selects PREEMPT_RCU if CONFIG_PREEMPTION is enabled.
Given the trade-offs between PREEMPT_RCU=y and PREEMPT_RCU=n, some
configurations might prefer the stronger forward-progress guarantees
of PREEMPT_RCU=n.

Accordingly, select PREEMPT_RCU=y only if the user selects
CONFIG_PREEMPT at compile time.

Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/rcu/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index e7d2dd267593..9dedb70ac2e6 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU

config PREEMPT_RCU
bool
- default y if PREEMPTION
+ default y if (PREEMPT || PREEMPT_DYNAMIC || PREEMPT_RT)
select TREE_RCU
help
This option selects the RCU implementation that is
--
2.31.1


2024-05-28 00:38:47

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

Reuse sched_dynamic_update() and related logic to enable choosing
the preemption model at boot or runtime for PREEMPT_AUTO.

The interface is identical to PREEMPT_DYNAMIC.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>

Changelog:
change title
---
include/linux/preempt.h | 2 +-
kernel/sched/core.c | 31 +++++++++++++++++++++++++++----
kernel/sched/debug.c | 6 +++---
kernel/sched/sched.h | 2 +-
4 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index d453f5e34390..d4f568606eda 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -481,7 +481,7 @@ DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(), preempt_enable_notrace())
DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())

-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)

extern bool preempt_model_none(void);
extern bool preempt_model_voluntary(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 349f6257fdcd..d7804e29182d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8713,9 +8713,13 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
}
EXPORT_SYMBOL(__cond_resched_rwlock_write);

-#if defined(CONFIG_PREEMPT_DYNAMIC)
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)

+#ifdef CONFIG_PREEMPT_DYNAMIC
#define PREEMPT_MODE "Dynamic Preempt"
+#else
+#define PREEMPT_MODE "Preempt Auto"
+#endif

enum {
preempt_dynamic_undefined = -1,
@@ -8790,11 +8794,11 @@ PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
PREEMPT_MODEL_ACCESSOR(full);

-#else /* !CONFIG_PREEMPT_DYNAMIC */
+#else /* !CONFIG_PREEMPT_DYNAMIC && !CONFIG_PREEMPT_AUTO */

static inline void preempt_dynamic_init(void) { }

-#endif /* !CONFIG_PREEMPT_DYNAMIC */
+#endif /* !CONFIG_PREEMPT_DYNAMIC && !CONFIG_PREEMPT_AUTO */

#ifdef CONFIG_PREEMPT_DYNAMIC

@@ -8925,7 +8929,26 @@ void sched_dynamic_klp_disable(void)

#endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */

-#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
+#elif defined(CONFIG_PREEMPT_AUTO)
+
+static void __sched_dynamic_update(int mode)
+{
+ switch (mode) {
+ case preempt_dynamic_none:
+ preempt_dynamic_mode = preempt_dynamic_undefined;
+ break;
+
+ case preempt_dynamic_voluntary:
+ preempt_dynamic_mode = preempt_dynamic_undefined;
+ break;
+
+ case preempt_dynamic_full:
+ preempt_dynamic_mode = preempt_dynamic_undefined;
+ break;
+ }
+}
+
+#endif /* CONFIG_PREEMPT_AUTO */

/**
* yield - yield the current processor to other threads.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 8d5d98a5834d..e53f1b73bf4a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -216,7 +216,7 @@ static const struct file_operations sched_scaling_fops = {

#endif /* SMP */

-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)

static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
@@ -276,7 +276,7 @@ static const struct file_operations sched_dynamic_fops = {
.release = single_release,
};

-#endif /* CONFIG_PREEMPT_DYNAMIC */
+#endif /* CONFIG_PREEMPT_DYNAMIC || CONFIG_PREEMPT_AUTO */

__read_mostly bool sched_debug_verbose;

@@ -343,7 +343,7 @@ static __init int sched_init_debug(void)

debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
debugfs_create_file_unsafe("verbose", 0644, debugfs_sched, &sched_debug_verbose, &sched_verbose_fops);
-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
#endif

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae50f212775e..c9239c0b0095 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3231,7 +3231,7 @@ extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *w

extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);

-#ifdef CONFIG_PREEMPT_DYNAMIC
+#if defined(CONFIG_PREEMPT_DYNAMIC) || defined(CONFIG_PREEMPT_AUTO)
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
--
2.31.1


2024-05-28 00:39:02

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 15/35] rcu: fix header guard for rcu_all_qs()

rcu_all_qs() is defined for !CONFIG_PREEMPT_RCU but the declaration
is conditioned on CONFIG_PREEMPTION.

With CONFIG_PREEMPT_AUTO, you can have configurations where
CONFIG_PREEMPTION is enabled without also enabling CONFIG_PREEMPT_RCU.

Decouple the two.

Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>

Changelog:
Might be going away
---
include/linux/rcutree.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 254244202ea9..be2b77c81a6d 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -103,7 +103,7 @@ extern int rcu_scheduler_active;
void rcu_end_inkernel_boot(void);
bool rcu_inkernel_boot_has_ended(void);
bool rcu_is_watching(void);
-#ifndef CONFIG_PREEMPTION
+#ifndef CONFIG_PREEMPT_RCU
void rcu_all_qs(void);
#endif

--
2.31.1


2024-05-28 00:39:18

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 17/35] rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y

With PREEMPT_RCU=n, cond_resched() provides urgently needed quiescent
states for read-side critical sections via rcu_all_qs().
One reason why this was necessary: lacking preempt-count, the tick
handler has no way of knowing whether it is executing in a read-side
critical section or not.

With PREEMPT_AUTO=y, there can be configurations with (PREEMPT_COUNT=y,
PREEMPT_RCU=n). This means that cond_resched() is a stub which does
not provide for quiescent states via rcu_all_qs().

So, use the availability of preempt_count() to report quiescent states
in rcu_flavor_sched_clock_irq().

Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/rcu/tree_plugin.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 36a8b5dbf5b5..741476c841a1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -963,13 +963,16 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
*/
static void rcu_flavor_sched_clock_irq(int user)
{
- if (user || rcu_is_cpu_rrupt_from_idle()) {
+ if (user || rcu_is_cpu_rrupt_from_idle() ||
+ (IS_ENABLED(CONFIG_PREEMPT_COUNT) &&
+ !(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {

/*
* Get here if this CPU took its interrupt from user
- * mode or from the idle loop, and if this is not a
- * nested interrupt. In this case, the CPU is in
- * a quiescent state, so note it.
+ * mode, from the idle loop without this being a nested
+ * interrupt, or while not holding a preempt count.
+ * In this case, the CPU is in a quiescent state, so note
+ * it.
*
* No memory barrier is required here because rcu_qs()
* references only CPU-local variables that other CPUs
--
2.31.1


2024-05-28 00:39:24

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
works at cross purposes: the RCU read side critical sections disable
preemption, while preempt=full schedules eagerly to minimize
latency.

Warn if the user is switching to full preemption with PREEMPT_RCU=n.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Suggested-by: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7804e29182d..df8e333f2d8b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
break;

case preempt_dynamic_full:
+ if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
+ pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
+ PREEMPT_MODE);
+
preempt_dynamic_mode = preempt_dynamic_undefined;
break;
}
--
2.31.1


2024-05-28 00:39:32

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 18/35] rcu: force context-switch for PREEMPT_RCU=n, PREEMPT_COUNT=y

With (PREEMPT_RCU=n, PREEMPT_COUNT=y), rcu_flavor_sched_clock_irq()
registers urgently needed quiescent states when preempt_count() is
available and no task or softirq is in a non-preemptible section.

This, however, does nothing for long running loops where preemption
is only temporarily enabled, since the tick is unlikely to neatly fall
in the preemptible() section.

Handle that by forcing a context-switch when we require a quiescent
state urgently but are holding a preempt_count().

Cc: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/rcu/tree.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d9642dd06c25..3a0e1d0b939c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2286,8 +2286,17 @@ void rcu_sched_clock_irq(int user)
raw_cpu_inc(rcu_data.ticks_this_gp);
/* The load-acquire pairs with the store-release setting to true. */
if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
- /* Idle and userspace execution already are quiescent states. */
- if (!rcu_is_cpu_rrupt_from_idle() && !user) {
+ /*
+ * Idle and userspace execution already are quiescent states.
+ * If, however, we came here from a nested interrupt in the
+ * kernel, or if we have PREEMPT_RCU=n but are holding a
+ * preempt_count() (say, with CONFIG_PREEMPT_AUTO=y), then
+ * force a context switch.
+ */
+ if ((!rcu_is_cpu_rrupt_from_idle() && !user) ||
+ ((!IS_ENABLED(CONFIG_PREEMPT_RCU) &&
+ IS_ENABLED(CONFIG_PREEMPT_COUNT)) &&
+ (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)))) {
set_tsk_need_resched(current);
set_preempt_need_resched();
}
--
2.31.1


2024-05-28 00:39:52

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 20/35] powerpc: add support for PREEMPT_AUTO

From: Shrikanth Hegde <[email protected]>

Add PowerPC arch support for PREEMPT_AUTO by defining LAZY bits.

Since PowerPC doesn't use generic exit to functions, Add
NR_LAZY check in exit to user and exit to kernel from interrupt
routines.

Signed-off-by: Shrikanth Hegde <[email protected]>
[ Changed TIF_NEED_RESCHED_LAZY to now be defined unconditionally. ]
Signed-off-by: Ankur Arora <[email protected]>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/thread_info.h | 5 ++++-
arch/powerpc/kernel/interrupt.c | 5 +++--
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..11e7008f5dd3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -268,6 +268,7 @@ config PPC
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select HAVE_PREEMPT_AUTO
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE
select HAVE_RSEQ
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 15c5691dd218..0d170e2be2b6 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -117,11 +117,14 @@ void arch_setup_new_exec(void);
#endif
#define TIF_POLLING_NRFLAG 19 /* true if poll_idle() is polling TIF_NEED_RESCHED */
#define TIF_32BIT 20 /* 32 bit binary */
+#define TIF_NEED_RESCHED_LAZY 21 /* Lazy rescheduling */

/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
+
#define _TIF_NOTIFY_SIGNAL (1<<TIF_NOTIFY_SIGNAL)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
#define _TIF_32BIT (1<<TIF_32BIT)
@@ -144,7 +147,7 @@ void arch_setup_new_exec(void);
#define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
- _TIF_NOTIFY_SIGNAL)
+ _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED_LAZY)
#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR)

/* Bits in local_flags */
diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index eca293794a1e..0b97cdd4b94e 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -185,7 +185,7 @@ interrupt_exit_user_prepare_main(unsigned long ret, struct pt_regs *regs)
ti_flags = read_thread_flags();
while (unlikely(ti_flags & (_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM))) {
local_irq_enable();
- if (ti_flags & _TIF_NEED_RESCHED) {
+ if (ti_flags & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) {
schedule();
} else {
/*
@@ -396,7 +396,8 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs)
/* Returning to a kernel context with local irqs enabled. */
WARN_ON_ONCE(!(regs->msr & MSR_EE));
again:
- if (IS_ENABLED(CONFIG_PREEMPT)) {
+
+ if (IS_ENABLED(CONFIG_PREEMPTION)) {
/* Return to preemptible kernel context */
if (unlikely(read_thread_flags() & _TIF_NEED_RESCHED)) {
if (preempt_count() == 0)
--
2.31.1


2024-05-28 00:39:59

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr()

Handle RESCHED_LAZY in resched_curr(), by registering an intent to
reschedule at exit-to-user.
Given that the rescheduling is not imminent, skip the preempt folding
and the resched IPI.

Also, update set_nr_and_not_polling() to handle RESCHED_LAZY. Note that
there are no changes to set_nr_if_polling(), since lazy rescheduling
is not meaningful for idle.

And finally, now that there are two need-resched bits, enforce a
priority order while setting them.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index df8e333f2d8b..27b908cc9134 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -899,14 +899,14 @@ static inline void hrtick_rq_init(struct rq *rq)

#if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG)
/*
- * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG,
+ * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG,
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
{
struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+ return !(fetch_or(&ti->flags, _tif_resched(rs)) & _TIF_POLLING_NRFLAG);
}

/*
@@ -931,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
}

#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct task_struct *p, resched_t rs)
{
- __set_tsk_need_resched(p, RESCHED_NOW);
+ __set_tsk_need_resched(p, rs);
return true;
}

@@ -1041,25 +1041,34 @@ void wake_up_q(struct wake_q_head *head)
void resched_curr(struct rq *rq)
{
struct task_struct *curr = rq->curr;
+ resched_t rs = RESCHED_NOW;
int cpu;

lockdep_assert_rq_held(rq);

- if (__test_tsk_need_resched(curr, RESCHED_NOW))
+ /*
+ * TIF_NEED_RESCHED is the higher priority bit, so if it is already
+ * set, nothing more to be done.
+ */
+ if (__test_tsk_need_resched(curr, RESCHED_NOW) ||
+ (rs == RESCHED_LAZY && __test_tsk_need_resched(curr, RESCHED_LAZY)))
return;

cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- __set_tsk_need_resched(curr, RESCHED_NOW);
- set_preempt_need_resched();
+ __set_tsk_need_resched(curr, rs);
+ if (rs == RESCHED_NOW)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(curr, rs)) {
+ if (rs == RESCHED_NOW)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
}

void resched_cpu(int cpu)
@@ -1154,7 +1163,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report
* much benefits.
*/
- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(rq->idle, RESCHED_NOW))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -6704,6 +6713,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)
}

next = pick_next_task(rq, prev, &rf);
+
+ /* Clear both TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY */
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
--
2.31.1


2024-05-28 00:41:05

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 19/35] x86/thread_info: define TIF_NEED_RESCHED_LAZY

Define TIF_NEED_RESCHED_LAZY which, with TIF_NEED_RESCHED provides the
scheduler with two kinds of rescheduling intent: TIF_NEED_RESCHED,
for the usual rescheduling at the next safe preemption point;
TIF_NEED_RESCHED_LAZY expressing an intent to reschedule at some
time in the future while allowing the current task to run to
completion.

Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 6 ++++--
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 928820e61cb5..5cd83b12f6fd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -277,6 +277,7 @@ config X86
select HAVE_STATIC_CALL
select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL
select HAVE_PREEMPT_DYNAMIC_CALL
+ select HAVE_PREEMPT_AUTO
select HAVE_RSEQ
select HAVE_RUST if X86_64
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 12da7dfd5ef1..6862bbbb98ab 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,8 +87,9 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
-#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
-#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling */
+#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
+#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -110,6 +111,7 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
--
2.31.1


2024-05-28 00:41:16

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 25/35] sched/fair: refactor update_curr(), entity_tick()

When updating the task's runtime statistics via update_curr()
or entity_tick(), we call resched_curr() to resched if needed.

Refactor update_curr() and entity_tick() to only update the stats
and deferring any rescheduling needed to task_tick_fair() or
update_curr().

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Suggested-by: Peter Ziljstra <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++-----------------------
1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c5171c247466..dd34709f294c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -981,10 +981,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
* XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
* this is probably good enough.
*/
-static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
if ((s64)(se->vruntime - se->deadline) < 0)
- return;
+ return false;

/*
* For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1002,9 +1002,11 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
+ return true;
}
+
+ return false;
}

#include "pelt.h"
@@ -1159,26 +1161,35 @@ s64 update_curr_common(struct rq *rq)
/*
* Update the current task's runtime statistics.
*/
-static void update_curr(struct cfs_rq *cfs_rq)
+static bool __update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
s64 delta_exec;
+ bool resched;

if (unlikely(!curr))
- return;
+ return false;

delta_exec = update_curr_se(rq_of(cfs_rq), curr);
if (unlikely(delta_exec <= 0))
- return;
+ return false;

curr->vruntime += calc_delta_fair(delta_exec, curr);
- update_deadline(cfs_rq, curr);
+ resched = update_deadline(cfs_rq, curr);
update_min_vruntime(cfs_rq);

if (entity_is_task(curr))
update_curr_task(task_of(curr), delta_exec);

account_cfs_rq_runtime(cfs_rq, delta_exec);
+
+ return resched;
+}
+
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+ if (__update_curr(cfs_rq))
+ resched_curr(rq_of(cfs_rq));
}

static void update_curr_fair(struct rq *rq)
@@ -5499,13 +5510,13 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
cfs_rq->curr = NULL;
}

-static void
-entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
+static bool
+entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
* Update run-time statistics of the 'current'.
*/
- update_curr(cfs_rq);
+ bool resched = __update_curr(cfs_rq);

/*
* Ensure that runnable average is periodically updated.
@@ -5513,22 +5524,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);

-#ifdef CONFIG_SCHED_HRTICK
- /*
- * queued ticks are scheduled to match the slice, so don't bother
- * validating it and just reschedule.
- */
- if (queued) {
- resched_curr(rq_of(cfs_rq));
- return;
- }
- /*
- * don't let the period tick interfere with the hrtick preemption
- */
- if (!sched_feat(DOUBLE_TICK) &&
- hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
- return;
-#endif
+ return resched;
}


@@ -12611,12 +12607,16 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ bool resched = false;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
- entity_tick(cfs_rq, se, queued);
+ resched |= entity_tick(cfs_rq, se);
}

+ if (resched)
+ resched_curr(rq);
+
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);

--
2.31.1


2024-05-28 00:41:30

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 26/35] sched/fair: handle tick expiry under lazy preemption

The default policy for lazy scheduling is to schedule in exit-to-user.
So, we do that for all but deadline tasks. For deadline tasks once a
task is not leftmost, force it to be scheduled away.

Always scheduling lazily, however, runs into the 'hog' problem -- the
target task might be running in the kernel and might not relinquish
CPU on its own.

Handle that by upgrading the ignored tif_resched(RESCHED_LAZY) bit to
tif_resched(RESCHED_NOW) at the next tick.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 8 ++++++++
kernel/sched/deadline.c | 5 ++++-
kernel/sched/fair.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 6 ++++++
5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e838328d93d1..2bc7f636267d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1051,6 +1051,14 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (is_idle_task(curr))
return RESCHED_NOW;

+ if (opt == RESCHED_TICK &&
+ unlikely(__test_tsk_need_resched(curr, RESCHED_LAZY)))
+ /*
+ * If the task hasn't switched away by the second tick,
+ * force it away by upgrading to TIF_NEED_RESCHED.
+ */
+ return RESCHED_NOW;
+
return RESCHED_LAZY;
}

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d24d6bfee293..cb0dd77508b1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1378,8 +1378,11 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
}

+ /*
+ * We are not leftmost anymore. Reschedule straight away.
+ */
if (!is_leftmost(dl_se, &rq->dl))
- resched_curr(rq);
+ __resched_curr(rq, RESCHED_FORCE);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd34709f294c..faa6afe0af0d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12615,7 +12615,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
}

if (resched)
- resched_curr(rq);
+ resched_curr_tick(rq);

if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f0a6c9bb890b..4713783bbdef 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1023,7 +1023,7 @@ static void update_curr_rt(struct rq *rq)
rt_rq->rt_time += delta_exec;
exceeded = sched_rt_runtime_exceeded(rt_rq);
if (exceeded)
- resched_curr(rq);
+ resched_curr_tick(rq);
raw_spin_unlock(&rt_rq->rt_runtime_lock);
if (exceeded)
do_start_rt_bandwidth(sched_rt_bandwidth(rt_rq));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e5e4747fbef2..107c5fc2b7bb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2467,6 +2467,7 @@ extern void reweight_task(struct task_struct *p, int prio);
enum resched_opt {
RESCHED_DEFAULT,
RESCHED_FORCE,
+ RESCHED_TICK,
};

extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2476,6 +2477,11 @@ static inline void resched_curr(struct rq *rq)
__resched_curr(rq, RESCHED_DEFAULT);
}

+static inline void resched_curr_tick(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_TICK);
+}
+
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1


2024-05-28 00:41:55

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 29/35] sched: handle preempt=voluntary under PREEMPT_AUTO

The default preemption policy for voluntary preemption under
PREEMPT_AUTO is to schedule eagerly for tasks of higher scheduling
class, and lazily for well-behaved, non-idle tasks.

This is the same policy as preempt=none, with an eager handling of
higher priority scheduling classes.

Comparing a cyclictest workload with a background kernel load of
'stress-ng --mmap', shows that both the average and the maximum
latencies improve:

# stress-ng --mmap 0 &
# cyclictest --mlockall --smp --priority=80 --interval=200 --distance=0 -q -D 300

Min ( %stdev ) Act ( %stdev ) Avg ( %stdev ) Max ( %stdev )

PREEMPT_AUTO, preempt=voluntary 1.73 ( +- 25.43% ) 62.16 ( +- 303.39% ) 14.92 ( +- 17.96% ) 2778.22 ( +- 15.04% )
PREEMPT_DYNAMIC, preempt=voluntary 1.83 ( +- 20.76% ) 253.45 ( +- 233.21% ) 18.70 ( +- 15.88% ) 2992.45 ( +- 15.95% )

The table above shows the aggregated latencies across all CPUs.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 12 ++++++++----
kernel/sched/sched.h | 6 ++++++
2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c25cccc09b65..2bc3ae21a9d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1052,6 +1052,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (preempt_model_preemptible())
return RESCHED_NOW;

+ if (preempt_model_voluntary() && opt == RESCHED_PRIORITY)
+ return RESCHED_NOW;
+
if (is_idle_task(curr))
return RESCHED_NOW;

@@ -2289,7 +2292,7 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
if (p->sched_class == rq->curr->sched_class)
rq->curr->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, rq->curr->sched_class))
- resched_curr(rq);
+ resched_curr_priority(rq);

/*
* A queue event has occurred, and we're going to schedule. In
@@ -8989,11 +8992,11 @@ static void __sched_dynamic_update(int mode)
case preempt_dynamic_none:
if (mode != preempt_dynamic_mode)
pr_info("%s: none\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;

case preempt_dynamic_voluntary:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: voluntary\n", PREEMPT_MODE);
break;

case preempt_dynamic_full:
@@ -9003,9 +9006,10 @@ static void __sched_dynamic_update(int mode)

if (mode != preempt_dynamic_mode)
pr_info("%s: full\n", PREEMPT_MODE);
- preempt_dynamic_mode = mode;
break;
}
+
+ preempt_dynamic_mode = mode;
}

#endif /* CONFIG_PREEMPT_AUTO */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 107c5fc2b7bb..ee8e99a9a677 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2468,6 +2468,7 @@ enum resched_opt {
RESCHED_DEFAULT,
RESCHED_FORCE,
RESCHED_TICK,
+ RESCHED_PRIORITY,
};

extern void __resched_curr(struct rq *rq, enum resched_opt opt);
@@ -2482,6 +2483,11 @@ static inline void resched_curr_tick(struct rq *rq)
__resched_curr(rq, RESCHED_TICK);
}

+static inline void resched_curr_priority(struct rq *rq)
+{
+ __resched_curr(rq, RESCHED_PRIORITY);
+}
+
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;
--
2.31.1


2024-05-28 00:42:16

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 28/35] sched: support preempt=full under PREEMPT_AUTO

The default preemption policy for preempt-full under PREEMPT_AUTO is
to minimize latency, and thus to always schedule eagerly. This is
identical to CONFIG_PREEMPT, and so should result in similar
performance.

Comparing scheduling/IPC workload:

# perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

PREEMPT_AUTO, preempt=full

3,080,508 context-switches ( +- 0.64% )
3.65171 +- 0.00654 seconds time elapsed ( +- 0.18% )

PREEMPT_DYNAMIC, preempt=full

3,087,527 context-switches ( +- 0.33% )
3.60163 +- 0.00633 seconds time elapsed ( +- 0.18% )

Looking at the breakup between voluntary and involuntary
context-switches, we see almost identical behaviour as well.

PREEMPT_AUTO, preempt=full

2087910.00 +- 34720.95 voluntary context-switches ( +- 1.660% )
784437.60 +- 19827.79 involuntary context-switches ( +- 2.520% )

PREEMPT_DYNAMIC, preempt=full

2102879.60 +- 22767.11 voluntary context-switches ( +- 1.080% )
801189.90 +- 21324.18 involuntary context-switches ( +- 2.660% )

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c3ba33c77053..c25cccc09b65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,9 +1035,10 @@ void wake_up_q(struct wake_q_head *head)
* For preemption models other than PREEMPT_AUTO: always schedule
* eagerly.
*
- * For PREEMPT_AUTO: schedule idle threads eagerly, allow everything else
- * to finish its time quanta, and mark for rescheduling at the next exit
- * to user.
+ * For PREEMPT_AUTO: schedule idle threads eagerly, and under full
+ * preemption, all tasks eagerly. Otherwise, allow everything else
+ * to finish its time quanta, and mark for rescheduling at the next
+ * exit to user.
*/
static resched_t resched_opt_translate(struct task_struct *curr,
enum resched_opt opt)
@@ -1048,6 +1049,9 @@ static resched_t resched_opt_translate(struct task_struct *curr,
if (opt == RESCHED_FORCE)
return RESCHED_NOW;

+ if (preempt_model_preemptible())
+ return RESCHED_NOW;
+
if (is_idle_task(curr))
return RESCHED_NOW;

@@ -8997,7 +9001,9 @@ static void __sched_dynamic_update(int mode)
pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
PREEMPT_MODE);

- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: full\n", PREEMPT_MODE);
+ preempt_dynamic_mode = mode;
break;
}
}
--
2.31.1


2024-05-28 00:42:40

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 30/35] sched: latency warn for TIF_NEED_RESCHED_LAZY

resched_latency_warn() now also warns if TIF_NEED_RESCHED_LAZY is set
without rescheduling for more than the latency_warn_ms period.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Ziljstra <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/debug.c | 7 +++++--
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bc3ae21a9d0..4f0ac90b7d47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5673,7 +5673,7 @@ static u64 cpu_resched_latency(struct rq *rq)
if (sysctl_resched_latency_warn_once && warned_once)
return 0;

- if (!need_resched() || !latency_warn_ms)
+ if ((!need_resched() && !need_resched_lazy()) || !latency_warn_ms)
return 0;

if (system_state == SYSTEM_BOOTING)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e53f1b73bf4a..a1be40101844 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1114,9 +1114,12 @@ void proc_sched_set_task(struct task_struct *p)
void resched_latency_warn(int cpu, u64 latency)
{
static DEFINE_RATELIMIT_STATE(latency_check_ratelimit, 60 * 60 * HZ, 1);
+ char *nr;
+
+ nr = __tif_need_resched(RESCHED_NOW) ? "need_resched" : "need_resched_lazy";

WARN(__ratelimit(&latency_check_ratelimit),
- "sched: CPU %d need_resched set for > %llu ns (%d ticks) "
+ "sched: CPU %d %s set for > %llu ns (%d ticks) "
"without schedule\n",
- cpu, latency, cpu_rq(cpu)->ticks_without_resched);
+ cpu, nr, latency, cpu_rq(cpu)->ticks_without_resched);
}
--
2.31.1


2024-05-28 00:48:12

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 31/35] tracing: support lazy resched

trace_entry::flags is full, so reuse the TRACE_FLAG_IRQS_NOSUPPORT
bit for this. The flag is safe to reuse since it is only used in
old archs that don't support lockdep irq tracing.

Also, now that we have a variety of need-resched combinations, document
these in the tracing headers.

Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
include/linux/trace_events.h | 6 +++---
kernel/trace/trace.c | 28 ++++++++++++++++++----------
kernel/trace/trace_output.c | 16 ++++++++++++++--
3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 6f9bdfb09d1d..329002785b4d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -184,7 +184,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status);

enum trace_flag_type {
TRACE_FLAG_IRQS_OFF = 0x01,
- TRACE_FLAG_IRQS_NOSUPPORT = 0x02,
+ TRACE_FLAG_NEED_RESCHED_LAZY = 0x02,
TRACE_FLAG_NEED_RESCHED = 0x04,
TRACE_FLAG_HARDIRQ = 0x08,
TRACE_FLAG_SOFTIRQ = 0x10,
@@ -211,11 +211,11 @@ static inline unsigned int tracing_gen_ctx(void)

static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
static inline unsigned int tracing_gen_ctx(void)
{
- return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT);
+ return tracing_gen_ctx_irq_test(0);
}
#endif

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ed229527be05..7941e9ec979a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2513,6 +2513,8 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)

if (__tif_need_resched(RESCHED_NOW))
trace_flags |= TRACE_FLAG_NEED_RESCHED;
+ if (__tif_need_resched(RESCHED_LAZY))
+ trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY;
if (test_preempt_need_resched())
trace_flags |= TRACE_FLAG_PREEMPT_RESCHED;
return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) |
@@ -4096,17 +4098,23 @@ unsigned long trace_total_entries(struct trace_array *tr)
return entries;
}

+#ifdef CONFIG_PREEMPT_AUTO
+#define NR_LEGEND "l: lazy, n: now, p: preempt, b: l|n, L: l|p, N: n|p, B: l|n|p"
+#else
+#define NR_LEGEND "n: now, p: preempt, N: n|p"
+#endif
+
static void print_lat_help_header(struct seq_file *m)
{
- seq_puts(m, "# _------=> CPU# \n"
- "# / _-----=> irqs-off/BH-disabled\n"
- "# | / _----=> need-resched \n"
- "# || / _---=> hardirq/softirq \n"
- "# ||| / _--=> preempt-depth \n"
- "# |||| / _-=> migrate-disable \n"
- "# ||||| / delay \n"
- "# cmd pid |||||| time | caller \n"
- "# \\ / |||||| \\ | / \n");
+ seq_printf(m, "# _------=> CPU# \n"
+ "# / _-----=> irqs-off/BH-disabled\n"
+ "# | / _----=> need-resched ( %s ) \n"
+ "# || / _---=> hardirq/softirq \n"
+ "# ||| / _--=> preempt-depth \n"
+ "# |||| / _-=> migrate-disable \n"
+ "# ||||| / delay \n"
+ "# cmd pid |||||| time | caller \n"
+ "# \\ / |||||| \\ | / \n", NR_LEGEND);
}

static void print_event_info(struct array_buffer *buf, struct seq_file *m)
@@ -4141,7 +4149,7 @@ static void print_func_help_header_irq(struct array_buffer *buf, struct seq_file
print_event_info(buf, m);

seq_printf(m, "# %.*s _-----=> irqs-off/BH-disabled\n", prec, space);
- seq_printf(m, "# %.*s / _----=> need-resched\n", prec, space);
+ seq_printf(m, "# %.*s / _----=> need-resched ( %s )\n", prec, space, NR_LEGEND);
seq_printf(m, "# %.*s| / _---=> hardirq/softirq\n", prec, space);
seq_printf(m, "# %.*s|| / _--=> preempt-depth\n", prec, space);
seq_printf(m, "# %.*s||| / _-=> migrate-disable\n", prec, space);
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..4f58a196e14c 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
(entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' :
(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
bh_off ? 'b' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' :
'.';

- switch (entry->flags & (TRACE_FLAG_NEED_RESCHED |
+ switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY |
TRACE_FLAG_PREEMPT_RESCHED)) {
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'B';
+ break;
case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'N';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED:
+ need_resched = 'L';
+ break;
+ case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'b';
+ break;
case TRACE_FLAG_NEED_RESCHED:
need_resched = 'n';
break;
+ case TRACE_FLAG_NEED_RESCHED_LAZY:
+ need_resched = 'l';
+ break;
case TRACE_FLAG_PREEMPT_RESCHED:
need_resched = 'p';
break;
--
2.31.1


2024-05-28 00:48:37

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 32/35] Documentation: tracing: add TIF_NEED_RESCHED_LAZY

Document various combinations of resched flags.

Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Originally-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <[email protected]>
---
Documentation/trace/ftrace.rst | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 7e7b8ec17934..7f20c0bae009 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1036,8 +1036,12 @@ explains which is which.
be printed here.

need-resched:
- - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
+ - 'B' all three, TIF_NEED_RESCHED, TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+ - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED are set,
+ - 'L' both TIF_NEED_RESCHED_LAZY and PREEMPT_NEED_RESCHED are set,
+ - 'b' both TIF_NEED_RESCHED and TIF_NEED_RESCHED_LAZY are set,
- 'n' only TIF_NEED_RESCHED is set,
+ - 'l' only TIF_NEED_RESCHED_LAZY is set,
- 'p' only PREEMPT_NEED_RESCHED is set,
- '.' otherwise.

--
2.31.1


2024-05-28 00:53:19

by Ankur Arora

[permalink] [raw]
Subject: [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y

To reduce RCU noise for nohz_full configurations, osnoise depends
on cond_resched() providing quiescent states for PREEMPT_RCU=n
configurations. And, for PREEMPT_RCU=y configurations does this
by directly calling rcu_momentary_dyntick_idle().

With PREEMPT_AUTO=y, however, we can have configurations with
(PREEMPTION=y, PREEMPT_RCU=n), which means neither of the above can
help.

Handle that by fallback to the explicit quiescent states via
rcu_momentary_dyntick_idle().

Cc: Paul E. McKenney <[email protected]>
Cc: Daniel Bristot de Oliveira <[email protected]>
Cc: Steven Rostedt <[email protected]>
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Ankur Arora <[email protected]>
---
kernel/trace/trace_osnoise.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index a8e28f9b9271..88d2cd2593c4 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1532,18 +1532,20 @@ static int run_osnoise(void)
/*
* In some cases, notably when running on a nohz_full CPU with
* a stopped tick PREEMPT_RCU has no way to account for QSs.
- * This will eventually cause unwarranted noise as PREEMPT_RCU
- * will force preemption as the means of ending the current
- * grace period. We avoid this problem by calling
- * rcu_momentary_dyntick_idle(), which performs a zero duration
- * EQS allowing PREEMPT_RCU to end the current grace period.
- * This call shouldn't be wrapped inside an RCU critical
- * section.
+ * This will eventually cause unwarranted noise as RCU forces
+ * preemption as the means of ending the current grace period.
+ * We avoid this by calling rcu_momentary_dyntick_idle(),
+ * which performs a zero duration EQS allowing RCU to end the
+ * current grace period. This call shouldn't be wrapped inside
+ * an RCU critical section.
*
- * Note that in non PREEMPT_RCU kernels QSs are handled through
- * cond_resched()
+ * For non-PREEMPT_RCU kernels with cond_resched() (non-
+ * PREEMPT_AUTO configurations), QSs are handled through
+ * cond_resched(). For PREEMPT_AUTO kernels, we fallback to the
+ * zero duration QS via rcu_momentary_dyntick_idle().
*/
- if (IS_ENABLED(CONFIG_PREEMPT_RCU)) {
+ if (IS_ENABLED(CONFIG_PREEMPT_RCU) ||
+ (!IS_ENABLED(CONFIG_PREEMPT_RCU) && IS_ENABLED(CONFIG_PREEMPTION))) {
if (!disable_irq)
local_irq_disable();

--
2.31.1


Subject: Re: [PATCH v2 33/35] osnoise: handle quiescent states for PREEMPT_RCU=n, PREEMPTION=y

On 5/28/24 02:35, Ankur Arora wrote:
> To reduce RCU noise for nohz_full configurations, osnoise depends
> on cond_resched() providing quiescent states for PREEMPT_RCU=n
> configurations. And, for PREEMPT_RCU=y configurations does this
> by directly calling rcu_momentary_dyntick_idle().
>
> With PREEMPT_AUTO=y, however, we can have configurations with
> (PREEMPTION=y, PREEMPT_RCU=n), which means neither of the above can
> help.
>
> Handle that by fallback to the explicit quiescent states via
> rcu_momentary_dyntick_idle().
>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Daniel Bristot de Oliveira <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Suggested-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Ankur Arora <[email protected]>

Acked-by: Daniel Bristot de Oliveira <[email protected]>

Thanks
-- Daniel

2024-05-28 16:03:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 06/35] thread_info: define __tif_need_resched(resched_t)

On Mon, May 27, 2024 at 05:34:52PM -0700, Ankur Arora wrote:

> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 65e5beedc915..e246b01553a5 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -216,22 +216,44 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
>
> #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool __tif_need_resched_bitop(int nr_flag)
> {
> - return arch_test_bit(TIF_NEED_RESCHED,
> - (unsigned long *)(&current_thread_info()->flags));
> + return arch_test_bit(nr_flag,
> + (unsigned long *)(&current_thread_info()->flags));
> }
>
> #else
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool __tif_need_resched_bitop(int nr_flag)
> {
> - return test_bit(TIF_NEED_RESCHED,
> - (unsigned long *)(&current_thread_info()->flags));
> + return test_bit(nr_flag,
> + (unsigned long *)(&current_thread_info()->flags));
> }

:se cino=(0:0

That is, you're wrecking the indentation here.

>
> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>
> +static __always_inline bool __tif_need_resched(resched_t rs)
> +{
> + /*
> + * With !PREEMPT_AUTO, this check is only meaningful if we
> + * are checking if tif_resched(RESCHED_NOW) is set.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO) || rs == RESCHED_NOW)
> + return __tif_need_resched_bitop(tif_resched(rs));
> + else
> + return false;
> +}

if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
return false;

return __tif_need_resched_bitop(tif_resched(rs));


> +
> +static __always_inline bool tif_need_resched(void)
> +{
> + return __tif_need_resched(RESCHED_NOW);
> +}
> +
> +static __always_inline bool tif_need_resched_lazy(void)
> +{
> + return __tif_need_resched(RESCHED_LAZY);
> +}
> +
> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> static inline int arch_within_stack_frames(const void * const stack,
> const void * const stackend,
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 233d1af39fff..ed229527be05 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2511,7 +2511,7 @@ unsigned int tracing_gen_ctx_irq_test(unsigned int irqs_status)
> if (softirq_count() >> (SOFTIRQ_SHIFT + 1))
> trace_flags |= TRACE_FLAG_BH_OFF;
>
> - if (tif_need_resched())
> + if (__tif_need_resched(RESCHED_NOW))
> trace_flags |= TRACE_FLAG_NEED_RESCHED;

Per the above this is a NO-OP.

2024-05-28 16:26:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic

On Mon, May 27, 2024 at 05:34:58PM -0700, Ankur Arora wrote:
> Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
> models to dynamically configure preemption.

Uh what ?!? What's the point of creating back-to-back #ifdef sections ?

> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
> 1 file changed, 86 insertions(+), 79 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0c26b60c1101..349f6257fdcd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
> }
> EXPORT_SYMBOL(__cond_resched_rwlock_write);
>
> +#if defined(CONFIG_PREEMPT_DYNAMIC)
> +
> +#define PREEMPT_MODE "Dynamic Preempt"
> +
> +enum {
> + preempt_dynamic_undefined = -1,
> + preempt_dynamic_none,
> + preempt_dynamic_voluntary,
> + preempt_dynamic_full,
> +};
> +
> +int preempt_dynamic_mode = preempt_dynamic_undefined;
> +static DEFINE_MUTEX(sched_dynamic_mutex);
> +
> +int sched_dynamic_mode(const char *str)
> +{
> + if (!strcmp(str, "none"))
> + return preempt_dynamic_none;
> +
> + if (!strcmp(str, "voluntary"))
> + return preempt_dynamic_voluntary;
> +
> + if (!strcmp(str, "full"))
> + return preempt_dynamic_full;
> +
> + return -EINVAL;
> +}
> +
> +static void __sched_dynamic_update(int mode);
> +void sched_dynamic_update(int mode)
> +{
> + mutex_lock(&sched_dynamic_mutex);
> + __sched_dynamic_update(mode);
> + mutex_unlock(&sched_dynamic_mutex);
> +}
> +
> +static void __init preempt_dynamic_init(void)
> +{
> + if (preempt_dynamic_mode == preempt_dynamic_undefined) {
> + if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
> + sched_dynamic_update(preempt_dynamic_none);
> + } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
> + sched_dynamic_update(preempt_dynamic_voluntary);
> + } else {
> + /* Default static call setting, nothing to do */
> + WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> + preempt_dynamic_mode = preempt_dynamic_full;
> + pr_info("%s: full\n", PREEMPT_MODE);
> + }
> + }
> +}
> +
> +static int __init setup_preempt_mode(char *str)
> +{
> + int mode = sched_dynamic_mode(str);
> + if (mode < 0) {
> + pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
> + return 0;
> + }
> +
> + sched_dynamic_update(mode);
> + return 1;
> +}
> +__setup("preempt=", setup_preempt_mode);
> +
> +#define PREEMPT_MODEL_ACCESSOR(mode) \
> + bool preempt_model_##mode(void) \
> + { \
> + WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> + return preempt_dynamic_mode == preempt_dynamic_##mode; \
> + } \
> + EXPORT_SYMBOL_GPL(preempt_model_##mode)
> +
> +PREEMPT_MODEL_ACCESSOR(none);
> +PREEMPT_MODEL_ACCESSOR(voluntary);
> +PREEMPT_MODEL_ACCESSOR(full);
> +
> +#else /* !CONFIG_PREEMPT_DYNAMIC */
> +
> +static inline void preempt_dynamic_init(void) { }
> +
> +#endif /* !CONFIG_PREEMPT_DYNAMIC */
> +
> #ifdef CONFIG_PREEMPT_DYNAMIC
>
> #ifdef CONFIG_GENERIC_ENTRY
> @@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
> * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> */
>
> -enum {
> - preempt_dynamic_undefined = -1,
> - preempt_dynamic_none,
> - preempt_dynamic_voluntary,
> - preempt_dynamic_full,
> -};
> -
> -int preempt_dynamic_mode = preempt_dynamic_undefined;
> -
> -int sched_dynamic_mode(const char *str)
> -{
> - if (!strcmp(str, "none"))
> - return preempt_dynamic_none;
> -
> - if (!strcmp(str, "voluntary"))
> - return preempt_dynamic_voluntary;
> -
> - if (!strcmp(str, "full"))
> - return preempt_dynamic_full;
> -
> - return -EINVAL;
> -}
> -
> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> #define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
> #define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
> @@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
> #error "Unsupported PREEMPT_DYNAMIC mechanism"
> #endif
>
> -static DEFINE_MUTEX(sched_dynamic_mutex);
> static bool klp_override;
>
> static void __sched_dynamic_update(int mode)
> @@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_disable(preempt_schedule_notrace);
> preempt_dynamic_disable(irqentry_exit_cond_resched);
> if (mode != preempt_dynamic_mode)
> - pr_info("Dynamic Preempt: none\n");
> + pr_info("%s: none\n", PREEMPT_MODE);
> break;
>
> case preempt_dynamic_voluntary:
> @@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_disable(preempt_schedule_notrace);
> preempt_dynamic_disable(irqentry_exit_cond_resched);
> if (mode != preempt_dynamic_mode)
> - pr_info("Dynamic Preempt: voluntary\n");
> + pr_info("%s: voluntary\n", PREEMPT_MODE);
> break;
>
> case preempt_dynamic_full:
> @@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_enable(preempt_schedule_notrace);
> preempt_dynamic_enable(irqentry_exit_cond_resched);
> if (mode != preempt_dynamic_mode)
> - pr_info("Dynamic Preempt: full\n");
> + pr_info("%s: full\n", PREEMPT_MODE);
> break;
> }
>
> preempt_dynamic_mode = mode;
> }
>
> -void sched_dynamic_update(int mode)
> -{
> - mutex_lock(&sched_dynamic_mutex);
> - __sched_dynamic_update(mode);
> - mutex_unlock(&sched_dynamic_mutex);
> -}
> -
> #ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
>
> static int klp_cond_resched(void)
> @@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)
>
> #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>
> -static int __init setup_preempt_mode(char *str)
> -{
> - int mode = sched_dynamic_mode(str);
> - if (mode < 0) {
> - pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
> - return 0;
> - }
> -
> - sched_dynamic_update(mode);
> - return 1;
> -}
> -__setup("preempt=", setup_preempt_mode);
> -
> -static void __init preempt_dynamic_init(void)
> -{
> - if (preempt_dynamic_mode == preempt_dynamic_undefined) {
> - if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
> - sched_dynamic_update(preempt_dynamic_none);
> - } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
> - sched_dynamic_update(preempt_dynamic_voluntary);
> - } else {
> - /* Default static call setting, nothing to do */
> - WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> - preempt_dynamic_mode = preempt_dynamic_full;
> - pr_info("Dynamic Preempt: full\n");
> - }
> - }
> -}
> -
> -#define PREEMPT_MODEL_ACCESSOR(mode) \
> - bool preempt_model_##mode(void) \
> - { \
> - WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
> - return preempt_dynamic_mode == preempt_dynamic_##mode; \
> - } \
> - EXPORT_SYMBOL_GPL(preempt_model_##mode)
> -
> -PREEMPT_MODEL_ACCESSOR(none);
> -PREEMPT_MODEL_ACCESSOR(voluntary);
> -PREEMPT_MODEL_ACCESSOR(full);
> -
> -#else /* !CONFIG_PREEMPT_DYNAMIC */
> -
> -static inline void preempt_dynamic_init(void) { }
> -
> #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
>
> /**
> --
> 2.31.1
>

2024-05-28 17:01:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]

On Mon, May 27, 2024 at 05:34:51PM -0700, Ankur Arora wrote:
> Define tif_resched() to serve as selector for the specific
> need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
> or to TIF_NEED_RESCHED_LAZY.
>
> For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
> to TIF_NEED_RESCHED, preserving existing scheduling behaviour.
>
> Cc: Peter Ziljstra <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> include/linux/thread_info.h | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 06e13e7acbe2..65e5beedc915 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -71,6 +71,31 @@ enum syscall_work_bit {
> #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> #endif
>
> +typedef enum {
> + RESCHED_NOW = 0,
> + RESCHED_LAZY = 1,
> +} resched_t;
> +
> +/*
> + * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
> + *
> + * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
> + * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
> + * leaving any scheduling behaviour unchanged.
> + */
> +static __always_inline int tif_resched(resched_t rs)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
> + return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
> + else
> + return TIF_NEED_RESCHED;
> +}

Perhaps:

if (IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
return TIF_NEED_RESCHED_LAZY;

return TIF_NEED_RESCHED;

hmm?

> +
> +static __always_inline int _tif_resched(resched_t rs)
> +{
> + return 1 << tif_resched(rs);
> +}
> +
> #ifdef __KERNEL__
>
> #ifndef arch_set_restart_data
> --
> 2.31.1
>

2024-05-28 17:26:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 08/35] entry: handle lazy rescheduling at user-exit

On Mon, May 27, 2024 at 05:34:54PM -0700, Ankur Arora wrote:
> The scheduling policy for TIF_NEED_RESCHED_LAZY is to allow the
> running task to voluntarily schedule out, running it to completion.
>
> For archs with GENERIC_ENTRY, do this by adding a check in
> exit_to_user_mode_loop().
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> include/linux/entry-common.h | 2 +-
> kernel/entry/common.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index b0fb775a600d..f5bb19369973 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -65,7 +65,7 @@
> #define EXIT_TO_USER_MODE_WORK \
> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
> - ARCH_EXIT_TO_USER_MODE_WORK)
> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK)

Should we be wanting both TIF_NEED_RESCHED flags side-by-side?

2024-05-28 17:32:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry

On Mon, May 27, 2024 at 05:34:55PM -0700, Ankur Arora wrote:
> Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
> xfer_to_guest_mode_handle_work() from various KVM vcpu-run
> loops to check for any task work including rescheduling.
>
> Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.
>
> Also, while at it, remove the explicit check for need_resched() in
> the exit condition as that is already covered in the loop condition.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> include/linux/entry-kvm.h | 2 +-
> kernel/entry/kvm.c | 4 ++--
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
> index 6813171afccb..674a622c91be 100644
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -18,7 +18,7 @@
>
> #define XFER_TO_GUEST_MODE_WORK \
> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)

Same as last patch, it seems weird to have both RESCHED flags so far
apart.

2024-05-28 19:27:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED

On Mon, May 27, 2024 at 05:34:56PM -0700, Ankur Arora wrote:
> Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
> explicit that this path only reschedules if it is needed imminently.
>
> Also, add a comment about why we need a need-resched check here at
> all, given that the top level conditional has already checked the
> preempt_count().
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Originally-by: Thomas Gleixner <[email protected]>
> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/entry/common.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bcb23c866425..c684385921de 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
> rcu_irq_exit_check_preempt();
> if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
> WARN_ON_ONCE(!on_thread_stack());
> - if (need_resched())
> +
> + /*
> + * Check if we need to preempt eagerly.
> + *
> + * Note: we need an explicit check here because some
> + * architectures don't fold TIF_NEED_RESCHED in the
> + * preempt_count. For archs that do, this is already covered
> + * in the conditional above.
> + */
> + if (__tif_need_resched(RESCHED_NOW))
> preempt_schedule_irq();

Seeing how you introduced need_resched_lazy() and kept need_resched() to
be the NOW thing, I really don't see the point of using the long form
here?


2024-05-28 21:36:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> Reuse sched_dynamic_update() and related logic to enable choosing
> the preemption model at boot or runtime for PREEMPT_AUTO.
>
> The interface is identical to PREEMPT_DYNAMIC.

Colour me confused, why?!? What are you doing and why aren't just just
adding AUTO to the existing DYNAMIC thing?

2024-05-29 06:19:06

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling



On 5/28/24 6:04 AM, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> v2 mostly reworks v1, with one of the main changes having less
> noisy need-resched-lazy related interfaces.
> More details in the changelog below.
>

Hi Ankur. Thanks for the series.

nit: had to manually patch 11,12,13 since it didnt apply cleanly on
tip/master and tip/sched/core. Mostly due some word differences in the change.

tip/master was at:
commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
Merge: 5d145493a139 47ff30cc1be7
Author: Ingo Molnar <[email protected]>
Date: Tue May 28 12:44:26 2024 +0200

Merge branch into tip/master: 'x86/percpu'



> The v1 of the series is at [4] and the RFC at [5].
>
> Design
> ==
>
> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
> PREEMPT_COUNT). This means that the scheduler can always safely
> preempt. (This is identical to CONFIG_PREEMPT.)
>
> Having that, the next step is to make the rescheduling policy dependent
> on the chosen scheduling model. Currently, the scheduler uses a single
> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
> reschedule is needed.
> PREEMPT_AUTO extends this by adding an additional need-resched bit
> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
> scheduler to express two kinds of rescheduling intent: schedule at
> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
> rescheduling while allowing the task on the runqueue to run to
> timeslice completion (TIF_NEED_RESCHED_LAZY).
>
> The scheduler decides which need-resched bits are chosen based on
> the preemption model in use:
>
> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>
> none never always [*]
> voluntary higher sched class other tasks [*]
> full always never
>
> [*] some details elided.
>
> The last part of the puzzle is, when does preemption happen, or
> alternately stated, when are the need-resched bits checked:
>
> exit-to-user ret-to-kernel preempt_count()
>
> NEED_RESCHED_LAZY Y N N
> NEED_RESCHED Y Y Y
>
> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
> none/voluntary preemption policies are in effect. And eager semantics
> under full preemption.
>
> In addition, since this is driven purely by the scheduler (not
> depending on cond_resched() placement and the like), there is enough
> flexibility in the scheduler to cope with edge cases -- ex. a kernel
> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
> simply upgrading to a full NEED_RESCHED which can use more coercive
> instruments like resched IPI to induce a context-switch.
>
> Performance
> ==
> The performance in the basic tests (perf bench sched messaging, kernbench,
> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
> (See patches
> "sched: support preempt=none under PREEMPT_AUTO"
> "sched: support preempt=full under PREEMPT_AUTO"
> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>
> For a macro test, a colleague in Oracle's Exadata team tried two
> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
> backported.)
>
> In both tests the data was cached on remote nodes (cells), and the
> database nodes (compute) served client queries, with clients being
> local in the first test and remote in the second.
>
> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>
>
> PREEMPT_VOLUNTARY PREEMPT_AUTO
> (preempt=voluntary)
> ============================== =============================
> clients throughput cpu-usage throughput cpu-usage Gain
> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
> ------- ---------- ----------------- ---------- ----------------- -------
>
>
> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>
>
> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
> 90/10 RW ratio)
>
>
> (Both sets of tests have a fair amount of NW traffic since the query
> tables etc are cached on the cells. Additionally, the first set,
> given the local clients, stress the scheduler a bit more than the
> second.)
>
> The comparative performance for both the tests is fairly close,
> more or less within a margin of error.
>
> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>
> "
> a) Base kernel (6.7),
> b) v1, PREEMPT_AUTO, preempt=voluntary
> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>
> Workloads I tested and their %gain,
> case b case c case d
> NAS +2.7% +1.9% +2.1%
> Hashjoin, +0.0% +0.0% +0.0%
> Graph500, -6.0% +0.0% +0.0%
> XSBench +1.7% +0.0% +1.2%
>
> (Note about the Graph500 numbers at [8].)
>
> Did kernbench etc test from Mel's mmtests suite also. Did not notice
> much difference.
> "
>
> One case where there is a significant performance drop is on powerpc,
> seen running hackbench on a 320 core system (a test on a smaller system is
> fine.) In theory there's no reason for this to only happen on powerpc
> since most of the code is common, but I haven't been able to reproduce
> it on x86 so far.
>
> All in all, I think the tests above show that this scheduling model has legs.
> However, the none/voluntary models under PREEMPT_AUTO are conceptually
> different enough from the current none/voluntary models that there
> likely are workloads where performance would be subpar. That needs more
> extensive testing to figure out the weak points.
>
>
>
Did test it again on PowerPC. Unfortunately numbers shows there is regression
still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
smaller system too to confirm. For now I have done the comparison for the hackbench
where highest regression was seen in v1.

perf stat collected for 20 iterations show higher context switch and higher migrations.
Could it be that LAZY bit is causing more context switches? or could it be something
else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Meanwhile, will do more test with other micro-benchmarks and post the results.


More details below.
CONFIG_HZ = 100
/hackbench -pipe 60 process 100000 loops

====================================================================================
On the larger system. (40 Cores, 320CPUS)
====================================================================================
6.10-rc1 +preempt_auto
preempt=none preempt=none
20 iterations avg value
hackbench pipe(60) 26.403 32.368 ( -31.1%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% )
6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% )
246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% )
1,759,232 page-faults # 19.716 /sec ( +- 6.61% )
577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% )
166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% )

26.403 +- 0.166 seconds time elapsed ( +- 0.63% )

++++++++++++
preempt auto
++++++++++++
Performance counter stats for 'system wide' (20 runs):
207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% )
9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% )
631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% )
1,756,583 page-faults # 16.112 /sec ( +- 6.59% )
700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )

32.368 +- 0.200 seconds time elapsed ( +- 0.62% )


============================================================================================
Smaller system ( 12Cores, 96CPUS)
============================================================================================
6.10-rc1 +preempt_auto
preempt=none preempt=none
20 iterations avg value
hackbench pipe(60) 55.930 65.75 ( -17.6%)

++++++++++++++++++
baseline 6.10-rc1:
++++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% )
1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% )
44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% )
1,698,710 page-faults # 30.010 /sec ( +- 6.58% )
412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% )
99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% )

55.930 +- 0.509 seconds time elapsed ( +- 0.91% )


+++++++++++++++++
v2_preempt_auto
+++++++++++++++++
Performance counter stats for 'system wide' (20 runs):
126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% )
2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% )
147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% )
1,710,637 page-faults # 25.593 /sec ( +- 6.55% )
483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% )
134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% )

65.75 +- 1.06 seconds time elapsed ( +- 1.61% )


2024-05-29 10:50:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 21/35] sched: prepare for lazy rescheduling in resched_curr()

On Mon, May 27, 2024 at 05:35:07PM -0700, Ankur Arora wrote:

> @@ -1041,25 +1041,34 @@ void wake_up_q(struct wake_q_head *head)
> void resched_curr(struct rq *rq)
> {
> struct task_struct *curr = rq->curr;
> + resched_t rs = RESCHED_NOW;
> int cpu;
>
> lockdep_assert_rq_held(rq);
>
> - if (__test_tsk_need_resched(curr, RESCHED_NOW))
> + /*
> + * TIF_NEED_RESCHED is the higher priority bit, so if it is already
> + * set, nothing more to be done.
> + */
> + if (__test_tsk_need_resched(curr, RESCHED_NOW) ||
> + (rs == RESCHED_LAZY && __test_tsk_need_resched(curr, RESCHED_LAZY)))
> return;
>
> cpu = cpu_of(rq);
>
> if (cpu == smp_processor_id()) {
> - __set_tsk_need_resched(curr, RESCHED_NOW);
> - set_preempt_need_resched();
> + __set_tsk_need_resched(curr, rs);
> + if (rs == RESCHED_NOW)
> + set_preempt_need_resched();
> return;
> }
>
> - if (set_nr_and_not_polling(curr))
> - smp_send_reschedule(cpu);
> - else
> + if (set_nr_and_not_polling(curr, rs)) {
> + if (rs == RESCHED_NOW)
> + smp_send_reschedule(cpu);

I'm thinking this wants at least something like:

WARN_ON_ONCE(rs == RESCHED_LAZY && is_idle_task(curr));


> + } else {
> trace_sched_wake_idle_without_ipi(cpu);
> + }
> }
>
> void resched_cpu(int cpu)

2024-05-29 11:50:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> works at cross purposes: the RCU read side critical sections disable
> preemption, while preempt=full schedules eagerly to minimize
> latency.
>
> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Suggested-by: Paul E. McKenney <[email protected]>
> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> Signed-off-by: Ankur Arora <[email protected]>
> ---
> kernel/sched/core.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d7804e29182d..df8e333f2d8b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> break;
>
> case preempt_dynamic_full:
> + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> + PREEMPT_MODE);
> +

Yeah, so I don't believe this is a viable strategy.

Firstly, none of these RCU patches are actually about the whole LAZY
preempt scheme, they apply equally well (arguably better) to the
existing PREEMPT_DYNAMIC thing.

Secondly, esp. with the LAZY thing, you are effectively running FULL at
all times. It's just that some of the preemptions, typically those of
the normal scheduling class are somewhat delayed. However RT/DL classes
are still insta preempt.

Meaning that if you run anything in the realtime classes you're running
a fully preemptible kernel. As such, RCU had better be able to deal with
it.

So no, I don't believe this is right.

2024-05-30 09:05:19

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 10/35] entry: irqentry_exit only preempts for TIF_NEED_RESCHED


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:34:56PM -0700, Ankur Arora wrote:
>> Use __tif_need_resched(RESCHED_NOW) instead of need_resched() to be
>> explicit that this path only reschedules if it is needed imminently.
>>
>> Also, add a comment about why we need a need-resched check here at
>> all, given that the top level conditional has already checked the
>> preempt_count().
>>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> kernel/entry/common.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>> index bcb23c866425..c684385921de 100644
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -307,7 +307,16 @@ void raw_irqentry_exit_cond_resched(void)
>> rcu_irq_exit_check_preempt();
>> if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
>> WARN_ON_ONCE(!on_thread_stack());
>> - if (need_resched())
>> +
>> + /*
>> + * Check if we need to preempt eagerly.
>> + *
>> + * Note: we need an explicit check here because some
>> + * architectures don't fold TIF_NEED_RESCHED in the
>> + * preempt_count. For archs that do, this is already covered
>> + * in the conditional above.
>> + */
>> + if (__tif_need_resched(RESCHED_NOW))
>> preempt_schedule_irq();
>
> Seeing how you introduced need_resched_lazy() and kept need_resched() to
> be the NOW thing, I really don't see the point of using the long form
> here?

So, the reason I used the lower level interface here (and the scheduler)
was to spell out exactly was happening here.

Basically keep need_resched()/need_resched_lazy() for the none-core code.

--
ankur

2024-05-30 09:05:37

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 09/35] entry/kvm: handle lazy rescheduling at guest-entry


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:34:55PM -0700, Ankur Arora wrote:
>> Archs defining CONFIG_KVM_XFER_TO_GUEST_WORK call
>> xfer_to_guest_mode_handle_work() from various KVM vcpu-run
>> loops to check for any task work including rescheduling.
>>
>> Handle TIF_NEED_RESCHED_LAZY alongside TIF_NEED_RESCHED.
>>
>> Also, while at it, remove the explicit check for need_resched() in
>> the exit condition as that is already covered in the loop condition.
>>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> include/linux/entry-kvm.h | 2 +-
>> kernel/entry/kvm.c | 4 ++--
>> 2 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
>> index 6813171afccb..674a622c91be 100644
>> --- a/include/linux/entry-kvm.h
>> +++ b/include/linux/entry-kvm.h
>> @@ -18,7 +18,7 @@
>>
>> #define XFER_TO_GUEST_MODE_WORK \
>> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
>> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
>> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK)
>
> Same as last patch, it seems weird to have both RESCHED flags so far
> apart.

True. Will fix this and the other.

--
ankur

2024-05-30 09:19:21

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 05/35] thread_info: selector for TIF_NEED_RESCHED[_LAZY]


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:34:51PM -0700, Ankur Arora wrote:
>> Define tif_resched() to serve as selector for the specific
>> need-resched flag: with tif_resched() mapping to TIF_NEED_RESCHED
>> or to TIF_NEED_RESCHED_LAZY.
>>
>> For !CONFIG_PREEMPT_AUTO, tif_resched() always evaluates
>> to TIF_NEED_RESCHED, preserving existing scheduling behaviour.
>>
>> Cc: Peter Ziljstra <[email protected]>
>> Originally-by: Thomas Gleixner <[email protected]>
>> Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> include/linux/thread_info.h | 25 +++++++++++++++++++++++++
>> 1 file changed, 25 insertions(+)
>>
>> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
>> index 06e13e7acbe2..65e5beedc915 100644
>> --- a/include/linux/thread_info.h
>> +++ b/include/linux/thread_info.h
>> @@ -71,6 +71,31 @@ enum syscall_work_bit {
>> #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>> #endif
>>
>> +typedef enum {
>> + RESCHED_NOW = 0,
>> + RESCHED_LAZY = 1,
>> +} resched_t;
>> +
>> +/*
>> + * tif_resched(r) maps to TIF_NEED_RESCHED[_LAZY] with CONFIG_PREEMPT_AUTO.
>> + *
>> + * For !CONFIG_PREEMPT_AUTO, both tif_resched(RESCHED_NOW) and
>> + * tif_resched(RESCHED_LAZY) reduce to the same value (TIF_NEED_RESCHED)
>> + * leaving any scheduling behaviour unchanged.
>> + */
>> +static __always_inline int tif_resched(resched_t rs)
>> +{
>> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO))
>> + return (rs == RESCHED_NOW) ? TIF_NEED_RESCHED : TIF_NEED_RESCHED_LAZY;
>> + else
>> + return TIF_NEED_RESCHED;
>> +}
>
> Perhaps:
>
> if (IS_ENABLED(CONFIG_PREEMPT_AUTO) && rs == RESCHED_LAZY)
> return TIF_NEED_RESCHED_LAZY;
>
> return TIF_NEED_RESCHED;

This and similar other interface changes make it much clearer.
Thanks. Will fix.


--
ankur

2024-05-30 09:31:34

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> Reuse sched_dynamic_update() and related logic to enable choosing
>> the preemption model at boot or runtime for PREEMPT_AUTO.
>>
>> The interface is identical to PREEMPT_DYNAMIC.
>
> Colour me confused, why?!? What are you doing and why aren't just just
> adding AUTO to the existing DYNAMIC thing?

You mean have a single __sched_dynamic_update()? AUTO doesn't use any
of the static_call/static_key stuff so I'm not sure how that would work.

Or am I missing the point of what you are saying?

--
ankur

2024-05-30 09:38:42

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 12/35] sched: separate PREEMPT_DYNAMIC config logic


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:34:58PM -0700, Ankur Arora wrote:
>> Pull out the PREEMPT_DYNAMIC setup logic to allow other preemption
>> models to dynamically configure preemption.
>
> Uh what ?!? What's the point of creating back-to-back #ifdef sections ?

Now that you mention it, it does seem quite odd.

Assuming I keep the separation maybe it makes sense to make the runtime
configuration it's own configuration option, say CONFIG_PREEMPT_RUNTIME.

And, PREEMPT_AUTO and PREEMPT_DYNAMIC could select it?


>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> kernel/sched/core.c | 165 +++++++++++++++++++++++---------------------
>> 1 file changed, 86 insertions(+), 79 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 0c26b60c1101..349f6257fdcd 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8713,6 +8713,89 @@ int __cond_resched_rwlock_write(rwlock_t *lock)
>> }
>> EXPORT_SYMBOL(__cond_resched_rwlock_write);
>>
>> +#if defined(CONFIG_PREEMPT_DYNAMIC)
>> +
>> +#define PREEMPT_MODE "Dynamic Preempt"
>> +
>> +enum {
>> + preempt_dynamic_undefined = -1,
>> + preempt_dynamic_none,
>> + preempt_dynamic_voluntary,
>> + preempt_dynamic_full,
>> +};
>> +
>> +int preempt_dynamic_mode = preempt_dynamic_undefined;
>> +static DEFINE_MUTEX(sched_dynamic_mutex);
>> +
>> +int sched_dynamic_mode(const char *str)
>> +{
>> + if (!strcmp(str, "none"))
>> + return preempt_dynamic_none;
>> +
>> + if (!strcmp(str, "voluntary"))
>> + return preempt_dynamic_voluntary;
>> +
>> + if (!strcmp(str, "full"))
>> + return preempt_dynamic_full;
>> +
>> + return -EINVAL;
>> +}
>> +
>> +static void __sched_dynamic_update(int mode);
>> +void sched_dynamic_update(int mode)
>> +{
>> + mutex_lock(&sched_dynamic_mutex);
>> + __sched_dynamic_update(mode);
>> + mutex_unlock(&sched_dynamic_mutex);
>> +}
>> +
>> +static void __init preempt_dynamic_init(void)
>> +{
>> + if (preempt_dynamic_mode == preempt_dynamic_undefined) {
>> + if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
>> + sched_dynamic_update(preempt_dynamic_none);
>> + } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
>> + sched_dynamic_update(preempt_dynamic_voluntary);
>> + } else {
>> + /* Default static call setting, nothing to do */
>> + WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
>> + preempt_dynamic_mode = preempt_dynamic_full;
>> + pr_info("%s: full\n", PREEMPT_MODE);
>> + }
>> + }
>> +}
>> +
>> +static int __init setup_preempt_mode(char *str)
>> +{
>> + int mode = sched_dynamic_mode(str);
>> + if (mode < 0) {
>> + pr_warn("%s: unsupported mode: %s\n", PREEMPT_MODE, str);
>> + return 0;
>> + }
>> +
>> + sched_dynamic_update(mode);
>> + return 1;
>> +}
>> +__setup("preempt=", setup_preempt_mode);
>> +
>> +#define PREEMPT_MODEL_ACCESSOR(mode) \
>> + bool preempt_model_##mode(void) \
>> + { \
>> + WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>> + return preempt_dynamic_mode == preempt_dynamic_##mode; \
>> + } \
>> + EXPORT_SYMBOL_GPL(preempt_model_##mode)
>> +
>> +PREEMPT_MODEL_ACCESSOR(none);
>> +PREEMPT_MODEL_ACCESSOR(voluntary);
>> +PREEMPT_MODEL_ACCESSOR(full);
>> +
>> +#else /* !CONFIG_PREEMPT_DYNAMIC */
>> +
>> +static inline void preempt_dynamic_init(void) { }
>> +
>> +#endif /* !CONFIG_PREEMPT_DYNAMIC */
>> +
>> #ifdef CONFIG_PREEMPT_DYNAMIC
>>
>> #ifdef CONFIG_GENERIC_ENTRY
>> @@ -8749,29 +8832,6 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
>> * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
>> */
>>
>> -enum {
>> - preempt_dynamic_undefined = -1,
>> - preempt_dynamic_none,
>> - preempt_dynamic_voluntary,
>> - preempt_dynamic_full,
>> -};
>> -
>> -int preempt_dynamic_mode = preempt_dynamic_undefined;
>> -
>> -int sched_dynamic_mode(const char *str)
>> -{
>> - if (!strcmp(str, "none"))
>> - return preempt_dynamic_none;
>> -
>> - if (!strcmp(str, "voluntary"))
>> - return preempt_dynamic_voluntary;
>> -
>> - if (!strcmp(str, "full"))
>> - return preempt_dynamic_full;
>> -
>> - return -EINVAL;
>> -}
>> -
>> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>> #define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
>> #define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
>> @@ -8782,7 +8842,6 @@ int sched_dynamic_mode(const char *str)
>> #error "Unsupported PREEMPT_DYNAMIC mechanism"
>> #endif
>>
>> -static DEFINE_MUTEX(sched_dynamic_mutex);
>> static bool klp_override;
>>
>> static void __sched_dynamic_update(int mode)
>> @@ -8807,7 +8866,7 @@ static void __sched_dynamic_update(int mode)
>> preempt_dynamic_disable(preempt_schedule_notrace);
>> preempt_dynamic_disable(irqentry_exit_cond_resched);
>> if (mode != preempt_dynamic_mode)
>> - pr_info("Dynamic Preempt: none\n");
>> + pr_info("%s: none\n", PREEMPT_MODE);
>> break;
>>
>> case preempt_dynamic_voluntary:
>> @@ -8818,7 +8877,7 @@ static void __sched_dynamic_update(int mode)
>> preempt_dynamic_disable(preempt_schedule_notrace);
>> preempt_dynamic_disable(irqentry_exit_cond_resched);
>> if (mode != preempt_dynamic_mode)
>> - pr_info("Dynamic Preempt: voluntary\n");
>> + pr_info("%s: voluntary\n", PREEMPT_MODE);
>> break;
>>
>> case preempt_dynamic_full:
>> @@ -8829,20 +8888,13 @@ static void __sched_dynamic_update(int mode)
>> preempt_dynamic_enable(preempt_schedule_notrace);
>> preempt_dynamic_enable(irqentry_exit_cond_resched);
>> if (mode != preempt_dynamic_mode)
>> - pr_info("Dynamic Preempt: full\n");
>> + pr_info("%s: full\n", PREEMPT_MODE);
>> break;
>> }
>>
>> preempt_dynamic_mode = mode;
>> }
>>
>> -void sched_dynamic_update(int mode)
>> -{
>> - mutex_lock(&sched_dynamic_mutex);
>> - __sched_dynamic_update(mode);
>> - mutex_unlock(&sched_dynamic_mutex);
>> -}
>> -
>> #ifdef CONFIG_HAVE_PREEMPT_DYNAMIC_CALL
>>
>> static int klp_cond_resched(void)
>> @@ -8873,51 +8925,6 @@ void sched_dynamic_klp_disable(void)
>>
>> #endif /* CONFIG_HAVE_PREEMPT_DYNAMIC_CALL */
>>
>> -static int __init setup_preempt_mode(char *str)
>> -{
>> - int mode = sched_dynamic_mode(str);
>> - if (mode < 0) {
>> - pr_warn("Dynamic Preempt: unsupported mode: %s\n", str);
>> - return 0;
>> - }
>> -
>> - sched_dynamic_update(mode);
>> - return 1;
>> -}
>> -__setup("preempt=", setup_preempt_mode);
>> -
>> -static void __init preempt_dynamic_init(void)
>> -{
>> - if (preempt_dynamic_mode == preempt_dynamic_undefined) {
>> - if (IS_ENABLED(CONFIG_PREEMPT_NONE)) {
>> - sched_dynamic_update(preempt_dynamic_none);
>> - } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
>> - sched_dynamic_update(preempt_dynamic_voluntary);
>> - } else {
>> - /* Default static call setting, nothing to do */
>> - WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
>> - preempt_dynamic_mode = preempt_dynamic_full;
>> - pr_info("Dynamic Preempt: full\n");
>> - }
>> - }
>> -}
>> -
>> -#define PREEMPT_MODEL_ACCESSOR(mode) \
>> - bool preempt_model_##mode(void) \
>> - { \
>> - WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \
>> - return preempt_dynamic_mode == preempt_dynamic_##mode; \
>> - } \
>> - EXPORT_SYMBOL_GPL(preempt_model_##mode)
>> -
>> -PREEMPT_MODEL_ACCESSOR(none);
>> -PREEMPT_MODEL_ACCESSOR(voluntary);
>> -PREEMPT_MODEL_ACCESSOR(full);
>> -
>> -#else /* !CONFIG_PREEMPT_DYNAMIC */
>> -
>> -static inline void preempt_dynamic_init(void) { }
>> -
>> #endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
>>
>> /**
>> --
>> 2.31.1
>>


--
ankur

2024-05-30 18:33:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> > works at cross purposes: the RCU read side critical sections disable
> > preemption, while preempt=full schedules eagerly to minimize
> > latency.
> >
> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> >
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Suggested-by: Paul E. McKenney <[email protected]>
> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> > Signed-off-by: Ankur Arora <[email protected]>
> > ---
> > kernel/sched/core.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d7804e29182d..df8e333f2d8b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> > break;
> >
> > case preempt_dynamic_full:
> > + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> > + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> > + PREEMPT_MODE);
> > +
>
> Yeah, so I don't believe this is a viable strategy.
>
> Firstly, none of these RCU patches are actually about the whole LAZY
> preempt scheme, they apply equally well (arguably better) to the
> existing PREEMPT_DYNAMIC thing.
>
> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> all times. It's just that some of the preemptions, typically those of
> the normal scheduling class are somewhat delayed. However RT/DL classes
> are still insta preempt.
>
> Meaning that if you run anything in the realtime classes you're running
> a fully preemptible kernel. As such, RCU had better be able to deal with
> it.
>
> So no, I don't believe this is right.

At one point, lazy preemption selected PREEMPT_COUNT (which I am
not seeing in this version, perhaps due to blindness on my part).
Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
(including lazy preemption) in RCU read-side critical sections.

Ankur, what am I missing here?

Thanx, Paul

2024-05-30 23:06:35

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full


Paul E. McKenney <[email protected]> writes:

> On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
>> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
>> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
>> > works at cross purposes: the RCU read side critical sections disable
>> > preemption, while preempt=full schedules eagerly to minimize
>> > latency.
>> >
>> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
>> >
>> > Cc: Ingo Molnar <[email protected]>
>> > Cc: Peter Zijlstra <[email protected]>
>> > Cc: Juri Lelli <[email protected]>
>> > Cc: Vincent Guittot <[email protected]>
>> > Suggested-by: Paul E. McKenney <[email protected]>
>> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
>> > Signed-off-by: Ankur Arora <[email protected]>
>> > ---
>> > kernel/sched/core.c | 4 ++++
>> > 1 file changed, 4 insertions(+)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d7804e29182d..df8e333f2d8b 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
>> > break;
>> >
>> > case preempt_dynamic_full:
>> > + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
>> > + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
>> > + PREEMPT_MODE);
>> > +
>>
>> Yeah, so I don't believe this is a viable strategy.
>>
>> Firstly, none of these RCU patches are actually about the whole LAZY
>> preempt scheme, they apply equally well (arguably better) to the
>> existing PREEMPT_DYNAMIC thing.
>>
>> Secondly, esp. with the LAZY thing, you are effectively running FULL at
>> all times. It's just that some of the preemptions, typically those of
>> the normal scheduling class are somewhat delayed. However RT/DL classes
>> are still insta preempt.
>>
>> Meaning that if you run anything in the realtime classes you're running
>> a fully preemptible kernel. As such, RCU had better be able to deal with
>> it.
>>
>> So no, I don't believe this is right.
>
> At one point, lazy preemption selected PREEMPT_COUNT (which I am
> not seeing in this version, perhaps due to blindness on my part).
> Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
> rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
> (including lazy preemption) in RCU read-side critical sections.

That should be still happening, just transitively. PREEMPT_AUTO selects
PREEMPT_BUILD, which selects PREEMPTION, and that in turn selects
PREEMPT_COUNT.


--
ankur

2024-05-30 23:12:23

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full


Peter Zijlstra <[email protected]> writes:

> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
>> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
>> works at cross purposes: the RCU read side critical sections disable
>> preemption, while preempt=full schedules eagerly to minimize
>> latency.
>>
>> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
>>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Juri Lelli <[email protected]>
>> Cc: Vincent Guittot <[email protected]>
>> Suggested-by: Paul E. McKenney <[email protected]>
>> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
>> Signed-off-by: Ankur Arora <[email protected]>
>> ---
>> kernel/sched/core.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d7804e29182d..df8e333f2d8b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
>> break;
>>
>> case preempt_dynamic_full:
>> + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
>> + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
>> + PREEMPT_MODE);
>> +
>
> Yeah, so I don't believe this is a viable strategy.
>
> Firstly, none of these RCU patches are actually about the whole LAZY
> preempt scheme, they apply equally well (arguably better) to the
> existing PREEMPT_DYNAMIC thing.

Agreed.

> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> all times. It's just that some of the preemptions, typically those of
> the normal scheduling class are somewhat delayed. However RT/DL classes
> are still insta preempt.

Also, agreed.

> Meaning that if you run anything in the realtime classes you're running
> a fully preemptible kernel. As such, RCU had better be able to deal with
> it.

So, RCU can deal with (PREEMPT_RCU=y, PREEMPT_AUTO=y, preempt=none/voluntary/full).
Since that's basically what PREEMPT_DYNAMIC already works with.

The other combination, (PREEMPT_RCU=n, PREEMPT_AUTO,
preempt=none/voluntary) would generally be business as usual, except, as
you say, it is really PREEMPT_RCU=n, preempt=full in disguise.

However, as Paul says __rcu_read_lock(), for PREEMPT_RCU=n is defined as:

static inline void __rcu_read_lock(void)
{
preempt_disable();
}

So, this combination -- though non standard -- should also work.

The reason for adding the warning was because Paul had warned in
discussions earlier (see here for instance:
https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/)

that the PREEMPT_FULL=y and PREEMPT_RCU=n is basically useless. But at
least in my understanding that's primarily a performance concern not a
correctness concern. But, Paul can probably speak to that more.

"PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
combination. All of the gains from PREEMPT_FULL=y are more than lost
due to PREEMPT_RCU=n, especially when the kernel decides to do something
like walk a long task list under RCU protection. We should not waste
people's time getting burned by this combination, nor should we waste
cycles testing it."


--
ankur

2024-05-30 23:15:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Thu, May 30, 2024 at 04:05:26PM -0700, Ankur Arora wrote:
>
> Paul E. McKenney <[email protected]> writes:
>
> > On Wed, May 29, 2024 at 10:14:04AM +0200, Peter Zijlstra wrote:
> >> On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> >> > The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> >> > works at cross purposes: the RCU read side critical sections disable
> >> > preemption, while preempt=full schedules eagerly to minimize
> >> > latency.
> >> >
> >> > Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> >> >
> >> > Cc: Ingo Molnar <[email protected]>
> >> > Cc: Peter Zijlstra <[email protected]>
> >> > Cc: Juri Lelli <[email protected]>
> >> > Cc: Vincent Guittot <[email protected]>
> >> > Suggested-by: Paul E. McKenney <[email protected]>
> >> > Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> >> > Signed-off-by: Ankur Arora <[email protected]>
> >> > ---
> >> > kernel/sched/core.c | 4 ++++
> >> > 1 file changed, 4 insertions(+)
> >> >
> >> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> > index d7804e29182d..df8e333f2d8b 100644
> >> > --- a/kernel/sched/core.c
> >> > +++ b/kernel/sched/core.c
> >> > @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> >> > break;
> >> >
> >> > case preempt_dynamic_full:
> >> > + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> >> > + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> >> > + PREEMPT_MODE);
> >> > +
> >>
> >> Yeah, so I don't believe this is a viable strategy.
> >>
> >> Firstly, none of these RCU patches are actually about the whole LAZY
> >> preempt scheme, they apply equally well (arguably better) to the
> >> existing PREEMPT_DYNAMIC thing.
> >>
> >> Secondly, esp. with the LAZY thing, you are effectively running FULL at
> >> all times. It's just that some of the preemptions, typically those of
> >> the normal scheduling class are somewhat delayed. However RT/DL classes
> >> are still insta preempt.
> >>
> >> Meaning that if you run anything in the realtime classes you're running
> >> a fully preemptible kernel. As such, RCU had better be able to deal with
> >> it.
> >>
> >> So no, I don't believe this is right.
> >
> > At one point, lazy preemption selected PREEMPT_COUNT (which I am
> > not seeing in this version, perhaps due to blindness on my part).
> > Of course, selecting PREEMPT_COUNT would result in !PREEMPT_RCU kernel's
> > rcu_read_lock() explicitly disabling preemption, thus avoiding preemption
> > (including lazy preemption) in RCU read-side critical sections.
>
> That should be still happening, just transitively. PREEMPT_AUTO selects
> PREEMPT_BUILD, which selects PREEMPTION, and that in turn selects
> PREEMPT_COUNT.

Ah, I gave up too soon. Thank you!

Thanx, Paul

2024-05-30 23:29:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Thu, May 30, 2024 at 04:04:41PM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Mon, May 27, 2024 at 05:35:02PM -0700, Ankur Arora wrote:
> >> The combination of PREEMPT_RCU=n and (PREEMPT_AUTO=y, preempt=full)
> >> works at cross purposes: the RCU read side critical sections disable
> >> preemption, while preempt=full schedules eagerly to minimize
> >> latency.
> >>
> >> Warn if the user is switching to full preemption with PREEMPT_RCU=n.
> >>
> >> Cc: Ingo Molnar <[email protected]>
> >> Cc: Peter Zijlstra <[email protected]>
> >> Cc: Juri Lelli <[email protected]>
> >> Cc: Vincent Guittot <[email protected]>
> >> Suggested-by: Paul E. McKenney <[email protected]>
> >> Link: https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/
> >> Signed-off-by: Ankur Arora <[email protected]>
> >> ---
> >> kernel/sched/core.c | 4 ++++
> >> 1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index d7804e29182d..df8e333f2d8b 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -8943,6 +8943,10 @@ static void __sched_dynamic_update(int mode)
> >> break;
> >>
> >> case preempt_dynamic_full:
> >> + if (!IS_ENABLED(CONFIG_PREEMPT_RCU))
> >> + pr_warn("%s: preempt=full is not recommended with CONFIG_PREEMPT_RCU=n",
> >> + PREEMPT_MODE);
> >> +
> >
> > Yeah, so I don't believe this is a viable strategy.
> >
> > Firstly, none of these RCU patches are actually about the whole LAZY
> > preempt scheme, they apply equally well (arguably better) to the
> > existing PREEMPT_DYNAMIC thing.
>
> Agreed.
>
> > Secondly, esp. with the LAZY thing, you are effectively running FULL at
> > all times. It's just that some of the preemptions, typically those of
> > the normal scheduling class are somewhat delayed. However RT/DL classes
> > are still insta preempt.
>
> Also, agreed.
>
> > Meaning that if you run anything in the realtime classes you're running
> > a fully preemptible kernel. As such, RCU had better be able to deal with
> > it.
>
> So, RCU can deal with (PREEMPT_RCU=y, PREEMPT_AUTO=y, preempt=none/voluntary/full).
> Since that's basically what PREEMPT_DYNAMIC already works with.
>
> The other combination, (PREEMPT_RCU=n, PREEMPT_AUTO,
> preempt=none/voluntary) would generally be business as usual, except, as
> you say, it is really PREEMPT_RCU=n, preempt=full in disguise.
>
> However, as Paul says __rcu_read_lock(), for PREEMPT_RCU=n is defined as:
>
> static inline void __rcu_read_lock(void)
> {
> preempt_disable();
> }
>
> So, this combination -- though non standard -- should also work.
>
> The reason for adding the warning was because Paul had warned in
> discussions earlier (see here for instance:
> https://lore.kernel.org/lkml/842f589e-5ea3-4c2b-9376-d718c14fabf5@paulmck-laptop/)
>
> that the PREEMPT_FULL=y and PREEMPT_RCU=n is basically useless. But at
> least in my understanding that's primarily a performance concern not a
> correctness concern. But, Paul can probably speak to that more.
>
> "PREEMPT_FULL=y plus PREEMPT_RCU=n appears to be a useless
> combination. All of the gains from PREEMPT_FULL=y are more than lost
> due to PREEMPT_RCU=n, especially when the kernel decides to do something
> like walk a long task list under RCU protection. We should not waste
> people's time getting burned by this combination, nor should we waste
> cycles testing it."

My selfish motivation here is to avoid testing this combination unless
and until someone actually has a good use for it. I do not think that
anyone will ever need it, but perhaps I am suffering from a failure
of imagination. If so, they hit that WARN, complain and explain their
use case, and at that point I start testing it (and fixing whatever bugs
have accumulated in the meantime). But until that time, I save time by
avoiding testing it.

Thanx, Paul

2024-06-01 11:48:24

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling


Shrikanth Hegde <[email protected]> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> v2 mostly reworks v1, with one of the main changes having less
>> noisy need-resched-lazy related interfaces.
>> More details in the changelog below.
>>
>
> Hi Ankur. Thanks for the series.
>
> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
> tip/master and tip/sched/core. Mostly due some word differences in the change.
>
> tip/master was at:
> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
> Merge: 5d145493a139 47ff30cc1be7
> Author: Ingo Molnar <[email protected]>
> Date: Tue May 28 12:44:26 2024 +0200
>
> Merge branch into tip/master: 'x86/percpu'
>
>
>
>> The v1 of the series is at [4] and the RFC at [5].
>>
>> Design
>> ==
>>
>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>> PREEMPT_COUNT). This means that the scheduler can always safely
>> preempt. (This is identical to CONFIG_PREEMPT.)
>>
>> Having that, the next step is to make the rescheduling policy dependent
>> on the chosen scheduling model. Currently, the scheduler uses a single
>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>> reschedule is needed.
>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>> scheduler to express two kinds of rescheduling intent: schedule at
>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>> rescheduling while allowing the task on the runqueue to run to
>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>
>> The scheduler decides which need-resched bits are chosen based on
>> the preemption model in use:
>>
>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>>
>> none never always [*]
>> voluntary higher sched class other tasks [*]
>> full always never
>>
>> [*] some details elided.
>>
>> The last part of the puzzle is, when does preemption happen, or
>> alternately stated, when are the need-resched bits checked:
>>
>> exit-to-user ret-to-kernel preempt_count()
>>
>> NEED_RESCHED_LAZY Y N N
>> NEED_RESCHED Y Y Y
>>
>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>> none/voluntary preemption policies are in effect. And eager semantics
>> under full preemption.
>>
>> In addition, since this is driven purely by the scheduler (not
>> depending on cond_resched() placement and the like), there is enough
>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>> simply upgrading to a full NEED_RESCHED which can use more coercive
>> instruments like resched IPI to induce a context-switch.
>>
>> Performance
>> ==
>> The performance in the basic tests (perf bench sched messaging, kernbench,
>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>> (See patches
>> "sched: support preempt=none under PREEMPT_AUTO"
>> "sched: support preempt=full under PREEMPT_AUTO"
>> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>
>> For a macro test, a colleague in Oracle's Exadata team tried two
>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>> backported.)
>>
>> In both tests the data was cached on remote nodes (cells), and the
>> database nodes (compute) served client queries, with clients being
>> local in the first test and remote in the second.
>>
>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>
>>
>> PREEMPT_VOLUNTARY PREEMPT_AUTO
>> (preempt=voluntary)
>> ============================== =============================
>> clients throughput cpu-usage throughput cpu-usage Gain
>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
>> ------- ---------- ----------------- ---------- ----------------- -------
>>
>>
>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>>
>>
>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
>> 90/10 RW ratio)
>>
>>
>> (Both sets of tests have a fair amount of NW traffic since the query
>> tables etc are cached on the cells. Additionally, the first set,
>> given the local clients, stress the scheduler a bit more than the
>> second.)
>>
>> The comparative performance for both the tests is fairly close,
>> more or less within a margin of error.
>>
>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>>
>> "
>> a) Base kernel (6.7),
>> b) v1, PREEMPT_AUTO, preempt=voluntary
>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>
>> Workloads I tested and their %gain,
>> case b case c case d
>> NAS +2.7% +1.9% +2.1%
>> Hashjoin, +0.0% +0.0% +0.0%
>> Graph500, -6.0% +0.0% +0.0%
>> XSBench +1.7% +0.0% +1.2%
>>
>> (Note about the Graph500 numbers at [8].)
>>
>> Did kernbench etc test from Mel's mmtests suite also. Did not notice
>> much difference.
>> "
>>
>> One case where there is a significant performance drop is on powerpc,
>> seen running hackbench on a 320 core system (a test on a smaller system is
>> fine.) In theory there's no reason for this to only happen on powerpc
>> since most of the code is common, but I haven't been able to reproduce
>> it on x86 so far.
>>
>> All in all, I think the tests above show that this scheduling model has legs.
>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>> different enough from the current none/voluntary models that there
>> likely are workloads where performance would be subpar. That needs more
>> extensive testing to figure out the weak points.
>>
>>
>>
> Did test it again on PowerPC. Unfortunately numbers shows there is regression
> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
> smaller system too to confirm. For now I have done the comparison for the hackbench
> where highest regression was seen in v1.
>
> perf stat collected for 20 iterations show higher context switch and higher migrations.
> Could it be that LAZY bit is causing more context switches? or could it be something
> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Thanks for trying it out.

As you point out, context-switches and migrations are signficantly higher.

Definitely unexpected. I ran the same test on an x86 box
(Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.

6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% )
6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% )
6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% )

6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% )
6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% )
6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% )

Clearly there's something different going on powerpc. I'm travelling
right now, but will dig deeper into this once I get back.

Meanwhile can you check if the increased context-switches are voluntary or
involuntary (or what the division is)?


Thanks
Ankur

> Meanwhile, will do more test with other micro-benchmarks and post the results.
>
>
> More details below.
> CONFIG_HZ = 100
> ./hackbench -pipe 60 process 100000 loops
>
> ====================================================================================
> On the larger system. (40 Cores, 320CPUS)
> ====================================================================================
> 6.10-rc1 +preempt_auto
> preempt=none preempt=none
> 20 iterations avg value
> hackbench pipe(60) 26.403 32.368 ( -31.1%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% )
> 6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% )
> 246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% )
> 1,759,232 page-faults # 19.716 /sec ( +- 6.61% )
> 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
> 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
> 37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% )
> 166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% )
>
> 26.403 +- 0.166 seconds time elapsed ( +- 0.63% )
>
> ++++++++++++
> preempt auto
> ++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% )
> 9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% )
> 631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% )
> 1,756,583 page-faults # 16.112 /sec ( +- 6.59% )
> 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
> 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
> 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
> 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )
>
> 32.368 +- 0.200 seconds time elapsed ( +- 0.62% )
>
>
> ============================================================================================
> Smaller system ( 12Cores, 96CPUS)
> ============================================================================================
> 6.10-rc1 +preempt_auto
> preempt=none preempt=none
> 20 iterations avg value
> hackbench pipe(60) 55.930 65.75 ( -17.6%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% )
> 1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% )
> 44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% )
> 1,698,710 page-faults # 30.010 /sec ( +- 6.58% )
> 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
> 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
> 30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% )
> 99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% )
>
> 55.930 +- 0.509 seconds time elapsed ( +- 0.91% )
>
>
> +++++++++++++++++
> v2_preempt_auto
> +++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% )
> 2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% )
> 147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% )
> 1,710,637 page-faults # 25.593 /sec ( +- 6.55% )
> 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
> 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
> 33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% )
> 134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% )
>
> 65.75 +- 1.06 seconds time elapsed ( +- 1.61% )

So, the context-switches are meaningfully higher.

--
ankur

2024-06-04 07:33:50

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling



On 6/1/24 5:17 PM, Ankur Arora wrote:
>
> Shrikanth Hegde <[email protected]> writes:
>
>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>> Hi,
>>>
>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>> on explicit preemption points for the voluntary models.
>>>
>>> The series is based on Thomas' original proposal which he outlined
>>> in [1], [2] and in his PoC [3].
>>>
>>> v2 mostly reworks v1, with one of the main changes having less
>>> noisy need-resched-lazy related interfaces.
>>> More details in the changelog below.
>>>
>>
>> Hi Ankur. Thanks for the series.
>>
>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>
>> tip/master was at:
>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>> Merge: 5d145493a139 47ff30cc1be7
>> Author: Ingo Molnar <[email protected]>
>> Date: Tue May 28 12:44:26 2024 +0200
>>
>> Merge branch into tip/master: 'x86/percpu'
>>
>>
>>
>>> The v1 of the series is at [4] and the RFC at [5].
>>>
>>> Design
>>> ==
>>>
>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>
>>> Having that, the next step is to make the rescheduling policy dependent
>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>> reschedule is needed.
>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>> scheduler to express two kinds of rescheduling intent: schedule at
>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>> rescheduling while allowing the task on the runqueue to run to
>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>
>>> The scheduler decides which need-resched bits are chosen based on
>>> the preemption model in use:
>>>
>>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>>>
>>> none never always [*]
>>> voluntary higher sched class other tasks [*]
>>> full always never
>>>
>>> [*] some details elided.
>>>
>>> The last part of the puzzle is, when does preemption happen, or
>>> alternately stated, when are the need-resched bits checked:
>>>
>>> exit-to-user ret-to-kernel preempt_count()
>>>
>>> NEED_RESCHED_LAZY Y N N
>>> NEED_RESCHED Y Y Y
>>>
>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>> none/voluntary preemption policies are in effect. And eager semantics
>>> under full preemption.
>>>
>>> In addition, since this is driven purely by the scheduler (not
>>> depending on cond_resched() placement and the like), there is enough
>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>> instruments like resched IPI to induce a context-switch.
>>>
>>> Performance
>>> ==
>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>> (See patches
>>> "sched: support preempt=none under PREEMPT_AUTO"
>>> "sched: support preempt=full under PREEMPT_AUTO"
>>> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>
>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>> backported.)
>>>
>>> In both tests the data was cached on remote nodes (cells), and the
>>> database nodes (compute) served client queries, with clients being
>>> local in the first test and remote in the second.
>>>
>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>
>>>
>>> PREEMPT_VOLUNTARY PREEMPT_AUTO
>>> (preempt=voluntary)
>>> ============================== =============================
>>> clients throughput cpu-usage throughput cpu-usage Gain
>>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
>>> ------- ---------- ----------------- ---------- ----------------- -------
>>>
>>>
>>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
>>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
>>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>>>
>>>
>>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
>>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
>>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
>>> 90/10 RW ratio)
>>>
>>>
>>> (Both sets of tests have a fair amount of NW traffic since the query
>>> tables etc are cached on the cells. Additionally, the first set,
>>> given the local clients, stress the scheduler a bit more than the
>>> second.)
>>>
>>> The comparative performance for both the tests is fairly close,
>>> more or less within a margin of error.
>>>
>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>>>
>>> "
>>> a) Base kernel (6.7),
>>> b) v1, PREEMPT_AUTO, preempt=voluntary
>>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>
>>> Workloads I tested and their %gain,
>>> case b case c case d
>>> NAS +2.7% +1.9% +2.1%
>>> Hashjoin, +0.0% +0.0% +0.0%
>>> Graph500, -6.0% +0.0% +0.0%
>>> XSBench +1.7% +0.0% +1.2%
>>>
>>> (Note about the Graph500 numbers at [8].)
>>>
>>> Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>> much difference.
>>> "
>>>
>>> One case where there is a significant performance drop is on powerpc,
>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>> fine.) In theory there's no reason for this to only happen on powerpc
>>> since most of the code is common, but I haven't been able to reproduce
>>> it on x86 so far.
>>>
>>> All in all, I think the tests above show that this scheduling model has legs.
>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>> different enough from the current none/voluntary models that there
>>> likely are workloads where performance would be subpar. That needs more
>>> extensive testing to figure out the weak points.
>>>
>>>
>>>
>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>> smaller system too to confirm. For now I have done the comparison for the hackbench
>> where highest regression was seen in v1.
>>
>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>> Could it be that LAZY bit is causing more context switches? or could it be something
>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>
> Thanks for trying it out.
>
> As you point out, context-switches and migrations are signficantly higher.
>
> Definitely unexpected. I ran the same test on an x86 box
> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>
> 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% )
> 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% )
> 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% )
>
> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% )
> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% )
> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% )
>
> Clearly there's something different going on powerpc. I'm travelling
> right now, but will dig deeper into this once I get back.
>
> Meanwhile can you check if the increased context-switches are voluntary or
> involuntary (or what the division is)?


Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
context switches per second while running "hackbench -pipe 60 process 100000 loops"


preempt=none 6.10 preempt_auto
=============================================================================
voluntary context switches 7632166.19 9391636.34(+23%)
involuntary context switches 2305544.07 3527293.94(+53%)

Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
involuntary seems to increase at higher rate.


BTW, ran Unixbench as well. It shows slight regression. stress-ng numbers didn't seem conclusive.
schench(old) showed slightly lower latency when the number of threads were low. at higher thread
count showed higher tail latency. But it doesn't seem very convincing numbers.
All these were done under preempt=none in both 6.10 and preempt_auto.


Unixbench 6.10 preempt_auto
=====================================================================
1 X Execl Throughput : 5345.70, 5109.68(-4.42)
4 X Execl Throughput : 15610.54, 15087.92(-3.35)
1 X Pipe-based Context Switching : 183172.30, 177069.52(-3.33)
4 X Pipe-based Context Switching : 615471.66, 602773.74(-2.06)
1 X Process Creation : 10778.92, 10443.76(-3.11)
4 X Process Creation : 24327.06, 25150.42(+3.38)
1 X Shell Scripts (1 concurrent) : 10416.76, 10222.28(-1.87)
4 X Shell Scripts (1 concurrent) : 36051.00, 35206.90(-2.34)
1 X Shell Scripts (8 concurrent) : 5004.22, 4907.32(-1.94)
4 X Shell Scripts (8 concurrent) : 12676.08, 12418.18(-2.03)


>
>
> Thanks
> Ankur
>
>> Meanwhile, will do more test with other micro-benchmarks and post the results.
>>
>>
>> More details below.
>> CONFIG_HZ = 100
>> ./hackbench -pipe 60 process 100000 loops
>>
>> ====================================================================================
>> On the larger system. (40 Cores, 320CPUS)
>> ====================================================================================
>> 6.10-rc1 +preempt_auto
>> preempt=none preempt=none
>> 20 iterations avg value
>> hackbench pipe(60) 26.403 32.368 ( -31.1%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>> Performance counter stats for 'system wide' (20 runs):
>> 168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% )
>> 6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% )
>> 246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% )
>> 1,759,232 page-faults # 19.716 /sec ( +- 6.61% )
>> 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
>> 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
>> 37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% )
>> 166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% )
>>
>> 26.403 +- 0.166 seconds time elapsed ( +- 0.63% )
>>
>> ++++++++++++
>> preempt auto
>> ++++++++++++
>> Performance counter stats for 'system wide' (20 runs):
>> 207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% )
>> 9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% )
>> 631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% )
>> 1,756,583 page-faults # 16.112 /sec ( +- 6.59% )
>> 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
>> 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
>> 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
>> 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )
>>
>> 32.368 +- 0.200 seconds time elapsed ( +- 0.62% )
>>
>>
>> ============================================================================================
>> Smaller system ( 12Cores, 96CPUS)
>> ============================================================================================
>> 6.10-rc1 +preempt_auto
>> preempt=none preempt=none
>> 20 iterations avg value
>> hackbench pipe(60) 55.930 65.75 ( -17.6%)
>>
>> ++++++++++++++++++
>> baseline 6.10-rc1:
>> ++++++++++++++++++
>> Performance counter stats for 'system wide' (20 runs):
>> 107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% )
>> 1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% )
>> 44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% )
>> 1,698,710 page-faults # 30.010 /sec ( +- 6.58% )
>> 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
>> 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
>> 30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% )
>> 99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% )
>>
>> 55.930 +- 0.509 seconds time elapsed ( +- 0.91% )
>>
>>
>> +++++++++++++++++
>> v2_preempt_auto
>> +++++++++++++++++
>> Performance counter stats for 'system wide' (20 runs):
>> 126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% )
>> 2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% )
>> 147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% )
>> 1,710,637 page-faults # 25.593 /sec ( +- 6.55% )
>> 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
>> 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
>> 33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% )
>> 134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% )
>>
>> 65.75 +- 1.06 seconds time elapsed ( +- 1.61% )
>
> So, the context-switches are meaningfully higher.
>
> --
> ankur

2024-06-05 15:45:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

On Mon, May 27, 2024, Ankur Arora wrote:
> Patches 1,2
> "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
> "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> condition spin_needbreak() on the dynamic preempt_model_*().

...

> Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> Sean Christopherson (2):
> sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
> sched/core: Drop spinlocks on contention iff kernel is preemptible

Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
than later? They fix a real bug that affects KVM to varying degrees.

2024-06-05 17:45:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

On Wed, Jun 05, 2024 at 08:44:50AM -0700, Sean Christopherson wrote:
> On Mon, May 27, 2024, Ankur Arora wrote:
> > Patches 1,2
> > "sched/core: Move preempt_model_*() helpers from sched.h to preempt.h"
> > "sched/core: Drop spinlocks on contention iff kernel is preemptible"
> > condition spin_needbreak() on the dynamic preempt_model_*().
>
> ...
>
> > Not really required but a useful bugfix for PREEMPT_DYNAMIC and PREEMPT_AUTO.
> > Sean Christopherson (2):
> > sched/core: Move preempt_model_*() helpers from sched.h to preempt.h
> > sched/core: Drop spinlocks on contention iff kernel is preemptible
>
> Peter and/or Thomas, would it be possible to get these applied to tip-tree sooner
> than later? They fix a real bug that affects KVM to varying degrees.

It so happens I've queued them for sched/core earlier today (see
queue/sched/core). If the robot comes back happy, I'll push them into
tip.

Thanks!

2024-06-06 11:51:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >>
> >> The interface is identical to PREEMPT_DYNAMIC.
> >
> > Colour me confused, why?!? What are you doing and why aren't just just
> > adding AUTO to the existing DYNAMIC thing?
>
> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> of the static_call/static_key stuff so I'm not sure how that would work.

*sigh*... see the below, seems to work.

---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 6 +-
include/linux/entry-common.h | 3 +-
include/linux/entry-kvm.h | 5 +-
include/linux/sched.h | 10 +++-
include/linux/thread_info.h | 21 +++++--
kernel/Kconfig.preempt | 11 ++++
kernel/entry/common.c | 2 +-
kernel/entry/kvm.c | 4 +-
kernel/sched/core.c | 110 ++++++++++++++++++++++++++++++++-----
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 4 +-
kernel/sched/sched.h | 1 +
13 files changed, 148 insertions(+), 32 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e8837116704ce..61f86b69524d7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -91,6 +91,7 @@ config X86
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API if X86_64
+ select ARCH_HAS_PREEMPT_LAZY
select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 12da7dfd5ef13..75bb390f7baf5 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -87,8 +87,9 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
-#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
-#define TIF_SSBD 5 /* Speculative store bypass disable */
+#define TIF_NEED_RESCHED_LAZY 4 /* rescheduling necessary */
+#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
+#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@@ -110,6 +111,7 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b0fb775a600d9..e66c8a7c113f4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -64,7 +64,8 @@

#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
- _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
+ _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
+ _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
ARCH_EXIT_TO_USER_MODE_WORK)

/**
diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
index 6813171afccb2..16149f6625e48 100644
--- a/include/linux/entry-kvm.h
+++ b/include/linux/entry-kvm.h
@@ -17,8 +17,9 @@
#endif

#define XFER_TO_GUEST_MODE_WORK \
- (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
- _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
+ (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
+ _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME | \
+ ARCH_XFER_TO_GUEST_MODE_WORK)

struct kvm_vcpu;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7635045b2395c..5900d84e08b3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1968,7 +1968,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)

static inline void clear_tsk_need_resched(struct task_struct *tsk)
{
- clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
+ atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
+ (atomic_long_t *)&task_thread_info(tsk)->flags);
}

static inline int test_tsk_need_resched(struct task_struct *tsk)
@@ -2074,6 +2075,7 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
extern bool preempt_model_none(void);
extern bool preempt_model_voluntary(void);
extern bool preempt_model_full(void);
+extern bool preempt_model_lazy(void);

#else

@@ -2089,6 +2091,10 @@ static inline bool preempt_model_full(void)
{
return IS_ENABLED(CONFIG_PREEMPT);
}
+static inline bool preempt_model_lazy(void)
+{
+ return IS_ENABLED(CONFIG_PREEMPT_LAZY);
+}

#endif

@@ -2107,7 +2113,7 @@ static inline bool preempt_model_rt(void)
*/
static inline bool preempt_model_preemptible(void)
{
- return preempt_model_full() || preempt_model_rt();
+ return preempt_model_full() || preempt_model_lazy() || preempt_model_rt();
}

static __always_inline bool need_resched(void)
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f49..cf2446c9c30d4 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -59,6 +59,14 @@ enum syscall_work_bit {

#include <asm/thread_info.h>

+#ifndef TIF_NEED_RESCHED_LAZY
+#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
+#error Inconsistent PREEMPT_LAZY
+#endif
+#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
+#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
+#endif
+
#ifdef __KERNEL__

#ifndef arch_set_restart_data
@@ -179,22 +187,27 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti

#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H

-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_test_bit(int bit)
{
- return arch_test_bit(TIF_NEED_RESCHED,
+ return arch_test_bit(bit,
(unsigned long *)(&current_thread_info()->flags));
}

#else

-static __always_inline bool tif_need_resched(void)
+static __always_inline bool tif_test_bit(int bit)
{
- return test_bit(TIF_NEED_RESCHED,
+ return test_bit(bit,
(unsigned long *)(&current_thread_info()->flags));
}

#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */

+static __always_inline bool tif_need_resched(void)
+{
+ return tif_test_bit(TIF_NEED_RESCHED);
+}
+
#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
static inline int arch_within_stack_frames(const void * const stack,
const void * const stackend,
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index c2f1fd95a8214..1a2e3849e3e5f 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -11,6 +11,9 @@ config PREEMPT_BUILD
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK

+config ARCH_HAS_PREEMPT_LAZY
+ bool
+
choice
prompt "Preemption Model"
default PREEMPT_NONE
@@ -67,6 +70,14 @@ config PREEMPT
embedded system with latency requirements in the milliseconds
range.

+config PREEMPT_LAZY
+ bool "Scheduler controlled preemption model"
+ depends on !ARCH_NO_PREEMPT
+ depends on ARCH_HAS_PREEMPT_LAZY
+ select PREEMPT_BUILD
+ help
+ Hamsters in your brain...
+
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 90843cc385880..bcb23c866425e 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,

local_irq_enable_exit_to_user(ti_work);

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_UPROBE)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 2e0f75bcb7fd1..8485f63863afc 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return -EINTR;
}

- if (ti_work & _TIF_NEED_RESCHED)
+ if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule();

if (ti_work & _TIF_NOTIFY_RESUME)
@@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return ret;

ti_work = read_thread_flags();
- } while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
+ } while (ti_work & XFER_TO_GUEST_MODE_WORK);
return 0;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965e6464e68e9..c32de809283cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -904,10 +904,9 @@ static inline void hrtick_rq_init(struct rq *rq)
* this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs.
*/
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
{
- struct thread_info *ti = task_thread_info(p);
- return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
+ return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
}

/*
@@ -932,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
}

#else
-static inline bool set_nr_and_not_polling(struct task_struct *p)
+static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
{
- set_tsk_need_resched(p);
+ atomic_long_or(1 << tif, (atomic_long_t *)&ti->flags);
return true;
}

@@ -1039,28 +1038,66 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
-void resched_curr(struct rq *rq)
+static void __resched_curr(struct rq *rq, int tif)
{
struct task_struct *curr = rq->curr;
+ struct thread_info *cti = task_thread_info(curr);
int cpu;

lockdep_assert_rq_held(rq);

- if (test_tsk_need_resched(curr))
+ if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
+ tif = TIF_NEED_RESCHED;
+
+ if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
return;

cpu = cpu_of(rq);

if (cpu == smp_processor_id()) {
- set_tsk_need_resched(curr);
- set_preempt_need_resched();
+ set_ti_thread_flag(cti, tif);
+ if (tif == TIF_NEED_RESCHED)
+ set_preempt_need_resched();
return;
}

- if (set_nr_and_not_polling(curr))
- smp_send_reschedule(cpu);
- else
+ if (set_nr_and_not_polling(cti, tif)) {
+ if (tif == TIF_NEED_RESCHED)
+ smp_send_reschedule(cpu);
+ } else {
trace_sched_wake_idle_without_ipi(cpu);
+ }
+}
+
+void resched_curr(struct rq *rq)
+{
+ __resched_curr(rq, TIF_NEED_RESCHED);
+}
+
+#ifdef CONFIG_PREEMPT_DYNAMIC
+static DEFINE_STATIC_KEY_FALSE(sk_dynamic_preempt_lazy);
+static __always_inline bool dynamic_preempt_lazy(void)
+{
+ return static_branch_unlikely(&sk_dynamic_preempt_lazy);
+}
+#else
+static __always_inline bool dynamic_preempt_lazy(void)
+{
+ return IS_ENABLED(PREEMPT_LAZY);
+}
+#endif
+
+static __always_inline int tif_need_resched_lazy(void)
+{
+ if (dynamic_preempt_lazy())
+ return TIF_NEED_RESCHED_LAZY;
+
+ return TIF_NEED_RESCHED;
+}
+
+void resched_curr_lazy(struct rq *rq)
+{
+ __resched_curr(rq, tif_need_resched_lazy());
}

void resched_cpu(int cpu)
@@ -1155,7 +1192,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report
* much benefits.
*/
- if (set_nr_and_not_polling(rq->idle))
+ if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
@@ -5537,6 +5574,10 @@ void sched_tick(void)
update_rq_clock(rq);
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
+
+ if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
+ resched_curr(rq);
+
curr->sched_class->task_tick(rq, curr, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
@@ -7245,6 +7286,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP
+ * dynamic_preempt_lazy <- false
*
* VOLUNTARY:
* cond_resched <- __cond_resched
@@ -7252,6 +7294,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP
+ * dynamic_preempt_lazy <- false
*
* FULL:
* cond_resched <- RET0
@@ -7259,6 +7302,15 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- preempt_schedule
* preempt_schedule_notrace <- preempt_schedule_notrace
* irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * dynamic_preempt_lazy <- false
+ *
+ * LAZY:
+ * cond_resched <- RET0
+ * might_resched <- RET0
+ * preempt_schedule <- preempt_schedule
+ * preempt_schedule_notrace <- preempt_schedule_notrace
+ * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * dynamic_preempt_lazy <- true
*/

enum {
@@ -7266,6 +7318,7 @@ enum {
preempt_dynamic_none,
preempt_dynamic_voluntary,
preempt_dynamic_full,
+ preempt_dynamic_lazy,
};

int preempt_dynamic_mode = preempt_dynamic_undefined;
@@ -7281,15 +7334,23 @@ int sched_dynamic_mode(const char *str)
if (!strcmp(str, "full"))
return preempt_dynamic_full;

+#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
+ if (!strcmp(str, "lazy"))
+ return preempt_dynamic_lazy;
+#endif
+
return -EINVAL;
}

+#define preempt_dynamic_key_enable(f) static_key_enable(&sk_dynamic_##f.key)
+#define preempt_dynamic_key_disable(f) static_key_disable(&sk_dynamic_##f.key)
+
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
#define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-#define preempt_dynamic_enable(f) static_key_enable(&sk_dynamic_##f.key)
-#define preempt_dynamic_disable(f) static_key_disable(&sk_dynamic_##f.key)
+#define preempt_dynamic_enable(f) preempt_dynamic_key_enable(f)
+#define preempt_dynamic_disable(f) preempt_dynamic_key_disable(f)
#else
#error "Unsupported PREEMPT_DYNAMIC mechanism"
#endif
@@ -7309,6 +7370,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
+ preempt_dynamic_key_disable(preempt_lazy);

switch (mode) {
case preempt_dynamic_none:
@@ -7318,6 +7380,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
+ preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: none\n");
break;
@@ -7329,6 +7392,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched);
+ preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: voluntary\n");
break;
@@ -7340,9 +7404,22 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
+ preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: full\n");
break;
+
+ case preempt_dynamic_lazy:
+ if (!klp_override)
+ preempt_dynamic_disable(cond_resched);
+ preempt_dynamic_disable(might_resched);
+ preempt_dynamic_enable(preempt_schedule);
+ preempt_dynamic_enable(preempt_schedule_notrace);
+ preempt_dynamic_enable(irqentry_exit_cond_resched);
+ preempt_dynamic_key_enable(preempt_lazy);
+ if (mode != preempt_dynamic_mode)
+ pr_info("Dynamic Preempt: lazy\n");
+ break;
}

preempt_dynamic_mode = mode;
@@ -7405,6 +7482,8 @@ static void __init preempt_dynamic_init(void)
sched_dynamic_update(preempt_dynamic_none);
} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
sched_dynamic_update(preempt_dynamic_voluntary);
+ } else if (IS_ENABLED(CONFIG_PREEMPT_LAZY)) {
+ sched_dynamic_update(preempt_dynamic_lazy);
} else {
/* Default static call setting, nothing to do */
WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
@@ -7425,6 +7504,7 @@ static void __init preempt_dynamic_init(void)
PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary);
PREEMPT_MODEL_ACCESSOR(full);
+PREEMPT_MODEL_ACCESSOR(lazy);

#else /* !CONFIG_PREEMPT_DYNAMIC: */

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1bc24410ae501..87309cf247c68 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -245,7 +245,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
static int sched_dynamic_show(struct seq_file *m, void *v)
{
static const char * preempt_modes[] = {
- "none", "voluntary", "full"
+ "none", "voluntary", "full", "lazy",
};
int i;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b5d50dbc79dc..71b4112cadde0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1007,7 +1007,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
* The task has consumed its request, reschedule.
*/
if (cfs_rq->nr_running > 1) {
- resched_curr(rq_of(cfs_rq));
+ resched_curr_lazy(rq_of(cfs_rq));
clear_buddies(cfs_rq, se);
}
}
@@ -8615,7 +8615,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
return;

preempt:
- resched_curr(rq);
+ resched_curr_lazy(rq);
}

static struct task_struct *pick_task_fair(struct rq *rq)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 041d8e00a1568..48a4617a5b28b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2494,6 +2494,7 @@ extern void init_sched_fair_class(void);
extern void reweight_task(struct task_struct *p, int prio);

extern void resched_curr(struct rq *rq);
+extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu);

extern struct rt_bandwidth def_rt_bandwidth;

2024-06-06 11:53:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:

> My selfish motivation here is to avoid testing this combination unless
> and until someone actually has a good use for it.

That doesn't make sense, the whole LAZY thing is fundamentally identical
to FULL, except it sometimes delays the preemption a wee bit. But all
the preemption scenarios from FULL are possible.

As such, it makes far more sense to only test FULL.

2024-06-06 13:39:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 16/35] preempt,rcu: warn on PREEMPT_RCU=n, preempt=full

On Thu, Jun 06, 2024 at 01:53:25PM +0200, Peter Zijlstra wrote:
> On Thu, May 30, 2024 at 04:20:26PM -0700, Paul E. McKenney wrote:
>
> > My selfish motivation here is to avoid testing this combination unless
> > and until someone actually has a good use for it.
>
> That doesn't make sense, the whole LAZY thing is fundamentally identical
> to FULL, except it sometimes delays the preemption a wee bit. But all
> the preemption scenarios from FULL are possible.

As noted earlier in this thread, this is not the case for non-preemptible
RCU, which disables preemption across its read-side critical sections.
In addition, from a performance/throughput viewpoint, it is not just
the possibility of preemption that matters, but also the probability.

> As such, it makes far more sense to only test FULL.

You have considerable work left to do in order to convince me of this one.

Thanx, Paul

2024-06-06 15:33:10

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO


Peter Zijlstra <[email protected]> writes:

> On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
>>
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> >> Reuse sched_dynamic_update() and related logic to enable choosing
>> >> the preemption model at boot or runtime for PREEMPT_AUTO.
>> >>
>> >> The interface is identical to PREEMPT_DYNAMIC.
>> >
>> > Colour me confused, why?!? What are you doing and why aren't just just
>> > adding AUTO to the existing DYNAMIC thing?
>>
>> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
>> of the static_call/static_key stuff so I'm not sure how that would work.
>
> *sigh*... see the below, seems to work.

Sorry, didn't mean for you to have to do all that work to prove the
point.

I phrased it badly. I do understand how lazy can be folded in as
you do here:

> + case preempt_dynamic_lazy:
> + if (!klp_override)
> + preempt_dynamic_disable(cond_resched);
> + preempt_dynamic_disable(might_resched);
> + preempt_dynamic_enable(preempt_schedule);
> + preempt_dynamic_enable(preempt_schedule_notrace);
> + preempt_dynamic_enable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_enable(preempt_lazy);
> + if (mode != preempt_dynamic_mode)
> + pr_info("Dynamic Preempt: lazy\n");
> + break;
> }

But, if the long term goal (at least as I understand it) is to get rid
of cond_resched() -- to allow optimizations that needing to call cond_resched()
makes impossible -- does it make sense to pull all of these together?

Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
only two models left. Then we will have (modulo figuring out how to
switch over klp from cond_resched() to a different unwinding technique):

static void __sched_dynamic_update(int mode)
{
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);

switch (mode) {
case preempt_dynamic_full:
preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("%s: full\n", PREEMPT_MODE);
break;

case preempt_dynamic_lazy:
preempt_dynamic_key_enable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: lazy\n");
break;
}

preempt_dynamic_mode = mode;
}

Which is pretty similar to what the PREEMPT_AUTO code was doing.

Thanks
Ankur

> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/thread_info.h | 6 +-
> include/linux/entry-common.h | 3 +-
> include/linux/entry-kvm.h | 5 +-
> include/linux/sched.h | 10 +++-
> include/linux/thread_info.h | 21 +++++--
> kernel/Kconfig.preempt | 11 ++++
> kernel/entry/common.c | 2 +-
> kernel/entry/kvm.c | 4 +-
> kernel/sched/core.c | 110 ++++++++++++++++++++++++++++++++-----
> kernel/sched/debug.c | 2 +-
> kernel/sched/fair.c | 4 +-
> kernel/sched/sched.h | 1 +
> 13 files changed, 148 insertions(+), 32 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index e8837116704ce..61f86b69524d7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -91,6 +91,7 @@ config X86
> select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
> select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_PMEM_API if X86_64
> + select ARCH_HAS_PREEMPT_LAZY
> select ARCH_HAS_PTE_DEVMAP if X86_64
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_HW_PTE_YOUNG
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index 12da7dfd5ef13..75bb390f7baf5 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -87,8 +87,9 @@ struct thread_info {
> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
> #define TIF_SIGPENDING 2 /* signal pending */
> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
> -#define TIF_SSBD 5 /* Speculative store bypass disable */
> +#define TIF_NEED_RESCHED_LAZY 4 /* rescheduling necessary */
> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
> +#define TIF_SSBD 6 /* Speculative store bypass disable */
> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
> @@ -110,6 +111,7 @@ struct thread_info {
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
> +#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
> #define _TIF_SSBD (1 << TIF_SSBD)
> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index b0fb775a600d9..e66c8a7c113f4 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -64,7 +64,8 @@
>
> #define EXIT_TO_USER_MODE_WORK \
> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> - _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
> + _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
> + _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
> ARCH_EXIT_TO_USER_MODE_WORK)
>
> /**
> diff --git a/include/linux/entry-kvm.h b/include/linux/entry-kvm.h
> index 6813171afccb2..16149f6625e48 100644
> --- a/include/linux/entry-kvm.h
> +++ b/include/linux/entry-kvm.h
> @@ -17,8 +17,9 @@
> #endif
>
> #define XFER_TO_GUEST_MODE_WORK \
> - (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \
> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK)
> + (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
> + _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME | \
> + ARCH_XFER_TO_GUEST_MODE_WORK)
>
> struct kvm_vcpu;
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7635045b2395c..5900d84e08b3c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1968,7 +1968,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)
>
> static inline void clear_tsk_need_resched(struct task_struct *tsk)
> {
> - clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
> + atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
> + (atomic_long_t *)&task_thread_info(tsk)->flags);
> }
>
> static inline int test_tsk_need_resched(struct task_struct *tsk)
> @@ -2074,6 +2075,7 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
> extern bool preempt_model_none(void);
> extern bool preempt_model_voluntary(void);
> extern bool preempt_model_full(void);
> +extern bool preempt_model_lazy(void);
>
> #else
>
> @@ -2089,6 +2091,10 @@ static inline bool preempt_model_full(void)
> {
> return IS_ENABLED(CONFIG_PREEMPT);
> }
> +static inline bool preempt_model_lazy(void)
> +{
> + return IS_ENABLED(CONFIG_PREEMPT_LAZY);
> +}
>
> #endif
>
> @@ -2107,7 +2113,7 @@ static inline bool preempt_model_rt(void)
> */
> static inline bool preempt_model_preemptible(void)
> {
> - return preempt_model_full() || preempt_model_rt();
> + return preempt_model_full() || preempt_model_lazy() || preempt_model_rt();
> }
>
> static __always_inline bool need_resched(void)
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 9ea0b28068f49..cf2446c9c30d4 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -59,6 +59,14 @@ enum syscall_work_bit {
>
> #include <asm/thread_info.h>
>
> +#ifndef TIF_NEED_RESCHED_LAZY
> +#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
> +#error Inconsistent PREEMPT_LAZY
> +#endif
> +#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
> +#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
> +#endif
> +
> #ifdef __KERNEL__
>
> #ifndef arch_set_restart_data
> @@ -179,22 +187,27 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
>
> #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool tif_test_bit(int bit)
> {
> - return arch_test_bit(TIF_NEED_RESCHED,
> + return arch_test_bit(bit,
> (unsigned long *)(&current_thread_info()->flags));
> }
>
> #else
>
> -static __always_inline bool tif_need_resched(void)
> +static __always_inline bool tif_test_bit(int bit)
> {
> - return test_bit(TIF_NEED_RESCHED,
> + return test_bit(bit,
> (unsigned long *)(&current_thread_info()->flags));
> }
>
> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
>
> +static __always_inline bool tif_need_resched(void)
> +{
> + return tif_test_bit(TIF_NEED_RESCHED);
> +}
> +
> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
> static inline int arch_within_stack_frames(const void * const stack,
> const void * const stackend,
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index c2f1fd95a8214..1a2e3849e3e5f 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -11,6 +11,9 @@ config PREEMPT_BUILD
> select PREEMPTION
> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
>
> +config ARCH_HAS_PREEMPT_LAZY
> + bool
> +
> choice
> prompt "Preemption Model"
> default PREEMPT_NONE
> @@ -67,6 +70,14 @@ config PREEMPT
> embedded system with latency requirements in the milliseconds
> range.
>
> +config PREEMPT_LAZY
> + bool "Scheduler controlled preemption model"
> + depends on !ARCH_NO_PREEMPT
> + depends on ARCH_HAS_PREEMPT_LAZY
> + select PREEMPT_BUILD
> + help
> + Hamsters in your brain...
> +
> config PREEMPT_RT
> bool "Fully Preemptible Kernel (Real-Time)"
> depends on EXPERT && ARCH_SUPPORTS_RT
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 90843cc385880..bcb23c866425e 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>
> local_irq_enable_exit_to_user(ti_work);
>
> - if (ti_work & _TIF_NEED_RESCHED)
> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> schedule();
>
> if (ti_work & _TIF_UPROBE)
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 2e0f75bcb7fd1..8485f63863afc 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
> return -EINTR;
> }
>
> - if (ti_work & _TIF_NEED_RESCHED)
> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> schedule();
>
> if (ti_work & _TIF_NOTIFY_RESUME)
> @@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
> return ret;
>
> ti_work = read_thread_flags();
> - } while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched());
> + } while (ti_work & XFER_TO_GUEST_MODE_WORK);
> return 0;
> }
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 965e6464e68e9..c32de809283cf 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -904,10 +904,9 @@ static inline void hrtick_rq_init(struct rq *rq)
> * this avoids any races wrt polling state changes and thereby avoids
> * spurious IPIs.
> */
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> {
> - struct thread_info *ti = task_thread_info(p);
> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
> + return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
> }
>
> /*
> @@ -932,9 +931,9 @@ static bool set_nr_if_polling(struct task_struct *p)
> }
>
> #else
> -static inline bool set_nr_and_not_polling(struct task_struct *p)
> +static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
> {
> - set_tsk_need_resched(p);
> + atomic_long_or(1 << tif, (atomic_long_t *)&ti->flags);
> return true;
> }
>
> @@ -1039,28 +1038,66 @@ void wake_up_q(struct wake_q_head *head)
> * might also involve a cross-CPU call to trigger the scheduler on
> * the target CPU.
> */
> -void resched_curr(struct rq *rq)
> +static void __resched_curr(struct rq *rq, int tif)
> {
> struct task_struct *curr = rq->curr;
> + struct thread_info *cti = task_thread_info(curr);
> int cpu;
>
> lockdep_assert_rq_held(rq);
>
> - if (test_tsk_need_resched(curr))
> + if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
> + tif = TIF_NEED_RESCHED;
> +
> + if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
> return;
>
> cpu = cpu_of(rq);
>
> if (cpu == smp_processor_id()) {
> - set_tsk_need_resched(curr);
> - set_preempt_need_resched();
> + set_ti_thread_flag(cti, tif);
> + if (tif == TIF_NEED_RESCHED)
> + set_preempt_need_resched();
> return;
> }
>
> - if (set_nr_and_not_polling(curr))
> - smp_send_reschedule(cpu);
> - else
> + if (set_nr_and_not_polling(cti, tif)) {
> + if (tif == TIF_NEED_RESCHED)
> + smp_send_reschedule(cpu);
> + } else {
> trace_sched_wake_idle_without_ipi(cpu);
> + }
> +}
> +
> +void resched_curr(struct rq *rq)
> +{
> + __resched_curr(rq, TIF_NEED_RESCHED);
> +}
> +
> +#ifdef CONFIG_PREEMPT_DYNAMIC
> +static DEFINE_STATIC_KEY_FALSE(sk_dynamic_preempt_lazy);
> +static __always_inline bool dynamic_preempt_lazy(void)
> +{
> + return static_branch_unlikely(&sk_dynamic_preempt_lazy);
> +}
> +#else
> +static __always_inline bool dynamic_preempt_lazy(void)
> +{
> + return IS_ENABLED(PREEMPT_LAZY);
> +}
> +#endif
> +
> +static __always_inline int tif_need_resched_lazy(void)
> +{
> + if (dynamic_preempt_lazy())
> + return TIF_NEED_RESCHED_LAZY;
> +
> + return TIF_NEED_RESCHED;
> +}
> +
> +void resched_curr_lazy(struct rq *rq)
> +{
> + __resched_curr(rq, tif_need_resched_lazy());
> }
>
> void resched_cpu(int cpu)
> @@ -1155,7 +1192,7 @@ static void wake_up_idle_cpu(int cpu)
> * and testing of the above solutions didn't appear to report
> * much benefits.
> */
> - if (set_nr_and_not_polling(rq->idle))
> + if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
> smp_send_reschedule(cpu);
> else
> trace_sched_wake_idle_without_ipi(cpu);
> @@ -5537,6 +5574,10 @@ void sched_tick(void)
> update_rq_clock(rq);
> hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
> +
> + if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
> + resched_curr(rq);
> +
> curr->sched_class->task_tick(rq, curr, 0);
> if (sched_feat(LATENCY_WARN))
> resched_latency = cpu_resched_latency(rq);
> @@ -7245,6 +7286,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
> * preempt_schedule <- NOP
> * preempt_schedule_notrace <- NOP
> * irqentry_exit_cond_resched <- NOP
> + * dynamic_preempt_lazy <- false
> *
> * VOLUNTARY:
> * cond_resched <- __cond_resched
> @@ -7252,6 +7294,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
> * preempt_schedule <- NOP
> * preempt_schedule_notrace <- NOP
> * irqentry_exit_cond_resched <- NOP
> + * dynamic_preempt_lazy <- false
> *
> * FULL:
> * cond_resched <- RET0
> @@ -7259,6 +7302,15 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
> * preempt_schedule <- preempt_schedule
> * preempt_schedule_notrace <- preempt_schedule_notrace
> * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + * dynamic_preempt_lazy <- false
> + *
> + * LAZY:
> + * cond_resched <- RET0
> + * might_resched <- RET0
> + * preempt_schedule <- preempt_schedule
> + * preempt_schedule_notrace <- preempt_schedule_notrace
> + * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + * dynamic_preempt_lazy <- true
> */
>
> enum {
> @@ -7266,6 +7318,7 @@ enum {
> preempt_dynamic_none,
> preempt_dynamic_voluntary,
> preempt_dynamic_full,
> + preempt_dynamic_lazy,
> };
>
> int preempt_dynamic_mode = preempt_dynamic_undefined;
> @@ -7281,15 +7334,23 @@ int sched_dynamic_mode(const char *str)
> if (!strcmp(str, "full"))
> return preempt_dynamic_full;
>
> +#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
> + if (!strcmp(str, "lazy"))
> + return preempt_dynamic_lazy;
> +#endif
> +
> return -EINVAL;
> }
>
> +#define preempt_dynamic_key_enable(f) static_key_enable(&sk_dynamic_##f.key)
> +#define preempt_dynamic_key_disable(f) static_key_disable(&sk_dynamic_##f.key)
> +
> #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
> #define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
> #define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
> #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
> -#define preempt_dynamic_enable(f) static_key_enable(&sk_dynamic_##f.key)
> -#define preempt_dynamic_disable(f) static_key_disable(&sk_dynamic_##f.key)
> +#define preempt_dynamic_enable(f) preempt_dynamic_key_enable(f)
> +#define preempt_dynamic_disable(f) preempt_dynamic_key_disable(f)
> #else
> #error "Unsupported PREEMPT_DYNAMIC mechanism"
> #endif
> @@ -7309,6 +7370,7 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_enable(preempt_schedule);
> preempt_dynamic_enable(preempt_schedule_notrace);
> preempt_dynamic_enable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_disable(preempt_lazy);
>
> switch (mode) {
> case preempt_dynamic_none:
> @@ -7318,6 +7380,7 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_disable(preempt_schedule);
> preempt_dynamic_disable(preempt_schedule_notrace);
> preempt_dynamic_disable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_disable(preempt_lazy);
> if (mode != preempt_dynamic_mode)
> pr_info("Dynamic Preempt: none\n");
> break;
> @@ -7329,6 +7392,7 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_disable(preempt_schedule);
> preempt_dynamic_disable(preempt_schedule_notrace);
> preempt_dynamic_disable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_disable(preempt_lazy);
> if (mode != preempt_dynamic_mode)
> pr_info("Dynamic Preempt: voluntary\n");
> break;
> @@ -7340,9 +7404,22 @@ static void __sched_dynamic_update(int mode)
> preempt_dynamic_enable(preempt_schedule);
> preempt_dynamic_enable(preempt_schedule_notrace);
> preempt_dynamic_enable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_disable(preempt_lazy);
> if (mode != preempt_dynamic_mode)
> pr_info("Dynamic Preempt: full\n");
> break;
> +
> + case preempt_dynamic_lazy:
> + if (!klp_override)
> + preempt_dynamic_disable(cond_resched);
> + preempt_dynamic_disable(might_resched);
> + preempt_dynamic_enable(preempt_schedule);
> + preempt_dynamic_enable(preempt_schedule_notrace);
> + preempt_dynamic_enable(irqentry_exit_cond_resched);
> + preempt_dynamic_key_enable(preempt_lazy);
> + if (mode != preempt_dynamic_mode)
> + pr_info("Dynamic Preempt: lazy\n");
> + break;
> }
>
> preempt_dynamic_mode = mode;
> @@ -7405,6 +7482,8 @@ static void __init preempt_dynamic_init(void)
> sched_dynamic_update(preempt_dynamic_none);
> } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
> sched_dynamic_update(preempt_dynamic_voluntary);
> + } else if (IS_ENABLED(CONFIG_PREEMPT_LAZY)) {
> + sched_dynamic_update(preempt_dynamic_lazy);
> } else {
> /* Default static call setting, nothing to do */
> WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
> @@ -7425,6 +7504,7 @@ static void __init preempt_dynamic_init(void)
> PREEMPT_MODEL_ACCESSOR(none);
> PREEMPT_MODEL_ACCESSOR(voluntary);
> PREEMPT_MODEL_ACCESSOR(full);
> +PREEMPT_MODEL_ACCESSOR(lazy);
>
> #else /* !CONFIG_PREEMPT_DYNAMIC: */
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1bc24410ae501..87309cf247c68 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -245,7 +245,7 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
> static int sched_dynamic_show(struct seq_file *m, void *v)
> {
> static const char * preempt_modes[] = {
> - "none", "voluntary", "full"
> + "none", "voluntary", "full", "lazy",
> };
> int i;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5b5d50dbc79dc..71b4112cadde0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1007,7 +1007,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> * The task has consumed its request, reschedule.
> */
> if (cfs_rq->nr_running > 1) {
> - resched_curr(rq_of(cfs_rq));
> + resched_curr_lazy(rq_of(cfs_rq));
> clear_buddies(cfs_rq, se);
> }
> }
> @@ -8615,7 +8615,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> return;
>
> preempt:
> - resched_curr(rq);
> + resched_curr_lazy(rq);
> }
>
> static struct task_struct *pick_task_fair(struct rq *rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 041d8e00a1568..48a4617a5b28b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2494,6 +2494,7 @@ extern void init_sched_fair_class(void);
> extern void reweight_task(struct task_struct *p, int prio);
>
> extern void resched_curr(struct rq *rq);
> +extern void resched_curr_lazy(struct rq *rq);
> extern void resched_cpu(int cpu);
>
> extern struct rt_bandwidth def_rt_bandwidth;


--
ankur

2024-06-06 17:33:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> >>
> >> Peter Zijlstra <[email protected]> writes:
> >>
> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >> >>
> >> >> The interface is identical to PREEMPT_DYNAMIC.
> >> >
> >> > Colour me confused, why?!? What are you doing and why aren't just just
> >> > adding AUTO to the existing DYNAMIC thing?
> >>
> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> >> of the static_call/static_key stuff so I'm not sure how that would work.
> >
> > *sigh*... see the below, seems to work.
>
> Sorry, didn't mean for you to have to do all that work to prove the
> point.

Well, for a large part it was needed for me to figure out what your
patches were actually doing anyway. Peel away all the layers and this is
what remains.

> I phrased it badly. I do understand how lazy can be folded in as
> you do here:
>
> > + case preempt_dynamic_lazy:
> > + if (!klp_override)
> > + preempt_dynamic_disable(cond_resched);
> > + preempt_dynamic_disable(might_resched);
> > + preempt_dynamic_enable(preempt_schedule);
> > + preempt_dynamic_enable(preempt_schedule_notrace);
> > + preempt_dynamic_enable(irqentry_exit_cond_resched);
> > + preempt_dynamic_key_enable(preempt_lazy);
> > + if (mode != preempt_dynamic_mode)
> > + pr_info("Dynamic Preempt: lazy\n");
> > + break;
> > }
>
> But, if the long term goal (at least as I understand it) is to get rid
> of cond_resched() -- to allow optimizations that needing to call cond_resched()
> makes impossible -- does it make sense to pull all of these together?

It certainly doesn't make sense to add yet another configurable thing. We
have one, so yes add it here.

> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
> only two models left. Then we will have (modulo figuring out how to
> switch over klp from cond_resched() to a different unwinding technique):
>
> static void __sched_dynamic_update(int mode)
> {
> preempt_dynamic_enable(preempt_schedule);
> preempt_dynamic_enable(preempt_schedule_notrace);
> preempt_dynamic_enable(irqentry_exit_cond_resched);
>
> switch (mode) {
> case preempt_dynamic_full:
> preempt_dynamic_key_disable(preempt_lazy);
> if (mode != preempt_dynamic_mode)
> pr_info("%s: full\n", PREEMPT_MODE);
> break;
>
> case preempt_dynamic_lazy:
> preempt_dynamic_key_enable(preempt_lazy);
> if (mode != preempt_dynamic_mode)
> pr_info("Dynamic Preempt: lazy\n");
> break;
> }
>
> preempt_dynamic_mode = mode;
> }
>
> Which is pretty similar to what the PREEMPT_AUTO code was doing.

Right, but without duplicating all that stuff in the interim.

2024-06-07 16:49:30

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling



On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
>
>
> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>
>> Shrikanth Hegde <[email protected]> writes:
>>
>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>> Hi,
>>>>
>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>> on explicit preemption points for the voluntary models.
>>>>
>>>> The series is based on Thomas' original proposal which he outlined
>>>> in [1], [2] and in his PoC [3].
>>>>
>>>> v2 mostly reworks v1, with one of the main changes having less
>>>> noisy need-resched-lazy related interfaces.
>>>> More details in the changelog below.
>>>>
>>>
>>> Hi Ankur. Thanks for the series.
>>>
>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>
>>> tip/master was at:
>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>> Merge: 5d145493a139 47ff30cc1be7
>>> Author: Ingo Molnar <[email protected]>
>>> Date: Tue May 28 12:44:26 2024 +0200
>>>
>>> Merge branch into tip/master: 'x86/percpu'
>>>
>>>
>>>
>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>
>>>> Design
>>>> ==
>>>>
>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>
>>>> Having that, the next step is to make the rescheduling policy dependent
>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>> reschedule is needed.
>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>> rescheduling while allowing the task on the runqueue to run to
>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>
>>>> The scheduler decides which need-resched bits are chosen based on
>>>> the preemption model in use:
>>>>
>>>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>>>>
>>>> none never always [*]
>>>> voluntary higher sched class other tasks [*]
>>>> full always never
>>>>
>>>> [*] some details elided.
>>>>
>>>> The last part of the puzzle is, when does preemption happen, or
>>>> alternately stated, when are the need-resched bits checked:
>>>>
>>>> exit-to-user ret-to-kernel preempt_count()
>>>>
>>>> NEED_RESCHED_LAZY Y N N
>>>> NEED_RESCHED Y Y Y
>>>>
>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>> under full preemption.
>>>>
>>>> In addition, since this is driven purely by the scheduler (not
>>>> depending on cond_resched() placement and the like), there is enough
>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>> instruments like resched IPI to induce a context-switch.
>>>>
>>>> Performance
>>>> ==
>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>> (See patches
>>>> "sched: support preempt=none under PREEMPT_AUTO"
>>>> "sched: support preempt=full under PREEMPT_AUTO"
>>>> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>
>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>> backported.)
>>>>
>>>> In both tests the data was cached on remote nodes (cells), and the
>>>> database nodes (compute) served client queries, with clients being
>>>> local in the first test and remote in the second.
>>>>
>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>
>>>>
>>>> PREEMPT_VOLUNTARY PREEMPT_AUTO
>>>> (preempt=voluntary)
>>>> ============================== =============================
>>>> clients throughput cpu-usage throughput cpu-usage Gain
>>>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
>>>> ------- ---------- ----------------- ---------- ----------------- -------
>>>>
>>>>
>>>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
>>>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
>>>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>>>>
>>>>
>>>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
>>>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
>>>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
>>>> 90/10 RW ratio)
>>>>
>>>>
>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>> tables etc are cached on the cells. Additionally, the first set,
>>>> given the local clients, stress the scheduler a bit more than the
>>>> second.)
>>>>
>>>> The comparative performance for both the tests is fairly close,
>>>> more or less within a margin of error.
>>>>
>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>>>>
>>>> "
>>>> a) Base kernel (6.7),
>>>> b) v1, PREEMPT_AUTO, preempt=voluntary
>>>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>
>>>> Workloads I tested and their %gain,
>>>> case b case c case d
>>>> NAS +2.7% +1.9% +2.1%
>>>> Hashjoin, +0.0% +0.0% +0.0%
>>>> Graph500, -6.0% +0.0% +0.0%
>>>> XSBench +1.7% +0.0% +1.2%
>>>>
>>>> (Note about the Graph500 numbers at [8].)
>>>>
>>>> Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>> much difference.
>>>> "
>>>>
>>>> One case where there is a significant performance drop is on powerpc,
>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>> since most of the code is common, but I haven't been able to reproduce
>>>> it on x86 so far.
>>>>
>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>> different enough from the current none/voluntary models that there
>>>> likely are workloads where performance would be subpar. That needs more
>>>> extensive testing to figure out the weak points.
>>>>
>>>>
>>>>
>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>> where highest regression was seen in v1.
>>>
>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>
>> Thanks for trying it out.
>>
>> As you point out, context-switches and migrations are signficantly higher.
>>
>> Definitely unexpected. I ran the same test on an x86 box
>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>
>> 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% )
>> 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% )
>> 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% )
>>
>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% )
>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% )
>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% )
>>
>> Clearly there's something different going on powerpc. I'm travelling
>> right now, but will dig deeper into this once I get back.
>>
>> Meanwhile can you check if the increased context-switches are voluntary or
>> involuntary (or what the division is)?
>
>
> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
> context switches per second while running "hackbench -pipe 60 process 100000 loops"
>
>
> preempt=none 6.10 preempt_auto
> =============================================================================
> voluntary context switches 7632166.19 9391636.34(+23%)
> involuntary context switches 2305544.07 3527293.94(+53%)
>
> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
> involuntary seems to increase at higher rate.
>
>


Continued data from hackbench regression. preempt=none in both the cases.
From mpstat, I see slightly higher idle time and more irq time with preempt_auto.

6.10-rc1:
=========
10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37
09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20
09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18
09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99

PREEMPT_AUTO:
===========
10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96
10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60
10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92
10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73
10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71

Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.

6.10-rc1:
=========
SOFTIRQ TOTAL_usecs
tasklet 71
block 145
net_rx 7914
rcu 136988
timer 304357
sched 1404497



PREEMPT_AUTO:
===========
SOFTIRQ TOTAL_usecs
tasklet 80
block 139
net_rx 6907
rcu 223508
timer 492767
sched 1794441


Would any specific setting of RCU matter for this?
This is what I have in config.

# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_TREE_SRCU=y
CONFIG_NEED_SRCU_NMI_SAFE=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_NEED_TASKS_RCU=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
# CONFIG_RCU_LAZY is not set
# end of RCU Subsystem


# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
CONFIG_CONTEXT_TRACKING_USER=y
# CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem


2024-06-09 00:47:28

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO


Peter Zijlstra <[email protected]> writes:

> On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
>>
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
>> >>
>> >> Peter Zijlstra <[email protected]> writes:
>> >>
>> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
>> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
>> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
>> >> >>
>> >> >> The interface is identical to PREEMPT_DYNAMIC.
>> >> >
>> >> > Colour me confused, why?!? What are you doing and why aren't just just
>> >> > adding AUTO to the existing DYNAMIC thing?
>> >>
>> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
>> >> of the static_call/static_key stuff so I'm not sure how that would work.
>> >
>> > *sigh*... see the below, seems to work.
>>
>> Sorry, didn't mean for you to have to do all that work to prove the
>> point.
>
> Well, for a large part it was needed for me to figure out what your
> patches were actually doing anyway. Peel away all the layers and this is
> what remains.
>
>> I phrased it badly. I do understand how lazy can be folded in as
>> you do here:
>>
>> > + case preempt_dynamic_lazy:
>> > + if (!klp_override)
>> > + preempt_dynamic_disable(cond_resched);
>> > + preempt_dynamic_disable(might_resched);
>> > + preempt_dynamic_enable(preempt_schedule);
>> > + preempt_dynamic_enable(preempt_schedule_notrace);
>> > + preempt_dynamic_enable(irqentry_exit_cond_resched);
>> > + preempt_dynamic_key_enable(preempt_lazy);
>> > + if (mode != preempt_dynamic_mode)
>> > + pr_info("Dynamic Preempt: lazy\n");
>> > + break;
>> > }
>>
>> But, if the long term goal (at least as I understand it) is to get rid
>> of cond_resched() -- to allow optimizations that needing to call cond_resched()
>> makes impossible -- does it make sense to pull all of these together?
>
> It certainly doesn't make sense to add yet another configurable thing. We
> have one, so yes add it here.
>
>> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
>> only two models left. Then we will have (modulo figuring out how to
>> switch over klp from cond_resched() to a different unwinding technique):
>>
>> static void __sched_dynamic_update(int mode)
>> {
>> preempt_dynamic_enable(preempt_schedule);
>> preempt_dynamic_enable(preempt_schedule_notrace);
>> preempt_dynamic_enable(irqentry_exit_cond_resched);
>>
>> switch (mode) {
>> case preempt_dynamic_full:
>> preempt_dynamic_key_disable(preempt_lazy);
>> if (mode != preempt_dynamic_mode)
>> pr_info("%s: full\n", PREEMPT_MODE);
>> break;
>>
>> case preempt_dynamic_lazy:
>> preempt_dynamic_key_enable(preempt_lazy);
>> if (mode != preempt_dynamic_mode)
>> pr_info("Dynamic Preempt: lazy\n");
>> break;
>> }
>>
>> preempt_dynamic_mode = mode;
>> }
>>
>> Which is pretty similar to what the PREEMPT_AUTO code was doing.
>
> Right, but without duplicating all that stuff in the interim.

Yeah, that makes sense. Joel had suggested something on these lines
earlier [1], to which I was resistant.

However, the duplication (and the fact that the voluntary model
was quite thin) should have told me that (AUTO, preempt=voluntary)
should just be folded under PREEMPT_DYNAMIC.

I'll rework the series to do that.

That should also simplify RCU related choices which I think Paul will
like. Given that the lazy model is meant to eventually replace
none/voluntary, so PREEMPT_RCU configuration can just be:

--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU

config PREEMPT_RCU
bool
- default y if PREEMPTION
+ default y if PREEMPTION && !PREEMPT_LAZY


Or, maybe we should instead have this:

--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -18,7 +18,7 @@ config TREE_RCU

config PREEMPT_RCU
bool
- default y if PREEMPTION
+ default y if PREEMPT || PREEMPT_RT
select TREE_RCU

Though this would be a change in behaviour for current PREEMPT_DYNAMIC
users.

[1] https://lore.kernel.org/lkml/[email protected]/

Thanks
--
ankur

2024-06-10 07:24:57

by Ankur Arora

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling


Shrikanth Hegde <[email protected]> writes:

> On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>>
>>> Shrikanth Hegde <[email protected]> writes:
>>>
>>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>>> on explicit preemption points for the voluntary models.
>>>>>
>>>>> The series is based on Thomas' original proposal which he outlined
>>>>> in [1], [2] and in his PoC [3].
>>>>>
>>>>> v2 mostly reworks v1, with one of the main changes having less
>>>>> noisy need-resched-lazy related interfaces.
>>>>> More details in the changelog below.
>>>>>
>>>>
>>>> Hi Ankur. Thanks for the series.
>>>>
>>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>>
>>>> tip/master was at:
>>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>>> Merge: 5d145493a139 47ff30cc1be7
>>>> Author: Ingo Molnar <[email protected]>
>>>> Date: Tue May 28 12:44:26 2024 +0200
>>>>
>>>> Merge branch into tip/master: 'x86/percpu'
>>>>
>>>>
>>>>
>>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>>
>>>>> Design
>>>>> ==
>>>>>
>>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>>
>>>>> Having that, the next step is to make the rescheduling policy dependent
>>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>>> reschedule is needed.
>>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>>> rescheduling while allowing the task on the runqueue to run to
>>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>>
>>>>> The scheduler decides which need-resched bits are chosen based on
>>>>> the preemption model in use:
>>>>>
>>>>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>>>>>
>>>>> none never always [*]
>>>>> voluntary higher sched class other tasks [*]
>>>>> full always never
>>>>>
>>>>> [*] some details elided.
>>>>>
>>>>> The last part of the puzzle is, when does preemption happen, or
>>>>> alternately stated, when are the need-resched bits checked:
>>>>>
>>>>> exit-to-user ret-to-kernel preempt_count()
>>>>>
>>>>> NEED_RESCHED_LAZY Y N N
>>>>> NEED_RESCHED Y Y Y
>>>>>
>>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>>> under full preemption.
>>>>>
>>>>> In addition, since this is driven purely by the scheduler (not
>>>>> depending on cond_resched() placement and the like), there is enough
>>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>>> instruments like resched IPI to induce a context-switch.
>>>>>
>>>>> Performance
>>>>> ==
>>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>>> (See patches
>>>>> "sched: support preempt=none under PREEMPT_AUTO"
>>>>> "sched: support preempt=full under PREEMPT_AUTO"
>>>>> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>>
>>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>>> backported.)
>>>>>
>>>>> In both tests the data was cached on remote nodes (cells), and the
>>>>> database nodes (compute) served client queries, with clients being
>>>>> local in the first test and remote in the second.
>>>>>
>>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>>
>>>>>
>>>>> PREEMPT_VOLUNTARY PREEMPT_AUTO
>>>>> (preempt=voluntary)
>>>>> ============================== =============================
>>>>> clients throughput cpu-usage throughput cpu-usage Gain
>>>>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
>>>>> ------- ---------- ----------------- ---------- ----------------- -------
>>>>>
>>>>>
>>>>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
>>>>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
>>>>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>>>>>
>>>>>
>>>>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
>>>>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
>>>>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
>>>>> 90/10 RW ratio)
>>>>>
>>>>>
>>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>>> tables etc are cached on the cells. Additionally, the first set,
>>>>> given the local clients, stress the scheduler a bit more than the
>>>>> second.)
>>>>>
>>>>> The comparative performance for both the tests is fairly close,
>>>>> more or less within a margin of error.
>>>>>
>>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>>>>>
>>>>> "
>>>>> a) Base kernel (6.7),
>>>>> b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>>
>>>>> Workloads I tested and their %gain,
>>>>> case b case c case d
>>>>> NAS +2.7% +1.9% +2.1%
>>>>> Hashjoin, +0.0% +0.0% +0.0%
>>>>> Graph500, -6.0% +0.0% +0.0%
>>>>> XSBench +1.7% +0.0% +1.2%
>>>>>
>>>>> (Note about the Graph500 numbers at [8].)
>>>>>
>>>>> Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>> much difference.
>>>>> "
>>>>>
>>>>> One case where there is a significant performance drop is on powerpc,
>>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>>> since most of the code is common, but I haven't been able to reproduce
>>>>> it on x86 so far.
>>>>>
>>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>>> different enough from the current none/voluntary models that there
>>>>> likely are workloads where performance would be subpar. That needs more
>>>>> extensive testing to figure out the weak points.
>>>>>
>>>>>
>>>>>
>>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>>> where highest regression was seen in v1.
>>>>
>>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>>
>>> Thanks for trying it out.
>>>
>>> As you point out, context-switches and migrations are signficantly higher.
>>>
>>> Definitely unexpected. I ran the same test on an x86 box
>>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>>
>>> 6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% )
>>> 6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% )
>>> 6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% )
>>>
>>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% )
>>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% )
>>> 6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% )
>>>
>>> Clearly there's something different going on powerpc. I'm travelling
>>> right now, but will dig deeper into this once I get back.
>>>
>>> Meanwhile can you check if the increased context-switches are voluntary or
>>> involuntary (or what the division is)?
>>
>>
>> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
>> context switches per second while running "hackbench -pipe 60 process 100000 loops"
>>
>>
>> preempt=none 6.10 preempt_auto
>> =============================================================================
>> voluntary context switches 7632166.19 9391636.34(+23%)
>> involuntary context switches 2305544.07 3527293.94(+53%)
>>
>> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
>> involuntary seems to increase at higher rate.
>>
>>
>
>
> Continued data from hackbench regression. preempt=none in both the cases.
> From mpstat, I see slightly higher idle time and more irq time with preempt_auto.
>
> 6.10-rc1:
> =========
> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37
> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20
> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18
> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99
>
> PREEMPT_AUTO:
> ===========
> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96
> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60
> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92
> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73
> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71
>
> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.

Yeah, the %sys is lower and %irq, higher. Can you also see where the
increased %irq is? For instance are the resched IPIs numbers greater?

> 6.10-rc1:
> =========
> SOFTIRQ TOTAL_usecs
> tasklet 71
> block 145
> net_rx 7914
> rcu 136988
> timer 304357
> sched 1404497
>
>
>
> PREEMPT_AUTO:
> ===========
> SOFTIRQ TOTAL_usecs
> tasklet 80
> block 139
> net_rx 6907
> rcu 223508
> timer 492767
> sched 1794441
>
>
> Would any specific setting of RCU matter for this?
> This is what I have in config.

Don't see how it could matter unless the RCU settings are changing
between the two tests? In my testing I'm also using TREE_RCU=y,
PREEMPT_RCU=n.

Let me see if I can find a test which shows a similar trend to what you
are seeing. And, then maybe see if tracing sched-switch might point to
an interesting difference between x86 and powerpc.


Thanks for all the detail.

Ankur

> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_TREE_SRCU=y
> CONFIG_NEED_SRCU_NMI_SAFE=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_NEED_TASKS_RCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_NOCB_CPU=y
> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
> # CONFIG_RCU_LAZY is not set
> # end of RCU Subsystem
>
>
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> CONFIG_CONTEXT_TRACKING_USER=y
> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
> # end of Timers subsystem


--
ankur

2024-06-12 18:11:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

On Sat, Jun 08, 2024 at 05:46:26PM -0700, Ankur Arora wrote:
> Peter Zijlstra <[email protected]> writes:
> > On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
> >> Peter Zijlstra <[email protected]> writes:
> >> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> >> >> Peter Zijlstra <[email protected]> writes:
> >> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >> >> >>
> >> >> >> The interface is identical to PREEMPT_DYNAMIC.
> >> >> >
> >> >> > Colour me confused, why?!? What are you doing and why aren't just just
> >> >> > adding AUTO to the existing DYNAMIC thing?
> >> >>
> >> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> >> >> of the static_call/static_key stuff so I'm not sure how that would work.
> >> >
> >> > *sigh*... see the below, seems to work.
> >>
> >> Sorry, didn't mean for you to have to do all that work to prove the
> >> point.
> >
> > Well, for a large part it was needed for me to figure out what your
> > patches were actually doing anyway. Peel away all the layers and this is
> > what remains.
> >
> >> I phrased it badly. I do understand how lazy can be folded in as
> >> you do here:
> >>
> >> > + case preempt_dynamic_lazy:
> >> > + if (!klp_override)
> >> > + preempt_dynamic_disable(cond_resched);
> >> > + preempt_dynamic_disable(might_resched);
> >> > + preempt_dynamic_enable(preempt_schedule);
> >> > + preempt_dynamic_enable(preempt_schedule_notrace);
> >> > + preempt_dynamic_enable(irqentry_exit_cond_resched);
> >> > + preempt_dynamic_key_enable(preempt_lazy);
> >> > + if (mode != preempt_dynamic_mode)
> >> > + pr_info("Dynamic Preempt: lazy\n");
> >> > + break;
> >> > }
> >>
> >> But, if the long term goal (at least as I understand it) is to get rid
> >> of cond_resched() -- to allow optimizations that needing to call cond_resched()
> >> makes impossible -- does it make sense to pull all of these together?
> >
> > It certainly doesn't make sense to add yet another configurable thing. We
> > have one, so yes add it here.
> >
> >> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
> >> only two models left. Then we will have (modulo figuring out how to
> >> switch over klp from cond_resched() to a different unwinding technique):
> >>
> >> static void __sched_dynamic_update(int mode)
> >> {
> >> preempt_dynamic_enable(preempt_schedule);
> >> preempt_dynamic_enable(preempt_schedule_notrace);
> >> preempt_dynamic_enable(irqentry_exit_cond_resched);
> >>
> >> switch (mode) {
> >> case preempt_dynamic_full:
> >> preempt_dynamic_key_disable(preempt_lazy);
> >> if (mode != preempt_dynamic_mode)
> >> pr_info("%s: full\n", PREEMPT_MODE);
> >> break;
> >>
> >> case preempt_dynamic_lazy:
> >> preempt_dynamic_key_enable(preempt_lazy);
> >> if (mode != preempt_dynamic_mode)
> >> pr_info("Dynamic Preempt: lazy\n");
> >> break;
> >> }
> >>
> >> preempt_dynamic_mode = mode;
> >> }
> >>
> >> Which is pretty similar to what the PREEMPT_AUTO code was doing.
> >
> > Right, but without duplicating all that stuff in the interim.
>
> Yeah, that makes sense. Joel had suggested something on these lines
> earlier [1], to which I was resistant.
>
> However, the duplication (and the fact that the voluntary model
> was quite thin) should have told me that (AUTO, preempt=voluntary)
> should just be folded under PREEMPT_DYNAMIC.
>
> I'll rework the series to do that.
>
> That should also simplify RCU related choices which I think Paul will
> like. Given that the lazy model is meant to eventually replace
> none/voluntary, so PREEMPT_RCU configuration can just be:
>
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
>
> config PREEMPT_RCU
> bool
> - default y if PREEMPTION
> + default y if PREEMPTION && !PREEMPT_LAZY

Given that PREEMPT_DYNAMIC selects PREEMPT_BUILD which in turn selects
PREEMPTION, this should work.

> Or, maybe we should instead have this:
>
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
>
> config PREEMPT_RCU
> bool
> - default y if PREEMPTION
> + default y if PREEMPT || PREEMPT_RT
> select TREE_RCU
>
> Though this would be a change in behaviour for current PREEMPT_DYNAMIC
> users.

Which I believe to be a no-go. I believe that PREEMPT_DYNAMIC users
really need their preemptible kernels to include preemptible RCU.

If PREEMPT_LAZY causes PREEMPT_DYNAMIC non-preemptible kernels to become
lazily preemptible, that is a topic to discuss with PREEMPT_DYNAMIC users.
On the other hand, if PREEMPT_LAZY does not cause PREEMPT_DYNAMIC
kernels to become lazily preemptible, then I would expect there to be
hard questions about removing cond_resched() and might_sleep(), or,
for that matter changing their semantics. Which I again must leave
to PREEMPT_DYNAMIC users.

Thanx, Paul

> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks
> --
> ankur

2024-06-15 15:05:33

by Shrikanth Hegde

[permalink] [raw]
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling



On 6/10/24 12:53 PM, Ankur Arora wrote:
>
_auto.
>>
>> 6.10-rc1:
>> =========
>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37
>> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20
>> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18
>> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99
>>
>> PREEMPT_AUTO:
>> ===========
>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96
>> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60
>> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92
>> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73
>> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71
>>
>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>
> Yeah, the %sys is lower and %irq, higher. Can you also see where the
> increased %irq is? For instance are the resched IPIs numbers greater?

Hi Ankur,


Used mpstat -I ALL to capture this info for 20 seconds.

HARDIRQ per second:
===================
6.10:
===================
18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45

Preempt_auto:
===================
18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85

18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.


SOFTIRQ per second:
===================
6.10:
===================
HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95

Preempt_auto:
===================
HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77

Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
It maybe irq triggering to softirq or softirq causing more IPI.



Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
enabled. I still see the same regression in hackbench. These configs still may need attention?

6.10 | preempt auto
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK=y | ----------------------------------------------------------------------------
CONFIG_INLINE_READ_UNLOCK_IRQ=y | ----------------------------------------------------------------------------
CONFIG_INLINE_WRITE_UNLOCK=y | ----------------------------------------------------------------------------
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ----------------------------------------------------------------------------


>
>> 6.10-rc1:
>> =========
>> SOFTIRQ TOTAL_usecs
>> tasklet 71
>> block 145
>> net_rx 7914
>> rcu 136988
>> timer 304357
>> sched 1404497
>>
>>
>>
>> PREEMPT_AUTO:
>> ===========
>> SOFTIRQ TOTAL_usecs
>> tasklet 80
>> block 139
>> net_rx 6907
>> rcu 223508
>> timer 492767
>> sched 1794441
>>
>>
>> Would any specific setting of RCU matter for this?
>> This is what I have in config.
>
> Don't see how it could matter unless the RCU settings are changing
> between the two tests? In my testing I'm also using TREE_RCU=y,
> PREEMPT_RCU=n.
>
> Let me see if I can find a test which shows a similar trend to what you
> are seeing. And, then maybe see if tracing sched-switch might point to
> an interesting difference between x86 and powerpc.
>
>
> Thanks for all the detail.
>
> Ankur
>
>> # RCU Subsystem
>> #
>> CONFIG_TREE_RCU=y
>> # CONFIG_RCU_EXPERT is not set
>> CONFIG_TREE_SRCU=y
>> CONFIG_NEED_SRCU_NMI_SAFE=y
>> CONFIG_TASKS_RCU_GENERIC=y
>> CONFIG_NEED_TASKS_RCU=y
>> CONFIG_TASKS_RCU=y
>> CONFIG_TASKS_RUDE_RCU=y
>> CONFIG_TASKS_TRACE_RCU=y
>> CONFIG_RCU_STALL_COMMON=y
>> CONFIG_RCU_NEED_SEGCBLIST=y
>> CONFIG_RCU_NOCB_CPU=y
>> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
>> # CONFIG_RCU_LAZY is not set
>> # end of RCU Subsystem
>>
>>
>> # Timers subsystem
>> #
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>> CONFIG_CONTEXT_TRACKING_USER=y
>> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
>> CONFIG_NO_HZ=y
>> CONFIG_HIGH_RES_TIMERS=y
>> # end of Timers subsystem
>
>
> --
> ankur