2023-07-05 18:25:51

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().

Patches
=======

o Patches 1-5 have been submitted previously and are included for the sake of
testing

o Patches 6-10 focus on having objtool detect problematic static key usage in early
entry

The context_tracking_key one causes a page_fault_oops() on my RHEL userspace
due to the KVM module, cf. changelog.

o Patch 11 adds the infrastructure for IPI deferral.
o Patch 12 adds text_poke() IPI deferral.

This one I'm fairly confident about, we "just" need to do something about the
__ro_after_init key vs module loading issue.

o Patches 13-14 add vunmap() flush_tlb_kernel_range() IPI deferral

These ones I'm a lot less confident about, mostly due to lacking
instrumentation/verification.

The actual deferred callback is also incomplete as it's not properly noinstr:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
have *no idea* what a sane implementation would be for Xen so I haven't
touched that yet.

Patches are also available at:

https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v1

Testing
=======

Xeon Platinum 8380 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "ipi_send_cpumask" -f "cpumask & MASK{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & MASK{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 20 minutes.

v6.4 (+ cpumask filtering patches):
$ trace-cmd report | grep callback | awk '{ print $NF }' | sort | uniq -c
236 callback=do_flush_tlb_all+0x0
576 callback=do_sync_core+0x0
814 callback=generic_smp_call_function_single_interrupt+0x0
309 callback=nohz_full_kick_func+0x0

v6.4 + patches:
$ trace-cmd report | grep callback | awk '{ print $NF }' | sort | uniq -c
22 callback=do_flush_tlb_all+0x0
24 callback=generic_smp_call_function_single_interrupt+0x0
307 callback=nohz_full_kick_func+0x0

o IPIs from instruction patching are entirely gone.

o Some TLB flushes remain as I only patched the vunmap cases:

kworker/2:0-13856 [002] 3517.445719: ipi_send_cpumask: cpumask=0-1,3-79 callsite=on_each_cpu_cond_mask+0x20
callback=do_flush_tlb_all+0x0
kworker/2:0-13856 [002] 3517.445722: kernel_stack: <stack trace >
=> trace_event_raw_event_ipi_send_cpumask (ffffffffa974a050)
=> smp_call_function_many_cond (ffffffffa97fb0a7)
=> on_each_cpu_cond_mask (ffffffffa97fb1f0)
=> pcpu_reclaim_populated (ffffffffa996b451)
=> pcpu_balance_workfn (ffffffffa996c399)
=> process_one_work (ffffffffa9730e14)
=> worker_thread (ffffffffa9731440)
=> kthread (ffffffffa973984e)

o The nohz_full_kick_func() ones seem to come from the dev_watchdog() but are
anyway consistent across revisions

<...>-3734 [042] 392.890491: ipi_send_cpu: cpu=42 callsite=irq_work_queue_on+0x77 callback=nohz_full_kick_func+0x0
<...>-3734 [042] 392.890497: kernel_stack: <stack trace >
=> trace_event_raw_event_ipi_send_cpu (ffffffff901492d8)
=> __irq_work_queue_local (ffffffff902acb3d)
=> irq_work_queue_on (ffffffff902acc47)
=> __mod_timer (ffffffff901dcd81)
=> dev_watchdog (ffffffff90a75310)
=> call_timer_fn (ffffffff901dc174)
=> __run_timers.part.0 (ffffffff901dc47e)
=> run_timer_softirq (ffffffff901dc546)
=> __do_softirq (ffffffff90c52348)
=> __irq_exit_rcu (ffffffff90113329)
=> sysvec_apic_timer_interrupt (ffffffff90c3c895)
=> asm_sysvec_apic_timer_interrupt (ffffffff90e00d86)

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
.data..ro_after_init section.

Links
=====

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE

Valentin Schneider (14):
tracing/filters: Dynamically allocate filter_pred.regex
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Document cpumask filtering
objtool: Flesh out warning related to pv_ops[] calls
objtool: Warn about non __ro_after_init static key usage in .noinstr
BROKEN: context_tracking: Make context_tracking_key __ro_after_init
x86/kvm: Make kvm_async_pf_enabled __ro_after_init
x86/sev-es: Make sev_es_enable_key __ro_after_init
context-tracking: Introduce work deferral infrastructure
context_tracking,x86: Defer kernel text patching IPIs
context_tracking,x86: Add infrastructure to defer kernel TLBI
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs

Documentation/trace/events.rst | 14 ++
arch/Kconfig | 9 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 20 ++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/kernel/alternative.c | 24 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/kvm.c | 2 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/kernel/sev.c | 2 +-
arch/x86/mm/tlb.c | 40 +++-
include/linux/context_tracking.h | 1 +
include/linux/context_tracking_state.h | 1 +
include/linux/context_tracking_work.h | 30 +++
include/linux/trace_events.h | 1 +
kernel/context_tracking.c | 65 +++++-
kernel/time/Kconfig | 5 +
kernel/trace/trace_events_filter.c | 228 ++++++++++++++++---
mm/vmalloc.c | 15 +-
tools/objtool/check.c | 22 +-
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 +
tools/objtool/special.c | 3 +
25 files changed, 451 insertions(+), 48 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h

--
2.31.1



2023-07-05 18:25:57

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

smp_call_function() & friends have the unfortunate habit of sending IPIs to
isolated, NOHZ_FULL, in-userspace CPUs, as they blindly target all online
CPUs.

Some callsites can be bent into doing the right, such as done by commit:

cc9e303c91f5 ("x86/cpu: Disable frequency requests via aperfmperf IPI for nohz_full CPUs")

Unfortunately, not all SMP callbacks can be omitted in this
fashion. However, some of them only affect execution in kernelspace, which
means they don't have to be executed *immediately* if the target CPU is in
userspace: stashing the callback and executing it upon the next kernel entry
would suffice. x86 kernel instruction patching or kernel TLB invalidation
are prime examples of it.

Add a field in struct context_tracking used as a bitmask to track deferred
callbacks to execute upon kernel entry. The LSB of that field is used as a
flag to prevent queueing deferred work when the CPU leaves userspace.

Later commits introduce the bit:callback mappings.

Note: A previous approach by PeterZ [1] used an extra bit in
context_tracking.state to flag the presence of deferred callbacks to
execute, and the actual callbacks were stored in a separate atomic
variable.

This meant that the atomic read of context_tracking.state was sufficient to
determine whether there are any deferred callbacks to execute.
Unfortunately, it presents a race window. Consider the work setting
function as:

preempt_disable();
seq = atomic_read(&ct->seq);
if (__context_tracking_seq_in_user(seq)) {
/* ctrl-dep */
atomic_or(work, &ct->work);
ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
}
preempt_enable();

return ret;

Then the following can happen:

CPUx CPUy
CT_SEQ_WORK \in context_tracking.state
atomic_or(WORK_N, &ct->work);
ct_kernel_enter()
ct_state_inc();
atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);

The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
sent. Unfortunately, the work bit would remain set, and it can't be sanely
cleared in case another CPU set it concurrently - this would ultimately
lead to a double execution of the callback, one as a deferred callback and
one in the IPI. As not all IPI callbacks are idempotent, this is
undesirable.

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/Kconfig | 9 +++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 14 +++++
include/linux/context_tracking.h | 1 +
include/linux/context_tracking_state.h | 1 +
include/linux/context_tracking_work.h | 28 +++++++++
kernel/context_tracking.c | 63 ++++++++++++++++++++
kernel/time/Kconfig | 5 ++
8 files changed, 122 insertions(+)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cada..e453e9fb864be 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -851,6 +851,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
- No use of instrumentation, unless instrumentation_begin() got
called.

+config HAVE_CONTEXT_TRACKING_WORK
+ bool
+ help
+ Architecture supports deferring work while not in kernel context.
+ This is especially useful on setups with isolated CPUs that might
+ want to avoid being interrupted to perform housekeeping tasks (for
+ ex. TLB invalidation or icache invalidation). The housekeeping
+ operations are performed upon re-entering the kernel.
+
config HAVE_TIF_NOHZ
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee4..490c773105c0c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -197,6 +197,7 @@ config X86
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING_USER if X86_64
select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER
+ select HAVE_CONTEXT_TRACKING_WORK if X86_64
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL
select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 0000000000000..5bc29e6b2ed38
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+ switch (work) {
+ case CONTEXT_WORK_n:
+ // Do work...
+ break;
+ }
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d3cbb6c16babf..80d571ddfc3a4 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/vtime.h>
#include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
#include <linux/instrumentation.h>

#include <asm/ptrace.h>
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index fdd537ea513ff..5af06ed26f858 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -36,6 +36,7 @@ struct context_tracking {
int recursion;
#endif
#ifdef CONFIG_CONTEXT_TRACKING
+ atomic_t work;
atomic_t state;
#endif
#ifdef CONFIG_CONTEXT_TRACKING_IDLE
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 0000000000000..0b06c3dab58c7
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+ CONTEXT_WORK_DISABLED_OFFSET,
+ CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+ CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
+ CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4e6cb14272fcb..b6aee3d0c0528 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -32,6 +32,9 @@ DEFINE_PER_CPU(struct context_tracking, context_tracking) = {
.dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
#endif
.state = ATOMIC_INIT(RCU_DYNTICKS_IDX),
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+ .work = ATOMIC_INIT(CONTEXT_WORK_DISABLED),
+#endif
};
EXPORT_SYMBOL_GPL(context_tracking);

@@ -72,6 +75,57 @@ static __always_inline void rcu_dynticks_task_trace_exit(void)
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
}

+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static __always_inline unsigned int ct_work_fetch(struct context_tracking *ct)
+{
+ return arch_atomic_fetch_or(CONTEXT_WORK_DISABLED, &ct->work);
+}
+static __always_inline void ct_work_clear(struct context_tracking *ct)
+{
+ arch_atomic_set(&ct->work, 0);
+}
+
+static noinstr void ct_work_flush(unsigned long work)
+{
+ int bit;
+
+ /* DISABLED is never set while there are deferred works */
+ WARN_ON_ONCE(work & CONTEXT_WORK_DISABLED);
+
+ /*
+ * arch_context_tracking_work() must be noinstr, non-blocking,
+ * and NMI safe.
+ */
+ for_each_set_bit(bit, &work, CONTEXT_WORK_MAX)
+ arch_context_tracking_work(BIT(bit));
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+ struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+ unsigned int old_work;
+ bool ret = false;
+
+ preempt_disable();
+
+ old_work = atomic_read(&ct->work);
+ /*
+ * Try setting the work until either
+ * - the target CPU no longer accepts any more deferred work
+ * - the work has been set
+ */
+ while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)
+ ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
+
+ preempt_enable();
+ return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long work) { }
+static __always_inline unsigned int ct_work_fetch(struct context_tracking *ct) { return 0; }
+static __always_inline void ct_work_clear(struct context_tracking *ct) { }
+#endif
+
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state, that is,
@@ -89,6 +143,10 @@ static noinstr void ct_kernel_exit_state(int offset)
*/
rcu_dynticks_task_trace_enter(); // Before ->dynticks update!
seq = ct_state_inc(offset);
+
+ /* Let this CPU allow deferred callbacks again */
+ ct_work_clear(this_cpu_ptr(&context_tracking));
+
// RCU is no longer watching. Better be in extended quiescent state!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && (seq & RCU_DYNTICKS_IDX));
}
@@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
*/
static noinstr void ct_kernel_enter_state(int offset)
{
+ struct context_tracking *ct = this_cpu_ptr(&context_tracking);
int seq;
+ unsigned int work;

+ work = ct_work_fetch(ct);
/*
* CPUs seeing atomic_add_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
seq = ct_state_inc(offset);
+ if (work)
+ ct_work_flush(work);
// RCU is now watching. Better not be in an extended quiescent state!
rcu_dynticks_task_trace_exit(); // After ->dynticks update!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index bae8f11070bef..fdb266f2d774b 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
Say N otherwise, this option brings an overhead that you
don't want in production.

+config CONTEXT_TRACKING_WORK
+ bool
+ depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+ default y
+
config NO_HZ
bool "Old Idle dynticks config"
help
--
2.31.1


2023-07-05 18:26:09

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 12/14] context_tracking,x86: Defer kernel text patching IPIs

text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
them vs the newly patched instruction. CPUs that are executing in userspace
do not need this synchronization to happen immediately, and this is
actually harmful interference for NOHZ_FULL CPUs.

As the synchronization IPIs are sent using a blocking call, returning from
text_poke_bp_batch() implies all CPUs will observe the patched
instruction(s), and this should be preserved even if the IPI is deferred.
In other words, to safely defer this synchronization, any kernel
instruction leading to the execution of the deferred instruction
sync (ct_work_flush()) must *not* be mutable (patchable) at runtime.

This means we must pay attention to mutable instructions in the early entry
code:
- alternatives
- static keys
- all sorts of probes (kprobes/ftrace/bpf/???)

The early entry code leading to ct_work_flush() is noinstr, which gets rid
of the probes.

Alternatives are safe, because it's boot-time patching (before SMP is
even brought up) which is before any IPI deferral can happen.

This leaves us with static keys. Any static key used in early entry code
should be only forever-enabled at boot time, IOW __ro_after_init (pretty
much like alternatives). Objtool is now able to point at static keys that
don't respect this, and all static keys used in early entry code have now
been verified as behaving like so.

Leverage the new context_tracking infrastructure to defer sync_core() IPIs
to a target CPU's next kernel entry.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/include/asm/context_tracking_work.h | 6 +++--
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/kernel/alternative.c | 24 ++++++++++++++++----
arch/x86/kernel/kprobes/core.c | 4 ++--
arch/x86/kernel/kprobes/opt.c | 4 ++--
arch/x86/kernel/module.c | 2 +-
include/linux/context_tracking_work.h | 4 ++--
7 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 5bc29e6b2ed38..2c66687ce00e2 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -2,11 +2,13 @@
#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
#define _ASM_X86_CONTEXT_TRACKING_WORK_H

+#include <asm/sync_core.h>
+
static __always_inline void arch_context_tracking_work(int work)
{
switch (work) {
- case CONTEXT_WORK_n:
- // Do work...
+ case CONTEXT_WORK_SYNC:
+ sync_core();
break;
}
}
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 29832c338cdc5..b6939e965e69d 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len);
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern void text_poke_sync(void);
+extern void text_poke_sync_deferrable(void);
extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d932..7770aef2204f3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -18,6 +18,7 @@
#include <linux/mmu_context.h>
#include <linux/bsearch.h>
#include <linux/sync_core.h>
+#include <linux/context_tracking.h>
#include <asm/text-patching.h>
#include <asm/alternative.h>
#include <asm/sections.h>
@@ -1765,9 +1766,24 @@ static void do_sync_core(void *info)
sync_core();
}

+static bool do_sync_core_defer_cond(int cpu, void *info)
+{
+ return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC);
+}
+
+static void __text_poke_sync(smp_cond_func_t cond_func)
+{
+ on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
+}
+
void text_poke_sync(void)
{
- on_each_cpu(do_sync_core, NULL, 1);
+ __text_poke_sync(NULL);
+}
+
+void text_poke_sync_deferrable(void)
+{
+ __text_poke_sync(do_sync_core_defer_cond);
}

/*
@@ -1967,7 +1983,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE);
}

- text_poke_sync();
+ text_poke_sync_deferrable();

/*
* Second step: update all but the first byte of the patched range.
@@ -2029,7 +2045,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
* not necessary and we'd be safe even without it. But
* better safe than sorry (plus there's not only Intel).
*/
- text_poke_sync();
+ text_poke_sync_deferrable();
}

/*
@@ -2050,7 +2066,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries
}

if (do_sync)
- text_poke_sync();
+ text_poke_sync_deferrable();

/*
* Remove and wait for refs to be zero.
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index f7f6042eb7e6c..a38c914753397 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p)
u8 int3 = INT3_INSN_OPCODE;

text_poke(p->addr, &int3, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1);
}

@@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p)

perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1);
text_poke(p->addr, &p->opcode, 1);
- text_poke_sync();
+ text_poke_sync_deferrable();
}

void arch_remove_kprobe(struct kprobe *p)
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 57b0037d0a996..88451a744ceda 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
JMP32_INSN_SIZE - INT3_INSN_SIZE);

text_poke(addr, new, INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();
text_poke(addr + INT3_INSN_SIZE,
new + INT3_INSN_SIZE,
JMP32_INSN_SIZE - INT3_INSN_SIZE);
- text_poke_sync();
+ text_poke_sync_deferrable();

perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE);
}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b05f62ee2344b..8b4542dc51b6d 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs,
write, apply);

if (!early) {
- text_poke_sync();
+ text_poke_sync_deferrable();
mutex_unlock(&text_mutex);
}

diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 0b06c3dab58c7..b0c7463048b60 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,13 +6,13 @@

enum {
CONTEXT_WORK_DISABLED_OFFSET,
- CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_SYNC_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};

enum ct_work {
CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
- CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};

--
2.31.1


2023-07-05 18:26:10

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 04/14] tracing/filters: Enable filtering the CPU common field by a cpumask

The tracing_cpumask lets us specify which CPUs are traced in a buffer
instance, but doesn't let us do this on a per-event basis (unless one
creates an instance per event).

A previous commit added filtering scalar fields by a user-given cpumask,
make this work with the CPU common field as well.

This enables doing things like

$ trace-cmd record -e 'sched_switch' -f 'CPU & MASK{12-52}' \
-e 'sched_wakeup' -f 'target_cpu & MASK{12-52}'

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/trace/trace_events_filter.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 99e111c237a93..b3d2612d4670a 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -68,6 +68,7 @@ enum filter_pred_fn {
FILTER_PRED_FN_PCHAR_USER,
FILTER_PRED_FN_PCHAR,
FILTER_PRED_FN_CPU,
+ FILTER_PRED_FN_CPU_CPUMASK,
FILTER_PRED_FN_CPUMASK,
FILTER_PRED_FN_FUNCTION,
FILTER_PRED_FN_,
@@ -933,6 +934,13 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
}
}

+static int filter_pred_cpu_cpumask(struct filter_pred *pred, void *event)
+{
+ int cpu = raw_smp_processor_id();
+
+ return do_filter_cpumask_scalar(pred->op, cpu, pred->mask);
+}
+
/* Filter predicate for cpumasks. */
static int filter_pred_cpumask(struct filter_pred *pred, void *event)
{
@@ -1436,6 +1444,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
return filter_pred_pchar(pred, event);
case FILTER_PRED_FN_CPU:
return filter_pred_cpu(pred, event);
+ case FILTER_PRED_FN_CPU_CPUMASK:
+ return filter_pred_cpu_cpumask(pred, event);
case FILTER_PRED_FN_CPUMASK:
return filter_pred_cpumask(pred, event);
case FILTER_PRED_FN_FUNCTION:
@@ -1654,6 +1664,7 @@ static int parse_pred(const char *str, void *data,
char *tmp;

if (field->filter_type != FILTER_CPUMASK &&
+ field->filter_type != FILTER_CPU &&
field->filter_type != FILTER_OTHER) {
parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
goto err_free;
@@ -1698,6 +1709,8 @@ static int parse_pred(const char *str, void *data,
i++;
if (field->filter_type == FILTER_CPUMASK) {
pred->fn_num = FILTER_PRED_FN_CPUMASK;
+ } else if (field->filter_type == FILTER_CPU) {
+ pred->fn_num = FILTER_PRED_FN_CPU_CPUMASK;
} else {
switch (field->size) {
case 8:
--
2.31.1


2023-07-05 18:26:13

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 13/14] context_tracking,x86: Add infrastructure to defer kernel TLBI

Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.

Rather than make __flush_tlb_all() noinstr, add a minimal noinstr
variant that doesn't try to leverage INVPCID.

FIXME: not fully noinstr compliant
XXX: same issue as with ins patching, when do we access data that should be
invalidated?

Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/include/asm/context_tracking_work.h | 4 ++++
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 17 +++++++++++++++++
include/linux/context_tracking_work.h | 2 ++
4 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
#define _ASM_X86_CONTEXT_TRACKING_WORK_H

#include <asm/sync_core.h>
+#include <asm/tlbflush.h>

static __always_inline void arch_context_tracking_work(int work)
{
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work)
case CONTEXT_WORK_SYNC:
sync_core();
break;
+ case CONTEXT_WORK_TLBI:
+ __flush_tlb_all_noinstr();
+ break;
}
}

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 75bfaa4210303..9064aa0027d05 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -15,6 +15,7 @@
#include <asm/pgtable.h>

void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);

#define TLB_FLUSH_ALL -1UL
#define TLB_GENERATION_INVALID 0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf27480af..631df9189ded4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1237,6 +1237,23 @@ void __flush_tlb_all(void)
}
EXPORT_SYMBOL_GPL(__flush_tlb_all);

+void noinstr __flush_tlb_all_noinstr(void)
+{
+ /*
+ * This is for invocation in early entry code that cannot be
+ * instrumented. A RMW to CR4 works for most cases, but relies on
+ * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+ * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+ * do the CR3 write.
+ *
+ * TODO: paravirt
+ */
+ if (cpu_feature_enabled(X86_FEATURE_PGE))
+ __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
+ else
+ flush_tlb_local();
+}
+
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index b0c7463048b60..c084a60d2fec5 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -7,12 +7,14 @@
enum {
CONTEXT_WORK_DISABLED_OFFSET,
CONTEXT_WORK_SYNC_OFFSET,
+ CONTEXT_WORK_TLBI_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};

enum ct_work {
CONTEXT_WORK_DISABLED = BIT(CONTEXT_WORK_DISABLED_OFFSET),
CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
+ CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};

--
2.31.1


2023-07-05 18:26:17

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 01/14] tracing/filters: Dynamically allocate filter_pred.regex

Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array
in the regex field, even if the predicate function does not use the field.

A later commit will introduce a dynamically allocated cpumask to struct
filter_pred, which will require a dedicated freeing function. Bite the
bullet and make filter_pred.regex dynamically allocated.

While at it, reorder the fields of filter_pred to fill in the byte
holes. The struct now fits on a single cacheline.

No change in behaviour intended.

The kfree()'s were patched via Coccinelle:
@@
struct filter_pred *pred;
@@

-kfree(pred);
+free_predicate(pred);

Signed-off-by: Valentin Schneider <[email protected]>
---
kernel/trace/trace_events_filter.c | 62 ++++++++++++++++++------------
1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 1dad64267878c..d999a218fe833 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -70,15 +70,15 @@ enum filter_pred_fn {
};

struct filter_pred {
- enum filter_pred_fn fn_num;
- u64 val;
- u64 val2;
- struct regex regex;
+ struct regex *regex;
unsigned short *ops;
struct ftrace_event_field *field;
- int offset;
+ u64 val;
+ u64 val2;
+ enum filter_pred_fn fn_num;
+ int offset;
int not;
- int op;
+ int op;
};

/*
@@ -186,6 +186,12 @@ enum {
PROCESS_OR = 4,
};

+static void free_predicate(struct filter_pred *pred)
+{
+ kfree(pred->regex);
+ kfree(pred);
+}
+
/*
* Without going into a formal proof, this explains the method that is used in
* parsing the logical expressions.
@@ -623,7 +629,7 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
kfree(inverts);
if (prog_stack) {
for (i = 0; prog_stack[i].pred; i++)
- kfree(prog_stack[i].pred);
+ free_predicate(prog_stack[i].pred);
kfree(prog_stack);
}
return ERR_PTR(ret);
@@ -750,7 +756,7 @@ static int filter_pred_string(struct filter_pred *pred, void *event)
char *addr = (char *)(event + pred->offset);
int cmp, match;

- cmp = pred->regex.match(addr, &pred->regex, pred->regex.field_len);
+ cmp = pred->regex->match(addr, pred->regex, pred->regex->field_len);

match = cmp ^ pred->not;

@@ -763,7 +769,7 @@ static __always_inline int filter_pchar(struct filter_pred *pred, char *str)
int len;

len = strlen(str) + 1; /* including tailing '\0' */
- cmp = pred->regex.match(str, &pred->regex, len);
+ cmp = pred->regex->match(str, pred->regex, len);

match = cmp ^ pred->not;

@@ -813,7 +819,7 @@ static int filter_pred_strloc(struct filter_pred *pred, void *event)
char *addr = (char *)(event + str_loc);
int cmp, match;

- cmp = pred->regex.match(addr, &pred->regex, str_len);
+ cmp = pred->regex->match(addr, pred->regex, str_len);

match = cmp ^ pred->not;

@@ -836,7 +842,7 @@ static int filter_pred_strrelloc(struct filter_pred *pred, void *event)
char *addr = (char *)(&item[1]) + str_loc;
int cmp, match;

- cmp = pred->regex.match(addr, &pred->regex, str_len);
+ cmp = pred->regex->match(addr, pred->regex, str_len);

match = cmp ^ pred->not;

@@ -874,7 +880,7 @@ static int filter_pred_comm(struct filter_pred *pred, void *event)
{
int cmp;

- cmp = pred->regex.match(current->comm, &pred->regex,
+ cmp = pred->regex->match(current->comm, pred->regex,
TASK_COMM_LEN);
return cmp ^ pred->not;
}
@@ -1004,7 +1010,7 @@ enum regex_type filter_parse_regex(char *buff, int len, char **search, int *not)

static void filter_build_regex(struct filter_pred *pred)
{
- struct regex *r = &pred->regex;
+ struct regex *r = pred->regex;
char *search;
enum regex_type type = MATCH_FULL;

@@ -1169,7 +1175,7 @@ static void free_prog(struct event_filter *filter)
return;

for (i = 0; prog[i].pred; i++)
- kfree(prog[i].pred);
+ free_predicate(prog[i].pred);
kfree(prog);
}

@@ -1553,9 +1559,12 @@ static int parse_pred(const char *str, void *data,
goto err_free;
}

- pred->regex.len = len;
- strncpy(pred->regex.pattern, str + s, len);
- pred->regex.pattern[len] = 0;
+ pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+ if (!pred->regex)
+ goto err_mem;
+ pred->regex->len = len;
+ strncpy(pred->regex->pattern, str + s, len);
+ pred->regex->pattern[len] = 0;

/* This is either a string, or an integer */
} else if (str[i] == '\'' || str[i] == '"') {
@@ -1597,9 +1606,12 @@ static int parse_pred(const char *str, void *data,
goto err_free;
}

- pred->regex.len = len;
- strncpy(pred->regex.pattern, str + s, len);
- pred->regex.pattern[len] = 0;
+ pred->regex = kzalloc(sizeof(*pred->regex), GFP_KERNEL);
+ if (!pred->regex)
+ goto err_mem;
+ pred->regex->len = len;
+ strncpy(pred->regex->pattern, str + s, len);
+ pred->regex->pattern[len] = 0;

filter_build_regex(pred);

@@ -1608,7 +1620,7 @@ static int parse_pred(const char *str, void *data,

} else if (field->filter_type == FILTER_STATIC_STRING) {
pred->fn_num = FILTER_PRED_FN_STRING;
- pred->regex.field_len = field->size;
+ pred->regex->field_len = field->size;

} else if (field->filter_type == FILTER_DYN_STRING) {
pred->fn_num = FILTER_PRED_FN_STRLOC;
@@ -1691,10 +1703,10 @@ static int parse_pred(const char *str, void *data,
return i;

err_free:
- kfree(pred);
+ free_predicate(pred);
return -EINVAL;
err_mem:
- kfree(pred);
+ free_predicate(pred);
return -ENOMEM;
}

@@ -2287,8 +2299,8 @@ static int ftrace_function_set_filter_pred(struct filter_pred *pred,
return ret;

return __ftrace_function_set_filter(pred->op == OP_EQ,
- pred->regex.pattern,
- pred->regex.len,
+ pred->regex->pattern,
+ pred->regex->len,
data);
}

--
2.31.1


2023-07-05 18:27:05

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH 14/14] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

This does require a guarantee that nothing in the vmalloc range can be
accessed in early entry code. vmalloc'd kernel stacks (VMAP_STACK) are
AFAICT a safe exception, as a task running in userspace needs to enter
kernelspace to execute do_exit() before its stack can be vfree'd.

XXX: Validation that nothing in the vmalloc range is accessed in .noinstr or
somesuch?

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 23 ++++++++++++++++++++---
mm/vmalloc.c | 15 ++++++++++-----
3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 9064aa0027d05..2b3605c09649c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -255,6 +255,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);

static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
{
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 631df9189ded4..bb18b35e61b4a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@
#include <linux/debugfs.h>
#include <linux/sched/smt.h>
#include <linux/task_work.h>
+#include <linux/context_tracking.h>

#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
@@ -1045,6 +1046,11 @@ static void do_flush_tlb_all(void *info)
__flush_tlb_all();
}

+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+ return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI);
+}
+
void flush_tlb_all(void)
{
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1061,12 +1067,13 @@ static void do_kernel_range_flush(void *info)
flush_tlb_one_kernel(addr);
}

-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
{
/* Balance as user space task's flush, a bit conservative */
if (end == TLB_FLUSH_ALL ||
(end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
- on_each_cpu(do_flush_tlb_all, NULL, 1);
+ on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
} else {
struct flush_tlb_info *info;

@@ -1074,13 +1081,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
info = get_flush_tlb_info(NULL, start, end, 0, false,
TLB_GENERATION_INVALID);

- on_each_cpu(do_kernel_range_flush, info, 1);
+ on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);

put_flush_tlb_info();
preempt_enable();
}
}

+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+ __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
/*
* This can be used from process context to figure out what the value of
* CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d13d71687d73..10f99224e12e0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -439,6 +439,11 @@ void vunmap_range_noflush(unsigned long start, unsigned long end)
__vunmap_range_noflush(start, end);
}

+#ifndef flush_tlb_kernel_range_deferrable
+#define flush_tlb_kernel_range_deferrable(start, end) \
+ flush_tlb_kernel_range(start, end)
+#endif
+
/**
* vunmap_range - unmap kernel virtual addresses
* @addr: start of the VM area to unmap
@@ -452,7 +457,7 @@ void vunmap_range(unsigned long addr, unsigned long end)
{
flush_cache_vunmap(addr, end);
vunmap_range_noflush(addr, end);
- flush_tlb_kernel_range(addr, end);
+ flush_tlb_kernel_range_deferrable(addr, end);
}

static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -1747,7 +1752,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
list_last_entry(&local_purge_list,
struct vmap_area, list)->va_end);

- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
resched_threshold = lazy_max_pages() << 1;

spin_lock(&free_vmap_area_lock);
@@ -1849,7 +1854,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
flush_cache_vunmap(va->va_start, va->va_end);
vunmap_range_noflush(va->va_start, va->va_end);
if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(va->va_start, va->va_end);
+ flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);

free_vmap_area_noflush(va);
}
@@ -2207,7 +2212,7 @@ static void vb_free(unsigned long addr, unsigned long size)
vunmap_range_noflush(addr, addr + size);

if (debug_pagealloc_enabled_static())
- flush_tlb_kernel_range(addr, addr + size);
+ flush_tlb_kernel_range_deferrable(addr, addr + size);

spin_lock(&vb->lock);

@@ -2260,7 +2265,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
- flush_tlb_kernel_range(start, end);
+ flush_tlb_kernel_range_deferrable(start, end);
mutex_unlock(&vmap_purge_lock);
}

--
2.31.1


2023-07-05 18:58:54

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition



> On Jul 5, 2023, at 11:12 AM, Valentin Schneider <[email protected]> wrote:
>
> Deferral approach
> =================
>
> Storing each and every callback, like a secondary call_single_queue turned out
> to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
> userspace for as long as possible - no signal of any form would be sent when
> deferring an IPI. This means that any form of queuing for deferred callbacks
> would end up as a convoluted memory leak.
>
> Deferred IPIs must thus be coalesced, which this series achieves by assigning
> IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
> kernel entry.

I have some experience with similar an optimization. Overall, it can make
sense and as you show, it can reduce the number of interrupts.

The main problem of such an approach might be in cases where a process
frequently enters and exits the kernel between deferred-IPIs, or even worse -
the IPI is sent while the remote CPU is inside the kernel. In such cases, you
pay the extra cost of synchronization and cache traffic, and might not even
get the benefit of reducing the number of IPIs.

In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
mechanism introduces while tracking whether a process is running or not. But
lazy-TLB would change is_lazy much less frequently than context tracking,
which means that the deferring the IPIs as done in this patch-set has a
greater potential to hurt performance than lazy-TLB.

tl;dr - it would be beneficial to show some performance number for both a
“good” case where a process spends most of the time in userspace, and “bad”
one where a process enters and exits the kernel very frequently. Reducing
the number of IPIs is good but I don’t think it is a goal by its own.

[ BTW: I did not go over the patches in detail. Obviously, there are
various delicate points that need to be checked, as avoiding the
deferring of IPIs if page-tables are freed. ]

2023-07-05 19:41:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

On Wed, 5 Jul 2023 19:12:42 +0100
Valentin Schneider <[email protected]> wrote:

> o Patches 1-5 have been submitted previously and are included for the sake of
> testing

I should have commented on the previous set, but I did my review on this set ;-)

Anyway, I'm all for the patches. Care to send a new version covering my input?

Thanks,

-- Steve

2023-07-05 22:39:38

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a ?crit :
> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> + unsigned int old_work;
> + bool ret = false;
> +
> + preempt_disable();
> +
> + old_work = atomic_read(&ct->work);
> + /*
> + * Try setting the work until either
> + * - the target CPU no longer accepts any more deferred work
> + * - the work has been set
> + */
> + while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)

Isn't there a race here where you may have missed a CPU that just entered in
user and you eventually disturb it?

> + ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> +
> + preempt_enable();
> + return ret;
> +}
[...]
> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> */
> static noinstr void ct_kernel_enter_state(int offset)
> {
> + struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> int seq;
> + unsigned int work;
>
> + work = ct_work_fetch(ct);

So this adds another fully ordered operation on user <-> kernel transition.
How many such IPIs can we expect?

If this is just about a dozen, can we stuff them in the state like in the
following? We can potentially add more of them especially on 64 bits we could
afford 30 different works, this is just shrinking the RCU extended quiescent
state counter space. Worst case that can happen is that RCU misses 65535
idle/user <-> kernel transitions and delays a grace period...

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..e453e9fb864b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -851,6 +851,15 @@ config HAVE_CONTEXT_TRACKING_USER_OFFSTACK
- No use of instrumentation, unless instrumentation_begin() got
called.

+config HAVE_CONTEXT_TRACKING_WORK
+ bool
+ help
+ Architecture supports deferring work while not in kernel context.
+ This is especially useful on setups with isolated CPUs that might
+ want to avoid being interrupted to perform housekeeping tasks (for
+ ex. TLB invalidation or icache invalidation). The housekeeping
+ operations are performed upon re-entering the kernel.
+
config HAVE_TIF_NOHZ
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..490c773105c0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -197,6 +197,7 @@ config X86
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING_USER if X86_64
select HAVE_CONTEXT_TRACKING_USER_OFFSTACK if HAVE_CONTEXT_TRACKING_USER
+ select HAVE_CONTEXT_TRACKING_WORK if X86_64
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL
select HAVE_OBJTOOL_NOP_MCOUNT if HAVE_OBJTOOL_MCOUNT
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
new file mode 100644
index 000000000000..5bc29e6b2ed3
--- /dev/null
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H
+#define _ASM_X86_CONTEXT_TRACKING_WORK_H
+
+static __always_inline void arch_context_tracking_work(int work)
+{
+ switch (work) {
+ case CONTEXT_WORK_n:
+ // Do work...
+ break;
+ }
+}
+
+#endif
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d3cbb6c16bab..333b26d7cbe5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/vtime.h>
#include <linux/context_tracking_state.h>
+#include <linux/context_tracking_work.h>
#include <linux/instrumentation.h>

#include <asm/ptrace.h>
@@ -75,7 +76,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
static __always_inline bool context_tracking_guest_enter(void)
{
if (context_tracking_enabled())
- __ct_user_enter(CONTEXT_GUEST);
+ __ct_user_enter(CONTEXT_USER);

return context_tracking_enabled_this_cpu();
}
@@ -83,7 +84,7 @@ static __always_inline bool context_tracking_guest_enter(void)
static __always_inline void context_tracking_guest_exit(void)
{
if (context_tracking_enabled())
- __ct_user_exit(CONTEXT_GUEST);
+ __ct_user_exit(CONTEXT_USER);
}

#define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
@@ -122,6 +123,26 @@ static __always_inline bool rcu_dynticks_curr_cpu_in_eqs(void)
return !(arch_atomic_read(this_cpu_ptr(&context_tracking.state)) & RCU_DYNTICKS_IDX);
}

+/*
+ * Increment the current CPU's context_tracking structure's ->state field
+ * with ordering and clear the work bits. Return the new value.
+ */
+static __always_inline unsigned long ct_state_inc_clear_work(int incby)
+{
+ struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+ unsigned long new, old, state;
+
+ state = arch_atomic_read(&ct->state);
+ do {
+ old = state;
+ new = old & ~CONTEXT_WORK_MASK;
+ new += incby;
+ state = arch_atomic_cmpxchg(&ct->state, old, new);
+ } while (old != state);
+
+ return state;
+}
+
/*
* Increment the current CPU's context_tracking structure's ->state field
* with ordering. Return the new value.
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index fdd537ea513f..ec3d172601c5 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -10,14 +10,19 @@
#define DYNTICK_IRQ_NONIDLE ((LONG_MAX / 2) + 1)

enum ctx_state {
+ /* Following are values */
CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
CONTEXT_KERNEL = 0,
CONTEXT_IDLE = 1,
CONTEXT_USER = 2,
- CONTEXT_GUEST = 3,
- CONTEXT_MAX = 4,
+ /* Following are bit numbers */
+ CONTEXT_WORK = 2,
+ CONTEXT_MAX = 16,
};

+#define CONTEXT_MASK (BIT(CONTEXT_WORK) - 1)
+#define CONTEXT_WORK_MASK ((BIT(CONTEXT_MAX) - 1) & ~(BIT(CONTEXT_WORK) - 1))
+
/* Even value for idle, else odd. */
#define RCU_DYNTICKS_IDX CONTEXT_MAX

diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
new file mode 100644
index 000000000000..fb74db8876dd
--- /dev/null
+++ b/include/linux/context_tracking_work.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CONTEXT_TRACKING_WORK_H
+#define _LINUX_CONTEXT_TRACKING_WORK_H
+
+#include <linux/bitops.h>
+
+enum {
+ CONTEXT_WORK_n_OFFSET,
+ CONTEXT_WORK_MAX_OFFSET
+};
+
+enum ct_work {
+ CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET),
+ CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
+};
+
+#include <asm/context_tracking_work.h>
+
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+extern bool ct_set_cpu_work(unsigned int cpu, unsigned int work);
+#else
+static inline bool
+ct_set_cpu_work(unsigned int cpu, unsigned int work) { return false; }
+#endif
+
+#endif
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index a09f1c19336a..732042b9a7b7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -72,6 +72,58 @@ static __always_inline void rcu_dynticks_task_trace_exit(void)
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
}

+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+static noinstr void ct_work_flush(unsigned long seq)
+{
+ unsigned int bit;
+ /*
+ * arch_context_tracking_work() must be noinstr, non-blocking,
+ * and NMI safe.
+ */
+ for_each_set_bit(bit, &seq, CONTEXT_MAX)
+ arch_context_tracking_work(BIT(bit) >> CONTEXT_WORK);
+}
+
+bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
+{
+ struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
+ unsigned int old, new, state;
+ bool ret = false;
+
+ preempt_disable();
+
+ work <<= CONTEXT_WORK;
+ state = atomic_read(&ct->state);
+ /*
+ * Try setting the work until either
+ * - the target CPU is on the kernel
+ * - the work has been set
+ */
+ for (;;) {
+ /* Only set if running in user/guest */
+ old = state;
+ old &= ~CONTEXT_MASK;
+ old |= CONTEXT_USER;
+
+ new = old | work;
+
+ state = atomic_cmpxchg(&ct->state, old, new);
+ if (state & work) {
+ ret = true;
+ break;
+ }
+
+ if ((state & CONTEXT_MASK) != CONTEXT_USER)
+ break;
+ }
+
+ preempt_enable();
+ return ret;
+}
+#else
+static __always_inline void ct_work_flush(unsigned long seq) { }
+#endif
+
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state, that is,
@@ -100,14 +152,18 @@ static noinstr void ct_kernel_exit_state(int offset)
*/
static noinstr void ct_kernel_enter_state(int offset)
{
- int seq;
+ struct context_tracking *ct = this_cpu_ptr(&context_tracking);
+ unsigned long seq;

/*
* CPUs seeing atomic_add_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- seq = ct_state_inc(offset);
+ seq = ct_state_inc_clear_work(offset);
+ if (seq & CONTEXT_WORK_MASK)
+ ct_work_flush(seq & CONTEXT_WORK_MASK);
+
// RCU is now watching. Better not be in an extended quiescent state!
rcu_dynticks_task_trace_exit(); // After ->dynticks update!
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !(seq & RCU_DYNTICKS_IDX));
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index bae8f11070be..fdb266f2d774 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -181,6 +181,11 @@ config CONTEXT_TRACKING_USER_FORCE
Say N otherwise, this option brings an overhead that you
don't want in production.

+config CONTEXT_TRACKING_WORK
+ bool
+ depends on HAVE_CONTEXT_TRACKING_WORK && CONTEXT_TRACKING_USER
+ default y
+
config NO_HZ
bool "Old Idle dynticks config"
help
--
2.40.1


2023-07-05 23:01:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> If this is just about a dozen, can we stuff them in the state like in the
> following? We can potentially add more of them especially on 64 bits we could
> afford 30 different works, this is just shrinking the RCU extended quiescent
> state counter space. Worst case that can happen is that RCU misses 65535
> idle/user <-> kernel transitions and delays a grace period...

We can make all this a 64bit only feature and use atomic_long_t :-)

2023-07-05 23:02:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:

> Note: A previous approach by PeterZ [1] used an extra bit in
> context_tracking.state to flag the presence of deferred callbacks to
> execute, and the actual callbacks were stored in a separate atomic
> variable.
>
> This meant that the atomic read of context_tracking.state was sufficient to
> determine whether there are any deferred callbacks to execute.
> Unfortunately, it presents a race window. Consider the work setting
> function as:
>
> preempt_disable();
> seq = atomic_read(&ct->seq);
> if (__context_tracking_seq_in_user(seq)) {
> /* ctrl-dep */
> atomic_or(work, &ct->work);
> ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
> }
> preempt_enable();
>
> return ret;
>
> Then the following can happen:
>
> CPUx CPUy
> CT_SEQ_WORK \in context_tracking.state
> atomic_or(WORK_N, &ct->work);
> ct_kernel_enter()
> ct_state_inc();
> atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>
> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
> sent. Unfortunately, the work bit would remain set, and it can't be sanely
> cleared in case another CPU set it concurrently - this would ultimately
> lead to a double execution of the callback, one as a deferred callback and
> one in the IPI. As not all IPI callbacks are idempotent, this is
> undesirable.

So adding another atomic is arguably worse.

The thing is, if the NOHZ_FULL CPU is actually doing context transitions
(SYSCALLs etc..) then everything is fundamentally racy, there is no
winning that game, we could find the remote CPU is in-kernel, send an
IPI, the remote CPU does return-to-user and receives the IPI.

And then the USER is upset... because he got an IPI.

The whole NOHZ_FULL thing really only works if userspace does not do
SYSCALLs.

But the sad sad state of affairs is that some people think it is
acceptable to do SYSCALLs while NOHZ_FULL and cry about how slow stuff
is.



2023-07-06 10:27:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index fdd537ea513f..ec3d172601c5 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -10,14 +10,19 @@
> #define DYNTICK_IRQ_NONIDLE ((LONG_MAX / 2) + 1)
>
> enum ctx_state {
> + /* Following are values */
> CONTEXT_DISABLED = -1, /* returned by ct_state() if unknown */
> CONTEXT_KERNEL = 0,
> CONTEXT_IDLE = 1,
> CONTEXT_USER = 2,
> - CONTEXT_GUEST = 3,
> - CONTEXT_MAX = 4,
> + /* Following are bit numbers */
> + CONTEXT_WORK = 2,
> + CONTEXT_MAX = 16,
> };
>
> +#define CONTEXT_MASK (BIT(CONTEXT_WORK) - 1)
> +#define CONTEXT_WORK_MASK ((BIT(CONTEXT_MAX) - 1) & ~(BIT(CONTEXT_WORK) - 1))
> +
> /* Even value for idle, else odd. */
> #define RCU_DYNTICKS_IDX CONTEXT_MAX

And that should be:

#define RCU_DYNTICKS_IDX BIT(CONTEXT_MAX)

Did I mention it's not even build tested? :o)

2023-07-06 10:28:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 12:41:04AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> > If this is just about a dozen, can we stuff them in the state like in the
> > following? We can potentially add more of them especially on 64 bits we could
> > afford 30 different works, this is just shrinking the RCU extended quiescent
> > state counter space. Worst case that can happen is that RCU misses 65535
> > idle/user <-> kernel transitions and delays a grace period...
>
> We can make all this a 64bit only feature and use atomic_long_t :-)

Works for me :)

2023-07-06 11:40:53

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On 06/07/23 00:39, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:
>
>> Note: A previous approach by PeterZ [1] used an extra bit in
>> context_tracking.state to flag the presence of deferred callbacks to
>> execute, and the actual callbacks were stored in a separate atomic
>> variable.
>>
>> This meant that the atomic read of context_tracking.state was sufficient to
>> determine whether there are any deferred callbacks to execute.
>> Unfortunately, it presents a race window. Consider the work setting
>> function as:
>>
>> preempt_disable();
>> seq = atomic_read(&ct->seq);
>> if (__context_tracking_seq_in_user(seq)) {
>> /* ctrl-dep */
>> atomic_or(work, &ct->work);
>> ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>> }
>> preempt_enable();
>>
>> return ret;
>>
>> Then the following can happen:
>>
>> CPUx CPUy
>> CT_SEQ_WORK \in context_tracking.state
>> atomic_or(WORK_N, &ct->work);
>> ct_kernel_enter()
>> ct_state_inc();
>> atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>>
>> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
>> sent. Unfortunately, the work bit would remain set, and it can't be sanely
>> cleared in case another CPU set it concurrently - this would ultimately
>> lead to a double execution of the callback, one as a deferred callback and
>> one in the IPI. As not all IPI callbacks are idempotent, this is
>> undesirable.
>
> So adding another atomic is arguably worse.
>
> The thing is, if the NOHZ_FULL CPU is actually doing context transitions
> (SYSCALLs etc..) then everything is fundamentally racy, there is no
> winning that game, we could find the remote CPU is in-kernel, send an
> IPI, the remote CPU does return-to-user and receives the IPI.
>
> And then the USER is upset... because he got an IPI.
>

Yeah, that part is inevitably racy.

The thing I was especially worried about was the potential double
executions (once in IPI, again in deferred work). It's not /too/ bad as the
only two deferred callbacks I'm introducing here are costly-but-stateless,
but IMO is a bad foundation.

But it seems like we can reuse the existing atomic and squeeze some bits in
there, so let's see how that goes :-)


2023-07-06 11:50:59

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

On 05/07/23 18:48, Nadav Amit wrote:
>> On Jul 5, 2023, at 11:12 AM, Valentin Schneider <[email protected]> wrote:
>>
>> Deferral approach
>> =================
>>
>> Storing each and every callback, like a secondary call_single_queue turned out
>> to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
>> userspace for as long as possible - no signal of any form would be sent when
>> deferring an IPI. This means that any form of queuing for deferred callbacks
>> would end up as a convoluted memory leak.
>>
>> Deferred IPIs must thus be coalesced, which this series achieves by assigning
>> IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
>> kernel entry.
>
> I have some experience with similar an optimization. Overall, it can make
> sense and as you show, it can reduce the number of interrupts.
>
> The main problem of such an approach might be in cases where a process
> frequently enters and exits the kernel between deferred-IPIs, or even worse -
> the IPI is sent while the remote CPU is inside the kernel. In such cases, you
> pay the extra cost of synchronization and cache traffic, and might not even
> get the benefit of reducing the number of IPIs.
>
> In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
> mechanism introduces while tracking whether a process is running or not. But
> lazy-TLB would change is_lazy much less frequently than context tracking,
> which means that the deferring the IPIs as done in this patch-set has a
> greater potential to hurt performance than lazy-TLB.
>
> tl;dr - it would be beneficial to show some performance number for both a
> “good” case where a process spends most of the time in userspace, and “bad”
> one where a process enters and exits the kernel very frequently. Reducing
> the number of IPIs is good but I don’t think it is a goal by its own.
>

There already is a significant overhead incurred on kernel entry for
nohz_full CPUs due to all of context_tracking faff; now I *am* making it
worse with that extra atomic, but I get the feeling it's not going to stay
:D

nohz_full CPUs that do context transitions very frequently are
unfortunately in the realm of "you shouldn't do that". Due to what's out
there I have to care about *occasional* transitions, but some folks
consider even that to be broken usage, so I don't believe getting numbers
for that to be much relevant.

> [ BTW: I did not go over the patches in detail. Obviously, there are
> various delicate points that need to be checked, as avoiding the
> deferring of IPIs if page-tables are freed. ]


2023-07-06 12:07:07

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

On 05/07/23 15:03, Steven Rostedt wrote:
> On Wed, 5 Jul 2023 19:12:42 +0100
> Valentin Schneider <[email protected]> wrote:
>
>> o Patches 1-5 have been submitted previously and are included for the sake of
>> testing
>
> I should have commented on the previous set, but I did my review on this set ;-)
>

Thanks for having a look!

> Anyway, I'm all for the patches. Care to send a new version covering my input?
>

Sure thing, I'll send a v2 of these patches soonish.

> Thanks,
>
> -- Steve


2023-07-06 12:10:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
> >> + ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> >> +
> >> + preempt_enable();
> >> + return ret;
> >> +}
> > [...]
> >> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> >> */
> >> static noinstr void ct_kernel_enter_state(int offset)
> >> {
> >> + struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> >> int seq;
> >> + unsigned int work;
> >>
> >> + work = ct_work_fetch(ct);
> >
> > So this adds another fully ordered operation on user <-> kernel transition.
> > How many such IPIs can we expect?
> >
>
> Despite having spent quite a lot of time on that question, I think I still
> only have a hunch.
>
> Poking around RHEL systems, I'd say 99% of the problematic IPIs are
> instruction patching and TLB flushes.
>
> Staring at the code, there's quite a lot of smp_calls for which it's hard
> to say whether the target CPUs can actually be isolated or not (e.g. the
> CPU comes from a cpumask shoved in a struct that was built using data from
> another struct of uncertain origins), but then again some of them don't
> need to hook into context_tracking.
>
> Long story short: I /think/ we can consider that number to be fairly small,
> but there could be more lurking in the shadows.

I guess it will still be time to reconsider the design if we ever reach such size.

>
> > If this is just about a dozen, can we stuff them in the state like in the
> > following? We can potentially add more of them especially on 64 bits we could
> > afford 30 different works, this is just shrinking the RCU extended quiescent
> > state counter space. Worst case that can happen is that RCU misses 65535
> > idle/user <-> kernel transitions and delays a grace period...
> >
>
> I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
> even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
> but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
> only does a !=).
>
> I'm rephrasing here to make sure I get it - is it then that the worst case
> here is 2^(dynticks_counter_size) transitions happen between saving the
> dynticks snapshot and checking it again, so RCU waits some more?

That's my understanding as well but I have to defer on Paul to make sure I'm
not overlooking something.

Thanks.

2023-07-06 12:11:36

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On 06/07/23 00:23, Frederic Weisbecker wrote:
> Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a écrit :
>> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
>> +{
>> + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> + unsigned int old_work;
>> + bool ret = false;
>> +
>> + preempt_disable();
>> +
>> + old_work = atomic_read(&ct->work);
>> + /*
>> + * Try setting the work until either
>> + * - the target CPU no longer accepts any more deferred work
>> + * - the work has been set
>> + */
>> + while (!(old_work & CONTEXT_WORK_DISABLED) && !ret)
>
> Isn't there a race here where you may have missed a CPU that just entered in
> user and you eventually disturb it?
>

Yes, unfortunately.

>> + ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
>> +
>> + preempt_enable();
>> + return ret;
>> +}
> [...]
>> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
>> */
>> static noinstr void ct_kernel_enter_state(int offset)
>> {
>> + struct context_tracking *ct = this_cpu_ptr(&context_tracking);
>> int seq;
>> + unsigned int work;
>>
>> + work = ct_work_fetch(ct);
>
> So this adds another fully ordered operation on user <-> kernel transition.
> How many such IPIs can we expect?
>

Despite having spent quite a lot of time on that question, I think I still
only have a hunch.

Poking around RHEL systems, I'd say 99% of the problematic IPIs are
instruction patching and TLB flushes.

Staring at the code, there's quite a lot of smp_calls for which it's hard
to say whether the target CPUs can actually be isolated or not (e.g. the
CPU comes from a cpumask shoved in a struct that was built using data from
another struct of uncertain origins), but then again some of them don't
need to hook into context_tracking.

Long story short: I /think/ we can consider that number to be fairly small,
but there could be more lurking in the shadows.

> If this is just about a dozen, can we stuff them in the state like in the
> following? We can potentially add more of them especially on 64 bits we could
> afford 30 different works, this is just shrinking the RCU extended quiescent
> state counter space. Worst case that can happen is that RCU misses 65535
> idle/user <-> kernel transitions and delays a grace period...
>

I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
only does a !=).

I'm rephrasing here to make sure I get it - is it then that the worst case
here is 2^(dynticks_counter_size) transitions happen between saving the
dynticks snapshot and checking it again, so RCU waits some more?


2023-07-06 12:13:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 12:23:57AM +0200, Frederic Weisbecker wrote:
> Le Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider a ?crit :
> +bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
> +{
> + struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
> + unsigned int old, new, state;
> + bool ret = false;
> +
> + preempt_disable();
> +
> + work <<= CONTEXT_WORK;
> + state = atomic_read(&ct->state);
> + /*
> + * Try setting the work until either
> + * - the target CPU is on the kernel
> + * - the work has been set
> + */
> + for (;;) {
> + /* Only set if running in user/guest */
> + old = state;
> + old &= ~CONTEXT_MASK;
> + old |= CONTEXT_USER;
> +
> + new = old | work;
> +
> + state = atomic_cmpxchg(&ct->state, old, new);
> + if (state & work) {

And this should be "if (state == old)", otherwise there is
a risk that someone else had set the work but atomic_cmpxchg()
failed due to other modifications is the meantime. It's then
dangerous in that case to defer the work because atomic_cmpxchg()
failures don't imply full ordering. So there is a risk that the
target executes the work but doesn't see the most recent data.

Thanks.

2023-07-06 16:47:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On Thu, Jul 06, 2023 at 01:40:14PM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
> > >> + ret = atomic_try_cmpxchg(&ct->work, &old_work, old_work | work);
> > >> +
> > >> + preempt_enable();
> > >> + return ret;
> > >> +}
> > > [...]
> > >> @@ -100,14 +158,19 @@ static noinstr void ct_kernel_exit_state(int offset)
> > >> */
> > >> static noinstr void ct_kernel_enter_state(int offset)
> > >> {
> > >> + struct context_tracking *ct = this_cpu_ptr(&context_tracking);
> > >> int seq;
> > >> + unsigned int work;
> > >>
> > >> + work = ct_work_fetch(ct);
> > >
> > > So this adds another fully ordered operation on user <-> kernel transition.
> > > How many such IPIs can we expect?
> > >
> >
> > Despite having spent quite a lot of time on that question, I think I still
> > only have a hunch.
> >
> > Poking around RHEL systems, I'd say 99% of the problematic IPIs are
> > instruction patching and TLB flushes.
> >
> > Staring at the code, there's quite a lot of smp_calls for which it's hard
> > to say whether the target CPUs can actually be isolated or not (e.g. the
> > CPU comes from a cpumask shoved in a struct that was built using data from
> > another struct of uncertain origins), but then again some of them don't
> > need to hook into context_tracking.
> >
> > Long story short: I /think/ we can consider that number to be fairly small,
> > but there could be more lurking in the shadows.
>
> I guess it will still be time to reconsider the design if we ever reach such size.
>
> > > If this is just about a dozen, can we stuff them in the state like in the
> > > following? We can potentially add more of them especially on 64 bits we could
> > > afford 30 different works, this is just shrinking the RCU extended quiescent
> > > state counter space. Worst case that can happen is that RCU misses 65535
> > > idle/user <-> kernel transitions and delays a grace period...
> > >
> >
> > I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
> > even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
> > but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
> > only does a !=).
> >
> > I'm rephrasing here to make sure I get it - is it then that the worst case
> > here is 2^(dynticks_counter_size) transitions happen between saving the
> > dynticks snapshot and checking it again, so RCU waits some more?
>
> That's my understanding as well but I have to defer on Paul to make sure I'm
> not overlooking something.

That does look plausible to me.

And yes, RCU really cares about whether its part of this counter has
been a multiple of two during a given interval of time, because this
indicates that the CPU has no pre-existing RCU readers still active.
One way that this can happen is for that value to be a multiple of two
at some point in time. The other way that this can happen is for the
value to have changed. No matter what the start and end values, if they
are different, the counter must necessarily have at least passed through
multiple of two in the meantime, again guaranteeing that any RCU readers
that around when the count was first fetched have now finished.

But we should take the machine's opinions much more seriously than we
take any of our own opinions. Why not adjust RCU_DYNTICKS_IDX so as
to crank RCU's portion of this counter down to (say) two or three bits
and let rcutorture have at it on TREE04 or TREE07, both of which have
nohz_full CPUs?

Maybe also adjust mkinitrd.sh to make the user/kernel transitions more
frequent?

Please note that I do -not- recommend production use of a three-bit
(let alone a two-bit) RCU portion because this has a high probability
of excessively extending grace periods. But it might be good to keep
a tiny counter as a debug option so that we regularly rcutorture it.

Thanx, Paul

2023-07-06 18:56:36

by Valentin Schneider

[permalink] [raw]
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

On 06/07/23 09:39, Paul E. McKenney wrote:
> On Thu, Jul 06, 2023 at 01:40:14PM +0200, Frederic Weisbecker wrote:
>> On Thu, Jul 06, 2023 at 12:30:46PM +0100, Valentin Schneider wrote:
>> > I'm trying to grok how this impacts RCU, IIUC most of RCU mostly cares about the
>> > even/odd-ness of the thing, and rcu_gp_fqs() cares about the actual value
>> > but only to check if it has changed over time (rcu_dynticks_in_eqs_since()
>> > only does a !=).
>> >
>> > I'm rephrasing here to make sure I get it - is it then that the worst case
>> > here is 2^(dynticks_counter_size) transitions happen between saving the
>> > dynticks snapshot and checking it again, so RCU waits some more?
>>
>> That's my understanding as well but I have to defer on Paul to make sure I'm
>> not overlooking something.
>
> That does look plausible to me.
>
> And yes, RCU really cares about whether its part of this counter has
> been a multiple of two during a given interval of time, because this
> indicates that the CPU has no pre-existing RCU readers still active.
> One way that this can happen is for that value to be a multiple of two
> at some point in time. The other way that this can happen is for the
> value to have changed. No matter what the start and end values, if they
> are different, the counter must necessarily have at least passed through
> multiple of two in the meantime, again guaranteeing that any RCU readers
> that around when the count was first fetched have now finished.
>

Thank you for the demystification!

> But we should take the machine's opinions much more seriously than we
> take any of our own opinions.

Heh :-)

> Why not adjust RCU_DYNTICKS_IDX so as
> to crank RCU's portion of this counter down to (say) two or three bits
> and let rcutorture have at it on TREE04 or TREE07, both of which have
> nohz_full CPUs?
>
> Maybe also adjust mkinitrd.sh to make the user/kernel transitions more
> frequent?
>
> Please note that I do -not- recommend production use of a three-bit
> (let alone a two-bit) RCU portion because this has a high probability
> of excessively extending grace periods. But it might be good to keep
> a tiny counter as a debug option so that we regularly rcutorture it.
>

Sounds sensible, I'll add that to my v2 todolist.

Thanks!

> Thanx, Paul