Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().
Patches
=======
o Patches 1-9 have been submitted separately and are included for the sake of
testing
o Patches 10-14 focus on having objtool detect problematic static key usage in
early entry
o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.
o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral
These ones I'm a lot less confident about, mostly due to lacking
instrumentation/verification.
The actual deferred callback is also incomplete as it's not properly noinstr:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
have *no idea* what a sane implementation would be for Xen so I haven't
touched that yet.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2
Testing
=======
Note: this is a different machine than used for v1, because that machine decided
to act difficult.
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.
v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
338 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
9207 func=do_flush_tlb_all
1116 func=do_sync_core
62 func=do_kernel_range_flush
3 func=nohz_full_kick_func
v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
2 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
2 func=nohz_full_kick_func
The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
.data..ro_after_init section.
Links
=====
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
Revisions
=========
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
against this series.
o Fixed flush_tlb_kernel_range_deferrable() definition
Peter Zijlstra (1):
jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
Valentin Schneider (19):
tracing/filters: Dynamically allocate filter_pred.regex
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Optimise cpumask vs cpumask filtering when user mask
is a single CPU
tracing/filters: Optimise scalar vs cpumask filtering when the user
mask is a single CPU
tracing/filters: Optimise CPU vs cpumask filtering when the user mask
is a single CPU
tracing/filters: Further optimise scalar vs cpumask comparison
tracing/filters: Document cpumask filtering
objtool: Flesh out warning related to pv_ops[] calls
objtool: Warn about non __ro_after_init static key usage in .noinstr
context_tracking: Make context_tracking_key __ro_after_init
x86/kvm: Make kvm_async_pf_enabled __ro_after_init
context-tracking: Introduce work deferral infrastructure
rcu: Make RCU dynticks counter size configurable
rcutorture: Add a test config to torture test low RCU_DYNTICKS width
context_tracking,x86: Defer kernel text patching IPIs
context_tracking,x86: Add infrastructure to defer kernel TLBI
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
Documentation/trace/events.rst | 14 +
arch/Kconfig | 9 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 20 ++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/kernel/alternative.c | 24 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/kvm.c | 2 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/tlb.c | 40 ++-
include/asm-generic/sections.h | 5 +
include/linux/context_tracking.h | 26 ++
include/linux/context_tracking_state.h | 65 +++-
include/linux/context_tracking_work.h | 28 ++
include/linux/jump_label.h | 1 +
include/linux/trace_events.h | 1 +
init/main.c | 1 +
kernel/context_tracking.c | 53 ++-
kernel/jump_label.c | 49 +++
kernel/rcu/Kconfig | 33 ++
kernel/time/Kconfig | 5 +
kernel/trace/trace_events_filter.c | 302 ++++++++++++++++--
mm/vmalloc.c | 19 +-
tools/objtool/check.c | 22 +-
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 +
tools/objtool/special.c | 3 +
.../selftests/rcutorture/configs/rcu/TREE11 | 19 ++
.../rcutorture/configs/rcu/TREE11.boot | 1 +
31 files changed, 695 insertions(+), 64 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
--
2.31.1
Cpumask, scalar and CPU fields can now be filtered by a user-provided
cpumask, document the syntax.
Signed-off-by: Valentin Schneider <[email protected]>
---
Documentation/trace/events.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index f5fcb8e1218f6..34108d5a55b41 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -219,6 +219,20 @@ the function "security_prepare_creds" and less than the end of that function.
The ".function" postfix can only be attached to values of size long, and can only
be compared with "==" or "!=".
+Cpumask fields or scalar fields that encode a CPU number can be filtered using
+a user-provided cpumask in cpulist format. The format is as follows::
+
+ CPUS{$cpulist}
+
+Operators available to cpumask filtering are:
+
+& (intersection), ==, !=
+
+For example, this will filter events that have their .target_cpu field present
+in the given cpumask::
+
+ target_cpu & CPUS{17-42}
+
5.2 Setting filters
-------------------
--
2.31.1
Several events use a scalar field to denote a CPU:
o sched_wakeup.target_cpu
o sched_migrate_task.orig_cpu,dest_cpu
o sched_move_numa.src_cpu,dst_cpu
o ipi_send_cpu.cpu
o ...
Filtering these currently requires using arithmetic comparison functions,
which can be tedious when dealing with interleaved SMT or NUMA CPU ids.
Allow these to be filtered by a user-provided cpumask, which enables e.g.:
$ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}'
Signed-off-by: Valentin Schneider <[email protected]>
---
NOTE: I went with an implicit cpumask conversion of the event field, as
AFAICT predicate_parse() does not support parsing the application of a
function to a field (e.g. 'CPUS(target_cpu) & CPUS{2,4,6,8-32}')
---
kernel/trace/trace_events_filter.c | 92 ++++++++++++++++++++++++++----
1 file changed, 81 insertions(+), 11 deletions(-)
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index cb1863dfa280b..1e14f801685a8 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -46,15 +46,19 @@ static const char * ops[] = { OPS };
enum filter_pred_fn {
FILTER_PRED_FN_NOP,
FILTER_PRED_FN_64,
+ FILTER_PRED_FN_64_CPUMASK,
FILTER_PRED_FN_S64,
FILTER_PRED_FN_U64,
FILTER_PRED_FN_32,
+ FILTER_PRED_FN_32_CPUMASK,
FILTER_PRED_FN_S32,
FILTER_PRED_FN_U32,
FILTER_PRED_FN_16,
+ FILTER_PRED_FN_16_CPUMASK,
FILTER_PRED_FN_S16,
FILTER_PRED_FN_U16,
FILTER_PRED_FN_8,
+ FILTER_PRED_FN_8_CPUMASK,
FILTER_PRED_FN_S8,
FILTER_PRED_FN_U8,
FILTER_PRED_FN_COMM,
@@ -643,6 +647,39 @@ predicate_parse(const char *str, int nr_parens, int nr_preds,
return ERR_PTR(ret);
}
+static inline int
+do_filter_cpumask(int op, const struct cpumask *mask, const struct cpumask *cmp)
+{
+ switch (op) {
+ case OP_EQ:
+ return cpumask_equal(mask, cmp);
+ case OP_NE:
+ return !cpumask_equal(mask, cmp);
+ case OP_BAND:
+ return cpumask_intersects(mask, cmp);
+ default:
+ return 0;
+ }
+}
+
+/* Optimisation of do_filter_cpumask() for scalar fields */
+static inline int
+do_filter_scalar_cpumask(int op, unsigned int cpu, const struct cpumask *mask)
+{
+ switch (op) {
+ case OP_EQ:
+ return cpumask_test_cpu(cpu, mask) &&
+ cpumask_nth(1, mask) >= nr_cpu_ids;
+ case OP_NE:
+ return !cpumask_test_cpu(cpu, mask) ||
+ cpumask_nth(1, mask) < nr_cpu_ids;
+ case OP_BAND:
+ return cpumask_test_cpu(cpu, mask);
+ default:
+ return 0;
+ }
+}
+
enum pred_cmp_types {
PRED_CMP_TYPE_NOP,
PRED_CMP_TYPE_LT,
@@ -686,6 +723,18 @@ static int filter_pred_##type(struct filter_pred *pred, void *event) \
} \
}
+#define DEFINE_CPUMASK_COMPARISON_PRED(size) \
+static int filter_pred_##size##_cpumask(struct filter_pred *pred, void *event) \
+{ \
+ u##size *addr = (u##size *)(event + pred->offset); \
+ unsigned int cpu = *addr; \
+ \
+ if (cpu >= nr_cpu_ids) \
+ return 0; \
+ \
+ return do_filter_scalar_cpumask(pred->op, cpu, pred->mask); \
+}
+
#define DEFINE_EQUALITY_PRED(size) \
static int filter_pred_##size(struct filter_pred *pred, void *event) \
{ \
@@ -707,6 +756,11 @@ DEFINE_COMPARISON_PRED(u16);
DEFINE_COMPARISON_PRED(s8);
DEFINE_COMPARISON_PRED(u8);
+DEFINE_CPUMASK_COMPARISON_PRED(64);
+DEFINE_CPUMASK_COMPARISON_PRED(32);
+DEFINE_CPUMASK_COMPARISON_PRED(16);
+DEFINE_CPUMASK_COMPARISON_PRED(8);
+
DEFINE_EQUALITY_PRED(64);
DEFINE_EQUALITY_PRED(32);
DEFINE_EQUALITY_PRED(16);
@@ -891,16 +945,7 @@ static int filter_pred_cpumask(struct filter_pred *pred, void *event)
const struct cpumask *mask = (event + loc);
const struct cpumask *cmp = pred->mask;
- switch (pred->op) {
- case OP_EQ:
- return cpumask_equal(mask, cmp);
- case OP_NE:
- return !cpumask_equal(mask, cmp);
- case OP_BAND:
- return cpumask_intersects(mask, cmp);
- default:
- return 0;
- }
+ return do_filter_cpumask(pred->op, mask, cmp);
}
/* Filter predicate for COMM. */
@@ -1351,24 +1396,32 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
switch (pred->fn_num) {
case FILTER_PRED_FN_64:
return filter_pred_64(pred, event);
+ case FILTER_PRED_FN_64_CPUMASK:
+ return filter_pred_64_cpumask(pred, event);
case FILTER_PRED_FN_S64:
return filter_pred_s64(pred, event);
case FILTER_PRED_FN_U64:
return filter_pred_u64(pred, event);
case FILTER_PRED_FN_32:
return filter_pred_32(pred, event);
+ case FILTER_PRED_FN_32_CPUMASK:
+ return filter_pred_32_cpumask(pred, event);
case FILTER_PRED_FN_S32:
return filter_pred_s32(pred, event);
case FILTER_PRED_FN_U32:
return filter_pred_u32(pred, event);
case FILTER_PRED_FN_16:
return filter_pred_16(pred, event);
+ case FILTER_PRED_FN_16_CPUMASK:
+ return filter_pred_16_cpumask(pred, event);
case FILTER_PRED_FN_S16:
return filter_pred_s16(pred, event);
case FILTER_PRED_FN_U16:
return filter_pred_u16(pred, event);
case FILTER_PRED_FN_8:
return filter_pred_8(pred, event);
+ case FILTER_PRED_FN_8_CPUMASK:
+ return filter_pred_8_cpumask(pred, event);
case FILTER_PRED_FN_S8:
return filter_pred_s8(pred, event);
case FILTER_PRED_FN_U8:
@@ -1606,6 +1659,7 @@ static int parse_pred(const char *str, void *data,
switch (field->filter_type) {
case FILTER_CPUMASK:
+ case FILTER_OTHER:
break;
default:
parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
@@ -1658,8 +1712,24 @@ static int parse_pred(const char *str, void *data,
/* Move along */
i++;
- if (field->filter_type == FILTER_CPUMASK)
+ if (field->filter_type == FILTER_CPUMASK) {
pred->fn_num = FILTER_PRED_FN_CPUMASK;
+ } else {
+ switch (field->size) {
+ case 8:
+ pred->fn_num = FILTER_PRED_FN_64_CPUMASK;
+ break;
+ case 4:
+ pred->fn_num = FILTER_PRED_FN_32_CPUMASK;
+ break;
+ case 2:
+ pred->fn_num = FILTER_PRED_FN_16_CPUMASK;
+ break;
+ case 1:
+ pred->fn_num = FILTER_PRED_FN_8_CPUMASK;
+ break;
+ }
+ }
/* This is either a string, or an integer */
} else if (str[i] == '\'' || str[i] == '"') {
--
2.31.1
The recently introduced ipi_send_cpumask trace event contains a cpumask
field, but it currently cannot be used in filter expressions.
Make event filtering aware of cpumask fields, and allow these to be
filtered by a user-provided cpumask.
The user-provided cpumask is to be given in cpulist format and wrapped as:
"CPUS{$cpulist}". The use of curly braces instead of parentheses is to
prevent predicate_parse() from parsing the contents of CPUS{...} as a
full-fledged predicate subexpression.
This enables e.g.:
$ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}'
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/trace_events.h | 1 +
kernel/trace/trace_events_filter.c | 97 +++++++++++++++++++++++++++++-
2 files changed, 96 insertions(+), 2 deletions(-)
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 3930e676436c9..302be73918336 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -807,6 +807,7 @@ enum {
FILTER_RDYN_STRING,
FILTER_PTR_STRING,
FILTER_TRACE_FN,
+ FILTER_CPUMASK,
FILTER_COMM,
FILTER_CPU,
FILTER_STACKTRACE,
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 91fc9990107f1..cb1863dfa280b 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -64,6 +64,7 @@ enum filter_pred_fn {
FILTER_PRED_FN_PCHAR_USER,
FILTER_PRED_FN_PCHAR,
FILTER_PRED_FN_CPU,
+ FILTER_PRED_FN_CPUMASK,
FILTER_PRED_FN_FUNCTION,
FILTER_PRED_FN_,
FILTER_PRED_TEST_VISITED,
@@ -71,6 +72,7 @@ enum filter_pred_fn {
struct filter_pred {
struct regex *regex;
+ struct cpumask *mask;
unsigned short *ops;
struct ftrace_event_field *field;
u64 val;
@@ -94,6 +96,8 @@ struct filter_pred {
C(TOO_MANY_OPEN, "Too many '('"), \
C(TOO_MANY_CLOSE, "Too few '('"), \
C(MISSING_QUOTE, "Missing matching quote"), \
+ C(MISSING_BRACE_OPEN, "Missing '{'"), \
+ C(MISSING_BRACE_CLOSE, "Missing '}'"), \
C(OPERAND_TOO_LONG, "Operand too long"), \
C(EXPECT_STRING, "Expecting string field"), \
C(EXPECT_DIGIT, "Expecting numeric field"), \
@@ -103,6 +107,7 @@ struct filter_pred {
C(BAD_SUBSYS_FILTER, "Couldn't find or set field in one of a subsystem's events"), \
C(TOO_MANY_PREDS, "Too many terms in predicate expression"), \
C(INVALID_FILTER, "Meaningless filter expression"), \
+ C(INVALID_CPULIST, "Invalid cpulist"), \
C(IP_FIELD_ONLY, "Only 'ip' field is supported for function trace"), \
C(INVALID_VALUE, "Invalid value (did you forget quotes)?"), \
C(NO_FUNCTION, "Function not found"), \
@@ -190,6 +195,7 @@ static void free_predicate(struct filter_pred *pred)
{
if (pred) {
kfree(pred->regex);
+ kfree(pred->mask);
kfree(pred);
}
}
@@ -877,6 +883,26 @@ static int filter_pred_cpu(struct filter_pred *pred, void *event)
}
}
+/* Filter predicate for cpumask field vs user-provided cpumask */
+static int filter_pred_cpumask(struct filter_pred *pred, void *event)
+{
+ u32 item = *(u32 *)(event + pred->offset);
+ int loc = item & 0xffff;
+ const struct cpumask *mask = (event + loc);
+ const struct cpumask *cmp = pred->mask;
+
+ switch (pred->op) {
+ case OP_EQ:
+ return cpumask_equal(mask, cmp);
+ case OP_NE:
+ return !cpumask_equal(mask, cmp);
+ case OP_BAND:
+ return cpumask_intersects(mask, cmp);
+ default:
+ return 0;
+ }
+}
+
/* Filter predicate for COMM. */
static int filter_pred_comm(struct filter_pred *pred, void *event)
{
@@ -1244,8 +1270,12 @@ static void filter_free_subsystem_filters(struct trace_subsystem_dir *dir,
int filter_assign_type(const char *type)
{
- if (strstr(type, "__data_loc") && strstr(type, "char"))
- return FILTER_DYN_STRING;
+ if (strstr(type, "__data_loc")) {
+ if (strstr(type, "char"))
+ return FILTER_DYN_STRING;
+ if (strstr(type, "cpumask_t"))
+ return FILTER_CPUMASK;
+ }
if (strstr(type, "__rel_loc") && strstr(type, "char"))
return FILTER_RDYN_STRING;
@@ -1357,6 +1387,8 @@ static int filter_pred_fn_call(struct filter_pred *pred, void *event)
return filter_pred_pchar(pred, event);
case FILTER_PRED_FN_CPU:
return filter_pred_cpu(pred, event);
+ case FILTER_PRED_FN_CPUMASK:
+ return filter_pred_cpumask(pred, event);
case FILTER_PRED_FN_FUNCTION:
return filter_pred_function(pred, event);
case FILTER_PRED_TEST_VISITED:
@@ -1568,6 +1600,67 @@ static int parse_pred(const char *str, void *data,
strncpy(pred->regex->pattern, str + s, len);
pred->regex->pattern[len] = 0;
+ } else if (!strncmp(str + i, "CPUS", 4)) {
+ unsigned int maskstart;
+ char *tmp;
+
+ switch (field->filter_type) {
+ case FILTER_CPUMASK:
+ break;
+ default:
+ parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
+ goto err_free;
+ }
+
+ switch (op) {
+ case OP_EQ:
+ case OP_NE:
+ case OP_BAND:
+ break;
+ default:
+ parse_error(pe, FILT_ERR_ILLEGAL_FIELD_OP, pos + i);
+ goto err_free;
+ }
+
+ /* Skip CPUS */
+ i += 4;
+ if (str[i++] != '{') {
+ parse_error(pe, FILT_ERR_MISSING_BRACE_OPEN, pos + i);
+ goto err_free;
+ }
+ maskstart = i;
+
+ /* Walk the cpulist until closing } */
+ for (; str[i] && str[i] != '}'; i++);
+ if (str[i] != '}') {
+ parse_error(pe, FILT_ERR_MISSING_BRACE_CLOSE, pos + i);
+ goto err_free;
+ }
+
+ if (maskstart == i) {
+ parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+ goto err_free;
+ }
+
+ /* Copy the cpulist between { and } */
+ tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
+ strscpy(tmp, str + maskstart, (i - maskstart) + 1);
+
+ pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
+ if (!pred->mask)
+ goto err_mem;
+
+ /* Now parse it */
+ if (cpulist_parse(tmp, pred->mask)) {
+ parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
+ goto err_free;
+ }
+
+ /* Move along */
+ i++;
+ if (field->filter_type == FILTER_CPUMASK)
+ pred->fn_num = FILTER_PRED_FN_CPUMASK;
+
/* This is either a string, or an integer */
} else if (str[i] == '\'' || str[i] == '"') {
char q = str[i];
--
2.31.1
From: Peter Zijlstra <[email protected]>
When a static_key is marked ro_after_init, its state will never change
(after init), therefore jump_label_update() will never need to iterate
the entries, and thus module load won't actually need to track this --
avoiding the static_key::next write.
Therefore, mark these keys such that jump_label_add_module() might
recognise them and avoid the modification.
Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
to denote such keys.
Link: http://lore.kernel.org/r/[email protected]
NOT-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
@Peter: I've barely touched this patch, it's just been writing a comment
and fixing benign compilation issues, so credit's all yours really!
---
include/asm-generic/sections.h | 5 ++++
include/linux/jump_label.h | 1 +
init/main.c | 1 +
kernel/jump_label.c | 49 ++++++++++++++++++++++++++++++++++
4 files changed, 56 insertions(+)
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index db13bb620f527..c768de6f19a9a 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -180,6 +180,11 @@ static inline bool is_kernel_rodata(unsigned long addr)
addr < (unsigned long)__end_rodata;
}
+static inline bool is_kernel_ro_after_init(unsigned long addr)
+{
+ return addr >= (unsigned long)__start_ro_after_init &&
+ addr < (unsigned long)__end_ro_after_init;
+}
/**
* is_kernel_inittext - checks if the pointer address is located in the
* .init.text section
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index f0a949b7c9733..88ef9e776af8d 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -216,6 +216,7 @@ extern struct jump_entry __start___jump_table[];
extern struct jump_entry __stop___jump_table[];
extern void jump_label_init(void);
+extern void jump_label_ro(void);
extern void jump_label_lock(void);
extern void jump_label_unlock(void);
extern void arch_jump_label_transform(struct jump_entry *entry,
diff --git a/init/main.c b/init/main.c
index ad920fac325c3..cb5304ca18f4d 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1403,6 +1403,7 @@ static void mark_readonly(void)
* insecure pages which are W+X.
*/
rcu_barrier();
+ jump_label_ro();
mark_rodata_ro();
rodata_test();
} else
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index d9c822bbffb8d..661ef74dee9b7 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -530,6 +530,45 @@ void __init jump_label_init(void)
cpus_read_unlock();
}
+static inline bool static_key_sealed(struct static_key *key)
+{
+ return (key->type & JUMP_TYPE_LINKED) && !(key->type & ~JUMP_TYPE_MASK);
+}
+
+static inline void static_key_seal(struct static_key *key)
+{
+ unsigned long type = key->type & JUMP_TYPE_TRUE;
+ key->type = JUMP_TYPE_LINKED | type;
+}
+
+void jump_label_ro(void)
+{
+ struct jump_entry *iter_start = __start___jump_table;
+ struct jump_entry *iter_stop = __stop___jump_table;
+ struct jump_entry *iter;
+
+ if (WARN_ON_ONCE(!static_key_initialized))
+ return;
+
+ cpus_read_lock();
+ jump_label_lock();
+
+ for (iter = iter_start; iter < iter_stop; iter++) {
+ struct static_key *iterk = jump_entry_key(iter);
+
+ if (!is_kernel_ro_after_init((unsigned long)iterk))
+ continue;
+
+ if (static_key_sealed(iterk))
+ continue;
+
+ static_key_seal(iterk);
+ }
+
+ jump_label_unlock();
+ cpus_read_unlock();
+}
+
#ifdef CONFIG_MODULES
enum jump_label_type jump_label_init_type(struct jump_entry *entry)
@@ -650,6 +689,15 @@ static int jump_label_add_module(struct module *mod)
static_key_set_entries(key, iter);
continue;
}
+
+ /*
+ * If the key was sealed at init, then there's no need to keep a
+ * a reference to its module entries - just patch them now and
+ * be done with it.
+ */
+ if (static_key_sealed(key))
+ goto do_poke;
+
jlm = kzalloc(sizeof(struct static_key_mod), GFP_KERNEL);
if (!jlm)
return -ENOMEM;
@@ -675,6 +723,7 @@ static int jump_label_add_module(struct module *mod)
static_key_set_linked(key);
/* Only update if we've changed from our initial state */
+do_poke:
if (jump_label_type(iter) != jump_label_init_type(iter))
__jump_label_update(key, iter, iter_stop, true);
}
--
2.31.1
Kernel TLB invalidation IPIs are a common source of interference on
NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not
accessing any kernel addresses, these invalidations do not need to happen
immediately, and can be deferred until the next user->kernel transition.
Rather than make __flush_tlb_all() noinstr, add a minimal noinstr
variant that doesn't try to leverage INVPCID.
FIXME: not fully noinstr compliant
XXX: same issue as with ins patching, when do we access data that should be
invalidated?
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/include/asm/context_tracking_work.h | 4 ++++
arch/x86/include/asm/tlbflush.h | 1 +
arch/x86/mm/tlb.c | 17 +++++++++++++++++
include/linux/context_tracking_state.h | 4 ++++
include/linux/context_tracking_work.h | 2 ++
5 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 2c66687ce00e2..9d4f021b5a45b 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@
#define _ASM_X86_CONTEXT_TRACKING_WORK_H
#include <asm/sync_core.h>
+#include <asm/tlbflush.h>
static __always_inline void arch_context_tracking_work(int work)
{
@@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work)
case CONTEXT_WORK_SYNC:
sync_core();
break;
+ case CONTEXT_WORK_TLBI:
+ __flush_tlb_all_noinstr();
+ break;
}
}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 80450e1d5385a..323b971987af7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -17,6 +17,7 @@
DECLARE_PER_CPU(u64, tlbstate_untag_mask);
void __flush_tlb_all(void);
+void noinstr __flush_tlb_all_noinstr(void);
#define TLB_FLUSH_ALL -1UL
#define TLB_GENERATION_INVALID 0
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 267acf27480af..631df9189ded4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1237,6 +1237,23 @@ void __flush_tlb_all(void)
}
EXPORT_SYMBOL_GPL(__flush_tlb_all);
+void noinstr __flush_tlb_all_noinstr(void)
+{
+ /*
+ * This is for invocation in early entry code that cannot be
+ * instrumented. A RMW to CR4 works for most cases, but relies on
+ * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID
+ * would require also resetting CR3.PCID, so just try with CR4.PGE, else
+ * do the CR3 write.
+ *
+ * TODO: paravirt
+ */
+ if (cpu_feature_enabled(X86_FEATURE_PGE))
+ __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4));
+ else
+ flush_tlb_local();
+}
+
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
struct flush_tlb_info *info;
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 292a0b7c06948..3571c62cbb9cd 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -62,6 +62,10 @@ enum ctx_state {
#define RCU_DYNTICKS_END (CT_STATE_SIZE - 1)
#define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
+/*
+ * When CONFIG_CONTEXT_TRACKING_WORK=n, _END is 1 behind _START, which makes
+ * the CONTEXT_WORK size computation below 0, which is what we want!
+ */
#define CONTEXT_WORK_START (CONTEXT_STATE_END + 1)
#define CONTEXT_WORK_END (RCU_DYNTICKS_START - 1)
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 13fc97b395030..47d5ced39a43a 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -6,11 +6,13 @@
enum {
CONTEXT_WORK_SYNC_OFFSET,
+ CONTEXT_WORK_TLBI_OFFSET,
CONTEXT_WORK_MAX_OFFSET
};
enum ct_work {
CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET),
+ CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET),
CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET)
};
--
2.31.1
We now have an RCU_EXPORT knob for configuring the size of the dynticks
counter: CONFIG_RCU_DYNTICKS_BITS.
Add a torture config for a ridiculously small counter (2 bits). This is ac
opy of TREE4 with the added counter size restriction.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
.../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++
.../rcutorture/configs/rcu/TREE11.boot | 1 +
2 files changed, 20 insertions(+)
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
new file mode 100644
index 0000000000000..aa7274efd9819
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
@@ -0,0 +1,19 @@
+CONFIG_SMP=y
+CONFIG_NR_CPUS=8
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=y
+CONFIG_PREEMPT=n
+CONFIG_PREEMPT_DYNAMIC=n
+#CHECK#CONFIG_TREE_RCU=y
+CONFIG_HZ_PERIODIC=n
+CONFIG_NO_HZ_IDLE=n
+CONFIG_NO_HZ_FULL=y
+CONFIG_RCU_TRACE=y
+CONFIG_RCU_FANOUT=4
+CONFIG_RCU_FANOUT_LEAF=3
+CONFIG_DEBUG_LOCK_ALLOC=n
+CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
+CONFIG_RCU_EXPERT=y
+CONFIG_RCU_EQS_DEBUG=y
+CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_BITS=2
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
new file mode 100644
index 0000000000000..a8d94caf7d2fd
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
@@ -0,0 +1 @@
+rcutree.rcu_fanout_leaf=4 nohz_full=1-N
--
2.31.1
CONTEXT_TRACKING_WORK reduces the size of the dynticks counter to free up
some bits for work deferral. Paul suggested making the actual counter size
configurable for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. Previous commits have added
build-time checks that ensure a kernel with problematic dynticks counter
width can't be built.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/context_tracking.h | 3 ++-
include/linux/context_tracking_state.h | 3 +--
kernel/rcu/Kconfig | 33 ++++++++++++++++++++++++++
3 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 8aee086d0a25f..9c0c622bc27bb 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -12,7 +12,8 @@
#ifdef CONFIG_CONTEXT_TRACKING_WORK
static_assert(CONTEXT_WORK_MAX_OFFSET <= CONTEXT_WORK_END + 1 - CONTEXT_WORK_START,
- "Not enough bits for CONTEXT_WORK");
+ "Not enough bits for CONTEXT_WORK, "
+ "CONFIG_RCU_DYNTICKS_BITS might be too high");
#endif
#ifdef CONFIG_CONTEXT_TRACKING_USER
diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
index 828fcdb801f73..292a0b7c06948 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -58,8 +58,7 @@ enum ctx_state {
#define CONTEXT_STATE_START 0
#define CONTEXT_STATE_END (bits_per(CONTEXT_MAX - 1) - 1)
-#define RCU_DYNTICKS_BITS (IS_ENABLED(CONFIG_CONTEXT_TRACKING_WORK) ? 16 : 31)
-#define RCU_DYNTICKS_START (CT_STATE_SIZE - RCU_DYNTICKS_BITS)
+#define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
#define RCU_DYNTICKS_END (CT_STATE_SIZE - 1)
#define RCU_DYNTICKS_IDX BIT(RCU_DYNTICKS_START)
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bdd7eadb33d8f..1ff2aab24e964 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
Say Y here if you need tighter callback-limit enforcement.
Say N here if you are unsure.
+config RCU_DYNTICKS_RANGE_BEGIN
+ int
+ depends on !RCU_EXPERT
+ default 31 if !CONTEXT_TRACKING_WORK
+ default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_RANGE_BEGIN
+ int
+ depends on RCU_EXPERT
+ default 2
+
+config RCU_DYNTICKS_RANGE_END
+ int
+ default 31 if !CONTEXT_TRACKING_WORK
+ default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_BITS_DEFAULT
+ int
+ default 31 if !CONTEXT_TRACKING_WORK
+ default 16 if CONTEXT_TRACKING_WORK
+
+config RCU_DYNTICKS_BITS
+ int "Dynticks counter width" if CONTEXT_TRACKING_WORK
+ range RCU_DYNTICKS_RANGE_BEGIN RCU_DYNTICKS_RANGE_END
+ default RCU_DYNTICKS_BITS_DEFAULT
+ help
+ This option controls the width of the dynticks counter.
+
+ Lower values will make overflows more frequent, which will increase
+ the likelihood of extending grace-periods.
+
+ Don't touch this unless you are running some tests.
+
endmenu # "RCU Subsystem"
--
2.31.1
On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> We now have an RCU_EXPORT knob for configuring the size of the dynticks
> counter: CONFIG_RCU_DYNTICKS_BITS.
>
> Add a torture config for a ridiculously small counter (2 bits). This is ac
> opy of TREE4 with the added counter size restriction.
>
> Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> Suggested-by: Paul E. McKenney <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> .../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++
> .../rcutorture/configs/rcu/TREE11.boot | 1 +
> 2 files changed, 20 insertions(+)
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
>
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> new file mode 100644
> index 0000000000000..aa7274efd9819
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> @@ -0,0 +1,19 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=8
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=y
> +CONFIG_PREEMPT=n
> +CONFIG_PREEMPT_DYNAMIC=n
> +#CHECK#CONFIG_TREE_RCU=y
> +CONFIG_HZ_PERIODIC=n
> +CONFIG_NO_HZ_IDLE=n
> +CONFIG_NO_HZ_FULL=y
> +CONFIG_RCU_TRACE=y
> +CONFIG_RCU_FANOUT=4
> +CONFIG_RCU_FANOUT_LEAF=3
> +CONFIG_DEBUG_LOCK_ALLOC=n
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_EQS_DEBUG=y
> +CONFIG_RCU_LAZY=y
> +CONFIG_RCU_DYNTICKS_BITS=2
Why not just add this last line to the existing TREE04 scenario?
That would ensure that it gets tested regularly without extending the
time required to run a full set of rcutorture tests.
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> new file mode 100644
> index 0000000000000..a8d94caf7d2fd
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> @@ -0,0 +1 @@
> +rcutree.rcu_fanout_leaf=4 nohz_full=1-N
> --
> 2.31.1
>
On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> > We now have an RCU_EXPORT knob for configuring the size of the dynticks
> > counter: CONFIG_RCU_DYNTICKS_BITS.
> >
> > Add a torture config for a ridiculously small counter (2 bits). This is ac
> > opy of TREE4 with the added counter size restriction.
> >
> > Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
> > Suggested-by: Paul E. McKenney <[email protected]>
> > Signed-off-by: Valentin Schneider <[email protected]>
> > ---
> > .../selftests/rcutorture/configs/rcu/TREE11 | 19 +++++++++++++++++++
> > .../rcutorture/configs/rcu/TREE11.boot | 1 +
> > 2 files changed, 20 insertions(+)
> > create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> >
> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > new file mode 100644
> > index 0000000000000..aa7274efd9819
> > --- /dev/null
> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> > @@ -0,0 +1,19 @@
> > +CONFIG_SMP=y
> > +CONFIG_NR_CPUS=8
> > +CONFIG_PREEMPT_NONE=n
> > +CONFIG_PREEMPT_VOLUNTARY=y
> > +CONFIG_PREEMPT=n
> > +CONFIG_PREEMPT_DYNAMIC=n
> > +#CHECK#CONFIG_TREE_RCU=y
> > +CONFIG_HZ_PERIODIC=n
> > +CONFIG_NO_HZ_IDLE=n
> > +CONFIG_NO_HZ_FULL=y
> > +CONFIG_RCU_TRACE=y
> > +CONFIG_RCU_FANOUT=4
> > +CONFIG_RCU_FANOUT_LEAF=3
> > +CONFIG_DEBUG_LOCK_ALLOC=n
> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> > +CONFIG_RCU_EXPERT=y
> > +CONFIG_RCU_EQS_DEBUG=y
> > +CONFIG_RCU_LAZY=y
> > +CONFIG_RCU_DYNTICKS_BITS=2
>
> Why not just add this last line to the existing TREE04 scenario?
> That would ensure that it gets tested regularly without extending the
> time required to run a full set of rcutorture tests.
Please see below for the version of this patch that I am running overnight
tests with. Does this one work for you?
Thanx, Paul
------------------------------------------------------------------------
commit 1aa13731e665193cd833edac5ebc86a9c3fea2b7
Author: Valentin Schneider <[email protected]>
Date: Thu Jul 20 20:58:41 2023 -0700
rcutorture: Add a test config to torture test low RCU_DYNTICKS width
We now have an RCU_EXPORT knob for configuring the size of the dynticks
counter: CONFIG_RCU_DYNTICKS_BITS.
Modify scenario TREE04 to exercise a a ridiculously small counter
(2 bits).
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04 b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
index dc4985064b3a..aa7274efd981 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04
@@ -16,3 +16,4 @@ CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_RCU_EXPERT=y
CONFIG_RCU_EQS_DEBUG=y
CONFIG_RCU_LAZY=y
+CONFIG_RCU_DYNTICKS_BITS=2
On 20/07/23 21:00, Paul E. McKenney wrote:
> On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
>> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
>> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> > new file mode 100644
>> > index 0000000000000..aa7274efd9819
>> > --- /dev/null
>> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> > @@ -0,0 +1,19 @@
>> > +CONFIG_SMP=y
>> > +CONFIG_NR_CPUS=8
>> > +CONFIG_PREEMPT_NONE=n
>> > +CONFIG_PREEMPT_VOLUNTARY=y
>> > +CONFIG_PREEMPT=n
>> > +CONFIG_PREEMPT_DYNAMIC=n
>> > +#CHECK#CONFIG_TREE_RCU=y
>> > +CONFIG_HZ_PERIODIC=n
>> > +CONFIG_NO_HZ_IDLE=n
>> > +CONFIG_NO_HZ_FULL=y
>> > +CONFIG_RCU_TRACE=y
>> > +CONFIG_RCU_FANOUT=4
>> > +CONFIG_RCU_FANOUT_LEAF=3
>> > +CONFIG_DEBUG_LOCK_ALLOC=n
>> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> > +CONFIG_RCU_EXPERT=y
>> > +CONFIG_RCU_EQS_DEBUG=y
>> > +CONFIG_RCU_LAZY=y
>> > +CONFIG_RCU_DYNTICKS_BITS=2
>>
>> Why not just add this last line to the existing TREE04 scenario?
>> That would ensure that it gets tested regularly without extending the
>> time required to run a full set of rcutorture tests.
>
> Please see below for the version of this patch that I am running overnight
> tests with. Does this one work for you?
>
Yep that's fine with me. I only went with a separate test file as wasn't
sure how new test options should be handled (merged into existing tests vs
new tests created), and didn't want to negatively impact TREE04 or
TREE06. If merging into TREE04 is preferred, then I'll do just that and
carry this path moving forwards.
Thanks!
On 20/07/23 17:30, Valentin Schneider wrote:
> index bdd7eadb33d8f..1ff2aab24e964 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
> Say Y here if you need tighter callback-limit enforcement.
> Say N here if you are unsure.
>
> +config RCU_DYNTICKS_RANGE_BEGIN
> + int
> + depends on !RCU_EXPERT
> + default 31 if !CONTEXT_TRACKING_WORK
You'll note that this should be 30 really, because the lower *2* bits are
taken by the context state (CONTEXT_GUEST has a value of 3).
This highlights the fragile part of this: the Kconfig values are hardcoded,
but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
static_assert() will at least capture any misconfiguration, but having that
enforced by the actual Kconfig ranges would be less awkward.
Do we currently have a way of e.g. making a Kconfig file depend on and use
values generated by a C header?
On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
> On 20/07/23 17:30, Valentin Schneider wrote:
> > index bdd7eadb33d8f..1ff2aab24e964 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
> > Say Y here if you need tighter callback-limit enforcement.
> > Say N here if you are unsure.
> >
> > +config RCU_DYNTICKS_RANGE_BEGIN
> > + int
> > + depends on !RCU_EXPERT
> > + default 31 if !CONTEXT_TRACKING_WORK
>
> You'll note that this should be 30 really, because the lower *2* bits are
> taken by the context state (CONTEXT_GUEST has a value of 3).
>
> This highlights the fragile part of this: the Kconfig values are hardcoded,
> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
> static_assert() will at least capture any misconfiguration, but having that
> enforced by the actual Kconfig ranges would be less awkward.
>
> Do we currently have a way of e.g. making a Kconfig file depend on and use
> values generated by a C header?
Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
option and let the C code work out what the number of bits should be?
I suppose that there might be a failure whose frequency depended on
the number of bits, which might be an argument for keeping something
like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using
RCU_DYNTICKS_TORTURE for normal testing.
Thoughts?
Thanx, Paul
On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
> On 20/07/23 21:00, Paul E. McKenney wrote:
> > On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
> >> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
> >> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >> > new file mode 100644
> >> > index 0000000000000..aa7274efd9819
> >> > --- /dev/null
> >> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
> >> > @@ -0,0 +1,19 @@
> >> > +CONFIG_SMP=y
> >> > +CONFIG_NR_CPUS=8
> >> > +CONFIG_PREEMPT_NONE=n
> >> > +CONFIG_PREEMPT_VOLUNTARY=y
> >> > +CONFIG_PREEMPT=n
> >> > +CONFIG_PREEMPT_DYNAMIC=n
> >> > +#CHECK#CONFIG_TREE_RCU=y
> >> > +CONFIG_HZ_PERIODIC=n
> >> > +CONFIG_NO_HZ_IDLE=n
> >> > +CONFIG_NO_HZ_FULL=y
> >> > +CONFIG_RCU_TRACE=y
> >> > +CONFIG_RCU_FANOUT=4
> >> > +CONFIG_RCU_FANOUT_LEAF=3
> >> > +CONFIG_DEBUG_LOCK_ALLOC=n
> >> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> >> > +CONFIG_RCU_EXPERT=y
> >> > +CONFIG_RCU_EQS_DEBUG=y
> >> > +CONFIG_RCU_LAZY=y
> >> > +CONFIG_RCU_DYNTICKS_BITS=2
> >>
> >> Why not just add this last line to the existing TREE04 scenario?
> >> That would ensure that it gets tested regularly without extending the
> >> time required to run a full set of rcutorture tests.
> >
> > Please see below for the version of this patch that I am running overnight
> > tests with. Does this one work for you?
>
> Yep that's fine with me. I only went with a separate test file as wasn't
> sure how new test options should be handled (merged into existing tests vs
> new tests created), and didn't want to negatively impact TREE04 or
> TREE06. If merging into TREE04 is preferred, then I'll do just that and
> carry this path moving forwards.
Things worked fine for this one-hour-per-scenario test run on my laptop,
except for the CONFIG_SMP=n runs, which all got build errors like the
following.
Thanx, Paul
------------------------------------------------------------------------
In file included from ./include/linux/container_of.h:5,
from ./include/linux/list.h:5,
from ./include/linux/swait.h:5,
from ./include/linux/completion.h:12,
from ./include/linux/crypto.h:15,
from arch/x86/kernel/asm-offsets.c:9:
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
| ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’
61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:70:29: note: in expansion of macro ‘RCU_DYNTICKS_START’
70 | #define CONTEXT_WORK_END (RCU_DYNTICKS_START - 1)
| ^~~~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:74:16: note: in expansion of macro ‘CONTEXT_WORK_END’
74 | (CONTEXT_WORK_END + 1 - CONTEXT_WORK_START) +
| ^~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
| ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:62:29: note: in expansion of macro ‘CT_STATE_SIZE’
62 | #define RCU_DYNTICKS_END (CT_STATE_SIZE - 1)
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:75:16: note: in expansion of macro ‘RCU_DYNTICKS_END’
75 | (RCU_DYNTICKS_END + 1 - RCU_DYNTICKS_START) ==
| ^~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
| ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:61:29: note: in expansion of macro ‘CT_STATE_SIZE’
61 | #define RCU_DYNTICKS_START (CT_STATE_SIZE - CONFIG_RCU_DYNTICKS_BITS)
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:75:40: note: in expansion of macro ‘RCU_DYNTICKS_START’
75 | (RCU_DYNTICKS_END + 1 - RCU_DYNTICKS_START) ==
| ^~~~~~~~~~~~~~~~~~
./include/linux/context_tracking_state.h:56:61: error: ‘struct context_tracking’ has no member named ‘state’
56 | #define CT_STATE_SIZE (sizeof(((struct context_tracking *)0)->state) * BITS_PER_BYTE)
| ^~
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:76:15: note: in expansion of macro ‘CT_STATE_SIZE’
76 | CT_STATE_SIZE);
| ^~~~~~~~~~~~~
./include/linux/context_tracking_state.h:73:15: error: expression in static assertion is not an integer
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^
./include/linux/build_bug.h:78:56: note: in definition of macro ‘__static_assert’
78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
| ^~~~
./include/linux/context_tracking_state.h:73:1: note: in expansion of macro ‘static_assert’
73 | static_assert((CONTEXT_STATE_END + 1 - CONTEXT_STATE_START) +
| ^~~~~~~~~~~~~
make[2]: *** [scripts/Makefile.build:116: arch/x86/kernel/asm-offsets.s] Error 1
make[1]: *** [/home/git/linux-rcu-1/Makefile:1275: prepare0] Error 2
make[1]: *** Waiting for unfinished jobs....
LD /home/git/linux-rcu-1/tools/objtool/objtool-in.o
LINK /home/git/linux-rcu-1/tools/objtool/objtool
make: *** [Makefile:234: __sub-make] Error 2
On 21/07/23 07:07, Paul E. McKenney wrote:
> On Fri, Jul 21, 2023 at 08:58:53AM +0100, Valentin Schneider wrote:
>> On 20/07/23 21:00, Paul E. McKenney wrote:
>> > On Thu, Jul 20, 2023 at 12:53:05PM -0700, Paul E. McKenney wrote:
>> >> On Thu, Jul 20, 2023 at 05:30:53PM +0100, Valentin Schneider wrote:
>> >> > diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > new file mode 100644
>> >> > index 0000000000000..aa7274efd9819
>> >> > --- /dev/null
>> >> > +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
>> >> > @@ -0,0 +1,19 @@
>> >> > +CONFIG_SMP=y
>> >> > +CONFIG_NR_CPUS=8
>> >> > +CONFIG_PREEMPT_NONE=n
>> >> > +CONFIG_PREEMPT_VOLUNTARY=y
>> >> > +CONFIG_PREEMPT=n
>> >> > +CONFIG_PREEMPT_DYNAMIC=n
>> >> > +#CHECK#CONFIG_TREE_RCU=y
>> >> > +CONFIG_HZ_PERIODIC=n
>> >> > +CONFIG_NO_HZ_IDLE=n
>> >> > +CONFIG_NO_HZ_FULL=y
>> >> > +CONFIG_RCU_TRACE=y
>> >> > +CONFIG_RCU_FANOUT=4
>> >> > +CONFIG_RCU_FANOUT_LEAF=3
>> >> > +CONFIG_DEBUG_LOCK_ALLOC=n
>> >> > +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> >> > +CONFIG_RCU_EXPERT=y
>> >> > +CONFIG_RCU_EQS_DEBUG=y
>> >> > +CONFIG_RCU_LAZY=y
>> >> > +CONFIG_RCU_DYNTICKS_BITS=2
>> >>
>> >> Why not just add this last line to the existing TREE04 scenario?
>> >> That would ensure that it gets tested regularly without extending the
>> >> time required to run a full set of rcutorture tests.
>> >
>> > Please see below for the version of this patch that I am running overnight
>> > tests with. Does this one work for you?
>>
>> Yep that's fine with me. I only went with a separate test file as wasn't
>> sure how new test options should be handled (merged into existing tests vs
>> new tests created), and didn't want to negatively impact TREE04 or
>> TREE06. If merging into TREE04 is preferred, then I'll do just that and
>> carry this path moving forwards.
>
> Things worked fine for this one-hour-per-scenario test run on my laptop,
Many thanks for testing!
> except for the CONFIG_SMP=n runs, which all got build errors like the
> following.
>
Harumph, yes !SMP (and !CONTEXT_TRACKING_WORK) doesn't compile nicely, I'll
fix that for v3.
On 21/07/23 07:10, Paul E. McKenney wrote:
> On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
>> On 20/07/23 17:30, Valentin Schneider wrote:
>> > index bdd7eadb33d8f..1ff2aab24e964 100644
>> > --- a/kernel/rcu/Kconfig
>> > +++ b/kernel/rcu/Kconfig
>> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
>> > Say Y here if you need tighter callback-limit enforcement.
>> > Say N here if you are unsure.
>> >
>> > +config RCU_DYNTICKS_RANGE_BEGIN
>> > + int
>> > + depends on !RCU_EXPERT
>> > + default 31 if !CONTEXT_TRACKING_WORK
>>
>> You'll note that this should be 30 really, because the lower *2* bits are
>> taken by the context state (CONTEXT_GUEST has a value of 3).
>>
>> This highlights the fragile part of this: the Kconfig values are hardcoded,
>> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
>> static_assert() will at least capture any misconfiguration, but having that
>> enforced by the actual Kconfig ranges would be less awkward.
>>
>> Do we currently have a way of e.g. making a Kconfig file depend on and use
>> values generated by a C header?
>
> Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
> option and let the C code work out what the number of bits should be?
>
> I suppose that there might be a failure whose frequency depended on
> the number of bits, which might be an argument for keeping something
> like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using
> RCU_DYNTICKS_TORTURE for normal testing.
>
> Thoughts?
>
AFAICT if we run tests with the minimum possible width, then intermediate
values shouldn't have much value.
Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I
came up with, as we can let the context tracking code figure out the widths
itself and not expose any of that to Kconfig.
> Thanx, Paul
On Fri, Jul 21, 2023 at 04:08:10PM +0100, Valentin Schneider wrote:
> On 21/07/23 07:10, Paul E. McKenney wrote:
> > On Fri, Jul 21, 2023 at 09:17:53AM +0100, Valentin Schneider wrote:
> >> On 20/07/23 17:30, Valentin Schneider wrote:
> >> > index bdd7eadb33d8f..1ff2aab24e964 100644
> >> > --- a/kernel/rcu/Kconfig
> >> > +++ b/kernel/rcu/Kconfig
> >> > @@ -332,4 +332,37 @@ config RCU_DOUBLE_CHECK_CB_TIME
> >> > Say Y here if you need tighter callback-limit enforcement.
> >> > Say N here if you are unsure.
> >> >
> >> > +config RCU_DYNTICKS_RANGE_BEGIN
> >> > + int
> >> > + depends on !RCU_EXPERT
> >> > + default 31 if !CONTEXT_TRACKING_WORK
> >>
> >> You'll note that this should be 30 really, because the lower *2* bits are
> >> taken by the context state (CONTEXT_GUEST has a value of 3).
> >>
> >> This highlights the fragile part of this: the Kconfig values are hardcoded,
> >> but they depend on CT_STATE_SIZE, CONTEXT_MASK and CONTEXT_WORK_MAX. The
> >> static_assert() will at least capture any misconfiguration, but having that
> >> enforced by the actual Kconfig ranges would be less awkward.
> >>
> >> Do we currently have a way of e.g. making a Kconfig file depend on and use
> >> values generated by a C header?
> >
> > Why not just have something like a boolean RCU_DYNTICKS_TORTURE Kconfig
> > option and let the C code work out what the number of bits should be?
> >
> > I suppose that there might be a failure whose frequency depended on
> > the number of bits, which might be an argument for keeping something
> > like RCU_DYNTICKS_RANGE_BEGIN for fault isolation. But still using
> > RCU_DYNTICKS_TORTURE for normal testing.
> >
> > Thoughts?
> >
>
> AFAICT if we run tests with the minimum possible width, then intermediate
> values shouldn't have much value.
>
> Your RCU_DYNTICKS_TORTURE suggestion sounds like a saner option than what I
> came up with, as we can let the context tracking code figure out the widths
> itself and not expose any of that to Kconfig.
Agreed. If a need for variable numbers of bits ever does arise, we can
worry about it at that time. And then we would have more information
on what a variable-bit facility should look like.
Thanx, Paul
On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
> int filter_assign_type(const char *type)
> {
> - if (strstr(type, "__data_loc") && strstr(type, "char"))
> - return FILTER_DYN_STRING;
> + if (strstr(type, "__data_loc")) {
> + if (strstr(type, "char"))
> + return FILTER_DYN_STRING;
> + if (strstr(type, "cpumask_t"))
> + return FILTER_CPUMASK;
> + }
The closing bracket has the wrong indentation.
> + /* Copy the cpulist between { and } */
> + tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> + strscpy(tmp, str + maskstart, (i - maskstart) + 1);
Need to check kmalloc() failure? And also free tmp?
> +
> + pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> + if (!pred->mask)
> + goto err_mem;
> +
> + /* Now parse it */
> + if (cpulist_parse(tmp, pred->mask)) {
> + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> + goto err_free;
> + }
> +
> + /* Move along */
> + i++;
> + if (field->filter_type == FILTER_CPUMASK)
> + pred->fn_num = FILTER_PRED_FN_CPUMASK;
> +
--
Josh
On 26/07/23 12:41, Josh Poimboeuf wrote:
> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>> int filter_assign_type(const char *type)
>> {
>> - if (strstr(type, "__data_loc") && strstr(type, "char"))
>> - return FILTER_DYN_STRING;
>> + if (strstr(type, "__data_loc")) {
>> + if (strstr(type, "char"))
>> + return FILTER_DYN_STRING;
>> + if (strstr(type, "cpumask_t"))
>> + return FILTER_CPUMASK;
>> + }
>
> The closing bracket has the wrong indentation.
>
>> + /* Copy the cpulist between { and } */
>> + tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
>> + strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>
> Need to check kmalloc() failure? And also free tmp?
>
Heh, indeed, shoddy that :-)
Thanks!
>> +
>> + pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> + if (!pred->mask)
>> + goto err_mem;
>> +
>> + /* Now parse it */
>> + if (cpulist_parse(tmp, pred->mask)) {
>> + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
>> + goto err_free;
>> + }
>> +
>> + /* Move along */
>> + i++;
>> + if (field->filter_type == FILTER_CPUMASK)
>> + pred->fn_num = FILTER_PRED_FN_CPUMASK;
>> +
>
> --
> Josh
On Thu, Jul 20, 2023 at 05:30:46PM +0100, Valentin Schneider wrote:
> From: Peter Zijlstra <[email protected]>
>
> When a static_key is marked ro_after_init, its state will never change
> (after init), therefore jump_label_update() will never need to iterate
> the entries, and thus module load won't actually need to track this --
> avoiding the static_key::next write.
>
> Therefore, mark these keys such that jump_label_add_module() might
> recognise them and avoid the modification.
>
> Use the special state: 'static_key_linked(key) && !static_key_mod(key)'
> to denote such keys.
>
> Link: http://lore.kernel.org/r/[email protected]
> NOT-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> ---
> @Peter: I've barely touched this patch, it's just been writing a comment
> and fixing benign compilation issues, so credit's all yours really!
Ah, it works? Excellent! You can remove the NOT from the SoB then ;-)
On Wed, 26 Jul 2023 12:41:48 -0700
Josh Poimboeuf <[email protected]> wrote:
> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
> > int filter_assign_type(const char *type)
> > {
> > - if (strstr(type, "__data_loc") && strstr(type, "char"))
> > - return FILTER_DYN_STRING;
> > + if (strstr(type, "__data_loc")) {
> > + if (strstr(type, "char"))
> > + return FILTER_DYN_STRING;
> > + if (strstr(type, "cpumask_t"))
> > + return FILTER_CPUMASK;
> > + }
>
> The closing bracket has the wrong indentation.
>
> > + /* Copy the cpulist between { and } */
> > + tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
> > + strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>
> Need to check kmalloc() failure? And also free tmp?
I came to state the same thing.
Also, when you do an empty for loop:
for (; str[i] && str[i] != '}'; i++);
Always put the semicolon on the next line, otherwise it is really easy
to think that the next line is part of the for loop. That is, instead
of the above, do:
for (; str[i] && str[i] != '}'; i++)
;
-- Steve
>
> > +
> > + pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
> > + if (!pred->mask)
> > + goto err_mem;
> > +
> > + /* Now parse it */
> > + if (cpulist_parse(tmp, pred->mask)) {
> > + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
> > + goto err_free;
> > + }
> > +
> > + /* Move along */
> > + i++;
> > + if (field->filter_type == FILTER_CPUMASK)
> > + pred->fn_num = FILTER_PRED_FN_CPUMASK;
> > +
>
On 29/07/23 15:09, Steven Rostedt wrote:
> On Wed, 26 Jul 2023 12:41:48 -0700
> Josh Poimboeuf <[email protected]> wrote:
>
>> On Thu, Jul 20, 2023 at 05:30:38PM +0100, Valentin Schneider wrote:
>> > int filter_assign_type(const char *type)
>> > {
>> > - if (strstr(type, "__data_loc") && strstr(type, "char"))
>> > - return FILTER_DYN_STRING;
>> > + if (strstr(type, "__data_loc")) {
>> > + if (strstr(type, "char"))
>> > + return FILTER_DYN_STRING;
>> > + if (strstr(type, "cpumask_t"))
>> > + return FILTER_CPUMASK;
>> > + }
>>
>> The closing bracket has the wrong indentation.
>>
>> > + /* Copy the cpulist between { and } */
>> > + tmp = kmalloc((i - maskstart) + 1, GFP_KERNEL);
>> > + strscpy(tmp, str + maskstart, (i - maskstart) + 1);
>>
>> Need to check kmalloc() failure? And also free tmp?
>
> I came to state the same thing.
>
> Also, when you do an empty for loop:
>
> for (; str[i] && str[i] != '}'; i++);
>
> Always put the semicolon on the next line, otherwise it is really easy
> to think that the next line is part of the for loop. That is, instead
> of the above, do:
>
> for (; str[i] && str[i] != '}'; i++)
> ;
>
Interestingly I don't think I've ever encountered that variant, usually
having an empty line (which this lacks) and the indentation level is enough
to identify these - regardless, I'll change it.
>
> -- Steve
>
>
>>
>> > +
>> > + pred->mask = kzalloc(cpumask_size(), GFP_KERNEL);
>> > + if (!pred->mask)
>> > + goto err_mem;
>> > +
>> > + /* Now parse it */
>> > + if (cpulist_parse(tmp, pred->mask)) {
>> > + parse_error(pe, FILT_ERR_INVALID_CPULIST, pos + i);
>> > + goto err_free;
>> > + }
>> > +
>> > + /* Move along */
>> > + i++;
>> > + if (field->filter_type == FILTER_CPUMASK)
>> > + pred->fn_num = FILTER_PRED_FN_CPUMASK;
>> > +
>>
On Mon, 31 Jul 2023 12:19:51 +0100
Valentin Schneider <[email protected]> wrote:
> >
> > Also, when you do an empty for loop:
> >
> > for (; str[i] && str[i] != '}'; i++);
> >
> > Always put the semicolon on the next line, otherwise it is really easy
> > to think that the next line is part of the for loop. That is, instead
> > of the above, do:
> >
> > for (; str[i] && str[i] != '}'; i++)
> > ;
> >
>
> Interestingly I don't think I've ever encountered that variant, usually
> having an empty line (which this lacks) and the indentation level is enough
> to identify these - regardless, I'll change it.
Do a "git grep -B1 -e '^\s*;\s*$'"
You'll find that it is quite common.
Thanks,
-- Steve