Subject: [RFC PATCH 0/6] x86/jump_label: Bound IPIs sent when updating a static key

While tuning a system with CPUs isolated as much as possible, we've
noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
process - generating the IPIs when changing a static key, enabling
network timestaping on sockets.

For instance, the trace pointed:

# trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
# trace-cmde report...
[...]
chronyd-858 [000] 433.286406: function: smp_call_function_single
chronyd-858 [000] 433.286408: kernel_stack: <stack trace>
=> smp_call_function_many (ffffffffbc10ab22)
=> smp_call_function (ffffffffbc10abaa)
=> on_each_cpu (ffffffffbc10ac0b)
=> text_poke_bp (ffffffffbc02436a)
=> arch_jump_label_transform (ffffffffbc020ec8)
=> __jump_label_update (ffffffffbc1b300f)
=> jump_label_update (ffffffffbc1b30ed)
=> static_key_slow_inc (ffffffffbc1b334d)
=> net_enable_timestamp (ffffffffbc625c44)
=> sock_enable_timestamp (ffffffffbc6127d5)
=> sock_setsockopt (ffffffffbc612d2f)
=> SyS_setsockopt (ffffffffbc60d5e6)
=> tracesys (ffffffffbc7681c5)
<idle>-0 [001] 433.286416: call_function_single_entry: vector=251
<idle>-0 [001] 433.286419: call_function_single_exit: vector=251

[... The IPI takes place 12 times]

The static key in case was the netstamp_needed_key. A static key
change from enabled->disabled/disabled->enabled causes the code to be
changed, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

As the static key netstamp_needed_key has four entries (used in for
places in the code) in our kernel, 3 IPIs were generated for each
entry, resulting in 12 IPIs. The number of IPIs then is linear with
regard to the number 'n' of entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

In this patch, rather than doing each updated at once, it is possible to
queue all updates first, and the, apply all updates at once, rewriting
the pseudo-code #1 in this way:


-- Pseudo-code #2 - This patch ---
1) for each key in the queue:
add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)

2) for each key in the queue:
update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)

3) for each key in the queue:
replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

void arch_jump_label_transform_queue(struct jump_entry *entry,
enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:

void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

One easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
#!/bin/bash

while [ true ]; do
sysctl -w kernel.sched_schedstats=1
sleep 2
sysctl -w kernel.sched_schedstats=0
sleep 2
done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
# watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Although the motivation of this patch is to reduce the noise on threads
that are *not* causing the enabling/disabling of the static key,
counter-intuitively, this patch also improves the performance of the
enabling/disabling (slow) path of the thread that is actually doing the
change. The reason being is that the costs of allocating memory/
list manipulation/freeing memory are smaller than sending IPIs.
For example, in a system with 24 CPUs, the current cost of enabling the
schedstats key is 170000-ish us, while with this patch, it decreases to
2200 -ish us.

This is an RFC, so comments and critics about things I am missing are
more than welcome.

The batch of operations was suggested by Scott Wood <[email protected]>.

Daniel Bristot de Oliveira (6):
jump_label: Add for_each_label_entry helper
jump_label: Add the jump_label_can_update_check() helper
x86/jump_label: Move checking code away from __jump_label_transform()
x86/jump_label: Add __jump_label_set_jump_code() helper
x86/alternative: Split text_poke_bp() into tree steps
x86/jump_label,x86/alternatives: Batch jump label transformations

arch/x86/include/asm/jump_label.h | 2 +
arch/x86/include/asm/text-patching.h | 9 ++
arch/x86/kernel/alternative.c | 115 ++++++++++++++++---
arch/x86/kernel/jump_label.c | 161 ++++++++++++++++++++-------
include/linux/jump_label.h | 8 ++
kernel/jump_label.c | 46 ++++++--
6 files changed, 273 insertions(+), 68 deletions(-)

--
2.17.1



Subject: [RFC PATCH 1/6] jump_label: Add for_each_label_entry helper

This patch adds the helper:
for_each_label_entry(key, entry, stop)

For the "for each jump label entry" for defined as:
for (; (entry < stop) && (jump_entry_key(entry) == key); entry++)

Simplifying the reading and usage.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/jump_label.h | 3 +++
kernel/jump_label.c | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 1a0b6f17a5d6..cd3bed880ca0 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -170,6 +170,9 @@ extern void static_key_disable(struct static_key *key);
extern void static_key_enable_cpuslocked(struct static_key *key);
extern void static_key_disable_cpuslocked(struct static_key *key);

+#define for_each_label_entry(key, entry, stop) \
+ for (; (entry < stop) && (jump_entry_key(entry) == key); entry++)
+
/*
* We should be using ATOMIC_INIT() for initializing .enabled, but
* the inclusion of atomic.h is problematic for inclusion of jump_label.h
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 2e62503bea0d..e853916a3b46 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -365,7 +365,7 @@ static void __jump_label_update(struct static_key *key,
struct jump_entry *entry,
struct jump_entry *stop)
{
- for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+ for_each_label_entry(key, entry, stop) {
/*
* An entry->code of 0 indicates an entry which has been
* disabled because it was in an init text area.
--
2.17.1


Subject: [RFC PATCH 3/6] x86/jump_label: Move checking code away from __jump_label_transform()

This patch creates two new functions, __jump_label_enabling_check()
and __jump_label_disabling_check(), to make the respective check before
updating a jump_entry.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/jump_label.c | 88 ++++++++++++++++++++++--------------
1 file changed, 53 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index eeea935e9bb5..a11d836d9aaa 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,57 +37,75 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

+static inline void __jump_label_enabling_check(struct jump_entry *entry,
+ enum jump_label_type type,
+ int init)
+{
+ const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+ const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+
+ if (init) {
+ /*
+ * Jump label is enabled for the first time.
+ * So we expect a default_nop...
+ */
+ if (unlikely(memcmp((void *)entry->code, default_nop, 5)
+ != 0))
+ bug_at((void *)entry->code, __LINE__);
+ } else {
+ /*
+ * ...otherwise expect an ideal_nop. Otherwise
+ * something went horribly wrong.
+ */
+ if (unlikely(memcmp((void *)entry->code, ideal_nop, 5)
+ != 0))
+ bug_at((void *)entry->code, __LINE__);
+ }
+}
+
+static inline void __jump_label_disabling_check(struct jump_entry *entry,
+ enum jump_label_type type,
+ int init)
+{
+ union jump_code_union code;
+ const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+
+ /*
+ * We are disabling this jump label. If it is not what
+ * we think it is, then something must have gone wrong.
+ * If this is the first initialization call, then we
+ * are converting the default nop to the ideal nop.
+ */
+ if (init) {
+ if (unlikely(memcmp((void *)entry->code, default_nop, 5) != 0))
+ bug_at((void *)entry->code, __LINE__);
+ } else {
+ code.jump = 0xe9;
+ code.offset = entry->target -
+ (entry->code + JUMP_LABEL_NOP_SIZE);
+ if (unlikely(memcmp((void *)entry->code, &code, 5) != 0))
+ bug_at((void *)entry->code, __LINE__);
+ }
+}
+
static void __ref __jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
void *(*poker)(void *, const void *, size_t),
int init)
{
union jump_code_union code;
- const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
- const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];

if (early_boot_irqs_disabled)
poker = text_poke_early;

if (type == JUMP_LABEL_JMP) {
- if (init) {
- /*
- * Jump label is enabled for the first time.
- * So we expect a default_nop...
- */
- if (unlikely(memcmp((void *)entry->code, default_nop, 5)
- != 0))
- bug_at((void *)entry->code, __LINE__);
- } else {
- /*
- * ...otherwise expect an ideal_nop. Otherwise
- * something went horribly wrong.
- */
- if (unlikely(memcmp((void *)entry->code, ideal_nop, 5)
- != 0))
- bug_at((void *)entry->code, __LINE__);
- }
+ __jump_label_enabling_check(entry, type, init);

code.jump = 0xe9;
code.offset = entry->target -
(entry->code + JUMP_LABEL_NOP_SIZE);
} else {
- /*
- * We are disabling this jump label. If it is not what
- * we think it is, then something must have gone wrong.
- * If this is the first initialization call, then we
- * are converting the default nop to the ideal nop.
- */
- if (init) {
- if (unlikely(memcmp((void *)entry->code, default_nop, 5) != 0))
- bug_at((void *)entry->code, __LINE__);
- } else {
- code.jump = 0xe9;
- code.offset = entry->target -
- (entry->code + JUMP_LABEL_NOP_SIZE);
- if (unlikely(memcmp((void *)entry->code, &code, 5) != 0))
- bug_at((void *)entry->code, __LINE__);
- }
+ __jump_label_disabling_check(entry, type, init);
memcpy(&code, ideal_nops[NOP_ATOMIC5], JUMP_LABEL_NOP_SIZE);
}

--
2.17.1


Subject: [RFC PATCH 4/6] x86/jump_label: Add __jump_label_set_jump_code() helper

Move the definition of the code to be written from
__jump_label_transform() to a specialized function.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/jump_label.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index a11d836d9aaa..de588ff47f81 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -88,6 +88,22 @@ static inline void __jump_label_disabling_check(struct jump_entry *entry,
}
}

+static void __jump_label_set_jump_code(struct jump_entry *entry,
+ enum jump_label_type type,
+ int init,
+ union jump_code_union *code)
+{
+ if (type == JUMP_LABEL_JMP) {
+ __jump_label_enabling_check(entry, type, init);
+
+ code->jump = 0xe9;
+ code->offset = entry->target - (entry->code + JUMP_LABEL_NOP_SIZE);
+ } else {
+ __jump_label_disabling_check(entry, type, init);
+ memcpy(code, ideal_nops[NOP_ATOMIC5], JUMP_LABEL_NOP_SIZE);
+ }
+}
+
static void __ref __jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
void *(*poker)(void *, const void *, size_t),
@@ -98,16 +114,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
if (early_boot_irqs_disabled)
poker = text_poke_early;

- if (type == JUMP_LABEL_JMP) {
- __jump_label_enabling_check(entry, type, init);
-
- code.jump = 0xe9;
- code.offset = entry->target -
- (entry->code + JUMP_LABEL_NOP_SIZE);
- } else {
- __jump_label_disabling_check(entry, type, init);
- memcpy(&code, ideal_nops[NOP_ATOMIC5], JUMP_LABEL_NOP_SIZE);
- }
+ __jump_label_set_jump_code(entry, type, init, &code);

/*
* Make text_poke_bp() a default fallback poker.
--
2.17.1


Subject: [RFC PATCH 2/6] jump_label: Add the jump_label_can_update_check() helper

Move the check of if a jump_entry is valid to a function.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/jump_label.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index e853916a3b46..940ba7819c87 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -361,22 +361,29 @@ static enum jump_label_type jump_label_type(struct jump_entry *entry)
return enabled ^ branch;
}

+bool jump_label_can_update_check(struct jump_entry *entry)
+{
+ /*
+ * An entry->code of 0 indicates an entry which has been
+ * disabled because it was in an init text area.
+ */
+ if (entry->code) {
+ if (kernel_text_address(entry->code))
+ return 1;
+ else
+ WARN_ONCE(1, "can't patch jump_label at %pS",
+ (void *)(unsigned long)entry->code);
+ }
+ return 0;
+}
+
static void __jump_label_update(struct static_key *key,
struct jump_entry *entry,
struct jump_entry *stop)
{
for_each_label_entry(key, entry, stop) {
- /*
- * An entry->code of 0 indicates an entry which has been
- * disabled because it was in an init text area.
- */
- if (entry->code) {
- if (kernel_text_address(entry->code))
- arch_jump_label_transform(entry, jump_label_type(entry));
- else
- WARN_ONCE(1, "can't patch jump_label at %pS",
- (void *)(unsigned long)entry->code);
- }
+ if (jump_label_can_update_check(entry))
+ arch_jump_label_transform(entry, jump_label_type(entry));
}
}

--
2.17.1


Subject: [RFC PATCH 5/6] x86/alternative: Split text_poke_bp() into tree steps

text_poke_bp() updates instructions on live kernel on SMP in three steps:
1) add a int3 trap to the address that will be patched
2) update all but the first byte of the patched range
3) replace the first byte (int3) by the first byte of

This patch creates one function for each of these steps.

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/alternative.c | 38 +++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index b9d5e7c9ef43..a4c83cb49cd0 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -763,6 +763,29 @@ int poke_int3_handler(struct pt_regs *regs)

}

+static void text_poke_bp_set_handler(void *addr, void *handler,
+ unsigned char int3)
+{
+ bp_int3_handler = handler;
+ bp_int3_addr = (u8 *)addr + sizeof(int3);
+ text_poke(addr, &int3, sizeof(int3));
+}
+
+static void patch_all_but_first_byte(void *addr, const void *opcode,
+ size_t len, unsigned char int3)
+{
+ /* patch all but the first byte */
+ text_poke((char *)addr + sizeof(int3),
+ (const char *) opcode + sizeof(int3),
+ len - sizeof(int3));
+}
+
+static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
+{
+ /* patch the first byte */
+ text_poke(addr, opcode, sizeof(int3));
+}
+
/**
* text_poke_bp() -- update instructions on live kernel on SMP
* @addr: address to patch
@@ -787,27 +810,21 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
unsigned char int3 = 0xcc;

- bp_int3_handler = handler;
- bp_int3_addr = (u8 *)addr + sizeof(int3);
- bp_patching_in_progress = true;
-
lockdep_assert_held(&text_mutex);

+ bp_patching_in_progress = true;
/*
* Corresponding read barrier in int3 notifier for making sure the
* in_progress and handler are correctly ordered wrt. patching.
*/
smp_wmb();

- text_poke(addr, &int3, sizeof(int3));
+ text_poke_bp_set_handler(addr, handler, int3);

on_each_cpu(do_sync_core, NULL, 1);

if (len - sizeof(int3) > 0) {
- /* patch all but the first byte */
- text_poke((char *)addr + sizeof(int3),
- (const char *) opcode + sizeof(int3),
- len - sizeof(int3));
+ patch_all_but_first_byte(addr, opcode, len, int3);
/*
* According to Intel, this core syncing is very likely
* not necessary and we'd be safe even without it. But
@@ -816,8 +833,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
on_each_cpu(do_sync_core, NULL, 1);
}

- /* patch the first byte */
- text_poke(addr, opcode, sizeof(int3));
+ patch_first_byte(addr, opcode, int3);

on_each_cpu(do_sync_core, NULL, 1);
/*
--
2.17.1


Subject: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

A static key, changing from enabled->disabled/disabled->enabled causes
the code to be changed, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

The number of IPIs sent is then linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n). For instance, as the static key
netstamp_needed_key has four entries (used in for places in the code)
in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs.

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

In this patch, rather than doing each updated at once, it queue all
updates first, and the, apply all updates at once, rewriting the
pseudo-code #1 in this way:

-- Pseudo-code #2 - This patch ---
1) for each key in the queue:
add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)

2) for each key in the queue:
update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)

3) for each key in the queue:
replace the first byte (int3) by the first byte of replacing opcode

sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

void arch_jump_label_transform_queue(struct jump_entry *entry,
enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:

void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

The batch of operations was:
Suggested-by: Scott Wood <[email protected]>

Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: "Steven Rostedt (VMware)" <[email protected]>
Cc: Zhou Chengming <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Chris von Recklinghausen <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Scott Wood <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Clark Williams <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/include/asm/jump_label.h | 2 +
arch/x86/include/asm/text-patching.h | 9 +++
arch/x86/kernel/alternative.c | 83 +++++++++++++++++++++++++---
arch/x86/kernel/jump_label.c | 54 ++++++++++++++++++
include/linux/jump_label.h | 5 ++
kernel/jump_label.c | 15 +++++
6 files changed, 161 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 8c0de4282659..d61c476046fe 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -15,6 +15,8 @@
#error asm/jump_label.h included on a non-jump-label kernel
#endif

+#define HAVE_JUMP_LABEL_BATCH
+
#define JUMP_LABEL_NOP_SIZE 5

#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..a28230f09d72 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
#define __parainstructions_end NULL
#endif

+struct text_to_poke {
+ struct list_head list;
+ void *opcode;
+ void *addr;
+ void *handler;
+ size_t len;
+};
+
extern void *text_poke_early(void *addr, const void *opcode, size_t len);

/*
@@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_list(struct list_head *entry_list);
extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a4c83cb49cd0..3bd502ea4c53 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -735,9 +735,12 @@ static void do_sync_core(void *info)

static bool bp_patching_in_progress;
static void *bp_int3_handler, *bp_int3_addr;
+struct list_head *bp_list;

int poke_int3_handler(struct pt_regs *regs)
{
+ void *ip;
+ struct text_to_poke *tp;
/*
* Having observed our INT3 instruction, we now must observe
* bp_patching_in_progress.
@@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
if (likely(!bp_patching_in_progress))
return 0;

- if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+ if (user_mode(regs))
return 0;

- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ /*
+ * Single poke.
+ */
+ if (bp_int3_addr) {
+ if (regs->ip == (unsigned long) bp_int3_addr) {
+ regs->ip = (unsigned long) bp_int3_handler;
+ return 1;
+ }
+ return 0;
+ }

- return 1;
+ /*
+ * Batch mode.
+ */
+ ip = (void *) regs->ip - sizeof(unsigned char);
+ list_for_each_entry(tp, bp_list, list) {
+ if (ip == tp->addr) {
+ /* set up the specified breakpoint handler */
+ regs->ip = (unsigned long) tp->handler;
+ return 1;
+ }
+ }

+ return 0;
}

static void text_poke_bp_set_handler(void *addr, void *handler,
unsigned char int3)
{
- bp_int3_handler = handler;
- bp_int3_addr = (u8 *)addr + sizeof(int3);
text_poke(addr, &int3, sizeof(int3));
}

@@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)

lockdep_assert_held(&text_mutex);

+ bp_int3_handler = handler;
+ bp_int3_addr = (u8 *)addr + sizeof(int3);
+
bp_patching_in_progress = true;
/*
* Corresponding read barrier in int3 notifier for making sure the
@@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
* the writing of the new instruction.
*/
bp_patching_in_progress = false;
-
+ bp_int3_handler = bp_int3_addr = 0;
return addr;
}

+void text_poke_bp_list(struct list_head *entry_list)
+{
+ unsigned char int3 = 0xcc;
+ int patched_all_but_first = 0;
+ struct text_to_poke *tp;
+
+ bp_list = entry_list;
+ bp_patching_in_progress = true;
+ /*
+ * Corresponding read barrier in int3 notifier for making sure the
+ * in_progress and handler are correctly ordered wrt. patching.
+ */
+ smp_wmb();
+
+ list_for_each_entry(tp, entry_list, list)
+ text_poke_bp_set_handler(tp->addr, tp->handler, int3);
+
+ on_each_cpu(do_sync_core, NULL, 1);
+
+ list_for_each_entry(tp, entry_list, list) {
+ if (tp->len - sizeof(int3) > 0) {
+ patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
+ patched_all_but_first++;
+ }
+ }
+
+ if (patched_all_but_first) {
+ /*
+ * According to Intel, this core syncing is very likely
+ * not necessary and we'd be safe even without it. But
+ * better safe than sorry (plus there's not only Intel).
+ */
+ on_each_cpu(do_sync_core, NULL, 1);
+ }
+
+ list_for_each_entry(tp, entry_list, list)
+ patch_first_byte(tp->addr, tp->opcode, int3);
+
+ on_each_cpu(do_sync_core, NULL, 1);
+ /*
+ * sync_core() implies an smp_mb() and orders this store against
+ * the writing of the new instruction.
+ */
+ bp_list = 0;
+ bp_patching_in_progress = false;
+}
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index de588ff47f81..3da5af5de4d3 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -12,6 +12,8 @@
#include <linux/list.h>
#include <linux/jhash.h>
#include <linux/cpu.h>
+#include <linux/slab.h>
+#include <linux/list.h>
#include <asm/kprobes.h>
#include <asm/alternative.h>
#include <asm/text-patching.h>
@@ -139,6 +141,58 @@ void arch_jump_label_transform(struct jump_entry *entry,
mutex_unlock(&text_mutex);
}

+LIST_HEAD(batch_list);
+
+void arch_jump_label_transform_queue(struct jump_entry *entry,
+ enum jump_label_type type)
+{
+ struct text_to_poke *tp;
+
+ /*
+ * Batch mode disabled at boot time.
+ */
+ if (early_boot_irqs_disabled)
+ goto fallback;
+
+ /*
+ * RFC Note: I put __GFP_NOFAIL, but I could also goto fallback;
+ * thoughts?
+ */
+ tp = kzalloc(sizeof(struct text_to_poke), GFP_KERNEL | __GFP_NOFAIL);
+ tp->opcode = kzalloc(sizeof(union jump_code_union),
+ GFP_KERNEL | __GFP_NOFAIL);
+
+ __jump_label_set_jump_code(entry, type, 0, tp->opcode);
+ tp->addr = (void *) entry->code;
+ tp->len = JUMP_LABEL_NOP_SIZE;
+ tp->handler = (void *) entry->code + JUMP_LABEL_NOP_SIZE;
+
+ list_add_tail(&tp->list, &batch_list);
+
+ return;
+
+fallback:
+ arch_jump_label_transform(entry, type);
+}
+
+void arch_jump_label_transform_apply(void)
+{
+ struct text_to_poke *tp, *next;
+
+ if (early_boot_irqs_disabled)
+ return;
+
+ mutex_lock(&text_mutex);
+ text_poke_bp_list(&batch_list);
+ mutex_unlock(&text_mutex);
+
+ list_for_each_entry_safe(tp, next, &batch_list, list) {
+ list_del(&tp->list);
+ kfree(tp->opcode);
+ kfree(tp);
+ }
+}
+
static enum {
JL_STATE_START,
JL_STATE_NO_UPDATE,
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index cd3bed880ca0..2aca92e03494 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -156,6 +156,11 @@ extern void jump_label_lock(void);
extern void jump_label_unlock(void);
extern void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type);
+#ifdef HAVE_JUMP_LABEL_BATCH
+extern void arch_jump_label_transform_queue(struct jump_entry *entry,
+ enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
+#endif
extern void arch_jump_label_transform_static(struct jump_entry *entry,
enum jump_label_type type);
extern int jump_label_text_reserved(void *start, void *end);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 940ba7819c87..f534d9c4e07f 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -377,6 +377,7 @@ bool jump_label_can_update_check(struct jump_entry *entry)
return 0;
}

+#ifndef HAVE_JUMP_LABEL_BATCH
static void __jump_label_update(struct static_key *key,
struct jump_entry *entry,
struct jump_entry *stop)
@@ -386,6 +387,20 @@ static void __jump_label_update(struct static_key *key,
arch_jump_label_transform(entry, jump_label_type(entry));
}
}
+#else
+static void __jump_label_update(struct static_key *key,
+ struct jump_entry *entry,
+ struct jump_entry *stop)
+{
+ for_each_label_entry(key, entry, stop) {
+ if (jump_label_can_update_check(entry))
+ arch_jump_label_transform_queue(entry,
+ jump_label_type(entry));
+ }
+ arch_jump_label_transform_apply();
+
+}
+#endif

void __init jump_label_init(void)
{
--
2.17.1


2018-10-08 15:13:34

by Jason Baron

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

On 10/08/2018 08:53 AM, Daniel Bristot de Oliveira wrote:
> A static key, changing from enabled->disabled/disabled->enabled causes
> the code to be changed, and this is done in three steps:
>
> -- Pseudo-code #1 - Current implementation ---
> For each key to be updated:
> 1) add an int3 trap to the address that will be patched
> sync cores (send IPI to all other CPUs)
> 2) update all but the first byte of the patched range
> sync cores (send IPI to all other CPUs)
> 3) replace the first byte (int3) by the first byte of replacing opcode
> sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
>
> The number of IPIs sent is then linear with regard to the number 'n' of
> entries of a key: O(n*3), which is O(n). For instance, as the static key
> netstamp_needed_key has four entries (used in for places in the code)
> in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs.
>
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling is _not_ running.
>
> In this patch, rather than doing each updated at once, it queue all
> updates first, and the, apply all updates at once, rewriting the
> pseudo-code #1 in this way:
>
> -- Pseudo-code #2 - This patch ---
> 1) for each key in the queue:
> add an int3 trap to the address that will be patched
> sync cores (send IPI to all other CPUs)
>
> 2) for each key in the queue:
> update all but the first byte of the patched range
> sync cores (send IPI to all other CPUs)
>
> 3) for each key in the queue:
> replace the first byte (int3) by the first byte of replacing opcode
>
> sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - This patch ---
>
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
>
> Currently, the jump label of a static key is transformed via the arch
> specific function:
>
> void arch_jump_label_transform(struct jump_entry *entry,
> enum jump_label_type type)
>
> The new approach (batch mode) uses two arch functions, the first has the
> same arguments of the arch_jump_label_transform(), and is the function:
>
> void arch_jump_label_transform_queue(struct jump_entry *entry,
> enum jump_label_type type)
>
> Rather than transforming the code, it adds the jump_entry in a queue of
> entries to be updated.
>
> After queuing all jump_entries, the function:
>
> void arch_jump_label_transform_apply(void)
>
> Applies the changes in the queue.
>
> The batch of operations was:
> Suggested-by: Scott Wood <[email protected]>

Hi,

We've discussed a 'batch' mode here before, and we had patches in the
past iirc, but they never quite reached a merge-able state. I think for
this patch, we want to separate it in 2 - the text patching code that
now takes a list, and the jump_label code consumer. Comments below.


>
> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: "Steven Rostedt (VMware)" <[email protected]>
> Cc: Zhou Chengming <[email protected]>
> Cc: Jiri Kosina <[email protected]>
> Cc: Josh Poimboeuf <[email protected]>
> Cc: "Peter Zijlstra (Intel)" <[email protected]>
> Cc: Chris von Recklinghausen <[email protected]>
> Cc: Jason Baron <[email protected]>
> Cc: Scott Wood <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Clark Williams <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> arch/x86/include/asm/jump_label.h | 2 +
> arch/x86/include/asm/text-patching.h | 9 +++
> arch/x86/kernel/alternative.c | 83 +++++++++++++++++++++++++---
> arch/x86/kernel/jump_label.c | 54 ++++++++++++++++++
> include/linux/jump_label.h | 5 ++
> kernel/jump_label.c | 15 +++++
> 6 files changed, 161 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
> index 8c0de4282659..d61c476046fe 100644
> --- a/arch/x86/include/asm/jump_label.h
> +++ b/arch/x86/include/asm/jump_label.h
> @@ -15,6 +15,8 @@
> #error asm/jump_label.h included on a non-jump-label kernel
> #endif
>
> +#define HAVE_JUMP_LABEL_BATCH
> +
> #define JUMP_LABEL_NOP_SIZE 5
>
> #ifdef CONFIG_X86_64
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..a28230f09d72 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> #define __parainstructions_end NULL
> #endif
>
> +struct text_to_poke {
> + struct list_head list;
> + void *opcode;
> + void *addr;
> + void *handler;
> + size_t len;
> +};
> +
> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> /*
> @@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_list(struct list_head *entry_list);
> extern int after_bootmem;
>
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index a4c83cb49cd0..3bd502ea4c53 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -735,9 +735,12 @@ static void do_sync_core(void *info)
>
> static bool bp_patching_in_progress;
> static void *bp_int3_handler, *bp_int3_addr;
> +struct list_head *bp_list;
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> + void *ip;
> + struct text_to_poke *tp;
> /*
> * Having observed our INT3 instruction, we now must observe
> * bp_patching_in_progress.
> @@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
> if (likely(!bp_patching_in_progress))
> return 0;
>
> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> + if (user_mode(regs))
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + /*
> + * Single poke.
> + */
> + if (bp_int3_addr) {
> + if (regs->ip == (unsigned long) bp_int3_addr) {
> + regs->ip = (unsigned long) bp_int3_handler;
> + return 1;
> + }
> + return 0;
> + }
>
> - return 1;
> + /*
> + * Batch mode.
> + */
> + ip = (void *) regs->ip - sizeof(unsigned char);
> + list_for_each_entry(tp, bp_list, list) {
> + if (ip == tp->addr) {
> + /* set up the specified breakpoint handler */
> + regs->ip = (unsigned long) tp->handler;
> + return 1;
> + }
> + }
>
> + return 0;
> }
>
> static void text_poke_bp_set_handler(void *addr, void *handler,
> unsigned char int3)
> {
> - bp_int3_handler = handler;
> - bp_int3_addr = (u8 *)addr + sizeof(int3);
> text_poke(addr, &int3, sizeof(int3));
> }
>
> @@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>
> lockdep_assert_held(&text_mutex);
>
> + bp_int3_handler = handler;
> + bp_int3_addr = (u8 *)addr + sizeof(int3);
> +
> bp_patching_in_progress = true;
> /*
> * Corresponding read barrier in int3 notifier for making sure the
> @@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> * the writing of the new instruction.
> */
> bp_patching_in_progress = false;
> -
> + bp_int3_handler = bp_int3_addr = 0;
> return addr;
> }
>
> +void text_poke_bp_list(struct list_head *entry_list)
> +{
> + unsigned char int3 = 0xcc;
> + int patched_all_but_first = 0;
> + struct text_to_poke *tp;
> +
> + bp_list = entry_list;
> + bp_patching_in_progress = true;
> + /*
> + * Corresponding read barrier in int3 notifier for making sure the
> + * in_progress and handler are correctly ordered wrt. patching.
> + */
> + smp_wmb();
> +
> + list_for_each_entry(tp, entry_list, list)
> + text_poke_bp_set_handler(tp->addr, tp->handler, int3);
> +
> + on_each_cpu(do_sync_core, NULL, 1);
> +
> + list_for_each_entry(tp, entry_list, list) {
> + if (tp->len - sizeof(int3) > 0) {
> + patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
> + patched_all_but_first++;
> + }
> + }
> +
> + if (patched_all_but_first) {
> + /*
> + * According to Intel, this core syncing is very likely
> + * not necessary and we'd be safe even without it. But
> + * better safe than sorry (plus there's not only Intel).
> + */
> + on_each_cpu(do_sync_core, NULL, 1);
> + }
> +
> + list_for_each_entry(tp, entry_list, list)
> + patch_first_byte(tp->addr, tp->opcode, int3);
> +
> + on_each_cpu(do_sync_core, NULL, 1);
> + /*
> + * sync_core() implies an smp_mb() and orders this store against
> + * the writing of the new instruction.
> + */
> + bp_list = 0;
> + bp_patching_in_progress = false;
> +}
> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> index de588ff47f81..3da5af5de4d3 100644
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -12,6 +12,8 @@
> #include <linux/list.h>
> #include <linux/jhash.h>
> #include <linux/cpu.h>
> +#include <linux/slab.h>
> +#include <linux/list.h>
> #include <asm/kprobes.h>
> #include <asm/alternative.h>
> #include <asm/text-patching.h>
> @@ -139,6 +141,58 @@ void arch_jump_label_transform(struct jump_entry *entry,
> mutex_unlock(&text_mutex);
> }
>
> +LIST_HEAD(batch_list);
> +
> +void arch_jump_label_transform_queue(struct jump_entry *entry,
> + enum jump_label_type type)
> +{
> + struct text_to_poke *tp;
> +
> + /*
> + * Batch mode disabled at boot time.
> + */
> + if (early_boot_irqs_disabled)
> + goto fallback;
> +
> + /*
> + * RFC Note: I put __GFP_NOFAIL, but I could also goto fallback;
> + * thoughts?
> + */
> + tp = kzalloc(sizeof(struct text_to_poke), GFP_KERNEL | __GFP_NOFAIL);
> + tp->opcode = kzalloc(sizeof(union jump_code_union),
> + GFP_KERNEL | __GFP_NOFAIL);


I wonder if we should just set aside a page here so that we can avoid
the allocation altogether. I think the size of the text_to_poke on
x86_64 is 44 bytes, so that's 93 or so entries, which I think covers the
use-case here. If we go over that limit, we would just do things in
batches of 93. I just think its nice to avoid memory allocations here to
avoid creating additional dependencies, although I'm not aware of any
specific ones.

> +
> + __jump_label_set_jump_code(entry, type, 0, tp->opcode);
> + tp->addr = (void *) entry->code;
> + tp->len = JUMP_LABEL_NOP_SIZE;
> + tp->handler = (void *) entry->code + JUMP_LABEL_NOP_SIZE;
> +
> + list_add_tail(&tp->list, &batch_list);
> +
> + return;
> +
> +fallback:
> + arch_jump_label_transform(entry, type);
> +}
> +
> +void arch_jump_label_transform_apply(void)
> +{
> + struct text_to_poke *tp, *next;
> +
> + if (early_boot_irqs_disabled)
> + return;
> +
> + mutex_lock(&text_mutex);
> + text_poke_bp_list(&batch_list);
> + mutex_unlock(&text_mutex);
> +
> + list_for_each_entry_safe(tp, next, &batch_list, list) {
> + list_del(&tp->list);
> + kfree(tp->opcode);
> + kfree(tp);
> + }
> +}
> +
> static enum {
> JL_STATE_START,
> JL_STATE_NO_UPDATE,
> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
> index cd3bed880ca0..2aca92e03494 100644
> --- a/include/linux/jump_label.h
> +++ b/include/linux/jump_label.h
> @@ -156,6 +156,11 @@ extern void jump_label_lock(void);
> extern void jump_label_unlock(void);
> extern void arch_jump_label_transform(struct jump_entry *entry,
> enum jump_label_type type);
> +#ifdef HAVE_JUMP_LABEL_BATCH
> +extern void arch_jump_label_transform_queue(struct jump_entry *entry,
> + enum jump_label_type type);
> +extern void arch_jump_label_transform_apply(void);
> +#endif
> extern void arch_jump_label_transform_static(struct jump_entry *entry,
> enum jump_label_type type);
> extern int jump_label_text_reserved(void *start, void *end);
> diff --git a/kernel/jump_label.c b/kernel/jump_label.c
> index 940ba7819c87..f534d9c4e07f 100644
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -377,6 +377,7 @@ bool jump_label_can_update_check(struct jump_entry *entry)
> return 0;
> }
>
> +#ifndef HAVE_JUMP_LABEL_BATCH
> static void __jump_label_update(struct static_key *key,
> struct jump_entry *entry,
> struct jump_entry *stop)
> @@ -386,6 +387,20 @@ static void __jump_label_update(struct static_key *key,
> arch_jump_label_transform(entry, jump_label_type(entry));
> }
> }
> +#else
> +static void __jump_label_update(struct static_key *key,
> + struct jump_entry *entry,
> + struct jump_entry *stop)
> +{
> + for_each_label_entry(key, entry, stop) {
> + if (jump_label_can_update_check(entry))
> + arch_jump_label_transform_queue(entry,
> + jump_label_type(entry));
> + }

So this could be done in batches if there are more entries than
PAGE_SIZE / sizeof(struct text_to_poke)


> + arch_jump_label_transform_apply();
> +
> +}
> +#endif
>
> void __init jump_label_init(void)
> {
>

Subject: Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

On 10/8/18 4:33 PM, Jason Baron wrote:
> On 10/08/2018 08:53 AM, Daniel Bristot de Oliveira wrote:
>> A static key, changing from enabled->disabled/disabled->enabled causes
>> the code to be changed, and this is done in three steps:
>>
>> -- Pseudo-code #1 - Current implementation ---
>> For each key to be updated:
>> 1) add an int3 trap to the address that will be patched
>> sync cores (send IPI to all other CPUs)
>> 2) update all but the first byte of the patched range
>> sync cores (send IPI to all other CPUs)
>> 3) replace the first byte (int3) by the first byte of replacing opcode
>> sync cores (send IPI to all other CPUs)
>> -- Pseudo-code #1 ---
>>
>> The number of IPIs sent is then linear with regard to the number 'n' of
>> entries of a key: O(n*3), which is O(n). For instance, as the static key
>> netstamp_needed_key has four entries (used in for places in the code)
>> in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs.
>>
>> This algorithm works fine for the update of a single key. But we think
>> it is possible to optimize the case in which a static key has more than
>> one entry. For instance, the sched_schedstats jump label has 56 entries
>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
>> which the thread that is enabling is _not_ running.
>>
>> In this patch, rather than doing each updated at once, it queue all
>> updates first, and the, apply all updates at once, rewriting the
>> pseudo-code #1 in this way:
>>
>> -- Pseudo-code #2 - This patch ---
>> 1) for each key in the queue:
>> add an int3 trap to the address that will be patched
>> sync cores (send IPI to all other CPUs)
>>
>> 2) for each key in the queue:
>> update all but the first byte of the patched range
>> sync cores (send IPI to all other CPUs)
>>
>> 3) for each key in the queue:
>> replace the first byte (int3) by the first byte of replacing opcode
>>
>> sync cores (send IPI to all other CPUs)
>> -- Pseudo-code #2 - This patch ---
>>
>> Doing the update in this way, the number of IPI becomes O(3) with regard
>> to the number of keys, which is O(1).
>>
>> Currently, the jump label of a static key is transformed via the arch
>> specific function:
>>
>> void arch_jump_label_transform(struct jump_entry *entry,
>> enum jump_label_type type)
>>
>> The new approach (batch mode) uses two arch functions, the first has the
>> same arguments of the arch_jump_label_transform(), and is the function:
>>
>> void arch_jump_label_transform_queue(struct jump_entry *entry,
>> enum jump_label_type type)
>>
>> Rather than transforming the code, it adds the jump_entry in a queue of
>> entries to be updated.
>>
>> After queuing all jump_entries, the function:
>>
>> void arch_jump_label_transform_apply(void)
>>
>> Applies the changes in the queue.
>>
>> The batch of operations was:
>> Suggested-by: Scott Wood <[email protected]>
>
> Hi,
>
> We've discussed a 'batch' mode here before, and we had patches in the
> past iirc, but they never quite reached a merge-able state.

Hi Jason!

Thanks for your comments! I will try to find references for old patches!

I think for
> this patch, we want to separate it in 2 - the text patching code that
> now takes a list, and the jump_label code consumer. Comments below.
>

I see your point. I agree on breaking this patch into two.

>>
>> Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: Greg Kroah-Hartman <[email protected]>
>> Cc: Pavel Tatashin <[email protected]>
>> Cc: Masami Hiramatsu <[email protected]>
>> Cc: "Steven Rostedt (VMware)" <[email protected]>
>> Cc: Zhou Chengming <[email protected]>
>> Cc: Jiri Kosina <[email protected]>
>> Cc: Josh Poimboeuf <[email protected]>
>> Cc: "Peter Zijlstra (Intel)" <[email protected]>
>> Cc: Chris von Recklinghausen <[email protected]>
>> Cc: Jason Baron <[email protected]>
>> Cc: Scott Wood <[email protected]>
>> Cc: Marcelo Tosatti <[email protected]>
>> Cc: Clark Williams <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> ---
>> arch/x86/include/asm/jump_label.h | 2 +
>> arch/x86/include/asm/text-patching.h | 9 +++
>> arch/x86/kernel/alternative.c | 83 +++++++++++++++++++++++++---
>> arch/x86/kernel/jump_label.c | 54 ++++++++++++++++++
>> include/linux/jump_label.h | 5 ++
>> kernel/jump_label.c | 15 +++++
>> 6 files changed, 161 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
>> index 8c0de4282659..d61c476046fe 100644
>> --- a/arch/x86/include/asm/jump_label.h
>> +++ b/arch/x86/include/asm/jump_label.h
>> @@ -15,6 +15,8 @@
>> #error asm/jump_label.h included on a non-jump-label kernel
>> #endif
>>
>> +#define HAVE_JUMP_LABEL_BATCH
>> +
>> #define JUMP_LABEL_NOP_SIZE 5
>>
>> #ifdef CONFIG_X86_64
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..a28230f09d72 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>> #define __parainstructions_end NULL
>> #endif
>>
>> +struct text_to_poke {
>> + struct list_head list;
>> + void *opcode;
>> + void *addr;
>> + void *handler;
>> + size_t len;
>> +};
>> +
>> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>
>> /*
>> @@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>> extern void *text_poke(void *addr, const void *opcode, size_t len);
>> extern int poke_int3_handler(struct pt_regs *regs);
>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> +extern void text_poke_bp_list(struct list_head *entry_list);
>> extern int after_bootmem;
>>
>> #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index a4c83cb49cd0..3bd502ea4c53 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -735,9 +735,12 @@ static void do_sync_core(void *info)
>>
>> static bool bp_patching_in_progress;
>> static void *bp_int3_handler, *bp_int3_addr;
>> +struct list_head *bp_list;
>>
>> int poke_int3_handler(struct pt_regs *regs)
>> {
>> + void *ip;
>> + struct text_to_poke *tp;
>> /*
>> * Having observed our INT3 instruction, we now must observe
>> * bp_patching_in_progress.
>> @@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
>> if (likely(!bp_patching_in_progress))
>> return 0;
>>
>> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>> + if (user_mode(regs))
>> return 0;
>>
>> - /* set up the specified breakpoint handler */
>> - regs->ip = (unsigned long) bp_int3_handler;
>> + /*
>> + * Single poke.
>> + */
>> + if (bp_int3_addr) {
>> + if (regs->ip == (unsigned long) bp_int3_addr) {
>> + regs->ip = (unsigned long) bp_int3_handler;
>> + return 1;
>> + }
>> + return 0;
>> + }
>>
>> - return 1;
>> + /*
>> + * Batch mode.
>> + */
>> + ip = (void *) regs->ip - sizeof(unsigned char);
>> + list_for_each_entry(tp, bp_list, list) {
>> + if (ip == tp->addr) {
>> + /* set up the specified breakpoint handler */
>> + regs->ip = (unsigned long) tp->handler;
>> + return 1;
>> + }
>> + }
>>
>> + return 0;
>> }
>>
>> static void text_poke_bp_set_handler(void *addr, void *handler,
>> unsigned char int3)
>> {
>> - bp_int3_handler = handler;
>> - bp_int3_addr = (u8 *)addr + sizeof(int3);
>> text_poke(addr, &int3, sizeof(int3));
>> }
>>
>> @@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>>
>> lockdep_assert_held(&text_mutex);
>>
>> + bp_int3_handler = handler;
>> + bp_int3_addr = (u8 *)addr + sizeof(int3);
>> +
>> bp_patching_in_progress = true;
>> /*
>> * Corresponding read barrier in int3 notifier for making sure the
>> @@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>> * the writing of the new instruction.
>> */
>> bp_patching_in_progress = false;
>> -
>> + bp_int3_handler = bp_int3_addr = 0;
>> return addr;
>> }
>>
>> +void text_poke_bp_list(struct list_head *entry_list)
>> +{
>> + unsigned char int3 = 0xcc;
>> + int patched_all_but_first = 0;
>> + struct text_to_poke *tp;
>> +
>> + bp_list = entry_list;
>> + bp_patching_in_progress = true;
>> + /*
>> + * Corresponding read barrier in int3 notifier for making sure the
>> + * in_progress and handler are correctly ordered wrt. patching.
>> + */
>> + smp_wmb();
>> +
>> + list_for_each_entry(tp, entry_list, list)
>> + text_poke_bp_set_handler(tp->addr, tp->handler, int3);
>> +
>> + on_each_cpu(do_sync_core, NULL, 1);
>> +
>> + list_for_each_entry(tp, entry_list, list) {
>> + if (tp->len - sizeof(int3) > 0) {
>> + patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
>> + patched_all_but_first++;
>> + }
>> + }
>> +
>> + if (patched_all_but_first) {
>> + /*
>> + * According to Intel, this core syncing is very likely
>> + * not necessary and we'd be safe even without it. But
>> + * better safe than sorry (plus there's not only Intel).
>> + */
>> + on_each_cpu(do_sync_core, NULL, 1);
>> + }
>> +
>> + list_for_each_entry(tp, entry_list, list)
>> + patch_first_byte(tp->addr, tp->opcode, int3);
>> +
>> + on_each_cpu(do_sync_core, NULL, 1);
>> + /*
>> + * sync_core() implies an smp_mb() and orders this store against
>> + * the writing of the new instruction.
>> + */
>> + bp_list = 0;
>> + bp_patching_in_progress = false;
>> +}
>> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
>> index de588ff47f81..3da5af5de4d3 100644
>> --- a/arch/x86/kernel/jump_label.c
>> +++ b/arch/x86/kernel/jump_label.c
>> @@ -12,6 +12,8 @@
>> #include <linux/list.h>
>> #include <linux/jhash.h>
>> #include <linux/cpu.h>
>> +#include <linux/slab.h>
>> +#include <linux/list.h>
>> #include <asm/kprobes.h>
>> #include <asm/alternative.h>
>> #include <asm/text-patching.h>
>> @@ -139,6 +141,58 @@ void arch_jump_label_transform(struct jump_entry *entry,
>> mutex_unlock(&text_mutex);
>> }
>>
>> +LIST_HEAD(batch_list);
>> +
>> +void arch_jump_label_transform_queue(struct jump_entry *entry,
>> + enum jump_label_type type)
>> +{
>> + struct text_to_poke *tp;
>> +
>> + /*
>> + * Batch mode disabled at boot time.
>> + */
>> + if (early_boot_irqs_disabled)
>> + goto fallback;
>> +
>> + /*
>> + * RFC Note: I put __GFP_NOFAIL, but I could also goto fallback;
>> + * thoughts?
>> + */
>> + tp = kzalloc(sizeof(struct text_to_poke), GFP_KERNEL | __GFP_NOFAIL);
>> + tp->opcode = kzalloc(sizeof(union jump_code_union),
>> + GFP_KERNEL | __GFP_NOFAIL);
>
>
> I wonder if we should just set aside a page here so that we can avoid
> the allocation altogether. I think the size of the text_to_poke on
> x86_64 is 44 bytes, so that's 93 or so entries, which I think covers the
> use-case here. If we go over that limit, we would just do things in
> batches of 93. I just think its nice to avoid memory allocations here to
> avoid creating additional dependencies, although I'm not aware of any
> specific ones.

Yeah, the memory allocation is the "weak" point. In the initial
implementation, I was passing all the arguments from

__jump_label_update()

to the arch code. But I ended up mixing a lot of non-arch with arch
code. It was ugly.

Then, I created the functions in the way that they are now. But I was
putting the entries in a static allocated vector of entries. It worked
fine, but it was not efficient w.r.t memory allocation.

I decided to try using the list with memory allocation because it would
not "wast" memory, and would not require to have a limited amount of
entries (what is better for maintenance...). The bad points about the
memory allocation are:

1) The alloc/list/free costs;
2) What to do when there is no memory.

The 1) is not as bad as it seems, because:

a) it is in the preemptive/irqs enabled context, so it does not cause
latency in the -rt case;
b) it is in the "absolute slow path" - as mentioned in [1];
c) Even doing these operations, this method is faster for the case in
which more than one key is being updated - and these are the performance
sensitive case, thinking "machine wise."

Regarding performance, I tested this patch in the -rt as well, and it
did not cause latency (well, more tests are welcome). Rather, I was
seeing a reduction in the average latency on all CPUs.

Regarding 2) I put the nofail because it looks cleaner. But I also had a
version in which, in the case of a system failing to allocate memory, it
simply falls back to the regular case (a goto fallback) in that
function.

Anyway, I agree that this is the point of doubts about this patch (that
is my opinion as well), and I would like to hear people's opinion about it.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/jump_label.h#n56

>> +
>> + __jump_label_set_jump_code(entry, type, 0, tp->opcode);
>> + tp->addr = (void *) entry->code;
>> + tp->len = JUMP_LABEL_NOP_SIZE;
>> + tp->handler = (void *) entry->code + JUMP_LABEL_NOP_SIZE;
>> +
>> + list_add_tail(&tp->list, &batch_list);
>> +
>> + return;
>> +
>> +fallback:
>> + arch_jump_label_transform(entry, type);
>> +}
>> +
>> +void arch_jump_label_transform_apply(void)
>> +{
>> + struct text_to_poke *tp, *next;
>> +
>> + if (early_boot_irqs_disabled)
>> + return;
>> +
>> + mutex_lock(&text_mutex);
>> + text_poke_bp_list(&batch_list);
>> + mutex_unlock(&text_mutex);
>> +
>> + list_for_each_entry_safe(tp, next, &batch_list, list) {
>> + list_del(&tp->list);
>> + kfree(tp->opcode);
>> + kfree(tp);
>> + }
>> +}
>> +
>> static enum {
>> JL_STATE_START,
>> JL_STATE_NO_UPDATE,
>> diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
>> index cd3bed880ca0..2aca92e03494 100644
>> --- a/include/linux/jump_label.h
>> +++ b/include/linux/jump_label.h
>> @@ -156,6 +156,11 @@ extern void jump_label_lock(void);
>> extern void jump_label_unlock(void);
>> extern void arch_jump_label_transform(struct jump_entry *entry,
>> enum jump_label_type type);
>> +#ifdef HAVE_JUMP_LABEL_BATCH
>> +extern void arch_jump_label_transform_queue(struct jump_entry *entry,
>> + enum jump_label_type type);
>> +extern void arch_jump_label_transform_apply(void);
>> +#endif
>> extern void arch_jump_label_transform_static(struct jump_entry *entry,
>> enum jump_label_type type);
>> extern int jump_label_text_reserved(void *start, void *end);
>> diff --git a/kernel/jump_label.c b/kernel/jump_label.c
>> index 940ba7819c87..f534d9c4e07f 100644
>> --- a/kernel/jump_label.c
>> +++ b/kernel/jump_label.c
>> @@ -377,6 +377,7 @@ bool jump_label_can_update_check(struct jump_entry *entry)
>> return 0;
>> }
>>
>> +#ifndef HAVE_JUMP_LABEL_BATCH
>> static void __jump_label_update(struct static_key *key,
>> struct jump_entry *entry,
>> struct jump_entry *stop)
>> @@ -386,6 +387,20 @@ static void __jump_label_update(struct static_key *key,
>> arch_jump_label_transform(entry, jump_label_type(entry));
>> }
>> }
>> +#else
>> +static void __jump_label_update(struct static_key *key,
>> + struct jump_entry *entry,
>> + struct jump_entry *stop)
>> +{
>> + for_each_label_entry(key, entry, stop) {
>> + if (jump_label_can_update_check(entry))
>> + arch_jump_label_transform_queue(entry,
>> + jump_label_type(entry));
>> + }
>
> So this could be done in batches if there are more entries than
> PAGE_SIZE / sizeof(struct text_to_poke)

Yeah, but that was one of the reasons for me to try the alloc/list. As
this is the slow path, maintenance and clarity of the code is "a point."

Anyway, I am not against doing the static allocation! Rather, if people
agree this is the way to go, I will go for it as well.

-- Daniel

>
>> + arch_jump_label_transform_apply();
>> +
>> +}
>> +#endif
>>
>> void __init jump_label_init(void)
>> {
>>


2018-10-10 20:18:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

On Mon, 8 Oct 2018 14:53:05 +0200
Daniel Bristot de Oliveira <[email protected]> wrote:

> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
> index 8c0de4282659..d61c476046fe 100644
> --- a/arch/x86/include/asm/jump_label.h
> +++ b/arch/x86/include/asm/jump_label.h
> @@ -15,6 +15,8 @@
> #error asm/jump_label.h included on a non-jump-label kernel
> #endif
>
> +#define HAVE_JUMP_LABEL_BATCH
> +
> #define JUMP_LABEL_NOP_SIZE 5
>
> #ifdef CONFIG_X86_64
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..a28230f09d72 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> #define __parainstructions_end NULL
> #endif
>
> +struct text_to_poke {
> + struct list_head list;
> + void *opcode;
> + void *addr;
> + void *handler;
> + size_t len;
> +};
> +
> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>
> /*
> @@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern int poke_int3_handler(struct pt_regs *regs);
> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_list(struct list_head *entry_list);
> extern int after_bootmem;
>
> #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index a4c83cb49cd0..3bd502ea4c53 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -735,9 +735,12 @@ static void do_sync_core(void *info)
>
> static bool bp_patching_in_progress;
> static void *bp_int3_handler, *bp_int3_addr;
> +struct list_head *bp_list;
>
> int poke_int3_handler(struct pt_regs *regs)
> {
> + void *ip;
> + struct text_to_poke *tp;
> /*
> * Having observed our INT3 instruction, we now must observe
> * bp_patching_in_progress.
> @@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
> if (likely(!bp_patching_in_progress))
> return 0;
>
> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> + if (user_mode(regs))
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + /*
> + * Single poke.
> + */
> + if (bp_int3_addr) {
> + if (regs->ip == (unsigned long) bp_int3_addr) {
> + regs->ip = (unsigned long) bp_int3_handler;
> + return 1;
> + }
> + return 0;
> + }
>
> - return 1;
> + /*
> + * Batch mode.
> + */
> + ip = (void *) regs->ip - sizeof(unsigned char);
> + list_for_each_entry(tp, bp_list, list) {
> + if (ip == tp->addr) {
> + /* set up the specified breakpoint handler */
> + regs->ip = (unsigned long) tp->handler;

I hate this loop. Makes hitting the static branch (which is probably in
a fast path, otherwise why use static branches?), do a O(n) loop of
batch updates. You may not have that many, but why not optimize it?

Can we make an array of the handlers, in sorted order, then we simply
do a binary search for the ip involved? Turning this from O(n) to
O(log2(n)).

As Jason mentioned. If you set aside a page for processing batches up
to PAGE / (sizeof tp) then you can also make it sorted and replace the
iteration with a loop.

-- Steve


> + return 1;
> + }
> + }
>
> + return 0;
> }
>
> static void text_poke_bp_set_handler(void *addr, void *handler,
> unsigned char int3)
> {
> - bp_int3_handler = handler;
> - bp_int3_addr = (u8 *)addr + sizeof(int3);
> text_poke(addr, &int3, sizeof(int3));
> }
>
> @@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>
> lockdep_assert_held(&text_mutex);
>
> + bp_int3_handler = handler;
> + bp_int3_addr = (u8 *)addr + sizeof(int3);
> +
> bp_patching_in_progress = true;
> /*
> * Corresponding read barrier in int3 notifier for making sure the
> @@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> * the writing of the new instruction.
> */
> bp_patching_in_progress = false;
> -
> + bp_int3_handler = bp_int3_addr = 0;
> return addr;
> }
>
> +void text_poke_bp_list(struct list_head *entry_list)
> +{
> + unsigned char int3 = 0xcc;
> + int patched_all_but_first = 0;
> + struct text_to_poke *tp;
> +
> + bp_list = entry_list;
> + bp_patching_in_progress = true;
> + /*
> + * Corresponding read barrier in int3 notifier for making sure the
> + * in_progress and handler are correctly ordered wrt. patching.
> + */
> + smp_wmb();
> +
> + list_for_each_entry(tp, entry_list, list)
> + text_poke_bp_set_handler(tp->addr, tp->handler, int3);
> +
> + on_each_cpu(do_sync_core, NULL, 1);
> +
> + list_for_each_entry(tp, entry_list, list) {
> + if (tp->len - sizeof(int3) > 0) {
> + patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
> + patched_all_but_first++;
> + }
> + }
> +
> + if (patched_all_but_first) {
> + /*
> + * According to Intel, this core syncing is very likely
> + * not necessary and we'd be safe even without it. But
> + * better safe than sorry (plus there's not only Intel).
> + */
> + on_each_cpu(do_sync_core, NULL, 1);
> + }
> +
> + list_for_each_entry(tp, entry_list, list)
> + patch_first_byte(tp->addr, tp->opcode, int3);
> +
> + on_each_cpu(do_sync_core, NULL, 1);
> + /*
> + * sync_core() implies an smp_mb() and orders this store against
> + * the writing of the new instruction.
> + */
> + bp_list = 0;
> + bp_patching_in_progress = false;
> +}

Subject: Re: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations

On 10/10/18 10:17 PM, Steven Rostedt wrote:
> On Mon, 8 Oct 2018 14:53:05 +0200
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
>> diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
>> index 8c0de4282659..d61c476046fe 100644
>> --- a/arch/x86/include/asm/jump_label.h
>> +++ b/arch/x86/include/asm/jump_label.h
>> @@ -15,6 +15,8 @@
>> #error asm/jump_label.h included on a non-jump-label kernel
>> #endif
>>
>> +#define HAVE_JUMP_LABEL_BATCH
>> +
>> #define JUMP_LABEL_NOP_SIZE 5
>>
>> #ifdef CONFIG_X86_64
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..a28230f09d72 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>> #define __parainstructions_end NULL
>> #endif
>>
>> +struct text_to_poke {
>> + struct list_head list;
>> + void *opcode;
>> + void *addr;
>> + void *handler;
>> + size_t len;
>> +};
>> +
>> extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>
>> /*
>> @@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>> extern void *text_poke(void *addr, const void *opcode, size_t len);
>> extern int poke_int3_handler(struct pt_regs *regs);
>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> +extern void text_poke_bp_list(struct list_head *entry_list);
>> extern int after_bootmem;
>>
>> #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index a4c83cb49cd0..3bd502ea4c53 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -735,9 +735,12 @@ static void do_sync_core(void *info)
>>
>> static bool bp_patching_in_progress;
>> static void *bp_int3_handler, *bp_int3_addr;
>> +struct list_head *bp_list;
>>
>> int poke_int3_handler(struct pt_regs *regs)
>> {
>> + void *ip;
>> + struct text_to_poke *tp;
>> /*
>> * Having observed our INT3 instruction, we now must observe
>> * bp_patching_in_progress.
>> @@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
>> if (likely(!bp_patching_in_progress))
>> return 0;
>>
>> - if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>> + if (user_mode(regs))
>> return 0;
>>
>> - /* set up the specified breakpoint handler */
>> - regs->ip = (unsigned long) bp_int3_handler;
>> + /*
>> + * Single poke.
>> + */
>> + if (bp_int3_addr) {
>> + if (regs->ip == (unsigned long) bp_int3_addr) {
>> + regs->ip = (unsigned long) bp_int3_handler;
>> + return 1;
>> + }
>> + return 0;
>> + }
>>
>> - return 1;
>> + /*
>> + * Batch mode.
>> + */
>> + ip = (void *) regs->ip - sizeof(unsigned char);
>> + list_for_each_entry(tp, bp_list, list) {
>> + if (ip == tp->addr) {
>> + /* set up the specified breakpoint handler */
>> + regs->ip = (unsigned long) tp->handler;
>
> I hate this loop. Makes hitting the static branch (which is probably in
> a fast path, otherwise why use static branches?), do a O(n) loop of
> batch updates. You may not have that many, but why not optimize it?
>
> Can we make an array of the handlers, in sorted order, then we simply
> do a binary search for the ip involved? Turning this from O(n) to
> O(log2(n)).
>
> As Jason mentioned. If you set aside a page for processing batches up
> to PAGE / (sizeof tp) then you can also make it sorted and replace the
> iteration with a loop.

Hi Steven!

I am convinced! I am working in the v2 now, that will:

Split the batch mode into two patches (Jason)
Use an aside page rather than allocating memory (Jason)
Use a sorted vector of keys with binary search in the int3 handler (Steven)

I will also do some performance tests in the int3 handler (peterz on IRC).

Thanks!

-- Daniel