Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
From:   Daniel Bristot de Oliveira <bristot@redhat.com>
To:     linux-kernel@vger.kernel.org, x86@kernel.org
Cc:     Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Pavel Tatashin <pasha.tatashin@oracle.com>,
        Masami Hiramatsu <mhiramat@kernel.org>,
        "Steven Rostedt (VMware)" <rostedt@goodmis.org>,
        Zhou Chengming <zhouchengming1@huawei.com>,
        Jiri Kosina <jkosina@suse.cz>,
        Josh Poimboeuf <jpoimboe@redhat.com>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Chris von Recklinghausen <crecklin@redhat.com>,
        Jason Baron <jbaron@akamai.com>, Scott Wood <swood@redhat.com>,
        Marcelo Tosatti <mtosatti@redhat.com>,
        Clark Williams <williams@redhat.com>
Subject: [RFC PATCH 6/6] x86/jump_label,x86/alternatives: Batch jump label transformations
Date:   Mon,  8 Oct 2018 14:53:05 +0200
Message-Id: <e99eb239b0546e180e49c1e5a9e607fb6af0ec60.1539001620.git.bristot@redhat.com>
In-Reply-To: <cover.1539001620.git.bristot@redhat.com>
References: <cover.1539001620.git.bristot@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

A static key, changing from enabled->disabled/disabled->enabled causes
the code to be changed, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
	1)  add an int3 trap to the address that will be patched
        	sync cores (send IPI to all other CPUs)
	2)  update all but the first byte of the patched range
	        sync cores (send IPI to all other CPUs)
	3)  replace the first byte (int3) by the first byte of replacing opcode
	        sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

The number of IPIs sent is then linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n). For instance, as the static key
netstamp_needed_key has four entries (used in for places in the code)
in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs.

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

In this patch, rather than doing each updated at once, it queue all
updates first, and the, apply all updates at once, rewriting the
pseudo-code #1 in this way:

-- Pseudo-code #2 - This patch  ---
1)  for each key in the queue:
        add an int3 trap to the address that will be patched
    sync cores (send IPI to all other CPUs)

2)  for each key in the queue:
        update all but the first byte of the patched range
    sync cores (send IPI to all other CPUs)

3)  for each key in the queue:
        replace the first byte (int3) by the first byte of replacing opcode

    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    void arch_jump_label_transform_queue(struct jump_entry *entry,
                         enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:

    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

The batch of operations was:
Suggested-by: Scott Wood <swood@redhat.com>

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Zhou Chengming <zhouchengming1@huawei.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/jump_label.h    |  2 +
 arch/x86/include/asm/text-patching.h |  9 +++
 arch/x86/kernel/alternative.c        | 83 +++++++++++++++++++++++++---
 arch/x86/kernel/jump_label.c         | 54 ++++++++++++++++++
 include/linux/jump_label.h           |  5 ++
 kernel/jump_label.c                  | 15 +++++
 6 files changed, 161 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 8c0de4282659..d61c476046fe 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -15,6 +15,8 @@
 #error asm/jump_label.h included on a non-jump-label kernel
 #endif
 
+#define HAVE_JUMP_LABEL_BATCH
+
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..a28230f09d72 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,14 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+struct text_to_poke {
+	struct list_head list;
+	void *opcode;
+	void *addr;
+	void *handler;
+	size_t len;
+};
+
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -37,6 +45,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_list(struct list_head *entry_list);
 extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index a4c83cb49cd0..3bd502ea4c53 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -735,9 +735,12 @@ static void do_sync_core(void *info)
 
 static bool bp_patching_in_progress;
 static void *bp_int3_handler, *bp_int3_addr;
+struct list_head *bp_list;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	void *ip;
+	struct text_to_poke *tp;
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching_in_progress.
@@ -753,21 +756,38 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	/*
+	 * Single poke.
+	 */
+	if (bp_int3_addr) {
+		if (regs->ip == (unsigned long) bp_int3_addr) {
+			regs->ip = (unsigned long) bp_int3_handler;
+			return 1;
+		}
+		return 0;
+	}
 
-	return 1;
+	/*
+	 * Batch mode.
+	 */
+	ip = (void *) regs->ip - sizeof(unsigned char);
+	list_for_each_entry(tp, bp_list, list) {
+		if (ip == tp->addr) {
+			/* set up the specified breakpoint handler */
+			regs->ip = (unsigned long) tp->handler;
+			return 1;
+		}
+	}
 
+	return 0;
 }
 
 static void text_poke_bp_set_handler(void *addr, void *handler,
 				     unsigned char int3)
 {
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	text_poke(addr, &int3, sizeof(int3));
 }
 
@@ -812,6 +832,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 
 	lockdep_assert_held(&text_mutex);
 
+	bp_int3_handler = handler;
+	bp_int3_addr = (u8 *)addr + sizeof(int3);
+
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -841,7 +864,53 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 * the writing of the new instruction.
 	 */
 	bp_patching_in_progress = false;
-
+	bp_int3_handler = bp_int3_addr = 0;
 	return addr;
 }
 
+void text_poke_bp_list(struct list_head *entry_list)
+{
+	unsigned char int3 = 0xcc;
+	int patched_all_but_first = 0;
+	struct text_to_poke *tp;
+
+	bp_list = entry_list;
+	bp_patching_in_progress = true;
+	/*
+	 * Corresponding read barrier in int3 notifier for making sure the
+	 * in_progress and handler are correctly ordered wrt. patching.
+	 */
+	smp_wmb();
+
+	list_for_each_entry(tp, entry_list, list)
+		text_poke_bp_set_handler(tp->addr, tp->handler, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+
+	list_for_each_entry(tp, entry_list, list) {
+		if (tp->len - sizeof(int3) > 0) {
+			patch_all_but_first_byte(tp->addr, tp->opcode, tp->len, int3);
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
+		/*
+		 * According to Intel, this core syncing is very likely
+		 * not necessary and we'd be safe even without it. But
+		 * better safe than sorry (plus there's not only Intel).
+		 */
+		on_each_cpu(do_sync_core, NULL, 1);
+	}
+
+	list_for_each_entry(tp, entry_list, list)
+		patch_first_byte(tp->addr, tp->opcode, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+	/*
+	 * sync_core() implies an smp_mb() and orders this store against
+	 * the writing of the new instruction.
+	 */
+	bp_list = 0;
+	bp_patching_in_progress = false;
+}
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index de588ff47f81..3da5af5de4d3 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -12,6 +12,8 @@
 #include <linux/list.h>
 #include <linux/jhash.h>
 #include <linux/cpu.h>
+#include <linux/slab.h>
+#include <linux/list.h>
 #include <asm/kprobes.h>
 #include <asm/alternative.h>
 #include <asm/text-patching.h>
@@ -139,6 +141,58 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	mutex_unlock(&text_mutex);
 }
 
+LIST_HEAD(batch_list);
+
+void arch_jump_label_transform_queue(struct jump_entry *entry,
+				     enum jump_label_type type)
+{
+	struct text_to_poke *tp;
+
+	/*
+	 * Batch mode disabled at boot time.
+	 */
+	if (early_boot_irqs_disabled)
+		goto fallback;
+
+	/*
+	 * RFC Note: I put __GFP_NOFAIL, but I could also goto fallback;
+	 * thoughts?
+	 */
+	tp = kzalloc(sizeof(struct text_to_poke), GFP_KERNEL | __GFP_NOFAIL);
+	tp->opcode = kzalloc(sizeof(union jump_code_union),
+			     GFP_KERNEL | __GFP_NOFAIL);
+
+	__jump_label_set_jump_code(entry, type, 0, tp->opcode);
+	tp->addr = (void *) entry->code;
+	tp->len = JUMP_LABEL_NOP_SIZE;
+	tp->handler = (void *) entry->code + JUMP_LABEL_NOP_SIZE;
+
+	list_add_tail(&tp->list, &batch_list);
+
+	return;
+
+fallback:
+	arch_jump_label_transform(entry, type);
+}
+
+void arch_jump_label_transform_apply(void)
+{
+	struct text_to_poke *tp, *next;
+
+	if (early_boot_irqs_disabled)
+		return;
+
+	mutex_lock(&text_mutex);
+	text_poke_bp_list(&batch_list);
+	mutex_unlock(&text_mutex);
+
+	list_for_each_entry_safe(tp, next, &batch_list, list) {
+		list_del(&tp->list);
+		kfree(tp->opcode);
+		kfree(tp);
+	}
+}
+
 static enum {
 	JL_STATE_START,
 	JL_STATE_NO_UPDATE,
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index cd3bed880ca0..2aca92e03494 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -156,6 +156,11 @@ extern void jump_label_lock(void);
 extern void jump_label_unlock(void);
 extern void arch_jump_label_transform(struct jump_entry *entry,
 				      enum jump_label_type type);
+#ifdef HAVE_JUMP_LABEL_BATCH
+extern void arch_jump_label_transform_queue(struct jump_entry *entry,
+					    enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
+#endif
 extern void arch_jump_label_transform_static(struct jump_entry *entry,
 					     enum jump_label_type type);
 extern int jump_label_text_reserved(void *start, void *end);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 940ba7819c87..f534d9c4e07f 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -377,6 +377,7 @@ bool jump_label_can_update_check(struct jump_entry *entry)
 	return 0;
 }
 
+#ifndef HAVE_JUMP_LABEL_BATCH
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop)
@@ -386,6 +387,20 @@ static void __jump_label_update(struct static_key *key,
 			arch_jump_label_transform(entry, jump_label_type(entry));
 	}
 }
+#else
+static void __jump_label_update(struct static_key *key,
+				struct jump_entry *entry,
+				struct jump_entry *stop)
+{
+	for_each_label_entry(key, entry, stop) {
+		if (jump_label_can_update_check(entry))
+			arch_jump_label_transform_queue(entry,
+							jump_label_type(entry));
+	}
+	arch_jump_label_transform_apply();
+
+}
+#endif
 
 void __init jump_label_init(void)
 {
-- 
2.17.1