Hi,
since the merge window is closing in and y'all are on a conference, I
thought I should take another stab at it. It being something which Ingo,
Linus and Peter have suggested in the past at least once.
Instead of calling memset:
ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
and having a JMP inside it depending on the feature supported, let's simply
have the REP; STOSB directly in the code:
...
ffffffff81000442: 4c 89 d7 mov %r10,%rdi
ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
<---- new memset
ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
ffffffff8100044c: 90 nop
ffffffff8100044d: 90 nop
ffffffff8100044e: 90 nop
<----
ffffffff8100044f: 4c 8d 84 24 98 00 00 lea 0x98(%rsp),%r8
ffffffff81000456: 00
...
And since the majority of x86 boxes out there is Intel, they haz
X86_FEATURE_ERMS so they won't even need to alternative-patch those call
sites when booting.
In order to patch on machines which don't set X86_FEATURE_ERMS, I need
to do a "reversed" patching of sorts, i.e., patch when the x86 feature
flag is NOT set. See the below changes in alternative.c which basically
add a flags field to struct alt_instr and thus control the patching
behavior in apply_alternatives().
The result is this:
static __always_inline void *memset(void *dest, int c, size_t n)
{
void *ret, *dummy;
asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
"call memset_rep", X86_FEATURE_ERMS,
"call memset_orig", X86_FEATURE_REP_GOOD)
: "=&D" (ret), "=a" (dummy)
: "0" (dest), "a" (c), "c" (n)
/* clobbers used by memset_orig() and memset_rep_good() */
: "rsi", "rdx", "r8", "r9", "memory");
return dest;
}
and so in the !ERMS case, we patch in a call to the memset_rep() version
which is the old variant in memset_64.S. There we need to do some reg
shuffling because I need to map the registers from where REP; STOSB
expects them to where the x86_64 ABI wants them. Not a big deal - a push
and two moves and a pop at the end.
If X86_FEATURE_REP_GOOD is not set either, we fallback to another call
to the original unrolled memset.
The rest of the diff is me trying to untangle memset()'s definitions
from the early code too because we include kernel proper headers there
and all kinds of crazy include hell ensues but that later.
Anyway, this is just a pre-alpha version to get people's thoughts and
see whether I'm in the right direction or you guys might have better
ideas.
Thx.
---
Signed-off-by: Borislav Petkov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/eboot.c | 3 +-
arch/x86/boot/compressed/misc.h | 2 +
arch/x86/boot/compressed/pgtable_64.c | 2 +-
arch/x86/boot/string.h | 2 +
arch/x86/entry/vdso/Makefile | 2 +-
arch/x86/include/asm/alternative-asm.h | 1 +
arch/x86/include/asm/alternative.h | 37 ++++++---
arch/x86/include/asm/cpufeature.h | 2 +
arch/x86/include/asm/string_64.h | 25 +++++-
arch/x86/kernel/alternative.c | 24 ++++--
arch/x86/lib/memset_64.S | 76 ++++++++-----------
.../firmware/efi/libstub/efi-stub-helper.c | 1 +
tools/objtool/special.c | 6 +-
14 files changed, 116 insertions(+), 68 deletions(-)
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b84afdd7538..d0f465c535f3 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -38,6 +38,7 @@ KBUILD_CFLAGS += $(call cc-option,-fno-stack-protector)
KBUILD_CFLAGS += $(call cc-disable-warning, address-of-packed-member)
KBUILD_CFLAGS += $(call cc-disable-warning, gnu)
KBUILD_CFLAGS += -Wno-pointer-sign
+KBUILD_CFLAGS += -D_SETUP
KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__
GCOV_PROFILE := n
diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index d6662fdef300..767eb3d0b640 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -6,6 +6,8 @@
*
* ----------------------------------------------------------------------- */
+#include "../string.h"
+
#include <linux/efi.h>
#include <linux/pci.h>
@@ -14,7 +16,6 @@
#include <asm/setup.h>
#include <asm/desc.h>
-#include "../string.h"
#include "eboot.h"
static efi_system_table_t *sys_table;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index c8181392f70d..b712c3e7273f 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,6 +16,8 @@
/* cpu_feature_enabled() cannot be used this early */
#define USE_EARLY_PGTABLE_L5
+#include "../string.h"
+
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index c8862696a47b..7ad83d90aaa7 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -1,9 +1,9 @@
+#include "../string.h"
#include <linux/efi.h>
#include <asm/e820/types.h>
#include <asm/processor.h>
#include <asm/efi.h>
#include "pgtable.h"
-#include "../string.h"
/*
* __force_order is used by special_insns.h asm code to force instruction
diff --git a/arch/x86/boot/string.h b/arch/x86/boot/string.h
index 38d8f2f5e47e..cd970beefc5d 100644
--- a/arch/x86/boot/string.h
+++ b/arch/x86/boot/string.h
@@ -2,6 +2,8 @@
#ifndef BOOT_STRING_H
#define BOOT_STRING_H
+#include <linux/types.h>
+
/* Undef any of these macros coming from string_32.h. */
#undef memcpy
#undef memset
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 8df549138193..3b9d7b2d20d0 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -23,7 +23,7 @@ VDSO32-$(CONFIG_X86_32) := y
VDSO32-$(CONFIG_IA32_EMULATION) := y
# files to link into the vdso
-vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o ../../lib/memset_64.o
# files to link into kernel
obj-y += vma.o
diff --git a/arch/x86/include/asm/alternative-asm.h b/arch/x86/include/asm/alternative-asm.h
index 464034db299f..78a60c46c472 100644
--- a/arch/x86/include/asm/alternative-asm.h
+++ b/arch/x86/include/asm/alternative-asm.h
@@ -43,6 +43,7 @@
.byte \orig_len
.byte \alt_len
.byte \pad_len
+ .byte 0
.endm
/*
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 094fbc9c0b1c..2a20604b933d 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -4,6 +4,7 @@
#ifndef __ASSEMBLY__
+#include <linux/bits.h>
#include <linux/types.h>
#include <linux/stddef.h>
#include <linux/stringify.h>
@@ -55,6 +56,11 @@
".long 999b - .\n\t" \
".popsection\n\t"
+/*
+ * Patch in reverse order, when the feature bit is *NOT* set.
+ */
+#define ALT_FLAG_REVERSE BIT(0)
+
struct alt_instr {
s32 instr_offset; /* original instruction */
s32 repl_offset; /* offset to replacement instruction */
@@ -62,6 +68,7 @@ struct alt_instr {
u8 instrlen; /* length of original instruction */
u8 replacementlen; /* length of new instruction */
u8 padlen; /* length of build-time padding */
+ u8 flags; /* patching flags */
} __packed;
/*
@@ -142,13 +149,14 @@ static inline int alternatives_text_reserved(void *start, void *end)
" - (" alt_slen ")), 0x90\n" \
alt_end_marker ":\n"
-#define ALTINSTR_ENTRY(feature, num) \
+#define ALTINSTR_ENTRY(feature, num, flags) \
" .long 661b - .\n" /* label */ \
" .long " b_replacement(num)"f - .\n" /* new instruction */ \
" .word " __stringify(feature) "\n" /* feature bit */ \
" .byte " alt_total_slen "\n" /* source len */ \
" .byte " alt_rlen(num) "\n" /* replacement len */ \
- " .byte " alt_pad_len "\n" /* pad len */
+ " .byte " alt_pad_len "\n" /* pad len */ \
+ " .byte " __stringify(flags) "\n" /* flags */
#define ALTINSTR_REPLACEMENT(newinstr, feature, num) /* replacement */ \
"# ALT: replacement " #num "\n" \
@@ -158,29 +166,40 @@ static inline int alternatives_text_reserved(void *start, void *end)
#define ALTERNATIVE(oldinstr, newinstr, feature) \
OLDINSTR(oldinstr, 1) \
".pushsection .altinstructions,\"a\"\n" \
- ALTINSTR_ENTRY(feature, 1) \
+ ALTINSTR_ENTRY(feature, 1, 0) \
".popsection\n" \
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr, feature, 1) \
".popsection\n"
-#define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
+#define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
OLDINSTR_2(oldinstr, 1, 2) \
".pushsection .altinstructions,\"a\"\n" \
- ALTINSTR_ENTRY(feature1, 1) \
- ALTINSTR_ENTRY(feature2, 2) \
+ ALTINSTR_ENTRY(feature1, 1, 0) \
+ ALTINSTR_ENTRY(feature2, 2, 0) \
".popsection\n" \
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinstr1, feature1, 1) \
ALTINSTR_REPLACEMENT(newinstr2, feature2, 2) \
".popsection\n"
+#define ALTERNATIVE_2_REVERSE(def, alt1, ft1, alt2, ft2) \
+ OLDINSTR_2(def, 1, 2) \
+ ".pushsection .altinstructions,\"a\"\n" \
+ ALTINSTR_ENTRY(ft1, 1, ALT_FLAG_REVERSE) \
+ ALTINSTR_ENTRY(ft2, 2, ALT_FLAG_REVERSE) \
+ ".popsection\n" \
+ ".pushsection .altinstr_replacement, \"ax\"\n" \
+ ALTINSTR_REPLACEMENT(alt1, ft1, 1) \
+ ALTINSTR_REPLACEMENT(alt2, ft2, 2) \
+ ".popsection\n"
+
#define ALTERNATIVE_3(oldinsn, newinsn1, feat1, newinsn2, feat2, newinsn3, feat3) \
OLDINSTR_3(oldinsn, 1, 2, 3) \
".pushsection .altinstructions,\"a\"\n" \
- ALTINSTR_ENTRY(feat1, 1) \
- ALTINSTR_ENTRY(feat2, 2) \
- ALTINSTR_ENTRY(feat3, 3) \
+ ALTINSTR_ENTRY(feat1, 1, 0) \
+ ALTINSTR_ENTRY(feat2, 2, 0) \
+ ALTINSTR_ENTRY(feat3, 3, 0) \
".popsection\n" \
".pushsection .altinstr_replacement, \"ax\"\n" \
ALTINSTR_REPLACEMENT(newinsn1, feat1, 1) \
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 59bf91c57aa8..eb936caf6657 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -184,6 +184,7 @@ static __always_inline bool _static_cpu_has(u16 bit)
" .byte 3b - 1b\n" /* src len */
" .byte 5f - 4f\n" /* repl len */
" .byte 3b - 2b\n" /* pad len */
+ " .byte 0\n" /* flags */
".previous\n"
".section .altinstr_replacement,\"ax\"\n"
"4: jmp %l[t_no]\n"
@@ -196,6 +197,7 @@ static __always_inline bool _static_cpu_has(u16 bit)
" .byte 3b - 1b\n" /* src len */
" .byte 0\n" /* repl len */
" .byte 0\n" /* pad len */
+ " .byte 0\n" /* flags */
".previous\n"
".section .altinstr_aux,\"ax\"\n"
"6:\n"
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 75314c3dbe47..a242f5fbca8b 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -5,6 +5,8 @@
#ifdef __KERNEL__
#include <linux/jump_label.h>
+#include <asm/cpufeatures.h>
+
/* Written 2002 by Andi Kleen */
/* Even with __builtin_ the compiler may decide to use the out of line
@@ -15,8 +17,27 @@ extern void *memcpy(void *to, const void *from, size_t len);
extern void *__memcpy(void *to, const void *from, size_t len);
#define __HAVE_ARCH_MEMSET
-void *memset(void *s, int c, size_t n);
-void *__memset(void *s, int c, size_t n);
+extern void *memset_rep(void *s, int c, size_t n);
+extern void *memset_orig(void *s, int c, size_t n);
+
+#ifndef _SETUP
+static __always_inline void *memset(void *dest, int c, size_t n)
+{
+ void *ret, *dummy;
+
+ asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
+ "call memset_rep", X86_FEATURE_ERMS,
+ "call memset_orig", X86_FEATURE_REP_GOOD)
+ : "=&D" (ret), "=a" (dummy)
+ : "0" (dest), "a" (c), "c" (n)
+ /* clobbers used by memset_orig() and memset_rep_good() */
+ : "rsi", "rdx", "r8", "r9", "memory");
+
+ return dest;
+}
+
+#define __memset(s, c, n) memset(s, c, n)
+#endif
#define __HAVE_ARCH_MEMSET16
static inline void *memset16(uint16_t *s, uint16_t v, size_t n)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9d3a971ea364..9fca4fbecd4f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -390,18 +390,32 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
replacement = (u8 *)&a->repl_offset + a->repl_offset;
BUG_ON(a->instrlen > sizeof(insn_buff));
BUG_ON(a->cpuid >= (NCAPINTS + NBUGINTS) * 32);
+
+ /*
+ * Patch only if the feature flag is present or if reverse
+ * patching is requested, i.e., patch if feature bit is *not*
+ * present.
+ */
if (!boot_cpu_has(a->cpuid)) {
- if (a->padlen > 1)
- optimize_nops(a, instr);
+ if (!(a->flags & ALT_FLAG_REVERSE)) {
+ if (a->padlen > 1)
+ optimize_nops(a, instr);
- continue;
+ continue;
+ }
}
- DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d",
+ /*
+ * ... or skip if feature bit is present but reverse is set:
+ */
+ if (boot_cpu_has(a->cpuid) && (a->flags & ALT_FLAG_REVERSE))
+ continue;
+
+ DPRINTK("feat: %d*32+%d, old: (%pS (%px) len: %d), repl: (%px, len: %d), pad: %d, flags: 0x%x",
a->cpuid >> 5,
a->cpuid & 0x1f,
instr, instr, a->instrlen,
- replacement, a->replacementlen, a->padlen);
+ replacement, a->replacementlen, a->padlen, a->flags);
DUMP_BYTES(instr, a->instrlen, "%px: old_insn: ", instr);
DUMP_BYTES(replacement, a->replacementlen, "%px: rpl_insn: ", replacement);
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 9bc861c71e75..f1b459bb310c 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -6,32 +6,27 @@
#include <asm/alternative-asm.h>
#include <asm/export.h>
-.weak memset
-
/*
* ISO C memset - set a memory block to a byte value. This function uses fast
* string to get better performance than the original function. The code is
* simpler and shorter than the original function as well.
*
* rdi destination
- * rsi value (char)
- * rdx count (bytes)
+ * rax value (char)
+ * rcx count (bytes)
*
* rax original destination
*/
-ENTRY(memset)
-ENTRY(__memset)
- /*
- * Some CPUs support enhanced REP MOVSB/STOSB feature. It is recommended
- * to use it when possible. If not available, use fast string instructions.
- *
- * Otherwise, use original memset function.
- */
- ALTERNATIVE_2 "jmp memset_orig", "", X86_FEATURE_REP_GOOD, \
- "jmp memset_erms", X86_FEATURE_ERMS
-
- movq %rdi,%r9
- movq %rdx,%rcx
+ENTRY(memset_rep)
+ /* stash retval */
+ push %rdi
+
+ /* value to memset with is in %rax */
+ mov %rax, %rsi
+
+ /* count is in %rcx, stash it */
+ movq %rcx, %rdx
+
andl $7,%edx
shrq $3,%rcx
/* expand byte value */
@@ -41,35 +36,22 @@ ENTRY(__memset)
rep stosq
movl %edx,%ecx
rep stosb
- movq %r9,%rax
- ret
-ENDPROC(memset)
-ENDPROC(__memset)
-EXPORT_SYMBOL(memset)
-EXPORT_SYMBOL(__memset)
-/*
- * ISO C memset - set a memory block to a byte value. This function uses
- * enhanced rep stosb to override the fast string function.
- * The code is simpler and shorter than the fast string function as well.
- *
- * rdi destination
- * rsi value (char)
- * rdx count (bytes)
- *
- * rax original destination
- */
-ENTRY(memset_erms)
- movq %rdi,%r9
- movb %sil,%al
- movq %rdx,%rcx
- rep stosb
- movq %r9,%rax
+ /* restore retval */
+ pop %rax
ret
-ENDPROC(memset_erms)
+ENDPROC(memset_rep)
+EXPORT_SYMBOL(memset_rep)
ENTRY(memset_orig)
- movq %rdi,%r10
+ /* stash retval */
+ push %rdi
+
+ /* value to memset with is in %rax */
+ mov %rax, %rsi
+
+ /* count is in %rcx, stash it */
+ mov %rcx, %rdx
/* expand byte value */
movzbl %sil,%ecx
@@ -105,8 +87,8 @@ ENTRY(memset_orig)
.p2align 4
.Lhandle_tail:
movl %edx,%ecx
- andl $63&(~7),%ecx
- jz .Lhandle_7
+ andl $63 & (~7),%ecx
+ jz .Lhandle_7
shrl $3,%ecx
.p2align 4
.Lloop_8:
@@ -121,12 +103,13 @@ ENTRY(memset_orig)
.p2align 4
.Lloop_1:
decl %edx
- movb %al,(%rdi)
+ movb %al,(%rdi)
leaq 1(%rdi),%rdi
jnz .Lloop_1
.Lende:
- movq %r10,%rax
+ /* restore retval */
+ pop %rax
ret
.Lbad_alignment:
@@ -140,3 +123,4 @@ ENTRY(memset_orig)
jmp .Lafter_bad_alignment
.Lfinal:
ENDPROC(memset_orig)
+EXPORT_SYMBOL(memset_orig)
diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index 3caae7f2cf56..a3fc93e68d5b 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -9,6 +9,7 @@
#include <linux/efi.h>
#include <asm/efi.h>
+#include <asm/string.h>
#include "efistub.h"
diff --git a/tools/objtool/special.c b/tools/objtool/special.c
index fdbaa611146d..95f423827fbd 100644
--- a/tools/objtool/special.c
+++ b/tools/objtool/special.c
@@ -23,7 +23,7 @@
#define JUMP_ORIG_OFFSET 0
#define JUMP_NEW_OFFSET 4
-#define ALT_ENTRY_SIZE 13
+#define ALT_ENTRY_SIZE 14
#define ALT_ORIG_OFFSET 0
#define ALT_NEW_OFFSET 4
#define ALT_FEATURE_OFFSET 8
@@ -172,8 +172,8 @@ int special_get_alts(struct elf *elf, struct list_head *alts)
continue;
if (sec->len % entry->size != 0) {
- WARN("%s size not a multiple of %d",
- sec->name, entry->size);
+ WARN("%s, size %d not a multiple of %d",
+ sec->name, sec->len, entry->size);
return -1;
}
--
2.21.0
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
* Borislav Petkov <[email protected]> wrote:
> Hi,
>
> since the merge window is closing in and y'all are on a conference, I
> thought I should take another stab at it. It being something which Ingo,
> Linus and Peter have suggested in the past at least once.
>
> Instead of calling memset:
>
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
>
> ...
> ffffffff81000442: 4c 89 d7 mov %r10,%rdi
> ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
>
> <---- new memset
> ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
> ffffffff8100044c: 90 nop
> ffffffff8100044d: 90 nop
> ffffffff8100044e: 90 nop
> <----
>
> ffffffff8100044f: 4c 8d 84 24 98 00 00 lea 0x98(%rsp),%r8
> ffffffff81000456: 00
> ...
>
> And since the majority of x86 boxes out there is Intel, they haz
> X86_FEATURE_ERMS so they won't even need to alternative-patch those call
> sites when booting.
>
> In order to patch on machines which don't set X86_FEATURE_ERMS, I need
> to do a "reversed" patching of sorts, i.e., patch when the x86 feature
> flag is NOT set. See the below changes in alternative.c which basically
> add a flags field to struct alt_instr and thus control the patching
> behavior in apply_alternatives().
>
> The result is this:
>
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;
>
> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
> "call memset_rep", X86_FEATURE_ERMS,
> "call memset_orig", X86_FEATURE_REP_GOOD)
> : "=&D" (ret), "=a" (dummy)
> : "0" (dest), "a" (c), "c" (n)
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
>
> return dest;
> }
>
> and so in the !ERMS case, we patch in a call to the memset_rep() version
> which is the old variant in memset_64.S. There we need to do some reg
> shuffling because I need to map the registers from where REP; STOSB
> expects them to where the x86_64 ABI wants them. Not a big deal - a push
> and two moves and a pop at the end.
>
> If X86_FEATURE_REP_GOOD is not set either, we fallback to another call
> to the original unrolled memset.
>
> The rest of the diff is me trying to untangle memset()'s definitions
> from the early code too because we include kernel proper headers there
> and all kinds of crazy include hell ensues but that later.
>
> Anyway, this is just a pre-alpha version to get people's thoughts and
> see whether I'm in the right direction or you guys might have better
> ideas.
That looks exciting - I'm wondering what effects this has on code
footprint - for example defconfig vmlinux code size, and what the average
per call site footprint impact is?
If the footprint effect is acceptable, then I'd expect this to improve
performance, especially in hot loops.
Thanks,
Ingo
On 13/09/2019 09.22, Borislav Petkov wrote:
>
> Instead of calling memset:
>
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
>
> ...
> ffffffff81000442: 4c 89 d7 mov %r10,%rdi
> ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
>
> <---- new memset
> ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
> ffffffff8100044c: 90 nop
> ffffffff8100044d: 90 nop
> ffffffff8100044e: 90 nop
> <----
>
> The result is this:
>
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;
How is this going to affect cases like memset(p, 0, 4/8/16); where gcc
would normally just do one or two word stores? Is rep; stosb still
competitive in that case? It seems that gcc doesn't recognize memset as
a builtin with this always_inline definition present [1].
> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
> "call memset_rep", X86_FEATURE_ERMS,
> "call memset_orig", X86_FEATURE_REP_GOOD)
> : "=&D" (ret), "=a" (dummy)
> : "0" (dest), "a" (c), "c" (n)
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
>
> return dest;
> }
>
Also, am I reading this
> #include <asm/export.h>
>
> -.weak memset
> -
> /*
> */
> -ENTRY(memset)
> -ENTRY(__memset)
right so there's no longer an actual memset symbol in vmlinux? It seems
that despite the above always_inline definition of memset, gcc can still
emit calls to that to implement large initalizations. (The gcc docs are
actually explicit about that, even under -ffreestanding, "GCC requires
the freestanding environment provide 'memcpy', 'memmove', 'memset' and
'memcmp'.")
[1] I tried this silly stripped-down version with gcc-8:
//#include <string.h>
#include <stddef.h>
#if 1
#define always_inline __inline__ __attribute__((__always_inline__))
static always_inline void *memset(void *dest, int c, size_t n)
{
void *ret, *dummy;
asm volatile("rep; stosb"
: "=&D" (ret), "=a" (dummy)
: "0" (dest), "a" (c), "c" (n)
/* clobbers used by memset_orig() and memset_rep_good() */
: "rsi", "rdx", "r8", "r9", "memory");
return dest;
}
#endif
struct s { long x; long y; };
int h(struct s *s);
int f(void)
{
struct s s;
memset(&s, 0, sizeof(s));
return h(&s);
}
int g(void)
{
struct s s[1024] = {};
return h(s);
}
With or without the string.h include, the result was
0000000000000000 <f>:
0: 48 83 ec 18 sub $0x18,%rsp
4: 31 c0 xor %eax,%eax
6: b9 10 00 00 00 mov $0x10,%ecx
b: 49 89 e2 mov %rsp,%r10
e: 4c 89 d7 mov %r10,%rdi
11: f3 aa rep stos %al,%es:(%rdi)
13: 4c 89 d7 mov %r10,%rdi
16: e8 00 00 00 00 callq 1b <f+0x1b>
17: R_X86_64_PLT32 h-0x4
1b: 48 83 c4 18 add $0x18,%rsp
1f: c3 retq
0000000000000020 <g>:
20: 48 81 ec 08 40 00 00 sub $0x4008,%rsp
27: ba 00 40 00 00 mov $0x4000,%edx
2c: 31 f6 xor %esi,%esi
2e: 48 89 e1 mov %rsp,%rcx
31: 48 89 cf mov %rcx,%rdi
34: e8 00 00 00 00 callq 39 <g+0x19>
35: R_X86_64_PLT32 memset-0x4
39: 48 89 c7 mov %rax,%rdi
3c: e8 00 00 00 00 callq 41 <g+0x21>
3d: R_X86_64_PLT32 h-0x4
41: 48 81 c4 08 40 00 00 add $0x4008,%rsp
48: c3 retq
With the asm version #if 0'ed out, f() uses two movq (and yields a
warning if the string.h include is omitted, but it still recognizes
memset()).
Rasmus
On Fri, Sep 13, 2019 at 8:22 AM Borislav Petkov <[email protected]> wrote:
>
> since the merge window is closing in and y'all are on a conference, I
> thought I should take another stab at it. It being something which Ingo,
> Linus and Peter have suggested in the past at least once.
>
> Instead of calling memset:
>
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
That's probably fine for when the memset *is* a call, but:
> The result is this:
>
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;
>
> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
Forcing this code means that if you do
struct { long hi, low; } a;
memset(&a, 0, sizeof(a));
you force that "rep stosb". Which is HORRID.
The compiler should turn it into just one single 8-byte store. But
because you took over all of memset(), now that doesn't happen.
In fact, the compiler should be able to keep a structure like that in
registers if the use of it is fairly simple. Which again wouldn't
happen due to forcing that inline asm.
And "rep movsb" is ok for variable-sized memsets (well, honestly,
generally only when size is sufficient, but it's been getting
progressively better). But "rep movsb" is absolutely disastrous for
small constant-sized memset() calls. It serializes the pipeline, it
takes tens of cycles etc - for something that can take one single
cycle and be easily hidden in the instruction stream among other
changes.
And we do have a number of small structs etc in the kernel.
So we do need to have gcc do the __builtin_memset() for the simple cases..
Linus
On Fri, Sep 13, 2019 at 09:35:30AM +0200, Ingo Molnar wrote:
> That looks exciting - I'm wondering what effects this has on code
> footprint - for example defconfig vmlinux code size, and what the average
> per call site footprint impact is?
>
> If the footprint effect is acceptable, then I'd expect this to improve
> performance, especially in hot loops.
Well, it grows a bit but that's my conglomerate ugly patch so I'll redo
that test when I've cleaned it up:
text data bss dec hex filename
19699924 5201260 1642568 26543752 1950688 vmlinux.before
19791674 5201388 1552456 26545518 1950d6e vmlinux.after
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13/09/2019 11.00, Linus Torvalds wrote:
> On Fri, Sep 13, 2019 at 8:22 AM Borislav Petkov <[email protected]> wrote:
>>
>> since the merge window is closing in and y'all are on a conference, I
>> thought I should take another stab at it. It being something which Ingo,
>> Linus and Peter have suggested in the past at least once.
>>
>> Instead of calling memset:
>>
>> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>>
>> and having a JMP inside it depending on the feature supported, let's simply
>> have the REP; STOSB directly in the code:
>
> That's probably fine for when the memset *is* a call, but:
>
>> The result is this:
>>
>> static __always_inline void *memset(void *dest, int c, size_t n)
>> {
>> void *ret, *dummy;
>>
>> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
>
> Forcing this code means that if you do
>
> struct { long hi, low; } a;
> memset(&a, 0, sizeof(a));
>
> you force that "rep stosb". Which is HORRID.
>
> The compiler should turn it into just one single 8-byte store. But
> because you took over all of memset(), now that doesn't happen.
OK, that answers my question.
> So we do need to have gcc do the __builtin_memset() for the simple cases..
Something like
if (__builtin_constant_p(c) && __builtin_constant_p(n) && n <= 32)
return __builtin_memset(dest, c, n);
might be enough? Of course it would be sad if 32 was so high that this
turned into a memset() call, but there's -mmemset-strategy= if one wants
complete control. Though that's of course build-time, so can't consider
differences between cpu models.
Rasmus
On Fri, Sep 13, 2019 at 11:18:00AM +0200, Rasmus Villemoes wrote:
> Something like
>
> if (__builtin_constant_p(c) && __builtin_constant_p(n) && n <= 32)
> return __builtin_memset(dest, c, n);
>
> might be enough? Of course it would be sad if 32 was so high that this
> turned into a memset() call, but there's -mmemset-strategy= if one wants
> complete control. Though that's of course build-time, so can't consider
> differences between cpu models.
Yah, that seems to turn this:
memset(&tr, 0, sizeof(tr));
tr.b = b;
where sizeof(tr) < 32 into:
# ./arch/x86/include/asm/string_64.h:29: return __builtin_memset(dest, c, n);
movq $0, 16(%rsp) #, MEM[(void *)&tr + 8B]
movq $0, 24(%rsp) #, MEM[(void *)&tr + 8B]
# arch/x86/kernel/cpu/mce/amd.c:1070: tr.b = b;
movq %rbx, 8(%rsp) # b, tr.b
which is 2 u64 moves and the assignment of b at offset 8.
Question is, where we should put the size cap? I'm thinking 32 or 64
bytes...
Linus, got any suggestions?
Or should we talk to Intel hw folks about it...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Sep 13, 2019 at 12:42:32PM +0200, Borislav Petkov wrote:
> Or should we talk to Intel hw folks about it...
Or, I can do something like this, while waiting. Benchmark at the end.
The numbers are from a KBL box:
model : 158
model name : Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
stepping : 12
and if I'm not doing anything wrong with the benchmark (the asm looks
ok but I could very well be missing something), the numbers say that
the REP; STOSB is better from sizes of 8 and upwards and up to two
cachelines we're pretty much on-par with the builtin variant.
But after that we're pretty close too.
But this is just a single uarch which is pretty new, I guess the 32 or
64 bytes cap should be good enough for most, especially those where REP;
STOSB isn't as good.
Hmm.
Loops: 1000000
size builtin_memset rep;stosb memset_rep
0 46 85 131
8 88 65 104
16 66 65 103
24 66 65 103
32 66 65 104
40 66 65 104
48 66 65 103
56 66 65 103
64 66 65 103
72 66 65 103
80 66 65 103
88 66 65 103
96 66 65 103
104 66 65 103
112 66 65 103
120 66 65 103
128 66 65 103
136 66 71 108
144 73 71 108
152 72 71 108
160 73 71 108
168 73 71 108
176 73 71 108
184 73 71 108
192 72 74 107
200 75 78 116
208 80 78 116
216 80 78 116
224 80 78 116
232 80 78 116
240 80 78 116
248 80 78 116
---
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef unsigned long long u64;
#define DECLARE_ARGS(val, low, high) unsigned low, high
#define EAX_EDX_VAL(val, low, high) ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high) "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high) "=a" (low), "=d" (high)
static __always_inline unsigned long long rdtsc(void)
{
DECLARE_ARGS(val, low, high);
asm volatile("rdtsc" : EAX_EDX_RET(val, low, high));
return EAX_EDX_VAL(val, low, high);
}
/**
* rdtsc_ordered() - read the current TSC in program order
*
* rdtsc_ordered() returns the result of RDTSC as a 64-bit integer.
* It is ordered like a load to a global in-memory counter. It should
* be impossible to observe non-monotonic rdtsc_unordered() behavior
* across multiple CPUs as long as the TSC is synced.
*/
static __always_inline unsigned long long rdtsc_ordered(void)
{
/*
* The RDTSC instruction is not ordered relative to memory
* access. The Intel SDM and the AMD APM are both vague on this
* point, but empirically an RDTSC instruction can be
* speculatively executed before prior loads. An RDTSC
* immediately after an appropriate barrier appears to be
* ordered as a normal load, that is, it provides the same
* ordering guarantees as reading from a global memory location
* that some other imaginary CPU is updating continuously with a
* time stamp.
*/
asm volatile("mfence");
return rdtsc();
}
static __always_inline void *rep_stosb(void *dest, int c, size_t n)
{
void *ret, *dummy;
asm volatile("rep; stosb"
: "=&D" (ret), "=c" (dummy)
: "0" (dest), "a" (c), "c" (n)
: "memory");
return ret;
}
static __always_inline void *memset_rep(void *dest, int c, size_t n)
{
void *ret, *dummy;
asm volatile("push %%rdi\n\t"
"mov %%rax, %%rsi\n\t"
"mov %%rcx, %%rdx\n\t"
"andl $7,%%edx\n\t"
"shrq $3,%%rcx\n\t"
"movzbl %%sil,%%esi\n\t"
"movabs $0x0101010101010101,%%rax\n\t"
"imulq %%rsi,%%rax\n\t"
"rep stosq\n\t"
"movl %%edx,%%ecx\n\t"
"rep stosb\n\t"
"pop %%rax\n\t"
: "=&D" (ret), "=c" (dummy)
: "0" (dest), "a" (c), "c" (n)
: "rsi", "rdx", "memory");
return ret;
}
#define BUF_SIZE 1024
static void bench_memset(unsigned int loops)
{
u64 p0, p1, m_cycles = 0, r_cycles = 0, rep_cyc = 0;
int i, s;
void *dst;
dst = malloc(BUF_SIZE);
if (!dst)
perror("malloc");
printf("Loops: %d\n", loops);
printf("size\t\tbuiltin_memset\trep;stosb\tmemset_rep\n");
for (s = 0; s < BUF_SIZE; s += 16) {
for (i = 0; i < loops; i++) {
p0 = rdtsc_ordered();
__builtin_memset(dst, 0, s);
p1 = rdtsc_ordered();
m_cycles += p1 - p0;
}
for (i = 0; i < loops; i++) {
p0 = rdtsc_ordered();
rep_stosb(dst, 0, s);
p1 = rdtsc_ordered();
r_cycles += p1 - p0;
}
for (i = 0; i < loops; i++) {
p0 = rdtsc_ordered();
memset_rep(dst, 0, s);
p1 = rdtsc_ordered();
rep_cyc += p1 - p0;
}
m_cycles /= loops;
r_cycles /= loops;
rep_cyc /= loops;
printf("%d\t\t%llu\t\t%llu\t\t%llu\n", s, m_cycles, r_cycles, rep_cyc);
m_cycles = r_cycles = rep_cyc = 0;
}
free(dst);
}
int main(void)
{
bench_memset(1000000);
return 0;
}
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
> Instead of calling memset:
>
> ffffffff8100cd8d: e8 0e 15 7a 00 callq ffffffff817ae2a0 <__memset>
>
> and having a JMP inside it depending on the feature supported, let's simply
> have the REP; STOSB directly in the code:
>
> ...
> ffffffff81000442: 4c 89 d7 mov %r10,%rdi
> ffffffff81000445: b9 00 10 00 00 mov $0x1000,%ecx
>
> <---- new memset
> ffffffff8100044a: f3 aa rep stos %al,%es:(%rdi)
> ffffffff8100044c: 90 nop
> ffffffff8100044d: 90 nop
> ffffffff8100044e: 90 nop
You can fit entire "xor eax, eax; rep stosb" inside call instruction.
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
eh... I'd just drop it. These registers screw up everything.
Time to rebase memset0().
On Sat, Sep 14, 2019 at 12:29:15PM +0300, Alexey Dobriyan wrote:
> eh... I'd just drop it. These registers screw up everything.
The intent is to not touch memset_orig and let it die with its users. It
is irrelevant now anyway.
If it can be shown that the extended list of clobbered registers hurt
performance, then we can improve it for the sake of keeping register
pressure low.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13/09/2019 18.36, Borislav Petkov wrote:
> On Fri, Sep 13, 2019 at 12:42:32PM +0200, Borislav Petkov wrote:
>> Or should we talk to Intel hw folks about it...
>
> Or, I can do something like this, while waiting. Benchmark at the end.
>
> The numbers are from a KBL box:
>
> model : 158
> model name : Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
> stepping : 12
>
> and if I'm not doing anything wrong with the benchmark
Eh, this benchmark doesn't seem to provide any hints on where to set the
cut-off for a compile-time constant n, i.e. the 32 in
__b_c_p(n) && n <= 32
- unless gcc has unrolled your loop completely, which I find highly
unlikely.
(the asm looks
> ok
By "looks ok", do you mean the the builtin_memset() have been made into
calls to libc memset(), or how has gcc expanded that? And if so, what's
the disassembly of your libc's memset()? The thing is, what needs to be
compared is how a rep;stosb of 32 bytes compares to 4 immediate stores.
In fact, perhaps we shouldn't even try to find a cutoff. If __b_c_p(n),
just use __builtin_memset unconditionally. If n is smallish, gcc will do
a few stores, and if n is largish and gcc ends up emitting a call to
memset(), well, we can optimize memset() itself based on cpu
capabilities _and_ it's not the call/ret that will dominate. There are
also optimization and diagnostic advantages of having gcc know the
semantics of the memset() call (e.g. the tr.b DSE you showed).
but I could very well be missing something), the numbers say that
> the REP; STOSB is better from sizes of 8 and upwards and up to two
> cachelines we're pretty much on-par with the builtin variant.
I don't think that's what the numbers say.
Rasmus
On Mon, Sep 16, 2019 at 2:18 AM Rasmus Villemoes
<[email protected]> wrote:
>
> Eh, this benchmark doesn't seem to provide any hints on where to set the
> cut-off for a compile-time constant n, i.e. the 32 in
Yes, you'd need to use proper fixed-size memset's with
__builtin_memset() to test that case. Probably easy enough with some
preprocessor macros to expand to a lot of cases.
But even then it will not show some of the advantages of inlining the
memset (quite often you have a "memset structure to zero, then
initialize a couple of fields" pattern, and gcc does much better for
that when it just inlines the memset to stores - to the point of just
removing all the memset entirely and just storing a couple of zeroes
between the fields you initialized).
So the "inline constant sizes" case has advantages over and beyond the
obvious ones. I suspect that a reasonable cut-off point is somethinig
like "8*sizeof(long)". But look at things like "struct kstat" uses
etc, the limit might actually be even higher than that.
Also note that while "rep stosb" is _reasonably_ good with current
CPU's (ie roughly gen 8+), it's not so great a few generations ago
(gen 6ish), and it can be absolutely horrid on older cores and/or
atom. The limit for when it is a win ends up depending on whether I$
footprint is an issue too, of course, but some of the bigger wins tend
to happen when you have sizes >= 128.
You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
code for specific cases if you don't look at I$ footprint and the cost
of the AVX setup (and the cost of frequency changes, which often go
hand-in-hand with the AVX use). So "rep movs/stos" is seldom
_optimal_, but it tends to be "quite good" for modern CPU's with
variable sizes that are in the 100+ byte range.
Linus
On Mon, Sep 16, 2019 at 10:41 AM Andy Lutomirski <[email protected]> wrote:
>
> After some experimentation, I think y'all are just doing it wrong.
> GCC is very clever about this as long as it's given the chance. This
> test, for example, generates excellent code:
>
> #include <string.h>
>
> __THROW __nonnull ((1)) __attribute__((always_inline)) void
> *memset(void *s, int c, size_t n)
> {
> asm volatile ("nop");
> return s;
> }
>
> /* generates 'nop' */
> void zero(void *dest, size_t size)
> {
> __builtin_memset(dest, 0, size);
> }
I think the point was that we'd like to get the default memset (for
when __builtin_memset() doesn't generate inline code) also inlined
into just "rep stosb", instead of that tail-call "jmp memset".
> So I'm thinking maybe compiler.h should actually do something like:
>
> #define memset __builtin_memset
>
> and we should have some appropriate magic so that the memset inline is
> exempt from the macro.
That "appropriate magic" is easy enough: make sure the memset inline
shows up before the macro definition.
However, gcc never actually inlines the memset() for me, always doing
that "jmp memset"
> FWIW, this obviously wrong code:
>
> __THROW __nonnull ((1)) __attribute__((always_inline)) void
> *memset(void *s, int c, size_t n)
> {
> __builtin_memset(s, c, n);
> return s;
> }
>
> generates 'jmp memset'. It's not entirely clear to me exactly what's
> happening here.
I think calling memset() is always the default fallback for
__builtin_memset, and because it can't be recursiveyl inlined, it's
done as a call. Which is then turned into a tailcall because the
calling conventions match, thus the "jmp memset".
But as mentioned, the example you claim generates excellent code
doesn't actually inline memset() at all for me, and they are all doing
"jmp memset" except for the cases that get turned into direct stores.
Eg (removing the cfi annotations etc stuff):
zero:
movq %rsi, %rdx
xorl %esi, %esi
jmp memset
rather than that "nop" showing up inside the zero function.
But I agree that when __builtin_memset() generates manual inline code,
it does the right thing, ie
memset_a_bit:
movl $0, (%rdi)
ret
is clearly the right thing to do. We knew that.
Linus
On Mon, Sep 16, 2019 at 10:25 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Sep 16, 2019 at 2:18 AM Rasmus Villemoes
> <[email protected]> wrote:
> >
> > Eh, this benchmark doesn't seem to provide any hints on where to set the
> > cut-off for a compile-time constant n, i.e. the 32 in
>
> Yes, you'd need to use proper fixed-size memset's with
> __builtin_memset() to test that case. Probably easy enough with some
> preprocessor macros to expand to a lot of cases.
>
> But even then it will not show some of the advantages of inlining the
> memset (quite often you have a "memset structure to zero, then
> initialize a couple of fields" pattern, and gcc does much better for
> that when it just inlines the memset to stores - to the point of just
> removing all the memset entirely and just storing a couple of zeroes
> between the fields you initialized).
After some experimentation, I think y'all are just doing it wrong.
GCC is very clever about this as long as it's given the chance. This
test, for example, generates excellent code:
#include <string.h>
__THROW __nonnull ((1)) __attribute__((always_inline)) void
*memset(void *s, int c, size_t n)
{
asm volatile ("nop");
return s;
}
/* generates 'nop' */
void zero(void *dest, size_t size)
{
__builtin_memset(dest, 0, size);
}
/* xorl %eax, %eax */
int test(void)
{
int x;
__builtin_memset(&x, 0, sizeof(x));
return x;
}
/* movl $0, (%rdi) */
void memset_a_bit(int *ptr)
{
__builtin_memset(ptr, 0, sizeof(*ptr));
}
So I'm thinking maybe compiler.h should actually do something like:
#define memset __builtin_memset
and we should have some appropriate magic so that the memset inline is
exempt from the macro. Or maybe there's some very clever way to put
all of this into the memset inline function. FWIW, this obviously
wrong code:
__THROW __nonnull ((1)) __attribute__((always_inline)) void
*memset(void *s, int c, size_t n)
{
__builtin_memset(s, c, n);
return s;
}
generates 'jmp memset'. It's not entirely clear to me exactly what's
happening here.
--Andy
On Mon, Sep 16, 2019 at 4:14 PM Andy Lutomirski <[email protected]> wrote:
>
> Well, when I wrote this email, I *thought* it was inlining the
> 'memset' function, but maybe I just can't read gcc's output today.
Not having your compiler, it's also possible that it works for you,
but just doesn't work for me.
> It seems like gcc is maybe smart enough to occasionally optimize
> memset just because it's called 'memset'. This generates good code:
Yup, that does the rigth thing for me and ignores the definition of
memset() in favor of the built-in one.
However, at least part of this discussion started because of the
reverse problem (turning a couple of assignments into memset), and the
suggestion that we might be able to use -ffreestanding together with
#define memset __builtin_memset
and then your nice code generation goes away, because the magical
treatment of memset goes away. I get
one_word:
xorl %eax, %eax
ret
not_one_word:
movq %rsi, %rdx
xorl %esi, %esi
jmp memset
despite having that "inline void *memset()" definition.
Linus
On Mon, Sep 16, 2019 at 2:30 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Sep 16, 2019 at 10:41 AM Andy Lutomirski <[email protected]> wrote:
> >
> > After some experimentation, I think y'all are just doing it wrong.
> > GCC is very clever about this as long as it's given the chance. This
> > test, for example, generates excellent code:
> >
> > #include <string.h>
> >
> > __THROW __nonnull ((1)) __attribute__((always_inline)) void
> > *memset(void *s, int c, size_t n)
> > {
> > asm volatile ("nop");
> > return s;
> > }
> >
> > /* generates 'nop' */
> > void zero(void *dest, size_t size)
> > {
> > __builtin_memset(dest, 0, size);
> > }
>
> I think the point was that we'd like to get the default memset (for
> when __builtin_memset() doesn't generate inline code) also inlined
> into just "rep stosb", instead of that tail-call "jmp memset".
Well, when I wrote this email, I *thought* it was inlining the
'memset' function, but maybe I just can't read gcc's output today.
It seems like gcc is maybe smart enough to occasionally optimize
memset just because it's called 'memset'. This generates good code:
#include <stddef.h>
inline void *memset(void *dest, int c, size_t n)
{
/* Boris' implementation */
void *ret, *dummy;
asm volatile("push %%rdi\n\t"
"mov %%rax, %%rsi\n\t"
"mov %%rcx, %%rdx\n\t"
"andl $7,%%edx\n\t"
"shrq $3,%%rcx\n\t"
"movzbl %%sil,%%esi\n\t"
"movabs $0x0101010101010101,%%rax\n\t"
"imulq %%rsi,%%rax\n\t"
"rep stosq\n\t"
"movl %%edx,%%ecx\n\t"
"rep stosb\n\t"
"pop %%rax\n\t"
: "=&D" (ret), "=c" (dummy)
: "0" (dest), "a" (c), "c" (n)
: "rsi", "rdx", "memory");
return ret;
}
int one_word(void)
{
int x;
memset(&x, 0, sizeof(x));
return x;
}
So maybe Boris' patch is good after all.
On Mon, Sep 16, 2019 at 10:25:25AM -0700, Linus Torvalds wrote:
> So the "inline constant sizes" case has advantages over and beyond the
> obvious ones. I suspect that a reasonable cut-off point is somethinig
> like "8*sizeof(long)". But look at things like "struct kstat" uses
> etc, the limit might actually be even higher than that.
Ok, sounds to me like we want to leave the small and build-time known
sizes to gcc to optimize. Especially if you want to clear only a subset
of the struct members and set the rest.
> Also note that while "rep stosb" is _reasonably_ good with current
> CPU's (ie roughly gen 8+), it's not so great a few generations ago
> (gen 6ish), and it can be absolutely horrid on older cores and/or
> atom. The limit for when it is a win ends up depending on whether I$
> footprint is an issue too, of course, but some of the bigger wins tend
> to happen when you have sizes >= 128.
Ok, so I have X86_FEATURE_ERMS set on this old IVB laptop so if gen8 and
newer do perform better, then we need to change our feature detection to
clear ERMS on those older generations and really leave it set only on
gen8+.
I'll change my benchmark to do explicit small-sized moves (instead of
using __builtin_memset) to see whether I can compare performance better.
Which also begs the question do we do __builtin_memset() at all or we
code those small-sized moves ourselves? We'll lose the advantage of the
partial struct members clearing but then we avoid the differences the
different compiler versions would introduce when the builtin is used.
And last but not least we should not leave readability somewhere along
the way: I'd prefer good performance and maintainable code than some
hypothetical max perf on some uarch but ugly and unreadable macro
mess...
> You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
> code for specific cases if you don't look at I$ footprint and the cost
> of the AVX setup (and the cost of frequency changes, which often go
> hand-in-hand with the AVX use). So "rep movs/stos" is seldom
> _optimal_, but it tends to be "quite good" for modern CPU's with
> variable sizes that are in the 100+ byte range.
Right. If we did this, it would be a conditional in front of the REP;
STOS but that ain't worse than our current CALL+JMP.
Thx.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
From: Linus Torvalds
> Sent: 16 September 2019 18:25
...
> You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
> code for specific cases if you don't look at I$ footprint and the cost
> of the AVX setup (and the cost of frequency changes, which often go
> hand-in-hand with the AVX use). So "rep movs/stos" is seldom
> _optimal_, but it tends to be "quite good" for modern CPU's with
> variable sizes that are in the 100+ byte range.
Years ago I managed to match 'rep movs' on my Athlon 700 with a
'normal' code loop.
I can't remember whether I beat the setup time though.
The 'trick' was to do 'read (read-write)*n write' to
avoid stalls and get all the loop processing for free.
The I$ footprint was larger though.
The setup costs for 'rep movx' are significant.
I think the worst cpu was the P4-Netbust at 40-50 clocks.
My guess is over 10 for all cpu (except pre-pentium ones).
IIRC the only cpu on which you should use 'rep movsb' for the
trailing bytes is the one before Intel added the fast copy logic.
That one had special optimisations for 'rep movsb' of length < 8.
Remember, if you are inlining you probably have to assume cold-cache
and an untrained branch predictor.
Most benchmarking is done hot-cache with the branch predictor trained.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, Sep 13, 2019 at 09:22:37AM +0200, Borislav Petkov wrote:
> In order to patch on machines which don't set X86_FEATURE_ERMS, I need
> to do a "reversed" patching of sorts, i.e., patch when the x86 feature
> flag is NOT set. See the below changes in alternative.c which basically
> add a flags field to struct alt_instr and thus control the patching
> behavior in apply_alternatives().
>
> The result is this:
>
> static __always_inline void *memset(void *dest, int c, size_t n)
> {
> void *ret, *dummy;
>
> asm volatile(ALTERNATIVE_2_REVERSE("rep; stosb",
> "call memset_rep", X86_FEATURE_ERMS,
> "call memset_orig", X86_FEATURE_REP_GOOD)
> : "=&D" (ret), "=a" (dummy)
> : "0" (dest), "a" (c), "c" (n)
> /* clobbers used by memset_orig() and memset_rep_good() */
> : "rsi", "rdx", "r8", "r9", "memory");
>
> return dest;
> }
I think this also needs ASM_CALL_CONSTRAINT.
Doesn't this break on older non-ERMS CPUs when the memset() is done
early, before alternative patching?
Could it instead do this?
ALTERNATIVE_2("call memset_orig",
"call memset_rep", X86_FEATURE_REP_GOOD,
"rep; stosb", X86_FEATURE_ERMS)
Then the "reverse alternatives" feature wouldn't be needed anyway.
--
Josh
On Tue, Sep 17, 2019 at 1:10 PM Josh Poimboeuf <[email protected]> wrote:
>
> Could it instead do this?
>
> ALTERNATIVE_2("call memset_orig",
> "call memset_rep", X86_FEATURE_REP_GOOD,
> "rep; stosb", X86_FEATURE_ERMS)
>
> Then the "reverse alternatives" feature wouldn't be needed anyway.
That sounds better, but I'm a bit nervous about the whole thing
because who knows when the alternatives code itself internally uses
memset() and then we have a nasty little chicken-and-egg problem.
Also, for it to make sense to inline rep stosb, I think we also need
to just make the calling conventions for the alternative calls be that
they _don't_ clobber other registers than the usual rep ones
(cx/di/si). Otherwise one big code generation advantage of inlining
the thing just goes away.
On the whole I get the feeling that this is all painful complexity and
we shouldn't do it. At least not without some hard performance numbers
for some huge improvement, which I don't think we've seen.
Because I find the thing fascinating conceptually, but am not at all
convinced I want to deal with the pain in practice ;)
Linus
On Tue, Sep 17, 2019 at 03:10:21PM -0500, Josh Poimboeuf wrote:
> Then the "reverse alternatives" feature wouldn't be needed anyway.
The intent was to have the default, most-used version be there at
build-time, obviating the need to patch. Therefore on those old !ERMS
CPUs we'll end up doing rep;stosb before patching, which is supported,
albeit slow.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
On Tue, Sep 17, 2019 at 01:45:20PM -0700, Linus Torvalds wrote:
> That sounds better, but I'm a bit nervous about the whole thing
> because who knows when the alternatives code itself internally uses
> memset() and then we have a nasty little chicken-and-egg problem.
You mean memcpy()...?
> Also, for it to make sense to inline rep stosb, I think we also need
> to just make the calling conventions for the alternative calls be that
> they _don't_ clobber other registers than the usual rep ones
> (cx/di/si). Otherwise one big code generation advantage of inlining
> the thing just goes away.
Yah, that is tricky and I have no smart idea how. The ABI puts the
operands in rdi,rsi,rdx, ... while REP; STOSB wants them in rax,rcx,rdi.
And if it were only that, then we could probably accept the 2 movs and
a push but then the old functions clobber three more: "rdx", "r8", "r9".
I could try to rewrite the old functions to see if I can save some regs...
> On the whole I get the feeling that this is all painful complexity and
> we shouldn't do it. At least not without some hard performance numbers
> for some huge improvement, which I don't think we've seen.
Yap, it is starting to become hairy.
> Because I find the thing fascinating conceptually, but am not at all
> convinced I want to deal with the pain in practice ;)
I hear ya.
Thx.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--