Hi,
I'm trying to chase down a performance issue with a driver I'm working
on that does a repeated memcpy_fromio of about 1KB from a PCI device. I
made a small change from a fixed size copy to a variable size only to be
surprised with a performance decrease of about 1/3.
I've looked through the code and discovered this is due to switching out
__builtin_memcpy with __memcpy when the length is not constant. This
makes sense and I've now been looking into the inner workings of
memcpy_64.S.
The CPU I'm testing on is a Sandy Bridge and according to /proc/cpuinfo,
it does _not_ have the erms bit and does have the rep_good bit. So I
expect to be using the "rep movsq" version of memcpy. However, I've done
some testing with my driver by hacking in specific memcpy
implementations and I've found the following results:
__builtin_memcpy w/const length: 85KB/s
memcpy_fromio: 26kB/s
__builtin_memcpy: 26kB/s
memcpy_movsq: 126kB/s
memcpy_erms: 26kB/s
Thus, based on these performance numbers, it almost seems like my
platform is using the erms version when it probably shouldn't be.
So my question is: how do I find out what version of memcpy my actual
machine is using and fix it if it is wrong?
I'm running with a 4.10.0 kernel and I've attached my cpuinfo.
Thanks for any insights,
Logan
On Sat, Mar 04, 2017 at 01:08:15PM -0700, Logan Gunthorpe wrote:
> So my question is: how do I find out what version of memcpy my actual
> machine is using and fix it if it is wrong?
You can boot with "debug-alternative" and look for those strings where
it says "feat:"
[ 0.261386] apply_alternatives: feat: 3*32+18, old: (ffffffff8101e3c4, len: 3), repl: (ffffffff81de4e7a, len: 3), pad: 3
^^^^^^^
[ 0.264003] ffffffff8101e3c4: old_insn: 0f 1f 00
[ 0.265235] ffffffff81de4e7a: rpl_insn: 0f ae e8
[ 0.266469] ffffffff8101e3c4: final_insn: 0f ae e8
if it says 9*32+9 there, then it really does patch in the ERMS variant.
If it says 3*32+16 and it patches a NOP, i.e., something like this:
[ 2.022665] apply_alternatives: feat: 3*32+16, old: (ffffffff812dbfe0, len: 5), repl: (ffffffff81de5336, len: 0), pad: 3
[ 2.024003] ffffffff812dbfe0: old_insn: eb 0e 90 90 90
[ 2.025332] ffffffff812dbfe0: final_insn: 0f 1f 44 00 00
^^^^ NOP
then, you're using the rep; movsq variant.
Also, do
make <path-to-file>.s
of the file where it does memcpy_fromio() and attach it here. I'd like to see
what the compiler generates.
HTH.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On March 4, 2017 3:46:44 PM PST, Logan Gunthorpe <[email protected]> wrote:
>Hi Borislav,
>
>Thanks for the help.
>
>On 04/03/17 03:43 PM, Borislav Petkov wrote:
>> You can boot with "debug-alternative" and look for those strings
>where
>
>Here's the symbols for memcpy and the corresponding apply_alternatives
>lines:
>
>ffffffff8122df90 T __memcpy
>ffffffff8122df90 W memcpy
>ffffffff8122dfb0 T memcpy_erms
>ffffffff8122dfc0 T memcpy_orig
>
>[ 0.076018] apply_alternatives: feat: 3*32+16, old:
>(ffffffff8122df90, len: 5), repl: (ffffffff819d7ac5, len: 0), pad: 0
>[ 0.076019] ffffffff8122df90: old_insn: e9 2b 00 00 00
>[ 0.076021] ffffffff8122df90: final_insn: 66 66 90 66 90
>
>So it looks like it's patching in a NOP as it is supposed to.
>
>
>> Also, do
>>
>> make <path-to-file>.s
>>
>> of the file where it does memcpy_fromio() and attach it here. I'd
>like to see
>> what the compiler generates.
>
>I've attached the whole file but this is the code in question:
>
> .loc 5 219 0
> movq 936(%rbp), %rdx # stdev_3(D)->mmio_mrpc, tmp127
> leaq 48(%rbx), %rax #, tmp113
> movq 40(%rbx), %rcx # MEM[(struct switchtec_user *)__mptr_7 +
>-56B].read_len, MEM[(struct switchtec_user *)__mptr_7 + -56B].read_len
> movq %rax, %rdi # tmp113, tmp118
> leaq 1024(%rdx), %rsi #, tmp114
> rep movsb
>
>Strangely, it decided to inline a memcpy with rep movsb. Any idea why
>the compiler would do that? Is this just gcc being stupid? My gcc
>version is below.
>
>Thanks,
>
>Logan
>
>
>Using built-in specs.
>COLLECT_GCC=gcc
>COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
>Target: x86_64-linux-gnu
>Configured with: ../src/configure -v --with-pkgversion='Debian
>4.9.2-10'
>--with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs
>--enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
>--program-suffix=-4.9 --enable-shared --enable-linker-build-id
>--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
>--with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib
>--enable-nls --with-sysroot=/ --enable-clocale=gnu
>--enable-libstdcxx-debug --enable-libstdcxx-time=yes
>--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
>--with-system-zlib --disable-browser-plugin --enable-java-awt=gtk
>--enable-gtk-cairo
>--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre
>--enable-java-home
>--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64
>--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64
>--with-arch-directory=amd64
>--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc
>--enable-multiarch --with-arch-32=i586 --with-abi=m64
>--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
>--enable-checking=release --build=x86_64-linux-gnu
>--host=x86_64-linux-gnu --target=x86_64-linux-gnu
>Thread model: posix
>gcc version 4.9.2 (Debian 4.9.2-10)
For newer processors, as determined by -mtune=, it is actually the best option for an arbitrary copy.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Mar 04, 2017 at 03:55:27PM -0800, [email protected] wrote:
> For newer processors, as determined by -mtune=, it is actually the
> best option for an arbitrary copy.
So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is
the best option for memcpy, then we should probably build with
-fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts
its own memcpy variant.
And this is probably wrong because we do the detection at boot time and
not at build time.
For example here it generates REP; MOVSL for the call in
drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to me.
Length is 32 so it could just as well do REP; MOVSQ.
IOW, we could do something like this:
---
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d449337a360..c1b68d147b8d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32
endif
export CONFIG_X86_X32_ABI
-# Don't unroll struct assignments with kmemcheck enabled
-ifeq ($(CONFIG_KMEMCHECK),y)
- KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
-endif
+KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
# Stackpointer is addressed different for 32 bit and 64 bit x86
sp-$(CONFIG_X86_32) := esp
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On March 4, 2017 4:14:47 PM PST, Borislav Petkov <[email protected]> wrote:
>On Sat, Mar 04, 2017 at 03:55:27PM -0800, [email protected] wrote:
>> For newer processors, as determined by -mtune=, it is actually the
>> best option for an arbitrary copy.
>
>So his doesn't have ERMS - it is a SNB - so if for SNB REP_GOOD is
>the best option for memcpy, then we should probably build with
>-fno-builtin-memcpy unconditionally. Otherwise gcc apparently inserts
>its own memcpy variant.
>
>And this is probably wrong because we do the detection at boot time and
>not at build time.
>
>For example here it generates REP; MOVSL for the call in
>drivers/firmware/dmi_scan.c::dmi_scan_machine() which looks wrong to
>me.
>Length is 32 so it could just as well do REP; MOVSQ.
>
>IOW, we could do something like this:
>
>---
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 2d449337a360..c1b68d147b8d 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -142,10 +142,7 @@ ifdef CONFIG_X86_X32
> endif
> export CONFIG_X86_X32_ABI
>
>-# Don't unroll struct assignments with kmemcheck enabled
>-ifeq ($(CONFIG_KMEMCHECK),y)
>- KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
>-endif
>+KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
>
> # Stackpointer is addressed different for 32 bit and 64 bit x86
> sp-$(CONFIG_X86_32) := esp
What are the compilation flags? It may be that gcc still does TRT depending on this call site. I'd check what gcc6 or 7 generates, though.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, Mar 04, 2017 at 04:23:17PM -0800, [email protected] wrote:
> What are the compilation flags? It may be that gcc still does TRT
> depending on this call site. I'd check what gcc6 or 7 generates,
> though.
Well, I don't think that matters: if you're building a kernel on one
machine to boot on another machine, the compiler can't know at build
time what the best MOVS* variant would be for the target machine. That's
why we're doing the alternatives patching at *boot* time.
However, if the size is small enough, the CALL/patch overhead would be
definitely too much.
Hmm, I wish we were able to say, "let gcc decide for small sizes and let
us do the patching for larger ones."
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On March 4, 2017 4:33:49 PM PST, Borislav Petkov <[email protected]> wrote:
>On Sat, Mar 04, 2017 at 04:23:17PM -0800, [email protected] wrote:
>> What are the compilation flags? It may be that gcc still does TRT
>> depending on this call site. I'd check what gcc6 or 7 generates,
>> though.
>
>Well, I don't think that matters: if you're building a kernel on one
>machine to boot on another machine, the compiler can't know at build
>time what the best MOVS* variant would be for the target machine.
>That's
>why we're doing the alternatives patching at *boot* time.
>
>However, if the size is small enough, the CALL/patch overhead would be
>definitely too much.
>
>Hmm, I wish we were able to say, "let gcc decide for small sizes and
>let
>us do the patching for larger ones."
That's what the -march= and -mtune= option do!
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hey,
On 04/03/17 05:33 PM, Borislav Petkov wrote:
> On Sat, Mar 04, 2017 at 04:23:17PM -0800, [email protected] wrote:
>> What are the compilation flags? It may be that gcc still does TRT
>> depending on this call site. I'd check what gcc6 or 7 generates,
>> though.
> Hmm, I wish we were able to say, "let gcc decide for small sizes and let
> us do the patching for larger ones."
So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected
instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but
switching to the latter option fixes my problem. A memcpy call is used
instead of the poor inline solution. (I'm not really sure how the inline
solution even makes any sense as it almost certainly makes things larger
in the grand scheme of things.)
It still might make sense to apply your patch asking gcc to never use
the broken memcpy but I'll leave that in your capable hands to decide.
Anyway, thanks for the help with this.
Logan
On Sat, Mar 04, 2017 at 04:56:38PM -0800, [email protected] wrote:
> That's what the -march= and -mtune= option do!
How does that even help with a distro kernel built with -mtune=generic ?
gcc can't possibly know on what targets is that kernel going to be
booted on. So it probably does some universally optimal things, like in
the dmi_scan_machine() case:
memcpy_fromio(buf, p, 32);
turns into:
.loc 3 219 0
movl $8, %ecx #, tmp79
movq %rax, %rsi # p, p
movq %rsp, %rdi #, tmp77
rep movsl
Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
4*8-byte MOVS?
That's half the loops.
[ It is a whole different story what the machine actually does underneath. It
being a half cacheline probably doesn't help and it really does the separate
MOVs but then it would be cheaper if it did 4 8-byte ones. ]
One thing's for sure - both variants are certainly cheaper than to CALL
a memcpy variant.
What we probably should try to do, though, is simply patch in the body
of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to
memcpy_orig() because that last one if fat.
I remember we did talk about it at some point but don't remember why we
didn't do it.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Sat, Mar 04, 2017 at 09:58:14PM -0700, Logan Gunthorpe wrote:
> So, I've found that my kernel config had the OPTIMIZE_FOR_SIZE selected
> instead of OPTIMIZE_FOR_PERFORMANCE. I'm not sure why that is but
> switching to the latter option fixes my problem. A memcpy call is used
> instead of the poor inline solution. (I'm not really sure how the inline
> solution even makes any sense as it almost certainly makes things larger
> in the grand scheme of things.)
Probably some gcc heuristics don't work as expected...
In any case, I have
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
here and it still generates REP; MOVSL in dmi_scan_machine().
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Sun, Mar 05, 2017 at 10:50:59AM +0100, Borislav Petkov wrote:
> On Sat, Mar 04, 2017 at 04:56:38PM -0800, [email protected] wrote:
> > That's what the -march= and -mtune= option do!
>
> How does that even help with a distro kernel built with -mtune=generic ?
>
> gcc can't possibly know on what targets is that kernel going to be
> booted on. So it probably does some universally optimal things, like in
> the dmi_scan_machine() case:
>
> memcpy_fromio(buf, p, 32);
>
> turns into:
>
> .loc 3 219 0
> movl $8, %ecx #, tmp79
> movq %rax, %rsi # p, p
> movq %rsp, %rdi #, tmp77
> rep movsl
>
> Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
> 4*8-byte MOVS?
>
> That's half the loops.
>
> [ It is a whole different story what the machine actually does underneath. It
> being a half cacheline probably doesn't help and it really does the separate
> MOVs but then it would be cheaper if it did 4 8-byte ones. ]
>
> One thing's for sure - both variants are certainly cheaper than to CALL
> a memcpy variant.
>
> What we probably should try to do, though, is simply patch in the body
> of REP; MOVSQ or REP; MOVSB into the call sites and only have a call to
> memcpy_orig() because that last one if fat.
>
> I remember we did talk about it at some point but don't remember why we
> didn't do it.
Right, and I believe it was Linus who mentioned we should simply patch
in the small REP_GOOD and ERMS memcpy body into the call sites. CCed.
Anyway, something like below.
It almost builds here - I still need to figure out that
arch/x86/boot/compressed/misc.c including decompression logic which does
memcpy and it turns into the alternative version which we don't want in
arch/x86/boot/compressed/.
Also, I need to check what vmlinuz size bloat we're talking: with the
diff below, we do add padding which looks like this:
3d: 90 nop
3e: 90 nop
3f: 90 nop
40: 90 nop
41: 90 nop
42: 90 nop
43: 90 nop
44: 90 nop
45: 90 nop
46: 90 nop
47: 90 nop
48: 90 nop
49: 90 nop
4a: 90 nop
4b: 90 nop
4c: 90 nop
4d: 90 nop
4e: 90 nop
4f: 90 nop
and that at every call site.
But at least we save ourselves the call overhead and use the optimal
memcpy variant on each machine.
---
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 2d449337a360..24711ca4033b 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -143,9 +143,7 @@ endif
export CONFIG_X86_X32_ABI
# Don't unroll struct assignments with kmemcheck enabled
-ifeq ($(CONFIG_KMEMCHECK),y)
- KBUILD_CFLAGS += $(call cc-option,-fno-builtin-memcpy)
-endif
+KBUILD_CFLAGS += $(call cc-option,-fno-builtin)
# Stackpointer is addressed different for 32 bit and 64 bit x86
sp-$(CONFIG_X86_32) := esp
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 1b020381ab38..8216c2e610ae 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -204,6 +204,12 @@ static inline int alternatives_text_reserved(void *start, void *end)
asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \
: output : "i" (0), ## input)
+#define alternative_io_2(oldinstr, newinstr1, feature1, newinstr2, feature2, \
+ output, input...) \
+ asm volatile(ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \
+ : output \
+ : "i" (0), ## input)
+
/* Like alternative_io, but for replacing a direct call with another one. */
#define alternative_call(oldfunc, newfunc, feature, output, input...) \
asm volatile (ALTERNATIVE("call %P[old]", "call %P[new]", feature) \
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index a164862d77e3..a529a12a35ed 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -3,6 +3,8 @@
#ifdef __KERNEL__
#include <linux/jump_label.h>
+#include <asm/cpufeatures.h>
+#include <asm/alternative.h>
/* Written 2002 by Andi Kleen */
@@ -28,8 +30,37 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
function. */
#define __HAVE_ARCH_MEMCPY 1
-extern void *memcpy(void *to, const void *from, size_t len);
-extern void *__memcpy(void *to, const void *from, size_t len);
+void *memcpy_orig(void *to, const void *from, size_t len);
+
+/*
+ * We build a call to memcpy_orig by default which replaced on
+ * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which
+ * have the enhanced REP MOVSB/STOSB feature (ERMS), change the REP_GOOD
+ * routine to a REP; MOVSB mem copy.
+ */
+static inline void *memcpy(void *to, const void *from, size_t len)
+{
+ void *ret;
+
+ alternative_io_2("call memcpy_orig\n\t",
+ "movq %%rdi, %%rax\n\t"
+ "movq %%rdx, %%rcx\n\t"
+ "shrq $3, %%rcx\n\t"
+ "andl $7, %%edx\n\t"
+ "rep movsq\n\t"
+ "movl %%edx, %%ecx\n\t"
+ "rep movsb\n\t",
+ X86_FEATURE_REP_GOOD,
+ "movq %%rdi, %%rax\n\t"
+ "movq %%rdx, %%rcx\n\t"
+ "rep movsb\n\t",
+ X86_FEATURE_ERMS,
+ ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from), "=d" (len)),
+ "1" (to), "2" (from), "3" (len)
+ : "memory", "rcx");
+
+ return ret;
+}
#ifndef CONFIG_KMEMCHECK
#if (__GNUC__ == 4 && __GNUC_MINOR__ < 3) || __GNUC__ < 4
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 779782f58324..c3c8890b155c 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -6,13 +6,6 @@
#include <asm/alternative-asm.h>
#include <asm/export.h>
-/*
- * We build a jump to memcpy_orig by default which gets NOPped out on
- * the majority of x86 CPUs which set REP_GOOD. In addition, CPUs which
- * have the enhanced REP MOVSB/STOSB feature (ERMS), change those NOPs
- * to a jmp to memcpy_erms which does the REP; MOVSB mem copy.
- */
-
.weak memcpy
/*
@@ -26,36 +19,12 @@
* Output:
* rax original destination
*/
-ENTRY(__memcpy)
-ENTRY(memcpy)
- ALTERNATIVE_2 "jmp memcpy_orig", "", X86_FEATURE_REP_GOOD, \
- "jmp memcpy_erms", X86_FEATURE_ERMS
-
- movq %rdi, %rax
- movq %rdx, %rcx
- shrq $3, %rcx
- andl $7, %edx
- rep movsq
- movl %edx, %ecx
- rep movsb
- ret
-ENDPROC(memcpy)
-ENDPROC(__memcpy)
-EXPORT_SYMBOL(memcpy)
-EXPORT_SYMBOL(__memcpy)
-
-/*
- * memcpy_erms() - enhanced fast string memcpy. This is faster and
- * simpler than memcpy. Use memcpy_erms when possible.
- */
-ENTRY(memcpy_erms)
- movq %rdi, %rax
- movq %rdx, %rcx
- rep movsb
- ret
-ENDPROC(memcpy_erms)
-
ENTRY(memcpy_orig)
+ pushq %r8
+ pushq %r9
+ pushq %r10
+ pushq %r11
+
movq %rdi, %rax
cmpq $0x20, %rdx
@@ -179,8 +148,14 @@ ENTRY(memcpy_orig)
movb %cl, (%rdi)
.Lend:
+ popq %r11
+ popq %r10
+ popq %r9
+ popq %r8
+
retq
ENDPROC(memcpy_orig)
+EXPORT_SYMBOL_GPL(memcpy_orig)
#ifndef CONFIG_UML
/*
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Sun, Mar 05, 2017 at 12:18:23PM +0100, Borislav Petkov wrote:
> Also, I need to check what vmlinuz size bloat we're talking: with the
> diff below, we do add padding which looks like this:
Yeah, even a tailored config adds ~67K:
text data bss dec hex filename
7567290 4040894 2560000 14168184 d83078 vmlinux
7635332 4041694 2560000 14237026 d93d62 vmlinux.memcpy
I'm assuming allmodconfig would become even bigger.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Sun, Mar 5, 2017 at 1:50 AM, Borislav Petkov <[email protected]> wrote:
>
> gcc can't possibly know on what targets is that kernel going to be
> booted on. So it probably does some universally optimal things, like in
> the dmi_scan_machine() case:
>
> memcpy_fromio(buf, p, 32);
>
> turns into:
>
> .loc 3 219 0
> movl $8, %ecx #, tmp79
> movq %rax, %rsi # p, p
> movq %rsp, %rdi #, tmp77
> rep movsl
>
> Apparently it thinks it is fine to do 8*4-byte MOVS. But why not
> 4*8-byte MOVS?
Actually, the "fromio/toio" code should never use regular memcpy().
There used to be devices that literally broke on 64-bit accesses due
to broken PCI crud.
We seem to have broken this *really* long ago, though. On x86-64 we
used to have a special __inline_memcpy() that copies our historical
32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That
was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io
functions")
That commit says
"Iomem has no special significance on x86"
but that's not strictly true. iomem is in the same address space and
uses the same access instructions as regular memory, but iomem _is_
special.
And I think it's a bug that we use "memcpy()" on it. Not because of
any gcc issues, but simply because our own memcpy() optimizations are
not appropriate for iomem.
For example, "rep movsb" really is the right thing to use on normal
memory on modern CPU's.
But it is *not* the right thing to use on IO memory, because the CPU
only does the magic cacheline access optimizations on cacheable
memory!
So I think we should re-introduce that old "__inline_memcpy()" as that
special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
64-bit.
Hmm?
Linus
On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote:
> Actually, the "fromio/toio" code should never use regular memcpy().
> There used to be devices that literally broke on 64-bit accesses due
> to broken PCI crud.
>
> We seem to have broken this *really* long ago, though.
I wonder why nothing blew up or failed strangely by now...
> On x86-64 we used to have a special __inline_memcpy() that copies our
> historical
It is still there in arch/x86/include/asm/string_64.h.
The comment "Only used for special circumstances." over it is grossly
understating it.
> 32-bit thing, and was used for memcpy_fromio() and memcpy_toio(). That
> was then undone by commit 6175ddf06b61 ("x86: Clean up mem*io
> functions")
>
> That commit says
>
> "Iomem has no special significance on x86"
>
> but that's not strictly true. iomem is in the same address space and
> uses the same access instructions as regular memory, but iomem _is_
> special.
>
> And I think it's a bug that we use "memcpy()" on it. Not because of
> any gcc issues, but simply because our own memcpy() optimizations are
> not appropriate for iomem.
>
> For example, "rep movsb" really is the right thing to use on normal
> memory on modern CPU's.
So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are
you saying, regardless, we should let gcc put REP; MOVSB for smaller
sizes?
Because gcc does generate a REP; MOVSB there when it puts its own
memcpy, see mail upthread. (Even though that is wrong to do on iomem.)
> But it is *not* the right thing to use on IO memory, because the CPU
> only does the magic cacheline access optimizations on cacheable
> memory!
Ah, yes.
> So I think we should re-introduce that old "__inline_memcpy()" as that
> special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
> 64-bit.
Logan, wanna give that a try, see if it takes care of your issue?
Oh, and along with the revert we would need a big fat warning explaining
why we need that special memcpy for IO memory.
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On Sun, Mar 5, 2017 at 11:54 AM, Borislav Petkov <[email protected]> wrote:
>>
>> We seem to have broken this *really* long ago, though.
>
> I wonder why nothing blew up or failed strangely by now...
The hardware that cared was pretty broken to begin with, and I think
it was mainly some really odd graphics cards.
And from memory, they had issues with 64-bit writes. We actually have
a slow 16-bit word at a time copy for exactly these kinds of issues:
scr_memcpyw() and friends.
I'd like to say that it was one of those shit server-only cards that
nobody sane would ever use (but "server hardware is validated and
better quality!"), but that might have been another issue.
>> For example, "rep movsb" really is the right thing to use on normal
>> memory on modern CPU's.
>
> So Logan's box is a SNB and it doesn't have the ERMS optimizations. Are
> you saying, regardless, we should let gcc put REP; MOVSB for smaller
> sizes?
I think gcc makes bad choices, but they are gcc';s choices to make.
I have up on gcc's "-Os" because the choices were so bad and not getting fixed.
.. and none of that has _anything_ to do with accesses to IO memory,
which is fundamentally different.
> Because gcc does generate a REP; MOVSB there when it puts its own
> memcpy, see mail upthread. (Even though that is wrong to do on iomem.)
No, *THAT* is not wrong to do on iomem. If we tell gcc that "memcpy()"
works on iomem, then gcc can damn well do whatever it wants.
"rep stosb" isn't wrong for memcpy(). Gcc may do stupid things with
it, but that's completely immaterial.
> Oh, and along with the revert we would need a big fat warning explaining
> why we need that special memcpy for IO memory.
Well, quite frankly, just a simple "IO memory is different from cached
memory" should be sufficient.
Linus
On Sun, Mar 05, 2017 at 11:19:42AM -0800, Linus Torvalds wrote:
>> But it is *not* the right thing to use on IO memory, because the CPU
>> only does the magic cacheline access optimizations on cacheable
>> memory!
Yes, and actually this is where I started. I thought my memcpy was using
byte accesses on purpose and I needed to create a patch for a different
IO memcpy because obviously byte accesses over the PCI bus would be very
un-ideal. However, when I found my system wasn't intentionally using
that implementation that was no longer my focus.
So, I have no way to test this, but it sounds like any Ivy bridge system
using the ERMS version of memcpy could have the same slow PCI memcpy
performance I've been seeing (unless the microcode fixes it up?). So it
sounds like it would be a good idea to revert the change Linus is
talking about.
>> So I think we should re-introduce that old "__inline_memcpy()" as that
>> special "safe memcpy" thing. Not just for KMEMCHECK, and not just for
>> 64-bit.
On 05/03/17 12:54 PM, Borislav Petkov wrote:
> Logan, wanna give that a try, see if it takes care of your issue?
Well honestly my issue was solved by fixing my kernel config. I have no
idea why I had optimize for size in there in the first place.
Thanks,
Logan
On 03/05/17 23:01, Logan Gunthorpe wrote:
>
> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>> Logan, wanna give that a try, see if it takes care of your issue?
>
> Well honestly my issue was solved by fixing my kernel config. I have no
> idea why I had optimize for size in there in the first place.
>
Yes, to gcc "optimize for size" means exactly that... intended for cases
where saving storage (e.g. ROM) or code download time is paramount.
We have frequently asked them for an optimization option that is not
strictly byte-count-based but still tries to reduce cache footprint, but
it is a nontrivial project.
-hpa
On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
> Well honestly my issue was solved by fixing my kernel config. I have no
> idea why I had optimize for size in there in the first place.
I still think that we should address the iomem memcpy Linus mentioned.
So how about this partial revert. I've made 32-bit use the same special
__memcpy() version.
Hmmm?
---
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 7afb0e2f07f4..9e378a10796d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
#ifdef __KERNEL__
#include <asm-generic/iomap.h>
+#include <asm/string.h>
/*
* Convert a virtual cached pointer to an uncached pointer
@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned char val, size_t count)
* @src: The (I/O memory) source for the data
* @count: The number of bytes to copy
*
- * Copy a block of data from I/O memory.
+ * Copy a block of data from I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
*/
static inline void
memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
{
- memcpy(dst, (const void __force *)src, count);
+ __inline_memcpy(dst, (const void __force *)src, count);
}
/**
@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void __iomem *src, size_t count)
* @src: The (RAM) source for the data
* @count: The number of bytes to copy
*
- * Copy a block of data to I/O memory.
+ * Copy a block of data to I/O memory. IO memory is different from
+ * cached memory so we use special memcpy version.
*/
static inline void
memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
{
- memcpy((void __force *)dst, src, count);
+ __inline_memcpy((void __force *)dst, src, count);
}
/*
diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e8353ee5c..556fa4a975ff 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
#define __HAVE_ARCH_STRLEN
extern size_t strlen(const char *s);
+#define __inline_memcpy __memcpy
static __always_inline void *__memcpy(void *to, const void *from, size_t n)
{
int d0, d1, d2;
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On March 6, 2017 5:33:28 AM PST, Borislav Petkov <[email protected]> wrote:
>On Mon, Mar 06, 2017 at 12:01:10AM -0700, Logan Gunthorpe wrote:
>> Well honestly my issue was solved by fixing my kernel config. I have
>no
>> idea why I had optimize for size in there in the first place.
>
>I still think that we should address the iomem memcpy Linus mentioned.
>So how about this partial revert. I've made 32-bit use the same special
>__memcpy() version.
>
>Hmmm?
>
>---
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index 7afb0e2f07f4..9e378a10796d 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -201,6 +201,7 @@ extern void set_iounmap_nonlazy(void);
> #ifdef __KERNEL__
>
> #include <asm-generic/iomap.h>
>+#include <asm/string.h>
>
> /*
> * Convert a virtual cached pointer to an uncached pointer
>@@ -227,12 +228,13 @@ memset_io(volatile void __iomem *addr, unsigned
>char val, size_t count)
> * @src: The (I/O memory) source for the data
> * @count: The number of bytes to copy
> *
>- * Copy a block of data from I/O memory.
>+ * Copy a block of data from I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
> */
> static inline void
>memcpy_fromio(void *dst, const volatile void __iomem *src, size_t
>count)
> {
>- memcpy(dst, (const void __force *)src, count);
>+ __inline_memcpy(dst, (const void __force *)src, count);
> }
>
> /**
>@@ -241,12 +243,13 @@ memcpy_fromio(void *dst, const volatile void
>__iomem *src, size_t count)
> * @src: The (RAM) source for the data
> * @count: The number of bytes to copy
> *
>- * Copy a block of data to I/O memory.
>+ * Copy a block of data to I/O memory. IO memory is different from
>+ * cached memory so we use special memcpy version.
> */
> static inline void
> memcpy_toio(volatile void __iomem *dst, const void *src, size_t count)
> {
>- memcpy((void __force *)dst, src, count);
>+ __inline_memcpy((void __force *)dst, src, count);
> }
>
> /*
>diff --git a/arch/x86/include/asm/string_32.h
>b/arch/x86/include/asm/string_32.h
>index 3d3e8353ee5c..556fa4a975ff 100644
>--- a/arch/x86/include/asm/string_32.h
>+++ b/arch/x86/include/asm/string_32.h
>@@ -29,6 +29,7 @@ extern char *strchr(const char *s, int c);
> #define __HAVE_ARCH_STRLEN
> extern size_t strlen(const char *s);
>
>+#define __inline_memcpy __memcpy
>static __always_inline void *__memcpy(void *to, const void *from,
>size_t n)
> {
> int d0, d1, d2;
It isn't really that straightforward IMO.
For UC memory transaction size really needs to be specified explicitly at all times and should be part of the API, rather than implicit.
For WC/WT/WB device memory, the ordinary memcpy is valid and preferred.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Mar 06, 2017 at 05:41:22AM -0800, [email protected] wrote:
> It isn't really that straightforward IMO.
>
> For UC memory transaction size really needs to be specified explicitly
> at all times and should be part of the API, rather than implicit.
>
> For WC/WT/WB device memory, the ordinary memcpy is valid and
> preferred.
I'm practically partially reverting
6175ddf06b61 ("x86: Clean up mem*io functions.")
Are you saying, this was wrong before too?
Maybe it was wrong, strictly speaking, but maybe that was good enough
for our purposes...
--
Regards/Gruss,
Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--
On 06/03/17 12:28 AM, H. Peter Anvin wrote:
> On 03/05/17 23:01, Logan Gunthorpe wrote:
>>
>> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>>> Logan, wanna give that a try, see if it takes care of your issue?
>>
>> Well honestly my issue was solved by fixing my kernel config. I have no
>> idea why I had optimize for size in there in the first place.
>>
>
> Yes, to gcc "optimize for size" means exactly that... intended for cases
> where saving storage (e.g. ROM) or code download time is paramount.
I agree and understand, however placing a poorly performing _inline_
memcpy instead of a single call instruction to a performant memcopy
probably took more code space in the end. So like Linus, I just have to
scratch my head at the -Os optimization option.
Logan
On March 6, 2017 9:12:41 AM PST, Logan Gunthorpe <[email protected]> wrote:
>
>
>On 06/03/17 12:28 AM, H. Peter Anvin wrote:
>> On 03/05/17 23:01, Logan Gunthorpe wrote:
>>>
>>> On 05/03/17 12:54 PM, Borislav Petkov wrote:
>>>> Logan, wanna give that a try, see if it takes care of your issue?
>>>
>>> Well honestly my issue was solved by fixing my kernel config. I have
>no
>>> idea why I had optimize for size in there in the first place.
>>>
>>
>> Yes, to gcc "optimize for size" means exactly that... intended for
>cases
>> where saving storage (e.g. ROM) or code download time is paramount.
>
>I agree and understand, however placing a poorly performing _inline_
>memcpy instead of a single call instruction to a performant memcopy
>probably took more code space in the end. So like Linus, I just have to
>scratch my head at the -Os optimization option.
>
>Logan
No, it will be smaller: -Os counts bytes.
If you think about it, there is no way that replacing a five-byte subroutine call with a two-byte instruction opcode can make it bigger! The only other difference between the two from a size perspective is that the compiler doesn't have to worry about clobbered registers other than the argument registers.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.