2020-04-07 15:06:11

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] memcpy_flushcache: use cache flusing for larger lengths

[ resending this to x86 maintainers ]

Hi

I tested performance of various methods how to write to optane-based
persistent memory, and found out that non-temporal stores achieve
throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
or clwb achieve throughput 1.6 GB/s.

memcpy_flushcache uses non-temporal stores, I modified it to use cached
stores + clflushopt and it improved performance of the dm-writecache
target significantly:

dm-writecache throughput:
(dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
writecache block size 512 1024 2048 4096
movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s

For block size 512, movnti works better, for larger block sizes,
clflushopt is better.

I was also testing the novafs filesystem, it is not upstream, but it
benefitted from similar change in __memcpy_flushcache and
__copy_user_nocache:
write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s


I submit this patch for __memcpy_flushcache that improves dm-writecache
performance.

Other ideas - should we introduce memcpy_to_pmem instead of modifying
memcpy_flushcache and move this logic there? Or should I modify the
dm-writecache target directly to use clflushopt with no change to the
architecture-specific code?

Mikulas




From: Mikulas Patocka <[email protected]>

I tested dm-writecache performance on a machine with Optane nvdimm and it
turned out that for larger writes, cached stores + cache flushing perform
better than non-temporal stores. This is the throughput of dm-writecache
measured with this command:
dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct

block size 512 1024 2048 4096
movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s

We can see that for smaller block, movnti performs better, but for larger
blocks, clflushopt has better performance.

This patch changes the function __memcpy_flushcache accordingly, so that
with size >= 768 it performs cached stores and cache flushing. Note that
we must not use the new branch if the CPU doesn't have clflushopt - in
that case, the kernel would use inefficient "clflush" instruction that has
very bad performance.

Signed-off-by: Mikulas Patocka <[email protected]>

---
arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
+++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
@@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
return;
}

+ if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
+ while (!IS_ALIGNED(dest, 64)) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+ do {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movq %%r8, (%1)\n"
+ "movq %%r9, 8(%1)\n"
+ "movq %%r10, 16(%1)\n"
+ "movq %%r11, 24(%1)\n"
+ "movq 32(%0), %%r8\n"
+ "movq 40(%0), %%r9\n"
+ "movq 48(%0), %%r10\n"
+ "movq 56(%0), %%r11\n"
+ "movq %%r8, 32(%1)\n"
+ "movq %%r9, 40(%1)\n"
+ "movq %%r10, 48(%1)\n"
+ "movq %%r11, 56(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ clflushopt((void *)dest);
+ dest += 64;
+ source += 64;
+ size -= 64;
+ } while (size >= 64);
+ }
+
/* 4x8 movnti loop */
while (size >= 32) {
asm("movq (%0), %%r8\n"


2020-04-07 16:10:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths


> On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <[email protected]> wrote:
>
> [ resending this to x86 maintainers ]
>
> Hi
>
> I tested performance of various methods how to write to optane-based
> persistent memory, and found out that non-temporal stores achieve
> throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> or clwb achieve throughput 1.6 GB/s.
>
> memcpy_flushcache uses non-temporal stores, I modified it to use cached
> stores + clflushopt and it improved performance of the dm-writecache
> target significantly:
>
> dm-writecache throughput:
> (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> writecache block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> For block size 512, movnti works better, for larger block sizes,
> clflushopt is better.
>
> I was also testing the novafs filesystem, it is not upstream, but it
> benefitted from similar change in __memcpy_flushcache and
> __copy_user_nocache:
> write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
>
>
> I submit this patch for __memcpy_flushcache that improves dm-writecache
> performance.
>
> Other ideas - should we introduce memcpy_to_pmem instead of modifying
> memcpy_flushcache and move this logic there? Or should I modify the
> dm-writecache target directly to use clflushopt with no change to the
> architecture-specific code?
>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <[email protected]>
>
> I tested dm-writecache performance on a machine with Optane nvdimm and it
> turned out that for larger writes, cached stores + cache flushing perform
> better than non-temporal stores. This is the throughput of dm-writecache
> measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
>
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> We can see that for smaller block, movnti performs better, but for larger
> blocks, clflushopt has better performance.
>
> This patch changes the function __memcpy_flushcache accordingly, so that
> with size >= 768 it performs cached stores and cache flushing. Note that
> we must not use the new branch if the CPU doesn't have clflushopt - in
> that case, the kernel would use inefficient "clflush" instruction that has
> very bad performance.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
> +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
> @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> return;
> }
>
> + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> + while (!IS_ALIGNED(dest, 64)) {
> + asm("movq (%0), %%r8\n"
> + "movnti %%r8, (%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8");
> + dest += 8;
> + source += 8;
> + size -= 8;
> + }
> + do {
> + asm("movq (%0), %%r8\n"
> + "movq 8(%0), %%r9\n"
> + "movq 16(%0), %%r10\n"
> + "movq 24(%0), %%r11\n"
> + "movq %%r8, (%1)\n"
> + "movq %%r9, 8(%1)\n"
> + "movq %%r10, 16(%1)\n"
> + "movq %%r11, 24(%1)\n"
> + "movq 32(%0), %%r8\n"
> + "movq 40(%0), %%r9\n"
> + "movq 48(%0), %%r10\n"
> + "movq 56(%0), %%r11\n"
> + "movq %%r8, 32(%1)\n"
> + "movq %%r9, 40(%1)\n"
> + "movq %%r10, 48(%1)\n"
> + "movq %%r11, 56(%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8", "r9", "r10", "r11");

Does this actually work better than the corresponding C code?

Also, that memory clobber probably isn’t doing your code generation any favors. Experimentally, you have the constraints wrong. An “r” constraint doesn’t tell GCC that you are dereferencing the pointer. You need to use “m” with a correctly-sized type. But I bet plain C is at least as good.

> + clflushopt((void *)dest);
> + dest += 64;
> + source += 64;
> + size -= 64;
> + } while (size >= 64);
> + }
> +
> /* 4x8 movnti loop */
> while (size >= 32) {
> asm("movq (%0), %%r8\n"
>

2020-04-07 16:36:10

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths



On Tue, 7 Apr 2020, Andy Lutomirski wrote:

>
> > On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <[email protected]> wrote:
> >
> > [ resending this to x86 maintainers ]
> >
> > Hi
> >
> > I tested performance of various methods how to write to optane-based
> > persistent memory, and found out that non-temporal stores achieve
> > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > or clwb achieve throughput 1.6 GB/s.
> >
> > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > stores + clflushopt and it improved performance of the dm-writecache
> > target significantly:
> >
> > dm-writecache throughput:
> > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > writecache block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > For block size 512, movnti works better, for larger block sizes,
> > clflushopt is better.
> >
> > I was also testing the novafs filesystem, it is not upstream, but it
> > benefitted from similar change in __memcpy_flushcache and
> > __copy_user_nocache:
> > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> >
> >
> > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > performance.
> >
> > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > memcpy_flushcache and move this logic there? Or should I modify the
> > dm-writecache target directly to use clflushopt with no change to the
> > architecture-specific code?
> >
> > Mikulas
> >
> >
> >
> >
> > From: Mikulas Patocka <[email protected]>
> >
> > I tested dm-writecache performance on a machine with Optane nvdimm and it
> > turned out that for larger writes, cached stores + cache flushing perform
> > better than non-temporal stores. This is the throughput of dm-writecache
> > measured with this command:
> > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
> >
> > block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > We can see that for smaller block, movnti performs better, but for larger
> > blocks, clflushopt has better performance.
> >
> > This patch changes the function __memcpy_flushcache accordingly, so that
> > with size >= 768 it performs cached stores and cache flushing. Note that
> > we must not use the new branch if the CPU doesn't have clflushopt - in
> > that case, the kernel would use inefficient "clflush" instruction that has
> > very bad performance.
> >
> > Signed-off-by: Mikulas Patocka <[email protected]>
> >
> > ---
> > arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 36 insertions(+)
> >
> > Index: linux-2.6/arch/x86/lib/usercopy_64.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
> > +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
> > @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> > return;
> > }
> >
> > + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> > + while (!IS_ALIGNED(dest, 64)) {
> > + asm("movq (%0), %%r8\n"
> > + "movnti %%r8, (%1)\n"
> > + :: "r" (source), "r" (dest)
> > + : "memory", "r8");
> > + dest += 8;
> > + source += 8;
> > + size -= 8;
> > + }
> > + do {
> > + asm("movq (%0), %%r8\n"
> > + "movq 8(%0), %%r9\n"
> > + "movq 16(%0), %%r10\n"
> > + "movq 24(%0), %%r11\n"
> > + "movq %%r8, (%1)\n"
> > + "movq %%r9, 8(%1)\n"
> > + "movq %%r10, 16(%1)\n"
> > + "movq %%r11, 24(%1)\n"
> > + "movq 32(%0), %%r8\n"
> > + "movq 40(%0), %%r9\n"
> > + "movq 48(%0), %%r10\n"
> > + "movq 56(%0), %%r11\n"
> > + "movq %%r8, 32(%1)\n"
> > + "movq %%r9, 40(%1)\n"
> > + "movq %%r10, 48(%1)\n"
> > + "movq %%r11, 56(%1)\n"
> > + :: "r" (source), "r" (dest)
> > + : "memory", "r8", "r9", "r10", "r11");
>
> Does this actually work better than the corresponding C code?
>
> Also, that memory clobber probably isn’t doing your code generation any
> favors. Experimentally, you have the constraints wrong. An “r”

The existing "movnti" loop uses exactly the same constraints (and the
"memory" clobber).

> constraint doesn’t tell GCC that you are dereferencing the pointer.
> You need to use “m” with a correctly-sized type.

But you would use
"=m"(*(char *)dest),"=m"(*((char *)dest + 8)),"=m"((char *)dest + 16))...
and so on, until you run out of argument numbers.

> But I bet plain C is at least as good.

I tried to replace it with
memcpy((void *)dest, (void *)src, 64);

The compiler inlined the memcpy function into 8 loads and 8 stores.
However, the whole function __memcpy_flushcache consumed one more saved
register and the machine code was a few bytes longer.

Mikulas

2020-04-07 17:54:15

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths

On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <[email protected]> wrote:
>
> [ resending this to x86 maintainers ]
>
> Hi
>
> I tested performance of various methods how to write to optane-based
> persistent memory, and found out that non-temporal stores achieve
> throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> or clwb achieve throughput 1.6 GB/s.
>
> memcpy_flushcache uses non-temporal stores, I modified it to use cached
> stores + clflushopt and it improved performance of the dm-writecache
> target significantly:
>
> dm-writecache throughput:
> (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> writecache block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> For block size 512, movnti works better, for larger block sizes,
> clflushopt is better.

This should use clwb instead of clflushopt, the clwb macri
automatically converts back to clflushopt if clwb is not supported.

>
> I was also testing the novafs filesystem, it is not upstream, but it
> benefitted from similar change in __memcpy_flushcache and
> __copy_user_nocache:
> write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
>
>
> I submit this patch for __memcpy_flushcache that improves dm-writecache
> performance.
>
> Other ideas - should we introduce memcpy_to_pmem instead of modifying
> memcpy_flushcache and move this logic there? Or should I modify the
> dm-writecache target directly to use clflushopt with no change to the
> architecture-specific code?

This also needs to mention your analysis that showed that this can
have negative cache pollution effects [1], so I'm not sure how to
decide when to make the tradeoff. Once we have movdir64b the tradeoff
equation changes yet again:

[1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/


>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <[email protected]>
>
> I tested dm-writecache performance on a machine with Optane nvdimm and it
> turned out that for larger writes, cached stores + cache flushing perform
> better than non-temporal stores. This is the throughput of dm-writecache
> measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
>
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> We can see that for smaller block, movnti performs better, but for larger
> blocks, clflushopt has better performance.
>
> This patch changes the function __memcpy_flushcache accordingly, so that
> with size >= 768 it performs cached stores and cache flushing. Note that
> we must not use the new branch if the CPU doesn't have clflushopt - in
> that case, the kernel would use inefficient "clflush" instruction that has
> very bad performance.
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
> +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
> @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> return;
> }
>
> + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> + while (!IS_ALIGNED(dest, 64)) {
> + asm("movq (%0), %%r8\n"
> + "movnti %%r8, (%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8");
> + dest += 8;
> + source += 8;
> + size -= 8;
> + }
> + do {
> + asm("movq (%0), %%r8\n"
> + "movq 8(%0), %%r9\n"
> + "movq 16(%0), %%r10\n"
> + "movq 24(%0), %%r11\n"
> + "movq %%r8, (%1)\n"
> + "movq %%r9, 8(%1)\n"
> + "movq %%r10, 16(%1)\n"
> + "movq %%r11, 24(%1)\n"
> + "movq 32(%0), %%r8\n"
> + "movq 40(%0), %%r9\n"
> + "movq 48(%0), %%r10\n"
> + "movq 56(%0), %%r11\n"
> + "movq %%r8, 32(%1)\n"
> + "movq %%r9, 40(%1)\n"
> + "movq %%r10, 48(%1)\n"
> + "movq %%r11, 56(%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8", "r9", "r10", "r11");
> + clflushopt((void *)dest);
> + dest += 64;
> + source += 64;
> + size -= 64;
> + } while (size >= 64);
> + }
> +
> /* 4x8 movnti loop */
> while (size >= 32) {
> asm("movq (%0), %%r8\n"
>

2020-04-08 20:24:49

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths



On Tue, 7 Apr 2020, Dan Williams wrote:

> On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <[email protected]> wrote:
> >
> > [ resending this to x86 maintainers ]
> >
> > Hi
> >
> > I tested performance of various methods how to write to optane-based
> > persistent memory, and found out that non-temporal stores achieve
> > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > or clwb achieve throughput 1.6 GB/s.
> >
> > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > stores + clflushopt and it improved performance of the dm-writecache
> > target significantly:
> >
> > dm-writecache throughput:
> > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > writecache block size 512 1024 2048 4096
> > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> >
> > For block size 512, movnti works better, for larger block sizes,
> > clflushopt is better.
>
> This should use clwb instead of clflushopt, the clwb macri
> automatically converts back to clflushopt if clwb is not supported.

But we want to invalidate cache, we do not expect CPU to access these data
anymore (it will be accessed by a DMA engine during writeback).

> > I was also testing the novafs filesystem, it is not upstream, but it
> > benefitted from similar change in __memcpy_flushcache and
> > __copy_user_nocache:
> > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> >
> >
> > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > performance.
> >
> > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > memcpy_flushcache and move this logic there? Or should I modify the
> > dm-writecache target directly to use clflushopt with no change to the
> > architecture-specific code?
>
> This also needs to mention your analysis that showed that this can
> have negative cache pollution effects [1], so I'm not sure how to
> decide when to make the tradeoff. Once we have movdir64b the tradeoff
> equation changes yet again:
>
> [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/

I analyzed it some more. I have created this program that tests writecache
w.r.t. cache pollution:

http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c

It fills the cache with a chain of random pointers and then walks these
pointers to evaluate cache pollution. Between the walks, it writes data to
the dm-writecache target.

With the original kernel, the result is:
8503 - 11366
real 0m7.985s
user 0m0.585s
sys 0m7.390s

With dm-writecache hacked to use cached writes + clflushopt:
8513 - 11379
real 0m5.045s
user 0m0.670s
sys 0m4.365s

So, the hacked dm-writecache is significantly faster, while the cache
micro-benchmark doesn't show any more cache pollution.

That's for dm-writecache. Are there some other significant users of
memcpy_flushcache that need to be checked?

Mikulas

2020-04-08 21:56:29

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths

On Wed, Apr 8, 2020 at 11:54 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Tue, 7 Apr 2020, Dan Williams wrote:
>
> > On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <[email protected]> wrote:
> > >
> > > [ resending this to x86 maintainers ]
> > >
> > > Hi
> > >
> > > I tested performance of various methods how to write to optane-based
> > > persistent memory, and found out that non-temporal stores achieve
> > > throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> > > or clwb achieve throughput 1.6 GB/s.
> > >
> > > memcpy_flushcache uses non-temporal stores, I modified it to use cached
> > > stores + clflushopt and it improved performance of the dm-writecache
> > > target significantly:
> > >
> > > dm-writecache throughput:
> > > (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> > > writecache block size 512 1024 2048 4096
> > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> > >
> > > For block size 512, movnti works better, for larger block sizes,
> > > clflushopt is better.
> >
> > This should use clwb instead of clflushopt, the clwb macri
> > automatically converts back to clflushopt if clwb is not supported.
>
> But we want to invalidate cache, we do not expect CPU to access these data
> anymore (it will be accessed by a DMA engine during writeback).

The cluflushopt and clwb instructions should have identical overhead,
but clwb wins on the rare chance the written data is needed again
soon. If it is never needed again then the cost of dropping a clean
cache line is the same as if the line was invalidated in the first
instance. In both cases (clflushopt and clwb) the snoop traffic
overhead is still paid whether the written-back line is still present
in the cache or not.

>
> > > I was also testing the novafs filesystem, it is not upstream, but it
> > > benefitted from similar change in __memcpy_flushcache and
> > > __copy_user_nocache:
> > > write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> > > write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
> > >
> > >
> > > I submit this patch for __memcpy_flushcache that improves dm-writecache
> > > performance.
> > >
> > > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > > memcpy_flushcache and move this logic there? Or should I modify the
> > > dm-writecache target directly to use clflushopt with no change to the
> > > architecture-specific code?
> >
> > This also needs to mention your analysis that showed that this can
> > have negative cache pollution effects [1], so I'm not sure how to
> > decide when to make the tradeoff. Once we have movdir64b the tradeoff
> > equation changes yet again:
> >
> > [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/
>
> I analyzed it some more. I have created this program that tests writecache
> w.r.t. cache pollution:
>
> http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c
>
> It fills the cache with a chain of random pointers and then walks these
> pointers to evaluate cache pollution. Between the walks, it writes data to
> the dm-writecache target.
>
> With the original kernel, the result is:
> 8503 - 11366
> real 0m7.985s
> user 0m0.585s
> sys 0m7.390s
>
> With dm-writecache hacked to use cached writes + clflushopt:
> 8513 - 11379
> real 0m5.045s
> user 0m0.670s
> sys 0m4.365s
>
> So, the hacked dm-writecache is significantly faster, while the cache
> micro-benchmark doesn't show any more cache pollution.

Nice. These are now the pmem numbers, or dram? Otherwise, what changed
that was making nt-writes on pmem perform better compared to your
previous test? I'm just trying to track the results.

> That's for dm-writecache. Are there some other significant users of
> memcpy_flushcache that need to be checked?

The only other user is direct and dax-I/O to the pmem driver.

2020-04-09 14:38:07

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths



On Wed, 8 Apr 2020, Dan Williams wrote:

> On Wed, Apr 8, 2020 at 11:54 AM Mikulas Patocka <[email protected]> wrote:
> >
> >
> >
> > On Tue, 7 Apr 2020, Dan Williams wrote:
> >
> > > On Tue, Apr 7, 2020 at 8:02 AM Mikulas Patocka <[email protected]> wrote:
> > > >
> > > This should use clwb instead of clflushopt, the clwb macri
> > > automatically converts back to clflushopt if clwb is not supported.
> >
> > But we want to invalidate cache, we do not expect CPU to access these data
> > anymore (it will be accessed by a DMA engine during writeback).
>
> The cluflushopt and clwb instructions should have identical overhead,
> but clwb wins on the rare chance the written data is needed again
> soon. If it is never needed again then the cost of dropping a clean
> cache line is the same as if the line was invalidated in the first
> instance. In both cases (clflushopt and clwb) the snoop traffic
> overhead is still paid whether the written-back line is still present
> in the cache or not.

But my concern is that clflushopt removes the line from the cache and
makes room for another line (this is desired behavior) - clwb keeps the
line cached and the line would have to compete with other cache lines in
the same associative set.

Do you know how does the CPU select the cache line to be replaced?

dm-writecache is intended to be used for workloads like database logs that
need extra-low commit latency. The committed data is not read back during
normal workload.

> > > > Other ideas - should we introduce memcpy_to_pmem instead of modifying
> > > > memcpy_flushcache and move this logic there? Or should I modify the
> > > > dm-writecache target directly to use clflushopt with no change to the
> > > > architecture-specific code?
> > >
> > > This also needs to mention your analysis that showed that this can
> > > have negative cache pollution effects [1], so I'm not sure how to
> > > decide when to make the tradeoff. Once we have movdir64b the tradeoff
> > > equation changes yet again:
> > >
> > > [1]: https://lore.kernel.org/linux-nvdimm/alpine.LRH.2.02.2004010941310.23210@file01.intranet.prod.int.rdu2.redhat.com/
> >
> > I analyzed it some more. I have created this program that tests writecache
> > w.r.t. cache pollution:
> >
> > http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test-2.c
> >
> > It fills the cache with a chain of random pointers and then walks these
> > pointers to evaluate cache pollution. Between the walks, it writes data to
> > the dm-writecache target.
> >
> > With the original kernel, the result is:
> > 8503 - 11366
> > real 0m7.985s
> > user 0m0.585s
> > sys 0m7.390s
> >
> > With dm-writecache hacked to use cached writes + clflushopt:
> > 8513 - 11379
> > real 0m5.045s
> > user 0m0.670s
> > sys 0m4.365s
> >
> > So, the hacked dm-writecache is significantly faster, while the cache
> > micro-benchmark doesn't show any more cache pollution.
>
> Nice. These are now the pmem numbers, or dram?

pmem


With dm-writecache on emulated pmem (with the memmap argument), we get

With the original kernel:
8508 - 11378
real 0m4.960s
user 0m0.638s
sys 0m4.312s

With dm-writecache hacked to use cached writes + clflushopt:
8505 - 11378
real 0m4.151s
user 0m0.560s
sys 0m3.582s

So - clflushopt is still slightly better.

> Otherwise, what changed that was making nt-writes on pmem perform better
> compared to your previous test? I'm just trying to track the results.

I re-ran the previous test
( http://people.redhat.com/~mpatocka/testcases/pmem/misc/l1-test.c )
and the result is this:

Write + clflushopt:
./l1-test /dev/ram0 f
8502 - 22616
./l1-test /dev/dax3.0 f
8502 - 22902
./l1-test /dev/dax4.0 f
8500 - 11970

Write + clwb:
./l1-test /dev/ram0 w
8502 - 22602
./l1-test /dev/dax3.0 w
8502 - 22454
./l1-test /dev/dax4.0 w
8502 - 11566

Non-temporal stores:
./l1-test /dev/ram0 n
8504 - 22162
./l1-test /dev/dax3.0 n
8502 - 12336
./l1-test /dev/dax4.0 n
8502 - 10662

(/dev/dax3.0 is the real persistent memory, /dev/dax4.0 is pmem emulated
with the memmap parameter)

"./l1-test /dev/ram0 n" is slower than "./l1-test /dev/dax4.0 n" while
both of these tests are on RAM. The pmem is mapped with large pages and
mem map for ramdisk is not - perhaps this is making the difference?

"./l1-test /dev/dax3.0 n" is better than "./l1-test /dev/dax3.0 w" and
"./l1-test /dev/dax3.0 f" - although the benchmaks done on dm-writecache
show that cached writes + clflushopt perform better. I don't know why
there is this disparity.

> > That's for dm-writecache. Are there some other significant users of
> > memcpy_flushcache that need to be checked?
>
> The only other user is direct and dax-I/O to the pmem driver.

Mikulas

2020-04-16 08:46:08

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths



On Thu, 9 Apr 2020, Mikulas Patocka wrote:

> With dm-writecache on emulated pmem (with the memmap argument), we get
>
> With the original kernel:
> 8508 - 11378
> real 0m4.960s
> user 0m0.638s
> sys 0m4.312s
>
> With dm-writecache hacked to use cached writes + clflushopt:
> 8505 - 11378
> real 0m4.151s
> user 0m0.560s
> sys 0m3.582s

I did some multithreaded tests:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt

And it turns out that for singlethreaded access, write+clwb performs
better, while for multithreaded access, non-temporal stores perform
better.

1 sequential write-nt 8 bytes 1.3 GB/s
2 sequential write-nt 8 bytes 2.5 GB/s
3 sequential write-nt 8 bytes 2.8 GB/s
4 sequential write-nt 8 bytes 2.8 GB/s
5 sequential write-nt 8 bytes 2.5 GB/s

1 sequential write 8 bytes + clwb 1.6 GB/s
2 sequential write 8 bytes + clwb 2.4 GB/s
3 sequential write 8 bytes + clwb 1.7 GB/s
4 sequential write 8 bytes + clwb 1.2 GB/s
5 sequential write 8 bytes + clwb 0.8 GB/s

For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
throughput.

The dm-writecache target is singlethreaded (all the copying is done while
holding the writecache lock), so it benefits from clwb.

Should memcpy_flushcache be changed to write+clwb? Or are there some
multithreaded users of memcpy_flushcache that would be hurt by this
change?

Mikulas

2020-04-16 20:44:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths

On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Thu, 9 Apr 2020, Mikulas Patocka wrote:
>
> > With dm-writecache on emulated pmem (with the memmap argument), we get
> >
> > With the original kernel:
> > 8508 - 11378
> > real 0m4.960s
> > user 0m0.638s
> > sys 0m4.312s
> >
> > With dm-writecache hacked to use cached writes + clflushopt:
> > 8505 - 11378
> > real 0m4.151s
> > user 0m0.560s
> > sys 0m3.582s
>
> I did some multithreaded tests:
> http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
>
> And it turns out that for singlethreaded access, write+clwb performs
> better, while for multithreaded access, non-temporal stores perform
> better.
>
> 1 sequential write-nt 8 bytes 1.3 GB/s
> 2 sequential write-nt 8 bytes 2.5 GB/s
> 3 sequential write-nt 8 bytes 2.8 GB/s
> 4 sequential write-nt 8 bytes 2.8 GB/s
> 5 sequential write-nt 8 bytes 2.5 GB/s
>
> 1 sequential write 8 bytes + clwb 1.6 GB/s
> 2 sequential write 8 bytes + clwb 2.4 GB/s
> 3 sequential write 8 bytes + clwb 1.7 GB/s
> 4 sequential write 8 bytes + clwb 1.2 GB/s
> 5 sequential write 8 bytes + clwb 0.8 GB/s
>
> For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> throughput.
>
> The dm-writecache target is singlethreaded (all the copying is done while
> holding the writecache lock), so it benefits from clwb.
>
> Should memcpy_flushcache be changed to write+clwb? Or are there some
> multithreaded users of memcpy_flushcache that would be hurt by this
> change?

Maybe this is asking for a specific memcpy_flushcache_inatomic()
implementation for your use case, but leave nt-writes for the general
case?

2020-04-17 12:52:07

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] x86: introduce memcpy_flushcache_clflushopt



On Thu, 16 Apr 2020, Dan Williams wrote:

> On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <[email protected]> wrote:
> >
> >
> >
> > On Thu, 9 Apr 2020, Mikulas Patocka wrote:
> >
> > > With dm-writecache on emulated pmem (with the memmap argument), we get
> > >
> > > With the original kernel:
> > > 8508 - 11378
> > > real 0m4.960s
> > > user 0m0.638s
> > > sys 0m4.312s
> > >
> > > With dm-writecache hacked to use cached writes + clflushopt:
> > > 8505 - 11378
> > > real 0m4.151s
> > > user 0m0.560s
> > > sys 0m3.582s
> >
> > I did some multithreaded tests:
> > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
> >
> > And it turns out that for singlethreaded access, write+clwb performs
> > better, while for multithreaded access, non-temporal stores perform
> > better.
> >
> > 1 sequential write-nt 8 bytes 1.3 GB/s
> > 2 sequential write-nt 8 bytes 2.5 GB/s
> > 3 sequential write-nt 8 bytes 2.8 GB/s
> > 4 sequential write-nt 8 bytes 2.8 GB/s
> > 5 sequential write-nt 8 bytes 2.5 GB/s
> >
> > 1 sequential write 8 bytes + clwb 1.6 GB/s
> > 2 sequential write 8 bytes + clwb 2.4 GB/s
> > 3 sequential write 8 bytes + clwb 1.7 GB/s
> > 4 sequential write 8 bytes + clwb 1.2 GB/s
> > 5 sequential write 8 bytes + clwb 0.8 GB/s
> >
> > For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> > 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> > throughput.
> >
> > The dm-writecache target is singlethreaded (all the copying is done while
> > holding the writecache lock), so it benefits from clwb.
> >
> > Should memcpy_flushcache be changed to write+clwb? Or are there some
> > multithreaded users of memcpy_flushcache that would be hurt by this
> > change?
>
> Maybe this is asking for a specific memcpy_flushcache_inatomic()
> implementation for your use case, but leave nt-writes for the general
> case?

Yes - I have created this patch that adds a new function
memcpy_flushcache_clflushopt and makes dm-writecache use it.

Mikulas



From: Mikulas Patocka <[email protected]>

Implement the function memcpy_flushcache_clflushopt which flushes cache
just like memcpy_flushcache - except that it uses cached writes and
explicit cache flushing instead of non-temporal stores.

Explicit cache flushing performs better in some cases (i.e. the
dm-writecache target with block size greater than 512), non-temporal
stores perform better in other cases (mostly multithreaded workloads) - so
we provide these two functions and the user should select which one is
faster for his particular workload.

dm-writecache througput (on real Optane-based persistent memory):
block size 512 1024 2048 4096
movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s

Signed-off-by: Mikulas Patocka <[email protected]>

---
arch/x86/include/asm/string_64.h | 10 ++++++++++
arch/x86/lib/usercopy_64.c | 32 ++++++++++++++++++++++++++++++++
drivers/md/dm-writecache.c | 5 ++++-
include/linux/string.h | 6 ++++++
4 files changed, 52 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h 2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/arch/x86/include/asm/string_64.h 2020-04-17 14:06:35.129999000 +0200
@@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
return 0;
}

+/*
+ * In some cases (mostly single-threaded workload), clflushopt is faster
+ * than non-temporal stores. In other situations, non-temporal stores are
+ * faster. So, we provide two functions:
+ * memcpy_flushcache using non-temporal stores
+ * memcpy_flushcache_clflushopt using clflushopt
+ * The caller should test which one is faster for the particular workload.
+ */
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
#define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
@@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
}
__memcpy_flushcache(dst, src, cnt);
}
+#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
+void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);
#endif

#endif /* __KERNEL__ */
Index: linux-2.6/include/linux/string.h
===================================================================
--- linux-2.6.orig/include/linux/string.h 2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/include/linux/string.h 2020-04-17 14:06:35.129999000 +0200
@@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
memcpy(dst, src, cnt);
}
#endif
+#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
+static inline void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt)
+{
+ memcpy_flushcache(dst, src, cnt);
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-04-17 14:25:18.569999000 +0200
@@ -199,6 +199,38 @@ void __memcpy_flushcache(void *_dst, con
}
EXPORT_SYMBOL_GPL(__memcpy_flushcache);

+void memcpy_flushcache_clflushopt(void *_dst, const void *_src, size_t size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64)) {
+ if (unlikely(!IS_ALIGNED(dest, 64))) {
+ size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ clflushopt((void *)dest);
+ dest += len;
+ source += len;
+ size -= len;
+ }
+ while (size >= 64) {
+ memcpy((void *)dest, (void *)source, 64);
+ clflushopt((void *)dest);
+ dest += 64;
+ source += 64;
+ size -= 64;
+ }
+ if (unlikely(size != 0)) {
+ memcpy((void *)dest, (void *)source, size);
+ clflushopt((void *)dest);
+ }
+ return;
+ }
+ memcpy_flushcache((void *)dest, (void *)source, size);
+}
+EXPORT_SYMBOL_GPL(memcpy_flushcache_clflushopt);
+
void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len)
{
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
@@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
}
} else {
flush_dcache_page(bio_page(bio));
- memcpy_flushcache(data, buf, size);
+ if (likely(size > 512))
+ memcpy_flushcache_clflushopt(data, buf, size);
+ else
+ memcpy_flushcache(data, buf, size);
}

bvec_kunmap_irq(buf, &flags);

2020-04-17 18:02:31

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt

On Fri, Apr 17, 2020 at 5:47 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Thu, 16 Apr 2020, Dan Williams wrote:
>
> > On Thu, Apr 16, 2020 at 1:24 AM Mikulas Patocka <[email protected]> wrote:
> > >
> > >
> > >
> > > On Thu, 9 Apr 2020, Mikulas Patocka wrote:
> > >
> > > > With dm-writecache on emulated pmem (with the memmap argument), we get
> > > >
> > > > With the original kernel:
> > > > 8508 - 11378
> > > > real 0m4.960s
> > > > user 0m0.638s
> > > > sys 0m4.312s
> > > >
> > > > With dm-writecache hacked to use cached writes + clflushopt:
> > > > 8505 - 11378
> > > > real 0m4.151s
> > > > user 0m0.560s
> > > > sys 0m3.582s
> > >
> > > I did some multithreaded tests:
> > > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem-multithreaded.txt
> > >
> > > And it turns out that for singlethreaded access, write+clwb performs
> > > better, while for multithreaded access, non-temporal stores perform
> > > better.
> > >
> > > 1 sequential write-nt 8 bytes 1.3 GB/s
> > > 2 sequential write-nt 8 bytes 2.5 GB/s
> > > 3 sequential write-nt 8 bytes 2.8 GB/s
> > > 4 sequential write-nt 8 bytes 2.8 GB/s
> > > 5 sequential write-nt 8 bytes 2.5 GB/s
> > >
> > > 1 sequential write 8 bytes + clwb 1.6 GB/s
> > > 2 sequential write 8 bytes + clwb 2.4 GB/s
> > > 3 sequential write 8 bytes + clwb 1.7 GB/s
> > > 4 sequential write 8 bytes + clwb 1.2 GB/s
> > > 5 sequential write 8 bytes + clwb 0.8 GB/s
> > >
> > > For one thread, we can see that write-nt 8 bytes has 1.3 GB/s and write
> > > 8+clwb has 1.6 GB/s, but for multiple threads, write-nt has better
> > > throughput.
> > >
> > > The dm-writecache target is singlethreaded (all the copying is done while
> > > holding the writecache lock), so it benefits from clwb.
> > >
> > > Should memcpy_flushcache be changed to write+clwb? Or are there some
> > > multithreaded users of memcpy_flushcache that would be hurt by this
> > > change?
> >
> > Maybe this is asking for a specific memcpy_flushcache_inatomic()
> > implementation for your use case, but leave nt-writes for the general
> > case?
>
> Yes - I have created this patch that adds a new function
> memcpy_flushcache_clflushopt and makes dm-writecache use it.
>
> Mikulas
>
>
>
> From: Mikulas Patocka <[email protected]>
>
> Implement the function memcpy_flushcache_clflushopt which flushes cache
> just like memcpy_flushcache - except that it uses cached writes and
> explicit cache flushing instead of non-temporal stores.
>
> Explicit cache flushing performs better in some cases (i.e. the
> dm-writecache target with block size greater than 512), non-temporal
> stores perform better in other cases (mostly multithreaded workloads) - so
> we provide these two functions and the user should select which one is
> faster for his particular workload.
>
> dm-writecache througput (on real Optane-based persistent memory):
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++++
> arch/x86/lib/usercopy_64.c | 32 ++++++++++++++++++++++++++++++++
> drivers/md/dm-writecache.c | 5 ++++-
> include/linux/string.h | 6 ++++++
> 4 files changed, 52 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/include/asm/string_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/string_64.h 2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/arch/x86/include/asm/string_64.h 2020-04-17 14:06:35.129999000 +0200
> @@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
> return 0;
> }
>
> +/*
> + * In some cases (mostly single-threaded workload), clflushopt is faster
> + * than non-temporal stores. In other situations, non-temporal stores are
> + * faster. So, we provide two functions:
> + * memcpy_flushcache using non-temporal stores
> + * memcpy_flushcache_clflushopt using clflushopt
> + * The caller should test which one is faster for the particular workload.
> + */
> #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
> void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> @@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
> }
> __memcpy_flushcache(dst, src, cnt);
> }
> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
> +void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);

This naming promotes an x86ism and it does not help the caller
understand why 'flushcache_clflushopt' is preferred over 'flushcache'.
The goal of naming it _inatomic() was specifically for the observation
that your driver coordinates atomic access and does not benefit from
the cache friendliness that non-temporal stores afford. That said
_inatomic() is arguably not a good choice either because that refers
to whether the copy is prepared to take a fault or not. What about
_exclusive() or _single()? Anything but _clflushopt() that conveys no
contextual information.

Other than quibbling with the name, and one more comment below, this
looks ok to me.

> #endif
>
> #endif /* __KERNEL__ */
> Index: linux-2.6/include/linux/string.h
> ===================================================================
> --- linux-2.6.orig/include/linux/string.h 2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/include/linux/string.h 2020-04-17 14:06:35.129999000 +0200
> @@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
> memcpy(dst, src, cnt);
> }
> #endif
> +#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
> +static inline void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt)
> +{
> + memcpy_flushcache(dst, src, cnt);
> +}
> +#endif
> void *memchr_inv(const void *s, int c, size_t n);
> char *strreplace(char *s, char old, char new);
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-04-17 14:25:18.569999000 +0200
> @@ -199,6 +199,38 @@ void __memcpy_flushcache(void *_dst, con
> }
> EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>
> +void memcpy_flushcache_clflushopt(void *_dst, const void *_src, size_t size)
> +{
> + unsigned long dest = (unsigned long) _dst;
> + unsigned long source = (unsigned long) _src;
> +
> + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64)) {
> + if (unlikely(!IS_ALIGNED(dest, 64))) {
> + size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
> +
> + memcpy((void *) dest, (void *) source, len);
> + clflushopt((void *)dest);
> + dest += len;
> + source += len;
> + size -= len;
> + }
> + while (size >= 64) {
> + memcpy((void *)dest, (void *)source, 64);
> + clflushopt((void *)dest);
> + dest += 64;
> + source += 64;
> + size -= 64;
> + }
> + if (unlikely(size != 0)) {
> + memcpy((void *)dest, (void *)source, size);
> + clflushopt((void *)dest);
> + }
> + return;
> + }
> + memcpy_flushcache((void *)dest, (void *)source, size);
> +}
> +EXPORT_SYMBOL_GPL(memcpy_flushcache_clflushopt);
> +
> void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
> size_t len)
> {
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> }
> } else {
> flush_dcache_page(bio_page(bio));
> - memcpy_flushcache(data, buf, size);
> + if (likely(size > 512))

This needs some reference to how this magic number is chosen and how a
future developer might determine whether the value needs to be
adjusted.

Will also need to remember to come back and re-evaluate this once
memcpy_flushcache() is enabled to use movdir64b which might invalidate
the performance advantage you are currently seeing with
cache-allocating-writes plus flushing.

> + memcpy_flushcache_clflushopt(data, buf, size);
> + else
> + memcpy_flushcache(data, buf, size);
> }
>
> bvec_kunmap_irq(buf, &flags);
>

2020-04-17 20:49:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt

Dan Williams <[email protected]> writes:
> On Fri, Apr 17, 2020 at 5:47 AM Mikulas Patocka <[email protected]> wrote:
>> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
>> +void memcpy_flushcache_clflushopt(void *dst, const void *src, size_t cnt);
>
> This naming promotes an x86ism and it does not help the caller
> understand why 'flushcache_clflushopt' is preferred over 'flushcache'.

Right.

> The goal of naming it _inatomic() was specifically for the observation
> that your driver coordinates atomic access and does not benefit from
> the cache friendliness that non-temporal stores afford. That said
> _inatomic() is arguably not a good choice either because that refers
> to whether the copy is prepared to take a fault or not. What about
> _exclusive() or _single()? Anything but _clflushopt() that conveys no
> contextual information.
>
> Other than quibbling with the name, and one more comment below, this
> looks ok to me.
>
>> Index: linux-2.6/drivers/md/dm-writecache.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
>> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
>> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
>> }
>> } else {
>> flush_dcache_page(bio_page(bio));
>> - memcpy_flushcache(data, buf, size);
>> + if (likely(size > 512))
>
> This needs some reference to how this magic number is chosen and how a
> future developer might determine whether the value needs to be
> adjusted.

I don't think it's a good idea to make this decision in generic code as
architectures or even CPU models might have different constraints on the
size.

So I'd rather let the architecture implementation decide and make this

flush_dcache_page(bio_page(bio));
- memcpy_flushcache(data, buf, size);
+ memcpy_flushcache_bikesheddedname(data, buf, size);

and have the default fallback memcpy_flushcache() and let the
architecture sort the size limit and the underlying technology out.

So x86 can use clflushopt or implement it with movdir64b and any other
architecture can provide their own magic soup without changing the
callsite.

Thanks,

tglx



2020-04-18 13:32:37

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt

From: Mikulas Patocka
> Sent: 17 April 2020 13:47
...
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> }
> } else {
> flush_dcache_page(bio_page(bio));
> - memcpy_flushcache(data, buf, size);
> + if (likely(size > 512))
> + memcpy_flushcache_clflushopt(data, buf, size);
> + else
> + memcpy_flushcache(data, buf, size);

Hmmm... have you looked at how long clflush actually takes?
It isn't too bad if you just do a small number, but using it
to flush large buffers can be very slow.

I've an Ivy bridge system where the X-server process requests the
frame buffer be flushed out every 10 seconds (no idea why).
With my 2560x1440 monitor this takes over 3ms.

This really needs a cond_resched() every few clflush instructions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-04-18 15:23:50

by Mikulas Patocka

[permalink] [raw]
Subject: RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt



On Sat, 18 Apr 2020, David Laight wrote:

> From: Mikulas Patocka
> > Sent: 17 April 2020 13:47
> ...
> > Index: linux-2.6/drivers/md/dm-writecache.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> > +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > }
> > } else {
> > flush_dcache_page(bio_page(bio));
> > - memcpy_flushcache(data, buf, size);
> > + if (likely(size > 512))
> > + memcpy_flushcache_clflushopt(data, buf, size);
> > + else
> > + memcpy_flushcache(data, buf, size);
>
> Hmmm... have you looked at how long clflush actually takes?
> It isn't too bad if you just do a small number, but using it
> to flush large buffers can be very slow.

Yes, I have. It's here:
http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt

sequential write 8 + clflush - 0.3 GB/s on nvdimm
sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
sequential write-nt 8 bytes - 1.3 GB/s on nvdimm

> I've an Ivy bridge system where the X-server process requests the
> frame buffer be flushed out every 10 seconds (no idea why).
> With my 2560x1440 monitor this takes over 3ms.
>
> This really needs a cond_resched() every few clflush instructions.
>
> David

AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush
only allows one outstanding cacle line flush, so it's very slow.
clflushopt and clwb relaxed this restriction and there can be multiple
cache-invalidation requests in flight until the user serializes it with
the sfence instruction.

The patch checks for clflushopt with
"static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it
falls back to non-temporal stores.

Mikulas

2020-04-19 17:50:30

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] x86: introduce memcpy_flushcache_clflushopt

From: Mikulas Patocka
> Sent: 18 April 2020 16:21
>
> On Sat, 18 Apr 2020, David Laight wrote:
>
> > From: Mikulas Patocka
> > > Sent: 17 April 2020 13:47
> > ...
> > > Index: linux-2.6/drivers/md/dm-writecache.c
> > > ===================================================================
> > > --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> > > +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> > > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > > }
> > > } else {
> > > flush_dcache_page(bio_page(bio));
> > > - memcpy_flushcache(data, buf, size);
> > > + if (likely(size > 512))
> > > + memcpy_flushcache_clflushopt(data, buf, size);
> > > + else
> > > + memcpy_flushcache(data, buf, size);
> >
> > Hmmm... have you looked at how long clflush actually takes?
> > It isn't too bad if you just do a small number, but using it
> > to flush large buffers can be very slow.
>
> Yes, I have. It's here:
> http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
>
> sequential write 8 + clflush - 0.3 GB/s on nvdimm
> sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
> sequential write-nt 8 bytes - 1.3 GB/s on nvdimm

That table doesn't give enough information to be useful.
The cpu speed, memory speed and transfer lengths are all relevant.

> > I've an Ivy bridge system where the X-server process requests the
> > frame buffer be flushed out every 10 seconds (no idea why).
> > With my 2560x1440 monitor this takes over 3ms.
> >
> > This really needs a cond_resched() every few clflush instructions.
> >
> > David
>
> AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush
> only allows one outstanding cacle line flush, so it's very slow.
> clflushopt and clwb relaxed this restriction and there can be multiple
> cache-invalidation requests in flight until the user serializes it with
> the sfence instruction.

It isn't that simple.
While clflush on Ivybridge is slower than clflushopt on newer processors
both instructions are (relatively) fast for something like 16 or 32
iterations. After that they get much slower.
I can't remember where I found the relevant figures, even the ones I
found didn't show how large the transfers needed to be before the bytes/sec
became constant.

> The patch checks for clflushopt with
> "static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it
> falls back to non-temporal stores.

Ok, I was expecting you'd be falling back to clflush first.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-04-20 04:50:42

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH] x86: introduce memcpy_flushcache_clflushopt

On Sun, Apr 19, 2020 at 10:49 AM David Laight <[email protected]> wrote:
>
> From: Mikulas Patocka
> > Sent: 18 April 2020 16:21
> >
> > On Sat, 18 Apr 2020, David Laight wrote:
> >
> > > From: Mikulas Patocka
> > > > Sent: 17 April 2020 13:47
> > > ...
> > > > Index: linux-2.6/drivers/md/dm-writecache.c
> > > > ===================================================================
> > > > --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> > > > +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> > > > @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > > > }
> > > > } else {
> > > > flush_dcache_page(bio_page(bio));
> > > > - memcpy_flushcache(data, buf, size);
> > > > + if (likely(size > 512))
> > > > + memcpy_flushcache_clflushopt(data, buf, size);
> > > > + else
> > > > + memcpy_flushcache(data, buf, size);
> > >
> > > Hmmm... have you looked at how long clflush actually takes?
> > > It isn't too bad if you just do a small number, but using it
> > > to flush large buffers can be very slow.
> >
> > Yes, I have. It's here:
> > http://people.redhat.com/~mpatocka/testcases/pmem/microbenchmarks/pmem.txt
> >
> > sequential write 8 + clflush - 0.3 GB/s on nvdimm
> > sequential write 8 + clflushopt - 1.6 GB/s on nvdimm
> > sequential write-nt 8 bytes - 1.3 GB/s on nvdimm
>
> That table doesn't give enough information to be useful.
> The cpu speed, memory speed and transfer lengths are all relevant.
>
> > > I've an Ivy bridge system where the X-server process requests the
> > > frame buffer be flushed out every 10 seconds (no idea why).
> > > With my 2560x1440 monitor this takes over 3ms.
> > >
> > > This really needs a cond_resched() every few clflush instructions.
> > >
> > > David
> >
> > AFAIK Ivy Bridge doesn't have clflushopt, it only has clflush. clflush
> > only allows one outstanding cacle line flush, so it's very slow.
> > clflushopt and clwb relaxed this restriction and there can be multiple
> > cache-invalidation requests in flight until the user serializes it with
> > the sfence instruction.
>
> It isn't that simple.
> While clflush on Ivybridge is slower than clflushopt on newer processors
> both instructions are (relatively) fast for something like 16 or 32
> iterations. After that they get much slower.
> I can't remember where I found the relevant figures, even the ones I
> found didn't show how large the transfers needed to be before the bytes/sec
> became constant.
>
> > The patch checks for clflushopt with
> > "static_cpu_has(X86_FEATURE_CLFLUSHOPT)" and if it is not present, it
> > falls back to non-temporal stores.
>
> Ok, I was expecting you'd be falling back to clflush first.

clflush is a serializing instruction, clflushopt and non-temporal
stores are not.

2020-04-20 16:38:41

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH v2] x86: introduce memcpy_flushcache_single



On Fri, 17 Apr 2020, Thomas Gleixner wrote:

> Dan Williams <[email protected]> writes:
>
> > The goal of naming it _inatomic() was specifically for the observation
> > that your driver coordinates atomic access and does not benefit from
> > the cache friendliness that non-temporal stores afford. That said
> > _inatomic() is arguably not a good choice either because that refers
> > to whether the copy is prepared to take a fault or not. What about
> > _exclusive() or _single()? Anything but _clflushopt() that conveys no
> > contextual information.

OK. I renamed it to memcpy_flushcache_single

> > Other than quibbling with the name, and one more comment below, this
> > looks ok to me.
> >
> >> Index: linux-2.6/drivers/md/dm-writecache.c
> >> ===================================================================
> >> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> >> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> >> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> >> }
> >> } else {
> >> flush_dcache_page(bio_page(bio));
> >> - memcpy_flushcache(data, buf, size);
> >> + if (likely(size > 512))
> >
> > This needs some reference to how this magic number is chosen and how a
> > future developer might determine whether the value needs to be
> > adjusted.
>
> I don't think it's a good idea to make this decision in generic code as
> architectures or even CPU models might have different constraints on the
> size.
>
> So I'd rather let the architecture implementation decide and make this
>
> flush_dcache_page(bio_page(bio));
> - memcpy_flushcache(data, buf, size);
> + memcpy_flushcache_bikesheddedname(data, buf, size);
>
> and have the default fallback memcpy_flushcache() and let the
> architecture sort the size limit and the underlying technology out.
>
> So x86 can use clflushopt or implement it with movdir64b and any other
> architecture can provide their own magic soup without changing the
> callsite.
>
> Thanks,
>
> tglx

OK - so I moved the decision to memcpy_flushcache_single and I added a
comment that explains the magic number.

Mikulas




From: Mikulas Patocka <[email protected]>

Implement the function memcpy_flushcache_single which flushes cache just
like memcpy_flushcache - except that it uses cached writes and explicit
cache flushing instead of non-temporal stores.

Explicit cache flushing performs better in singlethreaded cases (i.e. the
dm-writecache target with block size greater than 512), non-temporal
stores perform better in other cases (mostly multithreaded workloads) - so
we provide these two functions and the user should select which one is
faster for his particular workload.

dm-writecache througput (on real Optane-based persistent memory):
block size 512 1024 2048 4096
movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s

Signed-off-by: Mikulas Patocka <[email protected]>

---
arch/x86/include/asm/string_64.h | 10 ++++++++
arch/x86/lib/usercopy_64.c | 46 +++++++++++++++++++++++++++++++++++++++
drivers/md/dm-writecache.c | 2 -
include/linux/string.h | 6 +++++
4 files changed, 63 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/string_64.h 2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/arch/x86/include/asm/string_64.h 2020-04-20 15:31:46.929999000 +0200
@@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
return 0;
}

+/*
+ * In some cases (mostly single-threaded workload), clflushopt is faster
+ * than non-temporal stores. In other situations, non-temporal stores are
+ * faster. So, we provide two functions:
+ * memcpy_flushcache using non-temporal stores
+ * memcpy_flushcache_single using clflushopt
+ * The caller should test which one is faster for the particular workload.
+ */
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
#define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
@@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
}
__memcpy_flushcache(dst, src, cnt);
}
+#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
+void memcpy_flushcache_single(void *dst, const void *src, size_t cnt);
#endif

#endif /* __KERNEL__ */
Index: linux-2.6/include/linux/string.h
===================================================================
--- linux-2.6.orig/include/linux/string.h 2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/include/linux/string.h 2020-04-20 15:31:46.929999000 +0200
@@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
memcpy(dst, src, cnt);
}
#endif
+#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
+static inline void memcpy_flushcache_single(void *dst, const void *src, size_t cnt)
+{
+ memcpy_flushcache(dst, src, cnt);
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===================================================================
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-04-20 15:38:13.159999000 +0200
@@ -199,6 +199,52 @@ void __memcpy_flushcache(void *_dst, con
}
EXPORT_SYMBOL_GPL(__memcpy_flushcache);

+void memcpy_flushcache_single(void *_dst, const void *_src, size_t size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ /*
+ * dm-writecache througput (on real Optane-based persistent memory):
+ * measured with dd:
+ *
+ * block size 512 1024 2048 4096
+ * movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
+ * clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
+ *
+ * We see that movnti performs better for 512-byte blocks, and
+ * clflushopt performs better for 1024-byte and larger blocks. So, we
+ * prefer clflushopt for sizes >= 768.
+ */
+
+ if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64) &&
+ likely(size >= 768)) {
+ if (unlikely(!IS_ALIGNED(dest, 64))) {
+ size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ clflushopt((void *)dest);
+ dest += len;
+ source += len;
+ size -= len;
+ }
+ do {
+ memcpy((void *)dest, (void *)source, 64);
+ clflushopt((void *)dest);
+ dest += 64;
+ source += 64;
+ size -= 64;
+ } while (size >= 64)
+ if (unlikely(size != 0)) {
+ memcpy((void *)dest, (void *)source, size);
+ clflushopt((void *)dest);
+ }
+ return;
+ }
+ memcpy_flushcache((void *)dest, (void *)source, size);
+}
+EXPORT_SYMBOL_GPL(memcpy_flushcache_single);
+
void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len)
{
Index: linux-2.6/drivers/md/dm-writecache.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-20 15:31:46.939999000 +0200
+++ linux-2.6/drivers/md/dm-writecache.c 2020-04-20 15:32:35.549999000 +0200
@@ -1166,7 +1166,7 @@ static void bio_copy_block(struct dm_wri
}
} else {
flush_dcache_page(bio_page(bio));
- memcpy_flushcache(data, buf, size);
+ memcpy_flushcache_single(data, buf, size);
}

bvec_kunmap_irq(buf, &flags);

2020-04-21 18:45:40

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2] x86: introduce memcpy_flushcache_single

On Mon, Apr 20, 2020 at 6:48 AM Mikulas Patocka <[email protected]> wrote:
>
>
>
> On Fri, 17 Apr 2020, Thomas Gleixner wrote:
>
> > Dan Williams <[email protected]> writes:
> >
> > > The goal of naming it _inatomic() was specifically for the observation
> > > that your driver coordinates atomic access and does not benefit from
> > > the cache friendliness that non-temporal stores afford. That said
> > > _inatomic() is arguably not a good choice either because that refers
> > > to whether the copy is prepared to take a fault or not. What about
> > > _exclusive() or _single()? Anything but _clflushopt() that conveys no
> > > contextual information.
>
> OK. I renamed it to memcpy_flushcache_single
>
> > > Other than quibbling with the name, and one more comment below, this
> > > looks ok to me.
> > >
> > >> Index: linux-2.6/drivers/md/dm-writecache.c
> > >> ===================================================================
> > >> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-17 14:06:35.139999000 +0200
> > >> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-17 14:06:35.129999000 +0200
> > >> @@ -1166,7 +1166,10 @@ static void bio_copy_block(struct dm_wri
> > >> }
> > >> } else {
> > >> flush_dcache_page(bio_page(bio));
> > >> - memcpy_flushcache(data, buf, size);
> > >> + if (likely(size > 512))
> > >
> > > This needs some reference to how this magic number is chosen and how a
> > > future developer might determine whether the value needs to be
> > > adjusted.
> >
> > I don't think it's a good idea to make this decision in generic code as
> > architectures or even CPU models might have different constraints on the
> > size.
> >
> > So I'd rather let the architecture implementation decide and make this
> >
> > flush_dcache_page(bio_page(bio));
> > - memcpy_flushcache(data, buf, size);
> > + memcpy_flushcache_bikesheddedname(data, buf, size);
> >
> > and have the default fallback memcpy_flushcache() and let the
> > architecture sort the size limit and the underlying technology out.
> >
> > So x86 can use clflushopt or implement it with movdir64b and any other
> > architecture can provide their own magic soup without changing the
> > callsite.
> >
> > Thanks,
> >
> > tglx
>
> OK - so I moved the decision to memcpy_flushcache_single and I added a
> comment that explains the magic number.
>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <[email protected]>
>
> Implement the function memcpy_flushcache_single which flushes cache just
> like memcpy_flushcache - except that it uses cached writes and explicit
> cache flushing instead of non-temporal stores.
>
> Explicit cache flushing performs better in singlethreaded cases (i.e. the
> dm-writecache target with block size greater than 512), non-temporal
> stores perform better in other cases (mostly multithreaded workloads) - so
> we provide these two functions and the user should select which one is
> faster for his particular workload.

I would mention that dm-writecache is choosing to use
memcpy_flushcache_single() because it is regularly invoked under a
lock.

"The dm-writecache target is singlethreaded (all the copying is done
while holding the writecache lock), so it benefits from clwb." [1]

[1]: http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com

>
> dm-writecache througput (on real Optane-based persistent memory):
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> Signed-off-by: Mikulas Patocka <[email protected]>
>
> ---
> arch/x86/include/asm/string_64.h | 10 ++++++++
> arch/x86/lib/usercopy_64.c | 46 +++++++++++++++++++++++++++++++++++++++
> drivers/md/dm-writecache.c | 2 -
> include/linux/string.h | 6 +++++
> 4 files changed, 63 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/include/asm/string_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/string_64.h 2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/arch/x86/include/asm/string_64.h 2020-04-20 15:31:46.929999000 +0200
> @@ -114,6 +114,14 @@ memcpy_mcsafe(void *dst, const void *src
> return 0;
> }
>
> +/*
> + * In some cases (mostly single-threaded workload), clflushopt is faster
> + * than non-temporal stores. In other situations, non-temporal stores are
> + * faster. So, we provide two functions:
> + * memcpy_flushcache using non-temporal stores
> + * memcpy_flushcache_single using clflushopt
> + * The caller should test which one is faster for the particular workload.
> + */
> #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
> void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> @@ -135,6 +143,8 @@ static __always_inline void memcpy_flush
> }
> __memcpy_flushcache(dst, src, cnt);
> }
> +#define __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT 1
> +void memcpy_flushcache_single(void *dst, const void *src, size_t cnt);
> #endif
>
> #endif /* __KERNEL__ */
> Index: linux-2.6/include/linux/string.h
> ===================================================================
> --- linux-2.6.orig/include/linux/string.h 2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/include/linux/string.h 2020-04-20 15:31:46.929999000 +0200
> @@ -175,6 +175,12 @@ static inline void memcpy_flushcache(voi
> memcpy(dst, src, cnt);
> }
> #endif
> +#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE_CLFLUSHOPT
> +static inline void memcpy_flushcache_single(void *dst, const void *src, size_t cnt)
> +{
> + memcpy_flushcache(dst, src, cnt);
> +}
> +#endif
> void *memchr_inv(const void *s, int c, size_t n);
> char *strreplace(char *s, char old, char new);
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-04-20 15:38:13.159999000 +0200
> @@ -199,6 +199,52 @@ void __memcpy_flushcache(void *_dst, con
> }
> EXPORT_SYMBOL_GPL(__memcpy_flushcache);
>
> +void memcpy_flushcache_single(void *_dst, const void *_src, size_t size)
> +{
> + unsigned long dest = (unsigned long) _dst;
> + unsigned long source = (unsigned long) _src;
> +
> + /*
> + * dm-writecache througput (on real Optane-based persistent memory):
> + * measured with dd:

Why mention Optane? There are several types of persistent memory.
Typical persistent memory to date behaves like DDR because it is
battery backed. So if you're going to mention the memory type I would
also include the DDR details.

At a minimum include the lore link in the changelog to the wider
analysis you contributed on the mailing list.

> + *
> + * block size 512 1024 2048 4096
> + * movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> + * clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
> + *
> + * We see that movnti performs better for 512-byte blocks, and
> + * clflushopt performs better for 1024-byte and larger blocks. So, we
> + * prefer clflushopt for sizes >= 768.
> + */
> +
> + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && likely(boot_cpu_data.x86_clflush_size == 64) &&
> + likely(size >= 768)) {
> + if (unlikely(!IS_ALIGNED(dest, 64))) {
> + size_t len = min_t(size_t, size, ALIGN(dest, 64) - dest);
> +
> + memcpy((void *) dest, (void *) source, len);
> + clflushopt((void *)dest);
> + dest += len;
> + source += len;
> + size -= len;
> + }
> + do {
> + memcpy((void *)dest, (void *)source, 64);
> + clflushopt((void *)dest);
> + dest += 64;
> + source += 64;
> + size -= 64;
> + } while (size >= 64)
> + if (unlikely(size != 0)) {
> + memcpy((void *)dest, (void *)source, size);
> + clflushopt((void *)dest);
> + }
> + return;
> + }
> + memcpy_flushcache((void *)dest, (void *)source, size);
> +}
> +EXPORT_SYMBOL_GPL(memcpy_flushcache_single);
> +
> void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
> size_t len)
> {
> Index: linux-2.6/drivers/md/dm-writecache.c
> ===================================================================
> --- linux-2.6.orig/drivers/md/dm-writecache.c 2020-04-20 15:31:46.939999000 +0200
> +++ linux-2.6/drivers/md/dm-writecache.c 2020-04-20 15:32:35.549999000 +0200
> @@ -1166,7 +1166,7 @@ static void bio_copy_block(struct dm_wri
> }
> } else {
> flush_dcache_page(bio_page(bio));
> - memcpy_flushcache(data, buf, size);
> + memcpy_flushcache_single(data, buf, size);
> }
>
> bvec_kunmap_irq(buf, &flags);
>