2016-10-26 15:51:57

by Boylston, Brian

[permalink] [raw]
Subject: [PATCH v2 0/3] use nocache copy in copy_from_iter_nocache()

Currently, copy_from_iter_nocache() uses "nocache" copies only for
iovecs; bvecs and kvecs use normal copies. This requires
x86's arch_copy_from_iter_pmem() to issue flushes for bvecs and kvecs,
which has a negative impact on performance when splice()ing from a pipe
to a pmem-backed file on a DAX-mounted file system.

This patch set enables nocache copies in copy_from_iter_nocache() for
bvecs and kvecs for arches that support it (x86 initially). This provides
a 2-3X improvement in splice() pipe-to-DAX-file throughput.

The first patch introduces memcpy_nocache(), which defaults to just
memcpy(), but for which an x86-specific implementation is provided.

For this patch, I sought to use a static inline function for x86, but
I could not find an obvious header file to put it in.
The build seemed to work when I put it in arch/x86/include/asm/uaccess.h,
but that didn't feel completely right. I also tried
arch/x86/include/asm/pmem.h, but that doesn't feel right either and it
didn't build. So, I offer it here in arch/x86/lib/misc.c for discussion.

The second patch updates copy_from_iter_nocache() to use the new
memcpy_nocache().

The third patch removes the flushes from x86's arch_copy_from_iter_pmem().

For testing, I ran fio with the posixaio, mmap, sync, psync, vsync, pvsync,
and splice engines, against both ext4 and xfs. Only the splice engine
showed any change in performance. For example, for xfs:

Unpatched 4.8:

Run status group 2 (all jobs):
WRITE: io=37602MB, aggrb=641724KB/s, minb=641724KB/s, maxb=641724KB/s, mint=60001msec, maxt=60001msec

Run status group 3 (all jobs):
WRITE: io=36244MB, aggrb=618553KB/s, minb=618553KB/s, maxb=618553KB/s, mint=60001msec, maxt=60001msec

With this patch set:

Run status group 2 (all jobs):
WRITE: io=128055MB, aggrb=2134.3MB/s, minb=2134.3MB/s, maxb=2134.3MB/s, mint=60001msec, maxt=60001msec

Run status group 3 (all jobs):
WRITE: io=122586MB, aggrb=2043.8MB/s, minb=2043.8MB/s, maxb=2043.8MB/s, mint=60001msec, maxt=60001msec

Cc: Ross Zwisler <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Brian Boylston <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Reported-by: Oliver Moreno <[email protected]>

Changes in v2:
- Split into multiple patches (Toshi Kani)
- Introduce memcpy_nocache() (Al Viro)
- Use nocache for kvecs as well

Brian Boylston (3):
introduce memcpy_nocache()
use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()
x86: remove unneeded flush in arch_copy_from_iter_pmem()

arch/x86/include/asm/pmem.h | 19 +------------------
arch/x86/include/asm/string_32.h | 3 +++
arch/x86/include/asm/string_64.h | 3 +++
arch/x86/lib/misc.c | 12 ++++++++++++
include/linux/string.h | 15 +++++++++++++++
lib/iov_iter.c | 14 +++++++++++---
6 files changed, 45 insertions(+), 21 deletions(-)

--
2.8.3


2016-10-26 15:52:27

by Boylston, Brian

[permalink] [raw]
Subject: [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()

Update copy_from_iter_nocache() to use memcpy_nocache()
for bvecs and kvecs.

Cc: Ross Zwisler <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Brian Boylston <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Reported-by: Oliver Moreno <[email protected]>
---
lib/iov_iter.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7e3138c..71e4531 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -342,6 +342,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
kunmap_atomic(from);
}

+static void memcpy_from_page_nocache(char *to, struct page *page, size_t offset, size_t len)
+{
+ char *from = kmap_atomic(page);
+ memcpy_nocache(to, from + offset, len);
+ kunmap_atomic(from);
+}
+
static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
{
char *to = kmap_atomic(page);
@@ -392,9 +399,10 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
iterate_and_advance(i, bytes, v,
__copy_from_user_nocache((to += v.iov_len) - v.iov_len,
v.iov_base, v.iov_len),
- memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
- v.bv_offset, v.bv_len),
- memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
+ memcpy_from_page_nocache((to += v.bv_len) - v.bv_len,
+ v.bv_page, v.bv_offset, v.bv_len),
+ memcpy_nocache((to += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len)
)

return bytes;
--
2.8.3

2016-10-26 15:52:22

by Boylston, Brian

[permalink] [raw]
Subject: [PATCH v2 1/3] introduce memcpy_nocache()

Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
if possible. Without arch-specific support, this defaults to just
memcpy(). For now, include arch-specific support for x86.

Cc: Ross Zwisler <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Brian Boylston <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Reported-by: Oliver Moreno <[email protected]>
---
arch/x86/include/asm/string_32.h | 3 +++
arch/x86/include/asm/string_64.h | 3 +++
arch/x86/lib/misc.c | 12 ++++++++++++
include/linux/string.h | 15 +++++++++++++++
4 files changed, 33 insertions(+)

diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
index 3d3e835..64f80c0 100644
--- a/arch/x86/include/asm/string_32.h
+++ b/arch/x86/include/asm/string_32.h
@@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)

#endif

+#define __HAVE_ARCH_MEMCPY_NOCACHE
+extern void *memcpy_nocache(void *dest, const void *src, size_t count);
+
#define __HAVE_ARCH_MEMMOVE
void *memmove(void *dest, const void *src, size_t n);

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 90dbbd9..a8fdd55 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
#endif

+#define __HAVE_ARCH_MEMCPY_NOCACHE
+extern void *memcpy_nocache(void *dest, const void *src, size_t count);
+
#define __HAVE_ARCH_MEMSET
void *memset(void *s, int c, size_t n);
void *__memset(void *s, int c, size_t n);
diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
index 76b373a..c993ab3 100644
--- a/arch/x86/lib/misc.c
+++ b/arch/x86/lib/misc.c
@@ -1,3 +1,6 @@
+#include <linux/export.h>
+#include <linux/uaccess.h>
+
/*
* Count the digits of @val including a possible sign.
*
@@ -19,3 +22,12 @@ int num_digits(int val)
}
return d;
}
+
+#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
+void *memcpy_nocache(void *dest, const void *src, size_t count)
+{
+ __copy_from_user_inatomic_nocache(dest, src, count);
+ return dest;
+}
+EXPORT_SYMBOL(memcpy_nocache);
+#endif
diff --git a/include/linux/string.h b/include/linux/string.h
index 26b6f6a..7f40c41 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
#ifndef __HAVE_ARCH_MEMCPY
extern void * memcpy(void *,const void *,__kernel_size_t);
#endif
+
+#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
+/**
+ * memcpy_nocache - Copy one area of memory to another, avoiding the
+ * processor cache if possible
+ * @dest: Where to copy to
+ * @src: Where to copy from
+ * @count: The size of the area.
+ */
+static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
+{
+ return memcpy(dest, src, count);
+}
+#endif
+
#ifndef __HAVE_ARCH_MEMMOVE
extern void * memmove(void *,const void *,__kernel_size_t);
#endif
--
2.8.3

2016-10-26 15:52:26

by Boylston, Brian

[permalink] [raw]
Subject: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

copy_from_iter_nocache() now uses nocache copies for all types of iovecs
on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.

Cc: Ross Zwisler <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Brian Boylston <[email protected]>
Reviewed-by: Toshi Kani <[email protected]>
Reported-by: Oliver Moreno <[email protected]>
---
arch/x86/include/asm/pmem.h | 19 +------------------
1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 643eba4..2fbf4ae 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
}

-/*
- * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
- * iterators, so for other types (bvec & kvec) we must do a cache write-back.
- */
-static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
-{
- return iter_is_iovec(i) == false;
-}
-
/**
* arch_copy_from_iter_pmem - copy data from an iterator to PMEM
* @addr: PMEM destination address
@@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
struct iov_iter *i)
{
- size_t len;
-
- /* TODO: skip the write-back by always using non-temporal stores */
- len = copy_from_iter_nocache(addr, bytes, i);
-
- if (__iter_needs_pmem_wb(i))
- arch_wb_cache_pmem(addr, bytes);
-
- return len;
+ return copy_from_iter_nocache(addr, bytes, i);
}

/**
--
2.8.3

2016-10-26 19:33:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On Wed, 26 Oct 2016, Brian Boylston wrote:
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)

> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);

Can we please move that prototype to linux/string.h instead of having it in
both files and just define __HAVE_ARCH_MEMCPY_NOCACHE? And that define can
move into x86/include/asm/string.h. There is no point in duplicating all
that stuff.

> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);

> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count)

> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE

You define __HAVE_ARCH_MEMCPY_NOCACHE for both 32 and 64 bit. What is
this #ifdef for ?

> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + __copy_from_user_inatomic_nocache(dest, src, count);

You want to replace a memcpy with this and then use copy from user. Why?
That does not make any sense to me and even if it makes sense for whatever
reason then this wants to be documented and the src argument needs a type
cast to (void __user *)..

Further this uses the inatomic variant. Why? Does a memcpy replacement
suddenly require to be called in atomic context? Again, this makes no sense
and lacks any form of comment. The kernel doc below does not say anything
about that.

Neither provides the changelog any useful information which is
complementary to what the patch actually does.

> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible

I'm not sure whether kerneldoc is happy about that line break. Did you
build the docs?

Make the initial line shorter and explain the functionality in detail
below the arguments

> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.

Nit. Can you please make this in tabular form for readability sake?

* @dest: Where to copy to
* @src: Where to copy from
* @count: The size of the area.

> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}
> +#endif

Thanks,

tglx

2016-10-26 19:52:14

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On 10/26/2016 06:50 PM, Brian Boylston wrote:
> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
> if possible. Without arch-specific support, this defaults to just
> memcpy(). For now, include arch-specific support for x86.
>
> Cc: Ross Zwisler <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Brian Boylston <[email protected]>
> Reviewed-by: Toshi Kani <[email protected]>
> Reported-by: Oliver Moreno <[email protected]>
> ---
> arch/x86/include/asm/string_32.h | 3 +++
> arch/x86/include/asm/string_64.h | 3 +++
> arch/x86/lib/misc.c | 12 ++++++++++++
> include/linux/string.h | 15 +++++++++++++++
> 4 files changed, 33 insertions(+)
>
> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
> index 3d3e835..64f80c0 100644
> --- a/arch/x86/include/asm/string_32.h
> +++ b/arch/x86/include/asm/string_32.h
> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>
> #endif
>
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
> #define __HAVE_ARCH_MEMMOVE
> void *memmove(void *dest, const void *src, size_t n);
>
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 90dbbd9..a8fdd55 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
> #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
> #endif
>
> +#define __HAVE_ARCH_MEMCPY_NOCACHE
> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
> +
> #define __HAVE_ARCH_MEMSET
> void *memset(void *s, int c, size_t n);
> void *__memset(void *s, int c, size_t n);
> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
> index 76b373a..c993ab3 100644
> --- a/arch/x86/lib/misc.c
> +++ b/arch/x86/lib/misc.c
> @@ -1,3 +1,6 @@
> +#include <linux/export.h>
> +#include <linux/uaccess.h>
> +
> /*
> * Count the digits of @val including a possible sign.
> *
> @@ -19,3 +22,12 @@ int num_digits(int val)
> }
> return d;
> }
> +
> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
> +void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + __copy_from_user_inatomic_nocache(dest, src, count);
> + return dest;
> +}
> +EXPORT_SYMBOL(memcpy_nocache);
> +#endif
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 26b6f6a..7f40c41 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
> #ifndef __HAVE_ARCH_MEMCPY
> extern void * memcpy(void *,const void *,__kernel_size_t);
> #endif
> +
> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
> +/**
> + * memcpy_nocache - Copy one area of memory to another, avoiding the
> + * processor cache if possible
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
> +{
> + return memcpy(dest, src, count);
> +}

What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.

Feels bad to add yet just another arch switch over __copy_user_nocache

Just feels like too many things that do the same thing. Sigh

Boaz

> +#endif
> +
> #ifndef __HAVE_ARCH_MEMMOVE
> extern void * memmove(void *,const void *,__kernel_size_t);
> #endif
>

2016-10-26 19:57:12

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

On 10/26/2016 06:50 PM, Brian Boylston wrote:
> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
>
> Cc: Ross Zwisler <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Brian Boylston <[email protected]>
> Reviewed-by: Toshi Kani <[email protected]>
> Reported-by: Oliver Moreno <[email protected]>
> ---
> arch/x86/include/asm/pmem.h | 19 +------------------
> 1 file changed, 1 insertion(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 643eba4..2fbf4ae 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
> clwb(p);
> }
>
> -/*
> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
> - */
> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> -{
> - return iter_is_iovec(i) == false;
> -}
> -
> /**
> * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
> * @addr: PMEM destination address
> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
> static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
> struct iov_iter *i)
> {
> - size_t len;
> -
> - /* TODO: skip the write-back by always using non-temporal stores */
> - len = copy_from_iter_nocache(addr, bytes, i);
> -
> - if (__iter_needs_pmem_wb(i))
> - arch_wb_cache_pmem(addr, bytes);
> -
> - return len;
> + return copy_from_iter_nocache(addr, bytes, i);

I wish you would remove all this completely. Don't see the point for it anymore

Thanks a lot for doing this. I have this patch for ages in my trees and was too
lasy to fight them through. Yes it is a big boost for many workloads.

Lots of gratitude
Boaz

> }
>
> /**
>

2016-10-27 04:47:03

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] use a nocache copy for bvecs and kvecs in copy_from_iter_nocache()

On Wed, Oct 26, 2016 at 10:50:20AM -0500, Brian Boylston wrote:
> Update copy_from_iter_nocache() to use memcpy_nocache()
> for bvecs and kvecs.
>
> Cc: Ross Zwisler <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Brian Boylston <[email protected]>
> Reviewed-by: Toshi Kani <[email protected]>
> Reported-by: Oliver Moreno <[email protected]>
> ---
> lib/iov_iter.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 7e3138c..71e4531 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -342,6 +342,13 @@ static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t
> kunmap_atomic(from);
> }
>
> +static void memcpy_from_page_nocache(char *to, struct page *page, size_t offset, size_t len)
> +{
> + char *from = kmap_atomic(page);
> + memcpy_nocache(to, from + offset, len);
> + kunmap_atomic(from);
> +}
> +
> static void memcpy_to_page(struct page *page, size_t offset, const char *from, size_t len)
> {
> char *to = kmap_atomic(page);
> @@ -392,9 +399,10 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
> iterate_and_advance(i, bytes, v,
> __copy_from_user_nocache((to += v.iov_len) - v.iov_len,
> v.iov_base, v.iov_len),
> - memcpy_from_page((to += v.bv_len) - v.bv_len, v.bv_page,
> - v.bv_offset, v.bv_len),
> - memcpy((to += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
> + memcpy_from_page_nocache((to += v.bv_len) - v.bv_len,
> + v.bv_page, v.bv_offset, v.bv_len),
> + memcpy_nocache((to += v.iov_len) - v.iov_len,
> + v.iov_base, v.iov_len)
> )
>
> return bytes;
> --
> 2.8.3

I generally agree with Boaz's comments to your patch 1 - this feels like yet
another layer where we have indirection based on the architecture.

We already have an arch switch at memcpy_to_pmem() based on whether the
architecture supports the PMEM API. And we already have
__copy_from_user_nocache(), which based on the architecture can map to either
a non-cached memcpy (x86_32, x86_64), or it can just map to a normal memcpy
via __copy_from_user_inatomic() (this happens in include/linux/uaccss.h, which
I believe is used for all non-x86 architectures).

memcpy_nocache() now does the same thing as __copy_from_user_nocache(), where
we define an uncached memcpy for x86_32 and x86_64, and a normal memcpy
variant for other architectures. But, weirdly, the x86 version maps to our
to __copy_from_user_nocache(), but it on non-x86 we don't map to
__copy_from_user_nocache() and use its fallback, we provide a new fallback of
our own via a direct call to memcpy()?

And, now in copy_from_iter_nocache() on x86 we call __copy_from_user_nocache()
two different ways: directly, and indirectly through memcpy_nocache() and
memcpy_from_page_nocache()=>memcpy_nocache().

Is there any way to simplify all of this?

All in all I'm on board with doing non-temporal copies in all cases, and I
like the idea behind memcpy_from_page_nocache(). I just think there must be a
way to make it simpler.

2016-10-28 01:58:33

by Boylston, Brian

[permalink] [raw]
Subject: RE: [PATCH v2 3/3] x86: remove unneeded flush in arch_copy_from_iter_pmem()

Boaz Harrosh wrote on 2016-10-26:
> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>> copy_from_iter_nocache() now uses nocache copies for all types of iovecs
>> on x86, so the flush in arch_copy_from_iter_pmem() is no longer needed.
>>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Signed-off-by: Brian Boylston <[email protected]>
>> Reviewed-by: Toshi Kani <[email protected]>
>> Reported-by: Oliver Moreno <[email protected]>
>> ---
>> arch/x86/include/asm/pmem.h | 19 +------------------
>> 1 file changed, 1 insertion(+), 18 deletions(-)
>> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
>> index 643eba4..2fbf4ae 100644
>> --- a/arch/x86/include/asm/pmem.h
>> +++ b/arch/x86/include/asm/pmem.h
>> @@ -72,15 +72,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
>> clwb(p);
>> }
>> -/*
>> - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
>> - * iterators, so for other types (bvec & kvec) we must do a cache write-back.
>> - */
>> -static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>> -{
>> - return iter_is_iovec(i) == false;
>> -}
>> -
>> /**
>> * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
>> * @addr: PMEM destination address
>> @@ -92,15 +83,7 @@ static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
>> static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
>> struct iov_iter *i)
>> {
>> - size_t len;
>> -
>> - /* TODO: skip the write-back by always using non-temporal stores */
>> - len = copy_from_iter_nocache(addr, bytes, i);
>> -
>> - if (__iter_needs_pmem_wb(i))
>> - arch_wb_cache_pmem(addr, bytes);
>> -
>> - return len;
>> + return copy_from_iter_nocache(addr, bytes, i);
>
> I wish you would remove all this completely. Don't see the point for it anymore
>

At first, I was just targeting bvecs for the splice() case, so I had left the
check and the flush in there for kvecs. When doing this v2 of the patch set,
I made kvecs nocache as well, so removed the check and flush, but didn't follow
through completely as you suggest. Will revisit.

Thanks!
Brian

> Thanks a lot for doing this. I have this patch for ages in my trees and was too
> lasy to fight them through. Yes it is a big boost for many workloads.
>
> Lots of gratitude
> Boaz
>
>> }
>> /**

2016-10-28 02:07:25

by Boylston, Brian

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] introduce memcpy_nocache()

Thomas Gleixner wrote on 2016-10-26:
> On Wed, 26 Oct 2016, Brian Boylston wrote:
>> --- a/arch/x86/include/asm/string_32.h
>> +++ b/arch/x86/include/asm/string_32.h
>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>
> Can we please move that prototype to linux/string.h instead of having it in
> both files and just define __HAVE_ARCH_MEMCPY_NOCACHE? And that define can
> move into x86/include/asm/string.h. There is no point in duplicating all
> that stuff.

Yes, sounds good.

>
>> --- a/arch/x86/include/asm/string_64.h
>> +++ b/arch/x86/include/asm/string_64.h
>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count)
>
>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>
> You define __HAVE_ARCH_MEMCPY_NOCACHE for both 32 and 64 bit. What is
> this #ifdef for ?

Now that you raise the question, I'm not sure, and I see that the
x86 memcpy() definition isn't similarly wrapped. I can adjust it.

>
>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> + __copy_from_user_inatomic_nocache(dest, src, count);
>
> You want to replace a memcpy with this and then use copy from user. Why?
> That does not make any sense to me and even if it makes sense for whatever
> reason then this wants to be documented and the src argument needs a type
> cast to (void __user *)..

I agree that the documentation needs improvement. The goal is to memcpy
while avoiding the processor cache. Following x86's arch_memcpy_to_pmem(),
I was thinking that the user nocache implementation on x86 would work,
but I have none of the comments that arch_memcpy_to_pmem() has...

>
> Further this uses the inatomic variant. Why? Does a memcpy replacement
> suddenly require to be called in atomic context? Again, this makes no sense
> and lacks any form of comment. The kernel doc below does not say anything
> about that.

I borrowed the use of the inatomic variant from arch_memcpy_to_pmem().

Perhaps I need to look at memcpy_to_pmem() as suggested by Boaz.

>
> Neither provides the changelog any useful information which is
> complementary to what the patch actually does.
>
>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>> +/**
>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>> + * processor cache if possible
>
> I'm not sure whether kerneldoc is happy about that line break. Did you
> build the docs?
>
> Make the initial line shorter and explain the functionality in detail
> below the arguments
>
>> + * @dest: Where to copy to
>> + * @src: Where to copy from
>> + * @count: The size of the area.
>
> Nit. Can you please make this in tabular form for readability sake?

No, I did not try to build the docs. I'll revisit this and your other
doc-related feedback.

Thanks!
Brian

>
> * @dest: Where to copy to
> * @src: Where to copy from
> * @count: The size of the area.
>> + */
>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> + return memcpy(dest, src, count);
>> +}
>> +#endif
>
> Thanks,
>
> tglx

2016-10-28 02:29:02

by Boylston, Brian

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] introduce memcpy_nocache()

Boaz Harrosh wrote on 2016-10-26:
> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>> if possible. Without arch-specific support, this defaults to just
>> memcpy(). For now, include arch-specific support for x86.
>>
>> Cc: Ross Zwisler <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: <[email protected]>
>> Cc: Al Viro <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Signed-off-by: Brian Boylston <[email protected]>
>> Reviewed-by: Toshi Kani <[email protected]>
>> Reported-by: Oliver Moreno <[email protected]>
>> ---
>> arch/x86/include/asm/string_32.h | 3 +++
>> arch/x86/include/asm/string_64.h | 3 +++
>> arch/x86/lib/misc.c | 12 ++++++++++++
>> include/linux/string.h | 15 +++++++++++++++
>> 4 files changed, 33 insertions(+)
>> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
>> index 3d3e835..64f80c0 100644
>> --- a/arch/x86/include/asm/string_32.h
>> +++ b/arch/x86/include/asm/string_32.h
>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>>
>> #endif
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>> +
>> #define __HAVE_ARCH_MEMMOVE
>> void *memmove(void *dest, const void *src, size_t n);
>> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
>> index 90dbbd9..a8fdd55 100644
>> --- a/arch/x86/include/asm/string_64.h
>> +++ b/arch/x86/include/asm/string_64.h
>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>> #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>> #endif
>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>> +
>> #define __HAVE_ARCH_MEMSET
>> void *memset(void *s, int c, size_t n);
>> void *__memset(void *s, int c, size_t n);
>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>> index 76b373a..c993ab3 100644
>> --- a/arch/x86/lib/misc.c
>> +++ b/arch/x86/lib/misc.c
>> @@ -1,3 +1,6 @@
>> +#include <linux/export.h>
>> +#include <linux/uaccess.h>
>> +
>> /*
>> * Count the digits of @val including a possible sign.
>> *
>> @@ -19,3 +22,12 @@ int num_digits(int val)
>> }
>> return d;
>> }
>> +
>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> + __copy_from_user_inatomic_nocache(dest, src, count);
>> + return dest;
>> +}
>> +EXPORT_SYMBOL(memcpy_nocache);
>> +#endif
>> diff --git a/include/linux/string.h b/include/linux/string.h
>> index 26b6f6a..7f40c41 100644
>> --- a/include/linux/string.h
>> +++ b/include/linux/string.h
>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>> #ifndef __HAVE_ARCH_MEMCPY
>> extern void * memcpy(void *,const void *,__kernel_size_t);
>> #endif
>> +
>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>> +/**
>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>> + * processor cache if possible
>> + * @dest: Where to copy to
>> + * @src: Where to copy from
>> + * @count: The size of the area.
>> + */
>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>> +{
>> + return memcpy(dest, src, count);
>> +}
>
> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
>
> Feels bad to add yet just another arch switch over __copy_user_nocache
>
> Just feels like too many things that do the same thing. Sigh

I agree that this looks like a nicer path.

I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
but lib/iov_iter.c doesn't currently #include linux/pmem.h. Would it be
acceptable to add it? Also, I wasn't sure if memcpy_to_pmem() would always
mean exactly "memcpy nocache".

I had also considered adjusting copy_from_iter_pmem() (also in linux/pmem.h)
to just use memcpy_to_pmem() directly, but then it can't use the goodness
that is the iterate_and_advance() macro in iov_iter.c.

So, I took a shot with a possibly ill-fated memcpy_nocache(). Thoughts on
either of the above two? Are these even in line with what you were thinking?

Thanks!
Brian

>
> Boaz
>
>> +#endif
>> +
>> #ifndef __HAVE_ARCH_MEMMOVE
>> extern void * memmove(void *,const void *,__kernel_size_t);
>> #endif

2016-11-01 14:25:19

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On 10/28/2016 04:54 AM, Boylston, Brian wrote:
> Boaz Harrosh wrote on 2016-10-26:
>> On 10/26/2016 06:50 PM, Brian Boylston wrote:
>>> Introduce memcpy_nocache() as a memcpy() that avoids the processor cache
>>> if possible. Without arch-specific support, this defaults to just
>>> memcpy(). For now, include arch-specific support for x86.
>>>
>>> Cc: Ross Zwisler <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: "H. Peter Anvin" <[email protected]>
>>> Cc: <[email protected]>
>>> Cc: Al Viro <[email protected]>
>>> Cc: Dan Williams <[email protected]>
>>> Signed-off-by: Brian Boylston <[email protected]>
>>> Reviewed-by: Toshi Kani <[email protected]>
>>> Reported-by: Oliver Moreno <[email protected]>
>>> ---
>>> arch/x86/include/asm/string_32.h | 3 +++
>>> arch/x86/include/asm/string_64.h | 3 +++
>>> arch/x86/lib/misc.c | 12 ++++++++++++
>>> include/linux/string.h | 15 +++++++++++++++
>>> 4 files changed, 33 insertions(+)
>>> diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h
>>> index 3d3e835..64f80c0 100644
>>> --- a/arch/x86/include/asm/string_32.h
>>> +++ b/arch/x86/include/asm/string_32.h
>>> @@ -196,6 +196,9 @@ static inline void *__memcpy3d(void *to, const void *from, size_t len)
>>>
>>> #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>> #define __HAVE_ARCH_MEMMOVE
>>> void *memmove(void *dest, const void *src, size_t n);
>>> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
>>> index 90dbbd9..a8fdd55 100644
>>> --- a/arch/x86/include/asm/string_64.h
>>> +++ b/arch/x86/include/asm/string_64.h
>>> @@ -51,6 +51,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
>>> #define memcpy(dst, src, len) __inline_memcpy((dst), (src), (len))
>>> #endif
>>> +#define __HAVE_ARCH_MEMCPY_NOCACHE
>>> +extern void *memcpy_nocache(void *dest, const void *src, size_t count);
>>> +
>>> #define __HAVE_ARCH_MEMSET
>>> void *memset(void *s, int c, size_t n);
>>> void *__memset(void *s, int c, size_t n);
>>> diff --git a/arch/x86/lib/misc.c b/arch/x86/lib/misc.c
>>> index 76b373a..c993ab3 100644
>>> --- a/arch/x86/lib/misc.c
>>> +++ b/arch/x86/lib/misc.c
>>> @@ -1,3 +1,6 @@
>>> +#include <linux/export.h>
>>> +#include <linux/uaccess.h>
>>> +
>>> /*
>>> * Count the digits of @val including a possible sign.
>>> *
>>> @@ -19,3 +22,12 @@ int num_digits(int val)
>>> }
>>> return d;
>>> }
>>> +
>>> +#ifdef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> + __copy_from_user_inatomic_nocache(dest, src, count);
>>> + return dest;
>>> +}
>>> +EXPORT_SYMBOL(memcpy_nocache);
>>> +#endif
>>> diff --git a/include/linux/string.h b/include/linux/string.h
>>> index 26b6f6a..7f40c41 100644
>>> --- a/include/linux/string.h
>>> +++ b/include/linux/string.h
>>> @@ -102,6 +102,21 @@ extern void * memset(void *,int,__kernel_size_t);
>>> #ifndef __HAVE_ARCH_MEMCPY
>>> extern void * memcpy(void *,const void *,__kernel_size_t);
>>> #endif
>>> +
>>> +#ifndef __HAVE_ARCH_MEMCPY_NOCACHE
>>> +/**
>>> + * memcpy_nocache - Copy one area of memory to another, avoiding the
>>> + * processor cache if possible
>>> + * @dest: Where to copy to
>>> + * @src: Where to copy from
>>> + * @count: The size of the area.
>>> + */
>>> +static inline void *memcpy_nocache(void *dest, const void *src, size_t count)
>>> +{
>>> + return memcpy(dest, src, count);
>>> +}
>>
>> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
>>
>> Feels bad to add yet just another arch switch over __copy_user_nocache
>>
>> Just feels like too many things that do the same thing. Sigh
>
> I agree that this looks like a nicer path.
>
> I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> but lib/iov_iter.c doesn't currently #include linux/pmem.h. Would it be
> acceptable to add it? Also, I wasn't sure if memcpy_to_pmem() would always
> mean exactly "memcpy nocache".
>

I think this is the way to go. In my opinion there is no reason why not to include
pmem.h into lib/iov_iter.c.

And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
so it should be safe to use this for all cases. It is so in the arches that support
this now, and I cannot imagine a theoretical arch that would differ. But let the
specific arch people holler if this steps on their tows, later when they care about
this at all.

> I had also considered adjusting copy_from_iter_pmem() (also in linux/pmem.h)
> to just use memcpy_to_pmem() directly, but then it can't use the goodness
> that is the iterate_and_advance() macro in iov_iter.c.
>

No please keep all these local to iov_iter.c. Any future changes should be local
to here and fixed in one place.

> So, I took a shot with a possibly ill-fated memcpy_nocache(). Thoughts on
> either of the above two? Are these even in line with what you were thinking?
>

Yes thanks again for pushing this. I think it is important. CC me on patches
and I will send you my Review-by
Boaz

> Thanks!
> Brian
>
>>
>> Boaz
>>
>>> +#endif
>>> +
>>> #ifndef __HAVE_ARCH_MEMMOVE
>>> extern void * memmove(void *,const void *,__kernel_size_t);
>>> #endif

2016-12-28 23:43:35

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On Tue, Nov 01, 2016 at 04:25:12PM +0200, Boaz Harrosh wrote:

> >> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
> >>
> >> Feels bad to add yet just another arch switch over __copy_user_nocache
> >>
> >> Just feels like too many things that do the same thing. Sigh
> >
> > I agree that this looks like a nicer path.
> >
> > I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
> > but lib/iov_iter.c doesn't currently #include linux/pmem.h. Would it be
> > acceptable to add it? Also, I wasn't sure if memcpy_to_pmem() would always
> > mean exactly "memcpy nocache".
> >
>
> I think this is the way to go. In my opinion there is no reason why not to include
> pmem.h into lib/iov_iter.c.
>
> And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
> so it should be safe to use this for all cases. It is so in the arches that support
> this now, and I cannot imagine a theoretical arch that would differ. But let the
> specific arch people holler if this steps on their tows, later when they care about
> this at all.

First of all, if it's the fastest arch way to bypass cache, why the hell
is it sitting in pmem-related areas?

More to the point, x86 implementation of that thing is tied to uaccess API
for no damn reason whatsoever. Let's add a real memcpy_nocache() and
be done with that. I mean, this
if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
__func__, dst, src, rem))
BUG();
is *screaming* "API misused here". And let's stay away from the STAC et.al. -
it's pointless for kernel-to-kernel copies.

BTW, your "it's iovec, only non-temporal stores there" logics in
arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
copies will have parts done via normal stores, for another 32bit will
_not_ go for non-caching codepath for short copies. What semantics do
we really need there?

2016-12-29 18:23:18

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On Wed, Dec 28, 2016 at 3:43 PM, Al Viro <[email protected]> wrote:
> On Tue, Nov 01, 2016 at 04:25:12PM +0200, Boaz Harrosh wrote:
>
>> >> What about memcpy_to_pmem() in linux/pmem.h it already has all the arch switches.
>> >>
>> >> Feels bad to add yet just another arch switch over __copy_user_nocache
>> >>
>> >> Just feels like too many things that do the same thing. Sigh
>> >
>> > I agree that this looks like a nicer path.
>> >
>> > I had considered adjusting copy_from_iter_nocache() to use memcpy_to_pmem(),
>> > but lib/iov_iter.c doesn't currently #include linux/pmem.h. Would it be
>> > acceptable to add it? Also, I wasn't sure if memcpy_to_pmem() would always
>> > mean exactly "memcpy nocache".
>> >
>>
>> I think this is the way to go. In my opinion there is no reason why not to include
>> pmem.h into lib/iov_iter.c.
>>
>> And I think memcpy_to_pmem() would always be the fastest arch way to bypass cache
>> so it should be safe to use this for all cases. It is so in the arches that support
>> this now, and I cannot imagine a theoretical arch that would differ. But let the
>> specific arch people holler if this steps on their tows, later when they care about
>> this at all.
>
> First of all, if it's the fastest arch way to bypass cache, why the hell
> is it sitting in pmem-related areas?

Agreed, pmem has little to do with a cache avoiding memcpy. I believe
there are embedded platforms in the field that have system wide
batteries and arrange for cpu caches to be flushed on power loss. So a
cache avoiding memory copy may not always be the best choice for pmem.

> More to the point, x86 implementation of that thing is tied to uaccess API
> for no damn reason whatsoever. Let's add a real memcpy_nocache() and
> be done with that. I mean, this
> if (WARN(rem, "%s: fault copying %p <- %p unwritten: %d\n",
> __func__, dst, src, rem))
> BUG();
> is *screaming* "API misused here". And let's stay away from the STAC et.al. -
> it's pointless for kernel-to-kernel copies.

Yes, that's my turd and I agree we should opt for a generic cache
bypassing copy.

> BTW, your "it's iovec, only non-temporal stores there" logics in
> arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
> copies will have parts done via normal stores, for another 32bit will
> _not_ go for non-caching codepath for short copies. What semantics do
> we really need there?

For typical pmem platforms we need to make sure all the writes are on
the way to memory such than a later sfence can guarantee that all
previous writes are visible to the platform "ADR" logic. ADR handles
flushing memory controller write buffers to media. At a minimum
arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
line writeback) of each touched cache line if it is not using a cache
bypassing store.

2016-12-30 03:53:09

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On Thu, Dec 29, 2016 at 10:23:15AM -0800, Dan Williams wrote:

> > BTW, your "it's iovec, only non-temporal stores there" logics in
> > arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
> > copies will have parts done via normal stores, for another 32bit will
> > _not_ go for non-caching codepath for short copies. What semantics do
> > we really need there?
>
> For typical pmem platforms we need to make sure all the writes are on
> the way to memory such than a later sfence can guarantee that all
> previous writes are visible to the platform "ADR" logic. ADR handles
> flushing memory controller write buffers to media. At a minimum
> arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
> line writeback) of each touched cache line if it is not using a cache
> bypassing store.

Um... Then we do have a problem - nocache variant of uaccess primitives
does *not* guarantee that clwb is redundant.

What about the requirements of e.g. tcp_sendmsg() with its use of
skb_add_data_nocache()? What warranties do we need there?

2016-12-30 04:56:15

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] introduce memcpy_nocache()

On Thu, Dec 29, 2016 at 7:52 PM, Al Viro <[email protected]> wrote:
> On Thu, Dec 29, 2016 at 10:23:15AM -0800, Dan Williams wrote:
>
>> > BTW, your "it's iovec, only non-temporal stores there" logics in
>> > arch_copy_from_iter_pmem() is simply wrong - for one thing, unaligned
>> > copies will have parts done via normal stores, for another 32bit will
>> > _not_ go for non-caching codepath for short copies. What semantics do
>> > we really need there?
>>
>> For typical pmem platforms we need to make sure all the writes are on
>> the way to memory such than a later sfence can guarantee that all
>> previous writes are visible to the platform "ADR" logic. ADR handles
>> flushing memory controller write buffers to media. At a minimum
>> arch_copy_from_iter_pmem() needs to trigger a clwb (unordered cache
>> line writeback) of each touched cache line if it is not using a cache
>> bypassing store.
>
> Um... Then we do have a problem - nocache variant of uaccess primitives
> does *not* guarantee that clwb is redundant.
>
> What about the requirements of e.g. tcp_sendmsg() with its use of
> skb_add_data_nocache()? What warranties do we need there?

Yes, we need to distinguish the existing "nocache" that tries to avoid
unnecessary cache pollution and this new "must write through" semantic
for writing to persistent memory. I suspect usages of
skb_add_data_nocache() are ok since they are in the transmit path.
Receiving directly into a buffer that is expected to be persisted
immediately is where we would need to be careful, but that is already
backstopped by dirty cacheline tracking. So as far as I can see, we
should only need a new memcpy_writethrough() (?) for the pmem
direct-i/o path at present.

2016-12-31 02:26:19

by Al Viro

[permalink] [raw]
Subject: [RFC] memcpy_nocache() and memcpy_writethrough()

On Thu, Dec 29, 2016 at 08:56:13PM -0800, Dan Williams wrote:

> > Um... Then we do have a problem - nocache variant of uaccess primitives
> > does *not* guarantee that clwb is redundant.
> >
> > What about the requirements of e.g. tcp_sendmsg() with its use of
> > skb_add_data_nocache()? What warranties do we need there?
>
> Yes, we need to distinguish the existing "nocache" that tries to avoid
> unnecessary cache pollution and this new "must write through" semantic
> for writing to persistent memory. I suspect usages of
> skb_add_data_nocache() are ok since they are in the transmit path.
> Receiving directly into a buffer that is expected to be persisted
> immediately is where we would need to be careful, but that is already
> backstopped by dirty cacheline tracking. So as far as I can see, we
> should only need a new memcpy_writethrough() (?) for the pmem
> direct-i/o path at present.

OK... Right now we have several places playing with nocache:
* dax_iomap_actor(). Writethrough warranties needed, nocache
side serves to reduce the cache impact *and* avoid the need for clwb
for writethrough.
* several memcpy_to_pmem() users - acpi_nfit_blk_single_io(),
nsio_rw_bytes(), write_pmem(). No clwb attempted; is it needed there?
* hfi1_copy_sge(). Cache pollution avoidance? The source is
in the kernel, looks like memcpy_nocache() candidate.
* ntb_memcpy_tx(). Really fishy one - it's from kernel to iomem,
with nocache userland->kernel copying primitive abused on x86. As soon
as e.g. powerpc or sparc grows ARCH_HAS_NOCACHE_UACCESS, we are in trouble
there. What is it actually trying to achieve? memcpy_toio() with
cache pollution avoidance?
* networking copy_from_iter_full_nocache() users - cache pollution
avoidance, AFAICS; no writethrough warranties sought.

Why does pmem need writethrough warranties, anyway? All explanations I've
found on the net had been along the lines of "we should not store a pointer
to pmem data structure until the structure itself had been committed to
pmem itself" and it looks like something that ought to be a job for barriers
- after all, we don't want the pointer store to be observed by _anything_
in the system until the earlier stores are visible, so what makes pmem
different from e.g. another CPU or a PCI busmaster, or...

I'm trying to figure out what would be the right API here; sure, we can
add separate memcpy_writethrough()/__copy_from_user_inatomic_writethrough()/
copy_from_iter_writethrough(), but I would like to understand what's going
on first.