2017-06-09 20:30:16

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 00/14] pmem: stop abusing __copy_user_nocache(), and other reworks

Changes since v2 [1]:
1/ Address the concerns from "[NAK] copy_from_iter_ops()" [2]. The
copy_from_iter_ops approach is replaced with a new set _flushcache
memcpy and user-copy helpers (Al)

2/ Use _flushcache as the suffix for the new cache managing copy helpers
rather than _writethrough (Ingo and Toshi)

3/ Keep asm/pmem.h instead of moving the helpers to
drivers/nvdimm/$arch.c (another side effect of Al's feedback)

[1]: https://lkml.org/lkml/2017/4/21/823
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

---
A few months back, in the course of reviewing the memcpy_nocache()
proposal from Brian, Linus proposed that the pmem specific
memcpy_to_pmem() routine be moved to be implemented at the driver level
[3]:

"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using 'movnt' by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in
the sense of <linux/uaccess.h>."

This feedback also dovetails with another fs/dax.c design wart of being
hard coded to assume the backing device is pmem. We call the pmem
specific copy, clear, and flush routines even if the backing device
driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
There is no reason to spend cpu cycles flushing the cache after writing
to brd, for example, since it is using volatile memory for storage.

Moreover, the pmem driver might be fronting a volatile memory range
published by the ACPI NFIT, or the platform might have arranged to flush
cpu caches on power fail. This latter capability is a feature that has
appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
platforms).

Now, the comment about completely avoiding uaccess.h is augmented by
Al's recent assertion:

"And for !@#!@# sake, comments like this
+ * On x86_64 __copy_from_user_nocache() uses non-temporal stores
+ * for the bulk of the transfer, but we need to manually flush
+ * if the transfer is unaligned. A cached memory copy is used
+ * when destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
mean only one thing: this should live in arch/x86/lib/usercopy_64.c,
right next to the actual function that does copying. NOT in
drivers/nvdimm/x86.c. At the very least it needs a comment in usercopy_64.c
with dire warnings along the lines of "don't touch that code without
looking into <filename>:pmem_from_user().."

So, this series proceeds to keep all the usercopy code centralized. The
change set:

1/ Moves what was previously named "the pmem api" out of the global
namespace and into the libnvdimm sub-system that needs to be
concerned with architecture specific persistent memory considerations.

2/ Arranges for dax to stop abusing __copy_user_nocache() and implements
formal _flushcache helpers that use 'movnt' on x86_64.

3/ Makes filesystem-dax cache maintenance optional by arranging for dax
to call driver specific copy and flush operations only if the driver
publishes them.

4/ Allows filesytem-dax cache management to be controlled by the block
device write-cache queue flag. The pmem driver is updated to clear
that flag by default when pmem is driving volatile memory. In the future
this same path may be used to detect platforms that have a
cpu-cache-flush-on-fail capability. That said, an administrator has the
option to force this behavior by writing to the $bdev/queue/write_cache
attribute in sysfs.

[3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html

This series is based on v4.12-rc4 and passes the current ndctl
regression suite.

---

Dan Williams (14):
x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations
dm: add ->copy_from_iter() dax operation support
filesystem-dax: convert to dax_copy_from_iter()
dax, pmem: introduce an optional 'flush' dax_operation
dm: add ->flush() dax operation support
filesystem-dax: convert to dax_flush()
x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush
x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm
x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm
pmem: remove global pmem api
libnvdimm, pmem: fix persistence warning
libnvdimm, nfit: enable support for volatile ranges
filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region


MAINTAINERS | 1
arch/x86/Kconfig | 1
arch/x86/include/asm/pmem.h | 81 ---------------------
arch/x86/include/asm/string_64.h | 5 +
arch/x86/include/asm/uaccess_64.h | 12 +++
arch/x86/lib/usercopy_64.c | 129 ++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/core.c | 15 +++-
drivers/dax/super.c | 24 ++++++
drivers/md/dm-linear.c | 30 ++++++++
drivers/md/dm-stripe.c | 40 ++++++++++
drivers/md/dm.c | 45 ++++++++++++
drivers/nvdimm/bus.c | 8 +-
drivers/nvdimm/claim.c | 6 +-
drivers/nvdimm/core.c | 2 -
drivers/nvdimm/dax_devs.c | 2 -
drivers/nvdimm/dimm_devs.c | 10 ++-
drivers/nvdimm/namespace_devs.c | 14 +---
drivers/nvdimm/nd-core.h | 9 ++
drivers/nvdimm/pfn_devs.c | 4 +
drivers/nvdimm/pmem.c | 32 +++++++-
drivers/nvdimm/pmem.h | 13 +++
drivers/nvdimm/region_devs.c | 43 +++++++----
fs/dax.c | 11 ++-
include/linux/dax.h | 9 ++
include/linux/device-mapper.h | 6 ++
include/linux/libnvdimm.h | 2 +
include/linux/pmem.h | 142 -------------------------------------
include/linux/string.h | 6 ++
include/linux/uio.h | 15 ++++
lib/Kconfig | 3 +
lib/iov_iter.c | 22 ++++++
31 files changed, 464 insertions(+), 278 deletions(-)
delete mode 100644 include/linux/pmem.h


2017-06-09 20:30:27

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 01/14] x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations

The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes are not
cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
(non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
to ensure data-writes have reached a power-fail-safe zone in the platform. The
fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
around and fence previous writes with an "sfence".

Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
memcpy_flushcache, that guarantee that the destination buffer is not dirty in
the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
will be used to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.

This is meant to satisfy the concern from Linus that if a driver wants to do
something beyond the normal nocache semantics it should be something private to
that driver [1], and Al's concern that anything uaccess related belongs with
the rest of the uaccess code [2].

The first consumer of this interface is a new 'copy_from_iter' dax operation so
that pmem can inject cache maintenance operations without imposing this
overhead on other dax-capable drivers.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 1
arch/x86/include/asm/string_64.h | 5 +
arch/x86/include/asm/uaccess_64.h | 11 +++
arch/x86/lib/usercopy_64.c | 128 +++++++++++++++++++++++++++++++++++++
drivers/acpi/nfit/core.c | 3 -
drivers/nvdimm/claim.c | 2 -
drivers/nvdimm/pmem.c | 13 +++-
drivers/nvdimm/region_devs.c | 4 +
include/linux/dax.h | 3 +
include/linux/string.h | 6 ++
include/linux/uio.h | 15 ++++
lib/Kconfig | 3 +
lib/iov_iter.c | 22 ++++++
13 files changed, 209 insertions(+), 7 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4ccfacc7232a..bb273b2f50b5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -54,6 +54,7 @@ config X86
select ARCH_HAS_KCOV if X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_PMEM_API if X86_64
+ select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 733bae07fb29..1f22bc277c45 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -109,6 +109,11 @@ memcpy_mcsafe(void *dst, const void *src, size_t cnt)
return 0;
}

+#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
+#define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
+void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+#endif
+
#endif /* __KERNEL__ */

#endif /* _ASM_X86_STRING_64_H */
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index c5504b9a472e..b16f6a1d8b26 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -171,6 +171,10 @@ unsigned long raw_copy_in_user(void __user *dst, const void __user *src, unsigne
extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);

+extern long __copy_user_flushcache(void *dst, const void __user *src, unsigned size);
+extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
+ size_t len);
+
static inline int
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
unsigned size)
@@ -179,6 +183,13 @@ __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
return __copy_user_nocache(dst, src, size, 0);
}

+static inline int
+__copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
+{
+ kasan_check_write(dst, size);
+ return __copy_user_flushcache(dst, src, size);
+}
+
unsigned long
copy_user_handle_tail(char *to, char *from, unsigned len);

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3b7c40a2e3e1..f42d2fd86ca3 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -7,6 +7,7 @@
*/
#include <linux/export.h>
#include <linux/uaccess.h>
+#include <linux/highmem.h>

/*
* Zero Userspace
@@ -73,3 +74,130 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
clac();
return len;
}
+
+#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
+/**
+ * clean_cache_range - write back a cache range with CLWB
+ * @vaddr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back a cache range using the CLWB (cache line write back)
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
+ */
+static void clean_cache_range(void *addr, size_t size)
+{
+ u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
+ unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vend = addr + size;
+ void *p;
+
+ for (p = (void *)((unsigned long)addr & ~clflush_mask);
+ p < vend; p += x86_clflush_size)
+ clwb(p);
+}
+
+long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
+{
+ unsigned long flushed, dest = (unsigned long) dst;
+ long rc = __copy_user_nocache(dst, src, size, 0);
+
+ /*
+ * __copy_user_nocache() uses non-temporal stores for the bulk
+ * of the transfer, but we need to manually flush if the
+ * transfer is unaligned. A cached memory copy is used when
+ * destination or size is not naturally aligned. That is:
+ * - Require 8-byte alignment when size is 8 bytes or larger.
+ * - Require 4-byte alignment when size is 4 bytes.
+ */
+ if (size < 8) {
+ if (!IS_ALIGNED(dest, 4) || size != 4)
+ clean_cache_range(dst, 1);
+ } else {
+ if (!IS_ALIGNED(dest, 8)) {
+ dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ clean_cache_range(dst, 1);
+ }
+
+ flushed = dest - (unsigned long) dst;
+ if (size > flushed && !IS_ALIGNED(size - flushed, 8))
+ clean_cache_range(dst + size - 1, 1);
+ }
+
+ return rc;
+}
+
+void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+{
+ unsigned long dest = (unsigned long) _dst;
+ unsigned long source = (unsigned long) _src;
+
+ /* cache copy and flush to align dest */
+ if (!IS_ALIGNED(dest, 8)) {
+ unsigned len = min_t(unsigned, size, ALIGN(dest, 8) - dest);
+
+ memcpy((void *) dest, (void *) source, len);
+ clean_cache_range((void *) dest, len);
+ dest += len;
+ source += len;
+ size -= len;
+ if (!size)
+ return;
+ }
+
+ /* 4x8 movnti loop */
+ while (size >= 32) {
+ asm("movq (%0), %%r8\n"
+ "movq 8(%0), %%r9\n"
+ "movq 16(%0), %%r10\n"
+ "movq 24(%0), %%r11\n"
+ "movnti %%r8, (%1)\n"
+ "movnti %%r9, 8(%1)\n"
+ "movnti %%r10, 16(%1)\n"
+ "movnti %%r11, 24(%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8", "r9", "r10", "r11");
+ dest += 32;
+ source += 32;
+ size -= 32;
+ }
+
+ /* 1x8 movnti loop */
+ while (size >= 8) {
+ asm("movq (%0), %%r8\n"
+ "movnti %%r8, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 8;
+ source += 8;
+ size -= 8;
+ }
+
+ /* 1x4 movnti loop */
+ while (size >= 4) {
+ asm("movl (%0), %%r8d\n"
+ "movnti %%r8d, (%1)\n"
+ :: "r" (source), "r" (dest)
+ : "memory", "r8");
+ dest += 4;
+ source += 4;
+ size -= 4;
+ }
+
+ /* cache copy for remaining bytes */
+ if (size) {
+ memcpy((void *) dest, (void *) source, size);
+ clean_cache_range((void *) dest, size);
+ }
+}
+EXPORT_SYMBOL_GPL(memcpy_flushcache);
+
+void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
+ size_t len)
+{
+ char *from = kmap_atomic(page);
+
+ memcpy_flushcache(to, from + offset, len);
+ kunmap_atomic(from);
+}
+#endif
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 656acb5d7166..cbd5596e7562 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -1842,8 +1842,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
}

if (rw)
- memcpy_to_pmem(mmio->addr.aperture + offset,
- iobuf + copied, c);
+ memcpy_flushcache(mmio->addr.aperture + offset, iobuf + copied, c);
else {
if (nfit_blk->dimm_flags & NFIT_BLK_READ_FLUSH)
mmio_flush_range((void __force *)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 7ceb5fa4f2a1..b8b9c8ca7862 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -277,7 +277,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
rc = -EIO;
}

- memcpy_to_pmem(nsio->addr + offset, buf, size);
+ memcpy_flushcache(nsio->addr + offset, buf, size);
nvdimm_flush(to_nd_region(ndns->dev.parent));

return rc;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c544d466ea51..2f3aefe565c6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -29,6 +29,7 @@
#include <linux/pfn_t.h>
#include <linux/slab.h>
#include <linux/pmem.h>
+#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/nd.h>
#include "pmem.h"
@@ -80,7 +81,7 @@ static void write_pmem(void *pmem_addr, struct page *page,
{
void *mem = kmap_atomic(page);

- memcpy_to_pmem(pmem_addr, mem + off, len);
+ memcpy_flushcache(pmem_addr, mem + off, len);
kunmap_atomic(mem);
}

@@ -235,8 +236,15 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
}

+static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter_flushcache(addr, bytes, i);
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .copy_from_iter = pmem_copy_from_iter,
};

static void pmem_release_queue(void *q)
@@ -294,7 +302,8 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (nvdimm_has_flush(nd_region) < 0)
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE)
+ || nvdimm_has_flush(nd_region) < 0)
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (!devm_request_mem_region(dev, res->start, resource_size(res),
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b550edf2571f..985b0e11bd73 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1015,8 +1015,8 @@ void nvdimm_flush(struct nd_region *nd_region)
* The first wmb() is needed to 'sfence' all previous writes
* such that they are architecturally visible for the platform
* buffer flush. Note that we've already arranged for pmem
- * writes to avoid the cache via arch_memcpy_to_pmem(). The
- * final wmb() ensures ordering for the NVDIMM flush write.
+ * writes to avoid the cache via memcpy_flushcache(). The final
+ * wmb() ensures ordering for the NVDIMM flush write.
*/
wmb();
for (i = 0; i < nd_region->ndr_mappings; i++)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5ec1f6c47716..bbe79ed90e2b 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,9 @@ struct dax_operations {
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+ /* copy_from_iter: dax-driver override for default copy_from_iter */
+ size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
+ struct iov_iter *);
};

#if IS_ENABLED(CONFIG_DAX)
diff --git a/include/linux/string.h b/include/linux/string.h
index 537918f8a98e..7439d83eaa33 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -122,6 +122,12 @@ static inline __must_check int memcpy_mcsafe(void *dst, const void *src,
return 0;
}
#endif
+#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE
+static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+{
+ memcpy(dst, src, cnt);
+}
+#endif
void *memchr_inv(const void *s, int c, size_t n);
char *strreplace(char *s, char old, char new);

diff --git a/include/linux/uio.h b/include/linux/uio.h
index f2d36a3d3005..55cd54a0e941 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -95,6 +95,21 @@ size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i);
+#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
+/*
+ * Note, users like pmem that depend on the stricter semantics of
+ * copy_from_iter_flushcache() than copy_from_iter_nocache() must check for
+ * IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) before assuming that the
+ * destination is flushed from the cache on return.
+ */
+size_t copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i);
+#else
+static inline size_t copy_from_iter_flushcache(void *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return copy_from_iter_nocache(addr, bytes, i);
+}
+#endif
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i);
size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..2d1c4b3a085c 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -548,6 +548,9 @@ config ARCH_HAS_SG_CHAIN
config ARCH_HAS_PMEM_API
bool

+config ARCH_HAS_UACCESS_FLUSHCACHE
+ bool
+
config ARCH_HAS_MMIO_FLUSH
bool

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f835964c9485..c9a69064462f 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -615,6 +615,28 @@ size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
}
EXPORT_SYMBOL(copy_from_iter_nocache);

+#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
+size_t copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
+{
+ char *to = addr;
+ if (unlikely(i->type & ITER_PIPE)) {
+ WARN_ON(1);
+ return 0;
+ }
+ iterate_and_advance(i, bytes, v,
+ __copy_from_user_flushcache((to += v.iov_len) - v.iov_len,
+ v.iov_base, v.iov_len),
+ memcpy_page_flushcache((to += v.bv_len) - v.bv_len, v.bv_page,
+ v.bv_offset, v.bv_len),
+ memcpy_flushcache((to += v.iov_len) - v.iov_len, v.iov_base,
+ v.iov_len)
+ )
+
+ return bytes;
+}
+EXPORT_SYMBOL_GPL(copy_from_iter_flushcache);
+#endif
+
bool copy_from_iter_full_nocache(void *addr, size_t bytes, struct iov_iter *i)
{
char *to = addr;

2017-06-09 20:30:35

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 03/14] filesystem-dax: convert to dax_copy_from_iter()

Now that all possible providers of the dax_operations copy_from_iter
method are implemented, switch filesytem-dax to call the driver rather
than copy_to_iter_pmem.

Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 50 -------------------------------------------
fs/dax.c | 3 ++-
include/linux/pmem.h | 24 ---------------------
3 files changed, 2 insertions(+), 75 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 0ff8fe71b255..60e8edbe0205 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
}

/**
- * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr: PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- */
-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- size_t len;
-
- /* TODO: skip the write-back by always using non-temporal stores */
- len = copy_from_iter_nocache(addr, bytes, i);
-
- /*
- * In the iovec case on x86_64 copy_from_iter_nocache() uses
- * non-temporal stores for the bulk of the transfer, but we need
- * to manually flush if the transfer is unaligned. A cached
- * memory copy is used when destination or size is not naturally
- * aligned. That is:
- * - Require 8-byte alignment when size is 8 bytes or larger.
- * - Require 4-byte alignment when size is 4 bytes.
- *
- * In the non-iovec case the entire destination needs to be
- * flushed.
- */
- if (iter_is_iovec(i)) {
- unsigned long flushed, dest = (unsigned long) addr;
-
- if (bytes < 8) {
- if (!IS_ALIGNED(dest, 4) || (bytes != 4))
- arch_wb_cache_pmem(addr, bytes);
- } else {
- if (!IS_ALIGNED(dest, 8)) {
- dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
- arch_wb_cache_pmem(addr, 1);
- }
-
- flushed = dest - (unsigned long) addr;
- if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
- arch_wb_cache_pmem(addr + bytes - 1, 1);
- }
- } else
- arch_wb_cache_pmem(addr, bytes);
-
- return len;
-}
-
-/**
* arch_clear_pmem - zero a PMEM memory range
* @addr: virtual start address
* @size: number of bytes to zero
diff --git a/fs/dax.c b/fs/dax.c
index 2a6889b3585f..b459948de427 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1054,7 +1054,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
map_len = end - pos;

if (iov_iter_rw(iter) == WRITE)
- map_len = copy_from_iter_pmem(kaddr, map_len, iter);
+ map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
+ map_len, iter);
else
map_len = copy_to_iter(kaddr, map_len, iter);
if (map_len <= 0) {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 71ecf3d46aac..9d542a5600e4 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- BUG();
- return 0;
-}
-
static inline void arch_clear_pmem(void *addr, size_t size)
{
BUG();
@@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
}

/**
- * copy_from_iter_pmem - copy data from an iterator to PMEM
- * @addr: PMEM destination address
- * @bytes: number of bytes to copy
- * @i: iterator with source data
- *
- * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
- struct iov_iter *i)
-{
- if (arch_has_pmem_api())
- return arch_copy_from_iter_pmem(addr, bytes, i);
- return copy_from_iter_nocache(addr, bytes, i);
-}
-
-/**
* clear_pmem - zero a PMEM memory range
* @addr: virtual start address
* @size: number of bytes to zero

2017-06-09 20:30:43

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 02/14] dm: add ->copy_from_iter() dax operation support

Allow device-mapper to route copy_from_iter operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying copy_from_iter implementations.

Cc: Toshi Kani <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 13 +++++++++++++
drivers/md/dm-linear.c | 15 +++++++++++++++
drivers/md/dm-stripe.c | 20 ++++++++++++++++++++
drivers/md/dm.c | 26 ++++++++++++++++++++++++++
include/linux/dax.h | 2 ++
include/linux/device-mapper.h | 3 +++
6 files changed, 79 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6ed32aac8bbe..dd299e55f65d 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -18,6 +18,7 @@
#include <linux/cdev.h>
#include <linux/hash.h>
#include <linux/slab.h>
+#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/fs.h>

@@ -172,6 +173,18 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
}
EXPORT_SYMBOL_GPL(dax_direct_access);

+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ if (!dax_alive(dax_dev))
+ return 0;
+
+ if (!dax_dev->ops->copy_from_iter)
+ return copy_from_iter(addr, bytes, i);
+ return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+EXPORT_SYMBOL_GPL(dax_copy_from_iter);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 7d42a9d9f406..0841ec1bfbad 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -159,6 +159,20 @@ static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}

+static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct dax_device *dax_dev = lc->dev->dax_dev;
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+ dev_sector = linear_map_sector(ti, sector);
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
+ return 0;
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
static struct target_type linear_target = {
.name = "linear",
.version = {1, 3, 0},
@@ -171,6 +185,7 @@ static struct target_type linear_target = {
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
+ .dax_copy_from_iter = linear_dax_copy_from_iter,
};

int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 75152482f3ad..1ef914f9ca72 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -332,6 +332,25 @@ static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff,
return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn);
}

+static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+ struct stripe_c *sc = ti->private;
+ struct dax_device *dax_dev;
+ struct block_device *bdev;
+ uint32_t stripe;
+
+ stripe_map_sector(sc, sector, &stripe, &dev_sector);
+ dev_sector += sc->stripe[stripe].physical_start;
+ dax_dev = sc->stripe[stripe].dev->dax_dev;
+ bdev = sc->stripe[stripe].dev->bdev;
+
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(bytes, PAGE_SIZE), &pgoff))
+ return 0;
+ return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
+}
+
/*
* Stripe status:
*
@@ -451,6 +470,7 @@ static struct target_type stripe_target = {
.iterate_devices = stripe_iterate_devices,
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
+ .dax_copy_from_iter = stripe_dax_copy_from_iter,
};

int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 37ccd73c79ec..7faaceb52819 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -19,6 +19,7 @@
#include <linux/dax.h>
#include <linux/slab.h>
#include <linux/idr.h>
+#include <linux/uio.h>
#include <linux/hdreg.h>
#include <linux/delay.h>
#include <linux/wait.h>
@@ -969,6 +970,30 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}

+static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ long ret = 0;
+ int srcu_idx;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+ if (!ti)
+ goto out;
+ if (!ti->type->dax_copy_from_iter) {
+ ret = copy_from_iter(addr, bytes, i);
+ goto out;
+ }
+ ret = ti->type->dax_copy_from_iter(ti, pgoff, addr, bytes, i);
+ out:
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH.
@@ -2859,6 +2884,7 @@ static const struct block_device_operations dm_blk_dops = {

static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
+ .copy_from_iter = dm_dax_copy_from_iter,
};

/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index bbe79ed90e2b..28e398f8c59e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -78,6 +78,8 @@ void kill_dax(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
+size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t bytes, struct iov_iter *i);

/*
* We use lowest available bit in exceptional entry for locking, one bit for
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index f4c639c0c362..11c8a0a92f9c 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -132,6 +132,8 @@ typedef int (*dm_busy_fn) (struct dm_target *ti);
*/
typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
+typedef size_t (*dm_dax_copy_from_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
+ void *addr, size_t bytes, struct iov_iter *i);
#define PAGE_SECTORS (PAGE_SIZE / 512)

void dm_error(const char *message);
@@ -181,6 +183,7 @@ struct target_type {
dm_iterate_devices_fn iterate_devices;
dm_io_hints_fn io_hints;
dm_dax_direct_access_fn direct_access;
+ dm_dax_copy_from_iter_fn dax_copy_from_iter;

/* For internal device-mapper use. */
struct list_head list;

2017-06-09 20:30:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 06/14] filesystem-dax: convert to dax_flush()

Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so the dax_flush()
helper skips cache management work when the underlying driver does not
specify a flush method.

We still do all the dirty tracking since the radix entry will already be
there for locking purposes. However, the work to clean the entry will be
a nop for some dax drivers.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index b459948de427..0933fc460ada 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -784,7 +784,7 @@ static int dax_writeback_one(struct block_device *bdev,
}

dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
- wb_cache_pmem(kaddr, size);
+ dax_flush(dax_dev, pgoff, kaddr, size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as

2017-06-09 20:31:05

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 07/14] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush

The clear_pmem() helper simply combines a memset() plus a cache flush.
Now that the flush routine is optionally provided by the dax device
driver we can avoid unnecessary cache management on dax devices fronting
volatile memory.

With clear_pmem() gone we can follow on with a patch to make pmem cache
management completely defined within the pmem driver.

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 13 -------------
fs/dax.c | 3 ++-
include/linux/pmem.h | 21 ---------------------
3 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 60e8edbe0205..f4c119d253f3 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
clwb(p);
}

-/**
- * arch_clear_pmem - zero a PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- */
-static inline void arch_clear_pmem(void *addr, size_t size)
-{
- memset(addr, 0, size);
- arch_wb_cache_pmem(addr, size);
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
clflush_cache_range(addr, size);
diff --git a/fs/dax.c b/fs/dax.c
index 0933fc460ada..554b8e7d921c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -975,7 +975,8 @@ int __dax_zero_page_range(struct block_device *bdev,
dax_read_unlock(id);
return rc;
}
- clear_pmem(kaddr + offset, size);
+ memset(kaddr + offset, 0, size);
+ dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 9d542a5600e4..772bd02a5b52 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-static inline void arch_clear_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
BUG();
@@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
}

/**
- * clear_pmem - zero a PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to zero
- *
- * Write zeros into the memory range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void clear_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_clear_pmem(addr, size);
- else
- memset(addr, 0, size);
-}
-
-/**
* invalidate_pmem - flush a pmem range from the cache hierarchy
* @addr: virtual start address
* @size: bytes to invalidate (internally aligned to cache line size)

2017-06-09 20:31:12

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 04/14] dax, pmem: introduce an optional 'flush' dax_operation

Filesystem-DAX flushes caches whenever it writes to the address returned
through dax_direct_access() and when writing back dirty radix entries.
That flushing is only required in the pmem case, so add a dax operation
to allow pmem to take this extra action, but skip it for other dax
capable devices that do not provide a flush routine.

An example for this differentiation might be a volatile ram disk where
there is no expectation of persistence. In fact the pmem driver itself might
front such an address range specified by the NFIT. So, this "no flush"
property might be something passed down by the bus / libnvdimm.

Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 7 +++++++
include/linux/dax.h | 2 ++
2 files changed, 9 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 2f3aefe565c6..823b07774244 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -242,9 +242,16 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
return copy_from_iter_flushcache(addr, bytes, i);
}

+static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
+ void *addr, size_t size)
+{
+ wb_cache_pmem(addr, size);
+}
+
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
.copy_from_iter = pmem_copy_from_iter,
+ .flush = pmem_dax_flush,
};

static void pmem_release_queue(void *q)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 28e398f8c59e..407dd3ff6e54 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -19,6 +19,8 @@ struct dax_operations {
/* copy_from_iter: dax-driver override for default copy_from_iter */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
+ /* flush: optional driver-specific cache management after writes */
+ void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
};

#if IS_ENABLED(CONFIG_DAX)

2017-06-09 20:31:21

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 11/14] libnvdimm, pmem: fix persistence warning

The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/region_devs.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 3c06a6ea6958..41b4cdf5dea8 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1037,8 +1037,9 @@ int nvdimm_has_flush(struct nd_region *nd_region)
{
int i;

- /* no nvdimm == flushing capability unknown */
- if (nd_region->ndr_mappings == 0)
+ /* no nvdimm or pmem api == flushing capability unknown */
+ if (nd_region->ndr_mappings == 0
+ || !IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API))
return -ENXIO;

for (i = 0; i < nd_region->ndr_mappings; i++) {

2017-06-09 20:31:37

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 14/14] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/pmem.c | 13 ++++++++-----
drivers/nvdimm/region_devs.c | 6 ++++++
include/linux/libnvdimm.h | 1 +
3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f6c27ec1e9..5cac9fb39db8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -279,10 +279,10 @@ static int pmem_attach_disk(struct device *dev,
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
struct vmem_altmap __altmap, *altmap = NULL;
+ int nid = dev_to_node(dev), fua, wbc;
struct resource *res = &nsio->res;
struct nd_pfn *nd_pfn = NULL;
struct dax_device *dax_dev;
- int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
struct pmem_device *pmem;
struct resource pfn_res;
@@ -308,9 +308,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE)
- || nvdimm_has_flush(nd_region) < 0)
- dev_warn(dev, "unable to guarantee persistence of writes\n");
+ fua = nvdimm_has_flush(nd_region);
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0)
+ dev_warn(dev, "unable to guarantee persistence of writes\n"); {
+ fua = 0;
+ }
+ wbc = nvdimm_has_cache(nd_region);

if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(&ndns->dev))) {
@@ -354,7 +357,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;

- blk_queue_write_cache(q, true, true);
+ blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 53a64a16aba4..0c3b089b280a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1060,6 +1060,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_flush);

+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
void __exit nd_region_devs_exit(void)
{
ida_destroy(&region_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index b2f659bd661d..a8ee1d0afd70 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -165,4 +165,5 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
#endif /* __LIBNVDIMM_H__ */

2017-06-09 20:31:26

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 10/14] pmem: remove global pmem api

Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h.

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
MAINTAINERS | 1 -
drivers/acpi/nfit/core.c | 3 +-
drivers/nvdimm/claim.c | 1 -
drivers/nvdimm/dimm_devs.c | 8 +++++
drivers/nvdimm/namespace_devs.c | 6 +---
drivers/nvdimm/pmem.c | 1 -
drivers/nvdimm/pmem.h | 5 +++
drivers/nvdimm/region_devs.c | 1 -
fs/dax.c | 1 -
include/linux/libnvdimm.h | 1 +
include/linux/pmem.h | 59 ---------------------------------------
11 files changed, 15 insertions(+), 72 deletions(-)
delete mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 7a28acd7f525..d77ad3194adc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7593,7 +7593,6 @@ L: [email protected]
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/pmem.c
-F: include/linux/pmem.h
F: arch/*/include/asm/pmem.h

LIGHTNVM PLATFORM SUPPORT
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index cbd5596e7562..ac2436538b7e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -20,7 +20,6 @@
#include <linux/list.h>
#include <linux/acpi.h>
#include <linux/sort.h>
-#include <linux/pmem.h>
#include <linux/io.h>
#include <linux/nd.h>
#include <asm/cacheflush.h>
@@ -1956,7 +1955,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
nfit_blk->bdw_offset = nfit_mem->bdw->offset;
mmio = &nfit_blk->mmio[BDW];
mmio->addr.base = devm_nvdimm_memremap(dev, nfit_mem->spa_bdw->address,
- nfit_mem->spa_bdw->length, ARCH_MEMREMAP_PMEM);
+ nfit_mem->spa_bdw->length, nd_blk_memremap_flags(ndbr));
if (!mmio->addr.base) {
dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
nvdimm_name(nvdimm));
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index d2e16c0401df..3beedf173902 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -12,7 +12,6 @@
*/
#include <linux/device.h>
#include <linux/sizes.h>
-#include <linux/pmem.h>
#include "nd-core.h"
#include "pmem.h"
#include "pfn.h"
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 9852a3355509..6a1e7a3c0c17 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -20,6 +20,7 @@
#include <linux/mm.h>
#include "nd-core.h"
#include "label.h"
+#include "pmem.h"
#include "nd.h"

static DEFINE_IDA(dimm_ida);
@@ -235,6 +236,13 @@ struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
}
EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);

+unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr)
+{
+ /* pmem mapping properties are private to libnvdimm */
+ return ARCH_MEMREMAP_PMEM;
+}
+EXPORT_SYMBOL_GPL(nd_blk_memremap_flags);
+
struct nvdimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
{
struct nvdimm *nvdimm = nd_mapping->nvdimm;
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 2f9dfbd2dbec..4e9261ef8a95 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -14,10 +14,10 @@
#include <linux/device.h>
#include <linux/sort.h>
#include <linux/slab.h>
-#include <linux/pmem.h>
#include <linux/list.h>
#include <linux/nd.h>
#include "nd-core.h"
+#include "pmem.h"
#include "nd.h"

static void namespace_io_release(struct device *dev)
@@ -155,11 +155,7 @@ bool pmem_should_map_pages(struct device *dev)
IORES_DESC_NONE) == REGION_MIXED)
return false;

-#ifdef ARCH_MEMREMAP_PMEM
return ARCH_MEMREMAP_PMEM == MEMREMAP_WB;
-#else
- return false;
-#endif
}
EXPORT_SYMBOL(pmem_should_map_pages);

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 68737bc68a07..06f6c27ec1e9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -28,7 +28,6 @@
#include <linux/blk-mq.h>
#include <linux/pfn_t.h>
#include <linux/slab.h>
-#include <linux/pmem.h>
#include <linux/uio.h>
#include <linux/dax.h>
#include <linux/nd.h>
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index d579c7095a45..2b02e00b44eb 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -6,7 +6,10 @@
#include <linux/fs.h>
#include <asm/pmem.h>

-#ifndef CONFIG_ARCH_HAS_PMEM_API
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
+#else
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
}
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 985b0e11bd73..3c06a6ea6958 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -15,7 +15,6 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/hash.h>
-#include <linux/pmem.h>
#include <linux/sort.h>
#include <linux/io.h>
#include <linux/nd.h>
diff --git a/fs/dax.c b/fs/dax.c
index 554b8e7d921c..6d8699feae2e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,7 +25,6 @@
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/pagevec.h>
-#include <linux/pmem.h>
#include <linux/sched.h>
#include <linux/sched/signal.h>
#include <linux/uio.h>
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 6c807017128d..b2f659bd661d 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -159,6 +159,7 @@ void *nd_region_provider_data(struct nd_region *nd_region);
void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
+unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
deleted file mode 100644
index 559c00848583..000000000000
--- a/include/linux/pmem.h
+++ /dev/null
@@ -1,59 +0,0 @@
-/*
- * Copyright(c) 2015 Intel Corporation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
-#ifndef __PMEM_H__
-#define __PMEM_H__
-
-#include <linux/io.h>
-#include <linux/uio.h>
-
-#ifdef CONFIG_ARCH_HAS_PMEM_API
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-#include <asm/pmem.h>
-#else
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
-/*
- * These are simply here to enable compilation, all call sites gate
- * calling these symbols with arch_has_pmem_api() and redirect to the
- * implementation in asm/pmem.h.
- */
-static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
- BUG();
-}
-#endif
-
-static inline bool arch_has_pmem_api(void)
-{
- return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
-}
-
-/**
- * memcpy_to_pmem - copy data to persistent memory
- * @dst: destination buffer for the copy
- * @src: source buffer for the copy
- * @n: length of the copy in bytes
- *
- * Perform a memory copy that results in the destination of the copy
- * being effectively evicted from, or never written to, the processor
- * cache hierarchy after the copy completes. After memcpy_to_pmem()
- * data may still reside in cpu or platform buffers, so this operation
- * must be followed by a blkdev_issue_flush() on the pmem block device.
- */
-static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
-{
- if (arch_has_pmem_api())
- arch_memcpy_to_pmem(dst, src, n);
- else
- memcpy(dst, src, n);
-}
-#endif /* __PMEM_H__ */

2017-06-09 20:31:41

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 12/14] libnvdimm, nfit: enable support for volatile ranges

Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/nfit/core.c | 9 ++++++++-
drivers/nvdimm/bus.c | 8 ++++----
drivers/nvdimm/core.c | 2 +-
drivers/nvdimm/dax_devs.c | 2 +-
drivers/nvdimm/dimm_devs.c | 2 +-
drivers/nvdimm/namespace_devs.c | 8 ++++----
drivers/nvdimm/nd-core.h | 9 +++++++++
drivers/nvdimm/pfn_devs.c | 4 ++--
drivers/nvdimm/region_devs.c | 27 ++++++++++++++-------------
9 files changed, 44 insertions(+), 27 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ac2436538b7e..60d1ca149cc1 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2227,6 +2227,13 @@ static bool nfit_spa_is_virtual(struct acpi_nfit_system_address *spa)
nfit_spa_type(spa) == NFIT_SPA_PCD);
}

+static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
+{
+ return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
+ nfit_spa_type(spa) == NFIT_SPA_VCD ||
+ nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
+}
+
static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
struct nfit_spa *nfit_spa)
{
@@ -2301,7 +2308,7 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc);
if (!nfit_spa->nd_region)
rc = -ENOMEM;
- } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
+ } else if (nfit_spa_is_volatile(spa)) {
nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
ndr_desc);
if (!nfit_spa->nd_region)
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index e9361bffe5ee..4cfba534814b 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -38,13 +38,13 @@ static int to_nd_device_type(struct device *dev)
{
if (is_nvdimm(dev))
return ND_DEVICE_DIMM;
- else if (is_nd_pmem(dev))
+ else if (is_memory(dev))
return ND_DEVICE_REGION_PMEM;
else if (is_nd_blk(dev))
return ND_DEVICE_REGION_BLK;
else if (is_nd_dax(dev))
return ND_DEVICE_DAX_PMEM;
- else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
+ else if (is_nd_region(dev->parent))
return nd_region_to_nstype(to_nd_region(dev->parent));

return 0;
@@ -56,7 +56,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
* Ensure that region devices always have their numa node set as
* early as possible.
*/
- if (is_nd_pmem(dev) || is_nd_blk(dev))
+ if (is_nd_region(dev))
set_dev_node(dev, to_nd_region(dev)->numa_node);
return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
to_nd_device_type(dev));
@@ -65,7 +65,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
static struct module *to_bus_provider(struct device *dev)
{
/* pin bus providers while regions are enabled */
- if (is_nd_pmem(dev) || is_nd_blk(dev)) {
+ if (is_nd_region(dev)) {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);

return nvdimm_bus->nd_desc->module;
diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
index 2dee908e4bae..22e3ef463401 100644
--- a/drivers/nvdimm/core.c
+++ b/drivers/nvdimm/core.c
@@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
struct list_head *poison_list;

- if (!is_nd_pmem(&nd_region->dev)) {
+ if (!is_memory(&nd_region->dev)) {
dev_WARN_ONCE(&nd_region->dev, 1,
"%s only valid for pmem regions\n", __func__);
return;
diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
index c1b6556aea6e..a304983ac417 100644
--- a/drivers/nvdimm/dax_devs.c
+++ b/drivers/nvdimm/dax_devs.c
@@ -89,7 +89,7 @@ struct device *nd_dax_create(struct nd_region *nd_region)
struct device *dev = NULL;
struct nd_dax *nd_dax;

- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;

nd_dax = nd_dax_alloc(nd_region);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 6a1e7a3c0c17..f0d1b7e5de01 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -419,7 +419,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct resource *res;
int i;

- if (!is_nd_pmem(dev))
+ if (!is_memory(dev))
return 0;

nd_region = to_nd_region(dev);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 4e9261ef8a95..57724da484d0 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -112,7 +112,7 @@ static int is_uuid_busy(struct device *dev, void *data)

static int is_namespace_uuid_busy(struct device *dev, void *data)
{
- if (is_nd_pmem(dev) || is_nd_blk(dev))
+ if (is_nd_region(dev))
return device_for_each_child(dev, data, is_uuid_busy);
return 0;
}
@@ -783,7 +783,7 @@ static int __reserve_free_pmem(struct device *dev, void *data)
struct nd_label_id label_id;
int i;

- if (!is_nd_pmem(dev))
+ if (!is_memory(dev))
return 0;

nd_region = to_nd_region(dev);
@@ -1872,7 +1872,7 @@ static struct device *nd_namespace_pmem_create(struct nd_region *nd_region)
struct resource *res;
struct device *dev;

- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;

nspm = kzalloc(sizeof(*nspm), GFP_KERNEL);
@@ -2152,7 +2152,7 @@ static struct device **scan_labels(struct nd_region *nd_region)
}
dev->parent = &nd_region->dev;
devs[count++] = dev;
- } else if (is_nd_pmem(&nd_region->dev)) {
+ } else if (is_memory(&nd_region->dev)) {
/* clean unselected labels */
for (i = 0; i < nd_region->ndr_mappings; i++) {
struct list_head *l, *e;
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 4c4bd209e725..86bc19ae30da 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -64,7 +64,16 @@ struct blk_alloc_info {

bool is_nvdimm(struct device *dev);
bool is_nd_pmem(struct device *dev);
+bool is_nd_volatile(struct device *dev);
bool is_nd_blk(struct device *dev);
+static inline bool is_nd_region(struct device *dev)
+{
+ return is_nd_pmem(dev) || is_nd_blk(dev) || is_nd_volatile(dev);
+}
+static inline bool is_memory(struct device *dev)
+{
+ return is_nd_pmem(dev) || is_nd_volatile(dev);
+}
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
int __init nvdimm_bus_init(void);
void nvdimm_bus_exit(void);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index a6c403600d19..5929eb65cee3 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -331,7 +331,7 @@ struct device *nd_pfn_create(struct nd_region *nd_region)
struct nd_pfn *nd_pfn;
struct device *dev;

- if (!is_nd_pmem(&nd_region->dev))
+ if (!is_memory(&nd_region->dev))
return NULL;

nd_pfn = nd_pfn_alloc(nd_region);
@@ -354,7 +354,7 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char *sig)
if (!pfn_sb || !ndns)
return -ENODEV;

- if (!is_nd_pmem(nd_pfn->dev.parent))
+ if (!is_memory(nd_pfn->dev.parent))
return -ENODEV;

if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb), 0))
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 41b4cdf5dea8..53a64a16aba4 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -168,6 +168,11 @@ bool is_nd_blk(struct device *dev)
return dev ? dev->type == &nd_blk_device_type : false;
}

+bool is_nd_volatile(struct device *dev)
+{
+ return dev ? dev->type == &nd_volatile_device_type : false;
+}
+
struct nd_region *to_nd_region(struct device *dev)
{
struct nd_region *nd_region = container_of(dev, struct nd_region, dev);
@@ -214,7 +219,7 @@ EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
*/
int nd_region_to_nstype(struct nd_region *nd_region)
{
- if (is_nd_pmem(&nd_region->dev)) {
+ if (is_memory(&nd_region->dev)) {
u16 i, alias;

for (i = 0, alias = 0; i < nd_region->ndr_mappings; i++) {
@@ -242,7 +247,7 @@ static ssize_t size_show(struct device *dev,
struct nd_region *nd_region = to_nd_region(dev);
unsigned long long size = 0;

- if (is_nd_pmem(dev)) {
+ if (is_memory(dev)) {
size = nd_region->ndr_size;
} else if (nd_region->ndr_mappings == 1) {
struct nd_mapping *nd_mapping = &nd_region->mapping[0];
@@ -307,7 +312,7 @@ static ssize_t set_cookie_show(struct device *dev,
struct nd_region *nd_region = to_nd_region(dev);
struct nd_interleave_set *nd_set = nd_region->nd_set;

- if (is_nd_pmem(dev) && nd_set)
+ if (is_memory(dev) && nd_set)
/* pass, should be precluded by region_visible */;
else
return -ENXIO;
@@ -334,7 +339,7 @@ resource_size_t nd_region_available_dpa(struct nd_region *nd_region)
if (!ndd)
return 0;

- if (is_nd_pmem(&nd_region->dev)) {
+ if (is_memory(&nd_region->dev)) {
available += nd_pmem_available_dpa(nd_region,
nd_mapping, &overlap);
if (overlap > blk_max_overlap) {
@@ -520,10 +525,10 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
struct nd_interleave_set *nd_set = nd_region->nd_set;
int type = nd_region_to_nstype(nd_region);

- if (!is_nd_pmem(dev) && a == &dev_attr_pfn_seed.attr)
+ if (!is_memory(dev) && a == &dev_attr_pfn_seed.attr)
return 0;

- if (!is_nd_pmem(dev) && a == &dev_attr_dax_seed.attr)
+ if (!is_memory(dev) && a == &dev_attr_dax_seed.attr)
return 0;

if (!is_nd_pmem(dev) && a == &dev_attr_badblocks.attr)
@@ -551,7 +556,7 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
|| type == ND_DEVICE_NAMESPACE_BLK)
&& a == &dev_attr_available_size.attr)
return a->mode;
- else if (is_nd_pmem(dev) && nd_set)
+ else if (is_memory(dev) && nd_set)
return a->mode;

return 0;
@@ -603,7 +608,7 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
{
struct nd_region *nd_region;

- if (!probe && (is_nd_pmem(dev) || is_nd_blk(dev))) {
+ if (!probe && is_nd_region(dev)) {
int i;

nd_region = to_nd_region(dev);
@@ -621,12 +626,8 @@ static void nd_region_notify_driver_action(struct nvdimm_bus *nvdimm_bus,
if (ndd)
atomic_dec(&nvdimm->busy);
}
-
- if (is_nd_pmem(dev))
- return;
}
- if (dev->parent && (is_nd_blk(dev->parent) || is_nd_pmem(dev->parent))
- && probe) {
+ if (dev->parent && is_nd_region(dev->parent) && probe) {
nd_region = to_nd_region(dev->parent);
nvdimm_bus_lock(dev);
if (nd_region->ns_seed == dev)

2017-06-09 20:31:58

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 13/14] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.

Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
fs/dax.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6d8699feae2e..c3140343ff7e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -783,7 +783,8 @@ static int dax_writeback_one(struct block_device *bdev,
}

dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
- dax_flush(dax_dev, pgoff, kaddr, size);
+ if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
+ dax_flush(dax_dev, pgoff, kaddr, size);
/*
* After we have flushed the cache, we can clear the dirty tag. There
* cannot be new dirty data in the pfn after the flush has completed as
@@ -975,7 +976,8 @@ int __dax_zero_page_range(struct block_device *bdev,
return rc;
}
memset(kaddr + offset, 0, size);
- dax_flush(dax_dev, pgoff, kaddr + offset, size);
+ if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
+ dax_flush(dax_dev, pgoff, kaddr + offset, size);
dax_read_unlock(id);
}
return 0;

2017-06-09 20:31:10

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 09/14] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm

Kill this globally defined wrapper and move to libnvdimm so that we can
ultimately remove include/linux/pmem.h.

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/claim.c | 3 ++-
drivers/nvdimm/pmem.c | 2 +-
drivers/nvdimm/pmem.h | 3 +++
include/linux/pmem.h | 19 -------------------
4 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index b8b9c8ca7862..d2e16c0401df 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -14,6 +14,7 @@
#include <linux/sizes.h>
#include <linux/pmem.h>
#include "nd-core.h"
+#include "pmem.h"
#include "pfn.h"
#include "btt.h"
#include "nd.h"
@@ -272,7 +273,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
cleared /= 512;
badblocks_clear(&nsio->bb, sector, cleared);
}
- invalidate_pmem(nsio->addr + offset, size);
+ arch_invalidate_pmem(nsio->addr + offset, size);
} else
rc = -EIO;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b87702d46bb..68737bc68a07 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -71,7 +71,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset,
badblocks_clear(&pmem->bb, sector, cleared);
}

- invalidate_pmem(pmem->virt_addr + offset, len);
+ arch_invalidate_pmem(pmem->virt_addr + offset, len);

return rc;
}
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 9137ec80b85f..d579c7095a45 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -10,6 +10,9 @@
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
}
+static inline void arch_invalidate_pmem(void *addr, size_t size)
+{
+}
#endif

/* this definition is in it's own header for tools/testing/nvdimm to consume */
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 33ae761f010a..559c00848583 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
{
BUG();
}
-
-static inline void arch_invalidate_pmem(void *addr, size_t size)
-{
- BUG();
-}
#endif

static inline bool arch_has_pmem_api(void)
@@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
else
memcpy(dst, src, n);
}
-
-/**
- * invalidate_pmem - flush a pmem range from the cache hierarchy
- * @addr: virtual start address
- * @size: bytes to invalidate (internally aligned to cache line size)
- *
- * For platforms that support clearing poison this flushes any poisoned
- * ranges out of the cache
- */
-static inline void invalidate_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_invalidate_pmem(addr, size);
-}
#endif /* __PMEM_H__ */

2017-06-09 20:32:57

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

With all calls to this routine re-directed through the pmem driver, we can kill
the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by
the arch specific asm/pmem.h. Same as before, pmem flushing is only defined
for x86_64, but it is straightforward to add other archs in the future.

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/pmem.h | 18 +-----------------
arch/x86/include/asm/uaccess_64.h | 1 +
arch/x86/lib/usercopy_64.c | 3 ++-
drivers/nvdimm/pmem.c | 2 +-
drivers/nvdimm/pmem.h | 7 +++++++
include/linux/pmem.h | 19 -------------------
6 files changed, 12 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..862be3a9275c 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,25 +44,9 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
- u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
- unsigned long clflush_mask = x86_clflush_size - 1;
- void *vend = addr + size;
- void *p;
-
- for (p = (void *)((unsigned long)addr & ~clflush_mask);
- p < vend; p += x86_clflush_size)
- clwb(p);
+ clean_cache_range(addr,size);
}

static inline void arch_invalidate_pmem(void *addr, size_t size)
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index b16f6a1d8b26..bdc4a2761525 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -174,6 +174,7 @@ extern long __copy_user_nocache(void *dst, const void __user *src,
extern long __copy_user_flushcache(void *dst, const void __user *src, unsigned size);
extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len);
+void clean_cache_range(void *addr, size_t size);

static inline int
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index f42d2fd86ca3..baa80ff29da8 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -85,7 +85,7 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
* instruction. Note that @size is internally rounded up to be cache
* line size aligned.
*/
-static void clean_cache_range(void *addr, size_t size)
+void clean_cache_range(void *addr, size_t size)
{
u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
unsigned long clflush_mask = x86_clflush_size - 1;
@@ -96,6 +96,7 @@ static void clean_cache_range(void *addr, size_t size)
p < vend; p += x86_clflush_size)
clwb(p);
}
+EXPORT_SYMBOL(clean_cache_range);

long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
{
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 823b07774244..3b87702d46bb 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,7 +245,7 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
{
- wb_cache_pmem(addr, size);
+ arch_wb_cache_pmem(addr, size);
}

static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..9137ec80b85f 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -4,6 +4,13 @@
#include <linux/types.h>
#include <linux/pfn_t.h>
#include <linux/fs.h>
+#include <asm/pmem.h>
+
+#ifndef CONFIG_ARCH_HAS_PMEM_API
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif

/* this definition is in it's own header for tools/testing/nvdimm to consume */
struct pmem_device {
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 772bd02a5b52..33ae761f010a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
BUG();
@@ -80,18 +75,4 @@ static inline void invalidate_pmem(void *addr, size_t size)
if (arch_has_pmem_api())
arch_invalidate_pmem(addr, size);
}
-
-/**
- * wb_cache_pmem - write back processor cache for PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back the processor cache range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void wb_cache_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_wb_cache_pmem(addr, size);
-}
#endif /* __PMEM_H__ */

2017-06-09 20:33:17

by Dan Williams

[permalink] [raw]
Subject: [PATCH v3 05/14] dm: add ->flush() dax operation support

Allow device-mapper to route flush operations to the
per-target implementation. In order for the device stacking to work we
need a dax_dev and a pgoff relative to that device. This gives each
layer of the stack the information it needs to look up the operation
pointer for the next level.

This conceptually allows for an array of mixed device drivers with
varying flush implementations.

Cc: Toshi Kani <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 11 +++++++++++
drivers/md/dm-linear.c | 15 +++++++++++++++
drivers/md/dm-stripe.c | 20 ++++++++++++++++++++
drivers/md/dm.c | 19 +++++++++++++++++++
include/linux/dax.h | 2 ++
include/linux/device-mapper.h | 3 +++
6 files changed, 70 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index dd299e55f65d..b7729e4d351a 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -185,6 +185,17 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
}
EXPORT_SYMBOL_GPL(dax_copy_from_iter);

+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ if (!dax_alive(dax_dev))
+ return;
+
+ if (dax_dev->ops->flush)
+ dax_dev->ops->flush(dax_dev, pgoff, addr, size);
+}
+EXPORT_SYMBOL_GPL(dax_flush);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 0841ec1bfbad..25e661974319 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -173,6 +173,20 @@ static size_t linear_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}

+static void linear_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ struct linear_c *lc = ti->private;
+ struct block_device *bdev = lc->dev->bdev;
+ struct dax_device *dax_dev = lc->dev->dax_dev;
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+
+ dev_sector = linear_map_sector(ti, sector);
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
+ return;
+ dax_flush(dax_dev, pgoff, addr, size);
+}
+
static struct target_type linear_target = {
.name = "linear",
.version = {1, 3, 0},
@@ -186,6 +200,7 @@ static struct target_type linear_target = {
.iterate_devices = linear_iterate_devices,
.direct_access = linear_dax_direct_access,
.dax_copy_from_iter = linear_dax_copy_from_iter,
+ .dax_flush = linear_dax_flush,
};

int __init dm_linear_init(void)
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index 1ef914f9ca72..8e73517967b6 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -351,6 +351,25 @@ static size_t stripe_dax_copy_from_iter(struct dm_target *ti, pgoff_t pgoff,
return dax_copy_from_iter(dax_dev, pgoff, addr, bytes, i);
}

+static void stripe_dax_flush(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ sector_t dev_sector, sector = pgoff * PAGE_SECTORS;
+ struct stripe_c *sc = ti->private;
+ struct dax_device *dax_dev;
+ struct block_device *bdev;
+ uint32_t stripe;
+
+ stripe_map_sector(sc, sector, &stripe, &dev_sector);
+ dev_sector += sc->stripe[stripe].physical_start;
+ dax_dev = sc->stripe[stripe].dev->dax_dev;
+ bdev = sc->stripe[stripe].dev->bdev;
+
+ if (bdev_dax_pgoff(bdev, dev_sector, ALIGN(size, PAGE_SIZE), &pgoff))
+ return;
+ dax_flush(dax_dev, pgoff, addr, size);
+}
+
/*
* Stripe status:
*
@@ -471,6 +490,7 @@ static struct target_type stripe_target = {
.io_hints = stripe_io_hints,
.direct_access = stripe_dax_direct_access,
.dax_copy_from_iter = stripe_dax_copy_from_iter,
+ .dax_flush = stripe_dax_flush,
};

int __init dm_stripe_init(void)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7faaceb52819..09b3efdc8abf 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -994,6 +994,24 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
return ret;
}

+static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ sector_t sector = pgoff * PAGE_SECTORS;
+ struct dm_target *ti;
+ int srcu_idx;
+
+ ti = dm_dax_get_live_target(md, sector, &srcu_idx);
+
+ if (!ti)
+ goto out;
+ if (ti->type->dax_flush)
+ ti->type->dax_flush(ti, pgoff, addr, size);
+ out:
+ dm_put_live_table(md, srcu_idx);
+}
+
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH.
@@ -2885,6 +2903,7 @@ static const struct block_device_operations dm_blk_dops = {
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
.copy_from_iter = dm_dax_copy_from_iter,
+ .flush = dm_dax_flush,
};

/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 407dd3ff6e54..1f6b6072af64 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -82,6 +82,8 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
+void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
+ size_t size);

/*
* We use lowest available bit in exceptional entry for locking, one bit for
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 11c8a0a92f9c..67bfe8ddcb32 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -134,6 +134,8 @@ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
typedef size_t (*dm_dax_copy_from_iter_fn)(struct dm_target *ti, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
+typedef void (*dm_dax_flush_fn)(struct dm_target *ti, pgoff_t pgoff, void *addr,
+ size_t size);
#define PAGE_SECTORS (PAGE_SIZE / 512)

void dm_error(const char *message);
@@ -184,6 +186,7 @@ struct target_type {
dm_io_hints_fn io_hints;
dm_dax_direct_access_fn direct_access;
dm_dax_copy_from_iter_fn dax_copy_from_iter;
+ dm_dax_flush_fn dax_flush;

/* For internal device-mapper use. */
struct list_head list;

2017-06-09 23:21:18

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 14/14] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

On Fri, Jun 9, 2017 at 1:25 PM, Dan Williams <[email protected]> wrote:
> The pmem driver attaches to both persistent and volatile memory ranges
> advertised by the ACPI NFIT. When the region is volatile it is redundant
> to spend cycles flushing caches at fsync(). Check if the hosting region
> is volatile and do not set QUEUE_FLAG_WC if it is.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/nvdimm/pmem.c | 13 ++++++++-----
> drivers/nvdimm/region_devs.c | 6 ++++++
> include/linux/libnvdimm.h | 1 +
> 3 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 06f6c27ec1e9..5cac9fb39db8 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -279,10 +279,10 @@ static int pmem_attach_disk(struct device *dev,
> struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> struct nd_region *nd_region = to_nd_region(dev->parent);
> struct vmem_altmap __altmap, *altmap = NULL;
> + int nid = dev_to_node(dev), fua, wbc;
> struct resource *res = &nsio->res;
> struct nd_pfn *nd_pfn = NULL;
> struct dax_device *dax_dev;
> - int nid = dev_to_node(dev);
> struct nd_pfn_sb *pfn_sb;
> struct pmem_device *pmem;
> struct resource pfn_res;
> @@ -308,9 +308,12 @@ static int pmem_attach_disk(struct device *dev,
> dev_set_drvdata(dev, pmem);
> pmem->phys_addr = res->start;
> pmem->size = resource_size(res);
> - if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE)
> - || nvdimm_has_flush(nd_region) < 0)
> - dev_warn(dev, "unable to guarantee persistence of writes\n");
> + fua = nvdimm_has_flush(nd_region);
> + if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0)
> + dev_warn(dev, "unable to guarantee persistence of writes\n"); {

I sent this patch with local uncommitted changes. Will send a v2
shortly, and also update stg mail to abort if the tree is dirty.

2017-06-10 18:01:08

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 14/14] libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region

The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set QUEUE_FLAG_WC if it is.

Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Changes since v3:
* Fix compile error.

drivers/nvdimm/pmem.c | 11 +++++++----
drivers/nvdimm/region_devs.c | 6 ++++++
include/linux/libnvdimm.h | 1 +
3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f6c27ec1e9..f39e095dd46d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -279,10 +279,10 @@ static int pmem_attach_disk(struct device *dev,
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
struct nd_region *nd_region = to_nd_region(dev->parent);
struct vmem_altmap __altmap, *altmap = NULL;
+ int nid = dev_to_node(dev), fua, wbc;
struct resource *res = &nsio->res;
struct nd_pfn *nd_pfn = NULL;
struct dax_device *dax_dev;
- int nid = dev_to_node(dev);
struct nd_pfn_sb *pfn_sb;
struct pmem_device *pmem;
struct resource pfn_res;
@@ -308,9 +308,12 @@ static int pmem_attach_disk(struct device *dev,
dev_set_drvdata(dev, pmem);
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE)
- || nvdimm_has_flush(nd_region) < 0)
+ fua = nvdimm_has_flush(nd_region);
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE) || fua < 0) {
dev_warn(dev, "unable to guarantee persistence of writes\n");
+ fua = 0;
+ }
+ wbc = nvdimm_has_cache(nd_region);

if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(&ndns->dev))) {
@@ -354,7 +357,7 @@ static int pmem_attach_disk(struct device *dev,
return PTR_ERR(addr);
pmem->virt_addr = addr;

- blk_queue_write_cache(q, true, true);
+ blk_queue_write_cache(q, wbc, fua);
blk_queue_make_request(q, pmem_make_request);
blk_queue_physical_block_size(q, PAGE_SIZE);
blk_queue_max_hw_sectors(q, UINT_MAX);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 53a64a16aba4..0c3b089b280a 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1060,6 +1060,12 @@ int nvdimm_has_flush(struct nd_region *nd_region)
}
EXPORT_SYMBOL_GPL(nvdimm_has_flush);

+int nvdimm_has_cache(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev);
+}
+EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+
void __exit nd_region_devs_exit(void)
{
ida_destroy(&region_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index b2f659bd661d..a8ee1d0afd70 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -165,4 +165,5 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
void nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
+int nvdimm_has_cache(struct nd_region *nd_region);
#endif /* __LIBNVDIMM_H__ */

2017-06-12 00:35:52

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

With all calls to this routine re-directed through the pmem driver, we can kill
the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by
the arch specific asm/pmem.h. Same as before, pmem flushing is only defined
for x86_64, but it is straightforward to add other archs in the future.

Cc: <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Change since v3:
* move the asm/pmem.h include inside CONFIG_ARCH_HAS_PMEM_API ifdef
guard

arch/x86/include/asm/pmem.h | 18 +-----------------
arch/x86/include/asm/uaccess_64.h | 1 +
arch/x86/lib/usercopy_64.c | 3 ++-
drivers/nvdimm/pmem.c | 2 +-
drivers/nvdimm/pmem.h | 8 ++++++++
include/linux/pmem.h | 19 -------------------
6 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index f4c119d253f3..862be3a9275c 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -44,25 +44,9 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-/**
- * arch_wb_cache_pmem - write back a cache range with CLWB
- * @vaddr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back a cache range using the CLWB (cache line write back)
- * instruction. Note that @size is internally rounded up to be cache
- * line size aligned.
- */
static inline void arch_wb_cache_pmem(void *addr, size_t size)
{
- u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
- unsigned long clflush_mask = x86_clflush_size - 1;
- void *vend = addr + size;
- void *p;
-
- for (p = (void *)((unsigned long)addr & ~clflush_mask);
- p < vend; p += x86_clflush_size)
- clwb(p);
+ clean_cache_range(addr,size);
}

static inline void arch_invalidate_pmem(void *addr, size_t size)
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index b16f6a1d8b26..bdc4a2761525 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -174,6 +174,7 @@ extern long __copy_user_nocache(void *dst, const void __user *src,
extern long __copy_user_flushcache(void *dst, const void __user *src, unsigned size);
extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len);
+void clean_cache_range(void *addr, size_t size);

static inline int
__copy_from_user_inatomic_nocache(void *dst, const void __user *src,
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index f42d2fd86ca3..baa80ff29da8 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -85,7 +85,7 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
* instruction. Note that @size is internally rounded up to be cache
* line size aligned.
*/
-static void clean_cache_range(void *addr, size_t size)
+void clean_cache_range(void *addr, size_t size)
{
u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
unsigned long clflush_mask = x86_clflush_size - 1;
@@ -96,6 +96,7 @@ static void clean_cache_range(void *addr, size_t size)
p < vend; p += x86_clflush_size)
clwb(p);
}
+EXPORT_SYMBOL(clean_cache_range);

long __copy_user_flushcache(void *dst, const void __user *src, unsigned size)
{
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 823b07774244..3b87702d46bb 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -245,7 +245,7 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
{
- wb_cache_pmem(addr, size);
+ arch_wb_cache_pmem(addr, size);
}

static const struct dax_operations pmem_dax_ops = {
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 7f4dbd72a90a..0169a0422b88 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -5,6 +5,14 @@
#include <linux/pfn_t.h>
#include <linux/fs.h>

+#ifdef CONFIG_ARCH_HAS_PMEM_API
+#include <asm/pmem.h>
+#else
+static inline void arch_wb_cache_pmem(void *addr, size_t size)
+{
+}
+#endif
+
/* this definition is in it's own header for tools/testing/nvdimm to consume */
struct pmem_device {
/* One contiguous memory region per device */
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 772bd02a5b52..33ae761f010a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
BUG();
}

-static inline void arch_wb_cache_pmem(void *addr, size_t size)
-{
- BUG();
-}
-
static inline void arch_invalidate_pmem(void *addr, size_t size)
{
BUG();
@@ -80,18 +75,4 @@ static inline void invalidate_pmem(void *addr, size_t size)
if (arch_has_pmem_api())
arch_invalidate_pmem(addr, size);
}
-
-/**
- * wb_cache_pmem - write back processor cache for PMEM memory range
- * @addr: virtual start address
- * @size: number of bytes to write back
- *
- * Write back the processor cache range starting at 'addr' for 'size' bytes.
- * See blkdev_issue_flush() note for memcpy_to_pmem().
- */
-static inline void wb_cache_pmem(void *addr, size_t size)
-{
- if (arch_has_pmem_api())
- arch_wb_cache_pmem(addr, size);
-}
#endif /* __PMEM_H__ */

2017-06-14 10:46:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

On Fri 09-06-17 13:24:56, Dan Williams wrote:
> Some platforms arrange for cpu caches to be flushed on power-fail. On
> those platforms there is no requirement that the kernel track and flush
> potentially dirty cache lines. Given that we still insert entries into
> the radix for locking purposes this patch only disables the cache flush
> loop, not the dirty tracking.
>
> Userspace can override the default cache setting via the block device
> queue "write_cache" attribute in sysfs.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

...

> - dax_flush(dax_dev, pgoff, kaddr, size);
> + if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
> + dax_flush(dax_dev, pgoff, kaddr, size);

IMHO the check belongs into dax_flush() similarly as blkdev_issue_flush()
takes silently handles whether the flush is actually needed or not.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:48:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 10/14] pmem: remove global pmem api

On Fri 09-06-17 13:24:40, Dan Williams wrote:
> Now that all callers of the pmem api have been converted to dax helpers that
> call back to the pmem driver, we can remove include/linux/pmem.h.
>
> Cc: <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Toshi Kani <[email protected]>
> Cc: Oliver O'Halloran <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> MAINTAINERS | 1 -
> drivers/acpi/nfit/core.c | 3 +-
> drivers/nvdimm/claim.c | 1 -
> drivers/nvdimm/dimm_devs.c | 8 +++++
> drivers/nvdimm/namespace_devs.c | 6 +---
> drivers/nvdimm/pmem.c | 1 -
> drivers/nvdimm/pmem.h | 5 +++
> drivers/nvdimm/region_devs.c | 1 -
> fs/dax.c | 1 -
> include/linux/libnvdimm.h | 1 +
> include/linux/pmem.h | 59 ---------------------------------------
> 11 files changed, 15 insertions(+), 72 deletions(-)
> delete mode 100644 include/linux/pmem.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7a28acd7f525..d77ad3194adc 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7593,7 +7593,6 @@ L: [email protected]
> Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
> S: Supported
> F: drivers/nvdimm/pmem.c
> -F: include/linux/pmem.h
> F: arch/*/include/asm/pmem.h
>
> LIGHTNVM PLATFORM SUPPORT
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index cbd5596e7562..ac2436538b7e 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -20,7 +20,6 @@
> #include <linux/list.h>
> #include <linux/acpi.h>
> #include <linux/sort.h>
> -#include <linux/pmem.h>
> #include <linux/io.h>
> #include <linux/nd.h>
> #include <asm/cacheflush.h>
> @@ -1956,7 +1955,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
> nfit_blk->bdw_offset = nfit_mem->bdw->offset;
> mmio = &nfit_blk->mmio[BDW];
> mmio->addr.base = devm_nvdimm_memremap(dev, nfit_mem->spa_bdw->address,
> - nfit_mem->spa_bdw->length, ARCH_MEMREMAP_PMEM);
> + nfit_mem->spa_bdw->length, nd_blk_memremap_flags(ndbr));
> if (!mmio->addr.base) {
> dev_dbg(dev, "%s: %s failed to map bdw\n", __func__,
> nvdimm_name(nvdimm));
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index d2e16c0401df..3beedf173902 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -12,7 +12,6 @@
> */
> #include <linux/device.h>
> #include <linux/sizes.h>
> -#include <linux/pmem.h>
> #include "nd-core.h"
> #include "pmem.h"
> #include "pfn.h"
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 9852a3355509..6a1e7a3c0c17 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -20,6 +20,7 @@
> #include <linux/mm.h>
> #include "nd-core.h"
> #include "label.h"
> +#include "pmem.h"
> #include "nd.h"
>
> static DEFINE_IDA(dimm_ida);
> @@ -235,6 +236,13 @@ struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr)
> }
> EXPORT_SYMBOL_GPL(nd_blk_region_to_dimm);
>
> +unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr)
> +{
> + /* pmem mapping properties are private to libnvdimm */
> + return ARCH_MEMREMAP_PMEM;
> +}
> +EXPORT_SYMBOL_GPL(nd_blk_memremap_flags);
> +
> struct nvdimm_drvdata *to_ndd(struct nd_mapping *nd_mapping)
> {
> struct nvdimm *nvdimm = nd_mapping->nvdimm;
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index 2f9dfbd2dbec..4e9261ef8a95 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -14,10 +14,10 @@
> #include <linux/device.h>
> #include <linux/sort.h>
> #include <linux/slab.h>
> -#include <linux/pmem.h>
> #include <linux/list.h>
> #include <linux/nd.h>
> #include "nd-core.h"
> +#include "pmem.h"
> #include "nd.h"
>
> static void namespace_io_release(struct device *dev)
> @@ -155,11 +155,7 @@ bool pmem_should_map_pages(struct device *dev)
> IORES_DESC_NONE) == REGION_MIXED)
> return false;
>
> -#ifdef ARCH_MEMREMAP_PMEM
> return ARCH_MEMREMAP_PMEM == MEMREMAP_WB;
> -#else
> - return false;
> -#endif
> }
> EXPORT_SYMBOL(pmem_should_map_pages);
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 68737bc68a07..06f6c27ec1e9 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -28,7 +28,6 @@
> #include <linux/blk-mq.h>
> #include <linux/pfn_t.h>
> #include <linux/slab.h>
> -#include <linux/pmem.h>
> #include <linux/uio.h>
> #include <linux/dax.h>
> #include <linux/nd.h>
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index d579c7095a45..2b02e00b44eb 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -6,7 +6,10 @@
> #include <linux/fs.h>
> #include <asm/pmem.h>
>
> -#ifndef CONFIG_ARCH_HAS_PMEM_API
> +#ifdef CONFIG_ARCH_HAS_PMEM_API
> +#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> +#else
> +#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> static inline void arch_wb_cache_pmem(void *addr, size_t size)
> {
> }
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 985b0e11bd73..3c06a6ea6958 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -15,7 +15,6 @@
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/hash.h>
> -#include <linux/pmem.h>
> #include <linux/sort.h>
> #include <linux/io.h>
> #include <linux/nd.h>
> diff --git a/fs/dax.c b/fs/dax.c
> index 554b8e7d921c..6d8699feae2e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -25,7 +25,6 @@
> #include <linux/mm.h>
> #include <linux/mutex.h>
> #include <linux/pagevec.h>
> -#include <linux/pmem.h>
> #include <linux/sched.h>
> #include <linux/sched/signal.h>
> #include <linux/uio.h>
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 6c807017128d..b2f659bd661d 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -159,6 +159,7 @@ void *nd_region_provider_data(struct nd_region *nd_region);
> void *nd_blk_region_provider_data(struct nd_blk_region *ndbr);
> void nd_blk_region_set_provider_data(struct nd_blk_region *ndbr, void *data);
> struct nvdimm *nd_blk_region_to_dimm(struct nd_blk_region *ndbr);
> +unsigned long nd_blk_memremap_flags(struct nd_blk_region *ndbr);
> unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
> void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
> u64 nd_fletcher64(void *addr, size_t len, bool le);
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> deleted file mode 100644
> index 559c00848583..000000000000
> --- a/include/linux/pmem.h
> +++ /dev/null
> @@ -1,59 +0,0 @@
> -/*
> - * Copyright(c) 2015 Intel Corporation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of version 2 of the GNU General Public License as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful, but
> - * WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> - * General Public License for more details.
> - */
> -#ifndef __PMEM_H__
> -#define __PMEM_H__
> -
> -#include <linux/io.h>
> -#include <linux/uio.h>
> -
> -#ifdef CONFIG_ARCH_HAS_PMEM_API
> -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> -#include <asm/pmem.h>
> -#else
> -#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> -/*
> - * These are simply here to enable compilation, all call sites gate
> - * calling these symbols with arch_has_pmem_api() and redirect to the
> - * implementation in asm/pmem.h.
> - */
> -static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
> -{
> - BUG();
> -}
> -#endif
> -
> -static inline bool arch_has_pmem_api(void)
> -{
> - return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
> -}
> -
> -/**
> - * memcpy_to_pmem - copy data to persistent memory
> - * @dst: destination buffer for the copy
> - * @src: source buffer for the copy
> - * @n: length of the copy in bytes
> - *
> - * Perform a memory copy that results in the destination of the copy
> - * being effectively evicted from, or never written to, the processor
> - * cache hierarchy after the copy completes. After memcpy_to_pmem()
> - * data may still reside in cpu or platform buffers, so this operation
> - * must be followed by a blkdev_issue_flush() on the pmem block device.
> - */
> -static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
> -{
> - if (arch_has_pmem_api())
> - arch_memcpy_to_pmem(dst, src, n);
> - else
> - memcpy(dst, src, n);
> -}
> -#endif /* __PMEM_H__ */
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:49:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 09/14] x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm

On Fri 09-06-17 13:24:35, Dan Williams wrote:
> Kill this globally defined wrapper and move to libnvdimm so that we can
> ultimately remove include/linux/pmem.h.
>
> Cc: <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> drivers/nvdimm/claim.c | 3 ++-
> drivers/nvdimm/pmem.c | 2 +-
> drivers/nvdimm/pmem.h | 3 +++
> include/linux/pmem.h | 19 -------------------
> 4 files changed, 6 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index b8b9c8ca7862..d2e16c0401df 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -14,6 +14,7 @@
> #include <linux/sizes.h>
> #include <linux/pmem.h>
> #include "nd-core.h"
> +#include "pmem.h"
> #include "pfn.h"
> #include "btt.h"
> #include "nd.h"
> @@ -272,7 +273,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> cleared /= 512;
> badblocks_clear(&nsio->bb, sector, cleared);
> }
> - invalidate_pmem(nsio->addr + offset, size);
> + arch_invalidate_pmem(nsio->addr + offset, size);
> } else
> rc = -EIO;
> }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 3b87702d46bb..68737bc68a07 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -71,7 +71,7 @@ static int pmem_clear_poison(struct pmem_device *pmem, phys_addr_t offset,
> badblocks_clear(&pmem->bb, sector, cleared);
> }
>
> - invalidate_pmem(pmem->virt_addr + offset, len);
> + arch_invalidate_pmem(pmem->virt_addr + offset, len);
>
> return rc;
> }
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index 9137ec80b85f..d579c7095a45 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -10,6 +10,9 @@
> static inline void arch_wb_cache_pmem(void *addr, size_t size)
> {
> }
> +static inline void arch_invalidate_pmem(void *addr, size_t size)
> +{
> +}
> #endif
>
> /* this definition is in it's own header for tools/testing/nvdimm to consume */
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> index 33ae761f010a..559c00848583 100644
> --- a/include/linux/pmem.h
> +++ b/include/linux/pmem.h
> @@ -30,11 +30,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
> {
> BUG();
> }
> -
> -static inline void arch_invalidate_pmem(void *addr, size_t size)
> -{
> - BUG();
> -}
> #endif
>
> static inline bool arch_has_pmem_api(void)
> @@ -61,18 +56,4 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
> else
> memcpy(dst, src, n);
> }
> -
> -/**
> - * invalidate_pmem - flush a pmem range from the cache hierarchy
> - * @addr: virtual start address
> - * @size: bytes to invalidate (internally aligned to cache line size)
> - *
> - * For platforms that support clearing poison this flushes any poisoned
> - * ranges out of the cache
> - */
> -static inline void invalidate_pmem(void *addr, size_t size)
> -{
> - if (arch_has_pmem_api())
> - arch_invalidate_pmem(addr, size);
> -}
> #endif /* __PMEM_H__ */
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:54:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

On Fri 09-06-17 13:24:29, Dan Williams wrote:
> With all calls to this routine re-directed through the pmem driver, we can kill
> the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by
> the arch specific asm/pmem.h. Same as before, pmem flushing is only defined
> for x86_64, but it is straightforward to add other archs in the future.
>
> Cc: <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Oliver O'Halloran <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. Just one question below...

> -/**
> - * arch_wb_cache_pmem - write back a cache range with CLWB
> - * @vaddr: virtual start address
> - * @size: number of bytes to write back
> - *
> - * Write back a cache range using the CLWB (cache line write back)
> - * instruction. Note that @size is internally rounded up to be cache
> - * line size aligned.
> - */
> static inline void arch_wb_cache_pmem(void *addr, size_t size)
> {
> - u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> - unsigned long clflush_mask = x86_clflush_size - 1;
> - void *vend = addr + size;
> - void *p;
> -
> - for (p = (void *)((unsigned long)addr & ~clflush_mask);
> - p < vend; p += x86_clflush_size)
> - clwb(p);
> + clean_cache_range(addr,size);
> }

So this will make compilation break on 32-bit x86 as it does not define
clean_cache_range(). Do we somewhere force we are on x86_64 when pmem is
enabled?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:55:46

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 07/14] x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush

On Fri 09-06-17 13:24:24, Dan Williams wrote:
> The clear_pmem() helper simply combines a memset() plus a cache flush.
> Now that the flush routine is optionally provided by the dax device
> driver we can avoid unnecessary cache management on dax devices fronting
> volatile memory.
>
> With clear_pmem() gone we can follow on with a patch to make pmem cache
> management completely defined within the pmem driver.
>
> Cc: <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza


> ---
> arch/x86/include/asm/pmem.h | 13 -------------
> fs/dax.c | 3 ++-
> include/linux/pmem.h | 21 ---------------------
> 3 files changed, 2 insertions(+), 35 deletions(-)
>
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 60e8edbe0205..f4c119d253f3 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -65,19 +65,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
> clwb(p);
> }
>
> -/**
> - * arch_clear_pmem - zero a PMEM memory range
> - * @addr: virtual start address
> - * @size: number of bytes to zero
> - *
> - * Write zeros into the memory range starting at 'addr' for 'size' bytes.
> - */
> -static inline void arch_clear_pmem(void *addr, size_t size)
> -{
> - memset(addr, 0, size);
> - arch_wb_cache_pmem(addr, size);
> -}
> -
> static inline void arch_invalidate_pmem(void *addr, size_t size)
> {
> clflush_cache_range(addr, size);
> diff --git a/fs/dax.c b/fs/dax.c
> index 0933fc460ada..554b8e7d921c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -975,7 +975,8 @@ int __dax_zero_page_range(struct block_device *bdev,
> dax_read_unlock(id);
> return rc;
> }
> - clear_pmem(kaddr + offset, size);
> + memset(kaddr + offset, 0, size);
> + dax_flush(dax_dev, pgoff, kaddr + offset, size);
> dax_read_unlock(id);
> }
> return 0;
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> index 9d542a5600e4..772bd02a5b52 100644
> --- a/include/linux/pmem.h
> +++ b/include/linux/pmem.h
> @@ -31,11 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
> BUG();
> }
>
> -static inline void arch_clear_pmem(void *addr, size_t size)
> -{
> - BUG();
> -}
> -
> static inline void arch_wb_cache_pmem(void *addr, size_t size)
> {
> BUG();
> @@ -73,22 +68,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
> }
>
> /**
> - * clear_pmem - zero a PMEM memory range
> - * @addr: virtual start address
> - * @size: number of bytes to zero
> - *
> - * Write zeros into the memory range starting at 'addr' for 'size' bytes.
> - * See blkdev_issue_flush() note for memcpy_to_pmem().
> - */
> -static inline void clear_pmem(void *addr, size_t size)
> -{
> - if (arch_has_pmem_api())
> - arch_clear_pmem(addr, size);
> - else
> - memset(addr, 0, size);
> -}
> -
> -/**
> * invalidate_pmem - flush a pmem range from the cache hierarchy
> * @addr: virtual start address
> * @size: bytes to invalidate (internally aligned to cache line size)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:56:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 06/14] filesystem-dax: convert to dax_flush()

On Fri 09-06-17 13:24:18, Dan Williams wrote:
> Filesystem-DAX flushes caches whenever it writes to the address returned
> through dax_direct_access() and when writing back dirty radix entries.
> That flushing is only required in the pmem case, so the dax_flush()
> helper skips cache management work when the underlying driver does not
> specify a flush method.
>
> We still do all the dirty tracking since the radix entry will already be
> there for locking purposes. However, the work to clean the entry will be
> a nop for some dax drivers.
>
> Cc: Jan Kara <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/dax.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index b459948de427..0933fc460ada 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -784,7 +784,7 @@ static int dax_writeback_one(struct block_device *bdev,
> }
>
> dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(pfn));
> - wb_cache_pmem(kaddr, size);
> + dax_flush(dax_dev, pgoff, kaddr, size);
> /*
> * After we have flushed the cache, we can clear the dirty tag. There
> * cannot be new dirty data in the pfn after the flush has completed as
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:57:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 04/14] dax, pmem: introduce an optional 'flush' dax_operation

On Fri 09-06-17 13:24:07, Dan Williams wrote:
> Filesystem-DAX flushes caches whenever it writes to the address returned
> through dax_direct_access() and when writing back dirty radix entries.
> That flushing is only required in the pmem case, so add a dax operation
> to allow pmem to take this extra action, but skip it for other dax
> capable devices that do not provide a flush routine.
>
> An example for this differentiation might be a volatile ram disk where
> there is no expectation of persistence. In fact the pmem driver itself might
> front such an address range specified by the NFIT. So, this "no flush"
> property might be something passed down by the bus / libnvdimm.
>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> drivers/nvdimm/pmem.c | 7 +++++++
> include/linux/dax.h | 2 ++
> 2 files changed, 9 insertions(+)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 2f3aefe565c6..823b07774244 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -242,9 +242,16 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> return copy_from_iter_flushcache(addr, bytes, i);
> }
>
> +static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
> + void *addr, size_t size)
> +{
> + wb_cache_pmem(addr, size);
> +}
> +
> static const struct dax_operations pmem_dax_ops = {
> .direct_access = pmem_dax_direct_access,
> .copy_from_iter = pmem_copy_from_iter,
> + .flush = pmem_dax_flush,
> };
>
> static void pmem_release_queue(void *q)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 28e398f8c59e..407dd3ff6e54 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -19,6 +19,8 @@ struct dax_operations {
> /* copy_from_iter: dax-driver override for default copy_from_iter */
> size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
> struct iov_iter *);
> + /* flush: optional driver-specific cache management after writes */
> + void (*flush)(struct dax_device *, pgoff_t, void *, size_t);
> };
>
> #if IS_ENABLED(CONFIG_DAX)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 10:58:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 03/14] filesystem-dax: convert to dax_copy_from_iter()

On Fri 09-06-17 13:24:02, Dan Williams wrote:
> Now that all possible providers of the dax_operations copy_from_iter
> method are implemented, switch filesytem-dax to call the driver rather
> than copy_to_iter_pmem.
>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> arch/x86/include/asm/pmem.h | 50 -------------------------------------------
> fs/dax.c | 3 ++-
> include/linux/pmem.h | 24 ---------------------
> 3 files changed, 2 insertions(+), 75 deletions(-)
>
> diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
> index 0ff8fe71b255..60e8edbe0205 100644
> --- a/arch/x86/include/asm/pmem.h
> +++ b/arch/x86/include/asm/pmem.h
> @@ -66,56 +66,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size)
> }
>
> /**
> - * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
> - * @addr: PMEM destination address
> - * @bytes: number of bytes to copy
> - * @i: iterator with source data
> - *
> - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
> - */
> -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
> - struct iov_iter *i)
> -{
> - size_t len;
> -
> - /* TODO: skip the write-back by always using non-temporal stores */
> - len = copy_from_iter_nocache(addr, bytes, i);
> -
> - /*
> - * In the iovec case on x86_64 copy_from_iter_nocache() uses
> - * non-temporal stores for the bulk of the transfer, but we need
> - * to manually flush if the transfer is unaligned. A cached
> - * memory copy is used when destination or size is not naturally
> - * aligned. That is:
> - * - Require 8-byte alignment when size is 8 bytes or larger.
> - * - Require 4-byte alignment when size is 4 bytes.
> - *
> - * In the non-iovec case the entire destination needs to be
> - * flushed.
> - */
> - if (iter_is_iovec(i)) {
> - unsigned long flushed, dest = (unsigned long) addr;
> -
> - if (bytes < 8) {
> - if (!IS_ALIGNED(dest, 4) || (bytes != 4))
> - arch_wb_cache_pmem(addr, bytes);
> - } else {
> - if (!IS_ALIGNED(dest, 8)) {
> - dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
> - arch_wb_cache_pmem(addr, 1);
> - }
> -
> - flushed = dest - (unsigned long) addr;
> - if (bytes > flushed && !IS_ALIGNED(bytes - flushed, 8))
> - arch_wb_cache_pmem(addr + bytes - 1, 1);
> - }
> - } else
> - arch_wb_cache_pmem(addr, bytes);
> -
> - return len;
> -}
> -
> -/**
> * arch_clear_pmem - zero a PMEM memory range
> * @addr: virtual start address
> * @size: number of bytes to zero
> diff --git a/fs/dax.c b/fs/dax.c
> index 2a6889b3585f..b459948de427 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1054,7 +1054,8 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> map_len = end - pos;
>
> if (iov_iter_rw(iter) == WRITE)
> - map_len = copy_from_iter_pmem(kaddr, map_len, iter);
> + map_len = dax_copy_from_iter(dax_dev, pgoff, kaddr,
> + map_len, iter);
> else
> map_len = copy_to_iter(kaddr, map_len, iter);
> if (map_len <= 0) {
> diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> index 71ecf3d46aac..9d542a5600e4 100644
> --- a/include/linux/pmem.h
> +++ b/include/linux/pmem.h
> @@ -31,13 +31,6 @@ static inline void arch_memcpy_to_pmem(void *dst, const void *src, size_t n)
> BUG();
> }
>
> -static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes,
> - struct iov_iter *i)
> -{
> - BUG();
> - return 0;
> -}
> -
> static inline void arch_clear_pmem(void *addr, size_t size)
> {
> BUG();
> @@ -80,23 +73,6 @@ static inline void memcpy_to_pmem(void *dst, const void *src, size_t n)
> }
>
> /**
> - * copy_from_iter_pmem - copy data from an iterator to PMEM
> - * @addr: PMEM destination address
> - * @bytes: number of bytes to copy
> - * @i: iterator with source data
> - *
> - * Copy data from the iterator 'i' to the PMEM buffer starting at 'addr'.
> - * See blkdev_issue_flush() note for memcpy_to_pmem().
> - */
> -static inline size_t copy_from_iter_pmem(void *addr, size_t bytes,
> - struct iov_iter *i)
> -{
> - if (arch_has_pmem_api())
> - return arch_copy_from_iter_pmem(addr, bytes, i);
> - return copy_from_iter_nocache(addr, bytes, i);
> -}
> -
> -/**
> * clear_pmem - zero a PMEM memory range
> * @addr: virtual start address
> * @size: number of bytes to zero
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-14 16:49:32

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

On Wed, Jun 14, 2017 at 3:54 AM, Jan Kara <[email protected]> wrote:
> On Fri 09-06-17 13:24:29, Dan Williams wrote:
>> With all calls to this routine re-directed through the pmem driver, we can kill
>> the pmem api indirection. arch_wb_cache_pmem() is now optionally supplied by
>> the arch specific asm/pmem.h. Same as before, pmem flushing is only defined
>> for x86_64, but it is straightforward to add other archs in the future.
>>
>> Cc: <[email protected]>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Christoph Hellwig <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Oliver O'Halloran <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>
> Looks good to me. Just one question below...
>
>> -/**
>> - * arch_wb_cache_pmem - write back a cache range with CLWB
>> - * @vaddr: virtual start address
>> - * @size: number of bytes to write back
>> - *
>> - * Write back a cache range using the CLWB (cache line write back)
>> - * instruction. Note that @size is internally rounded up to be cache
>> - * line size aligned.
>> - */
>> static inline void arch_wb_cache_pmem(void *addr, size_t size)
>> {
>> - u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
>> - unsigned long clflush_mask = x86_clflush_size - 1;
>> - void *vend = addr + size;
>> - void *p;
>> -
>> - for (p = (void *)((unsigned long)addr & ~clflush_mask);
>> - p < vend; p += x86_clflush_size)
>> - clwb(p);
>> + clean_cache_range(addr,size);
>> }
>
> So this will make compilation break on 32-bit x86 as it does not define
> clean_cache_range(). Do we somewhere force we are on x86_64 when pmem is
> enabled?

Yes, this is enforced by:

select ARCH_HAS_PMEM_API if X86_64

...in arch/x86/Kconfig. We fallback to a dummy arch_wb_cache_pmem()
implementation and emit this warning for !ARCH_HAS_PMEM_API archs:

"nd_pmem namespace0.0: unable to guarantee persistence of writes"

2017-06-14 16:49:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

On Wed, Jun 14, 2017 at 3:46 AM, Jan Kara <[email protected]> wrote:
> On Fri 09-06-17 13:24:56, Dan Williams wrote:
>> Some platforms arrange for cpu caches to be flushed on power-fail. On
>> those platforms there is no requirement that the kernel track and flush
>> potentially dirty cache lines. Given that we still insert entries into
>> the radix for locking purposes this patch only disables the cache flush
>> loop, not the dirty tracking.
>>
>> Userspace can override the default cache setting via the block device
>> queue "write_cache" attribute in sysfs.
>>
>> Cc: Jan Kara <[email protected]>
>> Cc: Jeff Moyer <[email protected]>
>> Cc: Christoph Hellwig <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Ross Zwisler <[email protected]>
>> Signed-off-by: Dan Williams <[email protected]>
>
> ...
>
>> - dax_flush(dax_dev, pgoff, kaddr, size);
>> + if (test_bit(QUEUE_FLAG_WC, &bdev->bd_queue->queue_flags))
>> + dax_flush(dax_dev, pgoff, kaddr, size);
>
> IMHO the check belongs into dax_flush() similarly as blkdev_issue_flush()
> takes silently handles whether the flush is actually needed or not.

Looks good, will do.

2017-06-14 23:17:55

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 13/14] libnvdimm, pmem: gate cache management on QUEUE_FLAG_WC in pmem_dax_flush()

Some platforms arrange for cpu caches to be flushed on power-fail. On
those platforms there is no requirement that the kernel track and flush
potentially dirty cache lines. Given that we still insert entries into
the radix for locking purposes this patch only disables the cache flush
loop, not the dirty tracking.

Userspace can override the default cache setting via the block device
queue "write_cache" attribute in sysfs.

Cc: Jeff Moyer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Ross Zwisler <[email protected]>
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Changes since v3:
* move the check of QUEUE_FLAG_WC into the pmem driver directly (Jan)

drivers/nvdimm/pmem.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 06f6c27ec1e9..49938b246a7b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -244,7 +244,16 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t size)
{
- arch_wb_cache_pmem(addr, size);
+ struct pmem_device *pmem = dax_get_private(dax_dev);
+ struct gendisk *disk = pmem->disk;
+ struct request_queue *q = disk->queue;
+
+ /*
+ * Only perform cache management when the queue has caching
+ * enabled.
+ */
+ if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+ arch_wb_cache_pmem(addr, size);
}

static const struct dax_operations pmem_dax_ops = {

2017-06-15 00:46:19

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v3 02/14] dm: add ->copy_from_iter() dax operation support

On Fri, 2017-06-09 at 13:23 -0700, Dan Williams wrote:
> Allow device-mapper to route copy_from_iter operations to the
> per-target implementation. In order for the device stacking to work
> we
> need a dax_dev and a pgoff relative to that device. This gives each
> layer of the stack the information it needs to look up the operation
> pointer for the next level.
>
> This conceptually allows for an array of mixed device drivers with
> varying copy_from_iter implementations.
>
> Cc: Toshi Kani <[email protected]>
> Reviewed-by: Mike Snitzer <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

I was worried about possible overhead with additional stub calls, but
it looks fine with a single thread fio write test with direct=1.

 92.62%  [kernel.kallsyms]   [k] __copy_user_nocache
  0.04%  [kernel.kallsyms]   [k] entry_SYSCALL_64_fastpath
  0.08%  libpthread-2.22.so  [.] __GI___libc_write
  0.01%  [kernel.kallsyms]   [k] sys_write
  0.02%  [kernel.kallsyms]   [k] vfs_write
  0.02%  [kernel.kallsyms]   [k] __vfs_write
  0.02%  [kernel.kallsyms]   [k] ext4_file_write_iter
  0.02%  [kernel.kallsyms]   [k] dax_iomap_rw
  0.03%  [kernel.kallsyms]   [k] iomap_apply
  0.04%  [kernel.kallsyms]   [k] dax_iomap_actor
  0.01%  [kernel.kallsyms]   [k] dax_copy_from_iter
  0.01%  [kernel.kallsyms]   [k] dm_dax_copy_from_iter
  0.01%  [kernel.kallsyms]   [k] linear_dax_copy_from_iter
  0.03%  [kernel.kallsyms]   [k] copy_from_iter_flushcache
  0.00%  [kernel.kallsyms]   [k] pmem_copy_from_iter

Multi-thread fio test hits hard in inode_lock(), no contention from the
dm-layer.

Reviewed-by: Toshi Kani <[email protected]>

Thanks,
-Toshi






2017-06-15 01:21:14

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v3 02/14] dm: add ->copy_from_iter() dax operation support

On Wed, 2017-06-14 at 18:45 -0600, Toshi Kani wrote:
> On Fri, 2017-06-09 at 13:23 -0700, Dan Williams wrote:
> > Allow device-mapper to route copy_from_iter operations to the
> > per-target implementation. In order for the device stacking to work
> > we need a dax_dev and a pgoff relative to that device. This gives
> > each layer of the stack the information it needs to look up the
> > operation pointer for the next level.
> >
> > This conceptually allows for an array of mixed device drivers with
> > varying copy_from_iter implementations.
> >
> > Cc: Toshi Kani <[email protected]>
> > Reviewed-by: Mike Snitzer <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
>
> I was worried about possible overhead with additional stub calls, but
> it looks fine with a single thread fio write test with direct=1.
>
>  92.62%  [kernel.kallsyms]   [k] __copy_user_nocache
>   0.04%  [kernel.kallsyms]   [k] entry_SYSCALL_64_fastpath
>   0.08%  libpthread-2.22.so  [.] __GI___libc_write
>   0.01%  [kernel.kallsyms]   [k] sys_write
>   0.02%  [kernel.kallsyms]   [k] vfs_write
>   0.02%  [kernel.kallsyms]   [k] __vfs_write
>   0.02%  [kernel.kallsyms]   [k] ext4_file_write_iter
>   0.02%  [kernel.kallsyms]   [k] dax_iomap_rw
>   0.03%  [kernel.kallsyms]   [k] iomap_apply
>   0.04%  [kernel.kallsyms]   [k] dax_iomap_actor
>   0.01%  [kernel.kallsyms]   [k] dax_copy_from_iter
>   0.01%  [kernel.kallsyms]   [k] dm_dax_copy_from_iter
>   0.01%  [kernel.kallsyms]   [k] linear_dax_copy_from_iter
>   0.03%  [kernel.kallsyms]   [k] copy_from_iter_flushcache
>   0.00%  [kernel.kallsyms]   [k] pmem_copy_from_iter

I had bs=256k, which was too big for this test. bs=4k result is not
this pretty at all, only 23% in __copy_user_nocache. This change
accounts for approx. 1% with 4k. Given we have larger overheads in
many other functions in the path, the change looks acceptable (I keep
my review-by). I'd prefer to reduce code in the path, though.

Thanks,
-Toshi

2017-06-15 01:45:00

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v3 05/14] dm: add ->flush() dax operation support

On Fri, 2017-06-09 at 13:24 -0700, Dan Williams wrote:
> Allow device-mapper to route flush operations to the
> per-target implementation. In order for the device stacking to work
> we need a dax_dev and a pgoff relative to that device. This gives
> each layer of the stack the information it needs to look up the
> operation pointer for the next level.
>
> This conceptually allows for an array of mixed device drivers with
> varying flush implementations.
>
> Cc: Toshi Kani <[email protected]>
> Reviewed-by: Mike Snitzer <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good to me.

Reviewed-by: Toshi Kani <[email protected]>

Thanks,
-Toshi

2017-06-15 08:09:16

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 13/14] libnvdimm, pmem: gate cache management on QUEUE_FLAG_WC in pmem_dax_flush()

On Wed 14-06-17 16:11:26, Dan Williams wrote:
> Some platforms arrange for cpu caches to be flushed on power-fail. On
> those platforms there is no requirement that the kernel track and flush
> potentially dirty cache lines. Given that we still insert entries into
> the radix for locking purposes this patch only disables the cache flush
> loop, not the dirty tracking.
>
> Userspace can override the default cache setting via the block device
> queue "write_cache" attribute in sysfs.
>
> Cc: Jeff Moyer <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> Changes since v3:
> * move the check of QUEUE_FLAG_WC into the pmem driver directly (Jan)
>
> drivers/nvdimm/pmem.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 06f6c27ec1e9..49938b246a7b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -244,7 +244,16 @@ static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
> static void pmem_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff,
> void *addr, size_t size)
> {
> - arch_wb_cache_pmem(addr, size);
> + struct pmem_device *pmem = dax_get_private(dax_dev);
> + struct gendisk *disk = pmem->disk;
> + struct request_queue *q = disk->queue;
> +
> + /*
> + * Only perform cache management when the queue has caching
> + * enabled.
> + */
> + if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
> + arch_wb_cache_pmem(addr, size);
> }
>
> static const struct dax_operations pmem_dax_ops = {
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-15 08:11:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

On Wed 14-06-17 09:49:29, Dan Williams wrote:
> On Wed, Jun 14, 2017 at 3:54 AM, Jan Kara <[email protected]> wrote:
> >> -/**
> >> - * arch_wb_cache_pmem - write back a cache range with CLWB
> >> - * @vaddr: virtual start address
> >> - * @size: number of bytes to write back
> >> - *
> >> - * Write back a cache range using the CLWB (cache line write back)
> >> - * instruction. Note that @size is internally rounded up to be cache
> >> - * line size aligned.
> >> - */
> >> static inline void arch_wb_cache_pmem(void *addr, size_t size)
> >> {
> >> - u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
> >> - unsigned long clflush_mask = x86_clflush_size - 1;
> >> - void *vend = addr + size;
> >> - void *p;
> >> -
> >> - for (p = (void *)((unsigned long)addr & ~clflush_mask);
> >> - p < vend; p += x86_clflush_size)
> >> - clwb(p);
> >> + clean_cache_range(addr,size);
> >> }
> >
> > So this will make compilation break on 32-bit x86 as it does not define
> > clean_cache_range(). Do we somewhere force we are on x86_64 when pmem is
> > enabled?
>
> Yes, this is enforced by:
>
> select ARCH_HAS_PMEM_API if X86_64
>
> ...in arch/x86/Kconfig. We fallback to a dummy arch_wb_cache_pmem()
> implementation and emit this warning for !ARCH_HAS_PMEM_API archs:
>
> "nd_pmem namespace0.0: unable to guarantee persistence of writes"

Aha, right. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-06-18 08:28:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations

On Fri, Jun 09, 2017 at 01:23:51PM -0700, Dan Williams wrote:
> Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
> memcpy_flushcache, that guarantee that the destination buffer is not dirty in
> the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines

Wouldn't writethrough be a better name?

> will be used to replace the "pmem api" (include/linux/pmem.h +
> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
> and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
> otherwise.

What is UACCESS about memcpy_flushcache?

2017-06-18 08:37:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 02/14] dm: add ->copy_from_iter() dax operation support

> +size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
> + size_t bytes, struct iov_iter *i)
> +{
> + if (!dax_alive(dax_dev))
> + return 0;
> +
> + if (!dax_dev->ops->copy_from_iter)
> + return copy_from_iter(addr, bytes, i);
> + return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
> +}
> +EXPORT_SYMBOL_GPL(dax_copy_from_iter);

Can you remove the fallbacks after this series so that we have
a clean abstraction?

2017-06-18 08:40:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

> +void clean_cache_range(void *addr, size_t size);
>
> static inline int
> __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index f42d2fd86ca3..baa80ff29da8 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -85,7 +85,7 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
> * instruction. Note that @size is internally rounded up to be cache
> * line size aligned.
> */
> -static void clean_cache_range(void *addr, size_t size)
> +void clean_cache_range(void *addr, size_t size)

Can you keep clean_cache_range private please? Just add
arch_wb_cache_pmem to usercopy_64.c just behind it so that the
compiler can tail-call and export that instead.

> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -4,6 +4,13 @@
> #include <linux/types.h>
> #include <linux/pfn_t.h>
> #include <linux/fs.h>
> +#include <asm/pmem.h>
> +
> +#ifndef CONFIG_ARCH_HAS_PMEM_API
> +static inline void arch_wb_cache_pmem(void *addr, size_t size)
> +{
> +}
> +#endif

And our normal Linux style would be to have this linux linux/pmem.h,
which should always be included for the asm version.

2017-06-18 08:46:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

On Fri, Jun 09, 2017 at 01:24:56PM -0700, Dan Williams wrote:
> Some platforms arrange for cpu caches to be flushed on power-fail. On
> those platforms there is no requirement that the kernel track and flush
> potentially dirty cache lines. Given that we still insert entries into
> the radix for locking purposes this patch only disables the cache flush
> loop, not the dirty tracking.
>
> Userspace can override the default cache setting via the block device
> queue "write_cache" attribute in sysfs.

NAK. Please stop using the block infrastructure for dax values. Have
your own flag and sysfs file in the dax infrastructure and only propagate
it to the block layer for the block devices using dax.

2017-06-19 02:02:58

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations

On Sun, Jun 18, 2017 at 1:28 AM, Christoph Hellwig <[email protected]> wrote:
> On Fri, Jun 09, 2017 at 01:23:51PM -0700, Dan Williams wrote:
>> Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
>> memcpy_flushcache, that guarantee that the destination buffer is not dirty in
>> the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
>
> Wouldn't writethrough be a better name?

I started with _writethrough, Ingo suggested _wt, and then Toshi
rightly pointed out that _wt might lead applications to assume that
the involved cache-lines are valid on return which may not be true. So
we settled on _flushcache in this thread [1].

[1]: https://lkml.org/lkml/2017/5/6/99

>> will be used to replace the "pmem api" (include/linux/pmem.h +
>> arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
>> and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>> config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
>> otherwise.
>
> What is UACCESS about memcpy_flushcache?

The uaccess part comes from the fact that the conversion provides all
the _flushcache versions of the copy_from_iter() operations
(__copy_from_user_flushcache, memcpy_page_flushcache,
memcpy_flushcache). It also stems from Al asking that
__copy_user_flushcache() live in arch/x86/lib/usercopy_64.c. That
said, I wouldn't object to a different name for the config symbol.

2017-06-19 02:05:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 02/14] dm: add ->copy_from_iter() dax operation support

On Sun, Jun 18, 2017 at 1:37 AM, Christoph Hellwig <[email protected]> wrote:
>> +size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>> + size_t bytes, struct iov_iter *i)
>> +{
>> + if (!dax_alive(dax_dev))
>> + return 0;
>> +
>> + if (!dax_dev->ops->copy_from_iter)
>> + return copy_from_iter(addr, bytes, i);
>> + return dax_dev->ops->copy_from_iter(dax_dev, pgoff, addr, bytes, i);
>> +}
>> +EXPORT_SYMBOL_GPL(dax_copy_from_iter);
>
> Can you remove the fallbacks after this series so that we have
> a clean abstraction?

You mean update all implementations to register copy_from_iter() as
their default op rather than workaround a NULL op in the core? Yeah, I
can do that.

2017-06-19 02:06:18

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] x86, dax, libnvdimm: move wb_cache_pmem() to libnvdimm

On Sun, Jun 18, 2017 at 1:40 AM, Christoph Hellwig <[email protected]> wrote:
>> +void clean_cache_range(void *addr, size_t size);
>>
>> static inline int
>> __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
>> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
>> index f42d2fd86ca3..baa80ff29da8 100644
>> --- a/arch/x86/lib/usercopy_64.c
>> +++ b/arch/x86/lib/usercopy_64.c
>> @@ -85,7 +85,7 @@ copy_user_handle_tail(char *to, char *from, unsigned len)
>> * instruction. Note that @size is internally rounded up to be cache
>> * line size aligned.
>> */
>> -static void clean_cache_range(void *addr, size_t size)
>> +void clean_cache_range(void *addr, size_t size)
>
> Can you keep clean_cache_range private please? Just add
> arch_wb_cache_pmem to usercopy_64.c just behind it so that the
> compiler can tail-call and export that instead.
>
>> --- a/drivers/nvdimm/pmem.h
>> +++ b/drivers/nvdimm/pmem.h
>> @@ -4,6 +4,13 @@
>> #include <linux/types.h>
>> #include <linux/pfn_t.h>
>> #include <linux/fs.h>
>> +#include <asm/pmem.h>
>> +
>> +#ifndef CONFIG_ARCH_HAS_PMEM_API
>> +static inline void arch_wb_cache_pmem(void *addr, size_t size)
>> +{
>> +}
>> +#endif
>
> And our normal Linux style would be to have this linux linux/pmem.h,
> which should always be included for the asm version.

Ok, will do.

2017-06-19 02:07:14

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] filesystem-dax: gate calls to dax_flush() on QUEUE_FLAG_WC

On Sun, Jun 18, 2017 at 1:45 AM, Christoph Hellwig <[email protected]> wrote:
> On Fri, Jun 09, 2017 at 01:24:56PM -0700, Dan Williams wrote:
>> Some platforms arrange for cpu caches to be flushed on power-fail. On
>> those platforms there is no requirement that the kernel track and flush
>> potentially dirty cache lines. Given that we still insert entries into
>> the radix for locking purposes this patch only disables the cache flush
>> loop, not the dirty tracking.
>>
>> Userspace can override the default cache setting via the block device
>> queue "write_cache" attribute in sysfs.
>
> NAK. Please stop using the block infrastructure for dax values. Have
> your own flag and sysfs file in the dax infrastructure and only propagate
> it to the block layer for the block devices using dax.

Ok, that makes sense.