2015-08-26 01:33:10

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 0/9] initial struct page support for pmem

Changes since v1 [1]:

1/ Several simplifications from Christoph including dropping the __pfn_t
dependency, and merging ZONE_DEVICE into the base arch_add_memory()
implementation.

2/ Drop the deeper changes to the memory hotplug code that enabled
allocating the backing 'struct page' array from pmem (struct
vmem_altmap). This functionality is still needed when large capacity
PMEM devices arrive. However, for now we can take this simple step to
enable struct page mapping in RAM and enable it by default for small
capacity CONFIG_X86_PMEM_LEGACY devices.

3/ A rework of the PMEM api to allow usage of the non-temporal
memcpy_to_pmem() implementation even on platforms without pcommit
instruction support.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-August/001809.html

---

When we last left this debate [2] it was becoming clear that the
'page-less' approach left too many I/O scenarios off the table. The
page-less enabling is still useful for avoiding the overhead of struct
page where it is not needed, but in the end, page-backed persistent
memory seems to be a requirement. We confirmed as much at the recently
concluded Persistent Memory Microconference at Linux Plumbers.

Whereas the initial RFC of this functionality enabled userspace to pick
whether struct page is allocated from RAM or PMEM. This new version
only enables RAM-backed for now. This is suitable for existing NVDIMM
devices and a starting point to incrementally build "allocate struct
page from PMEM" support.

[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000748.html

---

Christoph Hellwig (2):
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
add devm_memremap_pages

Dan Williams (7):
dax: drop size parameter to ->direct_access()
mm: ZONE_DEVICE for "device memory"
x86, pmem: push fallback handling to arch code
libnvdimm, pfn: 'struct page' provider infrastructure
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pmem: direct map legacy pmem by default
devm_memremap_pages: protect against pmem device unbind


arch/arm/include/asm/memory.h | 6 -
arch/arm64/include/asm/memory.h | 6 -
arch/ia64/mm/init.c | 4
arch/powerpc/mm/mem.c | 4
arch/powerpc/sysdev/axonram.c | 2
arch/s390/mm/init.c | 2
arch/sh/mm/init.c | 5 -
arch/tile/mm/init.c | 2
arch/unicore32/include/asm/memory.h | 6 -
arch/x86/include/asm/io.h | 2
arch/x86/include/asm/pmem.h | 41 ++++
arch/x86/mm/init_32.c | 4
arch/x86/mm/init_64.c | 4
drivers/acpi/nfit.c | 2
drivers/block/brd.c | 6 -
drivers/nvdimm/Kconfig | 23 ++
drivers/nvdimm/Makefile | 2
drivers/nvdimm/btt.c | 6 -
drivers/nvdimm/btt_devs.c | 172 +-----------------
drivers/nvdimm/claim.c | 201 +++++++++++++++++++++
drivers/nvdimm/e820.c | 1
drivers/nvdimm/namespace_devs.c | 62 +++++-
drivers/nvdimm/nd-core.h | 9 +
drivers/nvdimm/nd.h | 59 ++++++
drivers/nvdimm/pfn.h | 35 ++++
drivers/nvdimm/pfn_devs.c | 337 +++++++++++++++++++++++++++++++++++
drivers/nvdimm/pmem.c | 220 +++++++++++++++++++++--
drivers/nvdimm/region.c | 2
drivers/nvdimm/region_devs.c | 20 ++
drivers/s390/block/dcssblk.c | 4
fs/block_dev.c | 2
include/asm-generic/memory_model.h | 6 +
include/asm-generic/pmem.h | 72 +++++++
include/linux/blkdev.h | 2
include/linux/io.h | 57 ++++++
include/linux/libnvdimm.h | 4
include/linux/memory_hotplug.h | 5 -
include/linux/mmzone.h | 23 ++
include/linux/pmem.h | 73 +-------
kernel/memremap.c | 136 ++++++++++++++
mm/Kconfig | 17 ++
mm/memory_hotplug.c | 14 +
mm/page_alloc.c | 3
tools/testing/nvdimm/Kbuild | 3
tools/testing/nvdimm/test/iomap.c | 13 +
45 files changed, 1369 insertions(+), 310 deletions(-)
create mode 100644 drivers/nvdimm/claim.c
create mode 100644 drivers/nvdimm/pfn.h
create mode 100644 drivers/nvdimm/pfn_devs.c
create mode 100644 include/asm-generic/pmem.h


2015-08-26 01:33:15

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 1/9] dax: drop size parameter to ->direct_access()

None of the implementations currently use it. The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
drivers/block/brd.c | 6 +-----
drivers/nvdimm/pmem.c | 2 +-
drivers/s390/block/dcssblk.c | 4 ++--
fs/block_dev.c | 2 +-
include/linux/blkdev.h | 2 +-
6 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index a2be2a66dab6..4419c84ac15a 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -141,7 +141,7 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
*/
static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void __pmem **kaddr, unsigned long *pfn, long size)
+ void __pmem **kaddr, unsigned long *pfn)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c96402fd1560..03c45c41bdfa 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -371,7 +371,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,

#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, unsigned long *pfn, long size)
+ void __pmem **kaddr, unsigned long *pfn)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -384,10 +384,6 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = (void __pmem *)page_address(page);
*pfn = page_to_pfn(page);

- /*
- * TODO: If size > PAGE_SIZE, we could look to see if the next page in
- * the file happens to be mapped to the next page of physical RAM.
- */
return PAGE_SIZE;
}
#else
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f3b629779266..3b5b9cb758b6 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -92,7 +92,7 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
}

static long pmem_direct_access(struct block_device *bdev, sector_t sector,
- void __pmem **kaddr, unsigned long *pfn, long size)
+ void __pmem **kaddr, unsigned long *pfn)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
size_t offset = sector << 9;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 2c5a397b9f3e..8c027a9e4e8a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, unsigned long *pfn, long size);
+ void __pmem **kaddr, unsigned long *pfn);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -879,7 +879,7 @@ fail:

static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void __pmem **kaddr, unsigned long *pfn, long size)
+ void __pmem **kaddr, unsigned long *pfn)
{
struct dcssblk_dev_info *dev_info;
unsigned long offset, dev_sz;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2345a9870e2c..3831e5691b32 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -462,7 +462,7 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
sector += get_start_sect(bdev);
if (sector % (PAGE_SIZE / 512))
return -EINVAL;
- avail = ops->direct_access(bdev, sector, addr, pfn, size);
+ avail = ops->direct_access(bdev, sector, addr, pfn);
if (!avail)
return -ERANGE;
return min(avail, size);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c401ecdff9cb..c22064f326b2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1556,7 +1556,7 @@ struct block_device_operations {
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
long (*direct_access)(struct block_device *, sector_t, void __pmem **,
- unsigned long *pfn, long size);
+ unsigned long *pfn);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */

2015-08-26 01:33:22

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 2/9] mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h

From: Christoph Hellwig <[email protected]>

Three architectures already define these, and we'll need them genericly
soon.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/arm/include/asm/memory.h | 6 ------
arch/arm64/include/asm/memory.h | 6 ------
arch/unicore32/include/asm/memory.h | 6 ------
include/asm-generic/memory_model.h | 6 ++++++
4 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/memory.h b/arch/arm/include/asm/memory.h
index b7f6fb462ea0..98d58bb04ac5 100644
--- a/arch/arm/include/asm/memory.h
+++ b/arch/arm/include/asm/memory.h
@@ -119,12 +119,6 @@
#endif

/*
- * Convert a physical address to a Page Frame Number and back
- */
-#define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT))
-#define __pfn_to_phys(pfn) ((phys_addr_t)(pfn) << PAGE_SHIFT)
-
-/*
* Convert a page to/from a physical address
*/
#define page_to_phys(page) (__pfn_to_phys(page_to_pfn(page)))
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f800d45ea226..d808bb688751 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -81,12 +81,6 @@
#define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET + PAGE_OFFSET))

/*
- * Convert a physical address to a Page Frame Number and back
- */
-#define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT))
-#define __pfn_to_phys(pfn) ((phys_addr_t)(pfn) << PAGE_SHIFT)
-
-/*
* Convert a page to/from a physical address
*/
#define page_to_phys(page) (__pfn_to_phys(page_to_pfn(page)))
diff --git a/arch/unicore32/include/asm/memory.h b/arch/unicore32/include/asm/memory.h
index debafc40200a..3bb0a29fd2d7 100644
--- a/arch/unicore32/include/asm/memory.h
+++ b/arch/unicore32/include/asm/memory.h
@@ -61,12 +61,6 @@
#endif

/*
- * Convert a physical address to a Page Frame Number and back
- */
-#define __phys_to_pfn(paddr) ((paddr) >> PAGE_SHIFT)
-#define __pfn_to_phys(pfn) ((pfn) << PAGE_SHIFT)
-
-/*
* Convert a page to/from a physical address
*/
#define page_to_phys(page) (__pfn_to_phys(page_to_pfn(page)))
diff --git a/include/asm-generic/memory_model.h b/include/asm-generic/memory_model.h
index 14909b0b9cae..f20f407ce45d 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -69,6 +69,12 @@
})
#endif /* CONFIG_FLATMEM/DISCONTIGMEM/SPARSEMEM */

+/*
+ * Convert a physical address to a Page Frame Number and back
+ */
+#define __phys_to_pfn(paddr) ((unsigned long)((paddr) >> PAGE_SHIFT))
+#define __pfn_to_phys(pfn) ((pfn) << PAGE_SHIFT)
+
#define page_to_pfn __page_to_pfn
#define pfn_to_page __pfn_to_page

2015-08-26 01:33:26

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 3/9] mm: ZONE_DEVICE for "device memory"

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jerome Glisse <[email protected]>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/ia64/mm/init.c | 4 ++--
arch/powerpc/mm/mem.c | 4 ++--
arch/s390/mm/init.c | 2 +-
arch/sh/mm/init.c | 5 +++--
arch/tile/mm/init.c | 2 +-
arch/x86/mm/init_32.c | 4 ++--
arch/x86/mm/init_64.c | 4 ++--
include/linux/memory_hotplug.h | 5 +++--
include/linux/mmzone.h | 23 +++++++++++++++++++++++
mm/Kconfig | 17 +++++++++++++++++
mm/memory_hotplug.c | 14 +++++++++++---
mm/page_alloc.c | 3 +++
12 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 97e48b0eefc7..1841ef69183d 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,7 +645,7 @@ mem_init (void)
}

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
pg_data_t *pgdat;
struct zone *zone;
@@ -656,7 +656,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
pgdat = NODE_DATA(nid);

zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL);
+ zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
ret = __add_pages(nid, zone, start_pfn, nr_pages);

if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 0f11819d8f1d..6571cfb05668 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -113,7 +113,7 @@ int memory_add_physaddr_to_nid(u64 start)
}
#endif

-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
struct pglist_data *pgdata;
struct zone *zone;
@@ -128,7 +128,7 @@ int arch_add_memory(int nid, u64 start, u64 size)

/* this should work for most non-highmem platforms */
zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, 0);
+ zone_for_memory(nid, start, size, 0, for_device);

return __add_pages(nid, zone, start_pfn, nr_pages);
}
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 76e873748b56..48ee78be88ba 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -168,7 +168,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
unsigned long start_pfn = PFN_DOWN(start);
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 2790b6a64157..c1490096b863 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,7 +485,7 @@ void free_initrd_mem(unsigned long start, unsigned long end)
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
pg_data_t *pgdat;
unsigned long start_pfn = start >> PAGE_SHIFT;
@@ -496,7 +496,8 @@ int arch_add_memory(int nid, u64 start, u64 size)

/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL),
+ zone_for_memory(nid, start, size, ZONE_NORMAL,
+ for_device),
start_pfn, nr_pages);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index 5bd252e3fdc5..d4e1fc41d06d 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,7 +863,7 @@ void __init mem_init(void)
* memory to the highmem for now.
*/
#ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size)
+int arch_add_memory(u64 start, u64 size, bool for_device)
{
struct pglist_data *pgdata = &contig_page_data;
struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 8340e45c891a..2a9237d20a70 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -822,11 +822,11 @@ void __init mem_init(void)
}

#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
struct pglist_data *pgdata = NODE_DATA(nid);
struct zone *zone = pgdata->node_zones +
- zone_for_memory(nid, start, size, ZONE_HIGHMEM);
+ zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..30564e2752d3 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -687,11 +687,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
* Memory is added always to NORMAL zone. This means you will never get
* additional DMA/DMA32 memory.
*/
-int arch_add_memory(int nid, u64 start, u64 size)
+int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
{
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones +
- zone_for_memory(nid, start, size, ZONE_NORMAL);
+ zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 6ffa0ac7f7d6..8f60e899b33c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -266,8 +266,9 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
void *arg, int (*func)(struct memory_block *, void *));
extern int add_memory(int nid, u64 start, u64 size);
-extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default);
-extern int arch_add_memory(int nid, u64 start, u64 size);
+extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
+ bool for_device);
+extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c25966a0a..9217fd93c25b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -319,7 +319,11 @@ enum zone_type {
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
+#ifdef CONFIG_ZONE_DEVICE
+ ZONE_DEVICE,
+#endif
__MAX_NR_ZONES
+
};

#ifndef __GENERATING_BOUNDS_H
@@ -794,6 +798,25 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
return !pgdat->node_start_pfn && !pgdat->node_spanned_pages;
}

+static inline int zone_id(const struct zone *zone)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+
+ return zone - pgdat->node_zones;
+}
+
+#ifdef CONFIG_ZONE_DEVICE
+static inline bool is_dev_zone(const struct zone *zone)
+{
+ return zone_id(zone) == ZONE_DEVICE;
+}
+#else
+static inline bool is_dev_zone(const struct zone *zone)
+{
+ return false;
+}
+#endif
+
#include <linux/memory_hotplug.h>

extern struct mutex zonelists_mutex;
diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..a0cd086df16b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -654,3 +654,20 @@ config DEFERRED_STRUCT_PAGE_INIT
when kswapd starts. This has a potential performance impact on
processes running early in the lifetime of the systemm until kswapd
finishes the initialisation.
+
+config ZONE_DEVICE
+ bool "Device memory (pmem, etc...) hotplug support" if EXPERT
+ default !ZONE_DMA
+ depends on !ZONE_DMA
+ depends on MEMORY_HOTPLUG
+ depends on MEMORY_HOTREMOVE
+ depends on X86_64 #arch_add_memory() comprehends device memory
+
+ help
+ Device memory hotplug support allows for establishing pmem,
+ or other device driver discovered memory regions, in the
+ memmap. This allows pfn_to_page() lookups of otherwise
+ "device-physical" addresses which is needed for using a DAX
+ mapping in an O_DIRECT operation, among other things.
+
+ If FS_DAX is enabled, then say Y.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 26fbba7d888f..24e4c76c951b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -770,7 +770,10 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,

start = phys_start_pfn << PAGE_SHIFT;
size = nr_pages * PAGE_SIZE;
- ret = release_mem_region_adjustable(&iomem_resource, start, size);
+
+ /* in the ZONE_DEVICE case device driver owns the memory region */
+ if (!is_dev_zone(zone))
+ ret = release_mem_region_adjustable(&iomem_resource, start, size);
if (ret) {
resource_size_t endres = start + size - 1;

@@ -1207,8 +1210,13 @@ static int should_add_memory_movable(int nid, u64 start, u64 size)
return 0;
}

-int zone_for_memory(int nid, u64 start, u64 size, int zone_default)
+int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
+ bool for_device)
{
+#ifdef CONFIG_ZONE_DEVICE
+ if (for_device)
+ return ZONE_DEVICE;
+#endif
if (should_add_memory_movable(nid, start, size))
return ZONE_MOVABLE;

@@ -1249,7 +1257,7 @@ int __ref add_memory(int nid, u64 start, u64 size)
}

/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size);
+ ret = arch_add_memory(nid, start, size, false);

if (ret < 0)
goto error;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22b2b7d..0f19b4e18233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -207,6 +207,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
"HighMem",
#endif
"Movable",
+#ifdef CONFIG_ZONE_DEVICE
+ "Device",
+#endif
};

int min_free_kbytes = 1024;

2015-08-26 01:33:33

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 4/9] add devm_memremap_pages

From: Christoph Hellwig <[email protected]>

This behaves like devm_memremap except that it ensures we have page
structures available that can back the region.

Signed-off-by: Christoph Hellwig <[email protected]>
[djbw: catch attempts to remap RAM, drop flags]
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/io.h | 20 ++++++++++++++++++++
kernel/memremap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+)

diff --git a/include/linux/io.h b/include/linux/io.h
index d8d749abd665..de64c1e53612 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -20,10 +20,13 @@

#include <linux/types.h>
#include <linux/init.h>
+#include <linux/bug.h>
+#include <linux/err.h>
#include <asm/io.h>
#include <asm/page.h>

struct device;
+struct resource;

__visible void __iowrite32_copy(void __iomem *to, const void *from, size_t count);
void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
@@ -84,6 +87,23 @@ void *devm_memremap(struct device *dev, resource_size_t offset,
size_t size, unsigned long flags);
void devm_memunmap(struct device *dev, void *addr);

+void *__devm_memremap_pages(struct device *dev, struct resource *res);
+
+#ifdef CONFIG_ZONE_DEVICE
+void *devm_memremap_pages(struct device *dev, struct resource *res);
+#else
+static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
+{
+ /*
+ * Fail attempts to call devm_memremap_pages() without
+ * ZONE_DEVICE support enabled, this requires callers to fall
+ * back to plain devm_memremap() based on config
+ */
+ WARN_ON_ONCE(1);
+ return ERR_PTR(-ENXIO);
+}
+#endif
+
/*
* Some systems do not have legacy ISA devices.
* /dev/port is not a valid interface on these systems.
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 5c9b55eaf121..72b0c66628b6 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/io.h>
#include <linux/mm.h>
+#include <linux/memory_hotplug.h>

#ifndef ioremap_cache
/* temporary while we convert existing ioremap_cache users to memremap */
@@ -135,3 +136,55 @@ void devm_memunmap(struct device *dev, void *addr)
memunmap(addr);
}
EXPORT_SYMBOL(devm_memunmap);
+
+#ifdef CONFIG_ZONE_DEVICE
+struct page_map {
+ struct resource res;
+};
+
+static void devm_memremap_pages_release(struct device *dev, void *res)
+{
+ struct page_map *page_map = res;
+
+ /* pages are dead and unused, undo the arch mapping */
+ arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
+}
+
+void *devm_memremap_pages(struct device *dev, struct resource *res)
+{
+ int is_ram = region_intersects(res->start, resource_size(res),
+ "System RAM");
+ struct page_map *page_map;
+ int error, nid;
+
+ if (is_ram == REGION_MIXED) {
+ WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+ __func__, res);
+ return ERR_PTR(-ENXIO);
+ }
+
+ if (is_ram == REGION_INTERSECTS)
+ return __va(res->start);
+
+ page_map = devres_alloc(devm_memremap_pages_release,
+ sizeof(*page_map), GFP_KERNEL);
+ if (!page_map)
+ return ERR_PTR(-ENOMEM);
+
+ memcpy(&page_map->res, res, sizeof(*res));
+
+ nid = dev_to_node(dev);
+ if (nid < 0)
+ nid = 0;
+
+ error = arch_add_memory(nid, res->start, resource_size(res), true);
+ if (error) {
+ devres_free(page_map);
+ return ERR_PTR(error);
+ }
+
+ devres_add(dev, page_map);
+ return __va(res->start);
+}
+EXPORT_SYMBOL(devm_memremap_pages);
+#endif /* CONFIG_ZONE_DEVICE */

2015-08-26 01:33:39

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

The decision of when to fallback to the default pmem apis is currently
done at too high of a level. In particular the test for
arch_has_pmem_api() in memcpy_to_pmem() really wants to decide whether
the arch_memcpy_to_pmem() implementation is placing data in a location
that a subsequent wmb_pmem() can flush.

For x86 this equates to an arch_memcpy_to_pmem() implementation that
guarantees that write data is at most sitting in the local cpu write
buffer. The current usage of __copy_from_user_inatomic_nocache()
guarantees this property on all 64-bit x86 implementations (at least
according to the Intel SDM that says Pentium M implementations may leave
dirty-data in the cache after a non-temporal store). In the 32-bit case
waiting until memcpy_to_pmem() time to perform a fallback is too late.
Instead 32-bit x86 is converted to use write-through mappings for pmem.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers. Code that cares whether wmb_pmem()
actually flushes writes to pmem must now call arch_has_wmb_pmem()
directly.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/io.h | 2 -
arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++--
drivers/acpi/nfit.c | 2 +
drivers/nvdimm/pmem.c | 2 +
include/asm-generic/pmem.h | 72 ++++++++++++++++++++++++++++++++++++++++++
include/linux/pmem.h | 73 +++++++------------------------------------
6 files changed, 123 insertions(+), 69 deletions(-)
create mode 100644 include/asm-generic/pmem.h

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d241fbd5c87b..83ec9b1d77cc 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -248,8 +248,6 @@ static inline void flush_write_buffers(void)
#endif
}

-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
#endif /* __KERNEL__ */

extern void native_io_delay(void);
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index a3a0df6545ee..6eb3c1da5d57 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -16,9 +16,12 @@
#include <linux/uaccess.h>
#include <asm/cacheflush.h>
#include <asm/cpufeature.h>
+#include <asm-generic/pmem.h>
#include <asm/special_insns.h>

#ifdef CONFIG_ARCH_HAS_PMEM_API
+#ifdef CONFIG_X86_64
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
/**
* arch_memcpy_to_pmem - copy data to persistent memory
* @dst: destination buffer for the copy
@@ -141,18 +144,48 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
__arch_wb_cache_pmem(vaddr, size);
}

-static inline bool arch_has_wmb_pmem(void)
+static inline bool __arch_has_wmb_pmem(void)
{
-#ifdef CONFIG_X86_64
/*
* We require that wmb() be an 'sfence', that is only guaranteed on
* 64-bit builds
*/
return static_cpu_has(X86_FEATURE_PCOMMIT);
+}
#else
+/*
+ * Some 32-bit implementations may leave dirty-data in cache after a
+ * series of non-temporal stores, so set pmem ranges to write-through
+ * caching.
+ */
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
+
+static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t n)
+{
+ default_memcpy_pmem(dst, src, n);
+}
+
+static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ return default_copy_from_iter_pmem(addr, bytes, i);
+}
+
+static inline void arch_clear_pmem(void __pmem *addr, size_t size)
+{
+ default_clear_pmem(addr, size);
+}
+
+static inline void arch_wmb_pmem(void)
+{
+ wmb();
+}
+
+static inline bool __arch_has_wmb_pmem(void)
+{
return false;
-#endif
}
+#endif /* CONFIG_X86_64 */
#endif /* CONFIG_ARCH_HAS_PMEM_API */
-
#endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c2638f914a9..c3fe20635562 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1364,7 +1364,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
return -ENOMEM;
}

- if (!arch_has_pmem_api() && !nfit_blk->nvdimm_flush)
+ if (!arch_has_wmb_pmem() && !nfit_blk->nvdimm_flush)
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (mmio->line_size == 0)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b5b9cb758b6..20bf122328da 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -125,7 +125,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,

pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (!arch_has_pmem_api())
+ if (!arch_has_wmb_pmem())
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (!devm_request_mem_region(dev, pmem->phys_addr, pmem->size,
diff --git a/include/asm-generic/pmem.h b/include/asm-generic/pmem.h
new file mode 100644
index 000000000000..95d1a6ac0df7
--- /dev/null
+++ b/include/asm-generic/pmem.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ASM_GENERIC_PMEM_H__
+#define __ASM_GENERIC_PMEM_H__
+/*
+ * These defaults seek to offer decent performance and minimize the
+ * window between i/o completion and writes being durable on media.
+ * However, it is undefined / architecture specific whether
+ * default_memremap_pmem + default_memcpy_to_pmem is sufficient for
+ * making data durable relative to i/o completion.
+ */
+static inline void default_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t size)
+{
+ memcpy((void __force *) dst, src, size);
+}
+
+static inline size_t default_copy_from_iter_pmem(void __pmem *addr,
+ size_t bytes, struct iov_iter *i)
+{
+ return copy_from_iter_nocache((void __force *)addr, bytes, i);
+}
+
+static inline void default_clear_pmem(void __pmem *addr, size_t size)
+{
+ if (size == PAGE_SIZE && ((unsigned long)addr & ~PAGE_MASK) == 0)
+ clear_page((void __force *)addr);
+ else
+ memset((void __force *)addr, 0, size);
+}
+
+#ifndef CONFIG_ARCH_HAS_PMEM_API
+/*
+ * These are simply here to enable compilation, all call sites gate
+ * calling these symbols with arch_has_pmem_api() and redirect to the
+ * implementation in asm/pmem.h.
+ */
+
+static inline bool __arch_has_wmb_pmem(void)
+{
+ return false;
+}
+
+static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
+ size_t n)
+{
+ BUG();
+}
+
+static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
+ struct iov_iter *i)
+{
+ BUG();
+ return 0;
+}
+
+static inline void arch_clear_pmem(void __pmem *addr, size_t size)
+{
+ BUG();
+}
+#endif /* CONFIG_ARCH_HAS_PMEM_API */
+#endif /* __ASM_GENERIC_PMEM_H__ */
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index a9d84bf335ee..f7f5a713a860 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -15,37 +15,9 @@

#include <linux/io.h>
#include <linux/uio.h>
-
+#include <asm-generic/pmem.h>
#ifdef CONFIG_ARCH_HAS_PMEM_API
#include <asm/pmem.h>
-#else
-static inline void arch_wmb_pmem(void)
-{
- BUG();
-}
-
-static inline bool arch_has_wmb_pmem(void)
-{
- return false;
-}
-
-static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
- size_t n)
-{
- BUG();
-}
-
-static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
- struct iov_iter *i)
-{
- BUG();
- return 0;
-}
-
-static inline void arch_clear_pmem(void __pmem *addr, size_t size)
-{
- BUG();
-}
#endif

/*
@@ -53,7 +25,6 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
* implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
* arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
*/
-
static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
{
memcpy(dst, (void __force const *) src, size);
@@ -64,8 +35,13 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
devm_memunmap(dev, (void __force *) addr);
}

+static inline bool arch_has_pmem_api(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
+}
+
/**
- * arch_has_pmem_api - true if wmb_pmem() ensures durability
+ * arch_has_wmb_pmem - true if wmb_pmem() ensures durability
*
* For a given cpu implementation within an architecture it is possible
* that wmb_pmem() resolves to a nop. In the case this returns
@@ -73,36 +49,9 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
* fall back to a different data consistency model, or otherwise notify
* the user.
*/
-static inline bool arch_has_pmem_api(void)
-{
- return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && arch_has_wmb_pmem();
-}
-
-/*
- * These defaults seek to offer decent performance and minimize the
- * window between i/o completion and writes being durable on media.
- * However, it is undefined / architecture specific whether
- * default_memremap_pmem + default_memcpy_to_pmem is sufficient for
- * making data durable relative to i/o completion.
- */
-static inline void default_memcpy_to_pmem(void __pmem *dst, const void *src,
- size_t size)
-{
- memcpy((void __force *) dst, src, size);
-}
-
-static inline size_t default_copy_from_iter_pmem(void __pmem *addr,
- size_t bytes, struct iov_iter *i)
-{
- return copy_from_iter_nocache((void __force *)addr, bytes, i);
-}
-
-static inline void default_clear_pmem(void __pmem *addr, size_t size)
+static inline bool arch_has_wmb_pmem(void)
{
- if (size == PAGE_SIZE && ((unsigned long)addr & ~PAGE_MASK) == 0)
- clear_page((void __force *)addr);
- else
- memset((void __force *)addr, 0, size);
+ return arch_has_pmem_api() && __arch_has_wmb_pmem();
}

/**
@@ -158,8 +107,10 @@ static inline void memcpy_to_pmem(void __pmem *dst, const void *src, size_t n)
*/
static inline void wmb_pmem(void)
{
- if (arch_has_pmem_api())
+ if (arch_has_wmb_pmem())
arch_wmb_pmem();
+ else
+ wmb();
}

/**

2015-08-26 01:33:43

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 6/9] libnvdimm, pfn: 'struct page' provider infrastructure

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/Kconfig | 22 +++
drivers/nvdimm/Makefile | 2
drivers/nvdimm/btt.c | 6 -
drivers/nvdimm/btt_devs.c | 172 +-------------------
drivers/nvdimm/claim.c | 201 +++++++++++++++++++++++
drivers/nvdimm/namespace_devs.c | 34 +++-
drivers/nvdimm/nd-core.h | 9 +
drivers/nvdimm/nd.h | 51 ++++++
drivers/nvdimm/pfn.h | 35 ++++
drivers/nvdimm/pfn_devs.c | 336 +++++++++++++++++++++++++++++++++++++++
drivers/nvdimm/region.c | 2
drivers/nvdimm/region_devs.c | 19 ++
tools/testing/nvdimm/Kbuild | 2
13 files changed, 714 insertions(+), 177 deletions(-)
create mode 100644 drivers/nvdimm/claim.c
create mode 100644 drivers/nvdimm/pfn.h
create mode 100644 drivers/nvdimm/pfn_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 72226acb5c0f..ace25b53b755 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -21,6 +21,7 @@ config BLK_DEV_PMEM
default LIBNVDIMM
depends on HAS_IOMEM
select ND_BTT if BTT
+ select ND_PFN if NVDIMM_PFN
help
Memory ranges for PMEM are described by either an NFIT
(NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
@@ -47,12 +48,16 @@ config ND_BLK
(CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
capabilities.

+config ND_CLAIM
+ bool
+
config ND_BTT
tristate

config BTT
bool "BTT: Block Translation Table (atomic sector updates)"
default y if LIBNVDIMM
+ select ND_CLAIM
help
The Block Translation Table (BTT) provides atomic sector
update semantics for persistent memory devices, so that
@@ -65,4 +70,21 @@ config BTT

Select Y if unsure

+config ND_PFN
+ tristate
+
+config NVDIMM_PFN
+ bool "PFN: Map persistent (device) memory"
+ default LIBNVDIMM
+ select ND_CLAIM
+ help
+ Map persistent memory, i.e. advertise it to the memory
+ management sub-system. By default persistent memory does
+ not support direct I/O, RDMA, or any other usage that
+ requires a 'struct page' to mediate an I/O request. This
+ driver allocates and initializes the infrastructure needed
+ to support those use cases.
+
+ Select Y if unsure
+
endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 9bf15db52dee..ea84d3c4e8e5 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -20,4 +20,6 @@ libnvdimm-y += region_devs.o
libnvdimm-y += region.o
libnvdimm-y += namespace_devs.o
libnvdimm-y += label.o
+libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
libnvdimm-$(CONFIG_BTT) += btt_devs.o
+libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 19588291550b..028d2d137bc5 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -731,6 +731,7 @@ static int create_arenas(struct btt *btt)
static int btt_arena_write_layout(struct arena_info *arena)
{
int ret;
+ u64 sum;
struct btt_sb *super;
struct nd_btt *nd_btt = arena->nd_btt;
const u8 *parent_uuid = nd_dev_to_uuid(&nd_btt->ndns->dev);
@@ -770,7 +771,8 @@ static int btt_arena_write_layout(struct arena_info *arena)
super->info2off = cpu_to_le64(arena->info2off - arena->infooff);

super->flags = 0;
- super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+ sum = nd_sb_checksum((struct nd_gen_sb *) super);
+ super->checksum = cpu_to_le64(sum);

ret = btt_info_write(arena, super);

@@ -1422,8 +1424,6 @@ static int __init nd_btt_init(void)
{
int rc;

- BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
-
btt_major = register_blkdev(0, "btt");
if (btt_major < 0)
return btt_major;
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 242ae1c550ad..59ad54a63d9f 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -21,63 +21,13 @@
#include "btt.h"
#include "nd.h"

-static void __nd_btt_detach_ndns(struct nd_btt *nd_btt)
-{
- struct nd_namespace_common *ndns = nd_btt->ndns;
-
- dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
- || ndns->claim != &nd_btt->dev,
- "%s: invalid claim\n", __func__);
- ndns->claim = NULL;
- nd_btt->ndns = NULL;
- put_device(&ndns->dev);
-}
-
-static void nd_btt_detach_ndns(struct nd_btt *nd_btt)
-{
- struct nd_namespace_common *ndns = nd_btt->ndns;
-
- if (!ndns)
- return;
- get_device(&ndns->dev);
- device_lock(&ndns->dev);
- __nd_btt_detach_ndns(nd_btt);
- device_unlock(&ndns->dev);
- put_device(&ndns->dev);
-}
-
-static bool __nd_btt_attach_ndns(struct nd_btt *nd_btt,
- struct nd_namespace_common *ndns)
-{
- if (ndns->claim)
- return false;
- dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
- || nd_btt->ndns,
- "%s: invalid claim\n", __func__);
- ndns->claim = &nd_btt->dev;
- nd_btt->ndns = ndns;
- get_device(&ndns->dev);
- return true;
-}
-
-static bool nd_btt_attach_ndns(struct nd_btt *nd_btt,
- struct nd_namespace_common *ndns)
-{
- bool claimed;
-
- device_lock(&ndns->dev);
- claimed = __nd_btt_attach_ndns(nd_btt, ndns);
- device_unlock(&ndns->dev);
- return claimed;
-}
-
static void nd_btt_release(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev->parent);
struct nd_btt *nd_btt = to_nd_btt(dev);

dev_dbg(dev, "%s\n", __func__);
- nd_btt_detach_ndns(nd_btt);
+ nd_detach_ndns(&nd_btt->dev, &nd_btt->ndns);
ida_simple_remove(&nd_region->btt_ida, nd_btt->id);
kfree(nd_btt->uuid);
kfree(nd_btt);
@@ -172,104 +122,15 @@ static ssize_t namespace_show(struct device *dev,
return rc;
}

-static int namespace_match(struct device *dev, void *data)
-{
- char *name = data;
-
- return strcmp(name, dev_name(dev)) == 0;
-}
-
-static bool is_nd_btt_idle(struct device *dev)
-{
- struct nd_region *nd_region = to_nd_region(dev->parent);
- struct nd_btt *nd_btt = to_nd_btt(dev);
-
- if (nd_region->btt_seed == dev || nd_btt->ndns || dev->driver)
- return false;
- return true;
-}
-
-static ssize_t __namespace_store(struct device *dev,
- struct device_attribute *attr, const char *buf, size_t len)
-{
- struct nd_btt *nd_btt = to_nd_btt(dev);
- struct nd_namespace_common *ndns;
- struct device *found;
- char *name;
-
- if (dev->driver) {
- dev_dbg(dev, "%s: -EBUSY\n", __func__);
- return -EBUSY;
- }
-
- name = kstrndup(buf, len, GFP_KERNEL);
- if (!name)
- return -ENOMEM;
- strim(name);
-
- if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
- /* pass */;
- else {
- len = -EINVAL;
- goto out;
- }
-
- ndns = nd_btt->ndns;
- if (strcmp(name, "") == 0) {
- /* detach the namespace and destroy / reset the btt device */
- nd_btt_detach_ndns(nd_btt);
- if (is_nd_btt_idle(dev))
- nd_device_unregister(dev, ND_ASYNC);
- else {
- nd_btt->lbasize = 0;
- kfree(nd_btt->uuid);
- nd_btt->uuid = NULL;
- }
- goto out;
- } else if (ndns) {
- dev_dbg(dev, "namespace already set to: %s\n",
- dev_name(&ndns->dev));
- len = -EBUSY;
- goto out;
- }
-
- found = device_find_child(dev->parent, name, namespace_match);
- if (!found) {
- dev_dbg(dev, "'%s' not found under %s\n", name,
- dev_name(dev->parent));
- len = -ENODEV;
- goto out;
- }
-
- ndns = to_ndns(found);
- if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
- dev_dbg(dev, "%s too small to host btt\n", name);
- len = -ENXIO;
- goto out_attach;
- }
-
- WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
- if (!nd_btt_attach_ndns(nd_btt, ndns)) {
- dev_dbg(dev, "%s already claimed\n",
- dev_name(&ndns->dev));
- len = -EBUSY;
- }
-
- out_attach:
- put_device(&ndns->dev); /* from device_find_child */
- out:
- kfree(name);
- return len;
-}
-
static ssize_t namespace_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
+ struct nd_btt *nd_btt = to_nd_btt(dev);
ssize_t rc;

nvdimm_bus_lock(dev);
device_lock(dev);
- rc = __namespace_store(dev, attr, buf, len);
+ rc = nd_namespace_store(dev, &nd_btt->ndns, buf, len);
dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
rc, buf, buf[len - 1] == '\n' ? "" : "\n");
device_unlock(dev);
@@ -324,7 +185,7 @@ static struct device *__nd_btt_create(struct nd_region *nd_region,
dev->type = &nd_btt_device_type;
dev->groups = nd_btt_attribute_groups;
device_initialize(&nd_btt->dev);
- if (ndns && !__nd_btt_attach_ndns(nd_btt, ndns)) {
+ if (ndns && !__nd_attach_ndns(&nd_btt->dev, ndns, &nd_btt->ndns)) {
dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
__func__, dev_name(ndns->claim));
put_device(dev);
@@ -375,7 +236,7 @@ bool nd_btt_arena_is_valid(struct nd_btt *nd_btt, struct btt_sb *super)

checksum = le64_to_cpu(super->checksum);
super->checksum = 0;
- if (checksum != nd_btt_sb_checksum(super))
+ if (checksum != nd_sb_checksum((struct nd_gen_sb *) super))
return false;
super->checksum = cpu_to_le64(checksum);

@@ -387,25 +248,6 @@ bool nd_btt_arena_is_valid(struct nd_btt *nd_btt, struct btt_sb *super)
}
EXPORT_SYMBOL(nd_btt_arena_is_valid);

-/*
- * nd_btt_sb_checksum: compute checksum for btt info block
- *
- * Returns a fletcher64 checksum of everything in the given info block
- * except the last field (since that's where the checksum lives).
- */
-u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
-{
- u64 sum;
- __le64 sum_save;
-
- sum_save = btt_sb->checksum;
- btt_sb->checksum = 0;
- sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
- btt_sb->checksum = sum_save;
- return sum;
-}
-EXPORT_SYMBOL(nd_btt_sb_checksum);
-
static int __nd_btt_probe(struct nd_btt *nd_btt,
struct nd_namespace_common *ndns, struct btt_sb *btt_sb)
{
@@ -453,7 +295,9 @@ int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
dev_dbg(&ndns->dev, "%s: btt: %s\n", __func__,
rc == 0 ? dev_name(dev) : "<none>");
if (rc < 0) {
- __nd_btt_detach_ndns(to_nd_btt(dev));
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ __nd_detach_ndns(dev, &nd_btt->ndns);
put_device(dev);
}

diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
new file mode 100644
index 000000000000..e8f03b0e95e4
--- /dev/null
+++ b/drivers/nvdimm/claim.c
@@ -0,0 +1,201 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/sizes.h>
+#include "nd-core.h"
+#include "pfn.h"
+#include "btt.h"
+#include "nd.h"
+
+void __nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns)
+{
+ struct nd_namespace_common *ndns = *_ndns;
+
+ dev_WARN_ONCE(dev, !mutex_is_locked(&ndns->dev.mutex)
+ || ndns->claim != dev,
+ "%s: invalid claim\n", __func__);
+ ndns->claim = NULL;
+ *_ndns = NULL;
+ put_device(&ndns->dev);
+}
+
+void nd_detach_ndns(struct device *dev,
+ struct nd_namespace_common **_ndns)
+{
+ struct nd_namespace_common *ndns = *_ndns;
+
+ if (!ndns)
+ return;
+ get_device(&ndns->dev);
+ device_lock(&ndns->dev);
+ __nd_detach_ndns(dev, _ndns);
+ device_unlock(&ndns->dev);
+ put_device(&ndns->dev);
+}
+
+bool __nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+ struct nd_namespace_common **_ndns)
+{
+ if (attach->claim)
+ return false;
+ dev_WARN_ONCE(dev, !mutex_is_locked(&attach->dev.mutex)
+ || *_ndns,
+ "%s: invalid claim\n", __func__);
+ attach->claim = dev;
+ *_ndns = attach;
+ get_device(&attach->dev);
+ return true;
+}
+
+bool nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+ struct nd_namespace_common **_ndns)
+{
+ bool claimed;
+
+ device_lock(&attach->dev);
+ claimed = __nd_attach_ndns(dev, attach, _ndns);
+ device_unlock(&attach->dev);
+ return claimed;
+}
+
+static int namespace_match(struct device *dev, void *data)
+{
+ char *name = data;
+
+ return strcmp(name, dev_name(dev)) == 0;
+}
+
+static bool is_idle(struct device *dev, struct nd_namespace_common *ndns)
+{
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+ struct device *seed = NULL;
+
+ if (is_nd_btt(dev))
+ seed = nd_region->btt_seed;
+ else if (is_nd_pfn(dev))
+ seed = nd_region->pfn_seed;
+
+ if (seed == dev || ndns || dev->driver)
+ return false;
+ return true;
+}
+
+static void nd_detach_and_reset(struct device *dev,
+ struct nd_namespace_common **_ndns)
+{
+ /* detach the namespace and destroy / reset the device */
+ nd_detach_ndns(dev, _ndns);
+ if (is_idle(dev, *_ndns)) {
+ nd_device_unregister(dev, ND_ASYNC);
+ } else if (is_nd_btt(dev)) {
+ struct nd_btt *nd_btt = to_nd_btt(dev);
+
+ nd_btt->lbasize = 0;
+ kfree(nd_btt->uuid);
+ nd_btt->uuid = NULL;
+ } else if (is_nd_pfn(dev)) {
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+ kfree(nd_pfn->uuid);
+ nd_pfn->uuid = NULL;
+ nd_pfn->mode = PFN_MODE_NONE;
+ }
+}
+
+ssize_t nd_namespace_store(struct device *dev,
+ struct nd_namespace_common **_ndns, const char *buf,
+ size_t len)
+{
+ struct nd_namespace_common *ndns;
+ struct device *found;
+ char *name;
+
+ if (dev->driver) {
+ dev_dbg(dev, "%s: -EBUSY\n", __func__);
+ return -EBUSY;
+ }
+
+ name = kstrndup(buf, len, GFP_KERNEL);
+ if (!name)
+ return -ENOMEM;
+ strim(name);
+
+ if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
+ /* pass */;
+ else {
+ len = -EINVAL;
+ goto out;
+ }
+
+ ndns = *_ndns;
+ if (strcmp(name, "") == 0) {
+ nd_detach_and_reset(dev, _ndns);
+ goto out;
+ } else if (ndns) {
+ dev_dbg(dev, "namespace already set to: %s\n",
+ dev_name(&ndns->dev));
+ len = -EBUSY;
+ goto out;
+ }
+
+ found = device_find_child(dev->parent, name, namespace_match);
+ if (!found) {
+ dev_dbg(dev, "'%s' not found under %s\n", name,
+ dev_name(dev->parent));
+ len = -ENODEV;
+ goto out;
+ }
+
+ ndns = to_ndns(found);
+ if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
+ dev_dbg(dev, "%s too small to host\n", name);
+ len = -ENXIO;
+ goto out_attach;
+ }
+
+ WARN_ON_ONCE(!is_nvdimm_bus_locked(dev));
+ if (!nd_attach_ndns(dev, ndns, _ndns)) {
+ dev_dbg(dev, "%s already claimed\n",
+ dev_name(&ndns->dev));
+ len = -EBUSY;
+ }
+
+ out_attach:
+ put_device(&ndns->dev); /* from device_find_child */
+ out:
+ kfree(name);
+ return len;
+}
+
+/*
+ * nd_sb_checksum: compute checksum for a generic info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_sb_checksum(struct nd_gen_sb *nd_gen_sb)
+{
+ u64 sum;
+ __le64 sum_save;
+
+ BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+ BUILD_BUG_ON(sizeof(struct nd_pfn_sb) != SZ_4K);
+ BUILD_BUG_ON(sizeof(struct nd_gen_sb) != SZ_4K);
+
+ sum_save = nd_gen_sb->checksum;
+ nd_gen_sb->checksum = 0;
+ sum = nd_fletcher64(nd_gen_sb, sizeof(*nd_gen_sb), 1);
+ nd_gen_sb->checksum = sum_save;
+ return sum;
+}
+EXPORT_SYMBOL(nd_sb_checksum);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index b18ffea9d85b..9303ca29be9b 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -82,8 +82,16 @@ const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
const char *suffix = "";

- if (ndns->claim && is_nd_btt(ndns->claim))
- suffix = "s";
+ if (ndns->claim) {
+ if (is_nd_btt(ndns->claim))
+ suffix = "s";
+ else if (is_nd_pfn(ndns->claim))
+ suffix = "m";
+ else
+ dev_WARN_ONCE(&ndns->dev, 1,
+ "unknown claim type by %s\n",
+ dev_name(ndns->claim));
+ }

if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev))
sprintf(name, "pmem%d%s", nd_region->id, suffix);
@@ -1255,12 +1263,22 @@ static const struct attribute_group *nd_namespace_attribute_groups[] = {
struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
{
struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
+ struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL;
struct nd_namespace_common *ndns;
resource_size_t size;

- if (nd_btt) {
- ndns = nd_btt->ndns;
- if (!ndns)
+ if (nd_btt || nd_pfn) {
+ struct device *host = NULL;
+
+ if (nd_btt) {
+ host = &nd_btt->dev;
+ ndns = nd_btt->ndns;
+ } else if (nd_pfn) {
+ host = &nd_pfn->dev;
+ ndns = nd_pfn->ndns;
+ }
+
+ if (!ndns || !host)
return ERR_PTR(-ENODEV);

/*
@@ -1271,12 +1289,12 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
device_unlock(&ndns->dev);
if (ndns->dev.driver) {
dev_dbg(&ndns->dev, "is active, can't bind %s\n",
- dev_name(&nd_btt->dev));
+ dev_name(host));
return ERR_PTR(-EBUSY);
}
- if (dev_WARN_ONCE(&ndns->dev, ndns->claim != &nd_btt->dev,
+ if (dev_WARN_ONCE(&ndns->dev, ndns->claim != host,
"host (%s) vs claim (%s) mismatch\n",
- dev_name(&nd_btt->dev),
+ dev_name(host),
dev_name(ndns->claim)))
return ERR_PTR(-ENXIO);
} else {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index e1970c71ad1c..159aed532042 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -80,4 +80,13 @@ struct resource *nsblk_add_resource(struct nd_region *nd_region,
int nvdimm_num_label_slots(struct nvdimm_drvdata *ndd);
void get_ndd(struct nvdimm_drvdata *ndd);
resource_size_t __nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
+void nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns);
+void __nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns);
+bool nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+ struct nd_namespace_common **_ndns);
+bool __nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+ struct nd_namespace_common **_ndns);
+ssize_t nd_namespace_store(struct device *dev,
+ struct nd_namespace_common **_ndns, const char *buf,
+ size_t len);
#endif /* __ND_CORE_H__ */
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index f9615824947b..95f7efc7fed9 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -29,6 +29,8 @@ enum {
ND_MAX_LANES = 256,
SECTOR_SHIFT = 9,
INT_LBASIZE_ALIGNMENT = 64,
+ ND_PFN_ALIGN = PAGES_PER_SECTION * PAGE_SIZE,
+ ND_PFN_MASK = ND_PFN_ALIGN - 1,
};

struct nvdimm_drvdata {
@@ -92,8 +94,10 @@ struct nd_region {
struct device dev;
struct ida ns_ida;
struct ida btt_ida;
+ struct ida pfn_ida;
struct device *ns_seed;
struct device *btt_seed;
+ struct device *pfn_seed;
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
@@ -133,6 +137,22 @@ struct nd_btt {
int id;
};

+enum nd_pfn_mode {
+ PFN_MODE_NONE,
+ PFN_MODE_RAM,
+ PFN_MODE_PMEM,
+};
+
+struct nd_pfn {
+ int id;
+ u8 *uuid;
+ struct device dev;
+ unsigned long npfns;
+ enum nd_pfn_mode mode;
+ struct nd_pfn_sb *pfn_sb;
+ struct nd_namespace_common *ndns;
+};
+
enum nd_async_mode {
ND_SYNC,
ND_ASYNC,
@@ -159,8 +179,13 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd);
int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
void *buf, size_t len);
struct nd_btt *to_nd_btt(struct device *dev);
-struct btt_sb;
-u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
+
+struct nd_gen_sb {
+ char reserved[SZ_4K - 8];
+ __le64 checksum;
+};
+
+u64 nd_sb_checksum(struct nd_gen_sb *sb);
#if IS_ENABLED(CONFIG_BTT)
int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata);
bool is_nd_btt(struct device *dev);
@@ -180,8 +205,30 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
{
return NULL;
}
+#endif

+struct nd_pfn *to_nd_pfn(struct device *dev);
+#if IS_ENABLED(CONFIG_NVDIMM_PFN)
+int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata);
+bool is_nd_pfn(struct device *dev);
+struct device *nd_pfn_create(struct nd_region *nd_region);
+#else
+static inline int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+ return -ENODEV;
+}
+
+static inline bool is_nd_pfn(struct device *dev)
+{
+ return false;
+}
+
+static inline struct device *nd_pfn_create(struct nd_region *nd_region)
+{
+ return NULL;
+}
#endif
+
struct nd_region *to_nd_region(struct device *dev);
int nd_region_to_nstype(struct nd_region *nd_region);
int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
new file mode 100644
index 000000000000..cc243754acef
--- /dev/null
+++ b/drivers/nvdimm/pfn.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef __NVDIMM_PFN_H
+#define __NVDIMM_PFN_H
+
+#include <linux/types.h>
+
+#define PFN_SIG_LEN 16
+#define PFN_SIG "NVDIMM_PFN_INFO\0"
+
+struct nd_pfn_sb {
+ u8 signature[PFN_SIG_LEN];
+ u8 uuid[16];
+ u8 parent_uuid[16];
+ __le32 flags;
+ __le16 version_major;
+ __le16 version_minor;
+ __le64 dataoff;
+ __le64 npfns;
+ __le32 mode;
+ u8 padding[4012];
+ __le64 checksum;
+};
+#endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
new file mode 100644
index 000000000000..f708d63709a5
--- /dev/null
+++ b/drivers/nvdimm/pfn_devs.c
@@ -0,0 +1,336 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/blkdev.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-core.h"
+#include "pfn.h"
+#include "nd.h"
+
+static void nd_pfn_release(struct device *dev)
+{
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+ dev_dbg(dev, "%s\n", __func__);
+ nd_detach_ndns(&nd_pfn->dev, &nd_pfn->ndns);
+ ida_simple_remove(&nd_region->pfn_ida, nd_pfn->id);
+ kfree(nd_pfn->uuid);
+ kfree(nd_pfn);
+}
+
+static struct device_type nd_pfn_device_type = {
+ .name = "nd_pfn",
+ .release = nd_pfn_release,
+};
+
+bool is_nd_pfn(struct device *dev)
+{
+ return dev ? dev->type == &nd_pfn_device_type : false;
+}
+EXPORT_SYMBOL(is_nd_pfn);
+
+struct nd_pfn *to_nd_pfn(struct device *dev)
+{
+ struct nd_pfn *nd_pfn = container_of(dev, struct nd_pfn, dev);
+
+ WARN_ON(!is_nd_pfn(dev));
+ return nd_pfn;
+}
+EXPORT_SYMBOL(to_nd_pfn);
+
+static ssize_t mode_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+ switch (nd_pfn->mode) {
+ case PFN_MODE_RAM:
+ return sprintf(buf, "ram\n");
+ case PFN_MODE_PMEM:
+ return sprintf(buf, "pmem\n");
+ default:
+ return sprintf(buf, "none\n");
+ }
+}
+
+static ssize_t mode_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+ ssize_t rc = 0;
+
+ device_lock(dev);
+ nvdimm_bus_lock(dev);
+ if (dev->driver)
+ rc = -EBUSY;
+ else {
+ size_t n = len - 1;
+
+ if (strncmp(buf, "pmem\n", n) == 0
+ || strncmp(buf, "pmem", n) == 0) {
+ /* TODO: allocate from PMEM support */
+ rc = -ENOTTY;
+ } else if (strncmp(buf, "ram\n", n) == 0
+ || strncmp(buf, "ram", n) == 0)
+ nd_pfn->mode = PFN_MODE_RAM;
+ else if (strncmp(buf, "none\n", n) == 0
+ || strncmp(buf, "none", n) == 0)
+ nd_pfn->mode = PFN_MODE_NONE;
+ else
+ rc = -EINVAL;
+ }
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ nvdimm_bus_unlock(dev);
+ device_unlock(dev);
+
+ return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(mode);
+
+static ssize_t uuid_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+ if (nd_pfn->uuid)
+ return sprintf(buf, "%pUb\n", nd_pfn->uuid);
+ return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+ ssize_t rc;
+
+ device_lock(dev);
+ rc = nd_uuid_store(dev, &nd_pfn->uuid, buf, len);
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ device_unlock(dev);
+
+ return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t namespace_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ rc = sprintf(buf, "%s\n", nd_pfn->ndns
+ ? dev_name(&nd_pfn->ndns->dev) : "");
+ nvdimm_bus_unlock(dev);
+ return rc;
+}
+
+static ssize_t namespace_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ device_lock(dev);
+ rc = nd_namespace_store(dev, &nd_pfn->ndns, buf, len);
+ dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+ rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+ device_unlock(dev);
+ nvdimm_bus_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RW(namespace);
+
+static struct attribute *nd_pfn_attributes[] = {
+ &dev_attr_mode.attr,
+ &dev_attr_namespace.attr,
+ &dev_attr_uuid.attr,
+ NULL,
+};
+
+static struct attribute_group nd_pfn_attribute_group = {
+ .attrs = nd_pfn_attributes,
+};
+
+static const struct attribute_group *nd_pfn_attribute_groups[] = {
+ &nd_pfn_attribute_group,
+ &nd_device_attribute_group,
+ &nd_numa_attribute_group,
+ NULL,
+};
+
+static struct device *__nd_pfn_create(struct nd_region *nd_region,
+ u8 *uuid, enum nd_pfn_mode mode,
+ struct nd_namespace_common *ndns)
+{
+ struct nd_pfn *nd_pfn;
+ struct device *dev;
+
+ /* we can only create pages for contiguous ranged of pmem */
+ if (!is_nd_pmem(&nd_region->dev))
+ return NULL;
+
+ nd_pfn = kzalloc(sizeof(*nd_pfn), GFP_KERNEL);
+ if (!nd_pfn)
+ return NULL;
+
+ nd_pfn->id = ida_simple_get(&nd_region->pfn_ida, 0, 0, GFP_KERNEL);
+ if (nd_pfn->id < 0) {
+ kfree(nd_pfn);
+ return NULL;
+ }
+
+ nd_pfn->mode = mode;
+ if (uuid)
+ uuid = kmemdup(uuid, 16, GFP_KERNEL);
+ nd_pfn->uuid = uuid;
+ dev = &nd_pfn->dev;
+ dev_set_name(dev, "pfn%d.%d", nd_region->id, nd_pfn->id);
+ dev->parent = &nd_region->dev;
+ dev->type = &nd_pfn_device_type;
+ dev->groups = nd_pfn_attribute_groups;
+ device_initialize(&nd_pfn->dev);
+ if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
+ dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
+ __func__, dev_name(ndns->claim));
+ put_device(dev);
+ return NULL;
+ }
+ return dev;
+}
+
+struct device *nd_pfn_create(struct nd_region *nd_region)
+{
+ struct device *dev = __nd_pfn_create(nd_region, NULL, PFN_MODE_NONE,
+ NULL);
+
+ if (dev)
+ __nd_device_register(dev);
+ return dev;
+}
+
+static int nd_pfn_validate(struct nd_pfn *nd_pfn)
+{
+ struct nd_namespace_common *ndns = nd_pfn->ndns;
+ struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
+ struct nd_namespace_io *nsio;
+ u64 checksum, offset;
+
+ if (!pfn_sb || !ndns)
+ return -ENODEV;
+
+ if (!is_nd_pmem(nd_pfn->dev.parent))
+ return -ENODEV;
+
+ /* section alignment for simple hotplug */
+ if (nvdimm_namespace_capacity(ndns) < ND_PFN_ALIGN)
+ return -ENODEV;
+
+ if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb)))
+ return -ENXIO;
+
+ if (memcmp(pfn_sb->signature, PFN_SIG, PFN_SIG_LEN) != 0)
+ return -ENODEV;
+
+ checksum = le64_to_cpu(pfn_sb->checksum);
+ pfn_sb->checksum = 0;
+ if (checksum != nd_sb_checksum((struct nd_gen_sb *) pfn_sb))
+ return -ENODEV;
+ pfn_sb->checksum = cpu_to_le64(checksum);
+
+ switch (le32_to_cpu(pfn_sb->mode)) {
+ case PFN_MODE_RAM:
+ break;
+ case PFN_MODE_PMEM:
+ /* TODO: allocate from PMEM support */
+ return -ENOTTY;
+ default:
+ return -ENXIO;
+ }
+
+ if (!nd_pfn->uuid) {
+ /* from probe we allocate */
+ nd_pfn->uuid = kmemdup(pfn_sb->uuid, 16, GFP_KERNEL);
+ if (!nd_pfn->uuid)
+ return -ENOMEM;
+ } else {
+ /* from init we validate */
+ if (memcmp(nd_pfn->uuid, pfn_sb->uuid, 16) != 0)
+ return -EINVAL;
+ }
+
+ /*
+ * These warnings are verbose because they can only trigger in
+ * the case where the physical address alignment of the
+ * namespace has changed since the pfn superblock was
+ * established.
+ */
+ offset = le64_to_cpu(pfn_sb->dataoff);
+ nsio = to_nd_namespace_io(&ndns->dev);
+ if ((nsio->res.start + offset) & (ND_PFN_ALIGN - 1)) {
+ dev_err(&nd_pfn->dev,
+ "init failed: %s with offset %#llx not section aligned\n",
+ dev_name(&ndns->dev), offset);
+ return -EBUSY;
+ } else if (offset >= resource_size(&nsio->res)) {
+ dev_err(&nd_pfn->dev, "pfn array size exceeds capacity of %s\n",
+ dev_name(&ndns->dev));
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+ int rc;
+ struct device *dev;
+ struct nd_pfn *nd_pfn;
+ struct nd_pfn_sb *pfn_sb;
+ struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+
+ if (ndns->force_raw)
+ return -ENODEV;
+
+ nvdimm_bus_lock(&ndns->dev);
+ dev = __nd_pfn_create(nd_region, NULL, PFN_MODE_NONE, ndns);
+ nvdimm_bus_unlock(&ndns->dev);
+ if (!dev)
+ return -ENOMEM;
+ dev_set_drvdata(dev, drvdata);
+ pfn_sb = kzalloc(sizeof(*pfn_sb), GFP_KERNEL);
+ nd_pfn = to_nd_pfn(dev);
+ nd_pfn->pfn_sb = pfn_sb;
+ rc = nd_pfn_validate(nd_pfn);
+ nd_pfn->pfn_sb = NULL;
+ kfree(pfn_sb);
+ dev_dbg(&ndns->dev, "%s: pfn: %s\n", __func__,
+ rc == 0 ? dev_name(dev) : "<none>");
+ if (rc < 0) {
+ __nd_detach_ndns(dev, &nd_pfn->ndns);
+ put_device(dev);
+ } else
+ __nd_device_register(&nd_pfn->dev);
+
+ return rc;
+}
+EXPORT_SYMBOL(nd_pfn_probe);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index f28f78ccff19..7da63eac78ee 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -53,6 +53,7 @@ static int nd_region_probe(struct device *dev)
return -ENODEV;

nd_region->btt_seed = nd_btt_create(nd_region);
+ nd_region->pfn_seed = nd_pfn_create(nd_region);
if (err == 0)
return 0;

@@ -84,6 +85,7 @@ static int nd_region_remove(struct device *dev)
nvdimm_bus_lock(dev);
nd_region->ns_seed = NULL;
nd_region->btt_seed = NULL;
+ nd_region->pfn_seed = NULL;
dev_set_drvdata(dev, NULL);
nvdimm_bus_unlock(dev);

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 7384455792bf..da4338154ad2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -345,6 +345,23 @@ static ssize_t btt_seed_show(struct device *dev,
}
static DEVICE_ATTR_RO(btt_seed);

+static ssize_t pfn_seed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+ ssize_t rc;
+
+ nvdimm_bus_lock(dev);
+ if (nd_region->pfn_seed)
+ rc = sprintf(buf, "%s\n", dev_name(nd_region->pfn_seed));
+ else
+ rc = sprintf(buf, "\n");
+ nvdimm_bus_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RO(pfn_seed);
+
static ssize_t read_only_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -373,6 +390,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_nstype.attr,
&dev_attr_mappings.attr,
&dev_attr_btt_seed.attr,
+ &dev_attr_pfn_seed.attr,
&dev_attr_read_only.attr,
&dev_attr_set_cookie.attr,
&dev_attr_available_size.attr,
@@ -744,6 +762,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
ida_init(&nd_region->btt_ida);
+ ida_init(&nd_region->pfn_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
dev->parent = &nvdimm_bus->dev;
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index e667579d38a0..22d4d19a49bc 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -41,7 +41,9 @@ libnvdimm-y += $(NVDIMM_SRC)/region_devs.o
libnvdimm-y += $(NVDIMM_SRC)/region.o
libnvdimm-y += $(NVDIMM_SRC)/namespace_devs.o
libnvdimm-y += $(NVDIMM_SRC)/label.o
+libnvdimm-$(CONFIG_ND_CLAIM) += $(NVDIMM_SRC)/claim.o
libnvdimm-$(CONFIG_BTT) += $(NVDIMM_SRC)/btt_devs.o
+libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
libnvdimm-y += config_check.o

obj-m += test/

2015-08-26 01:33:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 7/9] libnvdimm, pmem: 'struct page' for pmem

Enable the pmem driver to handle PFN device instances. Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem. Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios. This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Cc: Boaz Harrosh <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/Kconfig | 1
drivers/nvdimm/nd.h | 6 +
drivers/nvdimm/pfn_devs.c | 9 +-
drivers/nvdimm/pmem.c | 203 +++++++++++++++++++++++++++++++++++--
tools/testing/nvdimm/Kbuild | 1
tools/testing/nvdimm/test/iomap.c | 13 ++
6 files changed, 216 insertions(+), 17 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index ace25b53b755..53c11621d5b1 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -76,6 +76,7 @@ config ND_PFN
config NVDIMM_PFN
bool "PFN: Map persistent (device) memory"
default LIBNVDIMM
+ depends on ZONE_DEVICE
select ND_CLAIM
help
Map persistent memory, i.e. advertise it to the memory
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 95f7efc7fed9..0fe939c42ce5 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -212,6 +212,7 @@ struct nd_pfn *to_nd_pfn(struct device *dev);
int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata);
bool is_nd_pfn(struct device *dev);
struct device *nd_pfn_create(struct nd_region *nd_region);
+int nd_pfn_validate(struct nd_pfn *nd_pfn);
#else
static inline int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
{
@@ -227,6 +228,11 @@ static inline struct device *nd_pfn_create(struct nd_region *nd_region)
{
return NULL;
}
+
+static inline int nd_pfn_validate(struct nd_pfn *nd_pfn)
+{
+ return -ENODEV;
+}
#endif

struct nd_region *to_nd_region(struct device *dev);
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index f708d63709a5..3fd7d0d81a47 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -228,7 +228,7 @@ struct device *nd_pfn_create(struct nd_region *nd_region)
return dev;
}

-static int nd_pfn_validate(struct nd_pfn *nd_pfn)
+int nd_pfn_validate(struct nd_pfn *nd_pfn)
{
struct nd_namespace_common *ndns = nd_pfn->ndns;
struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
@@ -286,10 +286,10 @@ static int nd_pfn_validate(struct nd_pfn *nd_pfn)
*/
offset = le64_to_cpu(pfn_sb->dataoff);
nsio = to_nd_namespace_io(&ndns->dev);
- if ((nsio->res.start + offset) & (ND_PFN_ALIGN - 1)) {
+ if (nsio->res.start & ND_PFN_MASK) {
dev_err(&nd_pfn->dev,
- "init failed: %s with offset %#llx not section aligned\n",
- dev_name(&ndns->dev), offset);
+ "init failed: %s not section aligned\n",
+ dev_name(&ndns->dev));
return -EBUSY;
} else if (offset >= resource_size(&nsio->res)) {
dev_err(&nd_pfn->dev, "pfn array size exceeds capacity of %s\n",
@@ -299,6 +299,7 @@ static int nd_pfn_validate(struct nd_pfn *nd_pfn)

return 0;
}
+EXPORT_SYMBOL(nd_pfn_validate);

int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
{
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 20bf122328da..13cee46a7b8b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -21,18 +21,24 @@
#include <linux/init.h>
#include <linux/platform_device.h>
#include <linux/module.h>
+#include <linux/memory_hotplug.h>
#include <linux/moduleparam.h>
+#include <linux/vmalloc.h>
#include <linux/slab.h>
#include <linux/pmem.h>
#include <linux/nd.h>
+#include "pfn.h"
#include "nd.h"

struct pmem_device {
struct request_queue *pmem_queue;
struct gendisk *pmem_disk;
+ struct nd_namespace_common *ndns;

/* One contiguous memory region per device */
phys_addr_t phys_addr;
+ /* when non-zero this device is hosting a 'pfn' instance */
+ phys_addr_t data_offset;
void __pmem *virt_addr;
size_t size;
};
@@ -44,7 +50,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
sector_t sector)
{
void *mem = kmap_atomic(page);
- size_t pmem_off = sector << 9;
+ phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
void __pmem *pmem_addr = pmem->virt_addr + pmem_off;

if (rw == READ) {
@@ -95,16 +101,23 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
void __pmem **kaddr, unsigned long *pfn)
{
struct pmem_device *pmem = bdev->bd_disk->private_data;
- size_t offset = sector << 9;
-
- if (!pmem)
- return -ENODEV;
+ resource_size_t offset = sector * 512 + pmem->data_offset;
+ resource_size_t size;
+
+ if (pmem->data_offset) {
+ /*
+ * Limit the direct_access() size to what is covered by
+ * the memmap
+ */
+ size = (pmem->size - offset) & ~ND_PFN_MASK;
+ } else
+ size = pmem->size - offset;

/* FIXME convert DAX to comprehend that this mapping has a lifetime */
*kaddr = pmem->virt_addr + offset;
*pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT;

- return pmem->size - offset;
+ return size;
}

static const struct block_device_operations pmem_fops = {
@@ -144,13 +157,16 @@ static struct pmem_device *pmem_alloc(struct device *dev,

static void pmem_detach_disk(struct pmem_device *pmem)
{
+ if (!pmem->pmem_disk)
+ return;
+
del_gendisk(pmem->pmem_disk);
put_disk(pmem->pmem_disk);
blk_cleanup_queue(pmem->pmem_queue);
}

-static int pmem_attach_disk(struct nd_namespace_common *ndns,
- struct pmem_device *pmem)
+static int pmem_attach_disk(struct device *dev,
+ struct nd_namespace_common *ndns, struct pmem_device *pmem)
{
struct gendisk *disk;

@@ -177,8 +193,8 @@ static int pmem_attach_disk(struct nd_namespace_common *ndns,
disk->queue = pmem->pmem_queue;
disk->flags = GENHD_FL_EXT_DEVT;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
- disk->driverfs_dev = &ndns->dev;
- set_capacity(disk, pmem->size >> 9);
+ disk->driverfs_dev = dev;
+ set_capacity(disk, (pmem->size - pmem->data_offset) / 512);
pmem->pmem_disk = disk;

add_disk(disk);
@@ -207,6 +223,154 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
return 0;
}

+static int nd_pfn_init(struct nd_pfn *nd_pfn)
+{
+ struct nd_pfn_sb *pfn_sb = kzalloc(sizeof(*pfn_sb), GFP_KERNEL);
+ struct pmem_device *pmem = dev_get_drvdata(&nd_pfn->dev);
+ struct nd_namespace_common *ndns = nd_pfn->ndns;
+ struct nd_region *nd_region;
+ unsigned long npfns;
+ phys_addr_t offset;
+ u64 checksum;
+ int rc;
+
+ if (!pfn_sb)
+ return -ENOMEM;
+
+ nd_pfn->pfn_sb = pfn_sb;
+ rc = nd_pfn_validate(nd_pfn);
+ if (rc == 0 || rc == -EBUSY)
+ return rc;
+
+ /* section alignment for simple hotplug */
+ if (nvdimm_namespace_capacity(ndns) < ND_PFN_ALIGN
+ || pmem->phys_addr & ND_PFN_MASK)
+ return -ENODEV;
+
+ nd_region = to_nd_region(nd_pfn->dev.parent);
+ if (nd_region->ro) {
+ dev_info(&nd_pfn->dev,
+ "%s is read-only, unable to init metadata\n",
+ dev_name(&nd_region->dev));
+ goto err;
+ }
+
+ memset(pfn_sb, 0, sizeof(*pfn_sb));
+ npfns = (pmem->size - SZ_8K) / SZ_4K;
+ /*
+ * Note, we use 64 here for the standard size of struct page,
+ * debugging options may cause it to be larger in which case the
+ * implementation will limit the pfns advertised through
+ * ->direct_access() to those that are included in the memmap.
+ */
+ if (nd_pfn->mode == PFN_MODE_PMEM)
+ offset = ALIGN(SZ_8K + 64 * npfns, PMD_SIZE);
+ else if (nd_pfn->mode == PFN_MODE_RAM)
+ offset = SZ_8K;
+ else
+ goto err;
+
+ npfns = (pmem->size - offset) / SZ_4K;
+ pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
+ pfn_sb->dataoff = cpu_to_le64(offset);
+ pfn_sb->npfns = cpu_to_le64(npfns);
+ memcpy(pfn_sb->signature, PFN_SIG, PFN_SIG_LEN);
+ memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
+ pfn_sb->version_major = le16_to_cpu(1);
+ checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
+ pfn_sb->checksum = cpu_to_le64(checksum);
+
+ rc = nvdimm_write_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb));
+ if (rc)
+ goto err;
+
+ return 0;
+ err:
+ nd_pfn->pfn_sb = NULL;
+ kfree(pfn_sb);
+ return -ENXIO;
+}
+
+static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
+{
+ struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
+ struct pmem_device *pmem;
+
+ /* free pmem disk */
+ pmem = dev_get_drvdata(&nd_pfn->dev);
+ pmem_detach_disk(pmem);
+
+ /* release nd_pfn resources */
+ kfree(nd_pfn->pfn_sb);
+ nd_pfn->pfn_sb = NULL;
+
+ return 0;
+}
+
+static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
+{
+ struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
+ struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
+ struct device *dev = &nd_pfn->dev;
+ struct vmem_altmap *altmap;
+ struct nd_region *nd_region;
+ struct nd_pfn_sb *pfn_sb;
+ struct pmem_device *pmem;
+ phys_addr_t offset;
+ int rc;
+
+ if (!nd_pfn->uuid || !nd_pfn->ndns)
+ return -ENODEV;
+
+ nd_region = to_nd_region(dev->parent);
+ rc = nd_pfn_init(nd_pfn);
+ if (rc)
+ return rc;
+
+ if (PAGE_SIZE != SZ_4K) {
+ dev_err(dev, "only supported on systems with 4K PAGE_SIZE\n");
+ return -ENXIO;
+ }
+ if (nsio->res.start & ND_PFN_MASK) {
+ dev_err(dev, "%s not memory hotplug section aligned\n",
+ dev_name(&ndns->dev));
+ return -ENXIO;
+ }
+
+ pfn_sb = nd_pfn->pfn_sb;
+ offset = le64_to_cpu(pfn_sb->dataoff);
+ nd_pfn->mode = le32_to_cpu(nd_pfn->pfn_sb->mode);
+ if (nd_pfn->mode == PFN_MODE_RAM) {
+ if (offset != SZ_8K)
+ return -EINVAL;
+ nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
+ altmap = NULL;
+ } else {
+ rc = -ENXIO;
+ goto err;
+ }
+
+ /* establish pfn range for lookup, and switch to direct map */
+ pmem = dev_get_drvdata(dev);
+ memunmap_pmem(dev, pmem->virt_addr);
+ pmem->virt_addr = (void __pmem *)devm_memremap_pages(dev, &nsio->res);
+ if (IS_ERR(pmem->virt_addr)) {
+ rc = PTR_ERR(pmem->virt_addr);
+ goto err;
+ }
+
+ /* attach pmem disk in "pfn-mode" */
+ pmem->data_offset = offset;
+ rc = pmem_attach_disk(dev, ndns, pmem);
+ if (rc)
+ goto err;
+
+ return rc;
+ err:
+ nvdimm_namespace_detach_pfn(ndns);
+ return rc;
+}
+
static int nd_pmem_probe(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev->parent);
@@ -223,16 +387,27 @@ static int nd_pmem_probe(struct device *dev)
if (IS_ERR(pmem))
return PTR_ERR(pmem);

+ pmem->ndns = ndns;
dev_set_drvdata(dev, pmem);
ndns->rw_bytes = pmem_rw_bytes;

if (is_nd_btt(dev))
return nvdimm_namespace_attach_btt(ndns);

- if (nd_btt_probe(ndns, pmem) == 0)
+ if (is_nd_pfn(dev))
+ return nvdimm_namespace_attach_pfn(ndns);
+
+ if (nd_btt_probe(ndns, pmem) == 0) {
/* we'll come back as btt-pmem */
return -ENXIO;
- return pmem_attach_disk(ndns, pmem);
+ }
+
+ if (nd_pfn_probe(ndns, pmem) == 0) {
+ /* we'll come back as pfn-pmem */
+ return -ENXIO;
+ }
+
+ return pmem_attach_disk(dev, ndns, pmem);
}

static int nd_pmem_remove(struct device *dev)
@@ -240,7 +415,9 @@ static int nd_pmem_remove(struct device *dev)
struct pmem_device *pmem = dev_get_drvdata(dev);

if (is_nd_btt(dev))
- nvdimm_namespace_detach_btt(to_nd_btt(dev)->ndns);
+ nvdimm_namespace_detach_btt(pmem->ndns);
+ else if (is_nd_pfn(dev))
+ nvdimm_namespace_detach_pfn(pmem->ndns);
else
pmem_detach_disk(pmem);

diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 22d4d19a49bc..eff633f8b6db 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -1,6 +1,7 @@
ldflags-y += --wrap=ioremap_wc
ldflags-y += --wrap=devm_ioremap_nocache
ldflags-y += --wrap=devm_memremap
+ldflags-y += --wrap=devm_memunmap
ldflags-y += --wrap=ioremap_nocache
ldflags-y += --wrap=iounmap
ldflags-y += --wrap=__devm_request_region
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index ff1e00458864..3609f6713075 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -95,6 +95,19 @@ void *__wrap_devm_memremap(struct device *dev, resource_size_t offset,
}
EXPORT_SYMBOL(__wrap_devm_memremap);

+void __wrap_devm_memunmap(struct device *dev, void *addr)
+{
+ struct nfit_test_resource *nfit_res;
+
+ rcu_read_lock();
+ nfit_res = get_nfit_res((unsigned long) addr);
+ rcu_read_unlock();
+ if (nfit_res)
+ return;
+ return devm_memunmap(dev, addr);
+}
+EXPORT_SYMBOL(__wrap_devm_memunmap);
+
void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
{
return __nfit_test_ioremap(offset, size, ioremap_nocache);

2015-08-26 01:33:54

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 8/9] libnvdimm, pmem: direct map legacy pmem by default

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory. Larger capacities will be described via the ACPI
NFIT. When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/e820.c | 1 +
drivers/nvdimm/namespace_devs.c | 28 ++++++++++++++++++++++++++--
drivers/nvdimm/nd.h | 2 ++
drivers/nvdimm/pmem.c | 13 ++++++++++---
drivers/nvdimm/region_devs.c | 1 +
include/linux/libnvdimm.h | 4 ++++
6 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index 1b5743ad92db..8282db2ef99e 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -49,6 +49,7 @@ static int e820_pmem_probe(struct platform_device *pdev)
ndr_desc.res = p;
ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
ndr_desc.numa_node = NUMA_NO_NODE;
+ set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
goto err;
}
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 9303ca29be9b..500bfb2825b3 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -13,6 +13,7 @@
#include <linux/module.h>
#include <linux/device.h>
#include <linux/slab.h>
+#include <linux/pmem.h>
#include <linux/nd.h>
#include "nd-core.h"
#include "nd.h"
@@ -76,6 +77,27 @@ static bool is_namespace_io(struct device *dev)
return dev ? dev->type == &namespace_io_device_type : false;
}

+bool pmem_should_map_pages(struct device *dev)
+{
+ struct nd_region *nd_region = to_nd_region(dev->parent);
+
+ if (!IS_ENABLED(CONFIG_ZONE_DEVICE))
+ return false;
+
+ if (!test_bit(ND_REGION_PAGEMAP, &nd_region->flags))
+ return false;
+
+ if (is_nd_pfn(dev) || is_nd_btt(dev))
+ return false;
+
+#ifdef ARCH_MEMREMAP_PMEM
+ return ARCH_MEMREMAP_PMEM == MEMREMAP_WB;
+#else
+ return false;
+#endif
+}
+EXPORT_SYMBOL(pmem_should_map_pages);
+
const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
char *name)
{
@@ -93,9 +115,11 @@ const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
dev_name(ndns->claim));
}

- if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev))
+ if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev)) {
+ if (pmem_should_map_pages(&ndns->dev))
+ suffix = "m";
sprintf(name, "pmem%d%s", nd_region->id, suffix);
- else if (is_namespace_blk(&ndns->dev)) {
+ } else if (is_namespace_blk(&ndns->dev)) {
struct nd_namespace_blk *nsblk;

nsblk = to_nd_namespace_blk(&ndns->dev);
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 0fe939c42ce5..182eb64f6081 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -95,6 +95,7 @@ struct nd_region {
struct ida ns_ida;
struct ida btt_ida;
struct ida pfn_ida;
+ unsigned long flags;
struct device *ns_seed;
struct device *btt_seed;
struct device *pfn_seed;
@@ -271,4 +272,5 @@ static inline bool nd_iostat_start(struct bio *bio, unsigned long *start)
void nd_iostat_end(struct bio *bio, unsigned long start);
resource_size_t nd_namespace_blk_validate(struct nd_namespace_blk *nsblk);
const u8 *nd_dev_to_uuid(struct device *dev);
+bool pmem_should_map_pages(struct device *dev);
#endif /* __ND_H__ */
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 13cee46a7b8b..6ea2ead35fb2 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -148,9 +148,16 @@ static struct pmem_device *pmem_alloc(struct device *dev,
return ERR_PTR(-EBUSY);
}

- pmem->virt_addr = memremap_pmem(dev, pmem->phys_addr, pmem->size);
- if (!pmem->virt_addr)
- return ERR_PTR(-ENXIO);
+ if (pmem_should_map_pages(dev)) {
+ pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res);
+ if (IS_ERR(pmem->virt_addr))
+ return pmem->virt_addr;
+ } else {
+ pmem->virt_addr = memremap_pmem(dev, pmem->phys_addr,
+ pmem->size);
+ if (!pmem->virt_addr)
+ return ERR_PTR(-ENXIO);
+ }

return pmem;
}
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index da4338154ad2..529f3f02e7b2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -758,6 +758,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
nd_region->provider_data = ndr_desc->provider_data;
nd_region->nd_set = ndr_desc->nd_set;
nd_region->num_lanes = ndr_desc->num_lanes;
+ nd_region->flags = ndr_desc->flags;
nd_region->ro = ro;
nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 75e3af01ee32..3f021dc5da8c 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -31,6 +31,9 @@ enum {
ND_CMD_ARS_STATUS_MAX = SZ_4K,
ND_MAX_MAPPINGS = 32,

+ /* region flag indicating to direct-map persistent memory by default */
+ ND_REGION_PAGEMAP = 0,
+
/* mark newly adjusted resources as requiring a label update */
DPA_RESOURCE_ADJUSTED = 1 << 0,
};
@@ -91,6 +94,7 @@ struct nd_region_desc {
void *provider_data;
int num_lanes;
int numa_node;
+ unsigned long flags;
};

struct nvdimm_bus;

2015-08-26 01:33:59

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 9/9] devm_memremap_pages: protect against pmem device unbind

Given that:

1/ device ->remove() can not be failed

2/ a pmem device may be unbound at any time

3/ we do not know what other parts of the kernel are actively using a
'struct page' from devm_memremap_pages()

...provide a facility for active usages of device memory to block pmem
device unbind. With a percpu_ref it should be feasible to take a
reference on a per-I/O or other high frequency basis.

Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/io.h | 37 ++++++++++++++++++++++
kernel/memremap.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 123 insertions(+), 3 deletions(-)

diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..e20cc04f42b7 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -90,8 +90,31 @@ void devm_memunmap(struct device *dev, void *addr);
void *__devm_memremap_pages(struct device *dev, struct resource *res);

#ifdef CONFIG_ZONE_DEVICE
+#include <linux/percpu-refcount.h>
+#include <linux/ioport.h>
+#include <linux/list.h>
+
+struct page_map {
+ struct resource res;
+ struct list_head list;
+ unsigned long flags;
+ struct percpu_ref percpu_ref;
+ struct device *dev;
+};
+
void *devm_memremap_pages(struct device *dev, struct resource *res);
+struct page_map * __must_check get_page_map(resource_size_t addr);
+static inline void ref_page_map(struct page_map *page_map)
+{
+ percpu_ref_get(&page_map->percpu_ref);
+}
+
+static inline void put_page_map(struct page_map *page_map)
+{
+ percpu_ref_put(&page_map->percpu_ref);
+}
#else
+struct page_map;
static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
{
/*
@@ -102,6 +125,20 @@ static inline void *devm_memremap_pages(struct device *dev, struct resource *res
WARN_ON_ONCE(1);
return ERR_PTR(-ENXIO);
}
+
+static inline __must_check struct page_map *get_page_map(resource_size_t addr)
+{
+ return NULL;
+}
+
+static inline void ref_page_map(struct page_map *page_map)
+{
+ return false;
+}
+
+static inline void put_page_map(struct page_map *page_map)
+{
+}
#endif

/*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 72b0c66628b6..65a6c9396062 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -12,6 +12,8 @@
*/
#include <linux/device.h>
#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
#include <linux/io.h>
#include <linux/mm.h>
#include <linux/memory_hotplug.h>
@@ -138,14 +140,66 @@ void devm_memunmap(struct device *dev, void *addr)
EXPORT_SYMBOL(devm_memunmap);

#ifdef CONFIG_ZONE_DEVICE
-struct page_map {
- struct resource res;
+static DEFINE_MUTEX(page_map_lock);
+static DECLARE_WAIT_QUEUE_HEAD(page_map_wait);
+static LIST_HEAD(page_maps);
+
+enum {
+ PAGE_MAP_LIVE,
+ PAGE_MAP_CONFIRM,
};

+static struct page_map *to_page_map(struct percpu_ref *ref)
+{
+ return container_of(ref, struct page_map, percpu_ref);
+}
+
+static void page_map_release(struct percpu_ref *ref)
+{
+ struct page_map *page_map = to_page_map(ref);
+
+ /* signal page_map is idle (no more refs) */
+ clear_bit(PAGE_MAP_LIVE, &page_map->flags);
+ wake_up_all(&page_map_wait);
+}
+
+static void page_map_confirm(struct percpu_ref *ref)
+{
+ struct page_map *page_map = to_page_map(ref);
+
+ /* signal page_map is confirmed dead (slow path ref mode) */
+ set_bit(PAGE_MAP_CONFIRM, &page_map->flags);
+ wake_up_all(&page_map_wait);
+}
+
+static void page_map_destroy(struct page_map *page_map)
+{
+ long tmo;
+
+ /* flush new lookups */
+ mutex_lock(&page_map_lock);
+ list_del_rcu(&page_map->list);
+ mutex_unlock(&page_map_lock);
+ synchronize_rcu();
+
+ percpu_ref_kill_and_confirm(&page_map->percpu_ref, page_map_confirm);
+ do {
+ tmo = wait_event_interruptible_timeout(page_map_wait,
+ !test_bit(PAGE_MAP_LIVE, &page_map->flags)
+ && test_bit(PAGE_MAP_CONFIRM, &page_map->flags), 5*HZ);
+ if (tmo <= 0)
+ dev_dbg(page_map->dev,
+ "page map active, continuing to wait...\n");
+ } while (tmo <= 0);
+}
+
static void devm_memremap_pages_release(struct device *dev, void *res)
{
struct page_map *page_map = res;

+ if (test_bit(PAGE_MAP_LIVE, &page_map->flags))
+ page_map_destroy(page_map);
+
/* pages are dead and unused, undo the arch mapping */
arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
}
@@ -155,7 +209,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
int is_ram = region_intersects(res->start, resource_size(res),
"System RAM");
struct page_map *page_map;
- int error, nid;
+ int error, nid, rc;

if (is_ram == REGION_MIXED) {
WARN_ONCE(1, "%s attempted on mixed region %pr\n",
@@ -172,6 +226,12 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
return ERR_PTR(-ENOMEM);

memcpy(&page_map->res, res, sizeof(*res));
+ INIT_LIST_HEAD(&page_map->list);
+ page_map->dev = dev;
+ rc = percpu_ref_init(&page_map->percpu_ref, page_map_release, 0,
+ GFP_KERNEL);
+ if (rc)
+ return ERR_PTR(rc);

nid = dev_to_node(dev);
if (nid < 0)
@@ -183,8 +243,31 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
return ERR_PTR(error);
}

+ set_bit(PAGE_MAP_LIVE, &page_map->flags);
+ mutex_lock(&page_map_lock);
+ list_add_rcu(&page_map->list, &page_maps);
+ mutex_unlock(&page_map_lock);
+
devres_add(dev, page_map);
return __va(res->start);
}
EXPORT_SYMBOL(devm_memremap_pages);
+
+struct page_map * __must_check get_page_map(resource_size_t addr)
+{
+ struct page_map *page_map, *ret = NULL;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(page_map, &page_maps, list) {
+ if (addr >= page_map->res.start && addr <= page_map->res.end) {
+ if (percpu_ref_tryget(&page_map->percpu_ref))
+ ret = page_map;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(get_page_map);
#endif /* CONFIG_ZONE_DEVICE */

2015-08-26 12:41:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

I like the intent behind this, but not the implementation.

I think the right approach is to keep the defaults in linux/pmem.h
and simply not set CONFIG_ARCH_HAS_PMEM_API for x86-32.

2015-08-26 12:46:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] devm_memremap_pages: protect against pmem device unbind

On Tue, Aug 25, 2015 at 09:28:13PM -0400, Dan Williams wrote:
> Given that:
>
> 1/ device ->remove() can not be failed
>
> 2/ a pmem device may be unbound at any time
>
> 3/ we do not know what other parts of the kernel are actively using a
> 'struct page' from devm_memremap_pages()
>
> ...provide a facility for active usages of device memory to block pmem
> device unbind. With a percpu_ref it should be feasible to take a
> reference on a per-I/O or other high frequency basis.

Without a caller of get_page_map this is just adding dead code. I'd
suggest to group it in a series with that caller.

Also if the page_map gets exposed in a header the name is a bit too generic.
memremap_map maybe?

2015-08-26 21:34:24

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Wed, 2015-08-26 at 14:41 +0200, Christoph Hellwig wrote:
> I like the intent behind this, but not the implementation.
>
> I think the right approach is to keep the defaults in linux/pmem.h
> and simply not set CONFIG_ARCH_HAS_PMEM_API for x86-32.

Yes, that makes things much cleaner. Revised patch and changelog below:

8<----
Subject: x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

From: Dan Williams <[email protected]>

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available. When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains. Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Christoph Hellwig <[email protected]>
[hch: set ARCH_HAS_PMEM_API=n on X86_32]
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/io.h | 2 --
arch/x86/include/asm/pmem.h | 8 ++------
drivers/acpi/nfit.c | 2 +-
drivers/nvdimm/pmem.c | 2 +-
include/linux/pmem.h | 28 +++++++++++++++++-----------
6 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 76c61154ed50..5912859df533 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,7 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
- select ARCH_HAS_PMEM_API
+ select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d241fbd5c87b..83ec9b1d77cc 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -248,8 +248,6 @@ static inline void flush_write_buffers(void)
#endif
}

-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
#endif /* __KERNEL__ */

extern void native_io_delay(void);
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index a3a0df6545ee..5111f1f053a4 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -19,6 +19,7 @@
#include <asm/special_insns.h>

#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
/**
* arch_memcpy_to_pmem - copy data to persistent memory
* @dst: destination buffer for the copy
@@ -141,18 +142,13 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
__arch_wb_cache_pmem(vaddr, size);
}

-static inline bool arch_has_wmb_pmem(void)
+static inline bool __arch_has_wmb_pmem(void)
{
-#ifdef CONFIG_X86_64
/*
* We require that wmb() be an 'sfence', that is only guaranteed on
* 64-bit builds
*/
return static_cpu_has(X86_FEATURE_PCOMMIT);
-#else
- return false;
-#endif
}
#endif /* CONFIG_ARCH_HAS_PMEM_API */
-
#endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c2638f914a9..c3fe20635562 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1364,7 +1364,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
return -ENOMEM;
}

- if (!arch_has_pmem_api() && !nfit_blk->nvdimm_flush)
+ if (!arch_has_wmb_pmem() && !nfit_blk->nvdimm_flush)
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (mmio->line_size == 0)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b5b9cb758b6..20bf122328da 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -125,7 +125,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,

pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (!arch_has_pmem_api())
+ if (!arch_has_wmb_pmem())
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (!devm_request_mem_region(dev, pmem->phys_addr, pmem->size,
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index a9d84bf335ee..9ec42710315e 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -19,12 +19,12 @@
#ifdef CONFIG_ARCH_HAS_PMEM_API
#include <asm/pmem.h>
#else
-static inline void arch_wmb_pmem(void)
-{
- BUG();
-}
-
-static inline bool arch_has_wmb_pmem(void)
+/*
+ * These are simply here to enable compilation, all call sites gate
+ * calling these symbols with arch_has_pmem_api() and redirect to the
+ * implementation in asm/pmem.h.
+ */
+static inline bool __arch_has_wmb_pmem(void)
{
return false;
}
@@ -53,7 +53,6 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
* implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
* arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
*/
-
static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
{
memcpy(dst, (void __force const *) src, size);
@@ -64,8 +63,13 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
devm_memunmap(dev, (void __force *) addr);
}

+static inline bool arch_has_pmem_api(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
+}
+
/**
- * arch_has_pmem_api - true if wmb_pmem() ensures durability
+ * arch_has_wmb_pmem - true if wmb_pmem() ensures durability
*
* For a given cpu implementation within an architecture it is possible
* that wmb_pmem() resolves to a nop. In the case this returns
@@ -73,9 +77,9 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
* fall back to a different data consistency model, or otherwise notify
* the user.
*/
-static inline bool arch_has_pmem_api(void)
+static inline bool arch_has_wmb_pmem(void)
{
- return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && arch_has_wmb_pmem();
+ return arch_has_pmem_api() && __arch_has_wmb_pmem();
}

/*
@@ -158,8 +162,10 @@ static inline void memcpy_to_pmem(void __pmem *dst, const void *src, size_t n)
*/
static inline void wmb_pmem(void)
{
- if (arch_has_pmem_api())
+ if (arch_has_wmb_pmem())
arch_wmb_pmem();
+ else
+ wmb();
}

/**

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-26 21:39:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] devm_memremap_pages: protect against pmem device unbind

On Wed, 2015-08-26 at 14:46 +0200, Christoph Hellwig wrote:
> On Tue, Aug 25, 2015 at 09:28:13PM -0400, Dan Williams wrote:
> > Given that:
> >
> > 1/ device ->remove() can not be failed
> >
> > 2/ a pmem device may be unbound at any time
> >
> > 3/ we do not know what other parts of the kernel are actively using a
> > 'struct page' from devm_memremap_pages()
> >
> > ...provide a facility for active usages of device memory to block pmem
> > device unbind. With a percpu_ref it should be feasible to take a
> > reference on a per-I/O or other high frequency basis.
>
> Without a caller of get_page_map this is just adding dead code. I'd
> suggest to group it in a series with that caller.
>

Agreed, we can drop this until the first user arrives.

> Also if the page_map gets exposed in a header the name is a bit too generic.
> memremap_map maybe?

Done, and in the patch below I hide the internal implementation details
of page_map in kernel/memremap.c and only expose the percpu_ref in the
public memremap_map.

8<----
Subject: devm_memremap_pages: protect against pmem device unbind

From: Dan Williams <[email protected]>

Given that:

1/ device ->remove() can not be failed

2/ a pmem device may be unbound at any time

3/ we do not know what other parts of the kernel are actively using a
'struct page' from devm_memremap_pages()

...provide a facility for active usages of device memory to block pmem
device unbind. With a percpu_ref it should be feasible to take a
reference on a per-I/O or other high frequency basis.

Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/io.h | 30 ++++++++++++++++
kernel/memremap.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/include/linux/io.h b/include/linux/io.h
index de64c1e53612..9e696b114c6d 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -90,8 +90,25 @@ void devm_memunmap(struct device *dev, void *addr);
void *__devm_memremap_pages(struct device *dev, struct resource *res);

#ifdef CONFIG_ZONE_DEVICE
+#include <linux/percpu-refcount.h>
+
+struct memremap_map {
+ struct percpu_ref percpu_ref;
+};
+
void *devm_memremap_pages(struct device *dev, struct resource *res);
+struct memremap_map * __must_check get_memremap_map(resource_size_t addr);
+static inline void ref_memremap_map(struct memremap_map *memremap_map)
+{
+ percpu_ref_get(&memremap_map->percpu_ref);
+}
+
+static inline void put_memremap_map(struct memremap_map *memremap_map)
+{
+ percpu_ref_put(&memremap_map->percpu_ref);
+}
#else
+struct memremap_map;
static inline void *devm_memremap_pages(struct device *dev, struct resource *res)
{
/*
@@ -102,6 +119,19 @@ static inline void *devm_memremap_pages(struct device *dev, struct resource *res
WARN_ON_ONCE(1);
return ERR_PTR(-ENXIO);
}
+
+static inline __must_check struct memremap_map *get_memremap_map(resource_size_t addr)
+{
+ return NULL;
+}
+
+static inline void ref_memremap_map(struct memremap_map *memremap_map)
+{
+}
+
+static inline void put_memremap_map(struct memremap_map *memremap_map)
+{
+}
#endif

/*
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 72b0c66628b6..5b9f04789b96 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -11,7 +11,11 @@
* General Public License for more details.
*/
#include <linux/device.h>
+#include <linux/ioport.h>
#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/list.h>
#include <linux/io.h>
#include <linux/mm.h>
#include <linux/memory_hotplug.h>
@@ -138,14 +142,74 @@ void devm_memunmap(struct device *dev, void *addr)
EXPORT_SYMBOL(devm_memunmap);

#ifdef CONFIG_ZONE_DEVICE
+static DEFINE_MUTEX(page_map_lock);
+static DECLARE_WAIT_QUEUE_HEAD(page_map_wait);
+static LIST_HEAD(page_maps);
+
+enum {
+ PAGE_MAP_LIVE,
+ PAGE_MAP_CONFIRM,
+};
+
struct page_map {
struct resource res;
+ struct list_head list;
+ unsigned long flags;
+ struct memremap_map map;
+ struct device *dev;
};

+static struct page_map *to_page_map(struct percpu_ref *ref)
+{
+ return container_of(ref, struct page_map, map.percpu_ref);
+}
+
+static void page_map_release(struct percpu_ref *ref)
+{
+ struct page_map *page_map = to_page_map(ref);
+
+ /* signal page_map is idle (no more refs) */
+ clear_bit(PAGE_MAP_LIVE, &page_map->flags);
+ wake_up_all(&page_map_wait);
+}
+
+static void page_map_confirm(struct percpu_ref *ref)
+{
+ struct page_map *page_map = to_page_map(ref);
+
+ /* signal page_map is confirmed dead (slow path ref mode) */
+ set_bit(PAGE_MAP_CONFIRM, &page_map->flags);
+ wake_up_all(&page_map_wait);
+}
+
+static void page_map_destroy(struct page_map *page_map)
+{
+ long tmo;
+
+ /* flush new lookups */
+ mutex_lock(&page_map_lock);
+ list_del_rcu(&page_map->list);
+ mutex_unlock(&page_map_lock);
+ synchronize_rcu();
+
+ percpu_ref_kill_and_confirm(&page_map->map.percpu_ref, page_map_confirm);
+ do {
+ tmo = wait_event_interruptible_timeout(page_map_wait,
+ !test_bit(PAGE_MAP_LIVE, &page_map->flags)
+ && test_bit(PAGE_MAP_CONFIRM, &page_map->flags), 5*HZ);
+ if (tmo <= 0)
+ dev_dbg(page_map->dev,
+ "page map active, continuing to wait...\n");
+ } while (tmo <= 0);
+}
+
static void devm_memremap_pages_release(struct device *dev, void *res)
{
struct page_map *page_map = res;

+ if (test_bit(PAGE_MAP_LIVE, &page_map->flags))
+ page_map_destroy(page_map);
+
/* pages are dead and unused, undo the arch mapping */
arch_remove_memory(page_map->res.start, resource_size(&page_map->res));
}
@@ -155,7 +219,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
int is_ram = region_intersects(res->start, resource_size(res),
"System RAM");
struct page_map *page_map;
- int error, nid;
+ int error, nid, rc;

if (is_ram == REGION_MIXED) {
WARN_ONCE(1, "%s attempted on mixed region %pr\n",
@@ -172,6 +236,12 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
return ERR_PTR(-ENOMEM);

memcpy(&page_map->res, res, sizeof(*res));
+ INIT_LIST_HEAD(&page_map->list);
+ page_map->dev = dev;
+ rc = percpu_ref_init(&page_map->map.percpu_ref, page_map_release, 0,
+ GFP_KERNEL);
+ if (rc)
+ return ERR_PTR(rc);

nid = dev_to_node(dev);
if (nid < 0)
@@ -183,8 +253,32 @@ void *devm_memremap_pages(struct device *dev, struct resource *res)
return ERR_PTR(error);
}

+ set_bit(PAGE_MAP_LIVE, &page_map->flags);
+ mutex_lock(&page_map_lock);
+ list_add_rcu(&page_map->list, &page_maps);
+ mutex_unlock(&page_map_lock);
+
devres_add(dev, page_map);
return __va(res->start);
}
EXPORT_SYMBOL(devm_memremap_pages);
+
+struct memremap_map * __must_check get_memremap_map(resource_size_t addr)
+{
+ struct memremap_map *ret = NULL;
+ struct page_map *page_map;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(page_map, &page_maps, list) {
+ if (addr >= page_map->res.start && addr <= page_map->res.end) {
+ if (percpu_ref_tryget(&page_map->map.percpu_ref))
+ ret = &page_map->map;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(get_memremap_map);
#endif /* CONFIG_ZONE_DEVICE */

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-27 07:33:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

This looks fine to me,

Reviewed-by: Christoph Hellwig <[email protected]>

2015-08-27 07:33:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] devm_memremap_pages: protect against pmem device unbind

On Wed, Aug 26, 2015 at 09:39:18PM +0000, Williams, Dan J wrote:
> On Wed, 2015-08-26 at 14:46 +0200, Christoph Hellwig wrote:
> > On Tue, Aug 25, 2015 at 09:28:13PM -0400, Dan Williams wrote:
> > > Given that:
> > >
> > > 1/ device ->remove() can not be failed
> > >
> > > 2/ a pmem device may be unbound at any time
> > >
> > > 3/ we do not know what other parts of the kernel are actively using a
> > > 'struct page' from devm_memremap_pages()
> > >
> > > ...provide a facility for active usages of device memory to block pmem
> > > device unbind. With a percpu_ref it should be feasible to take a
> > > reference on a per-I/O or other high frequency basis.
> >
> > Without a caller of get_page_map this is just adding dead code. I'd
> > suggest to group it in a series with that caller.
> >
>
> Agreed, we can drop this until the first user arrives.
>
> > Also if the page_map gets exposed in a header the name is a bit too generic.
> > memremap_map maybe?
>
> Done, and in the patch below I hide the internal implementation details
> of page_map in kernel/memremap.c and only expose the percpu_ref in the
> public memremap_map.

Yes, that looks good once we're getting the users for it.

2015-08-28 20:22:15

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Wed, Aug 26, 2015 at 09:34:20PM +0000, Williams, Dan J wrote:
> On Wed, 2015-08-26 at 14:41 +0200, Christoph Hellwig wrote:
> > I like the intent behind this, but not the implementation.
> >
> > I think the right approach is to keep the defaults in linux/pmem.h
> > and simply not set CONFIG_ARCH_HAS_PMEM_API for x86-32.
>
> Yes, that makes things much cleaner. Revised patch and changelog below:
>
> 8<----
> Subject: x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
>
> From: Dan Williams <[email protected]>
>
> Given that a write-back (WB) mapping plus non-temporal stores is
> expected to be the most efficient way to access PMEM, update the
> definition of ARCH_HAS_PMEM_API to imply arch support for
> WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to
> the direct map and mapping it with struct page.
>
> The above clarification for X86_64 means that memcpy_to_pmem() is
> permitted to use the non-temporal arch_memcpy_to_pmem() rather than
> needlessly fall back to default_memcpy_to_pmem() when the pcommit
> instruction is not available. When arch_memcpy_to_pmem() is not
> guaranteed to flush writes out of cache, i.e. on older X86_32
> implementations where non-temporal stores may just dirty cache,
> ARCH_HAS_PMEM_API is simply disabled.
>
> The default fall back for persistent memory handling remains. Namely,
> map it with the WT (write-through) cache-type and hope for the best.
>
> arch_has_pmem_api() is updated to only indicate whether the arch
> provides the proper helpers to meet the minimum "writes are visible
> outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code
> that cares whether wmb_pmem() actually flushes writes to pmem must now
> call arch_has_wmb_pmem() directly.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Toshi Kani <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> [hch: set ARCH_HAS_PMEM_API=n on X86_32]
> Signed-off-by: Dan Williams <[email protected]>

Yep, this seems like a good change.

Reviewed-by: Ross Zwisler <[email protected]>

2015-08-28 21:43:37

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Wed, 2015-08-26 at 21:34 +0000, Williams, Dan J wrote:
> On Wed, 2015-08-26 at 14:41 +0200, Christoph Hellwig wrote:
> > I like the intent behind this, but not the implementation.
> >
> > I think the right approach is to keep the defaults in linux/pmem.h
> > and simply not set CONFIG_ARCH_HAS_PMEM_API for x86-32.
>
> Yes, that makes things much cleaner. Revised patch and changelog below:
>
> 8<----
> Subject: x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
>
> From: Dan Williams <[email protected]>
>
> Given that a write-back (WB) mapping plus non-temporal stores is
> expected to be the most efficient way to access PMEM, update the
> definition of ARCH_HAS_PMEM_API to imply arch support for
> WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to
> the direct map and mapping it with struct page.
>
> The above clarification for X86_64 means that memcpy_to_pmem() is
> permitted to use the non-temporal arch_memcpy_to_pmem() rather than
> needlessly fall back to default_memcpy_to_pmem() when the pcommit
> instruction is not available. When arch_memcpy_to_pmem() is not
> guaranteed to flush writes out of cache, i.e. on older X86_32
> implementations where non-temporal stores may just dirty cache,
> ARCH_HAS_PMEM_API is simply disabled.
>
> The default fall back for persistent memory handling remains. Namely,
> map it with the WT (write-through) cache-type and hope for the best.
>
> arch_has_pmem_api() is updated to only indicate whether the arch
> provides the proper helpers to meet the minimum "writes are visible
> outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code
> that cares whether wmb_pmem() actually flushes writes to pmem must now
> call arch_has_wmb_pmem() directly.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Toshi Kani <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> [hch: set ARCH_HAS_PMEM_API=n on X86_32]
> Signed-off-by: Dan Williams <[email protected]>

Thanks for making this change! It looks good.

Reviewed-by: Toshi Kani <[email protected]>

I have one minor comment below:

> ---
> arch/x86/Kconfig | 2 +-
> arch/x86/include/asm/io.h | 2 --
> arch/x86/include/asm/pmem.h | 8 ++------
> drivers/acpi/nfit.c | 2 +-
> drivers/nvdimm/pmem.c | 2 +-
> include/linux/pmem.h | 28 +++++++++++++++++-----------
> 6 files changed, 22 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 76c61154ed50..5912859df533 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -27,7 +27,7 @@ config X86
> select ARCH_HAS_ELF_RANDOMIZE
> select ARCH_HAS_FAST_MULTIPLIER
> select ARCH_HAS_GCOV_PROFILE_ALL
> - select ARCH_HAS_PMEM_API
> + select ARCH_HAS_PMEM_API if X86_64
> select ARCH_HAS_SG_CHAIN
> select ARCH_HAVE_NMI_SAFE_CMPXCHG
> select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index d241fbd5c87b..83ec9b1d77cc 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -248,8 +248,6 @@ static inline void flush_write_buffers(void)
> #endif
> }
>
> -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB

Should it be better to do:

#else /* !CONFIG_ARCH_HAS_PMEM_API */
#define ARCH_MEMREMAP_PMEM MEMREMAP_WT

so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?

Thanks,
-Toshi

2015-08-28 21:47:20

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Fri, Aug 28, 2015 at 2:41 PM, Toshi Kani <[email protected]> wrote:
> On Wed, 2015-08-26 at 21:34 +0000, Williams, Dan J wrote:
[..]
>> -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
>
> Should it be better to do:
>
> #else /* !CONFIG_ARCH_HAS_PMEM_API */
> #define ARCH_MEMREMAP_PMEM MEMREMAP_WT
>
> so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?

Yeah, that seems like a nice incremental cleanup for memremap_pmem()
to just unconditionally use ARCH_MEMREMAP_PMEM, feel free to send it
along.

2015-08-28 21:50:37

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Fri, 2015-08-28 at 14:47 -0700, Dan Williams wrote:
> On Fri, Aug 28, 2015 at 2:41 PM, Toshi Kani <[email protected]> wrote:
> > On Wed, 2015-08-26 at 21:34 +0000, Williams, Dan J wrote:
> [..]
> > > -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> >
> > Should it be better to do:
> >
> > #else /* !CONFIG_ARCH_HAS_PMEM_API */
> > #define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> >
> > so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?
>
> Yeah, that seems like a nice incremental cleanup for memremap_pmem()
> to just unconditionally use ARCH_MEMREMAP_PMEM, feel free to send it
> along.

OK. Will do.

Thanks,
-Toshi

2015-08-29 04:05:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Fri, 2015-08-28 at 15:48 -0600, Toshi Kani wrote:
> On Fri, 2015-08-28 at 14:47 -0700, Dan Williams wrote:
> > On Fri, Aug 28, 2015 at 2:41 PM, Toshi Kani <[email protected]> wrote:
> > > On Wed, 2015-08-26 at 21:34 +0000, Williams, Dan J wrote:
> > [..]
> > > > -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> > >
> > > Should it be better to do:
> > >
> > > #else /* !CONFIG_ARCH_HAS_PMEM_API */
> > > #define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> > >
> > > so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?
> >
> > Yeah, that seems like a nice incremental cleanup for memremap_pmem()
> > to just unconditionally use ARCH_MEMREMAP_PMEM, feel free to send it
> > along.
>
> OK. Will do.
>

Here's the re-worked patch with Toshi's fixes folded in:

8<-----
Subject: x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

From: Dan Williams <[email protected]>

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available. When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains. Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: Christoph Hellwig <[email protected]>
[toshi: x86_32 compile fixes]
Signed-off-by: Toshi Kani <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/pmem.h | 9 +--------
drivers/acpi/nfit.c | 3 ++-
drivers/nvdimm/pmem.c | 2 +-
include/linux/pmem.h | 36 ++++++++++++++++++++++--------------
5 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 03ab6122325a..ef4c6bbb3af1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,7 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
- select ARCH_HAS_PMEM_API
+ select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index bb026c5adf8a..d8ce3ec816ab 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -18,8 +18,6 @@
#include <asm/cpufeature.h>
#include <asm/special_insns.h>

-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
#ifdef CONFIG_ARCH_HAS_PMEM_API
/**
* arch_memcpy_to_pmem - copy data to persistent memory
@@ -143,18 +141,13 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
__arch_wb_cache_pmem(vaddr, size);
}

-static inline bool arch_has_wmb_pmem(void)
+static inline bool __arch_has_wmb_pmem(void)
{
-#ifdef CONFIG_X86_64
/*
* We require that wmb() be an 'sfence', that is only guaranteed on
* 64-bit builds
*/
return static_cpu_has(X86_FEATURE_PCOMMIT);
-#else
- return false;
-#endif
}
#endif /* CONFIG_ARCH_HAS_PMEM_API */
-
#endif /* __ASM_X86_PMEM_H__ */
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 56fff0141636..f61e69fa2ad1 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -20,6 +20,7 @@
#include <linux/sort.h>
#include <linux/pmem.h>
#include <linux/io.h>
+#include <asm/cacheflush.h>
#include "nfit.h"

/*
@@ -1371,7 +1372,7 @@ static int acpi_nfit_blk_region_enable(struct nvdimm_bus *nvdimm_bus,
return -ENOMEM;
}

- if (!arch_has_pmem_api() && !nfit_blk->nvdimm_flush)
+ if (!arch_has_wmb_pmem() && !nfit_blk->nvdimm_flush)
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (mmio->line_size == 0)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 3b5b9cb758b6..20bf122328da 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -125,7 +125,7 @@ static struct pmem_device *pmem_alloc(struct device *dev,

pmem->phys_addr = res->start;
pmem->size = resource_size(res);
- if (!arch_has_pmem_api())
+ if (!arch_has_wmb_pmem())
dev_warn(dev, "unable to guarantee persistence of writes\n");

if (!devm_request_mem_region(dev, pmem->phys_addr, pmem->size,
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index a9d84bf335ee..85f810b33917 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -17,16 +17,23 @@
#include <linux/uio.h>

#ifdef CONFIG_ARCH_HAS_PMEM_API
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
#include <asm/pmem.h>
#else
-static inline void arch_wmb_pmem(void)
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WT
+/*
+ * These are simply here to enable compilation, all call sites gate
+ * calling these symbols with arch_has_pmem_api() and redirect to the
+ * implementation in asm/pmem.h.
+ */
+static inline bool __arch_has_wmb_pmem(void)
{
- BUG();
+ return false;
}

-static inline bool arch_has_wmb_pmem(void)
+static inline void arch_wmb_pmem(void)
{
- return false;
+ BUG();
}

static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
@@ -53,7 +60,6 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
* implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
* arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
*/
-
static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
{
memcpy(dst, (void __force const *) src, size);
@@ -64,8 +70,13 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
devm_memunmap(dev, (void __force *) addr);
}

+static inline bool arch_has_pmem_api(void)
+{
+ return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API);
+}
+
/**
- * arch_has_pmem_api - true if wmb_pmem() ensures durability
+ * arch_has_wmb_pmem - true if wmb_pmem() ensures durability
*
* For a given cpu implementation within an architecture it is possible
* that wmb_pmem() resolves to a nop. In the case this returns
@@ -73,9 +84,9 @@ static inline void memunmap_pmem(struct device *dev, void __pmem *addr)
* fall back to a different data consistency model, or otherwise notify
* the user.
*/
-static inline bool arch_has_pmem_api(void)
+static inline bool arch_has_wmb_pmem(void)
{
- return IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API) && arch_has_wmb_pmem();
+ return arch_has_pmem_api() && __arch_has_wmb_pmem();
}

/*
@@ -120,13 +131,8 @@ static inline void default_clear_pmem(void __pmem *addr, size_t size)
static inline void __pmem *memremap_pmem(struct device *dev,
resource_size_t offset, unsigned long size)
{
-#ifdef ARCH_MEMREMAP_PMEM
return (void __pmem *) devm_memremap(dev, offset, size,
ARCH_MEMREMAP_PMEM);
-#else
- return (void __pmem *) devm_memremap(dev, offset, size,
- MEMREMAP_WT);
-#endif
}

/**
@@ -158,8 +164,10 @@ static inline void memcpy_to_pmem(void __pmem *dst, const void *src, size_t n)
*/
static inline void wmb_pmem(void)
{
- if (arch_has_pmem_api())
+ if (arch_has_wmb_pmem())
arch_wmb_pmem();
+ else
+ wmb();
}

/**

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-08-29 13:57:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] x86, pmem: push fallback handling to arch code

On Sat, Aug 29, 2015 at 04:04:58AM +0000, Williams, Dan J wrote:
> On Fri, 2015-08-28 at 15:48 -0600, Toshi Kani wrote:
> > On Fri, 2015-08-28 at 14:47 -0700, Dan Williams wrote:
> > > On Fri, Aug 28, 2015 at 2:41 PM, Toshi Kani <[email protected]> wrote:
> > > > On Wed, 2015-08-26 at 21:34 +0000, Williams, Dan J wrote:
> > > [..]
> > > > > -#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
> > > >
> > > > Should it be better to do:
> > > >
> > > > #else /* !CONFIG_ARCH_HAS_PMEM_API */
> > > > #define ARCH_MEMREMAP_PMEM MEMREMAP_WT
> > > >
> > > > so that you can remove all '#ifdef ARCH_MEMREMAP_PMEM' stuff?
> > >
> > > Yeah, that seems like a nice incremental cleanup for memremap_pmem()
> > > to just unconditionally use ARCH_MEMREMAP_PMEM, feel free to send it
> > > along.
> >
> > OK. Will do.
> >
>
> Here's the re-worked patch with Toshi's fixes folded in:

I like this in principle, but we'll have to be carefull now if we
want to drop the fallbacks in mremap, as we will have to shift it into
the pmem driver then.