2020-01-10 19:09:03

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata

Hey,

Presented herewith a small series which allows device-dax to work without
struct page to be used to back KVM guests memory. It's an RFC, and there's
still some items we're looking at (see TODO below); but wondering if folks
would be OK carving some time out of their busy schedules to provide feedback
direction-wise on this work.

In virtualized environments (specially those with no kernel-backed PV
interfaces, and just SR-IOV), memory is largelly assigned to guests: either
persistent with NVDIMMs or volatile for regular RAM. The kernel
(hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
need which could instead be used to create other guests. One of motivations of
this series is to then get that memory used for 'struct page', when it is meant
to solely be used by userspace. This is also useful for the case of memory
backing guests virtual NVDIMMs. The other neat side effect is that the
hypervisor has no virtual mapping of the guest and hence code gadgets (if
found) are limited in their effectiveness.

It is expected that a smaller (instead of total) amount of host memory is
defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
QEMU), the main thing that is needed is a device which open + mmap + close with
a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
just that and so the work comprised here was improving what's there and the
interfaces it uses.

The series is divided as follows:

* Patch 1 , 3: Preparatory work for patch 7 for adding support for
vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL

* Patch 2 , 4: Preparatory work for patch 7 for adding support for
follow_pfn() to work with 2M/1G huge pages, which is
what KVM uses for VM_PFNMAP.

* Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
which encompasses mainly dealing with the lack of devmap,
and creating a VM_PFNMAP vma.

* Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
At the very end of the cover letter (after scissors mark),
there's a patch for ndctl to be able to create namespaces
with '--mode devdax --map none'.

* Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
a PFN.

* Patch 10: The actual end consumer example for RAM case. The patch just adds a
label storage area which consequently allows namespaces to be
created. We picked PMEM legacy for starters.

Thoughts, coments appreciated.

Joao

P.S. As an example to try this out:

1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
and kernel has 16G.

2) create a devdax namespace with 1G hugepages:

$ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"none",
"size":"32.00 GiB (34.36 GB)",
"uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
"daxregion":{
"id":0,
"size":"32.00 GiB (34.36 GB)",
"align":1073741824,
"devices":[
{
"chardev":"dax0.0",
"size":"32.00 GiB (34.36 GB)",
"target_node":0,
"mode":"devdax"
}
]
},
"align":1073741824
}

3) Add this to your qemu params:
-m 32G
-object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
-numa node,memdev=mem

TODO:

* Discontiguous regions/namespaces: The work above is limited to max
contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
is because of what specs allow for persistent namespaces. But for volatile RAM
case we would need handling of discontiguous extents (hence a region would represent
more than a resource) to be less bound to how guests are placed on the system.
I played around with multi-resource for device-dax, but I'm wondering about
UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
case (under certain nvdimm region bits); or if 2) device-dax would have it's
own separate UABI to be used by daxctl (which would be also useful for hmem
devices?).

* MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
device-dax, which would allow collect_procs() to find the processes solely based
on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
looking at the dax-region pfn range for each DAX vma.

* NUMA: For now excluded setting the target_node; while these two patches
are being worked on[1][2].

[1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
[2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/


Joao Martins (9):
mm: Add pmd support for _PAGE_SPECIAL
mm: Handle pmd entries in follow_pfn()
mm: Add pud support for _PAGE_SPECIAL
mm: Handle pud entries in follow_pfn()
device-dax: Do not enforce MADV_DONTFORK on mmap()
device-dax: Introduce pfn_flags helper
device-dax: Add support for PFN_SPECIAL flags
dax/pmem: Add device-dax support for PFN_MODE_NONE
nvdimm/e820: add multiple namespaces support

Nikita Leshenko (1):
vfio/type1: Use follow_pfn for VM_FPNMAP VMAs

arch/x86/include/asm/pgtable.h | 34 ++++-
drivers/dax/bus.c | 3 +-
drivers/dax/device.c | 78 ++++++++----
drivers/dax/pmem/core.c | 36 +++++-
drivers/nvdimm/e820.c | 212 ++++++++++++++++++++++++++++----
drivers/vfio/vfio_iommu_type1.c | 6 +-
mm/gup.c | 6 +
mm/huge_memory.c | 15 ++-
mm/memory.c | 67 ++++++++--
9 files changed, 382 insertions(+), 75 deletions(-)

8>----------------
Subject: [PATCH] ndctl: add 'devdax' support for NDCTL_PFN_LOC_NONE

diff --git a/ndctl/namespace.c b/ndctl/namespace.c
index 7fb00078646b..2568943eb207 100644
--- a/ndctl/namespace.c
+++ b/ndctl/namespace.c
@@ -206,6 +206,8 @@ static int set_defaults(enum device_action mode)
/* pass */;
else if (strcmp(param.map, "dev") == 0)
/* pass */;
+ else if (strcmp(param.map, "none") == 0)
+ /* pass */;
else {
error("invalid map location '%s'\n", param.map);
rc = -EINVAL;
@@ -755,9 +757,17 @@ static int validate_namespace_options(struct ndctl_region *region,
if (param.map) {
if (!strcmp(param.map, "mem"))
p->loc = NDCTL_PFN_LOC_RAM;
+ else if (!strcmp(param.map, "none"))
+ p->loc = NDCTL_PFN_LOC_NONE;
else
p->loc = NDCTL_PFN_LOC_PMEM;

+ if (p->loc == NDCTL_PFN_LOC_NONE
+ && p->mode != NDCTL_NS_MODE_DAX) {
+ debug("--map=none only valid for devdax mode namespace\n");
+ return -EINVAL;
+ }
+
if (ndns && p->mode != NDCTL_NS_MODE_MEMORY
&& p->mode != NDCTL_NS_MODE_DAX) {
debug("%s: --map= only valid for fsdax mode namespace\n",


--
2.17.1


2020-01-10 19:09:11

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL

Currently vmf_insert_pfn_pmd only works with devmap
and BUG_ON otherwise. Add support for handling
page special when pfn_t has it marked with PFN_SPECIAL.

Usage of page special aren't expected to do GUP
hence return no pages on gup_huge_pmd() much like how
it is done for ptes on gup_pte_range().

This allows a DAX driver to handle 2M hugepages without
struct pages.

Signed-off-by: Joao Martins <[email protected]>
---
arch/x86/include/asm/pgtable.h | 16 +++++++++++++++-
mm/gup.c | 3 +++
mm/huge_memory.c | 7 ++++---
mm/memory.c | 3 ++-
4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ad97dc155195..60351c0c15fe 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -255,7 +255,7 @@ static inline int pmd_large(pmd_t pte)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline int pmd_trans_huge(pmd_t pmd)
{
- return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+ return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
}

#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
{
return 0;
}
+#endif
+
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline int pmd_special(pmd_t pmd)
+{
+ return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
+}
+#endif
+
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -414,6 +423,11 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_DEVMAP);
}

+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
static inline pmd_t pmd_mkhuge(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_PSE);
diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..ba5f10535392 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}

+ if (pmd_special(orig))
+ return 0;
+
refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
do {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..06ad4d6f7477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
if (pfn_t_devmap(pfn))
entry = pmd_mkdevmap(entry);
+ else if (pfn_t_special(pfn))
+ entry = pmd_mkspecial(entry);
if (write) {
entry = pmd_mkyoung(pmd_mkdirty(entry));
entry = maybe_pmd_mkwrite(entry, vma);
@@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
* but we need to be consistent with PTEs and architectures that
* can't support a 'special' bit.
*/
- BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
- !pfn_t_devmap(pfn));
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -2013,7 +2014,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
- pmd_devmap(*pmd)))
+ pmd_devmap(*pmd) || pmd_special(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9a4f52..cfc3668bddeb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1165,7 +1165,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+ pmd_devmap(*pmd) || pmd_special(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
else if (zap_huge_pmd(tlb, vma, pmd, addr))
--
2.17.1

2020-01-10 19:09:17

by Joao Martins

[permalink] [raw]
Subject: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

User can define regions with 'memmap=size!offset' which in turn
creates PMEM legacy devices. But because it is a label-less
NVDIMM device we only have one namespace for the whole device.

Add support for multiple namespaces by adding ndctl control
support, and exposing a minimal set of features:
(ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
store labels.

Initialization is a little different: We allocate and register an
nvdimm bus with an @nvdimm_descriptor which we use to locate
where we are keeping our label storage area. The config data
get/set/size operations are then simply memcpying to this area.

Equivalent approach can also be found in the NFIT tests which
emulate the same thing.

Signed-off-by: Joao Martins <[email protected]>
---
drivers/nvdimm/e820.c | 212 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 191 insertions(+), 21 deletions(-)

diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
index e02f60ad6c99..36fbff3d7110 100644
--- a/drivers/nvdimm/e820.c
+++ b/drivers/nvdimm/e820.c
@@ -7,14 +7,21 @@
#include <linux/memory_hotplug.h>
#include <linux/libnvdimm.h>
#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/ndctl.h>
+#include <linux/nd.h>

-static int e820_pmem_remove(struct platform_device *pdev)
-{
- struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
+#define LABEL_SIZE SZ_128K

- nvdimm_bus_unregister(nvdimm_bus);
- return 0;
-}
+struct e820_descriptor {
+ struct nd_interleave_set nd_set;
+ struct nvdimm_bus_descriptor nd_desc;
+ void *label;
+ unsigned char cookie1[16];
+ unsigned char cookie2[16];
+ struct nvdimm_bus *nvdimm_bus;
+ struct nvdimm *nvdimm;
+};

#ifdef CONFIG_MEMORY_HOTPLUG
static int e820_range_to_nid(resource_size_t addr)
@@ -28,43 +35,206 @@ static int e820_range_to_nid(resource_size_t addr)
}
#endif

+static int e820_get_config_size(struct nd_cmd_get_config_size *nd_cmd,
+ unsigned int buf_len)
+{
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+
+ nd_cmd->status = 0;
+ nd_cmd->config_size = LABEL_SIZE;
+ nd_cmd->max_xfer = SZ_4K;
+
+ return 0;
+}
+
+static int e820_get_config_data(struct nd_cmd_get_config_data_hdr
+ *nd_cmd, unsigned int buf_len, void *label)
+{
+ unsigned int len, offset = nd_cmd->in_offset;
+ int rc;
+
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+ if (offset >= LABEL_SIZE)
+ return -EINVAL;
+ if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
+ return -EINVAL;
+
+ nd_cmd->status = 0;
+ len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+ memcpy(nd_cmd->out_buf, label + offset, len);
+ rc = buf_len - sizeof(*nd_cmd) - len;
+
+ return rc;
+}
+
+static int e820_set_config_data(struct nd_cmd_set_config_hdr *nd_cmd,
+ unsigned int buf_len, void *label)
+{
+ unsigned int len, offset = nd_cmd->in_offset;
+ u32 *status;
+ int rc;
+
+ if (buf_len < sizeof(*nd_cmd))
+ return -EINVAL;
+ if (offset >= LABEL_SIZE)
+ return -EINVAL;
+ if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
+ return -EINVAL;
+
+ status = (void *)nd_cmd + nd_cmd->in_length + sizeof(*nd_cmd);
+ *status = 0;
+ len = min(nd_cmd->in_length, LABEL_SIZE - offset);
+ memcpy(label + offset, nd_cmd->in_buf, len);
+ rc = buf_len - sizeof(*nd_cmd) - (len + 4);
+
+ return rc;
+}
+
+static struct e820_descriptor *to_e820_desc(struct nvdimm_bus_descriptor *desc)
+{
+ return container_of(desc, struct e820_descriptor, nd_desc);
+}
+
+static int e820_ndctl(struct nvdimm_bus_descriptor *nd_desc,
+ struct nvdimm *nvdimm, unsigned int cmd, void *buf,
+ unsigned int buf_len, int *cmd_rc)
+{
+ struct e820_descriptor *t = to_e820_desc(nd_desc);
+ int rc = -EINVAL;
+
+ switch (cmd) {
+ case ND_CMD_GET_CONFIG_SIZE:
+ rc = e820_get_config_size(buf, buf_len);
+ break;
+ case ND_CMD_GET_CONFIG_DATA:
+ rc = e820_get_config_data(buf, buf_len, t->label);
+ break;
+ case ND_CMD_SET_CONFIG_DATA:
+ rc = e820_set_config_data(buf, buf_len, t->label);
+ break;
+ default:
+ return rc;
+ }
+
+ return rc;
+}
+
+static void e820_desc_free(struct e820_descriptor *desc)
+{
+ if (!desc)
+ return;
+
+ nvdimm_bus_unregister(desc->nvdimm_bus);
+ kfree(desc->label);
+ kfree(desc);
+}
+
+static struct e820_descriptor *e820_desc_alloc(struct platform_device *pdev)
+{
+ struct nvdimm_bus_descriptor *nd_desc;
+ unsigned int cmd_mask, dimm_flags;
+ struct device *dev = &pdev->dev;
+ struct nvdimm_bus *nvdimm_bus;
+ struct e820_descriptor *desc;
+ struct nvdimm *nvdimm;
+
+ desc = kzalloc(sizeof(*desc), GFP_KERNEL);
+ if (!desc)
+ goto err;
+
+ desc->label = kzalloc(LABEL_SIZE, GFP_KERNEL);
+ if (!desc->label)
+ goto err;
+
+ nd_desc = &desc->nd_desc;
+ nd_desc->provider_name = "e820";
+ nd_desc->module = THIS_MODULE;
+ nd_desc->ndctl = e820_ndctl;
+ nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
+ if (!nvdimm_bus) {
+ dev_err(dev, "nvdimm bus registration failure\n");
+ goto err;
+ }
+ desc->nvdimm_bus = nvdimm_bus;
+
+ cmd_mask = (1UL << ND_CMD_GET_CONFIG_SIZE |
+ 1UL << ND_CMD_GET_CONFIG_DATA |
+ 1UL << ND_CMD_SET_CONFIG_DATA);
+ dimm_flags = (1UL << NDD_ALIASING);
+ nvdimm = nvdimm_create(nvdimm_bus, pdev, NULL,
+ dimm_flags, cmd_mask, 0, NULL);
+ if (!nvdimm) {
+ dev_err(dev, "nvdimm creation failure\n");
+ goto err;
+ }
+ desc->nvdimm = nvdimm;
+ return desc;
+
+err:
+ e820_desc_free(desc);
+ return NULL;
+}
+
static int e820_register_one(struct resource *res, void *data)
{
+ struct platform_device *pdev = data;
struct nd_region_desc ndr_desc;
- struct nvdimm_bus *nvdimm_bus = data;
+ struct nd_mapping_desc mapping;
+ struct e820_descriptor *desc;
+
+ desc = e820_desc_alloc(pdev);
+ if (!desc)
+ return -ENOMEM;
+
+ mapping.nvdimm = desc->nvdimm;
+ mapping.start = res->start;
+ mapping.size = resource_size(res);
+ mapping.position = 0;
+
+ generate_random_uuid(desc->cookie1);
+ desc->nd_set.cookie1 = (u64) desc->cookie1;
+ generate_random_uuid(desc->cookie2);
+ desc->nd_set.cookie2 = (u64) desc->cookie2;

memset(&ndr_desc, 0, sizeof(ndr_desc));
ndr_desc.res = res;
ndr_desc.numa_node = e820_range_to_nid(res->start);
ndr_desc.target_node = ndr_desc.numa_node;
+ ndr_desc.mapping = &mapping;
+ ndr_desc.num_mappings = 1;
+ ndr_desc.nd_set = &desc->nd_set;
set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
- if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+ if (!nvdimm_pmem_region_create(desc->nvdimm_bus, &ndr_desc)) {
+ e820_desc_free(desc);
+ dev_err(&pdev->dev, "nvdimm region creation failure\n");
return -ENXIO;
+ }
+
+ platform_set_drvdata(pdev, desc);
+ return 0;
+}
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+ struct e820_descriptor *desc = platform_get_drvdata(pdev);
+
+ e820_desc_free(desc);
return 0;
}

static int e820_pmem_probe(struct platform_device *pdev)
{
- static struct nvdimm_bus_descriptor nd_desc;
- struct device *dev = &pdev->dev;
- struct nvdimm_bus *nvdimm_bus;
int rc = -ENXIO;

- nd_desc.provider_name = "e820";
- nd_desc.module = THIS_MODULE;
- nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
- if (!nvdimm_bus)
- goto err;
- platform_set_drvdata(pdev, nvdimm_bus);
-
rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
- IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
+ IORESOURCE_MEM, 0, -1, pdev, e820_register_one);
if (rc)
goto err;
return 0;
err:
- nvdimm_bus_unregister(nvdimm_bus);
- dev_err(dev, "failed to register legacy persistent memory ranges\n");
+ dev_err(&pdev->dev, "failed to register legacy persistent memory ranges\n");
return rc;
}

--
2.17.1

2020-02-03 21:35:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL

On Fri, Jan 10, 2020 at 07:03:04PM +0000, Joao Martins wrote:
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
> {
> return 0;
> }
> +#endif
> +
> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> +static inline int pmd_special(pmd_t pmd)
> +{
> + return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
> +}
> +#endif

The ifdef/endif don't make much sense here; x86 does have PTE_SPECIAL,
and this is an x86 header file, so that can be assumed.

> +++ b/mm/gup.c
> @@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> }
>
> + if (pmd_special(orig))
> + return 0;

Here, you're calling it unconditionally. I think you need a pmd_special()
conditionally defined in include/asm-generic/pgtable.h

+#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline bool pmd_special(pmd_t pmd)
+{
+ return false;
+}
+#endif

(oh, and plese use bool instead of int; I know that's different from
pte_special(), but pte_special() predates bool and nobody's done the work
to convert it yet)

> +++ b/mm/huge_memory.c
> @@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> if (pfn_t_devmap(pfn))
> entry = pmd_mkdevmap(entry);
> + else if (pfn_t_special(pfn))
> + entry = pmd_mkspecial(entry);

Again, we'll need a generic one.

> @@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
> * but we need to be consistent with PTEs and architectures that
> * can't support a 'special' bit.
> */
> - BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
> - !pfn_t_devmap(pfn));
> + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));

Should that rather be ...

+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
+ !pfn_t_devmap(pfn) && !pfn_t_special(pfn));

I also think this comment needs adjusting:

/*
* There is no pmd_special() but there may be special pmds, e.g.
* in a direct-access (dax) mapping, so let's just replicate the
* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
*/


2020-02-04 01:26:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata

On Fri, Jan 10, 2020 at 11:06 AM Joao Martins <[email protected]> wrote:
>
> Hey,

Hi Joao,

>
> Presented herewith a small series which allows device-dax to work without
> struct page to be used to back KVM guests memory. It's an RFC, and there's
> still some items we're looking at (see TODO below);

So it turns out I already have some patches in flight that address
discontiguous allocation item. Here's a WIP branch that I'll be
sending out after the merge window closes.

https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

> but wondering if folks
> would be OK carving some time out of their busy schedules to provide feedback
> direction-wise on this work.

...apologies I did not get to this sooner. Please feel free to ping me
after a week if you're awaiting comment on anything in the nvdimm or
device-dax area.

> In virtualized environments (specially those with no kernel-backed PV
> interfaces, and just SR-IOV), memory is largelly assigned to guests: either
> persistent with NVDIMMs or volatile for regular RAM. The kernel
> (hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
> we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
> need which could instead be used to create other guests. One of motivations of
> this series is to then get that memory used for 'struct page', when it is meant
> to solely be used by userspace.

Do you mean reclaim it for the guest to use for 'struct page' capacity
since the host hypervisor has reduced need for it?

> This is also useful for the case of memory
> backing guests virtual NVDIMMs. The other neat side effect is that the
> hypervisor has no virtual mapping of the guest and hence code gadgets (if
> found) are limited in their effectiveness.

You mean just the direct-map? qemu still gets a valid virtual address,
or are you proposing it's not mapped there either?

> It is expected that a smaller (instead of total) amount of host memory is
> defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
> QEMU), the main thing that is needed is a device which open + mmap + close with
> a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
> just that and so the work comprised here was improving what's there and the
> interfaces it uses.

In general I think this "'struct page'-less device-dax" capability
makes sense, why suffer 1.6% capacity loss when that memory is going
unused? My main concerns are:

1/ Use of memmap=nn!ss when none of the persistent memory
infrastructure is needed.

2/ Auditing the new usage of VM_PFNMAP and what that means for
memory-error handling and applications that expect 'struct page'
services to be present.

3/ "page-less" dreams have been dashed on the rocks in the past. The
slow drip of missing features (ptrace(), direct-I/O, fork()...) is why
the kernel now requires them by default for dax.

For 1/ I have a proposal, for 2/ I need to dig in to what you have
here, but maybe I can trade you some review on the discontiguous
allocation patches. For 3/ can you say a bit more about why the
overhead is intolerable?

The proposal for 1/ is please let's not build more on top of
memmap=nn!ss. It's fragile because there's no facility to validate
that it is correct, the assumption is that the user overriding the
memmap knows to pick address that don't collide and have real memory
backing. Practice has shown that people get it wrong often enough that
we need something better. It's also confusing that applications start
seeing "Persistent Memory" in /proc/iomem when it's only there to
shunt memory over to device-dax.

The alternative is "efi_fake_mem=nn@ss:0x40000" and the related
'soft-reservation' enabling that bypasses the persistent memory
enabling and assigns memory to device-dax directly. EFI attribute
override is safer because this attribute is only honored when applied
to existing EFI conventional memory.

Have a look at that branch just be warned it's not polished, or just
wait for me to send them out with a proper cover letter, and I'll take
a look at what you have below.

>
> The series is divided as follows:
>
> * Patch 1 , 3: Preparatory work for patch 7 for adding support for
> vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
>
> * Patch 2 , 4: Preparatory work for patch 7 for adding support for
> follow_pfn() to work with 2M/1G huge pages, which is
> what KVM uses for VM_PFNMAP.
>
> * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
> which encompasses mainly dealing with the lack of devmap,
> and creating a VM_PFNMAP vma.
>
> * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
> At the very end of the cover letter (after scissors mark),
> there's a patch for ndctl to be able to create namespaces
> with '--mode devdax --map none'.
>
> * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
> a PFN.
>
> * Patch 10: The actual end consumer example for RAM case. The patch just adds a
> label storage area which consequently allows namespaces to be
> created. We picked PMEM legacy for starters.
>
> Thoughts, coments appreciated.
> Joao
>
> P.S. As an example to try this out:
>
> 1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
> and kernel has 16G.
>
> 2) create a devdax namespace with 1G hugepages:
>
> $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
> {
> "dev":"namespace0.0",
> "mode":"devdax",
> "map":"none",
> "size":"32.00 GiB (34.36 GB)",
> "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
> "daxregion":{
> "id":0,
> "size":"32.00 GiB (34.36 GB)",
> "align":1073741824,
> "devices":[
> {
> "chardev":"dax0.0",
> "size":"32.00 GiB (34.36 GB)",
> "target_node":0,
> "mode":"devdax"
> }
> ]
> },
> "align":1073741824
> }
>
> 3) Add this to your qemu params:
> -m 32G
> -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
> -numa node,memdev=mem
>
> TODO:
>
> * Discontiguous regions/namespaces: The work above is limited to max
> contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
> is because of what specs allow for persistent namespaces. But for volatile RAM
> case we would need handling of discontiguous extents (hence a region would represent
> more than a resource) to be less bound to how guests are placed on the system.
> I played around with multi-resource for device-dax, but I'm wondering about
> UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
> case (under certain nvdimm region bits); or if 2) device-dax would have it's
> own separate UABI to be used by daxctl (which would be also useful for hmem
> devices?).
>
> * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
> device-dax, which would allow collect_procs() to find the processes solely based
> on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
> looking at the dax-region pfn range for each DAX vma.

You mean, make the memory error handling device-dax aware?

>
> * NUMA: For now excluded setting the target_node; while these two patches
> are being worked on[1][2].
>
> [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/

I'll ping x86 folks again after the merge window. I expect they have
just not had time to ack them yet.

2020-02-04 15:30:49

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

Hi -

On 1/10/20 2:03 PM, Joao Martins wrote:
> User can define regions with 'memmap=size!offset' which in turn
> creates PMEM legacy devices. But because it is a label-less
> NVDIMM device we only have one namespace for the whole device.
>
> Add support for multiple namespaces by adding ndctl control
> support, and exposing a minimal set of features:
> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> store labels.

FWIW, I like this a lot. If we move away from using memmap in favor of
efi_fake_mem, ideally we'd have the same support for full-fledged
pmem/dax regions and namespaces that this patch brings.

Thanks,
Barret


>
> Initialization is a little different: We allocate and register an
> nvdimm bus with an @nvdimm_descriptor which we use to locate
> where we are keeping our label storage area. The config data
> get/set/size operations are then simply memcpying to this area.
>
> Equivalent approach can also be found in the NFIT tests which
> emulate the same thing.
>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> drivers/nvdimm/e820.c | 212 +++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 191 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
> index e02f60ad6c99..36fbff3d7110 100644
> --- a/drivers/nvdimm/e820.c
> +++ b/drivers/nvdimm/e820.c
> @@ -7,14 +7,21 @@
> #include <linux/memory_hotplug.h>
> #include <linux/libnvdimm.h>
> #include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/ndctl.h>
> +#include <linux/nd.h>
>
> -static int e820_pmem_remove(struct platform_device *pdev)
> -{
> - struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
> +#define LABEL_SIZE SZ_128K
>
> - nvdimm_bus_unregister(nvdimm_bus);
> - return 0;
> -}
> +struct e820_descriptor {
> + struct nd_interleave_set nd_set;
> + struct nvdimm_bus_descriptor nd_desc;
> + void *label;
> + unsigned char cookie1[16];
> + unsigned char cookie2[16];
> + struct nvdimm_bus *nvdimm_bus;
> + struct nvdimm *nvdimm;
> +};
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> static int e820_range_to_nid(resource_size_t addr)
> @@ -28,43 +35,206 @@ static int e820_range_to_nid(resource_size_t addr)
> }
> #endif
>
> +static int e820_get_config_size(struct nd_cmd_get_config_size *nd_cmd,
> + unsigned int buf_len)
> +{
> + if (buf_len < sizeof(*nd_cmd))
> + return -EINVAL;
> +
> + nd_cmd->status = 0;
> + nd_cmd->config_size = LABEL_SIZE;
> + nd_cmd->max_xfer = SZ_4K;
> +
> + return 0;
> +}
> +
> +static int e820_get_config_data(struct nd_cmd_get_config_data_hdr
> + *nd_cmd, unsigned int buf_len, void *label)
> +{
> + unsigned int len, offset = nd_cmd->in_offset;
> + int rc;
> +
> + if (buf_len < sizeof(*nd_cmd))
> + return -EINVAL;
> + if (offset >= LABEL_SIZE)
> + return -EINVAL;
> + if (nd_cmd->in_length + sizeof(*nd_cmd) > buf_len)
> + return -EINVAL;
> +
> + nd_cmd->status = 0;
> + len = min(nd_cmd->in_length, LABEL_SIZE - offset);
> + memcpy(nd_cmd->out_buf, label + offset, len);
> + rc = buf_len - sizeof(*nd_cmd) - len;
> +
> + return rc;
> +}
> +
> +static int e820_set_config_data(struct nd_cmd_set_config_hdr *nd_cmd,
> + unsigned int buf_len, void *label)
> +{
> + unsigned int len, offset = nd_cmd->in_offset;
> + u32 *status;
> + int rc;
> +
> + if (buf_len < sizeof(*nd_cmd))
> + return -EINVAL;
> + if (offset >= LABEL_SIZE)
> + return -EINVAL;
> + if (nd_cmd->in_length + sizeof(*nd_cmd) + 4 > buf_len)
> + return -EINVAL;
> +
> + status = (void *)nd_cmd + nd_cmd->in_length + sizeof(*nd_cmd);
> + *status = 0;
> + len = min(nd_cmd->in_length, LABEL_SIZE - offset);
> + memcpy(label + offset, nd_cmd->in_buf, len);
> + rc = buf_len - sizeof(*nd_cmd) - (len + 4);
> +
> + return rc;
> +}
> +
> +static struct e820_descriptor *to_e820_desc(struct nvdimm_bus_descriptor *desc)
> +{
> + return container_of(desc, struct e820_descriptor, nd_desc);
> +}
> +
> +static int e820_ndctl(struct nvdimm_bus_descriptor *nd_desc,
> + struct nvdimm *nvdimm, unsigned int cmd, void *buf,
> + unsigned int buf_len, int *cmd_rc)
> +{
> + struct e820_descriptor *t = to_e820_desc(nd_desc);
> + int rc = -EINVAL;
> +
> + switch (cmd) {
> + case ND_CMD_GET_CONFIG_SIZE:
> + rc = e820_get_config_size(buf, buf_len);
> + break;
> + case ND_CMD_GET_CONFIG_DATA:
> + rc = e820_get_config_data(buf, buf_len, t->label);
> + break;
> + case ND_CMD_SET_CONFIG_DATA:
> + rc = e820_set_config_data(buf, buf_len, t->label);
> + break;
> + default:
> + return rc;
> + }
> +
> + return rc;
> +}
> +
> +static void e820_desc_free(struct e820_descriptor *desc)
> +{
> + if (!desc)
> + return;
> +
> + nvdimm_bus_unregister(desc->nvdimm_bus);
> + kfree(desc->label);
> + kfree(desc);
> +}
> +
> +static struct e820_descriptor *e820_desc_alloc(struct platform_device *pdev)
> +{
> + struct nvdimm_bus_descriptor *nd_desc;
> + unsigned int cmd_mask, dimm_flags;
> + struct device *dev = &pdev->dev;
> + struct nvdimm_bus *nvdimm_bus;
> + struct e820_descriptor *desc;
> + struct nvdimm *nvdimm;
> +
> + desc = kzalloc(sizeof(*desc), GFP_KERNEL);
> + if (!desc)
> + goto err;
> +
> + desc->label = kzalloc(LABEL_SIZE, GFP_KERNEL);
> + if (!desc->label)
> + goto err;
> +
> + nd_desc = &desc->nd_desc;
> + nd_desc->provider_name = "e820";
> + nd_desc->module = THIS_MODULE;
> + nd_desc->ndctl = e820_ndctl;
> + nvdimm_bus = nvdimm_bus_register(&pdev->dev, nd_desc);
> + if (!nvdimm_bus) {
> + dev_err(dev, "nvdimm bus registration failure\n");
> + goto err;
> + }
> + desc->nvdimm_bus = nvdimm_bus;
> +
> + cmd_mask = (1UL << ND_CMD_GET_CONFIG_SIZE |
> + 1UL << ND_CMD_GET_CONFIG_DATA |
> + 1UL << ND_CMD_SET_CONFIG_DATA);
> + dimm_flags = (1UL << NDD_ALIASING);
> + nvdimm = nvdimm_create(nvdimm_bus, pdev, NULL,
> + dimm_flags, cmd_mask, 0, NULL);
> + if (!nvdimm) {
> + dev_err(dev, "nvdimm creation failure\n");
> + goto err;
> + }
> + desc->nvdimm = nvdimm;
> + return desc;
> +
> +err:
> + e820_desc_free(desc);
> + return NULL;
> +}
> +
> static int e820_register_one(struct resource *res, void *data)
> {
> + struct platform_device *pdev = data;
> struct nd_region_desc ndr_desc;
> - struct nvdimm_bus *nvdimm_bus = data;
> + struct nd_mapping_desc mapping;
> + struct e820_descriptor *desc;
> +
> + desc = e820_desc_alloc(pdev);
> + if (!desc)
> + return -ENOMEM;
> +
> + mapping.nvdimm = desc->nvdimm;
> + mapping.start = res->start;
> + mapping.size = resource_size(res);
> + mapping.position = 0;
> +
> + generate_random_uuid(desc->cookie1);
> + desc->nd_set.cookie1 = (u64) desc->cookie1;
> + generate_random_uuid(desc->cookie2);
> + desc->nd_set.cookie2 = (u64) desc->cookie2;
>
> memset(&ndr_desc, 0, sizeof(ndr_desc));
> ndr_desc.res = res;
> ndr_desc.numa_node = e820_range_to_nid(res->start);
> ndr_desc.target_node = ndr_desc.numa_node;
> + ndr_desc.mapping = &mapping;
> + ndr_desc.num_mappings = 1;
> + ndr_desc.nd_set = &desc->nd_set;
> set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> - if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
> + if (!nvdimm_pmem_region_create(desc->nvdimm_bus, &ndr_desc)) {
> + e820_desc_free(desc);
> + dev_err(&pdev->dev, "nvdimm region creation failure\n");
> return -ENXIO;
> + }
> +
> + platform_set_drvdata(pdev, desc);
> + return 0;
> +}
> +
> +static int e820_pmem_remove(struct platform_device *pdev)
> +{
> + struct e820_descriptor *desc = platform_get_drvdata(pdev);
> +
> + e820_desc_free(desc);
> return 0;
> }
>
> static int e820_pmem_probe(struct platform_device *pdev)
> {
> - static struct nvdimm_bus_descriptor nd_desc;
> - struct device *dev = &pdev->dev;
> - struct nvdimm_bus *nvdimm_bus;
> int rc = -ENXIO;
>
> - nd_desc.provider_name = "e820";
> - nd_desc.module = THIS_MODULE;
> - nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
> - if (!nvdimm_bus)
> - goto err;
> - platform_set_drvdata(pdev, nvdimm_bus);
> -
> rc = walk_iomem_res_desc(IORES_DESC_PERSISTENT_MEMORY_LEGACY,
> - IORESOURCE_MEM, 0, -1, nvdimm_bus, e820_register_one);
> + IORESOURCE_MEM, 0, -1, pdev, e820_register_one);
> if (rc)
> goto err;
> return 0;
> err:
> - nvdimm_bus_unregister(nvdimm_bus);
> - dev_err(dev, "failed to register legacy persistent memory ranges\n");
> + dev_err(&pdev->dev, "failed to register legacy persistent memory ranges\n");
> return rc;
> }
>
>

2020-02-04 16:17:47

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 01/10] mm: Add pmd support for _PAGE_SPECIAL

On 2/3/20 9:34 PM, Matthew Wilcox wrote:
> On Fri, Jan 10, 2020 at 07:03:04PM +0000, Joao Martins wrote:
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -293,6 +293,15 @@ static inline int pgd_devmap(pgd_t pgd)
>> {
>> return 0;
>> }
>> +#endif
>> +
>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> +static inline int pmd_special(pmd_t pmd)
>> +{
>> + return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
>> +}
>> +#endif
>
> The ifdef/endif don't make much sense here; x86 does have PTE_SPECIAL,
> and this is an x86 header file, so that can be assumed.
>
Gotcha.

>> +++ b/mm/gup.c
>> @@ -2079,6 +2079,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>> return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
>> }
>>
>> + if (pmd_special(orig))
>> + return 0;
>
> Here, you're calling it unconditionally. I think you need a pmd_special()
> conditionally defined in include/asm-generic/pgtable.h
>
> +#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
> +static inline bool pmd_special(pmd_t pmd)
> +{
> + return false;
> +}
> +#endif
>
> (oh, and plese use bool instead of int; I know that's different from
> pte_special(), but pte_special() predates bool and nobody's done the work
> to convert it yet)
>
Got it.

>> +++ b/mm/huge_memory.c
>> @@ -791,6 +791,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>> entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>> if (pfn_t_devmap(pfn))
>> entry = pmd_mkdevmap(entry);
>> + else if (pfn_t_special(pfn))
>> + entry = pmd_mkspecial(entry);
>
> Again, we'll need a generic one.
>
Will add it.

>> @@ -823,8 +825,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
>> * but we need to be consistent with PTEs and architectures that
>> * can't support a 'special' bit.
>> */
>> - BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
>> - !pfn_t_devmap(pfn));
>> + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
>
> Should that rather be ...
>
> + BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
> + !pfn_t_devmap(pfn) && !pfn_t_special(pfn));
>
Yes. That is indeed a mistake I had already fixed for v2. Patch 3 does the exact
same, so as the other comments you mentioned here too so will adjust that
accordingly.

> I also think this comment needs adjusting:
>
> /*
> * There is no pmd_special() but there may be special pmds, e.g.
> * in a direct-access (dax) mapping, so let's just replicate the
> * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
> */
>
>
I'll replace with what vm_normal_page() equivalent has:

/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */

Joao

2020-02-04 16:45:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <[email protected]> wrote:
>
> Hi -
>
> On 1/10/20 2:03 PM, Joao Martins wrote:
> > User can define regions with 'memmap=size!offset' which in turn
> > creates PMEM legacy devices. But because it is a label-less
> > NVDIMM device we only have one namespace for the whole device.
> >
> > Add support for multiple namespaces by adding ndctl control
> > support, and exposing a minimal set of features:
> > (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> > ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> > store labels.
>
> FWIW, I like this a lot. If we move away from using memmap in favor of
> efi_fake_mem, ideally we'd have the same support for full-fledged
> pmem/dax regions and namespaces that this patch brings.

No, efi_fake_mem only supports creating dax-regions. What's the use
case that can't be satisfied by just specifying multiple memmap=
ranges?

2020-02-04 18:22:37

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

Hi -

On 2/4/20 11:44 AM, Dan Williams wrote:
> On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <[email protected]> wrote:
>>
>> Hi -
>>
>> On 1/10/20 2:03 PM, Joao Martins wrote:
>>> User can define regions with 'memmap=size!offset' which in turn
>>> creates PMEM legacy devices. But because it is a label-less
>>> NVDIMM device we only have one namespace for the whole device.
>>>
>>> Add support for multiple namespaces by adding ndctl control
>>> support, and exposing a minimal set of features:
>>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
>>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
>>> store labels.
>>
>> FWIW, I like this a lot. If we move away from using memmap in favor of
>> efi_fake_mem, ideally we'd have the same support for full-fledged
>> pmem/dax regions and namespaces that this patch brings.
>
> No, efi_fake_mem only supports creating dax-regions. What's the use
> case that can't be satisfied by just specifying multiple memmap=
> ranges?

I'd like to be able to create and destroy dax regions on the fly. In
particular, I want to run guest VMs using the dax files for guest
memory, but I don't know at boot time how many VMs I'll have, or what
their sizes are. Ideally, I'd have separate files for each VM, instead
of a single /dev/dax.

I currently do this with fs-dax with one big memmap region (ext4 on
/dev/pmem0), and I use the file system to handle the
creation/destruction/resizing and metadata management. But since fs-dax
won't work with device pass-through, I started looking at dev-dax, with
the expectation that I'd need some software to manage the memory (i.e.
allocation). That led me to ndctl, which seems to need namespace labels
to have the level of control I was looking for.

Thanks,

Barret

2020-02-04 19:09:59

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] device-dax: Support devices without PFN metadata

On 2/4/20 1:24 AM, Dan Williams wrote:
> On Fri, Jan 10, 2020 at 11:06 AM Joao Martins <[email protected]> wrote:
>>
>> Hey,
>
> Hi Joao,
>
>>
>> Presented herewith a small series which allows device-dax to work without
>> struct page to be used to back KVM guests memory. It's an RFC, and there's
>> still some items we're looking at (see TODO below);
>
> So it turns out I already have some patches in flight that address
> discontiguous allocation item. Here's a WIP branch that I'll be
> sending out after the merge window closes.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending
>
Neat!

>> but wondering if folks
>> would be OK carving some time out of their busy schedules to provide feedback
>> direction-wise on this work.
>
> ...apologies I did not get to this sooner. Please feel free to ping me
> after a week if you're awaiting comment on anything in the nvdimm or
> device-dax area.
>
OK, got it.

>> In virtualized environments (specially those with no kernel-backed PV
>> interfaces, and just SR-IOV), memory is largelly assigned to guests: either
>> persistent with NVDIMMs or volatile for regular RAM. The kernel
>> (hypervisor) tracks it with 'struct page' (64b) for each 4K page. Overall
>> we're spending 16GB for each 1Tb of host memory tracked that the kernel won't
>> need which could instead be used to create other guests. One of motivations of
>> this series is to then get that memory used for 'struct page', when it is meant
>> to solely be used by userspace.
>
> Do you mean reclaim it for the guest to use for 'struct page' capacity
> since the host hypervisor has reduced need for it?
>
Yes.

>> This is also useful for the case of memory
>> backing guests virtual NVDIMMs. The other neat side effect is that the
>> hypervisor has no virtual mapping of the guest and hence code gadgets (if
>> found) are limited in their effectiveness.
>
> You mean just the direct-map?

Yes.

> qemu still gets a valid virtual address,
> or are you proposing it's not mapped there either?
>
Correct, Qemu still has a valid virtual address.

>> It is expected that a smaller (instead of total) amount of host memory is
>> defined for the kernel (with mem=X or memmap=X!Y). For KVM userspace VMM (e.g.
>> QEMU), the main thing that is needed is a device which open + mmap + close with
>> a certain alignment (4K, 2M, 1G). That made us look at device-dax which does
>> just that and so the work comprised here was improving what's there and the
>> interfaces it uses.
>
> In general I think this "'struct page'-less device-dax" capability
> makes sense, why suffer 1.6% capacity loss when that memory is going
> unused? My main concerns are:

Exactly!

>
> 1/ Use of memmap=nn!ss when none of the persistent memory
> infrastructure is needed.
>
> 2/ Auditing the new usage of VM_PFNMAP and what that means for
> memory-error handling and applications that expect 'struct page'
> services to be present.
>
> 3/ "page-less" dreams have been dashed on the rocks in the past. The
> slow drip of missing features (ptrace(), direct-I/O, fork()...) is why
> the kernel now requires them by default for dax.
>

wrt to ptrace() the only usage I could find was PEEKDATA/POKEDATA/etc
(essentially access_remote_vm()). Which if devdax defines a access() which
copies back from/to pfn from/to user buffer then it should work. Unless there
something else that I am missing?

> For 1/ I have a proposal, for 2/ I need to dig in to what you have
> here, but maybe I can trade you some review on the discontiguous
> allocation patches.

Will definitely look at those patches. Before this series I actually had started
with hmem (based on your initial work of multi-resource dax) and had patches too
for daxctl (to create/destroy the devices backed with soft reserved memory),
before switching to nd_e820. Let me know where I can help. I can throw those
away and build on top of your work that is on that branch.

For 1), part of the reason to start with memmap= was because the series was more
about devdax infra than the 'partitioning' itself. But it wasn't clear to me
which one would be best fitted layer-wise and given the devdax changes are
somewhat orthogonal then figured I would settle with something simpler that is
reusing a common facility and hear out opinions.

For 2) the idea is to keep the 'struct page' required services usage to the
minimum (or none) necessary. For example, if you have a guest with only VFIO
instances you really don't need much (aside from MCE, and optionally ptrace
attach). For kernel PV, the ones that would require work would be vhost-scsi
which would need the equivalent of (vhost_net.experimental_zcopytx=0 -- default
since v5.3) and therefore copy as opposed to do gup. I have a patch for that.

But as you can see in these patches I am trying to keep the same properties as
existing pte special based mappings. Should we need a 'struct page' I wonder if
we could dynamically create/delete the struct page (e.g. maybe via
find_special_page()).

> For 3/ can you say a bit more about why the
> overhead is intolerable?
>

It really is about extra memory 'wasted' when the hypervisor purpose is to
service back those same resources (e.g. memory, IO) to guests.

In terms of percentage it's somewhat reductory when we use 1.6% as overhead (or
16GB per Tb). When we take into account that most of hypervisor applications
required memory should be kept small[*], then you quickly realize those 'lost'
16Gb consumed for 'struct page' is a lot. The hugepages (which you also get with
dexdax with a devmap) also help further as you deterministically require less
memory for page table entries.

[*] as an arbritary example: say only only up to 18Gb or less for hypervisor
applications/orchestration in a 1Tb machine.

This is also orthogonal to volatile RAM; I suppose we have the same problem in
persistent memory which is also given to guests via say e.g. Qemu emulated NVDIMMs.

> The proposal for 1/ is please let's not build more on top of
> memmap=nn!ss. It's fragile because there's no facility to validate
> that it is correct, the assumption is that the user overriding the
> memmap knows to pick address that don't collide and have real memory
> backing. Practice has shown that people get it wrong often enough that
> we need something better.

/nods Agreed.

> It's also confusing that applications start
> seeing "Persistent Memory" in /proc/iomem when it's only there to
> shunt memory over to device-dax.
>

Right. The one side benefit of existing nvdimm's facilities though is that we
reuse namespace-based partitioning, and so this split of dax sort of ends up
replicating what's already in drivers/nvdimm. But I am not arguing one way or
another; as atm I was already wondering if I am abusing nvdimm or whether this
new layer would be created.

> The alternative is "efi_fake_mem=nn@ss:0x40000" and the related
> 'soft-reservation' enabling that bypasses the persistent memory
> enabling and assigns memory to device-dax directly. EFI attribute
> override is safer because this attribute is only honored when applied
> to existing EFI conventional memory.
>
/nods

One idea I had was to make this slightly more dynamic as opposed to a hard limit
; and it would be less tied in to the EFI memory map.

Say we would offline certain memory blocks and reassign them back to dax_hmem.
In a way it would be somewhat the reserve of dax_kmem (or rather dax_kmem
understanding of new_id with memory%d devices). Would that be sensible? I am
still sketching this, not yet sure on the extent that I get it cleanly.

> Have a look at that branch just be warned it's not polished, or just
> wait for me to send them out with a proper cover letter, and I'll take
> a look at what you have below.
>
Cool, awesome.


>>
>> The series is divided as follows:
>>
>> * Patch 1 , 3: Preparatory work for patch 7 for adding support for
>> vmf_insert_{pmd,pud} with dax pfn flags PFN_DEV|PFN_SPECIAL
>>
>> * Patch 2 , 4: Preparatory work for patch 7 for adding support for
>> follow_pfn() to work with 2M/1G huge pages, which is
>> what KVM uses for VM_PFNMAP.
>>
>> * Patch 5 - 7: One bugfix and device-dax support for PFN_DEV|PFN_SPECIAL,
>> which encompasses mainly dealing with the lack of devmap,
>> and creating a VM_PFNMAP vma.
>>
>> * Patch 8: PMEM support for no PFN metadata only for device-dax namespaces.
>> At the very end of the cover letter (after scissors mark),
>> there's a patch for ndctl to be able to create namespaces
>> with '--mode devdax --map none'.
>>
>> * Patch 9: Let VFIO handle VM_PFNMAP without relying on vm_pgoff being
>> a PFN.
>>
>> * Patch 10: The actual end consumer example for RAM case. The patch just adds a
>> label storage area which consequently allows namespaces to be
>> created. We picked PMEM legacy for starters.
>>
>> Thoughts, coments appreciated.
>> Joao
>>
>> P.S. As an example to try this out:
>>
>> 1) add 'memmap=48G!16G' to the kernel command line, on a host with 64G,
>> and kernel has 16G.
>>
>> 2) create a devdax namespace with 1G hugepages:
>>
>> $ ndctl create-namespace --verbose --mode devdax --map none --size 32G --align 1G -r 0
>> {
>> "dev":"namespace0.0",
>> "mode":"devdax",
>> "map":"none",
>> "size":"32.00 GiB (34.36 GB)",
>> "uuid":"dfdd05cd-2611-46ac-8bcd-10b6194f32d4",
>> "daxregion":{
>> "id":0,
>> "size":"32.00 GiB (34.36 GB)",
>> "align":1073741824,
>> "devices":[
>> {
>> "chardev":"dax0.0",
>> "size":"32.00 GiB (34.36 GB)",
>> "target_node":0,
>> "mode":"devdax"
>> }
>> ]
>> },
>> "align":1073741824
>> }
>>
>> 3) Add this to your qemu params:
>> -m 32G
>> -object memory-backend-file,id=mem,size=32G,mem-path=/dev/dax0.0,share=on,align=1G
>> -numa node,memdev=mem
>>
>> TODO:
>>
>> * Discontiguous regions/namespaces: The work above is limited to max
>> contiguous extent, coming from nvdimm dpa allocation heuristics -- which I take
>> is because of what specs allow for persistent namespaces. But for volatile RAM
>> case we would need handling of discontiguous extents (hence a region would represent
>> more than a resource) to be less bound to how guests are placed on the system.
>> I played around with multi-resource for device-dax, but I'm wondering about
>> UABI: 1) whether nvdimm DPA allocation heuristics should be relaxed for RAM
>> case (under certain nvdimm region bits); or if 2) device-dax would have it's
>> own separate UABI to be used by daxctl (which would be also useful for hmem
>> devices?).
>>

Btw one thing I couldn't tell from specs is whether a persistent namespace is
allowed to be represented with 2 or more discontiguous resources. I know of
block mode namespaces, but I am not sure it is eligible to represent a
'discontiguous' persistent namespace be represented like block-mode namespace
labels despite not having any btt interface. That seemed to work so far (but I
am not testing on an actual NVDIMM); but given we would pursue towards the hmem
approach this is also less relevant.

>> * MCE handling: For contiguous regions vm_pgoff could be set to the pfn in
>> device-dax, which would allow collect_procs() to find the processes solely based
>> on the PFN. But for discontiguous namespaces, not sure if this would work; perhaps
>> looking at the dax-region pfn range for each DAX vma.
>
> You mean, make the memory error handling device-dax aware?
>

/nods

The reason being that with discontinuous regions the assumption of 'vma PFN =
vm_pgoff + (addr - vm_start)' would no longer be sufficient to determine whether
a vma with PFNMAP has the PFN that triggered the MCE. So far I am having a vma
vm operation which finds the address for a given PFN; (essentially the reverse
of a follow_pfn()). Given that devdax knows which pfns it fills in it would be
cheap. Then memory failure SIGBUS the found process, which then (VMM) does a
vMCE to the guest. Guest then recovers the way it sees fit. This is not in this
series but I was hoping to include on v2.

>>
>> * NUMA: For now excluded setting the target_node; while these two patches
>> are being worked on[1][2].
>>
>> [1] https://lore.kernel.org/lkml/157401276776.43284.12396353118982684546.stgit@dwillia2-desk3.amr.corp.intel.com/
>> [2] https://lore.kernel.org/lkml/157401277293.43284.3805106435228534675.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> I'll ping x86 folks again after the merge window. I expect they have
> just not had time to ack them yet.
>
Cool, Thanks for the feedback so far!

Joao

2020-02-04 19:27:35

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

On 2/4/20 6:20 PM, Barret Rhoden wrote:
> Hi -
>
> On 2/4/20 11:44 AM, Dan Williams wrote:
>> On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <[email protected]> wrote:
>>>
>>> Hi -
>>>
>>> On 1/10/20 2:03 PM, Joao Martins wrote:
>>>> User can define regions with 'memmap=size!offset' which in turn
>>>> creates PMEM legacy devices. But because it is a label-less
>>>> NVDIMM device we only have one namespace for the whole device.
>>>>
>>>> Add support for multiple namespaces by adding ndctl control
>>>> support, and exposing a minimal set of features:
>>>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
>>>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
>>>> store labels.
>>>
>>> FWIW, I like this a lot. If we move away from using memmap in favor of
>>> efi_fake_mem, ideally we'd have the same support for full-fledged
>>> pmem/dax regions and namespaces that this patch brings.
>>
>> No, efi_fake_mem only supports creating dax-regions. What's the use
>> case that can't be satisfied by just specifying multiple memmap=
>> ranges?
>
> I'd like to be able to create and destroy dax regions on the fly. In
> particular, I want to run guest VMs using the dax files for guest
> memory, but I don't know at boot time how many VMs I'll have, or what
> their sizes are. Ideally, I'd have separate files for each VM, instead
> of a single /dev/dax.
>
> I currently do this with fs-dax with one big memmap region (ext4 on
> /dev/pmem0), and I use the file system to handle the
> creation/destruction/resizing and metadata management. But since fs-dax
> won't work with device pass-through, I started looking at dev-dax, with
> the expectation that I'd need some software to manage the memory (i.e.
> allocation). That led me to ndctl, which seems to need namespace labels
> to have the level of control I was looking for.
>

Indeed this is the intent of the patch.

As Barret mentioned, memmap= is limited to the one namespace covering the entire
region, and this would fix it (regardless of namespace mode). Otherwise we gotta
know in advance the amount of guests and its exact sizes, which would be
somewhat unflexible.

But given that it's 'pmem emulation' I thought it was OK to twist the label-less
aspect of nd_e820 (unless there's hardware out there which does this?).

If Dan agrees, I can continue with the patch.

2020-02-04 21:45:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

On Tue, Feb 4, 2020 at 10:20 AM Barret Rhoden <[email protected]> wrote:
>
> Hi -
>
> On 2/4/20 11:44 AM, Dan Williams wrote:
> > On Tue, Feb 4, 2020 at 7:30 AM Barret Rhoden <[email protected]> wrote:
> >>
> >> Hi -
> >>
> >> On 1/10/20 2:03 PM, Joao Martins wrote:
> >>> User can define regions with 'memmap=size!offset' which in turn
> >>> creates PMEM legacy devices. But because it is a label-less
> >>> NVDIMM device we only have one namespace for the whole device.
> >>>
> >>> Add support for multiple namespaces by adding ndctl control
> >>> support, and exposing a minimal set of features:
> >>> (ND_CMD_GET_CONFIG_SIZE, ND_CMD_GET_CONFIG_DATA,
> >>> ND_CMD_SET_CONFIG_DATA) alongside NDD_ALIASING because we can
> >>> store labels.
> >>
> >> FWIW, I like this a lot. If we move away from using memmap in favor of
> >> efi_fake_mem, ideally we'd have the same support for full-fledged
> >> pmem/dax regions and namespaces that this patch brings.
> >
> > No, efi_fake_mem only supports creating dax-regions. What's the use
> > case that can't be satisfied by just specifying multiple memmap=
> > ranges?
>
> I'd like to be able to create and destroy dax regions on the fly. In
> particular, I want to run guest VMs using the dax files for guest
> memory, but I don't know at boot time how many VMs I'll have, or what
> their sizes are. Ideally, I'd have separate files for each VM, instead
> of a single /dev/dax.
>
> I currently do this with fs-dax with one big memmap region (ext4 on
> /dev/pmem0), and I use the file system to handle the
> creation/destruction/resizing and metadata management. But since fs-dax
> won't work with device pass-through, I started looking at dev-dax, with
> the expectation that I'd need some software to manage the memory (i.e.
> allocation). That led me to ndctl, which seems to need namespace labels
> to have the level of control I was looking for.

Ah, got it, you only ended up at wanting namespace labels because
there was no other way to carve up device-dax. That's changing as part
of the efi_fake_mem= enabling and I have a patch set in the works to
allow discontiguous sub-divisions of a device-dax range. Note that is
this branch rebases frequently:

https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending


>
> Thanks,
>
> Barret
>

2020-02-04 21:59:31

by Barret Rhoden

[permalink] [raw]
Subject: Re: [PATCH RFC 10/10] nvdimm/e820: add multiple namespaces support

On 2/4/20 4:43 PM, Dan Williams wrote:
> Ah, got it, you only ended up at wanting namespace labels because
> there was no other way to carve up device-dax. That's changing as part
> of the efi_fake_mem= enabling and I have a patch set in the works to
> allow discontiguous sub-divisions of a device-dax range. Note that is
> this branch rebases frequently:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

Cool, thanks. I'll check it out!

Barret