Changes since v1 [1]:
- Fix build errors with the PowerPC override for memremap_compat_align()
- Move the memremap_compat_align() override definition to
arch/powerpc/mm/ioremap.c (Aneesh)
[1]: http://lore.kernel.org/r/158041475480.3889308.655103391935006598.stgit@dwillia2-desk3.amr.corp.intel.com
---
Explicit review requests, but any other feedback is of course
appreciated:
Patch1 needs an ack from ppc arch maintainers, and I'd like a tested-by
from Aneesh that this still works to solve the ppc issue. Jeff, does
this look good to you?
---
Aneesh reports that PowerPC requires 16MiB alignment for the address
range passed to devm_memremap_pages(), and Jeff reports that it is
possible to create a misaligned namespace which blocks future namespace
creation in that region. Both of these issues require namespace
alignment to be managed at the region level rather than padding at the
namespace level which has been a broken approach to date.
Introduce memremap_compat_align() to indicate the hard requirements of
an arch's memremap_pages() implementation. Use the maximum known
memremap_compat_align() to set the default namespace alignment for
libnvdimm. Consult that alignment when allocating free space. Finally,
allow the default region alignment to be overridden to maintain the same
namespace creation capability as previous kernels.
The ndctl unit tests, which have some misaligned namespace assumptions,
are updated to use the alignment override where necessary.
Thanks to Aneesh for early feedback and testing on this improved
alignment handling.
---
Dan Williams (4):
mm/memremap_pages: Introduce memremap_compat_align()
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/region: Introduce an 'align' attribute
arch/powerpc/Kconfig | 1
arch/powerpc/mm/ioremap.c | 12 +++
arch/powerpc/platforms/pseries/papr_scm.c | 2
drivers/acpi/nfit/core.c | 4 +
drivers/nvdimm/dimm.c | 2
drivers/nvdimm/dimm_devs.c | 95 +++++++++++++++++----
drivers/nvdimm/namespace_devs.c | 21 ++++-
drivers/nvdimm/nd.h | 3 -
drivers/nvdimm/pfn_devs.c | 2
drivers/nvdimm/region_devs.c | 132 ++++++++++++++++++++++++++---
include/linux/libnvdimm.h | 2
include/linux/memremap.h | 8 ++
include/linux/mmzone.h | 1
lib/Kconfig | 3 +
mm/memremap.c | 13 +++
15 files changed, 260 insertions(+), 41 deletions(-)
--
base-commit: 543506a2936aaced94bcc8731aae5d29d0442e90
The "sub-section memory hotplug" facility allows memremap_pages() users
like libnvdimm to compensate for hardware platforms like x86 that have a
section size larger than their hardware memory mapping granularity. The
compensation that sub-section support affords is being tolerant of
physical memory resources shifting by units smaller (64MiB on x86) than
the memory-hotplug section size (128 MiB). Where the platform
physical-memory mapping granularity is limited by the number and
capability of address-decode-registers in the memory controller.
While the sub-section support allows memremap_pages() to operate on
sub-section (2MiB) granularity, the Power architecture may still
require 16MiB alignment on "!radix_enabled()" platforms.
In order for libnvdimm to be able to detect and manage this per-arch
limitation, introduce memremap_compat_align() as a common minimum
alignment across all driver-facing memory-mapping interfaces, and let
Power override it to 16MiB in the "!radix_enabled()" case.
The assumption / requirement for 16MiB to be a viable
memremap_compat_align() value is that Power does not have platforms
where its equivalent of address-decode-registers never hardware remaps a
persistent memory resource on smaller than 16MiB boundaries. Note that I
tried my best to not add a new Kconfig symbol, but header include
entanglements defeated the #ifndef memremap_compat_align design pattern
and the need to export it defeats the __weak design pattern for arch
overrides.
Based on an initial patch by Aneesh.
Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com
Reported-by: Aneesh Kumar K.V <[email protected]>
Reported-by: Jeff Moyer <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/mm/ioremap.c | 12 ++++++++++++
drivers/nvdimm/pfn_devs.c | 2 +-
include/linux/memremap.h | 8 ++++++++
include/linux/mmzone.h | 1 +
lib/Kconfig | 3 +++
mm/memremap.c | 13 +++++++++++++
7 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..e6ffe905e2b9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -122,6 +122,7 @@ config PPC
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
select ARCH_HAS_HUGEPD if HUGETLB_PAGE
+ select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
select ARCH_HAS_MMIOWB if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index fc669643ce6a..38b5ba7d3e2d 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -2,6 +2,7 @@
#include <linux/io.h>
#include <linux/slab.h>
+#include <linux/mmzone.h>
#include <linux/vmalloc.h>
#include <asm/io-workarounds.h>
@@ -97,3 +98,14 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size,
return NULL;
}
+
+#ifdef CONFIG_ZONE_DEVICE
+/* override of the generic version in mm/memremap.c */
+unsigned long memremap_compat_align(void)
+{
+ if (radix_enabled())
+ return SUBSECTION_SIZE;
+ return (1UL << mmu_psize_defs[mmu_linear_psize].shift);
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index b94f7a7e94b8..a5c25cb87116 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
start = nsio->res.start;
size = resource_size(&nsio->res);
npfns = PHYS_PFN(size - SZ_8K);
- align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
+ align = max(nd_pfn->align, SUBSECTION_SIZE);
end_trunc = start + size - ALIGN_DOWN(start + size, align);
if (nd_pfn->mode == PFN_MODE_PMEM) {
/*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..8af1cbd8f293 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
+unsigned long memremap_compat_align(void);
#else
static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
unsigned long nr_pfns)
{
}
+
+/* when memremap_pages() is disabled all archs can remap a single page */
+static inline unsigned long memremap_compat_align(void)
+{
+ return PAGE_SIZE;
+}
#endif /* CONFIG_ZONE_DEVICE */
static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
@@ -172,4 +179,5 @@ static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
if (pgmap)
percpu_ref_put(pgmap->ref);
}
+
#endif /* _LINUX_MEMREMAP_H_ */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..6b77f7239af5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1170,6 +1170,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
#define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
#define SUBSECTION_SHIFT 21
+#define SUBSECTION_SIZE (1UL << SUBSECTION_SHIFT)
#define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
#define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
diff --git a/lib/Kconfig b/lib/Kconfig
index 0cf875fd627c..17dbc7bd3895 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -618,6 +618,9 @@ config ARCH_HAS_PMEM_API
config MEMREGION
bool
+config ARCH_HAS_MEMREMAP_COMPAT_ALIGN
+ bool
+
# use memcpy to implement user copies for nommu architectures
config UACCESS_MEMCPY
bool
diff --git a/mm/memremap.c b/mm/memremap.c
index 09b5b7adc773..a6905d28fe91 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -7,6 +7,7 @@
#include <linux/mm.h>
#include <linux/pfn_t.h>
#include <linux/swap.h>
+#include <linux/mmzone.h>
#include <linux/swapops.h>
#include <linux/types.h>
#include <linux/wait_bit.h>
@@ -14,6 +15,18 @@
static DEFINE_XARRAY(pgmap_array);
+/*
+ * Minimum compatible alignment of the resource (start, end) across
+ * memremap interfaces (i.e. memremap + memremap_pages)
+ */
+#ifndef CONFIG_ARCH_HAS_MEMREMAP_COMPAT_ALIGN
+unsigned long memremap_compat_align(void)
+{
+ return SUBSECTION_SIZE;
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
+
#ifdef CONFIG_DEV_PAGEMAP_OPS
DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
EXPORT_SYMBOL(devmap_managed_key);
The pmem driver on PowerPC crashes with the following signature when
instantiating misaligned namespaces that map their capacity via
memremap_pages().
BUG: Unable to handle kernel data access at 0xc001000406000000
Faulting instruction address: 0xc000000000090790
NIP [c000000000090790] arch_add_memory+0xc0/0x130
LR [c000000000090744] arch_add_memory+0x74/0x130
Call Trace:
arch_add_memory+0x74/0x130 (unreliable)
memremap_pages+0x74c/0xa30
devm_memremap_pages+0x3c/0xa0
pmem_attach_disk+0x188/0x770
nvdimm_bus_probe+0xd8/0x470
With the assumption that only memremap_pages() has alignment
constraints, enforce memremap_compat_align() for
pmem_should_map_pages(), nd_pfn, or nd_dax cases.
Reported-by: Aneesh Kumar K.V <[email protected]>
Cc: Jeff Moyer <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/namespace_devs.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 032dc61725ff..aff1f32fdb4f 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1739,6 +1739,16 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
return ERR_PTR(-ENODEV);
}
+ if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
+ struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
+ resource_size_t start = nsio->res.start;
+
+ if (!IS_ALIGNED(start | size, memremap_compat_align())) {
+ dev_dbg(&ndns->dev, "misaligned, unable to map\n");
+ return ERR_PTR(-EOPNOTSUPP);
+ }
+ }
+
if (is_namespace_pmem(&ndns->dev)) {
struct nd_namespace_pmem *nspm;
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.
To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.
This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.
Cc: Vishal Verma <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Oliver O'Halloran <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Link: https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/platforms/pseries/papr_scm.c | 2 +-
drivers/acpi/nfit/core.c | 4 +++-
drivers/nvdimm/dimm.c | 2 +-
drivers/nvdimm/dimm_devs.c | 9 +++++----
drivers/nvdimm/namespace_devs.c | 2 +-
drivers/nvdimm/nd.h | 2 +-
drivers/nvdimm/region_devs.c | 10 +++++-----
include/linux/libnvdimm.h | 2 ++
8 files changed, 19 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 0b4467e378e5..589858cb3203 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -328,7 +328,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
}
dimm_flags = 0;
- set_bit(NDD_ALIASING, &dimm_flags);
+ set_bit(NDD_LABELING, &dimm_flags);
p->nvdimm = nvdimm_create(p->bus, p, NULL, dimm_flags,
PAPR_SCM_DIMM_CMD_MASK, 0, NULL);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3320f93616d..71d7f2aa1b12 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2026,8 +2026,10 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc)
continue;
}
- if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+ if (nfit_mem->bdw && nfit_mem->memdev_pmem) {
set_bit(NDD_ALIASING, &flags);
+ set_bit(NDD_LABELING, &flags);
+ }
/* collate flags across all memdevs for this dimm */
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 64776ed15bb3..7d4ddc4d9322 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -99,7 +99,7 @@ static int nvdimm_probe(struct device *dev)
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
- nvdimm_set_aliasing(dev);
+ nvdimm_set_labeling(dev);
}
nvdimm_bus_unlock(dev);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 94ea6dba6b4f..64159d4d4b8f 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -32,7 +32,7 @@ int nvdimm_check_config_data(struct device *dev)
if (!nvdimm->cmd_mask ||
!test_bit(ND_CMD_GET_CONFIG_DATA, &nvdimm->cmd_mask)) {
- if (test_bit(NDD_ALIASING, &nvdimm->flags))
+ if (test_bit(NDD_LABELING, &nvdimm->flags))
return -ENXIO;
else
return -ENOTTY;
@@ -173,11 +173,11 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
return rc;
}
-void nvdimm_set_aliasing(struct device *dev)
+void nvdimm_set_labeling(struct device *dev)
{
struct nvdimm *nvdimm = to_nvdimm(dev);
- set_bit(NDD_ALIASING, &nvdimm->flags);
+ set_bit(NDD_LABELING, &nvdimm->flags);
}
void nvdimm_set_locked(struct device *dev)
@@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
{
struct nvdimm *nvdimm = to_nvdimm(dev);
- return sprintf(buf, "%s%s\n",
+ return sprintf(buf, "%s%s%s\n",
test_bit(NDD_ALIASING, &nvdimm->flags) ? "alias " : "",
+ test_bit(NDD_LABELING, &nvdimm->flags) ? "label" : "",
test_bit(NDD_LOCKED, &nvdimm->flags) ? "lock " : "");
}
static DEVICE_ATTR_RO(flags);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index aff1f32fdb4f..30cda9f235de 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -2531,7 +2531,7 @@ static int init_active_labels(struct nd_region *nd_region)
if (!ndd) {
if (test_bit(NDD_LOCKED, &nvdimm->flags))
/* fail, label data may be unreadable */;
- else if (test_bit(NDD_ALIASING, &nvdimm->flags))
+ else if (test_bit(NDD_LABELING, &nvdimm->flags))
/* fail, labels needed to disambiguate dpa */;
else
return 0;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index c9f6a5b5253a..ca39abe29c7c 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -252,7 +252,7 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
void *buf, size_t len);
long nvdimm_clear_poison(struct device *dev, phys_addr_t phys,
unsigned int len);
-void nvdimm_set_aliasing(struct device *dev);
+void nvdimm_set_labeling(struct device *dev);
void nvdimm_set_locked(struct device *dev);
void nvdimm_clear_locked(struct device *dev);
int nvdimm_security_setup_events(struct device *dev);
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index a19e535830d9..a5fc6e4c56ff 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -195,16 +195,16 @@ EXPORT_SYMBOL_GPL(nd_blk_region_set_provider_data);
int nd_region_to_nstype(struct nd_region *nd_region)
{
if (is_memory(&nd_region->dev)) {
- u16 i, alias;
+ u16 i, label;
- for (i = 0, alias = 0; i < nd_region->ndr_mappings; i++) {
+ for (i = 0, label = 0; i < nd_region->ndr_mappings; i++) {
struct nd_mapping *nd_mapping = &nd_region->mapping[i];
struct nvdimm *nvdimm = nd_mapping->nvdimm;
- if (test_bit(NDD_ALIASING, &nvdimm->flags))
- alias++;
+ if (test_bit(NDD_LABELING, &nvdimm->flags))
+ label++;
}
- if (alias)
+ if (label)
return ND_DEVICE_NAMESPACE_PMEM;
else
return ND_DEVICE_NAMESPACE_IO;
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 9df091bd30ba..18da4059be09 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -37,6 +37,8 @@ enum {
NDD_WORK_PENDING = 4,
/* ignore / filter NSLABEL_FLAG_LOCAL for this DIMM, i.e. no aliasing */
NDD_NOBLK = 5,
+ /* dimm supports namespace labels */
+ NDD_LABELING = 6,
/* need to set a limit somewhere, but yes, this is likely overkill */
ND_IOCTL_MAX_BUFLEN = SZ_4M,
The align attribute applies an alignment constraint for namespace
creation in a region. Whereas the 'align' attribute of a namespace
applied alignment padding via an info block, the 'align' attribute
applies alignment constraints to the free space allocation.
The default for 'align' is the maximum known memremap_compat_align()
across all archs (16MiB from PowerPC at time of writing) multiplied by
the number of interleave ways if there is blk-aliasing. The minimum is
PAGE_SIZE and allows for the creation of cross-arch incompatible
namespaces, just as previous kernels allowed, but the expectation is
cross-arch and mode-independent compatibility by default.
The regression risk with this change is limited to cases that were
dependent on the ability to create unaligned namespaces, *and* for some
reason are unable to opt-out of aligned namespaces by writing to
'regionX/align'. If such a scenario arises the default can be flipped
from opt-out to opt-in of compat-aligned namespace creation, but that is
a last resort. The kernel will otherwise continue to support existing
defined misaligned namespaces.
Unfortunately this change needs to touch several parts of the
implementation at once:
- region/available_size: expand busy extents to current align
- region/max_available_extent: expand busy extents to current align
- namespace/size: trim free space to current align
...to keep the free space accounting conforming to the dynamic align
setting.
Reported-by: Aneesh Kumar K.V <[email protected]>
Reported-by: Jeff Moyer <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Link: https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
---
drivers/nvdimm/dimm_devs.c | 86 +++++++++++++++++++++++----
drivers/nvdimm/namespace_devs.c | 9 ++-
drivers/nvdimm/nd.h | 1
drivers/nvdimm/region_devs.c | 122 ++++++++++++++++++++++++++++++++++++---
4 files changed, 192 insertions(+), 26 deletions(-)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 64159d4d4b8f..b4994abb655f 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -563,6 +563,21 @@ int nvdimm_security_freeze(struct nvdimm *nvdimm)
return rc;
}
+static unsigned long dpa_align(struct nd_region *nd_region)
+{
+ struct device *dev = &nd_region->dev;
+
+ if (dev_WARN_ONCE(dev, !is_nvdimm_bus_locked(dev),
+ "bus lock required for capacity provision\n"))
+ return 0;
+ if (dev_WARN_ONCE(dev, !nd_region->ndr_mappings || nd_region->align
+ % nd_region->ndr_mappings,
+ "invalid region align %#lx mappings: %d\n",
+ nd_region->align, nd_region->ndr_mappings))
+ return 0;
+ return nd_region->align / nd_region->ndr_mappings;
+}
+
int alias_dpa_busy(struct device *dev, void *data)
{
resource_size_t map_end, blk_start, new;
@@ -571,6 +586,7 @@ int alias_dpa_busy(struct device *dev, void *data)
struct nd_region *nd_region;
struct nvdimm_drvdata *ndd;
struct resource *res;
+ unsigned long align;
int i;
if (!is_memory(dev))
@@ -608,13 +624,21 @@ int alias_dpa_busy(struct device *dev, void *data)
* Find the free dpa from the end of the last pmem allocation to
* the end of the interleave-set mapping.
*/
+ align = dpa_align(nd_region);
+ if (!align)
+ return 0;
+
for_each_dpa_resource(ndd, res) {
+ resource_size_t start, end;
+
if (strncmp(res->name, "pmem", 4) != 0)
continue;
- if ((res->start >= blk_start && res->start < map_end)
- || (res->end >= blk_start
- && res->end <= map_end)) {
- new = max(blk_start, min(map_end + 1, res->end + 1));
+
+ start = ALIGN_DOWN(res->start, align);
+ end = ALIGN(res->end + 1, align) - 1;
+ if ((start >= blk_start && start < map_end)
+ || (end >= blk_start && end <= map_end)) {
+ new = max(blk_start, min(map_end, end) + 1);
if (new != blk_start) {
blk_start = new;
goto retry;
@@ -654,6 +678,7 @@ resource_size_t nd_blk_available_dpa(struct nd_region *nd_region)
.res = NULL,
};
struct resource *res;
+ unsigned long align;
if (!ndd)
return 0;
@@ -661,10 +686,20 @@ resource_size_t nd_blk_available_dpa(struct nd_region *nd_region)
device_for_each_child(&nvdimm_bus->dev, &info, alias_dpa_busy);
/* now account for busy blk allocations in unaliased dpa */
+ align = dpa_align(nd_region);
+ if (!align)
+ return 0;
for_each_dpa_resource(ndd, res) {
+ resource_size_t start, end, size;
+
if (strncmp(res->name, "blk", 3) != 0)
continue;
- info.available -= resource_size(res);
+ start = ALIGN_DOWN(res->start, align);
+ end = ALIGN(res->end + 1, align) - 1;
+ size = end - start + 1;
+ if (size >= info.available)
+ return 0;
+ info.available -= size;
}
return info.available;
@@ -683,19 +718,31 @@ resource_size_t nd_pmem_max_contiguous_dpa(struct nd_region *nd_region,
struct nvdimm_bus *nvdimm_bus;
resource_size_t max = 0;
struct resource *res;
+ unsigned long align;
/* if a dimm is disabled the available capacity is zero */
if (!ndd)
return 0;
+ align = dpa_align(nd_region);
+ if (!align)
+ return 0;
+
nvdimm_bus = walk_to_nvdimm_bus(ndd->dev);
if (__reserve_free_pmem(&nd_region->dev, nd_mapping->nvdimm))
return 0;
for_each_dpa_resource(ndd, res) {
+ resource_size_t start, end;
+
if (strcmp(res->name, "pmem-reserve") != 0)
continue;
- if (resource_size(res) > max)
- max = resource_size(res);
+ /* trim free space relative to current alignment setting */
+ start = ALIGN(res->start, align);
+ end = ALIGN_DOWN(res->end + 1, align) - 1;
+ if (end < start)
+ continue;
+ if (end - start + 1 > max)
+ max = end - start + 1;
}
release_free_pmem(nvdimm_bus, nd_mapping);
return max;
@@ -723,24 +770,33 @@ resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
struct resource *res;
const char *reason;
+ unsigned long align;
if (!ndd)
return 0;
+ align = dpa_align(nd_region);
+ if (!align)
+ return 0;
+
map_start = nd_mapping->start;
map_end = map_start + nd_mapping->size - 1;
blk_start = max(map_start, map_end + 1 - *overlap);
for_each_dpa_resource(ndd, res) {
- if (res->start >= map_start && res->start < map_end) {
+ resource_size_t start, end;
+
+ start = ALIGN_DOWN(res->start, align);
+ end = ALIGN(res->end + 1, align) - 1;
+ if (start >= map_start && start < map_end) {
if (strncmp(res->name, "blk", 3) == 0)
blk_start = min(blk_start,
- max(map_start, res->start));
- else if (res->end > map_end) {
+ max(map_start, start));
+ else if (end > map_end) {
reason = "misaligned to iset";
goto err;
} else
- busy += resource_size(res);
- } else if (res->end >= map_start && res->end <= map_end) {
+ busy += end - start + 1;
+ } else if (end >= map_start && end <= map_end) {
if (strncmp(res->name, "blk", 3) == 0) {
/*
* If a BLK allocation overlaps the start of
@@ -749,8 +805,8 @@ resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
*/
blk_start = map_start;
} else
- busy += resource_size(res);
- } else if (map_start > res->start && map_start < res->end) {
+ busy += end - start + 1;
+ } else if (map_start > start && map_start < end) {
/* total eclipse of the mapping */
busy += nd_mapping->size;
blk_start = map_start;
@@ -760,7 +816,7 @@ resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
*overlap = map_end + 1 - blk_start;
available = blk_start - map_start;
if (busy < available)
- return available - busy;
+ return ALIGN_DOWN(available - busy, align);
return 0;
err:
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index 30cda9f235de..4720ad69e1c5 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -541,6 +541,11 @@ static void space_valid(struct nd_region *nd_region, struct nvdimm_drvdata *ndd,
{
bool is_reserve = strcmp(label_id->id, "pmem-reserve") == 0;
bool is_pmem = strncmp(label_id->id, "pmem", 4) == 0;
+ unsigned long align;
+
+ align = nd_region->align / nd_region->ndr_mappings;
+ valid->start = ALIGN(valid->start, align);
+ valid->end = ALIGN_DOWN(valid->end + 1, align) - 1;
if (valid->start >= valid->end)
goto invalid;
@@ -980,10 +985,10 @@ static ssize_t __size_store(struct device *dev, unsigned long long val)
return -ENXIO;
}
- div_u64_rem(val, PAGE_SIZE * nd_region->ndr_mappings, &remainder);
+ div_u64_rem(val, nd_region->align, &remainder);
if (remainder) {
dev_dbg(dev, "%llu is not %ldK aligned\n", val,
- (PAGE_SIZE * nd_region->ndr_mappings) / SZ_1K);
+ nd_region->align / SZ_1K);
return -EINVAL;
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index ca39abe29c7c..c4d69c1cce55 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -146,6 +146,7 @@ struct nd_region {
struct device *btt_seed;
struct device *pfn_seed;
struct device *dax_seed;
+ unsigned long align;
u16 ndr_mappings;
u64 ndr_size;
u64 ndr_start;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index a5fc6e4c56ff..bf239e783940 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -216,21 +216,25 @@ int nd_region_to_nstype(struct nd_region *nd_region)
}
EXPORT_SYMBOL(nd_region_to_nstype);
-static ssize_t size_show(struct device *dev,
- struct device_attribute *attr, char *buf)
+static unsigned long long region_size(struct nd_region *nd_region)
{
- struct nd_region *nd_region = to_nd_region(dev);
- unsigned long long size = 0;
-
- if (is_memory(dev)) {
- size = nd_region->ndr_size;
+ if (is_memory(&nd_region->dev)) {
+ return nd_region->ndr_size;
} else if (nd_region->ndr_mappings == 1) {
struct nd_mapping *nd_mapping = &nd_region->mapping[0];
- size = nd_mapping->size;
+ return nd_mapping->size;
}
- return sprintf(buf, "%llu\n", size);
+ return 0;
+}
+
+static ssize_t size_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+
+ return sprintf(buf, "%llu\n", region_size(nd_region));
}
static DEVICE_ATTR_RO(size);
@@ -529,6 +533,55 @@ static ssize_t read_only_store(struct device *dev,
}
static DEVICE_ATTR_RW(read_only);
+static ssize_t align_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+
+ return sprintf(buf, "%#lx\n", nd_region->align);
+}
+
+static ssize_t align_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t len)
+{
+ struct nd_region *nd_region = to_nd_region(dev);
+ unsigned long val, dpa;
+ u32 remainder;
+ int rc;
+
+ rc = kstrtoul(buf, 0, &val);
+ if (rc)
+ return rc;
+
+ if (!nd_region->ndr_mappings)
+ return -ENXIO;
+
+ /*
+ * Ensure space-align is evenly divisible by the region
+ * interleave-width because the kernel typically has no facility
+ * to determine which DIMM(s), dimm-physical-addresses, would
+ * contribute to the tail capacity in system-physical-address
+ * space for the namespace.
+ */
+ dpa = val;
+ remainder = do_div(dpa, nd_region->ndr_mappings);
+ if (!is_power_of_2(dpa) || dpa < PAGE_SIZE
+ || val > region_size(nd_region) || remainder)
+ return -EINVAL;
+
+ /*
+ * Given that space allocation consults this value multiple
+ * times ensure it does not change for the duration of the
+ * allocation.
+ */
+ nvdimm_bus_lock(dev);
+ nd_region->align = val;
+ nvdimm_bus_unlock(dev);
+
+ return len;
+}
+static DEVICE_ATTR_RW(align);
+
static ssize_t region_badblocks_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -571,6 +624,7 @@ static DEVICE_ATTR_RO(persistence_domain);
static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
+ &dev_attr_align.attr,
&dev_attr_nstype.attr,
&dev_attr_mappings.attr,
&dev_attr_btt_seed.attr,
@@ -626,6 +680,19 @@ static umode_t region_visible(struct kobject *kobj, struct attribute *a, int n)
return a->mode;
}
+ if (a == &dev_attr_align.attr) {
+ int i;
+
+ for (i = 0; i < nd_region->ndr_mappings; i++) {
+ struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+ struct nvdimm *nvdimm = nd_mapping->nvdimm;
+
+ if (test_bit(NDD_LABELING, &nvdimm->flags))
+ return a->mode;
+ }
+ return 0;
+ }
+
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
return a->mode;
@@ -935,6 +1002,42 @@ void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
}
EXPORT_SYMBOL(nd_region_release_lane);
+/*
+ * PowerPC requires this alignment for memremap_pages(). All other archs
+ * should be ok with SUBSECTION_SIZE (see memremap_compat_align()).
+ */
+#define MEMREMAP_COMPAT_ALIGN_MAX SZ_16M
+
+static unsigned long default_align(struct nd_region *nd_region)
+{
+ unsigned long align, per_mapping;
+ int i, mappings;
+ u32 remainder;
+
+ if (is_nd_blk(&nd_region->dev))
+ align = PAGE_SIZE;
+ else
+ align = MEMREMAP_COMPAT_ALIGN_MAX;
+
+ for (i = 0; i < nd_region->ndr_mappings; i++) {
+ struct nd_mapping *nd_mapping = &nd_region->mapping[i];
+ struct nvdimm *nvdimm = nd_mapping->nvdimm;
+
+ if (test_bit(NDD_ALIASING, &nvdimm->flags)) {
+ align = MEMREMAP_COMPAT_ALIGN_MAX;
+ break;
+ }
+ }
+
+ mappings = max_t(u16, 1, nd_region->ndr_mappings);
+ per_mapping = align;
+ remainder = do_div(per_mapping, mappings);
+ if (remainder)
+ align *= mappings;
+
+ return align;
+}
+
static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
struct nd_region_desc *ndr_desc,
const struct device_type *dev_type, const char *caller)
@@ -1039,6 +1142,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ nd_region->align = default_align(nd_region);
if (ndr_desc->flush)
nd_region->flush = ndr_desc->flush;
else
Dan Williams <[email protected]> writes:
> The "sub-section memory hotplug" facility allows memremap_pages() users
> like libnvdimm to compensate for hardware platforms like x86 that have a
> section size larger than their hardware memory mapping granularity. The
> compensation that sub-section support affords is being tolerant of
> physical memory resources shifting by units smaller (64MiB on x86) than
> the memory-hotplug section size (128 MiB). Where the platform
> physical-memory mapping granularity is limited by the number and
> capability of address-decode-registers in the memory controller.
>
> While the sub-section support allows memremap_pages() to operate on
> sub-section (2MiB) granularity, the Power architecture may still
> require 16MiB alignment on "!radix_enabled()" platforms.
>
> In order for libnvdimm to be able to detect and manage this per-arch
> limitation, introduce memremap_compat_align() as a common minimum
> alignment across all driver-facing memory-mapping interfaces, and let
> Power override it to 16MiB in the "!radix_enabled()" case.
>
> The assumption / requirement for 16MiB to be a viable
> memremap_compat_align() value is that Power does not have platforms
> where its equivalent of address-decode-registers never hardware remaps a
> persistent memory resource on smaller than 16MiB boundaries. Note that I
> tried my best to not add a new Kconfig symbol, but header include
> entanglements defeated the #ifndef memremap_compat_align design pattern
> and the need to export it defeats the __weak design pattern for arch
> overrides.
>
> Based on an initial patch by Aneesh.
I have just a couple of questions.
First, can you please add a comment above the generic implementation of
memremap_compat_align describing its purpose, and why a platform might
want to override it?
Second, I will take it at face value that the power architecture
requires a 16MB alignment, but it's not clear to me why mmu_linear_psize
was chosen to represent that. What's the relationship, there, and can
we please have a comment explaining it?
Thanks!
Jeff
>
> Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com
> Reported-by: Aneesh Kumar K.V <[email protected]>
> Reported-by: Jeff Moyer <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/mm/ioremap.c | 12 ++++++++++++
> drivers/nvdimm/pfn_devs.c | 2 +-
> include/linux/memremap.h | 8 ++++++++
> include/linux/mmzone.h | 1 +
> lib/Kconfig | 3 +++
> mm/memremap.c | 13 +++++++++++++
> 7 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 497b7d0b2d7e..e6ffe905e2b9 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -122,6 +122,7 @@ config PPC
> select ARCH_HAS_GCOV_PROFILE_ALL
> select ARCH_HAS_KCOV
> select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> + select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
> select ARCH_HAS_MMIOWB if PPC64
> select ARCH_HAS_PHYS_TO_DMA
> select ARCH_HAS_PMEM_API
> diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
> index fc669643ce6a..38b5ba7d3e2d 100644
> --- a/arch/powerpc/mm/ioremap.c
> +++ b/arch/powerpc/mm/ioremap.c
> @@ -2,6 +2,7 @@
>
> #include <linux/io.h>
> #include <linux/slab.h>
> +#include <linux/mmzone.h>
> #include <linux/vmalloc.h>
> #include <asm/io-workarounds.h>
>
> @@ -97,3 +98,14 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long size,
>
> return NULL;
> }
> +
> +#ifdef CONFIG_ZONE_DEVICE
> +/* override of the generic version in mm/memremap.c */
> +unsigned long memremap_compat_align(void)
> +{
> + if (radix_enabled())
> + return SUBSECTION_SIZE;
> + return (1UL << mmu_psize_defs[mmu_linear_psize].shift);
> +}
> +EXPORT_SYMBOL_GPL(memremap_compat_align);
> +#endif
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index b94f7a7e94b8..a5c25cb87116 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -750,7 +750,7 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
> start = nsio->res.start;
> size = resource_size(&nsio->res);
> npfns = PHYS_PFN(size - SZ_8K);
> - align = max(nd_pfn->align, (1UL << SUBSECTION_SHIFT));
> + align = max(nd_pfn->align, SUBSECTION_SIZE);
> end_trunc = start + size - ALIGN_DOWN(start + size, align);
> if (nd_pfn->mode == PFN_MODE_PMEM) {
> /*
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 6fefb09af7c3..8af1cbd8f293 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -132,6 +132,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>
> unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
> +unsigned long memremap_compat_align(void);
> #else
> static inline void *devm_memremap_pages(struct device *dev,
> struct dev_pagemap *pgmap)
> @@ -165,6 +166,12 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap,
> unsigned long nr_pfns)
> {
> }
> +
> +/* when memremap_pages() is disabled all archs can remap a single page */
> +static inline unsigned long memremap_compat_align(void)
> +{
> + return PAGE_SIZE;
> +}
> #endif /* CONFIG_ZONE_DEVICE */
>
> static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
> @@ -172,4 +179,5 @@ static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
> if (pgmap)
> percpu_ref_put(pgmap->ref);
> }
> +
> #endif /* _LINUX_MEMREMAP_H_ */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..6b77f7239af5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1170,6 +1170,7 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
> #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK)
>
> #define SUBSECTION_SHIFT 21
> +#define SUBSECTION_SIZE (1UL << SUBSECTION_SHIFT)
>
> #define PFN_SUBSECTION_SHIFT (SUBSECTION_SHIFT - PAGE_SHIFT)
> #define PAGES_PER_SUBSECTION (1UL << PFN_SUBSECTION_SHIFT)
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0cf875fd627c..17dbc7bd3895 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -618,6 +618,9 @@ config ARCH_HAS_PMEM_API
> config MEMREGION
> bool
>
> +config ARCH_HAS_MEMREMAP_COMPAT_ALIGN
> + bool
> +
> # use memcpy to implement user copies for nommu architectures
> config UACCESS_MEMCPY
> bool
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 09b5b7adc773..a6905d28fe91 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -7,6 +7,7 @@
> #include <linux/mm.h>
> #include <linux/pfn_t.h>
> #include <linux/swap.h>
> +#include <linux/mmzone.h>
> #include <linux/swapops.h>
> #include <linux/types.h>
> #include <linux/wait_bit.h>
> @@ -14,6 +15,18 @@
>
> static DEFINE_XARRAY(pgmap_array);
>
> +/*
> + * Minimum compatible alignment of the resource (start, end) across
> + * memremap interfaces (i.e. memremap + memremap_pages)
> + */
> +#ifndef CONFIG_ARCH_HAS_MEMREMAP_COMPAT_ALIGN
> +unsigned long memremap_compat_align(void)
> +{
> + return SUBSECTION_SIZE;
> +}
> +EXPORT_SYMBOL_GPL(memremap_compat_align);
> +#endif
> +
> #ifdef CONFIG_DEV_PAGEMAP_OPS
> DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
> EXPORT_SYMBOL(devmap_managed_key);
On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
>
> > The "sub-section memory hotplug" facility allows memremap_pages() users
> > like libnvdimm to compensate for hardware platforms like x86 that have a
> > section size larger than their hardware memory mapping granularity. The
> > compensation that sub-section support affords is being tolerant of
> > physical memory resources shifting by units smaller (64MiB on x86) than
> > the memory-hotplug section size (128 MiB). Where the platform
> > physical-memory mapping granularity is limited by the number and
> > capability of address-decode-registers in the memory controller.
> >
> > While the sub-section support allows memremap_pages() to operate on
> > sub-section (2MiB) granularity, the Power architecture may still
> > require 16MiB alignment on "!radix_enabled()" platforms.
> >
> > In order for libnvdimm to be able to detect and manage this per-arch
> > limitation, introduce memremap_compat_align() as a common minimum
> > alignment across all driver-facing memory-mapping interfaces, and let
> > Power override it to 16MiB in the "!radix_enabled()" case.
> >
> > The assumption / requirement for 16MiB to be a viable
> > memremap_compat_align() value is that Power does not have platforms
> > where its equivalent of address-decode-registers never hardware remaps a
> > persistent memory resource on smaller than 16MiB boundaries. Note that I
> > tried my best to not add a new Kconfig symbol, but header include
> > entanglements defeated the #ifndef memremap_compat_align design pattern
> > and the need to export it defeats the __weak design pattern for arch
> > overrides.
> >
> > Based on an initial patch by Aneesh.
>
> I have just a couple of questions.
>
> First, can you please add a comment above the generic implementation of
> memremap_compat_align describing its purpose, and why a platform might
> want to override it?
Sure, how about:
/*
* The memremap() and memremap_pages() interfaces are alternately used
* to map persistent memory namespaces. These interfaces place different
* constraints on the alignment and size of the mapping (namespace).
* memremap() can map individual PAGE_SIZE pages. memremap_pages() can
* only map subsections (2MB), and at least one architecture (PowerPC)
* the minimum mapping granularity of memremap_pages() is 16MB.
*
* The role of memremap_compat_align() is to communicate the minimum
* arch supported alignment of a namespace such that it can freely
* switch modes without violating the arch constraint. Namely, do not
* allow a namespace to be PAGE_SIZE aligned since that namespace may be
* reconfigured into a mode that requires SUBSECTION_SIZE alignment.
*/
> Second, I will take it at face value that the power architecture
> requires a 16MB alignment, but it's not clear to me why mmu_linear_psize
> was chosen to represent that. What's the relationship, there, and can
> we please have a comment explaining it?
Aneesh, can you help here?
Dan Williams <[email protected]> writes:
> @@ -312,8 +312,9 @@ static ssize_t flags_show(struct device *dev,
> {
> struct nvdimm *nvdimm = to_nvdimm(dev);
>
> - return sprintf(buf, "%s%s\n",
> + return sprintf(buf, "%s%s%s\n",
> test_bit(NDD_ALIASING, &nvdimm->flags) ? "alias " : "",
> + test_bit(NDD_LABELING, &nvdimm->flags) ? "label" : "",
^
Missing a space.
The rest looks sane.
Reviewed-by: Jeff Moyer <[email protected]>
Dan Williams <[email protected]> writes:
> The pmem driver on PowerPC crashes with the following signature when
> instantiating misaligned namespaces that map their capacity via
> memremap_pages().
>
> BUG: Unable to handle kernel data access at 0xc001000406000000
> Faulting instruction address: 0xc000000000090790
> NIP [c000000000090790] arch_add_memory+0xc0/0x130
> LR [c000000000090744] arch_add_memory+0x74/0x130
> Call Trace:
> arch_add_memory+0x74/0x130 (unreliable)
> memremap_pages+0x74c/0xa30
> devm_memremap_pages+0x3c/0xa0
> pmem_attach_disk+0x188/0x770
> nvdimm_bus_probe+0xd8/0x470
>
> With the assumption that only memremap_pages() has alignment
> constraints, enforce memremap_compat_align() for
> pmem_should_map_pages(), nd_pfn, or nd_dax cases.
>
> Reported-by: Aneesh Kumar K.V <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Dan Williams <[email protected]> writes:
> The pmem driver on PowerPC crashes with the following signature when
> instantiating misaligned namespaces that map their capacity via
> memremap_pages().
>
> BUG: Unable to handle kernel data access at 0xc001000406000000
> Faulting instruction address: 0xc000000000090790
> NIP [c000000000090790] arch_add_memory+0xc0/0x130
> LR [c000000000090744] arch_add_memory+0x74/0x130
> Call Trace:
> arch_add_memory+0x74/0x130 (unreliable)
> memremap_pages+0x74c/0xa30
> devm_memremap_pages+0x3c/0xa0
> pmem_attach_disk+0x188/0x770
> nvdimm_bus_probe+0xd8/0x470
>
> With the assumption that only memremap_pages() has alignment
> constraints, enforce memremap_compat_align() for
> pmem_should_map_pages(), nd_pfn, or nd_dax cases.
>
> Reported-by: Aneesh Kumar K.V <[email protected]>
> Cc: Jeff Moyer <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/nvdimm/namespace_devs.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index 032dc61725ff..aff1f32fdb4f 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1739,6 +1739,16 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
> return ERR_PTR(-ENODEV);
> }
>
> + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
> + struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> + resource_size_t start = nsio->res.start;
> +
> + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
> + dev_dbg(&ndns->dev, "misaligned, unable to map\n");
> + return ERR_PTR(-EOPNOTSUPP);
> + }
> + }
> +
> if (is_namespace_pmem(&ndns->dev)) {
> struct nd_namespace_pmem *nspm;
>
Actually, I take back my ack. :) This prevents a previously working
namespace from being successfully probed/setup. I thought we were only
going to enforce the alignment for a newly created namespace? This should
only check whether the alignment works for the current platform.
-Jeff
On Thu, Feb 13, 2020 at 1:55 PM Jeff Moyer <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
>
> > The pmem driver on PowerPC crashes with the following signature when
> > instantiating misaligned namespaces that map their capacity via
> > memremap_pages().
> >
> > BUG: Unable to handle kernel data access at 0xc001000406000000
> > Faulting instruction address: 0xc000000000090790
> > NIP [c000000000090790] arch_add_memory+0xc0/0x130
> > LR [c000000000090744] arch_add_memory+0x74/0x130
> > Call Trace:
> > arch_add_memory+0x74/0x130 (unreliable)
> > memremap_pages+0x74c/0xa30
> > devm_memremap_pages+0x3c/0xa0
> > pmem_attach_disk+0x188/0x770
> > nvdimm_bus_probe+0xd8/0x470
> >
> > With the assumption that only memremap_pages() has alignment
> > constraints, enforce memremap_compat_align() for
> > pmem_should_map_pages(), nd_pfn, or nd_dax cases.
> >
> > Reported-by: Aneesh Kumar K.V <[email protected]>
> > Cc: Jeff Moyer <[email protected]>
> > Reviewed-by: Aneesh Kumar K.V <[email protected]>
> > Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
> > Signed-off-by: Dan Williams <[email protected]>
> > ---
> > drivers/nvdimm/namespace_devs.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> > index 032dc61725ff..aff1f32fdb4f 100644
> > --- a/drivers/nvdimm/namespace_devs.c
> > +++ b/drivers/nvdimm/namespace_devs.c
> > @@ -1739,6 +1739,16 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
> > return ERR_PTR(-ENODEV);
> > }
> >
> > + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
> > + struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
> > + resource_size_t start = nsio->res.start;
> > +
> > + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
> > + dev_dbg(&ndns->dev, "misaligned, unable to map\n");
> > + return ERR_PTR(-EOPNOTSUPP);
> > + }
> > + }
> > +
> > if (is_namespace_pmem(&ndns->dev)) {
> > struct nd_namespace_pmem *nspm;
> >
>
> Actually, I take back my ack. :) This prevents a previously working
> namespace from being successfully probed/setup.
Do you have a test case handy? I can see a potential gap with a
namespace that used internal padding to fix up the alignment. The goal
of this check is to catch cases that are just going to fail
devm_memremap_pages(), and the expectation is that it could not have
worked before unless it was ported from another platform, or someone
flipped the page-size switch on PowerPC.
> I thought we were only
> going to enforce the alignment for a newly created namespace? This should
> only check whether the alignment works for the current platform.
The model is a new default 16MB alignment is enforced at creation
time, but if you need to support previously created namespaces then
you can manually trim that alignment requirement to no less than
memremap_compat_align() because that's the point at which
devm_memremap_pages() will start failing or crashing.
Dan Williams <[email protected]> writes:
> On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer <[email protected]> wrote:
>>
>> Dan Williams <[email protected]> writes:
>>
>> > The "sub-section memory hotplug" facility allows memremap_pages() users
>> > like libnvdimm to compensate for hardware platforms like x86 that have a
>> > section size larger than their hardware memory mapping granularity. The
>> > compensation that sub-section support affords is being tolerant of
>> > physical memory resources shifting by units smaller (64MiB on x86) than
>> > the memory-hotplug section size (128 MiB). Where the platform
>> > physical-memory mapping granularity is limited by the number and
>> > capability of address-decode-registers in the memory controller.
>> >
>> > While the sub-section support allows memremap_pages() to operate on
>> > sub-section (2MiB) granularity, the Power architecture may still
>> > require 16MiB alignment on "!radix_enabled()" platforms.
>> >
>> > In order for libnvdimm to be able to detect and manage this per-arch
>> > limitation, introduce memremap_compat_align() as a common minimum
>> > alignment across all driver-facing memory-mapping interfaces, and let
>> > Power override it to 16MiB in the "!radix_enabled()" case.
>> >
>> > The assumption / requirement for 16MiB to be a viable
>> > memremap_compat_align() value is that Power does not have platforms
>> > where its equivalent of address-decode-registers never hardware remaps a
>> > persistent memory resource on smaller than 16MiB boundaries. Note that I
>> > tried my best to not add a new Kconfig symbol, but header include
>> > entanglements defeated the #ifndef memremap_compat_align design pattern
>> > and the need to export it defeats the __weak design pattern for arch
>> > overrides.
>> >
>> > Based on an initial patch by Aneesh.
>>
>> I have just a couple of questions.
>>
>> First, can you please add a comment above the generic implementation of
>> memremap_compat_align describing its purpose, and why a platform might
>> want to override it?
>
> Sure, how about:
>
> /*
> * The memremap() and memremap_pages() interfaces are alternately used
> * to map persistent memory namespaces. These interfaces place different
> * constraints on the alignment and size of the mapping (namespace).
> * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
> * only map subsections (2MB), and at least one architecture (PowerPC)
> * the minimum mapping granularity of memremap_pages() is 16MB.
> *
> * The role of memremap_compat_align() is to communicate the minimum
> * arch supported alignment of a namespace such that it can freely
> * switch modes without violating the arch constraint. Namely, do not
> * allow a namespace to be PAGE_SIZE aligned since that namespace may be
> * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
> */
>
>> Second, I will take it at face value that the power architecture
>> requires a 16MB alignment, but it's not clear to me why mmu_linear_psize
>> was chosen to represent that. What's the relationship, there, and can
>> we please have a comment explaining it?
>
> Aneesh, can you help here?
With hash translation, we map the direct-map range with just one page
size. Based on different restrictions as described in htab_init_page_sizes
we can end up choosing 16M, 64K or even 4K. We use the variable
mmu_linear_psize to indicate which page size we used for direct-map
range.
ie we should do.
+unsigned long arch_namespace_align_size(void)
+{
+ unsigned long sub_section_size = (1UL << SUBSECTION_SHIFT);
+
+ if (radix_enabled())
+ return sub_section_size;
+ return max(sub_section_size, (1UL << mmu_psize_defs[mmu_linear_psize].shift));
+
+}
+EXPORT_SYMBOL_GPL(arch_namespace_align_size);
as done here
https://lore.kernel.org/linux-nvdimm/[email protected]/
Dan can you update the powerpc definition?
-aneesh
Dan Williams <[email protected]> writes:
> On Thu, Feb 13, 2020 at 1:55 PM Jeff Moyer <[email protected]> wrote:
>>
>> Dan Williams <[email protected]> writes:
>>
>> > The pmem driver on PowerPC crashes with the following signature when
>> > instantiating misaligned namespaces that map their capacity via
>> > memremap_pages().
>> >
>> > BUG: Unable to handle kernel data access at 0xc001000406000000
>> > Faulting instruction address: 0xc000000000090790
>> > NIP [c000000000090790] arch_add_memory+0xc0/0x130
>> > LR [c000000000090744] arch_add_memory+0x74/0x130
>> > Call Trace:
>> > arch_add_memory+0x74/0x130 (unreliable)
>> > memremap_pages+0x74c/0xa30
>> > devm_memremap_pages+0x3c/0xa0
>> > pmem_attach_disk+0x188/0x770
>> > nvdimm_bus_probe+0xd8/0x470
>> >
>> > With the assumption that only memremap_pages() has alignment
>> > constraints, enforce memremap_compat_align() for
>> > pmem_should_map_pages(), nd_pfn, or nd_dax cases.
>> >
>> > Reported-by: Aneesh Kumar K.V <[email protected]>
>> > Cc: Jeff Moyer <[email protected]>
>> > Reviewed-by: Aneesh Kumar K.V <[email protected]>
>> > Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
>> > Signed-off-by: Dan Williams <[email protected]>
>> > ---
>> > drivers/nvdimm/namespace_devs.c | 10 ++++++++++
>> > 1 file changed, 10 insertions(+)
>> >
>> > diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
>> > index 032dc61725ff..aff1f32fdb4f 100644
>> > --- a/drivers/nvdimm/namespace_devs.c
>> > +++ b/drivers/nvdimm/namespace_devs.c
>> > @@ -1739,6 +1739,16 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
>> > return ERR_PTR(-ENODEV);
>> > }
>> >
>> > + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
>> > + struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>> > + resource_size_t start = nsio->res.start;
>> > +
>> > + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
>> > + dev_dbg(&ndns->dev, "misaligned, unable to map\n");
>> > + return ERR_PTR(-EOPNOTSUPP);
>> > + }
>> > + }
>> > +
>> > if (is_namespace_pmem(&ndns->dev)) {
>> > struct nd_namespace_pmem *nspm;
>> >
>>
>> Actually, I take back my ack. :) This prevents a previously working
>> namespace from being successfully probed/setup.
>
> Do you have a test case handy? I can see a potential gap with a
> namespace that used internal padding to fix up the alignment.
# ndctl list -v -n namespace0.0
[
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":52846133248,
"uuid":"b99f6f6a-2909-4189-9bfa-6eeebd95d40e",
"raw_uuid":"aff43777-015b-493f-bbf9-7c7b0fe33519",
"sector_size":512,
"align":4096,
"blockdev":"pmem0",
"numa_node":0
}
]
# cat /sys/bus/nd/devices/region0/mappings
6
# grep namespace0.0 /proc/iomem
1860000000-24e0003fff : namespace0.0
> The goal of this check is to catch cases that are just going to fail
> devm_memremap_pages(), and the expectation is that it could not have
> worked before unless it was ported from another platform, or someone
> flipped the page-size switch on PowerPC.
On x86, creation and probing of the namespace worked fine before this
patch. What *doesn't* work is creating another fsdax namespace after
this one. sector mode namespaces can still be created, though:
[
{
"dev":"namespace0.1",
"mode":"sector",
"size":53270768640,
"uuid":"67ea2c74-d4b1-4fc9-9c1a-a7d2a6c2a4a7",
"sector_size":512,
"blockdev":"pmem0.1s"
},
# grep namespace0.1 /proc/iomem
24e0004000-3160007fff : namespace0.1
>> I thought we were only going to enforce the alignment for a newly
>> created namespace? This should only check whether the alignment
>> works for the current platform.
>
> The model is a new default 16MB alignment is enforced at creation
> time, but if you need to support previously created namespaces then
> you can manually trim that alignment requirement to no less than
> memremap_compat_align() because that's the point at which
> devm_memremap_pages() will start failing or crashing.
The problem is that older kernels did not enforce alignment to
SUBSECTION_SIZE. We shouldn't prevent those namespaces from being
accessed. The probe itself will not cause the WARN_ON to trigger.
Creating new namespaces at misaligned addresses could, but you've
altered the free space allocation such that we won't hit that anymore.
If I drop this patch, the probe will still work, and allocating new
namespaces will also work:
# ndctl list
[
{
"dev":"namespace0.1",
"mode":"sector",
"size":53270768640,
"uuid":"67ea2c74-d4b1-4fc9-9c1a-a7d2a6c2a4a7",
"sector_size":512,
"blockdev":"pmem0.1s"
},
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":52846133248,
"uuid":"b99f6f6a-2909-4189-9bfa-6eeebd95d40e",
"sector_size":512,
"align":4096,
"blockdev":"pmem0"
}
]
ndctl create-namespace -m fsdax -s 36g -r 0
{
"dev":"namespace0.2",
"mode":"fsdax",
"map":"dev",
"size":"35.44 GiB (38.05 GB)",
"uuid":"7893264c-c7ef-4cbe-95e1-ccf2aff041fb",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0.2"
}
proc/iomem:
1860000000-d55fffffff : Persistent Memory
1860000000-24e0003fff : namespace0.0
24e0004000-3160007fff : namespace0.1
3162000000-3a61ffffff : namespace0.2
So, maybe the right thing is to make memremap_compat_align return
PAGE_SIZE for x86 instead of SUBSECTION_SIZE?
-Jeff
On 2/14/20 10:14 PM, Jeff Moyer wrote:
> Dan Williams <[email protected]> writes:
>
>> On Thu, Feb 13, 2020 at 1:55 PM Jeff Moyer <[email protected]> wrote:
>>>
>>> Dan Williams <[email protected]> writes:
>>>
>>>> The pmem driver on PowerPC crashes with the following signature when
>>>> instantiating misaligned namespaces that map their capacity via
>>>> memremap_pages().
>>>>
>>>> BUG: Unable to handle kernel data access at 0xc001000406000000
>>>> Faulting instruction address: 0xc000000000090790
>>>> NIP [c000000000090790] arch_add_memory+0xc0/0x130
>>>> LR [c000000000090744] arch_add_memory+0x74/0x130
>>>> Call Trace:
>>>> arch_add_memory+0x74/0x130 (unreliable)
>>>> memremap_pages+0x74c/0xa30
>>>> devm_memremap_pages+0x3c/0xa0
>>>> pmem_attach_disk+0x188/0x770
>>>> nvdimm_bus_probe+0xd8/0x470
>>>>
>>>> With the assumption that only memremap_pages() has alignment
>>>> constraints, enforce memremap_compat_align() for
>>>> pmem_should_map_pages(), nd_pfn, or nd_dax cases.
>>>>
>>>> Reported-by: Aneesh Kumar K.V <[email protected]>
>>>> Cc: Jeff Moyer <[email protected]>
>>>> Reviewed-by: Aneesh Kumar K.V <[email protected]>
>>>> Link: https://lore.kernel.org/r/158041477336.3889308.4581652885008605170.stgit@dwillia2-desk3.amr.corp.intel.com
>>>> Signed-off-by: Dan Williams <[email protected]>
>>>> ---
>>>> drivers/nvdimm/namespace_devs.c | 10 ++++++++++
>>>> 1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
>>>> index 032dc61725ff..aff1f32fdb4f 100644
>>>> --- a/drivers/nvdimm/namespace_devs.c
>>>> +++ b/drivers/nvdimm/namespace_devs.c
>>>> @@ -1739,6 +1739,16 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
>>>> return ERR_PTR(-ENODEV);
>>>> }
>>>>
>>>> + if (pmem_should_map_pages(dev) || nd_pfn || nd_dax) {
>>>> + struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>>>> + resource_size_t start = nsio->res.start;
>>>> +
>>>> + if (!IS_ALIGNED(start | size, memremap_compat_align())) {
>>>> + dev_dbg(&ndns->dev, "misaligned, unable to map\n");
>>>> + return ERR_PTR(-EOPNOTSUPP);
>>>> + }
>>>> + }
>>>> +
>>>> if (is_namespace_pmem(&ndns->dev)) {
>>>> struct nd_namespace_pmem *nspm;
>>>>
>>>
>>> Actually, I take back my ack. :) This prevents a previously working
>>> namespace from being successfully probed/setup.
>>
>> Do you have a test case handy? I can see a potential gap with a
>> namespace that used internal padding to fix up the alignment.
>
> # ndctl list -v -n namespace0.0
> [
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":52846133248,
> "uuid":"b99f6f6a-2909-4189-9bfa-6eeebd95d40e",
> "raw_uuid":"aff43777-015b-493f-bbf9-7c7b0fe33519",
> "sector_size":512,
> "align":4096,
> "blockdev":"pmem0",
> "numa_node":0
> }
> ]
>
> # cat /sys/bus/nd/devices/region0/mappings
> 6
>
> # grep namespace0.0 /proc/iomem
> 1860000000-24e0003fff : namespace0.0
>
>> The goal of this check is to catch cases that are just going to fail
>> devm_memremap_pages(), and the expectation is that it could not have
>> worked before unless it was ported from another platform, or someone
>> flipped the page-size switch on PowerPC.
>
> On x86, creation and probing of the namespace worked fine before this
> patch. What *doesn't* work is creating another fsdax namespace after
> this one. sector mode namespaces can still be created, though:
>
> [
> {
> "dev":"namespace0.1",
> "mode":"sector",
> "size":53270768640,
> "uuid":"67ea2c74-d4b1-4fc9-9c1a-a7d2a6c2a4a7",
> "sector_size":512,
> "blockdev":"pmem0.1s"
> },
>
> # grep namespace0.1 /proc/iomem
> 24e0004000-3160007fff : namespace0.1
>
>>> I thought we were only going to enforce the alignment for a newly
>>> created namespace? This should only check whether the alignment
>>> works for the current platform.
>>
>> The model is a new default 16MB alignment is enforced at creation
>> time, but if you need to support previously created namespaces then
>> you can manually trim that alignment requirement to no less than
>> memremap_compat_align() because that's the point at which
>> devm_memremap_pages() will start failing or crashing.
>
> The problem is that older kernels did not enforce alignment to
> SUBSECTION_SIZE. We shouldn't prevent those namespaces from being
> accessed. The probe itself will not cause the WARN_ON to trigger.
> Creating new namespaces at misaligned addresses could, but you've
> altered the free space allocation such that we won't hit that anymore.
>
> If I drop this patch, the probe will still work, and allocating new
> namespaces will also work:
>
> # ndctl list
> [
> {
> "dev":"namespace0.1",
> "mode":"sector",
> "size":53270768640,
> "uuid":"67ea2c74-d4b1-4fc9-9c1a-a7d2a6c2a4a7",
> "sector_size":512,
> "blockdev":"pmem0.1s"
> },
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":52846133248,
> "uuid":"b99f6f6a-2909-4189-9bfa-6eeebd95d40e",
> "sector_size":512,
> "align":4096,
> "blockdev":"pmem0"
> }
> ]
> ndctl create-namespace -m fsdax -s 36g -r 0
> {
> "dev":"namespace0.2",
> "mode":"fsdax",
> "map":"dev",
> "size":"35.44 GiB (38.05 GB)",
> "uuid":"7893264c-c7ef-4cbe-95e1-ccf2aff041fb",
> "sector_size":512,
> "align":2097152,
> "blockdev":"pmem0.2"
> }
>
> proc/iomem:
>
> 1860000000-d55fffffff : Persistent Memory
> 1860000000-24e0003fff : namespace0.0
> 24e0004000-3160007fff : namespace0.1
> 3162000000-3a61ffffff : namespace0.2
>
> So, maybe the right thing is to make memremap_compat_align return
> PAGE_SIZE for x86 instead of SUBSECTION_SIZE?
>
I did that as part of
https://lore.kernel.org/linux-nvdimm/[email protected]
and applied the subsection details only when creating new namespace
https://lore.kernel.org/linux-nvdimm/[email protected]
But I do agree with the approach that in-order to create a compatible
namespace we need enforce max possible align value across all supported
architectures.
On POWER we should still be able to enforce SUBSECTION_SIZE
restrictions. We did put that as document w.r.t. distributions like Suse
https://www.suse.com/support/kb/doc/?id=7024300
-aneesh
Dan Williams <[email protected]> writes:
> The align attribute applies an alignment constraint for namespace
> creation in a region. Whereas the 'align' attribute of a namespace
> applied alignment padding via an info block, the 'align' attribute
> applies alignment constraints to the free space allocation.
>
> The default for 'align' is the maximum known memremap_compat_align()
> across all archs (16MiB from PowerPC at time of writing) multiplied by
> the number of interleave ways if there is blk-aliasing. The minimum is
> PAGE_SIZE and allows for the creation of cross-arch incompatible
> namespaces, just as previous kernels allowed, but the expectation is
> cross-arch and mode-independent compatibility by default.
>
> The regression risk with this change is limited to cases that were
> dependent on the ability to create unaligned namespaces, *and* for some
> reason are unable to opt-out of aligned namespaces by writing to
> 'regionX/align'. If such a scenario arises the default can be flipped
> from opt-out to opt-in of compat-aligned namespace creation, but that is
> a last resort. The kernel will otherwise continue to support existing
> defined misaligned namespaces.
>
> Unfortunately this change needs to touch several parts of the
> implementation at once:
>
> - region/available_size: expand busy extents to current align
> - region/max_available_extent: expand busy extents to current align
> - namespace/size: trim free space to current align
>
> ...to keep the free space accounting conforming to the dynamic align
> setting.
>
> Reported-by: Aneesh Kumar K.V <[email protected]>
> Reported-by: Jeff Moyer <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> Link: https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.stgit@dwillia2-desk3.amr.corp.intel.com
> Signed-off-by: Dan Williams <[email protected]>
This looks good to me.
Reviewed-by: Jeff Moyer <[email protected]>
Dan Williams <[email protected]> writes:
> On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer <[email protected]> wrote:
>> I have just a couple of questions.
>>
>> First, can you please add a comment above the generic implementation of
>> memremap_compat_align describing its purpose, and why a platform might
>> want to override it?
>
> Sure, how about:
>
> /*
> * The memremap() and memremap_pages() interfaces are alternately used
> * to map persistent memory namespaces. These interfaces place different
> * constraints on the alignment and size of the mapping (namespace).
> * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
> * only map subsections (2MB), and at least one architecture (PowerPC)
> * the minimum mapping granularity of memremap_pages() is 16MB.
> *
> * The role of memremap_compat_align() is to communicate the minimum
> * arch supported alignment of a namespace such that it can freely
> * switch modes without violating the arch constraint. Namely, do not
> * allow a namespace to be PAGE_SIZE aligned since that namespace may be
> * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
> */
Well, if we modify the x86 variant to be PAGE_SIZE, I think that text
won't work. How about:
/*
* memremap_compat_align should return the minimum alignment for
* mapping memory via memremap() and memremap_pages(). For x86, this
* is the system PAGE_SIZE. Other architectures may impose different
* restrictions, as is seen on powerpc where the minimum alignment is
* tied to the linear mapping page size.
*
* When creating persistent memory namespaces, the alignment is forced
* to the least common denominator (MEMREMAP_COMPAT_ALIGN_MAX,
* currently 16MB). However, older kernels did not enforce this
* behavior, so we allow mapping namespaces with smaller alignments,
* so long as the platform supports it. See nvdimm_namespace_common_probe.
*/
-Jeff
Dan Williams <[email protected]> writes:
> ---
>
> Explicit review requests, but any other feedback is of course
> appreciated:
>
> Patch1 needs an ack from ppc arch maintainers, and I'd like a tested-by
> from Aneesh that this still works to solve the ppc issue. Jeff, does
> this look good to you?
OK, I've reviewed everything. Testing looks good with the change I
mentioned (memremap_compat_align returning PAGE_SIZE). I made sure a
4k-aligned namespace created under an unpatched kernel would be
accessible under a patched kernel. I also made sure that manually
setting align would allow for creating of poorly aligned namespaces, and
that those namespaces were then accessible on the unpatched kernel.
Thanks, Dan!
-Jeff
On Fri, Feb 14, 2020 at 12:59 PM Jeff Moyer <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
>
> > On Thu, Feb 13, 2020 at 8:58 AM Jeff Moyer <[email protected]> wrote:
>
> >> I have just a couple of questions.
> >>
> >> First, can you please add a comment above the generic implementation of
> >> memremap_compat_align describing its purpose, and why a platform might
> >> want to override it?
> >
> > Sure, how about:
> >
> > /*
> > * The memremap() and memremap_pages() interfaces are alternately used
> > * to map persistent memory namespaces. These interfaces place different
> > * constraints on the alignment and size of the mapping (namespace).
> > * memremap() can map individual PAGE_SIZE pages. memremap_pages() can
> > * only map subsections (2MB), and at least one architecture (PowerPC)
> > * the minimum mapping granularity of memremap_pages() is 16MB.
> > *
> > * The role of memremap_compat_align() is to communicate the minimum
> > * arch supported alignment of a namespace such that it can freely
> > * switch modes without violating the arch constraint. Namely, do not
> > * allow a namespace to be PAGE_SIZE aligned since that namespace may be
> > * reconfigured into a mode that requires SUBSECTION_SIZE alignment.
> > */
>
> Well, if we modify the x86 variant to be PAGE_SIZE, I think that text
> won't work. How about:
...but I'm not looking to change it to PAGE_SIZE, I'm going to fix the
alignment check to skip if the namespace has "inner" alignment
padding, i.e. "start_pad" and/or "end_trunc" are non-zero.