2020-07-29 03:35:52

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
1G memblock size. Even Dan Williams' sub-section patch series [1] had been
upstream merged, it was not helpful due to hard limitation of kmem_start:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c000000-23fffffff : System RAM
23dd40000-23fecffff : reserved
23fed0000-23fffffff : reserved
240000000-33fdfffff : Persistent Memory
240000000-2403fffff : namespace0.0
280000000-2bfffffff : dax0.0 <- aligned with 1G boundary
280000000-2bfffffff : System RAM
Hence there is a big gap between 0x2403fffff and 0x280000000 due to the 1G
alignment.

Without this series, if qemu creates a 4G bytes nvdimm device, we can only
use 2G bytes for dax pmem(kmem) in the worst case.
e.g.
240000000-33fdfffff : Persistent Memory
We can only use the memblock between [240000000, 2ffffffff] due to the hard
limitation. It wastes too much memory space.

Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
SPARSEMEM_VMEMMAP, page bits in struct page ...

Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
with memory_block_size_bytes().

Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
tested on arm64/x86 guest.

This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].

[1] https://lkml.org/lkml/2019/6/19/67
[2] https://lkml.org/lkml/2020/7/8/1546
Jia He (6):
mm/memory_hotplug: remove redundant memory block size alignment check
resource: export find_next_iomem_res() helper
mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
mm/page_alloc: adjust the start,end in dax pmem kmem case
device-dax: relax the memblock size alignment for kmem_start
arm64: fall back to vmemmap_populate_basepages if not aligned with
PMD_SIZE

arch/arm64/mm/mmu.c | 4 ++++
drivers/base/memory.c | 24 ++++++++++++++++--------
drivers/dax/kmem.c | 22 +++++++++++++---------
include/linux/ioport.h | 3 +++
kernel/resource.c | 3 ++-
mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
mm/page_alloc.c | 14 ++++++++++++++
7 files changed, 90 insertions(+), 19 deletions(-)

--
2.17.1


2020-07-29 03:36:08

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 1/6] mm/memory_hotplug: remove redundant memory block size alignment check

The alignment check has been done by check_hotplug_memory_range(). Hence
the redundant one in create_memory_block_devices() can be removed.

The similar redundant check is removed in remove_memory_block_devices().

Signed-off-by: Jia He <[email protected]>
---
drivers/base/memory.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b09b68b9f78..4a1691664c6c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -642,10 +642,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
unsigned long block_id;
int ret = 0;

- if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
- !IS_ALIGNED(size, memory_block_size_bytes())))
- return -EINVAL;
-
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
ret = init_memory_block(&mem, block_id, MEM_OFFLINE);
if (ret)
@@ -678,10 +674,6 @@ void remove_memory_block_devices(unsigned long start, unsigned long size)
struct memory_block *mem;
unsigned long block_id;

- if (WARN_ON_ONCE(!IS_ALIGNED(start, memory_block_size_bytes()) ||
- !IS_ALIGNED(size, memory_block_size_bytes())))
- return;
-
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
mem = find_memory_block_by_id(block_id);
if (WARN_ON_ONCE(!mem))
--
2.17.1

2020-07-29 03:36:24

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 2/6] resource: export find_next_iomem_res() helper

The helper is to find the lowest iomem resource that covers part of
[@start..@end]

It is useful when relaxing the alignment check for dax pmem kmem.

Signed-off-by: Jia He <[email protected]>
---
include/linux/ioport.h | 3 +++
kernel/resource.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6c2b06fe8beb..203fd16c9f45 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -247,6 +247,9 @@ extern struct resource * __request_region(struct resource *,

extern void __release_region(struct resource *, resource_size_t,
resource_size_t);
+extern int find_next_iomem_res(resource_size_t start, resource_size_t end,
+ unsigned long flags, unsigned long desc,
+ bool first_lvl, struct resource *res);
#ifdef CONFIG_MEMORY_HOTREMOVE
extern int release_mem_region_adjustable(struct resource *, resource_size_t,
resource_size_t);
diff --git a/kernel/resource.c b/kernel/resource.c
index 841737bbda9e..57e6a6802a3d 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -338,7 +338,7 @@ EXPORT_SYMBOL(release_resource);
* @first_lvl: walk only the first level children, if set
* @res: return ptr, if resource found
*/
-static int find_next_iomem_res(resource_size_t start, resource_size_t end,
+int find_next_iomem_res(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
bool first_lvl, struct resource *res)
{
@@ -391,6 +391,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
read_unlock(&resource_lock);
return p ? 0 : -ENODEV;
}
+EXPORT_SYMBOL(find_next_iomem_res);

static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
--
2.17.1

2020-07-29 03:37:02

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 3/6] mm/memory_hotplug: allow pmem kmem not to align with memory_block_size

When dax pmem is probed as RAM device on arm64, previously, kmem_start in
dev_dax_kmem_probe() should be aligned with 1G memblock size on arm64 due
to SECTION_SIZE_BITS(30).

There will be some meta data at the beginning/end of the iomem space, e.g.
namespace info and nvdimm label:
240000000-33fdfffff : Persistent Memory
240000000-2403fffff : namespace0.0
280000000-2bfffffff : dax0.0
280000000-2bfffffff : System RAM

Hence it makes the whole kmem space not aligned with memory_block_size for
both start addr and end addr. Hence there is a big gap when kmem is added
into memory block which causes big memory space wasting.

This changes it by relaxing the alignment check for dax pmem kmem in the
path of online/offline memory blocks.

Signed-off-by: Jia He <[email protected]>
---
drivers/base/memory.c | 16 ++++++++++++++++
mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 4a1691664c6c..3d2a94f3b1d9 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -334,6 +334,22 @@ static ssize_t valid_zones_show(struct device *dev,
* online nodes otherwise the page_zone is not reliable
*/
if (mem->state == MEM_ONLINE) {
+#ifdef CONFIG_ZONE_DEVICE
+ struct resource res;
+ int ret;
+
+ /* adjust start_pfn for dax pmem kmem */
+ ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+ ((start_pfn + nr_pages) << PAGE_SHIFT) - 1,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ IORES_DESC_PERSISTENT_MEMORY,
+ false, &res);
+ if (!ret && PFN_UP(res.start) > start_pfn) {
+ nr_pages -= PFN_UP(res.start) - start_pfn;
+ start_pfn = PFN_UP(res.start);
+ }
+#endif
+
/*
* The block contains more than one zone can not be offlined.
* This can happen e.g. for ZONE_DMA and ZONE_DMA32
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a53103dc292b..25745f67b680 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -999,6 +999,20 @@ int try_online_node(int nid)

static int check_hotplug_memory_range(u64 start, u64 size)
{
+#ifdef CONFIG_ZONE_DEVICE
+ struct resource res;
+ int ret;
+
+ /* Allow pmem kmem not to align with block size */
+ ret = find_next_iomem_res(start, start + size - 1,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ IORES_DESC_PERSISTENT_MEMORY,
+ false, &res);
+ if (!ret) {
+ return 0;
+ }
+#endif
+
/* memory range must be block size aligned */
if (!size || !IS_ALIGNED(start, memory_block_size_bytes()) ||
!IS_ALIGNED(size, memory_block_size_bytes())) {
@@ -1481,19 +1495,42 @@ static int __ref __offline_pages(unsigned long start_pfn,
mem_hotplug_begin();

/*
- * Don't allow to offline memory blocks that contain holes.
+ * Don't allow to offline memory blocks that contain holes except
+ * for pmem.
* Consequently, memory blocks with holes can never get onlined
* via the hotplug path - online_pages() - as hotplugged memory has
* no holes. This way, we e.g., don't have to worry about marking
* memory holes PG_reserved, don't need pfn_valid() checks, and can
* avoid using walk_system_ram_range() later.
+ * When dax pmem is used as RAM (kmem), holes at the beginning is
+ * allowed.
*/
walk_system_ram_range(start_pfn, end_pfn - start_pfn, &nr_pages,
count_system_ram_pages_cb);
if (nr_pages != end_pfn - start_pfn) {
+#ifdef CONFIG_ZONE_DEVICE
+ struct resource res;
+
+ /* Allow pmem kmem not to align with block size */
+ ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+ (end_pfn << PAGE_SHIFT) - 1,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ IORES_DESC_PERSISTENT_MEMORY,
+ false, &res);
+ if (ret) {
+ ret = -EINVAL;
+ reason = "memory holes";
+ goto failed_removal;
+ }
+
+ /* adjust start_pfn for dax pmem kmem */
+ start_pfn = PFN_UP(res.start);
+ end_pfn = PFN_DOWN(res.end + 1);
+#else
ret = -EINVAL;
reason = "memory holes";
goto failed_removal;
+#endif
}

/* This makes hotplug much easier...and readable.
--
2.17.1

2020-07-29 03:37:45

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 5/6] device-dax: relax the memblock size alignment for kmem_start

Previously, kmem_start in dev_dax_kmem_probe should be aligned with
SECTION_SIZE_BITS(30), i.e. 1G memblock size on arm64. Even with Dan
Williams' sub-section patch series, it was not helpful when adding the
dax pmem kmem to memblock:
$ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
$echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
$echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
$cat /proc/iomem
...
23c000000-23fffffff : System RAM
23dd40000-23fecffff : reserved
23fed0000-23fffffff : reserved
240000000-33fdfffff : Persistent Memory
240000000-2403fffff : namespace0.0
280000000-2bfffffff : dax0.0 <- boundary are aligned with 1G
280000000-2bfffffff : System RAM (kmem)
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000040000000-0x000000023fffffff 8G online yes 1-8
0x0000000280000000-0x00000002bfffffff 1G online yes 10

Memory block size: 1G
Total online memory: 9G
Total offline memory: 0B
...
Hence there is a big gap between 0x2403fffff and 0x280000000 due to the 1G
alignment on arm64. More than that, only 1G memory is returned while 2G is
requested.

On x86, the gap is relatively small due to SECTION_SIZE_BITS(27).

Besides descreasing SECTION_SIZE_BITS on arm64, we can relax the alignment
when adding the kmem.
After this patch:
240000000-33fdfffff : Persistent Memory
240000000-2421fffff : namespace0.0
242400000-2bfffffff : dax0.0
242400000-2bfffffff : System RAM (kmem)
$ lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000040000000-0x00000002bfffffff 10G online yes 1-10

Memory block size: 1G
Total online memory: 10G
Total offline memory: 0B

Notes, block 9-10 are the newly hotplug added.

This patches remove the tight alignment constraint of
memory_block_size_bytes(), but still keep the constraint from
online_pages_range().

Signed-off-by: Jia He <[email protected]>
---
drivers/dax/kmem.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index d77786dc0d92..849d0706dfe0 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -30,9 +30,20 @@ int dev_dax_kmem_probe(struct device *dev)
const char *new_res_name;
int numa_node;
int rc;
+ int order;

- /* Hotplug starting at the beginning of the next block: */
- kmem_start = ALIGN(res->start, memory_block_size_bytes());
+ /* kmem_start needn't be aligned with memory_block_size_bytes().
+ * But given the constraint in online_pages_range(), adjust the
+ * alignment of kmem_start and kmem_size
+ */
+ kmem_size = resource_size(res);
+ order = min_t(int, MAX_ORDER - 1, get_order(kmem_size));
+ kmem_start = ALIGN(res->start, 1ul << (order + PAGE_SHIFT));
+ /* Adjust the size down to compensate for moving up kmem_start: */
+ kmem_size -= kmem_start - res->start;
+ /* Align the size down to cover only complete blocks: */
+ kmem_size &= ~((1ul << (order + PAGE_SHIFT)) - 1);
+ kmem_end = kmem_start + kmem_size;

/*
* Ensure good NUMA information for the persistent memory.
@@ -48,13 +59,6 @@ int dev_dax_kmem_probe(struct device *dev)
numa_node, res);
}

- kmem_size = resource_size(res);
- /* Adjust the size down to compensate for moving up kmem_start: */
- kmem_size -= kmem_start - res->start;
- /* Align the size down to cover only complete blocks: */
- kmem_size &= ~(memory_block_size_bytes() - 1);
- kmem_end = kmem_start + kmem_size;
-
new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
if (!new_res_name)
return -ENOMEM;
--
2.17.1

2020-07-29 03:39:12

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 4/6] mm/page_alloc: adjust the start,end in dax pmem kmem case

There are 3 cases when doing online pages:
- normal RAM, should be aligned with memory block size
- persistent memory with ZONE_DEVICE
- persistent memory used as normal RAM (kmem) with ZONE_NORMAL, this patch
tries to adjust the start_pfn/end_pfn after finding the corresponding
resource range.

Without this patch, the check of __init_single_page when doing online memory
will be failed because those pages haven't been mapped in mmu(not present
from mmu's point of view).

Signed-off-by: Jia He <[email protected]>
---
mm/page_alloc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e028b87ce294..13216ab3623f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5971,6 +5971,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (start_pfn == altmap->base_pfn)
start_pfn += altmap->reserve;
end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+ } else {
+ struct resource res;
+ int ret;
+
+ /* adjust the start,end in dax pmem kmem case */
+ ret = find_next_iomem_res(start_pfn << PAGE_SHIFT,
+ (end_pfn << PAGE_SHIFT) - 1,
+ IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
+ IORES_DESC_PERSISTENT_MEMORY,
+ false, &res);
+ if (!ret) {
+ start_pfn = PFN_UP(res.start);
+ end_pfn = PFN_DOWN(res.end + 1);
+ }
}
#endif

--
2.17.1

2020-07-29 03:39:48

by Justin He

[permalink] [raw]
Subject: [RFC PATCH 6/6] arm64: fall back to vmemmap_populate_basepages if not aligned with PMD_SIZE

In dax pmem kmem (dax pmem used as RAM device) case, the start address
might not be aligned with PMD_SIZE
e.g.
240000000-33fdfffff : Persistent Memory
240000000-2421fffff : namespace0.0
242400000-2bfffffff : dax0.0
242400000-2bfffffff : System RAM (kmem)
pfn_to_page(0x242400000) is fffffe0007e90000.

Without this patch, vmemmap_populate(fffffe0007e90000, ...) will incorrectly
create a pmd mapping [fffffe0007e00000, fffffe0008000000] which contains
fffffe0007e90000.

This adds the check and then falls back to vmemmap_populate_basepages()

Signed-off-by: Jia He <[email protected]>
---
arch/arm64/mm/mmu.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d69feb2cfb84..3b21bd47e801 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1102,6 +1102,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,

do {
next = pmd_addr_end(addr, end);
+ if (next - addr < PMD_SIZE) {
+ vmemmap_populate_basepages(start, next, node, altmap);
+ continue;
+ }

pgdp = vmemmap_pgd_populate(addr, node);
if (!pgdp)
--
2.17.1

2020-07-29 06:46:05

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment



> Am 29.07.2020 um 05:35 schrieb Jia He <[email protected]>:
>
> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> addr in dev_dax_kmem_probe() should be aligned w/ SECTION_SIZE_BITS(30),i.e.
> 1G memblock size. Even Dan Williams' sub-section patch series [1] had been
> upstream merged, it was not helpful due to hard limitation of kmem_start:
> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f -a 2M
> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> $cat /proc/iomem
> ...
> 23c000000-23fffffff : System RAM
> 23dd40000-23fecffff : reserved
> 23fed0000-23fffffff : reserved
> 240000000-33fdfffff : Persistent Memory
> 240000000-2403fffff : namespace0.0
> 280000000-2bfffffff : dax0.0 <- aligned with 1G boundary
> 280000000-2bfffffff : System RAM
> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the 1G
> alignment.
>
> Without this series, if qemu creates a 4G bytes nvdimm device, we can only
> use 2G bytes for dax pmem(kmem) in the worst case.
> e.g.
> 240000000-33fdfffff : Persistent Memory
> We can only use the memblock between [240000000, 2ffffffff] due to the hard
> limitation. It wastes too much memory space.
>
> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but there
> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> SPARSEMEM_VMEMMAP, page bits in struct page ...
>
> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem alignment
> with memory_block_size_bytes().
>
> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax pmem
> can be used as ram with smaller gap. Also the kmem hotplug add/remove are both
> tested on arm64/x86 guest.
>

Hi,

I am not convinced this use case is worth such hacks (that’s what it is) for now. On real machines pmem is big - your example (losing 50% is extreme).

I would much rather want to see the section size on arm64 reduced. I remember there were patches and that at least with a base page size of 4k it can be reduced drastically (64k base pages are more problematic due to the ridiculous THP size of 512M). But could be a section size of 512 is possible on all configs right now.

In the long term we might want to rework the memory block device model (eventually supporting old/new as discussed with Michal some time ago using a kernel parameter), dropping the fixed sizes
- allowing sizes / addresses aligned with subsection size
- drastically reducing the number of devices for boot memory to only a hand full (e.g., one per resource / DIMM we can actually unplug again.

Long story short, I don’t like this hack.


> This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-rc5 [2].
>
> [1] https://lkml.org/lkml/2019/6/19/67
> [2] https://lkml.org/lkml/2020/7/8/1546
> Jia He (6):
> mm/memory_hotplug: remove redundant memory block size alignment check
> resource: export find_next_iomem_res() helper
> mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> mm/page_alloc: adjust the start,end in dax pmem kmem case
> device-dax: relax the memblock size alignment for kmem_start
> arm64: fall back to vmemmap_populate_basepages if not aligned with
> PMD_SIZE
>
> arch/arm64/mm/mmu.c | 4 ++++
> drivers/base/memory.c | 24 ++++++++++++++++--------
> drivers/dax/kmem.c | 22 +++++++++++++---------
> include/linux/ioport.h | 3 +++
> kernel/resource.c | 3 ++-
> mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 14 ++++++++++++++
> 7 files changed, 90 insertions(+), 19 deletions(-)
>
> --
> 2.17.1
>

2020-07-29 08:30:55

by Justin He

[permalink] [raw]
Subject: RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

Hi David

> -----Original Message-----
> From: David Hildenbrand <[email protected]>
> Sent: Wednesday, July 29, 2020 2:37 PM
> To: Justin He <[email protected]>
> Cc: Dan Williams <[email protected]>; Vishal Verma
> <[email protected]>; Mike Rapoport <[email protected]>; David
> Hildenbrand <[email protected]>; Catalin Marinas <[email protected]>;
> Will Deacon <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; Rafael J. Wysocki <[email protected]>; Dave
> Jiang <[email protected]>; Andrew Morton <[email protected]>;
> Steve Capper <[email protected]>; Mark Rutland <[email protected]>;
> Logan Gunthorpe <[email protected]>; Anshuman Khandual
> <[email protected]>; Hsin-Yi Wang <[email protected]>; Jason
> Gunthorpe <[email protected]>; Dave Hansen <[email protected]>; Kees
> Cook <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected]; Wei
> Yang <[email protected]>; Pankaj Gupta
> <[email protected]>; Ira Weiny <[email protected]>; Kaly Xin
> <[email protected]>
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
>
>
>
> > Am 29.07.2020 um 05:35 schrieb Jia He <[email protected]>:
> >
> > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> > addr in dev_dax_kmem_probe() should be aligned w/
> SECTION_SIZE_BITS(30),i.e.
> > 1G memblock size. Even Dan Williams' sub-section patch series [1] had
> been
> > upstream merged, it was not helpful due to hard limitation of kmem_start:
> > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
> -a 2M
> > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > $cat /proc/iomem
> > ...
> > 23c000000-23fffffff : System RAM
> > 23dd40000-23fecffff : reserved
> > 23fed0000-23fffffff : reserved
> > 240000000-33fdfffff : Persistent Memory
> > 240000000-2403fffff : namespace0.0
> > 280000000-2bfffffff : dax0.0 <- aligned with 1G boundary
> > 280000000-2bfffffff : System RAM
> > Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
> 1G
> > alignment.
> >
> > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> only
> > use 2G bytes for dax pmem(kmem) in the worst case.
> > e.g.
> > 240000000-33fdfffff : Persistent Memory
> > We can only use the memblock between [240000000, 2ffffffff] due to the
> hard
> > limitation. It wastes too much memory space.
> >
> > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> there
> > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > SPARSEMEM_VMEMMAP, page bits in struct page ...
> >
> > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> alignment
> > with memory_block_size_bytes().
> >
> > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> pmem
> > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> are both
> > tested on arm64/x86 guest.
> >
>
> Hi,
>
> I am not convinced this use case is worth such hacks (that’s what it is)
> for now. On real machines pmem is big - your example (losing 50% is
> extreme).
>
> I would much rather want to see the section size on arm64 reduced. I
> remember there were patches and that at least with a base page size of 4k
> it can be reduced drastically (64k base pages are more problematic due to
> the ridiculous THP size of 512M). But could be a section size of 512 is
> possible on all configs right now.

Yes, I once investigated how to reduce section size on arm64 thoughtfully:
There are many constraints for reducing SECTION_SIZE_BITS
1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
much.
2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
into page->flags.
3. MAX_ORDER depends on SECTION_SIZE_BITS
- 3.1 mmzone.h
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
- 3.2 hugepage_init()
MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);

Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
SECTION_SIZE_BITS can be reduced to 27.
But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
be reduced to 27.

In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
might be very complicated,e.g. we still need to consider the case for
ARM64_16K_PAGES.

>
> In the long term we might want to rework the memory block device model
> (eventually supporting old/new as discussed with Michal some time ago
> using a kernel parameter), dropping the fixed sizes

Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.


--
Cheers,
Justin (Jia He)



> - allowing sizes / addresses aligned with subsection size
> - drastically reducing the number of devices for boot memory to only a
> hand full (e.g., one per resource / DIMM we can actually unplug again.
>
> Long story short, I don’t like this hack.
>
>
> > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> rc5 [2].
> >
> > [1] https://lkml.org/lkml/2019/6/19/67
> > [2] https://lkml.org/lkml/2020/7/8/1546
> > Jia He (6):
> > mm/memory_hotplug: remove redundant memory block size alignment check
> > resource: export find_next_iomem_res() helper
> > mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> > mm/page_alloc: adjust the start,end in dax pmem kmem case
> > device-dax: relax the memblock size alignment for kmem_start
> > arm64: fall back to vmemmap_populate_basepages if not aligned with
> > PMD_SIZE
> >
> > arch/arm64/mm/mmu.c | 4 ++++
> > drivers/base/memory.c | 24 ++++++++++++++++--------
> > drivers/dax/kmem.c | 22 +++++++++++++---------
> > include/linux/ioport.h | 3 +++
> > kernel/resource.c | 3 ++-
> > mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
> > mm/page_alloc.c | 14 ++++++++++++++
> > 7 files changed, 90 insertions(+), 19 deletions(-)
> >
> > --
> > 2.17.1
> >

2020-07-29 08:45:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

On 29.07.20 10:27, Justin He wrote:
> Hi David
>
>> -----Original Message-----
>> From: David Hildenbrand <[email protected]>
>> Sent: Wednesday, July 29, 2020 2:37 PM
>> To: Justin He <[email protected]>
>> Cc: Dan Williams <[email protected]>; Vishal Verma
>> <[email protected]>; Mike Rapoport <[email protected]>; David
>> Hildenbrand <[email protected]>; Catalin Marinas <[email protected]>;
>> Will Deacon <[email protected]>; Greg Kroah-Hartman
>> <[email protected]>; Rafael J. Wysocki <[email protected]>; Dave
>> Jiang <[email protected]>; Andrew Morton <[email protected]>;
>> Steve Capper <[email protected]>; Mark Rutland <[email protected]>;
>> Logan Gunthorpe <[email protected]>; Anshuman Khandual
>> <[email protected]>; Hsin-Yi Wang <[email protected]>; Jason
>> Gunthorpe <[email protected]>; Dave Hansen <[email protected]>; Kees
>> Cook <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected]; Wei
>> Yang <[email protected]>; Pankaj Gupta
>> <[email protected]>; Ira Weiny <[email protected]>; Kaly Xin
>> <[email protected]>
>> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
>> alignment
>>
>>
>>
>>> Am 29.07.2020 um 05:35 schrieb Jia He <[email protected]>:
>>>
>>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
>>> addr in dev_dax_kmem_probe() should be aligned w/
>> SECTION_SIZE_BITS(30),i.e.
>>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had
>> been
>>> upstream merged, it was not helpful due to hard limitation of kmem_start:
>>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
>> -a 2M
>>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>>> $cat /proc/iomem
>>> ...
>>> 23c000000-23fffffff : System RAM
>>> 23dd40000-23fecffff : reserved
>>> 23fed0000-23fffffff : reserved
>>> 240000000-33fdfffff : Persistent Memory
>>> 240000000-2403fffff : namespace0.0
>>> 280000000-2bfffffff : dax0.0 <- aligned with 1G boundary
>>> 280000000-2bfffffff : System RAM
>>> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
>> 1G
>>> alignment.
>>>
>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>> only
>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>> e.g.
>>> 240000000-33fdfffff : Persistent Memory
>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>> hard
>>> limitation. It wastes too much memory space.
>>>
>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>> there
>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>
>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>> alignment
>>> with memory_block_size_bytes().
>>>
>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>> pmem
>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>> are both
>>> tested on arm64/x86 guest.
>>>
>>
>> Hi,
>>
>> I am not convinced this use case is worth such hacks (that’s what it is)
>> for now. On real machines pmem is big - your example (losing 50% is
>> extreme).
>>
>> I would much rather want to see the section size on arm64 reduced. I
>> remember there were patches and that at least with a base page size of 4k
>> it can be reduced drastically (64k base pages are more problematic due to
>> the ridiculous THP size of 512M). But could be a section size of 512 is
>> possible on all configs right now.
>
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> into page->flags.

Yep.

> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif

Yep, with 4k base pages it's 4 MB. However, with 64k base pages its
512MB ( :( ).

> - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.

I think there were plans to eventually switch to 2MB THP with 64k base
pages as well (which can be emulated using some sort of consecutive PTE
entries under arm64, don't ask me how this feature is called),
theoretically also allowing smaller section sizes (when also reducing
MAX_ORDER properly) I would highly appreciate that switch. Having max
allocation/THP in the size of gigantic pages sounds very weird to me
(and creates issues e.g., to support hot(un)plug of small memory blocks
for virtio-mem). But I guess this is not under our control :)

>
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

Haven't looked into 16k base pages yet. But I remember it's in general
more similar to 4k than to 64k (speaking about sane THP sizes and
similar ...).

>
>>
>> In the long term we might want to rework the memory block device model
>> (eventually supporting old/new as discussed with Michal some time ago
>> using a kernel parameter), dropping the fixed sizes
>
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.

Yeah, but I might not be able to dig it out anymore ...

Anyhow, the idea would be to have some magic switch that converts
between old and new world, to not break userspace that relies on that.

With old, everything would continue to work as it is. With *new* we
would have the reduced number of memory blocks for boot memory and
decoupled it from a strict, static memory block size.


There would be another option in corner cases right now. If you would
*know* that the metadata memory has no memmap/idendity mapping and have
1G alignment for your pmem device (including the metadata part)

1. add_memory_device_managed() the whole memory, including the metadata part
2. use generic_online_pages() to not expose metadata pages to the buddy
3. Mark metdata pages in a special way, such that you can e.g., allow to
offline memory again, including the metdata pages (e.g., PG_offline +
memory notifier like virtio-mem does)

3. would only be relevant to support offlining of memory again.

If the metadata part is, however, already ZONE_DEVICE with a memmap,
then that's not an option. (I have no idea how that metadata part is
used, sorry)


--
Thanks,

David / dhildenb

2020-07-29 09:35:25

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

Hi Justin,

On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> Hi David
> > >
> > > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> > only
> > > use 2G bytes for dax pmem(kmem) in the worst case.
> > > e.g.
> > > 240000000-33fdfffff : Persistent Memory
> > > We can only use the memblock between [240000000, 2ffffffff] due to the
> > hard
> > > limitation. It wastes too much memory space.
> > >
> > > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> > there
> > > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > > SPARSEMEM_VMEMMAP, page bits in struct page ...
> > >
> > > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> > alignment
> > > with memory_block_size_bytes().
> > >
> > > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> > pmem
> > > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> > are both
> > > tested on arm64/x86 guest.
> > >
> >
> > Hi,
> >
> > I am not convinced this use case is worth such hacks (that’s what it is)
> > for now. On real machines pmem is big - your example (losing 50% is
> > extreme).
> >
> > I would much rather want to see the section size on arm64 reduced. I
> > remember there were patches and that at least with a base page size of 4k
> > it can be reduced drastically (64k base pages are more problematic due to
> > the ridiculous THP size of 512M). But could be a section size of 512 is
> > possible on all configs right now.
>
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> into page->flags.
> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif
> - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.
>
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

It is not necessary to pollute Kconfig with that.
arch/arm64/include/asm/sparesemem.h can have something like

#ifdef CONFIG_ARM64_64K_PAGES
#define SPARSE_SECTION_SIZE 29
#elif defined(CONFIG_ARM16K_PAGES)
#define SPARSE_SECTION_SIZE 28
#elif defined(CONFIG_ARM4K_PAGES)
#define SPARSE_SECTION_SIZE 27
#else
#error
#endif

There is still large gap with ARM64_64K_PAGES, though.

As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

> >
> > In the long term we might want to rework the memory block device model
> > (eventually supporting old/new as discussed with Michal some time ago
> > using a kernel parameter), dropping the fixed sizes
>
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.
>
>
> --
> Cheers,
> Justin (Jia He)
>
>
>
> > - allowing sizes / addresses aligned with subsection size
> > - drastically reducing the number of devices for boot memory to only a
> > hand full (e.g., one per resource / DIMM we can actually unplug again.
> >
> > Long story short, I don’t like this hack.
> >
> >
> > > This patch series (mainly patch6/6) is based on the fixing patch, ~v5.8-
> > rc5 [2].
> > >
> > > [1] https://lkml.org/lkml/2019/6/19/67
> > > [2] https://lkml.org/lkml/2020/7/8/1546
> > > Jia He (6):
> > > mm/memory_hotplug: remove redundant memory block size alignment check
> > > resource: export find_next_iomem_res() helper
> > > mm/memory_hotplug: allow pmem kmem not to align with memory_block_size
> > > mm/page_alloc: adjust the start,end in dax pmem kmem case
> > > device-dax: relax the memblock size alignment for kmem_start
> > > arm64: fall back to vmemmap_populate_basepages if not aligned with
> > > PMD_SIZE
> > >
> > > arch/arm64/mm/mmu.c | 4 ++++
> > > drivers/base/memory.c | 24 ++++++++++++++++--------
> > > drivers/dax/kmem.c | 22 +++++++++++++---------
> > > include/linux/ioport.h | 3 +++
> > > kernel/resource.c | 3 ++-
> > > mm/memory_hotplug.c | 39 ++++++++++++++++++++++++++++++++++++++-
> > > mm/page_alloc.c | 14 ++++++++++++++
> > > 7 files changed, 90 insertions(+), 19 deletions(-)
> > >
> > > --
> > > 2.17.1
> > >
>

--
Sincerely yours,
Mike.

2020-07-29 09:37:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

On 29.07.20 11:31, Mike Rapoport wrote:
> Hi Justin,
>
> On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
>> Hi David
>>>>
>>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>>> only
>>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>>> e.g.
>>>> 240000000-33fdfffff : Persistent Memory
>>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>>> hard
>>>> limitation. It wastes too much memory space.
>>>>
>>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>>> there
>>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>>
>>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>>> alignment
>>>> with memory_block_size_bytes().
>>>>
>>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>>> pmem
>>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>>> are both
>>>> tested on arm64/x86 guest.
>>>>
>>>
>>> Hi,
>>>
>>> I am not convinced this use case is worth such hacks (that’s what it is)
>>> for now. On real machines pmem is big - your example (losing 50% is
>>> extreme).
>>>
>>> I would much rather want to see the section size on arm64 reduced. I
>>> remember there were patches and that at least with a base page size of 4k
>>> it can be reduced drastically (64k base pages are more problematic due to
>>> the ridiculous THP size of 512M). But could be a section size of 512 is
>>> possible on all configs right now.
>>
>> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
>> There are many constraints for reducing SECTION_SIZE_BITS
>> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>> much.
>> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>> into page->flags.
>> 3. MAX_ORDER depends on SECTION_SIZE_BITS
>> - 3.1 mmzone.h
>> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>> #error Allocator MAX_ORDER exceeds SECTION_SIZE
>> #endif
>> - 3.2 hugepage_init()
>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>>
>> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
>> SECTION_SIZE_BITS can be reduced to 27.
>> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
>> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
>> be reduced to 27.
>>
>> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
>> might be very complicated,e.g. we still need to consider the case for
>> ARM64_16K_PAGES.
>
> It is not necessary to pollute Kconfig with that.
> arch/arm64/include/asm/sparesemem.h can have something like
>
> #ifdef CONFIG_ARM64_64K_PAGES
> #define SPARSE_SECTION_SIZE 29
> #elif defined(CONFIG_ARM16K_PAGES)
> #define SPARSE_SECTION_SIZE 28
> #elif defined(CONFIG_ARM4K_PAGES)
> #define SPARSE_SECTION_SIZE 27
> #else
> #error
> #endif

ack

>
> There is still large gap with ARM64_64K_PAGES, though.
>
> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?

I was asking myself the same question a while ago and didn't really find
a compelling one.

I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
would require config tweaks to even disable it.

--
Thanks,

David / dhildenb

2020-07-29 13:04:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

On 29.07.20 15:00, Mike Rapoport wrote:
> On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
>> On 29.07.20 11:31, Mike Rapoport wrote:
>>> Hi Justin,
>>>
>>> On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
>>>> Hi David
>>>>>>
>>>>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>>>>> only
>>>>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>>>>> e.g.
>>>>>> 240000000-33fdfffff : Persistent Memory
>>>>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>>>>> hard
>>>>>> limitation. It wastes too much memory space.
>>>>>>
>>>>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>>>>> there
>>>>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>>>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>>>>
>>>>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>>>>> alignment
>>>>>> with memory_block_size_bytes().
>>>>>>
>>>>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>>>>> pmem
>>>>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>>>>> are both
>>>>>> tested on arm64/x86 guest.
>>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am not convinced this use case is worth such hacks (that’s what it is)
>>>>> for now. On real machines pmem is big - your example (losing 50% is
>>>>> extreme).
>>>>>
>>>>> I would much rather want to see the section size on arm64 reduced. I
>>>>> remember there were patches and that at least with a base page size of 4k
>>>>> it can be reduced drastically (64k base pages are more problematic due to
>>>>> the ridiculous THP size of 512M). But could be a section size of 512 is
>>>>> possible on all configs right now.
>>>>
>>>> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
>>>> There are many constraints for reducing SECTION_SIZE_BITS
>>>> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>>>> much.
>>>> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>>>> into page->flags.
>>>> 3. MAX_ORDER depends on SECTION_SIZE_BITS
>>>> - 3.1 mmzone.h
>>>> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
>>>> #error Allocator MAX_ORDER exceeds SECTION_SIZE
>>>> #endif
>>>> - 3.2 hugepage_init()
>>>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
>>>>
>>>> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
>>>> SECTION_SIZE_BITS can be reduced to 27.
>>>> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
>>>> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
>>>> be reduced to 27.
>>>>
>>>> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
>>>> might be very complicated,e.g. we still need to consider the case for
>>>> ARM64_16K_PAGES.
>>>
>>> It is not necessary to pollute Kconfig with that.
>>> arch/arm64/include/asm/sparesemem.h can have something like
>>>
>>> #ifdef CONFIG_ARM64_64K_PAGES
>>> #define SPARSE_SECTION_SIZE 29
>>> #elif defined(CONFIG_ARM16K_PAGES)
>>> #define SPARSE_SECTION_SIZE 28
>>> #elif defined(CONFIG_ARM4K_PAGES)
>>> #define SPARSE_SECTION_SIZE 27
>>> #else
>>> #error
>>> #endif
>>
>> ack
>>
>>>
>>> There is still large gap with ARM64_64K_PAGES, though.
>>>
>>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
>>
>> I was asking myself the same question a while ago and didn't really find
>> a compelling one.
>
> Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> how to free empty parts of the memory map with "classic" SPARSEMEM.

You mean the hole punching within section memmap? (which is why their
pfn_valid() implementation is special)

(I do wonder why that shouldn't work with VMEMMAP, or is it simply not
implemented?)

>
>> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
>> would require config tweaks to even disable it.
>
> Nope, it's right there in menuconfig,
>
> "Memory Management options" -> "Sparse Memory virtual memmap"

Ah, good to know.


--
Thanks,

David / dhildenb

2020-07-29 13:05:24

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> >
> > On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 240000000-33fdfffff : Persistent Memory
> >>>> We can only use the memblock between [240000000, 2ffffffff] due to the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of 4k
> >>> it can be reduced drastically (64k base pages are more problematic due to
> >>> the ridiculous THP size of 512M). But could be a section size of 512 is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
> >> much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
> >> into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> >> - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >> - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> >
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> >
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
>
> ack
>
> >
> > There is still large gap with ARM64_64K_PAGES, though.
> >
> > As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
>
> I was asking myself the same question a while ago and didn't really find
> a compelling one.

Memory overhead for VMEMMAP is larger, especially for arm64 that knows
how to free empty parts of the memory map with "classic" SPARSEMEM.

> I think it's always enabled as default (SPARSEMEM_VMEMMAP_ENABLE) and
> would require config tweaks to even disable it.

Nope, it's right there in menuconfig,

"Memory Management options" -> "Sparse Memory virtual memmap"

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-29 14:14:21

by Mike Rapoport

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

On Wed, Jul 29, 2020 at 03:03:04PM +0200, David Hildenbrand wrote:
> On 29.07.20 15:00, Mike Rapoport wrote:
> > On Wed, Jul 29, 2020 at 11:35:20AM +0200, David Hildenbrand wrote:
> >>>
> >>> There is still large gap with ARM64_64K_PAGES, though.
> >>>
> >>> As for SPARSEMEM without VMEMMAP, are there actual benefits to use it?
> >>
> >> I was asking myself the same question a while ago and didn't really find
> >> a compelling one.
> >
> > Memory overhead for VMEMMAP is larger, especially for arm64 that knows
> > how to free empty parts of the memory map with "classic" SPARSEMEM.
>
> You mean the hole punching within section memmap? (which is why their
> pfn_valid() implementation is special)

Yes, arm (both 32 and 64) do this. And for smaller systems with a few
memory banks this is very reasonable to trade slight (if any) slowdown
in pfn_valid() for several megs of memory.

> (I do wonder why that shouldn't work with VMEMMAP, or is it simply not
> implemented?)

It's not implemented. There was a patch [1] recently to implement this.

[1] https://lore.kernel.org/lkml/[email protected]/

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-30 02:18:51

by Justin He

[permalink] [raw]
Subject: RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment


> -----Original Message-----
> From: David Hildenbrand <[email protected]>
> Sent: Wednesday, July 29, 2020 5:35 PM
> To: Mike Rapoport <[email protected]>; Justin He <[email protected]>
> Cc: Dan Williams <[email protected]>; Vishal Verma
> <[email protected]>; Catalin Marinas <[email protected]>;
> Will Deacon <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; Rafael J. Wysocki <[email protected]>; Dave
> Jiang <[email protected]>; Andrew Morton <[email protected]>;
> Steve Capper <[email protected]>; Mark Rutland <[email protected]>;
> Logan Gunthorpe <[email protected]>; Anshuman Khandual
> <[email protected]>; Hsin-Yi Wang <[email protected]>; Jason
> Gunthorpe <[email protected]>; Dave Hansen <[email protected]>; Kees
> Cook <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected]; Wei
> Yang <[email protected]>; Pankaj Gupta
> <[email protected]>; Ira Weiny <[email protected]>; Kaly Xin
> <[email protected]>
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
>
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> >
> > On Wed, Jul 29, 2020 at 08:27:58AM +0000, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 240000000-33fdfffff : Persistent Memory
> >>>> We can only use the memblock between [240000000, 2ffffffff] due to
> the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative,
> but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device.
> dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it
> is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of
> 4k
> >>> it can be reduced drastically (64k base pages are more problematic due
> to
> >>> the ridiculous THP size of 512M). But could be a section size of 512
> is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64
> thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be
> reduced too
> >> much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be
> counted
> >> into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> >> - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >> - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS
> can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the
> Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> >
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> >
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
>
> ack
Thanks, David and Mike. Will discuss it further more with arm internally about
the thoughtful section_size change

--
Cheers,
Justin (Jia He)