Hello,
I've been trying to use the 'movablecore=' kernel command line option to create
a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that
offlining the resulting ZONE_MOVABLE area consistently fails in my setups
because that zone contains unmovable pages. My testing has been in a x86_64
QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail
100% of the time.
Digging into it a bit, these unmovable pages are Reserved pages which were
allocated in early boot as part of the memblock allocator. Many of these
allocations are for data structures for the SPARSEMEM memory model, including
'struct mem_section' objects. These memblock allocations can be tracked by
setting the 'memblock=debug' kernel command line parameter, and are marked as
reserved in:
memmap_init_reserved_pages()
reserve_bootmem_region()
With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2
kernel I get the following on my 4G system:
# lsmem --split ZONES --output-all
RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B
And when I try to offline memory block 39, I get:
# echo 0 > /sys/devices/system/memory/memory39/online
bash: echo: write error: Device or resource busy
with dmesg saying:
[ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00
[ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff)
[ 57.447301] page_type: 0xffffffff()
[ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000
[ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 57.452011] page dumped because: unmovable page
Looking back at the memblock allocations, I can see that the physical address for
pfn:0x13ff00 was used in a memblock allocation:
[ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150
The full dmesg output can be found here: https://pastebin.com/cNztqa4u
The 'movablecore=' command line parameter is handled in
'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should
start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA
node.
The issue is that the memblock allocator and the processing of the movablecore=
command line parameter don't know about one another, and in my x86_64 testing
they both always use memory at the end of the NUMA node and have collisions.
From several comments in the code I believe that this is a known issue:
https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59
/*
* Both, bootmem allocations and memory holes are marked
* PG_reserved and are unmovable. We can even have unmovable
* allocations inside ZONE_MOVABLE, for example when
* specifying "movablecore".
*/
https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765
* 2. memblock allocations: kernelcore/movablecore setups might create
* situations where ZONE_MOVABLE contains unmovable allocations
* after boot. Memory offlining and allocations fail early.
We check for these unmovable pages by scanning for 'PageReserved()' in the area
we are trying to offline, which happens in has_unmovable_pages().
Interestingly, the boot timing works out like this:
1. Allocate memblock areas to set up the SPARSEMEM model.
[ 0.369990] Call Trace:
[ 0.370404] <TASK>
[ 0.370759] ? dump_stack_lvl+0x43/0x60
[ 0.371410] ? sparse_init_nid+0x2dc/0x560
[ 0.372116] ? sparse_init+0x346/0x450
[ 0.372755] ? paging_init+0xa/0x20
[ 0.373349] ? setup_arch+0xa6a/0xfc0
[ 0.373970] ? slab_is_available+0x5/0x20
[ 0.374651] ? start_kernel+0x5e/0x770
[ 0.375290] ? x86_64_start_reservations+0x14/0x30
[ 0.376109] ? x86_64_start_kernel+0x71/0x80
[ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b
[ 0.377755] </TASK>
2. Process movablecore= kernel command line parameter and set up memory zones
[ 0.489382] Call Trace:
[ 0.489818] <TASK>
[ 0.490187] ? dump_stack_lvl+0x43/0x60
[ 0.490873] ? free_area_init+0x115/0xc80
[ 0.491588] ? __printk_cpu_sync_put+0x5/0x30
[ 0.492354] ? dump_stack_lvl+0x48/0x60
[ 0.493002] ? sparse_init_nid+0x2dc/0x560
[ 0.493697] ? zone_sizes_init+0x60/0x80
[ 0.494361] ? setup_arch+0xa6a/0xfc0
[ 0.494981] ? slab_is_available+0x5/0x20
[ 0.495674] ? start_kernel+0x5e/0x770
[ 0.496312] ? x86_64_start_reservations+0x14/0x30
[ 0.497123] ? x86_64_start_kernel+0x71/0x80
[ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b
[ 0.498768] </TASK>
3. Mark memblock areas as Reserved.
[ 0.761136] Call Trace:
[ 0.761534] <TASK>
[ 0.761876] dump_stack_lvl+0x43/0x60
[ 0.762474] reserve_bootmem_region+0x1e/0x170
[ 0.763201] memblock_free_all+0xe3/0x250
[ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130
[ 0.764812] ? swiotlb_init_remap+0x195/0x2c0
[ 0.765519] mem_init+0x19/0x1b0
[ 0.766047] mm_core_init+0x9c/0x3d0
[ 0.766630] start_kernel+0x264/0x770
[ 0.767229] x86_64_start_reservations+0x14/0x30
[ 0.767987] x86_64_start_kernel+0x71/0x80
[ 0.768666] secondary_startup_64_no_verify+0x167/0x16b
[ 0.769534] </TASK>
So, during ZONE_MOVABLE setup we currently can't do the same
has_unmovable_pages() scan looking for PageReserved() to check for overlap
because the pages have not yet been marked as Reserved.
I do think that we need to fix this collision between ZONE_MOVABLE and memmap
allocations, because this issue essentially makes the movablecore= kernel
command line parameter useless in many cases, as the ZONE_MOVABLE region it
creates will often actually be unmovable.
Here are the options I currently see for resolution:
1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
the beginning of the NUMA node instead of the end. This should fix my use case,
but again is prone to breakage in other configurations (# of NUMA nodes, other
architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
think that this should be relatively straightforward and low risk, though.
2. Make the code which processes the movablecore= command line option aware of
the memblock allocations, and have it choose a region for ZONE_MOVABLE which
does not have these allocations. This might be done by checking for
PageReserved() as we do with offlining memory, though that will take some boot
time reordering, or we'll have to figure out the overlap in another way. This
may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If
we can get it working, this seems like the most correct solution to me, but
also the most difficult and risky because it involves significant changes in
the code for memory setup at early boot.
Am I missing anything are there other solutions we should consider, or do you
have an opinion on which solution we should pursue?
Thanks,
- Ross
Hi,
On Tue, Jul 18, 2023 at 04:01:06PM -0600, Ross Zwisler wrote:
> Hello,
>
> I've been trying to use the 'movablecore=' kernel command line option to create
> a ZONE_MOVABLE memory zone on my x86_64 systems, and have noticed that
> offlining the resulting ZONE_MOVABLE area consistently fails in my setups
> because that zone contains unmovable pages. My testing has been in a x86_64
> QEMU VM with a single NUMA node and 4G, 8G or 16G of memory, all of which fail
> 100% of the time.
>
> Digging into it a bit, these unmovable pages are Reserved pages which were
> allocated in early boot as part of the memblock allocator. Many of these
> allocations are for data structures for the SPARSEMEM memory model, including
> 'struct mem_section' objects. These memblock allocations can be tracked by
> setting the 'memblock=debug' kernel command line parameter, and are marked as
> reserved in:
>
> memmap_init_reserved_pages()
> reserve_bootmem_region()
>
> With the command line params 'movablecore=256M memblock=debug' and a v6.5.0-rc2
> kernel I get the following on my 4G system:
>
> # lsmem --split ZONES --output-all
> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>
> Memory block size: 128M
> Total online memory: 4G
> Total offline memory: 0B
>
> And when I try to offline memory block 39, I get:
>
> # echo 0 > /sys/devices/system/memory/memory39/online
> bash: echo: write error: Device or resource busy
>
> with dmesg saying:
>
> [ 57.439849] page:0000000076a3e320 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ff00
> [ 57.444073] flags: 0x1fffffc0001000(reserved|node=0|zone=3|lastcpupid=0x1fffff)
> [ 57.447301] page_type: 0xffffffff()
> [ 57.448754] raw: 001fffffc0001000 ffffdd6384ffc008 ffffdd6384ffc008 0000000000000000
> [ 57.450383] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [ 57.452011] page dumped because: unmovable page
>
> Looking back at the memblock allocations, I can see that the physical address for
> pfn:0x13ff00 was used in a memblock allocation:
>
> [ 0.395180] memblock_reserve: [0x000000013ff00000-0x000000013ffbffff] memblock_alloc_range_nid+0xe0/0x150
>
> The full dmesg output can be found here: https://pastebin.com/cNztqa4u
>
> The 'movablecore=' command line parameter is handled in
> 'find_zone_movable_pfns_for_nodes()', which decides where ZONE_MOVABLE should
> start and end. Currently ZONE_MOVABLE is always located at the end of a NUMA
> node.
>
> The issue is that the memblock allocator and the processing of the movablecore=
> command line parameter don't know about one another, and in my x86_64 testing
> they both always use memory at the end of the NUMA node and have collisions.
>
> From several comments in the code I believe that this is a known issue:
>
> https://elixir.bootlin.com/linux/v6.5-rc2/source/mm/page_isolation.c#L59
> /*
> * Both, bootmem allocations and memory holes are marked
> * PG_reserved and are unmovable. We can even have unmovable
> * allocations inside ZONE_MOVABLE, for example when
> * specifying "movablecore".
> */
>
> https://elixir.bootlin.com/linux/v6.5-rc2/source/include/linux/mmzone.h#L765
> * 2. memblock allocations: kernelcore/movablecore setups might create
> * situations where ZONE_MOVABLE contains unmovable allocations
> * after boot. Memory offlining and allocations fail early.
>
> We check for these unmovable pages by scanning for 'PageReserved()' in the area
> we are trying to offline, which happens in has_unmovable_pages().
>
> Interestingly, the boot timing works out like this:
>
> 1. Allocate memblock areas to set up the SPARSEMEM model.
> [ 0.369990] Call Trace:
> [ 0.370404] <TASK>
> [ 0.370759] ? dump_stack_lvl+0x43/0x60
> [ 0.371410] ? sparse_init_nid+0x2dc/0x560
> [ 0.372116] ? sparse_init+0x346/0x450
> [ 0.372755] ? paging_init+0xa/0x20
> [ 0.373349] ? setup_arch+0xa6a/0xfc0
> [ 0.373970] ? slab_is_available+0x5/0x20
> [ 0.374651] ? start_kernel+0x5e/0x770
> [ 0.375290] ? x86_64_start_reservations+0x14/0x30
> [ 0.376109] ? x86_64_start_kernel+0x71/0x80
> [ 0.376835] ? secondary_startup_64_no_verify+0x167/0x16b
> [ 0.377755] </TASK>
>
> 2. Process movablecore= kernel command line parameter and set up memory zones
> [ 0.489382] Call Trace:
> [ 0.489818] <TASK>
> [ 0.490187] ? dump_stack_lvl+0x43/0x60
> [ 0.490873] ? free_area_init+0x115/0xc80
> [ 0.491588] ? __printk_cpu_sync_put+0x5/0x30
> [ 0.492354] ? dump_stack_lvl+0x48/0x60
> [ 0.493002] ? sparse_init_nid+0x2dc/0x560
> [ 0.493697] ? zone_sizes_init+0x60/0x80
> [ 0.494361] ? setup_arch+0xa6a/0xfc0
> [ 0.494981] ? slab_is_available+0x5/0x20
> [ 0.495674] ? start_kernel+0x5e/0x770
> [ 0.496312] ? x86_64_start_reservations+0x14/0x30
> [ 0.497123] ? x86_64_start_kernel+0x71/0x80
> [ 0.497847] ? secondary_startup_64_no_verify+0x167/0x16b
> [ 0.498768] </TASK>
>
> 3. Mark memblock areas as Reserved.
> [ 0.761136] Call Trace:
> [ 0.761534] <TASK>
> [ 0.761876] dump_stack_lvl+0x43/0x60
> [ 0.762474] reserve_bootmem_region+0x1e/0x170
> [ 0.763201] memblock_free_all+0xe3/0x250
> [ 0.763862] ? swiotlb_init_io_tlb_mem.constprop.0+0x11a/0x130
> [ 0.764812] ? swiotlb_init_remap+0x195/0x2c0
> [ 0.765519] mem_init+0x19/0x1b0
> [ 0.766047] mm_core_init+0x9c/0x3d0
> [ 0.766630] start_kernel+0x264/0x770
> [ 0.767229] x86_64_start_reservations+0x14/0x30
> [ 0.767987] x86_64_start_kernel+0x71/0x80
> [ 0.768666] secondary_startup_64_no_verify+0x167/0x16b
> [ 0.769534] </TASK>
>
> So, during ZONE_MOVABLE setup we currently can't do the same
> has_unmovable_pages() scan looking for PageReserved() to check for overlap
> because the pages have not yet been marked as Reserved.
>
> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> allocations, because this issue essentially makes the movablecore= kernel
> command line parameter useless in many cases, as the ZONE_MOVABLE region it
> creates will often actually be unmovable.
>
> Here are the options I currently see for resolution:
>
> 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
> the beginning of the NUMA node instead of the end. This should fix my use case,
> but again is prone to breakage in other configurations (# of NUMA nodes, other
> architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
> think that this should be relatively straightforward and low risk, though.
>
> 2. Make the code which processes the movablecore= command line option aware of
> the memblock allocations, and have it choose a region for ZONE_MOVABLE which
> does not have these allocations. This might be done by checking for
> PageReserved() as we do with offlining memory, though that will take some boot
> time reordering, or we'll have to figure out the overlap in another way. This
> may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
> a ZONE_MOVABLE section in between them. I'm not sure if this is allowed? If
> we can get it working, this seems like the most correct solution to me, but
> also the most difficult and risky because it involves significant changes in
> the code for memory setup at early boot.
>
> Am I missing anything are there other solutions we should consider, or do you
> have an opinion on which solution we should pursue?
I'd add
3. Switch memblock to use bottom up allocations. Historically memblock
allocated memory from the top to avoid corrupting the kernel image and to
avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
allocations with lower limit of memblock allocations set to 16M.
With the hack below no memblock allocations will end up in ZONE_MOVABLE:
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..5e940f057dd4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1116,6 +1116,7 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
e820__memblock_setup();
+ memblock_set_bottom_up(true);
/*
* Needs to run after memblock setup because it needs the physical
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..ed1e14a2a62d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -660,16 +660,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;
- /*
- * We reset memblock back to the top-down direction
- * here because if we configured ACPI_NUMA, we have
- * parsed SRAT in init_func(). It is ok to have the
- * reset here even if we did't configure ACPI_NUMA
- * or acpi numa init fails and fallbacks to dummy
- * numa init.
- */
- memblock_set_bottom_up(false);
-
ret = numa_cleanup_meminfo(&numa_meminfo);
if (ret < 0)
return ret;
diff --git a/mm/memblock.c b/mm/memblock.c
index 3feafea06ab2..7ba040bf8da2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1388,6 +1388,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
bool exact_nid)
{
enum memblock_flags flags = choose_memblock_flags();
+ phys_addr_t min = SZ_16M;
phys_addr_t found;
if (WARN_ONCE(nid == MAX_NUMNODES, "Usage of MAX_NUMNODES is deprecated. Use NUMA_NO_NODE instead\n"))
@@ -1400,13 +1401,13 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
}
again:
- found = memblock_find_in_range_node(size, align, start, end, nid,
+ found = memblock_find_in_range_node(size, align, min, end, nid,
flags);
if (found && !memblock_reserve(found, size))
goto done;
if (nid != NUMA_NO_NODE && !exact_nid) {
- found = memblock_find_in_range_node(size, align, start,
+ found = memblock_find_in_range_node(size, align, min,
end, NUMA_NO_NODE,
flags);
if (found && !memblock_reserve(found, size))
@@ -1420,6 +1421,11 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
goto again;
}
+ if (start < min) {
+ min = start;
+ goto again;
+ }
+
return 0;
done:
> Thanks,
> - Ross
--
Sincerely yours,
Mike.
On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
[...]
> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> allocations, because this issue essentially makes the movablecore= kernel
> command line parameter useless in many cases, as the ZONE_MOVABLE region it
> creates will often actually be unmovable.
movablecore is kinda hack and I would be more inclined to get rid of it
rather than build more into it. Could you be more specific about your
use case?
> Here are the options I currently see for resolution:
>
> 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
> the beginning of the NUMA node instead of the end. This should fix my use case,
> but again is prone to breakage in other configurations (# of NUMA nodes, other
> architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
> think that this should be relatively straightforward and low risk, though.
>
> 2. Make the code which processes the movablecore= command line option aware of
> the memblock allocations, and have it choose a region for ZONE_MOVABLE which
> does not have these allocations. This might be done by checking for
> PageReserved() as we do with offlining memory, though that will take some boot
> time reordering, or we'll have to figure out the overlap in another way. This
> may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
> a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?
Yes, this is no problem. Zones are allowed to be sparse.
> If
> we can get it working, this seems like the most correct solution to me, but
> also the most difficult and risky because it involves significant changes in
> the code for memory setup at early boot.
>
> Am I missing anything are there other solutions we should consider, or do you
> have an opinion on which solution we should pursue?
If this really needs to be addressed than 2) is certainly a more robust
approach.
--
Michal Hocko
SUSE Labs
On Wed 19-07-23 10:59:52, Mike Rapoport wrote:
> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > [...]
> > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > allocations, because this issue essentially makes the movablecore= kernel
> > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > creates will often actually be unmovable.
> >
> > movablecore is kinda hack and I would be more inclined to get rid of it
> > rather than build more into it. Could you be more specific about your
> > use case?
> >
> > > Here are the options I currently see for resolution:
> > >
> > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
> > > the beginning of the NUMA node instead of the end. This should fix my use case,
> > > but again is prone to breakage in other configurations (# of NUMA nodes, other
> > > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
> > > think that this should be relatively straightforward and low risk, though.
> > >
> > > 2. Make the code which processes the movablecore= command line option aware of
> > > the memblock allocations, and have it choose a region for ZONE_MOVABLE which
> > > does not have these allocations. This might be done by checking for
> > > PageReserved() as we do with offlining memory, though that will take some boot
> > > time reordering, or we'll have to figure out the overlap in another way. This
> > > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
> > > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?
> >
> > Yes, this is no problem. Zones are allowed to be sparse.
>
> The current initialization order is roughly
>
> * very early initialization with some memblock allocations
> * determine zone locations and sizes
> * initialize memory map
> - memblock_alloc(lots of memory)
> * lots of unrelated initializations that may allocate memory
> * release free pages from memblock to the buddy allocator
>
> With 2) we can make sure the memory map and early allocations won't be in
> the ZONE_MOVABLE, but we'll still may have reserved pages there.
Yes this will always be fragile. If the spefic placement of the movable
memory is not important and the only thing that matters is the size and
numa locality then an easier to maintain solution would be to simply
offline enough memory blocks very early in the userspace bring up and
online it back as movable. If offlining fails just try another
memblock. This doesn't require any kernel code change.
--
Michal Hocko
SUSE Labs
On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> [...]
> > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > allocations, because this issue essentially makes the movablecore= kernel
> > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > creates will often actually be unmovable.
>
> movablecore is kinda hack and I would be more inclined to get rid of it
> rather than build more into it. Could you be more specific about your
> use case?
>
> > Here are the options I currently see for resolution:
> >
> > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
> > the beginning of the NUMA node instead of the end. This should fix my use case,
> > but again is prone to breakage in other configurations (# of NUMA nodes, other
> > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
> > think that this should be relatively straightforward and low risk, though.
> >
> > 2. Make the code which processes the movablecore= command line option aware of
> > the memblock allocations, and have it choose a region for ZONE_MOVABLE which
> > does not have these allocations. This might be done by checking for
> > PageReserved() as we do with offlining memory, though that will take some boot
> > time reordering, or we'll have to figure out the overlap in another way. This
> > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
> > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?
>
> Yes, this is no problem. Zones are allowed to be sparse.
The current initialization order is roughly
* very early initialization with some memblock allocations
* determine zone locations and sizes
* initialize memory map
- memblock_alloc(lots of memory)
* lots of unrelated initializations that may allocate memory
* release free pages from memblock to the buddy allocator
With 2) we can make sure the memory map and early allocations won't be in
the ZONE_MOVABLE, but we'll still may have reserved pages there.
> > If
> > we can get it working, this seems like the most correct solution to me, but
> > also the most difficult and risky because it involves significant changes in
> > the code for memory setup at early boot.
> >
> > Am I missing anything are there other solutions we should consider, or do you
> > have an opinion on which solution we should pursue?
>
> If this really needs to be addressed than 2) is certainly a more robust
> approach.
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On 19.07.23 10:06, Michal Hocko wrote:
> On Wed 19-07-23 10:59:52, Mike Rapoport wrote:
>> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
>>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
>>> [...]
>>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
>>>> allocations, because this issue essentially makes the movablecore= kernel
>>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it
>>>> creates will often actually be unmovable.
>>>
>>> movablecore is kinda hack and I would be more inclined to get rid of it
>>> rather than build more into it. Could you be more specific about your
>>> use case?
>>>
>>>> Here are the options I currently see for resolution:
>>>>
>>>> 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
>>>> the beginning of the NUMA node instead of the end. This should fix my use case,
>>>> but again is prone to breakage in other configurations (# of NUMA nodes, other
>>>> architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
>>>> think that this should be relatively straightforward and low risk, though.
>>>>
>>>> 2. Make the code which processes the movablecore= command line option aware of
>>>> the memblock allocations, and have it choose a region for ZONE_MOVABLE which
>>>> does not have these allocations. This might be done by checking for
>>>> PageReserved() as we do with offlining memory, though that will take some boot
>>>> time reordering, or we'll have to figure out the overlap in another way. This
>>>> may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
>>>> a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?
>>>
>>> Yes, this is no problem. Zones are allowed to be sparse.
>>
>> The current initialization order is roughly
>>
>> * very early initialization with some memblock allocations
>> * determine zone locations and sizes
>> * initialize memory map
>> - memblock_alloc(lots of memory)
>> * lots of unrelated initializations that may allocate memory
>> * release free pages from memblock to the buddy allocator
>>
>> With 2) we can make sure the memory map and early allocations won't be in
>> the ZONE_MOVABLE, but we'll still may have reserved pages there.
>
> Yes this will always be fragile. If the spefic placement of the movable
> memory is not important and the only thing that matters is the size and
> numa locality then an easier to maintain solution would be to simply
> offline enough memory blocks very early in the userspace bring up and
> online it back as movable. If offlining fails just try another
> memblock. This doesn't require any kernel code change.
As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1]
parameter to mark some memory as protected.
That memory can then be configured as devdax device and online to
ZONE_MOVABLE (dev/dax).
[1]
https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap
--
Cheers,
David / dhildenb
On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> 3. Switch memblock to use bottom up allocations. Historically memblock
> allocated memory from the top to avoid corrupting the kernel image and to
> avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> allocations with lower limit of memblock allocations set to 16M.
>
> With the hack below no memblock allocations will end up in ZONE_MOVABLE:
Yep, I've confirmed that for my use cases at least this does the trick, thank
you! I had thought about moving the memblock allocations, but had no idea it
was (basically) already supported and thought it'd be much riskier than just
adjusting where ZONE_MOVABLE lived.
Is there a reason for this to not be a real option for users, maybe per a
kernel config knob or something? I'm happy to explore other options in this
thread, but this is doing the trick so far.
On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> [...]
> > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > allocations, because this issue essentially makes the movablecore= kernel
> > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > creates will often actually be unmovable.
>
> movablecore is kinda hack and I would be more inclined to get rid of it
> rather than build more into it. Could you be more specific about your
> use case?
The problem that I'm trying to solve is that I'd like to be able to get kernel
core dumps off machines (chromebooks) so that we can debug crashes. Because
the memory used by the crash kernel ("crashkernel=" kernel command line
option) is consumed the entire time the machine is booted, there is a strong
motivation to keep the crash kernel as small and as simple as possible. To
this end I'm trying to get away without SSD drivers, not having to worry about
encryption on the SSDs, etc.
So, the rough plan right now is:
1) During boot set aside some memory that won't contain kernel allocations.
I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
region (or whatever non-kernel region) will be set aside as PMEM in the crash
kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
parameter passed to the crash kernel.
So, in my sample 4G VM system, I see:
# lsmem --split ZONES --output-all
RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B
so I'll pass "memmap=256M!0x130000000" to the crash kernel.
2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
aside only contains user data, which we don't want to store anyway. We make a
filesystem in there, and create a kernel crash dump using 'makedumpfile':
mkfs.ext4 /dev/pmem0
mount /dev/pmem0 /mnt
makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
We then set up the next full kernel boot to also have this same PMEM region,
using the same memmap kernel parameter. We reboot back into a full kernel.
3) The next full kernel will be a normal boot with a full networking stack,
SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
the kdump and either store it somewhere persistent or upload it somewhere. We
can then unmount the PMEM and reconfigure it back to system ram so that the
live system isn't missing memory.
ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
daxctl reconfigure-device --mode=system-ram dax0.0
This is the flow I'm trying to support, and have mostly working in a VM,
except up until now makedumpfile would crash because all the memblock
structures it needed were in the PMEM area that I had just wiped out by making
a new filesystem. :)
Do you see any blockers that would make this infeasible?
For the non-kernel memory, is the ZONE_MOVABLE path that I'm currently
pursuing the best option, or would we be better off with your suggestion
elsewhere in this thread:
> If the spefic placement of the movable memory is not important and the only
> thing that matters is the size and numa locality then an easier to maintain
> solution would be to simply offline enough memory blocks very early in the
> userspace bring up and online it back as movable. If offlining fails just
> try another memblock. This doesn't require any kernel code change.
If this 2nd way is preferred, can you point me to how I can offline the memory
blocks & then get them back later in boot?
Thanks for the help!
- Ross
On Wed, Jul 19, 2023 at 10:14:59AM +0200, David Hildenbrand wrote:
> On 19.07.23 10:06, Michal Hocko wrote:
> > On Wed 19-07-23 10:59:52, Mike Rapoport wrote:
> > > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > > > [...]
> > > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > > > allocations, because this issue essentially makes the movablecore= kernel
> > > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > > > creates will often actually be unmovable.
> > > >
> > > > movablecore is kinda hack and I would be more inclined to get rid of it
> > > > rather than build more into it. Could you be more specific about your
> > > > use case?
> > > >
> > > > > Here are the options I currently see for resolution:
> > > > >
> > > > > 1. Change the way ZONE_MOVABLE memory is allocated so that it is allocated from
> > > > > the beginning of the NUMA node instead of the end. This should fix my use case,
> > > > > but again is prone to breakage in other configurations (# of NUMA nodes, other
> > > > > architectures) where ZONE_MOVABLE and memblock allocations might overlap. I
> > > > > think that this should be relatively straightforward and low risk, though.
> > > > >
> > > > > 2. Make the code which processes the movablecore= command line option aware of
> > > > > the memblock allocations, and have it choose a region for ZONE_MOVABLE which
> > > > > does not have these allocations. This might be done by checking for
> > > > > PageReserved() as we do with offlining memory, though that will take some boot
> > > > > time reordering, or we'll have to figure out the overlap in another way. This
> > > > > may also result in us having two ZONE_NORMAL zones for a given NUMA node, with
> > > > > a ZONE_MOVABLE section in between them. I'm not sure if this is allowed?
> > > >
> > > > Yes, this is no problem. Zones are allowed to be sparse.
> > >
> > > The current initialization order is roughly
> > >
> > > * very early initialization with some memblock allocations
> > > * determine zone locations and sizes
> > > * initialize memory map
> > > - memblock_alloc(lots of memory)
> > > * lots of unrelated initializations that may allocate memory
> > > * release free pages from memblock to the buddy allocator
> > >
> > > With 2) we can make sure the memory map and early allocations won't be in
> > > the ZONE_MOVABLE, but we'll still may have reserved pages there.
> >
> > Yes this will always be fragile. If the spefic placement of the movable
> > memory is not important and the only thing that matters is the size and
> > numa locality then an easier to maintain solution would be to simply
> > offline enough memory blocks very early in the userspace bring up and
> > online it back as movable. If offlining fails just try another
> > memblock. This doesn't require any kernel code change.
>
> As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter
> to mark some memory as protected.
>
> That memory can then be configured as devdax device and online to
> ZONE_MOVABLE (dev/dax).
>
> [1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap
I've previously been reconfiguring devdax memory like this:
ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
daxctl reconfigure-device --mode=system-ram dax0.0
Is this how you've been doing it, or is there something else I should
consider?
I just sent mail to Michal outlining my use case, hopefully it makes sense.
I had thought about using 'memmap=' in the first kernel and the worry was that
I'd have to support many different machines with different memory
configurations, and have to hard-code memory offsets and lengths for the
various memmap= kernel command line parameters. If I can make ZONE_MOVABLE
work that's preferable because the kernel will choose the correct usermem-only
region for me, and then I can just use that region for the crash kernel and
3rd kernel boots.
[CC Jiri Bohac]
On Wed 19-07-23 16:48:21, Ross Zwisler wrote:
> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > [...]
> > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > allocations, because this issue essentially makes the movablecore= kernel
> > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > creates will often actually be unmovable.
> >
> > movablecore is kinda hack and I would be more inclined to get rid of it
> > rather than build more into it. Could you be more specific about your
> > use case?
>
> The problem that I'm trying to solve is that I'd like to be able to get kernel
> core dumps off machines (chromebooks) so that we can debug crashes. Because
> the memory used by the crash kernel ("crashkernel=" kernel command line
> option) is consumed the entire time the machine is booted, there is a strong
> motivation to keep the crash kernel as small and as simple as possible. To
> this end I'm trying to get away without SSD drivers, not having to worry about
> encryption on the SSDs, etc.
This is something Jiri is also looking into.
> So, the rough plan right now is:
>
> 1) During boot set aside some memory that won't contain kernel allocations.
> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
>
> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> region (or whatever non-kernel region) will be set aside as PMEM in the crash
> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> parameter passed to the crash kernel.
>
> So, in my sample 4G VM system, I see:
>
> # lsmem --split ZONES --output-all
> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>
> Memory block size: 128M
> Total online memory: 4G
> Total offline memory: 0B
>
> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
>
> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> aside only contains user data, which we don't want to store anyway. We make a
> filesystem in there, and create a kernel crash dump using 'makedumpfile':
>
> mkfs.ext4 /dev/pmem0
> mount /dev/pmem0 /mnt
> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
>
> We then set up the next full kernel boot to also have this same PMEM region,
> using the same memmap kernel parameter. We reboot back into a full kernel.
>
> 3) The next full kernel will be a normal boot with a full networking stack,
> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
> the kdump and either store it somewhere persistent or upload it somewhere. We
> can then unmount the PMEM and reconfigure it back to system ram so that the
> live system isn't missing memory.
>
> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
> daxctl reconfigure-device --mode=system-ram dax0.0
>
> This is the flow I'm trying to support, and have mostly working in a VM,
> except up until now makedumpfile would crash because all the memblock
> structures it needed were in the PMEM area that I had just wiped out by making
> a new filesystem. :)
>
> Do you see any blockers that would make this infeasible?
>
> For the non-kernel memory, is the ZONE_MOVABLE path that I'm currently
> pursuing the best option, or would we be better off with your suggestion
> elsewhere in this thread:
The main problem I would see with this approach is that the small
Movable zone you set aside would be easily consumed and reclaimed. That
could generate some unexpected performance artifacts. We used to see
those with small zones or large differences in zone sizes in the past.
But functionally this should work. Or I do not see any fundamental
problems at least.
Jiri is looking at this from a slightly different angle. Very broadly,
he would like to have a dedicated CMA pool and reuse that for the
kernel memory (dropping anything sitting there) when crashing.
GFP_MOVABLE allocations can use CMA pools.
> > If the spefic placement of the movable memory is not important and the only
> > thing that matters is the size and numa locality then an easier to maintain
> > solution would be to simply offline enough memory blocks very early in the
> > userspace bring up and online it back as movable. If offlining fails just
> > try another memblock. This doesn't require any kernel code change.
>
> If this 2nd way is preferred, can you point me to how I can offline the memory
> blocks & then get them back later in boot?
/bin/echo offline > /sys/devices/system/memory/memory$NUM/state && \
echo online_movable > /sys/devices/system/memory/memory$NUM/state
more in Documentation/admin-guide/mm/memory-hotplug.rst
--
Michal Hocko
SUSE Labs
On Wed 19-07-23 16:48:21, Ross Zwisler wrote:
> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > [...]
> > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > allocations, because this issue essentially makes the movablecore= kernel
> > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > creates will often actually be unmovable.
> >
> > movablecore is kinda hack and I would be more inclined to get rid of it
> > rather than build more into it. Could you be more specific about your
> > use case?
>
> The problem that I'm trying to solve is that I'd like to be able to get kernel
> core dumps off machines (chromebooks) so that we can debug crashes. Because
> the memory used by the crash kernel ("crashkernel=" kernel command line
> option) is consumed the entire time the machine is booted, there is a strong
> motivation to keep the crash kernel as small and as simple as possible. To
> this end I'm trying to get away without SSD drivers, not having to worry about
> encryption on the SSDs, etc.
>
> So, the rough plan right now is:
>
> 1) During boot set aside some memory that won't contain kernel allocations.
> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
>
> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> region (or whatever non-kernel region) will be set aside as PMEM in the crash
> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> parameter passed to the crash kernel.
>
> So, in my sample 4G VM system, I see:
>
> # lsmem --split ZONES --output-all
> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>
> Memory block size: 128M
> Total online memory: 4G
> Total offline memory: 0B
>
> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
>
> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> aside only contains user data, which we don't want to store anyway. We make a
> filesystem in there, and create a kernel crash dump using 'makedumpfile':
>
> mkfs.ext4 /dev/pmem0
> mount /dev/pmem0 /mnt
> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
>
> We then set up the next full kernel boot to also have this same PMEM region,
> using the same memmap kernel parameter. We reboot back into a full kernel.
Btw. How do you ensure that the address range doesn't get reinitialized
by POST? Do you rely on kexec boot here?
--
Michal Hocko
SUSE Labs
On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > 3. Switch memblock to use bottom up allocations. Historically memblock
> > allocated memory from the top to avoid corrupting the kernel image and to
> > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > allocations with lower limit of memblock allocations set to 16M.
> >
> > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
>
> Yep, I've confirmed that for my use cases at least this does the trick, thank
> you! I had thought about moving the memblock allocations, but had no idea it
> was (basically) already supported and thought it'd be much riskier than just
> adjusting where ZONE_MOVABLE lived.
>
> Is there a reason for this to not be a real option for users, maybe per a
> kernel config knob or something? I'm happy to explore other options in this
> thread, but this is doing the trick so far.
I think we can make x86 always use bottom up.
To do this properly we'd need to set lower limit for memblock allocations
to MAX_DMA32_PFN and allow fallback below it so that early allocations
won't eat memory from ZONE_DMA32.
Aside from x86 boot being fragile in general I don't see why this wouldn't
work.
--
Sincerely yours,
Mike.
On Thu, Jul 20, 2023 at 02:13:25PM +0200, Michal Hocko wrote:
> On Wed 19-07-23 16:48:21, Ross Zwisler wrote:
> > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > > [...]
> > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > > allocations, because this issue essentially makes the movablecore= kernel
> > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > > creates will often actually be unmovable.
> > >
> > > movablecore is kinda hack and I would be more inclined to get rid of it
> > > rather than build more into it. Could you be more specific about your
> > > use case?
> >
> > The problem that I'm trying to solve is that I'd like to be able to get kernel
> > core dumps off machines (chromebooks) so that we can debug crashes. Because
> > the memory used by the crash kernel ("crashkernel=" kernel command line
> > option) is consumed the entire time the machine is booted, there is a strong
> > motivation to keep the crash kernel as small and as simple as possible. To
> > this end I'm trying to get away without SSD drivers, not having to worry about
> > encryption on the SSDs, etc.
> >
> > So, the rough plan right now is:
> >
> > 1) During boot set aside some memory that won't contain kernel allocations.
> > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
> >
> > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> > region (or whatever non-kernel region) will be set aside as PMEM in the crash
> > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> > parameter passed to the crash kernel.
> >
> > So, in my sample 4G VM system, I see:
> >
> > # lsmem --split ZONES --output-all
> > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
> >
> > Memory block size: 128M
> > Total online memory: 4G
> > Total offline memory: 0B
> >
> > so I'll pass "memmap=256M!0x130000000" to the crash kernel.
> >
> > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> > aside only contains user data, which we don't want to store anyway. We make a
> > filesystem in there, and create a kernel crash dump using 'makedumpfile':
> >
> > mkfs.ext4 /dev/pmem0
> > mount /dev/pmem0 /mnt
> > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
> >
> > We then set up the next full kernel boot to also have this same PMEM region,
> > using the same memmap kernel parameter. We reboot back into a full kernel.
>
> Btw. How do you ensure that the address range doesn't get reinitialized
> by POST? Do you rely on kexec boot here?
I've been working under the assumption that I do need to do a full reboot (not
just another kexec boot) so that the devices in the system (NICs, disks, etc)
are all reinitialized and don't carry over bad state from the crash.
I do know about the 'reset_devices' kernel command line parameter, but wasn't
sure that would be enough. From looking around it seems like this is very
driver + device dependent, so maybe I just need to test more.
In any case, you're right, if we do a full reboot and go through POST, it's
system dependent on whether BIOS/UEFI/Coreboot/etc will zero memory, and if it
does this feature won't work unless we kexec to the 3rd kernel.
I've also heard concerns around whether a full reboot will cause the memory
controller to reinitialize and potentially cause memory bit flips or similar,
though I haven't yet seen this myself. Has anyone seen such bit flips /
memory corruption due to system reboot, or is this a non-issue in your
experience?
Lots to figure out, thanks for the help. :)
On Fri 21-07-23 14:20:09, Mike Rapoport wrote:
> On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > > 3. Switch memblock to use bottom up allocations. Historically memblock
> > > allocated memory from the top to avoid corrupting the kernel image and to
> > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > > allocations with lower limit of memblock allocations set to 16M.
> > >
> > > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
> >
> > Yep, I've confirmed that for my use cases at least this does the trick, thank
> > you! I had thought about moving the memblock allocations, but had no idea it
> > was (basically) already supported and thought it'd be much riskier than just
> > adjusting where ZONE_MOVABLE lived.
> >
> > Is there a reason for this to not be a real option for users, maybe per a
> > kernel config knob or something? I'm happy to explore other options in this
> > thread, but this is doing the trick so far.
>
> I think we can make x86 always use bottom up.
>
> To do this properly we'd need to set lower limit for memblock allocations
> to MAX_DMA32_PFN and allow fallback below it so that early allocations
> won't eat memory from ZONE_DMA32.
>
> Aside from x86 boot being fragile in general I don't see why this wouldn't
> work.
This would add a very subtle depency of a functionality on the specific
boot allocator behavior and that is bad for long term maintenance.
--
Michal Hocko
SUSE Labs
>> As an alternative, we might use the "memmap=nn[KMG]!ss[KMG]" [1] parameter
>> to mark some memory as protected.
>>
>> That memory can then be configured as devdax device and online to
>> ZONE_MOVABLE (dev/dax).
>>
>> [1] https://docs.pmem.io/persistent-memory/getting-started-guide/creating-development-environments/linux-environments/linux-memmap
>
> I've previously been reconfiguring devdax memory like this:
>
> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
> daxctl reconfigure-device --mode=system-ram dax0.0
>
> Is this how you've been doing it, or is there something else I should
> consider?
No, exactly like that.
>
> I just sent mail to Michal outlining my use case, hopefully it makes sense.
Yes, thanks for sharing, I'll dig deeper into that next.
>
> I had thought about using 'memmap=' in the first kernel and the worry was that
> I'd have to support many different machines with different memory
> configurations, and have to hard-code memory offsets and lengths for the
> various memmap= kernel command line parameters.
Indeed.
> If I can make ZONE_MOVABLE
> work that's preferable because the kernel will choose the correct usermem-only
> region for me, and then I can just use that region for the crash kernel and
> 3rd kernel boots.
It really sounds like you might be better off using CMA or
alloc_contig_pages().
The latter is unreliable, though, and the memory cannot be used for
other purposes once alloc_contig_pages() succeeded
See arch/powerpc/platforms/powernv/memtrace.c one one user that needs to
set aside a lot of memory to store HW traces.
--
Cheers,
David / dhildenb
On 20.07.23 00:48, Ross Zwisler wrote:
> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
>> [...]
>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
>>> allocations, because this issue essentially makes the movablecore= kernel
>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it
>>> creates will often actually be unmovable.
>>
>> movablecore is kinda hack and I would be more inclined to get rid of it
>> rather than build more into it. Could you be more specific about your
>> use case?
>
> The problem that I'm trying to solve is that I'd like to be able to get kernel
> core dumps off machines (chromebooks) so that we can debug crashes. Because
> the memory used by the crash kernel ("crashkernel=" kernel command line
> option) is consumed the entire time the machine is booted, there is a strong
> motivation to keep the crash kernel as small and as simple as possible. To
> this end I'm trying to get away without SSD drivers, not having to worry about
> encryption on the SSDs, etc.
Okay, so you intend to keep the crashkernel area as small as possible.
>
> So, the rough plan right now is:
> > 1) During boot set aside some memory that won't contain kernel
allocations.
> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
>
> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> region (or whatever non-kernel region) will be set aside as PMEM in the crash
> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> parameter passed to the crash kernel.
>
> So, in my sample 4G VM system, I see:
>
> # lsmem --split ZONES --output-all
> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>
> Memory block size: 128M
> Total online memory: 4G
> Total offline memory: 0B
>
> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
>
> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> aside only contains user data, which we don't want to store anyway.
I raised that in different context already, but such assumptions are not
100% future proof IMHO. For example, we might at one point be able to
make user page tables movable and place them on there.
But yes, most kernel data structures (which you care about) will
probably never be movable and never end up on these regions.
> We make a
> filesystem in there, and create a kernel crash dump using 'makedumpfile':
>
> mkfs.ext4 /dev/pmem0
> mount /dev/pmem0 /mnt
> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
>
> We then set up the next full kernel boot to also have this same PMEM region,
> using the same memmap kernel parameter. We reboot back into a full kernel.
>
> 3) The next full kernel will be a normal boot with a full networking stack,
> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
> the kdump and either store it somewhere persistent or upload it somewhere. We
> can then unmount the PMEM and reconfigure it back to system ram so that the
> live system isn't missing memory.
>
> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
> daxctl reconfigure-device --mode=system-ram dax0.0
>
> This is the flow I'm trying to support, and have mostly working in a VM,
> except up until now makedumpfile would crash because all the memblock
> structures it needed were in the PMEM area that I had just wiped out by making
> a new filesystem. :)
Thinking out loud (and remembering that some architectures relocate the
crashkernel during kexec, if I am not wrong), maybe the following would
also work and make your setup eventually easier:
1) Don't reserve a crashkernel area in the traditional way, instead
reserve that area using CMA. It can be used for MOVABLE allocations.
2) Let kexec load the crashkernel+initrd into ordinary memory only
(consuming as much as you would need there).
3) On kexec, relocate the crashkernel+initrd into the CMA area
(overwriting any movable data in there)
4) In makedumpfile, don't dump any memory that falls into the
crashkernel area. It might already have been overwritten by the second
kernel
Maybe that would allow you to make the crashkernel+initrd slightly
bigger (to include SSD drivers etc.) and have a bigger crashkernel area,
because while the crashkernel is armed it will only consume the
crashkernel+initrd size and not the overall crashkernel area size.
If that makes any sense :)
--
Cheers,
David / dhildenb
On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote:
> On Fri 21-07-23 14:20:09, Mike Rapoport wrote:
> > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > > > 3. Switch memblock to use bottom up allocations. Historically memblock
> > > > allocated memory from the top to avoid corrupting the kernel image and to
> > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > > > allocations with lower limit of memblock allocations set to 16M.
> > > >
> > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
> > >
> > > Yep, I've confirmed that for my use cases at least this does the trick, thank
> > > you! I had thought about moving the memblock allocations, but had no idea it
> > > was (basically) already supported and thought it'd be much riskier than just
> > > adjusting where ZONE_MOVABLE lived.
> > >
> > > Is there a reason for this to not be a real option for users, maybe per a
> > > kernel config knob or something? I'm happy to explore other options in this
> > > thread, but this is doing the trick so far.
> >
> > I think we can make x86 always use bottom up.
> >
> > To do this properly we'd need to set lower limit for memblock allocations
> > to MAX_DMA32_PFN and allow fallback below it so that early allocations
> > won't eat memory from ZONE_DMA32.
> >
> > Aside from x86 boot being fragile in general I don't see why this wouldn't
> > work.
>
> This would add a very subtle depency of a functionality on the specific
> boot allocator behavior and that is bad for long term maintenance.
What do you mean by "specific boot allocator behavior"?
Using a limit for allocations and then falling back to the entire available
memory if allocation fails within the limits?
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Wed 26-07-23 13:48:45, Mike Rapoport wrote:
> On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote:
> > On Fri 21-07-23 14:20:09, Mike Rapoport wrote:
> > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > > > > 3. Switch memblock to use bottom up allocations. Historically memblock
> > > > > allocated memory from the top to avoid corrupting the kernel image and to
> > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > > > > allocations with lower limit of memblock allocations set to 16M.
> > > > >
> > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
> > > >
> > > > Yep, I've confirmed that for my use cases at least this does the trick, thank
> > > > you! I had thought about moving the memblock allocations, but had no idea it
> > > > was (basically) already supported and thought it'd be much riskier than just
> > > > adjusting where ZONE_MOVABLE lived.
> > > >
> > > > Is there a reason for this to not be a real option for users, maybe per a
> > > > kernel config knob or something? I'm happy to explore other options in this
> > > > thread, but this is doing the trick so far.
> > >
> > > I think we can make x86 always use bottom up.
> > >
> > > To do this properly we'd need to set lower limit for memblock allocations
> > > to MAX_DMA32_PFN and allow fallback below it so that early allocations
> > > won't eat memory from ZONE_DMA32.
> > >
> > > Aside from x86 boot being fragile in general I don't see why this wouldn't
> > > work.
> >
> > This would add a very subtle depency of a functionality on the specific
> > boot allocator behavior and that is bad for long term maintenance.
>
> What do you mean by "specific boot allocator behavior"?
I mean that the expectation that the boot allocator starts from low
addresses and functionality depending on that is too fragile. This has
already caused some problems in the past IIRC.
--
Michal Hocko
SUSE Labs
On 26.07.23 10:44, David Hildenbrand wrote:
> On 20.07.23 00:48, Ross Zwisler wrote:
>> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
>>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
>>> [...]
>>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
>>>> allocations, because this issue essentially makes the movablecore= kernel
>>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it
>>>> creates will often actually be unmovable.
>>>
>>> movablecore is kinda hack and I would be more inclined to get rid of it
>>> rather than build more into it. Could you be more specific about your
>>> use case?
>>
>> The problem that I'm trying to solve is that I'd like to be able to get kernel
>> core dumps off machines (chromebooks) so that we can debug crashes. Because
>> the memory used by the crash kernel ("crashkernel=" kernel command line
>> option) is consumed the entire time the machine is booted, there is a strong
>> motivation to keep the crash kernel as small and as simple as possible. To
>> this end I'm trying to get away without SSD drivers, not having to worry about
>> encryption on the SSDs, etc.
>
> Okay, so you intend to keep the crashkernel area as small as possible.
>
>>
>> So, the rough plan right now is:
>> > 1) During boot set aside some memory that won't contain kernel
> allocations.
>> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
>>
>> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
>> region (or whatever non-kernel region) will be set aside as PMEM in the crash
>> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
>> parameter passed to the crash kernel.
>>
>> So, in my sample 4G VM system, I see:
>>
>> # lsmem --split ZONES --output-all
>> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
>> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
>> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
>> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
>> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>>
>> Memory block size: 128M
>> Total online memory: 4G
>> Total offline memory: 0B
>>
>> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
>>
>> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
>> aside only contains user data, which we don't want to store anyway.
>
> I raised that in different context already, but such assumptions are not
> 100% future proof IMHO. For example, we might at one point be able to
> make user page tables movable and place them on there.
>
> But yes, most kernel data structures (which you care about) will
> probably never be movable and never end up on these regions.
>
>> We make a
>> filesystem in there, and create a kernel crash dump using 'makedumpfile':
>>
>> mkfs.ext4 /dev/pmem0
>> mount /dev/pmem0 /mnt
>> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
>>
>> We then set up the next full kernel boot to also have this same PMEM region,
>> using the same memmap kernel parameter. We reboot back into a full kernel.
>>
>> 3) The next full kernel will be a normal boot with a full networking stack,
>> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
>> the kdump and either store it somewhere persistent or upload it somewhere. We
>> can then unmount the PMEM and reconfigure it back to system ram so that the
>> live system isn't missing memory.
>>
>> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
>> daxctl reconfigure-device --mode=system-ram dax0.0
>>
>> This is the flow I'm trying to support, and have mostly working in a VM,
>> except up until now makedumpfile would crash because all the memblock
>> structures it needed were in the PMEM area that I had just wiped out by making
>> a new filesystem. :)
>
>
> Thinking out loud (and remembering that some architectures relocate the
> crashkernel during kexec, if I am not wrong), maybe the following would
> also work and make your setup eventually easier:
>
> 1) Don't reserve a crashkernel area in the traditional way, instead
> reserve that area using CMA. It can be used for MOVABLE allocations.
>
> 2) Let kexec load the crashkernel+initrd into ordinary memory only
> (consuming as much as you would need there).
Oh, I realized that one might be able to place the kernel+initrd
directly in the area by allocating via CMA.
--
Cheers,
David / dhildenb
On Wed, Jul 26, 2023 at 02:57:55PM +0200, Michal Hocko wrote:
> On Wed 26-07-23 13:48:45, Mike Rapoport wrote:
> > On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote:
> > > On Fri 21-07-23 14:20:09, Mike Rapoport wrote:
> > > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> > > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > > > > > 3. Switch memblock to use bottom up allocations. Historically memblock
> > > > > > allocated memory from the top to avoid corrupting the kernel image and to
> > > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > > > > > allocations with lower limit of memblock allocations set to 16M.
> > > > > >
> > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
> > > > >
> > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank
> > > > > you! I had thought about moving the memblock allocations, but had no idea it
> > > > > was (basically) already supported and thought it'd be much riskier than just
> > > > > adjusting where ZONE_MOVABLE lived.
> > > > >
> > > > > Is there a reason for this to not be a real option for users, maybe per a
> > > > > kernel config knob or something? I'm happy to explore other options in this
> > > > > thread, but this is doing the trick so far.
> > > >
> > > > I think we can make x86 always use bottom up.
> > > >
> > > > To do this properly we'd need to set lower limit for memblock allocations
> > > > to MAX_DMA32_PFN and allow fallback below it so that early allocations
> > > > won't eat memory from ZONE_DMA32.
> > > >
> > > > Aside from x86 boot being fragile in general I don't see why this wouldn't
> > > > work.
> > >
> > > This would add a very subtle depency of a functionality on the specific
> > > boot allocator behavior and that is bad for long term maintenance.
> >
> > What do you mean by "specific boot allocator behavior"?
>
> I mean that the expectation that the boot allocator starts from low
> addresses and functionality depending on that is too fragile. This has
> already caused some problems in the past IIRC.
Well, any change in x86 boot sequence may cause all sorts of problems :)
We do some of the boot time allocations from low addresses when
movable_node is enabled and that is entirely implicit and buried deep
inside the code.
What I'm suggesting is to switch the allocations to bottom-up once and for
all with explicitly set lower limit and a defined semantics for a fallback.
This might cause some bumps in the beginning, but I don't expect it to be a
maintenance problem in the long run.
And it will free higher memory from early allocations for all usecases, not
just this one.
> --
> Michal Hocko
> SUSE Labs
--
Sincerely yours,
Mike.
On Wed 26-07-23 16:23:17, Mike Rapoport wrote:
> On Wed, Jul 26, 2023 at 02:57:55PM +0200, Michal Hocko wrote:
> > On Wed 26-07-23 13:48:45, Mike Rapoport wrote:
> > > On Wed, Jul 26, 2023 at 09:49:12AM +0200, Michal Hocko wrote:
> > > > On Fri 21-07-23 14:20:09, Mike Rapoport wrote:
> > > > > On Wed, Jul 19, 2023 at 04:26:04PM -0600, Ross Zwisler wrote:
> > > > > > On Wed, Jul 19, 2023 at 08:44:34AM +0300, Mike Rapoport wrote:
> > > > > > > 3. Switch memblock to use bottom up allocations. Historically memblock
> > > > > > > allocated memory from the top to avoid corrupting the kernel image and to
> > > > > > > avoid exhausting precious ZONE_DMA. I believe we can use bottom-up
> > > > > > > allocations with lower limit of memblock allocations set to 16M.
> > > > > > >
> > > > > > > With the hack below no memblock allocations will end up in ZONE_MOVABLE:
> > > > > >
> > > > > > Yep, I've confirmed that for my use cases at least this does the trick, thank
> > > > > > you! I had thought about moving the memblock allocations, but had no idea it
> > > > > > was (basically) already supported and thought it'd be much riskier than just
> > > > > > adjusting where ZONE_MOVABLE lived.
> > > > > >
> > > > > > Is there a reason for this to not be a real option for users, maybe per a
> > > > > > kernel config knob or something? I'm happy to explore other options in this
> > > > > > thread, but this is doing the trick so far.
> > > > >
> > > > > I think we can make x86 always use bottom up.
> > > > >
> > > > > To do this properly we'd need to set lower limit for memblock allocations
> > > > > to MAX_DMA32_PFN and allow fallback below it so that early allocations
> > > > > won't eat memory from ZONE_DMA32.
> > > > >
> > > > > Aside from x86 boot being fragile in general I don't see why this wouldn't
> > > > > work.
> > > >
> > > > This would add a very subtle depency of a functionality on the specific
> > > > boot allocator behavior and that is bad for long term maintenance.
> > >
> > > What do you mean by "specific boot allocator behavior"?
> >
> > I mean that the expectation that the boot allocator starts from low
> > addresses and functionality depending on that is too fragile. This has
> > already caused some problems in the past IIRC.
>
> Well, any change in x86 boot sequence may cause all sorts of problems :)
>
> We do some of the boot time allocations from low addresses when
> movable_node is enabled and that is entirely implicit and buried deep
> inside the code.
>
> What I'm suggesting is to switch the allocations to bottom-up once and for
> all with explicitly set lower limit and a defined semantics for a fallback.
>
> This might cause some bumps in the beginning, but I don't expect it to be a
> maintenance problem in the long run.
>
> And it will free higher memory from early allocations for all usecases, not
> just this one.
Higher memory is usually not a problem AFAIK. It is lowmem that is a
more scarce resource because some HW might be constrained in why phys
address range is visible.
--
Michal Hocko
SUSE Labs
On Wed 26-07-23 10:44:21, David Hildenbrand wrote:
> On 20.07.23 00:48, Ross Zwisler wrote:
> > On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
> > > On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
> > > [...]
> > > > I do think that we need to fix this collision between ZONE_MOVABLE and memmap
> > > > allocations, because this issue essentially makes the movablecore= kernel
> > > > command line parameter useless in many cases, as the ZONE_MOVABLE region it
> > > > creates will often actually be unmovable.
> > >
> > > movablecore is kinda hack and I would be more inclined to get rid of it
> > > rather than build more into it. Could you be more specific about your
> > > use case?
> >
> > The problem that I'm trying to solve is that I'd like to be able to get kernel
> > core dumps off machines (chromebooks) so that we can debug crashes. Because
> > the memory used by the crash kernel ("crashkernel=" kernel command line
> > option) is consumed the entire time the machine is booted, there is a strong
> > motivation to keep the crash kernel as small and as simple as possible. To
> > this end I'm trying to get away without SSD drivers, not having to worry about
> > encryption on the SSDs, etc.
>
> Okay, so you intend to keep the crashkernel area as small as possible.
>
> >
> > So, the rough plan right now is:
> > > 1) During boot set aside some memory that won't contain kernel
> allocations.
> > I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
> >
> > We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
> > region (or whatever non-kernel region) will be set aside as PMEM in the crash
> > kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
> > parameter passed to the crash kernel.
> >
> > So, in my sample 4G VM system, I see:
> >
> > # lsmem --split ZONES --output-all
> > RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
> > 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
> > 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
> > 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
> > 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
> > Memory block size: 128M
> > Total online memory: 4G
> > Total offline memory: 0B
> >
> > so I'll pass "memmap=256M!0x130000000" to the crash kernel.
> >
> > 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
> > aside only contains user data, which we don't want to store anyway.
>
> I raised that in different context already, but such assumptions are not
> 100% future proof IMHO. For example, we might at one point be able to make
> user page tables movable and place them on there.
>
> But yes, most kernel data structures (which you care about) will probably
> never be movable and never end up on these regions.
>
> > We make a
> > filesystem in there, and create a kernel crash dump using 'makedumpfile':
> >
> > mkfs.ext4 /dev/pmem0
> > mount /dev/pmem0 /mnt
> > makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
> >
> > We then set up the next full kernel boot to also have this same PMEM region,
> > using the same memmap kernel parameter. We reboot back into a full kernel.
> >
> > 3) The next full kernel will be a normal boot with a full networking stack,
> > SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
> > the kdump and either store it somewhere persistent or upload it somewhere. We
> > can then unmount the PMEM and reconfigure it back to system ram so that the
> > live system isn't missing memory.
> >
> > ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
> > daxctl reconfigure-device --mode=system-ram dax0.0
> >
> > This is the flow I'm trying to support, and have mostly working in a VM,
> > except up until now makedumpfile would crash because all the memblock
> > structures it needed were in the PMEM area that I had just wiped out by making
> > a new filesystem. :)
>
>
> Thinking out loud (and remembering that some architectures relocate the
> crashkernel during kexec, if I am not wrong), maybe the following would also
> work and make your setup eventually easier:
>
> 1) Don't reserve a crashkernel area in the traditional way, instead reserve
> that area using CMA. It can be used for MOVABLE allocations.
>
> 2) Let kexec load the crashkernel+initrd into ordinary memory only
> (consuming as much as you would need there).
>
> 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting
> any movable data in there)
>
> 4) In makedumpfile, don't dump any memory that falls into the crashkernel
> area. It might already have been overwritten by the second kernel
This is more or less what Jiri is looking into.
--
Michal Hocko
SUSE Labs
On 27.07.23 10:18, Michal Hocko wrote:
> On Wed 26-07-23 10:44:21, David Hildenbrand wrote:
>> On 20.07.23 00:48, Ross Zwisler wrote:
>>> On Wed, Jul 19, 2023 at 08:14:48AM +0200, Michal Hocko wrote:
>>>> On Tue 18-07-23 16:01:06, Ross Zwisler wrote:
>>>> [...]
>>>>> I do think that we need to fix this collision between ZONE_MOVABLE and memmap
>>>>> allocations, because this issue essentially makes the movablecore= kernel
>>>>> command line parameter useless in many cases, as the ZONE_MOVABLE region it
>>>>> creates will often actually be unmovable.
>>>>
>>>> movablecore is kinda hack and I would be more inclined to get rid of it
>>>> rather than build more into it. Could you be more specific about your
>>>> use case?
>>>
>>> The problem that I'm trying to solve is that I'd like to be able to get kernel
>>> core dumps off machines (chromebooks) so that we can debug crashes. Because
>>> the memory used by the crash kernel ("crashkernel=" kernel command line
>>> option) is consumed the entire time the machine is booted, there is a strong
>>> motivation to keep the crash kernel as small and as simple as possible. To
>>> this end I'm trying to get away without SSD drivers, not having to worry about
>>> encryption on the SSDs, etc.
>>
>> Okay, so you intend to keep the crashkernel area as small as possible.
>>
>>>
>>> So, the rough plan right now is:
>>> > 1) During boot set aside some memory that won't contain kernel
>> allocations.
>>> I'm trying to do this now with ZONE_MOVABLE, but I'm open to better ways.
>>>
>>> We set aside memory for a crash kernel & arm it so that the ZONE_MOVABLE
>>> region (or whatever non-kernel region) will be set aside as PMEM in the crash
>>> kernel. This is done with the memmap=nn[KMG]!ss[KMG] kernel command line
>>> parameter passed to the crash kernel.
>>>
>>> So, in my sample 4G VM system, I see:
>>>
>>> # lsmem --split ZONES --output-all
>>> RANGE SIZE STATE REMOVABLE BLOCK NODE ZONES
>>> 0x0000000000000000-0x0000000007ffffff 128M online yes 0 0 None
>>> 0x0000000008000000-0x00000000bfffffff 2.9G online yes 1-23 0 DMA32
>>> 0x0000000100000000-0x000000012fffffff 768M online yes 32-37 0 Normal
>>> 0x0000000130000000-0x000000013fffffff 256M online yes 38-39 0 Movable
>>> Memory block size: 128M
>>> Total online memory: 4G
>>> Total offline memory: 0B
>>>
>>> so I'll pass "memmap=256M!0x130000000" to the crash kernel.
>>>
>>> 2) When we hit a kernel crash, we know (hope?) that the PMEM region we've set
>>> aside only contains user data, which we don't want to store anyway.
>>
>> I raised that in different context already, but such assumptions are not
>> 100% future proof IMHO. For example, we might at one point be able to make
>> user page tables movable and place them on there.
>>
>> But yes, most kernel data structures (which you care about) will probably
>> never be movable and never end up on these regions.
>>
>>> We make a
>>> filesystem in there, and create a kernel crash dump using 'makedumpfile':
>>>
>>> mkfs.ext4 /dev/pmem0
>>> mount /dev/pmem0 /mnt
>>> makedumpfile -c -d 31 /proc/vmcore /mnt/kdump
>>>
>>> We then set up the next full kernel boot to also have this same PMEM region,
>>> using the same memmap kernel parameter. We reboot back into a full kernel.
>>>
>>> 3) The next full kernel will be a normal boot with a full networking stack,
>>> SSD drivers, disk encryption, etc. We mount up our PMEM filesystem, pull out
>>> the kdump and either store it somewhere persistent or upload it somewhere. We
>>> can then unmount the PMEM and reconfigure it back to system ram so that the
>>> live system isn't missing memory.
>>>
>>> ndctl create-namespace --reconfig=namespace0.0 -m devdax -f
>>> daxctl reconfigure-device --mode=system-ram dax0.0
>>>
>>> This is the flow I'm trying to support, and have mostly working in a VM,
>>> except up until now makedumpfile would crash because all the memblock
>>> structures it needed were in the PMEM area that I had just wiped out by making
>>> a new filesystem. :)
>>
>>
>> Thinking out loud (and remembering that some architectures relocate the
>> crashkernel during kexec, if I am not wrong), maybe the following would also
>> work and make your setup eventually easier:
>>
>> 1) Don't reserve a crashkernel area in the traditional way, instead reserve
>> that area using CMA. It can be used for MOVABLE allocations.
>>
>> 2) Let kexec load the crashkernel+initrd into ordinary memory only
>> (consuming as much as you would need there).
>>
>> 3) On kexec, relocate the crashkernel+initrd into the CMA area (overwriting
>> any movable data in there)
>>
>> 4) In makedumpfile, don't dump any memory that falls into the crashkernel
>> area. It might already have been overwritten by the second kernel
>
> This is more or less what Jiri is looking into.
>
Ah, very nice.
--
Cheers,
David / dhildenb