Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example, as
a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 44 ++++++++++++-------------
mm/page_alloc.c | 43 +++++++++++++++++++-----
2 files changed, 57 insertions(+), 30 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1825,30 +1825,30 @@
keepinitrd [HW,ARM]
kernelcore= [KNL,X86,IA-64,PPC]
- Format: nn[KMGTPE] | "mirror"
- This parameter
- specifies the amount of memory usable by the kernel
- for non-movable allocations. The requested amount is
- spread evenly throughout all nodes in the system. The
- remaining memory in each node is used for Movable
- pages. In the event, a node is too small to have both
- kernelcore and Movable pages, kernelcore pages will
- take priority and other nodes will have a larger number
- of Movable pages. The Movable zone is used for the
- allocation of pages that may be reclaimed or moved
- by the page migration subsystem. This means that
- HugeTLB pages may not be allocated from this zone.
- Note that allocations like PTEs-from-HighMem still
- use the HighMem zone if it exists, and the Normal
- zone if it does not.
-
- Instead of specifying the amount of memory (nn[KMGTPE]),
- you can specify "mirror" option. In case "mirror"
+ Format: nn[KMGTPE] | nn% | "mirror"
+ This parameter specifies the amount of memory usable by
+ the kernel for non-movable allocations. The requested
+ amount is spread evenly throughout all nodes in the
+ system as ZONE_NORMAL. The remaining memory is used for
+ movable memory in its own zone, ZONE_MOVABLE. In the
+ event, a node is too small to have both ZONE_NORMAL and
+ ZONE_MOVABLE, kernelcore memory will take priority and
+ other nodes will have a larger ZONE_MOVABLE.
+
+ ZONE_MOVABLE is used for the allocation of pages that
+ may be reclaimed or moved by the page migration
+ subsystem. This means that HugeTLB pages may not be
+ allocated from this zone. Note that allocations like
+ PTEs-from-HighMem still use the HighMem zone if it
+ exists, and the Normal zone if it does not.
+
+ It is possible to specify the exact amount of memory in
+ the form of "nn[KMGTPE]", a percentage of total system
+ memory in the form of "nn%", or "mirror". If "mirror"
option is specified, mirrored (reliable) memory is used
for non-movable allocations and remaining memory is used
- for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
- so you can NOT specify nn[KMGTPE] and "mirror" at the same
- time.
+ for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
+ are exclusive, so you cannot specify multiple forms.
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -272,7 +272,9 @@ static unsigned long __meminitdata dma_reserve;
static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
static unsigned long __initdata required_kernelcore;
+static unsigned long required_kernelcore_percent __initdata;
static unsigned long __initdata required_movablecore;
+static unsigned long required_movablecore_percent __initdata;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
static bool mirrored_kernelcore;
@@ -6477,7 +6479,18 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
/*
- * If movablecore=nn[KMG] was specified, calculate what size of
+ * If kernelcore=nn% or movablecore=nn% was specified, calculate the
+ * amount of necessary memory.
+ */
+ if (required_kernelcore_percent)
+ required_kernelcore = (totalpages * 100 * required_kernelcore_percent) /
+ 10000UL;
+ if (required_movablecore_percent)
+ required_movablecore = (totalpages * 100 * required_movablecore_percent) /
+ 10000UL;
+
+ /*
+ * If movablecore= was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
@@ -6717,18 +6730,30 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
zero_resv_unavail();
}
-static int __init cmdline_parse_core(char *p, unsigned long *core)
+static int __init cmdline_parse_core(char *p, unsigned long *core,
+ unsigned long *percent)
{
unsigned long long coremem;
+ char *endptr;
+
if (!p)
return -EINVAL;
- coremem = memparse(p, &p);
- *core = coremem >> PAGE_SHIFT;
+ /* Value may be a percentage of total memory, otherwise bytes */
+ coremem = simple_strtoull(p, &endptr, 0);
+ if (*endptr == '%') {
+ /* Paranoid check for percent values greater than 100 */
+ WARN_ON(coremem > 100);
- /* Paranoid check that UL is enough for the coremem value */
- WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);
+ *percent = coremem;
+ } else {
+ coremem = memparse(p, &p);
+ /* Paranoid check that UL is enough for the coremem value */
+ WARN_ON((coremem >> PAGE_SHIFT) > ULONG_MAX);
+ *core = coremem >> PAGE_SHIFT;
+ *percent = 0UL;
+ }
return 0;
}
@@ -6744,7 +6769,8 @@ static int __init cmdline_parse_kernelcore(char *p)
return 0;
}
- return cmdline_parse_core(p, &required_kernelcore);
+ return cmdline_parse_core(p, &required_kernelcore,
+ &required_kernelcore_percent);
}
/*
@@ -6753,7 +6779,8 @@ static int __init cmdline_parse_kernelcore(char *p)
*/
static int __init cmdline_parse_movablecore(char *p)
{
- return cmdline_parse_core(p, &required_movablecore);
+ return cmdline_parse_core(p, &required_movablecore,
+ &required_movablecore_percent);
}
early_param("kernelcore", cmdline_parse_kernelcore);
mirrored_kernelcore can be in __meminitdata, so move it there.
At the same time, fixup section specifiers to be after the name of the
variable per checkpatch.
---
mm/page_alloc.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -264,19 +264,19 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
int watermark_scale_factor = 10;
-static unsigned long __meminitdata nr_kernel_pages;
-static unsigned long __meminitdata nr_all_pages;
-static unsigned long __meminitdata dma_reserve;
+static unsigned long nr_kernel_pages __meminitdata;
+static unsigned long nr_all_pages __meminitdata;
+static unsigned long dma_reserve __meminitdata;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
-static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
-static unsigned long __initdata required_kernelcore;
+static unsigned long arch_zone_lowest_possible_pfn[MAX_NR_ZONES] __meminitdata;
+static unsigned long arch_zone_highest_possible_pfn[MAX_NR_ZONES] __meminitdata;
+static unsigned long required_kernelcore __initdata;
static unsigned long required_kernelcore_percent __initdata;
-static unsigned long __initdata required_movablecore;
+static unsigned long required_movablecore __initdata;
static unsigned long required_movablecore_percent __initdata;
-static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
-static bool mirrored_kernelcore;
+static unsigned long zone_movable_pfn[MAX_NUMNODES] __meminitdata;
+static bool mirrored_kernelcore __meminitdata;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
On Mon, 12 Feb 2018 16:24:25 -0800 (PST) David Rientjes <[email protected]> wrote:
> Both kernelcore= and movablecore= can be used to define the amount of
> ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
> the system memory capacity to be known when specifying the command line,
> however.
>
> This introduces the ability to define both kernelcore= and movablecore=
> as a percentage of total system memory. This is convenient for systems
> software that wants to define the amount of ZONE_MOVABLE, for example, as
> a proportion of a system's memory rather than a hardcoded byte value.
>
> To define the percentage, the final character of the parameter should be
> a '%'.
Is this fine-grained enough? We've had percentage-based tunables in
the past, and 10 years later when systems are vastly larger, 1% is too
much.
On Tue, 13 Feb 2018, Andrew Morton wrote:
> > Both kernelcore= and movablecore= can be used to define the amount of
> > ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
> > the system memory capacity to be known when specifying the command line,
> > however.
> >
> > This introduces the ability to define both kernelcore= and movablecore=
> > as a percentage of total system memory. This is convenient for systems
> > software that wants to define the amount of ZONE_MOVABLE, for example, as
> > a proportion of a system's memory rather than a hardcoded byte value.
> >
> > To define the percentage, the final character of the parameter should be
> > a '%'.
>
> Is this fine-grained enough? We've had percentage-based tunables in
> the past, and 10 years later when systems are vastly larger, 1% is too
> much.
>
They still have the (current) ability to define the exact amount of bytes
down to page sized granularity, whereas 1% would yield 40GB on a 4TB
system. I'm not sure that people will want any finer-grained control if
defining the proportion of the system for kernelcore. They do have the
ability with the existing interface, though, if they want to be that
precise.
(This is a cop out for not implementing some fractional percentage parser,
although that would be possible as a more complete solution.)
On 02/12/2018 04:24 PM, David Rientjes wrote:
> Both kernelcore= and movablecore= can be used to define the amount of
> ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
> the system memory capacity to be known when specifying the command line,
> however.
>
> This introduces the ability to define both kernelcore= and movablecore=
> as a percentage of total system memory. This is convenient for systems
> software that wants to define the amount of ZONE_MOVABLE, for example, as
> a proportion of a system's memory rather than a hardcoded byte value.
>
> To define the percentage, the final character of the parameter should be
> a '%'.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 44 ++++++++++++-------------
> mm/page_alloc.c | 43 +++++++++++++++++++-----
> 2 files changed, 57 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1825,30 +1825,30 @@
> keepinitrd [HW,ARM]
>
> kernelcore= [KNL,X86,IA-64,PPC]
> - Format: nn[KMGTPE] | "mirror"
> - This parameter
> - specifies the amount of memory usable by the kernel
> - for non-movable allocations. The requested amount is
> - spread evenly throughout all nodes in the system. The
> - remaining memory in each node is used for Movable
> - pages. In the event, a node is too small to have both
> - kernelcore and Movable pages, kernelcore pages will
> - take priority and other nodes will have a larger number
> - of Movable pages. The Movable zone is used for the
> - allocation of pages that may be reclaimed or moved
> - by the page migration subsystem. This means that
> - HugeTLB pages may not be allocated from this zone.
> - Note that allocations like PTEs-from-HighMem still
> - use the HighMem zone if it exists, and the Normal
> - zone if it does not.
> -
> - Instead of specifying the amount of memory (nn[KMGTPE]),
> - you can specify "mirror" option. In case "mirror"
> + Format: nn[KMGTPE] | nn% | "mirror"
> + This parameter specifies the amount of memory usable by
> + the kernel for non-movable allocations. The requested
> + amount is spread evenly throughout all nodes in the
> + system as ZONE_NORMAL. The remaining memory is used for
> + movable memory in its own zone, ZONE_MOVABLE. In the
> + event, a node is too small to have both ZONE_NORMAL and
> + ZONE_MOVABLE, kernelcore memory will take priority and
> + other nodes will have a larger ZONE_MOVABLE.
> +
> + ZONE_MOVABLE is used for the allocation of pages that
> + may be reclaimed or moved by the page migration
> + subsystem. This means that HugeTLB pages may not be
> + allocated from this zone. Note that allocations like
> + PTEs-from-HighMem still use the HighMem zone if it
> + exists, and the Normal zone if it does not.
I know you are just updating the documentation for the new ability to
specify a percentage. However, while looking at this I noticed that
the existing description is out of date. HugeTLB pages CAN be treated
as movable and allocated from ZONE_MOVABLE.
If you have to respin, could you drop that line while making this change?
> +
> + It is possible to specify the exact amount of memory in
> + the form of "nn[KMGTPE]", a percentage of total system
> + memory in the form of "nn%", or "mirror". If "mirror"
> option is specified, mirrored (reliable) memory is used
> for non-movable allocations and remaining memory is used
> - for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
> - so you can NOT specify nn[KMGTPE] and "mirror" at the same
> - time.
> + for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
> + are exclusive, so you cannot specify multiple forms.
>
> kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
> Format: <Controller#>[,poll interval]
Don't you need to make the same type percentage changes for 'movablecore='?
--
Mike Kravetz
On Tue, 13 Feb 2018, Mike Kravetz wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1825,30 +1825,30 @@
> > keepinitrd [HW,ARM]
> >
> > kernelcore= [KNL,X86,IA-64,PPC]
> > - Format: nn[KMGTPE] | "mirror"
> > - This parameter
> > - specifies the amount of memory usable by the kernel
> > - for non-movable allocations. The requested amount is
> > - spread evenly throughout all nodes in the system. The
> > - remaining memory in each node is used for Movable
> > - pages. In the event, a node is too small to have both
> > - kernelcore and Movable pages, kernelcore pages will
> > - take priority and other nodes will have a larger number
> > - of Movable pages. The Movable zone is used for the
> > - allocation of pages that may be reclaimed or moved
> > - by the page migration subsystem. This means that
> > - HugeTLB pages may not be allocated from this zone.
> > - Note that allocations like PTEs-from-HighMem still
> > - use the HighMem zone if it exists, and the Normal
> > - zone if it does not.
> > -
> > - Instead of specifying the amount of memory (nn[KMGTPE]),
> > - you can specify "mirror" option. In case "mirror"
> > + Format: nn[KMGTPE] | nn% | "mirror"
> > + This parameter specifies the amount of memory usable by
> > + the kernel for non-movable allocations. The requested
> > + amount is spread evenly throughout all nodes in the
> > + system as ZONE_NORMAL. The remaining memory is used for
> > + movable memory in its own zone, ZONE_MOVABLE. In the
> > + event, a node is too small to have both ZONE_NORMAL and
> > + ZONE_MOVABLE, kernelcore memory will take priority and
> > + other nodes will have a larger ZONE_MOVABLE.
> > +
> > + ZONE_MOVABLE is used for the allocation of pages that
> > + may be reclaimed or moved by the page migration
> > + subsystem. This means that HugeTLB pages may not be
> > + allocated from this zone. Note that allocations like
> > + PTEs-from-HighMem still use the HighMem zone if it
> > + exists, and the Normal zone if it does not.
>
> I know you are just updating the documentation for the new ability to
> specify a percentage. However, while looking at this I noticed that
> the existing description is out of date. HugeTLB pages CAN be treated
> as movable and allocated from ZONE_MOVABLE.
>
> If you have to respin, could you drop that line while making this change?
>
Hi Mike,
It's merged in -mm, so perhaps no respin is necessary. I think a general
cleanup to this area regarding your work with hugetlb pages would be good.
> > +
> > + It is possible to specify the exact amount of memory in
> > + the form of "nn[KMGTPE]", a percentage of total system
> > + memory in the form of "nn%", or "mirror". If "mirror"
> > option is specified, mirrored (reliable) memory is used
> > for non-movable allocations and remaining memory is used
> > - for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
> > - so you can NOT specify nn[KMGTPE] and "mirror" at the same
> > - time.
> > + for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
> > + are exclusive, so you cannot specify multiple forms.
> >
> > kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
> > Format: <Controller#>[,poll interval]
>
> Don't you need to make the same type percentage changes for 'movablecore='?
>
The majority of the movablecore= documentation simply refers to the
kernelcore= option as its complement, I'm not sure that we need to go
in-depth into what the percentage specifiers mean for both options.
Specify that movablecore= can use a percent value.
Remove comment about hugetlb pages not being movable per Mike.
Cc: Mike Kravetz <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 22 +++++++++----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1837,10 +1837,9 @@
ZONE_MOVABLE is used for the allocation of pages that
may be reclaimed or moved by the page migration
- subsystem. This means that HugeTLB pages may not be
- allocated from this zone. Note that allocations like
- PTEs-from-HighMem still use the HighMem zone if it
- exists, and the Normal zone if it does not.
+ subsystem. Note that allocations like PTEs-from-HighMem
+ still use the HighMem zone if it exists, and the Normal
+ zone if it does not.
It is possible to specify the exact amount of memory in
the form of "nn[KMGTPE]", a percentage of total system
@@ -2353,13 +2352,14 @@
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
- movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
- is similar to kernelcore except it specifies the
- amount of memory used for migratable allocations.
- If both kernelcore and movablecore is specified,
- then kernelcore will be at *least* the specified
- value but may be more. If movablecore on its own
- is specified, the administrator must be careful
+ movablecore= [KNL,X86,IA-64,PPC]
+ Format: nn[KMGTPE] | nn%
+ This parameter is the complement to kernelcore=, it
+ specifies the amount of memory used for migratable
+ allocations. If both kernelcore and movablecore is
+ specified, then kernelcore will be at *least* the
+ specified value but may be more. If movablecore on its
+ own is specified, the administrator must be careful
that the amount of memory usable for all allocations
is not too small.
On 02/13/2018 05:00 PM, David Rientjes wrote:
> Specify that movablecore= can use a percent value.
>
> Remove comment about hugetlb pages not being movable per Mike.
>
> Cc: Mike Kravetz <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
Thanks! FWIW,
Reviewed-by: Mike Kravetz <[email protected]>
And, that is for all of patch 1.
--
Mike Kravetz
> ---
> .../admin-guide/kernel-parameters.txt | 22 +++++++++----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1837,10 +1837,9 @@
>
> ZONE_MOVABLE is used for the allocation of pages that
> may be reclaimed or moved by the page migration
> - subsystem. This means that HugeTLB pages may not be
> - allocated from this zone. Note that allocations like
> - PTEs-from-HighMem still use the HighMem zone if it
> - exists, and the Normal zone if it does not.
> + subsystem. Note that allocations like PTEs-from-HighMem
> + still use the HighMem zone if it exists, and the Normal
> + zone if it does not.
>
> It is possible to specify the exact amount of memory in
> the form of "nn[KMGTPE]", a percentage of total system
> @@ -2353,13 +2352,14 @@
> mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
> reporting absolute coordinates, such as tablets
>
> - movablecore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
> - is similar to kernelcore except it specifies the
> - amount of memory used for migratable allocations.
> - If both kernelcore and movablecore is specified,
> - then kernelcore will be at *least* the specified
> - value but may be more. If movablecore on its own
> - is specified, the administrator must be careful
> + movablecore= [KNL,X86,IA-64,PPC]
> + Format: nn[KMGTPE] | nn%
> + This parameter is the complement to kernelcore=, it
> + specifies the amount of memory used for migratable
> + allocations. If both kernelcore and movablecore is
> + specified, then kernelcore will be at *least* the
> + specified value but may be more. If movablecore on its
> + own is specified, the administrator must be careful
> that the amount of memory usable for all allocations
> is not too small.
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Tue, 13 Feb 2018 15:55:11 -0800 (PST) David Rientjes <[email protected]> wrote:
> >
> > Is this fine-grained enough? We've had percentage-based tunables in
> > the past, and 10 years later when systems are vastly larger, 1% is too
> > much.
> >
>
> They still have the (current) ability to define the exact amount of bytes
> down to page sized granularity, whereas 1% would yield 40GB on a 4TB
> system. I'm not sure that people will want any finer-grained control if
> defining the proportion of the system for kernelcore. They do have the
> ability with the existing interface, though, if they want to be that
> precise.
>
> (This is a cop out for not implementing some fractional percentage parser,
> although that would be possible as a more complete solution.)
And the interface which you've proposed can be seamlessly extended to
accept 0.07%, so not a problem.
On Mon 12-02-18 16:24:25, David Rientjes wrote:
> Both kernelcore= and movablecore= can be used to define the amount of
> ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
> the system memory capacity to be known when specifying the command line,
> however.
>
> This introduces the ability to define both kernelcore= and movablecore=
> as a percentage of total system memory. This is convenient for systems
> software that wants to define the amount of ZONE_MOVABLE, for example, as
> a proportion of a system's memory rather than a hardcoded byte value.
>
> To define the percentage, the final character of the parameter should be
> a '%'.
I do not have any objections regarding the extension. What I am more
interested in is _why_ people are still using this command line
parameter at all these days. Why would anybody want to introduce lowmem
issues from 32b days. I can see the CMA/Hotplug usecases for
ZONE_MOVABLE but those have their own ways to define zone movable. I was
tempted to simply remove the kernelcore already. Could you be more
specific what is your usecase which triggered a need of an easier
scaling of the size?
--
Michal Hocko
SUSE Labs
On Wed, 14 Feb 2018, Michal Hocko wrote:
> I do not have any objections regarding the extension. What I am more
> interested in is _why_ people are still using this command line
> parameter at all these days. Why would anybody want to introduce lowmem
> issues from 32b days. I can see the CMA/Hotplug usecases for
> ZONE_MOVABLE but those have their own ways to define zone movable. I was
> tempted to simply remove the kernelcore already. Could you be more
> specific what is your usecase which triggered a need of an easier
> scaling of the size?
Fragmentation of non-__GFP_MOVABLE pages due to low on memory situations
can pollute most pageblocks on the system, as much as 1GB of slab being
fragmented over 128GB of memory, for example. When the amount of kernel
memory is well bounded for certain systems, it is better to aggressively
reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly
fallback to others.
We have additional patches that help with this fragmentation if you're
interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
draining of pcp lists back to the zone free area to prevent stranding.
On Thu, Feb 15, 2018 at 03:45:25PM +0100, Michal Hocko wrote:
> > When the amount of kernel
> > memory is well bounded for certain systems, it is better to aggressively
> > reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly
> > fallback to others.
> >
> > We have additional patches that help with this fragmentation if you're
> > interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
> > pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
> > draining of pcp lists back to the zone free area to prevent stranding.
>
> Yes, I think we need a proper fix. (Ab)using zone_movable for this
> usecase is just sad.
What if ... on startup, slab allocated a MAX_ORDER page for itself.
It would then satisfy its own page allocation requests from this giant
page. If we start to run low on memory in the rest of the system, slab
can be induced to return some of it via its shrinker. If slab runs low
on memory, it tries to allocate another MAX_ORDER page for itself.
I think even this should reduce fragmentation. We could enhance the
fragmentation reduction by noticing when somebody else releases a page
that was previously part of a slab MAX_ORDER page and handing that page
back to slab. When slab notices that it has an entire MAX_ORDER page free
(and sufficient other memory on hand that it's unlikely to need it soon),
it can hand that MAX_ORDER page back to the page allocator.
On Wed 14-02-18 02:28:38, David Rientjes wrote:
> On Wed, 14 Feb 2018, Michal Hocko wrote:
>
> > I do not have any objections regarding the extension. What I am more
> > interested in is _why_ people are still using this command line
> > parameter at all these days. Why would anybody want to introduce lowmem
> > issues from 32b days. I can see the CMA/Hotplug usecases for
> > ZONE_MOVABLE but those have their own ways to define zone movable. I was
> > tempted to simply remove the kernelcore already. Could you be more
> > specific what is your usecase which triggered a need of an easier
> > scaling of the size?
>
> Fragmentation of non-__GFP_MOVABLE pages due to low on memory situations
> can pollute most pageblocks on the system, as much as 1GB of slab being
> fragmented over 128GB of memory, for example.
OK, I was assuming something like that.
> When the amount of kernel
> memory is well bounded for certain systems, it is better to aggressively
> reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly
> fallback to others.
>
> We have additional patches that help with this fragmentation if you're
> interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
> pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
> draining of pcp lists back to the zone free area to prevent stranding.
Yes, I think we need a proper fix. (Ab)using zone_movable for this
usecase is just sad.
--
Michal Hocko
SUSE Labs
On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> What if ... on startup, slab allocated a MAX_ORDER page for itself.
> It would then satisfy its own page allocation requests from this giant
> page. If we start to run low on memory in the rest of the system, slab
> can be induced to return some of it via its shrinker. If slab runs low
> on memory, it tries to allocate another MAX_ORDER page for itself.
The inducing of releasing memory back is not there but you can run SLUB
with MAX_ORDER allocations by passing "slab_min_order=9" or so on bootup.
> I think even this should reduce fragmentation. We could enhance the
> fragmentation reduction by noticing when somebody else releases a page
> that was previously part of a slab MAX_ORDER page and handing that page
> back to slab. When slab notices that it has an entire MAX_ORDER page free
> (and sufficient other memory on hand that it's unlikely to need it soon),
> it can hand that MAX_ORDER page back to the page allocator.
SLUB will release MAX_ORDER pages if they are completely free with the
above configuration.
On Thu, Feb 15, 2018 at 09:49:00AM -0600, Christopher Lameter wrote:
> On Thu, 15 Feb 2018, Matthew Wilcox wrote:
>
> > What if ... on startup, slab allocated a MAX_ORDER page for itself.
> > It would then satisfy its own page allocation requests from this giant
> > page. If we start to run low on memory in the rest of the system, slab
> > can be induced to return some of it via its shrinker. If slab runs low
> > on memory, it tries to allocate another MAX_ORDER page for itself.
>
> The inducing of releasing memory back is not there but you can run SLUB
> with MAX_ORDER allocations by passing "slab_min_order=9" or so on bootup.
Maybe we should try this patch in order to automatically scale the slub
page size with the amount of memory in the machine?
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..7059a8389194 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -167,6 +167,7 @@ extern void prep_compound_page(struct page *page, unsigned int order);
extern void post_alloc_hook(struct page *page, unsigned int order,
gfp_t gfp_flags);
extern int user_min_free_kbytes;
+extern unsigned long __meminitdata nr_kernel_pages;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef9c259db041..3c51bb22403f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -264,7 +264,7 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
int watermark_scale_factor = 10;
-static unsigned long __meminitdata nr_kernel_pages;
+unsigned long __meminitdata nr_kernel_pages;
static unsigned long __meminitdata nr_all_pages;
static unsigned long __meminitdata dma_reserve;
diff --git a/mm/slub.c b/mm/slub.c
index e381728a3751..abca4a6e9b6c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4194,6 +4194,23 @@ void __init kmem_cache_init(void)
if (debug_guardpage_minorder())
slub_max_order = 0;
+ if (slub_min_order == 0) {
+ unsigned long numentries = nr_kernel_pages;
+
+ /*
+ * Above 4GB, we start to care more about fragmenting large
+ * pages than about using the minimum amount of memory.
+ * Scale the slub page size at half the rate that we scale
+ * the memory size; at 4GB we double the page size to 8k,
+ * 16GB to 16k, 64GB to 32k, 256GB to 64k.
+ */
+ while (numentries > (4UL << 30)) {
+ if (slub_min_order >= slub_max_order)
+ break;
+ slub_min_order++;
+ numentries /= 4;
+ }
+ }
kmem_cache_node = &boot_kmem_cache_node;
kmem_cache = &boot_kmem_cache;
On Thu, 15 Feb 2018, Michal Hocko wrote:
> > When the amount of kernel
> > memory is well bounded for certain systems, it is better to aggressively
> > reclaim from existing MIGRATE_UNMOVABLE pageblocks rather than eagerly
> > fallback to others.
> >
> > We have additional patches that help with this fragmentation if you're
> > interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
> > pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
> > draining of pcp lists back to the zone free area to prevent stranding.
>
> Yes, I think we need a proper fix. (Ab)using zone_movable for this
> usecase is just sad.
>
It's a hard balance to achieve between a fast page allocator with per-cpu
pagesets, reducing fragmentation of unmovable memory, and the performance
impact of any fix to reduce that fragmentation for users currently
unaffected. Our patches to kick kcompactd for MIGRATE_UNMOVABLE
pageblocks on fallback would be a waste unless you have a ton of anonymous
memory you want backed by thp.
If hugepages is the main motivation for reducing the fragmentation,
hugetlbfs could be suggested because it would give us more runtime control
and we could leave surplus pages sitting in the free pool unless reclaimed
under memory pressure. That works fine in dedicated environments where we
know how much hugetlb to reserve; if we give it back under memory pressure
it becomes hard to reallocate the high number of hugepages we want (>95%
of system memory). It's much more sloppy in shared environments where the
amount of hugepages are unknown.
And of course this doesn't address when a pin prevents memory from being
migrated during memory compaction that is __GFP_MOVABLE at allocation but
later pinned in place, which can still be a problem with ZONE_MOVABLE. It
would nice to have a solution where this memory can be annotated to want
to come from a non-MIGRATE_MOVABLE pageblock, if possible.
On Thu, Feb 15, 2018 at 09:49:00AM -0600, Christopher Lameter wrote:
> On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> > What if ... on startup, slab allocated a MAX_ORDER page for itself.
> > It would then satisfy its own page allocation requests from this giant
> > page. If we start to run low on memory in the rest of the system, slab
> > can be induced to return some of it via its shrinker. If slab runs low
> > on memory, it tries to allocate another MAX_ORDER page for itself.
>
> The inducing of releasing memory back is not there but you can run SLUB
> with MAX_ORDER allocations by passing "slab_min_order=9" or so on bootup.
This is subtly different from the idea that I had. If you set
slub_min_order to 9, then slub will allocate 2MB pages for each slab,
so allocating one object from kmalloc-32 and one object from dentry will
cause 4MB to be taken from the system.
What I was proposing was an intermediate page allocator where slab would
request 2MB for its own uses all at once, then allocate pages from that to
individual slabs, so allocating a kmalloc-32 object and a dentry object
would result in 510 pages of memory still being available for any slab
that needed it.
On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> What I was proposing was an intermediate page allocator where slab would
> request 2MB for its own uses all at once, then allocate pages from that to
> individual slabs, so allocating a kmalloc-32 object and a dentry object
> would result in 510 pages of memory still being available for any slab
> that needed it.
>
A type of memory arena built between the page allocator and slab
allocator.
The issue that I see with this is eventually there's going to be low on
memory situations where memory needs to be reclaimed from these arena
pages. We can free individual pages back to the buddy allocator, but have
no control over whether an entire pageblock can be freed back. So now we
have MIGRATE_MOVABLE or MIGRATE_UNMOVABLE pageblocks with some user pages
and some slab pages, and we've reached the same fragmentation issue in a
different way. After that, it will become more difficult for the slab
allocator to request a page of pageblock_order.
Other than the stranding issue of MIGRATE_UNMOVABLE pages on pcps, the
page allocator currently does well in falling back to other migratetypes
but there isn't any type of slab reclaim or defragmentation done in the
background to try to free up as much memory from that
now-MIGRATE_UNMOVABLE pageblock as possible. We have patches that do
that, but as I mentioned before it can affect the performance of the page
allocator because it drains pcps on fallback and it does kcompactd
compaction in the background even if you don't need order-9 memory later
(or you've defragmented needlessly when more slab is just going to be
allocated anyway).
On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> > The inducing of releasing memory back is not there but you can run SLUB
> > with MAX_ORDER allocations by passing "slab_min_order=9" or so on bootup.
>
> This is subtly different from the idea that I had. If you set
> slub_min_order to 9, then slub will allocate 2MB pages for each slab,
> so allocating one object from kmalloc-32 and one object from dentry will
> cause 4MB to be taken from the system.
Right.
> What I was proposing was an intermediate page allocator where slab would
> request 2MB for its own uses all at once, then allocate pages from that to
> individual slabs, so allocating a kmalloc-32 object and a dentry object
> would result in 510 pages of memory still being available for any slab
> that needed it.
Well thats not really going to work since you would be mixing objects of
different sizes which may present more fragmentation problems within the
2M later if they are freed and more objects are allocated.
What we could do is add a readonly allocation mode for those objects for
which we know that are never freed. Those could be combined into 2M
blocks.
Or an allocation mode where we would free a whole bunch of objects in one
go. If we can mark those allocs then they could be satisfied from the same
block.
On Fri, Feb 16, 2018 at 09:44:25AM -0600, Christopher Lameter wrote:
> On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> > What I was proposing was an intermediate page allocator where slab would
> > request 2MB for its own uses all at once, then allocate pages from that to
> > individual slabs, so allocating a kmalloc-32 object and a dentry object
> > would result in 510 pages of memory still being available for any slab
> > that needed it.
>
> Well thats not really going to work since you would be mixing objects of
> different sizes which may present more fragmentation problems within the
> 2M later if they are freed and more objects are allocated.
I don't understand this response. I'm not suggesting mixing objects
of different sizes within the same page. The vast majority of slabs
use order-0 pages, a few use order-1 pages and larger sizes are almost
unheard of. I'm suggesting the slab have it's own private arena of pages
that it uses for allocating pages to slabs; when an entire page comes
free in a slab, it is returned to the arena. When the arena is empty,
slab requests another arena from the page allocator.
If you're concerned about order-0 allocations fragmenting the arena
for order-1 slabs, then we could have separate arenas for order-0 and
order-1. But there should be no more fragmentation caused by sticking
within an arena for page allocations than there would be by spreading
slab allocations across all memory.
On Fri, 16 Feb 2018, Matthew Wilcox wrote:
> On Fri, Feb 16, 2018 at 09:44:25AM -0600, Christopher Lameter wrote:
> > On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> > > What I was proposing was an intermediate page allocator where slab would
> > > request 2MB for its own uses all at once, then allocate pages from that to
> > > individual slabs, so allocating a kmalloc-32 object and a dentry object
> > > would result in 510 pages of memory still being available for any slab
> > > that needed it.
> >
> > Well thats not really going to work since you would be mixing objects of
> > different sizes which may present more fragmentation problems within the
> > 2M later if they are freed and more objects are allocated.
>
> I don't understand this response. I'm not suggesting mixing objects
> of different sizes within the same page. The vast majority of slabs
> use order-0 pages, a few use order-1 pages and larger sizes are almost
> unheard of. I'm suggesting the slab have it's own private arena of pages
> that it uses for allocating pages to slabs; when an entire page comes
> free in a slab, it is returned to the arena. When the arena is empty,
> slab requests another arena from the page allocator.
This just shifts the fragmentation problem because the 2M page cannot be
released until all 4k or 8k pages within that 2M page are freed. How is
that different from the page allocator which cannot coalesce an 2M page
until all fragments have been released?
The kernelcore already does something similar by limiting the
general unmovable allocs to a section of memory.
> If you're concerned about order-0 allocations fragmenting the arena
> for order-1 slabs, then we could have separate arenas for order-0 and
> order-1. But there should be no more fragmentation caused by sticking
> within an arena for page allocations than there would be by spreading
> slab allocations across all memory.
We avoid large frames at this point but they are beneficial to pack
objects tighter and also increase performance.
Maybe what we should do is raise the lowest allocation size instead and
allocate 2^x groups of pages to certain purposes?
I.e. have a base allocation size of 16k and if the alloc was a page cache
page then use the remainder for the neigboring pages.
Similar things could be done for the page allocator.
Raising the minimum allocation size may allow us to reduce the sizes
necessary to be allocated at the price of loosing some memory. On large
systems this may not matter much.
On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> On Thu, Feb 15, 2018 at 09:49:00AM -0600, Christopher Lameter wrote:
> > On Thu, 15 Feb 2018, Matthew Wilcox wrote:
> >
> > > What if ... on startup, slab allocated a MAX_ORDER page for itself.
> > > It would then satisfy its own page allocation requests from this giant
> > > page. If we start to run low on memory in the rest of the system, slab
> > > can be induced to return some of it via its shrinker. If slab runs low
> > > on memory, it tries to allocate another MAX_ORDER page for itself.
> >
> > The inducing of releasing memory back is not there but you can run SLUB
> > with MAX_ORDER allocations by passing "slab_min_order=9" or so on bootup.
>
> Maybe we should try this patch in order to automatically scale the slub
> page size with the amount of memory in the machine?
Well setting slub_min_order may cause allocation failures. You would leave
that at 0 for a prod configuration. Setting slub_max_order higher would
work.
On Fri, Feb 16, 2018 at 10:08:28AM -0600, Christopher Lameter wrote:
> On Fri, 16 Feb 2018, Matthew Wilcox wrote:
> > I don't understand this response. I'm not suggesting mixing objects
> > of different sizes within the same page. The vast majority of slabs
> > use order-0 pages, a few use order-1 pages and larger sizes are almost
> > unheard of. I'm suggesting the slab have it's own private arena of pages
> > that it uses for allocating pages to slabs; when an entire page comes
> > free in a slab, it is returned to the arena. When the arena is empty,
> > slab requests another arena from the page allocator.
>
> This just shifts the fragmentation problem because the 2M page cannot be
> released until all 4k or 8k pages within that 2M page are freed. How is
> that different from the page allocator which cannot coalesce an 2M page
> until all fragments have been released?
I'm not proposing releasing this 2MB page, unless it naturally frees up.
I'm saying that by restricting allocations to be within this 2MB page,
we prevent allocating from the adjacent 2MB page.
The workload I'm thinking of looks like this ... maybe the result of
running 'file' on every inode in a directory:
do {
Allocate an inode
Allocate a page of pagecache
} while (lots of times);
naively, we allocate a page for the inode slab, then 3-6 pages for page
cache (depending on the filesystem), then we allocate another page for
the inode slab, then another 3-6 pages of page cache, and so on. So the
pages end up looking like this:
IPPPPPIP|PPPPIPPP|PPIPPPPP|IPPPPPIP|...
Now we need an order-3 allocation. We can't get there just by releasing
page cache pages because there's inode slab pages in there, so we need to
shrink the inode caches as well. I'm proposing:
IIIIII00|PPPPPPPP|PPPPPPPP|PPPPPPPP|PP...
and we can get our order-3 allocation just by releasing page cache pages.
> The kernelcore already does something similar by limiting the
> general unmovable allocs to a section of memory.
Right! But Michal's unhappy about kernelcore (see the beginning of this
thread), and so I'm proposing an alternative.
> Maybe what we should do is raise the lowest allocation size instead and
> allocate 2^x groups of pages to certain purposes?
>
> I.e. have a base allocation size of 16k and if the alloc was a page cache
> page then use the remainder for the neigboring pages.
Yes, there are a lot of ideas like this floating around; I know Kirill's
interested in this kind of thing not just for THP but also for faultaround.