This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/
I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.
With the kernel configs distributions can easily change the default
values so that people do not need to manually set kernel cmdline
for common use cases and one can still overwrite the default value
with manual setup or disable it by using crashkernel=0
Signed-off-by: Dave Young <[email protected]>
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
(kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc)
2. IO devices, large system can have a lot of io devices, although we
can try to only add those device drivers we needed, it is still a
problem because of some built-in drivers, some stacked logical devices
eg. device mapper devices, acpi etc. Even if only considering the
meta data for driver model it will still be a big number eg. sysfs
files etc.
3. The minimum memory requirement for some device drivers are big, even
if some of them have implemented low meory profile. It is usual to see
10M memory use for a storage driver.
4. user space initramfs size growing. Busybox is not usable if we need
to add udev support and some complicate storage support. Use dracut
with systemd, especially networking stuff need more memory.
So probably add another kernel config option to scale the memory size
eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
use base_value + system_mem >> (2^14) for x86. I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.
arch/Kconfig | 16 ++++++++++++++++
kernel/crash_core.c | 23 ++++++++++++++++++++++-
2 files changed, 38 insertions(+), 1 deletion(-)
--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,22 @@ config KEXEC_CORE
select CRASH_CORE
bool
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+ int "System memory size threshold for kdump memory default reserving"
+ depends on CRASH_CORE
+ default 0
+ help
+ CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
+ the system memory size is equal or bigger than the threshold.
+
+config CRASHKERNEL_DEFAULT_MB
+ int "Default crashkernel memory size reserved for kdump"
+ depends on CRASH_CORE
+ default 0
+ help
+ This is used as the default kdump reserved memory size in MB.
+ crashkernel=X kernel cmdline can overwrite this value.
+
config HAVE_IMA_KEXEC
bool
--- linux-x86.orig/kernel/crash_core.c
+++ linux-x86/kernel/crash_core.c
@@ -9,6 +9,7 @@
#include <linux/crash_core.h>
#include <linux/utsname.h>
#include <linux/vmalloc.h>
+#include <linux/sizes.h>
#include <asm/page.h>
#include <asm/sections.h>
@@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
return 0;
}
+static int __init get_crashkernel_default(unsigned long long system_ram,
+ unsigned long long *size)
+{
+ unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
+ unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
+
+ thres *= SZ_1M;
+ sz *= SZ_1M;
+
+ if (sz >= system_ram || system_ram < thres) {
+ pr_debug("crashkernel default size can not be used.\n");
+ return -EINVAL;
+ }
+ *size = sz;
+
+ return 0;
+}
+
#define SUFFIX_HIGH 0
#define SUFFIX_LOW 1
#define SUFFIX_NULL 2
@@ -240,8 +259,10 @@ static int __init __parse_crashkernel(ch
*crash_size = 0;
*crash_base = 0;
- ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
+ if (!strstr(cmdline, "crashkernel="))
+ return get_crashkernel_default(system_ram, crash_size);
+ ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
if (!ck_cmdline)
return -EINVAL;
On Mon, 21 May 2018 10:53:37 +0800 Dave Young <[email protected]> wrote:
> This is a rework of the crashkernel=auto patches back to 2009 although
> I'm not sure if below is the last version of the old effort:
> https://lkml.org/lkml/2009/8/12/61
> https://lwn.net/Articles/345344/
>
> I changed the original design, instead of adding the auto reserve logic
> in code, in this patch just introduce two kernel config options for
> the default crashkernel value in MB and the threshold of system memory
> in MB so that only reserve default when system memory is equal or
> above the threshold.
>
> With the kernel configs distributions can easily change the default
> values so that people do not need to manually set kernel cmdline
> for common use cases and one can still overwrite the default value
> with manual setup or disable it by using crashkernel=0
>
> Signed-off-by: Dave Young <[email protected]>
> ---
> Another difference is with original design the crashkernel size scales
> with system memory, according to test, large machine may need more
> memory in kdump kernel because of several factors:
> 1. cpu numbers, because of the percpu memory allocated for cpus.
> (kdump can use nr_cpus=1 to workaround this, but some
> arches do not support nr_cpus=X for example powerpc)
> 2. IO devices, large system can have a lot of io devices, although we
> can try to only add those device drivers we needed, it is still a
> problem because of some built-in drivers, some stacked logical devices
> eg. device mapper devices, acpi etc. Even if only considering the
> meta data for driver model it will still be a big number eg. sysfs
> files etc.
> 3. The minimum memory requirement for some device drivers are big, even
> if some of them have implemented low meory profile. It is usual to see
> 10M memory use for a storage driver.
> 4. user space initramfs size growing. Busybox is not usable if we need
> to add udev support and some complicate storage support. Use dracut
> with systemd, especially networking stuff need more memory.
>
> So probably add another kernel config option to scale the memory size
> eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
> use base_value + system_mem >> (2^14) for x86. I'm still hesatating
> how to describe and add this option. Any suggestions will be appreciated.
>
> ...
>
> --- linux-x86.orig/arch/Kconfig
> +++ linux-x86/arch/Kconfig
> @@ -10,6 +10,22 @@ config KEXEC_CORE
> select CRASH_CORE
> bool
>
> +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> + int "System memory size threshold for kdump memory default reserving"
> + depends on CRASH_CORE
> + default 0
> + help
> + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> + the system memory size is equal or bigger than the threshold.
"the threshold" is rather vague. Can it be clarified?
In fact I'm really struggling to understand the logic here....
> +config CRASHKERNEL_DEFAULT_MB
> + int "Default crashkernel memory size reserved for kdump"
> + depends on CRASH_CORE
> + default 0
> + help
> + This is used as the default kdump reserved memory size in MB.
> + crashkernel=X kernel cmdline can overwrite this value.
> +
> config HAVE_IMA_KEXEC
> bool
>
> @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> return 0;
> }
>
> +static int __init get_crashkernel_default(unsigned long long system_ram,
> + unsigned long long *size)
> +{
> + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> +
> + thres *= SZ_1M;
> + sz *= SZ_1M;
> +
> + if (sz >= system_ram || system_ram < thres) {
> + pr_debug("crashkernel default size can not be used.\n");
> + return -EINVAL;
In other words,
if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
fail;
yes?
How come? What's happening here? Perhaps a (good) explanatory comment
is needed. And clearer Kconfig text.
All confused :(
On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young <[email protected]> wrote:
>
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> >
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> >
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> >
> > Signed-off-by: Dave Young <[email protected]>
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> > (kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc)
> > 2. IO devices, large system can have a lot of io devices, although we
> > can try to only add those device drivers we needed, it is still a
> > problem because of some built-in drivers, some stacked logical devices
> > eg. device mapper devices, acpi etc. Even if only considering the
> > meta data for driver model it will still be a big number eg. sysfs
> > files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> > if some of them have implemented low meory profile. It is usual to see
> > 10M memory use for a storage driver.
> > 4. user space initramfs size growing. Busybox is not usable if we need
> > to add udev support and some complicate storage support. Use dracut
> > with systemd, especially networking stuff need more memory.
> >
> > So probably add another kernel config option to scale the memory size
> > eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
> > use base_value + system_mem >> (2^14) for x86. I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> >
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > + int "System memory size threshold for kdump memory default reserving"
> > + depends on CRASH_CORE
> > + default 0
> > + help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
>
> "the threshold" is rather vague. Can it be clarified?
>
> In fact I'm really struggling to understand the logic here....
>
>
> > +config CRASHKERNEL_DEFAULT_MB
> > + int "Default crashkernel memory size reserved for kdump"
> > + depends on CRASH_CORE
> > + default 0
> > + help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> > config HAVE_IMA_KEXEC
> > bool
> >
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> > }
> >
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > + thres *= SZ_1M;
> > + sz *= SZ_1M;
> > +
> > + if (sz >= system_ram || system_ram < thres) {
> > + pr_debug("crashkernel default size can not be used.\n");
> > + return -EINVAL;
>
> In other words,
>
> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> fail;
>
> yes?
the first comparison is a sanity check for the default reserved
size, if it is bigger than system ram size it is apparently bad:
if ( CONFIG_CRASHKERNEL_DEFAULT_MB >= system_ram )
fail;
The second comparison is for the threshold setting, it is a designed
logic like:
if ( system_ram >= CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB ) then
go ahead to use the default value of CONFIG_CRASHKERNEL_DEFAULT_MB
>
> How come? What's happening here? Perhaps a (good) explanatory comment
> is needed. And clearer Kconfig text.
>
> All confused :(
Hmm, scratch head~, will think about how to describe it better. If you
have any suggestions just let me know :)
Thanks
Dave
On 05/21/18 at 12:02pm, Andrew Morton wrote:
> On Mon, 21 May 2018 10:53:37 +0800 Dave Young <[email protected]> wrote:
>
> > This is a rework of the crashkernel=auto patches back to 2009 although
> > I'm not sure if below is the last version of the old effort:
> > https://lkml.org/lkml/2009/8/12/61
> > https://lwn.net/Articles/345344/
> >
> > I changed the original design, instead of adding the auto reserve logic
> > in code, in this patch just introduce two kernel config options for
> > the default crashkernel value in MB and the threshold of system memory
> > in MB so that only reserve default when system memory is equal or
> > above the threshold.
> >
> > With the kernel configs distributions can easily change the default
> > values so that people do not need to manually set kernel cmdline
> > for common use cases and one can still overwrite the default value
> > with manual setup or disable it by using crashkernel=0
> >
> > Signed-off-by: Dave Young <[email protected]>
> > ---
> > Another difference is with original design the crashkernel size scales
> > with system memory, according to test, large machine may need more
> > memory in kdump kernel because of several factors:
> > 1. cpu numbers, because of the percpu memory allocated for cpus.
> > (kdump can use nr_cpus=1 to workaround this, but some
> > arches do not support nr_cpus=X for example powerpc)
> > 2. IO devices, large system can have a lot of io devices, although we
> > can try to only add those device drivers we needed, it is still a
> > problem because of some built-in drivers, some stacked logical devices
> > eg. device mapper devices, acpi etc. Even if only considering the
> > meta data for driver model it will still be a big number eg. sysfs
> > files etc.
> > 3. The minimum memory requirement for some device drivers are big, even
> > if some of them have implemented low meory profile. It is usual to see
> > 10M memory use for a storage driver.
> > 4. user space initramfs size growing. Busybox is not usable if we need
> > to add udev support and some complicate storage support. Use dracut
> > with systemd, especially networking stuff need more memory.
> >
> > So probably add another kernel config option to scale the memory size
> > eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
> > use base_value + system_mem >> (2^14) for x86. I'm still hesatating
> > how to describe and add this option. Any suggestions will be appreciated.
> >
> > ...
> >
> > --- linux-x86.orig/arch/Kconfig
> > +++ linux-x86/arch/Kconfig
> > @@ -10,6 +10,22 @@ config KEXEC_CORE
> > select CRASH_CORE
> > bool
> >
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > + int "System memory size threshold for kdump memory default reserving"
> > + depends on CRASH_CORE
> > + default 0
> > + help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
>
> "the threshold" is rather vague. Can it be clarified?
>
> In fact I'm really struggling to understand the logic here....
Sorry I missed this comment, the threshold is like this:
if system memory size is lower than the threshold then we will do
nothing and do not reserve crashkernel memory by default. Eg. if the
threshold is 1024M then default reservation is only used when system has
more than 1024M memory, and for those low mem machine we do not reserve by
default.
Thanks
Dave
[snip]
> >
> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > + int "System memory size threshold for kdump memory default reserving"
> > + depends on CRASH_CORE
> > + default 0
> > + help
> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > + the system memory size is equal or bigger than the threshold.
>
> "the threshold" is rather vague. Can it be clarified?
>
> In fact I'm really struggling to understand the logic here....
>
>
> > +config CRASHKERNEL_DEFAULT_MB
> > + int "Default crashkernel memory size reserved for kdump"
> > + depends on CRASH_CORE
> > + default 0
> > + help
> > + This is used as the default kdump reserved memory size in MB.
> > + crashkernel=X kernel cmdline can overwrite this value.
> > +
> > config HAVE_IMA_KEXEC
> > bool
> >
> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > return 0;
> > }
> >
> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > + unsigned long long *size)
> > +{
> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > +
> > + thres *= SZ_1M;
> > + sz *= SZ_1M;
> > +
> > + if (sz >= system_ram || system_ram < thres) {
> > + pr_debug("crashkernel default size can not be used.\n");
> > + return -EINVAL;
>
> In other words,
>
> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> fail;
>
> yes?
>
> How come? What's happening here? Perhaps a (good) explanatory comment
> is needed. And clearer Kconfig text.
>
> All confused :(
Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
the size is too large and kernel can not find enough memory it will
still fail in latter code.
Is below version looks clearer?
---
This is a rework of the crashkernel=auto patches back to 2009 although
I'm not sure if below is the last version of the old effort:
https://lkml.org/lkml/2009/8/12/61
https://lwn.net/Articles/345344/
I changed the original design, instead of adding the auto reserve logic
in code, in this patch just introduce two kernel config options for
the default crashkernel value in MB and the threshold of system memory
in MB so that only reserve default when system memory is equal or
above the threshold.
Signed-off-by: Dave Young <[email protected]>
---
Another difference is with original design the crashkernel size scales
with system memory, according to test, large machine may need more
memory in kdump kernel because of several factors:
1. cpu numbers, because of the percpu memory allocated for cpus.
(kdump can use nr_cpus=1 to workaround this, but some
arches do not support nr_cpus=X for example powerpc)
2. IO devices, large system can have a lot of io devices, although we
can try to only add those device drivers we needed, it is still a
problem because of some built-in drivers, some stacked logical devices
eg. device mapper devices, acpi etc. Even if only considering the
meta data for driver model it will still be a big number eg. sysfs
files etc.
3. The minimum memory requirement for some device drivers are big, even
if some of them have implemented low meory profile. It is usual to see
10M memory use for a storage driver.
4. user space initramfs size growing. Busybox is not usable if we need
to add udev support and some complicate storage support. Use dracut
with systemd, especially networking stuff need more memory.
So probably add another kernel config option to scale the memory size
eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
use base_value + system_mem >> (2^14) for x86. I'm still hesatating
how to describe and add this option. Any suggestions will be appreciated.
arch/Kconfig | 17 +++++++++++++++++
kernel/crash_core.c | 19 ++++++++++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)
--- linux-x86.orig/arch/Kconfig
+++ linux-x86/arch/Kconfig
@@ -10,6 +10,23 @@ config KEXEC_CORE
select CRASH_CORE
bool
+config CRASHKERNEL_DEFAULT_THRESHOLD_MB
+ int "System memory size threshold for using CRASHKERNEL_DEFAULT_MB"
+ depends on CRASH_CORE
+ default 0
+ help
+ CRASHKERNEL_DEFAULT_MB will be reserved for kdump if the system
+ memory is above or equal to CRASHKERNEL_DEFAULT_THRESHOLD_MB MB.
+ It is only effective in case no crashkernel=X parameter is used.
+
+config CRASHKERNEL_DEFAULT_MB
+ int "Default crashkernel memory size reserved for kdump"
+ depends on CRASH_CORE
+ default 0
+ help
+ This is used as the default kdump reserved memory size in MB.
+ crashkernel=X kernel cmdline can overwrite this value.
+
config HAVE_IMA_KEXEC
bool
--- linux-x86.orig/kernel/crash_core.c
+++ linux-x86/kernel/crash_core.c
@@ -143,6 +143,21 @@ static int __init parse_crashkernel_simp
return 0;
}
+static int __init get_crashkernel_default(unsigned long long system_ram,
+ unsigned long long *size)
+{
+ unsigned long long system_ram_mb = system_ram >> 20;
+
+ if (system_ram_mb < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB) {
+ pr_debug("crashkernel: system memory size is lower than %d\n",
+ CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB);
+ return -EINVAL;
+ }
+ *size = (unsigned long long)CONFIG_CRASHKERNEL_DEFAULT_MB << 20;
+
+ return 0;
+}
+
#define SUFFIX_HIGH 0
#define SUFFIX_LOW 1
#define SUFFIX_NULL 2
@@ -240,8 +255,10 @@ static int __init __parse_crashkernel(ch
*crash_size = 0;
*crash_base = 0;
- ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
+ if (!strstr(cmdline, "crashkernel="))
+ return get_crashkernel_default(system_ram, crash_size);
+ ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
if (!ck_cmdline)
return -EINVAL;
Dave Young <[email protected]> writes:
> [snip]
>
>> >
>> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
>> > + int "System memory size threshold for kdump memory default reserving"
>> > + depends on CRASH_CORE
>> > + default 0
>> > + help
>> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
>> > + the system memory size is equal or bigger than the threshold.
>>
>> "the threshold" is rather vague. Can it be clarified?
>>
>> In fact I'm really struggling to understand the logic here....
>>
>>
>> > +config CRASHKERNEL_DEFAULT_MB
>> > + int "Default crashkernel memory size reserved for kdump"
>> > + depends on CRASH_CORE
>> > + default 0
>> > + help
>> > + This is used as the default kdump reserved memory size in MB.
>> > + crashkernel=X kernel cmdline can overwrite this value.
>> > +
>> > config HAVE_IMA_KEXEC
>> > bool
>> >
>> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
>> > return 0;
>> > }
>> >
>> > +static int __init get_crashkernel_default(unsigned long long system_ram,
>> > + unsigned long long *size)
>> > +{
>> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
>> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
>> > +
>> > + thres *= SZ_1M;
>> > + sz *= SZ_1M;
>> > +
>> > + if (sz >= system_ram || system_ram < thres) {
>> > + pr_debug("crashkernel default size can not be used.\n");
>> > + return -EINVAL;
>>
>> In other words,
>>
>> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>> fail;
>>
>> yes?
>>
>> How come? What's happening here? Perhaps a (good) explanatory comment
>> is needed. And clearer Kconfig text.
>>
>> All confused :(
>
> Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> the size is too large and kernel can not find enough memory it will
> still fail in latter code.
>
> Is below version looks clearer?
What is the advantage of providing this in a kconfig option rather
than on the kernel command line as we can now?
Eric
> ---
>
> This is a rework of the crashkernel=auto patches back to 2009 although
> I'm not sure if below is the last version of the old effort:
> https://lkml.org/lkml/2009/8/12/61
> https://lwn.net/Articles/345344/
>
> I changed the original design, instead of adding the auto reserve logic
> in code, in this patch just introduce two kernel config options for
> the default crashkernel value in MB and the threshold of system memory
> in MB so that only reserve default when system memory is equal or
> above the threshold.
>
> Signed-off-by: Dave Young <[email protected]>
> ---
> Another difference is with original design the crashkernel size scales
> with system memory, according to test, large machine may need more
> memory in kdump kernel because of several factors:
> 1. cpu numbers, because of the percpu memory allocated for cpus.
> (kdump can use nr_cpus=1 to workaround this, but some
> arches do not support nr_cpus=X for example powerpc)
> 2. IO devices, large system can have a lot of io devices, although we
> can try to only add those device drivers we needed, it is still a
> problem because of some built-in drivers, some stacked logical devices
> eg. device mapper devices, acpi etc. Even if only considering the
> meta data for driver model it will still be a big number eg. sysfs
> files etc.
> 3. The minimum memory requirement for some device drivers are big, even
> if some of them have implemented low meory profile. It is usual to see
> 10M memory use for a storage driver.
> 4. user space initramfs size growing. Busybox is not usable if we need
> to add udev support and some complicate storage support. Use dracut
> with systemd, especially networking stuff need more memory.
>
> So probably add another kernel config option to scale the memory size
> eg. CRASHKERNEL_DEFAULT_SCALE_RATIO is also good to have, in RHEL we
> use base_value + system_mem >> (2^14) for x86. I'm still hesatating
> how to describe and add this option. Any suggestions will be appreciated.
>
> arch/Kconfig | 17 +++++++++++++++++
> kernel/crash_core.c | 19 ++++++++++++++++++-
> 2 files changed, 35 insertions(+), 1 deletion(-)
>
> --- linux-x86.orig/arch/Kconfig
> +++ linux-x86/arch/Kconfig
> @@ -10,6 +10,23 @@ config KEXEC_CORE
> select CRASH_CORE
> bool
>
> +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> + int "System memory size threshold for using CRASHKERNEL_DEFAULT_MB"
> + depends on CRASH_CORE
> + default 0
> + help
> + CRASHKERNEL_DEFAULT_MB will be reserved for kdump if the system
> + memory is above or equal to CRASHKERNEL_DEFAULT_THRESHOLD_MB MB.
> + It is only effective in case no crashkernel=X parameter is used.
> +
> +config CRASHKERNEL_DEFAULT_MB
> + int "Default crashkernel memory size reserved for kdump"
> + depends on CRASH_CORE
> + default 0
> + help
> + This is used as the default kdump reserved memory size in MB.
> + crashkernel=X kernel cmdline can overwrite this value.
> +
> config HAVE_IMA_KEXEC
> bool
>
> --- linux-x86.orig/kernel/crash_core.c
> +++ linux-x86/kernel/crash_core.c
> @@ -143,6 +143,21 @@ static int __init parse_crashkernel_simp
> return 0;
> }
>
> +static int __init get_crashkernel_default(unsigned long long system_ram,
> + unsigned long long *size)
> +{
> + unsigned long long system_ram_mb = system_ram >> 20;
> +
> + if (system_ram_mb < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB) {
> + pr_debug("crashkernel: system memory size is lower than %d\n",
> + CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB);
> + return -EINVAL;
> + }
> + *size = (unsigned long long)CONFIG_CRASHKERNEL_DEFAULT_MB << 20;
> +
> + return 0;
> +}
> +
> #define SUFFIX_HIGH 0
> #define SUFFIX_LOW 1
> #define SUFFIX_NULL 2
> @@ -240,8 +255,10 @@ static int __init __parse_crashkernel(ch
> *crash_size = 0;
> *crash_base = 0;
>
> - ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
> + if (!strstr(cmdline, "crashkernel="))
> + return get_crashkernel_default(system_ram, crash_size);
>
> + ck_cmdline = get_last_crashkernel(cmdline, name, suffix);
> if (!ck_cmdline)
> return -EINVAL;
>
On Wed, 23 May 2018 10:53:55 -0500
[email protected] (Eric W. Biederman) wrote:
> Dave Young <[email protected]> writes:
>
> > [snip]
> >
> >> >
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > + int "System memory size threshold for kdump memory default reserving"
> >> > + depends on CRASH_CORE
> >> > + default 0
> >> > + help
> >> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > + the system memory size is equal or bigger than the threshold.
> >>
> >> "the threshold" is rather vague. Can it be clarified?
> >>
> >> In fact I'm really struggling to understand the logic here....
> >>
> >>
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > + int "Default crashkernel memory size reserved for kdump"
> >> > + depends on CRASH_CORE
> >> > + default 0
> >> > + help
> >> > + This is used as the default kdump reserved memory size in MB.
> >> > + crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> > config HAVE_IMA_KEXEC
> >> > bool
> >> >
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> > return 0;
> >> > }
> >> >
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > + unsigned long long *size)
> >> > +{
> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > + thres *= SZ_1M;
> >> > + sz *= SZ_1M;
> >> > +
> >> > + if (sz >= system_ram || system_ram < thres) {
> >> > + pr_debug("crashkernel default size can not be used.\n");
> >> > + return -EINVAL;
> >>
> >> In other words,
> >>
> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >> fail;
> >>
> >> yes?
> >>
> >> How come? What's happening here? Perhaps a (good) explanatory comment
> >> is needed. And clearer Kconfig text.
> >>
> >> All confused :(
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?
>
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?
Yeah, I was about to ask the very same question.
Having spent quite some time on estimating RAM required to save a crash
dump, I can tell you that there is no silver bullet. My main objection
is that core dumps are saved from user space, and the kernel cannot
have a clue what it is going to be.
First, the primary kernel cannot know how much memory will be needed
for the panic kernel (not necessarily same as the primary kernel) and
the panic initrd. If you build a minimal initrd for your system, then
at least it depends on which modules must be included, which in turn
depends on where you want to store the resulting dump. Mounting a local
ext2 partition will require less software than mounting an LVM logical
volume in a PV accessed through iSCSI over two bonded Ethernet NICs.
Second, run-time requirements may vary wildly. While sending the data
over a simple TCP connection (e.g. using FTP) consumes just a few
megabytes even on 10G Ethernet, dm block devices tend to consume much
more, because of the additional buffers allocated by device mapper.
Third, systems should be treated as "big" not so much because of the
amount of RAM, but more so because of the amount of attached devices.
I've seen a machine with devices from /dev/sda to /dev/sdvm; try to
calculate how much kernel memory is taken just by their in-kernel
representation...
Fourth, quite often there is a trade-off between how much memory is
reserved for the panic environment, and how long dumping will take. For
example, you may take advantage of multi-threading in makedumpfile, but
obviously, the additional threads need more memory (or makedumpfile
will have to do its job in more cycles, reducing speed again). Oh, did
I mention that even bringing up more CPUs has an impact on kernel
runtime memory requirements?
In short, if one size fits none, what good is it to hardcode that "one
size" into the kernel image?
Petr T
Hi Eric,
On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> Dave Young <[email protected]> writes:
>
> > [snip]
> >
> >> >
> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> > + int "System memory size threshold for kdump memory default reserving"
> >> > + depends on CRASH_CORE
> >> > + default 0
> >> > + help
> >> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> > + the system memory size is equal or bigger than the threshold.
> >>
> >> "the threshold" is rather vague. Can it be clarified?
> >>
> >> In fact I'm really struggling to understand the logic here....
> >>
> >>
> >> > +config CRASHKERNEL_DEFAULT_MB
> >> > + int "Default crashkernel memory size reserved for kdump"
> >> > + depends on CRASH_CORE
> >> > + default 0
> >> > + help
> >> > + This is used as the default kdump reserved memory size in MB.
> >> > + crashkernel=X kernel cmdline can overwrite this value.
> >> > +
> >> > config HAVE_IMA_KEXEC
> >> > bool
> >> >
> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> > return 0;
> >> > }
> >> >
> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> > + unsigned long long *size)
> >> > +{
> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> > +
> >> > + thres *= SZ_1M;
> >> > + sz *= SZ_1M;
> >> > +
> >> > + if (sz >= system_ram || system_ram < thres) {
> >> > + pr_debug("crashkernel default size can not be used.\n");
> >> > + return -EINVAL;
> >>
> >> In other words,
> >>
> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >> fail;
> >>
> >> yes?
> >>
> >> How come? What's happening here? Perhaps a (good) explanatory comment
> >> is needed. And clearer Kconfig text.
> >>
> >> All confused :(
> >
> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > the size is too large and kernel can not find enough memory it will
> > still fail in latter code.
> >
> > Is below version looks clearer?
>
> What is the advantage of providing this in a kconfig option rather
> than on the kernel command line as we can now?
It is not a replacement of the cmdline, this can be a supplement to
the crashkernel command line. For a lot of common use cases if we have
the auto reservation user just do not need to manually set the cmdline
for example on a virtual machine and usual setup (except of the
comlicate storage and very large machines). The crashkernel=auto
has been used for long time, Red Hat QE tested it on a lot of different
lab machines and proved it works well. Kdump usually just works so admin
do little work to enable kdump.
But the crashkernel=auto implementation has some drawbacks that is it
is more like embed policy in the code and it is not flexible like a
config option.
Thanks
Dave
Hi Petr,
On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> On Wed, 23 May 2018 10:53:55 -0500
> [email protected] (Eric W. Biederman) wrote:
>
> > Dave Young <[email protected]> writes:
> >
> > > [snip]
> > >
> > >> >
> > >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> > >> > + int "System memory size threshold for kdump memory default reserving"
> > >> > + depends on CRASH_CORE
> > >> > + default 0
> > >> > + help
> > >> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> > >> > + the system memory size is equal or bigger than the threshold.
> > >>
> > >> "the threshold" is rather vague. Can it be clarified?
> > >>
> > >> In fact I'm really struggling to understand the logic here....
> > >>
> > >>
> > >> > +config CRASHKERNEL_DEFAULT_MB
> > >> > + int "Default crashkernel memory size reserved for kdump"
> > >> > + depends on CRASH_CORE
> > >> > + default 0
> > >> > + help
> > >> > + This is used as the default kdump reserved memory size in MB.
> > >> > + crashkernel=X kernel cmdline can overwrite this value.
> > >> > +
> > >> > config HAVE_IMA_KEXEC
> > >> > bool
> > >> >
> > >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> > >> > return 0;
> > >> > }
> > >> >
> > >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> > >> > + unsigned long long *size)
> > >> > +{
> > >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> > >> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> > >> > +
> > >> > + thres *= SZ_1M;
> > >> > + sz *= SZ_1M;
> > >> > +
> > >> > + if (sz >= system_ram || system_ram < thres) {
> > >> > + pr_debug("crashkernel default size can not be used.\n");
> > >> > + return -EINVAL;
> > >>
> > >> In other words,
> > >>
> > >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> > >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> > >> fail;
> > >>
> > >> yes?
> > >>
> > >> How come? What's happening here? Perhaps a (good) explanatory comment
> > >> is needed. And clearer Kconfig text.
> > >>
> > >> All confused :(
> > >
> > > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> > > the size is too large and kernel can not find enough memory it will
> > > still fail in latter code.
> > >
> > > Is below version looks clearer?
> >
> > What is the advantage of providing this in a kconfig option rather
> > than on the kernel command line as we can now?
>
> Yeah, I was about to ask the very same question.
>
> Having spent quite some time on estimating RAM required to save a crash
> dump, I can tell you that there is no silver bullet. My main objection
> is that core dumps are saved from user space, and the kernel cannot
> have a clue what it is going to be.
>
> First, the primary kernel cannot know how much memory will be needed
> for the panic kernel (not necessarily same as the primary kernel) and
> the panic initrd. If you build a minimal initrd for your system, then
> at least it depends on which modules must be included, which in turn
> depends on where you want to store the resulting dump. Mounting a local
> ext2 partition will require less software than mounting an LVM logical
> volume in a PV accessed through iSCSI over two bonded Ethernet NICs.
>
> Second, run-time requirements may vary wildly. While sending the data
> over a simple TCP connection (e.g. using FTP) consumes just a few
> megabytes even on 10G Ethernet, dm block devices tend to consume much
> more, because of the additional buffers allocated by device mapper.
>
> Third, systems should be treated as "big" not so much because of the
> amount of RAM, but more so because of the amount of attached devices.
> I've seen a machine with devices from /dev/sda to /dev/sdvm; try to
> calculate how much kernel memory is taken just by their in-kernel
> representation...
>
> Fourth, quite often there is a trade-off between how much memory is
> reserved for the panic environment, and how long dumping will take. For
> example, you may take advantage of multi-threading in makedumpfile, but
> obviously, the additional threads need more memory (or makedumpfile
> will have to do its job in more cycles, reducing speed again). Oh, did
> I mention that even bringing up more CPUs has an impact on kernel
> runtime memory requirements?
>
> In short, if one size fits none, what good is it to hardcode that "one
> size" into the kernel image?
I agreed with all the things that we can not know the exact memory
requirement for 100% use cases. But that does not means this is useless
it is still useful for common use cases of no special and memory hog
requirements as I mentioned in another reply it can simplify the kdump
deployment for those people who do not need the special setup.
For example, if this is a workstation I just want to break into a shell
to collect some panic info, then I just need a very minimal initrd, then
the Kconfig will work just fine.
Thanks
Dave
On Thu, 24 May 2018 09:49:05 +0800
Dave Young <[email protected]> wrote:
> Hi Petr,
>
> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
>[...]
> > In short, if one size fits none, what good is it to hardcode that "one
> > size" into the kernel image?
>
> I agreed with all the things that we can not know the exact memory
> requirement for 100% use cases. But that does not means this is useless
> it is still useful for common use cases of no special and memory hog
> requirements as I mentioned in another reply it can simplify the kdump
> deployment for those people who do not need the special setup.
I still tend to disagree. This "common-case" reservation depends on
things that are defined by user space. It surely does not make it
easier to build a distribution kernel. Today, I get bug reports that
the number calculated and added to the boot loader configuration by the
installer is inaccurate. If I put a fixed number into a kernel config
option, I will start getting bugs that this number is incorrect (for
some systems).
> For example, if this is a workstation I just want to break into a shell
> to collect some panic info, then I just need a very minimal initrd, then
> the Kconfig will work just fine.
What is "a very minimal initrd"? Last time I had to make a significant
adjustment to the estimation for openSUSE, this was caused by growing
user-space requirements (systemd in this case, but I don't want to
start flamewars on that topic, please).
Anyway, if you want to improve the "common case", then look how IBM
tries to solve it for firmware-assisted dump (fadump) on powerpc:
https://patchwork.ozlabs.org/patch/905026/
The main idea is:
> Instead of setting aside a significant chunk of memory nobody can use,
> [...] reserve a significant chunk of memory that the kernel is prevented
> from using [...], but applications are free to use it.
That works great, because user space pages are filtered out in the
common case, so they can be used freely by the panic kernel.
Just my two cents,
Petr T
On 05/24/18 at 08:57am, Petr Tesarik wrote:
> On Thu, 24 May 2018 09:49:05 +0800
> Dave Young <[email protected]> wrote:
>
> > Hi Petr,
> >
> > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >[...]
> > > In short, if one size fits none, what good is it to hardcode that "one
> > > size" into the kernel image?
> >
> > I agreed with all the things that we can not know the exact memory
> > requirement for 100% use cases. But that does not means this is useless
> > it is still useful for common use cases of no special and memory hog
> > requirements as I mentioned in another reply it can simplify the kdump
> > deployment for those people who do not need the special setup.
>
> I still tend to disagree. This "common-case" reservation depends on
> things that are defined by user space. It surely does not make it
> easier to build a distribution kernel. Today, I get bug reports that
> the number calculated and added to the boot loader configuration by the
> installer is inaccurate. If I put a fixed number into a kernel config
> option, I will start getting bugs that this number is incorrect (for
> some systems).
The value is a best effort, it will never be 100% correct. We did not
guarantee that. The kernel config option value is just up to user.
So I'm thinking it as a good to have benefit.
>
> > For example, if this is a workstation I just want to break into a shell
> > to collect some panic info, then I just need a very minimal initrd, then
> > the Kconfig will work just fine.
>
> What is "a very minimal initrd"? Last time I had to make a significant
> adjustment to the estimation for openSUSE, this was caused by growing
> user-space requirements (systemd in this case, but I don't want to
> start flamewars on that topic, please).
Still I think we have agreement and same feeling about the userspace
memory requirement. I think although it is hard, we have been still
trying to shrink the initramfs memory use.
Besides of distribution use, why people can not use some minimal
initrd? For example only a basic shell and some necessary tools and
basic storage eg. raw disks supported, and he/she can just collect the
panic infomation by himself in a shell.
>
> Anyway, if you want to improve the "common case", then look how IBM
> tries to solve it for firmware-assisted dump (fadump) on powerpc:
>
> https://patchwork.ozlabs.org/patch/905026/
>
> The main idea is:
>
> > Instead of setting aside a significant chunk of memory nobody can use,
> > [...] reserve a significant chunk of memory that the kernel is prevented
> > from using [...], but applications are free to use it.
>
> That works great, because user space pages are filtered out in the
> common case, so they can be used freely by the panic kernel.
Good suggestion. I have been reading that posts already at the same time before I saw
this reply from you :)
That could be a good idea and worth to discuss more. I cced Hari
already in the thread. Hari, is it possible for you to extend your
idea to general use, ie. shared by both kdump and fadump? Anyway I
think that is another topic we can discuss separately.
>
> Just my two cents,
> Petr T
Thanks
Dave
On 05/24/18 at 08:57am, Petr Tesarik wrote:
> On Thu, 24 May 2018 09:49:05 +0800
> Dave Young <[email protected]> wrote:
>
> > Hi Petr,
> >
> > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >[...]
> > > In short, if one size fits none, what good is it to hardcode that "one
> > > size" into the kernel image?
> >
> > I agreed with all the things that we can not know the exact memory
> > requirement for 100% use cases. But that does not means this is useless
> > it is still useful for common use cases of no special and memory hog
> > requirements as I mentioned in another reply it can simplify the kdump
> > deployment for those people who do not need the special setup.
>
> I still tend to disagree. This "common-case" reservation depends on
> things that are defined by user space. It surely does not make it
> easier to build a distribution kernel. Today, I get bug reports that
> the number calculated and added to the boot loader configuration by the
> installer is inaccurate. If I put a fixed number into a kernel config
> option, I will start getting bugs that this number is incorrect (for
> some systems).
>
> > For example, if this is a workstation I just want to break into a shell
> > to collect some panic info, then I just need a very minimal initrd, then
> > the Kconfig will work just fine.
>
> What is "a very minimal initrd"? Last time I had to make a significant
> adjustment to the estimation for openSUSE, this was caused by growing
> user-space requirements (systemd in this case, but I don't want to
> start flamewars on that topic, please).
>
> Anyway, if you want to improve the "common case", then look how IBM
> tries to solve it for firmware-assisted dump (fadump) on powerpc:
>
> https://patchwork.ozlabs.org/patch/905026/
>
> The main idea is:
>
> > Instead of setting aside a significant chunk of memory nobody can use,
> > [...] reserve a significant chunk of memory that the kernel is prevented
> > from using [...], but applications are free to use it.
>
> That works great, because user space pages are filtered out in the
> common case, so they can be used freely by the panic kernel.
This seems a good idea, just makedumpfile need be adjusted since it allows
user to decide if dump user space data or not.
On 05/24/18 at 03:26pm, Dave Young wrote:
> On 05/24/18 at 08:57am, Petr Tesarik wrote:
> > On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young <[email protected]> wrote:
> >
> > > Hi Petr,
> > >
> > > On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> > >[...]
> > > > In short, if one size fits none, what good is it to hardcode that "one
> > > > size" into the kernel image?
> > >
> > > I agreed with all the things that we can not know the exact memory
> > > requirement for 100% use cases. But that does not means this is useless
> > > it is still useful for common use cases of no special and memory hog
> > > requirements as I mentioned in another reply it can simplify the kdump
> > > deployment for those people who do not need the special setup.
> >
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
>
> The value is a best effort, it will never be 100% correct. We did not
> guarantee that. The kernel config option value is just up to user.
> So I'm thinking it as a good to have benefit.
I means this patch is not trying to force add a fixed value for crashkernel
in kernel code. It provides another way one can use on kernel build time
the value just works.
>
> >
> > > For example, if this is a workstation I just want to break into a shell
> > > to collect some panic info, then I just need a very minimal initrd, then
> > > the Kconfig will work just fine.
> >
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
>
> Still I think we have agreement and same feeling about the userspace
> memory requirement. I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
>
> Besides of distribution use, why people can not use some minimal
> initrd? For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.
>
> >
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> >
> > https://patchwork.ozlabs.org/patch/905026/
> >
> > The main idea is:
> >
> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> >
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
>
> Good suggestion. I have been reading that posts already at the same time before I saw
> this reply from you :)
>
> That could be a good idea and worth to discuss more. I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump? Anyway I
> think that is another topic we can discuss separately.
>
> >
> > Just my two cents,
> > Petr T
>
> Thanks
> Dave
> > > Instead of setting aside a significant chunk of memory nobody can use,
> > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > from using [...], but applications are free to use it.
> >
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
>
> Good suggestion. I have been reading that posts already at the same time before I saw
> this reply from you :)
>
> That could be a good idea and worth to discuss more. I cced Hari
> already in the thread. Hari, is it possible for you to extend your
> idea to general use, ie. shared by both kdump and fadump? Anyway I
> think that is another topic we can discuss separately.
BTW, I remember we had some Red hat internal discussion about CMA previously
there is a problem, that is we have crashkernel=,high for reserving high
memory and ,low for low memory, we were not sure if CMA can handle this
case.
Thanks
Dave
Thanks
Dave
On 05/24/18 at 03:56pm, Dave Young wrote:
> > > > Instead of setting aside a significant chunk of memory nobody can use,
> > > > [...] reserve a significant chunk of memory that the kernel is prevented
> > > > from using [...], but applications are free to use it.
> > >
> > > That works great, because user space pages are filtered out in the
> > > common case, so they can be used freely by the panic kernel.
> >
> > Good suggestion. I have been reading that posts already at the same time before I saw
> > this reply from you :)
> >
> > That could be a good idea and worth to discuss more. I cced Hari
> > already in the thread. Hari, is it possible for you to extend your
> > idea to general use, ie. shared by both kdump and fadump? Anyway I
> > think that is another topic we can discuss separately.
>
> BTW, I remember we had some Red hat internal discussion about CMA previously
> there is a problem, that is we have crashkernel=,high for reserving high
> memory and ,low for low memory, we were not sure if CMA can handle this
> case.
Pratyush ever investigated this too see if CMA can be used so that we
can dynamically adjust the crashkernel memory after boot. The result is
CMA is not good. I doubt fadump can really make use of CMA to reserve
crashkernel.
Thanks
Baoquan
On Thu, 24 May 2018 15:26:27 +0800
Dave Young <[email protected]> wrote:
> On 05/24/18 at 08:57am, Petr Tesarik wrote:
>[...]
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
>
> Still I think we have agreement and same feeling about the userspace
> memory requirement. I think although it is hard, we have been still
> trying to shrink the initramfs memory use.
>
> Besides of distribution use, why people can not use some minimal
> initrd? For example only a basic shell and some necessary tools and
> basic storage eg. raw disks supported, and he/she can just collect the
> panic infomation by himself in a shell.
Again, I'm having trouble with the definition of a "minimal initrd" and
also with the definition of a "workstation". I have already seen a sad
case where kdump started going OOM after connecting a 4K monitor,
because, well, it needed a bigger framebuffer...
OTOH you wrote in another mail that RH has tested some values on a
variety of hardware, so you seem to have a clue. Good for you. I still
believe it is moving policy into the kernel.
Based on past experience, I expect that certain users will argue that
"crashkernel=auto" should work out of the box on their HPE Superdome
with 600+ LUNs attached...
As you wrote elsewhere in the thread:
> I means this patch is not trying to force add a fixed value for crashkernel
> in kernel code. It provides another way one can use on kernel build time
> the value just works.
I don't mind if it is added, although I don't find it very useful.
Petr T
Petr Tesarik <[email protected]> writes:
2> On Thu, 24 May 2018 09:49:05 +0800
> Dave Young <[email protected]> wrote:
>
>> Hi Petr,
>>
>> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
>>[...]
>> > In short, if one size fits none, what good is it to hardcode that "one
>> > size" into the kernel image?
>>
>> I agreed with all the things that we can not know the exact memory
>> requirement for 100% use cases. But that does not means this is useless
>> it is still useful for common use cases of no special and memory hog
>> requirements as I mentioned in another reply it can simplify the kdump
>> deployment for those people who do not need the special setup.
>
> I still tend to disagree. This "common-case" reservation depends on
> things that are defined by user space. It surely does not make it
> easier to build a distribution kernel. Today, I get bug reports that
> the number calculated and added to the boot loader configuration by the
> installer is inaccurate. If I put a fixed number into a kernel config
> option, I will start getting bugs that this number is incorrect (for
> some systems).
>
>> For example, if this is a workstation I just want to break into a shell
>> to collect some panic info, then I just need a very minimal initrd, then
>> the Kconfig will work just fine.
>
> What is "a very minimal initrd"? Last time I had to make a significant
> adjustment to the estimation for openSUSE, this was caused by growing
> user-space requirements (systemd in this case, but I don't want to
> start flamewars on that topic, please).
>
> Anyway, if you want to improve the "common case", then look how IBM
> tries to solve it for firmware-assisted dump (fadump) on powerpc:
>
> https://patchwork.ozlabs.org/patch/905026/
>
> The main idea is:
>
>> Instead of setting aside a significant chunk of memory nobody can use,
>> [...] reserve a significant chunk of memory that the kernel is prevented
>> from using [...], but applications are free to use it.
>
> That works great, because user space pages are filtered out in the
> common case, so they can be used freely by the panic kernel.
They absolutely can not be used in the kdump case.
The kdump requirement is that they are pages no-one initiates any I/O
to. To avoid the problem of devices doing DMA as the new kernel starts
and runs. Secondarily to avoid problems with cpus that refused to halt.
Eric
Dave Young <[email protected]> writes:
> Hi Eric,
> On 05/23/18 at 10:53am, Eric W. Biederman wrote:
>> Dave Young <[email protected]> writes:
>>
>> > [snip]
>> >
>> >> >
>> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
>> >> > + int "System memory size threshold for kdump memory default reserving"
>> >> > + depends on CRASH_CORE
>> >> > + default 0
>> >> > + help
>> >> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
>> >> > + the system memory size is equal or bigger than the threshold.
>> >>
>> >> "the threshold" is rather vague. Can it be clarified?
>> >>
>> >> In fact I'm really struggling to understand the logic here....
>> >>
>> >>
>> >> > +config CRASHKERNEL_DEFAULT_MB
>> >> > + int "Default crashkernel memory size reserved for kdump"
>> >> > + depends on CRASH_CORE
>> >> > + default 0
>> >> > + help
>> >> > + This is used as the default kdump reserved memory size in MB.
>> >> > + crashkernel=X kernel cmdline can overwrite this value.
>> >> > +
>> >> > config HAVE_IMA_KEXEC
>> >> > bool
>> >> >
>> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
>> >> > return 0;
>> >> > }
>> >> >
>> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
>> >> > + unsigned long long *size)
>> >> > +{
>> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
>> >> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
>> >> > +
>> >> > + thres *= SZ_1M;
>> >> > + sz *= SZ_1M;
>> >> > +
>> >> > + if (sz >= system_ram || system_ram < thres) {
>> >> > + pr_debug("crashkernel default size can not be used.\n");
>> >> > + return -EINVAL;
>> >>
>> >> In other words,
>> >>
>> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
>> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
>> >> fail;
>> >>
>> >> yes?
>> >>
>> >> How come? What's happening here? Perhaps a (good) explanatory comment
>> >> is needed. And clearer Kconfig text.
>> >>
>> >> All confused :(
>> >
>> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
>> > the size is too large and kernel can not find enough memory it will
>> > still fail in latter code.
>> >
>> > Is below version looks clearer?
>>
>> What is the advantage of providing this in a kconfig option rather
>> than on the kernel command line as we can now?
>
> It is not a replacement of the cmdline, this can be a supplement to
> the crashkernel command line. For a lot of common use cases if we have
> the auto reservation user just do not need to manually set the cmdline
> for example on a virtual machine and usual setup (except of the
> comlicate storage and very large machines). The crashkernel=auto
> has been used for long time, Red Hat QE tested it on a lot of different
> lab machines and proved it works well. Kdump usually just works so admin
> do little work to enable kdump.
>
> But the crashkernel=auto implementation has some drawbacks that is it
> is more like embed policy in the code and it is not flexible like a
> config option.
Have you considered using the builtin command line aka CONFIG_CMDLINE?
If as you are reserving a fixed amount of memory as your patch does that
should be sufficient, and doable without any kernel changes.
Eric
Hi Eric,
On 05/24/18 at 11:41am, Eric W. Biederman wrote:
> Dave Young <[email protected]> writes:
>
> > Hi Eric,
> > On 05/23/18 at 10:53am, Eric W. Biederman wrote:
> >> Dave Young <[email protected]> writes:
> >>
> >> > [snip]
> >> >
> >> >> >
> >> >> > +config CRASHKERNEL_DEFAULT_THRESHOLD_MB
> >> >> > + int "System memory size threshold for kdump memory default reserving"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > + CRASHKERNEL_DEFAULT_MB is used as default crashkernel value if
> >> >> > + the system memory size is equal or bigger than the threshold.
> >> >>
> >> >> "the threshold" is rather vague. Can it be clarified?
> >> >>
> >> >> In fact I'm really struggling to understand the logic here....
> >> >>
> >> >>
> >> >> > +config CRASHKERNEL_DEFAULT_MB
> >> >> > + int "Default crashkernel memory size reserved for kdump"
> >> >> > + depends on CRASH_CORE
> >> >> > + default 0
> >> >> > + help
> >> >> > + This is used as the default kdump reserved memory size in MB.
> >> >> > + crashkernel=X kernel cmdline can overwrite this value.
> >> >> > +
> >> >> > config HAVE_IMA_KEXEC
> >> >> > bool
> >> >> >
> >> >> > @@ -143,6 +144,24 @@ static int __init parse_crashkernel_simp
> >> >> > return 0;
> >> >> > }
> >> >> >
> >> >> > +static int __init get_crashkernel_default(unsigned long long system_ram,
> >> >> > + unsigned long long *size)
> >> >> > +{
> >> >> > + unsigned long long sz = CONFIG_CRASHKERNEL_DEFAULT_MB;
> >> >> > + unsigned long long thres = CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB;
> >> >> > +
> >> >> > + thres *= SZ_1M;
> >> >> > + sz *= SZ_1M;
> >> >> > +
> >> >> > + if (sz >= system_ram || system_ram < thres) {
> >> >> > + pr_debug("crashkernel default size can not be used.\n");
> >> >> > + return -EINVAL;
> >> >>
> >> >> In other words,
> >> >>
> >> >> if (system_ram <= CONFIG_CRASHKERNEL_DEFAULT_MB ||
> >> >> system_ram < CONFIG_CRASHKERNEL_DEFAULT_THRESHOLD_MB)
> >> >> fail;
> >> >>
> >> >> yes?
> >> >>
> >> >> How come? What's happening here? Perhaps a (good) explanatory comment
> >> >> is needed. And clearer Kconfig text.
> >> >>
> >> >> All confused :(
> >> >
> >> > Andrew, I tuned it a bit, removed the check of sz >= system_ram, so if
> >> > the size is too large and kernel can not find enough memory it will
> >> > still fail in latter code.
> >> >
> >> > Is below version looks clearer?
> >>
> >> What is the advantage of providing this in a kconfig option rather
> >> than on the kernel command line as we can now?
> >
> > It is not a replacement of the cmdline, this can be a supplement to
> > the crashkernel command line. For a lot of common use cases if we have
> > the auto reservation user just do not need to manually set the cmdline
> > for example on a virtual machine and usual setup (except of the
> > comlicate storage and very large machines). The crashkernel=auto
> > has been used for long time, Red Hat QE tested it on a lot of different
> > lab machines and proved it works well. Kdump usually just works so admin
> > do little work to enable kdump.
> >
> > But the crashkernel=auto implementation has some drawbacks that is it
> > is more like embed policy in the code and it is not flexible like a
> > config option.
>
> Have you considered using the builtin command line aka CONFIG_CMDLINE?
> If as you are reserving a fixed amount of memory as your patch does that
> should be sufficient, and doable without any kernel changes.
Hmm, even in builtin cmdline it is same as a explict used crashkernel=.
If we think from a distribution point of view, it will be hard to
differentiate the builtin provided param and bootloader provided
params. It looks odd to see two crashkernel= when `cat /proc/cmdline`,
it will confuse people and there could cause compatibility problems
because it is explict value visible in kernel cmdline.
Thanks
Dave
V Thu, 24 May 2018 11:34:05 -0500
[email protected] (Eric W. Biederman) napsáno:
> Petr Tesarik <[email protected]> writes:
>
> 2> On Thu, 24 May 2018 09:49:05 +0800
> > Dave Young <[email protected]> wrote:
> >
> >> Hi Petr,
> >>
> >> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
> >>[...]
> >> > In short, if one size fits none, what good is it to hardcode that "one
> >> > size" into the kernel image?
> >>
> >> I agreed with all the things that we can not know the exact memory
> >> requirement for 100% use cases. But that does not means this is useless
> >> it is still useful for common use cases of no special and memory hog
> >> requirements as I mentioned in another reply it can simplify the kdump
> >> deployment for those people who do not need the special setup.
> >
> > I still tend to disagree. This "common-case" reservation depends on
> > things that are defined by user space. It surely does not make it
> > easier to build a distribution kernel. Today, I get bug reports that
> > the number calculated and added to the boot loader configuration by the
> > installer is inaccurate. If I put a fixed number into a kernel config
> > option, I will start getting bugs that this number is incorrect (for
> > some systems).
> >
> >> For example, if this is a workstation I just want to break into a shell
> >> to collect some panic info, then I just need a very minimal initrd, then
> >> the Kconfig will work just fine.
> >
> > What is "a very minimal initrd"? Last time I had to make a significant
> > adjustment to the estimation for openSUSE, this was caused by growing
> > user-space requirements (systemd in this case, but I don't want to
> > start flamewars on that topic, please).
> >
> > Anyway, if you want to improve the "common case", then look how IBM
> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
> >
> > https://patchwork.ozlabs.org/patch/905026/
> >
> > The main idea is:
> >
> >> Instead of setting aside a significant chunk of memory nobody can use,
> >> [...] reserve a significant chunk of memory that the kernel is prevented
> >> from using [...], but applications are free to use it.
> >
> > That works great, because user space pages are filtered out in the
> > common case, so they can be used freely by the panic kernel.
>
> They absolutely can not be used in the kdump case.
>
> The kdump requirement is that they are pages no-one initiates any I/O
> to. To avoid the problem of devices doing DMA as the new kernel starts
> and runs.
Good point. This means that memory reserved for this purpose would also
have to be excluded from allocations that may be eventually used for
DMA transfers.
> Secondarily to avoid problems with cpus that refused to halt.
Let's face it - if some CPUs refused to halt, all bets are off. The
code running on such a CPU can break many other things besides memory,
most importantly, it may meddle with the HW registers of crucial
devices in the system. To be less abstract, I have seen a failure to
stop a CPU in the crashed kernel a few times, and the panic kernel
could never successfully save anything; it always crashed at boot or a
little bit later.
Anyway, of course we would still have to keep the current method,
because user pages are not always filtered. For example, a major SUSE
account runs a database in user space and also inspects its data
structures in case of a system crash.
Petr T
Petr Tesarik <[email protected]> writes:
> V Thu, 24 May 2018 11:34:05 -0500
> [email protected] (Eric W. Biederman) napsáno:
>
>> Petr Tesarik <[email protected]> writes:
>>
>> 2> On Thu, 24 May 2018 09:49:05 +0800
>> > Dave Young <[email protected]> wrote:
>> >
>> >> Hi Petr,
>> >>
>> >> On 05/23/18 at 10:22pm, Petr Tesarik wrote:
>> >>[...]
>> >> > In short, if one size fits none, what good is it to hardcode that "one
>> >> > size" into the kernel image?
>> >>
>> >> I agreed with all the things that we can not know the exact memory
>> >> requirement for 100% use cases. But that does not means this is useless
>> >> it is still useful for common use cases of no special and memory hog
>> >> requirements as I mentioned in another reply it can simplify the kdump
>> >> deployment for those people who do not need the special setup.
>> >
>> > I still tend to disagree. This "common-case" reservation depends on
>> > things that are defined by user space. It surely does not make it
>> > easier to build a distribution kernel. Today, I get bug reports that
>> > the number calculated and added to the boot loader configuration by the
>> > installer is inaccurate. If I put a fixed number into a kernel config
>> > option, I will start getting bugs that this number is incorrect (for
>> > some systems).
>> >
>> >> For example, if this is a workstation I just want to break into a shell
>> >> to collect some panic info, then I just need a very minimal initrd, then
>> >> the Kconfig will work just fine.
>> >
>> > What is "a very minimal initrd"? Last time I had to make a significant
>> > adjustment to the estimation for openSUSE, this was caused by growing
>> > user-space requirements (systemd in this case, but I don't want to
>> > start flamewars on that topic, please).
>> >
>> > Anyway, if you want to improve the "common case", then look how IBM
>> > tries to solve it for firmware-assisted dump (fadump) on powerpc:
>> >
>> > https://patchwork.ozlabs.org/patch/905026/
>> >
>> > The main idea is:
>> >
>> >> Instead of setting aside a significant chunk of memory nobody can use,
>> >> [...] reserve a significant chunk of memory that the kernel is prevented
>> >> from using [...], but applications are free to use it.
>> >
>> > That works great, because user space pages are filtered out in the
>> > common case, so they can be used freely by the panic kernel.
>>
>> They absolutely can not be used in the kdump case.
>>
>> The kdump requirement is that they are pages no-one initiates any I/O
>> to. To avoid the problem of devices doing DMA as the new kernel starts
>> and runs.
>
> Good point. This means that memory reserved for this purpose would also
> have to be excluded from allocations that may be eventually used for
> DMA transfers.
Think of a network card. The DMA's for incomming packets can be
indefinitely delayed into the future unless that network card is
reprogrammed. If the dump kernel does not load the driver that won't
happen.
>> Secondarily to avoid problems with cpus that refused to halt.
>
> Let's face it - if some CPUs refused to halt, all bets are off. The
> code running on such a CPU can break many other things besides memory,
> most importantly, it may meddle with the HW registers of crucial
> devices in the system. To be less abstract, I have seen a failure to
> stop a CPU in the crashed kernel a few times, and the panic kernel
> could never successfully save anything; it always crashed at boot or a
> little bit later.
Crashing at boot is comparatively good. That is part of the design
criteria. It is better to fail to startup the kernel than to start a
corrupted kernel and mangle a users data.
But I do see how it can be a crap shoot when dealing with another cpu.
The ultimate point is that the absolute best we can do is to run a
kernel in memory that we never use for anything else and then we have a
fighting chance of getting the system working and getting a report of
the failure out to somewhere.
> Anyway, of course we would still have to keep the current method,
> because user pages are not always filtered. For example, a major SUSE
> account runs a database in user space and also inspects its data
> structures in case of a system crash.
And I understand the memory pressures that will encourage people to use
user pages for extra memory to run the dump capture kernel in. Short of
the presence of an IOMMU that all DMA transfers must go through I don't
see how those user pages could reliably be used.
Eric
On Fri, 25 May 2018 15:00:13 -0500
[email protected] (Eric W. Biederman) wrote:
>[...]
> The ultimate point is that the absolute best we can do is to run a
> kernel in memory that we never use for anything else and then we have a
> fighting chance of getting the system working and getting a report of
> the failure out to somewhere.
>
> > Anyway, of course we would still have to keep the current method,
> > because user pages are not always filtered. For example, a major SUSE
> > account runs a database in user space and also inspects its data
> > structures in case of a system crash.
>
> And I understand the memory pressures that will encourage people to use
> user pages for extra memory to run the dump capture kernel in. Short of
> the presence of an IOMMU that all DMA transfers must go through I don't
> see how those user pages could reliably be used.
Absolutely. I fully understand that a system which reuses first
kernel's memory in some way must be less reliable than the current
state. However, some people are willing to trade less reliability for
reduced resource consumption.
Note that we're not talking about reserving a few gigs on a single
machine with some terabytes of memory (i.e. less than 1% of total RAM),
rather a few hundred megs of each 4-gig VM on an s390x machine (i.e.
about 10% of total RAM).
Petr T
Petr Tesarik <[email protected]> writes:
> On Fri, 25 May 2018 15:00:13 -0500
> [email protected] (Eric W. Biederman) wrote:
>
>>[...]
>> The ultimate point is that the absolute best we can do is to run a
>> kernel in memory that we never use for anything else and then we have a
>> fighting chance of getting the system working and getting a report of
>> the failure out to somewhere.
>>
>> > Anyway, of course we would still have to keep the current method,
>> > because user pages are not always filtered. For example, a major SUSE
>> > account runs a database in user space and also inspects its data
>> > structures in case of a system crash.
>>
>> And I understand the memory pressures that will encourage people to use
>> user pages for extra memory to run the dump capture kernel in. Short of
>> the presence of an IOMMU that all DMA transfers must go through I don't
>> see how those user pages could reliably be used.
>
> Absolutely. I fully understand that a system which reuses first
> kernel's memory in some way must be less reliable than the current
> state. However, some people are willing to trade less reliability for
> reduced resource consumption.
That is the kind of tradeoff that can easily result in the crash kernel
eating your data. I will nack any patch that I see that goes anywhere
near that kind of solution for the kernel that takes the crash.
> Note that we're not talking about reserving a few gigs on a single
> machine with some terabytes of memory (i.e. less than 1% of total RAM),
> rather a few hundred megs of each 4-gig VM on an s390x machine (i.e.
> about 10% of total RAM).
You should be able to get away with tens of gigs instead of hundreds.
The biggest reservation I remember anyone ever making is about 100Meg.
And that was a general purpose configuration not tuned at all. With the
maximum size dealing with large machines.
kexec on panic grew up on machines with 4Gig or less as it arrived
before everyone was 64bit. It should be possible to tune your crash
dump taking kernel so things run in a reasonable amount of memory for
the configuration you are talking about. The usual trade-off is time
vs generality. Usually I simply have not seen people with non-embedded
configurations take the time to tune things.
Eric