2014-06-11 01:04:52

by Luis Chamberlain

[permalink] [raw]
Subject: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

From: "Luis R. Rodriguez" <[email protected]>

The default size of the ring buffer is too small for machines
with a large amount of CPUs under heavy load. What ends up
happening when debugging is the ring buffer overlaps and chews
up old messages making debugging impossible unless the size is
passed as a kernel parameter. An idle system upon boot up will
on average spew out only about one or two extra lines but where
this really matters is on heavy load and that will vary widely
depending on the system and environment.

There are mechanisms to help increase the kernel ring buffer
for tracing through debugfs, and those interfaces even allow growing
the kernel ring buffer per CPU. We also have a static value which
can be passed upon boot. Relying on debugfs however is not ideal
for production, and relying on the value passed upon bootup is
can only used *after* an issue has creeped up. Instead of being
reactive this adds a proactive measure which lets you scale the
amount of contributions you'd expect to the kernel ring buffer
under load by each CPU in the worst case scenerio.

We use num_possible_cpus() to avoid complexities which could be
introduced by dynamically changing the ring buffer size at run
time, num_possible_cpus() lets us use the upper limit on possible
number of CPUs therefore avoiding having to deal with hotplugging
CPUs on and off. This option is diabled by default, and if used
the kernel ring buffer size then can be computed as follows:

size = __LOG_BUF_LEN + (num_possible_cpus() - 1 ) * __LOG_CPU_BUF_LEN

Cc: Michal Hocko <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Joe Perches <[email protected]>
Cc: Arun KS <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: [email protected]
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
init/Kconfig | 28 ++++++++++++++++++++++++++++
kernel/printk/printk.c | 6 ++++--
2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..1814436 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
13 => 8 KB
12 => 4 KB

+config LOG_CPU_BUF_SHIFT
+ int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
+ range 0 21
+ default 0
+ help
+ The kernel ring buffer will get additional data logged onto it
+ when multiple CPUs are supported. Typically the contributions is a
+ few lines when idle however under under load this can vary and in the
+ worst case it can mean loosing logging information. You can use this
+ to set the maximum expected mount of amount of logging contribution
+ under load by each CPU in the worst case scenerio. Select a size as
+ a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
+ LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
+ follows having 16 CPUs as possible.
+
+ ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
+
+ Where as typically you'd only end up with 256 KB. This is disabled
+ by default with a value of 0.
+
+ Examples:
+ 17 => 128 KB
+ 16 => 64 KB
+ 15 => 32 KB
+ 14 => 16 KB
+ 13 => 8 KB
+ 12 => 4 KB
+
#
# Architectures with an unreliable sched_clock() should select this:
#
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 7228258..2023424 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -246,6 +246,7 @@ static u32 clear_idx;
#define LOG_ALIGN __alignof__(struct printk_log)
#endif
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
+#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
static char *log_buf = __log_buf;
static u32 log_buf_len = __LOG_BUF_LEN;
@@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
unsigned long flags;
char *new_log_buf;
int free;
+ int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;

- if (!new_log_buf_len)
- return;
+ if (!new_log_buf_len && cpu_extra > 1)
+ new_log_buf_len = __LOG_BUF_LEN + cpu_extra;

if (early) {
new_log_buf =
--
2.0.0.rc3.18.g00a5b79


2014-06-11 09:34:54

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" <[email protected]>
>
> The default size of the ring buffer is too small for machines
> with a large amount of CPUs under heavy load. What ends up
> happening when debugging is the ring buffer overlaps and chews
> up old messages making debugging impossible unless the size is
> passed as a kernel parameter. An idle system upon boot up will
> on average spew out only about one or two extra lines but where
> this really matters is on heavy load and that will vary widely
> depending on the system and environment.

Thanks for looking at this. It is a pity to lose stracktrace when a huge
machine Oopses just because the default ring buffer is too small.


> There are mechanisms to help increase the kernel ring buffer
> for tracing through debugfs, and those interfaces even allow growing
> the kernel ring buffer per CPU. We also have a static value which
> can be passed upon boot. Relying on debugfs however is not ideal
> for production, and relying on the value passed upon bootup is
> can only used *after* an issue has creeped up. Instead of being
> reactive this adds a proactive measure which lets you scale the
> amount of contributions you'd expect to the kernel ring buffer
> under load by each CPU in the worst case scenerio.
>
> We use num_possible_cpus() to avoid complexities which could be
> introduced by dynamically changing the ring buffer size at run
> time, num_possible_cpus() lets us use the upper limit on possible
> number of CPUs therefore avoiding having to deal with hotplugging
> CPUs on and off. This option is diabled by default, and if used
> the kernel ring buffer size then can be computed as follows:
>
> size = __LOG_BUF_LEN + (num_possible_cpus() - 1 ) * __LOG_CPU_BUF_LEN
>
> Cc: Michal Hocko <[email protected]>
> Cc: Petr Mladek <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Joe Perches <[email protected]>
> Cc: Arun KS <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: [email protected]
> Signed-off-by: Luis R. Rodriguez <[email protected]>
> ---
> init/Kconfig | 28 ++++++++++++++++++++++++++++
> kernel/printk/printk.c | 6 ++++--
> 2 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d3585b..1814436 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> 13 => 8 KB
> 12 => 4 KB
>
> +config LOG_CPU_BUF_SHIFT
> + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> + range 0 21
> + default 0
> + help
> + The kernel ring buffer will get additional data logged onto it
> + when multiple CPUs are supported. Typically the contributions is a
> + few lines when idle however under under load this can vary and in the
> + worst case it can mean loosing logging information. You can use this
> + to set the maximum expected mount of amount of logging contribution
> + under load by each CPU in the worst case scenerio. Select a size as
> + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> + follows having 16 CPUs as possible.
> +
> + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB

It might be better to use the CPU_NUM-specific value as a minimum of
the needed space. Linux distributions might want to distribute kernel
with non-zero value and still use the static "__log_buf" on reasonable
small systems.

> + Where as typically you'd only end up with 256 KB. This is disabled
> + by default with a value of 0.

I would add:

This value is ignored when "log_buf_len" commandline parameter
is used. It forces the exact size of the ring buffer.

> + Examples:
> + 17 => 128 KB
> + 16 => 64 KB
> + 15 => 32 KB
> + 14 => 16 KB
> + 13 => 8 KB
> + 12 => 4 KB

I think that we should make it more cleat that it is per-CPU here,
for example:

17 => 128 KB for each CPU
16 => 64 KB for each CPU
15 => 32 KB for each CPU
14 => 16 KB for each CPU
13 => 8 KB for each CPU
12 => 4 KB for each CPU


> #
> # Architectures with an unreliable sched_clock() should select this:
> #
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 7228258..2023424 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -246,6 +246,7 @@ static u32 clear_idx;
> #define LOG_ALIGN __alignof__(struct printk_log)
> #endif
> #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> static char *log_buf = __log_buf;
> static u32 log_buf_len = __LOG_BUF_LEN;
> @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> unsigned long flags;
> char *new_log_buf;
> int free;
> + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
>
> - if (!new_log_buf_len)
> - return;
> + if (!new_log_buf_len && cpu_extra > 1)
> + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;

We still should return when both new_log_buf_len and cpu_extra are
zero and call here:

if (!new_log_buf_len)
return;

Also I would feel more comfortable if we somehow limit the maximum
size of cpu_extra. I wonder if there might be a crazy setup with a lot
of possible CPUs and possible memory but with some minimal amount of
CPUs and memory at the boot time.

The question is how to do it. I am still not much familiar with the
memory subsystem. I wonder if 10% of memory defined by the
"total_rampages" variable would be a reasonable limit.

> if (early) {
> new_log_buf =
> --
> 2.0.0.rc3.18.g00a5b79
>

> LocalWords: buf len cpu boottime

2014-06-11 21:47:45

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Wed, Jun 11, 2014 at 11:34:47AM +0200, Petr Ml?dek wrote:
> On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" <[email protected]>
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d3585b..1814436 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> > 13 => 8 KB
> > 12 => 4 KB
> >
> > +config LOG_CPU_BUF_SHIFT
> > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> > + range 0 21
> > + default 0
> > + help
> > + The kernel ring buffer will get additional data logged onto it
> > + when multiple CPUs are supported. Typically the contributions is a
> > + few lines when idle however under under load this can vary and in the
> > + worst case it can mean loosing logging information. You can use this
> > + to set the maximum expected mount of amount of logging contribution
> > + under load by each CPU in the worst case scenerio. Select a size as
> > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> > + follows having 16 CPUs as possible.
> > +
> > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
>
> It might be better to use the CPU_NUM-specific value as a minimum of
> the needed space. Linux distributions might want to distribute kernel
> with non-zero value and still use the static "__log_buf" on reasonable
> small systems.

Not sure if I follow what you mean by CPU_NUM-specific, can you elaborate?
The default in this patch is to ignore this, do you mean that upstream
should probably default to a non-zero value here and then let distributions
select 0 for some kernel builds ? If so then perhaps adding a sysctl
override value might be good to allow only small systems to override
this to 0?

> > + Where as typically you'd only end up with 256 KB. This is disabled
> > + by default with a value of 0.
>
> I would add:
>
> This value is ignored when "log_buf_len" commandline parameter
> is used. It forces the exact size of the ring buffer.

Good point, I've amended this in.

> > + Examples:
> > + 17 => 128 KB
> > + 16 => 64 KB
> > + 15 => 32 KB
> > + 14 => 16 KB
> > + 13 => 8 KB
> > + 12 => 4 KB
>
> I think that we should make it more cleat that it is per-CPU here,
> for example:
>
> 17 => 128 KB for each CPU
> 16 => 64 KB for each CPU
> 15 => 32 KB for each CPU
> 14 => 16 KB for each CPU
> 13 => 8 KB for each CPU
> 12 => 4 KB for each CPU

Thanks, amended as well.

> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 7228258..2023424 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -246,6 +246,7 @@ static u32 clear_idx;
> > #define LOG_ALIGN __alignof__(struct printk_log)
> > #endif
> > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> > static char *log_buf = __log_buf;
> > static u32 log_buf_len = __LOG_BUF_LEN;
> > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> > unsigned long flags;
> > char *new_log_buf;
> > int free;
> > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
> >
> > - if (!new_log_buf_len)
> > - return;
> > + if (!new_log_buf_len && cpu_extra > 1)
> > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
>
> We still should return when both new_log_buf_len and cpu_extra are
> zero and call here:
>
> if (!new_log_buf_len)
> return;

The check for cpu_extra > 1 does that -- the default in the patch was 0
and 1 << 0 is 1, so if in the case that the default is used we'd bail
just like before. Or did I perhaps miss what you were saying here?

> Also I would feel more comfortable if we somehow limit the maximum
> size of cpu_extra.

Michal had similar concerns and I thought up to limit it to 1024 max
CPUs, but after my second implementation I did some math on the values
that would be used if say LOG_CPU_BUF_SHIFT was 12, it turns out to not
be *that* bad for even huge num_possible_cpus(). For example for 4096
num_possible_cpus() this comes out to with LOG_BUF_SHIFT of 18:


((1 << 18) + ((4096 - 1) * (1 << 12))) / 1024 = 16636 KB

~16 MB doesn't seem that bad for such a monster box which I'd presume
would have an insane amount of memory. If this logic however does
seems unreasonable and we should cap it -- then by all means lets
pick a sensible number, its just not clear to me what that number
should be. Another reason why I stayed away from capping this was
that we'd then likely end up capping this in the future, and I was
trying to find a solution that would not require mucking as
technology evolves. The reasoning above is also why I had opted to
make the default to 0, only distributions would have a good sense
of what might be reasonable, which I guess begs more for a sysctl
value here.

> I wonder if there might be a crazy setup with a lot
> of possible CPUs and possible memory but with some minimal amount of
> CPUs and memory at the boot time.

When I tested disabling smp I saw the log was still amended to include
information about the disabled CPUs, I however hadn't tested on a machine
with hot pluggable CPUs and with tons of CPUs disabled, so not sure if
that adds more info as well. This also though points more to this being
more a system specific thing, which is another reason to perhaps keep this
disabled and leave this instead as a system config?

> The question is how to do it. I am still not much familiar with the
> memory subsystem. I wonder if 10% of memory defined by the
> "total_rampages" variable would be a reasonable limit.

Not sure either, curious if Mel might have a suggestion?

>
> > if (early) {
> > new_log_buf =
> > --
> > 2.0.0.rc3.18.g00a5b79
> >
>
> > LocalWords: buf len cpu boottime

What's this? :)

Luis


Attachments:
(No filename) (5.86 kB)
(No filename) (835.00 B)
Download all attachments

2014-06-12 13:05:49

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Wed 2014-06-11 23:47:41, Luis R. Rodriguez wrote:
> On Wed, Jun 11, 2014 at 11:34:47AM +0200, Petr Ml?dek wrote:
> > On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> > > From: "Luis R. Rodriguez" <[email protected]>
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 9d3585b..1814436 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> > > 13 => 8 KB
> > > 12 => 4 KB
> > >
> > > +config LOG_CPU_BUF_SHIFT
> > > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> > > + range 0 21
> > > + default 0
> > > + help
> > > + The kernel ring buffer will get additional data logged onto it
> > > + when multiple CPUs are supported. Typically the contributions is a
> > > + few lines when idle however under under load this can vary and in the
> > > + worst case it can mean loosing logging information. You can use this
> > > + to set the maximum expected mount of amount of logging contribution
> > > + under load by each CPU in the worst case scenerio. Select a size as
> > > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> > > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> > > + follows having 16 CPUs as possible.
> > > +
> > > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
> >
> > It might be better to use the CPU_NUM-specific value as a minimum of
> > the needed space. Linux distributions might want to distribute kernel
> > with non-zero value and still use the static "__log_buf" on reasonable
> > small systems.
>
> Not sure if I follow what you mean by CPU_NUM-specific, can you
> elaborate?

I wanted to say that the space requested by LOG_CPU_BUF_SHIFT depends
on the number of CPUs. If LOG_CPU_BUF_SHIFT is not zero, your
patch always allocates new ringbuffer and leave the static "__log_buf"
unused. I think that this is not necessary for machines with small
amount of CPUs and probably also with small amount of memory.

I would rename the variable to LOG_CPU_BUF_MIN_SHIFT or so. It would
represent minimal size that is needed to print CPU-specific
messages. If they take only "small" part of the default ring buffer
size, we could still use the default rind buffer.

For example, if we left 50% of the default buffer for CPU-specific
messages, the code might look like:

#define __LOG_CPU_MIN_BUF_LEN (1 << CONFIG_LOG_CPU_MIN_BUF_SHIFT)

int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_MIN_BUF_LEN;

if (!new_log_buf_len && (cpu_extra > __LOG_BUF_LEN / 2))
new_log_buf_len = __LOG_BUF_LEN + cpu_extra;

if (!new_log_buf_len)
return;

allocate the new ring buffer...


> The default in this patch is to ignore this, do you mean that upstream
> should probably default to a non-zero value here and then let distributions
> select 0 for some kernel builds ?

If the change has effect only for huge systems, the default value
might be non-zero everywhere.

> If so then perhaps adding a sysctl override value might be good to
> allow only small systems to override this to 0?

I think that it won't help to lover the value using sysctl because the
huge buffer would be already allocated during boot. If I did not miss anything.

[...]
>
> > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > index 7228258..2023424 100644
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -246,6 +246,7 @@ static u32 clear_idx;
> > > #define LOG_ALIGN __alignof__(struct printk_log)
> > > #endif
> > > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> > > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> > > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> > > static char *log_buf = __log_buf;
> > > static u32 log_buf_len = __LOG_BUF_LEN;
> > > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> > > unsigned long flags;
> > > char *new_log_buf;
> > > int free;
> > > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
> > >
> > > - if (!new_log_buf_len)
> > > - return;
> > > + if (!new_log_buf_len && cpu_extra > 1)
> > > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
> >
> > We still should return when both new_log_buf_len and cpu_extra are
> > zero and call here:
> >
> > if (!new_log_buf_len)
> > return;
>
> The check for cpu_extra > 1 does that -- the default in the patch was 0
> and 1 << 0 is 1, so if in the case that the default is used we'd bail
> just like before. Or did I perhaps miss what you were saying here?

The problem is that we do not bail out because you removed the "return".
If "new_log_buf_len=0" and "cpu_extra=1" then we keep
"new_log_buf_len" as is. Then we continue, try to allocate zero memory
and print error: "log_buf_len: 0 bytes not available". Do I get it right?


> > Also I would feel more comfortable if we somehow limit the maximum
> > size of cpu_extra.
>
> Michal had similar concerns and I thought up to limit it to 1024 max
> CPUs, but after my second implementation I did some math on the values
> that would be used if say LOG_CPU_BUF_SHIFT was 12, it turns out to not
> be *that* bad for even huge num_possible_cpus(). For example for 4096
> num_possible_cpus() this comes out to with LOG_BUF_SHIFT of 18:
>
>
> ((1 << 18) + ((4096 - 1) * (1 << 12))) / 1024 = 16636 KB
>
> ~16 MB doesn't seem that bad for such a monster box which I'd presume
> would have an insane amount of memory. If this logic however does
> seems unreasonable and we should cap it -- then by all means lets
> pick a sensible number, its just not clear to me what that number
> should be. Another reason why I stayed away from capping this was
> that we'd then likely end up capping this in the future, and I was
> trying to find a solution that would not require mucking as
> technology evolves. The reasoning above is also why I had opted to
> make the default to 0, only distributions would have a good sense
> of what might be reasonable, which I guess begs more for a sysctl
> value here.

I am not sure but I think that the huge buffer would be allocated
before any sysctl value could be modified. So, I think that sysctl
would not really help here.

I think that the 10% or 20% of the total memory size is a good limit.
Nobody would want to use more than 20% of memory for logging. So, it
needs not be higher. The main purpose of the limit is that the system
does not die immediately after allocating the ring buffer. The 80%
reserve for the rest of the system sounds fine as well. Note that
the limit won't be needed on 99,9% of systems but it would help
with debugging the last 0.1% :-)

> > I wonder if there might be a crazy setup with a lot
> > of possible CPUs and possible memory but with some minimal amount of
> > CPUs and memory at the boot time.
>
> When I tested disabling smp I saw the log was still amended to include
> information about the disabled CPUs, I however hadn't tested on a machine
> with hot pluggable CPUs and with tons of CPUs disabled, so not sure if
> that adds more info as well. This also though points more to this being
> more a system specific thing, which is another reason to perhaps keep this
> disabled and leave this instead as a system config?
>
> > The question is how to do it. I am still not much familiar with the
> > memory subsystem. I wonder if 10% of memory defined by the
> > "total_rampages" variable would be a reasonable limit.
>
> Not sure either, curious if Mel might have a suggestion?
>
> >
> > > if (early) {
> > > new_log_buf =
> > > --
> > > 2.0.0.rc3.18.g00a5b79
> > >
> >
> > > LocalWords: buf len cpu boottime
>
> What's this? :)

Heh, emacs added this when doing spell check. It is strage, it does it
only from time to time :-)

Best Regards,
Petr

2014-06-12 18:01:13

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs


On Wed, 2014-06-11 at 11:34 +0200, Petr Mládek wrote:
> On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> > From: "Luis R. Rodriguez" <[email protected]>
> >
> > The default size of the ring buffer is too small for machines
> > with a large amount of CPUs under heavy load. What ends up
> > happening when debugging is the ring buffer overlaps and chews
> > up old messages making debugging impossible unless the size is
> > passed as a kernel parameter. An idle system upon boot up will
> > on average spew out only about one or two extra lines but where
> > this really matters is on heavy load and that will vary widely
> > depending on the system and environment.
>
> Thanks for looking at this. It is a pity to lose stracktrace when a huge
> machine Oopses just because the default ring buffer is too small.

Agreed, I would very much welcome something like this.

> > There are mechanisms to help increase the kernel ring buffer
> > for tracing through debugfs, and those interfaces even allow growing
> > the kernel ring buffer per CPU. We also have a static value which
> > can be passed upon boot. Relying on debugfs however is not ideal
> > for production, and relying on the value passed upon bootup is
> > can only used *after* an issue has creeped up. Instead of being
> > reactive this adds a proactive measure which lets you scale the
> > amount of contributions you'd expect to the kernel ring buffer
> > under load by each CPU in the worst case scenerio.
> >
> > We use num_possible_cpus() to avoid complexities which could be
> > introduced by dynamically changing the ring buffer size at run
> > time, num_possible_cpus() lets us use the upper limit on possible
> > number of CPUs therefore avoiding having to deal with hotplugging
> > CPUs on and off. This option is diabled by default, and if used
> > the kernel ring buffer size then can be computed as follows:
> >
> > size = __LOG_BUF_LEN + (num_possible_cpus() - 1 ) * __LOG_CPU_BUF_LEN
> >
> > Cc: Michal Hocko <[email protected]>
> > Cc: Petr Mladek <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Joe Perches <[email protected]>
> > Cc: Arun KS <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Luis R. Rodriguez <[email protected]>
> > ---
> > init/Kconfig | 28 ++++++++++++++++++++++++++++
> > kernel/printk/printk.c | 6 ++++--
> > 2 files changed, 32 insertions(+), 2 deletions(-)
> >
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9d3585b..1814436 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> > 13 => 8 KB
> > 12 => 4 KB
> >
> > +config LOG_CPU_BUF_SHIFT
> > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> > + range 0 21
> > + default 0
> > + help
> > + The kernel ring buffer will get additional data logged onto it
> > + when multiple CPUs are supported. Typically the contributions is a
> > + few lines when idle however under under load this can vary and in the
> > + worst case it can mean loosing logging information. You can use this
> > + to set the maximum expected mount of amount of logging contribution
> > + under load by each CPU in the worst case scenerio. Select a size as
> > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> > + follows having 16 CPUs as possible.
> > +
> > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
>
> It might be better to use the CPU_NUM-specific value as a minimum of
> the needed space. Linux distributions might want to distribute kernel
> with non-zero value and still use the static "__log_buf" on reasonable
> small systems.

It should also depend on SMP and !BASE_SMALL.
I was wondering about disabling this by default as it would defeat the
purpose of being a proactive feature. Similarly, I worry about distros
choosing a correct default value on their own.

> > + Where as typically you'd only end up with 256 KB. This is disabled
> > + by default with a value of 0.
>
> I would add:
>
> This value is ignored when "log_buf_len" commandline parameter
> is used. It forces the exact size of the ring buffer.

... and update Documentation/kernel-parameters.txt to be more
descriptive about this new functionality.

> > + Examples:
> > + 17 => 128 KB
> > + 16 => 64 KB
> > + 15 => 32 KB
> > + 14 => 16 KB
> > + 13 => 8 KB
> > + 12 => 4 KB
>
> I think that we should make it more cleat that it is per-CPU here,
> for example:
>
> 17 => 128 KB for each CPU
> 16 => 64 KB for each CPU
> 15 => 32 KB for each CPU
> 14 => 16 KB for each CPU
> 13 => 8 KB for each CPU
> 12 => 4 KB for each CPU
>

Agreed.

> > #
> > # Architectures with an unreliable sched_clock() should select this:
> > #
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index 7228258..2023424 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -246,6 +246,7 @@ static u32 clear_idx;
> > #define LOG_ALIGN __alignof__(struct printk_log)
> > #endif
> > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> > static char *log_buf = __log_buf;
> > static u32 log_buf_len = __LOG_BUF_LEN;
> > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> > unsigned long flags;
> > char *new_log_buf;
> > int free;
> > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;

If depending on SMP, you can remove the - 1 here.

> > - if (!new_log_buf_len)
> > - return;
> > + if (!new_log_buf_len && cpu_extra > 1)
> > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
>
> We still should return when both new_log_buf_len and cpu_extra are
> zero and call here:
>
> if (!new_log_buf_len)
> return;

Yep.

> Also I would feel more comfortable if we somehow limit the maximum
> size of cpu_extra. I wonder if there might be a crazy setup with a lot
> of possible CPUs and possible memory but with some minimal amount of
> CPUs and memory at the boot time.

Maybe. But considering that systems with a lot of CPUs *do* have a lot
of memory, I wouldn't worry much about this, just like we don't worry
about it now. Considering a _large_ 1024 core system and using the max
value 21 for CONFIG_LOG_BUF_SHIFT, we would only allocate just over 2Gb
of extra space -- trivial for such a system. And if it does break
something, then heck, go fix you box and/or just reduce the percpu
value. I guess that's a good reason to keep the default to 0 and let
users play with it as they wish without compromising uninterested
parties. afaict only x86 would be exposed to systems not booting if we
fail to allocate.

Thanks,
Davidlohr


2014-06-12 18:45:25

by Joe Perches

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

(adding Chris Metcalf for arch/tile,
I think this change might impact that arch)

On Thu, 2014-06-12 at 11:01 -0700, Davidlohr Bueso wrote:
> On Wed, 2014-06-11 at 11:34 +0200, Petr Ml?dek wrote:
> > On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> > > From: "Luis R. Rodriguez" <[email protected]>
> > >
> > > The default size of the ring buffer is too small for machines
> > > with a large amount of CPUs under heavy load. What ends up
> > > happening when debugging is the ring buffer overlaps and chews
> > > up old messages making debugging impossible unless the size is
> > > passed as a kernel parameter. An idle system upon boot up will
> > > on average spew out only about one or two extra lines but where
> > > this really matters is on heavy load and that will vary widely
> > > depending on the system and environment.
> >
> > Thanks for looking at this. It is a pity to lose stracktrace when a huge
> > machine Oopses just because the default ring buffer is too small.
>
> Agreed, I would very much welcome something like this.
>
> > > There are mechanisms to help increase the kernel ring buffer
> > > for tracing through debugfs, and those interfaces even allow growing
> > > the kernel ring buffer per CPU. We also have a static value which
> > > can be passed upon boot. Relying on debugfs however is not ideal
> > > for production, and relying on the value passed upon bootup is
> > > can only used *after* an issue has creeped up. Instead of being
> > > reactive this adds a proactive measure which lets you scale the
> > > amount of contributions you'd expect to the kernel ring buffer
> > > under load by each CPU in the worst case scenerio.
> > >
> > > We use num_possible_cpus() to avoid complexities which could be
> > > introduced by dynamically changing the ring buffer size at run
> > > time, num_possible_cpus() lets us use the upper limit on possible
> > > number of CPUs therefore avoiding having to deal with hotplugging
> > > CPUs on and off. This option is diabled by default, and if used
> > > the kernel ring buffer size then can be computed as follows:
> > >
> > > size = __LOG_BUF_LEN + (num_possible_cpus() - 1 ) * __LOG_CPU_BUF_LEN
> > >
> > > Cc: Michal Hocko <[email protected]>
> > > Cc: Petr Mladek <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > Cc: Joe Perches <[email protected]>
> > > Cc: Arun KS <[email protected]>
> > > Cc: Kees Cook <[email protected]>
> > > Cc: [email protected]
> > > Signed-off-by: Luis R. Rodriguez <[email protected]>
> > > ---
> > > init/Kconfig | 28 ++++++++++++++++++++++++++++
> > > kernel/printk/printk.c | 6 ++++--
> > > 2 files changed, 32 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 9d3585b..1814436 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> > > 13 => 8 KB
> > > 12 => 4 KB
> > >
> > > +config LOG_CPU_BUF_SHIFT
> > > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> > > + range 0 21
> > > + default 0
> > > + help
> > > + The kernel ring buffer will get additional data logged onto it
> > > + when multiple CPUs are supported. Typically the contributions is a
> > > + few lines when idle however under under load this can vary and in the
> > > + worst case it can mean loosing logging information. You can use this

trivia: s/loosing/losing/

> > > + to set the maximum expected mount of amount of logging contribution
> > > + under load by each CPU in the worst case scenerio. Select a size as
> > > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> > > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> > > + follows having 16 CPUs as possible.
> > > +
> > > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
> >
> > It might be better to use the CPU_NUM-specific value as a minimum of
> > the needed space. Linux distributions might want to distribute kernel
> > with non-zero value and still use the static "__log_buf" on reasonable
> > small systems.
>
> It should also depend on SMP and !BASE_SMALL.
> I was wondering about disabling this by default as it would defeat the
> purpose of being a proactive feature. Similarly, I worry about distros
> choosing a correct default value on their own.
>
> > > + Where as typically you'd only end up with 256 KB. This is disabled
> > > + by default with a value of 0.
> >
> > I would add:
> >
> > This value is ignored when "log_buf_len" commandline parameter
> > is used. It forces the exact size of the ring buffer.
>
> ... and update Documentation/kernel-parameters.txt to be more
> descriptive about this new functionality.
>
> > > + Examples:
> > > + 17 => 128 KB
> > > + 16 => 64 KB
> > > + 15 => 32 KB
> > > + 14 => 16 KB
> > > + 13 => 8 KB
> > > + 12 => 4 KB
> >
> > I think that we should make it more cleat that it is per-CPU here,
> > for example:
> >
> > 17 => 128 KB for each CPU
> > 16 => 64 KB for each CPU
> > 15 => 32 KB for each CPU
> > 14 => 16 KB for each CPU
> > 13 => 8 KB for each CPU
> > 12 => 4 KB for each CPU
> >
>
> Agreed.
>
> > > #
> > > # Architectures with an unreliable sched_clock() should select this:
> > > #
> > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > index 7228258..2023424 100644
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -246,6 +246,7 @@ static u32 clear_idx;
> > > #define LOG_ALIGN __alignof__(struct printk_log)
> > > #endif
> > > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> > > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> > > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> > > static char *log_buf = __log_buf;
> > > static u32 log_buf_len = __LOG_BUF_LEN;
> > > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> > > unsigned long flags;
> > > char *new_log_buf;
> > > int free;
> > > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
>
> If depending on SMP, you can remove the - 1 here.
>
> > > - if (!new_log_buf_len)
> > > - return;
> > > + if (!new_log_buf_len && cpu_extra > 1)
> > > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
> >
> > We still should return when both new_log_buf_len and cpu_extra are
> > zero and call here:
> >
> > if (!new_log_buf_len)
> > return;
>
> Yep.
>
> > Also I would feel more comfortable if we somehow limit the maximum
> > size of cpu_extra. I wonder if there might be a crazy setup with a lot
> > of possible CPUs and possible memory but with some minimal amount of
> > CPUs and memory at the boot time.
>
> Maybe. But considering that systems with a lot of CPUs *do* have a lot
> of memory, I wouldn't worry much about this, just like we don't worry
> about it now. Considering a _large_ 1024 core system and using the max
> value 21 for CONFIG_LOG_BUF_SHIFT, we would only allocate just over 2Gb
> of extra space -- trivial for such a system. And if it does break
> something, then heck, go fix you box and/or just reduce the percpu
> value. I guess that's a good reason to keep the default to 0 and let
> users play with it as they wish without compromising uninterested
> parties. afaict only x86 would be exposed to systems not booting if we
> fail to allocate.
>
> Thanks,
> Davidlohr


2014-06-12 20:22:51

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Thu, Jun 12, 2014 at 03:05:32PM +0200, Petr Ml?dek wrote:
> On Wed 2014-06-11 23:47:41, Luis R. Rodriguez wrote:
> > On Wed, Jun 11, 2014 at 11:34:47AM +0200, Petr Ml?dek wrote:
> > > On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
> > > > From: "Luis R. Rodriguez" <[email protected]>
> > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > index 9d3585b..1814436 100644
> > > > --- a/init/Kconfig
> > > > +++ b/init/Kconfig
> > > > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
> > > > 13 => 8 KB
> > > > 12 => 4 KB
> > > >
> > > > +config LOG_CPU_BUF_SHIFT
> > > > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
> > > > + range 0 21
> > > > + default 0
> > > > + help
> > > > + The kernel ring buffer will get additional data logged onto it
> > > > + when multiple CPUs are supported. Typically the contributions is a
> > > > + few lines when idle however under under load this can vary and in the
> > > > + worst case it can mean loosing logging information. You can use this
> > > > + to set the maximum expected mount of amount of logging contribution
> > > > + under load by each CPU in the worst case scenerio. Select a size as
> > > > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
> > > > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
> > > > + follows having 16 CPUs as possible.
> > > > +
> > > > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
> > >
> > > It might be better to use the CPU_NUM-specific value as a minimum of
> > > the needed space. Linux distributions might want to distribute kernel
> > > with non-zero value and still use the static "__log_buf" on reasonable
> > > small systems.
> >
> > Not sure if I follow what you mean by CPU_NUM-specific, can you
> > elaborate?
>
> I wanted to say that the space requested by LOG_CPU_BUF_SHIFT depends
> on the number of CPUs. If LOG_CPU_BUF_SHIFT is not zero, your
> patch always allocates new ringbuffer and leave the static "__log_buf"
> unused. I think that this is not necessary for machines with small
> amount of CPUs and probably also with small amount of memory.

True, which is why I disabled it by default if we want to only leave
this disabled for < certain amount of num CPU systems, what is that
number, I see below a recommendation and I do like it.

> I would rename the variable to LOG_CPU_BUF_MIN_SHIFT or so. It would
> represent minimal size that is needed to print CPU-specific
> messages. If they take only "small" part of the default ring buffer
> size, we could still use the default rind buffer.

True, and will rename this, that still leaves open the question of a
number of CPUs that is sensible to keep but you resolve that below.

> For example, if we left 50% of the default buffer for CPU-specific
> messages, the code might look like:
>
> #define __LOG_CPU_MIN_BUF_LEN (1 << CONFIG_LOG_CPU_MIN_BUF_SHIFT)
>
> int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_MIN_BUF_LEN;
>
> if (!new_log_buf_len && (cpu_extra > __LOG_BUF_LEN / 2))
> new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
>
> if (!new_log_buf_len)
> return;
>
> allocate the new ring buffer...

Yeah I like these heuristics a lot, will fold them in and send a v2 now in patch form.
To be clear with this CONFIG_LOG_CPU_MIN_BUF_SHIFT could actually be left now to
something other than non zeo by default and only if that contribution is seen to
go above 1/2 of __LOG_BUF_LEN will we allocate more for the ring buffer. With
default values of LOG_BUF_SHIFT at 18 and say a default value of 12 for
LOG_CPU_MIN_BUF_SHIFT this would mean we'd need if we remove the -1 we'd require
64 CPUs in order to trigger an allocation for more memory. If we keep the -1 we'd
require anything over 64 number of CPUs. Do we want to keep the -1 and the > 64
CPU requirement as default? Is the LOG_CPU_MIN_BUF_SHIFT default of 12 resonable
to start with (assumes 4KB in the worst case before the kernel ring buffer flips over).

> > The default in this patch is to ignore this, do you mean that upstream
> > should probably default to a non-zero value here and then let distributions
> > select 0 for some kernel builds ?
>
> If the change has effect only for huge systems, the default value
> might be non-zero everywhere.

Sure.

> > If so then perhaps adding a sysctl override value might be good to
> > allow only small systems to override this to 0?
>
> I think that it won't help to lover the value using sysctl because the
> huge buffer would be already allocated during boot. If I did not miss anything.
>
> [...]

Yeah true, a sensible default would be best, with the systctl we'd also have
to handle dynamic re-allocations and while the tracing code already added
code to make this easier I'd prefer we don't make this a popular path.

> >
> > > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > > index 7228258..2023424 100644
> > > > --- a/kernel/printk/printk.c
> > > > +++ b/kernel/printk/printk.c
> > > > @@ -246,6 +246,7 @@ static u32 clear_idx;
> > > > #define LOG_ALIGN __alignof__(struct printk_log)
> > > > #endif
> > > > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
> > > > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
> > > > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
> > > > static char *log_buf = __log_buf;
> > > > static u32 log_buf_len = __LOG_BUF_LEN;
> > > > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
> > > > unsigned long flags;
> > > > char *new_log_buf;
> > > > int free;
> > > > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
> > > >
> > > > - if (!new_log_buf_len)
> > > > - return;
> > > > + if (!new_log_buf_len && cpu_extra > 1)
> > > > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
> > >
> > > We still should return when both new_log_buf_len and cpu_extra are
> > > zero and call here:
> > >
> > > if (!new_log_buf_len)
> > > return;
> >
> > The check for cpu_extra > 1 does that -- the default in the patch was 0
> > and 1 << 0 is 1, so if in the case that the default is used we'd bail
> > just like before. Or did I perhaps miss what you were saying here?
>
> The problem is that we do not bail out because you removed the "return".
> If "new_log_buf_len=0" and "cpu_extra=1" then we keep
> "new_log_buf_len" as is. Then we continue, try to allocate zero memory
> and print error: "log_buf_len: 0 bytes not available". Do I get it right?

Yeah sorry, I meant to add the else.. and bail with a return if the default
was of 0 was not used or if the kernel parameter to increase the size was
not passed.

> > > Also I would feel more comfortable if we somehow limit the maximum
> > > size of cpu_extra.
> >
> > Michal had similar concerns and I thought up to limit it to 1024 max
> > CPUs, but after my second implementation I did some math on the values
> > that would be used if say LOG_CPU_BUF_SHIFT was 12, it turns out to not
> > be *that* bad for even huge num_possible_cpus(). For example for 4096
> > num_possible_cpus() this comes out to with LOG_BUF_SHIFT of 18:
> >
> >
> > ((1 << 18) + ((4096 - 1) * (1 << 12))) / 1024 = 16636 KB
> >
> > ~16 MB doesn't seem that bad for such a monster box which I'd presume
> > would have an insane amount of memory. If this logic however does
> > seems unreasonable and we should cap it -- then by all means lets
> > pick a sensible number, its just not clear to me what that number
> > should be. Another reason why I stayed away from capping this was
> > that we'd then likely end up capping this in the future, and I was
> > trying to find a solution that would not require mucking as
> > technology evolves. The reasoning above is also why I had opted to
> > make the default to 0, only distributions would have a good sense
> > of what might be reasonable, which I guess begs more for a sysctl
> > value here.
>
> I am not sure but I think that the huge buffer would be allocated
> before any sysctl value could be modified. So, I think that sysctl
> would not really help here.

Sure.

> I think that the 10% or 20% of the total memory size is a good limit.
> Nobody would want to use more than 20% of memory for logging. So, it
> needs not be higher. The main purpose of the limit is that the system
> does not die immediately after allocating the ring buffer. The 80%
> reserve for the rest of the system sounds fine as well. Note that
> the limit won't be needed on 99,9% of systems but it would help
> with debugging the last 0.1% :-)

Oh, what we do for the the 0.1%.

Luis


Attachments:
(No filename) (8.37 kB)
(No filename) (835.00 B)
Download all attachments

2014-06-12 21:12:25

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Thu, Jun 12, 2014 at 11:01 AM, Davidlohr Bueso <[email protected]> wrote:
> On Wed, 2014-06-11 at 11:34 +0200, Petr Mládek wrote:
>> On Tue 2014-06-10 18:04:45, Luis R. Rodriguez wrote:
>> > From: "Luis R. Rodriguez" <[email protected]>
>> >
>> > The default size of the ring buffer is too small for machines
>> > with a large amount of CPUs under heavy load. What ends up
>> > happening when debugging is the ring buffer overlaps and chews
>> > up old messages making debugging impossible unless the size is
>> > passed as a kernel parameter. An idle system upon boot up will
>> > on average spew out only about one or two extra lines but where
>> > this really matters is on heavy load and that will vary widely
>> > depending on the system and environment.
>>
>> Thanks for looking at this. It is a pity to lose stracktrace when a huge
>> machine Oopses just because the default ring buffer is too small.
>
> Agreed, I would very much welcome something like this.

Great!

>> > There are mechanisms to help increase the kernel ring buffer
>> > for tracing through debugfs, and those interfaces even allow growing
>> > the kernel ring buffer per CPU. We also have a static value which
>> > can be passed upon boot. Relying on debugfs however is not ideal
>> > for production, and relying on the value passed upon bootup is
>> > can only used *after* an issue has creeped up. Instead of being
>> > reactive this adds a proactive measure which lets you scale the
>> > amount of contributions you'd expect to the kernel ring buffer
>> > under load by each CPU in the worst case scenerio.
>> >
>> > We use num_possible_cpus() to avoid complexities which could be
>> > introduced by dynamically changing the ring buffer size at run
>> > time, num_possible_cpus() lets us use the upper limit on possible
>> > number of CPUs therefore avoiding having to deal with hotplugging
>> > CPUs on and off. This option is diabled by default, and if used
>> > the kernel ring buffer size then can be computed as follows:
>> >
>> > size = __LOG_BUF_LEN + (num_possible_cpus() - 1 ) * __LOG_CPU_BUF_LEN
>> >
>> > Cc: Michal Hocko <[email protected]>
>> > Cc: Petr Mladek <[email protected]>
>> > Cc: Andrew Morton <[email protected]>
>> > Cc: Joe Perches <[email protected]>
>> > Cc: Arun KS <[email protected]>
>> > Cc: Kees Cook <[email protected]>
>> > Cc: [email protected]
>> > Signed-off-by: Luis R. Rodriguez <[email protected]>
>> > ---
>> > init/Kconfig | 28 ++++++++++++++++++++++++++++
>> > kernel/printk/printk.c | 6 ++++--
>> > 2 files changed, 32 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/init/Kconfig b/init/Kconfig
>> > index 9d3585b..1814436 100644
>> > --- a/init/Kconfig
>> > +++ b/init/Kconfig
>> > @@ -806,6 +806,34 @@ config LOG_BUF_SHIFT
>> > 13 => 8 KB
>> > 12 => 4 KB
>> >
>> > +config LOG_CPU_BUF_SHIFT
>> > + int "CPU kernel log buffer size contribution (13 => 8 KB, 17 => 128KB)"
>> > + range 0 21
>> > + default 0
>> > + help
>> > + The kernel ring buffer will get additional data logged onto it
>> > + when multiple CPUs are supported. Typically the contributions is a
>> > + few lines when idle however under under load this can vary and in the
>> > + worst case it can mean loosing logging information. You can use this
>> > + to set the maximum expected mount of amount of logging contribution
>> > + under load by each CPU in the worst case scenerio. Select a size as
>> > + a power of 2. For example if LOG_BUF_SHIFT is 18 and if your
>> > + LOG_CPU_BUF_SHIFT is 12 your kernel ring buffer size will be as
>> > + follows having 16 CPUs as possible.
>> > +
>> > + ((1 << 18) + ((16 - 1) * (1 << 12))) / 1024 = 316 KB
>>
>> It might be better to use the CPU_NUM-specific value as a minimum of
>> the needed space. Linux distributions might want to distribute kernel
>> with non-zero value and still use the static "__log_buf" on reasonable
>> small systems.
>
> It should also depend on SMP and !BASE_SMALL.
> I was wondering about disabling this by default as it would defeat the
> purpose of being a proactive feature. Similarly, I worry about distros
> choosing a correct default value on their own.

True, it seems Petr's recommendations would address these concerns for
systems under a certain amount of limit of number of CPUs, as is right
now we require the contribution by CPU in worst case scenario to be >
1/2 of the default kernel ring buffer size that's > 64 number of CPUs.

>> > + Where as typically you'd only end up with 256 KB. This is disabled
>> > + by default with a value of 0.
>>
>> I would add:
>>
>> This value is ignored when "log_buf_len" commandline parameter
>> is used. It forces the exact size of the ring buffer.
>
> ... and update Documentation/kernel-parameters.txt to be more
> descriptive about this new functionality.

Will do!

>> > + Examples:
>> > + 17 => 128 KB
>> > + 16 => 64 KB
>> > + 15 => 32 KB
>> > + 14 => 16 KB
>> > + 13 => 8 KB
>> > + 12 => 4 KB
>>
>> I think that we should make it more cleat that it is per-CPU here,
>> for example:
>>
>> 17 => 128 KB for each CPU
>> 16 => 64 KB for each CPU
>> 15 => 32 KB for each CPU
>> 14 => 16 KB for each CPU
>> 13 => 8 KB for each CPU
>> 12 => 4 KB for each CPU
>>
>
> Agreed.

Amended.

>> > #
>> > # Architectures with an unreliable sched_clock() should select this:
>> > #
>> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
>> > index 7228258..2023424 100644
>> > --- a/kernel/printk/printk.c
>> > +++ b/kernel/printk/printk.c
>> > @@ -246,6 +246,7 @@ static u32 clear_idx;
>> > #define LOG_ALIGN __alignof__(struct printk_log)
>> > #endif
>> > #define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
>> > +#define __LOG_CPU_BUF_LEN (1 << CONFIG_LOG_CPU_BUF_SHIFT)
>> > static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
>> > static char *log_buf = __log_buf;
>> > static u32 log_buf_len = __LOG_BUF_LEN;
>> > @@ -752,9 +753,10 @@ void __init setup_log_buf(int early)
>> > unsigned long flags;
>> > char *new_log_buf;
>> > int free;
>> > + int cpu_extra = (num_possible_cpus() - 1) * __LOG_CPU_BUF_LEN;
>
> If depending on SMP, you can remove the - 1 here.

Great point, however it still does have an implication of the minimum
amount of CPUs that will trigger an increase with the heuristics
suggested by Petr, with the -1 we'd require > 64 CPUs, without it 64
CPUs would trigger an increase. We need to decide on this, I will add
the Kconfig requirement suggestions though.

>> > - if (!new_log_buf_len)
>> > - return;
>> > + if (!new_log_buf_len && cpu_extra > 1)
>> > + new_log_buf_len = __LOG_BUF_LEN + cpu_extra;
>>
>> We still should return when both new_log_buf_len and cpu_extra are
>> zero and call here:
>>
>> if (!new_log_buf_len)
>> return;
>
> Yep.

Fixed, thanks.

>> Also I would feel more comfortable if we somehow limit the maximum
>> size of cpu_extra. I wonder if there might be a crazy setup with a lot
>> of possible CPUs and possible memory but with some minimal amount of
>> CPUs and memory at the boot time.
>
> Maybe. But considering that systems with a lot of CPUs *do* have a lot
> of memory, I wouldn't worry much about this, just like we don't worry
> about it now. Considering a _large_ 1024 core system and using the max
> value 21 for CONFIG_LOG_BUF_SHIFT, we would only allocate just over 2Gb
> of extra space -- trivial for such a system. And if it does break
> something, then heck, go fix you box and/or just reduce the percpu
> value. I guess that's a good reason to keep the default to 0 and let
> users play with it as they wish without compromising uninterested
> parties. afaict only x86 would be exposed to systems not booting if we
> fail to allocate.

Picking hard limit values is certainly subjective but if we can pick
some heuristic that can scale without revisiting this much it'd be
great, I think Petr's new suggestion of having the contribution be
more than the default kernel ring buffer could help mitigate most
issues on smaller systems, a default of 12 (4KB contribution per CPU)
is also reasonably small I think based on the computations I've made
even for crazy large beasts.

Luis

2014-06-12 21:28:55

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On Thu, Jun 12, 2014 at 11:45 AM, Joe Perches <[email protected]> wrote:
>
> trivia: s/loosing/losing/

Amended, thanks.

Luis

2014-06-16 20:17:39

by Chris Metcalf

[permalink] [raw]
Subject: Re: [RFC] printk: allow increasing the ring buffer depending on the number of CPUs

On 6/12/2014 2:45 PM, Joe Perches wrote:
> (adding Chris Metcalf for arch/tile,
> I think this change might impact that arch)

Thanks for the Cc. I've been following the discussion and I think it's on the right track for arch/tile.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com