2015-08-11 11:25:51

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH] printk: rebalance printk

From: Pan Xinhui <[email protected]>

printk can be called in any context, It's very useful to output debug
info.

But it might cause very bad issues on some special cases. For example,
some driver hit errors, and it dumps many messages like reg values, etc.

Sometimes, printk is called when irqs disabled. This is OKay if there is
a few messages. But What would happen if many messages outputted by other
drivers at same time.

Here is the scenario.
CPUA CPUB
local_irq_save(flags);
printk()
while(..) { --> console_unlock
printk(...);
//hundreds or thousands loops
} //all messages flushed out to consoles
local_irq_restore(flags);

printk runs on CPUA just store the messages in the buf and return.
printk runs on CPUB(who owns the console_sem lock) would take the duty
to flush all messages to consoles. It would take a long time to flush
messages out, IOW, irq would be disabled for a long time. Such case is
too bad. We hit many interrupt related panics, for example, cpu did not
respond to IPI.

Here is the solution, if we detect such case above, try to rebalance it.
Let CPUA take the duty to flush messages to consoles.

The idea is simple, but the implementation is a little difficult.
Introduce many help functions to fix it.

Signed-off-by: Pan Xinhui <[email protected]>
---
kernel/printk/printk.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cf8c242..3dc2f60 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1652,6 +1652,65 @@ static size_t cont_print_text(char *text, size_t size)
return textlen;
}

+static struct {
+ unsigned int request:1;/* set it with logbuf_lock and console_sem held
+ * clear it with logbuf_lock held.
+ */
+ unsigned int reply:1; /* set/clear it with logbuf_lock held*/
+ struct semaphore help_sem;
+} help_t = {
+ .request = 0,
+ .reply = 0,
+ .help_sem = __SEMAPHORE_INITIALIZER(help_t.help_sem, 0),
+};
+
+/* protected by logbuf_lock*/
+static int check_help(void)
+{
+ return !!help_t.request;
+}
+
+/* protected by logbuf_lock and console_sem*/
+static void request_help(void)
+{
+ help_t.request = 1;
+}
+
+/* protected by logbuf_lock*/
+static void cancel_help(void)
+{
+ help_t.request = 0;
+}
+
+/* protected by logbuf_lock*/
+static void reply_help(void)
+{
+ help_t.request = 0;
+ help_t.reply = 1;
+}
+
+/* protected by logbuf_lock*/
+static int ack_help(void)
+{
+ return !!help_t.reply;
+}
+
+/* protected by logbuf_lock*/
+static void close_help(void)
+{
+ help_t.reply = 0;
+}
+
+static void thanks_help(void)
+{
+ up(&help_t.help_sem);
+}
+
+static int done_help(void)
+{
+ return !down_trylock(&help_t.help_sem);
+}
+
asmlinkage int vprintk_emit(int facility, int level,
const char *dict, size_t dictlen,
const char *fmt, va_list args)
@@ -1667,6 +1726,8 @@ asmlinkage int vprintk_emit(int facility, int level,
bool in_sched = false;
/* cpu currently holding logbuf_lock in this function */
static unsigned int logbuf_cpu = UINT_MAX;
+ bool help = 0;
+ bool can_help = !(irqs_disabled() || in_sched || in_interrupt());

if (level == LOGLEVEL_SCHED) {
level = LOGLEVEL_DEFAULT;
@@ -1701,6 +1762,10 @@ asmlinkage int vprintk_emit(int facility, int level,

lockdep_off();
raw_spin_lock(&logbuf_lock);
+ if (can_help && check_help()) {
+ reply_help();
+ help = true;
+ }
logbuf_cpu = this_cpu;

if (unlikely(recursion_bug)) {
@@ -1811,6 +1876,9 @@ asmlinkage int vprintk_emit(int facility, int level,
*/
preempt_disable();

+ if (help)
+ while (!done_help())
+ cpu_relax();
/*
* Try to acquire and then immediately release the console
* semaphore. The release will print out buffers and wake up
@@ -2230,6 +2298,8 @@ void console_unlock(void)
unsigned long flags;
bool wake_klogd = false;
bool retry;
+ bool help = !!(irqs_disabled() || in_interrupt());
+ bool skip_ack = true;

if (console_suspended) {
up_console_sem();
@@ -2248,6 +2318,21 @@ again:
int level;

raw_spin_lock_irqsave(&logbuf_lock, flags);
+ if (help) {
+ if (!skip_ack) {
+ if (ack_help()) {
+ /*
+ * If request got replyed by another
+ * printk-on-going, close the request.
+ */
+ close_help();
+ break;
+ }
+ cancel_help();
+ }
+ request_help();
+ skip_ack = 0;
+ }
if (seen_seq != log_next_seq) {
wake_klogd = true;
seen_seq = log_next_seq;
@@ -2265,8 +2350,13 @@ again:
len = 0;
}
skip:
- if (console_seq == log_next_seq)
+ if (console_seq == log_next_seq) {
+ if (help) {
+ cancel_help();
+ help = 0;
+ }
break;
+ }

msg = log_from_idx(console_idx);
if (msg->flags & LOG_NOCONS) {
@@ -2318,6 +2408,14 @@ skip:

up_console_sem();

+ if (help) {
+ /*
+ * Need call up_console_sem first.
+ */
+ thanks_help();
+ goto wake_up;
+ }
+
/*
* Someone could have filled up the buffer again, so re-check if there's
* something to flush. In case we cannot trylock the console_sem again,
@@ -2331,6 +2429,7 @@ skip:
if (retry && console_trylock())
goto again;

+wake_up:
if (wake_klogd)
wake_up_klogd();
}
--
1.9.1


2015-08-11 11:37:39

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

I think you have already known how to reproduce it. Let me still share the debug patch to hit this printk issue in several seconds.

+static int auto_printk(void *data)
+{
+ int irq = (int)data;
+ char b[64] = {
+ [0 ... 61] = 'F',
+ '\n',
+ 0,
+ };
+
+ int bytes_per_second = 115200/8; //minicom console ratelimit
+ int lines_per_second = bytes_per_second/(sizeof(b)+16);//16 is count of extra bytes including log_prefix
+ int lines_with_5sec = lines_per_second * 5;
+ int i = lines_with_5sec;
+
+ if (irq)
+ local_irq_disable();
+ unsigned long j = jiffies;
+ unsigned long k = jiffies;
+ while (i--) {
+ k = jiffies;
+ if (k > j + 5 * HZ) {
+ printk("XINHUI: long time %d,%d\n", irq, irqs_disabled());
+ }
+ j = k;
+ }
+ if (irq)
+ local_irq_enable();
+ return 0;
+}
+static int xh_th(const char *val, struct kernel_param *kp)
+{
+ kthread_run(auto_printk, 1, "XH_T1");
+ kthread_run(auto_printk, 0, "XH_T0");
+}
+module_param_call(th, NULL, xh_th, NULL, 0664);
+

thanks
xinhui

On 2015年08月11日 19:23, Pan Xinhui wrote:
> From: Pan Xinhui <[email protected]>
>
> printk can be called in any context, It's very useful to output debug
> info.
>
> But it might cause very bad issues on some special cases. For example,
> some driver hit errors, and it dumps many messages like reg values, etc.
>
> Sometimes, printk is called when irqs disabled. This is OKay if there is
> a few messages. But What would happen if many messages outputted by other
> drivers at same time.
>
> Here is the scenario.
> CPUA CPUB
> local_irq_save(flags);
> printk()
> while(..) { --> console_unlock
> printk(...);
> //hundreds or thousands loops
> } //all messages flushed out to consoles
> local_irq_restore(flags);
>
> printk runs on CPUA just store the messages in the buf and return.
> printk runs on CPUB(who owns the console_sem lock) would take the duty
> to flush all messages to consoles. It would take a long time to flush
> messages out, IOW, irq would be disabled for a long time. Such case is
> too bad. We hit many interrupt related panics, for example, cpu did not
> respond to IPI.
>
> Here is the solution, if we detect such case above, try to rebalance it.
> Let CPUA take the duty to flush messages to consoles.
>
> The idea is simple, but the implementation is a little difficult.
> Introduce many help functions to fix it.
>
> Signed-off-by: Pan Xinhui <[email protected]>
> ---
> kernel/printk/printk.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 100 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cf8c242..3dc2f60 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1652,6 +1652,65 @@ static size_t cont_print_text(char *text, size_t size)
> return textlen;
> }
>
> +static struct {
> + unsigned int request:1;/* set it with logbuf_lock and console_sem held
> + * clear it with logbuf_lock held.
> + */
> + unsigned int reply:1; /* set/clear it with logbuf_lock held*/
> + struct semaphore help_sem;
> +} help_t = {
> + .request = 0,
> + .reply = 0,
> + .help_sem = __SEMAPHORE_INITIALIZER(help_t.help_sem, 0),
> +};
> +
> +/* protected by logbuf_lock*/
> +static int check_help(void)
> +{
> + return !!help_t.request;
> +}
> +
> +/* protected by logbuf_lock and console_sem*/
> +static void request_help(void)
> +{
> + help_t.request = 1;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void cancel_help(void)
> +{
> + help_t.request = 0;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void reply_help(void)
> +{
> + help_t.request = 0;
> + help_t.reply = 1;
> +}
> +
> +/* protected by logbuf_lock*/
> +static int ack_help(void)
> +{
> + return !!help_t.reply;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void close_help(void)
> +{
> + help_t.reply = 0;
> +}
> +
> +static void thanks_help(void)
> +{
> + up(&help_t.help_sem);
> +}
> +
> +static int done_help(void)
> +{
> + return !down_trylock(&help_t.help_sem);
> +}
> +
> asmlinkage int vprintk_emit(int facility, int level,
> const char *dict, size_t dictlen,
> const char *fmt, va_list args)
> @@ -1667,6 +1726,8 @@ asmlinkage int vprintk_emit(int facility, int level,
> bool in_sched = false;
> /* cpu currently holding logbuf_lock in this function */
> static unsigned int logbuf_cpu = UINT_MAX;
> + bool help = 0;
> + bool can_help = !(irqs_disabled() || in_sched || in_interrupt());
>
> if (level == LOGLEVEL_SCHED) {
> level = LOGLEVEL_DEFAULT;
> @@ -1701,6 +1762,10 @@ asmlinkage int vprintk_emit(int facility, int level,
>
> lockdep_off();
> raw_spin_lock(&logbuf_lock);
> + if (can_help && check_help()) {
> + reply_help();
> + help = true;
> + }
> logbuf_cpu = this_cpu;
>
> if (unlikely(recursion_bug)) {
> @@ -1811,6 +1876,9 @@ asmlinkage int vprintk_emit(int facility, int level,
> */
> preempt_disable();
>
> + if (help)
> + while (!done_help())
> + cpu_relax();
> /*
> * Try to acquire and then immediately release the console
> * semaphore. The release will print out buffers and wake up
> @@ -2230,6 +2298,8 @@ void console_unlock(void)
> unsigned long flags;
> bool wake_klogd = false;
> bool retry;
> + bool help = !!(irqs_disabled() || in_interrupt());
> + bool skip_ack = true;
>
> if (console_suspended) {
> up_console_sem();
> @@ -2248,6 +2318,21 @@ again:
> int level;
>
> raw_spin_lock_irqsave(&logbuf_lock, flags);
> + if (help) {
> + if (!skip_ack) {
> + if (ack_help()) {
> + /*
> + * If request got replyed by another
> + * printk-on-going, close the request.
> + */
> + close_help();
> + break;
> + }
> + cancel_help();
> + }
> + request_help();
> + skip_ack = 0;
> + }
> if (seen_seq != log_next_seq) {
> wake_klogd = true;
> seen_seq = log_next_seq;
> @@ -2265,8 +2350,13 @@ again:
> len = 0;
> }
> skip:
> - if (console_seq == log_next_seq)
> + if (console_seq == log_next_seq) {
> + if (help) {
> + cancel_help();
> + help = 0;
> + }
> break;
> + }
>
> msg = log_from_idx(console_idx);
> if (msg->flags & LOG_NOCONS) {
> @@ -2318,6 +2408,14 @@ skip:
>
> up_console_sem();
>
> + if (help) {
> + /*
> + * Need call up_console_sem first.
> + */
> + thanks_help();
> + goto wake_up;
> + }
> +
> /*
> * Someone could have filled up the buffer again, so re-check if there's
> * something to flush. In case we cannot trylock the console_sem again,
> @@ -2331,6 +2429,7 @@ skip:
> if (retry && console_trylock())
> goto again;
>
> +wake_up:
> if (wake_klogd)
> wake_up_klogd();
> }
>

2015-08-11 12:01:14

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

On 08/11/2015 07:23 AM, Pan Xinhui wrote:
> From: Pan Xinhui <[email protected]>
>
> printk can be called in any context, It's very useful to output debug
> info.
>
> But it might cause very bad issues on some special cases. For example,
> some driver hit errors, and it dumps many messages like reg values, etc.
>
> Sometimes, printk is called when irqs disabled. This is OKay if there is
> a few messages. But What would happen if many messages outputted by other
> drivers at same time.
>
> Here is the scenario.
> CPUA CPUB
> local_irq_save(flags);
> printk()
> while(..) { --> console_unlock
> printk(...);
> //hundreds or thousands loops
> } //all messages flushed out to consoles
> local_irq_restore(flags);
>
> printk runs on CPUA just store the messages in the buf and return.
> printk runs on CPUB(who owns the console_sem lock) would take the duty
> to flush all messages to consoles. It would take a long time to flush
> messages out, IOW, irq would be disabled for a long time. Such case is
> too bad. We hit many interrupt related panics, for example, cpu did not
> respond to IPI.
>
> Here is the solution, if we detect such case above, try to rebalance it.
> Let CPUA take the duty to flush messages to consoles.
>
> The idea is simple, but the implementation is a little difficult.
> Introduce many help functions to fix it.

Please describe the rebalance state machine/message passing in detail.

Regards,
Peter Hurley

> Signed-off-by: Pan Xinhui <[email protected]>
> ---
> kernel/printk/printk.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 100 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cf8c242..3dc2f60 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1652,6 +1652,65 @@ static size_t cont_print_text(char *text, size_t size)
> return textlen;
> }
>
> +static struct {
> + unsigned int request:1;/* set it with logbuf_lock and console_sem held
> + * clear it with logbuf_lock held.
> + */
> + unsigned int reply:1; /* set/clear it with logbuf_lock held*/
> + struct semaphore help_sem;
> +} help_t = {
> + .request = 0,
> + .reply = 0,
> + .help_sem = __SEMAPHORE_INITIALIZER(help_t.help_sem, 0),
> +};
> +
> +/* protected by logbuf_lock*/
> +static int check_help(void)
> +{
> + return !!help_t.request;
> +}
> +
> +/* protected by logbuf_lock and console_sem*/
> +static void request_help(void)
> +{
> + help_t.request = 1;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void cancel_help(void)
> +{
> + help_t.request = 0;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void reply_help(void)
> +{
> + help_t.request = 0;
> + help_t.reply = 1;
> +}
> +
> +/* protected by logbuf_lock*/
> +static int ack_help(void)
> +{
> + return !!help_t.reply;
> +}
> +
> +/* protected by logbuf_lock*/
> +static void close_help(void)
> +{
> + help_t.reply = 0;
> +}
> +
> +static void thanks_help(void)
> +{
> + up(&help_t.help_sem);
> +}
> +
> +static int done_help(void)
> +{
> + return !down_trylock(&help_t.help_sem);
> +}
> +
> asmlinkage int vprintk_emit(int facility, int level,
> const char *dict, size_t dictlen,
> const char *fmt, va_list args)
> @@ -1667,6 +1726,8 @@ asmlinkage int vprintk_emit(int facility, int level,
> bool in_sched = false;
> /* cpu currently holding logbuf_lock in this function */
> static unsigned int logbuf_cpu = UINT_MAX;
> + bool help = 0;
> + bool can_help = !(irqs_disabled() || in_sched || in_interrupt());
>
> if (level == LOGLEVEL_SCHED) {
> level = LOGLEVEL_DEFAULT;
> @@ -1701,6 +1762,10 @@ asmlinkage int vprintk_emit(int facility, int level,
>
> lockdep_off();
> raw_spin_lock(&logbuf_lock);
> + if (can_help && check_help()) {
> + reply_help();
> + help = true;
> + }
> logbuf_cpu = this_cpu;
>
> if (unlikely(recursion_bug)) {
> @@ -1811,6 +1876,9 @@ asmlinkage int vprintk_emit(int facility, int level,
> */
> preempt_disable();
>
> + if (help)
> + while (!done_help())
> + cpu_relax();
> /*
> * Try to acquire and then immediately release the console
> * semaphore. The release will print out buffers and wake up
> @@ -2230,6 +2298,8 @@ void console_unlock(void)
> unsigned long flags;
> bool wake_klogd = false;
> bool retry;
> + bool help = !!(irqs_disabled() || in_interrupt());
> + bool skip_ack = true;
>
> if (console_suspended) {
> up_console_sem();
> @@ -2248,6 +2318,21 @@ again:
> int level;
>
> raw_spin_lock_irqsave(&logbuf_lock, flags);
> + if (help) {
> + if (!skip_ack) {
> + if (ack_help()) {
> + /*
> + * If request got replyed by another
> + * printk-on-going, close the request.
> + */
> + close_help();
> + break;
> + }
> + cancel_help();
> + }
> + request_help();
> + skip_ack = 0;
> + }
> if (seen_seq != log_next_seq) {
> wake_klogd = true;
> seen_seq = log_next_seq;
> @@ -2265,8 +2350,13 @@ again:
> len = 0;
> }
> skip:
> - if (console_seq == log_next_seq)
> + if (console_seq == log_next_seq) {
> + if (help) {
> + cancel_help();
> + help = 0;
> + }
> break;
> + }
>
> msg = log_from_idx(console_idx);
> if (msg->flags & LOG_NOCONS) {
> @@ -2318,6 +2408,14 @@ skip:
>
> up_console_sem();
>
> + if (help) {
> + /*
> + * Need call up_console_sem first.
> + */
> + thanks_help();
> + goto wake_up;
> + }
> +
> /*
> * Someone could have filled up the buffer again, so re-check if there's
> * something to flush. In case we cannot trylock the console_sem again,
> @@ -2331,6 +2429,7 @@ skip:
> if (retry && console_trylock())
> goto again;
>
> +wake_up:
> if (wake_klogd)
> wake_up_klogd();
> }
>

2015-08-11 18:16:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
> From: Pan Xinhui <[email protected]>
>
> printk can be called in any context, It's very useful to output debug
> info.
>
> But it might cause very bad issues on some special cases. For example,
> some driver hit errors, and it dumps many messages like reg values, etc.
>
> Sometimes, printk is called when irqs disabled. This is OKay if there is
> a few messages. But What would happen if many messages outputted by other
> drivers at same time.
>
> Here is the scenario.
> CPUA CPUB
> local_irq_save(flags);
> printk()
> while(..) { --> console_unlock
> printk(...);
> //hundreds or thousands loops
> } //all messages flushed out to consoles
> local_irq_restore(flags);
>

Where are you seeing this type of scenario "in the wild"? Or is this
just a "debug/bringup hardware" issue? We shouldn't be ever stuck in a
printk that prints hundreds or thousands of loops, if so, we need to fix
the kernel code that does that, as we do have control over this.

thanks,

greg k-h

2015-08-12 01:38:25

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

hi, Peter
thanks for your reply!
On 2015年08月11日 20:00, Peter Hurley wrote:
> On 08/11/2015 07:23 AM, Pan Xinhui wrote:
>> From: Pan Xinhui <[email protected]>
>>
>> printk can be called in any context, It's very useful to output debug
>> info.
>>
>> But it might cause very bad issues on some special cases. For example,
>> some driver hit errors, and it dumps many messages like reg values, etc.
>>
>> Sometimes, printk is called when irqs disabled. This is OKay if there is
>> a few messages. But What would happen if many messages outputted by other
>> drivers at same time.
>>
>> Here is the scenario.
>> CPUA CPUB
>> local_irq_save(flags);
>> printk()
>> while(..) { --> console_unlock
>> printk(...);
>> //hundreds or thousands loops
>> } //all messages flushed out to consoles
>> local_irq_restore(flags);
>>
>> printk runs on CPUA just store the messages in the buf and return.
>> printk runs on CPUB(who owns the console_sem lock) would take the duty
>> to flush all messages to consoles. It would take a long time to flush
>> messages out, IOW, irq would be disabled for a long time. Such case is
>> too bad. We hit many interrupt related panics, for example, cpu did not
>> respond to IPI.
>>
>> Here is the solution, if we detect such case above, try to rebalance it.
>> Let CPUA take the duty to flush messages to consoles.
>>
>> The idea is simple, but the implementation is a little difficult.
>> Introduce many help functions to fix it.
>
> Please describe the rebalance state machine/message passing in detail.
>
> Regards,
> Peter Hurley
>
yes, I should describe it more detailedly. Sorry for that.
I will send out V2.

thanks
xinhui

2015-08-12 01:56:25

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

Hi, Greg
thanks for your reply!
On 2015年08月12日 02:16, Greg Kroah-Hartman wrote:
> On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
>> From: Pan Xinhui <[email protected]>
>>
>> printk can be called in any context, It's very useful to output debug
>> info.
>>
>> But it might cause very bad issues on some special cases. For example,
>> some driver hit errors, and it dumps many messages like reg values, etc.
>>
>> Sometimes, printk is called when irqs disabled. This is OKay if there is
>> a few messages. But What would happen if many messages outputted by other
>> drivers at same time.
>>
>> Here is the scenario.
>> CPUA CPUB
>> local_irq_save(flags);
>> printk()
>> while(..) { --> console_unlock
>> printk(...);
>> //hundreds or thousands loops
>> } //all messages flushed out to consoles
>> local_irq_restore(flags);
>>
>
> Where are you seeing this type of scenario "in the wild"? Or is this
> just a "debug/bringup hardware" issue? We shouldn't be ever stuck in a
> printk that prints hundreds or thousands of loops, if so, we need to fix
> the kernel code that does that, as we do have control over this.
>

Sorry for misleading. these loops are just an example. In real world, there are many pr_info/pr_err when some drivers
hit device errors to dump debug information.

There is camera and sound test(multimedia test) in our side, we enable dynamic_debug to collect logs to fix some weird issues.
sound driver will dump many reg/memory values, and camera will output a few logs with irqs disabled. So we hit this printk issues
very frequently, and it blocks our tests indeed.

Although we have other log-store methods, for example, pstore. But some boards are from customers, thanks to the security, every time
boards powered up, memory are erased....

As printk is announced that it could be used in most context, So I prefer to enhance the stability of printk to handle such case.

thanks
xinhui


> thanks,
>
> greg k-h
>

2015-08-12 02:04:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

On Wed, Aug 12, 2015 at 09:53:58AM +0800, Pan Xinhui wrote:
> Hi, Greg
> thanks for your reply!
> On 2015年08月12日 02:16, Greg Kroah-Hartman wrote:
> > On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
> >> From: Pan Xinhui <[email protected]>
> >>
> >> printk can be called in any context, It's very useful to output debug
> >> info.
> >>
> >> But it might cause very bad issues on some special cases. For example,
> >> some driver hit errors, and it dumps many messages like reg values, etc.
> >>
> >> Sometimes, printk is called when irqs disabled. This is OKay if there is
> >> a few messages. But What would happen if many messages outputted by other
> >> drivers at same time.
> >>
> >> Here is the scenario.
> >> CPUA CPUB
> >> local_irq_save(flags);
> >> printk()
> >> while(..) { --> console_unlock
> >> printk(...);
> >> //hundreds or thousands loops
> >> } //all messages flushed out to consoles
> >> local_irq_restore(flags);
> >>
> >
> > Where are you seeing this type of scenario "in the wild"? Or is this
> > just a "debug/bringup hardware" issue? We shouldn't be ever stuck in a
> > printk that prints hundreds or thousands of loops, if so, we need to fix
> > the kernel code that does that, as we do have control over this.
> >
>
> Sorry for misleading. these loops are just an example. In real world,
> there are many pr_info/pr_err when some drivers hit device errors to
> dump debug information.

That's not good, they should use dev_dbg() for that instead.

> There is camera and sound test(multimedia test) in our side, we enable
> dynamic_debug to collect logs to fix some weird issues. sound driver
> will dump many reg/memory values, and camera will output a few logs
> with irqs disabled. So we hit this printk issues very frequently, and
> it blocks our tests indeed.

Maybe those memory value dumps should be done through debugfs instead of
printk?

> As printk is announced that it could be used in most context, So I
> prefer to enhance the stability of printk to handle such case.

Stability is nice, but using printk for things it was never ment to be
used for, isn't good. Especially if you are forced to add additional
complexity to the printk logic just for these corner cases that no one
uses :)

thanks,

greg k-h

2015-08-12 02:22:37

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk



On 2015年08月12日 10:04, Greg Kroah-Hartman wrote:
> On Wed, Aug 12, 2015 at 09:53:58AM +0800, Pan Xinhui wrote:
>> Hi, Greg
>> thanks for your reply!
>> On 2015年08月12日 02:16, Greg Kroah-Hartman wrote:
>>> On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
>>>> From: Pan Xinhui <[email protected]>
>>>>
>>>> printk can be called in any context, It's very useful to output debug
>>>> info.
>>>>
>>>> But it might cause very bad issues on some special cases. For example,
>>>> some driver hit errors, and it dumps many messages like reg values, etc.
>>>>
>>>> Sometimes, printk is called when irqs disabled. This is OKay if there is
>>>> a few messages. But What would happen if many messages outputted by other
>>>> drivers at same time.
>>>>
>>>> Here is the scenario.
>>>> CPUA CPUB
>>>> local_irq_save(flags);
>>>> printk()
>>>> while(..) { --> console_unlock
>>>> printk(...);
>>>> //hundreds or thousands loops
>>>> } //all messages flushed out to consoles
>>>> local_irq_restore(flags);
>>>>
>>>
>>> Where are you seeing this type of scenario "in the wild"? Or is this
>>> just a "debug/bringup hardware" issue? We shouldn't be ever stuck in a
>>> printk that prints hundreds or thousands of loops, if so, we need to fix
>>> the kernel code that does that, as we do have control over this.
>>>
>>
>> Sorry for misleading. these loops are just an example. In real world,
>> there are many pr_info/pr_err when some drivers hit device errors to
>> dump debug information.
>
> That's not good, they should use dev_dbg() for that instead.
>
Agree, We changed some codes to use dev_dbg(). But some errors still need be outputted immediately.
But when we enabled dynamic debug. there is not much difference... printk issue still hit.

>> There is camera and sound test(multimedia test) in our side, we enable
>> dynamic_debug to collect logs to fix some weird issues. sound driver
>> will dump many reg/memory values, and camera will output a few logs
>> with irqs disabled. So we hit this printk issues very frequently, and
>> it blocks our tests indeed.
>
> Maybe those memory value dumps should be done through debugfs instead of
> printk?
>
These reg/memory runtime values must be outputted in time, they have changed at any time..

>> As printk is announced that it could be used in most context, So I
>> prefer to enhance the stability of printk to handle such case.
>
> Stability is nice, but using printk for things it was never ment to be
> used for, isn't good. Especially if you are forced to add additional
> complexity to the printk logic just for these corner cases that no one
> uses :)
>

printk is very good, it does work in very complex context, including in interrupt, etc.
However if we could make it better, then why not :)

My patch is a little complex. Sorry for that.
I would try to work out V2.

thanks
xinhui

> thanks,
>
> greg k-h
>

2015-08-12 12:31:48

by Peter Hurley

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk

On 08/11/2015 02:16 PM, Greg Kroah-Hartman wrote:
> On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
>> From: Pan Xinhui <[email protected]>
>>
>> printk can be called in any context, It's very useful to output debug
>> info.
>>
>> But it might cause very bad issues on some special cases. For example,
>> some driver hit errors, and it dumps many messages like reg values, etc.
>>
>> Sometimes, printk is called when irqs disabled. This is OKay if there is
>> a few messages. But What would happen if many messages outputted by other
>> drivers at same time.
>>
>> Here is the scenario.
>> CPUA CPUB
>> local_irq_save(flags);
>> printk()
>> while(..) { --> console_unlock
>> printk(...);
>> //hundreds or thousands loops
>> } //all messages flushed out to consoles
>> local_irq_restore(flags);
>>
>
> Where are you seeing this type of scenario "in the wild"? Or is this
> just a "debug/bringup hardware" issue?

There have been problem reports of big machines getting soft-lockup/RCU stall
warnings with serial console attached. I think SLES is carrying patches
from Jan Kara to try to workaround this issue.


> We shouldn't be ever stuck in a
> printk that prints hundreds or thousands of loops, if so, we need to fix
> the kernel code that does that, as we do have control over this.

The loop referred to here is the loop in console_unlock(). Essentially
what happens is one cpu can get trapped in the console_unlock() output
loop; printk()'s from other cpus are only appending to the logbuf since
they can't acquire the console_lock (which is owned by the one cpu trapped
in the output loop).

Regards,
Peter Hurley

2015-08-13 02:18:59

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH] printk: rebalance printk



On 2015年08月12日 20:31, Peter Hurley wrote:
> On 08/11/2015 02:16 PM, Greg Kroah-Hartman wrote:
>> On Tue, Aug 11, 2015 at 07:23:01PM +0800, Pan Xinhui wrote:
>>> From: Pan Xinhui <[email protected]>
>>>
>>> printk can be called in any context, It's very useful to output debug
>>> info.
>>>
>>> But it might cause very bad issues on some special cases. For example,
>>> some driver hit errors, and it dumps many messages like reg values, etc.
>>>
>>> Sometimes, printk is called when irqs disabled. This is OKay if there is
>>> a few messages. But What would happen if many messages outputted by other
>>> drivers at same time.
>>>
>>> Here is the scenario.
>>> CPUA CPUB
>>> local_irq_save(flags);
>>> printk()
>>> while(..) { --> console_unlock
>>> printk(...);
>>> //hundreds or thousands loops
>>> } //all messages flushed out to consoles
>>> local_irq_restore(flags);
>>>
>>
>> Where are you seeing this type of scenario "in the wild"? Or is this
>> just a "debug/bringup hardware" issue?
>
> There have been problem reports of big machines getting soft-lockup/RCU stall
> warnings with serial console attached. I think SLES is carrying patches
> from Jan Kara to try to workaround this issue.
>
>
commit 5874af2003b1aaaa053128d655710140e3187226 ("printk: enable interrupts before calling console_trylock_for_printk()")
does help much in most cases. Thanks Jan Kara for that patch!

However there are still some corner cases like that I mentioned above.
As preempt is disabled in console_unlock(), Even applying my patch, softlockup/RCU warning still goes out.

I am very looking forward to Jan Kara's new patch to walk around it. Of course I will also try to work out a pre-review patch with my new idea
to fix these softlockpup/RCU warning.

Thanks
xinhui

>> We shouldn't be ever stuck in a
>> printk that prints hundreds or thousands of loops, if so, we need to fix
>> the kernel code that does that, as we do have control over this.
>
> The loop referred to here is the loop in console_unlock(). Essentially
> what happens is one cpu can get trapped in the console_unlock() output
> loop; printk()'s from other cpus are only appending to the logbuf since
> they can't acquire the console_lock (which is owned by the one cpu trapped
> in the output loop).
>
> Regards,
> Peter Hurley
>