Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933179AbbFJO5r (ORCPT ); Wed, 10 Jun 2015 10:57:47 -0400 Received: from cantor2.suse.de ([195.135.220.15]:51964 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751308AbbFJO5k (ORCPT ); Wed, 10 Jun 2015 10:57:40 -0400 Date: Wed, 10 Jun 2015 16:57:37 +0200 From: Petr Mladek To: Peter Zijlstra Cc: Steven Rostedt , linux-kernel@vger.kernel.org, jkosina@suse.cz, paulmck@linux.vnet.ibm.com, Ingo Molnar , Thomas Gleixner Subject: Re: [RFC][PATCH] printk: Fixup the nmi printk mess Message-ID: <20150610145737.GE9409@pathway.suse.cz> References: <20150610125509.GO19282@twins.programming.kicks-ass.net> <20150610143155.GD9409@pathway.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150610143155.GD9409@pathway.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8075 Lines: 246 On Wed 2015-06-10 16:31:55, Petr Mladek wrote: > On Wed 2015-06-10 14:55:09, Peter Zijlstra wrote: > > Hi, > > > > I just stumbled upon: > > > > a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs") > > > > Which is not dissimilar to what I've proposed in the past. Except its > > squirreled away in some far and dark corner of one arch. > > > > Lets fix that. > > It is very interesting approach. But I see two possible races, see below. > > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h > > index dfd59d6bc6f0..a477e0766d2e 100644 > > --- a/include/linux/hardirq.h > > +++ b/include/linux/hardirq.h > > @@ -67,10 +67,12 @@ extern void irq_exit(void); > > preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ > > rcu_nmi_enter(); \ > > trace_hardirq_enter(); \ > > + printk_nmi_enter(); \ > > } while (0) > > > > #define nmi_exit() \ > > do { \ > > + printk_nmi_exit(); \ > > trace_hardirq_exit(); \ > > rcu_nmi_exit(); \ > > BUG_ON(!in_nmi()); \ > > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > > index caebf2a758dc..04c68b9f56f8 100644 > > --- a/include/linux/percpu.h > > +++ b/include/linux/percpu.h > > @@ -135,7 +135,4 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr); > > (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \ > > __alignof__(type)) > > > > -/* To avoid include hell, as printk can not declare this, we declare it here */ > > -DECLARE_PER_CPU(printk_func_t, printk_func); > > - > > #endif /* __LINUX_PERCPU_H */ > > diff --git a/init/main.c b/init/main.c > > index 2115055faeac..4a28accaaa98 100644 > > --- a/init/main.c > > +++ b/init/main.c > > @@ -586,6 +586,7 @@ asmlinkage __visible void __init start_kernel(void) > > timekeeping_init(); > > time_init(); > > sched_clock_postinit(); > > + printk_init(); > > perf_event_init(); > > profile_init(); > > call_function_init(); > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > > index c099b082cd02..0149c41be95b 100644 > > --- a/kernel/printk/printk.c > > +++ b/kernel/printk/printk.c > > @@ -1821,13 +1821,125 @@ int vprintk_default(const char *fmt, va_list args) > > } > > EXPORT_SYMBOL_GPL(vprintk_default); > > > > +#ifdef CONFIG_PRINTK_NMI > > + > > +typedef int(*printk_func_t)(const char *fmt, va_list args); > > /* > > * This allows printk to be diverted to another function per cpu. > > * This is useful for calling printk functions from within NMI > > * without worrying about race conditions that can lock up the > > * box. > > */ > > -DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; > > +static DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default; > > + > > +#include > > + > > +struct nmi_seq_buf { > > + struct seq_buf seq; > > + struct irq_work work; > > + unsigned char buffer[PAGE_SIZE - > > + sizeof(struct seq_buf) - > > + sizeof(struct irq_work)]; > > +}; > > + > > +/* Safe printing in NMI context */ > > +static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq); > > + > > +static void print_seq_line(struct nmi_seq_buf *s, int start, int end) > > +{ > > + const char *buf = s->buffer + start; > > + > > + printk("%.*s", (end - start) + 1, buf); > > +} > > + > > +static void __printk_nmi_flush(struct irq_work *work) > > +{ > > + struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work); > > + int len, last_i = 0, i = 0; > > + > > +again: > > + len = seq_buf_used(&s->seq); > > + if (!len) > > + return; > > + > > + /* Print line by line. */ > > + for (; i < len; i++) { > > + if (s->buffer[i] == '\n') { > > + print_seq_line(s, last_i, i); > > + last_i = i + 1; > > + } > > + } > > + /* Check if there was a partial line. */ > > + if (last_i < len) { > > + print_seq_line(s, last_i, len - 1); > > + pr_cont("\n"); > > + } > > + > > + /* > > + * Another NMI could have come in while we were printing > > + * check if nothing has been added to the buffer. > > + */ > > + if (cmpxchg_local(&s->seq.len, len, 0) != len) > > + goto again; > > If another NMI comes at this point, it will start filling the buffer > from the beginning. If it is fast enough, it might override the text > that we print above. > > Note that it is easy to be faster because the normal printk() is more > complex and might wait for console. Also note that show_regs() calls many separate printk()s, the irqwork is scheduled by the first one => it is quite likely that some backtrace will get messed. Anothrer problem is that __printk_nmi_flush() is per-CPU => more instances might be running in parallel. I haven't tested this but I quess that it will mix backtraces from different CPUs in the main ring buffer. Note that the backtraces used to be serialized via static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED in the past. See the commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs") A solution would be to block __printk_nmi_flush() until printk_nmi_exit() is called via some bit and cmpxchg. Also we could add one more bit and vprintk_nmi() could drop messages until the buffer is printed to avoid any mess. Huh, it is getting more and more complicated. Best Regards, Petr > > +} > > + > > +void printk_init(void) > > +{ > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > + struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu); > > + > > + init_irq_work(&s->work, __printk_nmi_flush); > > + seq_buf_init(&s->seq, s->buffer, sizeof(s->buffer)); > > + } > > +} > > + > > +/* > > + * It is not safe to call printk() directly from NMI handlers. > > + * It may be fine if the NMI detected a lock up and we have no choice > > + * but to do so, but doing a NMI on all other CPUs to get a back trace > > + * can be done with a sysrq-l. We don't want that to lock up, which > > + * can happen if the NMI interrupts a printk in progress. > > + * > > + * Instead, we redirect the vprintk() to this nmi_vprintk() that writes > > + * the content into a per cpu seq_buf buffer. Then when the NMIs are > > + * all done, we can safely dump the contents of the seq_buf to a printk() > > + * from a non NMI context. > > + */ > > +static int vprintk_nmi(const char *fmt, va_list args) > > +{ > > + struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq); > > + unsigned int len = seq_buf_used(&s->seq); > > + > > + irq_work_queue(&s->work); > > + seq_buf_vprintf(&s->seq, fmt, args); > > This is more critical. seq_buf_vprintf() has the following code: > > len = vsnprintf(s->buffer + s->len, s->size - s->len, fmt, args); > > The race might look like: > > CPU0 CPU1 > > __printk_nmi_flush() > ... > > seq_buf_vprintf() > > s->buffer + s->len > > cmpxchg_local(&s->seq.len, > len, 0) > > s->size - s->len > > => that seq_buf_vprint() will start writing to the end of the seq_buf > but it will think that the entire writer is available and it might > cause buffer overflow > > > Well, it is rather theoretical. s->len will be most likely read only > once (compiler optimization). Anyway, we must make make sure that > all the used seq_buf operations are save against such races. > It is not obvious and might be prone to regressions. > > To be honest, I am not familiar with cmpxchg_local(). But I think that > it can't do much difference. The value has to be synced to the other > CPUs one day. > > > Note that Steven's solution is safe because the prints are serialized via > @backtrace_flag in arch_trigger_all_cpu_backtrace(). > > > Best Regards, > Petr > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/