Date: Wed, 10 Jun 2015 16:57:37 +0200
From: Petr Mladek <pmladek@suse.cz>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>, linux-kernel@vger.kernel.org,
        jkosina@suse.cz, paulmck@linux.vnet.ibm.com,
        Ingo Molnar <mingo@kernel.org>, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC][PATCH] printk: Fixup the nmi printk mess
Message-ID: <20150610145737.GE9409@pathway.suse.cz>
References: <20150610125509.GO19282@twins.programming.kicks-ass.net>
 <20150610143155.GD9409@pathway.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150610143155.GD9409@pathway.suse.cz>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8075
Lines: 246

On Wed 2015-06-10 16:31:55, Petr Mladek wrote:
> On Wed 2015-06-10 14:55:09, Peter Zijlstra wrote:
> > Hi,
> > 
> > I just stumbled upon:
> > 
> > 	a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")
> > 
> > Which is not dissimilar to what I've proposed in the past. Except its
> > squirreled away in some far and dark corner of one arch.
> > 
> > Lets fix that.
> 
> It is very interesting approach. But I see two possible races, see below.
> 
> > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > index dfd59d6bc6f0..a477e0766d2e 100644
> > --- a/include/linux/hardirq.h
> > +++ b/include/linux/hardirq.h
> > @@ -67,10 +67,12 @@ extern void irq_exit(void);
> >  		preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
> >  		rcu_nmi_enter();				\
> >  		trace_hardirq_enter();				\
> > +		printk_nmi_enter();				\
> >  	} while (0)
> >  
> >  #define nmi_exit()						\
> >  	do {							\
> > +		printk_nmi_exit();				\
> >  		trace_hardirq_exit();				\
> >  		rcu_nmi_exit();					\
> >  		BUG_ON(!in_nmi());				\
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index caebf2a758dc..04c68b9f56f8 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -135,7 +135,4 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> >  	(typeof(type) __percpu *)__alloc_percpu(sizeof(type),		\
> >  						__alignof__(type))
> >  
> > -/* To avoid include hell, as printk can not declare this, we declare it here */
> > -DECLARE_PER_CPU(printk_func_t, printk_func);
> > -
> >  #endif /* __LINUX_PERCPU_H */
> > diff --git a/init/main.c b/init/main.c
> > index 2115055faeac..4a28accaaa98 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -586,6 +586,7 @@ asmlinkage __visible void __init start_kernel(void)
> >  	timekeeping_init();
> >  	time_init();
> >  	sched_clock_postinit();
> > +	printk_init();
> >  	perf_event_init();
> >  	profile_init();
> >  	call_function_init();
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index c099b082cd02..0149c41be95b 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -1821,13 +1821,125 @@ int vprintk_default(const char *fmt, va_list args)
> >  }
> >  EXPORT_SYMBOL_GPL(vprintk_default);
> >  
> > +#ifdef CONFIG_PRINTK_NMI
> > +
> > +typedef int(*printk_func_t)(const char *fmt, va_list args);
> >  /*
> >   * This allows printk to be diverted to another function per cpu.
> >   * This is useful for calling printk functions from within NMI
> >   * without worrying about race conditions that can lock up the
> >   * box.
> >   */
> > -DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +static DEFINE_PER_CPU(printk_func_t, printk_func) = vprintk_default;
> > +
> > +#include <linux/seq_buf.h>
> > +
> > +struct nmi_seq_buf {
> > +	struct seq_buf		seq;
> > +	struct irq_work		work;
> > +	unsigned char		buffer[PAGE_SIZE -
> > +					sizeof(struct seq_buf) -
> > +					sizeof(struct irq_work)];
> > +};
> > +
> > +/* Safe printing in NMI context */
> > +static DEFINE_PER_CPU(struct nmi_seq_buf, nmi_print_seq);
> > +
> > +static void print_seq_line(struct nmi_seq_buf *s, int start, int end)
> > +{
> > +	const char *buf = s->buffer + start;
> > +
> > +	printk("%.*s", (end - start) + 1, buf);
> > +}
> > +
> > +static void __printk_nmi_flush(struct irq_work *work)
> > +{
> > +	struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work);
> > +	int len, last_i = 0, i = 0;
> > +
> > +again:
> > +	len = seq_buf_used(&s->seq);
> > +	if (!len)
> > +		return;
> > +
> > +	/* Print line by line. */
> > +	for (; i < len; i++) {
> > +		if (s->buffer[i] == '\n') {
> > +			print_seq_line(s, last_i, i);
> > +			last_i = i + 1;
> > +		}
> > +	}
> > +	/* Check if there was a partial line. */
> > +	if (last_i < len) {
> > +		print_seq_line(s, last_i, len - 1);
> > +		pr_cont("\n");
> > +	}
> > +
> > +	/*
> > +	 * Another NMI could have come in while we were printing
> > +	 * check if nothing has been added to the buffer.
> > +	 */
> > +	if (cmpxchg_local(&s->seq.len, len, 0) != len)
> > +		goto again;
> 
> If another NMI comes at this point, it will start filling the buffer
> from the beginning. If it is fast enough, it might override the text
> that we print above.
> 
> Note that it is easy to be faster because the normal printk() is more
> complex and might wait for console.

Also note that show_regs() calls many separate printk()s, the irqwork
is scheduled by the first one => it is quite likely that some
backtrace will get messed.


Anothrer problem is that __printk_nmi_flush() is per-CPU => more
instances might be running in parallel. I haven't tested this but
I quess that it will mix backtraces from different CPUs in
the main ring buffer.

Note that the backtraces used to be serialized via
static arch_spinlock_t lock = __ARCH_SPIN_LOCK_UNLOCKED
in the past. See the commit
a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs")


A solution would be to block __printk_nmi_flush() until
printk_nmi_exit() is called via some bit and cmpxchg.

Also we could add one more bit and vprintk_nmi() could drop messages
until the buffer is printed to avoid any mess.

Huh, it is getting more and more complicated.


Best Regards,
Petr

> > +}
> > +
> > +void printk_init(void)
> > +{
> > +	int cpu;
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		struct nmi_seq_buf *s = &per_cpu(nmi_print_seq, cpu);
> > +
> > +		init_irq_work(&s->work, __printk_nmi_flush);
> > +		seq_buf_init(&s->seq, s->buffer, sizeof(s->buffer));
> > +	}
> > +}
> > +
> > +/*
> > + * It is not safe to call printk() directly from NMI handlers.
> > + * It may be fine if the NMI detected a lock up and we have no choice
> > + * but to do so, but doing a NMI on all other CPUs to get a back trace
> > + * can be done with a sysrq-l. We don't want that to lock up, which
> > + * can happen if the NMI interrupts a printk in progress.
> > + *
> > + * Instead, we redirect the vprintk() to this nmi_vprintk() that writes
> > + * the content into a per cpu seq_buf buffer. Then when the NMIs are
> > + * all done, we can safely dump the contents of the seq_buf to a printk()
> > + * from a non NMI context.
> > + */
> > +static int vprintk_nmi(const char *fmt, va_list args)
> > +{
> > +	struct nmi_seq_buf *s = this_cpu_ptr(&nmi_print_seq);
> > +	unsigned int len = seq_buf_used(&s->seq);
> > +
> > +	irq_work_queue(&s->work);
> > +	seq_buf_vprintf(&s->seq, fmt, args);
> 
> This is more critical. seq_buf_vprintf() has the following code:
> 
> 	len = vsnprintf(s->buffer + s->len, s->size - s->len, fmt, args);
> 
> The race might look like:
> 
> CPU0					CPU1
> 
> 					__printk_nmi_flush()
> 					...
> 
> seq_buf_vprintf()
> 
>    s->buffer + s->len
> 
> 					cmpxchg_local(&s->seq.len,
> 					len, 0)
> 
>    s->size - s->len
> 
> => that seq_buf_vprint() will start writing to the end of the seq_buf
>   but it will think that the entire writer is available and it might
>   cause buffer overflow
> 
> 
> Well, it is rather theoretical. s->len will be most likely read only
> once (compiler optimization). Anyway, we must make make sure that
> all the used seq_buf operations are save against such races.
> It is not obvious and might be prone to regressions.
> 
> To be honest, I am not familiar with cmpxchg_local(). But I think that
> it can't do much difference. The value has to be synced to the other
> CPUs one day.
> 
> 
> Note that Steven's solution is safe because the prints are serialized via
> @backtrace_flag in arch_trigger_all_cpu_backtrace().
> 
> 
> Best Regards,
> Petr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/