Date: Thu, 11 Jun 2015 16:55:47 +0200
From: Petr Mladek <pmladek@suse.cz>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>, linux-kernel@vger.kernel.org,
        jkosina@suse.cz, paulmck@linux.vnet.ibm.com,
        Ingo Molnar <mingo@kernel.org>, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC][PATCH] printk: Fixup the nmi printk mess
Message-ID: <20150611145547.GA3234@dhcp128.suse.cz>
References: <20150610125509.GO19282@twins.programming.kicks-ass.net>
 <20150610143155.GD9409@pathway.suse.cz>
 <20150610152917.GI3644@twins.programming.kicks-ass.net>
 <20150610192304.GY18673@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150610192304.GY18673@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3950
Lines: 135

On Wed 2015-06-10 21:23:04, Peter Zijlstra wrote:
> Below a version which does x-cpu stuff to allow the
> trigger_all*_cpu_backtrace() initiator to flush buffers on behalf of
> other CPUs.
>
> Compile tested only.

The output from "echo l >/proc/sysrq-trigger" looks reasonable.
It does not mix output from different CPUs. This is expected
because of the @lock.

The other observation is that it prints CPUs in _random_ order:
28, 24, 25, 1, 26, 2, 27, 3, ...

The order is fine when I disable the irq_work.

It means that irq_works are usually faster than printk_nmi_flush() =>
printk_nmi_flush() is not that useful => all the complexity with
the three atomic variables (head, tail, read) did not bring
much win.

Anyway, I think that the current solution is racy and it cannot be fixed
easily, see below.


> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index c099b082cd02..99bfc1e3f32a 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1821,13 +1821,200 @@ int vprintk_default(const char *fmt, va_list args)
> +static void __printk_nmi_flush(struct irq_work *work)
> +{
> +	static raw_spinlock_t lock = __RAW_SPIN_LOCK_INITIALIZER(lock);
> +	struct nmi_seq_buf *s = container_of(work, struct nmi_seq_buf, work);
> +	int len, head, size, i, last_i;
> +
> +again:
> +	/*
> +	 * vprintk_nmi()	truncate
> +	 *
> +	 * [S] head		[S] head
> +	 *     wmb		mb
> +	 * [S] tail		[S] read

BTW, this is quite cryptic for me. Coffee did not help :-)

	 *
> +	 * therefore:
> +	 */
> +	i = atomic_read(&s->read);
> +	len = atomic_read(&s->tail); /* up to the tail is stable */
> +	smp_rmb();
> +	head = atomic_read(&s->head);
> +
> +	/*
> +	 * We cannot truncate tail because it could overwrite a store from
> +	 * vprintk_nmi(), however vprintk_nmi() will always update tail to the
> +	 * correct value.
> +	 *
> +	 * Therefore if head < tail, we missed a truncate and should do so now.
> +	 */
> +	if (head < len)
> +		len = 0;

This is a bit confusing. It is a complicated way how to return on the next test.

If I get this correctly. This might happen only inside
_printk_nmi_flush() called on another CPU (from
printk_nmi_flush()) when it interferes with the queued
irq_work. The irq_work is faster and truncates the buffer.

So, the return is fine after all because the irq_work printed
everything.


> +	if (len - i <= 0) /* nothing to do */
> +		return;
> +	/*
> +	 * 'Consume' this chunk, avoids concurrent callers printing the same
> +	 * stuff.
> +	 */
> +	if (atomic_cmpxchg(&s->read, i, len) != i)
> +		goto again;

I think that this is racy:

CPU0					CPU7

printk_nmi_flush()

  __printk_nmi_flush(for CPU7)

    i = atomic_read(&s->read);     (100)
    len = atomic_read(&s->tail);   (200)
    head = atomic_read(&s->head);  (200)

    if (atomic_cmpxchg(&s->read, i, len) != i)

    we pass but we get interrupted
    or rescheduled on preemptive kernel

					another vprintk_nmi()
					leaves: head(400), tail(400)

					__printk_nmi_flush() in irq_work

					it prints string between 200-400
					truncate buffer: head(0), read(0)

					another vprintk_nmi()
					returns: head(150), tail(150)

    print string between (100-200) =>
    part of the new and part of old message
    and modifies @head and @read a wrong way

I think that such races are hard to avoid without indexing the printed
messages. But it would make the approach too complicated.

I think that ordering CPUs is not worth it. I would go back to the
first solution, add the @lock there, and double check races with
seq_buf().

I stop here with commenting the code for now.

Best Regards,
Petr

PS: I had two cups of coffee and hope that my comments are smaller fiasco
than yesterday.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/