Date: Fri, 18 Nov 2016 20:23:38 +0000
From: Brian Starkey <brian.starkey@arm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Eric Dumazet <edumazet@google.com>,
        LKML <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Alexander Potapenko <glider@google.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let
 ksoftirqd do its job"
Message-ID: <20161118183633.GA25157@e106950-lin.cambridge.arm.com>
References: <20161116135527.GA5833@e106950-lin.cambridge.arm.com>
 <CANn89iJ_GmhKzq-yzPNFkxNfoqNJQ_uSUVGX=iJO7reQAHedNA@mail.gmail.com>
 <20161116180156.GA21156@e106950-lin.cambridge.arm.com>
 <CANn89iK=yTkVKi9Wnx3ZXWZSOQc6KJT-FcE7H-5+QB85GE4=Vw@mail.gmail.com>
 <20161116210139.GB21156@e106950-lin.cambridge.arm.com>
 <CANn89iKu+=7eD3MenkpfiwqkerwKkJJXonzHi=yiKc3o0A3p9w@mail.gmail.com>
 <20161117164200.GA24653@e106950-lin.cambridge.arm.com>
 <alpine.DEB.2.20.1611180125400.3640@nanos>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.20.1611180125400.3640@nanos>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2014
Lines: 60

Hi Thomas,

On Fri, Nov 18, 2016 at 01:40:43AM +0100, Thomas Gleixner wrote:
>Brian,
>
>On Thu, 17 Nov 2016, Brian Starkey wrote:
>> No joy with this patch :-(
>>
>> I had to add an ioaddr argument because apparently that macro depends
>> on local context (yuck...), but it doesn't help my issue.
>>
>> FWIW I don't see any timeouts, either with or without the patch.
>> (I don't know for sure, but I would guess that the model of the
>> network card doesn't model whatever stall that loop is checking for.
>> It probably just completes all MMU operations immediately)
>
>Is there a chance that you enable trace points at the kernel command line?
>
>  trace_event=sched_wakeup,sched_switch,irq_handler_entry,irq_handler_exit,softirq_raise,softirq_entry,softirq_exit
>
>should be enough for a start. All we need aside of that is a trigger to
>stop the trace so we can actually see the events around the time where
>things go stale.
>
>I assume that the whole issue is visible throughout the slow progress of
>init towards a working system, so for a start it would be sufficient to add
>something like this into the startup sequence at some point:
>
> mount -t debugfs debugfs /sys/kernel/debug
> echo 0 >/sys/kernel/debug/tracing/tracing_on
>
>The only interesting challange is to get the trace data out of the
>system. The trace is accessible via:
>
> cat /sys/kernel/tracing/trace
>

Thanks for the pointers on tracing. I haven't used it before so that
was very helpful.

>So if your ssh works at some point, that might be an option or you just try
>to store it over NFS (which will be slow, but better than nothing). Maybe
>you have a better idea :)

I've tried a whole bunch of different ways to reproduce the problem
and get the logs out, so far they've all been unsuccessful
(reproducing is easy, getting data out is not).

I have a few more ideas to try, but it's pretty slow work - it's
taking at least 30 minutes per attempt. I'll let you know if I manage
something.

Thanks!
-Brian

>
>Thanks,
>
>	tglx
>