MIME-Version: 1.0
In-Reply-To: <20161122103351.GA25080@e106950-lin.cambridge.arm.com>
References: <20161116135527.GA5833@e106950-lin.cambridge.arm.com>
 <CANn89iJ_GmhKzq-yzPNFkxNfoqNJQ_uSUVGX=iJO7reQAHedNA@mail.gmail.com>
 <20161116180156.GA21156@e106950-lin.cambridge.arm.com> <CANn89iK=yTkVKi9Wnx3ZXWZSOQc6KJT-FcE7H-5+QB85GE4=Vw@mail.gmail.com>
 <20161116210139.GB21156@e106950-lin.cambridge.arm.com> <CANn89iKu+=7eD3MenkpfiwqkerwKkJJXonzHi=yiKc3o0A3p9w@mail.gmail.com>
 <20161117164200.GA24653@e106950-lin.cambridge.arm.com> <alpine.DEB.2.20.1611180125400.3640@nanos>
 <20161122103351.GA25080@e106950-lin.cambridge.arm.com>
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 22 Nov 2016 06:29:33 -0800
Message-ID: <CANn89i+dBOafrP9TjJgPDifjycQvoPET6RUCuNXKu4tiJ_HJRQ@mail.gmail.com>
Subject: Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let
 ksoftirqd do its job"
To: Brian Starkey <brian.starkey@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Alexander Potapenko <glider@google.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2525
Lines: 75

On Tue, Nov 22, 2016 at 2:33 AM, Brian Starkey <brian.starkey@arm.com> wrote:
>
> Hi,
>
> On Fri, Nov 18, 2016 at 01:40:43AM +0100, Thomas Gleixner wrote:
>>
>> Brian,
>>
>> On Thu, 17 Nov 2016, Brian Starkey wrote:
>>>
>>> No joy with this patch :-(
>>>
>>> I had to add an ioaddr argument because apparently that macro depends
>>> on local context (yuck...), but it doesn't help my issue.
>>>
>>> FWIW I don't see any timeouts, either with or without the patch.
>>> (I don't know for sure, but I would guess that the model of the
>>> network card doesn't model whatever stall that loop is checking for.
>>> It probably just completes all MMU operations immediately)
>>
>>
>> Is there a chance that you enable trace points at the kernel command line?
>>
>>  trace_event=sched_wakeup,sched_switch,irq_handler_entry,irq_handler_exit,softirq_raise,softirq_entry,softirq_exit
>>
>> should be enough for a start. All we need aside of that is a trigger to
>> stop the trace so we can actually see the events around the time where
>> things go stale.
>>
>> I assume that the whole issue is visible throughout the slow progress of
>> init towards a working system, so for a start it would be sufficient to add
>> something like this into the startup sequence at some point:
>>
>> mount -t debugfs debugfs /sys/kernel/debug
>> echo 0 >/sys/kernel/debug/tracing/tracing_on
>>
>> The only interesting challange is to get the trace data out of the
>> system. The trace is accessible via:
>>
>> cat /sys/kernel/tracing/trace
>>
>> So if your ssh works at some point, that might be an option or you just try
>> to store it over NFS (which will be slow, but better than nothing). Maybe
>> you have a better idea :)
>
>
> I finally managed to pry some traces out this morning. It seems like
> the system struggles to even invoke echo when it's doing badly.
>
> Trace before 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwU21wNTdZS29kVXc
> Trace after 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwbXVzcnpieVkzWFU
> (btw, if there's a preferred way to send the logs let me know. I
> wasn't sure large or non-text attachments would be well received)
>
> I'm not sure how much help the trace is, but it does look like the
> system is spending far too much time in the ethernet device's IRQ
> handler to be healthy.
>

Thanks a lot Brian

Can you confirm interrupt handler is smc911x_interrupt() ?

(ie : is SMC_USE_PXA_DMA / SMC_USE_DMA defined or not ?)


>
> Thanks,
> Brian
>>
>>
>> Thanks,
>>
>>         tglx
>>