MIME-Version: 1.0
In-Reply-To: <1435649004-379-1-git-send-email-rui.y.wang@intel.com>
References: <1435649004-379-1-git-send-email-rui.y.wang@intel.com>
Date: Tue, 30 Jun 2015 17:23:30 +0200
Message-ID: <CAKMK7uFL+J7ewbhsW+MSxs0gkmf9O1YRWW=RFgPsEqC2x=UMQA@mail.gmail.com>
Subject: Re: drm/mgag200: doesn't work in panic context
From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Rui Wang <rui.y.wang@intel.com>
Cc: Borislav Petkov <bp@alien8.de>, "Luck, Tony" <tony.luck@intel.com>,
        Dave Airlie <airlied@redhat.com>, "Clark, Rob" <robdclark@gmail.com>,
        Matthew D Roper <matthew.d.roper@intel.com>,
        "Chen, Gong" <gong.chen@intel.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3242
Lines: 64

On Tue, Jun 30, 2015 at 9:23 AM, Rui Wang <rui.y.wang@intel.com> wrote:
> On Tuesday, June 30, 2015 2:37 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>> On Tue, Jun 30, 2015 at 4:53 AM, Rui Wang <rui.y.wang@intel.com> wrote:
>> >
>> > I think testing can be done by injecting a fatal machine check
>> > exception via einj's debugfs interface. I can reproduce the hard hang every
>> time.
>> > I think It can be a simple script or C program do to the automated testing.
>> > If anyone has any patch I'll be happy to help test it out.
>>
>> Testing shouldn't kill the machine ;-)
>
> Yes :) What I assumed was that after applying a future patch the machine should
> be able to reboot instead of hanging itself, so the testing can repeat.
>
>>
>> The idea I had is to just exercise the drm panic code (since we'd need to
>> shunt everything else), and that can be done my calling the relevant
>> functions from a hardirq context. And hardirq context is simples to get with a
>> IPI to the local cpu. This way we don't depend upon the entire panic path to
>> be recoverable, but only upon the drm bits being sane.
>
> Yes If it can be tested without rebooting then it'll be more efficient.
> But einj does something more than what an IPI can do, it injects hardware
> errors which trigger exceptions in NMI context... and the exception handler
> usually panics on fatal errors. And the display may be the only way to catch
> what has happened. I'm just hoping that the future version may work in NMI
> context.

NMI sounds ... ambigous ;-) But yeah if we can somehow inject
something as an NMI too then that would be even better. What I want to
avoid is forcing reboots, since that means you can't run a basic
modeset test afterwards to make sure nothing was trampled too badly.
Of course we'd have replace the screen contents, but the important
part is that the panic handler doen't touch anything if the driver is
in modeset code right now (because it'll massively increase the risk
of dying completely), and an easy way to check that it didn't step all
over modeset state unduly is to do a modeset afterwards. If that works
we'll be fine.

Also with that approach we can make sure that no real errors get into
dmesg (as opposed to a real panic), which means we can capture dmesg
afterwards and if there is a seroius log message (or even backtrace)
then drm panic handling has a bug.

All that isn't possible when we force a real panic to happen.

Actually thinking more about NMI that shouldn't be a problem. The
important thing with nmi vs. hardirq is that you can't even reliably
grab an irqsave spinlock, it's trylocks all the way down. But that
also holds for the panic handler, it's trylocks only. Could we somehow
just check that using lockdep - is there an NMI lockdep context
somewhere we could fake-grab? That's another upside of using an IPI
btw: Real panics kill lockdep ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/