Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754894AbbF3PXj (ORCPT ); Tue, 30 Jun 2015 11:23:39 -0400 Received: from mail-ob0-f170.google.com ([209.85.214.170]:33238 "EHLO mail-ob0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753653AbbF3PXb (ORCPT ); Tue, 30 Jun 2015 11:23:31 -0400 MIME-Version: 1.0 X-Originating-IP: [212.51.149.109] In-Reply-To: <1435649004-379-1-git-send-email-rui.y.wang@intel.com> References: <1435649004-379-1-git-send-email-rui.y.wang@intel.com> Date: Tue, 30 Jun 2015 17:23:30 +0200 Message-ID: Subject: Re: drm/mgag200: doesn't work in panic context From: Daniel Vetter To: Rui Wang Cc: Borislav Petkov , "Luck, Tony" , Dave Airlie , "Clark, Rob" , Matthew D Roper , "Chen, Gong" , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3242 Lines: 64 On Tue, Jun 30, 2015 at 9:23 AM, Rui Wang wrote: > On Tuesday, June 30, 2015 2:37 PM, Daniel Vetter wrote: >> On Tue, Jun 30, 2015 at 4:53 AM, Rui Wang wrote: >> > >> > I think testing can be done by injecting a fatal machine check >> > exception via einj's debugfs interface. I can reproduce the hard hang every >> time. >> > I think It can be a simple script or C program do to the automated testing. >> > If anyone has any patch I'll be happy to help test it out. >> >> Testing shouldn't kill the machine ;-) > > Yes :) What I assumed was that after applying a future patch the machine should > be able to reboot instead of hanging itself, so the testing can repeat. > >> >> The idea I had is to just exercise the drm panic code (since we'd need to >> shunt everything else), and that can be done my calling the relevant >> functions from a hardirq context. And hardirq context is simples to get with a >> IPI to the local cpu. This way we don't depend upon the entire panic path to >> be recoverable, but only upon the drm bits being sane. > > Yes If it can be tested without rebooting then it'll be more efficient. > But einj does something more than what an IPI can do, it injects hardware > errors which trigger exceptions in NMI context... and the exception handler > usually panics on fatal errors. And the display may be the only way to catch > what has happened. I'm just hoping that the future version may work in NMI > context. NMI sounds ... ambigous ;-) But yeah if we can somehow inject something as an NMI too then that would be even better. What I want to avoid is forcing reboots, since that means you can't run a basic modeset test afterwards to make sure nothing was trampled too badly. Of course we'd have replace the screen contents, but the important part is that the panic handler doen't touch anything if the driver is in modeset code right now (because it'll massively increase the risk of dying completely), and an easy way to check that it didn't step all over modeset state unduly is to do a modeset afterwards. If that works we'll be fine. Also with that approach we can make sure that no real errors get into dmesg (as opposed to a real panic), which means we can capture dmesg afterwards and if there is a seroius log message (or even backtrace) then drm panic handling has a bug. All that isn't possible when we force a real panic to happen. Actually thinking more about NMI that shouldn't be a problem. The important thing with nmi vs. hardirq is that you can't even reliably grab an irqsave spinlock, it's trylocks all the way down. But that also holds for the panic handler, it's trylocks only. Could we somehow just check that using lockdep - is there an NMI lockdep context somewhere we could fake-grab? That's another upside of using an IPI btw: Real panics kill lockdep ;-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/