Received: by 10.223.185.116 with SMTP id b49csp567105wrg; Sat, 3 Mar 2018 02:44:47 -0800 (PST) X-Google-Smtp-Source: AG47ELuz2AYupDS2WTLcsx4TcD5JZjBLHnmHNvAWG4DGw3pU1hkLsinJw+d8TtDje1KyaOy5BDvD X-Received: by 10.99.123.74 with SMTP id k10mr6976313pgn.217.1520073887672; Sat, 03 Mar 2018 02:44:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520073887; cv=none; d=google.com; s=arc-20160816; b=nJiNzoEDPAQ1HB2KLVXP7SteNE6FHcX/LWc9jJjggDc6eM9CA2ZW732s/OdxCq26Il b/K0w+OCP/yrQ9GJgOUgmZbRyp0smfFbcCKz2U4krFUUHC3byKBUuemHg6kC2nKCAkXa Q7ePrZ5NPOFKECs09m3uv6lDsEwxss1Sd08gG5XZyBbeXmT8Na72dK01F/hTfTdaN/i2 4ldgV75WFAChatOf4E8Et1Lx9FL+R3j35Un7eWBVaHOqdaFwF9p1hFotIG3dLNw8pR97 2RaJF8ZJRObaH9Hx4iWchNSHCIEjnjUaHv/NgnOwZe2GnRlcFhIIzSaQE0iiluHoWD2E WStQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=/s3+1TNg+eDACiwdwiGjNQ75k8VwPX1VU+O6TnYuAIQ=; b=mk7NGkQduDHwwy7BCc2ARGMq2bGkQkLO9xpRPzJap6ELJn0T4isf4U7SmqPpMYJmm5 8SSK01IfZvyn8AdPNv6XTQPF0y80ys5hE97t4O4/rxiWucsnLJse3AMZirgBVd4KkE+S rJkj6eL7X29/7d4lhU5K2gDqLI1NLyz1FxePh0fyVxnL25TqXpGnfxKD8PseV1ftsaoi 1GkhfT28IHPyicK1Kmo7T1IQMyapXta1v2AY1Yq2QQ8P702e7MkkF7oaANO2DdqIL4Az iDzzeb18fOlwAD7TTcznDbNiKxGOw1LU6s7PMbQ3DldYhkmYDv+3PFE7QMa1UvwYcJNF aurA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 97-v6si6256501plm.149.2018.03.03.02.44.33; Sat, 03 Mar 2018 02:44:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752060AbeCCKny (ORCPT + 99 others); Sat, 3 Mar 2018 05:43:54 -0500 Received: from stargate.chelsio.com ([12.32.117.8]:60094 "EHLO stargate.chelsio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751853AbeCCKnw (ORCPT ); Sat, 3 Mar 2018 05:43:52 -0500 Received: from localhost (scalar.blr.asicdesigners.com [10.193.185.94]) by stargate.chelsio.com (8.13.8/8.13.8) with ESMTP id w23AhROX015681; Sat, 3 Mar 2018 02:43:28 -0800 Date: Sat, 3 Mar 2018 16:13:08 +0530 From: Rahul Lakkireddy To: "Eric W. Biederman" Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kexec@lists.infradead.org, davem@davemloft.net, akpm@linux-foundation.org, torvalds@linux-foundation.org, Ganesh GR , Nirranjan Kirubaharan , Indranil Choudhury Subject: Re: [RFC 0/2] kernel: add support to collect hardware logs in panic Message-ID: <20180303104307.GA17150@chelsio.com> References: <87lgfad32y.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87lgfad32y.fsf@xmission.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Friday, March 03/02/18, 2018 at 18:52:45 +0530, Eric W. Biederman wrote: > Rahul Lakkireddy writes: > > > On production servers running variety of workloads over time, kernel > > panic can happen sporadically after days or even months. It is > > important to collect as much debug logs as possible to root cause > > and fix the problem, that may not be easy to reproduce. Snapshot of > > underlying hardware/firmware state (like register dump, firmware > > logs, adapter memory, etc.), at the time of kernel panic will be very > > helpful while debugging the culprit device driver. > > > > This series of patches add new generic framework that enable device > > drivers to collect device specific snapshot of the hardware/firmware > > state of the underlying device at the time of kernel panic. The > > collected logs are appended to vmcore along with details, such as > > start address and length of the logs, which are required for > > extraction during post-analysis. > > > > Device drivers can use crash_driver_dump_register() to register their > > callback that collects underlying device specific hardware/firmware > > logs during kernel panic (i.e. before booting into the second kernel). > > Drivers can unregister with crash_driver_dump_unregister(). > > > > To extract the device specific hardware/firmware logs using crash: > > > > crash> help -D | grep DRIVERDUMP > > DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968) > > > > crash> rd ffffb131090bd000 37782968 -r hardware.log > > 37782968 bytes copied from 0xffffb131090bd000 to hardware.log > > > > Patch 1 adds API to allow drivers to register callback to > > collect the device specific hardware/firmware logs. > > > > Patch 2 shows a cxgb4 driver example using the API to collect > > hardware/firmware logs during kernel panic. > > > > Suggestions and feedback will be much appreciated. > > I strongly suggest you figure out how to run this code in the > crash recovery kernel before your hardware is initialized. > That will give you a known good kernel to perform your collection from. > > Every line of code we add to the kexec on panic code path tends to add > to it's fragility and increase the chance you won't get any information > at all. > > When the assumption is it is something wrong with your driver/hardware > that caused the crash, calling into your driver is a very bad idea. > Especially running code that does callbacks and all kinds of other cute > things. > > Doing this as the crash recover kernel boots up before much if any > hardware is initialized seems like a fine thing to do, and just > needs a little coordination with userspace to ensure the information > gets saved when a vmcore is computed. > Thanks for the feedback and suggestions. I will work on achieving this from the crash recover kernel. Thanks, Rahul