Received: by 10.223.185.116 with SMTP id b49csp8765646wrg; Fri, 2 Mar 2018 07:35:21 -0800 (PST) X-Google-Smtp-Source: AG47ELtAaG0OBV/nIykvHLItE2l2jx34LriB/+5OdcbBPM4HEW8eAVRRF/1hZjwSyPAcKjIs4V7G X-Received: by 10.99.64.66 with SMTP id n63mr4951186pga.204.1520004921778; Fri, 02 Mar 2018 07:35:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520004921; cv=none; d=google.com; s=arc-20160816; b=Q4PvZpb1Ve+3MwqgEss1XB8oUEB1Ajt/6YFka8tGUN9tOOnuO/rzde6053xddL47f7 iD12E3KsxuY0bhlU+Z16bJN41+5BYLVpHBirowK6liqe2MQ3q01YGrFvSTeR2bt1s2pN ZuW4Xcm2kjxd5wLh6wlkAa4tZ9CdSiOmpePtYJjZ3066feMS/lh9/MDooFFRPSiysBTT Uh6TbsnyruBDSVZtvhnvp1NwYaPQEoTDCJB/536bs9GgkUpTkZ33BFNYwcrDfQLdoJSs NZ8fVXmvLbgDodd0WVdrJ8/drhVaXfyaJc3SHnaNcBFIOcY/lTYl0a9zy2j5jXCc+Z/R GuEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:subject:mime-version:user-agent :message-id:in-reply-to:date:references:cc:to:from :arc-authentication-results; bh=80/mQ1Xw67yAI2Y0GOI9oec6TkAmb5Da0cMuGz6uWrk=; b=ZBKPI+a44XDU/q3+NhgXTSZK8gDVZS6ULifPTGz0xYvq39dwNlSlPwLgx4+P4KHs9O vhpHMYny+xnlL3ScxT77/1fAIAEHdR1oFNnqf7D24siZkBlqSHxhN0zdJCMv1uzTvdwM xlF+vB3moM42h6UKlyeL1HD0JxAuGkZT9pzSoacnpm2662f+lbdJ6n35b9NoIAn+a6MI rUy0h6U+yRFyAUgjUrFvVyR2E2MlJaWXyuNrZ1UnD1xUBP4LfaHWXVdooltx21qpWl8p goTaNQOnyzVEe2lBfH32lI/TuRI8IgRVRoFK01oSxkqvJLqSodUJ3fOSgUbp1X2byvdn y9mA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s21si4132697pgo.418.2018.03.02.07.35.07; Fri, 02 Mar 2018 07:35:21 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1428086AbeCBNYC (ORCPT + 99 others); Fri, 2 Mar 2018 08:24:02 -0500 Received: from out02.mta.xmission.com ([166.70.13.232]:58974 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1428057AbeCBNXs (ORCPT ); Fri, 2 Mar 2018 08:23:48 -0500 Received: from in01.mta.xmission.com ([166.70.13.51]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1erkeu-0003Qq-Qw; Fri, 02 Mar 2018 06:23:36 -0700 Received: from 174-19-85-160.omah.qwest.net ([174.19.85.160] helo=x220.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1erkef-0000OO-8Z; Fri, 02 Mar 2018 06:23:36 -0700 From: ebiederm@xmission.com (Eric W. Biederman) To: Rahul Lakkireddy Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kexec@lists.infradead.org, davem@davemloft.net, akpm@linux-foundation.org, torvalds@linux-foundation.org, ganeshgr@chelsio.com, nirranjan@chelsio.com, indranil@chelsio.com References: Date: Fri, 02 Mar 2018 07:22:45 -0600 In-Reply-To: (Rahul Lakkireddy's message of "Fri, 2 Mar 2018 17:49:56 +0530") Message-ID: <87lgfad32y.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1erkef-0000OO-8Z;;;mid=<87lgfad32y.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=174.19.85.160;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX18skhvtD4f98ipG+Eq1FsLbRIakRvXQW0Q= X-SA-Exim-Connect-IP: 174.19.85.160 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on sa07.xmission.com X-Spam-Level: X-Spam-Status: No, score=-0.7 required=8.0 tests=ALL_TRUSTED,BAYES_00, DCC_CHECK_NEGATIVE,TVD_RCVD_IP,T_TM2_M_HEADER_IN_MSG,XMNoVowels, XMSolicitRefs_0,XMSubLong,XM_Doc_Oz_Body autolearn=disabled version=3.4.1 X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.7 XMSubLong Long Subject * 1.5 XMNoVowels Alpha-numberic number with no vowels * 1.0 XM_Doc_Oz_Body BODY: Dr. Oz body dropper * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * -3.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0001] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] * 0.1 XMSolicitRefs_0 Weightloss drug X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Rahul Lakkireddy X-Spam-Relay-Country: X-Spam-Timing: total 15021 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 3.3 (0.0%), b_tie_ro: 2.6 (0.0%), parse: 0.72 (0.0%), extract_message_metadata: 10 (0.1%), get_uri_detail_list: 1.59 (0.0%), tests_pri_-1000: 2.7 (0.0%), tests_pri_-950: 1.15 (0.0%), tests_pri_-900: 0.96 (0.0%), tests_pri_-400: 21 (0.1%), check_bayes: 20 (0.1%), b_tokenize: 7 (0.0%), b_tok_get_all: 7 (0.0%), b_comp_prob: 2.1 (0.0%), b_tok_touch_all: 2.2 (0.0%), b_finish: 0.54 (0.0%), tests_pri_0: 194 (1.3%), check_dkim_signature: 0.47 (0.0%), check_dkim_adsp: 3.3 (0.0%), tests_pri_500: 14785 (98.4%), poll_dns_idle: 14778 (98.4%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC 0/2] kernel: add support to collect hardware logs in panic X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rahul Lakkireddy writes: > On production servers running variety of workloads over time, kernel > panic can happen sporadically after days or even months. It is > important to collect as much debug logs as possible to root cause > and fix the problem, that may not be easy to reproduce. Snapshot of > underlying hardware/firmware state (like register dump, firmware > logs, adapter memory, etc.), at the time of kernel panic will be very > helpful while debugging the culprit device driver. > > This series of patches add new generic framework that enable device > drivers to collect device specific snapshot of the hardware/firmware > state of the underlying device at the time of kernel panic. The > collected logs are appended to vmcore along with details, such as > start address and length of the logs, which are required for > extraction during post-analysis. > > Device drivers can use crash_driver_dump_register() to register their > callback that collects underlying device specific hardware/firmware > logs during kernel panic (i.e. before booting into the second kernel). > Drivers can unregister with crash_driver_dump_unregister(). > > To extract the device specific hardware/firmware logs using crash: > > crash> help -D | grep DRIVERDUMP > DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968) > > crash> rd ffffb131090bd000 37782968 -r hardware.log > 37782968 bytes copied from 0xffffb131090bd000 to hardware.log > > Patch 1 adds API to allow drivers to register callback to > collect the device specific hardware/firmware logs. > > Patch 2 shows a cxgb4 driver example using the API to collect > hardware/firmware logs during kernel panic. > > Suggestions and feedback will be much appreciated. I strongly suggest you figure out how to run this code in the crash recovery kernel before your hardware is initialized. That will give you a known good kernel to perform your collection from. Every line of code we add to the kexec on panic code path tends to add to it's fragility and increase the chance you won't get any information at all. When the assumption is it is something wrong with your driver/hardware that caused the crash, calling into your driver is a very bad idea. Especially running code that does callbacks and all kinds of other cute things. Doing this as the crash recover kernel boots up before much if any hardware is initialized seems like a fine thing to do, and just needs a little coordination with userspace to ensure the information gets saved when a vmcore is computed. Eric