Received: by 10.213.65.68 with SMTP id h4csp159051imn; Fri, 23 Mar 2018 01:33:46 -0700 (PDT) X-Google-Smtp-Source: AG47ELtZBtqxSsmV6Tr81jFd330bbEr5YIJWYxihRBHKQusM5nW9+WL+1tlZJ7qqYooZtugmNnq7 X-Received: by 2002:a17:902:6184:: with SMTP id u4-v6mr9064370plj.390.1521794026552; Fri, 23 Mar 2018 01:33:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521794026; cv=none; d=google.com; s=arc-20160816; b=c3GqwY1r/nM3tjnhgqrAzMiog/nmpt7x5ksZrF7cr8+jJ8x9a2cBE5OJMVbvPN7sFE 7OuZrqzaVeoaQYWrCxzOdLblJ0kScJdGtQUy6OlgL/w2G221YWvQzS/2nsyvoJuwYgyk p+Dfw42T9LR23PMJnvXp6aDJzZSN4YnkD7crua4Pc+aC/dR7Ne12AjYfcTXiuHaSzbZl CpsBzTgtpyQ0CkX1xP8eUWQw2cfcRDxu+qt4CR36wPOj17q2B0RZTqNxpnKK4ZJ7Jz1Q VD7OKwym7CpxwMlCykhALCEK92SyFc/rVDdRcHek3b16aHv84fb67mA9/PfPD9WKTZ3M mXqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=3a78KKpi9EFBZmT3LYxRL96LXGNAsLAKiVRSRTy7Pig=; b=Wj0UOuCP+vyeuohSJKthuao5xFOcNPVk1RZu30FxAy30dX/yKdA8DzmPgJUUNjuxcw Nz2oLTsuXfUXM15BfY1R4wZ2LCUpFjgp1wKOr15q6/O+2fQPoMcQlSN2Erqhz1HZyZO9 z9dAq2ABEnOAoOe1SFNyUmFqeodTNOm+S6slKTlAxq4daxS2qgQN+xtUWlqh6NbB0+C3 UYZ3284gnmUUV9qvFCP9JtTgDVQP3h37uRXe3t0OsoGySoaaJMt30ZwHfAJygiQtrInW eEDveMHHgbRjuoABxOpdZ4BkY81ppGrGOZe/gRgZu7MehDKtnOXQxM5gSD42sCmuIQqK z1yg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a1-v6si8088989plt.693.2018.03.23.01.33.31; Fri, 23 Mar 2018 01:33:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751901AbeCWIca (ORCPT + 99 others); Fri, 23 Mar 2018 04:32:30 -0400 Received: from stargate.chelsio.com ([12.32.117.8]:36495 "EHLO stargate.chelsio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751675AbeCWIc1 (ORCPT ); Fri, 23 Mar 2018 04:32:27 -0400 Received: from localhost (scalar.blr.asicdesigners.com [10.193.185.94]) by stargate.chelsio.com (8.13.8/8.13.8) with ESMTP id w2N8VxTa030534; Fri, 23 Mar 2018 01:32:05 -0700 From: Rahul Lakkireddy To: netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, kexec@lists.infradead.org, linux-kernel@vger.kernel.org Cc: davem@davemloft.net, viro@zeniv.linux.org.uk, ebiederm@xmission.com, stephen@networkplumber.org, akpm@linux-foundation.org, torvalds@linux-foundation.org, ganeshgr@chelsio.com, nirranjan@chelsio.com, indranil@chelsio.com, Rahul Lakkireddy Subject: [PATCH net-next 0/2] kernel: add support to collect hardware logs in crash recovery kernel Date: Fri, 23 Mar 2018 14:00:59 +0530 Message-Id: X-Mailer: git-send-email 2.5.3 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On production servers running variety of workloads over time, kernel panic can happen sporadically after days or even months. It is important to collect as much debug logs as possible to root cause and fix the problem, that may not be easy to reproduce. Snapshot of underlying hardware/firmware state (like register dump, firmware logs, adapter memory, etc.), at the time of kernel panic will be very helpful while debugging the culprit device driver. This series of patches add new generic framework that enable device drivers to collect device specific snapshot of the hardware/firmware state of the underlying device in the crash recovery kernel. In crash recovery kernel, the collected logs are exposed via /sys/kernel/crashdd/ directory, which is copied by user space scripts for post-analysis. A kernel module crashdd is newly added. In crash recovery kernel, crashdd exposes /sys/kernel/crashdd/ directory containing device specific hardware/firmware logs. The sequence of actions done by device drivers to append their device specific hardware/firmware logs to /sys/kernel/crashdd/ directory are as follows: 1. During probe (before hardware is initialized), device drivers register to the crashdd module (via crashdd_add_dump()), with callback function, along with buffer size and log name needed for firmware/hardware log collection. 2. Crashdd creates a driver's directory under /sys/kernel/crashdd/. Then, it allocates the buffer with requested size and invokes the device driver's registered callback function. 3. Device driver collects all hardware/firmware logs into the buffer and returns control back to crashdd. 4. Crashdd exposes the buffer as a file via /sys/kernel/crashdd//. 5. User space script (/usr/lib/kdump/kdump-lib-initramfs.sh) copies the entire /sys/kernel/crashdd/ directory to /var/crash/ directory. Patch 1 adds crashdd module to allow drivers to register callback to collect the device specific hardware/firmware logs. The module also exports /sys/kernel/crashdd/ directory containing the hardware/firmware logs. Patch 2 shows a cxgb4 driver example using the API to collect hardware/firmware logs in crash recovery kernel, before hardware is initialized. The logs for the devices are made available under /sys/kernel/crashdd/cxgb4/ directory. Thanks, Rahul RFC v1: https://lkml.org/lkml/2018/3/2/542 RFC v2: https://lkml.org/lkml/2018/3/16/326 --- Changes since rfc v2: - Moved exporting crashdd from procfs to sysfs. Suggested by Stephen Hemminger - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory. - Replaced all proc API with sysfs API and updated comments. - Calling driver callback before creating the binary file under crashdd sysfs. - Changed binary dump file permission from S_IRUSR to S_IRUGO. - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP. rfc v2: - Collecting logs in 2nd kernel instead of during kernel panic. Suggested by Eric Biederman . - Added new crashdd module that exports /proc/crashdd/ containing driver's registered hardware/firmware logs in patch 1. - Replaced the API to allow drivers to register their hardware/firmware log collect routine in crash recovery kernel in patch 1. - Updated patch 2 to use the new API in patch 1. Rahul Lakkireddy (2): fs/crashdd: add API to collect hardware dump in second kernel cxgb4: collect hardware dump in second kernel drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 4 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 +++ drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 3 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 12 ++ fs/Kconfig | 1 + fs/Makefile | 1 + fs/crashdd/Kconfig | 10 + fs/crashdd/Makefile | 3 + fs/crashdd/crashdd.c | 234 +++++++++++++++++++++++ fs/crashdd/crashdd_internal.h | 24 +++ include/linux/crashdd.h | 24 +++ 11 files changed, 341 insertions(+) create mode 100644 fs/crashdd/Kconfig create mode 100644 fs/crashdd/Makefile create mode 100644 fs/crashdd/crashdd.c create mode 100644 fs/crashdd/crashdd_internal.h create mode 100644 include/linux/crashdd.h -- 2.14.1