Received: by 10.192.165.148 with SMTP id m20csp1531945imm; Sat, 21 Apr 2018 10:09:48 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/UnFNyQEuRS3qI0dMdW/V2rRZSEUcpYZROt6HPet61sog9SssukBLfrqfLLLJwjhdKoEvy X-Received: by 10.101.97.8 with SMTP id z8mr6909212pgu.192.1524330588453; Sat, 21 Apr 2018 10:09:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524330588; cv=none; d=google.com; s=arc-20160816; b=mgyvbvOabxdoRfCrI+mYmrSh9ntY4ARiPUkLfQrrMHU2ETrqt16SqAD2qDkq8J9PFZ jLldmqAHJ9U66N6LwAmDyJNfS+pNzhb7t744T2sNf150B7CP3EKwFfbv0hml9e8gXWPc WUd3b3bRJ5Dqpf4sM6iDB7UOMPHwy7V+4Nn+dbchMZyCqzfIwPm8gkK9TSASx5fefkD9 EtjPgLufBiN43yMAbM00dSYOT5ppzTzJpxmsxV3Vt7gvJLTEm9q1AkI6Y9nskfEeZJdl VEOP7rF0VZLd6cE+ka/QVrcixf0QlYzHDMhRmhG1ctB544LmkpZWq74TVDMlbz+AabGz +MLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=RWux7x/Z0lu6VCX0dNW46kEfRWB9fsRVB3EtWUcGGIo=; b=EHyVUUmvx4Lf6x40NtjGTTs7x5MB5pJd+CFa3RyKFD6EfdMWz9kn8YpTcbw9KtQvM8 hgQg1tnktc6KVvc9tJb60vCR4HFJOYZYJ5YJgj/MACwSAiP2nbPa3E63YVG28vs97AT8 Q1o/AhjqtRmCIrQDIoyOr7EsyOZaiGtO9j0+abFQHrklyr7osuLVBywmEnIfFv1vxMHh WyI75NNfIucLHPDcsEwoBPs/6Aix51olb/6w7UZYJDEDh1fhawGF3yE8tPylMemQMbbl kPHH13yjMSBJSErrO6+U+db1QrA1A8CLZY7+2MDln1Tk0eGcZoGXoAlk4ovFhXCfb75z qaww== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w27si6679611pge.638.2018.04.21.10.09.34; Sat, 21 Apr 2018 10:09:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753237AbeDURG6 (ORCPT + 99 others); Sat, 21 Apr 2018 13:06:58 -0400 Received: from stargate.chelsio.com ([12.32.117.8]:30752 "EHLO stargate.chelsio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752947AbeDURG4 (ORCPT ); Sat, 21 Apr 2018 13:06:56 -0400 Received: from localhost (scalar.blr.asicdesigners.com [10.193.185.94]) by stargate.chelsio.com (8.13.8/8.13.8) with ESMTP id w3LH6Z0T012699; Sat, 21 Apr 2018 10:06:36 -0700 From: Rahul Lakkireddy To: netdev@vger.kernel.org, kexec@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: davem@davemloft.net, viro@zeniv.linux.org.uk, ebiederm@xmission.com, stephen@networkplumber.org, akpm@linux-foundation.org, torvalds@linux-foundation.org, ganeshgr@chelsio.com, nirranjan@chelsio.com, indranil@chelsio.com, Rahul Lakkireddy Subject: [PATCH net-next v5 0/3] kernel: add support to collect hardware logs in crash recovery kernel Date: Sat, 21 Apr 2018 22:35:52 +0530 Message-Id: X-Mailer: git-send-email 2.5.3 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On production servers running variety of workloads over time, kernel panic can happen sporadically after days or even months. It is important to collect as much debug logs as possible to root cause and fix the problem, that may not be easy to reproduce. Snapshot of underlying hardware/firmware state (like register dump, firmware logs, adapter memory, etc.), at the time of kernel panic will be very helpful while debugging the culprit device driver. This series of patches add new generic framework that enable device drivers to collect device specific snapshot of the hardware/firmware state of the underlying device in the crash recovery kernel. In crash recovery kernel, the collected logs are added as elf notes to /proc/vmcore, which is copied by user space scripts for post-analysis. The sequence of actions done by device drivers to append their device specific hardware/firmware logs to /proc/vmcore are as follows: 1. During probe (before hardware is initialized), device drivers register to the vmcore module (via vmcore_add_device_dump()), with callback function, along with buffer size and log name needed for firmware/hardware log collection. 2. vmcore module allocates the buffer with requested size. It adds an elf note and invokes the device driver's registered callback function. 3. Device driver collects all hardware/firmware logs into the buffer and returns control back to vmcore module. The device specific hardware/firmware logs can be seen as elf notes: # readelf -n /proc/vmcore Displaying notes found at file offset 0x00001000 with length 0x04003288: Owner Data size Description VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8 Unknown note type: (0x00000700) VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8 Unknown note type: (0x00000700) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) CORE 0x00000150 NT_PRSTATUS (prstatus structure) VMCOREINFO 0x0000074f Unknown note type: (0x00000000) Patch 1 adds API to vmcore module to allow drivers to register callback to collect the device specific hardware/firmware logs. The logs will be added to /proc/vmcore as elf notes. Patch 2 updates read and mmap logic to append device specific hardware/ firmware logs as elf notes. Patch 3 shows a cxgb4 driver example using the API to collect hardware/firmware logs in crash recovery kernel, before hardware is initialized. Thanks, Rahul --- v5: - Removed enabling CONFIG_PROC_VMCORE_DEVICE_DUMP by default and updated help message. v4: - Made __vmcore_add_device_dump() static. - Moved compile check to define vmcore_add_device_dump() to crash_dump.h to fix compilation when vmcore.c is not compiled in. - Convert ---help--- to help in Kconfig as indicated by checkpatch. - Rebased to tip. v3: - Dropped sysfs crashdd module. - Exported dumps as elf notes. Suggested by Eric Biederman . Added as patch 2 in this version. - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device dump support. - Moved logic related to adding dumps from crashdd to vmcore module. - Rename all crashdd* to vmcoredd*. - Updated comments. v2: - Added ABI Documentation for crashdd. - Directly use octal permission instead of macro. Changes since rfc v2: - Moved exporting crashdd from procfs to sysfs. Suggested by Stephen Hemminger - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory. - Replaced all proc API with sysfs API and updated comments. - Calling driver callback before creating the binary file under crashdd sysfs. - Changed binary dump file permission from S_IRUSR to S_IRUGO. - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP. rfc v2: - Collecting logs in 2nd kernel instead of during kernel panic. Suggested by Eric Biederman . - Added new crashdd module that exports /proc/crashdd/ containing driver's registered hardware/firmware logs in patch 1. - Replaced the API to allow drivers to register their hardware/firmware log collect routine in crash recovery kernel in patch 1. - Updated patch 2 to use the new API in patch 1. Rahul Lakkireddy (3): vmcore: add API to collect hardware dump in second kernel vmcore: append device dumps to vmcore as elf notes cxgb4: collect hardware dump in second kernel drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 4 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++ drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h | 3 + drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 10 + fs/proc/Kconfig | 15 + fs/proc/vmcore.c | 399 ++++++++++++++++++++++- include/linux/crash_core.h | 4 + include/linux/crash_dump.h | 17 + include/linux/kcore.h | 6 + include/uapi/linux/elf.h | 1 + 10 files changed, 471 insertions(+), 13 deletions(-) -- 2.14.1