Received: by 10.192.165.156 with SMTP id m28csp1339204imm; Wed, 18 Apr 2018 08:09:34 -0700 (PDT) X-Google-Smtp-Source: AIpwx492Ck3u+7zn9vTHuV+cnspJKcWV96nrZ1m6Vhmp36A0jVBHiLRdU+6ZR6k51gZw/MSgPA10 X-Received: by 10.99.98.69 with SMTP id w66mr2037396pgb.55.1524064174107; Wed, 18 Apr 2018 08:09:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524064174; cv=none; d=google.com; s=arc-20160816; b=W47BmUGA4dayhnRU9AOWsV9fwLKgwtQfz+Ia0rGcwfb9kpsYJe5/j0MUHozzgLrusm I3YJa8pAGv3djNmTWO0YWG0AmTLboTf8Z3h908HiR1E7nhGPN6mxCKebjbllMTk2l32o GG7rGV0mNt8wk77t3OLtx1YowDHW7YH2Lt28iVwvFm91ox3xHtXyNcBB1tNVgG1pqqvv SNgVQUlOKLoOd31cevyrNuStwAEQrCProh7Wmdc8/LZBeAFHevtl3nmI8O/rFs+Pl3mC cRhFs79Qv8WlJA22VvW4fSxWQyJIbzwLdW4cIXs3YRQrVOuQj7n99GeB1H5yGFrz6Qgf 7QfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=eS4XB8AZmuxIkiqwSrwakQlYgudqQibofssVFD0mVuk=; b=hc//1iWp+G9Cb4cMvKMDdVlYsrSdwYSzXAdxCOOU9KuhnraB7aSf5ZLqo3lD3izXvP 2l+/VL8Z6XLBeMsZdH+XhpVAI8qnZEO2i/KBwwywEZYHdvpVIZkKc7bleg/gzXuKYmMA kPlZpFYAU/wel1eH9JiwCnLmvZQMpsq43hwnvoCVFJnaHVSLim2BSxYgxwlzkw5aNd6G KHbiyzXKVcPB/UeqFgM2pnTzo+zJCziM7R2l1k5ThJNQl+HHA/USVfUfXYTroDezQznC hA7I58INGh3Vmzu/Pzf3NzszxkhLqgEyWBCCyplDAXUvQwqPwcxPqn0KUQt/NDKry/bE dAmg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j63si1252285pfc.351.2018.04.18.08.09.15; Wed, 18 Apr 2018 08:09:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753098AbeDRPIA (ORCPT + 99 others); Wed, 18 Apr 2018 11:08:00 -0400 Received: from stargate.chelsio.com ([12.32.117.8]:43054 "EHLO stargate.chelsio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752166AbeDRPH6 (ORCPT ); Wed, 18 Apr 2018 11:07:58 -0400 Received: from localhost (scalar.blr.asicdesigners.com [10.193.185.94]) by stargate.chelsio.com (8.13.8/8.13.8) with ESMTP id w3IF7XvB007441; Wed, 18 Apr 2018 08:07:34 -0700 Date: Wed, 18 Apr 2018 20:37:07 +0530 From: Rahul Lakkireddy To: "Eric W. Biederman" Cc: Dave Young , "netdev@vger.kernel.org" , "kexec@lists.infradead.org" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Indranil Choudhury , Nirranjan Kirubaharan , "stephen@networkplumber.org" , Ganesh GR , "akpm@linux-foundation.org" , "torvalds@linux-foundation.org" , "davem@davemloft.net" , "viro@zeniv.linux.org.uk" Subject: Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Message-ID: <20180418150707.GA27638@chelsio.com> References: <20180418061546.GA4551@dhcp-128-65.nay.redhat.com> <20180418123114.GA19159@chelsio.com> <871sfcy4ge.fsf@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <871sfcy4ge.fsf@xmission.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wednesday, April 04/18/18, 2018 at 19:58:01 +0530, Eric W. Biederman wrote: > Rahul Lakkireddy writes: > > > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote: > >> Hi Rahul, > >> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote: > >> > On production servers running variety of workloads over time, kernel > >> > panic can happen sporadically after days or even months. It is > >> > important to collect as much debug logs as possible to root cause > >> > and fix the problem, that may not be easy to reproduce. Snapshot of > >> > underlying hardware/firmware state (like register dump, firmware > >> > logs, adapter memory, etc.), at the time of kernel panic will be very > >> > helpful while debugging the culprit device driver. > >> > > >> > This series of patches add new generic framework that enable device > >> > drivers to collect device specific snapshot of the hardware/firmware > >> > state of the underlying device in the crash recovery kernel. In crash > >> > recovery kernel, the collected logs are added as elf notes to > >> > /proc/vmcore, which is copied by user space scripts for post-analysis. > >> > > >> > The sequence of actions done by device drivers to append their device > >> > specific hardware/firmware logs to /proc/vmcore are as follows: > >> > > >> > 1. During probe (before hardware is initialized), device drivers > >> > register to the vmcore module (via vmcore_add_device_dump()), with > >> > callback function, along with buffer size and log name needed for > >> > firmware/hardware log collection. > >> > >> I assumed the elf notes info should be prepared while kexec_[file_]load > >> phase. But I did not read the old comment, not sure if it has been discussed > >> or not. > >> > > > > We must not collect dumps in crashing kernel. Adding more things in > > crash dump path risks not collecting vmcore at all. Eric had > > discussed this in more detail at: > > > > https://lkml.org/lkml/2018/3/24/319 > > > > We are safe to collect dumps in the second kernel. Each device dump > > will be exported as an elf note in /proc/vmcore. > > It just occurred to me there is one variation that is worth > considering. > > Is the area you are looking at dumping part of a huge mmio area? > I think someone said 2GB? > > If that is the case it could be worth it to simply add the needed > addresses to the range of memory we need to dump, and simply having a > elf note saying that is what happened. > We are _not_ dumping mmio area. However, one part of the dump collection involves reading 2 GB on-chip memory via PIO access, which is compressed and stored. Thanks, Rahul