Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp548288yba; Sat, 4 May 2019 07:48:31 -0700 (PDT) X-Google-Smtp-Source: APXvYqyhGrEPPmeZWWxUWpz6KBQqYvhX8QvutyDG0llJeh3GVEDCeJCpRcOe13wHo4oCJqKNEYtk X-Received: by 2002:a65:528b:: with SMTP id y11mr7983128pgp.341.1556981311453; Sat, 04 May 2019 07:48:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556981311; cv=none; d=google.com; s=arc-20160816; b=SDwQvquZ4snojehRnx6sPRf0/SlaFvqMZn+MpchilYqFJTtnOlZVs7eljqytOvtTtf Jzg8rRjhU6R/X1yvpK2K/SMEdtgHdYAKCo7nmT6mmmUq6v4R+WpXzlgCSKBdbwa7VH4c 9/H8W1zmMg7kUift1ipy0QQVb3x/bRBj9CNF2PtSRLIJZX+irQgXJ52F/GGIi5vOc+mD aGNCAtZWpK0hDEwqH3rWpqXEeflsovhez9dphI9Rg4IOIBLjvt5uRxu0evdS2dau9skP OxJAOf3Pu3iRfs8rrCkKHusptV20++NdpAi9FxrKeiUbppG5pUQ3VM9M4XvSaKRg+ZXY wt0g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=3/UhKsCGkW4s6b3CKLSpWg03k2UImtt0PnOB2N02e3E=; b=JHsjNJJTg+Gb8XZZSFlnHjd+KSW1cfBqsNI7yKLPRmC5p8uvIIKgTKq1NqNgezFtKC 99psBo03OnoVim1TtV/1AAUJzSSfYWaozy1q6MyNnDsfPiorKX7RkkbdDlODEp2/8X3V pltMPD37OY9Q+BDIIAwdyI8AxPrrtqMrHxEh3exeI86ZOcVOqX/cj3vLjznyfkswjxBf WkARLLVduq3noVI+B043CNkGMnT8P/blfnfr3bvZlMD8A9awGMCWppm4xhB2WJAGqqVu soo1dSBwLSph09MznYVqd1drr5s79nB5i8SvcHYcoGGfti2dU3bdxUp5qR+aMAxAii7k vIFQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rVTjYVBl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r5si6797901pgp.29.2019.05.04.07.48.15; Sat, 04 May 2019 07:48:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rVTjYVBl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727176AbfEDOg4 (ORCPT + 99 others); Sat, 4 May 2019 10:36:56 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:39583 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726070AbfEDOgz (ORCPT ); Sat, 4 May 2019 10:36:55 -0400 Received: by mail-pg1-f196.google.com with SMTP id w22so2865827pgi.6 for ; Sat, 04 May 2019 07:36:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=3/UhKsCGkW4s6b3CKLSpWg03k2UImtt0PnOB2N02e3E=; b=rVTjYVBl38TZ1yBXffS6DjJV4YNwnABswrZ+TCLs6fD5UhADPj7DgToHW0h/bXLgqQ kfcrY4jK1VlXfRQaGiHJhoMq+5vgoq6yba1hnjV/OfPeVpcxei6RT/lBQvgON/KaYTs2 hL8KXipjeKNmuLjlacRIvlGCdVABXceYeVfkzdmtxXtQE8cuoTwnChoWX/iIgCGz7xUD j1HV9RevkOT+BKVwvqBOh3znIvMThd0uHxVy9+lc04uxvbEHmkW8WbAPluPycVNFkMye 75XOIbTePGnaDHmO9SPEIOOuyDOWutWnUvuJ4HxRq10qa9mQuYyaNThjRCaM3CdSsSiF pXkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=3/UhKsCGkW4s6b3CKLSpWg03k2UImtt0PnOB2N02e3E=; b=CRWJbGm89C20N4sk2wb38nFINbPUyyBzlVpR3nKEhTWOkz+lQ0bi6Zy8rxthemMOQY RzXvzTAA1gD/77YMohSgqQF3dVMp1I9Ge0JrEk+3eazz9aNOOQHDqj1Vfbq6KSvEKZ/t CdJcoRkeFrbhTMpR8AKihsE4u/KDAnlRMKuGA/BdsM0V+WJR8vgBVFr5V3FLr5BYp5xq Y38xi/Pf9ETlqnHP/rF44+S1uGsMbM/gR1dttJOmSinw/ld8jPrTT4yV7oIO7ERoT3/d 5hk1+1qMszdvyTx9nAqMA37F491KYKl0uTPTRyE4ltXCsPp0ZSNDdTEhuODv7pG9HWtc 0sDw== X-Gm-Message-State: APjAAAVT3LQBDo+4RQn4t6s3F5vs8/EdPuFKRN6jYWWkysHmEkJXlGFA I0n1SIfBtgKHyhbAdCcML5lySwVRzMLXaREx+N8= X-Received: by 2002:aa7:9214:: with SMTP id 20mr19486575pfo.202.1556980615110; Sat, 04 May 2019 07:36:55 -0700 (PDT) MIME-Version: 1.0 References: <1556787561-5113-1-git-send-email-akinobu.mita@gmail.com> <20190502125722.GA28470@localhost.localdomain> <20190503121232.GB30013@localhost.localdomain> <20190503122035.GA21501@lst.de> <61bf6f0b-4087-cfb3-1ae6-539f18b5b6ea@gmail.com> In-Reply-To: <61bf6f0b-4087-cfb3-1ae6-539f18b5b6ea@gmail.com> From: Akinobu Mita Date: Sat, 4 May 2019 23:36:44 +0900 Message-ID: Subject: Re: [PATCH 0/4] nvme-pci: support device coredump To: Minwoo Im Cc: Christoph Hellwig , Jens Axboe , Sagi Grimberg , LKML , linux-nvme@lists.infradead.org, Keith Busch , Keith Busch , Johannes Berg Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2019=E5=B9=B45=E6=9C=884=E6=97=A5(=E5=9C=9F) 18:40 Minwoo Im : > > Hi Akinobu, > > On 5/4/19 1:20 PM, Akinobu Mita wrote: > > 2019=E5=B9=B45=E6=9C=883=E6=97=A5(=E9=87=91) 21:20 Christoph Hellwig : > >> > >> On Fri, May 03, 2019 at 06:12:32AM -0600, Keith Busch wrote: > >>> Could you actually explain how the rest is useful? I personally have > >>> never encountered an issue where knowing these values would have help= ed: > >>> every device timeout always needed device specific internal firmware > >>> logs in my experience. > > > > I agree that the device specific internal logs like telemetry are the m= ost > > useful. The memory dump of command queues and completion queues is not > > that powerful but helps to know what commands have been submitted befor= e > > the controller goes wrong (IOW, it's sometimes not enough to know > > which commands are actually failed), and it can be parsed without vendo= r > > specific knowledge. > > I'm not pretty sure I can say that memory dump of queues are useless at a= ll. > > As you mentioned, sometimes it's not enough to know which command has > actually been failed because we might want to know what happened before a= nd > after the actual failure. > > But, the information of commands handled from device inside would be much > more useful to figure out what happened because in case of multiple queue= s, > the arbitration among them could not be represented by this memory dump. Correct. > > If the issue is reproducible, the nvme trace is the most powerful for t= his > > kind of information. The memory dump of the queues is not that powerfu= l, > > but it can always be enabled by default. > > If the memory dump is a key to reproduce some issues, then it will be > powerful > to hand it to a vendor to solve it. But I'm afraid of it because the > dump might > not be able to give relative submitted times among the commands in queues= . I agree that only the memory dump of queues don't help much to reproduce issues. However when analyzing the customer-side issues, we would like to know whether unusual commands have been issued before crash, especially on admin queue. > >> Yes. Also not that NVMe now has the 'device initiated telemetry' > >> feauture, which is just a wired name for device coredump. Wiring that > >> up so that we can easily provide that data to the device vendor would > >> actually be pretty useful. > > > > This version of nvme coredump captures controller registers and each qu= eue. > > So before resetting controller is a suitable time to capture these. > > If we'll capture other log pages in this mechanism, the coredump proced= ure > > will be splitted into two phases (before resetting controller and after > > resetting as soon as admin queue is available). > > I agree with that it would be nice if we have a information that might no= t > be that powerful rather than nothing. > > But, could we request controller-initiated telemetry log page if > supported by > the controller to get the internal information at the point of failure > like reset? > If the dump is generated with the telemetry log page, I think it would > be great > to be a clue to solve the issue. OK. Let me try it in the next version.