Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp879288pxb; Thu, 5 Nov 2020 15:54:15 -0800 (PST) X-Google-Smtp-Source: ABdhPJxMOGBvDMac+WoXRoeJgNPACnOHEzSe5Qzwh4tItpSbC1hKd+1Ot43K7Suz5sVJsmkRSAPR X-Received: by 2002:a17:906:ca83:: with SMTP id js3mr5068483ejb.42.1604620455477; Thu, 05 Nov 2020 15:54:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1604620455; cv=none; d=google.com; s=arc-20160816; b=udgwgF8uKCRJkaz486BEWv29mphI0HP2uhG7Mr23V0ckvaYKiHTsXAoFAmowGHE0+U mNo4Kzqmxrnoo6t6obDHOTsoz70DTS6Rmp65ZOv29dzYEFUDmr790LMqEDGeINsHyGy9 8iRiq4GM68+wqCnI+1HHpJDTFXDH8ZHaqrM5YoGwhEjw8/w4+75PTn0iuPB2cHCnt0+c 4qPOcAExYx7V0s3G2KIuKGaGMk6a7hkDk5zk418Ay4DqQm4Nb8hCZIOXZHyu45/f6JTp xiE53q6a6LcbSu0UKtS7JgH3GBnfMFXB51En4ZVTjg19vGgvVJ5njDkjiQ+Whrkco0Bx M7FA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=g+pv080mfaDWNC3bI7wgy94GEdIlxkv9mLywplA5Wjw=; b=uvYfO2ddEYot+AuuOMMSo2aSm0doRbXUJohynKudvBDBPGxSIEcVGNXJje3AFT43ti Wee6jPLi9g6H9PAcGwAudVGiEsQBmX8xyLeQzEbLmjStp5PfEt/0i9vpskeQXh7kuNA2 MyeiZDeuHW1r14dkJXV0GphEZ4YXueWjq4omqWiMgKyv3OHb8d/2+Bd5P8zrY5WIcBku mD69I1zwY13AN9XHlB286uwKN6Tll3Re+Hutp2QCxoj6Bsire3ZgryGmA7HEVNHjnsOy mY2pOYilwEtrnRcgIkH9l3xTlLajCs5uNUxek3DZc1Bwj7TZoNImHtoHjJ0Wge7UsxQ1 0/6g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=FVkYF3p4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q1si2412846ejf.509.2020.11.05.15.53.52; Thu, 05 Nov 2020 15:54:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=FVkYF3p4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732414AbgKEXwg (ORCPT + 99 others); Thu, 5 Nov 2020 18:52:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:49724 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730895AbgKEXwf (ORCPT ); Thu, 5 Nov 2020 18:52:35 -0500 Received: from lt-jalone-7480.mtl.com (c-24-6-56-119.hsd1.ca.comcast.net [24.6.56.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 0512C20739; Thu, 5 Nov 2020 23:52:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1604620354; bh=zVN6qcOZyRn/6G2pAEK9dZgGcVU1Xsvgk0SwNKZ3JdY=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=FVkYF3p4Xaw853vnm3cExPWB3x42IussfibeA4j5AYEdPrp4sBagcrNoYbmTbFo7T agQ3iSIHEoCtos56qI27q2D3bwQFdF1ZnOvqXj7bAzN2B+qQArefknhR0Ms0nRffdx AvI7Hlvxk/nJtr+iy23HbHRZy093iRsPdorf9UJ4= Message-ID: <1b96abb1da9bca4d9f962babad9a0724c1188437.camel@kernel.org> Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink health reporters for NIX From: Saeed Mahameed To: Jakub Kicinski Cc: George Cherian , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Jiri Pirko , "davem@davemloft.net" , Sunil Kovvuri Goutham , Linu Cherian , Geethasowjanya Akula , "masahiroy@kernel.org" , "willemdebruijn.kernel@gmail.com" Date: Thu, 05 Nov 2020 15:52:32 -0800 In-Reply-To: <20201105124204.4dbea042@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> References: <20201105090724.761a033d@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> <011c4d4e2227df793f615b7638165c266763e24a.camel@kernel.org> <20201105124204.4dbea042@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-1.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2020-11-05 at 12:42 -0800, Jakub Kicinski wrote: > On Thu, 05 Nov 2020 11:23:54 -0800 Saeed Mahameed wrote: > > If you report an error without recovering, devlink health will > > report a > > bad device state > > > > $ ./devlink health > > pci/0002:01:00.0: > > reporter npa > > state error error 1 recover 0 > > Actually, the counter in the driver is unnecessary, right? Devlink > counts errors. > if you mean error and recover counters, then yes. they are managed by devlink health every call to dl-health-report will do: devlink_health_report(reporter, err_ctx, msg) { reproter.error++; devlink_trigger_event(reporter, msg); reporter.dump(err_ctx, msg); reporter.diag(err_ctx); if (!reporter.recover(err_ctx)) reporter.recover++; } so dl-health reports without a recover op will confuse the user if user sees error count > recover count. error count should only be grater than recover count when recover procedure fails which now will indicate the device is not in a healthy state. also i want to clarify one small note about devlink dump. devlink health dump semantics: on devlink health dump, the devlink health will check if previous dump exists and will just return it without actually calling the driver, if not then it will call the driver to perform a new dump and will cache it. user has to explicitly clear the devlink health dump of that reporter in order to allow for newer dump to get generated. this is done this way because we want the driver to store the dump of the previously reported errors at the moment the erorrs are reported by driver, so when a user issue a dump command the dump of the previous error will be reported to user form memory without the need to access driver/hw who might be in a bad state. so this is why using devlink dump for reporting counters doesn't really work, it will only report the first time the counters are accessed via devlink health dump, after that it will report the same cached values over and over until the user clears it up. > > So you will need to implement an empty recover op. > > so if these events are informational only and they don't indicate > > device health issues, why would you report them via devlink health > > ? > > I see devlink health reporters a way of collecting errors reports > which > for the most part are just shared with the vendor. IOW firmware (or > hardware) bugs. > > Obviously as you say without recover and additional context in the > report the value is quite diminished. But _if_ these are indeed > "report > me to the vendor" kind of events then at least they should use our > current mechanics for such reports - which is dl-health. > > Without knowing what these events are it's quite hard to tell if > devlink health is an overkill or counter is sufficient. > > Either way - printing these to the logs is definitely the worst > choice > :) Sure, I don't mind using devlink health for dump only, I don't really have strong feelings against this, they can always extend it in the future. it just doesn't make sense to me to have it mainly used for dumping counters and without using devlik helath utilities, like events, reports and recover. so maybe Sunil et al. could polish this patchset and provide more devlink health support, like diagnose for these errors, dump HW information and contexts related to these errors so they could debug root causes, etc .. Then the use for dl health in this series can be truly justified.