Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp4875196pxj; Wed, 12 May 2021 15:23:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwicK4OCg4aTnO7vsHbUdGXkwGAxJga5wYdqVp0YPVsH96954EWNrCZZQOXxof+BYMcz5jo X-Received: by 2002:aa7:de8b:: with SMTP id j11mr46660315edv.363.1620858230751; Wed, 12 May 2021 15:23:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1620858230; cv=none; d=google.com; s=arc-20160816; b=ZyUZCg8eQwFgdxIhaPj5mFhAFlAfaeOna/0vTDctCo1xg2wLKR8PmIyOwOSj94MkBK gGPvaL2wdRCHHWne46i1KTuhGTvIsoR8xo0Ctz4pSugBEY5clhAq/Sb5UVOYmKYgSFpG MHKvWH7BY4EC8ZfXT3D6ZGeBwx/+FLi8mlVmmTB821RXifHQ4jLEq+gwXLUbeCGAvJUi d7FlQ+hQwLkAa4rkbvE2R/Okp+ls7Fp4WJR+idfhW84Y3vqfK/y4+ri2NAk2D1H3z3pG /lAOJbATtZEPbbYd3ZxyBtIM5XD6OxK7vn/jL83Hlk3MorHpm+37AMuQML8vGTdF9hKM Fl1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=hwvjlzB9ZjQc1MjZx7Xbk0cwBahbwiqW8zWrf9aO+jg=; b=oXH++MoMi99arHPm2PO4XOg4+NbRFgIRmzSMOiaD9r+eyH6FhtsbzZaUK35HiiBXD5 klV+tFM4BkbDoJgDw+4Duke9dWUsfEvuig9jOsrcuQyjIm8U5AyILlUd9lYxfXf6Nyz5 AlmBIOeww1kOfeZmLsZoyInEXv5izzB+sLSBgs6tJ1F4wkKf1TjIyhqcjrhwUeK3fRKk 0TRDSiQHwX09uW4cqvfg/aPGzpGKs0iovzus8gGHwRKf0cUEe5D40tmiQLAQqy9oQlIh pqDBc/g6l+wWPbCjaavTVADN+FFewiezqb8wq7hkzYAbv2PzIg5H416K/gBEs9FpErM7 c+UQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@alien8.de header.s=dkim header.b=DE6dG2qA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alien8.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q11si1198160ejz.231.2021.05.12.15.23.12; Wed, 12 May 2021 15:23:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@alien8.de header.s=dkim header.b=DE6dG2qA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alien8.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1442771AbhELWNZ (ORCPT + 99 others); Wed, 12 May 2021 18:13:25 -0400 Received: from mail.skyhub.de ([5.9.137.197]:33946 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1390419AbhELVGy (ORCPT ); Wed, 12 May 2021 17:06:54 -0400 Received: from zn.tnic (p200300ec2f0bb800dc83a7b830fce185.dip0.t-ipconnect.de [IPv6:2003:ec:2f0b:b800:dc83:a7b8:30fc:e185]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id ECD991EC04DE; Wed, 12 May 2021 23:05:42 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=dkim; t=1620853543; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=hwvjlzB9ZjQc1MjZx7Xbk0cwBahbwiqW8zWrf9aO+jg=; b=DE6dG2qA9Ee5FoEoNTxlkh7vTDoLsFsYzlzTg3x8Yo6Z4hs2P3t/1hv2VWeVPzzNxn0n5Q NqFAH3FfVY8mNRGAk1CXNeZr6h2Cvaah3JN6M9Y034HIVl6OzLHbKn+BmV0nXl3rWr6P1U jcOqYpxlriw0LtFrwqqBeSG6Jy5MloU= Date: Wed, 12 May 2021 23:05:38 +0200 From: Borislav Petkov To: "Joshi, Mukul" Cc: "amd-gfx@lists.freedesktop.org" , "Kasiviswanathan, Harish" , x86-ml , lkml Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran Message-ID: References: <20210512013058.6827-1-mukul.joshi@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote: > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is > only interested in errors on GPU UMC. So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2. > We cannot know this without is_smca_umc_v2. You don't need it - just export smca_get_bank_type() and test the bank type at the call site. > Maybe. I hope its not too much of a concern if it stays the way it is. That was just a suggestion anyway - it is not code I maintain so not my call. > I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices. > I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices > are two different class of devices. > That is the reason I create a new one. > I am OK to use EDAC priority if that is acceptable. I don't know what's acceptable because I still am unclear as to what that thing is supposed to do. It seems you are interested only in uncorrectable errors. How are those errors reported? #MC exception, deferred interrupt, simply logged in the bank and we find them by polling? Then, the commit message is talking about some "bad page retirement". What does that do? What can the user do when she sees the "Uncorrectable error detected in UMC..." message? It depends on what "retiring" of GPU pages means... In any case, dmesg should issue a human-understandable message about the recovery action being done and what that means for the user: should she replace the GPU, should she ignore, etc, etc. > A system can have multiple GPUs and we only want a single notifier > registered. I will change the comment to explicitly state this. Actually, the notifier registration should be able to return a different retval to state that a callback has already been registered but that warns only currently so I'm guessing we're stuck with such ugly "workarounds" for their shortcomings. I'm gonna take a look whether they can be fixed though so that you don't have to do this notifier_registered thing. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette