Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp105379ybp; Thu, 3 Oct 2019 10:50:27 -0700 (PDT) X-Google-Smtp-Source: APXvYqw4279SrO62pTayi6Xw3zl30WjKiD6G/XNL3Hd+wuYEaBuxPhPlRpXtmAS6g5ynWexa316S X-Received: by 2002:a50:cf0d:: with SMTP id c13mr10600464edk.125.1570125027766; Thu, 03 Oct 2019 10:50:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570125027; cv=none; d=google.com; s=arc-20160816; b=Rn6lAYnNSfhk7k7icv84enYmQlxsOLggBPlC1FyqgnBr5hqgAbDVQzRaAFb0HzUtyS Y6XYrQMV/sRt/+xDDIPQLVoo7oBZM8qMJHb9rtsATHQREP7xUd56n4fEi59mhUCL5UWX lNvfXzGXhw/759yaXv0cnCPuR4+3IQUtAG4JIEKxhtyShMRbPbZfE8ql9gmT8hkcAKqV jVI7PGSIKQ8kKE2ZXO9XNT3DwK1Rx5epGKFOrDfEYWI/4NWlezQ9MJ8syNFz6k5tMh8A uKcQbo0SU/Cd/F8ZSKPIGmISl781HJle7UmBrFnv+kn+5WaQJ2aZqW7yllW70B/dYNyC Rxaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=y7hQL5qUF/60V3T48ICpM+ui2lxEdbmunquKL2kp3w4=; b=xVnufh5fZfUoCxtMhDleQGcy1LkBbNwUqn6593JWXN/feawKKUw4vJy16gs3mHJVuu 9XadRUW28N4jR4zDBB31GHG01GRtLMECMaeYrX85ZDux5RSFbzggZMO5+DSnqTCyDnb5 H+ULOOTzWpc+MmxYoFxT5uDkeybu22aGfoWYHwDivuNAmx6OTuGxKd7kY4KyJkObgwTW J/+1+VaYvDGEc6ZYL0hoaGfFhyB6X2+XYBroRhV0aeqnVP5qSwIietgnvFu1zvCtRtf8 d+zhw7Inerkas1rjXdxWFAwiwrOcL4HuXJNHDvfubgIBfZjCJCbY6HWti64JNJxs3t2O gNxA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y50si2235295edd.237.2019.10.03.10.50.03; Thu, 03 Oct 2019 10:50:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729831AbfJCR2T (ORCPT + 99 others); Thu, 3 Oct 2019 13:28:19 -0400 Received: from foss.arm.com ([217.140.110.172]:51898 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727254AbfJCR2Q (ORCPT ); Thu, 3 Oct 2019 13:28:16 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9425F1000; Thu, 3 Oct 2019 10:21:59 -0700 (PDT) Received: from [10.1.196.105] (eglon.cambridge.arm.com [10.1.196.105]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 180473F739; Thu, 3 Oct 2019 10:21:57 -0700 (PDT) Subject: Re: [PATCH RFC 0/4] ACPI: APEI: Add support to notify the vendor specific HW errors To: Shiju Jose Cc: "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "rjw@rjwysocki.net" , "lenb@kernel.org" , "tony.luck@intel.com" , "bp@alien8.de" , "baicar@os.amperecomputing.com" , Linuxarm , Jonathan Cameron , tanxiaofei References: <20190812101149.26036-1-shiju.jose@huawei.com> <72f44e4d-a20b-df1c-ddfe-55219e0ed429@arm.com> <86258A5CC0A3704780874CF6004BA8A6584C6BA0@lhreml523-mbx.china.huawei.com> From: James Morse Message-ID: Date: Thu, 3 Oct 2019 18:21:55 +0100 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: <86258A5CC0A3704780874CF6004BA8A6584C6BA0@lhreml523-mbx.china.huawei.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Shiju, On 22/08/2019 17:56, Shiju Jose wrote: > James Morse wrote: >> On 12/08/2019 11:11, Shiju Jose wrote: >>> Presently kernel does not support reporting the vendor specific HW >>> errors, in the non-standard format, to the vendor drivers for the recovery. >> >> 'non standard' here is probably a little jarring to the casual reader. You're >> referring to the UEFI spec's "N.2.3 Non-standard Section Body", which refers to >> any section type published somewhere other than the UEFI spec. >>> This patch set add this support and also move the existing handler >>> functions for the standard errors to the new callback method. >> >> Could you give an example of where this would be useful? You're adding an API >> with no caller to justify its existence. > One such example is handling the local errors occurred in a device controller, such as PCIe. Could we have the example in the form of patches? (sorry, I wasn't clear) I don't think its realistic that a PCIe device driver would want to know about errors on other devices in the system. (SAS-HBA meet the GPU). PCIe's has AER for handling errors that (may have) occurred on a PCIe link, and this has its own CPER records. >> GUIDs should only belong to one driver. > UEFI spec's N.2.3 Non-standard Section Body mentioned, "The type (e.g. format) of a > non-standard section is identified by the GUID populated in the Section Descriptor's > Section Type field." > There is a possibility to define common non-standard error section format I agree the GUID describes the format of the error record, > which will > be used for more than one driver if the error data to be reported is in the same format. > Then can the same GUID belong to multiple drivers? ... but here we disagree. CPER has a component/block-diagram view of the system. It describes a Memory error or an error with a PCIe endpoint. An error record affects one component. If you wanted to describe an error caused by a failed transaction between a PCIe device and memory, you would need two of these records, and its guesswork as to what happened between them. But the PCIe device has no business poking around in the memory error. Even if it did APEI would be the wrong place to do this as its not the only caller of memory_failure(). >>> Also the CCIX RAS patches could be move to the proposed callback method. >> >> Presumably for any vendor-specific stuff? > This information was related to the proposal to replace the number of if(guid_equal(...)) else > if(guid_equal(...)) checks in the ghes_do_proc() for the existing UEFI spec defined error > sections(such as PCIe, Memory, ARM HW error) 'the standard ones' > by registering the corresponding handler functions to the proposed notification method. I really don't like this. Registering a handler for 'memory corruption' would require walking a list of dynamically allocated pointers. Can there be more than one entry? Can random drivers block memory_failure() while they allocate more memory to send packets over USB? What if it loops? For the standard error sources the kernel needs to run 'the' handler as quickly as possible, with a minimum of code/memory-access in the meantime. It already takes too long. Thanks, James > The same apply to the CCIX error sections and any other > error sections defined by the UEFI spec in the future.