Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1668030imm; Tue, 22 May 2018 07:34:37 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrbp/HcUgshV0DURNBnNGKPSzl2eRA2dqb5jPYzith1yram2b2q3bpJZSL8/9LGKuN5o/F0 X-Received: by 2002:a17:902:a60d:: with SMTP id u13-v6mr25250172plq.40.1526999677292; Tue, 22 May 2018 07:34:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526999677; cv=none; d=google.com; s=arc-20160816; b=SF6Kx3maR6z6oO7HBekRecrV3AOc4UjeFvE4afjeLylbKwea0+8w7xU5CS4QzD5dmh lM3YHyY95zs/0PWslzVvzhZptjNTCU49D4Lf3Z63qOxVzDiMH93OhZz0xsLo6dLN5I9S IF2NGuwrQi43XdotBQLqbNtdxTOvuSea99Xrw0agv0+P1CC2N5FQIM/Ep32UsoqzcYXT QmO1N0j0CgDP3llRUaAwr5AuSsHT4PQUK0yD6ex7u7Gr7XOo/f37s2W4V7z5b/jOvg06 2g5CsXfpMMXFoAMvSF1Vyxtp9uvykO3H0AURdnd9K+yNyNsKvc0wOAE+ABoRw2EQft74 mezw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=obOwu4fsrZygWhB7iGymjW4gwnZpf3E+Br/uOTsc0AA=; b=Q3EC4zSAI/0Fr3KcT0s4tKoqYbEtP+SfVjfb127Obc2AkZHOQT6xZcKODb7Peh1cMC xoEtH9IVynV4YfU6i8N7Xy9Uob0D7m7pClNftXTXelvpiOwj31YTENWiIgUpEFmi6QXO mBoXRTfB6HskKfPtSLpz0juBKkTjd4GELNWXH9HnFH219+zuR0MEJpp/e7382meq7+dC /MfJcjIuK05Dyutir0+C36Igaft3q5jR9Y8LvGPn8VqO2yT/hcrZ+0RdMTLC6R3fhb3q jjdo/HmUnDqghxzGoQim+Z/NuoI5i4qH7+h/LToCt1aZCVnIsfh1HumK29NEDi3f+Pnc XoTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ICpLBjd+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c10-v6si13515757pge.596.2018.05.22.07.34.20; Tue, 22 May 2018 07:34:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=ICpLBjd+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751579AbeEVOdA (ORCPT + 99 others); Tue, 22 May 2018 10:33:00 -0400 Received: from mail-ot0-f193.google.com ([74.125.82.193]:41059 "EHLO mail-ot0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751291AbeEVOc6 (ORCPT ); Tue, 22 May 2018 10:32:58 -0400 Received: by mail-ot0-f193.google.com with SMTP id t1-v6so21200036oth.8; Tue, 22 May 2018 07:32:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=obOwu4fsrZygWhB7iGymjW4gwnZpf3E+Br/uOTsc0AA=; b=ICpLBjd+FWGXayx6pHko2jAD1UJoeB4v7bc78uhb5/bwLy+hKlj3uLGCd5a3Kn5sVV DKFClTRIlpVWmhlNlmT7aBigJknwTaewkhAa8a1t4r4j2qJAqSIP+1X54ccEdGR+kpUi cGpTQaCZiD0Lvh4mdvThXH9zFRUM+M+SZhBeaU4anYqVv2fI0aaAa808rhmwaZyM8Cak IZKlRKaQ1JictCCaRcrb/3BmAQCnS6/4+fWmwTsjuMaHvOK2Fg/VtjgKzopZR88UY1dt z1K6C3alLzb9kjK/q+rSdK8QRK+v6JN3xHnadB8DPchd7WH4nNJxpqKobPHTtC6kb8Xm bywA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=obOwu4fsrZygWhB7iGymjW4gwnZpf3E+Br/uOTsc0AA=; b=EjYwa9yLEbHYBhN2rwe0Kkkcvwg4dl665WRPQol3L4W9hszd1OJmCvW2ENlFmpoDRD wjM14yX4nGTCsWkGW6SWpf37qtK3cH19WZEvxJEWa+HHmOJgO6y6hTXYrLkZHdFJwOXO Lm5LuOFE54cgkGS2ErRBRMNTuGgSA7saf0KEsiffZ6A9zM00rJPizOS6+bVJJMqoQx+n /ZRHDj12iG9wxUU5B0gTglycZFBGvQFo8gVwkPqtx47rRzGW790pddo2yUvwu/NSOXCh /04A9DKeRSDCOMCPlxEJ18j2FGImAKWeFrMDmX7RAjmleNnC2HFFpGA55lt3YUr3fGDB rB5Q== X-Gm-Message-State: ALKqPwduVLrCuFP0LZB8yRDgOhPh21dHio2SiF7Hxn18KTEvGrknNBoX Ghh9tJ6LqKuWlUdXx+2SyJyiPf7BQlE= X-Received: by 2002:a9d:2fa8:: with SMTP id r37-v6mr2531295otb.383.1526999577157; Tue, 22 May 2018 07:32:57 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-201-114-184.hsd1.tx.comcast.net. [98.201.114.184]) by smtp.gmail.com with ESMTPSA id f4-v6sm9095888oth.77.2018.05.22.07.32.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 22 May 2018 07:32:56 -0700 (PDT) Subject: Re: [PATCH v6 2/2] acpi: apei: Do not panic() on PCIe errors reported through GHES To: "Rafael J. Wysocki" Cc: Borislav Petkov , alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, "Rafael J. Wysocki" , Len Brown , Tony Luck , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , ACPI Devel Maling List , Linux Kernel Mailing List References: <20180521135003.32459-1-mr.nuke.me@gmail.com> <20180521135003.32459-3-mr.nuke.me@gmail.com> From: "Alex G." Message-ID: <5a72a503-e8d5-c317-89b4-86e574c48064@gmail.com> Date: Tue, 22 May 2018 09:32:55 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/22/2018 04:02 AM, Rafael J. Wysocki wrote: > On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc wrote: >> The policy was to panic() when GHES said that an error is "Fatal". >> This logic is wrong for several reasons, as it doesn't account for the >> cause of the error. >> >> PCIe fatal errors indicate that the link to a device is either >> unstable or unusable. They don't indicate that the machine is on fire, > > But they very well may indicate just that AFAICS. I guess it's possible to set a machine on fire, and get a PCIe error as one of the links melts. Although in that case, I doubt how we try to handle the error makes a difference. Sarcasm aside, my point is that it makes little sense to crash a machine when we lose a PCIe link. >> and they are not severe enough to justify a panic(). Do not blindly >> rely on firmware to evaluate the severity for us. Instead, look at >> the error severity based on what caused the error (GHES subsections). > > Which bit also comes from the firmware, right? So why is it regarded > as a better source of information? It's less bad (not using 'better') because it relates more closely to the error than the specific mechanism through which it is reported. The header severity is an artificial concept that firmware has to make up, whereas the subsection severity usually comes directly from hardware. > Or are you trying to say that both of the pieces of information in > question should be consistent with each other? But if they aren't, > which one should we trust more and why? The header severity is letting someone else make the decisions for you. (snip) >> +/* PCIe errors should not cause a panic. */ > > This comment is not sufficient and it should go inside of the function. What would make a "sufficient" comment? >> +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) >> +{ >> + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); >> + >> + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && >> + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && >> + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) >> + return GHES_SEV_RECOVERABLE; > > You have not explained convincingly enough why the above condition > makes sense at all. Is this a test? I think it's self-explanatory: How can you invoke a handler when you don't have a source for the error? Or how can you invoke a handler when you don't have that handler? >> + >> + return ghes_cper_severity(gdata->error_severity); >> +} >> + >> +/* >> + * The severity field in the status block is an unreliable metric for the >> + * severity. A more reliable way is to look at each subsection and see how safe >> + * it is to call the approproate error handler. >> + * We're not conerned with handling the error. We're concerned with being able >> + * to notify an error handler by crossing the NMI/IRQ boundary, being able to >> + * schedule_work, and so forth. >> + * - SEC_PCIE: All PCIe errors can be handled by AER. > > Make this comment a proper kerneldoc or move it inside of the function. I don't like moving long comments inside a function, as it breaks code flow. Above-function explanation is also consistent with how ghes_handle_aer() is documented. Rafael, thank you very much for taking the time to review these patches. Although, after reading through your emails, I'm at a loss on how you want to to solve the problem. It appears everyone has a very strong and different opinion how to proceed. I think the biggest problem is having a policy to panic on "fatal" errors, instead of letting the error handler make that decision. I'd much rather kill that stupid policy, but people seem to like it for some reason. Alex