Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1912741imm; Tue, 22 May 2018 11:20:12 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqRnGJ0uBI6eBH48/LSyTd1gwXAmwFvUkLK4w+DDNzDC1UomQdzOk9O/aZX9IT5ShzO4954 X-Received: by 2002:a62:22db:: with SMTP id p88-v6mr25361178pfj.239.1527013211831; Tue, 22 May 2018 11:20:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527013211; cv=none; d=google.com; s=arc-20160816; b=xdctkC0UbFoJ3+VcmDqjA9YK1RdCCxw8OpC0fNGh/PGvnNXQIFpUKnURB8GD+JYd1V 0Ly1jEAA5heCnlHMea4sAnOxWHWmknlW8s1wPdHJKrS8DGYGTvgm3H1dmLV74esuSIF2 imtGHUsJGd4IKlnNCcEj70CE/qpJMttYdI533eppXn6GFRr62H8+Hgbvc+BV1Nv66qf7 8rZrQbDmDMOUsjRo2gEeETs3/SSYZiJeqJC8RGr0+kBOEWXjBHOeva7CUdtPq0ZpvkYu TIavpsnBeQv8FIv5GYKTgFv6sGbtSdDCaOS1jCedrCTqTNp1hH8aYNKnex64MNGzQBCe B6gA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=vQYvVoYzNA7R570Xh7GOXxOIu2S7uAIbT6QGiicTlIg=; b=idzZQ9frtr5m8o2HBtPtNgPdj2clDbU2nIoi0bcZuAcNZcV/8QZiQyvprdAfWxN5Hp aiGh3xqdIAn/xARnjbGBH1hebz3efJ7APiauSz1F0ondynvP1CGsLiMVe1y2sDzJ51Dt AGBP5KxlW5AKdktcqDfTdl4Ky5lw5QQPcSGOKKgFEozv+xFAqDQEB2BnGuh6V7EPdKlO VH4l9zqgJHd8xKGhkCP7tc337Kwp6SlYgp+W2XrIjr+QjmVECtgOepChMa7zpo70HD0l Xhr9m7EALmW0y570k4QCGiyvauI5RLBCDUEKB6h67IsqyZ205qnpW/B+yKVq/xPCg5ab 6EZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GO7rsO7d; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g89-v6si16871163pfj.349.2018.05.22.11.19.56; Tue, 22 May 2018 11:20:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=GO7rsO7d; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751652AbeEVSTl (ORCPT + 99 others); Tue, 22 May 2018 14:19:41 -0400 Received: from mail-ot0-f178.google.com ([74.125.82.178]:40204 "EHLO mail-ot0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751319AbeEVSTh (ORCPT ); Tue, 22 May 2018 14:19:37 -0400 Received: by mail-ot0-f178.google.com with SMTP id n1-v6so22091101otf.7; Tue, 22 May 2018 11:19:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=vQYvVoYzNA7R570Xh7GOXxOIu2S7uAIbT6QGiicTlIg=; b=GO7rsO7d7R2bym4lJxqgJQSvHFAf/vIk5EjZhQJi85zT73IlqQSKwOc++lI33jEri8 IQYhJKoz0V1P5mrvEl+WKL3l8d17aC16uEIF0hjLFcM5WrhQI+6YpTI2ScIgTaOLjWEq GCq9baEdT6z2WuR7Oetp+LFWZc2Cwn84rdcneJJOrlWlundem2z5imOaZ4zPqpzZT0jF CIh4JWTQdxzolAsZxguFrqr8LUfgPRjHmIikKr+aW5uGuP5W+Qw3bSqM7TS/GkUvmY5d PINi/tpqtEi/HSmqdz1lugTf9t3TsSXRCoSeTZQpTGdeW+y6N4J9ouNe3xAGJhUOyfXK Q1Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=vQYvVoYzNA7R570Xh7GOXxOIu2S7uAIbT6QGiicTlIg=; b=fiBptfsVezKGCMuHmktlygT9/Ok3uWQVVk/q2iR+euUtyEha57q3H9Eg/gi2xKtMb6 iwFx9ytpCcIq3UmUjVxkGenYPlW9WyQ+AS7bEUIayBSLXHyUyDaC3qIAH6w620RQlnlj f2K6UYbpV4pGsdNSthBHc7fg4VMGyb5Nch1mj+N2TimveWZqUmkAXtJGzIAvlXkOlcm4 ccCvRWgmCb34TJBGJpmL24a5feC2lFhOuCSr4pUXtrVZp1i1mbsiszwvYsTnpoVfSrVT DqfWggi+fJs/eJs/Bk9nXgDZBGY5zHW1LB0U7Emdc8FnqBNySeKzcPlbI6HjJobrzhVd Bozw== X-Gm-Message-State: ALKqPwcAD3WSVv+kI1JUEagtVhDRWzGBaabMvxlszak6GPflheHeTn6s Hv989NYvJgC6KFjWtwu7vazLAqpy5Tc= X-Received: by 2002:a9d:2842:: with SMTP id h2-v6mr18057140otd.210.1527013176515; Tue, 22 May 2018 11:19:36 -0700 (PDT) Received: from nukespec.gtech ([2601:2c1:8500:500b::e4e]) by smtp.gmail.com with ESMTPSA id f21-v6sm11786882otj.0.2018.05.22.11.19.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 22 May 2018 11:19:36 -0700 (PDT) Subject: Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity() To: "Rafael J. Wysocki" , "Luck, Tony" Cc: Borislav Petkov , alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, "Rafael J. Wysocki" , Len Brown , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , ACPI Devel Maling List , Linux Kernel Mailing List References: <20180521135003.32459-1-mr.nuke.me@gmail.com> <20180521135003.32459-2-mr.nuke.me@gmail.com> <53d0ba88-6929-a7cf-6c3e-4ca389f7249a@gmail.com> <20180522135015.GF5512@pd.tnic> <0b758a1c-90e3-6f76-4f83-1e22c8fc9cd6@gmail.com> <20180522145426.GG5512@pd.tnic> <20180522175742.GA3543@agluck-desk> From: "Alex G." Message-ID: <5dc58180-d3c0-a9f0-282f-4be433c94052@gmail.com> Date: Tue, 22 May 2018 13:19:34 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote: > On Tue, May 22, 2018 at 7:57 PM, Luck, Tony wrote: >> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: >>> I especially don't want to have the case where a PCIe error is *really* >>> fatal and then we noodle in some handlers debating about the severity >>> because it got marked as recoverable intermittently and end up causing >>> data corruption on the storage device. Here's a real no-no for ya. >> >> All that we have is a message from the BIOS that this is a "fatal" >> error. When did we start trusting the BIOS to give us accurate >> information? > > Some time ago, actually. > > This is about changing the existing behavior which has been to treat > "fatal" errors reported by the BIOS as good enough reasons for a panic > for quite a while AFAICS. Yes, you are correct. I'd actually like to go deeper, and remove the policy to panic() on fatal errors. Now whether we blacklist or whitelist which errors can go through is up for debate, but the current policy seems broken. >> PCIe fatal means that the link or the device is broken. > > And that may really mean that the component in question is on fire. > We just don't know. Should there be a physical fire, we have much bigger issues. At the same time, we could retrain the link, call the driver, and release freon gas to put out the fire. That's something we don't currently have the option to do. >> But that seems a poor reason to take down a large server that may have >> dozens of devices (some of them set up specifically to handle >> errors ... e.g. mirrored disks on separate controllers, or NIC >> devices that have been "bonded" together). >> >> So, as long as the action for a "fatal" error is to mark a link >> down and offline the device, that seems a pretty reasonable course >> of action. >> >> The argument gets a lot more marginal if you simply reset the >> link and re-enable the device to "fix" it. That might be enough, >> but I don't think the OS has enough data to make the call. > > Again, that's about changing the existing behavior or the existing policy even. > > What exactly has changed to make us consider this now? Firmware started passing "fatal" GHES headers with the explicit intent of crashing an OS. At the same time, we've learnt how to handle these errors in a number of cases. With DPC (coming soon to firmware-first) the error is contained, and a non-issue. As specs and hardware implementations evolve, we have to adapt. I'm here until November, and one of my goals is to involve linux upstream in the development of these features so that when the hardware hits the market, we're ready. That does mean we have to drop some of the silly things we're doing. Alex > Thanks, > Rafael >