Received: by 10.192.165.156 with SMTP id m28csp942164imm; Thu, 19 Apr 2018 10:02:13 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/Li9LZCIHcqsj/MtUWzlGPGHMKN3zy7xDUFg4bmwrgBVLE4H15ltcYQK+6DKekxZu0gOZX X-Received: by 10.98.147.135 with SMTP id r7mr6508881pfk.31.1524157333622; Thu, 19 Apr 2018 10:02:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524157333; cv=none; d=google.com; s=arc-20160816; b=ksqvdvWh07iAgHinSIQc10mCiVqdlDok+zAsx+23C/VoUXuJSR1v5urSdPP+q9USzs hpBTTtQv59b9H+hEh2AiXLXdau+pxEgI8kMKogjI9ZDRq8igzR6tRje8x7vLRnBtUyKo kAbtFXvBPRCzH2vz7mOArHDq+9TFDjW4tPTqtLuspbsk1uRHRVC0PW5nReEfELwijmVH RcHe+zTgxfyXSqrKiFMcLgIqH8SUkmqGJqSKqjwuuJ6NXt0fR1SNOkgAMd6+F7EITMNB FcG1w0grfQpASKMxxWc96mT+Ldh7xdpDwBLGY0iQ4X/ptMd7p74NK8NrwcEBUX1dPvmd Oe8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=U3eAi0GPThXfKdlgCUrHxAwR+6B3ekgqO1uaCqwRz6Q=; b=Zd0sJDUN7c+VyUPJlQlDO3OZuwv3CcMucys6sF9SWW9mu/EdKfapZFMGwveYUWuVLX hTevi5yVCkhzluP9NQszKlPrMvijxXVLr5FnA4c/AW8tMCYUxC1vSfxQfHeGBJJBxWNe AT9etZoGUcXKkbHYXsY7Cd0Ya3lqUM4wAmR+cv1rBGexM1V4+6Qyok1dc4Qkk5w96DTc wrolH0qCqBG1y3e3/dLbyVu3qeXRoTstCVHlEIbVDhNyQzvn8j97oqTFaGXYXXwmAfNZ mbjlrymgDFEVIJasl/Ks5Mf03WHtsOqSPhW2O4ZLDLr6mdwnASMdj7MSO7hXhw/Hsqj3 Ndxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=CooCYbfO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 9si2537517pfh.242.2018.04.19.10.01.51; Thu, 19 Apr 2018 10:02:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=CooCYbfO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753990AbeDSQ1C (ORCPT + 99 others); Thu, 19 Apr 2018 12:27:02 -0400 Received: from mail-oi0-f43.google.com ([209.85.218.43]:41589 "EHLO mail-oi0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753839AbeDSQ07 (ORCPT ); Thu, 19 Apr 2018 12:26:59 -0400 Received: by mail-oi0-f43.google.com with SMTP id 188-v6so5397191oih.8; Thu, 19 Apr 2018 09:26:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=U3eAi0GPThXfKdlgCUrHxAwR+6B3ekgqO1uaCqwRz6Q=; b=CooCYbfObH+1ggzErYgpU8MEbwSz9ZvgjhOnfBBnK5Q8UMV9wrAzukRrbWaiBSGABo U1OQlKfvtqNbbaiScsUTyARVRdfHJo6/CuinsOWUrjEhexn/iUe1xTIOfAenGdG0uQ5H we/dsGLA8jKDOI3DdNpKRUQrDGElf04lDB2ckztTCcEpJqFmRHGa+EDGgzOqLH5s+NK/ pwEnleSXThCto3PDig14rO2aUXL6eCJopY//RSIcnNpZtj83U33R7m4FKEeYhZ2FN2D2 l+R0kW9qJd1Bm/vU5GPMsllwlxXEa008tdIMj88wlvLvv/JTUJAbk8j0TsknQ8oHLw8B BsyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=U3eAi0GPThXfKdlgCUrHxAwR+6B3ekgqO1uaCqwRz6Q=; b=gupai/JrPLgbsoWkd50E0bi1b6HehjwDeLw3DSklJOQjvI2voM5DS7YIPUbH8pfPdg GFMa11a4lX8b53qzokor4EosrwFFl18U0ILw8EloyspTUUtkwRrM0lHdLoudLG3YiEvt Iwks28TPeBDqsOynVVOxVA1wQufvtP8/eQHCR5iCegPvMuTzXZbwHGBU/TBIrhRn+KsH 31u7oWJfgN7fxPWLaJ0hq7dbUS3/9v4ALtrE7aHbncWSaLokf6idQBI+KVd5w9iwdHtO 3YKbItlNd3Mm1N3e0DuotP9UFPyjL/SU8sHXvjNhK1FfiwRY9ODO12vpgH0Y9ABraTSJ dJvQ== X-Gm-Message-State: ALQs6tCoj/aaTz2LG35Hw31TQMPb4X8r8VSMwmToiPY7N1GhQKw7QBkw Ooo+jTOt4iLurAzMw7cA7hk= X-Received: by 2002:aca:a906:: with SMTP id s6-v6mr4272447oie.101.1524155219092; Thu, 19 Apr 2018 09:26:59 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id r82-v6sm2027430oih.35.2018.04.19.09.26.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Apr 2018 09:26:58 -0700 (PDT) Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> From: "Alex G." Message-ID: <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> Date: Thu, 19 Apr 2018 11:26:57 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180419154006.GE3600@pd.tnic> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/19/2018 10:40 AM, Borislav Petkov wrote: > On Thu, Apr 19, 2018 at 09:57:07AM -0500, Alex G. wrote: >> ghes_severity() is a one-to-one mapping from a set of unsorted >> severities to monotonically increasing numbers. The "one-to-one" mapping >> part of the sentence is obvious from the function name. To change it to >> parse the entire GHES would completely destroy this, and I think it >> would apply policy in the wrong place. > > So do a wrapper or whatever. Do a ghes_compute_severity() or however you > would wanna call it and do the iteration there. That doesn't sound right. There isn't a formula to compute. What we're doing is we're looking at individual error sources, and deciding what errors we can handle based both on the error, and our ability to handle the error. >> Should I do that, I might have to call it something like >> ghes_parse_and_apply_policy_to_severity(). But that misses the whole >> point if these changes. > > What policy? You simply compute the severity like we do in the mce code. As explained above, our ability to resolve an error depends on the interaction between the error and error handler. This is very closely tied to the capabilities of each individual handler. I'll do it your way, but I don't think ignoring this tight coupling is the right thing to do. > >> I would like to get to the handlers first, and then decide if things are >> okay or not, > > Why? Give me an example why you'd handle an error first and then decide > whether we're ok or not? > > Usually, the error handler decides that in one place. So what exactly > are you trying to do differently that doesn't fit that flow? In the NMI case you don't make it to the error handler. James and I beat this subject to the afterlife in v1. >> I don't want to leave people scratching their heads, but I also don't >> want to make AER a special case without having a generic way to handle >> these cases. People are just as susceptible to scratch their heads >> wondering why AER is a special case and everything else crashes. > > Not if it is properly done *and* documented why we applying the > respective policy for the error type. > >> Maybe it's better move the AER handling to NMI/IRQ context, since >> ghes_handle_aer() is only scheduling the real AER andler, and is irq >> safe. I'm scratching my head about why we're messing with IRQ work from >> NMI context, instead of just scheduling a regular handler to take care >> of things. > > No, first pls explain what exactly you're trying to do I realize v1 was quite a while back, so I'll take this opportunity to restate: At a very high level, I'm working with Dell on improving server reliability, with a focus on NVME hotplug and surprise removal. One of the features we don't support is surprise removal of NVME drives; hotplug is supported with 'prepare to remove'. This is one of the reasons NVME is not on feature parity with SAS and SATA. My role is to solve this issue on linux, and to not worry about other OSes. This puts me in a position to have a linux-centric view of the problem, as opposed to the more common firmware-centric view. Part of solving the surprise removal issue involves improving FFS error handling. This is required because the servers which are shipping use FFS instead of native error notifications. As part of extensive testing, I have found the NMI handler to be the most common cause of crashes, and hence this series. > and then we can talk about how to do it. Your move. > Btw, a real-life example to accompany that intention goes a long way. I'm not sure if this is the example you're looking for, but take an r740xd server, and slowly unplug an Intel NVME drives at an angle. You're likely to crash the machine. Alex