Received: by 10.213.65.68 with SMTP id h4csp1111922imn; Wed, 4 Apr 2018 12:51:20 -0700 (PDT) X-Google-Smtp-Source: AIpwx487ffpB5f3iZZ9jV22YPEvViNzrynIxphwzVfFGo2Y+cPUhHhVDT5mpYnU2RUPIFjjaKzaz X-Received: by 2002:a17:902:8e83:: with SMTP id bg3-v6mr20274878plb.144.1522871480529; Wed, 04 Apr 2018 12:51:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522871480; cv=none; d=google.com; s=arc-20160816; b=YQPZ1Cqx4bgQR2+OAZKyY8kcyKo3aHqA4aoempR8sIaMf1jl6kIQZP9y9k/hUHeFpe 4dKATzRhIvq8NKNS+sbsejP5Gq6Bb/QZil6++mXwJ366jxYiRmqP3H7IL9z078TCYMHj /GcDQvrceXEUHVXk9i26KS4kVFyGeAv4QtmFmMJAnrOWhAbjxoIQo5rXJhsheY0Riaoq ShckIBaqYZcGUOl2PvOcQg0dIudo5b0MNfEEbs16gNorJmq4oOKnLl70xcjiY3JEpZTO Q27zKf0MaU/Vc6Ho5BjurtUGJo/1QtfsCVssAd7nCmxmPj7qDEGx4FoQ8+fHbJnEMMPB 3QOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=KESDvNOR8giYmT7pwYkvwEUaERHx2FxPME1flAbASMg=; b=eFvORu7AFZwwJrN8sxCO4pGWyOPOcdzPjSkb8c0fxAURx92l+aF0/gE+EsKwM5yvPf nqc5zNveR0ZX3xyH/J8pMht7IqL8Xxf58vMzMm1KHy+dOhIzzuRS/J1cgTG1+m6oZ9nJ Cxlu7htB2g5cT5+/lmtsnQYIjMMBA0Yt94gd5I9ilxKXKvfBscb1B4H8qKLqGo5J+xRL ompRyig6SSi0r8RlOWRzNgjCIzLsfSfcSzkBJSDCB9VwysSsQu3dDstIJqq7+fruE+Sw CWpYNrAQzwFS5BSbx0FTWYJq+P3vyZ2tHl3BZB/O/Gvy3EiC9eicsr+FiPlZ7Fbi7+kN QeBA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=PkfeQOdP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k190si4205999pgk.196.2018.04.04.12.51.06; Wed, 04 Apr 2018 12:51:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=PkfeQOdP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752049AbeDDTtx (ORCPT + 99 others); Wed, 4 Apr 2018 15:49:53 -0400 Received: from mail-ot0-f194.google.com ([74.125.82.194]:34909 "EHLO mail-ot0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751808AbeDDTtu (ORCPT ); Wed, 4 Apr 2018 15:49:50 -0400 Received: by mail-ot0-f194.google.com with SMTP id f47-v6so15764081oth.2; Wed, 04 Apr 2018 12:49:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=KESDvNOR8giYmT7pwYkvwEUaERHx2FxPME1flAbASMg=; b=PkfeQOdPssK3Whj8WrNbkA+Nykk0rGTUPDQQdyTsD7JGjI79p307BY+mnBvbBjc/6/ XouVf7zCd4x+rSJBy+VxvVR6c18x21UtSBMvM0OAWvK/e/d4Bh4oDLqNUi0QeseM2xsF rHo0OWgmlm/cuhH1Au9W2ytK8KiMzVYXtrgpHfknLKsc8s57CW/UaACksGPYZBamIf+d YtTagRjRl3W0O7PXX1VTk1jtyYacfRCGOo43j4k9YF+N0ePE2pLRyiIBPCqfDSQT4xD1 3mPu8F4dG0pbbdFfAPLGOAMhBtqMoSRYJoLhuzR0Uc/j4Hqwoffn5OdU/lXkSslOoLvG QSpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=KESDvNOR8giYmT7pwYkvwEUaERHx2FxPME1flAbASMg=; b=d+XChtaQxyUsSx2XrEWmSMe5aPTHiixPrYSjVrEVyPONODymO8rO+IHsd8GdbPA9YR 4aWwuPGlxDDhdsA3nqxdLDnW75tYFy9G+qsS0v6E8IqdgozkH4hLYm/7yEes1bYFo/uW PK5mG/7CWtW75ThKKIuxXV/lJl+T5QjUgVjZraoplCYRhllihloF+0aQDx4hML9wz+IK /W/wbVueXbQjk49su6gPEWM5s/2e71rJfwhs1WxAP/VuyqKtDfR59T4eACidj3lzK+Z/ W0bH8mFPN02I3vOW5aGS3HsB73no4M5nDgikPQM89PDDAqBMQXFo3jL1SMUUY6QR6615 P/oA== X-Gm-Message-State: ALQs6tDwCRCwuU+NzZbD9vzhUzc58k30CcQ403HFu5Qusgr00TrCOXNj CZ3fMQH/P4Iv1//c0OixgR0= X-Received: by 2002:a9d:52af:: with SMTP id f47-v6mr12497866oth.174.1522871389485; Wed, 04 Apr 2018 12:49:49 -0700 (PDT) Received: from nuclearis2_1.gtech (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id b45-v6sm3756994otb.1.2018.04.04.12.49.48 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Apr 2018 12:49:48 -0700 (PDT) Subject: Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages To: James Morse Cc: linux-acpi@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, bp@alien8.de, tbaicar@codeaurora.org, will.deacon@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com References: <20180403170830.29282-1-mr.nuke.me@gmail.com> <20180403170830.29282-4-mr.nuke.me@gmail.com> <338e9bb4-a837-69f9-36e5-5ee2ddcaaa38@arm.com> From: "Alex G." Message-ID: <9e29e5c6-b942-617e-f92e-728627799506@gmail.com> Date: Wed, 4 Apr 2018 14:49:48 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/04/2018 11:53 AM, James Morse wrote: > Hi Alex, (snip) >>> How do we know we will survive this trip? >> >> We don't. > > Isn't that even worse than panic()ing? (No output, followed by a watchdog reboot > if we're lucky) On the contrary. No output, followed by a watchdog reboot is usually what happens when the panic() is in NMI. One of the very few ways that can be debugged is over a NULL modem cable. >>> On arm64 systems it may not be possible to return to the context we took the NMI >>> notification from: we may bounce back into firmware with the same "world is on >>> fire" error. Firmware can rightly assume the OS has made no attempt to handle >>> the error. >> >> I'm not aware of the NMI path being used on arm64: >> $ git grep 'select HAVE_ACPI_APEI_NMI' >> arch/x86/Kconfig: select HAVE_ACPI_APEI_NMI if ACPI >> $ > > (I agree its not obvious!), arm64 has NOTIFY_SEA, NOTIFY_SEI and NOTIFY_SDEI > that all behave like NOTIFY_NMI (in that they can interrupt IRQ-masked code). > > CONFIG_HAVE_ACPI_APEI_NMI has come to mean NOTIFY_NMI, which requires the arch > code to implement register_nmi_handler() and (from memory) behave as a notifier > chain, which doesn't fit with how these things behave. > > NOTIFY_SEA lives behind CONFIG_ACPI_APEI_SEA in driver/acpi/apei/Kconfig. > The series to add SDEI support is on the list here: > https://www.spinics.net/lists/arm-kernel/msg642946.html > (NOTIFY_SEI is very tricky for firmware to get right, I don't think we're there > yet.) From my understanding, your patch series tries to use ghes_notify_nmi() as a common entry point. But if on arm64, you can return to firmware, then we have wildly different constraints. On x86, you can't return to SMM. You can have a new SMI triggered while servicing the NMI, but you can't re-enter an SMI that you've returned from. SMM holds all the cores, and they all leave SMM at roughly the same time, so you don't deal with coexistence issues. This probably also avoids the multiple notifications that you are trying to implement on arm. On quick thought, I think the best way to go for both series is to leave the entry points arch-specific, and prevent hax86 and harm64 from stepping on each other's toes. Thoughts? (snip) > Wouldn't reasonably-intelligent firmware be able to spot the same fault > repeatedly in a loop? (last-fault-address == this-fault-address, > last-fault-time==now) I'm told that the BIOS on Dell machines will detect interrupt storms, and disable the specific error source in such cases. On an x86 server. 50us of SMM eats up several milliseconds-core of CPU time, so they try to be a little careful. >>> Your 'not re-arming the error' example makes this worrying. >> >> Those are one class of bugs you get when you let firmware run the show. >> It's been an issue since day one of EFI. The best you can do in such >> cases is accept you may later lose the link to a device. >> Side-note: We have a BIOS defect filed with against this particular >> issue, and it only seems to affect PCIe SMIs. > > This sounds like policy applied too early That's firmware-first in one sentence. > fixing the firmware to at least make this configurable would be preferable. I can ask for "fixing the firmware" and we might even get the fix, but the fact remains that there are machines in the field with the buggy BIOSes, and there will continue to be such machines even after a fix is released. > My point was there is a class of reported-as-fatal error that really is fatal, > and for these trying to defer the work to irq_work is worse than panic()ing as > we get less information about what happened. How do we get less information? We save the GHES structure on the estatus_queue, and use it as the error source in either case. Is there a class of errors where we can reasonable expect things to be okay in NMI, but go horribly wrong on return from NMI? (snip) > Are all AER errors recoverable? (is an OS:'fatal' AER error ever valid?) PCIe "fatal" means that your link is gone and you might not get it back. You lose all the downstream devices. You may be able to reset the link and get it back. This sort of "fatal" has no impact on a machine's ability to continue operation. From ACPI, FFS should only send a "fatal" error if a machine needs reset [1], so it's clearly a firmware bug to mark any PCIe error "fatal". This won't stop BIOS vendors from taking shortcuts and deciding to crash an OS by sending the error as "fatal". [1]ACPI 6.2: 18.1Hardware Errors and Error Sources >> The alternative is to keep ghes_do_proc() and ghes_notify_nmi() in sync, >> duplicate code between the two. But we can also get ghes_proc_irq_work() >> as an entry point, and this easily turns in a maintenance nightmare. > > We can't call all the recovery-helpers from ghes_notify_nmi() as they sleep and > allocate memory. I think the trick would be to make the CPER parsing routines > NMI-safe so that we can check these 'fatal' records for cases where we're pretty > sure linux can do a better job. can_i_handle_this_nonblocking(struct four_letter_acronum_err *err) ? That's the approach I originally tried, but I didn't like it. The parsing gets deeper and deeper, and, by the time you realize if the error can be handled or not, you're already halfway through recovery. This is a perfectly viable approach. However, is there a palpable (non-theoretical) concern that warrants complicating things this way? >>> For the PCI*/AER errors we should be able to try and handle it ... if we can get >>> back to process/IRQ context: >>> What happens if a PCIe driver has interrupts masked and is doing something to >>> cause this error? We can take the NMI and setup the irq-work, but it may never >>> run as we return to interrupts-masked context. >> >> The way the recovery from NMI is set up is confusing. On NMIs, >> ghes_proc_in_irq() is queued up to do further work. Despite the name, in >> this case it runs as a task. That further queues up AER work, >> Either way, PCIe endpoint interrupts are irrelevant here. AER recovery >> runs independent of drivers or interrupts: > > Hmm, we may be looking at different things, (and I'm not familiar with x86): > ghes_notify_nmi() uses irq_work_queue() to add the work to a percpu llist and > call arch_irq_work_raise(). arch_irq_work_raise() sends a self IPI, which, once > taken causes the arch code to call irq_work_run() to consume the llist. I abstract things at irq_work_queue(): "queue this work and let me return from this interrupt" > My badly made point is we could take the NMI that triggers all this from a > context that has interrupts masked. We can't take the self-IPI until we return > from the NMI, and we return to a context that still has interrupts masked. The > original context may cause the NMI to trigger again before it unmasks > interrupts. In this case we've livelocked, the recovery work never runs. In order to get another NMI from firmware, you need SMM to be triggered again. Remember, SMM has to exit before the NMI can be serviced. I doubt the livelock is a concern in this place, because we're dealing with NMIs invoked by firmware(SMM). NMIs for other reasons would be handled elsewhere. In the event that FFS fails to prevent and SMI storm, and keeps spamming the CPU with NMIs, that is a clear firmware bug. It's also the sort of bug I haven't observed in the wild. > I think something like polling a status register over a broken-link would do > this. (has my config write been acknowledge?) That would not be a valid reason to disable interrupts. Non-posted PCIe requests can take hundreds of milliseconds (memory-mapped or otherwise). I can't think of any sort of code that can cause errors to come in via NMI which would be a suitable atomic context for disabling interrupts. Maybe on arm it's normal to stuff in anything error-related via NMI? On x86, most error sources have their own (maskable) interrupt vector. I am not concert about livelocking the system. (snip) >> I can see an issue if we somehow lose the ability to run queued >> tasks, but in that case, we've already lost control of the machine. > > about to loose control of the machine: firmware has saved the day giving us a > clean slate and some standardised messages that describe the fault. In theory, yes. In practice, by the time OS gets control things are usually worse than they would have been without FFS. "standardized" is really a pipe dream, especially on x86, where each vendor interprets things differently, so the OS still has to figure it all out on its own. > I agree we can do better for AER errors, but I don't think we should make the > handling of other classes of error worse. These classes match nicely to the CPER > record types, which we already have mapped in ghes_notify_nmi(). My approach is to _not_ trust the firmware. I hope arm does it better, but on x86, each hardware vendor does what it thinks is best in firmware. The result is wildly different FFS behavior between vendors. This is something that an OS then, cannot rely upon. Alex