Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3213164pxb; Mon, 18 Oct 2021 10:23:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzNSVekZuD1m/yuVg+QBYaSGHae42OCV0/uhwfKefLgCkTAk05XtLXvttUDxEEsKZyxsAMI X-Received: by 2002:a63:1125:: with SMTP id g37mr24447830pgl.403.1634577784470; Mon, 18 Oct 2021 10:23:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634577784; cv=none; d=google.com; s=arc-20160816; b=es6PU/yVGJNsb1COxZ93LwJmwOECRihkf42/Bk/d9dW3KlhZ+cEl3wlc2z+/JbQVWO 6k3GkyaCcfOUCpkicInBY8tGIG9LMWC9NLsB3G7PXjrRLjXeKjSOwyDamAbyNK6tI/Cp xTW/sTBThlZvYAMP/dVhNHFvmvMc1PxajXugcJxZXIqEeFbKfbSMEyBPNTKQ4mSsOiSd V6JZpXI74B2x0NJPC4ez5kquutv3v7f7XKLRWgT8JEvVIKJlZdUXP/EiIRy67BeWjmzP mia3E2yity3awovbqowRzOwnNxvmvg27Dyk2FT9hMIWaObLiK3fzagrJ9trE/B9pr0XA c3Iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=r9nm3ai6m4Wgnyupwrs1A07en1r3sK/NPhU03FrOBeM=; b=gh5YwRGFQSBU5OXLBZvrLdY6BJr+Xb6eozMlRVwmIMlpKlfuCCDTVNMcobwbiwn453 Fi2LE6EEtWs4aJCQnhWIEapbb9Dpf1R43dO+UOenBNqOKYDYiIFoBvQ9lZlsmo3CPK6n igaod6yDexL5+P+7L3YEbBrPvQXLH0G4HIauC5gUWo6TajI8EoIpbg2QQhc2IU/HBvQ3 NzWlnJGaMVOjrBVnyDXxvxfoG2n7hvBjZcxmdjEQUFBplmrMtEOZd0DuuEl2yLyer8Q7 0jaY8HV/pV0zBmaUkrA62uZrr/Hsz7xP3N3wtCaQNjPBqUB6letZEfo2rekh3weMuYeq SAbA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q91si106370pjq.42.2021.10.18.10.22.51; Mon, 18 Oct 2021 10:23:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233824AbhJRRXo (ORCPT + 99 others); Mon, 18 Oct 2021 13:23:44 -0400 Received: from foss.arm.com ([217.140.110.172]:40868 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233600AbhJRRXn (ORCPT ); Mon, 18 Oct 2021 13:23:43 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DD91E2F; Mon, 18 Oct 2021 10:21:31 -0700 (PDT) Received: from [10.1.196.28] (eglon.cambridge.arm.com [10.1.196.28]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C10B13F694; Mon, 18 Oct 2021 10:21:30 -0700 (PDT) Subject: Re: [PATCH V2] ACPI / APEI: restore interrupt before panic in sdei flow To: =?UTF-8?B?5Lmx55+z?= Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, Tony Luck , linux-arm-kernel@lists.infradead.org, Borislav Petkov , Len Brown , "Rafael J. Wysocki" , huangming@linux.alibaba.com References: <20211012142910.9688-1-zhangliguang@linux.alibaba.com> <5951ad5b-d755-0150-0f2a-c567eb454dac@arm.com> From: James Morse Message-ID: Date: Mon, 18 Oct 2021 18:21:29 +0100 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Liguang, On 14/10/2021 15:18, 乱石 wrote: > 在 2021/10/14 1:44, James Morse 写道: >> On 12/10/2021 15:29, Liguang Zhang wrote: >>> When hest acpi table configure Hardware Error Notification type as >>> Software Delegated Exception(0x0B) for RAS event, OS RAS interacts with >>> ATF by SDEI mechanism. On the firmware first system, OS was notified by >>> ATF sdei call. >>> If fatal RAS error occured, panic was called in sdei_asm_handle() >>> without ehf_deactivate_priority executed, which lead interrupt masked. >> So far the story is: >> Firmware generated and SDEI event (a kind of software NMI) because of a firmware >> interrupt, but it hasn't completely handled the interrupt. >> >> >>> If interrupt masked, system would be halted in kdump flow like this: >>> >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for cmdq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 32768 entries for evtq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for priq >>> arm-smmu-v3 arm-smmu-v3.3.auto: SMMU currently enabled! Resetting... >> How and why do firmware interrupts affect the IOMMU? [...] >> Could you debug why firmware interrupts being active prevent the SMMU from being reset. As >> far as I can tell, those should be totally independent. > If ehf_deactivate_priority() was not executed, pmr_el1 register was not resumed to >0x80, > which leads > non-secure interrupts masked. arm_smmu_device_probe() finally called usleep_range() which > based on > hrtimer. Because non-secure timer interrupts was masked, usleep_range would not reponse. Aha! So nothing to do with with the SMMU at all. Your firmware has 'disabled' the interrupt by moving the CPUs priority mask so that no interrupts at all can be taken. I still think this is best fixed in firmware. Papering over the problem here is not enough as the handler may encounter memory corruption, take an exception, and panic() from some other part of the kernel. Its RAS - we know something has gone wrong before we get to this point. The OS needs to be able to call panic() at any point in time. Your firmware should not deny the normal-world interrupts like this. Please either complete the interrupt handling before calling into the normal world, or disable it if you need the interrupt to not fire again. If the device that triggers the interrupt doesn't have a disable, there are hardware registers in the GIC to do this. (I don't know how TFA works here, it may be a bug in the upstream code) Thanks, James