Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp917396rdh; Sun, 24 Sep 2023 18:44:08 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHo9qVshFdpnu+i6akYRKZ4TBPUMKnFkizwQi7r4g7vVcv/X+nk+84hMyGKl5MuBer0bH/V X-Received: by 2002:a05:6a00:2d82:b0:690:38b6:b2da with SMTP id fb2-20020a056a002d8200b0069038b6b2damr4499127pfb.2.1695606248492; Sun, 24 Sep 2023 18:44:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695606248; cv=none; d=google.com; s=arc-20160816; b=MMOOJR3eBi/O8y+bZjSs/CJuCwt3tK0VREpz/6zAgXIlqPshID7hbqk8m7cqh4KJxp qEHReG7KxGGK3KVOwhxT2BC4IyxV31UfySpeStHF0irsZguhnxenjHODOqA7cuNZU3Pq 0T2iN/elt6JIEFsTjzQgW+uhXLhIX7EfolmNmwxILleartStZnF8DOkjB5Tmh06XLVpd z+Yy9ho8Qjb4856w1bjoDSB2SSc86Pq38G7in7sQ8nRZqrqQJVMSAgRXgwPPRVch1Bs9 bJOlpmJ7k+OXdYlyN/1Ug9rfA4Hzyoc4tROX7Ry57IdWHQzsITjtqEP5ldtBzyfVgbmH TpcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=03vRRAkZsGZzT08D43u4oxs8mnEiL9OvNAN9xId6H60=; fh=Hsv7YfG+XwEM5DTPk8gbo/oKp+0Oi8ObZopqWYIlb/A=; b=fWJVYmFbxxRWhExRz9UJvuTj8Gq0kzqV0Xt3ApAYrzhdk6hhLmLxcf7jETwyPGvNJj Ploqi5dhzuC3Zpq+iilW6zrAwBCD5jgapkPUqLTa9b1NQ/C19ck5L2mwHrDuKw1fjzAx 3JKYdT9iBrZDIypyS9r6CSZnliRfTHlvP2zkVZLFkHJ9CeAS2AeFjbLXPsI7HueWSa2f XWzXXQ2GenQZZsEUz2Gyukf8c60bSqzmQCGoiFbUYKipYdOtfYqWOSAUFXsJi6fmAA1o Wn56yGdw/LOHD/p87ovTlolpebN3mDdpYgQpkLWia1oQg2LBe6LayhRlVDQWXXuQ5gYo Lg/g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id bt25-20020a056a00439900b006901504b6a3si8960153pfb.153.2023.09.24.18.44.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Sep 2023 18:44:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id DFBED801FBAF; Sun, 24 Sep 2023 18:44:05 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231435AbjIYBoG (ORCPT + 99 others); Sun, 24 Sep 2023 21:44:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229480AbjIYBoG (ORCPT ); Sun, 24 Sep 2023 21:44:06 -0400 Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BA62CF; Sun, 24 Sep 2023 18:43:57 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=15;SR=0;TI=SMTPD_---0Vsjl30y_1695606232; Received: from 30.240.112.49(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0Vsjl30y_1695606232) by smtp.aliyun-inc.com; Mon, 25 Sep 2023 09:43:54 +0800 Message-ID: Date: Mon, 25 Sep 2023 09:43:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs? Content-Language: en-US To: David Laight , Bjorn Helgaas Cc: "Rafael J. Wysocki" , "gregkh@linuxfoundation.org" , Linux PCI , "mahesh@linux.ibm.com" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "bp@alien8.de" , Baolin Wang , Jonathan Cameron , "bhelgaas@google.com" , "james.morse@arm.com" , "linuxppc-dev@lists.ozlabs.org" , "lenb@kernel.org" References: <20230920230257.GA280837@bhelgaas> <2e5870e416f84e8fad8340061ec303e2@AcuMS.aculab.com> From: Shuai Xue In-Reply-To: <2e5870e416f84e8fad8340061ec303e2@AcuMS.aculab.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Sun, 24 Sep 2023 18:44:06 -0700 (PDT) On 2023/9/21 21:20, David Laight wrote: > ... > I've got a target to generate AER errors by generating read cycles > that are inside the address range that the bridge forwards but > outside of any BAR because there are 2 different sized BARs. > (Pretty easy to setup.) > On the system I was using they didn't get propagated all the way > to the root bridge - but were visible in the lower bridge. So how did you observe it? If the error message does not propagate to the root bridge, I think no AER interrupt will be trigger. > It would be nice for a driver to be able to detect/clear such > a flag if it gets an unexpected ~0u read value. > (I'm not sure an error callback helps.) IMHO, a general model is that error detected at endpoint should be routed to upstream port for example: RCiEP route error message to RCEC, so that the AER port service could handle the error, the device driver only have to implement error handler callback. > > OTOH a 'nebs compliant' server routed any kind of PCIe link error > through to some 'system management' logic that then raised an NMI. > I'm not sure who thought an NMI was a good idea - they are pretty > impossible to handle in the kernel and too late to be of use to > the code performing the access. I think it is the responsibility of the device to prevent the spread of errors while reporting that errors have been detected. For example, drop the current, (drain submit queue) and report error in completion record. Both NMI and MSI are asynchronous interrupts. > > In any case we were getting one after 'echo 1 >xxx/remove' and > then taking the PCIe link down by reprogramming the fpga. > So the link going down was entirely expected, but there seemed > to be nothing we could do to stop the kernel crashing. > > I'm sure 'nebs compliant' ought to contain some requirements for > resilience to hardware failures! How the kernel crash after a link down? Did the system detect a surprise down error? Best Regards, Shuai