Received: by 2002:a05:7412:f584:b0:e2:908c:2ebd with SMTP id eh4csp1927462rdb; Tue, 5 Sep 2023 09:03:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHkShNjnZQorWICMWov9lZ8Yi4YpEdS2ytqmv6uHrsfhOBH0O/M/9pMMYKDiSKhQB5zZd+8 X-Received: by 2002:ac2:5b51:0:b0:4fe:3364:6c20 with SMTP id i17-20020ac25b51000000b004fe33646c20mr210044lfp.16.1693929806123; Tue, 05 Sep 2023 09:03:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1693929806; cv=none; d=google.com; s=arc-20160816; b=a247dd7a3zLp0+UoLOtE2mu5/OmO4otzkV9QqpVa+Pe/eBSBCz8N/q/CWt4DnWqUtI xpHm1MNHPxB7n4Pb29IA22z7LJ7Pt1bqqRQ38YfiuWNJQrrufP5uYpnBViaBqAdAiwqo bGWFTB+B6/S8ag5N/TAwt4/GWHyopxHUh77yw+ebZwUAue8auQox3tdx6jBhYgpJ23um nraL3joxWi4lIToLCXUCCH06B4I7tWchSTD0Vvj/Wt3eBHy1mYUimEdoPO0zmLJ9rtU5 J3VHXfAHzDuIrR6i/2cFeQymHeC+VDhKgJ1m57oTvmcK1o62zvCi9Qr2TDBXfhaH+l+/ JwbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=EPTRxIuRRRWGPAZZrCEkmJ6XlGVzgS5fP/p6LTE/j68=; fh=+e24f6hLWVOw3AJhQdhnMnbFt6CS62Sfxrd7JxelrNU=; b=sMHGJCEaoUfgbQ05o3LSP66pl8mqJVB1d9/6B8E0oW7RSdmobntOQ/4JF060ICJ31a k62ZudT1Dd8AyeSRJdJyXT2eEZhrM4l2woJFqKvisTcCpJgBZV9NcIvaf2fECEswL5oo Ci6MjM9wMY/wg5aYr5iCEwV5fHgRvA3KcNfSbXirErJUrwDSgujZXw363UiN2L4Ie/9t YZyP8qH6GtzLoGpWHnba7ZeQbrk9017Ue2VAArGxd+1Cx/7e5UU4Joup59V4VwZQsbpc a4bXF4H92K42O/qTfxxb5yojk404KXM2L0Q5KmLfuljZRsJSDh7rkEd597O2Z8e3aL/L 3c0w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m22-20020a056402051600b0052346650c5bsi7919842edv.65.2023.09.05.09.03.20; Tue, 05 Sep 2023 09:03:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346156AbjIDKk7 (ORCPT + 19 others); Mon, 4 Sep 2023 06:40:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51926 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230247AbjIDKk6 (ORCPT ); Mon, 4 Sep 2023 06:40:58 -0400 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32CC7CA; Mon, 4 Sep 2023 03:40:54 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R371e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=16;SR=0;TI=SMTPD_---0VrJ9gP4_1693824049; Received: from 30.240.117.141(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0VrJ9gP4_1693824049) by smtp.aliyun-inc.com; Mon, 04 Sep 2023 18:40:50 +0800 Message-ID: <2540b570-1c1a-7d1b-59e9-6c32d9947c44@linux.alibaba.com> Date: Mon, 4 Sep 2023 18:40:48 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Subject: Re: [PATCH] HWPOISON: add a pr_err message when forcibly send a sigbus Content-Language: en-US To: Helge Deller , Will Deacon , "Luck, Tony" Cc: catalin.marinas@arm.com, James.Bottomley@HansenPartnership.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-parisc@vger.kernel.org References: <20230819102212.21103-1-xueshuai@linux.alibaba.com> <20230821105025.GB19469@willie-the-truck> <44c4d801-3e21-426b-2cf0-a7884d2bf5ff@linux.alibaba.com> <54114b64-4726-da46-8ffa-16749ec0887a@linux.alibaba.com> <20230830221814.GB30121@willie-the-truck> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-11.4 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_BLOCKED, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2023/8/31 17:06, Helge Deller wrote: > On 8/31/23 05:29, Shuai Xue wrote: >> On 2023/8/31 06:18, Will Deacon wrote: >>> On Mon, Aug 28, 2023 at 09:41:55AM +0800, Shuai Xue wrote: >>>> On 2023/8/22 09:15, Shuai Xue wrote: >>>>> On 2023/8/21 18:50, Will Deacon wrote: >>>>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >>>>>>> index 3fe516b32577..38e2186882bd 100644 >>>>>>> --- a/arch/arm64/mm/fault.c >>>>>>> +++ b/arch/arm64/mm/fault.c >>>>>>> @@ -679,6 +679,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >>>>>>>       } else if (fault & (VM_FAULT_HWPOISON_LARGE | VM_FAULT_HWPOISON)) { >>>>>>>           unsigned int lsb; >>>>>>> >>>>>>> +        pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n", >>>>>>> +               current->comm, current->pid, far); >>>>>>>           lsb = PAGE_SHIFT; >>>>>>>           if (fault & VM_FAULT_HWPOISON_LARGE) >>>>>>>               lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault)); >>>>>> >>>>>> Hmm, I'm not convinced by this. We have 'show_unhandled_signals' already, >>>>>> and there's plenty of code in memory-failure.c for handling poisoned pages >>>>>> reported by e.g. GHES. I don't think dumping extra messages in dmesg from >>>>>> the arch code really adds anything. >>>>> >>>>> I see the show_unhandled_signals() will dump the stack but it rely on >>>>> /proc/sys/debug/exception-trace be set. >>>>> >>>>> The memory failure is the top issue in our production cloud and also other hyperscalers. >>>>> We have received complaints from our operations engineers and end users that processes >>>>> are being inexplicably killed :(. Could you please consider add a message? >>> >>> I don't have any objection to logging this stuff somehow, I'm just not >>> convinced that the console is the best place for that information in 2023. >>> Is there really nothing better? > >> I agree that console might not the better place, but it still plays an important role. >> IMO the most direct idea for end user to check what happened is to check by viewing >> the dmesg. In addition, we deployed some log store service collects all cluster dmesg >> from /var/log/kern. > > Right, pr_err() is not just console. > It ends up in the syslog, which ends up in a lot of places, e.g. through syslog forwarding. > Most monitoring tools monitor the syslog as well. > > So, IMHO pr_err() is the right thing. > > Helge > Totally agreed. Thank you. Best Regards, Shuai