Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp6659159rwr; Tue, 9 May 2023 19:34:32 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4/e4qV82sbN+o7Iwdf/h9QR62mNCxRMr0YzmSKOqKLWJ5gjxW95/lKXN1yuIK4lQbEks6d X-Received: by 2002:a05:6a20:918e:b0:ee:bfc0:2bfe with SMTP id v14-20020a056a20918e00b000eebfc02bfemr21218327pzd.60.1683686071985; Tue, 09 May 2023 19:34:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683686071; cv=none; d=google.com; s=arc-20160816; b=PRd+MP9vZfwgvVId2EistZb6y2skLRHs+bhVVZb3CEUEEnn6mGN4kXkZrvM0POk5P1 2wcw+qXp0dS5rG+FDAzPCWqxYUSVSUlndOgl1trh3QE1qt1o8YUEGuTUFTW1k88TPS0e PeGaOtmomN3SjanxXo8J0IJCxCh4Y74Izx37t/PqVi1eAGuBuCh79/+hHBK5NeL7CiG8 Uw34wMV95/SkNHfklbsidUARv4tWqyrEPJmrp44J5zbMWNjHG+Ie37JgS4JAKehPalJw 4lHAp+MCg+r+ZsHcs5mrxPKZEoC2aUAg+p5jjnXnroLy2zqcJemdkENazJgZeSSTIf5l 1/dw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=I99x7IfgZnKJTjrLN/JhVlyTHu/Zbx5hRZuEoOhJtBg=; b=c+a56StE3oFJYqkL5I01fqCDl0GE/d+/0iqzs1lqM7/2FsVlAVe0e9+1TqEE6WYyi+ 78E89GIQU8dehdKoQZKUZc6dkZYohHmp0JjKTv1P6n7+zRmXNjzClzaI6Q6zt7CePhb1 hr3Qpz7x+vMn/n+MpZ0eB5SqJDAjGtRsiCt63Ry+N1lAUZHvGrwRYMsjfPXNufUozIyM kiWfUKJuG/MERsFsm2SEsfyr1N3JvqR4iUROkxjfOevzkcOSGwJ6eLupFiF2m5QoMyH0 J7nT8km9lHGfNtyS5S+CuFGLrm8JyjGxGPaW0Vxp2CMpd1R9/kTtRVVCV3kfCQ/By55T 900Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i15-20020a633c4f000000b005303bde6431si3030928pgn.895.2023.05.09.19.34.19; Tue, 09 May 2023 19:34:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234743AbjEJCR3 (ORCPT + 99 others); Tue, 9 May 2023 22:17:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39982 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229651AbjEJCR1 (ORCPT ); Tue, 9 May 2023 22:17:27 -0400 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 33E10123; Tue, 9 May 2023 19:17:26 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046056;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0ViDWdbf_1683685041; Received: from 30.240.113.228(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0ViDWdbf_1683685041) by smtp.aliyun-inc.com; Wed, 10 May 2023 10:17:23 +0800 Message-ID: Date: Wed, 10 May 2023 10:17:18 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.1 Subject: Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory failure Content-Language: en-US To: Yazen Ghannam , bp@alien8.de, tony.luck@intel.com Cc: tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, baolin.wang@linux.alibaba.com, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org References: <20230425121829.61755-1-xueshuai@linux.alibaba.com> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2023/5/9 22:25, Yazen Ghannam wrote: > On 4/25/23 8:18 AM, Shuai Xue wrote: >> When a deferred UE error is detected, e.g by background patrol scruber, it >> will be handled in APIC interrupt handler amd_deferred_error_interrupt(). >> The handler will collect MCA banks, init mce struct and process it by >> nofitying the registered MCE decode chain. >> >> The uc_decode_notifier, one of MCE decode chain, will process memory >> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY. >> However, APIC interrupt handler does not init mce severity and the >> uninitialized severity is 0 (MCE_NO_SEVERITY). >> >> To handle the deferred memory failure case, init mce severity when logging >> MCA banks. >> >> Signed-off-by: Shuai Xue >> > > Hi Shuai Xue, > > I think this patch is fair to do. But it won't have the intended effect > in practice. > > The value in MCA_ADDR for DRAM ECC errors will be a memory controller > "normalized address". This is not a system physical address that the OS > can use to take action. > > The mce_usable_address() function needs to be updated to handle this. > I'll send a patchset this week to do so. Afterwards, the > uc_decode_notifier will not attempt to handle these errors. From the experience of other platforms (e.g. ARM64 RAS and Intel MCA), uc_decode_notifier should handle these error to hard offline the corrupted page. If the corrupted page is a free buddy page, we can isolate it and avoid using the page in the future. In my test case, the error is detected by patrol scrubber in memory controller. The scrubber may lack of system address space perspective, and only reports "normalized address". But we can decode the "normalized address" to system address by EDAC (umc_normaddr_to_sysaddr), right? (I am not quite familiar with AMD RAS, please correct me if I am wrong) > > Thanks, > Yazen Thank you. Best Regards, Shuai