Received: by 2002:a05:6358:53a8:b0:117:f937:c515 with SMTP id z40csp274038rwe; Fri, 14 Apr 2023 02:33:27 -0700 (PDT) X-Google-Smtp-Source: AKy350auiaAfHodQPsUXZF3wcZrlDHzkMzNuDDImDg6xyP1YGvA9ji6ltC+8YP2FEvgRny83NZgk X-Received: by 2002:a05:6a20:2008:b0:eb:e3f2:edcc with SMTP id w8-20020a056a20200800b000ebe3f2edccmr5259834pzw.51.1681464807493; Fri, 14 Apr 2023 02:33:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681464807; cv=none; d=google.com; s=arc-20160816; b=VL9HDBCw5WJ1DrjDg/Ohf4FuCnEXOaEfXmwwseWIj+ZbI8bBFt/YE4K55g2UJSWEkI wzXcCokjxqRayzEPJQe+AcRrwYRFfG5agKFDaE6eSqbo3gyxt+jBbr+WRv+tSBz8BPI9 rTwZ63tuQ4i0tAq9Bt2odIiTeAZ9lnJbRHwFdO0IbYK97Nn60vGozn5o0IuW/iPCXTfL EX7g3N9e+xLwykabTp/aW0n/KcoBn7X9mp9tqX/zt7Sifo+gV3f86WaQw4TcRVOvxIAW FWiBjqMWskFNMZ50Ufht9vC3j3ep8QnBHUFy3DJm1MEfRvH9nVEPbLJN4ywdeVXHPE5Q tt1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=wtzBpuIyCQBqfFOo0jpX2ueB/AF7gNJ6VtBotQOA7mY=; b=VCa9GvNkJTrjlsgNxXJnAIyjWb8/13o/27aLc51deLsgy9Jk6v+09C7cgXUi8tENJv GR829B0vqx9kCRaXaxZSo67xdE9P6aAHQNcP+nZ6exzY1unje0w25OZ7cc/a33iiO/hF Kox5jRn6J8N0TQsR95SsDbFOElsURW00abFf0zAPVDaeS3+dg9eu/K3dJ34RjvM3+OXt S6sMZ6o5zJz37/+Hsa5SzDkRfMmqv57Yh3i8BsVgnKaUugLGaIH2aTLgTNSJ33uLJ+wj EgbkwCA9tKn4Dl78omtkb5aWZPDSm8pW1HrcbHFPa9pVrutE7ZK5gJXABIqlAMUj4Wb7 mfcA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n8-20020a63f808000000b0050bdfd8ac87si3966144pgh.33.2023.04.14.02.33.12; Fri, 14 Apr 2023 02:33:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229841AbjDNJ0e (ORCPT + 99 others); Fri, 14 Apr 2023 05:26:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229528AbjDNJ0c (ORCPT ); Fri, 14 Apr 2023 05:26:32 -0400 Received: from mx3.molgen.mpg.de (mx3.molgen.mpg.de [141.14.17.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9F542100 for ; Fri, 14 Apr 2023 02:26:30 -0700 (PDT) Received: from [141.14.220.45] (g45.guest.molgen.mpg.de [141.14.220.45]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) (Authenticated sender: pmenzel) by mx.molgen.mpg.de (Postfix) with ESMTPSA id DACCA60027FE8; Fri, 14 Apr 2023 11:26:27 +0200 (CEST) Message-ID: Date: Fri, 14 Apr 2023 11:26:27 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17: d42040000000011b Content-Language: en-US To: Borislav Petkov Cc: Thomas Gleixner , Ingo Molnar , Dave Hansen , x86@kernel.org, LKML , Yazen Ghannam References: <21a09968-296b-5b21-8079-6d9d4e0769d4@molgen.mpg.de> <20230412163240.GAZDbdKHjmQcxqkeDQ@fat_crate.local> From: Paul Menzel In-Reply-To: <20230412163240.GAZDbdKHjmQcxqkeDQ@fat_crate.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.5 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dear Borislav, Thank you for your quick and helpful reply. Am 12.04.23 um 18:32 schrieb Borislav Petkov: > On Wed, Apr 12, 2023 at 05:11:26PM +0200, Paul Menzel wrote: >> On a Dell PowerEdge R7525 with AMD EPYC 7763 64-Core Processor, Linux >> 5.15.94 logs the machine check exceptions (MCE) below: >> >> ``` >> [5154053.127240] mce: [Hardware Error]: Machine check events logged >> [5154053.133711] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17: d42040000000011b >> [5154053.141948] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN 2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00 > > Build the latest kernel with CONFIG_X86_MCE_INJECT and > CONFIG_EDAC_DECODE_MCE enabled and CONFIG_RAS_CEC *disabled*. Then boot > it on that machine with and do the following below. > > The files are in debugfs: > > /sys/kernel/debug/mce-inject/ > ├── addr > ├── bank > ├── cpu > ├── flags > ├── ipid > ├── misc > ├── README > ├── status > └── synd > > so you go and do > > echo 0xd42040000000011b > status > echo 0xb3cbdbbc0 > addr > echo 3 > cpu > echo "sw" > flags > echo 0x6bd210000a801002 > synd > echo 0x9600650f00 > ipid > echo 17 > bank > > Remember to keep the bank write last because this one injects the error. > > It should dump the decoded error in dmesg. Yes, that worked: ``` [ 436.584741] mce: [Hardware Error]: Machine check events logged [ 436.590638] [Hardware Error]: Corrected error, no action required. [ 436.596869] [Hardware Error]: CPU:3 (19:1:1) MC17_STATUS[Over|CE|-|AddrV|-|SyndV|CECC|-|-|-]: 0xd42040000000011b [ 436.607083] [Hardware Error]: Error Addr: 0x0000000b3cbdbbc0 [ 436.612763] [Hardware Error]: IPID: 0x0000009600650f00, Syndrome: 0x6bd210000a801002 [ 436.620569] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 436.628942] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ``` It says “no action required”, but out of the identical 14 servers with the same workload this is the only one having shown this errors three times. Maybe the DIMM at bank 17 should just be replaced. […] Kind regards, Paul