Received: by 2002:ab2:3350:0:b0:1f4:6588:b3a7 with SMTP id o16csp676518lqe; Sat, 6 Apr 2024 20:59:45 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCWlfJ9uUKR7p2LlbZlyV4MOIxG0UMPddNv2+y3WS0Pp6pI3ihwvS9AdIxUWTeqjZNOpEKtrYZ+d6UYcj9/9ogRTy9smIoF+akHr9d8HOQ== X-Google-Smtp-Source: AGHT+IFF7IGR8oJyJ3rEozN9RhI10i2bbYUpMj7EmMU4kLSDl2oa4ikURLXFHG8EOIvhlxakXLXJ X-Received: by 2002:a05:620a:cc8:b0:78a:5f3c:7510 with SMTP id b8-20020a05620a0cc800b0078a5f3c7510mr7710557qkj.20.1712462384925; Sat, 06 Apr 2024 20:59:44 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712462384; cv=pass; d=google.com; s=arc-20160816; b=TxULJu26pbcGtxOqn1PfpOkGCwOvuzukh/Bqp37Vpe3TKjq5cxTxoDCk5Am0WUL4SZ 80ecWO/SkpFXSiJXdB5LICaCXzQOQgOXvFM7JwfEGVkB34tmq5T2dD8VGwEulUepvHS1 H3PXDtiyjatUqimVnlZ+ZrbyJYi7C67SXLP++QTsB99VGPHWypdBv1I4lHC/jfHd/er0 EZzDvEkdycdwS8AUe3wq9eBoq73BXeP+10xsgfz9KjvL9p5tiFWf0LOLHOJqrkola0nM 52lPKDYRE1PLbfe1oJo81Ve3wlkPhxDJ05ZXb/gzthPfN4qOuEcKLAuhbCC7KaNg6ZVh fEQA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:content-language:in-reply-to:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:user-agent:date :message-id:from:references:cc:to:subject; bh=vZ2/9oo4sWmmDsWfo/fkEq4esDTob3nE5/Z/zXz9gIY=; fh=6xEZe37ftOgFfz/+9d2Nl4b7unxGXHUtbufabCjoT0Y=; b=GX/odZ4SJV91Nu261iTmz/iPlh+4rbojPYUilekGDhngMn/4E+9j326C8CmqSqX1cq xUnOFrwEY0z9pfMJk/Trvau45TKRhTB+cm5r4UfssBAw7OnrE1Ig4fx4quosNR0qzO6W TK0rTkmQUe50O/vjxQP7B4Nb+dT2QgVi6ad+J40pbac/nAkZteQocli8t5era8iV9Q9X BSN2qzboeWuR5Q0QW/qKnObKn01okDTfQHGQR0QPOgkiM7Zg/V8Q7vxfsJLv0738TvI/ yBQ1BKacPcA7DkvtQ3lir3qRpSNBBKvKrhmvHb/343CLd0izAfqWKkOht8io7GJzWzPB KJ3w==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-134160-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-134160-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id d5-20020a05620a136500b0078be0312d1dsi5246221qkl.307.2024.04.06.20.59.44 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Apr 2024 20:59:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-134160-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huawei.com dmarc=pass fromdomain=huawei.com); spf=pass (google.com: domain of linux-kernel+bounces-134160-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-134160-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id A362D1C21EE9 for ; Sun, 7 Apr 2024 03:59:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9E23AF9E4; Sun, 7 Apr 2024 03:59:39 +0000 (UTC) Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C192EDDBD for ; Sun, 7 Apr 2024 03:59:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.255 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712462379; cv=none; b=c8nioRaHG32g9ZfHI3qXt91aUs845Ad5dEZCA8vXJtC3iwj/oIn8reS6MszmxdHRo7lc8BJpx4oFyft2v2DZZ6HPGB+U6Qa/6GacRwEcsTJnb6AxRnOlFdl/H6KwO6HcaQ3K0r3feZFpix32i1DOwz9nwFP2zT6hS7YvXNc2ofA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712462379; c=relaxed/simple; bh=KLSCh2wPQyrQ8/khykXKCUPpwYuXrNVTSzq+Is2ljSM=; h=Subject:To:CC:References:From:Message-ID:Date:MIME-Version: In-Reply-To:Content-Type; b=aeCtNDSrxk8AAnrARdVe58D9DePmPwgJmP08zMusCx1HJfrEWbvScwBUBrYXvgbrdFmOodIyBHiK20m079vkDJFfw3c805VDPobqcxUMzGBwD0OC2vZ1zKvP44T6ZYdr2PqBBIur70eZcguSHyJQUJF12fqDv84PlCtoPrDE13c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.255 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.48]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4VBz1g1dfvz1QCTH; Sun, 7 Apr 2024 11:56:55 +0800 (CST) Received: from canpemm500002.china.huawei.com (unknown [7.192.104.244]) by mail.maildlp.com (Postfix) with ESMTPS id 3D3B0180073; Sun, 7 Apr 2024 11:59:34 +0800 (CST) Received: from [10.173.135.154] (10.173.135.154) by canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Sun, 7 Apr 2024 11:59:33 +0800 Subject: Re: Machine check recovery broken in v6.9-rc1 To: "Luck, Tony" , Oscar Salvador CC: David Hildenbrand , Borislav Petkov , Yazen Ghannam , Naoya Horiguchi , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" References: <1e943439-6044-4aa4-8c41-747e9e4dca27@redhat.com> From: Miaohe Lin Message-ID: <3e49dd21-0aea-c7ac-1633-91764e759bf7@huawei.com> Date: Sun, 7 Apr 2024 11:59:33 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To canpemm500002.china.huawei.com (7.192.104.244) On 2024/4/7 8:08, Luck, Tony wrote: >> This one is against 6.1 (previous one was against v6.9-rc2): >> Again, compile tested only > > Oscar. > > Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away. > > Here's what's happening. > > When the machine check occurs there's a scramble from various subsystems > to report the memory error. > > ghes_do_memory_failure() calls memory_failure_queue() which later > calls memory_failure() from a kernel thread. Side note: this happens TWICE > for each error. Not sure yet if this is a BIOS issue logging more than once. > or some Linux issues in acpi/apei/ghes.c code. > > uc_decode_notifier() [called from a different kernel thread] also calls > do_memory_failure() > > Finally kill_me_maybe() [called from task_work on return to the application > when returning from the machine check handler] also calls memory_failure() > > do_memory_failure() is somewhat prepared for multiple reports of the same > error. It uses an atomic test and set operation to mark the page as poisoned. > > First called to report the error does all the real work. Late arrivals take a > shorter path, but may still take some action(s) depending on the "flags" > passed in: > > if (TestSetPageHWPoison(p)) { > pr_err("%#lx: already hardware poisoned\n", pfn); > res = -EHWPOISON; > if (flags & MF_ACTION_REQUIRED) > res = kill_accessing_process(current, pfn, flags); > if (flags & MF_COUNT_INCREASED) > put_page(p); > goto unlock_mutex; > } > > In this case the last to arrive has MF_ACTION_REQUIRED set, so calls > kill_accessing_process() ... which is in the stack trace that led to the: > > kernel BUG at include/linux/swapops.h:88! > > I'm not sure that I fully understand your patch. I guess that it is making sure to > handle the case that the page has already been marked as poisoned? > > > Anyway ... thanks for the quick fix. I hope the above helps write a good > commit message to get this applied and backported to stable. Sorry for late. I was just back from my vacation. > > Tested-by: Tony Luck Thanks for both. This should be a issue introduced from commit: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") hwpoison_entry_to_pfn() is replaced with swp_offset_pfn() which might not be intended to be used with hwpoison entry: /* * A pfn swap entry is a special type of swap entry that always has a pfn stored * in the swap offset. *They are used to represent unaddressable device memory* * *and to restrict access to a page undergoing migration* */ static inline bool is_pfn_swap_entry(swp_entry_t entry) { /* Make sure the swp offset can always store the needed fields */ BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS); return is_migration_entry(entry) || is_device_private_entry(entry) || is_device_exclusive_entry(entry); } I think Oscar's patch is the right fix and it will be better to amend the corresponding comment too. Thanks. > > -Tony > > > > > > . >