Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp1015579iob; Fri, 13 May 2022 19:43:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJywRo/+8e0owf77JhmcfZDoJo5ahMmsezzienhJBUfNQmGm/rJMJfpGRusHAmxRVKwNqCJb X-Received: by 2002:a05:6000:2c9:b0:20c:64c9:4b7a with SMTP id o9-20020a05600002c900b0020c64c94b7amr5768695wry.325.1652496196172; Fri, 13 May 2022 19:43:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652496196; cv=none; d=google.com; s=arc-20160816; b=xopZzOJQ/FkJ+izBiYnUePx6T9iDeOXnxd+A88gZ/MnnVV81TLSjDVH2Hgzafg/Xew PQQsIwKSBhXe3qd0iGiC+V1yJdNQQsU6cpeid6tsgIwqjataTpmTlqfNVbr5kdGmSQe7 ReB53SX88MnlTIinA2JUuPor4qNFpcZ0xPRcNzQXTuyaIMZ5CdLsVjeKdh+p0v7VA5oN rK2Nds5lBEgSACZpVDJ55ZI0BXUlBgw2RUJum8KeaZ/CHgzCFxC5J61gPFRdmsgglcnL Qr2XdOfy8MEgJeAx+t2PWsImnWkhqNGDTCwL5jlz2VMTfvtUfKkpVKFvghY0fmZ+X1cu vA4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:cc:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=39RuDmaNXDQBWH0X/kkcsp5HOrtgFjXcurofW9ePNCM=; b=RUx2lrERmo6EF9iwTQYvUjlQ3sxBRaFZhX6+Yx8E0WY4qeZFnBuHeLFmPejh6p1aj2 xsBCoLG3z0qqYgs974fAf+l3UFSOc3omLERvNK0hpqJZjOGEauQEX916SBNuyFq1ILHE wSOtiLVYAyItOxW4kc0Lrx8jK1lVHU/XQMnnqxBSbuFfnH3Qdu1G1yZWnxvDeTjgqakH HeNyZho/AB3uEGdTEOlBYynSEOEXXBP4e9zmyP1XYAbi+SLTrOCLuaDj+57sfONyG+Vn T22yGPDyvJWin/jVWKFO5WbmUu7fPgFr6UN27yy/keWcnjaVk8TpxpFaB+VIs5/FZczh F2jQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=cZDez6dq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id g7-20020adfd1e7000000b0020aaac45374si5191103wrd.514.2022.05.13.19.43.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 19:43:16 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=cZDez6dq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 139CA3091D6; Fri, 13 May 2022 16:31:38 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245554AbiEKPLq (ORCPT + 99 others); Wed, 11 May 2022 11:11:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54226 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245605AbiEKPL2 (ORCPT ); Wed, 11 May 2022 11:11:28 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4E8EF5A16D for ; Wed, 11 May 2022 08:11:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1652281881; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=39RuDmaNXDQBWH0X/kkcsp5HOrtgFjXcurofW9ePNCM=; b=cZDez6dq8JlgfyTG/BZrVg5VfeI5OO1jQaQ6b2bAfMZDExqkc0y71cIMpiX3SM0v9imZ/n i/Qjd92wGtNTiJBXlv9ekDmcSP+dwe/AARg1B5a/L5PW5ZoihHEAqjwpxj40h+UXcFUpu6 VAzIP9wpvmYCsd4wroWTseKD4x8U42I= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-349-Xbqf-21QNTafiAGkvraN5Q-1; Wed, 11 May 2022 11:11:20 -0400 X-MC-Unique: Xbqf-21QNTafiAGkvraN5Q-1 Received: by mail-wm1-f71.google.com with SMTP id o24-20020a05600c379800b003943412e81dso803311wmr.6 for ; Wed, 11 May 2022 08:11:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=39RuDmaNXDQBWH0X/kkcsp5HOrtgFjXcurofW9ePNCM=; b=mmJCfvLt6BKSgd3HZ8bhqnii9+mxbMUnvhsE8COGaE2UT814G7cw4091dNoZFwe5AQ g8+RJyXdCuzoODJr8ZHra/8eQ6y0Bq0xUKfrJtz0IJJZ1Yk1E2UjKaJTiZGv+kJSRgLT NBTPs9FZKmkobnXTKacFCNulgc80cojjn6BwpFLWqUC7CLXX7X16lrgnWg9AQrDOm0nm 2A/dBlgXs4eY3PXTjMDeZOh6HCPKLzCPufU6FK0QN1Q2kRe1pc/dwF3Y87o3+8/R+UPO 7Ma2o4LC5EAez5DRu0WDS16HqegcEAUhS5lZ21Y0oe+JNk7NSkNGlG4foOKQQ6ofYyJG 62DQ== X-Gm-Message-State: AOAM5301VGi8KZfJRGUywTnLe1/Qw1CzC/5ruYQSEfchhhwi2XmL2Ghd V3z1fwS0EF5JcKIj8sLzdn4pUCMSOWC5iKUWHhc0GWUdnXIg1I5hR+ncqY6jX+Kq7orpKBbc56e UdBhKkvu2SzZB09mlHPAFDMAR X-Received: by 2002:a05:6000:178d:b0:20c:5bfd:4d7d with SMTP id e13-20020a056000178d00b0020c5bfd4d7dmr23049801wrg.23.1652281879117; Wed, 11 May 2022 08:11:19 -0700 (PDT) X-Received: by 2002:a05:6000:178d:b0:20c:5bfd:4d7d with SMTP id e13-20020a056000178d00b0020c5bfd4d7dmr23049777wrg.23.1652281878846; Wed, 11 May 2022 08:11:18 -0700 (PDT) Received: from ?IPV6:2003:cb:c701:700:2393:b0f4:ef08:bd51? (p200300cbc70107002393b0f4ef08bd51.dip0.t-ipconnect.de. [2003:cb:c701:700:2393:b0f4:ef08:bd51]) by smtp.gmail.com with ESMTPSA id n7-20020adffe07000000b0020c5253d8dasm1865846wrr.38.2022.05.11.08.11.17 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 11 May 2022 08:11:18 -0700 (PDT) Message-ID: <0389eac1-af68-56b5-696d-581bb56878b9@redhat.com> Date: Wed, 11 May 2022 17:11:17 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Content-Language: en-US To: Miaohe Lin , Oscar Salvador Cc: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Naoya Horiguchi , "linux-mm@kvack.org" , Andrew Morton , Mike Kravetz , Yang Shi , Muchun Song , "linux-kernel@vger.kernel.org" References: <20220427042841.678351-1-naoya.horiguchi@linux.dev> <54399815-10fe-9d43-7ada-7ddb55e798cb@redhat.com> <20220427122049.GA3918978@hori.linux.bs1.fc.nec.co.jp> <20220509072902.GB123646@hori.linux.bs1.fc.nec.co.jp> <6a5d31a3-c27f-f6d9-78bb-d6bf69547887@huawei.com> <465902dc-d3bf-7a93-da04-839faddcd699@huawei.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug In-Reply-To: <465902dc-d3bf-7a93-da04-839faddcd699@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09.05.22 12:53, Miaohe Lin wrote: > On 2022/5/9 17:58, Oscar Salvador wrote: >> On Mon, May 09, 2022 at 05:04:54PM +0800, Miaohe Lin wrote: >>>>> So that leaves us with either >>>>> >>>>> 1) Fail offlining -> no need to care about reonlining >>> >>> Maybe fail offlining will be a better alternative as we can get rid of many races >>> between memory failure and memory offline? But no strong opinion. :) >> >> If taking care of those races is not an herculean effort, I'd go with >> allowing offlining + disallow re-onlining. >> Mainly because memory RAS stuff. > > This dose make sense to me. Thanks. We can try to solve those races if > offlining + disallow re-onlining is applied. :) > >> >> Now, to the re-onlining thing, we'll have to come up with a way to check >> whether a section contains hwpoisoned pages, so we do not have to go >> and check every single page, as that will be really suboptimal. > > Yes, we need a stable and cheap way to do that. My simplistic approach would be a simple flag/indicator in the memory block devices that indicates that any page in the memory block was hwpoisoned. It's easy to check that during memory onlining and fail it. diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 084d67fd55cc..3d0ef812e901 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -183,6 +183,9 @@ static int memory_block_online(struct memory_block *mem) struct zone *zone; int ret; + if (mem->hwpoisoned) + return -EHWPOISON; + zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group, start_pfn, nr_pages); Once the problematic DIMM would actually get unplugged, the memory block devices would get removed as well. So when hotplugging a new DIMM in the same location, we could online that memory again. Another place to store that would be the memory section, we'd then have to check all underlying sections here. We're a bit short on flags in the memory section I think, but they are easier to lookup from other code eventually then memory block devices. -- Thanks, David / dhildenb