Received: by 2002:a5d:925a:0:0:0:0:0 with SMTP id e26csp1098477iol; Thu, 9 Jun 2022 23:35:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwq+ncAkf9ukyoOc5r6Y/bYUotqJzzqZsT5wZFHYBa3/9kv9c+kphce6mYaeWjXtSXFUTWM X-Received: by 2002:a17:90b:4f88:b0:1e8:4013:e5f5 with SMTP id qe8-20020a17090b4f8800b001e84013e5f5mr7310683pjb.220.1654842939048; Thu, 09 Jun 2022 23:35:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654842939; cv=none; d=google.com; s=arc-20160816; b=mrBbSNm5WkhIYlCoX07+O5tOEMU+SsGYJm3IGILByTPFjvop6bLOaF4fQQlZ500z17 Scy4Tnl/3uXmNY82twlw4FkD7dv5NV1+aLk50t8L8iUmsAFjupSbSQ7JiMmBw/kFhd0O BNlaSfwNzjqJZZI1LkMin0HnUh9OecSSbcisuG3zJUnpsvOTt/qEq6KPD4a3zXqunaCl Wfi6TQye1RBDhWeENk2eMl+FpXtuimlF3HgXeW6Fde6dxx81auetepqTLdFaQh/fCLL9 SvH9bNliRtdBji3pF9bYXJKF70rekL35++aXYI/WZlvHlne7fw/jWTnCUWbJOKfiHqxv Iijw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=pEh3y6F4/8+8sINB/R4F878fSIy1te/Ohlz+65vpGJk=; b=PKz9CwHzlaeL2Z5uChQnTX3b8erBsZsKIH/6q6UgesfpRIOvwuNOWjYEMWrF3WhlHx Xt+DVR1xnN/jKL+1adAuu8n/YklEIeMH8I0SsRwwxqyDM9+T3g1ov3tvIURM3QeYAx7n SMYdM5Cgo0UHROlqDqLMsgFtbP+wjhgr3czASzz/46UZm4rLVdiOhPNzRg4v29tjG4mp b/kUjNRQmfEHRNs0LyIR8pgu0ejw2p8bXb0DBL12QYmpIyzohX2FKrv+SuOxQOwK9VQD CAM1VywBMghmOa133sH4VlfT9q/0Y/ddRAokjXsYH4mr7vG+YzvR3r2PY6cEulWxzPxp 85MQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="cMpWjx/9"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n19-20020a170902d0d300b00163eff0e715si30924623pln.106.2022.06.09.23.35.25; Thu, 09 Jun 2022 23:35:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="cMpWjx/9"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244653AbiFJGEi (ORCPT + 99 others); Fri, 10 Jun 2022 02:04:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245377AbiFJGEg (ORCPT ); Fri, 10 Jun 2022 02:04:36 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70E67146776 for ; Thu, 9 Jun 2022 23:04:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1654841074; x=1686377074; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=R+DSh2EEu94BxaaMFthJVuW4/txPO8GfM9ooK/W62F0=; b=cMpWjx/9XR9aMBB7BZKaprv66Dg0AElMmnMQcCRUvFBtQvhBAKxZIeRg wc8aN9rLzUzh9UY4f1h0J3hpj76nUzSA4jSMqpRTRgAVoEha/cg7D3Soj 8CE3qCi+mHdCR3m44VQMrwXGYhYgMNcF5GxY9SRHEqxtQk4e8T2W5zEDR mwlilmXSTwhcYSZ/RjO1gVVIIfgkW+pwO5rAglwWlRhfRuGx+jXss7o/1 ExPDOOFcfSOTkoCHSYFh0YVombUhwx/+qpTwxgsSI8G6lNm3KMQ0bMboQ hde9XLOjBVD9jq0pHP2ranx0Hbf9K/njx2Bl+ddZ8BHyiBNhIXF2SgBtW w==; X-IronPort-AV: E=McAfee;i="6400,9594,10373"; a="257358855" X-IronPort-AV: E=Sophos;i="5.91,288,1647327600"; d="scan'208";a="257358855" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2022 23:04:34 -0700 X-IronPort-AV: E=Sophos;i="5.91,288,1647327600"; d="scan'208";a="828061405" Received: from qingfen1-mobl1.ccr.corp.intel.com ([10.254.215.73]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jun 2022 23:04:29 -0700 Message-ID: <9f05470b3188c2a81696841a3a3e007e99caecea.camel@intel.com> Subject: Re: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers From: Ying Huang To: "Aneesh Kumar K.V" , linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Date: Fri, 10 Jun 2022 14:04:26 +0800 In-Reply-To: <87sfoffcfz.fsf@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> <20220603134237.131362-10-aneesh.kumar@linux.ibm.com> <6e94b7e2a6192e4cacba1db3676b5b5cf9b98eac.camel@intel.com> <11f94e0c50f17f4a6a2f974cb69a1ae72853e2be.camel@intel.com> <232817e0-24fd-e022-6c92-c260f7f01f8a@linux.ibm.com> <87sfoffcfz.fsf@linux.ibm.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote: > Ying Huang writes: > > .... > > > > > > > > > > is this good (not tested)? > > > > > /* > > > > >    * build the allowed promotion mask. Promotion is allowed > > > > >    * from higher memory tier to lower memory tier only if > > > > >    * lower memory tier doesn't include compute. We want to > > > > >    * skip promotion from a memory tier, if any node which is > > > > >    * part of that memory tier have CPUs. Once we detect such > > > > >    * a memory tier, we consider that tier as top tier from > > > > >    * which promotion is not allowed. > > > > >    */ > > > > > list_for_each_entry_reverse(memtier, &memory_tiers, list) { > > > > > nodes_and(allowed, node_state[N_CPU], memtier->nodelist); > > > > > if (nodes_empty(allowed)) > > > > > nodes_or(promotion_mask, promotion_mask, allowed); > > > > > else > > > > > break; > > > > > } > > > > > > > > > > and then > > > > > > > > > > static inline bool node_is_toptier(int node) > > > > > { > > > > > > > > > > return !node_isset(node, promotion_mask); > > > > > } > > > > > > > > > > > > > This should work. But it appears unnatural. So, I don't think we > > > > should avoid to add more and more node masks to mitigate the design > > > > decision that we cannot access memory tier information directly. All > > > > these becomes simple and natural, if we can access memory tier > > > > information directly. > > > > > > > > > > how do you derive whether node is toptier details if we have memtier > > > details in pgdat? > > > > pgdat -> memory tier -> rank > > > > Then we can compare this rank with the fast memory rank. The fast > > memory rank can be calculated dynamically at appropriate places. > > This is what I am testing now. We still need to closely audit that lock > free access to the NODE_DATA()->memtier. For v6 I will keep this as a > separate patch and once we all agree that it is safe, I will fold it > back. Thanks for doing this. We finally have a way to access memory_tier in hot path. [snip] > +/* > + * Called with memory_tier_lock. Hence the device references cannot > + * be dropped during this function. > + */ > +static void memtier_node_clear(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + > + rcu_assign_pointer(pgdat->memtier, NULL); > + /* > + * Make sure read side see the NULL value before we clear the node > + * from the nodelist. > + */ > + synchronize_rcu(); > + node_clear(node, memtier->nodelist); > +} > + > +static void memtier_node_set(int node, struct memory_tier *memtier) > +{ > + pg_data_t *pgdat; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return; > + /* > + * Make sure we mark the memtier NULL before we assign the new memory tier > + * to the NUMA node. This make sure that anybody looking at NODE_DATA > + * finds a NULL memtier or the one which is still valid. > + */ > > + rcu_assign_pointer(pgdat->memtier, NULL); > + synchronize_rcu(); Per my understanding, in your code, when we change pgdat->memtier, we will call synchronize_rcu() twice. IMHO, once should be OK. That is, something like below, rcu_assign_pointer(pgdat->memtier, NULL); node_clear(node, memtier->nodelist); synchronize_rcu(); node_set(node, new_memtier->nodelist); rcu_assign_pointer(pgdat->memtier, new_memtier); In this way, there will be 3 states, 1. prev pgdat->memtier == old_memtier node_isset(node, old_memtier->node_list) !node_isset(node, new_memtier->node_list) 2. transitioning pgdat->memtier == NULL !node_isset(node, old_memtier->node_list) !node_isset(node, new_memtier->node_list) 3. after pgdat->memtier == new_memtier !node_isset(node, old_memtier->node_list) node_isset(node, new_memtier->node_list) The real state may be one of 1, 2, 3, 1+2, or 2+3. But it will not be 1+3. I think that this satisfied our requirements. [snip] >  static int __node_create_and_set_memory_tier(int node, int tier) >  { >   int ret = 0; > @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier) >   goto out; >   } >   } > - node_set(node, memtier->nodelist); > + memtier_node_set(node, memtier); >  out: >   return ret; >  } > @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier) >   if (current_tier->dev.id == tier) >   goto out; > - node_clear(node, current_tier->nodelist); > + memtier_node_clear(node, current_tier);+ node_set(node, memtier->nodelist); > + rcu_assign_pointer(pgdat->memtier, memtier); > +} > + > +bool node_is_toptier(int node) > +{ > + bool toptier; > + pg_data_t *pgdat; > + struct memory_tier *memtier; > + > + pgdat = NODE_DATA(node); > + if (!pgdat) > + return false; > + > + rcu_read_lock(); > + memtier = rcu_dereference(pgdat->memtier); > + if (!memtier) { > + toptier = true; > + goto out; > + } > + if (memtier->rank >= top_tier_rank) > + toptier = true; > + else > + toptier = false; > +out: > + rcu_read_unlock(); > + return toptier; > +} > + > >    ret = __node_create_and_set_memory_tier(node, tier); > >   if (ret) { >   /* reset it back to older tier */ > - node_set(node, current_tier->nodelist); > + memtier_node_set(node, current_tier); >   goto out; >   } >   > [snip] >  static int __init memory_tier_init(void) >  { > - int ret; > + int ret, node; >   struct memory_tier *memtier; > >   ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); > > @@ -766,7 +829,13 @@ static int __init memory_tier_init(void) > >   panic("%s() failed to register memory tier: %d\n", __func__, ret); > >   /* CPU only nodes are not part of memory tiers. */ > > - memtier->nodelist = node_states[N_MEMORY]; > + for_each_node_state(node, N_MEMORY) { > + /* > + * Should be safe to do this early in the boot. > + */ > + NODE_DATA(node)->memtier = memtier; No required absoluately. But IMHO it's more consistent to use rcu_assign_pointer() here. > + node_set(node, memtier->nodelist); > + } >   migrate_on_reclaim_init(); >   > >  return 0; Best Regareds, Huang, Ying