Received: by 2002:ab2:6857:0:b0:1ef:ffd0:ce49 with SMTP id l23csp981769lqp; Fri, 22 Mar 2024 01:44:10 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXk4G7ACT/A4tBNNXsWijkwENIVuvTkIKdHZHm3J/myFNLHzjznw/5PMhkjzxIWG8glvRtT6UHncl0FqknNJS1OFnS3OawTbCW6iIOlnw== X-Google-Smtp-Source: AGHT+IGvfP6ukfwh8Xkc5ygzorG90rNc+WcjaJWJ/36Dx5o0AS5OV7icHCe7YUEjn30uk9H518Xz X-Received: by 2002:a05:6a20:7f9d:b0:1a3:6f99:1653 with SMTP id d29-20020a056a207f9d00b001a36f991653mr1983311pzj.33.1711097050144; Fri, 22 Mar 2024 01:44:10 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711097050; cv=pass; d=google.com; s=arc-20160816; b=ntherwqczo9d4IB9Xdyo8Wx6gs786o+ViBS8Z5Ko5ez1D/+aYnmLhSvc+PkcFVsaog RL3mM8Z557NTOMHXFvVw99AwQa9wJ4uMyaB5kiIX+o9iRs4CClc+jrNGmQx8mVNEQ6X1 YQxSI+nA02J7aV4CVZvx5Hr/s0qLAO2luHmONQNV89nazEJ8sPlcrtzGwQdJp2JrYwFq pRk1sN38hE2DdkFRhws5rYWlPYsAx6R3a8cQwmo99o3DYW45Q+983Y4N14glDK1tD5ok h9mVle4SQVM8oP7hB8FPQDrfiUwigbyak86Mjd8vszehcToF2xn5HayYOJ20Qc50urnv n4xw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=ReOAX7lYm5bpWGbPCVf9sILQRqrZ17ktCHhwM/TbddE=; fh=Eip7FIlSvOIdAZGPwJhskLyDvLZH4sm8t2huFHJNK7Y=; b=Xc7H2gsRIFmonC/JL86ck1PBdDepPTRFT5cldKPi2YYiKoHONFbl7Q8akyUcbCdghE 7SaQ3i9usA76GI5Lko+Z/jjD1aExrEFQqT85fFWpSG/ScNmnczOs/liiZLK103GH4vCR 6DrtgC8guagz6ORJm8IgOdgXc1+pkUHxpse+9DO/OiZ7sKuNo7kzy3F8dbfyk4MxHLph xJT8w9jdvIEfyx8FYwtjFkcgqkKcqgl+U+AqnqEj+Wwr2/x7cD7bTWC9jIhdybWzBtI1 zLlsQZFsU723fB3mbk18B1Zmu2XFZs28cIgzrrs0+RlHNYUvKLVSAP0ypN8m8McXFth9 9yLg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=CJGV9vng; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-111151-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-111151-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id u17-20020a170902e5d100b001ddb6bbc698si1579098plf.107.2024.03.22.01.44.09 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Mar 2024 01:44:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-111151-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=CJGV9vng; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-111151-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-111151-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id BD612B24A58 for ; Fri, 22 Mar 2024 08:41:56 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 639A51863B; Fri, 22 Mar 2024 08:41:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="CJGV9vng" Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9059079F6; Fri, 22 Mar 2024 08:41:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.8 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711096906; cv=none; b=RaYsDcUUAVWzNerg0fAKEjE6Whlt+CqMNxnKxqnXnguu0H5t7L5pkAVU5RMvuipO/Wo/KRZzfcblFPdJV6GXio6CrC16F6wZBXWHhLyJKz3BOvVocS6NkF9ztHQgnRLvS15DITbU5xu4ZmvQbXqiLJiQvkwH/vdFVq/N8oY/jQc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711096906; c=relaxed/simple; bh=lUc29sDLNaxzdQqbXFTr/bw/3I6o0uIxECNwJmIeNUw=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=ilUi3C/AobmVCzMlWWDEnhEVJmtX1OGH8cHMeAh4d/GigrguKKir9Ui+4pwz6hOxK8WR30JALTzZeKtnZpdV84oHReWFi7BSgjYGs8YdiQGotquKCuArKl2rdO37gu1WKmW1EZr2WxywCdlA5CyXAPpFuO09+c4Zb0VguOaOF5U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=CJGV9vng; arc=none smtp.client-ip=192.198.163.8 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711096905; x=1742632905; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=lUc29sDLNaxzdQqbXFTr/bw/3I6o0uIxECNwJmIeNUw=; b=CJGV9vngmJWFDDe3hXcjtc0KvHe96htfDjXW1fXkv/NOeqQ1Jy9lQuuD /eY+q/DeGlBCldsl2hBUcR/cWT6mzyc3fXpYauSMbhyJprvNhOoo5A7M+ xmBLy22tSs/Pi3pNzQ9KhZBytDg4hPJSzFYajjp1eBK7U/hHC2VdrsSuJ 1lxwj2sq0kdXTgDDZO/GRSzeyiIPmzyzxAr2XepP6ZvAg/XI3lh2qaNdH xV0tUc+LRBOZdmrUddlGkMHK1Y9bkQaWvaw4MjfcwX/oZdm7pshuR6sjT 7eVH6qjiV5JOTnMzl/IAnlTDX4KhnkBVy1mlGbgn8TMfjAMw91kUJk+IV Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11020"; a="23620916" X-IronPort-AV: E=Sophos;i="6.07,145,1708416000"; d="scan'208";a="23620916" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2024 01:41:43 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,145,1708416000"; d="scan'208";a="19487739" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2024 01:41:38 -0700 From: "Huang, Ying" To: "Ho-Ren (Jack) Chuang" Cc: "Gregory Price" , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, "Eishan Mirakhur" , "Vinicius Tavares Petrucci" , "Ravis OpenSrc" , "Alistair Popple" , "Srinivasulu Thanneeru" , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Subject: Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info In-Reply-To: <20240322070356.315922-3-horenchuang@bytedance.com> (Ho-Ren Chuang's message of "Fri, 22 Mar 2024 07:03:55 +0000") References: <20240322070356.315922-1-horenchuang@bytedance.com> <20240322070356.315922-3-horenchuang@bytedance.com> Date: Fri, 22 Mar 2024 16:39:44 +0800 Message-ID: <87cyrmr773.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=ascii "Ho-Ren (Jack) Chuang" writes: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > --- > mm/memory-tiers.c | 73 ++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 63 insertions(+), 10 deletions(-) > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..9396330fa162 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +static DEFINE_MUTEX(default_dram_perf_lock); Better to add comments about what is protected by this lock. > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype; mtype may be referenced without initialization now below. > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +521,20 @@ static struct memory_tier *set_node_memory_tier(int node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, &adist); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, &default_memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall back.\n"); > + } > + } > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + __init_node_memory_type(node, mtype); > + > + mtype = node_memory_types[node].memtype; > + node_set(node, mtype->nodes); > + memtier = find_create_memory_tier(mtype); > if (!IS_ERR(memtier)) > rcu_assign_pointer(pgdat->memtier, memtier); > return memtier; > @@ -655,6 +671,34 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(&memory_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (!node_state(nid, N_CPU) && > + node_memory_types[nid].memtype == NULL) > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ > + set_node_memory_tier(nid); > + > + establish_demotion_targets(); > + mutex_unlock(&memory_tier_lock); > + > + return 0; > +} > +late_initcall(memory_tier_late_init); > + > static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) > { > pr_info( > @@ -668,7 +712,7 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > { > int rc = 0; > > - mutex_lock(&memory_tier_lock); > + mutex_lock(&default_dram_perf_lock); > if (default_dram_perf_error) { > rc = -EIO; > goto out; > @@ -716,7 +760,7 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > } > > out: > - mutex_unlock(&memory_tier_lock); > + mutex_unlock(&default_dram_perf_lock); > return rc; > } > > @@ -732,7 +776,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > perf->read_bandwidth + perf->write_bandwidth == 0) > return -EINVAL; > > - mutex_lock(&memory_tier_lock); > + mutex_lock(&default_dram_perf_lock); > /* > * The abstract distance of a memory node is in direct proportion to > * its memory latency (read + write) and inversely proportional to its > @@ -745,7 +789,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > (default_dram_perf.read_latency + default_dram_perf.write_latency) * > (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / > (perf->read_bandwidth + perf->write_bandwidth); > - mutex_unlock(&memory_tier_lock); > + mutex_unlock(&default_dram_perf_lock); > > return 0; > } > @@ -858,7 +902,8 @@ static int __init memory_tier_init(void) > * For now we can have 4 faster memory tiers with smaller adistance > * than default DRAM tier. > */ > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > + &default_memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -868,6 +913,14 @@ static int __init memory_tier_init(void) > * types assigned. > */ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa nodes. > + * These will be initialized after firmware and devices are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying