Received: by 2002:a05:7412:8d11:b0:fa:4934:9f with SMTP id bj17csp280607rdb; Sun, 14 Jan 2024 17:26:35 -0800 (PST) X-Google-Smtp-Source: AGHT+IGQTXsXzQ3NS8Yzmv7TnVc6MIcjkpxJeFd7Q1Jl4aw3ylLgCXVCkwfzqmehq2tdXxcCnPWZ X-Received: by 2002:a05:6a20:3243:b0:199:bb69:a6f7 with SMTP id hm3-20020a056a20324300b00199bb69a6f7mr1741892pzc.122.1705281995479; Sun, 14 Jan 2024 17:26:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1705281995; cv=none; d=google.com; s=arc-20160816; b=PlnHd0TOcRAAox68G8F/m+myQPvQ2sRbpUbg7gA9T85QiHSs7Aoqx+uLzW7rsS3lwx tOfgTMI+4hzYLXlJqP72uFD0ci3ZoU8g6CxZ5aIDKcewrv7ZVolc41yAwlXyq8ukwuXQ GlQyPdyK2MbKJnZUY3MPJPIdWHwHfG4a4eB+HtxU9PnxWcNFMd781lwHtNMRiAQP9I9F L8/u5RD7KqZet4PTMnp1/f9WbEB7MmCG6cAyszwmjandCg8FvmpYmsN6I3Vyutvq+XMP lqcgE4UwD4gUtbheBSwq6Vs36Z8grDuKO2kqm7s9c4LPDY8PGMz/jOP+nd7Y3xTBHVwg b4qw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:user-agent:message-id:date :references:in-reply-to:subject:cc:to:from:dkim-signature; bh=tGJhSHBugncOAn0Sf0dKR+2bStC7rwJThdZ2F9AktXs=; fh=ESIeLbZDY3w0ChZnZvpNT79BHGtr5cD0NGesM916EPE=; b=fj7Hwgdq7St8EhDM7blCtm5+E8eXTP+cUNLhkgbmXV6SPS8dMw/jgFaqhGhZrBXcvt 5ALIIiCg6rkZ3zIhF7ZsLKGuTxP9j/uv6bngbtZG6MyPcmLXoyU5HYonRAQXKVGQYHp8 dXMDyPmzPoH2Q1bOLRpe9IzjzZlYNdL3PovrhaM90NCH0szAZsKyPiGIar639KrQtB7O PtZRMR52qIPkZfsgHMZHalvYE6pvxE66C8hSihSPullFUhMRTBboDo4BbRnIreuViJBp F3t5gt3JxcNBkLRjDeNLPV90pQya6QDOtk4ExO6D3+G+V7hGJBte4DHC//J16r/hM7nq bt4A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=akUr5ml6; spf=pass (google.com: domain of linux-kernel+bounces-25579-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-25579-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id z4-20020a17090abd8400b0028c9e3fefd0si10330354pjr.11.2024.01.14.17.26.34 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 14 Jan 2024 17:26:35 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-25579-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=akUr5ml6; spf=pass (google.com: domain of linux-kernel+bounces-25579-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-25579-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 7CDA4B20D7F for ; Mon, 15 Jan 2024 01:26:33 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 012D215D2; Mon, 15 Jan 2024 01:26:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="akUr5ml6" Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA17315A8; Mon, 15 Jan 2024 01:26:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1705281978; x=1736817978; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=RAN96hU9YlRYVx95l6KQuFTyks7KtMSlSqiqq5bKzoM=; b=akUr5ml6h/mmpYIXdPz/3014bTZyZyXayEjUSoDmphoGFLUINd+qyXN1 3D4IHU3RrrDxmJGPN/gln1y8t8nowqvou2aYN/iewwmq7vcBuezq8Amzl o/oa36hHb7jxkssptNxdnCidV3YG6Qbe5YzksOrlEjJuFalqB5Jsld5Zr 5Ac8eDOY2aezz7bcgUWHlceUUWMXMFsGgMeahylk2KJnVwVSp1DT1fC/B v3JMjaYhKkRHX1o4nxurqhADSzJEv7HLX8poJZvJOAGybWUlFfT+bZB5k fs9yUgD1n5fqathgDcs2lRVuS0EhAz8o61XRuZpdBSYac8ATX8TeS0CvE Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10953"; a="396653515" X-IronPort-AV: E=Sophos;i="6.04,195,1695711600"; d="scan'208";a="396653515" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jan 2024 17:26:17 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10953"; a="783653565" X-IronPort-AV: E=Sophos;i="6.04,195,1695711600"; d="scan'208";a="783653565" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Jan 2024 17:26:11 -0800 From: "Huang, Ying" To: Hao Xiang Cc: "aneesh.kumar@linux.ibm.com" , Jonathan Cameron , Gregory Price , Srinivasulu Thanneeru , Srinivasulu Opensrc , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "dan.j.williams@intel.com" , "mhocko@suse.com" , "tj@kernel.org" , "john@jagalactic.com" , Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , "linux-kernel@vger.kernel.org" , Johannes Weiner , Wei Xu , "Ho-Ren (Jack) Chuang" Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers In-Reply-To: (Hao Xiang's message of "Fri, 12 Jan 2024 00:14:04 -0800") References: <87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edezc5l1.fsf@yhuang6-desk2.ccr.corp.intel.com> <87a5pmddl5.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmspbpma.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o7dv897s.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240109155049.00003f13@Huawei.com> <20240110141821.0000370d@Huawei.com> <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 15 Jan 2024 09:24:13 +0800 Message-ID: <871qaj2xtu.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hao Xiang writes: > On Thu, Jan 11, 2024 at 11:02=E2=80=AFPM Huang, Ying wrote: >> >> Hao Xiang writes: >> >> > On Wed, Jan 10, 2024 at 6:18=E2=80=AFAM Jonathan Cameron >> > wrote: >> >> >> >> On Tue, 9 Jan 2024 16:28:15 -0800 >> >> Hao Xiang wrote: >> >> >> >> > On Tue, Jan 9, 2024 at 9:59=E2=80=AFAM Gregory Price wrote: >> >> > > >> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: >> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 >> >> > > > "Huang, Ying" wrote: >> >> > > > > Gregory Price writes: >> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: >> >> > > > > It's possible to change the performance of a NUMA node change= d, if we >> >> > > > > hot-remove a memory device, then hot-add another different me= mory >> >> > > > > device. It's hoped that the CDAT changes too. >> >> > > > >> >> > > > Not supported, but ACPI has _HMA methods to in theory allow cha= nging >> >> > > > HMAT values based on firmware notifications... So we 'could' m= ake >> >> > > > it work for HMAT based description. >> >> > > > >> >> > > > Ultimately my current thinking is we'll end up emulating CXL ty= pe3 >> >> > > > devices (hiding topology complexity) and you can update CDAT but >> >> > > > IIRC that is only meant to be for degraded situations - so if y= ou >> >> > > > want multiple performance regions, CDAT should describe them fo= rm the start. >> >> > > > >> >> > > >> >> > > That was my thought. I don't think it's particularly *realistic*= for >> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case w= here >> >> > > it could be valuable. >> >> > > >> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cML= ncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ >> >> > > > > > >> >> > > > > > This group wants to enable passing CXL memory through to KV= M/QEMU >> >> > > > > > (i.e. host CXL expander memory passed through to the guest)= , and >> >> > > > > > allow the guest to apply memory tiering. >> >> > > > > > >> >> > > > > > There are multiple issues with this, presently: >> >> > > > > > >> >> > > > > > 1. The QEMU CXL virtual device is not and probably never wi= ll be >> >> > > > > > performant enough to be a commodity class virtualization. >> >> > > > >> >> > > > I'd flex that a bit - we will end up with a solution for virtua= lization but >> >> > > > it isn't the emulation that is there today because it's not pos= sible to >> >> > > > emulate some of the topology in a peformant manner (interleavin= g with sub >> >> > > > page granularity / interleaving at all (to a lesser degree)). T= here are >> >> > > > ways to do better than we are today, but they start to look like >> >> > > > software dissagregated memory setups (think lots of page faults= in the host). >> >> > > > >> >> > > >> >> > > Agreed, the emulated device as-is can't be the virtualization dev= ice, >> >> > > but it doesn't mean it can't be the basis for it. >> >> > > >> >> > > My thought is, if you want to pass host CXL *memory* through to t= he >> >> > > guest, you don't actually care to pass CXL *control* through to t= he >> >> > > guest. That control lies pretty squarely with the host/hyperviso= r. >> >> > > >> >> > > So, at least in theory, you can just cut the type3 device out of = the >> >> > > QEMU configuration entirely and just pass it through as a distinc= t numa >> >> > > node with specific hmat qualities. >> >> > > >> >> > > Barring that, if we must go through the type3 device, the questio= n is >> >> > > how difficult would it be to just make a stripped down type3 devi= ce >> >> > > to provide the informational components, but hack off anything >> >> > > topology/interleave related? Then you just do direct passthrough = as you >> >> > > described below. >> >> > > >> >> > > qemu/kvm would report errors if you tried to touch the naughty bi= ts. >> >> > > >> >> > > The second question is... is that device "compliant" or does it n= eed >> >> > > super special handling from the kernel driver :D? If what i desc= ribed >> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU s= hould >> >> > > just hide the CXL device entirely from the guest (for this use ca= se) >> >> > > and just pass the memory through as a numa node. >> >> > > >> >> > > Which gets us back to: The memory-tiering component needs a way to >> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All = three >> >> > > of those seem like totally valid ways to go about it. >> >> > > >> >> > > > > > >> >> > > > > > 2. When passing memory through as an explicit NUMA node, bu= t not as >> >> > > > > > part of a CXL memory device, the nodes are lumped togeth= er in the >> >> > > > > > DRAM tier. >> >> > > > > > >> >> > > > > > None of this has to do with firmware. >> >> > > > > > >> >> > > > > > Memory-type is an awful way of denoting membership of a tie= r, but we >> >> > > > > > have HMAT information that can be passed through via QEMU: >> >> > > > > > >> >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node0 \ >> >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node1 \ >> >> > > > > > -numa node,nodeid=3D0,cpus=3D0-4,memdev=3Dram-node0 \ >> >> > > > > > -numa node,initiator=3D0,nodeid=3D1,memdev=3Dram-node1 \ >> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,d= ata-type=3Daccess-latency,latency=3D10 \ >> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,d= ata-type=3Daccess-bandwidth,bandwidth=3D10485760 \ >> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,d= ata-type=3Daccess-latency,latency=3D20 \ >> >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,d= ata-type=3Daccess-bandwidth,bandwidth=3D5242880 >> >> > > > > > >> >> > > > > > Not only would it be nice if we could change tier membershi= p based on >> >> > > > > > this data, it's realistically the only way to allow guests = to accomplish >> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to= the guest. >> >> > > > >> >> > > > This I fully agree with. There will be systems with a bunch of= normal DDR with different >> >> > > > access characteristics irrespective of CXL. + likely HMAT solut= ions will be used >> >> > > > before we get anything more complex in place for CXL. >> >> > > > >> >> > > >> >> > > Had not even considered this, but that's completely accurate as w= ell. >> >> > > >> >> > > And more discretely: What of devices that don't provide HMAT/CDAT= ? That >> >> > > isn't necessarily a violation of any standard. There probably co= uld be >> >> > > a release valve for us to still make those devices useful. >> >> > > >> >> > > The concern I have with not implementing a movement mechanism *at= all* >> >> > > is that a one-size-fits-all initial-placement heuristic feels gro= ss >> >> > > when we're, at least ideologically, moving toward "software defin= ed memory". >> >> > > >> >> > > Personally I think the movement mechanism is a good idea that get= s folks >> >> > > where they're going sooner, and it doesn't hurt anything by exist= ing. We >> >> > > can change the initial placement mechanism too. >> >> > >> >> > I think providing users a way to "FIX" the memory tiering is a back= up >> >> > option. Given that DDRs with different access characteristics provi= de >> >> > the relevant CDAT/HMAT information, the kernel should be able to >> >> > correctly establish memory tiering on boot. >> >> >> >> Include hotplug and I'll be happier! I know that's messy though. >> >> >> >> > Current memory tiering code has >> >> > 1) memory_tier_init() to iterate through all boot onlined memory >> >> > nodes. All nodes are assumed to be fast tier (adistance >> >> > MEMTIER_ADISTANCE_DRAM is used). >> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memo= ry >> >> > nodes. This is the place the kernel reads the memory attributes from >> >> > HMAT and recognizes the memory nodes into the correct tier (devdax >> >> > controlled CXL, pmem, etc). >> >> > If we want DDRs with different memory characteristics to be put into >> >> > the correct tier (as in the guest VM memory tiering case), we proba= bly >> >> > need a third path to iterate the boot onlined memory nodes and also= be >> >> > able to read their memory attributes. I don't think we can do that = in >> >> > 1) because the ACPI subsystem is not yet initialized. >> >> >> >> Can we move it later in general? Or drag HMAT parsing earlier? >> >> ACPI table availability is pretty early, it's just that we don't both= er >> >> with HMAT because nothing early uses it. >> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. >> > >> > I tested the call sequence under a debugger earlier. hmat_init() is >> > called after memory_tier_init(). Let me poke around and see what our >> > options are. >> >> This sounds reasonable. >> >> Please keep in mind that we need a way to identify the base line memory >> type(default_dram_type). A simple method is to use NUMA nodes with CPU >> attached. But I remember that Aneesh said that some NUMA nodes without >> CPU will need to be put in default_dram_type too on their systems. We >> need a way to identify that. > > Yes, I am doing some prototyping the way you described. In > memory_tier_init(), we will just set the memory tier for the NUMA > nodes with CPU. In hmat_init(), I am trying to call back to mm to > finish the memory tier initialization for the CPUless NUMA nodes. If a > CPUless numa node can't get the effective adistance from > mt_calc_adistance(), we will fallback to add that node to > default_dram_type. Sound reasonable for me. > The other thing I want to experiment is to call mt_calc_adistance() on > a memory node with CPU and see what kind of adistance will be > returned. Anyway, we need a base line to start. The abstract distance is calculated based on the ratio of the performance of a node to that of default DRAM node. -- Best Regards, Huang, Ying