Received: by 2002:a05:7412:f589:b0:e2:908c:2ebd with SMTP id eh9csp320301rdb; Tue, 31 Oct 2023 08:22:35 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFBezSW2T3hQOouxftkxRcs2XCpZuvxTtX/1lSl1ruxPzP1YkKky6bQf4xm5xuZIyCm14pl X-Received: by 2002:a05:6a20:3c8d:b0:17b:2e82:d4ba with SMTP id b13-20020a056a203c8d00b0017b2e82d4bamr17870732pzj.3.1698765754685; Tue, 31 Oct 2023 08:22:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698765754; cv=none; d=google.com; s=arc-20160816; b=v8Yi2j0fdit5ESUpr4msKgfSKQj5Ofk17fvskTe5gAt1VG7R6qwTqQN5mI7l68UsdQ ToYwc41ZE9In2BQVPgvjfhornpuAYuZG/hfuPVAvlozkWsRkqC2/8S5yTcv0WPfiFwKS bCNxLHmFwoA3HdmqLKaM5NwvcBSPkIe4KQ+Wx0V2k0/yxtLKlP2oMm+9HjhSCAxZp2tV Rr4V15C5oDMjN3PAPPSam3w9enf+EeXegxErjk01gtF1DC5EffJxzn9WdLkvFdOyjHFh WCxTX8e/+vL7eGWr0KPQehufEQsIPWh7cCD2RQy1qovjGYwapkQ9ptLvw6B/qPGgvgmn 30Ag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; fh=MHAUCrX9Vv5SDvyEx8WN2E3kUlwuloHK3qENeOkPpXc=; b=TVc6SsWbTsdd+BlgpK9hG9hKXwQT35jVV7xfC7OqFYJvkUWa3UxTauDFfjXCpew52j B2jAxwZfbzb/AWJqF0jTRvqUS9mJCCw1QxZNI7rrT+RlHymlnpoUuWe+3Wzft3P0i0O2 sbRjAcoUH4QUkCQ+P/0YkfECZkNXh1fYOkEIPeqV6x7qdH/mxXVCp92yuMMna5Tt0fIW CYmciV0jmaCT1TaHa1J0yBxmc1O/DUns3sDkj0YL+I9Lwf4CrpwsfejwVYllPW+zsVfm Glb1V4g251vJdTHbh0AsFbQHYdbnwDkk3Kai2QTqyIMNvV9Zkvqt8EB8pu+s1wa/RtOR vfQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=J6Bn8VVU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id bz30-20020a056a02061e00b005b934643e39si1144085pgb.599.2023.10.31.08.22.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 08:22:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=J6Bn8VVU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id B5405802A63E; Tue, 31 Oct 2023 08:22:23 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344825AbjJaPWH (ORCPT + 99 others); Tue, 31 Oct 2023 11:22:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344673AbjJaPWC (ORCPT ); Tue, 31 Oct 2023 11:22:02 -0400 Received: from mail-qk1-x72a.google.com (mail-qk1-x72a.google.com [IPv6:2607:f8b0:4864:20::72a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2F00126 for ; Tue, 31 Oct 2023 08:21:44 -0700 (PDT) Received: by mail-qk1-x72a.google.com with SMTP id af79cd13be357-77897c4ac1fso386727485a.3 for ; Tue, 31 Oct 2023 08:21:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1698765704; x=1699370504; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; b=J6Bn8VVUm06P+Yfv/ypgUgwEbXm+/Dnh08wxkyt71+r9Rg/A5oQ7ep/9ZRMxcHVnEx TcE+j0TC0WJJSXe6FzowZM9oCfB1+dicsO7YFi2O96Y56fcv76YZ1n2v4Vze/oaeNGaA 6slRWe2x5p5zmeZ3Td4W53ZUZP0PNuubYvaXeh5kDbJxGGsplXdwrrcZVsFVMB1f3lOJ 7B+q4zK09l0VyZqPLJGWp88EEBEL+cJbg1Pa5eZKCEjZiyTgfPgMF2bD92pzgl2dv0wG rLKhsf7+B4P7B9iC9vrvVg1RC9jJF95kEhdMMh5s/1+aeeAXBEx5xKNFagBZaiMCwnLO bnpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698765704; x=1699370504; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZnmkiJJG0cEgBzyHB4LFbZlwQl8xMEYKj1qchPFtzG4=; b=cif6nccpl6pCaNswtjyyF0A+ffCdwS2t47XxwQyFUcudlJLXU38YOmMqiTDxXuiqLh 2HDzxmA1LDXHvmNsFGI2rioluJkiQ1jMAgvP1hBsYR6kGoNKbP1TSImP9uQPt/aTARf3 k7dkMYlzNFgITCXO4iZ2lWAASnGy7hjNibSsU9pnlLTou3MN5LK2Uk9HgxeAonNs2iSz NaeFH6K1cYUrh9XY2RDGcS0Rk2TzvNAe0O8mz2Tnmbb3owVUyXiXowmfOrXPCXoxxuVK AMSgraOxqrFKdS81U9a3fIrsFleSQbC1sZG5T01ThI2vQ+D4RwlcnrDipDxgrtsnkoIx znbA== X-Gm-Message-State: AOJu0YxV+9j7WDLUMKfpicSdtOGdgn2QujFuQAecl/UAn6eDKJ/3Bf/l tI8SmLVyjuLoU5PceciMFzvHzw== X-Received: by 2002:a05:620a:25c8:b0:77a:47ad:1211 with SMTP id y8-20020a05620a25c800b0077a47ad1211mr1196140qko.69.1698765703762; Tue, 31 Oct 2023 08:21:43 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:a294]) by smtp.gmail.com with ESMTPSA id az40-20020a05620a172800b0076db5b792basm577846qkb.75.2023.10.31.08.21.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 31 Oct 2023 08:21:43 -0700 (PDT) Date: Tue, 31 Oct 2023 11:21:42 -0400 From: Johannes Weiner To: Michal Hocko Cc: Gregory Price , linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave Message-ID: <20231031152142.GA3029315@cmpxchg.org> References: <20231031003810.4532-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 31 Oct 2023 08:22:24 -0700 (PDT) On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > This patchset implements weighted interleave and adds a new sysfs > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > The il_weight of a node is used by mempolicy to implement weighted > > interleave when `numactl --interleave=...` is invoked. By default > > il_weight for a node is always 1, which preserves the default round > > robin interleave behavior. > > > > Interleave weights may be set from 0-100, and denote the number of > > pages that should be allocated from the node when interleaving > > occurs. > > > > For example, if a node's interleave weight is set to 5, 5 pages > > will be allocated from that node before the next node is scheduled > > for allocations. > > I find this semantic rather weird TBH. First of all why do you think it > makes sense to have those weights global for all users? What if > different applications have different view on how to spred their > interleaved memory? > > I do get that you might have a different tiers with largerly different > runtime characteristics but why would you want to interleave them into a > single mapping and have hard to predict runtime behavior? > > [...] > > In this way it becomes possible to set an interleaving strategy > > that fits the available bandwidth for the devices available on > > the system. An example system: > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > In this setup, the effective weights for nodes 0-3 for a task > > running on Node 0 may be [60, 20, 10, 10]. > > > > This spreads memory out across devices which all have different > > latency and bandwidth attributes at a way that can maximize the > > available resources. > > OK, so why is this any better than not using any memory policy rely > on demotion to push out cold memory down the tier hierarchy? > > What is the actual real life usecase and what kind of benefits you can > present? There are two things CXL gives you: additional capacity and additional bus bandwidth. The promotion/demotion mechanism is good for the capacity usecase, where you have a nice hot/cold gradient in the workingset and want placement accordingly across faster and slower memory. The interleaving is useful when you have a flatter workingset distribution and poorer access locality. In that case, the CPU caches are less effective and the workload can be bus-bound. The workload might fit entirely into DRAM, but concentrating it there is suboptimal. Fanning it out in proportion to the relative performance of each memory tier gives better resuls. We experimented with datacenter workloads on such machines last year and found significant performance benefits: https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ This hopefully also explains why it's a global setting. The usecase is different from conventional NUMA interleaving, which is used as a locality measure: spread shared data evenly between compute nodes. This one isn't about locality - the CXL tier doesn't have local compute. Instead, the optimal spread is based on hardware parameters, which is a global property rather than a per-workload one.