Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp361833pxb; Wed, 14 Apr 2021 17:52:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyp3D7nofcZfLMDB2YIOU09QQzC+blvhwOykfrTlBErK4yx0ZgR4oH5kZWXL2vEt1eNRop1 X-Received: by 2002:a17:903:1d2:b029:ea:e375:6a57 with SMTP id e18-20020a17090301d2b02900eae3756a57mr1030202plh.31.1618447969379; Wed, 14 Apr 2021 17:52:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618447969; cv=none; d=google.com; s=arc-20160816; b=Ed4W4RQZEthlqZlU3l+kw84MUPb+mBN8KqLne3baqe0B7tSK2AO1OiFCTR3loweVtV IABozicVqLb4MQP3evTykDHJyE7tGNYoR3JLFdD0GOz+Ut4AHmyVt//DgsJzJ9Ejziut 6EneXEyWTeyi4+vQaVYsTrj3prayKMVAA2/2kbkWfSw5P2PFKyhp5XF0WbSOqWgSIK/F gaIJtiZSGePBufADCp0E17BCow8u/ITWSITChICDwlwyxZj/o7NMMuv+Nmm93Wx8rHba gMkqt5Lnug9EsRB0cdIWRGHPP3OLVfLa0JCr8mM53JwjG+HgtZhF3ENoqLjVDBTAbNXH biKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:ironport-sdr:ironport-sdr; bh=VgNtBUj9QY1j2PbNt00ShV0ylxdVDk2A6IaW52F7Rq8=; b=w5SWgfhBG9SwgIezKfgVGLv2ct/VuGtCYe8oW5FvPHX3QLrW1HlgsBYR5n4Pdr9Dn9 jU/6zdBYLjcORe0B8YF6ugtVsjvFpNwzQzX8FekEw5K3WKN+dcssvzFW54oddryXIsII lQtRjlhWWKaQRZM2ED/bIsSi5B3w9NTFxBXmrHT1ruy9qcMArTG4cJUpW/JhjpRizsLa niceMhRAyZM4M/thnhLNMJd6a0dsu1yRVWoBSfZ5sfmk3JWdDDXRSznDB4E6j1t8pSMb bP+VYKm7BuKLk0T/g/1JewF6eLNgcWVAcJ/u12gbFB6ll1Yug0jlqdQG1sUAoqnMXXIL jEqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l19si1150181pfc.245.2021.04.14.17.52.37; Wed, 14 Apr 2021 17:52:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230383AbhDOAnF (ORCPT + 99 others); Wed, 14 Apr 2021 20:43:05 -0400 Received: from mga07.intel.com ([134.134.136.100]:56494 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230040AbhDOAnE (ORCPT ); Wed, 14 Apr 2021 20:43:04 -0400 IronPort-SDR: gbQ6GFFy8fZOM73j00asM+fKi9qPyWTuqLLOkGzPPDAL600Ie5dT4DyOIt3GLwzmEyExdVBkbs DcOGKwkN0tKQ== X-IronPort-AV: E=McAfee;i="6200,9189,9954"; a="258728973" X-IronPort-AV: E=Sophos;i="5.82,223,1613462400"; d="scan'208";a="258728973" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2021 17:42:40 -0700 IronPort-SDR: lP90tNN9rJXZ+zZzV/Ob6WFrbiWIsX1Gf3oImAeXo2y42oGv7f/qKBGesrYeacp3vfvicrD5fQ E50xgL7dT9IQ== X-IronPort-AV: E=Sophos;i="5.82,223,1613462400"; d="scan'208";a="389573156" Received: from schen9-mobl.amr.corp.intel.com ([10.209.63.115]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Apr 2021 17:42:40 -0700 Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory To: Shakeel Butt Cc: Michal Hocko , Johannes Weiner , Andrew Morton , Dave Hansen , Ying Huang , Dan Williams , David Rientjes , Linux MM , Cgroups , LKML , Greg Thelen , Wei Xu References: <58e5dcc9-c134-78de-6965-7980f8596b57@linux.intel.com> From: Tim Chen Message-ID: Date: Wed, 14 Apr 2021 17:42:39 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/12/21 12:20 PM, Shakeel Butt wrote: >> >> memory_t0.current Current usage of tier 0 memory by the cgroup. >> >> memory_t0.min If tier 0 memory used by the cgroup falls below this low >> boundary, the memory will not be subjected to demotion >> to lower tiers to free up memory at tier 0. >> >> memory_t0.low Above this boundary, the tier 0 memory will be subjected >> to demotion. The demotion pressure will be proportional >> to the overage. >> >> memory_t0.high If tier 0 memory used by the cgroup exceeds this high >> boundary, allocation of tier 0 memory by the cgroup will >> be throttled. The tier 0 memory used by this cgroup >> will also be subjected to heavy demotion. >> >> memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. >> >> If needed, memory_t[12...].current/min/low/high for additional tiers can be added. >> This follows closely with the design of the general memory controller interface. >> >> Will such an interface looks sane and acceptable with everyone? >> > > I have a couple of questions. Let's suppose we have a two socket > system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket > 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). > Based on the tier definition of this patch series, tier_0: {node_0, > node_1} and tier_1: {node_2, node_3}. > > My questions are: > > 1) Can we assume that the cost of access within a tier will always be > less than the cost of access from the tier? (node_0 <-> node_1 vs > node_0 <-> node_2) I do assume that higher tier memory offers better performance (or less access latency) than a lower tier memory. Otherwise, this defeats the whole purpose of promoting hot memory from lower tier to a higher tier, and demoting cold memory to a lower tier. Tiers assumption is embedded once we define this promotion/demotion relationship between the numa nodes. So if node_m ----demotes----> node_n <---promotes---- then node_m is one tier higher tier than node_n. This promotion/demotion relationship between the nodes is the underpinning of Dave and Ying's demotion and promotion patch sets. > 2) If yes to (1), is that assumption future proof? Will the future > systems with DRAM over CXL support have the same characteristics? I think if you configure a promotion/demotion relationship between DRAM over CXL and local-socket connected DRAM, you could divide them up into separate tiers. Or you don't care about the difference and you will configure them not to have a promotion/demotion relationship and they will be at the same tier. Balance within the same tier will be effected by the autonuma mechanism. > 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 > <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 > might be third tier and similarly for jobs running on node_1, node_2 > might be third tier. Tier definition is an admin's choice, of where the admin think the hot memory should reside after looking at the memory performance. It falls out of how the admin construct the promotion/demotion relationship between the nodes and OS does not assume the tier relationship from memory performance directly. > > The reason I am asking these questions is that the statically > partitioning memory nodes into tiers will inherently add platform > specific assumptions in the user API. > > Assumptions like: > 1) Access within tier is always cheaper than across tier. > 2) Access from tier_i to tier_i+1 has uniform cost. > > The reason I am more inclined towards having numa centric control is > that we don't have to make these assumptions. Though the usability > will be more difficult. Greg (CCed) has some ideas on making it better > and we will share our proposal after polishing it a bit more. > I am still trying to understand how a numa centric control actually work. Putting limits on every numa node for each cgroup seems to make the system configuration quite complicated. Looking forward to your proposal so I can better understand that perspective. Tim