Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp3545565pxb; Wed, 14 Apr 2021 07:57:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqothXocmpylEWN9eWavhvsZy4dT47ERj/CGvuCCG0nw78jQufwhUaoA3DeW0CMXohUMZv X-Received: by 2002:aa7:c351:: with SMTP id j17mr40086442edr.199.1618412257577; Wed, 14 Apr 2021 07:57:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618412257; cv=none; d=google.com; s=arc-20160816; b=NQn3BDr8DMDRSr5uLtcgLrMGcw4O4b2pF5rN/evxmVBDZMJyY/iZ0ldHcTt3OEHpXN NUoK6uuYYMcHzFNaE5swIANWjeS3n+q9+DcQngnlMlJK45t42vqjhJNwpQiKpQqLE4cz llx8oMClt+WJH0xIvkT0ZJv8+i8hqsU7mDtOLh6etQSzgV36KCqykNZF/Del7N/1QV0j TssgxmFTOqujyJQqVWGNZqb5fO39zRyxqX8qvNx267VpjPC26i88T5EihoHGUBbI3ol4 S9FwI0EBaHXOuCpNWZT7p3gyG/3D1cZ3oYQsidRQOG9SNACqKFAYduRPoWb2X5EQspqd V8Zg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=2ZVH0mw1/n4i/LAjFYIGuaMu7ZHXSRXnCUB4AyhcXWc=; b=aKaGIIqqWvYe4NWDZJUL0ePEAmbkqQX26puao0XxZq5d++tgtwHc1iPRAnFEGtF96O q3or2Na2Nj6SQ9oQd4yZn3aV+gIaAbERcbFRiSZfsVKnnmtFsT5hqGBY1av7CALCRd/T jipFWosXEiWKjhA40obY3Al9EE+5vE/i3Sxw5heKF3hSVD4lYRXl7N1+Q/gkEki4MCda P9evySeQizy4tz8iDpLE0L371j5Lk15JGGTB8qtgUksRFeP3Uvj9IrwlKypZBLHe4zoi MnWotQbEeSJvrR8CCosFWDor55iNGpuR6UZ7JSmJ5YtdcuBTAt46/Fws0+GDktJMZjja hFJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id dd18si12767826ejb.264.2021.04.14.07.57.14; Wed, 14 Apr 2021 07:57:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232678AbhDNJB6 (ORCPT + 99 others); Wed, 14 Apr 2021 05:01:58 -0400 Received: from frasgout.his.huawei.com ([185.176.79.56]:2849 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1350462AbhDNJBt (ORCPT ); Wed, 14 Apr 2021 05:01:49 -0400 Received: from fraeml734-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4FKx9H2T3Mz68987; Wed, 14 Apr 2021 16:54:11 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml734-chm.china.huawei.com (10.206.15.215) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Wed, 14 Apr 2021 11:01:26 +0200 Received: from localhost (10.47.83.55) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Wed, 14 Apr 2021 10:01:25 +0100 Date: Wed, 14 Apr 2021 09:59:58 +0100 From: Jonathan Cameron To: Shakeel Butt CC: Tim Chen , Michal Hocko , Johannes Weiner , Andrew Morton , Dave Hansen , "Ying Huang" , Dan Williams , "David Rientjes" , Linux MM , Cgroups , LKML , Greg Thelen , Wei Xu Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Message-ID: <20210414095958.000008c4@Huawei.com> In-Reply-To: References: <58e5dcc9-c134-78de-6965-7980f8596b57@linux.intel.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 3.17.4 (GTK+ 2.24.32; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.47.83.55] X-ClientProxiedBy: lhreml713-chm.china.huawei.com (10.201.108.64) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 12 Apr 2021 12:20:22 -0700 Shakeel Butt wrote: > On Fri, Apr 9, 2021 at 4:26 PM Tim Chen wrote: > > > > > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > > > >> The top tier memory used is reported in > > >> > > >> memory.toptier_usage_in_bytes > > >> > > >> The amount of top tier memory usable by each cgroup without > > >> triggering page reclaim is controlled by the > > >> > > >> memory.toptier_soft_limit_in_bytes > > > > > > > Michal, > > > > Thanks for your comments. I will like to take a step back and > > look at the eventual goal we envision: a mechanism to partition the > > tiered memory between the cgroups. > > > > A typical use case may be a system with two set of tasks. > > One set of task is very latency sensitive and we desire instantaneous > > response from them. Another set of tasks will be running batch jobs > > were latency and performance is not critical. In this case, > > we want to carve out enough top tier memory such that the working set > > of the latency sensitive tasks can fit entirely in the top tier memory. > > The rest of the top tier memory can be assigned to the background tasks. > > > > To achieve such cgroup based tiered memory management, we probably want > > something like the following. > > > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > > where tier t_0 sits at the top and demotes to the lower tier. > > We envision for this top tier memory t0 the following knobs and counters > > in the cgroup memory controller > > > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > > boundary, the memory will not be subjected to demotion > > to lower tiers to free up memory at tier 0. > > > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > > to demotion. The demotion pressure will be proportional > > to the overage. > > > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > > boundary, allocation of tier 0 memory by the cgroup will > > be throttled. The tier 0 memory used by this cgroup > > will also be subjected to heavy demotion. > > > > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > > This follows closely with the design of the general memory controller interface. > > > > Will such an interface looks sane and acceptable with everyone? > > > > I have a couple of questions. Let's suppose we have a two socket > system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket > 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). > Based on the tier definition of this patch series, tier_0: {node_0, > node_1} and tier_1: {node_2, node_3}. > > My questions are: > > 1) Can we assume that the cost of access within a tier will always be > less than the cost of access from the tier? (node_0 <-> node_1 vs > node_0 <-> node_2) No in large systems even it we can make this assumption in 2 socket ones. > 2) If yes to (1), is that assumption future proof? Will the future > systems with DRAM over CXL support have the same characteristics? > 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 > <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 > might be third tier and similarly for jobs running on node_1, node_2 > might be third tier. > > The reason I am asking these questions is that the statically > partitioning memory nodes into tiers will inherently add platform > specific assumptions in the user API. Absolutely agree. > > Assumptions like: > 1) Access within tier is always cheaper than across tier. > 2) Access from tier_i to tier_i+1 has uniform cost. > > The reason I am more inclined towards having numa centric control is > that we don't have to make these assumptions. Though the usability > will be more difficult. Greg (CCed) has some ideas on making it better > and we will share our proposal after polishing it a bit more. > Sounds good, will look out for that. Jonathan