Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2379227pxb; Mon, 12 Apr 2021 23:55:43 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwelPf//CHt3e4Ua1qSY935Ara+IiUHSAIpXWMiGRx1ZjiRZF0V6Quyr6f5/0sDoGYhmkDM X-Received: by 2002:a17:90a:a395:: with SMTP id x21mr3225494pjp.95.1618296943791; Mon, 12 Apr 2021 23:55:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618296943; cv=none; d=google.com; s=arc-20160816; b=uTIX1NVKMXxXN1Y3CmHgKrHYqHEbQGIuAO5VVD8BGvvXyg/hvRZcbNqurtlbdwPxjG WCWNTFCE+kFHBtwfi0tSih0TdFCX9DMakRjaZy+2i6PDf045qyv8BuBaIsvjrmzVioGp kATqY5Hp/alQKyt0Vrj23ZEgmSfOivgxz49/YORtKbNBVXdQfQyXSIgsZkHN8QXh5/XR 2+uwisNb7vgczhSZf1xlAtydEaOXaoLIN5MPepQPxNtEg+qq1sfrERsooDNJQSJo/KxI hZFro0HcCl1wSIy4A7zhWnBEIjQ9hG4SO6t0ZK/GmB+ADOUXbE73/NDSgS4iOe/VYEZq Zpvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=njLi9+INls9UJu6xo/OSSkdSSq3w2QSnMYDEcop1v+w=; b=vUZB3vmntgpenOqxOimrcILnNbVAkHSONE3ZFQvli1o88bek6hUOJKma3PEVvXR4ke chc2DTexJkCFG+ilS4fynGk9AsZywYWuw4cRFfusOq5fKFWzuysCGYTcRyUQOZvWcspb IjLgwHQEDio0vCYHWASNKMfYvCLMl4bs/en8wRNsFLDDBlZs+3iOVw3/+9zo2rp2CUDK XNjM2R/rAH0oNOVjXvWg5a2h66Y9njkZlLZYmdm3L7OWzASRNLsZS4v85bjONzjb2ynO ba+eLAJkR9EJB4ELJC5FRN2+GkGjRGb8dVZ21N3T9J4sdwdEHJgBBGHea4OElA2F6XIJ +keQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=E6Csoera; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p8si13518250pfo.339.2021.04.12.23.55.31; Mon, 12 Apr 2021 23:55:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=E6Csoera; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241146AbhDLTU6 (ORCPT + 99 others); Mon, 12 Apr 2021 15:20:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237281AbhDLTU6 (ORCPT ); Mon, 12 Apr 2021 15:20:58 -0400 Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D293C06174A for ; Mon, 12 Apr 2021 12:20:35 -0700 (PDT) Received: by mail-lf1-x130.google.com with SMTP id b14so23219770lfv.8 for ; Mon, 12 Apr 2021 12:20:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=njLi9+INls9UJu6xo/OSSkdSSq3w2QSnMYDEcop1v+w=; b=E6Csoera0jTVwBjqvhuypbBsJEYw/+XzR+REf5XoXOCW9nP2fSC/MJcGVjsDnKa90T bk1YHPdwhSkhmjMzdnOnZOJWy/R8XS/WHmV61w9P+OwAyU46e7LrRdhPgfrb+v1IMfRU f2nUnjawlUJ+FaLhWS9x3b4/VZbxb7T/ou4E/fqqZqi/7C4RaudewxQdsCBggHQUvsnE yKaZNFtbZtIKwo9Gi03kxEFbuSaK+cfntXwPznqzcJz1p6Mir6oOeWJ2RFqqINVWeJ9D gqA+lMxXvj9aQFgW5DjHDQQexEHddJrpScogfSMuuD83sWwwzty18aqMU61BOw130fsI nzpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=njLi9+INls9UJu6xo/OSSkdSSq3w2QSnMYDEcop1v+w=; b=shELWpIQuhtoEQ0+ySyhPM+9CYeBx7m1FP5ylBEhARZE089FHVZqsO3WqPyY/1UWD5 8/kfheNixu++ygsthSYdhSrGJR2d0XFkAydrbQYxh37awx7XDgf3n/AH0R5HN1aNKRYX anmWUZrN0loEbuUOS6GeVPePd5njyO6sD1LYi9AJIsUcT6ei+VPBfzVRSKxOIMzOKmeY ZzPazTZSEoJTFiL1Vjq2KBMOlfdykUVb+TFWNogRRsZ9ddxMNPpsFQ0+9FMasxZ6YRMu 9ypy17V+nG7whPM6a/W/n2V/NE8GplsgTymxWK3HpYimpCLH/2nEv+0DQ1osuVxzXAuR W01w== X-Gm-Message-State: AOAM533wLOR52vRLjcUjH45jV4XHpHzOx/8K3kE1BqwcAJLGGFXkLgNo 9SA0BnvyFK3sOKLbngpCnO+meHEjmSZCGJl18Lj6jQ== X-Received: by 2002:a19:3804:: with SMTP id f4mr20738610lfa.117.1618255233440; Mon, 12 Apr 2021 12:20:33 -0700 (PDT) MIME-Version: 1.0 References: <58e5dcc9-c134-78de-6965-7980f8596b57@linux.intel.com> In-Reply-To: <58e5dcc9-c134-78de-6965-7980f8596b57@linux.intel.com> From: Shakeel Butt Date: Mon, 12 Apr 2021 12:20:22 -0700 Message-ID: Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory To: Tim Chen Cc: Michal Hocko , Johannes Weiner , Andrew Morton , Dave Hansen , Ying Huang , Dan Williams , David Rientjes , Linux MM , Cgroups , LKML , Greg Thelen , Wei Xu Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 9, 2021 at 4:26 PM Tim Chen wrote: > > > On 4/8/21 4:52 AM, Michal Hocko wrote: > > >> The top tier memory used is reported in > >> > >> memory.toptier_usage_in_bytes > >> > >> The amount of top tier memory usable by each cgroup without > >> triggering page reclaim is controlled by the > >> > >> memory.toptier_soft_limit_in_bytes > > > > Michal, > > Thanks for your comments. I will like to take a step back and > look at the eventual goal we envision: a mechanism to partition the > tiered memory between the cgroups. > > A typical use case may be a system with two set of tasks. > One set of task is very latency sensitive and we desire instantaneous > response from them. Another set of tasks will be running batch jobs > were latency and performance is not critical. In this case, > we want to carve out enough top tier memory such that the working set > of the latency sensitive tasks can fit entirely in the top tier memory. > The rest of the top tier memory can be assigned to the background tasks. > > To achieve such cgroup based tiered memory management, we probably want > something like the following. > > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1, > where tier t_0 sits at the top and demotes to the lower tier. > We envision for this top tier memory t0 the following knobs and counters > in the cgroup memory controller > > memory_t0.current Current usage of tier 0 memory by the cgroup. > > memory_t0.min If tier 0 memory used by the cgroup falls below this low > boundary, the memory will not be subjected to demotion > to lower tiers to free up memory at tier 0. > > memory_t0.low Above this boundary, the tier 0 memory will be subjected > to demotion. The demotion pressure will be proportional > to the overage. > > memory_t0.high If tier 0 memory used by the cgroup exceeds this high > boundary, allocation of tier 0 memory by the cgroup will > be throttled. The tier 0 memory used by this cgroup > will also be subjected to heavy demotion. > > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup. > > If needed, memory_t[12...].current/min/low/high for additional tiers can be added. > This follows closely with the design of the general memory controller interface. > > Will such an interface looks sane and acceptable with everyone? > I have a couple of questions. Let's suppose we have a two socket system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1). Based on the tier definition of this patch series, tier_0: {node_0, node_1} and tier_1: {node_2, node_3}. My questions are: 1) Can we assume that the cost of access within a tier will always be less than the cost of access from the tier? (node_0 <-> node_1 vs node_0 <-> node_2) 2) If yes to (1), is that assumption future proof? Will the future systems with DRAM over CXL support have the same characteristics? 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0 <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3 might be third tier and similarly for jobs running on node_1, node_2 might be third tier. The reason I am asking these questions is that the statically partitioning memory nodes into tiers will inherently add platform specific assumptions in the user API. Assumptions like: 1) Access within tier is always cheaper than across tier. 2) Access from tier_i to tier_i+1 has uniform cost. The reason I am more inclined towards having numa centric control is that we don't have to make these assumptions. Though the usability will be more difficult. Greg (CCed) has some ideas on making it better and we will share our proposal after polishing it a bit more.