Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp1014125pxf; Thu, 8 Apr 2021 20:00:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyNQOC9zVvt+V5iNCeMknPwdVGtah6tWVmLJRTkO5TOvh/uQKl4ENvEC1kiQqjypgX3AuMn X-Received: by 2002:a17:906:d8c3:: with SMTP id re3mr14286130ejb.106.1617937241519; Thu, 08 Apr 2021 20:00:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617937241; cv=none; d=google.com; s=arc-20160816; b=UMjQRsw8W/rNrnZzQDP1qyoYWq4RqUrFxl+jdUA2Ygqf5zKt1paX+OQvjId8fgpk0B 3OO6gnp5fv3FFIVwsG4UuaNmE9etKKPy4qELvdOteZ4HEuePJrSa6VI5GNvRlHi10q1P fIVp7N8ZVSP8JI1s+rwSNSzD+5prbQLLsOLqziY/9Ns2Oh0rSEZdCbGHuTiahcyD9B4r ZwLGQqcbbvAN8QpDp8sXfWsYRr0e5EaS/oyqqYsYkCHKnI+Mr8XDkChGLANhP1oAvinD FUqS0FZHpiN8Ey3+hZn4OtUKDshIpdb3FxBY24u3rtVuWEcY77LfLtgTMlMM8BLNiWpV nMpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:ironport-sdr:ironport-sdr; bh=23IKaqCFqgjKIZLZFgNmaz0cGB2V2+sgdqcNbUI/skA=; b=sSPCguaOzBOrGka3J0dsJy531xP2ZD/Xyj02V5+EyiWFW9YaO8zVciY4vNrcKrw3kz XPLY0LLwP3KO63/J2UhfFikVCJhV3zT0mtxm52WJuvClIl+MFZsfOxZ7dOnI/f4IRklO 4L+NCYXxt55i7c4iCjAkPpSICP+2g34kTzXKR0+No1LQ+5neqF8cb594dbMum63csTF8 MP5ywLJB2CG0VHV5218YqwOIO8/vPgu+Uy3KiHvvaugX1n7IabfpdN47XILA/akUjudx yA4xBZR/z2Y97VeQOsUOl0lBx1jhanbUo7+LwxDXEZDSe01CuMW5GguNiN9JA33khD9o duBA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n24si1146950eds.571.2021.04.08.20.00.18; Thu, 08 Apr 2021 20:00:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232933AbhDIC6W (ORCPT + 99 others); Thu, 8 Apr 2021 22:58:22 -0400 Received: from mga05.intel.com ([192.55.52.43]:26807 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232662AbhDIC6V (ORCPT ); Thu, 8 Apr 2021 22:58:21 -0400 IronPort-SDR: gppe/fruRHOo7E3hpBzsyBU6Kaplu3kmdMChYWVyRV4W3im7ZMiHJkSmObipONFt2eZsEnRK9K IKeb9Wk577nA== X-IronPort-AV: E=McAfee;i="6000,8403,9948"; a="278952692" X-IronPort-AV: E=Sophos;i="5.82,208,1613462400"; d="scan'208";a="278952692" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Apr 2021 19:58:09 -0700 IronPort-SDR: YIna69kQgib0Tk8XEmtRqqz7Sn5j7xERa4SK1ZNZzeNAMFrHbI1U4/JfiLu7aJu2XLuaMdBCaz 5nNZzWm4Lmow== X-IronPort-AV: E=Sophos;i="5.82,208,1613462400"; d="scan'208";a="422529544" Received: from yhuang6-desk1.sh.intel.com (HELO yhuang6-desk1.ccr.corp.intel.com) ([10.239.13.1]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Apr 2021 19:58:05 -0700 From: "Huang, Ying" To: Yang Shi Cc: Shakeel Butt , Tim Chen , Michal Hocko , Johannes Weiner , Andrew Morton , Dave Hansen , Dan Williams , David Rientjes , Linux MM , Cgroups , LKML , Feng Tang Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory References: Date: Fri, 09 Apr 2021 10:58:03 +0800 In-Reply-To: (Yang Shi's message of "Thu, 8 Apr 2021 11:00:54 -0700") Message-ID: <87eefkxiys.fsf@yhuang6-desk1.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Yang Shi writes: > On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt wrote: >> >> Hi Tim, >> >> On Mon, Apr 5, 2021 at 11:08 AM Tim Chen wrote: >> > >> > Traditionally, all memory is DRAM. Some DRAM might be closer/faster than >> > others NUMA wise, but a byte of media has about the same cost whether it >> > is close or far. But, with new memory tiers such as Persistent Memory >> > (PMEM). there is a choice between fast/expensive DRAM and slow/cheap >> > PMEM. >> > >> > The fast/expensive memory lives in the top tier of the memory hierachy. >> > >> > Previously, the patchset >> > [PATCH 00/10] [v7] Migrate Pages in lieu of discard >> > https://lore.kernel.org/linux-mm/20210401183216.443C4443@viggo.jf.intel.com/ >> > provides a mechanism to demote cold pages from DRAM node into PMEM. >> > >> > And the patchset >> > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory tiering system >> > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.huang@intel.com/ >> > provides a mechanism to promote hot pages in PMEM to the DRAM node >> > leveraging autonuma. >> > >> > The two patchsets together keep the hot pages in DRAM and colder pages >> > in PMEM. >> >> Thanks for working on this as this is becoming more and more important >> particularly in the data centers where memory is a big portion of the >> cost. >> >> I see you have responded to Michal and I will add my more specific >> response there. Here I wanted to give my high level concern regarding >> using v1's soft limit like semantics for top tier memory. >> >> This patch series aims to distribute/partition top tier memory between >> jobs of different priorities. We want high priority jobs to have >> preferential access to the top tier memory and we don't want low >> priority jobs to hog the top tier memory. >> >> Using v1's soft limit like behavior can potentially cause high >> priority jobs to stall to make enough space on top tier memory on >> their allocation path and I think this patchset is aiming to reduce >> that impact by making kswapd do that work. However I think the more >> concerning issue is the low priority job hogging the top tier memory. >> >> The possible ways the low priority job can hog the top tier memory are >> by allocating non-movable memory or by mlocking the memory. (Oh there >> is also pinning the memory but I don't know if there is a user api to >> pin memory?) For the mlocked memory, you need to either modify the >> reclaim code or use a different mechanism for demoting cold memory. > > Do you mean long term pin? RDMA should be able to simply pin the > memory for weeks. A lot of transient pins come from Direct I/O. They > should be less concerned. > > The low priority jobs should be able to be restricted by cpuset, for > example, just keep them on second tier memory nodes. Then all the > above problems are gone. To optimize the page placement of a process between DRAM and PMEM, we want to place the hot pages in DRAM and the cold pages in PMEM. But the memory accessing pattern changes overtime, so we need to migrate pages between DRAM and PMEM to adapt to the changing. To avoid the hot pages be pinned in PMEM always, one way is to online the PMEM as movable zones. If so, and if the low priority jobs are restricted by cpuset to allocate from PMEM only, we may fail to run quite some workloads as being discussed in the following threads, https://lore.kernel.org/linux-mm/1604470210-124827-1-git-send-email-feng.tang@intel.com/ >> >> Basically I am saying we should put the upfront control (limit) on the >> usage of top tier memory by the jobs. > > This sounds similar to what I talked about in LSFMM 2019 > (https://lwn.net/Articles/787418/). We used to have some potential > usecase which divides DRAM:PMEM ratio for different jobs or memcgs > when I was with Alibaba. > > In the first place I thought about per NUMA node limit, but it was > very hard to configure it correctly for users unless you know exactly > about your memory usage and hot/cold memory distribution. > > I'm wondering, just off the top of my head, if we could extend the > semantic of low and min limit. For example, just redefine low and min > to "the limit on top tier memory". Then we could have low priority > jobs have 0 low/min limit. Per my understanding, memory.low/min are for the memory protection instead of the memory limiting. memory.high is for the memory limiting. Best Regards, Huang, Ying