Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp805611ybt; Wed, 1 Jul 2020 10:24:43 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxvFMxqGIPHNjvj9zWLwa4XpxsLfrhf/sLS3I3Hi50LRHB5JpyHDPzDoVvbKf80ADs9bs0/ X-Received: by 2002:a05:6402:2064:: with SMTP id bd4mr29491495edb.180.1593624283791; Wed, 01 Jul 2020 10:24:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593624283; cv=none; d=google.com; s=arc-20160816; b=l31GOZhfWVOI3c7LlLMit+sxbugw86+T3G1Cf0aGb0whLhov5vJcKbCNmwQKp+iFF2 Kxl3ZrpYbbKsRpZRXPHOdAKTuTDxXm0f9WkVsbq6/n14GZc4SfUJw67TDePdVuvMlGn9 8dH7s+VYiblOZEIoVa+2ftbdOi7DqqPWUx80mCK4zmM6d+71Cww0MVS7NPIJ0iixj+SO G+G7ThGpp3ARu8acmje5P7a1u6FgF4aKZZWfwnXkEJvuGpUCkV9eOQxfBsNkwPiWb1Rb DAKvImcV6Z+oUON0sbUr+CQtT0yvviiazjgf1vJhYcLq0pGZnDXd+EOR7w4+CVyFvzUb 4ViQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=HsXFIcpnm2MJ3dSW9ef/c13/tfiFlCsNIkUEicE47iQ=; b=nfDl3yD1vofUqtoQqM7VHFVABNuwmjO1c/KBjVh3dqmRsVG3wn1aNqIIALKrry5/PX Zx+Iip3XA9/M856uwfcSsYn+YOzwoP80B3lB+T4eKX0GOIybOjDMLA+U3iDGmKFItsYD 55OVO8hsU15SVAIIaUn/22886Y9fhyZO1cmcTQqwlvJ2/wS0Pu3vxoXmBX3CkQA8nc7x YSJH/Sb+0/MZv9NV7OEYp7L/rD8P6EXjfS74zvDa5CCDt5P3sFxLgJA8om5vqMrbqji3 yp9sg7brg/TtYvL0ioALPXaxV26NrOv83xYfUOdklFdjCcl1ZwCvTcbEYzMlxnmJOasr /E6Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bx21si4222145ejc.644.2020.07.01.10.24.19; Wed, 01 Jul 2020 10:24:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732742AbgGARVu (ORCPT + 99 others); Wed, 1 Jul 2020 13:21:50 -0400 Received: from out30-43.freemail.mail.aliyun.com ([115.124.30.43]:39070 "EHLO out30-43.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726432AbgGARVu (ORCPT ); Wed, 1 Jul 2020 13:21:50 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01355;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0U1PXvC4_1593624102; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0U1PXvC4_1593624102) by smtp.aliyun-inc.com(127.0.0.1); Thu, 02 Jul 2020 01:21:44 +0800 Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard To: David Rientjes Cc: Dave Hansen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kbusch@kernel.org, ying.huang@intel.com, dan.j.williams@intel.com References: <20200629234503.749E5340@viggo.jf.intel.com> <20200629234509.8F89C4EF@viggo.jf.intel.com> <039a5704-4468-f662-d660-668071842ca3@linux.alibaba.com> From: Yang Shi Message-ID: <33028a57-24fd-e618-7d89-5f35a35a6314@linux.alibaba.com> Date: Wed, 1 Jul 2020 10:21:24 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/30/20 10:41 PM, David Rientjes wrote: > On Tue, 30 Jun 2020, Yang Shi wrote: > >>>> From: Dave Hansen >>>> >>>> If a memory node has a preferred migration path to demote cold pages, >>>> attempt to move those inactive pages to that migration node before >>>> reclaiming. This will better utilize available memory, provide a faster >>>> tier than swapping or discarding, and allow such pages to be reused >>>> immediately without IO to retrieve the data. >>>> >>>> When handling anonymous pages, this will be considered before swap if >>>> enabled. Should the demotion fail for any reason, the page reclaim >>>> will proceed as if the demotion feature was not enabled. >>>> >>> Thanks for sharing these patches and kick-starting the conversation, Dave. >>> >>> Could this cause us to break a user's mbind() or allow a user to >>> circumvent their cpuset.mems? >>> >>> Because we don't have a mapping of the page back to its allocation >>> context (or the process context in which it was allocated), it seems like >>> both are possible. >> Yes, this could break the memory placement policy enforced by mbind and >> cpuset. I discussed this with Michal on mailing list and tried to find a way >> to solve it, but unfortunately it seems not easy as what you mentioned above. >> The memory policy and cpuset is stored in task_struct rather than mm_struct. >> It is not easy to trace back to task_struct from page (owner field of >> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not >> preferred way). >> > Yeah, and Ying made a similar response to this message. > > We can do this if we consider pmem not to be a separate memory tier from > the system perspective, however, but rather the socket perspective. In > other words, a node can only demote to a series of exclusive pmem ranges > and promote to the same series of ranges in reverse order. So DRAM node 0 > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > one DRAM node. > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > just to be slower volatile memory and we don't need to deal with the > latency concerns of cross socket migration. A user page will never be > demoted to a pmem range across the socket and will never be promoted to a > different DRAM node that it doesn't have access to. But I don't see too much benefit to limit the migration target to the so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a different socket) pmem node since even the cross socket access should be much faster then refault or swap from disk. > > That can work with the NUMA abstraction for pmem, but it could also > theoretically be a new memory zone instead. If all memory living on pmem > is migratable (the natural way that memory hotplug is done, so we can > offline), this zone would live above ZONE_MOVABLE. Zonelist ordering > would determine whether we can allocate directly from this memory based on > system config or a new gfp flag that could be set for users of a mempolicy > that allows allocations directly from pmem. If abstracted as a NUMA node > instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't > make much sense. > > Kswapd would need to be enlightened for proper pgdat and pmem balancing > but in theory it should be simpler because it only has its own node to > manage. Existing per-zone watermarks might be easy to use to fine tune > the policy from userspace: the scale factor determines how much memory we > try to keep free on DRAM for migration from pmem, for example. We also > wouldn't have to deal with node hotplug or updating of demotion/promotion > node chains. > > Maybe the strongest advantage of the node abstraction is the ability to > use autonuma and migrate_pages()/move_pages() API for moving pages > explicitly? Mempolicies could be used for migration to "top-tier" memory, > i.e. ZONE_NORMAL or ZONE_MOVABLE, instead. I think using pmem as a node is more natural than zone and less intrusive since we can just reuse all the numa APIs. If we treat pmem as a new zone I think the implementation may be more intrusive and complicated (i.e. need a new gfp flag) and user can't control the memory placement. Actually there had been such proposal before, please see https://www.spinics.net/lists/linux-mm/msg151788.html