Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp920549pxb; Fri, 22 Apr 2022 14:20:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy8kLOIPDd1acqlYrqQsF+w6Bplv/otnuf+u01DRfRcviI8seEs3uplsleZz1efWd94Vwur X-Received: by 2002:a17:90a:68c3:b0:1cb:a136:76a0 with SMTP id q3-20020a17090a68c300b001cba13676a0mr18522938pjj.101.1650662422357; Fri, 22 Apr 2022 14:20:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650662422; cv=none; d=google.com; s=arc-20160816; b=OhOw7E8MoPvLQOaKe9jzVyoyetdx1/LG6R4wHEyB6WZ4QWnuXMMjWvmpSOJCb64/qx Ff3PxNHstmgDzUTgIfDVo1q95tq0sCWZdNuwh4frJyMe3UAgAb8Kxi9HhssuVN0fksy+ 7XjMiCpMfu9kvuqoImDly5nMCqAlCmXTFdIMvIS9+QsopaW4mIrvt3oPwWKG2xLIS5rG 6rgdyS2z5MKUGkPsHpfOtNQUiPASXTYJBujvE3sLaFzPji2pY2QSojSmZJ75GeO9ma+r 39jLI+RDk/jbT8LbquhAr2IT9jMlNWZUG1dOPIODM5Aum4ZbYoC2pHu+u5DCiaufRX66 h3Qg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=hjKStpOSlGateuiEbVr0jYeOEW3o2bJKOO/TNMgKFgk=; b=ayh8e7gdYioa7tkPJNtbNrOxbFAI1OAif3/to3qQj7BUtaM6+SLTThVTvZmcZovcaS UVhh2E8dm68aLOOnlzHbweuR62B+5NvM0Je1/945kOuFJl7XVkSjyE5suVsEtqhvpdYh SAttCP7HNNQXqWLHif0qiQRXNLMunBcBuvnGLZaPFleq9yp07XGimRc8MCY51cwQwYuu Soiv+C9qzbcvpZ93A5YxSjbe6zBS1MUnBcn6HFLU2un+58MIvH7KqYkIsrbztn+qHoQX u+NJZzanausOERvdHTHGc0JnCK9WWE1jjCqeEoE0gpBSB8nJqGYmyVzDepykCk/gCAtQ u5Ow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=XWtH7gEL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id i5-20020a63b305000000b003a9f944b0c1si9795442pgf.300.2022.04.22.14.20.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 14:20:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=XWtH7gEL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 134A0108FC7; Fri, 22 Apr 2022 12:27:45 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1446909AbiDVLDe (ORCPT + 99 others); Fri, 22 Apr 2022 07:03:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57292 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1443207AbiDVLDc (ORCPT ); Fri, 22 Apr 2022 07:03:32 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 730A013DDC for ; Fri, 22 Apr 2022 04:00:39 -0700 (PDT) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 23M9HNbh004871; Fri, 22 Apr 2022 11:00:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : content-transfer-encoding : in-reply-to; s=pp1; bh=hjKStpOSlGateuiEbVr0jYeOEW3o2bJKOO/TNMgKFgk=; b=XWtH7gELt0YYWjCfH2jUGWW8DmCTzV4usNjElJx/pzLGySSzuvl3EQCKRdDCgLMlbDDQ 0PRAzBgMDGsGet1vg9zg+aVhWhSWNLIQP1SQdbwEXVv2Vl4x1PvpZoB/yznYWkcZgJ7H 6dfzAHlpu1J6cOMRLw3SEj5EDqDWMqAHUTsspWDd6Pvo5ExfmpsrMnO8NwO3ozU8+nyn QxpeFdvS5ylUG5xtYm0AeDkTPTrHNeZ8rNWTLqGb8bcy1tk5l3hRsfy+3WEmqux1MEI/ e6+XVQBq9SlVZf9qNAdUYS567BfmX7Vb9BLB3UEXnfNYGt/1he112l9ZRMRPLZwgBEwR TA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com with ESMTP id 3fjer9dfw3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 22 Apr 2022 11:00:32 +0000 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 23MAp6or032387; Fri, 22 Apr 2022 11:00:31 GMT Received: from ppma04ams.nl.ibm.com (63.31.33a9.ip4.static.sl-reverse.com [169.51.49.99]) by mx0b-001b2d01.pphosted.com with ESMTP id 3fjer9dfuq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 22 Apr 2022 11:00:31 +0000 Received: from pps.filterd (ppma04ams.nl.ibm.com [127.0.0.1]) by ppma04ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 23MAsI5b007837; Fri, 22 Apr 2022 11:00:29 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma04ams.nl.ibm.com with ESMTP id 3ffne99853-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 22 Apr 2022 11:00:29 +0000 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 23MB0QIt56426956 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 22 Apr 2022 11:00:26 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C5E74AE05F; Fri, 22 Apr 2022 11:00:26 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0C45DAE065; Fri, 22 Apr 2022 11:00:24 +0000 (GMT) Received: from li-6e1fa1cc-351b-11b2-a85c-b897023bb5f3.ibm.com (unknown [9.43.59.166]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Fri, 22 Apr 2022 11:00:23 +0000 (GMT) Date: Fri, 22 Apr 2022 16:30:17 +0530 From: Jagdish Gediya To: "ying.huang@intel.com" Cc: Wei Xu , Yang Shi , Linux MM , Linux Kernel Mailing List , Andrew Morton , "Aneesh Kumar K.V" , Baolin Wang , Dave Hansen , Dan Williams , Greg Thelen Subject: Re: [PATCH v2 0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS Message-ID: References: <610ccaad03f168440ce765ae5570634f3b77555e.camel@intel.com> <8e31c744a7712bb05dbf7ceb2accf1a35e60306a.camel@intel.com> <78b5f4cfd86efda14c61d515e4db9424e811c5be.camel@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <78b5f4cfd86efda14c61d515e4db9424e811c5be.camel@intel.com> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: nCuMr4d0xa6omzspB_W3tLpLwKn1E0R0 X-Proofpoint-ORIG-GUID: 3d7bMpJLoK_Ng1OsU8W992RXYuhS4fUQ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.858,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-04-22_02,2022-04-22_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 impostorscore=0 mlxscore=0 priorityscore=1501 phishscore=0 malwarescore=0 mlxlogscore=999 lowpriorityscore=0 spamscore=0 adultscore=0 bulkscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2202240000 definitions=main-2204220046 X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote: > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com > > wrote: > > > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > > > wrote: > > > > > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > > > wrote: > > > > > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > >   0: 10 40 20 > > > > > > > > > > > > > >   1: 40 10 80 > > > > > > > > > > > > > >   2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > >  0 1 > > > > > > > > > > > > > >  1 X > > > > > > > > > > > > > >  2 X > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > >  0 1 > > > > > > > > > > > > > >  1 X > > > > > > > > > > > > > >  2 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > > >         /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > > >         /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > > > > >         /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > > > ACPI HMAT table. > > > > > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > > > > > There is a difference: one is to indicate the availability of the > > > > > > memory tiering hardware and the other is to indicate whether > > > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > > > target interface, the userspace can figure out the memory tiering > > > > > > topology abstracted by the kernel. It is possible to use > > > > > > application-guided demotion without having to enable reclaim-based > > > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > > > decouple the tiering node representation from the actual demotion > > > > > > mechanism enablement. > > > > > > > > > > I am confused here. It appears that you need a way to expose the > > > > > automatic generated demotion order from kernel to user space interface. > > > > > We can talk about that if you really need it. > > > > > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > > > demotion order from user space to kernel interface. > > > > > > > > As a side effect of allowing user space to override the default set of > > > > demotion target nodes, it also provides a sysfs interface to allow > > > > userspace to read which nodes are currently being designated as > > > > demotion targets. > > > > > > > > The initialization of demotion targets is expected to complete during > > > > boot (either by kernel or via an init script). After that, the > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > > > applications) can query this sysfs interface to know if there are any > > > > tiering nodes present and act accordingly. > > > > > > > > It would be even better to expose the per-node demotion order > > > > (node_demotion[]) via the sysfs interface (e.g. > > > > /sys/devices/system/node/nodeX/demotion_targets as you have > > > > suggested). It can be read-only until there are good use cases to > > > > require overriding the per-node demotion order. > > > > > > I am OK to expose the system demotion order to user space. For example, > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > > > > Sounds good. We can send out a patch for such a read-only interface. > > > > > But if we want to add functionality to override system demotion order, > > > we need to consider the user space interface carefully, at least after > > > collecting all requirement so far. I don't think the interface proposed > > > in [2-5/5] of this patchset is sufficient or extensible enough. > > > > The current proposed interface should be sufficient to override which > > nodes can serve as demotion targets. I agree that it is not > > sufficient if userspace wants to redefine the per-node demotion > > targets and a suitable user space interface for that purpose needs to > > be designed carefully. > > > > IMHO, it's better to define both together. That is, collect all > requirement, and design it carefully, keeping extensible in mind. If > it's not the good timing yet, we can defer it to collect more > requirement. That's not urgent even for authors' system, because they > can just don't enable demotion-in-reclaim. > > Best Regards, > Huang, Ying I think it is necessary to either have per node demotion targets configuration or the user space interface supported by this patch series. As we don't have clear consensus on how the user interface should look like, we can defer the per node demotion target set interface to future until the real need arises. Current patch series sets N_DEMOTION_TARGET from dax device kmem driver, it may be possible that some memory node desired as demotion target is not detected in the system from dax-device kmem probe path. It is also possible that some of the dax-devices are not preferred as demotion target e.g. HBM, for such devices, node shouldn't be set to N_DEMOTION_TARGETS. In future, Support should be added to distinguish such dax-devices and not mark them as N_DEMOTION_TARGETS from the kernel, but for now this user space interface will be useful to avoid such devices as demotion targets. We can add read only interface to view per node demotion targets from /sys/devices/system/node/nodeX/demotion_targets, remove duplicated /sys/kernel/mm/numa/demotion_target interface and instead make /sys/devices/system/node/demotion_targets writable. Huang, Wei, Yang, What do you suggest?