Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp6036951iob; Tue, 10 May 2022 08:59:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxUWZ4+D/Lhl74ft2K4vJkCFUeDzQLmM/1/noCMVsLf8E0j1+c2vX+wE/R3MfB/hM2IzXbD X-Received: by 2002:a9d:4ca:0:b0:605:eadb:45c0 with SMTP id 68-20020a9d04ca000000b00605eadb45c0mr8102314otm.4.1652198379738; Tue, 10 May 2022 08:59:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652198379; cv=none; d=google.com; s=arc-20160816; b=ZcPiRs3k/xE4pBPxBTJEhwxiVTlYag1u/c6aNoqunHNm9cNDrz1aoHlZZ5IwmcfsHr 9m2bZdcyli9+2I8tD8mtVoU5DM2Tka5YezgEMctWxPKZW3w8nRtx+GlLs8mxCeo8/VBT RD1eIfqgfDTbLFHWoOpDfPa7T/wuel823cvGPR0FB7b95tAVtrapBH7vQPjef+y8OhGE SN5n2bMne5r6xEsumBSlm4EjjrW4BXhKh5Llt8shOVyjqh+7W9kJ1c0MymFzHa53G6GW rrHGA3NAgbcKp7Y7Re0L7ZK592A6ojYiexVsYztkP8F8GkMpxiH4wWWyd2AGvr3ycIWj dzLA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:dkim-signature; bh=ReniqVyCnhJn7wI9OZvI2+w7Me/H8Qb7gm1e+jK9Hlo=; b=ReYuLmr/65keEfyOobHLiGMRBkm+LeikRzoMV4YNtZJ+Svp30efhBJuF9Mdbmtungt Lh5eQrLL4Zl1DWPfNv4do0QGz95GAmYluvFTfD4z0YIEpQp7hm+Grre4r7x8SoqFcYuJ gf/s8PAiu7t+qBftAJnYjFmdp4seJ819S/XX3V8CAv+wkvTaV5AYWsk3qsXH6Oek5+9J 2UTmnvNUXLVFQK/I1tqCjgYgFFRjZttiRwFX5JRxMhV17/hSrUJHiSKSl3okwgtcmFgh deGI1hcQHZ4C043GWSNf/NnK6F58eiYW6aoXSt3rx0Ne9OyQGlm81ivmstyovzOlQEph lEUw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b="Kmx3bh/+"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c4-20020a056830000400b006061eab9290si12337011otp.73.2022.05.10.08.59.25; Tue, 10 May 2022 08:59:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b="Kmx3bh/+"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241192AbiEJLtZ (ORCPT + 99 others); Tue, 10 May 2022 07:49:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48574 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229513AbiEJLtU (ORCPT ); Tue, 10 May 2022 07:49:20 -0400 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 15E01250E8A for ; Tue, 10 May 2022 04:45:23 -0700 (PDT) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 24A9rflE020277; Tue, 10 May 2022 11:44:34 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : in-reply-to : references : date : message-id : mime-version : content-type; s=pp1; bh=ReniqVyCnhJn7wI9OZvI2+w7Me/H8Qb7gm1e+jK9Hlo=; b=Kmx3bh/+xyjstY2E8FGwTU/p4WO9QDy5vvpqJhTl1eRlKUJ8GvBrjUi1YdyzoH3UziBs b8U13IQilJ3KEl3A9wsQdpeusz+1gqxsUArSAwoJ0wPqd9LpYUgVDPQ4O0cRHCO5WiIQ RbR75SWQwCx5TdyxX0329nxntt1Vkeu6/Y9Qm2Z5b8YAP8785U/FxE8u72zvxuY/GFuv wb5a1SVc6/iwhL3YMlWj5zzKWeKhOtjAn1VYswxSgLkXQzpPwdQbKuHQ8wTOPDGqnCus ptLUrmhxcbk3aB/JnTmst6+wEF0WU6cyovZ9MlEI6gi1EgBWnUtcbisM1L8vioFVVpXb zg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3fyntr21ah-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 10 May 2022 11:44:34 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 24ABaG86024772; Tue, 10 May 2022 11:44:33 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3fyntr21a8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 10 May 2022 11:44:33 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 24ABd2YG021735; Tue, 10 May 2022 11:44:32 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01wdc.us.ibm.com with ESMTP id 3fwgd94p4n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 10 May 2022 11:44:32 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 24ABiVMI43123122 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 10 May 2022 11:44:32 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D62B311206F; Tue, 10 May 2022 11:44:31 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E46E611206E; Tue, 10 May 2022 11:44:25 +0000 (GMT) Received: from skywalker.linux.ibm.com (unknown [9.43.24.223]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 10 May 2022 11:44:25 +0000 (GMT) X-Mailer: emacs 29.0.50 (via feedmail 11-beta-1 I) From: "Aneesh Kumar K.V" To: Wei Xu , Hesham Almatary Cc: Yang Shi , Andrew Morton , Dave Hansen , Huang Ying , Dan Williams , Linux MM , Greg Thelen , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Davidlohr Bueso , Michal Hocko , Baolin Wang , Brice Goglin , Feng Tang , Tim Chen Subject: Re: RFC: Memory Tiering Kernel Interfaces In-Reply-To: References: <1642ab64-7957-e1e6-71c5-ceab9c23bf41@huawei.com> Date: Tue, 10 May 2022 17:14:23 +0530 Message-ID: <87czglhaso.fsf@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: bek01YQ49-3UB6_ZnGUq1pS4eIW5Itxc X-Proofpoint-GUID: AcUw790ebGa3rwHCy75kjNo1vItqIsot X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.858,Hydra:6.0.486,FMLib:17.11.64.514 definitions=2022-05-10_01,2022-05-10_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 malwarescore=0 lowpriorityscore=0 priorityscore=1501 suspectscore=0 bulkscore=0 adultscore=0 mlxscore=0 spamscore=0 phishscore=0 impostorscore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2202240000 definitions=main-2205100052 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Wei Xu writes: > On Mon, May 9, 2022 at 7:32 AM Hesham Almatary > wrote: >> .... > > nearest lower tier before demoting to lower lower tiers. >> There might still be simple cases/topologies where we might want to "skip" >> the very next lower tier. For example, assume we have a 3 tiered memory >> system as follows: >> >> node 0 has a CPU and DDR memory in tier 0, node 1 has GPU and DDR memory >> in tier 0, >> node 2 has NVMM memory in tier 1, node 3 has some sort of bigger memory >> (could be a bigger DDR or something) in tier 2. The distances are as >> follows: >> >> -------------- -------------- >> | Node 0 | | Node 1 | >> | ------- | | ------- | >> | | DDR | | | | DDR | | >> | ------- | | ------- | >> | | | | >> -------------- -------------- >> | 20 | 120 | >> v v | >> ---------------------------- | >> | Node 2 PMEM | | 100 >> ---------------------------- | >> | 100 | >> v v >> -------------------------------------- >> | Node 3 Large mem | >> -------------------------------------- >> >> node distances: >> node 0 1 2 3 >> 0 10 20 20 120 >> 1 20 10 120 100 >> 2 20 120 10 100 >> 3 120 100 100 10 >> >> /sys/devices/system/node/memory_tiers >> 0-1 >> 2 >> 3 >> >> N_TOPTIER_MEMORY: 0-1 >> >> >> In this case, we want to be able to "skip" the demotion path from Node 1 >> to Node 2, >> >> and make demotion go directely to Node 3 as it is closer, distance wise. >> How can >> >> we accommodate this scenario (or at least not rule it out as future >> work) with the current RFC? > > This is an interesting example. I think one way to support this is to > allow all the lower tier nodes to be the demotion targets of a node in > the higher tier. We can then use the allocation fallback order to > select the best demotion target. > > For this example, we will have the demotion targets of each node as: > > node 0: allowed=2-3, order (based on allocation fallback order): 2, 3 > node 1: allowed=2-3, order (based on allocation fallback order): 3, 2 > node 2: allowed = 3, order (based on allocation fallback order): 3 > node 3: allowed = empty > > What do you think? > Can we simplify this further with tier 0 - > empty (no HBM/GPU) tier 1 -> Node0, Node1 tier 2 -> Node2, Node3 Hence node 0: allowed=2-3, order (based on allocation fallback order): 2, 3 node 1: allowed=2-3, order (based on allocation fallback order): 3, 2 node 2: allowed = empty node 3: allowed = empty -aneesh