Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp3594380pxm; Tue, 1 Mar 2022 01:15:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJzANkAiL7ooIzgyU7G50D+hq6sHLQr2jvohK4QiTB5ZoTMpY+l4OenrrfhntqfiC5WJGqvc X-Received: by 2002:a17:90b:1496:b0:1be:e4fe:bdee with SMTP id js22-20020a17090b149600b001bee4febdeemr679394pjb.163.1646126113524; Tue, 01 Mar 2022 01:15:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646126113; cv=none; d=google.com; s=arc-20160816; b=kx+w7gc2ztZv3BNAL63ts7Re4OFj/wIg6ON0dBXj9eof2MafIs5B+ft/BGDDzvKlwU U7WnyV5fjiZGrJQ7OBCQ7BKWUAXsvOQU9GTQSDz1pmY+L4HAqhW6zaLXzia4ynNEq+Px lu8GAXzkNSYASKNXxUHb95bzhEP6sC4VOZUS9p4u/m8gKpR0C0yvZLm3G5tmlyeCfH/7 So0V+8rjtC5MZMXDqSyEMfUQ+OBZEp/Mqzr+KdvGq7/MPW1ESTiPMEDojKxKOYAT8eWb PcAmkag6vdCk0b6w3763UxYj1F+IwtZdlO5E5k83NRy3pVmyzzJwDtned9AXNFOu+nfb MCbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=gG4UjV7R5+ToqetqBJAn0QWb9Gb2wNNC0d+MNWhyQoE=; b=Q7RPtiUTXYUI4wzZbVwZwsyait3Di/dc3td762ikCzEjT324C3akounB7xtLv+maTR JUFrxj/OE4SCfoz9hgpHih4OiU3uhfm6Suh51hxJmOjUn5mjaxCyBvrCk7LfM6w1Z211 Bn9cuBc3GAHdtXomAj/3B1ezl1B/qz7YF41+nK+EIQ0WBLJBmpb1yZv/uW1NEAJgn7S1 eoJPcbUq3i9Q/RaMMWESJ0my44HEzXyhQdaeYHyULxUvCvYn09/T+uZiZtMHxPbHENaA RXOx2xpDLf7+fVIdVnJxKzk+RZuJ28gtiNvKGyTaPvnKfcNZGHDPIvUDt3WNkmYKl8iZ wA6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=T3E1OPFo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u7-20020a170903124700b0014fc0394772si13152519plh.588.2022.03.01.01.14.58; Tue, 01 Mar 2022 01:15:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=T3E1OPFo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233502AbiCAIyX (ORCPT + 99 others); Tue, 1 Mar 2022 03:54:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232807AbiCAIyW (ORCPT ); Tue, 1 Mar 2022 03:54:22 -0500 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B47A4CD4E for ; Tue, 1 Mar 2022 00:53:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646124822; x=1677660822; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=OKAx0mpfM0tACch330a5GLxXLRTDoojkwcWwxpc/s3c=; b=T3E1OPFovAOnf3OlHFCSX3XtGRD2QNtQtQAUoK9hW8PlJF+8DanuPDxI 1vMtaVNxidqDPh4pNDtRI4cCsplWSmHCcvTAxndMLeXf1fsG4ynmAWLC8 zg+PQk9sosdN5CPyuBn6UeHvKmQgcYZ9oU4VXjeegrlzRatFA0JFtb901 9eIYxMV3djAMxQNPX/nbviJsieibWCA0uO6Jzyap+5ymDDNGU3Rfx5Fy2 ooJg/ZncIzYVOaGVaW0lv7I7AVrB3rHP3Ifl1eMN1v3PwSwX1+fM1VhHn 53UP1NpqHSvgvKuxSGgrXX+sZzjeH34wmcVnL1/6qR5Uz0cJIinbXIgRv w==; X-IronPort-AV: E=McAfee;i="6200,9189,10272"; a="277755779" X-IronPort-AV: E=Sophos;i="5.90,145,1643702400"; d="scan'208";a="277755779" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2022 00:53:41 -0800 X-IronPort-AV: E=Sophos;i="5.90,145,1643702400"; d="scan'208";a="550627066" Received: from yhuang6-desk2.sh.intel.com ([10.239.13.11]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Mar 2022 00:53:37 -0800 From: Huang Ying To: Peter Zijlstra , Mel Gorman , Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Feng Tang , Huang Ying , Michal Hocko , Rik van Riel , Dave Hansen , Yang Shi , Zi Yan , Wei Xu , Oscar Salvador , Shakeel Butt , Johannes Weiner Subject: [PATCH -V14 0/3] NUMA balancing: optimize memory placement for memory tiering system Date: Tue, 1 Mar 2022 16:53:26 +0800 Message-Id: <20220301085329.3210428-1-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The changes since the last post are as follows, - Improved the patch description of [2/3] per Oscar's comments. Thanks! - Added Oscar's Reviewed-by for [2/3] and [3/3]. -- With the advent of various new memory types, some machines will have multiple types of memory, e.g. DRAM and PMEM (persistent memory). The memory subsystem of these machines can be called memory tiering system, because the performance of the different types of memory are different. After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM"), the PMEM could be used as the cost-effective volatile memory in separate NUMA nodes. In a typical memory tiering system, there are CPUs, DRAM and PMEM in each physical NUMA node. The CPUs and the DRAM will be put in one logical node, while the PMEM will be put in another (faked) logical node. To optimize the system overall performance, the hot pages should be placed in DRAM node. To do that, we need to identify the hot pages in the PMEM node and migrate them to DRAM node via NUMA migration. In the original NUMA balancing, there are already a set of existing mechanisms to identify the pages recently accessed by the CPUs in a node and migrate the pages to the node. So we can reuse these mechanisms to build the mechanisms to optimize the page placement in the memory tiering system. This is implemented in this patchset. At the other hand, the cold pages should be placed in PMEM node. So, we also need to identify the cold pages in the DRAM node and migrate them to PMEM node. In commit 26aa2d199d6f ("mm/migrate: demote pages during reclaim"), a mechanism to demote the cold DRAM pages to PMEM node under memory pressure is implemented. Based on that, the cold DRAM pages can be demoted to PMEM node proactively to free some memory space on DRAM node to accommodate the promoted hot PMEM pages. This is implemented in this patchset too. We have tested the solution with the pmbench memory accessing benchmark with the 80:20 read/write ratio and the Gauss access address distribution on a 2 socket Intel server with Optane DC Persistent Memory Model. The test results shows that the pmbench score can improve up to 95.9%. Changelog: v14: - Improved the patch description of [2/3] per Oscar's comments. Thanks! - Added Oscar's Reviewed-by for [2/3] and [3/3]. v13: - Fix nr_succeeded type in migrate_misplaced_page per Oscar's comments. - Make NUMA_BALANCING_MEMORY_TIERING works independent of demotion knob per Johannes' comments. v12: - Rebased on v5.17-rc4 - Change promotion watermark implementation per Johannes' comments - Fixed several sysctl ABI document bugs, Thanks Andrew. v11: - Rebased on v5.17-rc1 - Remove [4-6] from the original patchset to make it easier to be reviewed. - Change the additional promotion watermark to be the high watermark / 4. v10: - Rebased on v5.16-rc1 - Revise error processing for [1/6] (promotion counter) per Yang's comments - Add sysctl document for [2/6] (optimize page placement) - Reset threshold adjustment state when disable/enable tiering mode - Reset threshold when workload transition is detected. v9: - Rebased on v5.15-rc4 - Make "add promotion counter" the first patch per Yang's comments v8: - Rebased on v5.15-rc1 - Make user-specified threshold take effect sooner v7: - Rebased on the mmots tree of 2021-07-15. - Some minor fixes. v6: - Rebased on the latest page demotion patchset. (which bases on v5.11) v5: - Rebased on the latest page demotion patchset. (which bases on v5.10) v4: - Rebased on the latest page demotion patchset. (which bases on v5.9-rc6) - Add page promotion counter. v3: - Move the rate limit control as late as possible per Mel Gorman's comments. - Revise the hot page selection implementation to store page scan time in struct page. - Code cleanup. - Rebased on the latest page demotion patchset. v2: - Addressed comments for V1. - Rebased on v5.5. Best Regards, Huang, Ying