Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp981993pxb; Fri, 22 Apr 2022 15:54:36 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxLhFUtJwrHmO52qFxMGSni8vA2ljqT4RcCVwWnZYz1IcwYATfnOvd9jYVSASNtC21vgDhD X-Received: by 2002:a17:902:a712:b0:158:e577:f82 with SMTP id w18-20020a170902a71200b00158e5770f82mr6534645plq.146.1650668076478; Fri, 22 Apr 2022 15:54:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650668076; cv=none; d=google.com; s=arc-20160816; b=uokspsrmbgnRNHm5AczGHphPd9b66BNpaKQn7sBvkXZDBq7DXu5VdrCVFuCuuxye68 WbGpLUQ55DNS1pfVK/wg7NRx4UOXHpSi4yzREhqwX7Fj8R07/btfBzw8eGBazxUAwUnn ChBsLr+GFeCI6iesPJK9KdrNLcPStvPX2A0q+UPZnjPddIrJx5qHkK6ZZGixP2qe9H+L AQBrrJsS6ygjZ7CfmW0Ya9O8Kuj4ETFaHQF3f7VEtiy5RcsuoFrikvYmO30wPbmfqKEQ 4ZJXxdsJFdT0mOX24QaI1ZSrjM/aWnKXLD+bLB8Wp8P4/+P8FFddk2n6RIQdLLsZ4tSH 0RSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=O4mKH9/d1KfsTQ/iFCbkA+SBW7ZyRLw+v5s8go165uE=; b=WIJ8p+nWb8nz9v2W2YKibTgsvKKUVnIQpMhXqUuqoPY/+twzViqwLDpjOY6wJbx5Eo 3Gt2sNnmC1J9H6IraNVuaR8mHwUO19rIH9WM2R64JdtLzAAQqJiOvL4fp261DkF9Vf5M J8fTLrMQozF0WenJPtH1zO75p/ibACNzKNmp2Pqf3CTtIdl9rncsDsZLYhMJtaLvZtMH g64Pv2Kx8IQJA3GTKiljj9aUJT+yvTApwdc119Xm2QZFOpOWCnPK/40DOagIALPHHjrD 86n7f7Lhl8/7CLA2UA8YND4HI6n4iMMO+DdULYBmAFQhlw0nPP64iRtcfBkU+HTHlIZe /FAw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hvcdjGkm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id v5-20020a17090ac90500b001cba36e18e7si8522081pjt.53.2022.04.22.15.54.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 15:54:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hvcdjGkm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D94FB4131B1; Fri, 22 Apr 2022 13:45:06 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1380465AbiDVAA0 (ORCPT + 99 others); Thu, 21 Apr 2022 20:00:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56480 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233119AbiDVAAZ (ORCPT ); Thu, 21 Apr 2022 20:00:25 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBF082018E for ; Thu, 21 Apr 2022 16:57:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1650585453; x=1682121453; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=qNbkrPfEWjwymGAI+jxZp21smcon3y5Vd3dXkyOa+vE=; b=hvcdjGkmD101zjU+Sgq/CPP9Bg0oNr3KXFRCEv+3SxAOHj4Ta7CizQCv QTVBNTxRGYynN944IZVjWUHaHE2BjTiBJte0qhpAIIb37iZ0uImUKY6ty o2aKV7KBSmB/AY9gNVx2R1ERin+kIlKtvMqic+p/c0JMKajqibw9w6Efx gA4WfI/JKmOmBF+SAymrWmaYD4mDe1dpsS6fTvFDRRRtiTHzM0wlgqAuu 8OPJluQ61sh9FxIIScYoBitVrepgrdwdpAuQ9F0azJo0a9ktb5XMJgNUd GxnxkQdtZM8U7w2trTI4lXAwjClfnEVklDbjp7xQKYtdqBfO14Cbv/ufa g==; X-IronPort-AV: E=McAfee;i="6400,9594,10324"; a="262115796" X-IronPort-AV: E=Sophos;i="5.90,280,1643702400"; d="scan'208";a="262115796" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2022 16:57:33 -0700 X-IronPort-AV: E=Sophos;i="5.90,280,1643702400"; d="scan'208";a="556055518" Received: from zhouju8x-mobl.ccr.corp.intel.com ([10.254.212.221]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2022 16:57:31 -0700 Message-ID: Subject: Re: [PATCH] mm: swap: determine swap device by using page nid From: "ying.huang@intel.com" To: Aaron Lu Cc: Yang Shi , Michal Hocko , Andrew Morton , Linux MM , Linux Kernel Mailing List Date: Fri, 22 Apr 2022 07:57:29 +0800 In-Reply-To: References: <20220407020953.475626-1-shy828301@gmail.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2022-04-21 at 22:11 +0800, Aaron Lu wrote: > On Thu, Apr 21, 2022 at 03:49:21PM +0800, ying.huang@intel.com wrote: > > On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote: > > > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote: > > > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu wrote: > > > > > > > > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote: > > > > > > The swap devices are linked to per node priority lists, the swap device > > > > > > closer to the node has higher priority on that node's priority list. > > > > > > This is supposed to improve I/O latency, particularly for some fast > > > > > > devices. But the current code gets nid by calling numa_node_id() which > > > > > > actually returns the nid that the reclaimer is running on instead of the > > > > > > nid that the page belongs to. > > > > > > > > > > > > > > > > Right. > > > > > > > > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the > > > > > > right swap device. But it doesn't work for the swap slots cache which > > > > > > is per cpu. We could skip swap slots cache if the current node is not > > > > > > the page's node, but it may be overkilling. So keep using the current > > > > > > node's swap slots cache. The issue was found by visual code inspection > > > > > > so it is not sure how much improvement could be achieved due to lack of > > > > > > suitable testing device. But anyway the current code does violate the > > > > > > design. > > > > > > > > > > > > > > > > I intentionally used the reclaimer's nid because I think when swapping > > > > > out to a device, it is faster when the device is on the same node as > > > > > the cpu. > > > > > > > > OK, the offline discussion with Huang Ying showed the design was to > > > > have page's nid in order to achieve better I/O performance (more > > > > noticeable on faster devices) since the reclaimer may be running on a > > > > different node from the reclaimed page. > > > > > > > > > > > > > > Anyway, I think I can make a test case where the workload allocates all > > > > > its memory on the remote node and its workingset memory is larger then > > > > > the available memory so swap is triggered, then we can see which way > > > > > achieves better performance. Sounds reasonable to you? > > > > > > > > Yeah, definitely, thank you so much. I don't have a fast enough device > > > > by hand to show the difference right now. If you could get some data > > > > it would be perfect. > > > > > > > > > > Failed to find a test box that has two NVMe disks attached to different > > > nodes and since Shanghai is locked down right now, we couldn't install > > > another NVMe on the box so I figured it might be OK to test on a box that > > > has a single NVMe attached to node 0 like this: > > > > > > 1) restrict the test processes to run on node 0 and allocate on node 1; > > > 2) restrict the test processes to run on node 1 and allocate on node 0. > > > > > > In case 1), the reclaimer's node id is the same as the swap device's so > > > it's the same as current behaviour and in case 2), the page's node id is > > > the same as the swap device's so it's what your patch proposed. > > > > > > The test I used is vm-scalability/case-swap-w-rand: > > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq > > > which spawns $nr_task processes and each will mmap $size and then > > > randomly write to that area. I set nr_task=32 and $size=4G, so a total > > > of 128G memory will be needed and I used memory.limit_in_bytes to > > > restrict the available memory to 64G, to make sure swap is triggered. > > > > > > The reason why cgroup is used is to avoid waking up the per-node kswapd > > > which can trigger swapping with reclaimer/page/swap device all having the > > > same node id. > > > > > > And I don't see a measuable difference from the result: > > > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s > > > case2(using page's node id) vm-scalability.throughput: 10567 KB/s > > > > > > My interpretation of the result is, when reclaiming a remote page, it > > > doesn't matter much which swap device to use if the swap device is a IO > > > device. > > > > > > Later Ying reminded me we have test box that has optane installed on > > > different nodes so I also tested there: Icelake 2 sockets server with 2 > > > optane installed on each node. I did the test there like this: > > > 1) restrict the test processes to run on node 0 and allocate on node 1 > > >    and only swapon pmem0, which is the optane backed swap device on node 0; > > > 2) restrict the test processes to run on node 0 and allocate on node 1 > > >    and only swapon pmem1, which is the optane backed swap device on node 1. > > > > > > So case 1) is current behaviour and case 2) is what your patch proposed. > > > > > > With the same test and the same nr_task/size, the result is: > > > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s > > > case2(using page's node id) vm-scalability.throughput: 58753 KB/s > > > > > > > The per-node swap device support is more about swap-in latency than > > swap-out throughput. I suspect the test case is more about swap-out > > throughput. perf profiling can show this. > > > > On another thought, swap out can very well affect swap in latency: > since swap is involved, the available memory is in short supply and swap > in may very likely need to reclaim a page and that reclaim can involve a > swap out, so swap out performance can also affect swap in latency. > I think you are talking about thrashing. Thrashing will kill performance. With proactive reclaiming, or something similar (e.g. kill low priority workloads), we can introduce swapping almost without thrashing. I don't want to say the performance of swapout isn't important. But swap out and swap in are different. swap out performance is more about throughput, while swap in performance is more about latency. Best Regards, Huang, Ying > > For swap-in latency, we can use pmbench, which can output latency > > information. > > > > Best Regards, > > Huang, Ying > > > > > > [snip] > >