Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752167AbdHOFtH (ORCPT ); Tue, 15 Aug 2017 01:49:07 -0400 Received: from mga14.intel.com ([192.55.52.115]:48830 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751666AbdHOFtG (ORCPT ); Tue, 15 Aug 2017 01:49:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,376,1498546800"; d="scan'208";a="1162674601" Date: Tue, 15 Aug 2017 13:49:45 +0800 From: Aaron Lu To: Andrew Morton Cc: linux-mm , lkml , "Chen, Tim C" , Huang Ying , "Kleen, Andi" , Michal Hocko , Minchan Kim Subject: Re: [PATCH] swap: choose swap device according to numa node Message-ID: <20170815054944.GF2369@aaronlu.sh.intel.com> References: <20170814053130.GD2369@aaronlu.sh.intel.com> <20170814163337.92c9f07666645366af82aba2@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170814163337.92c9f07666645366af82aba2@linux-foundation.org> User-Agent: Mutt/1.8.3 (2017-05-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5236 Lines: 130 On Mon, Aug 14, 2017 at 04:33:37PM -0700, Andrew Morton wrote: > On Mon, 14 Aug 2017 13:31:30 +0800 Aaron Lu wrote: > > > --- /dev/null > > +++ b/Documentation/vm/swap_numa.txt > > @@ -0,0 +1,18 @@ > > +If the system has more than one swap device and swap device has the node > > +information, we can make use of this information to decide which swap > > +device to use in get_swap_pages() to get better performance. > > + > > +The current code uses a priority based list, swap_avail_list, to decide > > +which swap device to use and if multiple swap devices share the same > > +priority, they are used round robin. This change here replaces the single > > +global swap_avail_list with a per-numa-node list, i.e. for each numa node, > > +it sees its own priority based list of available swap devices. Swap > > +device's priority can be promoted on its matching node's swap_avail_list. > > + > > +The current swap device's priority is set as: user can set a >=0 value, > > +or the system will pick one starting from -1 then downwards. The priority > > +value in the swap_avail_list is the negated value of the swap device's > > +due to plist being sorted from low to high. The new policy doesn't change > > +the semantics for priority >=0 cases, the previous starting from -1 then > > +downwards now becomes starting from -2 then downwards and -1 is reserved > > +as the promoted value. > > Could we please add a little "user guide" here? Tell people how to set > up their system to exploit this? Sample /etc/fstab entries, perhaps? That's a good idea. How about this: Automatically bind swap device to numa node ------------------------------------------- If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. How to use this feature ----------------------- Swap device has priority and that decides the order of it to be used. To make use of automatically binding, there is no need to manipulate priority settings for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and swapB, with swapA attached to node 0 and swapB attached to node 1, are going to be swapped on. Simply swapping them on by doing: # swapon /dev/swapA # swapon /dev/swapB Then node 0 will use the two swap devices in the order of swapA then swapB and node 1 will use the two swap devices in the order of swapB then swapA. Note that the order of them being swapped on doesn't matter. A more complex example on a 4 node machine. Assume 6 swap devices are going to be swapped on: swapA and swapB are attached to node 0, swapC is attached to node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. The way to swap them on is the same as above: # swapon /dev/swapA # swapon /dev/swapB # swapon /dev/swapC # swapon /dev/swapD # swapon /dev/swapE # swapon /dev/swapF Then node 0 will use them in the order of: swapA/swapB -> swapC -> swapD -> swapE -> swapF swapA and swapB will be used in a round robin mode before any other swap device. node 1 will use them in the order of: swapC -> swapA -> swapB -> swapD -> swapE -> swapF node 2 will use them in the order of: swapD/swapE -> swapA -> swapB -> swapC -> swapF Similaly, swapD and swapE will be used in a round robin mode before any other swap devices. node 3 will use them in the order of: swapF -> swapA -> swapB -> swapC -> swapD -> swapE Implementation details ---------------------- The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This change here replaces the single global swap_avail_list with a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. So if multiple swap devices are attached to the same node, they will all be promoted to priority -1 on that node's plist and will be used round robin before any other swap devices. > > > > > ... > > > > +static int __init swapfile_init(void) > > +{ > > + int nid; > > + > > + swap_avail_heads = kmalloc(nr_node_ids * sizeof(struct plist_head), GFP_KERNEL); > > + if (!swap_avail_heads) > > + return -ENOMEM; > > Well, a kmalloc failure at __init time is generally considered "can't > happen", but if it _does_ happen, the system will later oops, I think. Agree. > Can we do something nicer here? I'm not sure what to do...any hint? Adding a pr_err() perhaps? > > + for_each_node(nid) > > + plist_head_init(&swap_avail_heads[nid]); > > + > > + return 0; > > +} > > +subsys_initcall(swapfile_init);