Received: by 10.213.65.68 with SMTP id h4csp3956922imn; Tue, 3 Apr 2018 13:51:51 -0700 (PDT) X-Google-Smtp-Source: AIpwx480yezPl5Pa433bcp4ykMx2/ICGHIs46FUO/vNAkKzl/dTnbEf7uCJUsJto7+dWWMhcdfUw X-Received: by 10.98.224.93 with SMTP id f90mr11619136pfh.21.1522788711050; Tue, 03 Apr 2018 13:51:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522788711; cv=none; d=google.com; s=arc-20160816; b=NWtcn6u1RLe8Rd/Zbhnv9eMNHcsXuptQjAfEFagiDGg+Cx4qso+AH5bA7UG2LG/ns8 sit0dXrQ3Um7w7TBrwWp0F3EDUXNGl+hIW8avoApUtlRmUiy09WbmF8K4XfeIJkV8GXY fYD4EybqN7z0/XxANED988HWS1wHFZ/iWWiuFjB/0waC1z6Xu2qjowmTMrzPCS+FigJ4 MRlLLGGz653PzTjALNAo0wc9jHEajGCLh5aCGMAp5BPutWrpVlQfMKZX5uP4gHePmS4P oFyocEpMpiLQOdfawlezad7aGXo6XoRR9/OpKQClnR9U3L87P5Q/VBzpTVS9F9/jfvFe qEWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature:arc-authentication-results; bh=9rybc7AfwULi7xyr/LlGSTgZFBERcJuq/yeCuZWIvd0=; b=lMeSg8mNu81mcyqYYCmq/nRo9RHZYpNR9xTFZCk7SolIfVnHTwcjwNBm6a1sbzwuSr lwF0CjZsS4A433IIe6i3p4sMYZIph5EHxc/R5Ca6gsKQQFG9p7Y2ZPOfFDCU0gez908n rSN4YBoQ8I9/o0gZleufklxS7AhwI52VqgELyqfI6OP7gqH5Dcz+6dwj7wHQAampslvW +wORX223g7Ysx13qp6nIUsR9M/FyXt//nfnlMBh/JvA3K5qZqssW9APxedDIOeanVPkA iava8RW2z80/iiZhWu//JwgT8JvqxCNcH846N+201hgrgqCQsNqVYdIONvAn1i9UFPUt FBjA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=IMFO4Zvu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j6si2523252pgn.571.2018.04.03.13.51.36; Tue, 03 Apr 2018 13:51:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=IMFO4Zvu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753340AbeDCUuI (ORCPT + 99 others); Tue, 3 Apr 2018 16:50:08 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:60550 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752495AbeDCUuH (ORCPT ); Tue, 3 Apr 2018 16:50:07 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w33Ke7WW140763; Tue, 3 Apr 2018 20:49:41 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2017-10-26; bh=9rybc7AfwULi7xyr/LlGSTgZFBERcJuq/yeCuZWIvd0=; b=IMFO4ZvuPv07toVZ7/HpzDGWXEV4B5WnrPdIR7dolRVSOqYJsCojzZMWEuEh0SYIKCaM RZ0SFdgl1aCWFgjezN+3NcGrk1pSrt7I/UP6yiupRpehRHIFur7lTup/pgkoX9fM4aX9 Z+DwGKdvF1KhIvzqCPk0Fa1em6z81k7chhRdjE35TwVsmzu7dduR3ae8epQv/+cUyKIk ADn6vPVt5wLV+4VvsDmYTp36O2/l9gxUPuwcSdJLO7bjjoQCX4B/9KvNTriOofMQ+RC8 xlChCslXE4zuLWu6Bn6SZ3pA9KxXkq6ifg1w084fvkwE4XmIvKRMpP45xxgwxJmRkxsy Ag== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2120.oracle.com with ESMTP id 2h4gus814g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 03 Apr 2018 20:49:41 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w33KndOf002572 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 3 Apr 2018 20:49:40 GMT Received: from abhmp0005.oracle.com (abhmp0005.oracle.com [141.146.116.11]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w33Knctb003903; Tue, 3 Apr 2018 20:49:38 GMT Received: from [10.197.190.87] (/10.197.190.87) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 03 Apr 2018 13:49:38 -0700 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node From: Buddy Lumpkin In-Reply-To: <20180403190759.GB6779@bombadil.infradead.org> Date: Tue, 3 Apr 2018 13:49:25 -0700 Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, hannes@cmpxchg.org, riel@surriel.com, mgorman@suse.de, akpm@linux-foundation.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com> <1522661062-39745-2-git-send-email-buddy.lumpkin@oracle.com> <20180403133115.GA5501@dhcp22.suse.cz> <20180403190759.GB6779@bombadil.infradead.org> To: Matthew Wilcox X-Mailer: Apple Mail (2.3273) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8852 signatures=668697 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804030201 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Apr 3, 2018, at 12:07 PM, Matthew Wilcox = wrote: >=20 > On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote: >> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote: >>> The presence of direct reclaims 10 years ago was a fairly reliable >>> indicator that too much was being asked of a Linux system. Kswapd = was >>> likely wasting time scanning pages that were ineligible for = eviction. >>> Adding RAM or reducing the working set size would usually make the = problem >>> go away. Since then hardware has evolved to bring a new struggle for >>> kswapd. Storage speeds have increased by orders of magnitude while = CPU >>> clock speeds stayed the same or even slowed down in exchange for = more >>> cores per package. This presents a throughput problem for a single >>> threaded kswapd that will get worse with each generation of new = hardware. >>=20 >> AFAIR we used to scale the number of kswapd workers many years ago. = It >> just turned out to be not all that great. We have a kswapd reclaim >> window for quite some time and that can allow to tune how much = proactive >> kswapd should be. >>=20 >> Also please note that the direct reclaim is a way to throttle overly >> aggressive memory consumers. The more we do in the background context >> the easier for them it will be to allocate faster. So I am not really >> sure that more background threads will solve the underlying problem. = It >> is just a matter of memory hogs tunning to end in the very same >> situtation AFAICS. Moreover the more they are going to allocate the = more >> less CPU time will _other_ (non-allocating) task get. >>=20 >>> Test Details >>=20 >> I will have to study this more to comment. >>=20 >> [...] >>> By increasing the number of kswapd threads, throughput increased by = ~50% >>> while kernel mode CPU utilization decreased or stayed the same, = likely due >>> to a decrease in the number of parallel tasks at any given time = doing page >>> replacement. >>=20 >> Well, isn't that just an effect of more work being done on behalf of >> other workload that might run along with your tests (and which = doesn't >> really need to allocate a lot of memory)? In other words how >> does the patch behaves with a non-artificial mixed workloads? >>=20 >> Please note that I am not saying that we absolutely have to stick = with the >> current single-thread-per-node implementation but I would really like = to >> see more background on why we should be allowing heavy memory hogs to >> allocate faster or how to prevent that. I would be also very = interested >> to see how to scale the number of threads based on how CPUs are = utilized >> by other workloads. >=20 > Yes, very much this. If you have a single-threaded workload which is > using the entirety of memory and would like to use even more, then it > makes sense to use as many CPUs as necessary getting memory out of its > way. If you have N CPUs and N-1 threads happily occupying themselves = in > their own reasonably-sized working sets with one monster process = trying > to use as much RAM as possible, then I'd be pretty unimpressed to see > the N-1 well-behaved threads preempted by kswapd. The default value provides one kswapd thread per NUMA node, the same it was without the patch. Also, I would point out that just because you = devote more threads to kswapd, doesn=E2=80=99t mean they are busy. If multiple = kswapd threads are busy, they are almost certainly doing work that would have resulted = in direct reclaims, which are often substantially more expensive than a = couple extra context switches due to preemption. Also, the code still uses wake_up_interruptible to wake kswapd threads, = so after starting the first kswapd thread, free pages minus the size of the = allocation would still need to be below the low watermark for a page allocation at = that time to cause another kswapd thread to wake up. When I first decided to try this out, I figured a lot of tuning would be = needed to see good behavior. But what I found in practice was that it actually = works quite well. When you look closely, you see that there is very little = difference between a direct reclaim and kswapd. In fact, direct reclaims work a little = harder than kswapd, and they should continue to do so because that prevents the = number of parallel scanning tasks from increasing unnecessarily. Please try it out, you might be surprised at how well it works.=20 >=20 > My biggest problem with the patch-as-presented is that it's yet one = more > thing for admins to get wrong. We should spawn more threads = automatically > if system conditions are right to do that. I totally agree with this. In my previous response to Michal Hocko, I = described how I think we could scale watermarks in response to direct reclaims, = and launch more kswapd threads when kswapd peaks at 100% CPU usage.