Received: by 10.223.176.5 with SMTP id f5csp323883wra; Thu, 1 Feb 2018 21:22:19 -0800 (PST) X-Google-Smtp-Source: AH8x225bjFLM9ZoiLl33hQsNEpx9K8C1KLmqxEn0BBeDW2ZELCR3Fk8rihHTUkBZ6AmygH4z3TmD X-Received: by 2002:a17:902:b605:: with SMTP id b5-v6mr13885277pls.318.1517548939007; Thu, 01 Feb 2018 21:22:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517548938; cv=none; d=google.com; s=arc-20160816; b=Wc5cBDE/gPmq3llSbHDZsNmQUxIEu4peA9+U+9GRswovPBNgoBKeUNsZZx+O7irYEK OkgMtlmBxdquUmE3GCUQew2fgmMEEE7+ZvU8joQy97zawbXTPDARkdHzy8gMb2eqIP5z +9CTAHdeWgrBMtkO1CiCj8d0Ltg5Y2PGtQJXLivMt6/kWQ+UCQy4ZVrJb9jv/CpOyZF2 85p60xF4zMyP4CaBs8ppls5108+mmVRQHuYy8qymFwJ4AQR1nhLmLydwrrbH7ZzVnnkX gBUPeZDkssT9JU8m1MojaJPlP21tROhC9ULa2Vz21O+GHcvkDguox/4truowg37CJHAs ARlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=SUQQkuhJ3npfNbAAFdTzJS9IrLi6okhFhfhx0xMmlU0=; b=P7yElm0ZlObKJbI3GXuoOCd8P185PoMN2BrkBhQSxMGyOGp+yb87yUhh9yVsPbZBmw YiWTmU0OywSK1j6DrSF+cFPUo7g3LsaSU0a5OZm3qy5JzrGfE2Vx45r0rBD7eVWHDB/R oN8Ri5oErBOFjf0cMgXH8aVlc5cZXzvRoNpJdQFiM7SveLZa/e9Puyq1Hv/jlBtUOcAj SwLGi8gSa1cRACtewXfKckZquNkgad88fqFc7T6IdRmbroTnn7PRAPQfZHw1TBMIU9C9 t+rFKLcNR7yaRsUpJpqZffpQGpmwifL7EAFXq1Rn49ecgDCtmu4Alw2YsatTtldnaCTt Vu1w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=ciTkD4fx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v32-v6si1067102plb.725.2018.02.01.21.21.33; Thu, 01 Feb 2018 21:22:18 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=ciTkD4fx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750799AbeBBFRY (ORCPT + 99 others); Fri, 2 Feb 2018 00:17:24 -0500 Received: from aserp2120.oracle.com ([141.146.126.78]:51212 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750703AbeBBFRT (ORCPT ); Fri, 2 Feb 2018 00:17:19 -0500 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w125HAM7016578; Fri, 2 Feb 2018 05:17:10 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=SUQQkuhJ3npfNbAAFdTzJS9IrLi6okhFhfhx0xMmlU0=; b=ciTkD4fxTx7u/nMdLQ/A++3gcj7GwNMN9EHte7xV4oua80RbWNSQMPaaE+As5c5ydyEY ffzhzNDle4kkRffpmXMSrSzvGxHPogzG2uBfnYVLxK0gensjmmRxY9CJH/aFRrnmPESI R90n+u9n4/FV+FfnnFtYy2+a7P8W2GkR5+qcop4rJ93YTHYG6OdtlmFEOguX87wYTx2O x6483gZklAbC1ILZame1clxInR9+kqGsD8JubK9OTtiG3U6X/w5ASuQs28ol4BREdwWB SA9hqV6S3HEwUjlx5LsyDYG3v0DfNInZOaBZXRuQIxI/Z7kwuCIkFz8v+REqZYD57Tmd 6Q== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by aserp2120.oracle.com with ESMTP id 2fvg7fr5xc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 02 Feb 2018 05:17:10 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w125H6HN015393 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 2 Feb 2018 05:17:07 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w125H4OM002369; Fri, 2 Feb 2018 05:17:05 GMT Received: from [10.39.194.240] (/10.39.194.240) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 01 Feb 2018 21:17:04 -0800 Subject: Re: [RFC PATCH v1 13/13] mm: splice local lists onto the front of the LRU To: Tim Chen , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: aaron.lu@intel.com, ak@linux.intel.com, akpm@linux-foundation.org, Dave.Dice@oracle.com, dave@stgolabs.net, khandual@linux.vnet.ibm.com, ldufour@linux.vnet.ibm.com, mgorman@suse.de, mhocko@kernel.org, pasha.tatashin@oracle.com, steven.sistare@oracle.com, yossi.lev@oracle.com References: <20180131230413.27653-1-daniel.m.jordan@oracle.com> <20180131230413.27653-14-daniel.m.jordan@oracle.com> From: Daniel Jordan Message-ID: Date: Fri, 2 Feb 2018 00:17:02 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8792 signatures=668660 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=3 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1802020056 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/01/2018 06:30 PM, Tim Chen wrote: > On 01/31/2018 03:04 PM, daniel.m.jordan@oracle.com wrote: >> Now that release_pages is scaling better with concurrent removals from >> the LRU, the performance results (included below) showed increased >> contention on lru_lock in the add-to-LRU path. >> >> To alleviate some of this contention, do more work outside the LRU lock. >> Prepare a local list of pages to be spliced onto the front of the LRU, >> including setting PageLRU in each page, before taking lru_lock. Since >> other threads use this page flag in certain checks outside lru_lock, >> ensure each page's LRU links have been properly initialized before >> setting the flag, and use memory barriers accordingly. >> >> Performance Results >> >> This is a will-it-scale run of page_fault1 using 4 different kernels. >> >> kernel kern # >> >> 4.15-rc2 1 >> large-zone-batch 2 >> lru-lock-base 3 >> lru-lock-splice 4 >> >> Each kernel builds on the last. The first is a baseline, the second >> makes zone->lock more scalable by increasing an order-0 per-cpu >> pagelist's 'batch' and 'high' values to 310 and 1860 respectively >> (courtesy of Aaron Lu's patch), the third scales lru_lock without >> splicing pages (the previous patch in this series), and the fourth adds >> page splicing (this patch). >> >> N tasks mmap, fault, and munmap anonymous pages in a loop until the test >> time has elapsed. >> >> The process case generally does better than the thread case most likely >> because of mmap_sem acting as a bottleneck. There's ongoing work >> upstream[*] to scale this lock, however, and once it goes in, my >> hypothesis is the thread numbers here will improve. Neglected to mention my hardware: 2-socket system, 44 cores, 503G memory, Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz >> >> kern # ntask proc thr proc stdev thr stdev >> speedup speedup pgf/s pgf/s >> 1 1 705,533 1,644 705,227 1,122 >> 2 1 2.5% 2.8% 722,912 453 724,807 728 >> 3 1 2.6% 2.6% 724,215 653 723,213 941 >> 4 1 2.3% 2.8% 721,746 272 724,944 728 >> >> kern # ntask proc thr proc stdev thr stdev >> speedup speedup pgf/s pgf/s >> 1 4 2,525,487 7,428 1,973,616 12,568 >> 2 4 2.6% 7.6% 2,590,699 6,968 2,123,570 10,350 >> 3 4 2.3% 4.4% 2,584,668 12,833 2,059,822 10,748 >> 4 4 4.7% 5.2% 2,643,251 13,297 2,076,808 9,506 >> >> kern # ntask proc thr proc stdev thr stdev >> speedup speedup pgf/s pgf/s >> 1 16 6,444,656 20,528 3,226,356 32,874 >> 2 16 1.9% 10.4% 6,566,846 20,803 3,560,437 64,019 >> 3 16 18.3% 6.8% 7,624,749 58,497 3,447,109 67,734 >> 4 16 28.2% 2.5% 8,264,125 31,677 3,306,679 69,443 >> >> kern # ntask proc thr proc stdev thr stdev >> speedup speedup pgf/s pgf/s >> 1 32 11,564,988 32,211 2,456,507 38,898 >> 2 32 1.8% 1.5% 11,777,119 45,418 2,494,064 27,964 >> 3 32 16.1% -2.7% 13,426,746 94,057 2,389,934 40,186 >> 4 32 26.2% 1.2% 14,593,745 28,121 2,486,059 42,004 >> >> kern # ntask proc thr proc stdev thr stdev >> speedup speedup pgf/s pgf/s >> 1 64 12,080,629 33,676 2,443,043 61,973 >> 2 64 3.9% 9.9% 12,551,136 206,202 2,684,632 69,483 >> 3 64 15.0% -3.8% 13,892,933 351,657 2,351,232 67,875 >> 4 64 21.9% 1.8% 14,728,765 64,945 2,485,940 66,839 >> >> [*] https://lwn.net/Articles/724502/ Range reader/writer locks >> https://lwn.net/Articles/744188/ Speculative page faults >> > > The speedup looks pretty nice and seems to peak at 16 tasks. Do you have an explanation of what > causes the drop from 28.2% to 21.9% going from 16 to 64 tasks? The system I was testing on had 44 cores, so part of the decrease in % speedup is just saturating the hardware (e.g. memory bandwidth). At 64 processes, we start having to share cores. Page faults per second did continue to increase each time we added more processes, though, so there's no anti-scaling going on. > Was > the loss in performance due to increased contention on LRU lock when more tasks running > results in a higher likelihood of hitting the sentinel? That seems to be another factor, yes. I used lock_stat to measure it, and it showed that wait time on lru_lock nearly tripled when going from 32 to 64 processes, but I also take lock_stat with a grain of salt as it changes the timing/interaction between processes. > If I understand > your patchset correctly, you will need to acquire LRU lock for sentinel page. Perhaps an increase > in batch size could help? Actually, I did try doing that. In this series the batch size is PAGEVEC_SIZE (14). When I did a run with PAGEVEC_SIZE*4, the performance stayed nearly the same for all but the 64 process case, where it dropped by ~10%. One explanation is as a process runs through one batch, it holds the batch lock longer before it has to switch batches, creating more opportunity for contention. By the way, we're also working on another approach to scaling this look: https://marc.info/?l=linux-mm&m=151746028405581 We plan to implement that idea and see how it compares performance-wise and diffstat-wise with this.