Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp197573ybj; Mon, 4 May 2020 19:13:20 -0700 (PDT) X-Google-Smtp-Source: APiQypKDIIBBdhHL+WGuXlgduvrbwCz4ImKGjmKUYUf8+1LpesnCsPfLJ7AXGXaMjfbUECJsNp/d X-Received: by 2002:a17:906:adb:: with SMTP id z27mr588373ejf.263.1588644800420; Mon, 04 May 2020 19:13:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588644800; cv=none; d=google.com; s=arc-20160816; b=ntVAVeyvF/LwyFsKSyZ5CzFbwcKnkH7mo2jAevnWwy5qyDz6CjWhBdvBDRIMy3UXNI rfb5eJAky60ybMYg5z8JdicydRffUHqFJ3Nbl8zYux7zCIRT/Y7IfqFn225+TFehotz3 APrqXhLCRF1bfIbmvR30AVwR/aT3UT0SttrtyMcQrbPCBzo82rcEnFLvHI2Qt7VXN9cK DJ8Y1ORl2jnahBgIPH367cBWJfU7NxJk01KG3KPa9MiGyq2foPJHuiv+l3fesiOSvi12 JT6ZPFLZtdq8gwGrPhV+3ZNKba3f2jbSPS2D7LM2CAVtSzm/7NeIHiFmRYUnsLJzmkHd PDpQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=lVKBNME8NJj6ZpopdgFkQ+iNkRORu46exO/01/KF0Vs=; b=An0kUYU8DwoisQVqtPr4ol/Dmhvce8lwNco1FjvxhHqPofiCxVjy3E8vyxVYTvcop/ XgaF+aux+mcQCnjEhZaDW09QiHBZdkzdDNgfmXVvQ72cfoVyYfssYxnnMVanjd/mhMd0 ufCDMmMn1/JYc6PTXBF+IlgsCcKvZ1YTd1PJqtRLUArxfhTZxpqMFxCdnH4LEd0QPoc/ wKkSTT5kyptjg3sWEgrOqCrsAZnmVXk3/FmhFp7X3rkX0WItQCIdLeBbClfs5+0BS0fw D+gkGtmU9CpMK7YipwZ2bwzN7xJCpRVxlxHmQmJGYnnhFha9R3IvW6rXhaXg5/SI4trW ixcg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2020-01-29 header.b=Otm4amx9; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bu22si429303edb.507.2020.05.04.19.12.46; Mon, 04 May 2020 19:13:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2020-01-29 header.b=Otm4amx9; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726551AbgEECMo (ORCPT + 99 others); Mon, 4 May 2020 22:12:44 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:59010 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726549AbgEECMn (ORCPT ); Mon, 4 May 2020 22:12:43 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04529enq048881; Tue, 5 May 2020 02:11:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2020-01-29; bh=lVKBNME8NJj6ZpopdgFkQ+iNkRORu46exO/01/KF0Vs=; b=Otm4amx90nPq2D8dA3V63mup9ZoB/4gYpNZ5nE1a8E2myKw6a7doDycHN3v/mKYb5pEA xn39xZSzRQcapN1PuQdEDNvaxxZZZIGEHDZ0vE/mdOfFTOgZCHqdsZx3HQhWSCsRIIlg xsb0KO/yIoRh8yvqB8RdqLsDTD1V0wi7Yvxx40VLuAoz2dxZ5syQq2Mt0JNQbk/ZVO97 6hmnmz+tTDYs9H9jRgOz/X7dbTL++BKLLaH4hZgogzaad7fPxrZZONaZG5sCDZyEht8V DLNzP5rVfK7O1MLDXakCf5sM+BgVQdJmC5zNO45SCVzKvMWkMHQT+P4MNbHLx6a5wOI+ 1A== Received: from aserp3030.oracle.com (aserp3030.oracle.com [141.146.126.71]) by userp2130.oracle.com with ESMTP id 30s09r26vq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 05 May 2020 02:11:05 +0000 Received: from pps.filterd (aserp3030.oracle.com [127.0.0.1]) by aserp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 04526o22176015; Tue, 5 May 2020 02:09:05 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3030.oracle.com with ESMTP id 30sjdrwuqs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 05 May 2020 02:09:05 +0000 Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 04528tjn007418; Tue, 5 May 2020 02:08:56 GMT Received: from ca-dmjordan1.us.oracle.com (/10.211.9.48) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 04 May 2020 19:08:55 -0700 Date: Mon, 4 May 2020 22:09:16 -0400 From: Daniel Jordan To: Alexander Duyck Cc: Josh Triplett , Daniel Jordan , Andrew Morton , Herbert Xu , Steffen Klassert , Alex Williamson , Alexander Duyck , Dan Williams , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Jonathan Corbet , Kirill Tkhai , Michal Hocko , Pavel Machek , Pavel Tatashin , Peter Zijlstra , Randy Dunlap , Shile Zhang , Tejun Heo , Zi Yan , linux-crypto@vger.kernel.org, linux-mm , LKML Subject: Re: [PATCH 6/7] mm: parallelize deferred_init_memmap() Message-ID: <20200505020916.mve4ijrg4z5h7eh5@ca-dmjordan1.us.oracle.com> References: <20200430201125.532129-1-daniel.m.jordan@oracle.com> <20200430201125.532129-7-daniel.m.jordan@oracle.com> <3C3C62BE-6363-41C3-834C-C3124EB3FFAB@joshtriplett.org> <20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com> User-Agent: NeoMutt/20180716 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9611 signatures=668687 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 mlxscore=0 bulkscore=0 adultscore=0 phishscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005050013 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9611 signatures=668687 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 bulkscore=0 mlxscore=0 lowpriorityscore=0 spamscore=0 adultscore=0 clxscore=1015 suspectscore=0 priorityscore=1501 malwarescore=0 mlxlogscore=999 phishscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005050013 Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org On Mon, May 04, 2020 at 09:48:44PM -0400, Daniel Jordan wrote: > On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote: > > On Mon, May 4, 2020 at 4:44 PM Josh Triplett wrote: > > > > > > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck wrote: > > > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan > > > > wrote: > > > >> /* > > > >> - * Initialize and free pages in MAX_ORDER sized increments so > > > >> - * that we can avoid introducing any issues with the buddy > > > >> - * allocator. > > > >> + * More CPUs always led to greater speedups on tested > > > >systems, up to > > > >> + * all the nodes' CPUs. Use all since the system is > > > >otherwise idle now. > > > >> */ > > > > > > > >I would be curious about your data. That isn't what I have seen in the > > > >past. Typically only up to about 8 or 10 CPUs gives you any benefit, > > > >beyond that I was usually cache/memory bandwidth bound. > > On Skylake it took more than 8 or 10 CPUs, though on other machines the benefit > of using all versus half or 3/4 of the CPUs is less significant. > > Given that the rest of the system is idle at this point, my main concern is > whether other archs regress past a certain thread count. Reposting the data to be consistent with the way the percentages are reported in the changelog. Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal) 2 nodes * 26 cores * 2 threads = 104 CPUs 384G/node = 768G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 4056.7 ( 5.5) -- 1763.3 ( 4.2) 2% ( 1) -2.4% 4153.3 ( 2.5) -5.6% 1861.7 ( 5.5) 12% ( 6) 35.0% 2637.7 ( 38.7) 80.3% 346.7 ( 37.5) 25% ( 13) 38.4% 2497.3 ( 38.5) 88.1% 210.0 ( 41.8) 37% ( 19) 38.9% 2477.0 ( 19.0) 89.5% 185.3 ( 21.5) 50% ( 26) 39.1% 2471.7 ( 21.4) 89.8% 179.7 ( 25.8) 75% ( 39) 39.5% 2455.7 ( 33.2) 90.8% 161.7 ( 29.3) 100% ( 52) 39.9% 2436.7 ( 2.1) 91.8% 144.3 ( 5.9) Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal) 1 node * 16 cores * 2 threads = 32 CPUs 192G/node = 192G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1957.3 ( 14.0) -- 1093.7 ( 12.9) 3% ( 1) 1.4% 1930.7 ( 10.0) 3.7% 1053.3 ( 7.6) 12% ( 4) 41.2% 1151.7 ( 9.0) 74.5% 278.7 ( 0.6) 25% ( 8) 46.3% 1051.0 ( 7.8) 83.7% 178.0 ( 2.6) 38% ( 12) 48.7% 1003.3 ( 7.6) 87.0% 141.7 ( 3.8) 50% ( 16) 48.2% 1014.3 ( 20.0) 87.8% 133.3 ( 3.2) 75% ( 24) 49.5% 989.3 ( 6.7) 88.4% 126.3 ( 1.5) 100% ( 32) 49.1% 996.0 ( 7.2) 88.4% 127.3 ( 5.1) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal) 2 nodes * 18 cores * 2 threads = 72 CPUs 128G/node = 256G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1666.0 ( 3.5) -- 618.0 ( 3.5) 3% ( 1) 1.0% 1649.7 ( 1.5) 2.9% 600.0 ( 1.0) 11% ( 4) 25.9% 1234.7 ( 21.4) 70.4% 183.0 ( 22.5) 25% ( 9) 29.6% 1173.0 ( 10.0) 80.7% 119.3 ( 9.6) 36% ( 13) 30.8% 1153.7 ( 17.0) 84.0% 99.0 ( 15.6) 50% ( 18) 31.0% 1150.3 ( 15.5) 84.3% 97.3 ( 16.2) 75% ( 27) 31.0% 1150.3 ( 2.5) 84.6% 95.0 ( 5.6) 100% ( 36) 31.3% 1145.3 ( 1.5) 85.6% 89.0 ( 1.7) AMD EPYC 7551 32-Core Processor (Zen, kvm guest) 1 node * 8 cores * 2 threads = 16 CPUs 64G/node = 64G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1029.7 ( 42.3) -- 253.7 ( 3.1) 6% ( 1) 3.3% 995.3 ( 21.4) 4.3% 242.7 ( 5.5) 12% ( 2) 14.0% 885.7 ( 24.4) 46.4% 136.0 ( 5.2) 25% ( 4) 18.9% 835.0 ( 21.5) 66.1% 86.0 ( 1.7) 38% ( 6) 21.9% 804.7 ( 15.7) 71.4% 72.7 ( 2.1) 50% ( 8) 20.8% 815.3 ( 11.7) 74.4% 65.0 ( 3.5) 75% ( 12) 23.5% 787.7 ( 2.1) 74.0% 66.0 ( 3.6) 100% ( 16) 23.3% 789.3 ( 15.0) 76.3% 60.0 ( 5.6) AMD EPYC 7551 32-Core Processor (Zen, kvm guest) 1 node * 2 cores * 2 threads = 4 CPUs 16G/node = 16G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 757.7 ( 17.1) -- 57.0 ( 0.0) 25% ( 1) -1.0% 765.3 ( 5.5) 3.5% 55.0 ( 0.0) 50% ( 2) 4.7% 722.3 ( 21.5) 42.7% 32.7 ( 4.6) 75% ( 3) 3.7% 729.7 ( 4.9) 54.4% 26.0 ( 0.0) 100% ( 4) 6.2% 710.3 ( 15.0) 63.2% 21.0 ( 0.0) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest) 1 node * 2 cores * 2 threads = 4 CPUs 14G/node = 14G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 656.3 ( 7.1) -- 57.3 ( 1.5) 25% ( 1) 1.8% 644.7 ( 3.1) 0.6% 57.0 ( 0.0) 50% ( 2) 6.5% 613.7 ( 5.1) 40.7% 34.0 ( 5.3) 75% ( 3) 6.9% 611.3 ( 6.7) 57.6% 24.3 ( 0.6) 100% ( 4) 8.6% 599.7 ( 5.9) 62.8% 21.3 ( 1.2) > > > I've found pretty much linear performance up to memory bandwidth, and on the systems I was testing, I didn't saturate memory bandwidth until about the full number of physical cores. From number of cores up to number of threads, the performance stayed about flat; it didn't get any better or worse. > > > > That doesn't sound right though based on the numbers you provided. The > > system you had was 192GB spread over 2 nodes with 48thread/24core per > > node, correct? Your numbers went from ~290ms to ~28ms so a 10x > > decrease, that doesn't sound linear when you spread the work over 24 > > cores to get there. I agree that the numbers largely stay flat once > > you hit the peak, I have seen similar behavior when I was working on > > the deferred init code previously. One concern I have though is that > > we may end up seeing better performance with a subset of cores instead > > of running all of the cores/threads, especially if features such as > > turbo come into play. In addition we are talking x86 only so far. I > > would be interested in seeing if this has benefits or not for other > > architectures. > > > > Also what is the penalty that is being paid in order to break up the > > work before-hand and set it up for the parallel work? I would be > > interested in seeing what the cost is on a system with fewer cores per > > node, maybe even down to 1. That would tell us how much additional > > overhead is being added to set things up to run in parallel. > > The numbers above have the 1-thread case. It seems close to the noise. > > > If I get > > a chance tomorrow I might try applying the patches and doing some > > testing myself. > > If you end up doing that, you might find this helpful: > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=afc72bf8478b95a1d6d174c269ff3693c60630e0 > > and maybe this: > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=dff6537eab281e5a9917682c4adf9059c0574223 > > Thanks for looking this over. > > [ By the way, I'm going to be out Tuesday but back the rest of the week. ] >