Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1286161imu; Tue, 11 Dec 2018 16:39:43 -0800 (PST) X-Google-Smtp-Source: AFSGD/VeUPOh2Pbbt2V5dh1Vm2z64qIhuKifS6fPlW7QNJvHbOZou/F1L7T/HqO6SggG/0TrtWpd X-Received: by 2002:a63:88c7:: with SMTP id l190mr16132351pgd.110.1544575183471; Tue, 11 Dec 2018 16:39:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544575183; cv=none; d=google.com; s=arc-20160816; b=AuZZZQQcWHWU5+tF1xOjDVmuFAl9WHXV0u36h8q1DaY3fz9JqwORL7L3bMqa1SndgU 6ZyumnoGTBf+5w2s0HHUp5dt3ET+4zGxfT9Q7sIwVBZHXc0j6FKup32yn+7WvBCz/CW3 nOaMiJtO900t+gN2v1kMq3ay5BjEyC4UHgDAXOnBlUNRA51Epp05UaG30iaRt+McaDxa eXlNMosaiCNJ0FLqwnoibzO2NA9z9+eAjLAXlwR4O+UeCnOURRrSxKW4ETYonvRziwsO 6SiMwO2an+tFumLXfsjTPnRJO9VVM6npzFFPRHUzRzRm11uGJ3Sol3Re0+naI+mETHLX uKJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=E3DvDy15AvcqmBeBWM9uCsXM9sTPY0Bhs/bZ0QrbURE=; b=ANVldNb8ZrCEyCfHY+SOBMjBiXkdlBdABtkV1qnYgCv1HcKUuS8SmdlZ6uC/O5CXSG sjA7Vhx51l+YEcqgbvchn3Utdlg/uiCDRwx7QauBoqpTzbV+JxHNhuPQ7AKKSyrxqUT2 KDPJhuO6y7bwO/pE+oEd+xVZbsa3iDzZhwiLhN/grh91oU7OQKpnbvphvV2zRSwExsXr 90ANSoY5cgcfylaWU/HdPauFCPcAGntOhi65ppVCQe/WZgtyq5phqtKeL+KjLEzQ1kee JGMZfHNfTK9tb6alr08tclxUtP4lVCdp33RAav8RWY6PjUQY+0WZKJOYj7msv6dD5m9I +tQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=OFjl5wzU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u23si11856141pfi.175.2018.12.11.16.39.24; Tue, 11 Dec 2018 16:39:43 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=OFjl5wzU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726269AbeLLAh1 (ORCPT + 99 others); Tue, 11 Dec 2018 19:37:27 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:38551 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726211AbeLLAh1 (ORCPT ); Tue, 11 Dec 2018 19:37:27 -0500 Received: by mail-pg1-f195.google.com with SMTP id g189so7458155pgc.5 for ; Tue, 11 Dec 2018 16:37:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=E3DvDy15AvcqmBeBWM9uCsXM9sTPY0Bhs/bZ0QrbURE=; b=OFjl5wzUSCExtXCc/oOupRdHHB2mwFPIpOHLgV66VvO2DbdTQprC5Eg0xJ5gx8xEcZ qJh+iaT4q5cjCZeREKTmaLYcFKV84OAUn3G45aDkVcFMnyuHJ1aMN7XHAdBGoME2flAz EZxnNQCxTIKu4nHJCLVBRuwH9dXrWkrDUmyGuYLFESYuUgcqlLuDZ/uVSODLEv5HHPrd xkHcFIadeb4aq98aAdA2r1u4AaU2MBtuBCE1JlQLpcOUIBJvuhJ5YJVK0pVD39TJ4JOM f9FLM/E59vesPTo9D/V630Eoc4LCd6mnM5d9cVnHEnmQ6XfJLI/gYR+Ahn1oEQf27E2O BYew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=E3DvDy15AvcqmBeBWM9uCsXM9sTPY0Bhs/bZ0QrbURE=; b=ZBLM/QbfT7mG3G74g/h4bm5NJoRJM4/BUsxyOh5usotGAHO0gSlqht1lEw8A+GQw5X FoDWZKgFVqj2MVb95uh9axc4WFnbGKB/JOXos3in3bfxwLc6IGmttotXHpoHIt91OP/H jGGu2rYqUt/ywhhwqj/AGG7uymdEqMbMy6tF9o+6Hi+frjd8OpeGHuTpTNZIpR/dfCmw XPKnhBuYebL1ZzxxzB8QFaIn6YGsv+e84OGD7h7kZv6BJ+53Ktlos9mI5bmNO3ZukBO7 Sl2aMrlaVlMp9Z0aAl4RMof4FqtAjpFUIUaxLZmCyk9jLyGt+O2LBhFCK9jLfgdbdTvm WLZQ== X-Gm-Message-State: AA+aEWZ4z1yJ+rKQWxc2/iNmZQABq5H9AnFTxj+b8wseoVaZ8mMoxoci 2Q2BBLKNEqKKHkiXARQXSYw6mOak1gWOBw== X-Received: by 2002:a63:f201:: with SMTP id v1mr15399479pgh.232.1544575045601; Tue, 11 Dec 2018 16:37:25 -0800 (PST) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id c81sm32721084pfb.107.2018.12.11.16.37.23 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 11 Dec 2018 16:37:23 -0800 (PST) Date: Tue, 11 Dec 2018 16:37:22 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrea Arcangeli cc: Linus Torvalds , mgorman@techsingularity.net, Vlastimil Babka , Michal Hocko , ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression In-Reply-To: <20181210044916.GC24097@redhat.com> Message-ID: References: <64a4aec6-3275-a716-8345-f021f6186d9b@suse.cz> <20181204104558.GV23260@techsingularity.net> <20181205204034.GB11899@redhat.com> <20181205233632.GE11899@redhat.com> <20181210044916.GC24097@redhat.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 9 Dec 2018, Andrea Arcangeli wrote: > You didn't release the proprietary software that depends on > __GFP_THISNODE behavior and that you're afraid is getting a > regression. > > Could you at least release with an open source license the benchmark > software that you must have used to do the above measurement to > understand why it gives such a weird result on remote THP? > Hi Andrea, As I said in response to Linus, I'm in the process of writing a more complete benchmarking test across all of our platforms for access and allocation latency for x86 (both Intel and AMD), POWER8/9, and arm64, and doing so on a kernel with minimum overhead (for the allocation latency, I want to remove things like mem cgroup overhead from the result). > On skylake and on the threadripper I can't confirm that there isn't a > significant benefit from cross socket hugepage over cross socket small > page. > > Skylake Xeon(R) Gold 5115: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 > node 0 size: 15602 MB > node 0 free: 14077 MB > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 > node 1 size: 16099 MB > node 1 free: 15949 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 > # numactl -m 0 -C 0 ./numa-thp-bench > random writes MADV_HUGEPAGE 10109753 usec > random writes MADV_NOHUGEPAGE 13682041 usec > random writes MADV_NOHUGEPAGE 13704208 usec > random writes MADV_HUGEPAGE 10120405 usec > # numactl -m 0 -C 10 ./numa-thp-bench > random writes MADV_HUGEPAGE 15393923 usec > random writes MADV_NOHUGEPAGE 19644793 usec > random writes MADV_NOHUGEPAGE 19671287 usec > random writes MADV_HUGEPAGE 15495281 usec > # grep Xeon /proc/cpuinfo |head -1 > model name : Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -11% > remote 4k -> remote 2m: +26% > > threadripper 1950x: > > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > node 0 size: 15982 MB > node 0 free: 14422 MB > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > node 1 size: 16124 MB > node 1 free: 5357 MB > node distances: > node 0 1 > 0: 10 16 > 1: 16 10 > # numactl -m 0 -C 0 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 12902667 usec > random writes MADV_NOHUGEPAGE 17543070 usec > random writes MADV_NOHUGEPAGE 17568858 usec > random writes MADV_HUGEPAGE 12896588 usec > # numactl -m 0 -C 8 /tmp/numa-thp-bench > random writes MADV_HUGEPAGE 19663515 usec > random writes MADV_NOHUGEPAGE 27819864 usec > random writes MADV_NOHUGEPAGE 27844066 usec > random writes MADV_HUGEPAGE 19662706 usec > # grep Threadripper /proc/cpuinfo |head -1 > model name : AMD Ryzen Threadripper 1950X 16-Core Processor > > local 4k -> local 2m: +35% > local 4k -> remote 2m: -10% > remote 4k -> remote 2m: +41% > > Or if you prefer reversed in terms of compute time (negative > percentage is better in this case): > > local 4k -> local 2m: -26% > local 4k -> remote 2m: +12% > remote 4k -> remote 2m: -29% > > It's true that local 4k is generally a win vs remote THP when the > workload is memory bound also for the threadripper, the threadripper > seems even more favorable to remote THP than skylake Xeon is. > My results are organized slightly different since it considers local hugepages as the baseline and is what we optimize for: on Broadwell, I've obtained more accurate results that show local small pages at +3.8%, remote hugepages at +12.8% and remote small pages at +18.8%. I think we both agree that the locality preference for workloads that fit within a single node is local hugepage -> local small page -> remote hugepage -> remote small page, and that has been unchanged in any of benchmarking results for either of us. > The above is the host bare metal result. Now let's try guest mode on > the threadripper. The last two lines seems more reliable (the first > two lines also needs to fault in the guest RAM because the guest > was fresh booted). > > guest backed by local 2M pages: > > random writes MADV_HUGEPAGE 16025855 usec > random writes MADV_NOHUGEPAGE 21903002 usec > random writes MADV_NOHUGEPAGE 19762767 usec > random writes MADV_HUGEPAGE 15189231 usec > > guest backed by remote 2M pages: > > random writes MADV_HUGEPAGE 25434251 usec > random writes MADV_NOHUGEPAGE 32404119 usec > random writes MADV_NOHUGEPAGE 31455592 usec > random writes MADV_HUGEPAGE 22248304 usec > > guest backed by local 4k pages: > > random writes MADV_HUGEPAGE 28945251 usec > random writes MADV_NOHUGEPAGE 32217690 usec > random writes MADV_NOHUGEPAGE 30664731 usec > random writes MADV_HUGEPAGE 22981082 usec > > guest backed by remote 4k pages: > > random writes MADV_HUGEPAGE 43772939 usec > random writes MADV_NOHUGEPAGE 52745664 usec > random writes MADV_NOHUGEPAGE 51632065 usec > random writes MADV_HUGEPAGE 40263194 usec > > I haven't yet tried the guest mode on the skylake nor > haswell/broadwell. I can do that too but I don't expect a significant > difference. > > On a threadripper guest, the remote 2m is practically identical to > local 4k. So shutting down compaction to try to generate local 4k > memory looks a sure loss. > I'm assuming your results above are with a defrag setting of "madvise" or "defer+madvise". > Even if we ignore the guest mode results completely, if we don't make > assumption on the workload to be able to fit in the node, if I use > MADV_HUGEPAGE I think I'd prefer the risk of a -10% slowdown if the > THP page ends up in a remote node, than not getting the +41% THP > speedup on remote memory if the pagetable ends up being remote or the > 4k page itself ends up being remote over time. > I'm agreeing with you that the preference for remote hugepages over local small pages depends on the configuration and the workload that you are running and there are clear advantages and disadvantages to both. This is different than what the long-standing NUMA preferences have been for thp allocations. I think we can optimize for *both* usecases without causing an unnecessary regression for other and doing so is not extremely complex. Since it depends on the workload, specifically workloads that fit within a single node, I think the reasonable approach would be to have a sane default regardless of the use of MADV_HUGEPAGE or thp defrag settings and then optimzie for the minority of cases where the workload does not fit in a single node. I'm assuming there is no debate about these larger workloads being in the minority, although we have single machines where this encompasses the totality of their workloads. Regarding the role of direct reclaim in the allocator, I think we need work on the feedback from compaction to determine whether it's worthwhile. That's difficult because of the point I continue to bring up: isolate_freepages() is not necessarily always able to access this freed memory. But for cases where we get COMPACT_SKIPPED because the order-0 watermarks are failing, reclaim *is* likely to have an impact in the success of compaction, otherwise we fail and defer because it wasn't able to make a hugepage available. [ If we run compaction regardless of the order-0 watermark check and find a pageblock where we can likely free a hugepage because it is fragmented movable pages, this is a pretty good indication that reclaim is worthwhile iff the reclaimed memory is beyond the migration scanner. ] Let me try to list out what I think is a reasonable design for the various configs assuming we are able to address the reclaim concern above. Note that this is for the majority of users where workloads do not span multiple nodes: - defrag=always: always compact, obviously - defrag=madvise/defer+madvise: - MADV_HUGEPAGE: always compact locally, fallback to small pages locally (small pages become eligible for khugepaged to collapse locally later, no chance of additional access latency) - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd locally, fallback to small pages locally - defrag=defer: kick kcompactd locally, fallback to small pages locally - defrag=never: fallback to small pages locally And that, AFAICT, has been the implementation for almost four years. For workloads that *can* span multiple nodes, this doesn't make much sense, as you point out and have reported in your bug. Considering the reclaim problem separately where we thrash a node unnecessarily, if we consider only hugepages and NUMA locality: - defrag=always: always compact for all allowed zones, zonelist ordered according to NUMA locality - defrag=madvise/defer+madvise: - MADV_HUGEPAGE: always compact for all allowed zones, try to allocate hugepages in zonelist order, only fallback to small pages when compaction fails - neither MADV_HUGEPAGE nor MADV_NOHUGEPAGE: kick kcompactd for all allowed zones, fallback to small pages locally - defrag=defer: kick kcompactd for all allowed zones, fallback to small pages locally - defrag=never: fallback to small pages locally For this policy to be possible, we must clear __GFP_THISNODE. How to determine when to do this? I think we have three options: heuristics (rss vs zone managed pages), per-process prctl(), or global thp setting for machine-wide behavior. I've been suggesting a per-process prctl() that can be set and carried across fork so that there are no changes needed to any workload and can simply special-case the thp allocation policy to use __GFP_THISNODE, which is the default for bare metal, and to not use it when we've said the workload will span multiple nodes. Depending on the size of the workload, it may choose to use this setting on certain systems and not others.