Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp7870686ybn; Mon, 30 Sep 2019 22:44:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqzXNr+FWRHWpg8K/oAJPxU8Ob401dHNxwm9C97ui4rFUvJBOpS2eTuhBlIAKoLC25vQfWHF X-Received: by 2002:a17:906:52d8:: with SMTP id w24mr21907986ejn.6.1569908680384; Mon, 30 Sep 2019 22:44:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569908680; cv=none; d=google.com; s=arc-20160816; b=WedswnsjJZ1Rn2wuiN3XBRkLHTTRw0+T6rQh/C+sJDNr/Bem4ZBYUy7ubppL43KiUE lw1pp/e/rs5zH3iChJZVHUpaoFjyhJYROYUdSnOYfZDbDAVS27tP6S0xMSv5BL3sKMaf eSvIG7/rr81r4gRooEelYaKHCqRmNYPzlIhUzY++idHjIpyzTDD7qW848Q2aBTBIlzHO etNZLqINoqyVcO7MY3bJZJi5pll3iY6DJtbg78pGVM//YlDuQiaqqGQ+NMg7NdL/Yvwb CcC2YEUeVHYRVrXySaMjYclxyRFE91MeyvmM98Qs1FXasS8qfGAZo28xwtRmcEukelhL z9MA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=2hsbHoJgXkHj4WYF7eOF+KfeAD+L/3N62NWJVrHQAmo=; b=HiXaPYwVqOrosJMUBoxdTYxF7VKZCBiKXnFXybjjc/sRreKMse/lWl8iVWAqKeKQRN jCYPllX+YVgaAWblly0M/30A6+LE63LPz+yGNsBCh+i+x69Ey7lF3BJbVtkzFhZ46FJT HTJUu+iSPrHrWRY3rDWTpvn15PG1kH32Xmti63+VjzbYdvTIdb+VPSVImyPe86baexI8 1hIHVKAt4RTheBLyQtaMH3B4aiLzMiTX4iAEyvAaZeifoAdFlWZfu27dJ0zk17NPNeuB AouQgRIgwpH5+sxkKoWU2m9MqDRADdq60xuT0kGuG0XauQhbLJVVkUamxuSzza3bzQjq YWWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l6si8708786edc.67.2019.09.30.22.44.15; Mon, 30 Sep 2019 22:44:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726822AbfJAFnw (ORCPT + 99 others); Tue, 1 Oct 2019 01:43:52 -0400 Received: from mx2.suse.de ([195.135.220.15]:46796 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725777AbfJAFnw (ORCPT ); Tue, 1 Oct 2019 01:43:52 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 44934AEE9; Tue, 1 Oct 2019 05:43:46 +0000 (UTC) Date: Tue, 1 Oct 2019 07:43:43 +0200 From: Michal Hocko To: Linus Torvalds Cc: David Rientjes , Andrea Arcangeli , Andrew Morton , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , Linux Kernel Mailing List , Linux-MM Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages Message-ID: <20191001054343.GA15624@dhcp22.suse.cz> References: <20190904205522.GA9871@redhat.com> <20190909193020.GD2063@dhcp22.suse.cz> <20190925070817.GH23050@dhcp22.suse.cz> <20190927074803.GB26848@dhcp22.suse.cz> <20190930112817.GC15942@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190930112817.GC15942@dhcp22.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 30-09-19 13:28:17, Michal Hocko wrote: [...] > Do not get me wrong, but we have a quite a long history of fine tuning > for THP by adding kludges here and there and they usually turnout to > break something else. I really want to get to understand the underlying > problem and base a solution on it rather than "__GFP_THISNODE can cause > overreclaim so pick up a break out condition and hope for the best". I didn't really get to any testing earlier but I was really suspecting that hugetlb will be the first one affected because it uses __GFP_RETRY_MAYFAIL - aka it really wants to succeed as much as possible because this is a direct admin request to preallocate a specific number of huge pages. The patch 3 doesn't really make any difference for those requests. Here is a very simple test I've done. Single NUMA node with 1G of memory. Populate the memory with a clean page cache. That is both easy to compact and reclaim and then try to allocate 512M worth of hugetlb pages. root@test1:~# cat hugetlb_test.sh #!/bin/sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) cp /proc/vmstat vmstat.$TS.before echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages cp /proc/vmstat vmstat.$TS.after The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + cp /proc/vmstat vmstat.1569905284.before + echo 256 + cat /proc/sys/vm/nr_hugepages 256 + cp /proc/vmstat vmstat.1569905284.after root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + cp /proc/vmstat vmstat.1569905311.before + echo 256 + cat /proc/sys/vm/nr_hugepages 256 + cp /proc/vmstat vmstat.1569905311.after so we get all the requested huge pages to the pool. Now with first 3 patches of this series applied (the last one doesn't make any difference for hugetlb allocations). root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + cp /proc/vmstat vmstat.1569905516.before + echo 256 + cat /proc/sys/vm/nr_hugepages 11 + cp /proc/vmstat vmstat.1569905516.after root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + cp /proc/vmstat vmstat.1569905541.before + echo 256 + cat /proc/sys/vm/nr_hugepages 12 + cp /proc/vmstat vmstat.1569905541.after so we do not get more that 12 huge pages which is really poor. Although hugetlb pages tend to be allocated early after the boot they are still an explicit admin request and having less than 5% success rate is really bad. If anything the __GFP_RETRY_MAYFAIL needs to be reflected in the code. I can provide vmstat files if anybody is interested. Then I have tried another test for THP. It is essentially the same thing. Populate the page cache to simulate a quite common case of memory being used for the cache and then populate 512M anonymous area with MADV_HUGEPAG set on it: $ cat thp_test.sh #!/bin/sh set -x echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) cp /proc/vmstat vmstat.$TS.before ./mem_eater nowait 500M cp /proc/vmstat vmstat.$TS.after mem_eater is essentially (mmap, madvise, and touch page by page dummy app). Again clean 5.3 kernel root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.8575 s, 51.5 MB/s + date +%s + TS=1569906274 + cp /proc/vmstat vmstat.1569906274.before + ./mem_eater nowait 500M 7f55e8282000-7f5607682000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 260616 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 509952 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906274.after root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.8648 s, 49.1 MB/s + date +%s + TS=1569906333 + cp /proc/vmstat vmstat.1569906333.before + ./mem_eater nowait 500M 7f26625cd000-7f26819cd000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 259892 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 509952 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906333.after We are getting quite consistent 99% THP utilization. grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906333.{before,after} vmstat.1569906333.before:allocstall_movable 29 vmstat.1569906333.before:pgsteal_kswapd 206760 vmstat.1569906333.before:pgsteal_direct 30162 vmstat.1569906333.before:compact_stall 29 vmstat.1569906333.after:allocstall_movable 65 vmstat.1569906333.after:pgsteal_kswapd 298688 vmstat.1569906333.after:pgsteal_direct 67645 vmstat.1569906333.after:compact_stall 66 Hit the direct compaction 37 times and reclaimed 146M in direct 359M by kswapd which is 505M in total which is not bad for 512M request. 5.3 + 3 patches (This is a non-NUMA machine so the only difference the 4th patch would make is timing because it just retries 2 times on the same node). root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0732 s, 51.0 MB/s + date +%s + TS=1569906542 + cp /proc/vmstat vmstat.1569906542.before + ./mem_eater nowait 500M 7f2799e08000-7f27b9208000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 294944 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 477184 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906542.after root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9239 s, 49.0 MB/s + date +%s + TS=1569906569 + cp /proc/vmstat vmstat.1569906569.before + ./mem_eater nowait 500M 7fa29a234000-7fa2b9634000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 253480 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 460800 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906569.after The drop down in the utilization is not that rapid here but it shows that the results are not very stable as well. grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906569.{before,after} vmstat.1569906569.before:allocstall_movable 52 vmstat.1569906569.before:pgsteal_kswapd 182617 vmstat.1569906569.before:pgsteal_direct 54281 vmstat.1569906569.before:compact_stall 85 vmstat.1569906569.after:allocstall_movable 64 vmstat.1569906569.after:pgsteal_kswapd 296840 vmstat.1569906569.after:pgsteal_direct 66778 vmstat.1569906569.after:compact_stall 191 We have hit the direct compaction 106 times and reclaimed 48M from direct and 446M from kswapd totaling 494M reclaimed in total. Slightly less than with clean 5.3 but I would consider it within noise. I didn't really get to analyze numbers very deeply but from a very preliminary look it seems that the bailout based on the watermark check is causing volatility because it depends on the kswapd activity in the background. Please note that this is pretty much the ideal case when the reclaimable memory is essentially free to drop. If kswapd starts fighting to get memory reclaimed then the THP utilization is likely to drop down as well. On the other hand it is fair to say that it is really hard to tell what would compaction do under those conditions. I also didn't really get to test any NUMA aspect of the change yet. I still do hope that David can share something I can play with because I do not want to create something completely artificial. -- Michal Hocko SUSE Labs