Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422862Ab2JDQX7 (ORCPT ); Thu, 4 Oct 2012 12:23:59 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]:59379 "EHLO mail-lb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422675Ab2JDQX6 (ORCPT ); Thu, 4 Oct 2012 12:23:58 -0400 Message-ID: <506DB816.9090107@openvz.org> Date: Thu, 04 Oct 2012 20:23:50 +0400 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.7) Gecko/20120831 Firefox/10.0.7 Iceape/2.7.7 MIME-Version: 1.0 To: Hugh Dickins CC: Shaohua Li , Rik van Riel , Minchan Kim , Andrew Morton , Wu Fengguang , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH RFC] mm/swap: automatic tuning for swapin readahead References: <50460CED.6060006@redhat.com> <20120906110836.22423.17638.stgit@zurg> <506AACAC.2010609@openvz.org> In-Reply-To: Content-Type: multipart/mixed; boundary="------------040300020909090609050408" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9373 Lines: 203 This is a multi-part message in MIME format. --------------040300020909090609050408 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Here results of my test. Workload isn't very realistic, but at least it threaded: compiling linux-3.6 with defconfig in 16 threads on tmpfs, 512mb ram, dualcore cpu, ordinary hard disk. (test script in attachment) average results for ten runs: RA=3 RA=0 RA=1 RA=2 RA=4 Hugh Shaohua real time 500 542 528 519 500 523 522 user time 738 737 735 737 739 737 739 sys time 93 93 91 92 96 92 93 pgmajfault 62918 110533 92454 78221 54342 86601 77229 pgpgin 2070372 795228 1034046 1471010 3177192 1154532 1599388 pgpgout 2597278 2022037 2110020 2350380 2802670 2286671 2526570 pswpin 462747 138873 202148 310969 739431 232710 341320 pswpout 646363 502599 524613 584731 697797 568784 628677 So, last two columns shows mostly equal results: +4.6% and +4.4% in comparison to vanilla kernel with RA=3, but your version shows more stable results (std-error 2.7% against 4.8%) (all this numbers in huge table in attachment) Numbers from your tests formatted into table for better readability HDD Vanilla Shaohua RA=3 RA=0 RA=4 SEQ, ANON 73921 76210 75611 121542 77950 SEQ, SHMEM 73601 73176 73855 118322 73534 RND, ANON 895392 831243 871569 841680 863871 RND, SHMEM 1058375 1053486 827935 756489 834804 SDD Vanilla Shaohua RA=3 RA=0 RA=4 SEQ, ANON 24634 24198 24673 70018 21125 SEQ, SHMEM 24959 24932 25052 69678 21387 RND, ANON 43014 26146 28075 25901 28686 RND, SHMEM 45349 45215 28249 24332 28226 Hugh Dickins wrote: > On Tue, 2 Oct 2012, Konstantin Khlebnikov wrote: >> Hugh Dickins wrote: >>> >>> If I boot with mem=900M (and 1G swap: either on hard disk sda, or >>> on Vertex II SSD sdb), and mmap anonymous 1000M (either MAP_PRIVATE, >>> or MAP_SHARED for a shmem object), and either cycle sequentially round >>> that making 5M touches (spaced a page apart), or make 5M random touches, >>> then here are the times in centisecs that I see (but it's only elapsed >>> that I've been worrying about). >>> >>> 3.6-rc7 swapping to hard disk: >>> 124 user 6154 system 73921 elapsed -rc7 sda seq >>> 102 user 8862 system 895392 elapsed -rc7 sda random >>> 130 user 6628 system 73601 elapsed -rc7 sda shmem seq >>> 194 user 8610 system 1058375 elapsed -rc7 sda shmem random >>> >>> 3.6-rc7 swapping to SSD: >>> 116 user 5898 system 24634 elapsed -rc7 sdb seq >>> 96 user 8166 system 43014 elapsed -rc7 sdb random >>> 110 user 6410 system 24959 elapsed -rc7 sdb shmem seq >>> 208 user 8024 system 45349 elapsed -rc7 sdb shmem random >>> >>> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), >>> HDD: >>> 116 user 6258 system 76210 elapsed shli sda seq >>> 80 user 7716 system 831243 elapsed shli sda random >>> 128 user 6640 system 73176 elapsed shli sda shmem seq >>> 212 user 8522 system 1053486 elapsed shli sda shmem random >>> >>> 3.6-rc7 + Shaohua's patch (and FAULT_FLAG_RETRY check in do_swap_page), >>> SSD: >>> 126 user 5734 system 24198 elapsed shli sdb seq >>> 90 user 7356 system 26146 elapsed shli sdb random >>> 128 user 6396 system 24932 elapsed shli sdb shmem seq >>> 192 user 8006 system 45215 elapsed shli sdb shmem random >>> >>> 3.6-rc7 + my patch, swapping to hard disk: >>> 126 user 6252 system 75611 elapsed hugh sda seq >>> 70 user 8310 system 871569 elapsed hugh sda random >>> 130 user 6790 system 73855 elapsed hugh sda shmem seq >>> 148 user 7734 system 827935 elapsed hugh sda shmem random >>> >>> 3.6-rc7 + my patch, swapping to SSD: >>> 116 user 5996 system 24673 elapsed hugh sdb seq >>> 76 user 7568 system 28075 elapsed hugh sdb random >>> 132 user 6468 system 25052 elapsed hugh sdb shmem seq >>> 166 user 7220 system 28249 elapsed hugh sdb shmem random >>> >> >> Hmm, It would be nice to gather numbers without swapin readahead at all, just >> to see the the worst possible case for sequential read and the best for >> random. > > Right, and also interesting to see what happens if we raise page_cluster > (more of an option than it was, with your or my patch scaling it down). > Run on the same machine under the same conditions: > > 3.6-rc7 + my patch, swapping to hard disk with page_cluster 0 (no readahead): > 136 user 34038 system 121542 elapsed hugh cluster0 sda seq > 102 user 7928 system 841680 elapsed hugh cluster0 sda random > 130 user 34770 system 118322 elapsed hugh cluster0 sda shmem seq > 160 user 7362 system 756489 elapsed hugh cluster0 sda shmem random > > 3.6-rc7 + my patch, swapping to SSD with page_cluster 0 (no readahead): > 138 user 32230 system 70018 elapsed hugh cluster0 sdb seq > 88 user 7296 system 25901 elapsed hugh cluster0 sdb random > 154 user 33150 system 69678 elapsed hugh cluster0 sdb shmem seq > 166 user 6936 system 24332 elapsed hugh cluster0 sdb shmem random > > 3.6-rc7 + my patch, swapping to hard disk with page_cluster 4 (default + 1): > 144 user 4262 system 77950 elapsed hugh cluster4 sda seq > 74 user 8268 system 863871 elapsed hugh cluster4 sda random > 140 user 4880 system 73534 elapsed hugh cluster4 sda shmem seq > 160 user 7788 system 834804 elapsed hugh cluster4 sda shmem random > > 3.6-rc7 + my patch, swapping to SSD with page_cluster 4 (default + 1): > 124 user 4242 system 21125 elapsed hugh cluster4 sdb seq > 72 user 7680 system 28686 elapsed hugh cluster4 sdb random > 122 user 4622 system 21387 elapsed hugh cluster4 sdb shmem seq > 172 user 7238 system 28226 elapsed hugh cluster4 sdb shmem random > > I was at first surprised to see random significantly faster than sequential > on SSD with readahead off, thinking they ought to come out the same. But > no, that's a warning on the limitations of the test: with an mmap of 1000M > on a machine with mem=900M, the page-by-page sequential is never going to > rehit cache, whereas the random has a good chance of finding in memory. > > Which I presume also accounts for the lower user times throughout > for random - but then why not the same for shmem random? > > I did start off measuring on the laptop with SSD, mmap 1000M mem=500M; > but once I transferred to the desktop, I rediscovered just how slow > swapping to hard disk can be, couldn't wait days, so made mem=900M. > >> I'll run some tests too, especially I want to see how it works for less >> synthetic workloads. > > Thank you, that would be valuable. I expect there to be certain midway > tests on which Shaohao's patch would show up as significantly faster, > where his per-vma approach would beat the global approach; then the > global to improve with growing contention between processes. But I > didn't devise any such test, and hoped Shaohua might have one. > > Hugh --------------040300020909090609050408 Content-Type: application/x-sh; name="test-linux-build.sh" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="test-linux-build.sh" T=/mnt/tmp L=linux-3.6.tar.bz2 J=16 swapoff -a swapon -a sysctl -q vm.drop_caches=3 date uname -a free T=/mnt/tmp mount -t tmpfs tmpfs $T -o size=4G tar xf $L -C $T make -s -C $T/linux-* defconfig cat /proc/vmstat > $T/vmstat.a time -p make -s -C $T/linux-* -j$J all cat /proc/vmstat > $T/vmstat.b paste $T/vmstat.a $T/vmstat.b | awk '{print $1, $4-$2}' umount $T --------------040300020909090609050408 Content-Type: text/plain; name="test-linux-build-results.txt" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="test-linux-build-results.txt" 0-orig.log 1-nora.log 1-one.log 2-two.log 4-four.log 5-hugh.log 6-shaohua.log real time 500 [1.9%] 542 [4.5%] +8.3% 528 [4.7%] +5.7% 519 [2.6%] +3.8% 500 [1.2%] +0.1% 523 [2.7%] +4.6% 522 [4.4%] +4.4% user time 738 [0.5%] 737 [0.7%] -0.2% 735 [0.4%] -0.4% 737 [0.4%] -0.1% 739 [0.4%] +0.1% 737 [0.4%] -0.1% 739 [0.3%] +0.2% sys time 93 [1.2%] 93 [1.8%] +0.6% 91 [0.8%] -1.4% 92 [1.2%] -1.3% 96 [0.8%] +3.8% 92 [0.8%] -1.0% 93 [1.0%] +0.5% pgmajfault 62918 [4.2%] 110533 [6.6%] +75.7% 92454 [4.6%] +46.9% 78221 [3.3%] +24.3% 54342 [2.7%] -13.6% 86601 [3.8%] +37.6% 77229 [6.4%] +22.7% pgpgin 2070372 [4.0%] 795228 [6.7%] -61.6% 1034046 [2.9%] -50.1% 1471010 [4.2%] -28.9% 3177192 [2.1%] +53.5% 1154532 [3.4%] -44.2% 1599388 [6.2%] -22.7% pgpgout 2597278 [7.9%] 2022037 [9.4%] -22.1% 2110020 [4.3%] -18.8% 2350380 [8.1%] -9.5% 2802670 [6.8%] +7.9% 2286671 [7.0%] -12.0% 2526570 [8.1%] -2.7% pswpin 462747 [4.5%] 138873 [6.1%] -70.0% 202148 [3.4%] -56.3% 310969 [3.7%] -32.8% 739431 [2.0%] +59.8% 232710 [3.6%] -49.7% 341320 [6.9%] -26.2% pswpout 646363 [7.9%] 502599 [9.5%] -22.2% 524613 [4.4%] -18.8% 584731 [8.2%] -9.5% 697797 [6.8%] +8.0% 568784 [7.0%] -12.0% 628677 [8.1%] -2.7% --------------040300020909090609050408-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/