Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932148Ab3GOOWU (ORCPT ); Mon, 15 Jul 2013 10:22:20 -0400 Received: from mail-pd0-f174.google.com ([209.85.192.174]:61623 "EHLO mail-pd0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756708Ab3GOOWQ (ORCPT ); Mon, 15 Jul 2013 10:22:16 -0400 Message-ID: <51E4054F.8040706@gmail.com> Date: Mon, 15 Jul 2013 22:21:03 +0800 From: Hush Bensen User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Thunderbird/22.0 MIME-Version: 1.0 To: Mel Gorman , Andrew Morton CC: Jiri Slaby , Valdis Kletnieks , Rik van Riel , Zlatko Calusic , Johannes Weiner , dormando , Michal Hocko , Jan Kara , Dave Chinner , Kamezawa Hiroyuki , Linux-FSDevel , Linux-MM , LKML Subject: Re: [PATCH 0/8] Reduce system disruption due to kswapd followup V3 References: <1369869457-22570-1-git-send-email-mgorman@suse.de> In-Reply-To: <1369869457-22570-1-git-send-email-mgorman@suse.de> Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19607 Lines: 317 ?? 2013/5/30 7:17, Mel Gorman ะด??: > tldr; Overall the system is getting less kicked in the face. Scan rates > between zones is often more balanced than it used to be. There are > now fewer writes from reclaim context and a reduction in IO wait > times. > > This series replaces all of the previous follow-up series. It was clear > that more of the stall logic needed to be in the same place so it is > comprehensible and easier to predict. > > Changelog since V2 > o Consolidate stall decisions into one place > o Add is_dirty_writeback for NFS > o Move accounting around > > Further testing of the "Reduce system disruption due to kswapd" discovered > a few problems. First and foremost, it's possible for pages under writeback > to be freed which will lead to badness. Second, as pages were not being > swapped the file LRU was being scanned faster and clean file pages were > being reclaimed. In some cases this results in increased read IO to re-read > data from disk. Third, more pages were being written from kswapd context > which can adversly affect IO performance. Lastly, it was observed that > PageDirty pages are not necessarily dirty on all filesystems (buffers can be > clean while PageDirty is set and ->writepage generates no IO) and not all > filesystems set PageWriteback when the page is being written (e.g. ext3). > This disconnect confuses the reclaim stalling logic. This follow-up series > is aimed at these problems. > > The tests were based on three kernels > > vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline > mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to > kswapd" applied on top as per what should be in Andrew's tree > right now > lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel > > The first test used memcached+memcachetest while some background IO > was in progress as implemented by the parallel IO tests implement in > MM Tests. memcachetest benchmarks how many operations/second memcached > can service. It starts with no background IO on a freshly created ext4 > filesystem and then re-runs the test with larger amounts of IO in the > background to roughly simulate a large copy in progress. The expectation > is that the IO should have little or no impact on memcachetest which is > running entirely in memory. > > parallelio > 3.9.0 3.9.0 3.9.0 > vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 > Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%) > Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%) > Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%) > Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%) > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%) > Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%) > Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%) > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%) > Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%) > Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%) > Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%) > Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%) > Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%) > Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%) > Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%) > > memcachetest is the transactions/second reported by memcachetest. In > the vanilla kernel note that performance drops from around > 23K/sec to just over 4K/second when there is 2385M of IO going > on in the background. With current mmotm, there is no collapse > in performance and with this follow-up series there is little > change. > > swaptotal is the total amount of swap traffic. With mmotm and the follow-up > series, the total amount of swapping is much reduced. > > > 3.9.0 3.9.0 3.9.0 > vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 > Minor Faults 11160152 10706748 10622316 > Major Faults 46305 755 678 > Swap Ins 260249 0 0 > Swap Outs 683860 18 18 > Direct pages scanned 0 678 2520 > Kswapd pages scanned 6046108 8814900 1639279 > Kswapd pages reclaimed 1081954 1172267 1094635 > Direct pages reclaimed 0 566 2304 > Kswapd efficiency 17% 13% 66% > Kswapd velocity 5217.560 7618.953 1414.879 > Direct efficiency 100% 83% 91% > Direct velocity 0.000 0.586 2.175 > Percentage direct scans 0% 0% 0% > Zone normal velocity 5105.086 6824.681 671.158 > Zone dma32 velocity 112.473 794.858 745.896 > Zone dma velocity 0.000 0.000 0.000 > Page writes by reclaim 1929612.000 6861768.000 32821.000 > Page writes file 1245752 6861750 32803 > Page writes anon 683860 18 18 > Page reclaim immediate 7484 40 239 > Sector Reads 1130320 93996 86900 > Sector Writes 13508052 10823500 11804436 > Page rescued immediate 0 0 0 > Slabs scanned 33536 27136 18560 > Direct inode steals 0 0 0 > Kswapd inode steals 8641 1035 0 > Kswapd skipped wait 0 0 0 > THP fault alloc 8 37 33 > THP collapse alloc 508 552 515 > THP splits 24 1 1 > THP fault fallback 0 0 0 > THP collapse fail 0 0 0 Which mmtest config you used for this one? > > There are a number of observations to make here > > 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the > pages swapped were really unused anonymous pages. Related to that, > major faults are much reduced. > > 2. kswapd efficiency was impacted by the initial series but with these > follow-up patches, the efficiency is now at 66% indicating that far > fewer pages were skipped during scanning due to dirty or writeback > pages. > > 3. kswapd velocity is reduced indicating that fewer pages are being scanned > with the follow-up series as kswapd now stalls when the tail of the > LRU queue is full of unqueued dirty pages. The stall gives flushers a > chance to catch-up so kswapd can reclaim clean pages when it wakes > > 4. In light of Zlatko's recent reports about zone scanning imbalances, > mmtests now reports scanning velocity on a per-zone basis. With mainline, > you can see that the scanning activity is dominated by the Normal > zone with over 45 times more scanning in Normal than the DMA32 zone. > With the series currently in mmotm, the ratio is slightly better but it > is still the case that the bulk of scanning is in the highest zone. With > this follow-up series, the ratio of scanning between the Normal and > DMA32 zone is roughly equal. > > 5. As Dave Chinner observed, the current patches in mmotm increased the > number of pages written from kswapd context which is expected to adversly > impact IO performance. With the follow-up patches, far fewer pages are > written from kswapd context than the mainline kernel > > 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With > the follow-up series, there is less slab shrinking activity and no inodes > were reclaimed. > > 7. Note that "Sectors Read" is drastically reduced implying that the source > data being used for the IO is not being aggressively discarded due to > page reclaim skipping over dirty pages and reclaiming clean pages. Note > that the reducion in reads could also be due to inode data not being > re-read from disk after a slab shrink. > > 3.9.0 3.9.0 3.9.0 > vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 > Mean sda-avgqz 166.99 32.09 33.44 > Mean sda-await 853.64 192.76 185.43 > Mean sda-r_await 6.31 9.24 5.97 > Mean sda-w_await 2992.81 202.65 192.43 > Max sda-avgqz 1409.91 718.75 698.98 > Max sda-await 6665.74 3538.00 3124.23 > Max sda-r_await 58.96 111.95 58.00 > Max sda-w_await 28458.94 3977.29 3148.61 > > In light of the changes in writes from reclaim context, the number of > reads and Dave Chinner's concerns about IO performance I took a closer > look at the IO stats for the test disk. Few observations > > 1. The average queue size is reduced by the initial series and roughly > the same with this follow up. > > 2. Average wait times for writes are reduced and as the IO > is completing faster it at least implies that the gain is because > flushers are writing the files efficiently instead of page reclaim > getting in the way. > > 3. The reduction in maximum write latency is staggering. 28 seconds down > to 3 seconds. > > > Jan Kara asked how NFS is affected by all of this. Unstable pages can > be taken into account as one of the patches in the series shows but it > is still the case that filesystems with unusual handling of dirty or > writeback could still be treated better. > > Tests like postmark, fsmark and largedd showed up nothing useful. On my test > setup, pages are simply not being written back from reclaim context with or > without the patches and there are no changes in performance. My test setup > probably is just not strong enough network-wise to be really interesting. > > I ran a longer-lived memcached test with IO going to NFS instead of a local disk > > parallelio > 3.9.0 3.9.0 3.9.0 > vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10 > Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%) > Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%) > Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%) > Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%) > Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%) > Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%) > Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%) > Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%) > Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%) > Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%) > Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%) > Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%) > Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%) > Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%) > Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%) > Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%) > Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%) > Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%) > Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%) > Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%) > Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%) > > 1. Performance does not collapse due to IO which is good. IO is also completing > faster. Note with mmotm, IO completes in a third of the time and faster again > with this series applied > > 2. Swapping is reduced, although not eliminated. The figures for the follow-up > look bad but it does vary a bit as the stalling is not perfect for nfs > or filesystems like ext3 with unusual handling of dirty and writeback > pages > > 3. There are swapins, particularly with larger amounts of IO indicating > that active pages are being reclaimed. However, the number of much > reduced. > > 3.9.0 3.9.0 3.9.0 > vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 > Minor Faults 36339175 35025445 35219699 > Major Faults 310964 27108 51887 > Swap Ins 2176399 173069 333316 > Swap Outs 3344050 357228 504824 > Direct pages scanned 8972 77283 43242 > Kswapd pages scanned 20899983 8939566 14772851 > Kswapd pages reclaimed 6193156 5172605 5231026 > Direct pages reclaimed 8450 73802 39514 > Kswapd efficiency 29% 57% 35% > Kswapd velocity 3929.743 1847.499 3058.840 > Direct efficiency 94% 95% 91% > Direct velocity 1.687 15.972 8.954 > Percentage direct scans 0% 0% 0% > Zone normal velocity 3721.907 939.103 2185.142 > Zone dma32 velocity 209.522 924.368 882.651 > Zone dma velocity 0.000 0.000 0.000 > Page writes by reclaim 4082185.000 526319.000 537114.000 > Page writes file 738135 169091 32290 > Page writes anon 3344050 357228 504824 > Page reclaim immediate 9524 170 5595843 > Sector Reads 8909900 861192 1483680 > Sector Writes 13428980 1488744 2076800 > Page rescued immediate 0 0 0 > Slabs scanned 38016 31744 28672 > Direct inode steals 0 0 0 > Kswapd inode steals 424 0 0 > Kswapd skipped wait 0 0 0 > THP fault alloc 14 15 119 > THP collapse alloc 1767 1569 1618 > THP splits 30 29 25 > THP fault fallback 0 0 0 > THP collapse fail 8 5 0 > Compaction stalls 17 41 100 > Compaction success 7 31 95 > Compaction failures 10 10 5 > Page migrate success 7083 22157 62217 > Page migrate failure 0 0 0 > Compaction pages isolated 14847 48758 135830 > Compaction migrate scanned 18328 48398 138929 > Compaction free scanned 2000255 355827 1720269 > Compaction cost 7 24 68 > > I guess the main takeaway again is the much reduced page writes > from reclaim context and reduced reads. > > 3.9.0 3.9.0 3.9.0 > vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10 > Mean sda-avgqz 23.58 0.35 0.44 > Mean sda-await 133.47 15.72 15.46 > Mean sda-r_await 4.72 4.69 3.95 > Mean sda-w_await 507.69 28.40 33.68 > Max sda-avgqz 680.60 12.25 23.14 > Max sda-await 3958.89 221.83 286.22 > Max sda-r_await 63.86 61.23 67.29 > Max sda-w_await 11710.38 883.57 1767.28 > > And as before, write wait times are much reduced. > > fs/block_dev.c | 1 + > fs/buffer.c | 34 +++++++++ > fs/ext3/inode.c | 1 + > fs/nfs/file.c | 30 ++++++++ > include/linux/buffer_head.h | 3 + > include/linux/fs.h | 1 + > mm/vmscan.c | 164 ++++++++++++++++++++++++++++++++------------ > 7 files changed, 189 insertions(+), 45 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/