From: Alexey Lyahkov Subject: Re: page eviction from the buddy cache Date: Thu, 4 Apr 2013 07:51:24 +0300 Message-ID: <239AD30A-2A31-4346-A4C7-8A6EB8247990@gmail.com> References: <51504A40.6020604@ya.ru> <20130327150743.GC14900@thunk.org> <3C8EEEF8-C1EB-4E3D-8DE6-198AB1BEA8C0@gmail.com> <515CD665.9000300@gmail.com> Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Hugh Dickins , Theodore Ts'o , Andrew Perepechko , linux-ext4@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org To: Will Huck Return-path: In-Reply-To: <515CD665.9000300@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Hi Will, i added a few tracepoints in mark_page_accessed, find_or_create_page, = add_to_page_cache_lru and force ftrace to use just these events to logs. >> echo -n 150000 > /sys/kernel/debug/tracing/buffer_size_kb=20 echo 1 > = /sys/kernel/debug/tracing/events/kmem/mm_vmscan_mark_accessed/enable = =20 echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_page/enable echo 1 > = /sys/kernel/debug/tracing/events/kmem/mm_find_create_page/enable echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_lru_move/enable=20= echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_add_page_lru/enable echo 1 > /sys/kernel/debug/tracing/tracing_on >>> kprobe module attached to __isolate_lru_page, __remove_from_page_cache = and BUG_ON hit if __isolate_lru_page requested to remove budy page from = lru lists. ftrace log buffer extracted from crashdump with backtrace where it's = hit. log show page allocation via find_or_create_page, one or two = mark_page_accessed call's, and isolate called. backtrace always similar to=20 found buddy ffffea00022383d8 ffff88004d7015f0 ------------[ cut here ]------------ kernel BUG at = /Users/shadow/work/lustre/work/BUGS/MRP-691/jprobe/jprobe.c:40! .. Call Trace: [] my__isolate_lru_page+0xe/0x18 [jprobe] [] isolate_pages_global+0xd0/0x380 [] ? shrink_inactive_list+0xb9/0x730 [] shrink_inactive_list+0x162/0x730 [] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs] [] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] [] ? mem_cgroup_soft_limit_reclaim+0x270/0x2a0 [] ? cl_env_fetch+0x25/0x80 [obdclass] [] shrink_zone+0x38f/0x510 [] balance_pgdat+0x719/0x810 [] ? isolate_pages_global+0x0/0x380 [] kswapd+0x144/0x3a0 [] ? lock_release_holdtime+0x3d/0x190 [] ? _spin_unlock_irqrestore+0x40/0x80 [] ? autoremove_wake_function+0x0/0x40 [] ? kswapd+0x0/0x3a0 [] kthread+0x96/0xa0 [] child_rip+0xa/0x20 [] ? restore_args+0x0/0x30 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 .... On Apr 4, 2013, at 04:24, Will Huck wrote: > Hi Alexey, > On 03/28/2013 01:34 PM, Alexey Lyahkov wrote: >> Hi Hugh, >>=20 >> "immediately" say in ~1s after allocation /via krobes/ftrace logs/, >> and you are correct - that is in case large streaming io in Lustre - = like 3-4GB/s in read. >> ftrace logs (with additional trace points) say page allocated, mark = page accessed.. >> and nothing until that page will found in isolate_lru_page in = shrink_inactive_list >> /that point to set kprobe/ >> if someone need a logs i may provide it's as it's easy to collect. >=20 > I don't need the log, but could you show me how you trace? >=20 >>=20 >> But may be that is more generic question when ext4 code, some = important metadata exist >> in block device page cache in that case calling lru_page_drain() here = move these pages >> in active LRU so will accessible easy. >>=20 >>=20 >> On Mar 27, 2013, at 21:24, Hugh Dickins wrote: >>=20 >>> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata] >>>=20 >>> On Wed, 27 Mar 2013, Theodore Ts'o wrote: >>>> Hi Andrew, >>>>=20 >>>> Thanks for your analysis! Since I'm not a mm developer, I'm not = sure >>>> what's the best way to more aggressively mark a page as one that = we'd >>>> really like to keep in the page cache --- whether it's calling >>>> lru_add_drain(), or calling activate_page(page), etc. >>>>=20 >>>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm >>>> experts in the hopes they could give us some advice about the best = way >>>> to achieve this goal. Andrew, Hugh, could you give us some quick >>>> words of wisdom? >>> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm. >>>=20 >>>> Thanks, >>>>=20 >>>> - Ted >>>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote: >>>>> Hello! >>>>>=20 >>>>> Our recent investigation has found that pages from >>>>> the buddy cache are evicted too often as compared >>>>> to the expectation from their usage pattern. This >>>>> introduces additional reads during large writes under >>>>> our workload and really hurts overall performance. >>>>>=20 >>>>> ext4 uses find_get_page() and find_or_create_page() >>>>> to look for buddy cache pages, but these pages don't >>>>> get a chance to become activated until the following >>>>> lru_add_drain() call, because mark_page_accessed() >>>>> does not activate pages which are not PageLRU(). >>>>>=20 >>>>> As can be found from a kprobe-based test, these pages >>>>> are often moved on the inactive LRU as a result of >>>>> shrink_inactive_list()->lru_add_drain() and immediately >>>>> evicted. >>> Not quite like that, I think. >>>=20 >>> Cache pages are intentionally put on the inactive list initially, >>> so that streaming I/O does not push out more useful pages: it is >>> intentional that the first call to mark_page_accessed() merely >>> marks the page referenced, but does not move it to active LRU. >>>=20 >>> You're right that the pagevec confuses things here, but I'm >>> surprised if these pages are "immediately evicted": they won't >>> be evicted while they remain on a pagevec, and can only be evicted >>> after reaching the LRU. And they should be put on the hot end of >>> the inactive LRU, and only evicted once they reach the cold end. >>>=20 >>> But maybe you have lots of dirty or = otherwise-un-immediately-evictable >>> data pages in between, so that page reclaim reaches these ones too = soon. >>>=20 >>> IIUC the pages you are discussing here are important metadata pages, >>> which you would much prefer to retain longer than streaming data. >>>=20 >>> While I question "immediately evicted", I don't doubt that they >>> get evicted sooner than you wish: one way or another, they arrive >>> at the cold end of the inactive LRU too soon. >>>=20 >>> You would like a way to mark these as more important to retain than >>> data pages: you would like to put them directly on the active list, >>> but are frustrated by the pagevec. >>>=20 >>>>> =46rom a quick look into linux-2.6.git, the issue seems >>>>> to exist in the current code as well. >>>>>=20 >>>>> A possible and, perhaps, non-optimal solution would be >>>>> to call lru_add_drain() each time a buddy cache page >>>>> is used. >>> mark_page_accessed() should be enough each time one is actually = used, >>> but yes, it looks like you need more than that when first added to = cache. >>>=20 >>> It appears that at the moment you need to do: >>>=20 >>> mark_page_accessed(page); /* to SetPageReferenced */ >>> lru_add_drain(); /* to SetPageLRU */ >>> mark_page_accessed(page); /* to SetPageActive */ >>>=20 >>> but I agree that we would really prefer a filesystem not to have to >>> call lru_add_drain(). >>>=20 >>> I quite like the idea of >>> mark_page_accessed(page); >>> mark_page_accessed(page); >>> as a sequence to use on important metadata (nicely reminiscent of >>> "sync; sync;"), but maybe not everybody will agree with me on that! >>>=20 >>> As currently implemented, a page is put on to a pagevec specific to >>> the LRU it is destined for, and we cannot change that destination >>> before it is flushed to that LRU. But at this moment I cannot see >>> a fundamental reason why we should not allow PageActive to be set >>> while in the pagevec, and destination LRU adjusted accordingly. >>>=20 >>> However, I could easily be missing something (probably some = VM_BUG_ONs >>> at the least); and changing this might uncover unwanted side-effects = - >>> perhaps some code paths which already call mark_page_accessed() = twice >>> in quick succession unintentionally, and would now be given an = Active >>> page when Inactive has actually been more appropriate. >>>=20 >>> Though I'd like to come back to this, I am very unlikely to find = time >>> for it in the near future: perhaps someone else might take it = further. >>>=20 >>> Hugh >>>=20 >>>>> Any other suggestions? >>>>>=20 >>>>> Thank you, >>>>> Andrew >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org >=20 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org