From: Bernd Schubert Subject: Re: page eviction from the buddy cache Date: Sat, 20 Apr 2013 23:18:17 +0200 Message-ID: <51730619.3030204@fastmail.fm> References: <51504A40.6020604@ya.ru> <20130327150743.GC14900@thunk.org> <3C8EEEF8-C1EB-4E3D-8DE6-198AB1BEA8C0@gmail.com> <515CD665.9000300@gmail.com> <239AD30A-2A31-4346-A4C7-8A6EB8247990@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: Will Huck , Hugh Dickins , Theodore Ts'o , Andrew Perepechko , linux-ext4@vger.kernel.org, akpm@linux-foundation.org, linux-mm@kvack.org To: Alexey Lyahkov Return-path: In-Reply-To: <239AD30A-2A31-4346-A4C7-8A6EB8247990@gmail.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org Alex, Andrew, did you notice the patch Ted just sent? ("ext4: mark all metadata I/O with REQ_META") I would like to see a way to mark pages read in with REQ_META to be kept in cache preferred over other pages. I guess that would solve LU-15 (https://jira.hpdd.intel.com/browse/LU-15) and also the direntry-block issue I tried to solve about 2 years ago (http://patchwork.ozlabs.org/patch/101200/). But using REQ_META to tag pages would probably also solve the same issue for other file systems. Is there anything already in the mm layer that could be used for that? Thanks, Bernd On 04/04/2013 06:51 AM, Alexey Lyahkov wrote: > Hi Will, > > i added a few tracepoints in mark_page_accessed, find_or_create_page, add_to_page_cache_lru and force ftrace to use just these events to logs. >>> > echo -n 150000 > /sys/kernel/debug/tracing/buffer_size_kb > echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_mark_accessed/enable > echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_page/enable > echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_create_page/enable > echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_lru_move/enable > echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_add_page_lru/enable > echo 1 > /sys/kernel/debug/tracing/tracing_on >>>> > kprobe module attached to __isolate_lru_page, __remove_from_page_cache and BUG_ON hit if __isolate_lru_page requested to remove budy page from lru lists. > > ftrace log buffer extracted from crashdump with backtrace where it's hit. > > log show page allocation via find_or_create_page, one or two mark_page_accessed call's, and isolate called. > backtrace always similar to > found buddy ffffea00022383d8 ffff88004d7015f0 > ------------[ cut here ]------------ > kernel BUG at /Users/shadow/work/lustre/work/BUGS/MRP-691/jprobe/jprobe.c:40! > .. > Call Trace: > [] my__isolate_lru_page+0xe/0x18 [jprobe] > [] isolate_pages_global+0xd0/0x380 > [] ? shrink_inactive_list+0xb9/0x730 > [] shrink_inactive_list+0x162/0x730 > [] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs] > [] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs] > [] ? mem_cgroup_soft_limit_reclaim+0x270/0x2a0 > [] ? cl_env_fetch+0x25/0x80 [obdclass] > [] shrink_zone+0x38f/0x510 > [] balance_pgdat+0x719/0x810 > [] ? isolate_pages_global+0x0/0x380 > [] kswapd+0x144/0x3a0 > [] ? lock_release_holdtime+0x3d/0x190 > [] ? _spin_unlock_irqrestore+0x40/0x80 > [] ? autoremove_wake_function+0x0/0x40 > [] ? kswapd+0x0/0x3a0 > [] kthread+0x96/0xa0 > [] child_rip+0xa/0x20 > [] ? restore_args+0x0/0x30 > [] ? kthread+0x0/0xa0 > [] ? child_rip+0x0/0x20 > .... > > > > On Apr 4, 2013, at 04:24, Will Huck wrote: > >> Hi Alexey, >> On 03/28/2013 01:34 PM, Alexey Lyahkov wrote: >>> Hi Hugh, >>> >>> "immediately" say in ~1s after allocation /via krobes/ftrace logs/, >>> and you are correct - that is in case large streaming io in Lustre - like 3-4GB/s in read. >>> ftrace logs (with additional trace points) say page allocated, mark page accessed.. >>> and nothing until that page will found in isolate_lru_page in shrink_inactive_list >>> /that point to set kprobe/ >>> if someone need a logs i may provide it's as it's easy to collect. >> >> I don't need the log, but could you show me how you trace? >> >>> >>> But may be that is more generic question when ext4 code, some important metadata exist >>> in block device page cache in that case calling lru_page_drain() here move these pages >>> in active LRU so will accessible easy. >>> >>> >>> On Mar 27, 2013, at 21:24, Hugh Dickins wrote: >>> >>>> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata] >>>> >>>> On Wed, 27 Mar 2013, Theodore Ts'o wrote: >>>>> Hi Andrew, >>>>> >>>>> Thanks for your analysis! Since I'm not a mm developer, I'm not sure >>>>> what's the best way to more aggressively mark a page as one that we'd >>>>> really like to keep in the page cache --- whether it's calling >>>>> lru_add_drain(), or calling activate_page(page), etc. >>>>> >>>>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm >>>>> experts in the hopes they could give us some advice about the best way >>>>> to achieve this goal. Andrew, Hugh, could you give us some quick >>>>> words of wisdom? >>>> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm. >>>> >>>>> Thanks, >>>>> >>>>> - Ted >>>>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote: >>>>>> Hello! >>>>>> >>>>>> Our recent investigation has found that pages from >>>>>> the buddy cache are evicted too often as compared >>>>>> to the expectation from their usage pattern. This >>>>>> introduces additional reads during large writes under >>>>>> our workload and really hurts overall performance. >>>>>> >>>>>> ext4 uses find_get_page() and find_or_create_page() >>>>>> to look for buddy cache pages, but these pages don't >>>>>> get a chance to become activated until the following >>>>>> lru_add_drain() call, because mark_page_accessed() >>>>>> does not activate pages which are not PageLRU(). >>>>>> >>>>>> As can be found from a kprobe-based test, these pages >>>>>> are often moved on the inactive LRU as a result of >>>>>> shrink_inactive_list()->lru_add_drain() and immediately >>>>>> evicted. >>>> Not quite like that, I think. >>>> >>>> Cache pages are intentionally put on the inactive list initially, >>>> so that streaming I/O does not push out more useful pages: it is >>>> intentional that the first call to mark_page_accessed() merely >>>> marks the page referenced, but does not move it to active LRU. >>>> >>>> You're right that the pagevec confuses things here, but I'm >>>> surprised if these pages are "immediately evicted": they won't >>>> be evicted while they remain on a pagevec, and can only be evicted >>>> after reaching the LRU. And they should be put on the hot end of >>>> the inactive LRU, and only evicted once they reach the cold end. >>>> >>>> But maybe you have lots of dirty or otherwise-un-immediately-evictable >>>> data pages in between, so that page reclaim reaches these ones too soon. >>>> >>>> IIUC the pages you are discussing here are important metadata pages, >>>> which you would much prefer to retain longer than streaming data. >>>> >>>> While I question "immediately evicted", I don't doubt that they >>>> get evicted sooner than you wish: one way or another, they arrive >>>> at the cold end of the inactive LRU too soon. >>>> >>>> You would like a way to mark these as more important to retain than >>>> data pages: you would like to put them directly on the active list, >>>> but are frustrated by the pagevec. >>>> >>>>>> From a quick look into linux-2.6.git, the issue seems >>>>>> to exist in the current code as well. >>>>>> >>>>>> A possible and, perhaps, non-optimal solution would be >>>>>> to call lru_add_drain() each time a buddy cache page >>>>>> is used. >>>> mark_page_accessed() should be enough each time one is actually used, >>>> but yes, it looks like you need more than that when first added to cache. >>>> >>>> It appears that at the moment you need to do: >>>> >>>> mark_page_accessed(page); /* to SetPageReferenced */ >>>> lru_add_drain(); /* to SetPageLRU */ >>>> mark_page_accessed(page); /* to SetPageActive */ >>>> >>>> but I agree that we would really prefer a filesystem not to have to >>>> call lru_add_drain(). >>>> >>>> I quite like the idea of >>>> mark_page_accessed(page); >>>> mark_page_accessed(page); >>>> as a sequence to use on important metadata (nicely reminiscent of >>>> "sync; sync;"), but maybe not everybody will agree with me on that! >>>> >>>> As currently implemented, a page is put on to a pagevec specific to >>>> the LRU it is destined for, and we cannot change that destination >>>> before it is flushed to that LRU. But at this moment I cannot see >>>> a fundamental reason why we should not allow PageActive to be set >>>> while in the pagevec, and destination LRU adjusted accordingly. >>>> >>>> However, I could easily be missing something (probably some VM_BUG_ONs >>>> at the least); and changing this might uncover unwanted side-effects - >>>> perhaps some code paths which already call mark_page_accessed() twice >>>> in quick succession unintentionally, and would now be given an Active >>>> page when Inactive has actually been more appropriate. >>>> >>>> Though I'd like to come back to this, I am very unlikely to find time >>>> for it in the near future: perhaps someone else might take it further. >>>> >>>> Hugh >>>> >>>>>> Any other suggestions? >>>>>> >>>>>> Thank you, >>>>>> Andrew >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: email@kvack.org >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org