Hello!
Our recent investigation has found that pages from
the buddy cache are evicted too often as compared
to the expectation from their usage pattern. This
introduces additional reads during large writes under
our workload and really hurts overall performance.
ext4 uses find_get_page() and find_or_create_page()
to look for buddy cache pages, but these pages don't
get a chance to become activated until the following
lru_add_drain() call, because mark_page_accessed()
does not activate pages which are not PageLRU().
As can be found from a kprobe-based test, these pages
are often moved on the inactive LRU as a result of
shrink_inactive_list()->lru_add_drain() and immediately
evicted.
From a quick look into linux-2.6.git, the issue seems
to exist in the current code as well.
A possible and, perhaps, non-optimal solution would be
to call lru_add_drain() each time a buddy cache page
is used.
Any other suggestions?
Thank you,
Andrew
Hi Andrew,
Thanks for your analysis! Since I'm not a mm developer, I'm not sure
what's the best way to more aggressively mark a page as one that we'd
really like to keep in the page cache --- whether it's calling
lru_add_drain(), or calling activate_page(page), etc.
So I've added Andrew Morton and Hugh Dickens to the cc list as mm
experts in the hopes they could give us some advice about the best way
to achieve this goal. Andrew, Hugh, could you give us some quick
words of wisdom?
Thanks,
- Ted
On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
> Hello!
>
> Our recent investigation has found that pages from
> the buddy cache are evicted too often as compared
> to the expectation from their usage pattern. This
> introduces additional reads during large writes under
> our workload and really hurts overall performance.
>
> ext4 uses find_get_page() and find_or_create_page()
> to look for buddy cache pages, but these pages don't
> get a chance to become activated until the following
> lru_add_drain() call, because mark_page_accessed()
> does not activate pages which are not PageLRU().
>
> As can be found from a kprobe-based test, these pages
> are often moved on the inactive LRU as a result of
> shrink_inactive_list()->lru_add_drain() and immediately
> evicted.
>
> From a quick look into linux-2.6.git, the issue seems
> to exist in the current code as well.
>
> A possible and, perhaps, non-optimal solution would be
> to call lru_add_drain() each time a buddy cache page
> is used.
>
> Any other suggestions?
>
> Thank you,
> Andrew
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata]
On Wed, 27 Mar 2013, Theodore Ts'o wrote:
> Hi Andrew,
>
> Thanks for your analysis! Since I'm not a mm developer, I'm not sure
> what's the best way to more aggressively mark a page as one that we'd
> really like to keep in the page cache --- whether it's calling
> lru_add_drain(), or calling activate_page(page), etc.
>
> So I've added Andrew Morton and Hugh Dickens to the cc list as mm
> experts in the hopes they could give us some advice about the best way
> to achieve this goal. Andrew, Hugh, could you give us some quick
> words of wisdom?
Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm.
>
> Thanks,
>
> - Ted
> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
> > Hello!
> >
> > Our recent investigation has found that pages from
> > the buddy cache are evicted too often as compared
> > to the expectation from their usage pattern. This
> > introduces additional reads during large writes under
> > our workload and really hurts overall performance.
> >
> > ext4 uses find_get_page() and find_or_create_page()
> > to look for buddy cache pages, but these pages don't
> > get a chance to become activated until the following
> > lru_add_drain() call, because mark_page_accessed()
> > does not activate pages which are not PageLRU().
> >
> > As can be found from a kprobe-based test, these pages
> > are often moved on the inactive LRU as a result of
> > shrink_inactive_list()->lru_add_drain() and immediately
> > evicted.
Not quite like that, I think.
Cache pages are intentionally put on the inactive list initially,
so that streaming I/O does not push out more useful pages: it is
intentional that the first call to mark_page_accessed() merely
marks the page referenced, but does not move it to active LRU.
You're right that the pagevec confuses things here, but I'm
surprised if these pages are "immediately evicted": they won't
be evicted while they remain on a pagevec, and can only be evicted
after reaching the LRU. And they should be put on the hot end of
the inactive LRU, and only evicted once they reach the cold end.
But maybe you have lots of dirty or otherwise-un-immediately-evictable
data pages in between, so that page reclaim reaches these ones too soon.
IIUC the pages you are discussing here are important metadata pages,
which you would much prefer to retain longer than streaming data.
While I question "immediately evicted", I don't doubt that they
get evicted sooner than you wish: one way or another, they arrive
at the cold end of the inactive LRU too soon.
You would like a way to mark these as more important to retain than
data pages: you would like to put them directly on the active list,
but are frustrated by the pagevec.
> >
> > From a quick look into linux-2.6.git, the issue seems
> > to exist in the current code as well.
> >
> > A possible and, perhaps, non-optimal solution would be
> > to call lru_add_drain() each time a buddy cache page
> > is used.
mark_page_accessed() should be enough each time one is actually used,
but yes, it looks like you need more than that when first added to cache.
It appears that at the moment you need to do:
mark_page_accessed(page); /* to SetPageReferenced */
lru_add_drain(); /* to SetPageLRU */
mark_page_accessed(page); /* to SetPageActive */
but I agree that we would really prefer a filesystem not to have to
call lru_add_drain().
I quite like the idea of
mark_page_accessed(page);
mark_page_accessed(page);
as a sequence to use on important metadata (nicely reminiscent of
"sync; sync;"), but maybe not everybody will agree with me on that!
As currently implemented, a page is put on to a pagevec specific to
the LRU it is destined for, and we cannot change that destination
before it is flushed to that LRU. But at this moment I cannot see
a fundamental reason why we should not allow PageActive to be set
while in the pagevec, and destination LRU adjusted accordingly.
However, I could easily be missing something (probably some VM_BUG_ONs
at the least); and changing this might uncover unwanted side-effects -
perhaps some code paths which already call mark_page_accessed() twice
in quick succession unintentionally, and would now be given an Active
page when Inactive has actually been more appropriate.
Though I'd like to come back to this, I am very unlikely to find time
for it in the near future: perhaps someone else might take it further.
Hugh
> >
> > Any other suggestions?
> >
> > Thank you,
> > Andrew
Hi Hugh,
"immediately" say in ~1s after allocation /via krobes/ftrace logs/,
and you are correct - that is in case large streaming io in Lustre - like 3-4GB/s in read.
ftrace logs (with additional trace points) say page allocated, mark page accessed..
and nothing until that page will found in isolate_lru_page in shrink_inactive_list
/that point to set kprobe/
if someone need a logs i may provide it's as it's easy to collect.
But may be that is more generic question when ext4 code, some important metadata exist
in block device page cache in that case calling lru_page_drain() here move these pages
in active LRU so will accessible easy.
On Mar 27, 2013, at 21:24, Hugh Dickins wrote:
> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata]
>
> On Wed, 27 Mar 2013, Theodore Ts'o wrote:
>> Hi Andrew,
>>
>> Thanks for your analysis! Since I'm not a mm developer, I'm not sure
>> what's the best way to more aggressively mark a page as one that we'd
>> really like to keep in the page cache --- whether it's calling
>> lru_add_drain(), or calling activate_page(page), etc.
>>
>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm
>> experts in the hopes they could give us some advice about the best way
>> to achieve this goal. Andrew, Hugh, could you give us some quick
>> words of wisdom?
>
> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm.
>
>>
>> Thanks,
>>
>> - Ted
>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
>>> Hello!
>>>
>>> Our recent investigation has found that pages from
>>> the buddy cache are evicted too often as compared
>>> to the expectation from their usage pattern. This
>>> introduces additional reads during large writes under
>>> our workload and really hurts overall performance.
>>>
>>> ext4 uses find_get_page() and find_or_create_page()
>>> to look for buddy cache pages, but these pages don't
>>> get a chance to become activated until the following
>>> lru_add_drain() call, because mark_page_accessed()
>>> does not activate pages which are not PageLRU().
>>>
>>> As can be found from a kprobe-based test, these pages
>>> are often moved on the inactive LRU as a result of
>>> shrink_inactive_list()->lru_add_drain() and immediately
>>> evicted.
>
> Not quite like that, I think.
>
> Cache pages are intentionally put on the inactive list initially,
> so that streaming I/O does not push out more useful pages: it is
> intentional that the first call to mark_page_accessed() merely
> marks the page referenced, but does not move it to active LRU.
>
> You're right that the pagevec confuses things here, but I'm
> surprised if these pages are "immediately evicted": they won't
> be evicted while they remain on a pagevec, and can only be evicted
> after reaching the LRU. And they should be put on the hot end of
> the inactive LRU, and only evicted once they reach the cold end.
>
> But maybe you have lots of dirty or otherwise-un-immediately-evictable
> data pages in between, so that page reclaim reaches these ones too soon.
>
> IIUC the pages you are discussing here are important metadata pages,
> which you would much prefer to retain longer than streaming data.
>
> While I question "immediately evicted", I don't doubt that they
> get evicted sooner than you wish: one way or another, they arrive
> at the cold end of the inactive LRU too soon.
>
> You would like a way to mark these as more important to retain than
> data pages: you would like to put them directly on the active list,
> but are frustrated by the pagevec.
>
>>>
>>> From a quick look into linux-2.6.git, the issue seems
>>> to exist in the current code as well.
>>>
>>> A possible and, perhaps, non-optimal solution would be
>>> to call lru_add_drain() each time a buddy cache page
>>> is used.
>
> mark_page_accessed() should be enough each time one is actually used,
> but yes, it looks like you need more than that when first added to cache.
>
> It appears that at the moment you need to do:
>
> mark_page_accessed(page); /* to SetPageReferenced */
> lru_add_drain(); /* to SetPageLRU */
> mark_page_accessed(page); /* to SetPageActive */
>
> but I agree that we would really prefer a filesystem not to have to
> call lru_add_drain().
>
> I quite like the idea of
> mark_page_accessed(page);
> mark_page_accessed(page);
> as a sequence to use on important metadata (nicely reminiscent of
> "sync; sync;"), but maybe not everybody will agree with me on that!
>
> As currently implemented, a page is put on to a pagevec specific to
> the LRU it is destined for, and we cannot change that destination
> before it is flushed to that LRU. But at this moment I cannot see
> a fundamental reason why we should not allow PageActive to be set
> while in the pagevec, and destination LRU adjusted accordingly.
>
> However, I could easily be missing something (probably some VM_BUG_ONs
> at the least); and changing this might uncover unwanted side-effects -
> perhaps some code paths which already call mark_page_accessed() twice
> in quick succession unintentionally, and would now be given an Active
> page when Inactive has actually been more appropriate.
>
> Though I'd like to come back to this, I am very unlikely to find time
> for it in the near future: perhaps someone else might take it further.
>
> Hugh
>
>>>
>>> Any other suggestions?
>>>
>>> Thank you,
>>> Andrew
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>
Hi Alexey,
On 03/28/2013 01:34 PM, Alexey Lyahkov wrote:
> Hi Hugh,
>
> "immediately" say in ~1s after allocation /via krobes/ftrace logs/,
> and you are correct - that is in case large streaming io in Lustre - like 3-4GB/s in read.
> ftrace logs (with additional trace points) say page allocated, mark page accessed..
> and nothing until that page will found in isolate_lru_page in shrink_inactive_list
> /that point to set kprobe/
> if someone need a logs i may provide it's as it's easy to collect.
I don't need the log, but could you show me how you trace?
>
> But may be that is more generic question when ext4 code, some important metadata exist
> in block device page cache in that case calling lru_page_drain() here move these pages
> in active LRU so will accessible easy.
>
>
> On Mar 27, 2013, at 21:24, Hugh Dickins wrote:
>
>> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata]
>>
>> On Wed, 27 Mar 2013, Theodore Ts'o wrote:
>>> Hi Andrew,
>>>
>>> Thanks for your analysis! Since I'm not a mm developer, I'm not sure
>>> what's the best way to more aggressively mark a page as one that we'd
>>> really like to keep in the page cache --- whether it's calling
>>> lru_add_drain(), or calling activate_page(page), etc.
>>>
>>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm
>>> experts in the hopes they could give us some advice about the best way
>>> to achieve this goal. Andrew, Hugh, could you give us some quick
>>> words of wisdom?
>> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm.
>>
>>> Thanks,
>>>
>>> - Ted
>>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
>>>> Hello!
>>>>
>>>> Our recent investigation has found that pages from
>>>> the buddy cache are evicted too often as compared
>>>> to the expectation from their usage pattern. This
>>>> introduces additional reads during large writes under
>>>> our workload and really hurts overall performance.
>>>>
>>>> ext4 uses find_get_page() and find_or_create_page()
>>>> to look for buddy cache pages, but these pages don't
>>>> get a chance to become activated until the following
>>>> lru_add_drain() call, because mark_page_accessed()
>>>> does not activate pages which are not PageLRU().
>>>>
>>>> As can be found from a kprobe-based test, these pages
>>>> are often moved on the inactive LRU as a result of
>>>> shrink_inactive_list()->lru_add_drain() and immediately
>>>> evicted.
>> Not quite like that, I think.
>>
>> Cache pages are intentionally put on the inactive list initially,
>> so that streaming I/O does not push out more useful pages: it is
>> intentional that the first call to mark_page_accessed() merely
>> marks the page referenced, but does not move it to active LRU.
>>
>> You're right that the pagevec confuses things here, but I'm
>> surprised if these pages are "immediately evicted": they won't
>> be evicted while they remain on a pagevec, and can only be evicted
>> after reaching the LRU. And they should be put on the hot end of
>> the inactive LRU, and only evicted once they reach the cold end.
>>
>> But maybe you have lots of dirty or otherwise-un-immediately-evictable
>> data pages in between, so that page reclaim reaches these ones too soon.
>>
>> IIUC the pages you are discussing here are important metadata pages,
>> which you would much prefer to retain longer than streaming data.
>>
>> While I question "immediately evicted", I don't doubt that they
>> get evicted sooner than you wish: one way or another, they arrive
>> at the cold end of the inactive LRU too soon.
>>
>> You would like a way to mark these as more important to retain than
>> data pages: you would like to put them directly on the active list,
>> but are frustrated by the pagevec.
>>
>>>> From a quick look into linux-2.6.git, the issue seems
>>>> to exist in the current code as well.
>>>>
>>>> A possible and, perhaps, non-optimal solution would be
>>>> to call lru_add_drain() each time a buddy cache page
>>>> is used.
>> mark_page_accessed() should be enough each time one is actually used,
>> but yes, it looks like you need more than that when first added to cache.
>>
>> It appears that at the moment you need to do:
>>
>> mark_page_accessed(page); /* to SetPageReferenced */
>> lru_add_drain(); /* to SetPageLRU */
>> mark_page_accessed(page); /* to SetPageActive */
>>
>> but I agree that we would really prefer a filesystem not to have to
>> call lru_add_drain().
>>
>> I quite like the idea of
>> mark_page_accessed(page);
>> mark_page_accessed(page);
>> as a sequence to use on important metadata (nicely reminiscent of
>> "sync; sync;"), but maybe not everybody will agree with me on that!
>>
>> As currently implemented, a page is put on to a pagevec specific to
>> the LRU it is destined for, and we cannot change that destination
>> before it is flushed to that LRU. But at this moment I cannot see
>> a fundamental reason why we should not allow PageActive to be set
>> while in the pagevec, and destination LRU adjusted accordingly.
>>
>> However, I could easily be missing something (probably some VM_BUG_ONs
>> at the least); and changing this might uncover unwanted side-effects -
>> perhaps some code paths which already call mark_page_accessed() twice
>> in quick succession unintentionally, and would now be given an Active
>> page when Inactive has actually been more appropriate.
>>
>> Though I'd like to come back to this, I am very unlikely to find time
>> for it in the near future: perhaps someone else might take it further.
>>
>> Hugh
>>
>>>> Any other suggestions?
>>>>
>>>> Thank you,
>>>> Andrew
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"[email protected]"> [email protected] </a>
Hi Will,
i added a few tracepoints in mark_page_accessed, find_or_create_page, add_to_page_cache_lru and force ftrace to use just these events to logs.
>>
echo -n 150000 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_mark_accessed/enable
echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_page/enable
echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_create_page/enable
echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_lru_move/enable
echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_add_page_lru/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>
kprobe module attached to __isolate_lru_page, __remove_from_page_cache and BUG_ON hit if __isolate_lru_page requested to remove budy page from lru lists.
ftrace log buffer extracted from crashdump with backtrace where it's hit.
log show page allocation via find_or_create_page, one or two mark_page_accessed call's, and isolate called.
backtrace always similar to
found buddy ffffea00022383d8 ffff88004d7015f0
------------[ cut here ]------------
kernel BUG at /Users/shadow/work/lustre/work/BUGS/MRP-691/jprobe/jprobe.c:40!
..
Call Trace:
[<ffffffffa00150be>] my__isolate_lru_page+0xe/0x18 [jprobe]
[<ffffffff81139d10>] isolate_pages_global+0xd0/0x380
[<ffffffff81136d99>] ? shrink_inactive_list+0xb9/0x730
[<ffffffff81136e42>] shrink_inactive_list+0x162/0x730
[<ffffffffa04e90fd>] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs]
[<ffffffffa04e7ac4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
[<ffffffff81179190>] ? mem_cgroup_soft_limit_reclaim+0x270/0x2a0
[<ffffffffa06a20f5>] ? cl_env_fetch+0x25/0x80 [obdclass]
[<ffffffff8113821f>] shrink_zone+0x38f/0x510
[<ffffffff811397a9>] balance_pgdat+0x719/0x810
[<ffffffff81139c40>] ? isolate_pages_global+0x0/0x380
[<ffffffff811399e4>] kswapd+0x144/0x3a0
[<ffffffff810a7cfd>] ? lock_release_holdtime+0x3d/0x190
[<ffffffff814fb880>] ? _spin_unlock_irqrestore+0x40/0x80
[<ffffffff810919e0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff811398a0>] ? kswapd+0x0/0x3a0
[<ffffffff81091696>] kthread+0x96/0xa0
[<ffffffff8100c28a>] child_rip+0xa/0x20
[<ffffffff8100bbd0>] ? restore_args+0x0/0x30
[<ffffffff81091600>] ? kthread+0x0/0xa0
[<ffffffff8100c280>] ? child_rip+0x0/0x20
....
On Apr 4, 2013, at 04:24, Will Huck wrote:
> Hi Alexey,
> On 03/28/2013 01:34 PM, Alexey Lyahkov wrote:
>> Hi Hugh,
>>
>> "immediately" say in ~1s after allocation /via krobes/ftrace logs/,
>> and you are correct - that is in case large streaming io in Lustre - like 3-4GB/s in read.
>> ftrace logs (with additional trace points) say page allocated, mark page accessed..
>> and nothing until that page will found in isolate_lru_page in shrink_inactive_list
>> /that point to set kprobe/
>> if someone need a logs i may provide it's as it's easy to collect.
>
> I don't need the log, but could you show me how you trace?
>
>>
>> But may be that is more generic question when ext4 code, some important metadata exist
>> in block device page cache in that case calling lru_page_drain() here move these pages
>> in active LRU so will accessible easy.
>>
>>
>> On Mar 27, 2013, at 21:24, Hugh Dickins wrote:
>>
>>> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata]
>>>
>>> On Wed, 27 Mar 2013, Theodore Ts'o wrote:
>>>> Hi Andrew,
>>>>
>>>> Thanks for your analysis! Since I'm not a mm developer, I'm not sure
>>>> what's the best way to more aggressively mark a page as one that we'd
>>>> really like to keep in the page cache --- whether it's calling
>>>> lru_add_drain(), or calling activate_page(page), etc.
>>>>
>>>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm
>>>> experts in the hopes they could give us some advice about the best way
>>>> to achieve this goal. Andrew, Hugh, could you give us some quick
>>>> words of wisdom?
>>> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm.
>>>
>>>> Thanks,
>>>>
>>>> - Ted
>>>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
>>>>> Hello!
>>>>>
>>>>> Our recent investigation has found that pages from
>>>>> the buddy cache are evicted too often as compared
>>>>> to the expectation from their usage pattern. This
>>>>> introduces additional reads during large writes under
>>>>> our workload and really hurts overall performance.
>>>>>
>>>>> ext4 uses find_get_page() and find_or_create_page()
>>>>> to look for buddy cache pages, but these pages don't
>>>>> get a chance to become activated until the following
>>>>> lru_add_drain() call, because mark_page_accessed()
>>>>> does not activate pages which are not PageLRU().
>>>>>
>>>>> As can be found from a kprobe-based test, these pages
>>>>> are often moved on the inactive LRU as a result of
>>>>> shrink_inactive_list()->lru_add_drain() and immediately
>>>>> evicted.
>>> Not quite like that, I think.
>>>
>>> Cache pages are intentionally put on the inactive list initially,
>>> so that streaming I/O does not push out more useful pages: it is
>>> intentional that the first call to mark_page_accessed() merely
>>> marks the page referenced, but does not move it to active LRU.
>>>
>>> You're right that the pagevec confuses things here, but I'm
>>> surprised if these pages are "immediately evicted": they won't
>>> be evicted while they remain on a pagevec, and can only be evicted
>>> after reaching the LRU. And they should be put on the hot end of
>>> the inactive LRU, and only evicted once they reach the cold end.
>>>
>>> But maybe you have lots of dirty or otherwise-un-immediately-evictable
>>> data pages in between, so that page reclaim reaches these ones too soon.
>>>
>>> IIUC the pages you are discussing here are important metadata pages,
>>> which you would much prefer to retain longer than streaming data.
>>>
>>> While I question "immediately evicted", I don't doubt that they
>>> get evicted sooner than you wish: one way or another, they arrive
>>> at the cold end of the inactive LRU too soon.
>>>
>>> You would like a way to mark these as more important to retain than
>>> data pages: you would like to put them directly on the active list,
>>> but are frustrated by the pagevec.
>>>
>>>>> From a quick look into linux-2.6.git, the issue seems
>>>>> to exist in the current code as well.
>>>>>
>>>>> A possible and, perhaps, non-optimal solution would be
>>>>> to call lru_add_drain() each time a buddy cache page
>>>>> is used.
>>> mark_page_accessed() should be enough each time one is actually used,
>>> but yes, it looks like you need more than that when first added to cache.
>>>
>>> It appears that at the moment you need to do:
>>>
>>> mark_page_accessed(page); /* to SetPageReferenced */
>>> lru_add_drain(); /* to SetPageLRU */
>>> mark_page_accessed(page); /* to SetPageActive */
>>>
>>> but I agree that we would really prefer a filesystem not to have to
>>> call lru_add_drain().
>>>
>>> I quite like the idea of
>>> mark_page_accessed(page);
>>> mark_page_accessed(page);
>>> as a sequence to use on important metadata (nicely reminiscent of
>>> "sync; sync;"), but maybe not everybody will agree with me on that!
>>>
>>> As currently implemented, a page is put on to a pagevec specific to
>>> the LRU it is destined for, and we cannot change that destination
>>> before it is flushed to that LRU. But at this moment I cannot see
>>> a fundamental reason why we should not allow PageActive to be set
>>> while in the pagevec, and destination LRU adjusted accordingly.
>>>
>>> However, I could easily be missing something (probably some VM_BUG_ONs
>>> at the least); and changing this might uncover unwanted side-effects -
>>> perhaps some code paths which already call mark_page_accessed() twice
>>> in quick succession unintentionally, and would now be given an Active
>>> page when Inactive has actually been more appropriate.
>>>
>>> Though I'd like to come back to this, I am very unlikely to find time
>>> for it in the near future: perhaps someone else might take it further.
>>>
>>> Hugh
>>>
>>>>> Any other suggestions?
>>>>>
>>>>> Thank you,
>>>>> Andrew
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=ilto:"[email protected]"> [email protected] </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>
Alex, Andrew,
did you notice the patch Ted just sent?
("ext4: mark all metadata I/O with REQ_META")
I would like to see a way to mark pages read in with REQ_META to be kept
in cache preferred over other pages. I guess that would solve LU-15
(https://jira.hpdd.intel.com/browse/LU-15) and also the direntry-block
issue I tried to solve about 2 years ago
(http://patchwork.ozlabs.org/patch/101200/). But using REQ_META to tag
pages would probably also solve the same issue for other file systems.
Is there anything already in the mm layer that could be used for that?
Thanks,
Bernd
On 04/04/2013 06:51 AM, Alexey Lyahkov wrote:
> Hi Will,
>
> i added a few tracepoints in mark_page_accessed, find_or_create_page, add_to_page_cache_lru and force ftrace to use just these events to logs.
>>>
> echo -n 150000 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_mark_accessed/enable
> echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_page/enable
> echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_find_create_page/enable
> echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_vmscan_lru_move/enable
> echo 1 > /sys/kernel/debug/tracing/events/kmem/mm_add_page_lru/enable
> echo 1 > /sys/kernel/debug/tracing/tracing_on
>>>>
> kprobe module attached to __isolate_lru_page, __remove_from_page_cache and BUG_ON hit if __isolate_lru_page requested to remove budy page from lru lists.
>
> ftrace log buffer extracted from crashdump with backtrace where it's hit.
>
> log show page allocation via find_or_create_page, one or two mark_page_accessed call's, and isolate called.
> backtrace always similar to
> found buddy ffffea00022383d8 ffff88004d7015f0
> ------------[ cut here ]------------
> kernel BUG at /Users/shadow/work/lustre/work/BUGS/MRP-691/jprobe/jprobe.c:40!
> ..
> Call Trace:
> [<ffffffffa00150be>] my__isolate_lru_page+0xe/0x18 [jprobe]
> [<ffffffff81139d10>] isolate_pages_global+0xd0/0x380
> [<ffffffff81136d99>] ? shrink_inactive_list+0xb9/0x730
> [<ffffffff81136e42>] shrink_inactive_list+0x162/0x730
> [<ffffffffa04e90fd>] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs]
> [<ffffffffa04e7ac4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
> [<ffffffff81179190>] ? mem_cgroup_soft_limit_reclaim+0x270/0x2a0
> [<ffffffffa06a20f5>] ? cl_env_fetch+0x25/0x80 [obdclass]
> [<ffffffff8113821f>] shrink_zone+0x38f/0x510
> [<ffffffff811397a9>] balance_pgdat+0x719/0x810
> [<ffffffff81139c40>] ? isolate_pages_global+0x0/0x380
> [<ffffffff811399e4>] kswapd+0x144/0x3a0
> [<ffffffff810a7cfd>] ? lock_release_holdtime+0x3d/0x190
> [<ffffffff814fb880>] ? _spin_unlock_irqrestore+0x40/0x80
> [<ffffffff810919e0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff811398a0>] ? kswapd+0x0/0x3a0
> [<ffffffff81091696>] kthread+0x96/0xa0
> [<ffffffff8100c28a>] child_rip+0xa/0x20
> [<ffffffff8100bbd0>] ? restore_args+0x0/0x30
> [<ffffffff81091600>] ? kthread+0x0/0xa0
> [<ffffffff8100c280>] ? child_rip+0x0/0x20
> ....
>
>
>
> On Apr 4, 2013, at 04:24, Will Huck wrote:
>
>> Hi Alexey,
>> On 03/28/2013 01:34 PM, Alexey Lyahkov wrote:
>>> Hi Hugh,
>>>
>>> "immediately" say in ~1s after allocation /via krobes/ftrace logs/,
>>> and you are correct - that is in case large streaming io in Lustre - like 3-4GB/s in read.
>>> ftrace logs (with additional trace points) say page allocated, mark page accessed..
>>> and nothing until that page will found in isolate_lru_page in shrink_inactive_list
>>> /that point to set kprobe/
>>> if someone need a logs i may provide it's as it's easy to collect.
>>
>> I don't need the log, but could you show me how you trace?
>>
>>>
>>> But may be that is more generic question when ext4 code, some important metadata exist
>>> in block device page cache in that case calling lru_page_drain() here move these pages
>>> in active LRU so will accessible easy.
>>>
>>>
>>> On Mar 27, 2013, at 21:24, Hugh Dickins wrote:
>>>
>>>> [Cc'ing linux-mm: "buddy cache" here is cache of some ext4 metadata]
>>>>
>>>> On Wed, 27 Mar 2013, Theodore Ts'o wrote:
>>>>> Hi Andrew,
>>>>>
>>>>> Thanks for your analysis! Since I'm not a mm developer, I'm not sure
>>>>> what's the best way to more aggressively mark a page as one that we'd
>>>>> really like to keep in the page cache --- whether it's calling
>>>>> lru_add_drain(), or calling activate_page(page), etc.
>>>>>
>>>>> So I've added Andrew Morton and Hugh Dickens to the cc list as mm
>>>>> experts in the hopes they could give us some advice about the best way
>>>>> to achieve this goal. Andrew, Hugh, could you give us some quick
>>>>> words of wisdom?
>>>> Hardly from me: I'm dissatisfied with answer below, Cc'ed linux-mm.
>>>>
>>>>> Thanks,
>>>>>
>>>>> - Ted
>>>>> On Mon, Mar 25, 2013 at 04:59:44PM +0400, Andrew Perepechko wrote:
>>>>>> Hello!
>>>>>>
>>>>>> Our recent investigation has found that pages from
>>>>>> the buddy cache are evicted too often as compared
>>>>>> to the expectation from their usage pattern. This
>>>>>> introduces additional reads during large writes under
>>>>>> our workload and really hurts overall performance.
>>>>>>
>>>>>> ext4 uses find_get_page() and find_or_create_page()
>>>>>> to look for buddy cache pages, but these pages don't
>>>>>> get a chance to become activated until the following
>>>>>> lru_add_drain() call, because mark_page_accessed()
>>>>>> does not activate pages which are not PageLRU().
>>>>>>
>>>>>> As can be found from a kprobe-based test, these pages
>>>>>> are often moved on the inactive LRU as a result of
>>>>>> shrink_inactive_list()->lru_add_drain() and immediately
>>>>>> evicted.
>>>> Not quite like that, I think.
>>>>
>>>> Cache pages are intentionally put on the inactive list initially,
>>>> so that streaming I/O does not push out more useful pages: it is
>>>> intentional that the first call to mark_page_accessed() merely
>>>> marks the page referenced, but does not move it to active LRU.
>>>>
>>>> You're right that the pagevec confuses things here, but I'm
>>>> surprised if these pages are "immediately evicted": they won't
>>>> be evicted while they remain on a pagevec, and can only be evicted
>>>> after reaching the LRU. And they should be put on the hot end of
>>>> the inactive LRU, and only evicted once they reach the cold end.
>>>>
>>>> But maybe you have lots of dirty or otherwise-un-immediately-evictable
>>>> data pages in between, so that page reclaim reaches these ones too soon.
>>>>
>>>> IIUC the pages you are discussing here are important metadata pages,
>>>> which you would much prefer to retain longer than streaming data.
>>>>
>>>> While I question "immediately evicted", I don't doubt that they
>>>> get evicted sooner than you wish: one way or another, they arrive
>>>> at the cold end of the inactive LRU too soon.
>>>>
>>>> You would like a way to mark these as more important to retain than
>>>> data pages: you would like to put them directly on the active list,
>>>> but are frustrated by the pagevec.
>>>>
>>>>>> From a quick look into linux-2.6.git, the issue seems
>>>>>> to exist in the current code as well.
>>>>>>
>>>>>> A possible and, perhaps, non-optimal solution would be
>>>>>> to call lru_add_drain() each time a buddy cache page
>>>>>> is used.
>>>> mark_page_accessed() should be enough each time one is actually used,
>>>> but yes, it looks like you need more than that when first added to cache.
>>>>
>>>> It appears that at the moment you need to do:
>>>>
>>>> mark_page_accessed(page); /* to SetPageReferenced */
>>>> lru_add_drain(); /* to SetPageLRU */
>>>> mark_page_accessed(page); /* to SetPageActive */
>>>>
>>>> but I agree that we would really prefer a filesystem not to have to
>>>> call lru_add_drain().
>>>>
>>>> I quite like the idea of
>>>> mark_page_accessed(page);
>>>> mark_page_accessed(page);
>>>> as a sequence to use on important metadata (nicely reminiscent of
>>>> "sync; sync;"), but maybe not everybody will agree with me on that!
>>>>
>>>> As currently implemented, a page is put on to a pagevec specific to
>>>> the LRU it is destined for, and we cannot change that destination
>>>> before it is flushed to that LRU. But at this moment I cannot see
>>>> a fundamental reason why we should not allow PageActive to be set
>>>> while in the pagevec, and destination LRU adjusted accordingly.
>>>>
>>>> However, I could easily be missing something (probably some VM_BUG_ONs
>>>> at the least); and changing this might uncover unwanted side-effects -
>>>> perhaps some code paths which already call mark_page_accessed() twice
>>>> in quick succession unintentionally, and would now be given an Active
>>>> page when Inactive has actually been more appropriate.
>>>>
>>>> Though I'd like to come back to this, I am very unlikely to find time
>>>> for it in the near future: perhaps someone else might take it further.
>>>>
>>>> Hugh
>>>>
>>>>>> Any other suggestions?
>>>>>>
>>>>>> Thank you,
>>>>>> Andrew
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=ilto:"[email protected]"> [email protected] </a>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"[email protected]"> [email protected] </a>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Sat, Apr 20, 2013 at 11:18:17PM +0200, Bernd Schubert wrote:
> Alex, Andrew,
>
> did you notice the patch Ted just sent?
> ("ext4: mark all metadata I/O with REQ_META")
This patch was sent to fix another issue that was brought up at Linux
Storage, Filesystem, and MM workshop. I did bring up this issue with
Mel Gorman while at LSF/MM, and as a result, tThe mm folks are going
to look into making mark_page_accessed() do the right thing, or
perhaps provide us with new interface. The problem with forcing the
page to be marked as activated is this would cause a TLB flush, which
would be pointless since this these buddy bitmap pages aren't actually
mapped in anywhere.
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Theodore,
May you provide more details about discussion and new interface?
someone from mm think we need to implement mark_acccessed_force() with avoiding is in LRU checks?
and move single page directly without waiting a LRU drain ?
On Apr 21, 2013, at 02:57, Theodore Ts'o wrote:
> On Sat, Apr 20, 2013 at 11:18:17PM +0200, Bernd Schubert wrote:
>> Alex, Andrew,
>>
>> did you notice the patch Ted just sent?
>> ("ext4: mark all metadata I/O with REQ_META")
>
> This patch was sent to fix another issue that was brought up at Linux
> Storage, Filesystem, and MM workshop. I did bring up this issue with
> Mel Gorman while at LSF/MM, and as a result, tThe mm folks are going
> to look into making mark_page_accessed() do the right thing, or
> perhaps provide us with new interface. The problem with forcing the
> page to be marked as activated is this would cause a TLB flush, which
> would be pointless since this these buddy bitmap pages aren't actually
> mapped in anywhere.
>
> - Ted
Bernd,
I fact marking REQ_META will help with io scheduler().
I and Andrew will discuss it's some time ago, but don't expect a too much performance improvements
anyway - i will pickup patch from Intel jira and ask performance engineer to evaluate it.
bw. My problem is in memory caching.
I may rewrite a buddy cache to avoid using a page cache at all as these pages but LRU aging will lost.
On Apr 21, 2013, at 00:18, Bernd Schubert wrote:
> Alex, Andrew,
>
> did you notice the patch Ted just sent?
> ("ext4: mark all metadata I/O with REQ_META")
>
> I would like to see a way to mark pages read in with REQ_META to be kept
> in cache preferred over other pages. I guess that would solve LU-15
> (https://jira.hpdd.intel.com/browse/LU-15) and also the direntry-block
> issue I tried to solve about 2 years ago
> (http://patchwork.ozlabs.org/patch/101200/). But using REQ_META to tag
> pages would probably also solve the same issue for other file systems.
> Is there anything already in the mm layer that could be used for that?
>
> Thanks,
> Bernd
>
On 04/21/2013 01:57 AM, Theodore Ts'o wrote:
> On Sat, Apr 20, 2013 at 11:18:17PM +0200, Bernd Schubert wrote:
>> Alex, Andrew,
>>
>> did you notice the patch Ted just sent?
>> ("ext4: mark all metadata I/O with REQ_META")
>
> This patch was sent to fix another issue that was brought up at Linux
> Storage, Filesystem, and MM workshop. I did bring up this issue with
> Mel Gorman while at LSF/MM, and as a result, tThe mm folks are going
> to look into making mark_page_accessed() do the right thing, or
Yeah, I know that REQ_META is a hint/flag for the IO scheduler.
> perhaps provide us with new interface. The problem with forcing the
> page to be marked as activated is this would cause a TLB flush, which
> would be pointless since this these buddy bitmap pages aren't actually
> mapped in anywhere.
I just thought we can (mis)use that flag and and add another information
to the page that it holds meta data. The mm system then could use that
flag and evict those pages with a lower priority compared to other pages.
I'm curious about outcome of the mm folks. Please let me know if there
is anything I can help with.
Thanks,
Bernd
On Tue, Apr 23, 2013 at 02:02:37PM +0200, Bernd Schubert wrote:
>
> I just thought we can (mis)use that flag and and add another
> information to the page that it holds meta data. The mm system then
> could use that flag and evict those pages with a lower priority
> compared to other pages.
Well, the flag I added was to the buffer_head, not to the page, and my
understanding is that the mm folks are very hesitant to add new page
flags, since they are bumping up against the 32 bit limit (on the i386
platform), and they are trying to keep the struct page structure trim
and svelte. :-)
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 23 Apr 2013, Theodore Ts'o wrote:
> On Tue, Apr 23, 2013 at 02:02:37PM +0200, Bernd Schubert wrote:
> >
> > I just thought we can (mis)use that flag and and add another
> > information to the page that it holds meta data. The mm system then
> > could use that flag and evict those pages with a lower priority
> > compared to other pages.
>
> Well, the flag I added was to the buffer_head, not to the page, and my
> understanding is that the mm folks are very hesitant to add new page
> flags, since they are bumping up against the 32 bit limit (on the i386
> platform), and they are trying to keep the struct page structure trim
> and svelte. :-)
Yes indeed. But luckily this issue seems not to need another page flag.
If metadata is useful, it gets used: so mark_page_accessed when it's used
should generally do the job; perhaps it's already called on that path,
perhaps calls need to be added.
Rarely used but nonetheless useful metadata might get pushed out by data.
But I don't think we want another page flag, and yet more complicated
reclaim policy in mm for that case. If the filesystem knows of such
cases, I hope it can find a way to use mark_page_accessed more often
on such pages than they are actually accessed, to help retain them.
What this thread did bring up was the failure of mark_page_accessed
to be fully effective until the page is flushed from pagevec to lru.
That remains a good point: something that several would like to fix.
For now I stand by what I said before (if you find it effective
in practice - I haven't heard back): at the moment you need to
mark_page_accessed(page); /* to SetPageReferenced */
lru_add_drain(); /* to SetPageLRU */
mark_page_accessed(page); /* to SetPageActive */
when such a metadata page is first brought in.
We all hate that lru_add_drain in the middle, which will exacerbate
lru_lock contention. We would love to eliminate the need for most
lru_add_drains: I have some ideas which I'm pursuing in odd moments,
but I promise nothing.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 23 Apr 2013 12:57:45 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
> For now I stand by what I said before (if you find it effective
> in practice - I haven't heard back): at the moment you need to
>
> mark_page_accessed(page); /* to SetPageReferenced */
> lru_add_drain(); /* to SetPageLRU */
> mark_page_accessed(page); /* to SetPageActive */
>
> when such a metadata page is first brought in.
That should fix things for now. Although it might be better to just do
mark_page_accessed(page); /* to SetPageReferenced */
lru_add_drain(); /* to SetPageLRU */
Because a) this was too early to decide that the page is
super-important and b) the second touch of this page should have a
mark_page_accessed() in it already.
I do agree that we should be able to set both PageReferenced and
PageActive on a lru_add_pvecs page and have those hints honoured when
lru_add_pvecs is spilled onto the LRU.
At present the code decides up-front which LRU the lru_add_pvecs page
will eventually be spilled onto. That's a bit strange and I wonder why
we did it that way. Why not just have a single (per-cpu) magazine of
pages which are to go onto the LRUs, and decide *which* LRU that will
be at the last possible moment?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 23 Apr 2013, Andrew Morton wrote:
> On Tue, 23 Apr 2013 12:57:45 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> I do agree that we should be able to set both PageReferenced and
> PageActive on a lru_add_pvecs page and have those hints honoured when
> lru_add_pvecs is spilled onto the LRU.
>
> At present the code decides up-front which LRU the lru_add_pvecs page
> will eventually be spilled onto. That's a bit strange and I wonder why
> we did it that way. Why not just have a single (per-cpu) magazine of
> pages which are to go onto the LRUs, and decide *which* LRU that will
> be at the last possible moment?
Yes, it is strange, and I'm wanting to get away from that: though I
won't be surprised if we discover that it's actually important for
avoiding races in the current scheme - which won't cope well with a
page being marked PageActive at the wrong moment (an instant after
it has been placed on the Inactive list).
What I want is for pages on the the per-cpu lru_add_pvecs to be already
marked PageLRU, and without raised page count, so that they are eligible
for isolation and migration without needing a drain. (But I may be too
ambitious in trying to avoid the raised page count for lru_rotate_pvecs
and the others too.)
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
> That should fix things for now. Although it might be better to just do
>
> mark_page_accessed(page); /* to SetPageReferenced */
> lru_add_drain(); /* to SetPageLRU */
>
> Because a) this was too early to decide that the page is
> super-important and b) the second touch of this page should have a
> mark_page_accessed() in it already.
The question is do we really want to put lru_add_drain() into the ext4
file system code? That seems to pushing some fairly mm-specific
knowledge into file system code. I'll do this if I have to do, but
wouldn't be better if this was pushed into mark_page_accessed(), or
some other new API was exported by the mm subsystem?
> At present the code decides up-front which LRU the lru_add_pvecs page
> will eventually be spilled onto. That's a bit strange and I wonder why
> we did it that way. Why not just have a single (per-cpu) magazine of
> pages which are to go onto the LRUs, and decide *which* LRU that will
> be at the last possible moment?
And this is why it seems strange that fs code should need or should
want to put something as mm-implementation dependent into their code
paths. At minimum, if we do this, we'll want to put some explanatory
comments so that later, people won't be asking, what the !@#@?!? are
the ext4 people calling lru_add_drain() here?
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, 24 Apr 2013 10:26:50 -0400 "Theodore Ts'o" <[email protected]> wrote:
> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
> > That should fix things for now. Although it might be better to just do
> >
> > mark_page_accessed(page); /* to SetPageReferenced */
> > lru_add_drain(); /* to SetPageLRU */
> >
> > Because a) this was too early to decide that the page is
> > super-important and b) the second touch of this page should have a
> > mark_page_accessed() in it already.
>
> The question is do we really want to put lru_add_drain() into the ext4
> file system code? That seems to pushing some fairly mm-specific
> knowledge into file system code. I'll do this if I have to do, but
> wouldn't be better if this was pushed into mark_page_accessed(), or
> some other new API was exported by the mm subsystem?
Sure, that would be daft. We'd add a new
mark_page_accessed_right_now_dont_use_this() to mm/swap.c
So, right direction add ability to mark a page as active in lru pagevec array ?
just a bypass a IS_IN_LRU(page) check and fix moving to LRU to ability to put into active LRU list from pagevec ?
I may prepare a patch for it.
On Apr 25, 2013, at 00:41, Andrew Morton wrote:
> On Wed, 24 Apr 2013 10:26:50 -0400 "Theodore Ts'o" <[email protected]> wrote:
>
>> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
>>> That should fix things for now. Although it might be better to just do
>>>
>>> mark_page_accessed(page); /* to SetPageReferenced */
>>> lru_add_drain(); /* to SetPageLRU */
>>>
>>> Because a) this was too early to decide that the page is
>>> super-important and b) the second touch of this page should have a
>>> mark_page_accessed() in it already.
>>
>> The question is do we really want to put lru_add_drain() into the ext4
>> file system code? That seems to pushing some fairly mm-specific
>> knowledge into file system code. I'll do this if I have to do, but
>> wouldn't be better if this was pushed into mark_page_accessed(), or
>> some other new API was exported by the mm subsystem?
>
> Sure, that would be daft. We'd add a new
> mark_page_accessed_right_now_dont_use_this() to mm/swap.c
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>
On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
> > That should fix things for now. Although it might be better to just do
> >
> > mark_page_accessed(page); /* to SetPageReferenced */
> > lru_add_drain(); /* to SetPageLRU */
> >
> > Because a) this was too early to decide that the page is
> > super-important and b) the second touch of this page should have a
> > mark_page_accessed() in it already.
>
> The question is do we really want to put lru_add_drain() into the ext4
> file system code? That seems to pushing some fairly mm-specific
> knowledge into file system code. I'll do this if I have to do, but
> wouldn't be better if this was pushed into mark_page_accessed(), or
> some other new API was exported by the mm subsystem?
>
I don't think we want to push lru_add_drain() into the ext4 code. It's
too specific of knowledge just to work around pagevecs. Before we rework
how pagevecs select what LRU to place a page, can we make sure that fixing
that will fix the problem?
Andrew, can you try the following patch please? Also, is there any chance
you can describe in more detail what the workload does? If it fails to boot,
remove the second that calls lru_add_drain_all() and try again.
The patch looks deceptively simple, a downside from is is that workloads that
call mark_page_accessed() frequently will contend more on the zone->lru_lock
than it did previously. Moving lru_add_drain() to the ext4 could would
suffer the same contention problem.
Thanks.
---8<---
mm: pagevec: Move inactive pages to active lists even if on a pagevec
If a page is on a pagevec aimed at the inactive list then two subsequent
calls to mark_page_acessed() will still not move it to the active list.
This can cause a page to be reclaimed sooner than is expected. This
patch detects if an inactive page is not on the LRU and drains the
pagevec before promoting it.
Not-signed-off
diff --git a/mm/swap.c b/mm/swap.c
index 8a529a0..eac64fe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -437,7 +437,18 @@ void activate_page(struct page *page)
void mark_page_accessed(struct page *page)
{
if (!PageActive(page) && !PageUnevictable(page) &&
- PageReferenced(page) && PageLRU(page)) {
+ PageReferenced(page)) {
+ /* Page could be in pagevec */
+ if (!PageLRU(page))
+ lru_add_drain();
+
+ /*
+ * Weeeee, using in_atomic() like this is a hand-grenade.
+ * Patch is for debugging purposes only, do not merge this.
+ */
+ if (!PageLRU(page) && !in_atomic())
+ lru_add_drain_all();
+
activate_page(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
Mel,
On Apr 25, 2013, at 17:30, Mel Gorman wrote:
> On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote:
>> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
>>> That should fix things for now. Although it might be better to just do
>>>
>>> mark_page_accessed(page); /* to SetPageReferenced */
>>> lru_add_drain(); /* to SetPageLRU */
>>>
>>> Because a) this was too early to decide that the page is
>>> super-important and b) the second touch of this page should have a
>>> mark_page_accessed() in it already.
>>
>> The question is do we really want to put lru_add_drain() into the ext4
>> file system code? That seems to pushing some fairly mm-specific
>> knowledge into file system code. I'll do this if I have to do, but
>> wouldn't be better if this was pushed into mark_page_accessed(), or
>> some other new API was exported by the mm subsystem?
>>
>
> I don't think we want to push lru_add_drain() into the ext4 code. It's
> too specific of knowledge just to work around pagevecs. Before we rework
> how pagevecs select what LRU to place a page, can we make sure that fixing
> that will fix the problem?
>
what is "that"? puting lru_add_drain() in ext4 core? sure that is fixes problem with many small reads during large write.
originally i have put shake_page() in ext4 code, but that have call lru_add_drain_all() so to exaggerated.
Index: linux-stage/fs/ext4/mballoc.c
===================================================================
--- linux-stage.orig/fs/ext4/mballoc.c 2013-03-19 10:55:52.000000000 +0200
+++ linux-stage/fs/ext4/mballoc.c 2013-03-19 10:59:02.000000000 +0200
@@ -900,8 +900,11 @@ static int ext4_mb_init_cache(struct pag
incore = data;
}
}
- if (likely(err == 0))
+ if (likely(err == 0)) {
SetPageUptodate(page);
+ /* make sure it's in active list */
+ mark_page_accessed(page);
+ }
out:
if (bh) {
@@ -957,6 +960,8 @@ int ext4_mb_init_group(struct super_bloc
page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
if (page) {
BUG_ON(page->mapping != inode->i_mapping);
+ /* move to lru - should be lru_add_drain() */
+ shake_page(page, 0);
ret = ext4_mb_init_cache(page, NULL);
if (ret) {
unlock_page(page);
@@ -986,6 +991,8 @@ int ext4_mb_init_group(struct super_bloc
unlock_page(page);
} else if (page) {
BUG_ON(page->mapping != inode->i_mapping);
+ /* move to lru - should be lru_add_drain() */
+ shake_page(page, 0);
ret = ext4_mb_init_cache(page, bitmap);
if (ret) {
unlock_page(page);
@@ -1087,6 +1094,7 @@ repeat_load_buddy:
if (page) {
BUG_ON(page->mapping != inode->i_mapping);
if (!PageUptodate(page)) {
+ shake_page(page, 0);
ret = ext4_mb_init_cache(page, NULL);
if (ret) {
unlock_page(page);
@@ -1118,6 +1126,7 @@ repeat_load_buddy:
if (page) {
BUG_ON(page->mapping != inode->i_mapping);
if (!PageUptodate(page)) {
+ shake_page(page, 0);
ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
if (ret) {
unlock_page(page);
@@ -2500,6 +2509,8 @@ static int ext4_mb_init_backend(struct s
* not in the inode hash, so it should never be found by iget(), but
* this will avoid confusion if it ever shows up during debugging. */
sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
+ sbi->s_buddy_cache->i_state = I_NEW;
+// mapping_set_unevictable(sbi->s_buddy_cache->i_mapping);
EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
for (i = 0; i < ngroups; i++) {
desc = ext4_get_group_desc(sb, i, NULL);
additional i_state = I_NEW need to prevent kill page cache from sysctl -w vm.drop_caches=3
> Andrew, can you try the following patch please? Also, is there any chance
> you can describe in more detail what the workload does?
lustre OSS node + IOR with file size twice more then OSS memory.
> If it fails to boot,
> remove the second that calls lru_add_drain_all() and try again.
well, i will try.
>
> The patch looks deceptively simple, a downside from is is that workloads that
> call mark_page_accessed() frequently will contend more on the zone->lru_lock
> than it did previously. Moving lru_add_drain() to the ext4 could would
> suffer the same contention problem.
NO, isn't. we have call lru_add_drain() in new page allocation case, but mark_page_accessed called without differences - is page in page cache already or it's new allocated - so we have very small zone->lru_lock contention.
>
> Thanks.
>
> ---8<---
> mm: pagevec: Move inactive pages to active lists even if on a pagevec
>
> If a page is on a pagevec aimed at the inactive list then two subsequent
> calls to mark_page_acessed() will still not move it to the active list.
> This can cause a page to be reclaimed sooner than is expected. This
> patch detects if an inactive page is not on the LRU and drains the
> pagevec before promoting it.
>
> Not-signed-off
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 8a529a0..eac64fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -437,7 +437,18 @@ void activate_page(struct page *page)
> void mark_page_accessed(struct page *page)
> {
> if (!PageActive(page) && !PageUnevictable(page) &&
> - PageReferenced(page) && PageLRU(page)) {
> + PageReferenced(page)) {
> + /* Page could be in pagevec */
> + if (!PageLRU(page))
> + lru_add_drain();
> +
> + /*
> + * Weeeee, using in_atomic() like this is a hand-grenade.
> + * Patch is for debugging purposes only, do not merge this.
> + */
> + if (!PageLRU(page) && !in_atomic())
> + lru_add_drain_all();
> +
> activate_page(page);
> ClearPageReferenced(page);
> } else if (!PageReferenced(page)) {
On Thu, Apr 25, 2013 at 09:37:07PM +0300, Alexey Lyahkov wrote:
> Mel,
>
>
> On Apr 25, 2013, at 17:30, Mel Gorman wrote:
>
> > On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote:
> >> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
> >>> That should fix things for now. Although it might be better to just do
> >>>
> >>> mark_page_accessed(page); /* to SetPageReferenced */
> >>> lru_add_drain(); /* to SetPageLRU */
> >>>
> >>> Because a) this was too early to decide that the page is
> >>> super-important and b) the second touch of this page should have a
> >>> mark_page_accessed() in it already.
> >>
> >> The question is do we really want to put lru_add_drain() into the ext4
> >> file system code? That seems to pushing some fairly mm-specific
> >> knowledge into file system code. I'll do this if I have to do, but
> >> wouldn't be better if this was pushed into mark_page_accessed(), or
> >> some other new API was exported by the mm subsystem?
> >>
> >
> > I don't think we want to push lru_add_drain() into the ext4 code. It's
> > too specific of knowledge just to work around pagevecs. Before we rework
> > how pagevecs select what LRU to place a page, can we make sure that fixing
> > that will fix the problem?
> >
> what is "that"? puting lru_add_drain() in ext4 core? sure that is fixes problem with many small reads during large write.
> originally i have put shake_page() in ext4 code, but that have call lru_add_drain_all() so to exaggerated.
>
No, I would prefer if this was not fixed within ext4. I need confirmation
that fixing mark_page_accessed() addresses the performance problem you
encounter. The two-line check for PageLRU() followed by a lru_add_drain()
is meant to check that. That is still not my preferred fix because even
if you do not encounter higher LRU contention, other workloads would be
at risk. The likely fix will involve converting pagevecs to using a single
list and then selecting what LRU to put a page on at drain time but I
want to know that it's worthwhile.
Using shake_page() in ext4 is certainly overkill.
> > Andrew, can you try the following patch please? Also, is there any chance
> > you can describe in more detail what the workload does?
>
> lustre OSS node + IOR with file size twice more then OSS memory.
>
Ok, no way I'll be reproducing that workload. Thanks.
--
Mel Gorman
SUSE Labs
On Apr 26, 2013, at 01:40, Mel Gorman wrote:
> No, I would prefer if this was not fixed within ext4. I need confirmation
> that fixing mark_page_accessed() addresses the performance problem you
> encounter. The two-line check for PageLRU() followed by a lru_add_drain()
> is meant to check that. That is still not my preferred fix because even
> if you do not encounter higher LRU contention, other workloads would be
> at risk. The likely fix will involve converting pagevecs to using a single
> list and then selecting what LRU to put a page on at drain time but I
> want to know that it's worthwhile.
>
> Using shake_page() in ext4 is certainly overkill.
agree, but it's was my prof of concept patch :) just to verify founded
>
>>> Andrew, can you try the following patch please? Also, is there any chance
>>> you can describe in more detail what the workload does?
>>
>> lustre OSS node + IOR with file size twice more then OSS memory.
>>
>
> Ok, no way I'll be reproducing that workload. Thanks.
>
I think you should be try several processes with DIO (so don't put any pages in lru_pagevec as that is heap), each have a filesize twice or more of available memory.
Main idea you should be have a read a new pages in budy cache (to allocate) and have large memory allocation in same time.
DIO chunk should be enough to start streaming allocation.
also you may use attached jprobe module to hit an BUG() if buddy page removed from a memory by shrinker.