Hi,
It seems rotate_reclaimable_page fails most of the
time due the page not being on the LRU when kswapd
calls writepage(). The filesystem in my tests is
ext3. The attached patch against 2.6.16-rc2 moves the
page to the LRU before calling writepage(). Below are
results for a write test with:
dd if=/dev/zero of=test bs=1024k count=1024
To trigger the writeback path with the default dirty
ratios, I set swappiness to 55 and mapped memory to
about 80%.
w/o patch (/proc/sys/vm/wb_put_lru = 0):
pgrotcalls 25852
pgrotnonlru 25834
pgrotated 18
with patch (/proc/sys/vm/wb_put_lru = 1):
pgrotcalls 26616
pgrotated 26616
Thanks,
Shantanu
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
On Sun, 5 Feb 2006, Shantanu Goel wrote:
> It seems rotate_reclaimable_page fails most of the
> time due the page not being on the LRU when kswapd
> calls writepage().
The question is, why is the page not yet back on the
LRU by the time the data write completes ?
Surely a disk IO is slow enough that the page will
have been put on the LRU milliseconds before the IO
completes ?
In what kind of configuration do you run into this
problem ?
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Shantanu Goel wrote:
>Hi,
>
>It seems rotate_reclaimable_page fails most of the
>time due the page not being on the LRU when kswapd
>calls writepage(). The filesystem in my tests is
>ext3. The attached patch against 2.6.16-rc2 moves the
>page to the LRU before calling writepage(). Below are
>results for a write test with:
>
>dd if=/dev/zero of=test bs=1024k count=1024
>
>To trigger the writeback path with the default dirty
>ratios, I set swappiness to 55 and mapped memory to
>about 80%.
>
>w/o patch (/proc/sys/vm/wb_put_lru = 0):
>
>pgrotcalls 25852
>pgrotnonlru 25834
>pgrotated 18
>
>with patch (/proc/sys/vm/wb_put_lru = 1):
>
>pgrotcalls 26616
>pgrotated 26616
>
>Thanks,
>Shantanu
>
>
>__________________________________________________
>
>
I think this BUGs easily because shrink_cache doesn't expect to see
unfreeable pages put back to LRU.
--Mika
--- Mika Penttil? <[email protected]> wrote:
> I think this BUGs easily because shrink_cache
> doesn't expect to see
> unfreeable pages put back to LRU.
Not quite. In shrink_list(), we never return pages
that were put back on the LRU by omitting their
addition to `ret_pages'. shrink_cache() only releases
pages on this list.
Shantanu
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
> The question is, why is the page not yet back on the
> LRU by the time the data write completes ?
>
One possibility is that dirtiness is being tracked by
buffers which are clean. When writepage() notices
that it simply marks the page clean and calls
end_page_writeback() which then calls
rotate_reclaimable_page() before the page scanner has
had the chance to put the page back on the LRU.
> Surely a disk IO is slow enough that the page will
> have been put on the LRU milliseconds before the IO
> completes ?
>
Agreed but if the scenario I described above is
possible, there would essentially be no delay. I have
not examined the ext3 code paths closely. Perhaps
someone on the list can verify if this can happen.
The statistics seem to clearly indicate that writeback
can complete before the scanner gets a chance to put
the page back.
> In what kind of configuration do you run into this
> problem ?
Not sure what you looking for here but there is
nothing unusual on this machine that I can think of.
The machine runs Ubuntu Breezy with Gnome. To force
that particular VM code path, I wrote a simple program
that gobbled a lot of mmap'ed memory and then ran the
dd test. The only VM parameter I adjusted was
swappiness which I set to 55 instead of the default
60.
Shantanu
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Hi Shantanu,
On Sun, Feb 05, 2006 at 07:02:59AM -0800, Shantanu Goel wrote:
> Hi,
>
> It seems rotate_reclaimable_page fails most of the
> time due the page not being on the LRU when kswapd
> calls writepage(). The filesystem in my tests is
> ext3. The attached patch against 2.6.16-rc2 moves the
> page to the LRU before calling writepage(). Below are
> results for a write test with:
>
> dd if=/dev/zero of=test bs=1024k count=1024
I guess that big issue here is that the pgrotate logic is completly
useless for common cases (and no one stepped up to fix it, here's a
chance).
You had to modify the default dirty limits to watch writeout happen via
the VM reclaim path. Usually most writeout happens via pdflush and the
dirty limits at the write() path.
Surely the question you raise about why writeback ends before the
shrinker adds such pages back to LRU is important, but getting pgrotate
to _work at all_ for common scenarios is broader and more crucial.
Marking PG_writeback pages as PG_rotated once they're chosen candidates
for eviction increases the number of rotated pages dramatically, but
that does not necessarily increase performance (I was unable to see any
performance increase under the limited testing I've done, even though
the pgrotated numbers were _way_ higher).
Another issue is that increasing the number of rotated pages increases
lru_lock contention, which might not be an advantage for certain
workloads.
So, any change in this area needs careful study under a varied,
meaningful set of workloads and configurations (which has not been
happening very often).
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a61080..26319eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -447,8 +447,14 @@ static int shrink_list(struct list_head
if (page_mapped(page) || PageSwapCache(page))
sc->nr_scanned++;
- if (PageWriteback(page))
+ if (PageWriteback(page)) {
+ /* mark writeback, candidate for eviction pages as
+ * PG_reclaim to free them immediately once they're
+ * laundered.
+ */
+ SetPageReclaim(page);
goto keep_locked;
+ }
referenced = page_referenced(page, 1);
/* In active use or really unfreeable? Activate it. */
Rik van Riel <[email protected]> wrote:
>
> On Sun, 5 Feb 2006, Shantanu Goel wrote:
>
> > It seems rotate_reclaimable_page fails most of the
> > time due the page not being on the LRU when kswapd
> > calls writepage().
>
> The question is, why is the page not yet back on the
> LRU by the time the data write completes ?
Could be they're ext3 pages which were written out by kjournald. Such
pages are marked dirty but have clean buffers. ext3_writepage() will
discover that the page is actually clean and will mark it thus without
performing any I/O.
In which case this code in shrink_list():
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
*/
if (TestSetPageLocked(page))
goto keep;
if (PageDirty(page) || PageWriteback(page))
goto keep_locked;
mapping = page_mapping(page);
case PAGE_CLEAN:
; /* try to free the page below */
should just go and reclaim the page immediately.
Shantanu, I suggest you add some instrumentation there too, see if it's
working. (That'll be non-trivial. Just because we hit PAGE_CLEAN: here
doesn't necessarily mean that the page will be reclaimed).
--- Andrew Morton <[email protected]> wrote:
> Rik van Riel <[email protected]> wrote:
> > The question is, why is the page not yet back on
> the
> > LRU by the time the data write completes ?
>
> Could be they're ext3 pages which were written out
> by kjournald. Such
> pages are marked dirty but have clean buffers.
> ext3_writepage() will
> discover that the page is actually clean and will
> mark it thus without
> performing any I/O.
>
I had conjectured that something like this might be
happening without knowing the details of how ext3
implements writepage. The filesystem tested on here
is ext3.
> Shantanu, I suggest you add some instrumentation
> there too, see if it's
> working. (That'll be non-trivial. Just because we
> hit PAGE_CLEAN: here
> doesn't necessarily mean that the page will be
> reclaimed).
I'll do so and report back the results.
Shantanu
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
--- Marcelo Tosatti <[email protected]>
wrote:
> Hi Shantanu,
>
Hi Marcelo.
> I guess that big issue here is that the pgrotate
> logic is completly
> useless for common cases (and no one stepped up to
> fix it, here's a
> chance).
>
I am all for this. The motivation in my case is the
VM scanner seems to be rather severe on mapped memory
in the case where the inactive list is full of dirty
pages. For instance, with the default values of
swappiness (60), dirty_background_ratio (10) and
dirty_ratio (40), if the mapped memory is just under
the 80% mark, the unmapped_ratio logic in
page-writeback.c does not kick in with the dd test I
described in my original email. Now most pages
encountered by kswapd will be dirty. One scan will
require pushing these pages to the backing store.
However, generic_buffered_write() marks all dirty
pages as referenced with the result, it will take 2
iterations before any I/O is performed since the
scanner skips inactive/dirty/referenced pages. This
causes the priority to drop enough that we start
reclaiming mapped memory. What's worse is we scan
mapped memory at a higher priority. Reducing
swappiness does not help completely because that
effectively increases the priority at which we do the
first mapped scan.
Ideally, for workloads that want to avoid paging as
much as possible, we should perhaps have a mode where
we never activate unmapped pages and let them all
reside on the inactive list. mark_page_accessed()
would simply move an unmapped page to the head of the
inactive list on the 2nd reference. The page scanner
would then reclaim unmapped pages in a strict LRU
fashion regardless of whether the page is dirty. I
have a patch that implements this but it does not
perform well for dbench type workloads so a special
/proc/sys/vm option enables/disables it. If there is
any interest I can post it.
> Marking PG_writeback pages as PG_rotated once
> they're chosen candidates
> for eviction increases the number of rotated pages
> dramatically, but
> that does not necessarily increase performance (I
> was unable to see any
> performance increase under the limited testing I've
> done, even though
> the pgrotated numbers were _way_ higher).
It is a win for the "most memory is mapped with
occasional large file copying" scenario in my
experience. The patch I mentioned above does this as
well.
>
> Another issue is that increasing the number of
> rotated pages increases
> lru_lock contention, which might not be an advantage
> for certain
> workloads.
The code as posted is certainly sub-optimal in this
regard. I have not given this much thought yet but
you do raise a good point.
Shantanu
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Marcelo Tosatti wrote:
> Marking PG_writeback pages as PG_rotated once they're chosen candidates
> for eviction increases the number of rotated pages dramatically, but
> that does not necessarily increase performance (I was unable to see any
> performance increase under the limited testing I've done, even though
> the pgrotated numbers were _way_ higher).
>
Just FYI, this change can end up leaking the PageReclaim bit
which IIRC can make bad noises in the free pages check, and
is also a tiny bit sloppy unless we also do a precautionary
ClearPageReclaim in writeback paths.
However I don't think it is a bad idea in theory.
> Another issue is that increasing the number of rotated pages increases
> lru_lock contention, which might not be an advantage for certain
> workloads.
>
> So, any change in this area needs careful study under a varied,
> meaningful set of workloads and configurations (which has not been
> happening very often).
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5a61080..26319eb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -447,8 +447,14 @@ static int shrink_list(struct list_head
> if (page_mapped(page) || PageSwapCache(page))
> sc->nr_scanned++;
>
> - if (PageWriteback(page))
> + if (PageWriteback(page)) {
> + /* mark writeback, candidate for eviction pages as
> + * PG_reclaim to free them immediately once they're
> + * laundered.
> + */
> + SetPageReclaim(page);
> goto keep_locked;
> + }
>
> referenced = page_referenced(page, 1);
> /* In active use or really unfreeable? Activate it. */
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sun, 5 Feb 2006, Shantanu Goel wrote:
> Ideally, for workloads that want to avoid paging as
> much as possible, we should perhaps have a mode where
> we never activate unmapped pages and let them all
> reside on the inactive list. mark_page_accessed()
> would simply move an unmapped page to the head of the
> inactive list on the 2nd reference.
Clock-pro (Peter's implementation, I still need to fix mine),
should do the right thing automatically in situations like
this...
--
All Rights Reversed
At Mon, 6 Feb 2006 19:37:08 -0500 (EST),
Rik van Riel wrote:
>
> On Sun, 5 Feb 2006, Shantanu Goel wrote:
>
> > Ideally, for workloads that want to avoid paging as
> > much as possible, we should perhaps have a mode where
> > we never activate unmapped pages and let them all
> > reside on the inactive list. mark_page_accessed()
> > would simply move an unmapped page to the head of the
> > inactive list on the 2nd reference.
>
> Clock-pro (Peter's implementation, I still need to fix mine),
> should do the right thing automatically in situations like
> this...
Really?
IIRC, his patch used to have a logic to keep mapped pages hot, which
is similar to vanilla linux's reclaim_mapped thing, but it seems to be
commented out now.
--
IWAMOTO Toshihiro