2005-12-03 07:10:18

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 01/16] mm: delayed page activation

When a page is referenced the second time in inactive_list, mark it with
PG_activate instead of moving it into active_list immediately. The actual
moving work is delayed to vmscan time.

This implies two essential changes:
- keeps the adjecency of pages in lru;
- lifts the page reference counter max from 1 to 3.

And leads to the following improvements:
- read-ahead for a leading reader will not be disturbed by a following reader;
- enables the thrashing protection logic to save pages for following readers;
- keeping relavant pages together helps improve I/O efficiency;
- and also helps decrease vm fragmantation;
- increased refcnt space might help page replacement algorithms.

Signed-off-by: Wu Fengguang <[email protected]>
---

include/linux/page-flags.h | 31 +++++++++++++++++++++++++++++++
mm/page_alloc.c | 1 +
mm/swap.c | 9 ++++-----
mm/vmscan.c | 6 ++++++
4 files changed, 42 insertions(+), 5 deletions(-)

--- linux.orig/include/linux/page-flags.h
+++ linux/include/linux/page-flags.h
@@ -76,6 +76,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
+#define PG_activate 20 /* delayed activate */

/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -314,6 +315,12 @@ extern void __mod_page_state(unsigned lo
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)

+#define PageActivate(page) test_bit(PG_activate, &(page)->flags)
+#define SetPageActivate(page) set_bit(PG_activate, &(page)->flags)
+#define ClearPageActivate(page) clear_bit(PG_activate, &(page)->flags)
+#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
+#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+
struct page; /* forward declaration */

int test_clear_page_dirty(struct page *page);
@@ -339,4 +346,28 @@ static inline void set_page_writeback(st
#define ClearPageFsMisc(page) clear_bit(PG_fs_misc, &(page)->flags)
#define TestClearPageFsMisc(page) test_and_clear_bit(PG_fs_misc, &(page)->flags)

+#if PG_activate < PG_referenced
+#error unexpected page flags order
+#endif
+
+#define PAGE_REFCNT_0 0
+#define PAGE_REFCNT_1 (1 << PG_referenced)
+#define PAGE_REFCNT_2 (1 << PG_activate)
+#define PAGE_REFCNT_3 ((1 << PG_activate) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK PAGE_REFCNT_3
+
+/*
+ * STATUS REFERENCE COUNT
+ * __ 0
+ * _R PAGE_REFCNT_1
+ * A_ PAGE_REFCNT_2
+ * AR PAGE_REFCNT_3
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+ return page->flags & PAGE_REFCNT_MASK;
+}
+
#endif /* PAGE_FLAGS_H */
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -543,6 +543,7 @@ static int prep_new_page(struct page *pa

page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_activate |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -29,7 +29,6 @@
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
-#include <linux/init.h>

/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -115,13 +114,13 @@ void fastcall activate_page(struct page
* Mark a page as having seen activity.
*
* inactive,unreferenced -> inactive,referenced
- * inactive,referenced -> active,unreferenced
- * active,unreferenced -> active,referenced
+ * inactive,referenced -> activate,unreferenced
+ * activate,unreferenced -> activate,referenced
*/
void fastcall mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
- activate_page(page);
+ if (!PageActivate(page) && PageReferenced(page) && PageLRU(page)) {
+ SetPageActivate(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -454,6 +454,12 @@ static int shrink_list(struct list_head
if (PageWriteback(page))
goto keep_locked;

+ if (PageActivate(page)) {
+ ClearPageActivate(page);
+ ClearPageReferenced(page);
+ goto activate_locked;
+ }
+
referenced = page_referenced(page, 1);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))

--


2005-12-04 12:11:16

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

Wu Fengguang writes:
> When a page is referenced the second time in inactive_list, mark it with
> PG_activate instead of moving it into active_list immediately. The actual
> moving work is delayed to vmscan time.
>
> This implies two essential changes:
> - keeps the adjecency of pages in lru;

But this change destroys LRU ordering: at the time when shrink_list()
inspects PG_activate bit, information about order in which
mark_page_accessed() was called against pages is lost. E.g., suppose
inactive list initially contained pages

/* head */ (P1, P2, P3) /* tail */

all of them referenced. Then mark_page_accessed(), is called against P1,
P2, and P3 (in that order). With the old code active list would end up

/* head */ (P3, P2, P1) /* tail */

which corresponds to LRU. With delayed page activation, pages are moved
to head of the active list in the order they are analyzed by
shrink_list(), which gives

/* head */ (P1, P2, P3) /* tail */

on the active list, that is _inverse_ LRU order.

Nikita.

2005-12-04 13:35:13

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Sun, Dec 04, 2005 at 03:11:28PM +0300, Nikita Danilov wrote:
> Wu Fengguang writes:
> > When a page is referenced the second time in inactive_list, mark it with
> > PG_activate instead of moving it into active_list immediately. The actual
> > moving work is delayed to vmscan time.
> >
> > This implies two essential changes:
> > - keeps the adjecency of pages in lru;
>
> But this change destroys LRU ordering: at the time when shrink_list()
> inspects PG_activate bit, information about order in which
> mark_page_accessed() was called against pages is lost. E.g., suppose

Thanks.
But this order of re-access time may be pointless. In fact the original
mark_page_accessed() is doing another inversion: inversion of page lifetime.
In the word of CLOCK-Pro, a page first being re-accessed has lower
inter-reference distance, and therefore should be better protected(if ignore
possible read-ahead effects). If we move re-accessed pages immediately into
active_list, we are pushing them closer to danger of eviction.

btw, the current vmscan code clears PG_referenced flag when moving pages to
active_list. I followed the convention by doing this in the patch:

--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -454,6 +454,12 @@ static int shrink_list(struct list_head
if (PageWriteback(page))
goto keep_locked;

+ if (PageActivate(page)) {
+ ClearPageActivate(page);
+ ClearPageReferenced(page);
+ goto activate_locked;
+ }
+
referenced = page_referenced(page, 1, sc->priority <= 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))

Though I have a strong feeling that with the extra PG_activate bit, the
+ ClearPageReferenced(page);
line should be removed. That is, let the extra reference record live through it.
The point is to smooth out the inter-reference distance. Imagine the following
situation:

- + - + + - - + -
1 2 3 4 5
+: reference time
-: shrink_list time

One page have an average inter-reference distance that is smaller than the
inter-scan distance. But the distances vary a bit. Here we'd better let the
reference count accumulate, or at the 3rd shrink_list time it will be evicted.
Though it has a side effect of favoriting non-mmaped file a bit more than
before, and I was not quite sure about it.

> inactive list initially contained pages
>
> /* head */ (P1, P2, P3) /* tail */
>
> all of them referenced. Then mark_page_accessed(), is called against P1,
> P2, and P3 (in that order). With the old code active list would end up
>
> /* head */ (P3, P2, P1) /* tail */
>
> which corresponds to LRU. With delayed page activation, pages are moved
> to head of the active list in the order they are analyzed by
> shrink_list(), which gives
>
> /* head */ (P1, P2, P3) /* tail */
>
> on the active list, that is _inverse_ LRU order.

Thanks,
Wu

2005-12-04 15:03:06

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

Wu Fengguang writes:
> On Sun, Dec 04, 2005 at 03:11:28PM +0300, Nikita Danilov wrote:
> > Wu Fengguang writes:
> > > When a page is referenced the second time in inactive_list, mark it with
> > > PG_activate instead of moving it into active_list immediately. The actual
> > > moving work is delayed to vmscan time.
> > >
> > > This implies two essential changes:
> > > - keeps the adjecency of pages in lru;
> >
> > But this change destroys LRU ordering: at the time when shrink_list()
> > inspects PG_activate bit, information about order in which
> > mark_page_accessed() was called against pages is lost. E.g., suppose
>
> Thanks.
> But this order of re-access time may be pointless. In fact the original
> mark_page_accessed() is doing another inversion: inversion of page lifetime.
> In the word of CLOCK-Pro, a page first being re-accessed has lower

The brave new world of CLOCK-Pro is still yet to happen, right?

> inter-reference distance, and therefore should be better protected(if ignore
> possible read-ahead effects). If we move re-accessed pages immediately into
> active_list, we are pushing them closer to danger of eviction.

Huh? Pages in the active list are closer to the eviction? If it is
really so, then CLOCK-pro hijacks the meaning of active list in a very
unintuitive way. In the current MM active list is supposed to contain
hot pages that will be evicted last.

Anyway, these issues should be addressed in CLOCK-pro
implementation. Current MM tries hard to maintain LRU approximation in
both active and inactive lists.

[...]

>
> Though I have a strong feeling that with the extra PG_activate bit, the
> + ClearPageReferenced(page);
> line should be removed. That is, let the extra reference record live through it.
> The point is to smooth out the inter-reference distance. Imagine the following
> situation:
>
> - + - + + - - + -
> 1 2 3 4 5
> +: reference time
> -: shrink_list time
>
> One page have an average inter-reference distance that is smaller than the
> inter-scan distance. But the distances vary a bit. Here we'd better let the
> reference count accumulate, or at the 3rd shrink_list time it will be evicted.

I think this is pretty normal and acceptable variance. Idea is that when
system is short on memory scan rate increases together with the
precision of reference tracking.

[...]

>
> Thanks,

Che-che,

> Wu

Nikita.

2005-12-04 15:40:59

by tony

[permalink] [raw]
Subject: Help!Unable to handle kernel NULL pointer...

hi,

Recently I met some trouble, and need your help. My system is Redhat linux
2.4.20, and crash every few days. There are some messages in
'/var/log/messages'as following:

I have had searched the archives and found that some guys met this problem
before, but i didn't find the way to fix the trouble.

Is there anybody who would like give me a hand?

Messages:
Nov 27 21:34:54 SHZX-WG04 kernel: <1>Unable to handle kernel NULL pointer
dereference at virtual address 00000080
Nov 27 21:34:54 SHZX-WG04 kernel: printing eip:
Nov 27 21:34:54 SHZX-WG04 kernel: c012cd07
Nov 27 21:34:54 SHZX-WG04 kernel: *pde = 06a47001
Nov 27 21:34:54 SHZX-WG04 kernel: *pte = 00000000
Nov 27 21:34:54 SHZX-WG04 kernel: Oops: 0000
Nov 27 21:34:54 SHZX-WG04 kernel: ide-cd cdrom lp parport autofs e1000
microcode keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic79xx
sd_mod scsi_mod
Nov 27 21:34:54 SHZX-WG04 kernel: CPU: 2
Nov 27 21:34:54 SHZX-WG04 kernel: EIP: 0060:[<c012cd07>] Not tainted
Nov 27 21:34:54 SHZX-WG04 kernel: EFLAGS: 00010202
Nov 27 21:34:54 SHZX-WG04 kernel:
Nov 27 21:34:54 SHZX-WG04 kernel: EIP is at access_process_vm [kernel] 0x27
(2.4.20-8bigmem)
Nov 27 21:34:54 SHZX-WG04 kernel: eax: 00000000 ebx: d8d93280 ecx:
00000017 edx: e9910000
Nov 27 21:34:54 SHZX-WG04 kernel: esi: 00000000 edi: e992c000 ebp:
e992c000 esp: edde9ef0
Nov 27 21:34:54 SHZX-WG04 kernel: ds: 0068 es: 0068 ss: 0068
Nov 27 21:34:54 SHZX-WG04 kernel: Process ps (pid: 18326,
stackpage=edde9000)
Nov 27 21:34:54 SHZX-WG04 kernel: Stack: c0160996 e0b9f180 edde9f10 00000202
00000001 00000000 edde9f84 cac05580
Nov 27 21:34:54 SHZX-WG04 kernel: f421800c 00000202 00000000 e992c000
00000000 00000500 000001f0 d8d93280
Nov 27 21:34:54 SHZX-WG04 kernel: 00000000 e992c000 e9910000 c017b334
e9910000 bffffac8 e992c000 00000017
Nov 27 21:34:54 SHZX-WG04 kernel: Call Trace: [<c0160996>] link_path_walk
[kernel] 0x656 (0xedde9ef0))
Nov 27 21:34:54 SHZX-WG04 kernel: [<c017b334>] proc_pid_cmdline [kernel]
0x74 (0xedde9f3c))
Nov 27 21:34:54 SHZX-WG04 kernel: [<c017b747>] proc_info_read [kernel] 0x77
(0xedde9f6c))
Nov 27 21:34:54 SHZX-WG04 kernel: [<c0153453>] sys_read [kernel] 0xa3
(0xedde9f94))
Nov 27 21:34:54 SHZX-WG04 kernel: [<c01527d2>] sys_open [kernel] 0xa2
(0xedde9fa8))
Nov 27 21:34:54 SHZX-WG04 kernel: [<c01098bf>] system_call [kernel] 0x33
(0xedde9fc0))
Nov 27 21:34:54 SHZX-WG04 kernel:
Nov 27 21:34:54 SHZX-WG04 kernel:
Nov 27 21:34:54 SHZX-WG04 kernel: Code: f6 80 80 00 00 00 01 74 2d 81 7c 24
30 40 b7 33 c0 74 23 f0

Tony howe
2005-12-04

2005-12-04 17:11:03

by Parag Warudkar

[permalink] [raw]
Subject: Re: Help!Unable to handle kernel NULL pointer...

You are running an ancient kernel (with possible security issues) and unless you have some paid support for it I doubt you will get serious help here.

Your choices are to upgrade to latest 2.4 series version 2.4.32 and hope that the problem goes away or upgrade to the latest 2.6 series and then repost if a similar problem occurs there.

Also, if the problem started happening recently and without any kernel related changes, you might want to test the hardware.

HTH
Parag



2005-12-04 19:10:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Sun, 2005-12-04 at 18:03 +0300, Nikita Danilov wrote:
> Wu Fengguang writes:
> > On Sun, Dec 04, 2005 at 03:11:28PM +0300, Nikita Danilov wrote:
> > > Wu Fengguang writes:
> > > > When a page is referenced the second time in inactive_list, mark it with
> > > > PG_activate instead of moving it into active_list immediately. The actual
> > > > moving work is delayed to vmscan time.
> > > >
> > > > This implies two essential changes:
> > > > - keeps the adjecency of pages in lru;
> > >
> > > But this change destroys LRU ordering: at the time when shrink_list()
> > > inspects PG_activate bit, information about order in which
> > > mark_page_accessed() was called against pages is lost. E.g., suppose
> >
> > Thanks.
> > But this order of re-access time may be pointless. In fact the original
> > mark_page_accessed() is doing another inversion: inversion of page lifetime.
> > In the word of CLOCK-Pro, a page first being re-accessed has lower
>
> The brave new world of CLOCK-Pro is still yet to happen, right?

Well, I have an implementation that is showing very promising results. I
plan to polish the code a bit and post the code somewhere this week.
(current state available at: http://linux-mm.org/PeterZClockPro2)

> > inter-reference distance, and therefore should be better protected(if ignore
> > possible read-ahead effects). If we move re-accessed pages immediately into
> > active_list, we are pushing them closer to danger of eviction.
>
> Huh? Pages in the active list are closer to the eviction? If it is
> really so, then CLOCK-pro hijacks the meaning of active list in a very
> unintuitive way. In the current MM active list is supposed to contain
> hot pages that will be evicted last.

Actually, CLOCK-pro does not have an active list. Pure CLOCK-pro has but
one clock. It is possible to create approximations that have more
lists/clocks, and in those the meaning of active list are indeed
somewhat different, but I agree with nikita here, this is odd.

> Anyway, these issues should be addressed in CLOCK-pro
> implementation. Current MM tries hard to maintain LRU approximation in
> both active and inactive lists.

nod.


Peter Zijlstra
(he who has dedicated his spare time to the eradication of LRU ;-)



2005-12-05 01:34:30

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Sun, Dec 04, 2005 at 06:03:15PM +0300, Nikita Danilov wrote:
> > inter-reference distance, and therefore should be better protected(if ignore
> > possible read-ahead effects). If we move re-accessed pages immediately into
> > active_list, we are pushing them closer to danger of eviction.
>
> Huh? Pages in the active list are closer to the eviction? If it is
> really so, then CLOCK-pro hijacks the meaning of active list in a very
> unintuitive way. In the current MM active list is supposed to contain
> hot pages that will be evicted last.

The page is going to active list anyway. So its remaining lifetime in inactive
list is killed by the early move.

Thanks,
Wu

2005-12-06 17:54:57

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

Wu Fengguang writes:
> On Sun, Dec 04, 2005 at 06:03:15PM +0300, Nikita Danilov wrote:
> > > inter-reference distance, and therefore should be better protected(if ignore
> > > possible read-ahead effects). If we move re-accessed pages immediately into
> > > active_list, we are pushing them closer to danger of eviction.
> >
> > Huh? Pages in the active list are closer to the eviction? If it is
> > really so, then CLOCK-pro hijacks the meaning of active list in a very
> > unintuitive way. In the current MM active list is supposed to contain
> > hot pages that will be evicted last.
>
> The page is going to active list anyway. So its remaining lifetime in inactive
> list is killed by the early move.

But this change increased lifetimes of _all_ pages, so this is
irrelevant. Consequently, it has a chance of increasing scanning
activity, because there will be more referenced pages at the cold tail
of the inactive list.

And --again-- this erases information about relative order of
references, and this is important. In the past, several VM modifications
(like split inactive_clean and inactive_dirty lists) were tried that had
various advantages over current scanner, but maintained weaker LRU, and
they all were found to degrade horribly under certain easy triggerable
conditions.

>
> Thanks,
> Wu

Nikita.

2005-12-07 01:18:57

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Tue, Dec 06, 2005 at 08:55:13PM +0300, Nikita Danilov wrote:
> Wu Fengguang writes:
> > On Sun, Dec 04, 2005 at 06:03:15PM +0300, Nikita Danilov wrote:
> > > > inter-reference distance, and therefore should be better protected(if ignore
> > > > possible read-ahead effects). If we move re-accessed pages immediately into
> > > > active_list, we are pushing them closer to danger of eviction.
> > >
> > > Huh? Pages in the active list are closer to the eviction? If it is
> > > really so, then CLOCK-pro hijacks the meaning of active list in a very
> > > unintuitive way. In the current MM active list is supposed to contain
> > > hot pages that will be evicted last.
> >
> > The page is going to active list anyway. So its remaining lifetime in inactive
> > list is killed by the early move.
>
> But this change increased lifetimes of _all_ pages, so this is

Yes, it also increased the lifetimes by meaningful values: first re-accessed
pages are prolonged more lifetime. Immediately removing them from inactive_list
is basicly doing MRU eviction.

> irrelevant. Consequently, it has a chance of increasing scanning
> activity, because there will be more referenced pages at the cold tail
> of the inactive list.

Delayed activation increased scanning activity, while immediate activation
increased the locking activity. Early profiling data on a 2 CPU Xeon box showed
that the delayed activation acctually cost less time.

> And --again-- this erases information about relative order of
> references, and this is important. In the past, several VM modifications
> (like split inactive_clean and inactive_dirty lists) were tried that had
> various advantages over current scanner, but maintained weaker LRU, and
> they all were found to degrade horribly under certain easy triggerable
> conditions.

Yeah, the patch does need more testing.
It has been in -ck tree for a while, and there's no negative report about it.

Andrew, and anyone in the lkml, do you feel ok to test it in -mm tree?
It is there because some readahead code test the PG_actvation bit explicitly.
If the answer is 'not for now', I'll strip it out from the readahead patchset
in the next version.

Thanks,
Wu

2005-12-07 09:47:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

Wu Fengguang <[email protected]> wrote:
>
> Andrew, and anyone in the lkml, do you feel ok to test it in -mm tree?

Nope, sorry. I am wildly uninterested in large changes to page reclaim.
Or to readahead, come to that.

That code has had years of testing, tweaking, tuning and poking. Large
changes such as these will take as long as a year to get settled into the
same degree of maturity. Both of these parts of the kernel are similar in
that they are hit with an extraordinarly broad range of usage patterns and
they both implement various predict-the-future heuristics. They are subtle
and there is a lot of historical knowledge embedded in there.

What I would encourage you to do is to stop developing and testing new
code. Instead, devote more time to testing, understanding and debugging
the current code. If you find and fix a problem and can help us gain a
really really really good understanding of the problem and the fix then
great, we can run with that minimal-sized, minimal-impact, well-understood,
well-tested fix.

See where I'm coming from? Experience teaches us to be super-cautious
here. In these code areas especially we cannot afford to go making
larger-than-needed changes because those changes will probably break things
in ways which will take a long time to discover, and longer to re-fix.

2005-12-07 10:10:59

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Wed, Dec 07, 2005 at 01:46:59AM -0800, Andrew Morton wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > Andrew, and anyone in the lkml, do you feel ok to test it in -mm tree?
>
> Nope, sorry. I am wildly uninterested in large changes to page reclaim.
> Or to readahead, come to that.
>
> That code has had years of testing, tweaking, tuning and poking. Large
> changes such as these will take as long as a year to get settled into the
> same degree of maturity. Both of these parts of the kernel are similar in
> that they are hit with an extraordinarly broad range of usage patterns and
> they both implement various predict-the-future heuristics. They are subtle
> and there is a lot of historical knowledge embedded in there.
>
> What I would encourage you to do is to stop developing and testing new
> code. Instead, devote more time to testing, understanding and debugging
> the current code. If you find and fix a problem and can help us gain a
> really really really good understanding of the problem and the fix then
> great, we can run with that minimal-sized, minimal-impact, well-understood,
> well-tested fix.
>
> See where I'm coming from? Experience teaches us to be super-cautious
> here. In these code areas especially we cannot afford to go making
> larger-than-needed changes because those changes will probably break things
> in ways which will take a long time to discover, and longer to re-fix.

Ok, thanks for the advise.
My main concern is in read-ahead. The new development stopped roughly from V8.
Various parts have been improving based on user feedbacks since V6. The future
work would be more testings/tunings and user interactions. Till now I have
received many user reports on both server/desktop, things are going on well :)

Regards,
Wu

2005-12-07 12:44:13

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

Wu Fengguang writes:
> On Tue, Dec 06, 2005 at 08:55:13PM +0300, Nikita Danilov wrote:

[...]

> >
> > But this change increased lifetimes of _all_ pages, so this is
>
> Yes, it also increased the lifetimes by meaningful values: first re-accessed
> pages are prolonged more lifetime. Immediately removing them from inactive_list
> is basicly doing MRU eviction.

Are you talking about CLOCK-pro here? I don't understand your statement
in the context of current VM: if the "first re-accessed" page was close
to the cold tail of the inactive list, and "second re-accessed" page was
close to the head of the inactive list, then life-time of second one is
increased by larger amount.

>
> > irrelevant. Consequently, it has a chance of increasing scanning
> > activity, because there will be more referenced pages at the cold tail
> > of the inactive list.
>
> Delayed activation increased scanning activity, while immediate activation
> increased the locking activity. Early profiling data on a 2 CPU Xeon box showed
> that the delayed activation acctually cost less time.

That's great, but current mark_page_accessed() has an important
advantage: the work is done by the process that accessed the page in
read/write path, or at page fault. By delegating activation to the VM
scanner, the burden of work is shifted to the innocent thread that
happened to trigger scanning during page allocation.

It's basically always useful to follow the principle that the thread
that used resources pays the overhead. This leads to more balanced
behavior with better worst-case.

Compare this with balance_dirty_pages() that throttles heavy writers by
forcing them to do the write-out. In the same vein, mark_page_accessed()
throttles thread (a bit) by forcing it to book-keep VM lists. Ideally,
threads doing a lot of page allocations, should be forced to do some
scanning, even if there is no memory shortage, etc.

As for locking overhead, mark_page_accessed() can be batched (I even
have a patch to do this, but it doesn't show any improvement).

>
> Thanks,
> Wu

Nikita.

2005-12-07 13:26:39

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: delayed page activation

On Wed, Dec 07, 2005 at 03:44:25PM +0300, Nikita Danilov wrote:
> Wu Fengguang writes:
> > Yes, it also increased the lifetimes by meaningful values: first re-accessed
> > pages are prolonged more lifetime. Immediately removing them from inactive_list
> > is basicly doing MRU eviction.
>
> Are you talking about CLOCK-pro here? I don't understand your statement
> in the context of current VM: if the "first re-accessed" page was close
> to the cold tail of the inactive list, and "second re-accessed" page was
> close to the head of the inactive list, then life-time of second one is
> increased by larger amount.

Sorry, I fail to mention that I'm comparing two pages that are read in at the
same time, therefore they are in the same place in inactive_list. But their
re-access time can be quite different.

There are roughly two kinds of reads: almost instantly and slowly forward. For
the former one, read-in-time = first-access-time, unless for initial cache misses.
The latter one is the original purpose of of the patch: to keep one chunk of
read-ahead pages together, instead of let them littering throughout the lru
list.

> > Delayed activation increased scanning activity, while immediate activation
> > increased the locking activity. Early profiling data on a 2 CPU Xeon box showed
> > that the delayed activation acctually cost less time.
>
> That's great, but current mark_page_accessed() has an important
> advantage: the work is done by the process that accessed the page in
> read/write path, or at page fault. By delegating activation to the VM
> scanner, the burden of work is shifted to the innocent thread that
> happened to trigger scanning during page allocation.

Thanks to notice it. It will happen in the direct page reclaim path. But I have
just made interesting tests of the patch, in which direct page reclaims were
reduced to zero. Till now I have no hint of why this is happening :)

Wu