LinuxLists.cc - [PATCH] mm:workingset use real time to judge activity of the file page

2019-04-04 03:31:35

Subject: [PATCH] mm:workingset use real time to judge activity of the file page

From: Zhaoyang Huang <[email protected]>

In previous implementation, the number of refault pages is used
for judging the refault period of each page, which is not precised as
eviction of other files will be affect a lot on current cache.
We introduce the timestamp into the workingset's entry and refault ratio
to measure the file page's activity. It helps to decrease the affection
of other files(average refault ratio can reflect the view of whole system
's memory).
The patch is tested on an Android system, which can be described as
comparing the launch time of an application between a huge memory
consumption. The result is launch time decrease 50% and the page fault
during the test decrease 80%.

Signed-off-by: Zhaoyang Huang <[email protected]>
---
include/linux/mmzone.h | 2 ++
mm/workingset.c | 24 +++++++++++++++++-------
2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..c38ba0a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -240,6 +240,8 @@ struct lruvec {
atomic_long_t inactive_age;
/* Refaults at the time of last reclaim cycle */
unsigned long refaults;
+ atomic_long_t refaults_ratio;
+ atomic_long_t prev_fault;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
diff --git a/mm/workingset.c b/mm/workingset.c
index 40ee02c..6361853 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -159,7 +159,7 @@
NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
-
+#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
/*
* Eviction timestamps need to be able to cover the full range of
* actionable refaults. However, bits are tight in the radix tree
@@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
eviction >>= bucket_order;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
+ eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
}

static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
- unsigned long *evictionp)
+ unsigned long *evictionp, unsigned long *prev_jiffp)
{
unsigned long entry = (unsigned long)shadow;
int memcgid, nid;
+ unsigned long prev_jiff;

entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+ entry >>= EVICTION_JIFFIES;
+ prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
*evictionp = entry << bucket_order;
+ *prev_jiffp = prev_jiff;
}

/**
@@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
unsigned long refault;
struct pglist_data *pgdat;
int memcgid;
+ unsigned long refault_ratio;
+ unsigned long prev_jiff;
+ unsigned long avg_refault_time;
+ unsigned long refault_time;

- unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
+ unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);

rcu_read_lock();
/*
@@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
* list is not a problem.
*/
refault_distance = (refault - eviction) & EVICTION_MASK;
-
inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
-
- if (refault_distance <= active_file) {
+ lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
+ refault_time = jiffies - prev_jiff;
+ avg_refault_time = refault_distance / lruvec->refaults_ratio;
+ if (refault_time <= avg_refault_time) {
inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
rcu_read_unlock();
return true;
@@ -521,7 +531,7 @@ static int __init workingset_init(void)
* some more pages at runtime, so keep working with up to
* double the initial memory by using totalram_pages as-is.
*/
- timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+ timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
max_order = fls_long(totalram_pages - 1);
if (max_order > timestamp_bits)
bucket_order = max_order - timestamp_bits;
--
1.9.1

2019-04-04 07:16:39

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

[Fixup email for Pavel and add Johannes]

On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> In previous implementation, the number of refault pages is used
> for judging the refault period of each page, which is not precised as
> eviction of other files will be affect a lot on current cache.
> We introduce the timestamp into the workingset's entry and refault ratio
> to measure the file page's activity. It helps to decrease the affection
> of other files(average refault ratio can reflect the view of whole system
> 's memory).
> The patch is tested on an Android system, which can be described as
> comparing the launch time of an application between a huge memory
> consumption. The result is launch time decrease 50% and the page fault
> during the test decrease 80%.
>
> Signed-off-by: Zhaoyang Huang <[email protected]>
> ---
> include/linux/mmzone.h | 2 ++
> mm/workingset.c | 24 +++++++++++++++++-------
> 2 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 32699b2..c38ba0a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -240,6 +240,8 @@ struct lruvec {
> atomic_long_t inactive_age;
> /* Refaults at the time of last reclaim cycle */
> unsigned long refaults;
> + atomic_long_t refaults_ratio;
> + atomic_long_t prev_fault;
> #ifdef CONFIG_MEMCG
> struct pglist_data *pgdat;
> #endif
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 40ee02c..6361853 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -159,7 +159,7 @@
> NODES_SHIFT + \
> MEM_CGROUP_ID_SHIFT)
> #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
> -
> +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
> /*
> * Eviction timestamps need to be able to cover the full range of
> * actionable refaults. However, bits are tight in the radix tree
> @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
> eviction >>= bucket_order;
> eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
> eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
>
> return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> }
>
> static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> - unsigned long *evictionp)
> + unsigned long *evictionp, unsigned long *prev_jiffp)
> {
> unsigned long entry = (unsigned long)shadow;
> int memcgid, nid;
> + unsigned long prev_jiff;
>
> entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> + entry >>= EVICTION_JIFFIES;
> + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
> nid = entry & ((1UL << NODES_SHIFT) - 1);
> entry >>= NODES_SHIFT;
> memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> *memcgidp = memcgid;
> *pgdat = NODE_DATA(nid);
> *evictionp = entry << bucket_order;
> + *prev_jiffp = prev_jiff;
> }
>
> /**
> @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
> unsigned long refault;
> struct pglist_data *pgdat;
> int memcgid;
> + unsigned long refault_ratio;
> + unsigned long prev_jiff;
> + unsigned long avg_refault_time;
> + unsigned long refault_time;
>
> - unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
>
> rcu_read_lock();
> /*
> @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
> * list is not a problem.
> */
> refault_distance = (refault - eviction) & EVICTION_MASK;
> -
> inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> -
> - if (refault_distance <= active_file) {
> + lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
> + refault_time = jiffies - prev_jiff;
> + avg_refault_time = refault_distance / lruvec->refaults_ratio;
> + if (refault_time <= avg_refault_time) {
> inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> rcu_read_unlock();
> return true;
> @@ -521,7 +531,7 @@ static int __init workingset_init(void)
> * some more pages at runtime, so keep working with up to
> * double the initial memory by using totalram_pages as-is.
> */
> - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> + timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
> max_order = fls_long(totalram_pages - 1);
> if (max_order > timestamp_bits)
> bucket_order = max_order - timestamp_bits;
> --
> 1.9.1

--
Michal Hocko
SUSE Labs

2019-04-04 16:40:46

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> In previous implementation, the number of refault pages is used
> for judging the refault period of each page, which is not precised as
> eviction of other files will be affect a lot on current cache.
> We introduce the timestamp into the workingset's entry and refault ratio
> to measure the file page's activity. It helps to decrease the affection
> of other files(average refault ratio can reflect the view of whole system
> 's memory).

I don't understand what exactly you're saying here, can you please
elaborate?

The reason it's using distances instead of absolute time is because
the ordering of the LRU is relative and not based on absolute time.

E.g. if a page is accessed every 500ms, it depends on all other pages
to determine whether this page is at the head or the tail of the LRU.

So when you refault, in order to determine the relative position of
the refaulted page in the LRU, you have to compare it to how fast that
LRU is moving. The absolute refault time, or the average time between
refaults, is not comparable to what's already in memory.

2019-04-05 00:16:30

by Zhaoyang Huang

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > In previous implementation, the number of refault pages is used
> > for judging the refault period of each page, which is not precised as
> > eviction of other files will be affect a lot on current cache.
> > We introduce the timestamp into the workingset's entry and refault ratio
> > to measure the file page's activity. It helps to decrease the affection
> > of other files(average refault ratio can reflect the view of whole system
> > 's memory).
>
> I don't understand what exactly you're saying here, can you please
> elaborate?
>
> The reason it's using distances instead of absolute time is because
> the ordering of the LRU is relative and not based on absolute time.
>
> E.g. if a page is accessed every 500ms, it depends on all other pages
> to determine whether this page is at the head or the tail of the LRU.
>
> So when you refault, in order to determine the relative position of
> the refaulted page in the LRU, you have to compare it to how fast that
> LRU is moving. The absolute refault time, or the average time between
> refaults, is not comparable to what's already in memory.
How do you know how long time did these pages' dropping taken.Actruly,
a quick dropping of large mount of pages will be wrongly deemed as
slow dropping instead of the exact hard situation.That is to say, 100
pages per million second or per second have same impaction on
calculating the refault distance, which may cause less protection on
this page cache for former scenario and introduce page thrashing.
especially when global reclaim, a round of kswapd reclaiming that
waked up by a high order allocation or large number of single page
allocations may cause such things as all pages within the node are
counted in the same lru. This commit can decreasing above things by
comparing refault time of single page with avg_refault_time =
delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio =
lru->inactive_ages / time).

2019-04-05 03:14:41

by Zhaoyang Huang

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

resend it via the right mailling list and rewrite the comments by ZY.

On Thu, Apr 4, 2019 at 3:15 PM Michal Hocko <[email protected]> wrote:
>
> [Fixup email for Pavel and add Johannes]
>
> On Thu 04-04-19 11:30:17, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > In previous implementation, the number of refault pages is used
> > for judging the refault period of each page, which is not precised as
> > eviction of other files will be affect a lot on current cache.
> > We introduce the timestamp into the workingset's entry and refault ratio
> > to measure the file page's activity. It helps to decrease the affection
> > of other files(average refault ratio can reflect the view of whole system
> > 's memory).
> > The patch is tested on an Android system, which can be described as
> > comparing the launch time of an application between a huge memory
> > consumption. The result is launch time decrease 50% and the page fault
> > during the test decrease 80%.
> >
I don't understand what exactly you're saying here, can you please elaborate?

The reason it's using distances instead of absolute time is because
the ordering of the LRU is relative and not based on absolute time.

E.g. if a page is accessed every 500ms, it depends on all other pages
to determine whether this page is at the head or the tail of the LRU.

So when you refault, in order to determine the relative position of
the refaulted page in the LRU, you have to compare it to how fast that
LRU is moving. The absolute refault time, or the average time between
refaults, is not comparable to what's already in memory.

comment by ZY
For current implementation, it is hard to deal with the evaluation of
refault period under the scenario of huge dropping of file pages
within short time, which maybe caused by a high order allocation or
continues single page allocation in KSWAPD. On the contrary, such page
which having a big refault_distance will be deemed as INACTIVE
wrongly, which will be reclaimed earlier than it should be and lead to
page thrashing. So we introduce 'avg_refault_time' & 'refault_ratio'
to judge if the refault is a accumulated thing or caused by a tight
reclaiming. That is to say, a big refault_distance in a long time
would also be inactive as the result of comparing it with ideal
time(avg_refault_time: avg_refault_time = delta_lru_reclaimed_pages/
avg_refault_retio (refault_ratio = lru->inactive_ages / time).
> > Signed-off-by: Zhaoyang Huang <[email protected]>
> > ---
> > include/linux/mmzone.h | 2 ++
> > mm/workingset.c | 24 +++++++++++++++++-------
> > 2 files changed, 19 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 32699b2..c38ba0a 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -240,6 +240,8 @@ struct lruvec {
> > atomic_long_t inactive_age;
> > /* Refaults at the time of last reclaim cycle */
> > unsigned long refaults;
> > + atomic_long_t refaults_ratio;
> > + atomic_long_t prev_fault;
> > #ifdef CONFIG_MEMCG
> > struct pglist_data *pgdat;
> > #endif
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 40ee02c..6361853 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -159,7 +159,7 @@
> > NODES_SHIFT + \
> > MEM_CGROUP_ID_SHIFT)
> > #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
> > -
> > +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
> > /*
> > * Eviction timestamps need to be able to cover the full range of
> > * actionable refaults. However, bits are tight in the radix tree
> > @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
> > eviction >>= bucket_order;
> > eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> > eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> > + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
> > eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> >
> > return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
> > }
> >
> > static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> > - unsigned long *evictionp)
> > + unsigned long *evictionp, unsigned long *prev_jiffp)
> > {
> > unsigned long entry = (unsigned long)shadow;
> > int memcgid, nid;
> > + unsigned long prev_jiff;
> >
> > entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> > + entry >>= EVICTION_JIFFIES;
> > + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;
> > nid = entry & ((1UL << NODES_SHIFT) - 1);
> > entry >>= NODES_SHIFT;
> > memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
> > @@ -195,6 +199,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> > *memcgidp = memcgid;
> > *pgdat = NODE_DATA(nid);
> > *evictionp = entry << bucket_order;
> > + *prev_jiffp = prev_jiff;
> > }
> >
> > /**
> > @@ -242,8 +247,12 @@ bool workingset_refault(void *shadow)
> > unsigned long refault;
> > struct pglist_data *pgdat;
> > int memcgid;
> > + unsigned long refault_ratio;
> > + unsigned long prev_jiff;
> > + unsigned long avg_refault_time;
> > + unsigned long refault_time;
> >
> > - unpack_shadow(shadow, &memcgid, &pgdat, &eviction);
> > + unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &prev_jiff);
> >
> > rcu_read_lock();
> > /*
> > @@ -288,10 +297,11 @@ bool workingset_refault(void *shadow)
> > * list is not a problem.
> > */
> > refault_distance = (refault - eviction) & EVICTION_MASK;
> > -
> > inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> > -
> > - if (refault_distance <= active_file) {
> > + lruvec->refaults_ratio = atomic_long_read(&lruvec->inactive_age) / jiffies;
> > + refault_time = jiffies - prev_jiff;
> > + avg_refault_time = refault_distance / lruvec->refaults_ratio;
> > + if (refault_time <= avg_refault_time) {
> > inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> > rcu_read_unlock();
> > return true;
> > @@ -521,7 +531,7 @@ static int __init workingset_init(void)
> > * some more pages at runtime, so keep working with up to
> > * double the initial memory by using totalram_pages as-is.
> > */
> > - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
> > + timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT - EVICTION_JIFFIES;
> > max_order = fls_long(totalram_pages - 1);
> > if (max_order > timestamp_bits)
> > bucket_order = max_order - timestamp_bits;
> > --
> > 1.9.1
>
> --
> Michal Hocko
> SUSE Labs

2019-04-05 03:25:14

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> +++ b/mm/workingset.c
> @@ -159,7 +159,7 @@
> NODES_SHIFT + \
> MEM_CGROUP_ID_SHIFT)
> #define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
> -
> +#define EVICTION_JIFFIES (BITS_PER_LONG >> 3)
> /*
> * Eviction timestamps need to be able to cover the full range of
> * actionable refaults. However, bits are tight in the radix tree
> @@ -175,18 +175,22 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
> eviction >>= bucket_order;
> eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
> eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
> + eviction = (eviction << EVICTION_JIFFIES) | (jiffies >> EVICTION_JIFFIES);
> eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

... this isn't against current, or even 5.0.

> entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
> + entry >>= EVICTION_JIFFIES;
> + prev_jiff = (entry & ((1UL << EVICTION_JIFFIES) - 1)) << EVICTION_JIFFIES;

These two lines are in the wrong order. So you're getting (effectively) a
random answer in your 'prev_jiff', which means your testing isn't thorough
enough. I suspect you're only testing cases you're expecting to improve,
and you aren't testing to make sure that other cases don't regress.

2019-04-05 19:35:37

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] mm:workingset use real time to judge activity of the file page

On Fri, Apr 05, 2019 at 07:23:46AM +0800, Zhaoyang Huang wrote:
> On Fri, Apr 5, 2019 at 12:39 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Apr 04, 2019 at 11:30:17AM +0800, Zhaoyang Huang wrote:
> > > From: Zhaoyang Huang <[email protected]>
> > >
> > > In previous implementation, the number of refault pages is used
> > > for judging the refault period of each page, which is not precised as
> > > eviction of other files will be affect a lot on current cache.
> > > We introduce the timestamp into the workingset's entry and refault ratio
> > > to measure the file page's activity. It helps to decrease the affection
> > > of other files(average refault ratio can reflect the view of whole system
> > > 's memory).
> >
> > I don't understand what exactly you're saying here, can you please
> > elaborate?
> >
> > The reason it's using distances instead of absolute time is because
> > the ordering of the LRU is relative and not based on absolute time.
> >
> > E.g. if a page is accessed every 500ms, it depends on all other pages
> > to determine whether this page is at the head or the tail of the LRU.
> >
> > So when you refault, in order to determine the relative position of
> > the refaulted page in the LRU, you have to compare it to how fast that
> > LRU is moving. The absolute refault time, or the average time between
> > refaults, is not comparable to what's already in memory.
> How do you know how long time did these pages' dropping taken.Actruly,
> a quick dropping of large mount of pages will be wrongly deemed as
> slow dropping instead of the exact hard situation.That is to say, 100
> pages per million second or per second have same impaction on
> calculating the refault distance, which may cause less protection on
> this page cache for former scenario and introduce page thrashing.
> especially when global reclaim, a round of kswapd reclaiming that
> waked up by a high order allocation or large number of single page
> allocations may cause such things as all pages within the node are
> counted in the same lru. This commit can decreasing above things by
> comparing refault time of single page with avg_refault_time =
> delta_lru_reclaimed_pages/ avg_refault_retio (refault_ratio =
> lru->inactive_ages / time).

When something like a higher-order allocation drops a large number of
file pages, it's *intentional* that the pages that were evicted before
them become less valuable and less likely to be activated on
refault. There is a finite amount of in-memory LRU space and the pages
that have been evicted the most recently have precedence because they
have the highest proven access frequency.

Of course, when a large amount of the cache that was pushed out in
between is not re-used again, and don't claim their space in memory,
it would be great if we could then activate the older pages that *are*
re-used again in their stead.

But that would require us being able to look into the future. When an
old page refaults, we don't know if a younger page is still going to
refault with a shorter refault distance or not. If it won't, then we
were right to activate it. If it will refault, then we put something
on the active list whose reuse frequency is too low to be able to fit
into memory, and we thrash the hottest pages in the system.

As Matthew says, you are fairly randomly making refault activations
more aggressive (especially with that timestamp unpacking bug), and
while that expectedly boosts workload transition / startup, it comes
at the cost of disrupting stable states because you can flood a very
active in-ram workingset with completely cold cache pages simply
because they refault uniformly wrt each other.