2019-04-28 07:48:34

by Zhaoyang Huang

[permalink] [raw]
Subject: [[repost]RFC PATCH] mm/workingset : judge file page activity via timestamp

From: Zhaoyang Huang <[email protected]>

this patch introduce timestamp into workingset's entry and judge if the page is
active or inactive via active_file/refault_ratio instead of refault distance.

The original thought is coming from the logs we got from trace_printk in this
patch, we can find about 1/5 of the file pages' refault are under the
scenario[1],which will be counted as inactive as they have a long refault distance
in between access. However, we can also know from the time information that the
page refault quickly as comparing to the average refault time which is calculated
by the number of active file and refault ratio. We want to save these kinds of
pages from evicted earlier as it used to be via setting it to ACTIVE instead.
The refault ratio is the value which can reflect lru's average file access
frequency in the past and provide the judge criteria for page's activation.

The patch is tested on an android system and reduce 30% of page faults, while
60% of the pages remain the original status as (refault_distance < active_file)
indicates. Pages status got from ftrace during the test can refer to [2].

[1]
system_server workingset_refault: WKST_ACT[0]:rft_dis 265976, act_file 34268 rft_ratio 3047 rft_time 0 avg_rft_time 11 refault 295592 eviction 29616 secs 97 pre_secs 97
HwBinder:922 workingset_refault: WKST_ACT[0]:rft_dis 264478, act_file 35037 rft_ratio 3070 rft_time 2 avg_rft_time 11 refault 310078 eviction 45600 secs 101 pre_secs 99

[2]
WKST_ACT[0]: original--INACTIVE commit--ACTIVE
WKST_ACT[1]: original--ACTIVE commit--ACTIVE
WKST_INACT[0]: original--INACTIVE commit--INACTIVE
WKST_INACT[1]: original--ACTIVE commit--INACTIVE

Signed-off-by: Zhaoyang Huang <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/workingset.c | 129 ++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 113 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741..ca4ced6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -242,6 +242,7 @@ struct lruvec {
atomic_long_t inactive_age;
/* Refaults at the time of last reclaim cycle */
unsigned long refaults;
+ atomic_long_t refaults_ratio;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
diff --git a/mm/workingset.c b/mm/workingset.c
index 0bedf67..fd2e5af 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -167,10 +167,19 @@
* refault distance will immediately activate the refaulting page.
*/

+#ifdef CONFIG_64BIT
+#define EVICTION_SECS_POS_SHIFT 18
+#define EVICTION_SECS_SHRINK_SHIFT 4
+#define EVICTION_SECS_POS_MASK ((1UL << EVICTION_SECS_POS_SHIFT) - 1)
+#else
+#define EVICTION_SECS_POS_SHIFT 0
+#define EVICTION_SECS_SHRINK_SHIFT 0
+#define NO_SECS_IN_WORKINGSET
+#endif
#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
- 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
+ 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT + \
+ EVICTION_SECS_POS_SHIFT + EVICTION_SECS_SHRINK_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)
-
/*
* Eviction timestamps need to be able to cover the full range of
* actionable refaults. However, bits are tight in the xarray
@@ -180,12 +189,48 @@
* evictions into coarser buckets by shaving off lower timestamp bits.
*/
static unsigned int bucket_order __read_mostly;
-
+#ifdef NO_SECS_IN_WORKINGSET
+static void pack_secs(unsigned long *peviction) { }
+static unsigned int unpack_secs(unsigned long entry) {return 0; }
+#else
+static void pack_secs(unsigned long *peviction)
+{
+ unsigned int secs;
+ unsigned long eviction;
+ int order;
+ int secs_shrink_size;
+ struct timespec64 ts;
+
+ ktime_get_boottime_ts64(&ts);
+ secs = (unsigned int)ts.tv_sec ? (unsigned int)ts.tv_sec : 1;
+ order = get_count_order(secs);
+ secs_shrink_size = (order <= EVICTION_SECS_POS_SHIFT)
+ ? 0 : (order - EVICTION_SECS_POS_SHIFT);
+
+ eviction = *peviction;
+ eviction = (eviction << EVICTION_SECS_POS_SHIFT)
+ | ((secs >> secs_shrink_size) & EVICTION_SECS_POS_MASK);
+ eviction = (eviction << EVICTION_SECS_SHRINK_SHIFT) | (secs_shrink_size & 0xf);
+ *peviction = eviction;
+}
+static unsigned int unpack_secs(unsigned long entry)
+{
+ unsigned int secs;
+ int secs_shrink_size;
+
+ secs_shrink_size = entry & ((1 << EVICTION_SECS_SHRINK_SHIFT) - 1);
+ entry >>= EVICTION_SECS_SHRINK_SHIFT;
+ secs = entry & EVICTION_SECS_POS_MASK;
+ secs = secs << secs_shrink_size;
+ return secs;
+}
+#endif
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
bool workingset)
{
eviction >>= bucket_order;
eviction &= EVICTION_MASK;
+ pack_secs(&eviction);
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
eviction = (eviction << 1) | workingset;
@@ -194,11 +239,12 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
}

static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
- unsigned long *evictionp, bool *workingsetp)
+ unsigned long *evictionp, bool *workingsetp, unsigned int *prev_secs)
{
unsigned long entry = xa_to_value(shadow);
int memcgid, nid;
bool workingset;
+ unsigned int secs;

workingset = entry & 1;
entry >>= 1;
@@ -206,11 +252,14 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
entry >>= MEM_CGROUP_ID_SHIFT;
+ secs = unpack_secs(entry);
+ entry >>= (EVICTION_SECS_POS_SHIFT + EVICTION_SECS_SHRINK_SHIFT);

*memcgidp = memcgid;
*pgdat = NODE_DATA(nid);
*evictionp = entry << bucket_order;
*workingsetp = workingset;
+ *prev_secs = secs;
}

/**
@@ -257,8 +306,19 @@ void workingset_refault(struct page *page, void *shadow)
unsigned long refault;
bool workingset;
int memcgid;
+#ifndef NO_SECS_IN_WORKINGSET
+ unsigned long avg_refault_time;
+ unsigned long refaults_ratio;
+ unsigned long refault_time;
+ int tradition;
+ unsigned int prev_secs;
+ unsigned int secs;
+#endif
+ struct timespec64 ts;
+ ktime_get_boottime_ts64(&ts);
+ secs = (unsigned int)ts.tv_sec ? (unsigned int)ts.tv_sec : 1;

- unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
+ unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset, &prev_secs);

rcu_read_lock();
/*
@@ -303,23 +363,57 @@ void workingset_refault(struct page *page, void *shadow)
refault_distance = (refault - eviction) & EVICTION_MASK;

inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
-
+#ifndef NO_SECS_IN_WORKINGSET
+ refaults_ratio = (atomic_long_read(&lruvec->inactive_age) + 1) / secs;
+ atomic_long_set(&lruvec->refaults_ratio, refaults_ratio);
+ refault_time = secs - prev_secs;
+ avg_refault_time = active_file / refaults_ratio;
+ tradition = !!(refault_distance < active_file);
/*
- * Compare the distance to the existing workingset size. We
- * don't act on pages that couldn't stay resident even if all
- * the memory was available to the page cache.
+ * What we are trying to solve here is
+ * 1. extremely fast refault as refault_time == 0.
+ * 2. quick file drop scenario, which has a big refault_distance but
+ * small refault_time comparing with the past refault ratio, which
+ * will be deemed as inactive in previous implementation.
*/
- if (refault_distance > active_file)
+ if (refault_time && (((refault_time < avg_refault_time)
+ && (avg_refault_time < 2 * refault_time))
+ || (refault_time >= avg_refault_time))) {
+ trace_printk("WKST_INACT[%d]:rft_dis %ld, act %ld\
+ rft_ratio %ld rft_time %ld avg_rft_time %ld\
+ refault %ld eviction %ld secs %d pre_secs %d page %p\n",
+ tradition, refault_distance, active_file,
+ refaults_ratio, refault_time, avg_refault_time,
+ refault, eviction, secs, prev_secs, page);
goto out;
+ } else {
+#else
+ if (refault_distance < active_file) {
+#endif

- SetPageActive(page);
- atomic_long_inc(&lruvec->inactive_age);
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+ /*
+ * Compare the distance to the existing workingset size. We
+ * don't act on pages that couldn't stay resident even if all
+ * the memory was available to the page cache.
+ */

- /* Page was active prior to eviction */
- if (workingset) {
- SetPageWorkingset(page);
- inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+ SetPageActive(page);
+ atomic_long_inc(&lruvec->inactive_age);
+ inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+
+ /* Page was active prior to eviction */
+ if (workingset) {
+ SetPageWorkingset(page);
+ inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+ }
+#ifndef NO_SECS_IN_WORKINGSET
+ trace_printk("WKST_ACT[%d]:rft_dis %ld, act %ld\
+ rft_ratio %ld rft_time %ld avg_rft_time %ld\
+ refault %ld eviction %ld secs %d pre_secs %d page %p\n",
+ tradition, refault_distance, active_file,
+ refaults_ratio, refault_time, avg_refault_time,
+ refault, eviction, secs, prev_secs, page);
+#endif
}
out:
rcu_read_unlock();
@@ -548,6 +642,7 @@ static int __init workingset_init(void)
* double the initial memory by using totalram_pages as-is.
*/
timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+
max_order = fls_long(totalram_pages() - 1);
if (max_order > timestamp_bits)
bucket_order = max_order - timestamp_bits;
--
1.9.1


2019-05-05 05:35:02

by kernel test robot

[permalink] [raw]
Subject: [RFC PATCH] mm/workingset ] dea0795270: BUG:kernel_hang_in_boot_stage

FYI, we noticed the following commit (built with gcc-7):

commit: dea079527088cffccd014c815ecaf5b0f0506c59 ("[[repost]RFC PATCH] mm/workingset : judge file page activity via timestamp")
url: https://github.com/0day-ci/linux/commits/Zhaoyang-Huang/RFC-PATCH-mm-workingset-judge-file-page-activity-via-timestamp/20190429-080234


in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 2G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-------------------------------+------------+------------+
| | 9520b5324b | dea0795270 |
+-------------------------------+------------+------------+
| boot_successes | 4 | 0 |
| boot_failures | 0 | 4 |
| BUG:kernel_hang_in_boot_stage | 0 | 4 |
+-------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 0.067096] ACPI: Early table checksum verification disabled
[ 0.068955] ACPI: RSDP 0x00000000000F6840 000014 (v00 BOCHS )
[ 0.071039] ACPI: RSDT 0x000000007FFE15C9 000030 (v01 BOCHS BXPCRSDT 00000001 BXPC 00000001)
[ 0.073473] ACPI: FACP 0x000000007FFE149D 000074 (v01 BOCHS BXPCFACP 00000001 BXPC 00000001)
[ 0.075983] ACPI: DSDT 0x000000007FFE0040 00145D (v01 BOCHS BXPCDSDT 00000001 BXPC 00000001)
BUG: kernel hang in boot stage

Elapsed time: 140

qemu-img create -f qcow2 disk-vm-snb-2G-406-0 256G
qemu-img create -f qcow2 disk-vm-snb-2G-406-1 256G


To reproduce:

# build kernel
cd linux
cp config-5.1.0-rc6-00158-gdea07952 .config
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 olddefconfig
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 prepare
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 modules_prepare
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 SHELL=/bin/bash
make HOSTCC=gcc-7 CC=gcc-7 ARCH=x86_64 bzImage


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
lkp


Attachments:
(No filename) (2.31 kB)
config-5.1.0-rc6-00158-gdea07952 (122.27 kB)
job-script (4.46 kB)
dmesg.xz (2.37 kB)
Download all attachments

2019-05-06 14:59:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [[repost]RFC PATCH] mm/workingset : judge file page activity via timestamp

On Sun, Apr 28, 2019 at 03:44:34PM +0800, Zhaoyang Huang wrote:
> From: Zhaoyang Huang <[email protected]>
>
> this patch introduce timestamp into workingset's entry and judge if the page is
> active or inactive via active_file/refault_ratio instead of refault distance.
>
> The original thought is coming from the logs we got from trace_printk in this
> patch, we can find about 1/5 of the file pages' refault are under the
> scenario[1],which will be counted as inactive as they have a long refault distance
> in between access. However, we can also know from the time information that the
> page refault quickly as comparing to the average refault time which is calculated
> by the number of active file and refault ratio. We want to save these kinds of
> pages from evicted earlier as it used to be via setting it to ACTIVE instead.
> The refault ratio is the value which can reflect lru's average file access
> frequency in the past and provide the judge criteria for page's activation.
>
> The patch is tested on an android system and reduce 30% of page faults, while
> 60% of the pages remain the original status as (refault_distance < active_file)
> indicates. Pages status got from ftrace during the test can refer to [2].
>
> [1]
> system_server workingset_refault: WKST_ACT[0]:rft_dis 265976, act_file 34268 rft_ratio 3047 rft_time 0 avg_rft_time 11 refault 295592 eviction 29616 secs 97 pre_secs 97
> HwBinder:922 workingset_refault: WKST_ACT[0]:rft_dis 264478, act_file 35037 rft_ratio 3070 rft_time 2 avg_rft_time 11 refault 310078 eviction 45600 secs 101 pre_secs 99
>
> [2]
> WKST_ACT[0]: original--INACTIVE commit--ACTIVE
> WKST_ACT[1]: original--ACTIVE commit--ACTIVE
> WKST_INACT[0]: original--INACTIVE commit--INACTIVE
> WKST_INACT[1]: original--ACTIVE commit--INACTIVE
>
> Signed-off-by: Zhaoyang Huang <[email protected]>

Nacked-by: Johannes Weiner <[email protected]>

You haven't addressed any of the questions raised during previous
submissions.

2019-05-07 06:07:26

by Zhaoyang Huang

[permalink] [raw]
Subject: Re: [[repost]RFC PATCH] mm/workingset : judge file page activity via timestamp

On Mon, May 6, 2019 at 10:57 PM Johannes Weiner <[email protected]> wrote:
>
> On Sun, Apr 28, 2019 at 03:44:34PM +0800, Zhaoyang Huang wrote:
> > From: Zhaoyang Huang <[email protected]>
> >
> > this patch introduce timestamp into workingset's entry and judge if the page is
> > active or inactive via active_file/refault_ratio instead of refault distance.
> >
> > The original thought is coming from the logs we got from trace_printk in this
> > patch, we can find about 1/5 of the file pages' refault are under the
> > scenario[1],which will be counted as inactive as they have a long refault distance
> > in between access. However, we can also know from the time information that the
> > page refault quickly as comparing to the average refault time which is calculated
> > by the number of active file and refault ratio. We want to save these kinds of
> > pages from evicted earlier as it used to be via setting it to ACTIVE instead.
> > The refault ratio is the value which can reflect lru's average file access
> > frequency in the past and provide the judge criteria for page's activation.
> >
> > The patch is tested on an android system and reduce 30% of page faults, while
> > 60% of the pages remain the original status as (refault_distance < active_file)
> > indicates. Pages status got from ftrace during the test can refer to [2].
> >
Hi Johannes,
Thank you for your feedback. I have answer previous comments many
times in different context. I don't expect you accept this patch but
want to have you pay attention to the phenomenon reported in [1],
which has a big refault distance but refaulted very quickly after
evicted. Do you think if this kind of page should be set to INACTIVE?
> > [1]
> > system_server workingset_refault: WKST_ACT[0]:rft_dis 265976, act_file 34268 rft_ratio 3047 rft_time 0 avg_rft_time 11 refault 295592 eviction 29616 secs 97 pre_secs 97
> > HwBinder:922 workingset_refault: WKST_ACT[0]:rft_dis 264478, act_file 35037 rft_ratio 3070 rft_time 2 avg_rft_time 11 refault 310078 eviction 45600 secs 101 pre_secs 99
> >
> > [2]
> > WKST_ACT[0]: original--INACTIVE commit--ACTIVE
> > WKST_ACT[1]: original--ACTIVE commit--ACTIVE
> > WKST_INACT[0]: original--INACTIVE commit--INACTIVE
> > WKST_INACT[1]: original--ACTIVE commit--INACTIVE
> >
> > Signed-off-by: Zhaoyang Huang <[email protected]>
>
> Nacked-by: Johannes Weiner <[email protected]>
>
> You haven't addressed any of the questions raised during previous
> submissions.