2015-04-28 12:24:57

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH v3 0/3] idle memory tracking

Hi,

This patch set introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning
the system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.

---- USE CASES ----

The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.

Another use case is balancing workloads within a compute cluster.
Knowing how much memory is not really used by a workload unit may help
take a more optimal decision when considering migrating the unit to
another node within the cluster.

---- USER API ----

The user API consists of two new proc files:

* /proc/kpageidle. For each page this file contains a 64-bit number, which
equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
considered idle if it has not been accessed since it was marked idle. To
mark a page idle one should write 1 to this file at the offset corresponding
to the page. Only user memory pages can be marked idle, for other page types
input is silently ignored. Writing to this file beyond max PFN results in
the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.

This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:

1. mark all pages of interest idle by writing 1 at the corresponding
offsets to /proc/kpageidle
2. wait until the workload accesses its working set
3. read /proc/kpageidle and count the amount of idle pages

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

This file can be used to find all pages (including unmapped file
pages) accounted to a particular cgroup. Using /proc/kpageidle, one
can then estimate the cgroup working set size.

For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.

---- REASONING ----

The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:

- it does not count unmapped file pages
- it affects the reclaimer logic

The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 3.

---- CHANGE LOG ----

Changes in v3:

- Enable CONFIG_IDLE_PAGE_TRACKING for 32 bit. Since this feature
requires two extra page flags and there is no space for them on 32
bit, page ext is used (thanks to Minchan Kim).

- Minor code cleanups and comments improved.

- Rebase on top of 4.1-rc1.

Changes in v2:

- The main difference from v1 is the API change. In v1 the user can
only set the idle flag for all pages at once, and for clearing the
Idle flag on pages accessed via page tables /proc/PID/clear_refs
should be used.

The main drawback of the v1 approach, as noted by Minchan, is that on
big machines setting the idle flag for each pages can result in CPU
bursts, which would be especially frustrating if the user only wanted
to estimate the amount of idle pages for a particular process or VMA.
With the new API a more fine-grained approach is possible: one can
read a process's /proc/PID/pagemap and set/check the Idle flag only
for those pages of the process's address space he or she is
interested in.

Another good point about the v2 API is that it is possible to limit
/proc/kpage* scanning rate when the user wants to estimate the total
number of idle pages, which is unachievable with the v1 approach.

- Make /proc/kpagecgroup return the ino of the closest online ancestor
in case the cgroup a page is charged to is offline.

- Fix /proc/PID/clear_refs not clearing Young page flag.

- Rebase on top of v4.0-rc6-mmotm-2015-04-01-14-54

v2: https://lkml.org/lkml/2015/4/7/260
v1: https://lkml.org/lkml/2015/3/18/794

---- PATCH SET STRUCTURE ----

The patch set is organized as follows:

- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup
- patch 2 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 3 implements the idle page tracking feature, including the
userspace API, /proc/kpageidle

---- SIMILAR WORKS ----

Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel
implemented a kernel space daemon for estimating idle memory size per
cgroup while this patch only provides the userspace with the minimal API
for doing the job, leaving the rest up to the userspace. However, they
both share the same idea of Idle/Young page flags to avoid affecting the
reclaimer logic.

---- SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ----
#! /usr/bin/python
#

CGROUP_MOUNT = "/sys/fs/cgroup/memory"

import os
import stat
import errno
import struct

def set_idle():
pgidle = open("/proc/kpageidle", "wb")
while True:
try:
pgidle.write(struct.pack("Q", 1))
except IOError as e:
if e.errno == errno.ENXIO: break
raise
pgidle.close()

def count_idle():
pgflags = open("/proc/kpageflags", "rb")
pgcgroup = open("/proc/kpagecgroup", "rb")
pgidle = open("/proc/kpageidle", "rb")
nidle = {}
while True:
s = pgflags.read(8)
if len(s) != 8: break;
flags = struct.unpack('Q', s)[0]
cgino = struct.unpack('Q', pgcgroup.read(8))[0]
idle = struct.unpack('Q', pgidle.read(8))[0]
if not idle: continue
if (flags >> 18) & 1: continue # unevictable?
npages = 512 if (flags >> 22) & 1 else 1 # huge?
nidle[cgino] = nidle.get(cgino, 0) + npages
pgflags.close()
pgcgroup.close()
pgidle.close()
return nidle

print "Setting the idle flag for each page..."
set_idle()

raw_input("Wait until the workload accesses its working set, then press Enter")

print "Counting idle pages..."
nidle = count_idle()

for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(nidle.get(ino, 0))
---- END SCRIPT ----

Comments are more than welcome.

Thanks,

Vladimir Davydov (3):
memcg: add page_cgroup_ino helper
proc: add kpagecgroup file
proc: add kpageidle file

Documentation/vm/pagemap.txt | 14 ++-
fs/proc/Kconfig | 5 +-
fs/proc/page.c | 207 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/memcontrol.h | 8 +-
include/linux/mm.h | 88 ++++++++++++++++++
include/linux/page-flags.h | 9 ++
include/linux/page_ext.h | 4 +
mm/Kconfig | 12 +++
mm/debug.c | 4 +
mm/hwpoison-inject.c | 5 +-
mm/memcontrol.c | 73 +++++++--------
mm/memory-failure.c | 16 +---
mm/page_ext.c | 3 +
mm/rmap.c | 7 ++
mm/swap.c | 2 +
16 files changed, 397 insertions(+), 64 deletions(-)

--
1.7.10.4


2015-04-28 12:25:05

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH v3 1/3] memcg: add page_cgroup_ino helper

Hwpoison allows to filter pages by memory cgroup ino. To ahieve that, it
calls try_get_mem_cgroup_from_page(), then mem_cgroup_css(), and finally
cgroup_ino() on the cgroup returned. This looks bulky. Since in the next
patch I need to get the ino of the memory cgroup a page is charged to
too, in this patch I introduce the page_cgroup_ino() helper.

Note that page_cgroup_ino() only considers those pages that are charged
to mem_cgroup->memory (i.e. page->mem_cgroup != NULL), and for others it
returns 0, while try_get_mem_cgroup_page(), used by hwpoison before, may
extract the cgroup from a swapcache readahead page too. Ignoring
swapcache readahead pages allows to call page_cgroup_ino() on unlocked
pages, which is nice. Hwpoison users will hardly see any difference.

Another difference between try_get_mem_cgroup_page() and
page_cgroup_ino() is that the latter works on pages charged to offline
memory cgroups, returning the inode number of the closest online
ancestor in this case, while the former does not, which is crucial for
the next patch.

Since try_get_mem_cgroup_page() is not used by anyone else, this patch
removes this function. Also, it makes hwpoison memcg filter depend on
CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP (I've no idea why it was made
dependant on CONFIG_MEMCG_SWAP initially).

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 8 ++---
mm/hwpoison-inject.c | 5 +--
mm/memcontrol.c | 73 ++++++++++++++++++++++----------------------
mm/memory-failure.c | 16 ++--------
4 files changed, 42 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..9262a8407af7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -91,7 +91,6 @@ bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
@@ -192,6 +191,8 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
void mem_cgroup_split_huge_fixup(struct page *head);
#endif

+unsigned long page_cgroup_ino(struct page *page);
+
#else /* CONFIG_MEMCG */
struct mem_cgroup;

@@ -252,11 +253,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

-static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- return NULL;
-}
-
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 329caf56df22..df63c3133d70 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -45,12 +45,9 @@ static int hwpoison_inject(void *data, u64 val)
/*
* do a racy check with elevated page count, to make sure PG_hwpoison
* will only be set for the targeted owner (or on a free page).
- * We temporarily take page lock for try_get_mem_cgroup_from_page().
* memory_failure() will redo the check reliably inside page lock.
*/
- lock_page(hpage);
err = hwpoison_filter(hpage);
- unlock_page(hpage);
if (err)
return 0;

@@ -123,7 +120,7 @@ static int pfn_inject_init(void)
if (!dentry)
goto fail;

-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
dentry = debugfs_create_u64("corrupt-filter-memcg", 0600,
hwpoison_dir, &hwpoison_filter_memcg);
if (!dentry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..87c7f852d45b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2349,40 +2349,6 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}

-/*
- * try_get_mem_cgroup_from_page - look up page's memcg association
- * @page: the page
- *
- * Look up, get a css reference, and return the memcg that owns @page.
- *
- * The page must be locked to prevent racing with swap-in and page
- * cache charges. If coming from an unlocked page table, the caller
- * must ensure the page is on the LRU or this can race with charging.
- */
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
- struct mem_cgroup *memcg;
- unsigned short id;
- swp_entry_t ent;
-
- VM_BUG_ON_PAGE(!PageLocked(page), page);
-
- memcg = page->mem_cgroup;
- if (memcg) {
- if (!css_tryget_online(&memcg->css))
- memcg = NULL;
- } else if (PageSwapCache(page)) {
- ent.val = page_private(page);
- id = lookup_swap_cgroup_id(ent);
- rcu_read_lock();
- memcg = mem_cgroup_from_id(id);
- if (memcg && !css_tryget_online(&memcg->css))
- memcg = NULL;
- rcu_read_unlock();
- }
- return memcg;
-}
-
static void lock_page_lru(struct page *page, int *isolated)
{
struct zone *zone = page_zone(page);
@@ -2774,6 +2740,31 @@ void mem_cgroup_split_huge_fixup(struct page *head)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+/**
+ * page_cgroup_ino - return inode number of page's memcg
+ * @page: the page
+ *
+ * Look up the closest online ancestor of the memory cgroup @page is charged to
+ * and return its inode number. It is safe to call this function without taking
+ * a reference to the page.
+ */
+unsigned long page_cgroup_ino(struct page *page)
+{
+ struct mem_cgroup *memcg;
+ unsigned long ino = 0;
+
+ rcu_read_lock();
+ memcg = READ_ONCE(page->mem_cgroup);
+ while (memcg && !css_tryget_online(&memcg->css))
+ memcg = parent_mem_cgroup(memcg);
+ rcu_read_unlock();
+ if (memcg) {
+ ino = cgroup_ino(memcg->css.cgroup);
+ css_put(&memcg->css);
+ }
+ return ino;
+}
+
#ifdef CONFIG_MEMCG_SWAP
static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
bool charge)
@@ -5482,8 +5473,18 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
}

- if (do_swap_account && PageSwapCache(page))
- memcg = try_get_mem_cgroup_from_page(page);
+ if (do_swap_account && PageSwapCache(page)) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id = lookup_swap_cgroup_id(ent);
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
+ }
if (!memcg)
memcg = get_mem_cgroup_from_mm(mm);

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9359b770cd9..64cd565fd4f8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -128,27 +128,15 @@ static int hwpoison_filter_flags(struct page *p)
* can only guarantee that the page either belongs to the memcg tasks, or is
* a freed page.
*/
-#ifdef CONFIG_MEMCG_SWAP
+#ifdef CONFIG_MEMCG
u64 hwpoison_filter_memcg;
EXPORT_SYMBOL_GPL(hwpoison_filter_memcg);
static int hwpoison_filter_task(struct page *p)
{
- struct mem_cgroup *mem;
- struct cgroup_subsys_state *css;
- unsigned long ino;
-
if (!hwpoison_filter_memcg)
return 0;

- mem = try_get_mem_cgroup_from_page(p);
- if (!mem)
- return -EINVAL;
-
- css = mem_cgroup_css(mem);
- ino = cgroup_ino(css->cgroup);
- css_put(css);
-
- if (ino != hwpoison_filter_memcg)
+ if (page_cgroup_ino(p) != hwpoison_filter_memcg)
return -EINVAL;

return 0;
--
1.7.10.4

2015-04-28 12:25:11

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH v3 2/3] proc: add kpagecgroup file

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup
each page is charged to, indexed by PFN. Having this information is
useful for estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov <[email protected]>
---
Documentation/vm/pagemap.txt | 6 ++++-
fs/proc/Kconfig | 5 ++--
fs/proc/page.c | 53 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..a9b7afc8fbc6 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are three components to pagemap:
+There are four components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -65,6 +65,10 @@ There are three components to pagemap:
23. BALLOON
24. ZERO_PAGE

+ * /proc/kpagecgroup. This file contains a 64-bit inode number of the
+ memory cgroup each page is charged to, indexed by PFN. Only available when
+ CONFIG_MEMCG is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 2183fcf41d59..5021a2935bb9 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -69,5 +69,6 @@ config PROC_PAGE_MONITOR
help
Various /proc files exist to monitor process memory utilization:
/proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
- /proc/kpagecount, and /proc/kpageflags. Disabling these
- interfaces will reduce the size of the kernel by approximately 4kb.
+ /proc/kpagecount, /proc/kpageflags, and /proc/kpagecgroup.
+ Disabling these interfaces will reduce the size of the kernel
+ by approximately 4kb.
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..70d23245dd43 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -9,6 +9,7 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -225,10 +226,62 @@ static const struct file_operations proc_kpageflags_operations = {
.read = kpageflags_read,
};

+#ifdef CONFIG_MEMCG
+static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *ppage;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 ino;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn_valid(pfn))
+ ppage = pfn_to_page(pfn);
+ else
+ ppage = NULL;
+
+ if (ppage)
+ ino = page_cgroup_ino(ppage);
+ else
+ ino = 0;
+
+ if (put_user(ino, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpagecgroup_operations = {
+ .llseek = mem_lseek,
+ .read = kpagecgroup_read,
+};
+#endif /* CONFIG_MEMCG */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+#ifdef CONFIG_MEMCG
+ proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
--
1.7.10.4

2015-04-28 12:25:21

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH v3 3/3] proc: add kpageidle file

Knowing the portion of memory that is not used by a certain application
or memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced.
However, this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
can only be set from userspace by writing 1 to /proc/kpageidle at the
offset corresponding to the page, and it is cleared whenever the page is
accessed either through page tables (it is cleared in page_referenced()
in this case) or using the read(2) system call (mark_page_accessed()).
Thus by setting the Idle flag for pages of a particular workload, which
can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
let the workload access its working set, and then reading the kpageidle
file, one can estimate the amount of pages that are not used by the
workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to kpageidle. If
page_referenced() is called on a Young page, it will add 1 to its return
value, therefore concealing the fact that the Access bit was cleared.

Note, since there is no room for extra page flags on 32 bit, this
feature uses extended page flags when compiled on 32 bit.

Signed-off-by: Vladimir Davydov <[email protected]>
---
Documentation/vm/pagemap.txt | 10 ++-
fs/proc/page.c | 154 ++++++++++++++++++++++++++++++++++++++++++
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 88 ++++++++++++++++++++++++
include/linux/page-flags.h | 9 +++
include/linux/page_ext.h | 4 ++
mm/Kconfig | 12 ++++
mm/debug.c | 4 ++
mm/page_ext.c | 3 +
mm/rmap.c | 7 ++
mm/swap.c | 2 +
11 files changed, 295 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index a9b7afc8fbc6..ac6fd32a9296 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
userspace programs to examine the page tables and related information by
reading files in /proc.

-There are four components to pagemap:
+There are five components to pagemap:

* /proc/pid/pagemap. This file lets a userspace process find out which
physical frame each virtual page is mapped to. It contains one 64-bit
@@ -69,6 +69,14 @@ There are four components to pagemap:
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.

+ * /proc/kpageidle. For each page this file contains a 64-bit number, which
+ equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
+ considered idle if it has not been accessed since it was marked idle. To
+ mark a page idle one should write 1 to this file at the offset corresponding
+ to the page. Only user memory pages can be marked idle, for other page types
+ input is silently ignored. Writing to this file beyond max PFN results in
+ the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.
+
Short descriptions to the page flags:

0. LOCKED
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 70d23245dd43..cfc55ba7fee6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -275,6 +275,156 @@ static const struct file_operations proc_kpagecgroup_operations = {
};
#endif /* CONFIG_MEMCG */

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static struct page *kpageidle_get_page(unsigned long pfn)
+{
+ struct page *page;
+
+ if (!pfn_valid(pfn))
+ return NULL;
+ page = pfn_to_page(pfn);
+ /*
+ * We are only interested in user memory pages, i.e. pages that are
+ * allocated and on an LRU list.
+ */
+ if (!page || page_count(page) == 0 || !PageLRU(page))
+ return NULL;
+ if (!get_page_unless_zero(page))
+ return NULL;
+ if (unlikely(!PageLRU(page))) {
+ put_page(page);
+ return NULL;
+ }
+ return page;
+}
+
+static void kpageidle_clear_refs(struct page *page)
+{
+ unsigned long dummy;
+
+ if (page_referenced(page, 0, NULL, &dummy))
+ /*
+ * This page was referenced. To avoid interference with the
+ * reclaimer, mark it young so that the next call will also
+ * return > 0 (see page_referenced_one)
+ */
+ set_page_young(page);
+}
+
+static ssize_t kpageidle_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ u64 __user *out = (u64 __user *)buf;
+ struct page *page;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 val;
+
+ pfn = src / KPMSIZE;
+ count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ val = 0;
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ if (page_is_idle(page)) {
+ /*
+ * The page might have been referenced via a
+ * pte, in which case it is not idle. Clear
+ * refs and recheck.
+ */
+ kpageidle_clear_refs(page);
+ if (page_is_idle(page))
+ val = 1;
+ }
+ put_page(page);
+ }
+
+ if (put_user(val, out)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn++;
+ out++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)out - buf;
+ if (!ret)
+ ret = (char __user *)out - buf;
+ return ret;
+}
+
+static ssize_t kpageidle_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const u64 __user *in = (u64 __user *)buf;
+ struct page *page;
+ unsigned long src = *ppos;
+ unsigned long pfn;
+ ssize_t ret = 0;
+ u64 val;
+
+ pfn = src / KPMSIZE;
+ if (src & KPMMASK || count & KPMMASK)
+ return -EINVAL;
+
+ while (count > 0) {
+ if (pfn >= max_pfn) {
+ if ((char __user *)in == buf)
+ ret = -ENXIO;
+ break;
+ }
+
+ if (get_user(val, in)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ if (val == 1) {
+ page = kpageidle_get_page(pfn);
+ if (page) {
+ kpageidle_clear_refs(page);
+ set_page_idle(page);
+ put_page(page);
+ }
+ } else if (val) {
+ ret = -EINVAL;
+ break;
+ }
+
+ pfn++;
+ in++;
+ count -= KPMSIZE;
+ }
+
+ *ppos += (char __user *)in - buf;
+ if (!ret)
+ ret = (char __user *)in - buf;
+ return ret;
+}
+
+static const struct file_operations proc_kpageidle_operations = {
+ .llseek = mem_lseek,
+ .read = kpageidle_read,
+ .write = kpageidle_write,
+};
+
+#ifndef CONFIG_64BIT
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
static int __init proc_page_init(void)
{
proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
@@ -282,6 +432,10 @@ static int __init proc_page_init(void)
#ifdef CONFIG_MEMCG
proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
#endif
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+ proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
+ &proc_kpageidle_operations);
+#endif
return 0;
}
fs_initcall(proc_page_init);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..ab04846f7dd5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
- if (young || PageReferenced(page))
+ if (young || page_is_young(page) || PageReferenced(page))
mss->referenced += size;
mapcount = page_mapcount(page);
if (mapcount >= 2) {
@@ -808,6 +808,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,

/* Clear accessed and referenced bits. */
pmdp_test_and_clear_young(vma, addr, pmd);
+ clear_page_young(page);
ClearPageReferenced(page);
out:
spin_unlock(ptl);
@@ -835,6 +836,7 @@ out:

/* Clear accessed and referenced bits. */
ptep_test_and_clear_young(vma, addr, pte);
+ clear_page_young(page);
ClearPageReferenced(page);
}
pte_unmap_unlock(pte - 1, ptl);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9fd03a7..794d29aa2317 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2200,5 +2200,93 @@ void __init setup_nr_node_ids(void);
static inline void setup_nr_node_ids(void) {}
#endif

+#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_64BIT
+static inline bool page_is_young(struct page *page)
+{
+ return PageYoung(page);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ SetPageYoung(page);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return PageIdle(page);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ SetPageIdle(page);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ ClearPageIdle(page);
+}
+#else /* !CONFIG_64BIT */
+/*
+ * If there is not enough space to store Idle and Young bits in page flags, use
+ * page ext flags instead.
+ */
+extern struct page_ext_operations page_idle_ops;
+
+static inline bool page_is_young(struct page *page)
+{
+ return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_young(struct page *page)
+{
+ set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_young(struct page *page)
+{
+ clear_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void set_page_idle(struct page *page)
+{
+ set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+ clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
+}
+#endif /* CONFIG_64BIT */
+#else /* !CONFIG_IDLE_PAGE_TRACKING */
+static inline bool page_is_young(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_young(struct page *page)
+{
+}
+
+static inline bool page_is_idle(struct page *page)
+{
+ return false;
+}
+
+static inline void clear_page_idle(struct page *page)
+{
+}
+#endif /* CONFIG_IDLE_PAGE_TRACKING */
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f34e040b34e9..5e7c4f50a644 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,10 @@ enum pageflags {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ PG_young,
+ PG_idle,
+#endif
__NR_PAGEFLAGS,

/* Filesystems */
@@ -289,6 +293,11 @@ PAGEFLAG_FALSE(HWPoison)
#define __PG_HWPOISON 0
#endif

+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+PAGEFLAG(Young, young)
+PAGEFLAG(Idle, idle)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index c42981cd99aa..17f118a82854 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -26,6 +26,10 @@ enum page_ext_flags {
PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
PAGE_EXT_DEBUG_GUARD,
PAGE_EXT_OWNER,
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ PAGE_EXT_YOUNG,
+ PAGE_EXT_IDLE,
+#endif
};

/*
diff --git a/mm/Kconfig b/mm/Kconfig
index 390214da4546..3600eace4774 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+config IDLE_PAGE_TRACKING
+ bool "Enable idle page tracking"
+ select PROC_PAGE_MONITOR
+ select PAGE_EXTENSION if !64BIT
+ help
+ This feature allows to estimate the amount of user pages that have
+ not been touched during a given period of time. This information can
+ be useful to tune memory cgroup limits and/or for job placement
+ within a compute cluster.
+
+ See Documentation/vm/pagemap.txt for more details.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..bb66f9ccec03 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
{1UL << PG_compound_lock, "compound_lock" },
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+ {1UL << PG_young, "young" },
+ {1UL << PG_idle, "idle" },
+#endif
};

static void dump_flags(unsigned long flags,
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d86fd2f5353f..e4b3af054bf2 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
+#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+ &page_idle_ops,
+#endif
};

static unsigned long total_usage;
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..12e73b758d9e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -784,6 +784,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;
+ if (page_is_idle(page))
+ clear_page_idle(page);
+ }
+
+ if (page_is_young(page)) {
+ clear_page_young(page);
+ pra->referenced++;
}

pra->mapcount--;
diff --git a/mm/swap.c b/mm/swap.c
index a7251a8ed532..6bf6f293a9ea 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
+ if (page_is_idle(page))
+ clear_page_idle(page);
}
EXPORT_SYMBOL(mark_page_accessed);

--
1.7.10.4

2015-04-29 03:58:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] idle memory tracking

Hello Vladimir,

On Tue, Apr 28, 2015 at 03:24:39PM +0300, Vladimir Davydov wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.
>
> ---- USER API ----
>
> The user API consists of two new proc files:
>
> * /proc/kpageidle. For each page this file contains a 64-bit number, which
> equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is

Why do we need 64bit per page to indicate just idle or not?
What do you imagine we're happy with other 63bit in future?

--
Kind regards,
Minchan Kim

2015-04-29 04:35:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.
> Currently, the only means to estimate the amount of idle memory provided
> by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> access bit for all pages mapped to a particular process by writing 1 to
> clear_refs, wait for some time, and then count smaps:Referenced.
> However, this method has two serious shortcomings:
>
> - it does not count unmapped file pages
> - it affects the reclaimer logic
>
> To overcome these drawbacks, this patch introduces two new page flags,
> Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> can only be set from userspace by writing 1 to /proc/kpageidle at the
> offset corresponding to the page, and it is cleared whenever the page is
> accessed either through page tables (it is cleared in page_referenced()
> in this case) or using the read(2) system call (mark_page_accessed()).
> Thus by setting the Idle flag for pages of a particular workload, which
> can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> let the workload access its working set, and then reading the kpageidle
> file, one can estimate the amount of pages that are not used by the
> workload.
>
> The Young page flag is used to avoid interference with the memory
> reclaimer. A page's Young flag is set whenever the Access bit of a page
> table entry pointing to the page is cleared by writing to kpageidle. If
> page_referenced() is called on a Young page, it will add 1 to its return
> value, therefore concealing the fact that the Access bit was cleared.
>
> Note, since there is no room for extra page flags on 32 bit, this
> feature uses extended page flags when compiled on 32 bit.

Thanks for considering 32bit.

Anyway, I believe it's good feature but not sure it's worth to consume
2bit of page flag but have no idea with saving page flags.

>
> Signed-off-by: Vladimir Davydov <[email protected]>
> ---
> Documentation/vm/pagemap.txt | 10 ++-
> fs/proc/page.c | 154 ++++++++++++++++++++++++++++++++++++++++++
> fs/proc/task_mmu.c | 4 +-
> include/linux/mm.h | 88 ++++++++++++++++++++++++
> include/linux/page-flags.h | 9 +++
> include/linux/page_ext.h | 4 ++
> mm/Kconfig | 12 ++++
> mm/debug.c | 4 ++
> mm/page_ext.c | 3 +
> mm/rmap.c | 7 ++
> mm/swap.c | 2 +
> 11 files changed, 295 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index a9b7afc8fbc6..ac6fd32a9296 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
> userspace programs to examine the page tables and related information by
> reading files in /proc.
>
> -There are four components to pagemap:
> +There are five components to pagemap:
>
> * /proc/pid/pagemap. This file lets a userspace process find out which
> physical frame each virtual page is mapped to. It contains one 64-bit
> @@ -69,6 +69,14 @@ There are four components to pagemap:
> memory cgroup each page is charged to, indexed by PFN. Only available when
> CONFIG_MEMCG is set.
>
> + * /proc/kpageidle. For each page this file contains a 64-bit number, which
> + equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is

As I replied to the cover letter, we need justification consume 64bit per page.

> + considered idle if it has not been accessed since it was marked idle. To
> + mark a page idle one should write 1 to this file at the offset corresponding
> + to the page. Only user memory pages can be marked idle, for other page types
> + input is silently ignored. Writing to this file beyond max PFN results in
> + the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.
> +
> Short descriptions to the page flags:
>
> 0. LOCKED
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 70d23245dd43..cfc55ba7fee6 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -275,6 +275,156 @@ static const struct file_operations proc_kpagecgroup_operations = {
> };
> #endif /* CONFIG_MEMCG */
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +static struct page *kpageidle_get_page(unsigned long pfn)
> +{
> + struct page *page;
> +
> + if (!pfn_valid(pfn))
> + return NULL;
> + page = pfn_to_page(pfn);
> + /*
> + * We are only interested in user memory pages, i.e. pages that are
> + * allocated and on an LRU list.
> + */
> + if (!page || page_count(page) == 0 || !PageLRU(page))

Why do you check (page_count == 0) even if we check it with get_page_unless_zero
below?

> + return NULL;
> + if (!get_page_unless_zero(page))
> + return NULL;
> + if (unlikely(!PageLRU(page))) {

What lock protect the check PageLRU?
If it is racing ClearPageLRU, what happens?

> + put_page(page);
> + return NULL;
> + }
> + return page;
> +}
> +
> +static void kpageidle_clear_refs(struct page *page)
> +{
> + unsigned long dummy;
> +
> + if (page_referenced(page, 0, NULL, &dummy))
> + /*
> + * This page was referenced. To avoid interference with the
> + * reclaimer, mark it young so that the next call will also

next what call?

It just works with mapped page so kpageidle_clear_pte_refs as function name
is more clear.

One more, kpageidle_clear_refs removes PG_idle via page_referenced which
is important feature for the function. Please document it so we could
understand why we need double check for PG_idle after calling
kpageidle_clear_refs for pte access bit.

> + * return > 0 (see page_referenced_one)
> + */
> + set_page_young(page);
> +}
> +
> +static ssize_t kpageidle_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + u64 __user *out = (u64 __user *)buf;
> + struct page *page;
> + unsigned long src = *ppos;
> + unsigned long pfn;
> + ssize_t ret = 0;
> + u64 val;
> +
> + pfn = src / KPMSIZE;
> + count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
> + if (src & KPMMASK || count & KPMMASK)
> + return -EINVAL;
> +
> + while (count > 0) {
> + val = 0;
> + page = kpageidle_get_page(pfn);
> + if (page) {
> + if (page_is_idle(page)) {
> + /*
> + * The page might have been referenced via a
> + * pte, in which case it is not idle. Clear
> + * refs and recheck.
> + */
> + kpageidle_clear_refs(page);
> + if (page_is_idle(page))
> + val = 1;
> + }
> + put_page(page);
> + }
> +
> + if (put_user(val, out)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + pfn++;
> + out++;
> + count -= KPMSIZE;
> + }
> +
> + *ppos += (char __user *)out - buf;
> + if (!ret)
> + ret = (char __user *)out - buf;
> + return ret;
> +}
> +
> +static ssize_t kpageidle_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + const u64 __user *in = (u64 __user *)buf;
> + struct page *page;
> + unsigned long src = *ppos;
> + unsigned long pfn;
> + ssize_t ret = 0;
> + u64 val;
> +
> + pfn = src / KPMSIZE;
> + if (src & KPMMASK || count & KPMMASK)
> + return -EINVAL;
> +
> + while (count > 0) {
> + if (pfn >= max_pfn) {
> + if ((char __user *)in == buf)
> + ret = -ENXIO;
> + break;
> + }
> +
> + if (get_user(val, in)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + if (val == 1) {
> + page = kpageidle_get_page(pfn);
> + if (page) {
> + kpageidle_clear_refs(page);
> + set_page_idle(page);
> + put_page(page);
> + }
> + } else if (val) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + pfn++;
> + in++;
> + count -= KPMSIZE;
> + }
> +
> + *ppos += (char __user *)in - buf;
> + if (!ret)
> + ret = (char __user *)in - buf;
> + return ret;
> +}
> +
> +static const struct file_operations proc_kpageidle_operations = {
> + .llseek = mem_lseek,
> + .read = kpageidle_read,
> + .write = kpageidle_write,
> +};
> +
> +#ifndef CONFIG_64BIT
> +static bool need_page_idle(void)
> +{
> + return true;
> +}
> +struct page_ext_operations page_idle_ops = {
> + .need = need_page_idle,
> +};
> +#endif
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
> static int __init proc_page_init(void)
> {
> proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
> @@ -282,6 +432,10 @@ static int __init proc_page_init(void)
> #ifdef CONFIG_MEMCG
> proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
> #endif
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> + proc_create("kpageidle", S_IRUSR | S_IWUSR, NULL,
> + &proc_kpageidle_operations);
> +#endif
> return 0;
> }
> fs_initcall(proc_page_init);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6dee68d013ff..ab04846f7dd5 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -458,7 +458,7 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>
> mss->resident += size;
> /* Accumulate the size in pages that have been accessed. */
> - if (young || PageReferenced(page))
> + if (young || page_is_young(page) || PageReferenced(page))
> mss->referenced += size;
> mapcount = page_mapcount(page);
> if (mapcount >= 2) {
> @@ -808,6 +808,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>
> /* Clear accessed and referenced bits. */
> pmdp_test_and_clear_young(vma, addr, pmd);
> + clear_page_young(page);
> ClearPageReferenced(page);
> out:
> spin_unlock(ptl);
> @@ -835,6 +836,7 @@ out:
>
> /* Clear accessed and referenced bits. */
> ptep_test_and_clear_young(vma, addr, pte);
> + clear_page_young(page);
> ClearPageReferenced(page);
> }
> pte_unmap_unlock(pte - 1, ptl);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0755b9fd03a7..794d29aa2317 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2200,5 +2200,93 @@ void __init setup_nr_node_ids(void);
> static inline void setup_nr_node_ids(void) {}
> #endif
>
> +#ifdef CONFIG_IDLE_PAGE_TRACKING
> +#ifdef CONFIG_64BIT
> +static inline bool page_is_young(struct page *page)
> +{
> + return PageYoung(page);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> + SetPageYoung(page);
> +}
> +
> +static inline void clear_page_young(struct page *page)
> +{
> + ClearPageYoung(page);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> + return PageIdle(page);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> + SetPageIdle(page);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> + ClearPageIdle(page);
> +}
> +#else /* !CONFIG_64BIT */
> +/*
> + * If there is not enough space to store Idle and Young bits in page flags, use
> + * page ext flags instead.
> + */
> +extern struct page_ext_operations page_idle_ops;
> +
> +static inline bool page_is_young(struct page *page)
> +{
> + return test_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_young(struct page *page)
> +{
> + set_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void clear_page_young(struct page *page)
> +{
> + clear_bit(PAGE_EXT_YOUNG, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> + return test_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void set_page_idle(struct page *page)
> +{
> + set_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> + clear_bit(PAGE_EXT_IDLE, &lookup_page_ext(page)->flags);
> +}
> +#endif /* CONFIG_64BIT */
> +#else /* !CONFIG_IDLE_PAGE_TRACKING */
> +static inline bool page_is_young(struct page *page)
> +{
> + return false;
> +}
> +
> +static inline void clear_page_young(struct page *page)
> +{
> +}
> +
> +static inline bool page_is_idle(struct page *page)
> +{
> + return false;
> +}
> +
> +static inline void clear_page_idle(struct page *page)
> +{
> +}
> +#endif /* CONFIG_IDLE_PAGE_TRACKING */
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f34e040b34e9..5e7c4f50a644 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -109,6 +109,10 @@ enum pageflags {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> PG_compound_lock,
> #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> + PG_young,
> + PG_idle,
> +#endif
> __NR_PAGEFLAGS,
>
> /* Filesystems */
> @@ -289,6 +293,11 @@ PAGEFLAG_FALSE(HWPoison)
> #define __PG_HWPOISON 0
> #endif
>
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> +PAGEFLAG(Young, young)
> +PAGEFLAG(Idle, idle)
> +#endif
> +
> /*
> * On an anonymous page mapped into a user virtual memory area,
> * page->mapping points to its anon_vma, not to a struct address_space;
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index c42981cd99aa..17f118a82854 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -26,6 +26,10 @@ enum page_ext_flags {
> PAGE_EXT_DEBUG_POISON, /* Page is poisoned */
> PAGE_EXT_DEBUG_GUARD,
> PAGE_EXT_OWNER,
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> + PAGE_EXT_YOUNG,
> + PAGE_EXT_IDLE,
> +#endif
> };
>
> /*
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390214da4546..3600eace4774 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -635,3 +635,15 @@ config MAX_STACK_SIZE_MB
> changed to a smaller value in which case that is used.
>
> A sane initial value is 80 MB.
> +
> +config IDLE_PAGE_TRACKING
> + bool "Enable idle page tracking"
> + select PROC_PAGE_MONITOR
> + select PAGE_EXTENSION if !64BIT
> + help
> + This feature allows to estimate the amount of user pages that have
> + not been touched during a given period of time. This information can
> + be useful to tune memory cgroup limits and/or for job placement
> + within a compute cluster.
> +
> + See Documentation/vm/pagemap.txt for more details.
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..bb66f9ccec03 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -48,6 +48,10 @@ static const struct trace_print_flags pageflag_names[] = {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> {1UL << PG_compound_lock, "compound_lock" },
> #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
> + {1UL << PG_young, "young" },
> + {1UL << PG_idle, "idle" },
> +#endif
> };
>
> static void dump_flags(unsigned long flags,
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index d86fd2f5353f..e4b3af054bf2 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -59,6 +59,9 @@ static struct page_ext_operations *page_ext_ops[] = {
> #ifdef CONFIG_PAGE_OWNER
> &page_owner_ops,
> #endif
> +#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
> + &page_idle_ops,
> +#endif
> };
>
> static unsigned long total_usage;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 24dd3f9fee27..12e73b758d9e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -784,6 +784,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> if (referenced) {
> pra->referenced++;
> pra->vm_flags |= vma->vm_flags;
> + if (page_is_idle(page))
> + clear_page_idle(page);
> + }
> +
> + if (page_is_young(page)) {
> + clear_page_young(page);
> + pra->referenced++;

If a page was page_is_young and referenced recenlty,
pra->referenced is increased doubly and it changes current
behavior for file-backed page promotion. Look at page_check_references.

> }
>
> pra->mapcount--;
> diff --git a/mm/swap.c b/mm/swap.c
> index a7251a8ed532..6bf6f293a9ea 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -623,6 +623,8 @@ void mark_page_accessed(struct page *page)
> } else if (!PageReferenced(page)) {
> SetPageReferenced(page);
> }
> + if (page_is_idle(page))
> + clear_page_idle(page);
> }
> EXPORT_SYMBOL(mark_page_accessed);
>
> --
> 1.7.10.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2015-04-29 04:58:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> Knowing the portion of memory that is not used by a certain application
> or memory cgroup (idle memory) can be useful for partitioning the system
> efficiently, e.g. by setting memory cgroup limits appropriately.
> Currently, the only means to estimate the amount of idle memory provided
> by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
> access bit for all pages mapped to a particular process by writing 1 to
> clear_refs, wait for some time, and then count smaps:Referenced.
> However, this method has two serious shortcomings:
>
> - it does not count unmapped file pages
> - it affects the reclaimer logic
>
> To overcome these drawbacks, this patch introduces two new page flags,
> Idle and Young, and a new proc file, /proc/kpageidle. A page's Idle flag
> can only be set from userspace by writing 1 to /proc/kpageidle at the
> offset corresponding to the page, and it is cleared whenever the page is
> accessed either through page tables (it is cleared in page_referenced()
> in this case) or using the read(2) system call (mark_page_accessed()).
> Thus by setting the Idle flag for pages of a particular workload, which
> can be found e.g. by reading /proc/PID/pagemap, waiting for some time to
> let the workload access its working set, and then reading the kpageidle
> file, one can estimate the amount of pages that are not used by the
> workload.
>
> The Young page flag is used to avoid interference with the memory
> reclaimer. A page's Young flag is set whenever the Access bit of a page
> table entry pointing to the page is cleared by writing to kpageidle. If
> page_referenced() is called on a Young page, it will add 1 to its return
> value, therefore concealing the fact that the Access bit was cleared.
>
> Note, since there is no room for extra page flags on 32 bit, this
> feature uses extended page flags when compiled on 32 bit.
>
> Signed-off-by: Vladimir Davydov <[email protected]>
> ---
> Documentation/vm/pagemap.txt | 10 ++-
> fs/proc/page.c | 154 ++++++++++++++++++++++++++++++++++++++++++
> fs/proc/task_mmu.c | 4 +-
> include/linux/mm.h | 88 ++++++++++++++++++++++++
> include/linux/page-flags.h | 9 +++
> include/linux/page_ext.h | 4 ++
> mm/Kconfig | 12 ++++
> mm/debug.c | 4 ++
> mm/page_ext.c | 3 +
> mm/rmap.c | 7 ++
> mm/swap.c | 2 +
> 11 files changed, 295 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
> index a9b7afc8fbc6..ac6fd32a9296 100644
> --- a/Documentation/vm/pagemap.txt
> +++ b/Documentation/vm/pagemap.txt
> @@ -5,7 +5,7 @@ pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
> userspace programs to examine the page tables and related information by
> reading files in /proc.
>
> -There are four components to pagemap:
> +There are five components to pagemap:
>
> * /proc/pid/pagemap. This file lets a userspace process find out which
> physical frame each virtual page is mapped to. It contains one 64-bit
> @@ -69,6 +69,14 @@ There are four components to pagemap:
> memory cgroup each page is charged to, indexed by PFN. Only available when
> CONFIG_MEMCG is set.
>
> + * /proc/kpageidle. For each page this file contains a 64-bit number, which
> + equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
> + considered idle if it has not been accessed since it was marked idle. To
> + mark a page idle one should write 1 to this file at the offset corresponding
> + to the page. Only user memory pages can be marked idle, for other page types
> + input is silently ignored. Writing to this file beyond max PFN results in
> + the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.
> +

How about using kpageflags for reading part?

I mean PG_idle is one of the page flags and we already have a feature to
parse of each PFN flag so we could reuse existing feature for reading
idleness.

--
Kind regards,
Minchan Kim

2015-04-29 05:03:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] idle memory tracking

On Tue, Apr 28, 2015 at 03:24:39PM +0300, Vladimir Davydov wrote:
> Hi,
>
> This patch set introduces a new user API for tracking user memory pages
> that have not been used for a given period of time. The purpose of this
> is to provide the userspace with the means of tracking a workload's
> working set, i.e. the set of pages that are actively used by the
> workload. Knowing the working set size can be useful for partitioning
> the system more efficiently, e.g. by tuning memory cgroup limits
> appropriately, or for job placement within a compute cluster.
>
> ---- USE CASES ----
>
> The unified cgroup hierarchy has memory.low and memory.high knobs, which
> are defined as the low and high boundaries for the workload working set
> size. However, the working set size of a workload may be unknown or
> change in time. With this patch set, one can periodically estimate the
> amount of memory unused by each cgroup and tune their memory.low and
> memory.high parameters accordingly, therefore optimizing the overall
> memory utilization.
>
> Another use case is balancing workloads within a compute cluster.
> Knowing how much memory is not really used by a workload unit may help
> take a more optimal decision when considering migrating the unit to
> another node within the cluster.

Another usecase I have a interest is working with per-process reclaim.
https://lwn.net/Articles/545668/
With idle tracking, we could reclaim idle pages only by smart user
memory-manager.

Thanks.

--
Kind regards,
Minchan Kim

2015-04-29 07:59:10

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] idle memory tracking

Hi Minchan,

Thank you for taking a look at this patch set.

On Wed, Apr 29, 2015 at 12:57:22PM +0900, Minchan Kim wrote:
> On Tue, Apr 28, 2015 at 03:24:39PM +0300, Vladimir Davydov wrote:
> > * /proc/kpageidle. For each page this file contains a 64-bit number, which
> > equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
>
> Why do we need 64bit per page to indicate just idle or not?

I don't think we need this, actually. I made this file 64 bit per page
only to conform to other /proc/kpage* files. Currently, I can't think of
any potential use for residual 63 bits, so personally I'm fine with 1
bit per page. If nobody comes with an idea about how > 1 bits could be
used, I'll switch /proc/kpageidle to a bitmask in the next iteration.

Thanks,
Vladimir

2015-04-29 08:32:08

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Wed, Apr 29, 2015 at 01:57:59PM +0900, Minchan Kim wrote:
> On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > @@ -69,6 +69,14 @@ There are four components to pagemap:
> > memory cgroup each page is charged to, indexed by PFN. Only available when
> > CONFIG_MEMCG is set.
> >
> > + * /proc/kpageidle. For each page this file contains a 64-bit number, which
> > + equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
> > + considered idle if it has not been accessed since it was marked idle. To
> > + mark a page idle one should write 1 to this file at the offset corresponding
> > + to the page. Only user memory pages can be marked idle, for other page types
> > + input is silently ignored. Writing to this file beyond max PFN results in
> > + the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.
> > +
>
> How about using kpageflags for reading part?
>
> I mean PG_idle is one of the page flags and we already have a feature to
> parse of each PFN flag so we could reuse existing feature for reading
> idleness.

Reading PG_idle implies clearing all pte references to make sure the
page was not referenced via a pte. This means that exporting it via
/proc/kpageflags would increase the cost of reading this file, even for
users that don't care about PG_idle. I'm not sure all users of
/proc/kpageflags will be fine with it.

Thanks,
Vladimir

2015-04-29 09:13:05

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 70d23245dd43..cfc55ba7fee6 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -275,6 +275,156 @@ static const struct file_operations proc_kpagecgroup_operations = {
> > };
> > #endif /* CONFIG_MEMCG */
> >
> > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > +static struct page *kpageidle_get_page(unsigned long pfn)
> > +{
> > + struct page *page;
> > +
> > + if (!pfn_valid(pfn))
> > + return NULL;
> > + page = pfn_to_page(pfn);
> > + /*
> > + * We are only interested in user memory pages, i.e. pages that are
> > + * allocated and on an LRU list.
> > + */
> > + if (!page || page_count(page) == 0 || !PageLRU(page))
>
> Why do you check (page_count == 0) even if we check it with get_page_unless_zero
> below?

I intended to avoid overhead of cmpxchg in case page_count is 0, but
diving deeper into get_page_unless_zero, I see that it already handles
such a scenario, so this check is useless. I'll remove it.

>
> > + return NULL;
> > + if (!get_page_unless_zero(page))
> > + return NULL;
> > + if (unlikely(!PageLRU(page))) {
>
> What lock protect the check PageLRU?
> If it is racing ClearPageLRU, what happens?

If we hold a reference to a page and see that it's on an LRU list, it
will surely remain a user memory page at least until we release the
reference to it, so it must be safe to play with idle/young flags. If we
race with isolate_lru_page, or any similar function temporarily clearing
PG_lru, we will silently skip the page w/o touching its idle/young
flags. We could consider isolated pages too, but that would increase the
cost of this function.

If you find this explanation OK, I'll add it to the comment to this
function.

>
> > + put_page(page);
> > + return NULL;
> > + }
> > + return page;
> > +}
> > +
> > +static void kpageidle_clear_refs(struct page *page)
> > +{
> > + unsigned long dummy;
> > +
> > + if (page_referenced(page, 0, NULL, &dummy))
> > + /*
> > + * This page was referenced. To avoid interference with the
> > + * reclaimer, mark it young so that the next call will also
>
> next what call?
>
> It just works with mapped page so kpageidle_clear_pte_refs as function name
> is more clear.
>
> One more, kpageidle_clear_refs removes PG_idle via page_referenced which
> is important feature for the function. Please document it so we could
> understand why we need double check for PG_idle after calling
> kpageidle_clear_refs for pte access bit.

Sounds reasonable, will do.

> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 24dd3f9fee27..12e73b758d9e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -784,6 +784,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> > if (referenced) {
> > pra->referenced++;
> > pra->vm_flags |= vma->vm_flags;
> > + if (page_is_idle(page))
> > + clear_page_idle(page);
> > + }
> > +
> > + if (page_is_young(page)) {
> > + clear_page_young(page);
> > + pra->referenced++;
>
> If a page was page_is_young and referenced recenlty,
> pra->referenced is increased doubly and it changes current
> behavior for file-backed page promotion. Look at page_check_references.

Yeah, you're quite right, I missed that. Something like this should get
rid of this extra reference:

diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..eca7416f55d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -781,6 +781,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

+ if (referenced && page_is_idle(page))
+ clear_page_idle(page);
+
+ if (page_is_young(page)) {
+ clear_page_young(page);
+ referenced++;
+ }
+
if (referenced) {
pra->referenced++;
pra->vm_flags |= vma->vm_flags;

Thanks,
Vladimir

2015-04-30 06:55:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

Hi,

On Wed, Apr 29, 2015 at 11:31:49AM +0300, Vladimir Davydov wrote:
> On Wed, Apr 29, 2015 at 01:57:59PM +0900, Minchan Kim wrote:
> > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > @@ -69,6 +69,14 @@ There are four components to pagemap:
> > > memory cgroup each page is charged to, indexed by PFN. Only available when
> > > CONFIG_MEMCG is set.
> > >
> > > + * /proc/kpageidle. For each page this file contains a 64-bit number, which
> > > + equals 1 if the page is idle or 0 otherwise, indexed by PFN. A page is
> > > + considered idle if it has not been accessed since it was marked idle. To
> > > + mark a page idle one should write 1 to this file at the offset corresponding
> > > + to the page. Only user memory pages can be marked idle, for other page types
> > > + input is silently ignored. Writing to this file beyond max PFN results in
> > > + the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set.
> > > +
> >
> > How about using kpageflags for reading part?
> >
> > I mean PG_idle is one of the page flags and we already have a feature to
> > parse of each PFN flag so we could reuse existing feature for reading
> > idleness.
>
> Reading PG_idle implies clearing all pte references to make sure the
> page was not referenced via a pte. This means that exporting it via
> /proc/kpageflags would increase the cost of reading this file, even for
> users that don't care about PG_idle. I'm not sure all users of
> /proc/kpageflags will be fine with it.

It triggers rmap traverse so it would be horrible overhead sometime
so I agree every kpageflags users don't want it but I didn't mean
reading of PG_idle via kpageflags should clear all pte references.
Reset should be still part of kpageidle but we can just read idlenss
without reset by kpageflags(IOW, Reset and reading is orthogoal)

A benefit via reading kpageflags, we could parse it's idle page
and not dirty page so we could reclaim it easy.
Anyway, it could be further improvement.

>
> Thanks,
> Vladimir

--
Kind regards,
Minchan Kim

2015-04-30 08:25:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > > index 70d23245dd43..cfc55ba7fee6 100644
> > > --- a/fs/proc/page.c
> > > +++ b/fs/proc/page.c
> > > @@ -275,6 +275,156 @@ static const struct file_operations proc_kpagecgroup_operations = {
> > > };
> > > #endif /* CONFIG_MEMCG */
> > >
> > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!pfn_valid(pfn))
> > > + return NULL;
> > > + page = pfn_to_page(pfn);
> > > + /*
> > > + * We are only interested in user memory pages, i.e. pages that are
> > > + * allocated and on an LRU list.
> > > + */
> > > + if (!page || page_count(page) == 0 || !PageLRU(page))
> >
> > Why do you check (page_count == 0) even if we check it with get_page_unless_zero
> > below?
>
> I intended to avoid overhead of cmpxchg in case page_count is 0, but
> diving deeper into get_page_unless_zero, I see that it already handles
> such a scenario, so this check is useless. I'll remove it.
>
> >
> > > + return NULL;
> > > + if (!get_page_unless_zero(page))
> > > + return NULL;
> > > + if (unlikely(!PageLRU(page))) {
> >
> > What lock protect the check PageLRU?
> > If it is racing ClearPageLRU, what happens?
>
> If we hold a reference to a page and see that it's on an LRU list, it
> will surely remain a user memory page at least until we release the
> reference to it, so it must be safe to play with idle/young flags. If we

The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
once you judge it's LRU page so if it is false-positive, what happens?
A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
IOW, all of fields from struct page rmap can acccess should be set up completely
before LRU checking. Otherwise, something will be broken.

Thanks.

> race with isolate_lru_page, or any similar function temporarily clearing
> PG_lru, we will silently skip the page w/o touching its idle/young
> flags. We could consider isolated pages too, but that would increase the
> cost of this function.
>
> If you find this explanation OK, I'll add it to the comment to this
> function.
>
> >
> > > + put_page(page);
> > > + return NULL;
> > > + }
> > > + return page;
> > > +}
> > > +
> > > +static void kpageidle_clear_refs(struct page *page)
> > > +{
> > > + unsigned long dummy;
> > > +
> > > + if (page_referenced(page, 0, NULL, &dummy))
> > > + /*
> > > + * This page was referenced. To avoid interference with the
> > > + * reclaimer, mark it young so that the next call will also
> >
> > next what call?
> >
> > It just works with mapped page so kpageidle_clear_pte_refs as function name
> > is more clear.
> >
> > One more, kpageidle_clear_refs removes PG_idle via page_referenced which
> > is important feature for the function. Please document it so we could
> > understand why we need double check for PG_idle after calling
> > kpageidle_clear_refs for pte access bit.
>
> Sounds reasonable, will do.
>
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 24dd3f9fee27..12e73b758d9e 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -784,6 +784,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> > > if (referenced) {
> > > pra->referenced++;
> > > pra->vm_flags |= vma->vm_flags;
> > > + if (page_is_idle(page))
> > > + clear_page_idle(page);
> > > + }
> > > +
> > > + if (page_is_young(page)) {
> > > + clear_page_young(page);
> > > + pra->referenced++;
> >
> > If a page was page_is_young and referenced recenlty,
> > pra->referenced is increased doubly and it changes current
> > behavior for file-backed page promotion. Look at page_check_references.
>
> Yeah, you're quite right, I missed that. Something like this should get
> rid of this extra reference:
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 24dd3f9fee27..eca7416f55d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -781,6 +781,14 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> pte_unmap_unlock(pte, ptl);
> }
>
> + if (referenced && page_is_idle(page))
> + clear_page_idle(page);
> +
> + if (page_is_young(page)) {
> + clear_page_young(page);
> + referenced++;
> + }
> +
> if (referenced) {
> pra->referenced++;
> pra->vm_flags |= vma->vm_flags;
>
> Thanks,
> Vladimir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2015-04-30 14:51:18

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > +{
> > > > + struct page *page;
> > > > +
> > > > + if (!pfn_valid(pfn))
> > > > + return NULL;
> > > > + page = pfn_to_page(pfn);
> > > > + /*
> > > > + * We are only interested in user memory pages, i.e. pages that are
> > > > + * allocated and on an LRU list.
> > > > + */
> > > > + if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > + return NULL;
> > > > + if (!get_page_unless_zero(page))
> > > > + return NULL;
> > > > + if (unlikely(!PageLRU(page))) {
> > >
> > > What lock protect the check PageLRU?
> > > If it is racing ClearPageLRU, what happens?
> >
> > If we hold a reference to a page and see that it's on an LRU list, it
> > will surely remain a user memory page at least until we release the
> > reference to it, so it must be safe to play with idle/young flags. If we
>
> The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> once you judge it's LRU page so if it is false-positive, what happens?
> A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> IOW, all of fields from struct page rmap can acccess should be set up completely
> before LRU checking. Otherwise, something will be broken.

So, basically you are concerned about the case when we encounter a
freshly allocated page, which has PG_lru bit set and it's going to
become anonymous, but it is still in the process of rmap initialization,
i.e. its ->mapping or ->mapcount may still be uninitialized, right?

AFAICS, page_referenced should handle such pages fine. Look, it only
needs ->index, ->mapping, and ->mapcount.

If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
page_lock_anon_vma_read will return NULL so that rmap_walk will be a
no-op.

If ->index is not initialized, than at worst we will go to
anon_vma_interval_tree_foreach over a wrong interval, in which case we
will see that the page is actually not mapped in page_referenced_one ->
page_check_address and again do nothing.

If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
will return NULL, just as it does in case ->mapping = NULL.

For file pages, we always take PG_locked before checking ->mapping, so
it must be valid.

Thanks,
Vladimir

2015-05-04 03:17:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > +{
> > > > > + struct page *page;
> > > > > +
> > > > > + if (!pfn_valid(pfn))
> > > > > + return NULL;
> > > > > + page = pfn_to_page(pfn);
> > > > > + /*
> > > > > + * We are only interested in user memory pages, i.e. pages that are
> > > > > + * allocated and on an LRU list.
> > > > > + */
> > > > > + if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > + return NULL;
> > > > > + if (!get_page_unless_zero(page))
> > > > > + return NULL;
> > > > > + if (unlikely(!PageLRU(page))) {
> > > >
> > > > What lock protect the check PageLRU?
> > > > If it is racing ClearPageLRU, what happens?
> > >
> > > If we hold a reference to a page and see that it's on an LRU list, it
> > > will surely remain a user memory page at least until we release the
> > > reference to it, so it must be safe to play with idle/young flags. If we
> >
> > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > once you judge it's LRU page so if it is false-positive, what happens?
> > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > IOW, all of fields from struct page rmap can acccess should be set up completely
> > before LRU checking. Otherwise, something will be broken.
>
> So, basically you are concerned about the case when we encounter a
> freshly allocated page, which has PG_lru bit set and it's going to
> become anonymous, but it is still in the process of rmap initialization,
> i.e. its ->mapping or ->mapcount may still be uninitialized, right?
>
> AFAICS, page_referenced should handle such pages fine. Look, it only
> needs ->index, ->mapping, and ->mapcount.
>
> If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> no-op.
>
> If ->index is not initialized, than at worst we will go to
> anon_vma_interval_tree_foreach over a wrong interval, in which case we
> will see that the page is actually not mapped in page_referenced_one ->
> page_check_address and again do nothing.
>
> If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> will return NULL, just as it does in case ->mapping = NULL.
>
> For file pages, we always take PG_locked before checking ->mapping, so
> it must be valid.
>
> Thanks,
> Vladimir


do_anonymous_page
page_add_new_anon_rmap
atomic_set(&page->_mapcount, 0);
__page_set_anon_rmap
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
page->index = linear_page_index(vma, address);
lru_cache_add
__pagevec_lru_add_fn
SetPageLRU(page);

During the procedure, there is no lock to prevent race. Then, at least,
we need a write memory barrier to guarantee other fields set up before
SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
well) But I can't find any barrier, either.

IOW, any fields you said could be out of order store without any lock or
memory barrier. You might argue atomic op is a barrier on x86 but it
doesn't guarantee other arches work like that so we need explict momory
barrier or lock.

Let's have a theoretical example.

CPU 0 CPU 1

do_anonymous_page
__page_set_anon_rmap
/* out of order happened so SetPageLRU is done ahead */
SetPageLRU(page)
/* Compilr changed store operation like below */
page->mapping = (struct address_space *) anon_vma;
/* Big stall happens */
/* idletacking judged it as LRU page so pass the page
in page_reference */
page_refernced
page_rmapping return true because
page->mapping has some vaule but not complete
so it calls rmap_walk_file.
it's okay to pass non-completed anon page in rmap_walk_file?

page->mapping = (struct address_space *)
((void *)page_mapping + PAGE_MAPPING_ANON);

It's too theoretical so it might be hard to happen in real practice.
My point is there is nothing to prevent explict race.
Even if there is no problem with other lock, it's fragile.
Do I miss something?

I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

--
Kind regards,
Minchan Kim

2015-05-04 09:50:04

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > +{
> > > > > > + struct page *page;
> > > > > > +
> > > > > > + if (!pfn_valid(pfn))
> > > > > > + return NULL;
> > > > > > + page = pfn_to_page(pfn);
> > > > > > + /*
> > > > > > + * We are only interested in user memory pages, i.e. pages that are
> > > > > > + * allocated and on an LRU list.
> > > > > > + */
> > > > > > + if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > + return NULL;
> > > > > > + if (!get_page_unless_zero(page))
> > > > > > + return NULL;
> > > > > > + if (unlikely(!PageLRU(page))) {
> > > > >
> > > > > What lock protect the check PageLRU?
> > > > > If it is racing ClearPageLRU, what happens?
> > > >
> > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > will surely remain a user memory page at least until we release the
> > > > reference to it, so it must be safe to play with idle/young flags. If we
> > >
> > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > once you judge it's LRU page so if it is false-positive, what happens?
> > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > before LRU checking. Otherwise, something will be broken.
> >
> > So, basically you are concerned about the case when we encounter a
> > freshly allocated page, which has PG_lru bit set and it's going to
> > become anonymous, but it is still in the process of rmap initialization,
> > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> >
> > AFAICS, page_referenced should handle such pages fine. Look, it only
> > needs ->index, ->mapping, and ->mapcount.
> >
> > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > no-op.
> >
> > If ->index is not initialized, than at worst we will go to
> > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > will see that the page is actually not mapped in page_referenced_one ->
> > page_check_address and again do nothing.
> >
> > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > will return NULL, just as it does in case ->mapping = NULL.
> >
> > For file pages, we always take PG_locked before checking ->mapping, so
> > it must be valid.
> >
> > Thanks,
> > Vladimir
>
>
> do_anonymous_page
> page_add_new_anon_rmap
> atomic_set(&page->_mapcount, 0);
> __page_set_anon_rmap
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> page->mapping = (struct address_space *) anon_vma;
> page->index = linear_page_index(vma, address);
> lru_cache_add
> __pagevec_lru_add_fn
> SetPageLRU(page);
>
> During the procedure, there is no lock to prevent race. Then, at least,
> we need a write memory barrier to guarantee other fields set up before
> SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> well) But I can't find any barrier, either.
>
> IOW, any fields you said could be out of order store without any lock or
> memory barrier. You might argue atomic op is a barrier on x86 but it
> doesn't guarantee other arches work like that so we need explict momory
> barrier or lock.
>
> Let's have a theoretical example.
>
> CPU 0 CPU 1
>
> do_anonymous_page
> __page_set_anon_rmap
> /* out of order happened so SetPageLRU is done ahead */
> SetPageLRU(page)
> /* Compilr changed store operation like below */

But it couldn't. Quoting Documentation/atomic_ops.txt:

Properly aligned pointers, longs, ints, and chars (and unsigned
equivalents) may be atomically loaded from and stored to in the same
sense as described for atomic_read() and atomic_set().

__page_set_anon_rmap sets page->mapping using the following expression:

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;

and it can't be split, i.e. if one concurrently reads page->mapping
he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
can't be any intermediate result in page->mapping, such as anon_vma or
PAGE_MAPPING_ANON, because one doesn't expect

atomic_set(&p, a + b);

to behave like

atomic_set(&p, a);
atomic_set(&p, atomic_read(&p) + b);

Thanks,
Vladimir

> page->mapping = (struct address_space *) anon_vma;
> /* Big stall happens */
> /* idletacking judged it as LRU page so pass the page
> in page_reference */
> page_refernced
> page_rmapping return true because
> page->mapping has some vaule but not complete
> so it calls rmap_walk_file.
> it's okay to pass non-completed anon page in rmap_walk_file?
>
> page->mapping = (struct address_space *)
> ((void *)page_mapping + PAGE_MAPPING_ANON);
>
> It's too theoretical so it might be hard to happen in real practice.
> My point is there is nothing to prevent explict race.
> Even if there is no problem with other lock, it's fragile.
> Do I miss something?
>
> I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

2015-05-04 10:55:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Mon, May 04, 2015 at 12:49:39PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> > On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > > +{
> > > > > > > + struct page *page;
> > > > > > > +
> > > > > > > + if (!pfn_valid(pfn))
> > > > > > > + return NULL;
> > > > > > > + page = pfn_to_page(pfn);
> > > > > > > + /*
> > > > > > > + * We are only interested in user memory pages, i.e. pages that are
> > > > > > > + * allocated and on an LRU list.
> > > > > > > + */
> > > > > > > + if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > > + return NULL;
> > > > > > > + if (!get_page_unless_zero(page))
> > > > > > > + return NULL;
> > > > > > > + if (unlikely(!PageLRU(page))) {
> > > > > >
> > > > > > What lock protect the check PageLRU?
> > > > > > If it is racing ClearPageLRU, what happens?
> > > > >
> > > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > > will surely remain a user memory page at least until we release the
> > > > > reference to it, so it must be safe to play with idle/young flags. If we
> > > >
> > > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > > once you judge it's LRU page so if it is false-positive, what happens?
> > > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > > before LRU checking. Otherwise, something will be broken.
> > >
> > > So, basically you are concerned about the case when we encounter a
> > > freshly allocated page, which has PG_lru bit set and it's going to
> > > become anonymous, but it is still in the process of rmap initialization,
> > > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> > >
> > > AFAICS, page_referenced should handle such pages fine. Look, it only
> > > needs ->index, ->mapping, and ->mapcount.
> > >
> > > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > > no-op.
> > >
> > > If ->index is not initialized, than at worst we will go to
> > > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > > will see that the page is actually not mapped in page_referenced_one ->
> > > page_check_address and again do nothing.
> > >
> > > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > > will return NULL, just as it does in case ->mapping = NULL.
> > >
> > > For file pages, we always take PG_locked before checking ->mapping, so
> > > it must be valid.
> > >
> > > Thanks,
> > > Vladimir
> >
> >
> > do_anonymous_page
> > page_add_new_anon_rmap
> > atomic_set(&page->_mapcount, 0);
> > __page_set_anon_rmap
> > anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> > page->mapping = (struct address_space *) anon_vma;
> > page->index = linear_page_index(vma, address);
> > lru_cache_add
> > __pagevec_lru_add_fn
> > SetPageLRU(page);
> >
> > During the procedure, there is no lock to prevent race. Then, at least,
> > we need a write memory barrier to guarantee other fields set up before
> > SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> > well) But I can't find any barrier, either.
> >
> > IOW, any fields you said could be out of order store without any lock or
> > memory barrier. You might argue atomic op is a barrier on x86 but it
> > doesn't guarantee other arches work like that so we need explict momory
> > barrier or lock.
> >
> > Let's have a theoretical example.
> >
> > CPU 0 CPU 1
> >
> > do_anonymous_page
> > __page_set_anon_rmap
> > /* out of order happened so SetPageLRU is done ahead */
> > SetPageLRU(page)
> > /* Compilr changed store operation like below */
>
> But it couldn't. Quoting Documentation/atomic_ops.txt:
>
> Properly aligned pointers, longs, ints, and chars (and unsigned
> equivalents) may be atomically loaded from and stored to in the same
> sense as described for atomic_read() and atomic_set().
>
> __page_set_anon_rmap sets page->mapping using the following expression:
>
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> page->mapping = (struct address_space *) anon_vma;
>
> and it can't be split, i.e. if one concurrently reads page->mapping
> he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
> can't be any intermediate result in page->mapping, such as anon_vma or
> PAGE_MAPPING_ANON, because one doesn't expect
>
> atomic_set(&p, a + b);
>
> to behave like
>
> atomic_set(&p, a);
> atomic_set(&p, atomic_read(&p) + b);

When I parsed the documentation, I understand it that each of words
store/load operation is atomically loaded or stored, not forcing to
preventing split.

Hmm, but it's really important part in this patchset's implementation
so I want to confirm during review process.

I don't have a worry even if I am not a expert about that part because
I know other experts. ;-) Ccing them.

What I want to know is as follows,

In do_anonymous_page, there is following peice of code.
page_add_new_anon_rmap
__page_set_anon_rmap
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
lru_cache_add_active_or_unevictable
__lru_cache_add
__pagevec_lru_add_fn
SetPageLRU(page);

>From page->mapping to SetPageLRU, there is no explict lock and memory
barrier.

As counterpart, kpageidle can pass page in page_referenced once it judges
out the page has PG_lru with PageLRU check but my concern is that it
judges without any lock or barrier like below.

static struct page *kpageidle_get_page(unsigned long pfn)
{
struct page *page;

if (!pfn_valid(pfn))
return NULL;
page = pfn_to_page(pfn);
/*
* We are only interested in user memory pages, i.e. pages that are
* allocated and on an LRU list.
*/
if (!page || page_count(page) == 0 || !PageLRU(page))
return NULL;
if (!get_page_unless_zero(page))
return NULL;
if (unlikely(!PageLRU(page))) {
put_page(page);
return NULL;
}
return page;
}

So, I guess once below compiler optimization happens in __page_set_anon_rmap,
it could be corrupt in page_refernced.

__page_set_anon_rmap:
page->mapping = (struct address_space *) anon_vma;
page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);

Because page_referenced checks it with PageAnon which has no memory barrier.
So if above compiler optimization happens, page_referenced can pass the anon
page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

But Vladimir said above above compiler optimization cannot happen because
it's aligned pointer and Documentation/atomic_ops.txt said it couldn't happen.

Thanks for looking this.


> > page->mapping = (struct address_space *) anon_vma;
> > /* Big stall happens */
> > /* idletacking judged it as LRU page so pass the page
> > in page_reference */
> > page_refernced
> > page_rmapping return true because
> > page->mapping has some vaule but not complete
> > so it calls rmap_walk_file.
> > it's okay to pass non-completed anon page in rmap_walk_file?
> >
> > page->mapping = (struct address_space *)
> > ((void *)page_mapping + PAGE_MAPPING_ANON);
>

>
> Thanks,
> Vladimir
>
> > page->mapping = (struct address_space *) anon_vma;
> > /* Big stall happens */
> > /* idletacking judged it as LRU page so pass the page
> > in page_reference */
> > page_refernced
> > page_rmapping return true because
> > page->mapping has some vaule but not complete
> > so it calls rmap_walk_file.
> > it's okay to pass non-completed anon page in rmap_walk_file?
> >
> > page->mapping = (struct address_space *)
> > ((void *)page_mapping + PAGE_MAPPING_ANON);
> >
> > It's too theoretical so it might be hard to happen in real practice.
> > My point is there is nothing to prevent explict race.
> > Even if there is no problem with other lock, it's fragile.
> > Do I miss something?
> >
> > I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

--
Kind regards,
Minchan Kim

2015-05-08 09:56:24

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Mon, May 04, 2015 at 07:54:59PM +0900, Minchan Kim wrote:
> So, I guess once below compiler optimization happens in __page_set_anon_rmap,
> it could be corrupt in page_refernced.
>
> __page_set_anon_rmap:
> page->mapping = (struct address_space *) anon_vma;
> page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);
>
> Because page_referenced checks it with PageAnon which has no memory barrier.
> So if above compiler optimization happens, page_referenced can pass the anon
> page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

FWIW

If such splits were possible, we would have bugs all over the kernel
IMO. An example is do_wp_page() vs shrink_active_list(). In do_wp_page()
we can call page_move_anon_rmap(), which sets page->mapping in exactly
the same fashion as above-mentioned __page_set_anon_rmap():

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;

The page in question may be on an LRU list, because nowhere in
do_wp_page() we remove it from the list, neither do we take any LRU
related locks. The page is locked, that's true, but shrink_active_list()
calls page_referenced() on an unlocked page, so according to your logic
they can race with the latter receiving a page with page->mapping equal
to anon_vma w/o PAGE_MAPPING_ANON bit set:

CPU0 CPU1
---- ----
do_wp_page shrink_active_list
lock_page page_referenced
PageAnon->yes, so skip trylock_page
page_move_anon_rmap
page->mapping = anon_vma
rmap_walk
PageAnon->no
rmap_walk_file
BUG
page->mapping = page->mapping+PAGE_MAPPING_ANON

However, this does not happen.

Thanks,
Vladimir

2015-05-09 15:12:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

Hello, Vladimir

On Fri, May 08, 2015 at 12:56:04PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 07:54:59PM +0900, Minchan Kim wrote:
> > So, I guess once below compiler optimization happens in __page_set_anon_rmap,
> > it could be corrupt in page_refernced.
> >
> > __page_set_anon_rmap:
> > page->mapping = (struct address_space *) anon_vma;
> > page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);
> >
> > Because page_referenced checks it with PageAnon which has no memory barrier.
> > So if above compiler optimization happens, page_referenced can pass the anon
> > page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)
>
> FWIW
>
> If such splits were possible, we would have bugs all over the kernel
> IMO. An example is do_wp_page() vs shrink_active_list(). In do_wp_page()
> we can call page_move_anon_rmap(), which sets page->mapping in exactly
> the same fashion as above-mentioned __page_set_anon_rmap():
>
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> page->mapping = (struct address_space *) anon_vma;
>
> The page in question may be on an LRU list, because nowhere in
> do_wp_page() we remove it from the list, neither do we take any LRU
> related locks. The page is locked, that's true, but shrink_active_list()
> calls page_referenced() on an unlocked page, so according to your logic
> they can race with the latter receiving a page with page->mapping equal
> to anon_vma w/o PAGE_MAPPING_ANON bit set:
>
> CPU0 CPU1
> ---- ----
> do_wp_page shrink_active_list
> lock_page page_referenced
> PageAnon->yes, so skip trylock_page
> page_move_anon_rmap
> page->mapping = anon_vma
> rmap_walk
> PageAnon->no
> rmap_walk_file
> BUG
> page->mapping = page->mapping+PAGE_MAPPING_ANON
>
> However, this does not happen.

Good spot.

However, it doesn't mean it's right so you are okay to rely on it.
Normally, store tearing is not common and such race would be hard to hit
but I want to call it as BUG.

Rik wrote the code and commented out.

"Protected against the rmap code by the page lock"

But unfortunately, page_referenced in shrink_active_list doesn't hold
a page lock so isn't it a bug? Rik?

Please, read store tearing section in Documentation/memory-barrier.txt.
If you get confused due to aligned memory, please read this link.

https://lkml.org/lkml/2014/7/16/262

Other quote from Paul in https://lkml.org/lkml/2015/5/1/229
"
..
If the thing read/written does fit into a machine word and if the location
read/written is properly aligned, I would be quite surprised if either
READ_ONCE() or WRITE_ONCE() resulted in any sort of tearing.
"

I parsed it as that "even store tearing can happen machine word at
alinged address and that's why WRITE_ONCE is there to prevent it"

If you want to claim GCC doesn't do it, please read below links

https://lkml.org/lkml/2015/4/16/527
http://yarchive.net/comp/linux/ACCESS_ONCE.html

Quote from Linus
"
The thing is, you can't _prove_ that the compiler won't do it, especially
if you end up changing the code later (without thinking about the fact
that you're loading things without locking).

So the rule is: if you access unlocked values, you use ACCESS_ONCE(). You
don't say "but it can't matter". Because you simply don't know.
"

Yeb, I might be paranoid but my point is it might work now on most of
arch but it seem to be buggy/fragile/subtle because we couldn't prove
all arch/compiler don't make any trouble. So, intead of adding more
logics based on fragile, please use right lock model. If lock becomes
big trouble by overhead, let's fix it(for instance, use WRITE_ONCE for
update-side and READ_ONCE for read-side) if I don't miss something.

>
> Thanks,
> Vladimir

--
Kind regards,
Minchan Kim

2015-05-10 10:34:53

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Sun, May 10, 2015 at 12:12:38AM +0900, Minchan Kim wrote:
> On Fri, May 08, 2015 at 12:56:04PM +0300, Vladimir Davydov wrote:
> > On Mon, May 04, 2015 at 07:54:59PM +0900, Minchan Kim wrote:
> > > So, I guess once below compiler optimization happens in __page_set_anon_rmap,
> > > it could be corrupt in page_refernced.
> > >
> > > __page_set_anon_rmap:
> > > page->mapping = (struct address_space *) anon_vma;
> > > page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);
> > >
> > > Because page_referenced checks it with PageAnon which has no memory barrier.
> > > So if above compiler optimization happens, page_referenced can pass the anon
> > > page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)
> >
> > FWIW
> >
> > If such splits were possible, we would have bugs all over the kernel
> > IMO. An example is do_wp_page() vs shrink_active_list(). In do_wp_page()
> > we can call page_move_anon_rmap(), which sets page->mapping in exactly
> > the same fashion as above-mentioned __page_set_anon_rmap():
> >
> > anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> > page->mapping = (struct address_space *) anon_vma;
> >
> > The page in question may be on an LRU list, because nowhere in
> > do_wp_page() we remove it from the list, neither do we take any LRU
> > related locks. The page is locked, that's true, but shrink_active_list()
> > calls page_referenced() on an unlocked page, so according to your logic
> > they can race with the latter receiving a page with page->mapping equal
> > to anon_vma w/o PAGE_MAPPING_ANON bit set:
> >
> > CPU0 CPU1
> > ---- ----
> > do_wp_page shrink_active_list
> > lock_page page_referenced
> > PageAnon->yes, so skip trylock_page
> > page_move_anon_rmap
> > page->mapping = anon_vma
> > rmap_walk
> > PageAnon->no
> > rmap_walk_file
> > BUG
> > page->mapping = page->mapping+PAGE_MAPPING_ANON
> >
> > However, this does not happen.
>
> Good spot.
>
> However, it doesn't mean it's right so you are okay to rely on it.
> Normally, store tearing is not common and such race would be hard to hit
> but I want to call it as BUG.

But then we should call atomic64_set/atomic_long_set a big fat bug,
because it does not use ACCESS_ONCE/volatile stuff on its argument, so
it is prone to write tearing and therefore it is not atomic at all.

>
> Rik wrote the code and commented out.
>
> "Protected against the rmap code by the page lock"
>
> But unfortunately, page_referenced in shrink_active_list doesn't hold
> a page lock so isn't it a bug? Rik?
>
> Please, read store tearing section in Documentation/memory-barrier.txt.
> If you get confused due to aligned memory, please read this link.
>
> https://lkml.org/lkml/2014/7/16/262

I've read it. It describes tearing of

p = 0x00010002;

to

*(u16 *)&p = 0x2;
*((u16 *)&p+1) = 0x1

to avoid computation of 0x00010002 by using two 16-bit immediate-store.

AFAIU that isn't nearly the case in __page_set_anon_rmap:

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;

The compiler doesn't know the value of anon_vma so there is absolutely
no benefit in tearing it - it would only result in two vs one store. I
admit we cannot rule out that some mad compiler can do that, but IMO
that would be a compiler bug, which would result in the kernel tearing
apart.

>
> Other quote from Paul in https://lkml.org/lkml/2015/5/1/229
> "
> ..
> If the thing read/written does fit into a machine word and if the location
> read/written is properly aligned, I would be quite surprised if either
> READ_ONCE() or WRITE_ONCE() resulted in any sort of tearing.
> "
>
> I parsed it as that "even store tearing can happen machine word at
> alinged address and that's why WRITE_ONCE is there to prevent it"

That's a sort of reading between the lines, I can't see it's written here.

>
> If you want to claim GCC doesn't do it, please read below links
>
> https://lkml.org/lkml/2015/4/16/527
> http://yarchive.net/comp/linux/ACCESS_ONCE.html
>
> Quote from Linus
> "
> The thing is, you can't _prove_ that the compiler won't do it, especially
> if you end up changing the code later (without thinking about the fact
> that you're loading things without locking).
>
> So the rule is: if you access unlocked values, you use ACCESS_ONCE(). You
> don't say "but it can't matter". Because you simply don't know.
> "

You took this citation from the context, which has nothing to do with
read/store tearing. It's about the value consistency in some statement.
E.g. in the following statement

int i = x;
if (i > y)
y = i;

we do need ACCESS_ONCE around x, because the compiler is free to fetch
its value twice, in the comparison and the assignment. But it's not
about read/write tearing.

>
> Yeb, I might be paranoid but my point is it might work now on most of
> arch but it seem to be buggy/fragile/subtle because we couldn't prove
> all arch/compiler don't make any trouble. So, intead of adding more
> logics based on fragile, please use right lock model. If lock becomes
> big trouble by overhead, let's fix it(for instance, use WRITE_ONCE for
> update-side and READ_ONCE for read-side) if I don't miss something.

IMO, locking would be an overkill. READ_ONCE is OK, because it has no
performance implications, but I would prefer to be convinced that it is
100% necessary before adding it just in case.

Thanks,
Vladimir

2015-05-12 09:41:58

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file

On Sun, May 10, 2015 at 01:34:29PM +0300, Vladimir Davydov wrote:
> On Sun, May 10, 2015 at 12:12:38AM +0900, Minchan Kim wrote:
> > Yeb, I might be paranoid but my point is it might work now on most of
> > arch but it seem to be buggy/fragile/subtle because we couldn't prove
> > all arch/compiler don't make any trouble. So, intead of adding more
> > logics based on fragile, please use right lock model. If lock becomes
> > big trouble by overhead, let's fix it(for instance, use WRITE_ONCE for
> > update-side and READ_ONCE for read-side) if I don't miss something.
>
> IMO, locking would be an overkill. READ_ONCE is OK, because it has no
> performance implications, but I would prefer to be convinced that it is
> 100% necessary before adding it just in case.

Finally, I'm convinced we do need synchronization here :-) Sorry for
being so stubborn and thank you for your patience.

After examining page_referenced() with the knowledge that the compiler
may be completely unreliable and split page->mapping read/writes as it
wants, I've drawn the conclusion that it is safer to take
page_zone->lru_lock for checking if the page is on an LRU list, just as
you proposed initially, because otherwise we need to insert those
READ/WRITE_ONCE in every nook and cranny, which would look confusing
provided we only needed them for this idle page tracking feature, which
might even be not compiled in.

I'll fix it and resend.

Thanks,
Vladimir