2013-10-03 00:52:03

by John Stultz

[permalink] [raw]
Subject: [PATCH 00/14] Volatile Ranges v9

So its been awhile since the last release of the volatile ranges
patches, and while Minchan and I have been busy with other things,
we have been slowly chipping away at issues and differences
trying to get a patchset that we both agree on.

There's still a few smaller issues, but we figured any further
polishing of the patch series in private would be unproductive
and it would be much better to send the patches out for review
and comment and get some wider opinions.

Whats new in v9:
* Updated to v3.11
* Added vrange purging logic to purge anonymous pages on
swapless systems
* Added logic to allocate the vroot structure dynamically
to avoid added overhead to mm and address_space structures
* Lots of minor tweaks, changes and cleanups

Still TODO:
* Sort out better solution for clearing volatility on new mmaps
- Minchan has a different approach here
* Sort out apparent shrinker livelock that occasionally crops
up under severe pressure

Feedback or thoughts here would be particularly helpful!

As is apparent from the author list, Minchan has really been the
one doing the heavy lifting here, and I've only been finding and
fixing a few bugs, refactoring the code for readability, and
trying to clarify commit messages. So many many thanks to Minchan
here for all his great work, and putting up with my sometimes
misguided "editing".

Also, thanks to Dhaval for his maintaining and vastly improving
the volatile ranges test suite, which can be found here:
https://github.com/volatile-ranges-test/vranges-test


These patches can also be pulled from git here:
git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9

We'd really welcome any feedback and comments on the patch series.

thanks
-john

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>


John Stultz (2):
vrange: Clear volatility on new mmaps
vrange: Add support for volatile ranges on file mappings

Minchan Kim (12):
vrange: Add basic data structure and functions
vrange: Add vrange support to mm_structs
vrange: Add new vrange(2) system call
vrange: Add basic functions to purge volatile pages
vrange: Purge volatile pages when memory is tight
vrange: Send SIGBUS when user try to access purged page
vrange: Add vrange LRU list for purging
vrange: Add core shrinking logic for swapless system
vrange: Purging vrange-anon pages from shrinker
vrange: Support background purging for vrange-file
vrange: Allocate vroot dynamically
vrange: Add vmstat counter about purged page

arch/x86/syscalls/syscall_64.tbl | 1 +
fs/inode.c | 4 +
include/linux/fs.h | 4 +
include/linux/mm_types.h | 4 +
include/linux/rmap.h | 11 +-
include/linux/swap.h | 6 +-
include/linux/syscalls.h | 2 +
include/linux/vm_event_item.h | 2 +
include/linux/vrange.h | 84 +++
include/linux/vrange_types.h | 28 +
include/uapi/asm-generic/mman-common.h | 3 +
kernel/fork.c | 12 +
kernel/sys_ni.c | 1 +
lib/Makefile | 2 +-
mm/Makefile | 2 +-
mm/internal.h | 2 -
mm/ksm.c | 2 +-
mm/memory.c | 27 +
mm/mincore.c | 5 +-
mm/mmap.c | 5 +
mm/rmap.c | 28 +-
mm/vmscan.c | 17 +-
mm/vmstat.c | 2 +
mm/vrange.c | 1196 ++++++++++++++++++++++++++++++++
24 files changed, 1429 insertions(+), 21 deletions(-)
create mode 100644 include/linux/vrange.h
create mode 100644 include/linux/vrange_types.h
create mode 100644 mm/vrange.c

--
1.8.1.2


2013-10-03 00:52:07

by John Stultz

[permalink] [raw]
Subject: [PATCH 02/14] vrange: Add vrange support to mm_structs

From: Minchan Kim <[email protected]>

This patch addes vroot on mm_struct so process can set volatile
ranges on anonymous memory.

This is somewhat wasteful, as it increases the mm struct even
if the process doesn't use vrange syscall. So a later patch
will provide dynamically allocated vroots.

One of note on this patch is vrange_fork. Since we do allocations
while holding a lock on the vrange, its possible it could deadlock
with direct reclaim's purging logic. For this reason, vrange_fork
uses GFP_NOIO for its allocations.

If vrange_fork fails, it isn't a critical problem. Since the result
is the child process's pages won't be volatile/purgable, which
could cause additional memory pressure, but won't cause problematic
application behavior (since volatile pages are only purged at the
kernels' discretion). This is thought to be more desirable then
having fork fail.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Bit of refactoring. Comment cleanups]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/mm_types.h | 4 ++++
include/linux/vrange.h | 7 ++++++-
kernel/fork.c | 11 +++++++++++
mm/vrange.c | 40 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faf4b7c..5d8cdc3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
+#include <linux/vrange_types.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -349,6 +350,9 @@ struct mm_struct {
*/


+#ifdef CONFIG_MMU
+ struct vrange_root vroot;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 0d378a5..2b96ee1 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -37,12 +37,17 @@ static inline int vrange_type(struct vrange *vrange)
}

extern void vrange_root_cleanup(struct vrange_root *vroot);
-
+extern int vrange_fork(struct mm_struct *new,
+ struct mm_struct *old);
#else

static inline void vrange_root_init(struct vrange_root *vroot,
int type, void *object) {};
static inline void vrange_root_cleanup(struct vrange_root *vroot) {};
+static inline int vrange_fork(struct mm_struct *new, struct mm_struct *old)
+{
+ return 0;
+}

#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index bf46287..ceb38bf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -71,6 +71,7 @@
#include <linux/signalfd.h>
#include <linux/uprobes.h>
#include <linux/aio.h>
+#include <linux/vrange.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -377,6 +378,14 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
retval = khugepaged_fork(mm, oldmm);
if (retval)
goto out;
+ /*
+ * Note: vrange_fork can fail in the case of ENOMEM, but
+ * this only results in the child not having any active
+ * volatile ranges. This is not harmful. Thus in this case
+ * the child will not see any pages purged unless it remarks
+ * them as volatile.
+ */
+ vrange_fork(mm, oldmm);

prev = NULL;
for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
@@ -538,6 +547,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm->nr_ptes = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
+ vrange_root_init(&mm->vroot, VRANGE_MM, mm);
mm_init_aio(mm);
mm_init_owner(mm, p);

@@ -609,6 +619,7 @@ void mmput(struct mm_struct *mm)

if (atomic_dec_and_test(&mm->mm_users)) {
uprobe_clear_state(mm);
+ vrange_root_cleanup(&mm->vroot);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/vrange.c b/mm/vrange.c
index 866566c..4ddcc3e9 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -181,3 +181,43 @@ void vrange_root_cleanup(struct vrange_root *vroot)
vrange_unlock(vroot);
}

+/*
+ * It's okay to fail vrange_fork because worst case is child process
+ * can't have copied own vrange data structure so that pages in the
+ * vrange couldn't be purged. It would be better rather than failing
+ * fork.
+ */
+int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
+{
+ struct vrange_root *new, *old;
+ struct vrange *range, *new_range;
+ struct rb_node *next;
+
+ new = &new_mm->vroot;
+ old = &old_mm->vroot;
+
+ vrange_lock(old);
+ next = rb_first(&old->v_rb);
+ while (next) {
+ range = vrange_entry(next);
+ next = rb_next(next);
+ /*
+ * We can't use GFP_KERNEL because direct reclaim's
+ * purging logic on vrange could be deadlock by
+ * vrange_lock.
+ */
+ new_range = __vrange_alloc(GFP_NOIO);
+ if (!new_range)
+ goto fail;
+ __vrange_set(new_range, range->node.start,
+ range->node.last, range->purged);
+ __vrange_add(new_range, new);
+
+ }
+ vrange_unlock(old);
+ return 0;
+fail:
+ vrange_unlock(old);
+ vrange_root_cleanup(new);
+ return -ENOMEM;
+}
--
1.8.1.2

2013-10-03 00:52:19

by John Stultz

[permalink] [raw]
Subject: [PATCH 03/14] vrange: Clear volatility on new mmaps

At lsf-mm, the issue was brought up that there is a precedence with
interfaces like mlock, such that new mappings in a pre-existing range
do no inherit the mlock state.

This is mostly because mlock only modifies the existing vmas, and so
any new mmaps create new vmas, which won't be mlocked.

Since volatility is not stored in the vma (for good cause, specifically
as we'd have to have manage file volatility differently from anonymous
and we're likely to manage volatility on small chunks of memory, which
would cause lots of vma splitting and churn), this patch clears volitility
on new mappings, to ensure that we don't inherit volatility if memory in
an existing volatile range is unmapped and then re-mapped with something
else.

Thus, this patch forces any volatility to be cleared on mmap.

XXX: We expect this patch to be not well loved by mm folks, and are open
to alternative methods here. Its more of a place holder to address
the issue from lsf-mm and hopefully will spur some further discussion.

Minchan does have an alternative solution, but I'm not a big fan of it
yet, so this simpler approach is a placeholder for now.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vrange.h | 2 ++
mm/mmap.c | 5 +++++
mm/vrange.c | 8 ++++++++
3 files changed, 15 insertions(+)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 2b96ee1..ef153c8 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -36,6 +36,8 @@ static inline int vrange_type(struct vrange *vrange)
return vrange->owner->type;
}

+extern int vrange_clear(struct vrange_root *vroot,
+ unsigned long start, unsigned long end);
extern void vrange_root_cleanup(struct vrange_root *vroot);
extern int vrange_fork(struct mm_struct *new,
struct mm_struct *old);
diff --git a/mm/mmap.c b/mm/mmap.c
index f9c97d1..ed7056f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -36,6 +36,7 @@
#include <linux/sched/sysctl.h>
#include <linux/notifier.h>
#include <linux/memory.h>
+#include <linux/vrange.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1502,6 +1503,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Clear old maps */
error = -ENOMEM;
munmap_back:
+
+ /* zap any volatile ranges */
+ vrange_clear(&mm->vroot, addr, addr + len);
+
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
diff --git a/mm/vrange.c b/mm/vrange.c
index 4ddcc3e9..f2d1588 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -166,6 +166,14 @@ static int vrange_remove(struct vrange_root *vroot,
return 0;
}

+int vrange_clear(struct vrange_root *vroot,
+ unsigned long start, unsigned long end)
+{
+ int purged;
+
+ return vrange_remove(vroot, start, end - 1, &purged);
+}
+
void vrange_root_cleanup(struct vrange_root *vroot)
{
struct vrange *range;
--
1.8.1.2

2013-10-03 00:52:28

by John Stultz

[permalink] [raw]
Subject: [PATCH 09/14] vrange: Add vrange LRU list for purging

From: Minchan Kim <[email protected]>

This patch adds vrange LRU list for managing vranges to purge by
something (In this implementation, I will use slab shrinker introduced
by upcoming patches).

This is necessary to purge vranges on swapless system because currently
the VM only ages anonymous pages if the system has a swap device.

In this case, because we would otherwise be duplicating the page LRUs
tracking of hot/cold pages, we utilize a vrange LRU, to manage the
shrinking order. Thus the shrinker will discard the entire vrange at
once, and vranges are purged in the order they are marked volatile.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vrange_types.h | 2 ++
mm/vrange.c | 61 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index 0d48b42..d7d451c 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -20,6 +20,8 @@ struct vrange {
struct interval_tree_node node;
struct vrange_root *owner;
int purged;
+ struct list_head lru;
+ atomic_t refcount;
};
#endif

diff --git a/mm/vrange.c b/mm/vrange.c
index c19a966..33e3ac1 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -14,8 +14,21 @@

static struct kmem_cache *vrange_cachep;

+static struct vrange_list {
+ struct list_head list;
+ unsigned long size;
+ struct mutex lock;
+} vrange_list;
+
+static inline unsigned int vrange_size(struct vrange *range)
+{
+ return range->node.last + 1 - range->node.start;
+}
+
static int __init vrange_init(void)
{
+ INIT_LIST_HEAD(&vrange_list.list);
+ mutex_init(&vrange_list.lock);
vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
return 0;
}
@@ -27,19 +40,56 @@ static struct vrange *__vrange_alloc(gfp_t flags)
if (!vrange)
return vrange;
vrange->owner = NULL;
+ INIT_LIST_HEAD(&vrange->lru);
+ atomic_set(&vrange->refcount, 1);
+
return vrange;
}

static void __vrange_free(struct vrange *range)
{
WARN_ON(range->owner);
+ WARN_ON(atomic_read(&range->refcount) != 0);
+ WARN_ON(!list_empty(&range->lru));
+
kmem_cache_free(vrange_cachep, range);
}

+static inline void __vrange_lru_add(struct vrange *range)
+{
+ mutex_lock(&vrange_list.lock);
+ WARN_ON(!list_empty(&range->lru));
+ list_add(&range->lru, &vrange_list.list);
+ vrange_list.size += vrange_size(range);
+ mutex_unlock(&vrange_list.lock);
+}
+
+static inline void __vrange_lru_del(struct vrange *range)
+{
+ mutex_lock(&vrange_list.lock);
+ if (!list_empty(&range->lru)) {
+ list_del_init(&range->lru);
+ vrange_list.size -= vrange_size(range);
+ WARN_ON(range->owner);
+ }
+ mutex_unlock(&vrange_list.lock);
+}
+
static void __vrange_add(struct vrange *range, struct vrange_root *vroot)
{
range->owner = vroot;
interval_tree_insert(&range->node, &vroot->v_rb);
+
+ WARN_ON(atomic_read(&range->refcount) <= 0);
+ __vrange_lru_add(range);
+}
+
+static inline void __vrange_put(struct vrange *range)
+{
+ if (atomic_dec_and_test(&range->refcount)) {
+ __vrange_lru_del(range);
+ __vrange_free(range);
+ }
}

static void __vrange_remove(struct vrange *range)
@@ -64,6 +114,7 @@ static inline void __vrange_resize(struct vrange *range,
bool purged = range->purged;

__vrange_remove(range);
+ __vrange_lru_del(range);
__vrange_set(range, start_idx, end_idx, purged);
__vrange_add(range, vroot);
}
@@ -100,7 +151,7 @@ static int vrange_add(struct vrange_root *vroot,
range = vrange_from_node(node);
/* old range covers new range fully */
if (node->start <= start_idx && node->last >= end_idx) {
- __vrange_free(new_range);
+ __vrange_put(new_range);
goto out;
}

@@ -109,7 +160,7 @@ static int vrange_add(struct vrange_root *vroot,
purged |= range->purged;

__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);

node = next;
}
@@ -150,7 +201,7 @@ static int vrange_remove(struct vrange_root *vroot,
if (start_idx <= node->start && end_idx >= node->last) {
/* argumented range covers the range fully */
__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);
} else if (node->start >= start_idx) {
/*
* Argumented range covers over the left of the
@@ -181,7 +232,7 @@ static int vrange_remove(struct vrange_root *vroot,
vrange_unlock(vroot);

if (!used_new)
- __vrange_free(new_range);
+ __vrange_put(new_range);

return 0;
}
@@ -204,7 +255,7 @@ void vrange_root_cleanup(struct vrange_root *vroot)
while ((node = rb_first(&vroot->v_rb))) {
range = vrange_entry(node);
__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);
}
vrange_unlock(vroot);
}
--
1.8.1.2

2013-10-03 00:52:37

by John Stultz

[permalink] [raw]
Subject: [PATCH 14/14] vrange: Add vmstat counter about purged page

From: Minchan Kim <[email protected]>

This patch adds the number of purged page in vmstat so admin can see
how many of volatile pages are discarded by VM until now.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/vmstat.c | 2 ++
mm/vrange.c | 10 ++++++++++
3 files changed, 14 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index bd6cf61..c4aea92 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGDISCARD_DIRECT,
+ PGDISCARD_KSWAPD,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c2ef4..4f35f46 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -756,6 +756,8 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pgdiscard_direct",
+ "pgdiscard_kswapd",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
diff --git a/mm/vrange.c b/mm/vrange.c
index c30e3dd..8931fab 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -894,6 +894,10 @@ int discard_vpage(struct page *page)

if (page_freeze_refs(page, 1)) {
unlock_page(page);
+ if (current_is_kswapd())
+ count_vm_event(PGDISCARD_KSWAPD);
+ else
+ count_vm_event(PGDISCARD_DIRECT);
return 0;
}
}
@@ -1144,6 +1148,12 @@ static int discard_vrange(struct vrange *vrange)
ret = __discard_vrange_file(mapping, vrange, &nr_discard);
}

+ if (!ret) {
+ if (current_is_kswapd())
+ count_vm_events(PGDISCARD_KSWAPD, nr_discard);
+ else
+ count_vm_events(PGDISCARD_DIRECT, nr_discard);
+ }
out:
__vroot_put(vroot);
return nr_discard;
--
1.8.1.2

2013-10-03 00:52:35

by John Stultz

[permalink] [raw]
Subject: [PATCH 13/14] vrange: Allocate vroot dynamically

From: Minchan Kim <[email protected]>

This patch allocates vroot dynamically when vrange syscall is called
so if anybody doesn't call vrange syscall, we don't waste memory space
occupied by vroot.

The vroot is allocated by SLAB_DESTROY_BY_RCU, thus because we can't
guarantee vroot's validity when we are about to access vroot of
a different process, the rules are as follows:

1. rcu_read_lock
2. checkt vroot == NULL
3. increment vroot's refcount
4. rcu_read_unlock
5. vrange_lock(vroot)
6. get vrange from tree
7. vrange->owenr == vroot check again because vroot can be allocated
for another one in same RCU period.

If we're accessing the vroot from our own context, we can skip
the rcu & extra checking, since we know the vroot won't disappear
from under us while we're running.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Commit rewording, renamed functions, added helper functions]
Signed-off-by: John Stultz <[email protected]>
---
fs/inode.c | 4 +-
include/linux/fs.h | 2 +-
include/linux/mm_types.h | 2 +-
include/linux/vrange_types.h | 1 +
kernel/fork.c | 5 +-
mm/mmap.c | 2 +-
mm/vrange.c | 257 +++++++++++++++++++++++++++++++++++++++++--
7 files changed, 255 insertions(+), 18 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5364f91..f5b8990 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -353,7 +353,6 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
- vrange_root_init(&mapping->vroot, VRANGE_FILE, mapping);
}
EXPORT_SYMBOL(address_space_init_once);

@@ -1421,7 +1420,8 @@ static void iput_final(struct inode *inode)
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);

- vrange_root_cleanup(&inode->i_mapping->vroot);
+ vrange_root_cleanup(inode->i_mapping->vroot);
+ inode->i_mapping->vroot = NULL;

evict(inode);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6ec2953..32ef488 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -415,7 +415,7 @@ struct address_space {
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutex i_mmap_mutex; /* protect tree, count, list */
#ifdef CONFIG_MMU
- struct vrange_root vroot;
+ struct vrange_root *vroot;
#endif
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5d8cdc3..ad7e2fc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -351,7 +351,7 @@ struct mm_struct {


#ifdef CONFIG_MMU
- struct vrange_root vroot;
+ struct vrange_root *vroot;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index d7d451c..c4ef8b6 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -14,6 +14,7 @@ struct vrange_root {
struct mutex v_lock; /* Protect v_rb */
enum vrange_type type; /* range root type */
void *object; /* pointer to mm_struct or mapping */
+ atomic_t refcount;
};

struct vrange {
diff --git a/kernel/fork.c b/kernel/fork.c
index ceb38bf..16d58ca 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -545,9 +545,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
mm->nr_ptes = 0;
+ mm->vroot = NULL;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
- vrange_root_init(&mm->vroot, VRANGE_MM, mm);
mm_init_aio(mm);
mm_init_owner(mm, p);

@@ -619,7 +619,8 @@ void mmput(struct mm_struct *mm)

if (atomic_dec_and_test(&mm->mm_users)) {
uprobe_clear_state(mm);
- vrange_root_cleanup(&mm->vroot);
+ vrange_root_cleanup(mm->vroot);
+ mm->vroot = NULL;
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/mmap.c b/mm/mmap.c
index ed7056f..cb2f9e0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1505,7 +1505,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
munmap_back:

/* zap any volatile ranges */
- vrange_clear(&mm->vroot, addr, addr + len);
+ vrange_clear(mm->vroot, addr, addr + len);

if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
if (do_munmap(mm, addr, len))
diff --git a/mm/vrange.c b/mm/vrange.c
index 3f21dc9..c30e3dd 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -16,6 +16,7 @@
#include <linux/pagevec.h>

static struct kmem_cache *vrange_cachep;
+static struct kmem_cache *vroot_cachep;

static struct vrange_list {
struct list_head list;
@@ -44,12 +45,169 @@ static int __init vrange_init(void)
{
INIT_LIST_HEAD(&vrange_list.list);
mutex_init(&vrange_list.lock);
+ vroot_cachep = kmem_cache_create("vrange_root",
+ sizeof(struct vrange_root), 0,
+ SLAB_DESTROY_BY_RCU|SLAB_PANIC, NULL);
vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
register_shrinker(&vrange_shrinker);
return 0;
}
module_init(vrange_init);

+static struct vrange_root *__vroot_alloc(gfp_t flags)
+{
+ struct vrange_root *vroot = kmem_cache_alloc(vroot_cachep, flags);
+ if (!vroot)
+ return vroot;
+
+ atomic_set(&vroot->refcount, 1);
+ return vroot;
+}
+
+static inline int __vroot_get(struct vrange_root *vroot)
+{
+ if (!atomic_inc_not_zero(&vroot->refcount))
+ return 0;
+
+ return 1;
+}
+
+static inline void __vroot_put(struct vrange_root *vroot)
+{
+ if (atomic_dec_and_test(&vroot->refcount)) {
+ enum {VRANGE_MM, VRANGE_FILE} type = vroot->type;
+ if (type == VRANGE_MM) {
+ struct mm_struct *mm = vroot->object;
+ mmdrop(mm);
+ } else if (type == VRANGE_FILE) {
+ /* TODO : */
+ } else
+ BUG();
+
+ WARN_ON(!RB_EMPTY_ROOT(&vroot->v_rb));
+ kmem_cache_free(vroot_cachep, vroot);
+ }
+}
+
+static bool __vroot_init_mm(struct vrange_root *vroot, struct mm_struct *mm)
+{
+ bool ret = false;
+
+ spin_lock(&mm->page_table_lock);
+ if (!mm->vroot) {
+ mm->vroot = vroot;
+ vrange_root_init(mm->vroot, VRANGE_MM, mm);
+ atomic_inc(&mm->mm_count);
+ ret = true;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ return ret;
+}
+
+static bool __vroot_init_mapping(struct vrange_root *vroot,
+ struct address_space *mapping)
+{
+ bool ret = false;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (!mapping->vroot) {
+ mapping->vroot = vroot;
+ vrange_root_init(mapping->vroot, VRANGE_FILE, mapping);
+ /* XXX - inc ref count on mapping? */
+ ret = true;
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ return ret;
+}
+
+static struct vrange_root *vroot_alloc_mm(struct mm_struct *mm)
+{
+ struct vrange_root *ret, *allocated;
+
+ ret = NULL;
+ allocated = __vroot_alloc(GFP_NOFS);
+ if (!allocated)
+ return NULL;
+
+ if (__vroot_init_mm(allocated, mm)) {
+ ret = allocated;
+ allocated = NULL;
+ }
+
+ if (allocated)
+ __vroot_put(allocated);
+
+ return ret;
+}
+
+static struct vrange_root *vroot_alloc_vma(struct vm_area_struct *vma)
+{
+ struct vrange_root *ret, *allocated;
+ bool val;
+
+ ret = NULL;
+ allocated = __vroot_alloc(GFP_NOFS);
+ if (!allocated)
+ return NULL;
+
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ val = __vroot_init_mapping(allocated, vma->vm_file->f_mapping);
+ else
+ val = __vroot_init_mm(allocated, vma->vm_mm);
+
+ if (val) {
+ ret = allocated;
+ allocated = NULL;
+ }
+
+ if (allocated)
+ __vroot_put(allocated);
+
+ return ret;
+}
+
+static struct vrange_root *vrange_get_vroot(struct vrange *vrange)
+{
+ struct vrange_root *vroot;
+ struct vrange_root *ret = NULL;
+
+ rcu_read_lock();
+ /*
+ * Prevent compiler from re-fetching vrange->owner while others
+ * clears vrange->owner.
+ */
+ vroot = ACCESS_ONCE(vrange->owner);
+ if (!vroot)
+ goto out;
+
+ /*
+ * vroot couldn't be destroyed while we're holding rcu_read_lock
+ * so it's okay to access vroot
+ */
+ if (!__vroot_get(vroot))
+ goto out;
+
+
+ /* If we reach here, vroot is either ours or others because
+ * vroot could be allocated for othres in same RCU period
+ * so we should check it carefully. For free/reallocating
+ * for others, all vranges from vroot->tree should be detached
+ * firstly right before vroot freeing so if we check vrange->owner
+ * isn't NULL, it means vroot is ours.
+ */
+ smp_rmb();
+ if (!vrange->owner) {
+ __vroot_put(vroot);
+ goto out;
+ }
+ ret = vroot;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
static struct vrange *__vrange_alloc(gfp_t flags)
{
struct vrange *vrange = kmem_cache_alloc(vrange_cachep, flags);
@@ -209,6 +367,9 @@ static int vrange_remove(struct vrange_root *vroot,
struct interval_tree_node *node, *next;
bool used_new = false;

+ if (!vroot)
+ return 0;
+
if (!purged)
return -EINVAL;

@@ -279,6 +440,9 @@ void vrange_root_cleanup(struct vrange_root *vroot)
struct vrange *range;
struct rb_node *node;

+ if (vroot == NULL)
+ return;
+
vrange_lock(vroot);
/* We should remove node by post-order traversal */
while ((node = rb_first(&vroot->v_rb))) {
@@ -287,6 +451,12 @@ void vrange_root_cleanup(struct vrange_root *vroot)
__vrange_put(range);
}
vrange_unlock(vroot);
+ /*
+ * Before removing vroot, we should make sure range-owner
+ * should be NULL. See the smp_rmb of vrange_get_vroot.
+ */
+ smp_wmb();
+ __vroot_put(vroot);
}

/*
@@ -294,6 +464,7 @@ void vrange_root_cleanup(struct vrange_root *vroot)
* can't have copied own vrange data structure so that pages in the
* vrange couldn't be purged. It would be better rather than failing
* fork.
+ * The down_write of both mm->mmap_sem protects mm->vroot race.
*/
int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
{
@@ -301,8 +472,14 @@ int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
struct vrange *range, *new_range;
struct rb_node *next;

- new = &new_mm->vroot;
- old = &old_mm->vroot;
+ if (!old_mm->vroot)
+ return 0;
+
+ new = vroot_alloc_mm(new_mm);
+ if (!new)
+ return -ENOMEM;
+
+ old = old_mm->vroot;

vrange_lock(old);
next = rb_first(&old->v_rb);
@@ -323,6 +500,7 @@ int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)

}
vrange_unlock(old);
+
return 0;
fail:
vrange_unlock(old);
@@ -335,9 +513,27 @@ static inline struct vrange_root *__vma_to_vroot(struct vm_area_struct *vma)
struct vrange_root *vroot = NULL;

if (vma->vm_file && (vma->vm_flags & VM_SHARED))
- vroot = &vma->vm_file->f_mapping->vroot;
+ vroot = vma->vm_file->f_mapping->vroot;
else
- vroot = &vma->vm_mm->vroot;
+ vroot = vma->vm_mm->vroot;
+
+ return vroot;
+}
+
+static inline struct vrange_root *__vma_to_vroot_get(struct vm_area_struct *vma)
+{
+ struct vrange_root *vroot = NULL;
+
+ rcu_read_lock();
+ vroot = __vma_to_vroot(vma);
+
+ if (!vroot)
+ goto out;
+
+ if (!__vroot_get(vroot))
+ vroot = NULL;
+out:
+ rcu_read_unlock();
return vroot;
}

@@ -383,6 +579,11 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start_idx,
tmp = end_idx;

vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ vroot = vroot_alloc_vma(vma);
+ if (!vroot)
+ goto out;
+
vstart_idx = __vma_addr_to_index(vma, start_idx);
vend_idx = __vma_addr_to_index(vma, tmp);

@@ -495,17 +696,31 @@ out:
bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
{
struct vrange_root *vroot;
+ struct vrange *vrange;
unsigned long vstart_idx, vend_idx;
bool ret = false;

- vroot = __vma_to_vroot(vma);
+ vroot = __vma_to_vroot_get(vma);
+
+ if (!vroot)
+ return ret;
+
vstart_idx = __vma_addr_to_index(vma, addr);
vend_idx = vstart_idx + PAGE_SIZE - 1;

vrange_lock(vroot);
- if (__vrange_find(vroot, vstart_idx, vend_idx))
- ret = true;
+ vrange = __vrange_find(vroot, vstart_idx, vend_idx);
+ if (vrange) {
+ /*
+ * vroot can be allocated for another process in
+ * same period so let's check vroot's stability
+ */
+ if (likely(vroot == vrange->owner))
+ ret = true;
+ }
vrange_unlock(vroot);
+ __vroot_put(vroot);
+
return ret;
}

@@ -517,6 +732,8 @@ bool vrange_addr_purged(struct vm_area_struct *vma, unsigned long addr)
bool ret = false;

vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ return false;
vstart_idx = __vma_addr_to_index(vma, addr);

vrange_lock(vroot);
@@ -550,6 +767,7 @@ static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
pte_t pteval;
spinlock_t *ptl;

+ VM_BUG_ON(!vroot);
VM_BUG_ON(!PageLocked(page));

pte = page_check_address(page, mm, addr, &ptl, 0);
@@ -608,9 +826,11 @@ static int try_to_discard_anon_vpage(struct page *page)
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
vma = avc->vma;
mm = vma->vm_mm;
- vroot = &mm->vroot;
- address = vma_address(page, vma);
+ vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ continue;

+ address = vma_address(page, vma);
vrange_lock(vroot);
if (!__vrange_find(vroot, address, address + PAGE_SIZE - 1)) {
vrange_unlock(vroot);
@@ -634,10 +854,14 @@ static int try_to_discard_file_vpage(struct page *page)
mutex_lock(&mapping->i_mmap_mutex);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
- struct vrange_root *vroot = &mapping->vroot;
+ struct vrange_root *vroot;
long vstart_idx;

+ vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ continue;
vstart_idx = __vma_addr_to_index(vma, address);
+
vrange_lock(vroot);
if (!__vrange_find(vroot, vstart_idx,
vstart_idx + PAGE_SIZE - 1)) {
@@ -901,7 +1125,16 @@ static int discard_vrange(struct vrange *vrange)
int ret = 0;
struct vrange_root *vroot;
unsigned int nr_discard = 0;
- vroot = vrange->owner;
+ vroot = vrange_get_vroot(vrange);
+ if (!vroot)
+ return 0;
+
+ /*
+ * Race of vrange->owner could happens with __vrange_remove
+ * but it's okay because subfunctions will check it again
+ */
+ if (vrange->owner == NULL)
+ goto out;

if (vroot->type == VRANGE_MM) {
struct mm_struct *mm = vroot->object;
@@ -911,6 +1144,8 @@ static int discard_vrange(struct vrange *vrange)
ret = __discard_vrange_file(mapping, vrange, &nr_discard);
}

+out:
+ __vroot_put(vroot);
return nr_discard;
}

--
1.8.1.2

2013-10-03 00:53:07

by John Stultz

[permalink] [raw]
Subject: [PATCH 12/14] vrange: Support background purging for vrange-file

From: Minchan Kim <[email protected]>

Add support to purge vrange file pages via the shrinker interface.

This is useful, since some filesystems like shmem/tmpfs use anonymous
pages, which won't be aged off the page LRU if swap is disabled.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Commit message tweaks]
Signed-off-by: John Stultz <[email protected]>
---
mm/vrange.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 49 insertions(+), 7 deletions(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index c6bc32f..3f21dc9 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -13,6 +13,7 @@
#include <linux/mmu_notifier.h>
#include <linux/mm_inline.h>
#include <linux/migrate.h>
+#include <linux/pagevec.h>

static struct kmem_cache *vrange_cachep;

@@ -854,21 +855,62 @@ out:
return ret;
}

+static int __discard_vrange_file(struct address_space *mapping,
+ struct vrange *vrange, unsigned int *ret_discard)
+{
+ struct pagevec pvec;
+ pgoff_t index;
+ int i;
+ unsigned int nr_discard = 0;
+ unsigned long start_idx = vrange->node.start;
+ unsigned long end_idx = vrange->node.last;
+ const pgoff_t start = start_idx >> PAGE_CACHE_SHIFT;
+ pgoff_t end = end_idx >> PAGE_CACHE_SHIFT;
+ LIST_HEAD(pagelist);
+
+ pagevec_init(&pvec, 0);
+ index = start;
+ while (index <= end && pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+ index = page->index;
+ if (index > end)
+ break;
+ if (isolate_lru_page(page))
+ continue;
+ list_add(&page->lru, &pagelist);
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+ }
+ pagevec_release(&pvec);
+ cond_resched();
+ index++;
+ }
+
+ if (!list_empty(&pagelist))
+ nr_discard = discard_vrange_pagelist(&pagelist);
+
+ *ret_discard = nr_discard;
+ putback_lru_pages(&pagelist);
+
+ return 0;
+}
+
static int discard_vrange(struct vrange *vrange)
{
int ret = 0;
- struct mm_struct *mm;
struct vrange_root *vroot;
unsigned int nr_discard = 0;
vroot = vrange->owner;

- /* TODO : handle VRANGE_FILE */
- if (vroot->type != VRANGE_MM)
- goto out;
+ if (vroot->type == VRANGE_MM) {
+ struct mm_struct *mm = vroot->object;
+ ret = __discard_vrange_anon(mm, vrange, &nr_discard);
+ } else if (vroot->type == VRANGE_FILE) {
+ struct address_space *mapping = vroot->object;
+ ret = __discard_vrange_file(mapping, vrange, &nr_discard);
+ }

- mm = vroot->object;
- ret = __discard_vrange_anon(mm, vrange, &nr_discard);
-out:
return nr_discard;
}

--
1.8.1.2

2013-10-03 00:52:26

by John Stultz

[permalink] [raw]
Subject: [PATCH 10/14] vrange: Add core shrinking logic for swapless system

From: Minchan Kim <[email protected]>

This patch adds the core volatile range shrinking logic
needed to allow volatile range purging to function on
swapless systems.

This patch does not wire in the specific range purging logic,
but that will be added in the following patches.

The reason I use shrinker is that Dave and Glauber are trying to
make slab shrinker being aware of node/memcg so if the patchset
reach on mainline, we also can support node/memcg in vrange, easily.

Another reason I selected slab shrinker is that normally slab shrinker
is called after normal reclaim of file-backed page(ex, page cache)
so reclaiming preference would be this, I expect.(TODO: invstigate
and might need more tunes in reclaim path)

page cache -> vrange by slab shrinking -> anon page

It does make sense because page cache can have stream data so there is
no point to shrink vrange pages if there are lots of streaming pages
in page cache.

In this version, I didn't check it works well but it's design concept
so we can make it work via modify page reclaim path.
I will have more experiment.

One of disadvantage with using slab shrink is that slab shrinker isn't
called in using memcg so memcg-noswap system cannot take advantage of it.
Hmm, Maybe I will jump into relcaim code to hook some point to control
vrange page shrinking more freely.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Renamed some functions and minor cleanups]
Signed-off-by: John Stultz <[email protected]>
---
mm/vrange.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index 33e3ac1..e7c5a25 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -25,11 +25,19 @@ static inline unsigned int vrange_size(struct vrange *range)
return range->node.last + 1 - range->node.start;
}

+static int shrink_vrange(struct shrinker *s, struct shrink_control *sc);
+
+static struct shrinker vrange_shrinker = {
+ .shrink = shrink_vrange,
+ .seeks = DEFAULT_SEEKS
+};
+
static int __init vrange_init(void)
{
INIT_LIST_HEAD(&vrange_list.list);
mutex_init(&vrange_list.lock);
vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
+ register_shrinker(&vrange_shrinker);
return 0;
}
module_init(vrange_init);
@@ -58,9 +66,14 @@ static void __vrange_free(struct vrange *range)
static inline void __vrange_lru_add(struct vrange *range)
{
mutex_lock(&vrange_list.lock);
- WARN_ON(!list_empty(&range->lru));
- list_add(&range->lru, &vrange_list.list);
- vrange_list.size += vrange_size(range);
+ /*
+ * We need this check because it could be raced with
+ * shrink_vrange and vrange_resize
+ */
+ if (list_empty(&range->lru)) {
+ list_add(&range->lru, &vrange_list.list);
+ vrange_list.size += vrange_size(range);
+ }
mutex_unlock(&vrange_list.lock);
}

@@ -84,6 +97,14 @@ static void __vrange_add(struct vrange *range, struct vrange_root *vroot)
__vrange_lru_add(range);
}

+static inline int __vrange_get(struct vrange *vrange)
+{
+ if (!atomic_inc_not_zero(&vrange->refcount))
+ return 0;
+
+ return 1;
+}
+
static inline void __vrange_put(struct vrange *range)
{
if (atomic_dec_and_test(&range->refcount)) {
@@ -647,3 +668,65 @@ int discard_vpage(struct page *page)

return 1;
}
+
+static struct vrange *vrange_isolate(void)
+{
+ struct vrange *vrange = NULL;
+ mutex_lock(&vrange_list.lock);
+ while (!list_empty(&vrange_list.list)) {
+ vrange = list_entry(vrange_list.list.prev,
+ struct vrange, lru);
+ list_del_init(&vrange->lru);
+ vrange_list.size -= vrange_size(vrange);
+
+ /* vrange is going to destroy */
+ if (__vrange_get(vrange))
+ break;
+
+ vrange = NULL;
+ }
+
+ mutex_unlock(&vrange_list.lock);
+ return vrange;
+}
+
+static unsigned int discard_vrange(struct vrange *vrange)
+{
+ return 0;
+}
+
+static int shrink_vrange(struct shrinker *s, struct shrink_control *sc)
+{
+ struct vrange *range = NULL;
+ long nr_to_scan = sc->nr_to_scan;
+ long size = vrange_list.size;
+
+ if (!nr_to_scan)
+ return size;
+
+ if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_IO))
+ return -1;
+
+ while (size > 0 && nr_to_scan > 0) {
+ range = vrange_isolate();
+ if (!range)
+ break;
+
+ /* range is removing so don't bother */
+ if (!range->owner) {
+ __vrange_put(range);
+ size -= vrange_size(range);
+ nr_to_scan -= vrange_size(range);
+ continue;
+ }
+
+ if (discard_vrange(range) < 0)
+ __vrange_lru_add(range);
+ __vrange_put(range);
+
+ size -= vrange_size(range);
+ nr_to_scan -= vrange_size(range);
+ }
+
+ return size;
+}
--
1.8.1.2

2013-10-03 00:53:43

by John Stultz

[permalink] [raw]
Subject: [PATCH 11/14] vrange: Purging vrange-anon pages from shrinker

From: Minchan Kim <[email protected]>

This patch provides the logic to discard anonymous
vranges from the shrinker, by generating the page list
for the volatile ranges setting the ptes volatile, and
discarding the pages.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Code tweaks and commit log rewording]
Signed-off-by: John Stultz <[email protected]>
---
mm/vrange.c | 179 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 178 insertions(+), 1 deletion(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index e7c5a25..c6bc32f 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -11,6 +11,8 @@
#include <linux/hugetlb.h>
#include "internal.h"
#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>

static struct kmem_cache *vrange_cachep;

@@ -20,6 +22,11 @@ static struct vrange_list {
struct mutex lock;
} vrange_list;

+struct vrange_walker {
+ struct vm_area_struct *vma;
+ struct list_head *pagelist;
+};
+
static inline unsigned int vrange_size(struct vrange *range)
{
return range->node.last + 1 - range->node.start;
@@ -690,11 +697,181 @@ static struct vrange *vrange_isolate(void)
return vrange;
}

-static unsigned int discard_vrange(struct vrange *vrange)
+static unsigned int discard_vrange_pagelist(struct list_head *page_list)
+{
+ struct page *page;
+ unsigned int nr_discard = 0;
+ LIST_HEAD(ret_pages);
+ LIST_HEAD(free_pages);
+
+ while (!list_empty(page_list)) {
+ int err;
+ page = list_entry(page_list->prev, struct page, lru);
+ list_del(&page->lru);
+ if (!trylock_page(page)) {
+ list_add(&page->lru, &ret_pages);
+ continue;
+ }
+
+ /*
+ * discard_vpage returns unlocked page if it
+ * is successful
+ */
+ err = discard_vpage(page);
+ if (err) {
+ unlock_page(page);
+ list_add(&page->lru, &ret_pages);
+ continue;
+ }
+
+ ClearPageActive(page);
+ list_add(&page->lru, &free_pages);
+ dec_zone_page_state(page, NR_ISOLATED_ANON);
+ nr_discard++;
+ }
+
+ free_hot_cold_page_list(&free_pages, 1);
+ list_splice(&ret_pages, page_list);
+ return nr_discard;
+}
+
+static void vrange_pte_entry(pte_t pteval, unsigned long address,
+ unsigned ptent_size, struct mm_walk *walk)
+{
+ struct page *page;
+ struct vrange_walker *vw = walk->private;
+ struct vm_area_struct *vma = vw->vma;
+ struct list_head *pagelist = vw->pagelist;
+
+ if (pte_none(pteval))
+ return;
+
+ if (!pte_present(pteval))
+ return;
+
+ page = vm_normal_page(vma, address, pteval);
+ if (unlikely(!page))
+ return;
+
+ if (!PageLRU(page) || PageLocked(page))
+ return;
+
+ /* TODO : Support THP */
+ if (unlikely(PageCompound(page)))
+ return;
+
+ if (isolate_lru_page(page))
+ return;
+
+ list_add(&page->lru, pagelist);
+
+ VM_BUG_ON(page_is_file_cache(page));
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+}
+
+static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ for (; addr != end; pte++, addr += PAGE_SIZE)
+ vrange_pte_entry(*pte, addr, PAGE_SIZE, walk);
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+
return 0;
}

+static unsigned int discard_vma_pages(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned int ret = 0;
+ LIST_HEAD(pagelist);
+ struct vrange_walker vw;
+ struct mm_walk vrange_walk = {
+ .pmd_entry = vrange_pte_range,
+ .mm = vma->vm_mm,
+ .private = &vw,
+ };
+
+ vw.pagelist = &pagelist;
+ vw.vma = vma;
+
+ walk_page_range(start, end, &vrange_walk);
+
+ if (!list_empty(&pagelist))
+ ret = discard_vrange_pagelist(&pagelist);
+
+ putback_lru_pages(&pagelist);
+ return ret;
+}
+
+/*
+ * vrange->owner isn't stable because caller doesn't hold vrange_lock
+ * so avoid touching vrange->owner.
+ */
+static int __discard_vrange_anon(struct mm_struct *mm, struct vrange *vrange,
+ unsigned int *ret_discard)
+{
+ struct vm_area_struct *vma;
+ unsigned int nr_discard = 0;
+ unsigned long start = vrange->node.start;
+ unsigned long end = vrange->node.last + 1;
+ int ret = 0;
+
+ /* It prevent to destroy vma when the process exist */
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ return ret;
+
+ if (!down_read_trylock(&mm->mmap_sem)) {
+ mmput(mm);
+ ret = -EBUSY;
+ goto out; /* this vrange could be retried */
+ }
+
+ vma = find_vma(mm, start);
+ if (!vma || (vma->vm_start >= end))
+ goto out_unlock;
+
+ for (; vma; vma = vma->vm_next) {
+ if (vma->vm_start >= end)
+ break;
+ BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+ VM_HUGETLB));
+ cond_resched();
+ nr_discard += discard_vma_pages(mm, vma,
+ max_t(unsigned long, start, vma->vm_start),
+ min_t(unsigned long, end, vma->vm_end));
+ }
+out_unlock:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ *ret_discard = nr_discard;
+out:
+ return ret;
+}
+
+static int discard_vrange(struct vrange *vrange)
+{
+ int ret = 0;
+ struct mm_struct *mm;
+ struct vrange_root *vroot;
+ unsigned int nr_discard = 0;
+ vroot = vrange->owner;
+
+ /* TODO : handle VRANGE_FILE */
+ if (vroot->type != VRANGE_MM)
+ goto out;
+
+ mm = vroot->object;
+ ret = __discard_vrange_anon(mm, vrange, &nr_discard);
+out:
+ return nr_discard;
+}
+
static int shrink_vrange(struct shrinker *s, struct shrink_control *sc)
{
struct vrange *range = NULL;
--
1.8.1.2

2013-10-03 00:53:59

by John Stultz

[permalink] [raw]
Subject: [PATCH 08/14] vrange: Send SIGBUS when user try to access purged page

From: Minchan Kim <[email protected]>

By vrange(2) semantic, a user should see SIGBUG if they try to
access purged page without marking the memory as non-voaltile
(ie, vrange(...VRANGE_NOVOLATILE)).

This allows for optimistic traversal of volatile pages, without
having to mark them non-volatile first and the SIGBUS allows
applications to trap and fixup the purged range before accessing
them again.

This patch implements it by adding SWP_VRANGE so it consumes one
from MAX_SWAPFILES. It means worst case of MAX_SWAPFILES in 32 bit
is 32 - 2 - 1 - 1 = 28. I think it's still enough for everybody.
If someone complains about that and thinks we shouldn't consume it,
I will change it with (swp_type 0, pgoffset 0) which is header of swap
which couldn't be allocated as swp_pte for swapout so we can use it.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/swap.h | 6 +++++-
include/linux/vrange.h | 20 ++++++++++++++++++++
mm/memory.c | 27 +++++++++++++++++++++++++++
mm/mincore.c | 5 ++++-
mm/vrange.c | 20 +++++++++++++++++++-
5 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d95cde5..7fd1006 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,6 +49,9 @@ static inline int current_is_kswapd(void)
* actions on faults.
*/

+#define SWP_VRANGE_NUM 1
+#define SWP_VRANGE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+
/*
* NUMA node memory migration support
*/
@@ -71,7 +74,8 @@ static inline int current_is_kswapd(void)
#endif

#define MAX_SWAPFILES \
- ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM \
+ - SWP_VRANGE_NUM)

/*
* Magic header for a swap area. The first part of the union is
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 778902d..50b9131 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -3,6 +3,8 @@

#include <linux/vrange_types.h>
#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>

#define vrange_from_node(node_ptr) \
container_of(node_ptr, struct vrange, node)
@@ -12,6 +14,16 @@

#ifdef CONFIG_MMU

+static inline swp_entry_t make_vrange_entry(void)
+{
+ return swp_entry(SWP_VRANGE, 0);
+}
+
+static inline int is_vrange_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_VRANGE;
+}
+
static inline void vrange_root_init(struct vrange_root *vroot, int type,
void *object)
{
@@ -44,6 +56,9 @@ extern int vrange_fork(struct mm_struct *new,
int discard_vpage(struct page *page);
bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr);

+extern bool vrange_addr_purged(struct vm_area_struct *vma,
+ unsigned long address);
+
#else

static inline void vrange_root_init(struct vrange_root *vroot,
@@ -60,5 +75,10 @@ static inline bool vrange_addr_volatile(struct vm_area_struct *vma,
return false;
}
static inline int discard_vpage(struct page *page) { return 0 };
+static inline bool vrange_addr_purged(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return false;
+};
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/mm/memory.c b/mm/memory.c
index af84bc0..e33dbce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/vrange.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -831,6 +832,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(!pte_present(pte))) {
if (!pte_file(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);
+ if (is_vrange_entry(entry))
+ goto out_set_pte;

if (swap_duplicate(entry) < 0)
return entry.val;
@@ -1174,6 +1177,8 @@ again:
print_bad_pte(vma, addr, ptent, NULL);
} else {
swp_entry_t entry = pte_to_swp_entry(ptent);
+ if (is_vrange_entry(entry))
+ goto out;

if (!non_swap_entry(entry))
rss[MM_SWAPENTS]--;
@@ -1190,6 +1195,7 @@ again:
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
}
+out:
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, addr != end);

@@ -3715,15 +3721,36 @@ int handle_pte_fault(struct mm_struct *mm,

entry = *pte;
if (!pte_present(entry)) {
+ swp_entry_t vrange_entry;
+
if (pte_none(entry)) {
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address,
pte, pmd, flags, entry);
}
+anon:
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
}
+
+ vrange_entry = pte_to_swp_entry(entry);
+ if (unlikely(is_vrange_entry(vrange_entry))) {
+ if (!vrange_addr_purged(vma, address)) {
+ /* zap pte */
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*pte, entry)))
+ goto unlock;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ ptep_clear_flush(vma, address, pte);
+ pte_unmap_unlock(pte, ptl);
+ goto anon;
+ }
+
+ return VM_FAULT_SIGBUS;
+ }
+
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
pte, pmd, flags, entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56..2a95eef 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,6 +15,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/hugetlb.h>
+#include <linux/vrange.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -129,7 +130,9 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
} else { /* pte is a swap entry */
swp_entry_t entry = pte_to_swp_entry(pte);

- if (is_migration_entry(entry)) {
+ if (is_vrange_entry(entry))
+ *vec = 0;
+ else if (is_migration_entry(entry)) {
/* migration entries are always uptodate */
*vec = 1;
} else {
diff --git a/mm/vrange.c b/mm/vrange.c
index c72e72d..c19a966 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -10,7 +10,6 @@
#include <linux/rmap.h>
#include <linux/hugetlb.h>
#include "internal.h"
-#include <linux/swap.h>
#include <linux/mmu_notifier.h>

static struct kmem_cache *vrange_cachep;
@@ -430,6 +429,24 @@ bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
return ret;
}

+bool vrange_addr_purged(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct vrange_root *vroot;
+ struct vrange *range;
+ unsigned long vstart_idx;
+ bool ret = false;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, addr);
+
+ vrange_lock(vroot);
+ range = __vrange_find(vroot, vstart_idx, vstart_idx + PAGE_SIZE - 1);
+ if (range && range->purged)
+ ret = true;
+ vrange_unlock(vroot);
+ return ret;
+}
+
/* Caller should hold vrange_lock */
static void do_purge(struct vrange_root *vroot,
unsigned long start_idx, unsigned long end_idx)
@@ -473,6 +490,7 @@ static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
page_remove_rmap(page);
page_cache_release(page);

+ set_pte_at(mm, addr, pte, swp_entry_to_pte(make_vrange_entry()));
pte_unmap_unlock(pte, ptl);
mmu_notifier_invalidate_page(mm, addr);

--
1.8.1.2

2013-10-03 00:52:17

by John Stultz

[permalink] [raw]
Subject: [PATCH 04/14] vrange: Add support for volatile ranges on file mappings

Like with the mm struct, this patch add basic support for
volatile ranges on file address_space structures. This allows
for volatile ranges to be set on mmapped files that can be
shared between processes.

The semantics on the volatile range sharing is that the
volatility is shared, just as the data is shared. Thus
if one process marks the range as volatile, the data is
volatile in all processes that have those pages mapped.

It is advised that processes coordinate when using volatile
ranges on shared mappings (much as they must coordinate when
writing to shared data).

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
fs/inode.c | 4 ++++
include/linux/fs.h | 4 ++++
2 files changed, 8 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index d6dfb09..5364f91 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
#include <linux/prefetch.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
+#include <linux/vrange.h>
#include "internal.h"

/*
@@ -352,6 +353,7 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
+ vrange_root_init(&mapping->vroot, VRANGE_FILE, mapping);
}
EXPORT_SYMBOL(address_space_init_once);

@@ -1419,6 +1421,8 @@ static void iput_final(struct inode *inode)
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);

+ vrange_root_cleanup(&inode->i_mapping->vroot);
+
evict(inode);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9818747..6ec2953 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -28,6 +28,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/vrange_types.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -413,6 +414,9 @@ struct address_space {
struct rb_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutex i_mmap_mutex; /* protect tree, count, list */
+#ifdef CONFIG_MMU
+ struct vrange_root vroot;
+#endif
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
--
1.8.1.2

2013-10-03 00:54:28

by John Stultz

[permalink] [raw]
Subject: [PATCH 07/14] vrange: Purge volatile pages when memory is tight

From: Minchan Kim <[email protected]>

This patch adds purging logic of volatile pages into direct
reclaim path so that if vrange pages is selected as victim by VM,
they could be discarded rather than swapping out.

Direct purging doesn't consider volatile page's age because it
would be better to free the page rather than swapping out
another working set pages. This makes sense because userspace
specifies "please remove free these pages when memory is tight"
via the vrange syscall.

This however is an in-kernel behavior and the purging logic
could later change. Applications should not assume anything
about the volatile page purging order, much as they shouldn't
assume anything about the page swapout order.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/rmap.h | 11 +++++++----
mm/ksm.c | 2 +-
mm/rmap.c | 28 ++++++++++++++++++++--------
mm/vmscan.c | 17 +++++++++++++++--
4 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..f38185d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -181,10 +181,11 @@ static inline void page_dup_rmap(struct page *page)
/*
* Called from mm/vmscan.c to handle paging out
*/
-int page_referenced(struct page *, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg,
+ unsigned long *vm_flags, int *is_vrange);
int page_referenced_one(struct page *, struct vm_area_struct *,
- unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+ unsigned long address, unsigned int *mapcount,
+ unsigned long *vm_flags, int *is_vrange);

#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)

@@ -249,9 +250,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,

static inline int page_referenced(struct page *page, int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_vrange)
{
*vm_flags = 0;
+ *is_vrange = 0;
return 0;
}

diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..debc20c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1932,7 +1932,7 @@ again:
continue;

referenced += page_referenced_one(page, vma,
- rmap_item->address, &mapcount, vm_flags);
+ rmap_item->address, &mapcount, vm_flags, NULL);
if (!search_new_forks || !mapcount)
break;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index b2e29ac..f929f22 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -57,6 +57,7 @@
#include <linux/migrate.h>
#include <linux/hugetlb.h>
#include <linux/backing-dev.h>
+#include <linux/vrange.h>

#include <asm/tlbflush.h>

@@ -662,7 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
*/
int page_referenced_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, unsigned int *mapcount,
- unsigned long *vm_flags)
+ unsigned long *vm_flags, int *is_vrange)
{
struct mm_struct *mm = vma->vm_mm;
int referenced = 0;
@@ -724,6 +725,11 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
referenced++;
}
pte_unmap_unlock(pte, ptl);
+ if (is_vrange && vrange_addr_volatile(vma, address)) {
+ *is_vrange = 1;
+ *mapcount = 0; /* break ealry from loop */
+ goto out;
+ }
}

(*mapcount)--;
@@ -736,7 +742,7 @@ out:

static int page_referenced_anon(struct page *page,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags, int *is_vrange)
{
unsigned int mapcount;
struct anon_vma *anon_vma;
@@ -761,7 +767,8 @@ static int page_referenced_anon(struct page *page,
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
+ &mapcount, vm_flags,
+ is_vrange);
if (!mapcount)
break;
}
@@ -785,7 +792,7 @@ static int page_referenced_anon(struct page *page,
*/
static int page_referenced_file(struct page *page,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags, int *is_vrange)
{
unsigned int mapcount;
struct address_space *mapping = page->mapping;
@@ -826,7 +833,8 @@ static int page_referenced_file(struct page *page,
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
+ &mapcount, vm_flags,
+ is_vrange);
if (!mapcount)
break;
}
@@ -841,6 +849,7 @@ static int page_referenced_file(struct page *page,
* @is_locked: caller holds lock on the page
* @memcg: target memory cgroup
* @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_vrange: Is @page in vrange?
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
@@ -848,7 +857,8 @@ static int page_referenced_file(struct page *page,
int page_referenced(struct page *page,
int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_vrange)
{
int referenced = 0;
int we_locked = 0;
@@ -867,10 +877,12 @@ int page_referenced(struct page *page,
vm_flags);
else if (PageAnon(page))
referenced += page_referenced_anon(page, memcg,
- vm_flags);
+ vm_flags,
+ is_vrange);
else if (page->mapping)
referenced += page_referenced_file(page, memcg,
- vm_flags);
+ vm_flags,
+ is_vrange);
if (we_locked)
unlock_page(page);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..ab377b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
#include <linux/sysctl.h>
#include <linux/oom.h>
#include <linux/prefetch.h>
+#include <linux/vrange.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -610,17 +611,19 @@ enum page_references {
PAGEREF_RECLAIM,
PAGEREF_RECLAIM_CLEAN,
PAGEREF_KEEP,
+ PAGEREF_DISCARD,
PAGEREF_ACTIVATE,
};

static enum page_references page_check_references(struct page *page,
struct scan_control *sc)
{
+ int is_vrange = 0;
int referenced_ptes, referenced_page;
unsigned long vm_flags;

referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
- &vm_flags);
+ &vm_flags, &is_vrange);
referenced_page = TestClearPageReferenced(page);

/*
@@ -630,6 +633,13 @@ static enum page_references page_check_references(struct page *page,
if (vm_flags & VM_LOCKED)
return PAGEREF_RECLAIM;

+ /*
+ * If volatile page is reached on LRU's tail, we discard the
+ * page without considering recycle the page.
+ */
+ if (is_vrange)
+ return PAGEREF_DISCARD;
+
if (referenced_ptes) {
if (PageSwapBacked(page))
return PAGEREF_ACTIVATE;
@@ -859,6 +869,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto activate_locked;
case PAGEREF_KEEP:
goto keep_locked;
+ case PAGEREF_DISCARD:
+ if (may_enter_fs && !discard_vpage(page))
+ goto free_it;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
; /* try to reclaim the page below */
@@ -1614,7 +1627,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}

if (page_referenced(page, 0, sc->target_mem_cgroup,
- &vm_flags)) {
+ &vm_flags, NULL)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
--
1.8.1.2

2013-10-03 00:52:15

by John Stultz

[permalink] [raw]
Subject: [PATCH 05/14] vrange: Add new vrange(2) system call

From: Minchan Kim <[email protected]>

This patch adds new system call sys_vrange.

NAME
vrange - Mark or unmark range of memory as volatile

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode,
int *purged);

DESCRIPTION
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area. The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if
there is no memory pressure.

mode:
VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NONVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.

purged: Pointer to an integer which will return 1 if
mode == VRANGE_NONVOLATILE and any page in the affected range
was purged. If purged returns zero during a mode ==
VRANGE_NONVOLATILE call, it means all of the pages in the range
are intact.

RETURN VALUE
On success vrange returns the number of bytes marked or unmarked.
Similar to write(), it may return fewer bytes then specified
if it ran into a problem.

If an error is returned, no changes were made.

ERRORS
EINVAL This error can occur for the following reasons:
* The value length is negative or not page size units.
* addr is not page-aligned
* mode not a valid value.

ENOMEM Not enough memory

EFAULT purged pointer is invalid

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman-common.h | 3 +
kernel/sys_ni.c | 1 +
mm/vrange.c | 164 +++++++++++++++++++++++++++++++++
5 files changed, 171 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..dc332bd 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 common finit_module sys_finit_module
+314 common vrange sys_vrange

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 84662ec..0997165 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -846,4 +846,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_vrange(unsigned long start, size_t len, int mode,
+ int __user *purged);
#endif
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529..9be120b 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,4 +66,7 @@
#define MAP_HUGE_SHIFT 26
#define MAP_HUGE_MASK 0x3f

+#define VRANGE_VOLATILE 0 /* unpin pages so VM can discard them */
+#define VRANGE_NONVOLATILE 1 /* pin pages so VM can't discard them */
+
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..f40070e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,6 +175,7 @@ cond_syscall(sys_mremap);
cond_syscall(sys_remap_file_pages);
cond_syscall(compat_sys_move_pages);
cond_syscall(compat_sys_migrate_pages);
+cond_syscall(sys_vrange);

/* block-layer dependent */
cond_syscall(sys_bdflush);
diff --git a/mm/vrange.c b/mm/vrange.c
index f2d1588..17be51c 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -4,6 +4,8 @@

#include <linux/vrange.h>
#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/mman.h>

static struct kmem_cache *vrange_cachep;

@@ -229,3 +231,165 @@ fail:
vrange_root_cleanup(new);
return -ENOMEM;
}
+
+static inline struct vrange_root *__vma_to_vroot(struct vm_area_struct *vma)
+{
+ struct vrange_root *vroot = NULL;
+
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ vroot = &vma->vm_file->f_mapping->vroot;
+ else
+ vroot = &vma->vm_mm->vroot;
+ return vroot;
+}
+
+static inline unsigned long __vma_addr_to_index(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ return (vma->vm_pgoff << PAGE_SHIFT) + addr - vma->vm_start;
+ return addr;
+}
+
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start_idx,
+ unsigned long end_idx, int mode, int *purged)
+{
+ struct vm_area_struct *vma;
+ unsigned long orig_start = start_idx;
+ ssize_t count = 0, ret = 0;
+
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start_idx);
+ for (;;) {
+ struct vrange_root *vroot;
+ unsigned long tmp, vstart_idx, vend_idx;
+
+ if (!vma)
+ goto out;
+
+ if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+ VM_HUGETLB))
+ goto out;
+
+ /* make sure start is at the front of the current vma*/
+ if (start_idx < vma->vm_start) {
+ start_idx = vma->vm_start;
+ if (start_idx > end_idx)
+ goto out;
+ }
+
+ /* bound tmp to closer of vm_end & end */
+ tmp = vma->vm_end - 1;
+ if (end_idx < tmp)
+ tmp = end_idx;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, start_idx);
+ vend_idx = __vma_addr_to_index(vma, tmp);
+
+ /* mark or unmark */
+ if (mode == VRANGE_VOLATILE)
+ ret = vrange_add(vroot, vstart_idx, vend_idx);
+ else if (mode == VRANGE_NONVOLATILE)
+ ret = vrange_remove(vroot, vstart_idx, vend_idx,
+ purged);
+
+ if (ret)
+ goto out;
+
+ /* update count to distance covered so far*/
+ count = tmp - orig_start + 1;
+
+ /* move start up to the end of the vma*/
+ start_idx = vma->vm_end;
+ if (start_idx > end_idx)
+ goto out;
+ /* move to the next vma */
+ vma = vma->vm_next;
+ }
+out:
+ up_read(&mm->mmap_sem);
+
+ /* report bytes successfully marked, even if we're exiting on error */
+ if (count)
+ return count;
+
+ return ret;
+}
+
+/*
+ * The vrange(2) system call.
+ *
+ * Applications can use vrange() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * discard pages of vrange instead of swapping out when memory pressure
+ * happens. The information provided is advisory only, and can be safely
+ * disregarded by the kernel if system has enough free memory.
+ *
+ * mode values:
+ * VRANGE_VOLATILE - hint to kernel so VM can discard vrange pages when
+ * memory pressure happens.
+ * VRANGE_NONVOLATILE - Removes any volatile hints previous specified in that
+ * range.
+ *
+ * purged ptr:
+ * Returns 1 if any page in the range being marked nonvolatile has been purged.
+ *
+ * Return values:
+ * On success vrange returns the number of bytes marked or unmarked.
+ * Similar to write(), it may return fewer bytes then specified if
+ * it ran into a problem.
+ *
+ * If an error is returned, no changes were made.
+ *
+ * Errors:
+ * -EINVAL - start len < 0, start is not page-aligned, start is greater
+ * than TASK_SIZE or "mode" is not a valid value.
+ * -ENOMEM - Short of free memory in system for successful system call.
+ * -EFAULT - Purged pointer is invalid.
+ * -ENOSUP - Feature not yet supported.
+ */
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+ size_t, len, int, mode, int __user *, purged)
+{
+ unsigned long end;
+ struct mm_struct *mm = current->mm;
+ ssize_t ret = -EINVAL;
+ int p = 0;
+
+ if (start & ~PAGE_MASK)
+ goto out;
+
+ len &= PAGE_MASK;
+ if (!len)
+ goto out;
+
+ end = start + len;
+ if (end < start)
+ goto out;
+
+ if (start >= TASK_SIZE)
+ goto out;
+
+ if (purged) {
+ /* Test pointer is valid before making any changes */
+ if (put_user(p, purged))
+ return -EFAULT;
+ }
+
+ ret = do_vrange(mm, start, end - 1, mode, &p);
+
+ if (purged) {
+ if (put_user(p, purged)) {
+ /*
+ * This would be bad, since we've modified volatilty
+ * and the change in purged state would be lost.
+ */
+ BUG();
+ }
+ }
+
+out:
+ return ret;
+}
--
1.8.1.2

2013-10-03 00:54:53

by John Stultz

[permalink] [raw]
Subject: [PATCH 06/14] vrange: Add basic functions to purge volatile pages

From: Minchan Kim <[email protected]>

This patch adds discard_vpage and related functions to purge
anonymous and file volatile pages.

It is in preparation for purging volatile pages when memory is tight.
The logic to trigger purge volatile pages will be introduced in the
next patch.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Reworked to add purging of file pages, commit log tweaks]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vrange.h | 9 +++
mm/internal.h | 2 -
mm/vrange.c | 185 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 194 insertions(+), 2 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index ef153c8..778902d 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -41,6 +41,9 @@ extern int vrange_clear(struct vrange_root *vroot,
extern void vrange_root_cleanup(struct vrange_root *vroot);
extern int vrange_fork(struct mm_struct *new,
struct mm_struct *old);
+int discard_vpage(struct page *page);
+bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr);
+
#else

static inline void vrange_root_init(struct vrange_root *vroot,
@@ -51,5 +54,11 @@ static inline int vrange_fork(struct mm_struct *new, struct mm_struct *old)
return 0;
}

+static inline bool vrange_addr_volatile(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return false;
+}
+static inline int discard_vpage(struct page *page) { return 0 };
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 4390ac6..c2c6a93 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -223,10 +223,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)

extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern unsigned long vma_address(struct page *page,
struct vm_area_struct *vma);
-#endif
#else /* !CONFIG_MMU */
static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
{
diff --git a/mm/vrange.c b/mm/vrange.c
index 17be51c..c72e72d 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -6,6 +6,12 @@
#include <linux/slab.h>
#include <linux/syscalls.h>
#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include "internal.h"
+#include <linux/swap.h>
+#include <linux/mmu_notifier.h>

static struct kmem_cache *vrange_cachep;

@@ -63,6 +69,19 @@ static inline void __vrange_resize(struct vrange *range,
__vrange_add(range, vroot);
}

+static struct vrange *__vrange_find(struct vrange_root *vroot,
+ unsigned long start_idx,
+ unsigned long end_idx)
+{
+ struct vrange *range = NULL;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ if (node)
+ range = vrange_from_node(node);
+ return range;
+}
+
static int vrange_add(struct vrange_root *vroot,
unsigned long start_idx, unsigned long end_idx)
{
@@ -393,3 +412,169 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
out:
return ret;
}
+
+bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct vrange_root *vroot;
+ unsigned long vstart_idx, vend_idx;
+ bool ret = false;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, addr);
+ vend_idx = vstart_idx + PAGE_SIZE - 1;
+
+ vrange_lock(vroot);
+ if (__vrange_find(vroot, vstart_idx, vend_idx))
+ ret = true;
+ vrange_unlock(vroot);
+ return ret;
+}
+
+/* Caller should hold vrange_lock */
+static void do_purge(struct vrange_root *vroot,
+ unsigned long start_idx, unsigned long end_idx)
+{
+ struct vrange *range;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ while (node) {
+ range = container_of(node, struct vrange, node);
+ range->purged = true;
+ node = interval_tree_iter_next(node, start_idx, end_idx);
+ }
+}
+
+static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ pte_t pteval;
+ spinlock_t *ptl;
+
+ VM_BUG_ON(!PageLocked(page));
+
+ pte = page_check_address(page, mm, addr, &ptl, 0);
+ if (!pte)
+ return;
+
+ BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ pteval = ptep_clear_flush(vma, addr, pte);
+
+ update_hiwater_rss(mm);
+ if (PageAnon(page))
+ dec_mm_counter(mm, MM_ANONPAGES);
+ else
+ dec_mm_counter(mm, MM_FILEPAGES);
+
+ page_remove_rmap(page);
+ page_cache_release(page);
+
+ pte_unmap_unlock(pte, ptl);
+ mmu_notifier_invalidate_page(mm, addr);
+
+ addr = __vma_addr_to_index(vma, addr);
+
+ do_purge(vroot, addr, addr + PAGE_SIZE - 1);
+}
+
+static int try_to_discard_anon_vpage(struct page *page)
+{
+ struct anon_vma *anon_vma;
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct vrange_root *vroot;
+
+ unsigned long address;
+
+ anon_vma = page_lock_anon_vma_read(page);
+ if (!anon_vma)
+ return -1;
+
+ pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ /*
+ * During interating the loop, some processes could see a page as
+ * purged while others could see a page as not-purged because we have
+ * no global lock between parent and child for protecting vrange system
+ * call during this loop. But it's not a problem because the page is
+ * not *SHARED* page but *COW* page so parent and child can see other
+ * data anytime. The worst case by this race is a page was purged
+ * but couldn't be discarded so it makes unnecessary page fault but
+ * it wouldn't be severe.
+ */
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ vma = avc->vma;
+ mm = vma->vm_mm;
+ vroot = &mm->vroot;
+ address = vma_address(page, vma);
+
+ vrange_lock(vroot);
+ if (!__vrange_find(vroot, address, address + PAGE_SIZE - 1)) {
+ vrange_unlock(vroot);
+ continue;
+ }
+
+ try_to_discard_one(vroot, page, vma, address);
+ vrange_unlock(vroot);
+ }
+
+ page_unlock_anon_vma_read(anon_vma);
+ return 0;
+}
+
+static int try_to_discard_file_vpage(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long address = vma_address(page, vma);
+ struct vrange_root *vroot = &mapping->vroot;
+ long vstart_idx;
+
+ vstart_idx = __vma_addr_to_index(vma, address);
+ vrange_lock(vroot);
+ if (!__vrange_find(vroot, vstart_idx,
+ vstart_idx + PAGE_SIZE - 1)) {
+ vrange_unlock(vroot);
+ continue;
+ }
+ try_to_discard_one(vroot, page, vma, address);
+ vrange_unlock(vroot);
+ }
+
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return 0;
+}
+
+static int try_to_discard_vpage(struct page *page)
+{
+ if (PageAnon(page))
+ return try_to_discard_anon_vpage(page);
+ return try_to_discard_file_vpage(page);
+}
+
+int discard_vpage(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ if (!try_to_discard_vpage(page)) {
+ if (PageSwapCache(page))
+ try_to_free_swap(page);
+
+ if (page_freeze_refs(page, 1)) {
+ unlock_page(page);
+ return 0;
+ }
+ }
+
+ return 1;
+}
--
1.8.1.2

2013-10-03 00:55:39

by John Stultz

[permalink] [raw]
Subject: [PATCH 01/14] vrange: Add basic data structure and functions

From: Minchan Kim <[email protected]>

This patch adds vrange data structure and core management
functions.

The vrange uses the generic interval tree as main data
structure because it handles address range, which fits well
for this purpose.

The vrange_add/vrange_remove are core functions for the vrange()
system call that will be introduced in a following patch.

The vrange_add inserts new address range into interval tree.
If new address range crosses over existing volatile range,
existing volatile range will be expanded to cover new range.

Thus, if existing volatile range has purged state, new extended
range will inherit that purged state. If new address range is
inside existing range, we ignore it.

vrange_remove removes the address range, returning the purged
state of the address ranges.

Cc: Andrew Morton <[email protected]>
Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
[jstultz: Heavy rework and cleanups to make this infrastructure more
easily reused for both file and anonymous pages]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vrange.h | 48 ++++++++++++
include/linux/vrange_types.h | 25 ++++++
lib/Makefile | 2 +-
mm/Makefile | 2 +-
mm/vrange.c | 183 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 258 insertions(+), 2 deletions(-)
create mode 100644 include/linux/vrange.h
create mode 100644 include/linux/vrange_types.h
create mode 100644 mm/vrange.c

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..0d378a5
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,48 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#include <linux/vrange_types.h>
+#include <linux/mm.h>
+
+#define vrange_from_node(node_ptr) \
+ container_of(node_ptr, struct vrange, node)
+
+#define vrange_entry(ptr) \
+ container_of(ptr, struct vrange, node.rb)
+
+#ifdef CONFIG_MMU
+
+static inline void vrange_root_init(struct vrange_root *vroot, int type,
+ void *object)
+{
+ vroot->type = type;
+ vroot->v_rb = RB_ROOT;
+ mutex_init(&vroot->v_lock);
+ vroot->object = object;
+}
+
+static inline void vrange_lock(struct vrange_root *vroot)
+{
+ mutex_lock(&vroot->v_lock);
+}
+
+static inline void vrange_unlock(struct vrange_root *vroot)
+{
+ mutex_unlock(&vroot->v_lock);
+}
+
+static inline int vrange_type(struct vrange *vrange)
+{
+ return vrange->owner->type;
+}
+
+extern void vrange_root_cleanup(struct vrange_root *vroot);
+
+#else
+
+static inline void vrange_root_init(struct vrange_root *vroot,
+ int type, void *object) {};
+static inline void vrange_root_cleanup(struct vrange_root *vroot) {};
+
+#endif
+#endif /* _LINIUX_VRANGE_H */
diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
new file mode 100644
index 0000000..0d48b42
--- /dev/null
+++ b/include/linux/vrange_types.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_VRANGE_TYPES_H
+#define _LINUX_VRANGE_TYPES_H
+
+#include <linux/mutex.h>
+#include <linux/interval_tree.h>
+
+enum vrange_type {
+ VRANGE_MM,
+ VRANGE_FILE,
+};
+
+struct vrange_root {
+ struct rb_root v_rb; /* vrange rb tree */
+ struct mutex v_lock; /* Protect v_rb */
+ enum vrange_type type; /* range root type */
+ void *object; /* pointer to mm_struct or mapping */
+};
+
+struct vrange {
+ struct interval_tree_node node;
+ struct vrange_root *owner;
+ int purged;
+};
+#endif
+
diff --git a/lib/Makefile b/lib/Makefile
index 7baccfd..c8739ee 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o percpu-refcount.o
+ earlycpio.o percpu-refcount.o interval_tree.o

obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
lib-$(CONFIG_MMU) += ioremap.o
diff --git a/mm/Makefile b/mm/Makefile
index f008033..54928af 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o vrange.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..866566c
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,183 @@
+/*
+ * mm/vrange.c
+ */
+
+#include <linux/vrange.h>
+#include <linux/slab.h>
+
+static struct kmem_cache *vrange_cachep;
+
+static int __init vrange_init(void)
+{
+ vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
+ return 0;
+}
+module_init(vrange_init);
+
+static struct vrange *__vrange_alloc(gfp_t flags)
+{
+ struct vrange *vrange = kmem_cache_alloc(vrange_cachep, flags);
+ if (!vrange)
+ return vrange;
+ vrange->owner = NULL;
+ return vrange;
+}
+
+static void __vrange_free(struct vrange *range)
+{
+ WARN_ON(range->owner);
+ kmem_cache_free(vrange_cachep, range);
+}
+
+static void __vrange_add(struct vrange *range, struct vrange_root *vroot)
+{
+ range->owner = vroot;
+ interval_tree_insert(&range->node, &vroot->v_rb);
+}
+
+static void __vrange_remove(struct vrange *range)
+{
+ interval_tree_remove(&range->node, &range->owner->v_rb);
+ range->owner = NULL;
+}
+
+static inline void __vrange_set(struct vrange *range,
+ unsigned long start_idx, unsigned long end_idx,
+ bool purged)
+{
+ range->node.start = start_idx;
+ range->node.last = end_idx;
+ range->purged = purged;
+}
+
+static inline void __vrange_resize(struct vrange *range,
+ unsigned long start_idx, unsigned long end_idx)
+{
+ struct vrange_root *vroot = range->owner;
+ bool purged = range->purged;
+
+ __vrange_remove(range);
+ __vrange_set(range, start_idx, end_idx, purged);
+ __vrange_add(range, vroot);
+}
+
+static int vrange_add(struct vrange_root *vroot,
+ unsigned long start_idx, unsigned long end_idx)
+{
+ struct vrange *new_range, *range;
+ struct interval_tree_node *node, *next;
+ int purged = 0;
+
+ new_range = __vrange_alloc(GFP_KERNEL);
+ if (!new_range)
+ return -ENOMEM;
+
+ vrange_lock(vroot);
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ while (node) {
+ next = interval_tree_iter_next(node, start_idx, end_idx);
+ range = vrange_from_node(node);
+ /* old range covers new range fully */
+ if (node->start <= start_idx && node->last >= end_idx) {
+ __vrange_free(new_range);
+ goto out;
+ }
+
+ start_idx = min_t(unsigned long, start_idx, node->start);
+ end_idx = max_t(unsigned long, end_idx, node->last);
+ purged |= range->purged;
+
+ __vrange_remove(range);
+ __vrange_free(range);
+
+ node = next;
+ }
+
+ __vrange_set(new_range, start_idx, end_idx, purged);
+ __vrange_add(new_range, vroot);
+out:
+ vrange_unlock(vroot);
+ return 0;
+}
+
+static int vrange_remove(struct vrange_root *vroot,
+ unsigned long start_idx, unsigned long end_idx,
+ int *purged)
+{
+ struct vrange *new_range, *range;
+ struct interval_tree_node *node, *next;
+ bool used_new = false;
+
+ if (!purged)
+ return -EINVAL;
+
+ *purged = 0;
+
+ new_range = __vrange_alloc(GFP_KERNEL);
+ if (!new_range)
+ return -ENOMEM;
+
+ vrange_lock(vroot);
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ while (node) {
+ next = interval_tree_iter_next(node, start_idx, end_idx);
+ range = vrange_from_node(node);
+
+ *purged |= range->purged;
+
+ if (start_idx <= node->start && end_idx >= node->last) {
+ /* argumented range covers the range fully */
+ __vrange_remove(range);
+ __vrange_free(range);
+ } else if (node->start >= start_idx) {
+ /*
+ * Argumented range covers over the left of the
+ * range
+ */
+ __vrange_resize(range, end_idx + 1, node->last);
+ } else if (node->last <= end_idx) {
+ /*
+ * Argumented range covers over the right of the
+ * range
+ */
+ __vrange_resize(range, node->start, start_idx - 1);
+ } else {
+ /*
+ * Argumented range is middle of the range
+ */
+ unsigned long last = node->last;
+ used_new = true;
+ __vrange_resize(range, node->start, start_idx - 1);
+ __vrange_set(new_range, end_idx + 1, last,
+ range->purged);
+ __vrange_add(new_range, vroot);
+ break;
+ }
+
+ node = next;
+ }
+ vrange_unlock(vroot);
+
+ if (!used_new)
+ __vrange_free(new_range);
+
+ return 0;
+}
+
+void vrange_root_cleanup(struct vrange_root *vroot)
+{
+ struct vrange *range;
+ struct rb_node *node;
+
+ vrange_lock(vroot);
+ /* We should remove node by post-order traversal */
+ while ((node = rb_first(&vroot->v_rb))) {
+ range = vrange_entry(node);
+ __vrange_remove(range);
+ __vrange_free(range);
+ }
+ vrange_unlock(vroot);
+}
+
--
1.8.1.2

2013-10-03 10:35:06

by Krzysztof Kozlowski

[permalink] [raw]
Subject: Re: [PATCH 06/14] vrange: Add basic functions to purge volatile pages

On śro, 2013-10-02 at 17:51 -0700, John Stultz wrote:
> +static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
> + struct vm_area_struct *vma, unsigned long addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pte_t *pte;
> + pte_t pteval;
> + spinlock_t *ptl;
> +
> + VM_BUG_ON(!PageLocked(page));
> +
> + pte = page_check_address(page, mm, addr, &ptl, 0);
> + if (!pte)
> + return;
> +
> + BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
> +
> + flush_cache_page(vma, addr, page_to_pfn(page));

It seems that this patch is different in your GIT repo
(git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9). In
GIT it is missing the fix: s/address/addr.

Best regards,
Krzysztof


2013-10-03 23:56:17

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 00/14] Volatile Ranges v9

On 10/02/2013 05:51 PM, John Stultz wrote:
> So its been awhile since the last release of the volatile ranges
> patches, and while Minchan and I have been busy with other things,
> we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
>
> There's still a few smaller issues, but we figured any further
> polishing of the patch series in private would be unproductive
> and it would be much better to send the patches out for review
> and comment and get some wider opinions.
>
> Whats new in v9:
> * Updated to v3.11
> * Added vrange purging logic to purge anonymous pages on
> swapless systems
> * Added logic to allocate the vroot structure dynamically
> to avoid added overhead to mm and address_space structures
> * Lots of minor tweaks, changes and cleanups
>
> Still TODO:
> * Sort out better solution for clearing volatility on new mmaps
> - Minchan has a different approach here
> * Sort out apparent shrinker livelock that occasionally crops
> up under severe pressure
>
> Feedback or thoughts here would be particularly helpful!

Andrew noted that I've forgotten to provide sufficient overview of what
volatile ranges does, and given its been while, folks may want a quick
introduction/reminder.

Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future. It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This funcitonality allows for a number of interesting uses:
* Userland caches that have kernel triggered eviction under memory
pressure. This allows for the kernel to "rightsize" userspace caches for
current system-wide workload. Things like image bitmap caches, or
rendered HTML in a hidden browser tab, where the data is not visible and
can be regenerated if needed, are good examples.

* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc. So if
userland wants to malloc memory quickly after the free, it just needs to
mark the pages as non-volatile, and only purged pages will have to be
faulted back in.

The syscall interface is defined in patch 5/14 in this series, but
briefly there are two ways to utilze the functionality:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memroy as
nonvolatile, and the kernel will provide notifcation if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the afected pages as
non-volatile, and refill the data as needed before continuing on


Other details:
The interface takes a range of memory, which can cover anonymous pages
as well as mmapped file pages. In the case that the pages are from a
shared mmapped file, the volatility set on those file pages is global.
Thus much as writes to those pages are shared to other processes, pages
marked volatile will be volatile to any other processes that have the
file mapped as well. It is advised that processes coordinate when using
volatile ranges on shared mappings (much as they must coordinate when
writing to shared data). Any uncleared volatility on mmapped files will
last until the the file is closed by all users (ie: volatility isn't
persistent on disk).

Volatility on anonymous pages are inherited across forks, but cleared on
exec.

You can read more about the history of volatile ranges here:
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


thanks
-john

thanks
-john

2013-10-07 22:57:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/02/2013 05:51 PM, John Stultz wrote:
> From: Minchan Kim <[email protected]>
>
> This patch adds new system call sys_vrange.
>
> NAME
> vrange - Mark or unmark range of memory as volatile
>

vrange() is about as nondescriptive as one can get -- there is exactly
one letter that has any connection with that this does.

> SYNOPSIS
> int vrange(unsigned_long start, size_t length, int mode,
> int *purged);
>
> DESCRIPTION
> Applications can use vrange(2) to advise the kernel how it should
> handle paging I/O in this VM area. The idea is to help the kernel
> discard pages of vrange instead of reclaiming when memory pressure
> happens. It means kernel doesn't discard any pages of vrange if
> there is no memory pressure.
>
> mode:
> VRANGE_VOLATILE
> hint to kernel so VM can discard in vrange pages when
> memory pressure happens.
> VRANGE_NONVOLATILE
> hint to kernel so VM doesn't discard vrange pages
> any more.
>
> If user try to access purged memory without VRANGE_NOVOLATILE call,
> he can encounter SIGBUS if the page was discarded by kernel.
>
> purged: Pointer to an integer which will return 1 if
> mode == VRANGE_NONVOLATILE and any page in the affected range
> was purged. If purged returns zero during a mode ==
> VRANGE_NONVOLATILE call, it means all of the pages in the range
> are intact.

I'm a bit confused about the "purged"

>From an earlier version of the patch:

> - What's different with madvise(DONTNEED)?
>
> System call semantic
>
> DONTNEED makes sure user always can see zero-fill pages after
> he calls madvise while vrange can see data or encounter SIGBUS.

This difference doesn't seem to be a huge one. The other one seems to
be the blocking status of MADV_DONTNEED, which perhaps may be better
handled by adding an option (MADV_LAZY) perhaps?

That way we would have lazy vs. immediate, and zero versus SIGBUS.

I see from the change history of the patch that this was an madvise() at
some point, but was changed into a separate system call at some point,
does anyone remember why that was? A quick look through my LKML
archives doesn't really make it clear.

-hpa

2013-10-07 23:14:28

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 03:56 PM, H. Peter Anvin wrote:
> On 10/02/2013 05:51 PM, John Stultz wrote:
>> From: Minchan Kim <[email protected]>
>>
>> This patch adds new system call sys_vrange.
>>
>> NAME
>> vrange - Mark or unmark range of memory as volatile
>>
> vrange() is about as nondescriptive as one can get -- there is exactly
> one letter that has any connection with that this does.


Hrm. Any suggestions? Would volatile_range() be better?


>
>> SYNOPSIS
>> int vrange(unsigned_long start, size_t length, int mode,
>> int *purged);
>>
>> DESCRIPTION
>> Applications can use vrange(2) to advise the kernel how it should
>> handle paging I/O in this VM area. The idea is to help the kernel
>> discard pages of vrange instead of reclaiming when memory pressure
>> happens. It means kernel doesn't discard any pages of vrange if
>> there is no memory pressure.
>>
>> mode:
>> VRANGE_VOLATILE
>> hint to kernel so VM can discard in vrange pages when
>> memory pressure happens.
>> VRANGE_NONVOLATILE
>> hint to kernel so VM doesn't discard vrange pages
>> any more.
>>
>> If user try to access purged memory without VRANGE_NOVOLATILE call,
>> he can encounter SIGBUS if the page was discarded by kernel.
>>
>> purged: Pointer to an integer which will return 1 if
>> mode == VRANGE_NONVOLATILE and any page in the affected range
>> was purged. If purged returns zero during a mode ==
>> VRANGE_NONVOLATILE call, it means all of the pages in the range
>> are intact.
> I'm a bit confused about the "purged"
>
> From an earlier version of the patch:
>
>> - What's different with madvise(DONTNEED)?
>>
>> System call semantic
>>
>> DONTNEED makes sure user always can see zero-fill pages after
>> he calls madvise while vrange can see data or encounter SIGBUS.
> This difference doesn't seem to be a huge one. The other one seems to
> be the blocking status of MADV_DONTNEED, which perhaps may be better
> handled by adding an option (MADV_LAZY) perhaps?
>
> That way we would have lazy vs. immediate, and zero versus SIGBUS.

And some sort of lazy-cancling call as well.


>
> I see from the change history of the patch that this was an madvise() at
> some point, but was changed into a separate system call at some point,
> does anyone remember why that was? A quick look through my LKML
> archives doesn't really make it clear.

The reason we can't use madvise, is that to properly handle error cases
and report the pruge state, we need an extra argument.

In much earlier versions, we just returned an error when setting
NONVOLATILE if the data was purged. However, since we have to possibly
do allocations when marking a range as non-volatile, we needed a way to
properly handle that allocation failing. We can't just return ENOMEM, as
we may have already marked purged memory as non-volatile.

Thus, that's why with vrange, we return the number of bytes modified,
along with the purge state. That way, if an error does occur we can
return the purge state of the bytes successfully modified, and only
return an error if nothing was changed, much like when a write fails.

thanks
-john


2013-10-07 23:27:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 04:14 PM, John Stultz wrote:
>>
>> I see from the change history of the patch that this was an madvise() at
>> some point, but was changed into a separate system call at some point,
>> does anyone remember why that was? A quick look through my LKML
>> archives doesn't really make it clear.
>
> The reason we can't use madvise, is that to properly handle error cases
> and report the pruge state, we need an extra argument.
>
> In much earlier versions, we just returned an error when setting
> NONVOLATILE if the data was purged. However, since we have to possibly
> do allocations when marking a range as non-volatile, we needed a way to
> properly handle that allocation failing. We can't just return ENOMEM, as
> we may have already marked purged memory as non-volatile.
>
> Thus, that's why with vrange, we return the number of bytes modified,
> along with the purge state. That way, if an error does occur we can
> return the purge state of the bytes successfully modified, and only
> return an error if nothing was changed, much like when a write fails.
>

I am not clear at all what the "purge state" is in this case.

-hpa

2013-10-07 23:41:17

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 04:26 PM, H. Peter Anvin wrote:
> On 10/07/2013 04:14 PM, John Stultz wrote:
>>> I see from the change history of the patch that this was an madvise() at
>>> some point, but was changed into a separate system call at some point,
>>> does anyone remember why that was? A quick look through my LKML
>>> archives doesn't really make it clear.
>> The reason we can't use madvise, is that to properly handle error cases
>> and report the pruge state, we need an extra argument.
>>
>> In much earlier versions, we just returned an error when setting
>> NONVOLATILE if the data was purged. However, since we have to possibly
>> do allocations when marking a range as non-volatile, we needed a way to
>> properly handle that allocation failing. We can't just return ENOMEM, as
>> we may have already marked purged memory as non-volatile.
>>
>> Thus, that's why with vrange, we return the number of bytes modified,
>> along with the purge state. That way, if an error does occur we can
>> return the purge state of the bytes successfully modified, and only
>> return an error if nothing was changed, much like when a write fails.
>>
> I am not clear at all what the "purge state" is in this case.


You mark a chunk of memory as volatile, then at some point later, mark
its as non-volatile. The purge state tells you if the memory is still
there, or if we threw it out due to memory pressure. This lets the
application regnerate the purged data before continuing on.

thanks
-john

2013-10-07 23:47:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 04:41 PM, John Stultz wrote:
>
> You mark a chunk of memory as volatile, then at some point later, mark
> its as non-volatile. The purge state tells you if the memory is still
> there, or if we threw it out due to memory pressure. This lets the
> application regnerate the purged data before continuing on.
>

And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
should do is an enhanced madvise() call?

-hpa

2013-10-07 23:54:36

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 04:46 PM, H. Peter Anvin wrote:
> On 10/07/2013 04:41 PM, John Stultz wrote:
>> You mark a chunk of memory as volatile, then at some point later, mark
>> its as non-volatile. The purge state tells you if the memory is still
>> there, or if we threw it out due to memory pressure. This lets the
>> application regnerate the purged data before continuing on.
>>
> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
> should do is an enhanced madvise() call?
Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
advisory after all. So it may immediately wipe out any data, but it may not.

Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
requires the memory at the specified range to not be volatile, and we
need to correctly inform userland how much was changed and if any of the
memory we did change to non-volatile was purged since being set volatile.

In that way it is sort of different from madvise. Some sort of an
madvise2 could be done, but then the extra purge state argument would be
oddly defined for any other mode.

Is your main concern here just wanting to have a zero-fill mode with
volatile ranges? Or do you really want to squeeze this in to the madvise
call interface?

thanks
-john

2013-10-08 00:00:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 04:54 PM, John Stultz wrote:
>>>
>> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
>> should do is an enhanced madvise() call?
> Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
> advisory after all. So it may immediately wipe out any data, but it may not.
>
> Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
> VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
> requires the memory at the specified range to not be volatile, and we
> need to correctly inform userland how much was changed and if any of the
> memory we did change to non-volatile was purged since being set volatile.
>
> In that way it is sort of different from madvise. Some sort of an
> madvise2 could be done, but then the extra purge state argument would be
> oddly defined for any other mode.
>
> Is your main concern here just wanting to have a zero-fill mode with
> volatile ranges? Or do you really want to squeeze this in to the madvise
> call interface?

The point is that MADV_DONTNEED is very similar in that sense,
especially if allowed to be lazy. It makes a lot of sense to permit
both scrubbing modes orthogonally.

The point you're making has to do with withdrawal of permission to flush
on demand, which is a result of having the lazy mode (ongoing
permission) and having to be able to withdraw such permission.

-0hpa

2013-10-08 00:02:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

Hello, John and Peter

On Mon, Oct 07, 2013 at 04:14:21PM -0700, John Stultz wrote:
> On 10/07/2013 03:56 PM, H. Peter Anvin wrote:
> > On 10/02/2013 05:51 PM, John Stultz wrote:
> >> From: Minchan Kim <[email protected]>
> >>
> >> This patch adds new system call sys_vrange.
> >>
> >> NAME
> >> vrange - Mark or unmark range of memory as volatile
> >>
> > vrange() is about as nondescriptive as one can get -- there is exactly
> > one letter that has any connection with that this does.
>
>
> Hrm. Any suggestions? Would volatile_range() be better?
>
>
> >
> >> SYNOPSIS
> >> int vrange(unsigned_long start, size_t length, int mode,
> >> int *purged);
> >>
> >> DESCRIPTION
> >> Applications can use vrange(2) to advise the kernel how it should
> >> handle paging I/O in this VM area. The idea is to help the kernel
> >> discard pages of vrange instead of reclaiming when memory pressure
> >> happens. It means kernel doesn't discard any pages of vrange if
> >> there is no memory pressure.
> >>
> >> mode:
> >> VRANGE_VOLATILE
> >> hint to kernel so VM can discard in vrange pages when
> >> memory pressure happens.
> >> VRANGE_NONVOLATILE
> >> hint to kernel so VM doesn't discard vrange pages
> >> any more.
> >>
> >> If user try to access purged memory without VRANGE_NOVOLATILE call,
> >> he can encounter SIGBUS if the page was discarded by kernel.
> >>
> >> purged: Pointer to an integer which will return 1 if
> >> mode == VRANGE_NONVOLATILE and any page in the affected range
> >> was purged. If purged returns zero during a mode ==
> >> VRANGE_NONVOLATILE call, it means all of the pages in the range
> >> are intact.
> > I'm a bit confused about the "purged"
> >
> > From an earlier version of the patch:
> >
> >> - What's different with madvise(DONTNEED)?
> >>
> >> System call semantic
> >>
> >> DONTNEED makes sure user always can see zero-fill pages after
> >> he calls madvise while vrange can see data or encounter SIGBUS.
> > This difference doesn't seem to be a huge one. The other one seems to
> > be the blocking status of MADV_DONTNEED, which perhaps may be better
> > handled by adding an option (MADV_LAZY) perhaps?
> >
> > That way we would have lazy vs. immediate, and zero versus SIGBUS.
>
> And some sort of lazy-cancling call as well.
>
>
> >
> > I see from the change history of the patch that this was an madvise() at
> > some point, but was changed into a separate system call at some point,
> > does anyone remember why that was? A quick look through my LKML
> > archives doesn't really make it clear.
>
> The reason we can't use madvise, is that to properly handle error cases
> and report the pruge state, we need an extra argument.
>
> In much earlier versions, we just returned an error when setting
> NONVOLATILE if the data was purged. However, since we have to possibly
> do allocations when marking a range as non-volatile, we needed a way to
> properly handle that allocation failing. We can't just return ENOMEM, as
> we may have already marked purged memory as non-volatile.
>
> Thus, that's why with vrange, we return the number of bytes modified,
> along with the purge state. That way, if an error does occur we can
> return the purge state of the bytes successfully modified, and only
> return an error if nothing was changed, much like when a write fails.

As well, we might need addtional argument VRANGE_FULL/VRANGE_PARTIAL
for vrange system call. I discussed it long time ago but omitted it
for early easy review phase. It is requested by Mozilla fork and of course
I think it makes sense to me.

https://lkml.org/lkml/2013/3/22/20

In short, if you mark a range with VRANGE_FULL, kernel can discard all
of pages within the range if memory is tight while kernel can discard
part of pages in the vrange if you mark the range with VRANGE_PARTIAL.



>
> thanks
> -john
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-10-08 00:07:35

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 05:03 PM, Minchan Kim wrote:
> Hello, John and Peter
>
> On Mon, Oct 07, 2013 at 04:14:21PM -0700, John Stultz wrote:
>> On 10/07/2013 03:56 PM, H. Peter Anvin wrote:
>>> I see from the change history of the patch that this was an madvise() at
>>> some point, but was changed into a separate system call at some point,
>>> does anyone remember why that was? A quick look through my LKML
>>> archives doesn't really make it clear.
>> The reason we can't use madvise, is that to properly handle error cases
>> and report the pruge state, we need an extra argument.
>>
>> In much earlier versions, we just returned an error when setting
>> NONVOLATILE if the data was purged. However, since we have to possibly
>> do allocations when marking a range as non-volatile, we needed a way to
>> properly handle that allocation failing. We can't just return ENOMEM, as
>> we may have already marked purged memory as non-volatile.
>>
>> Thus, that's why with vrange, we return the number of bytes modified,
>> along with the purge state. That way, if an error does occur we can
>> return the purge state of the bytes successfully modified, and only
>> return an error if nothing was changed, much like when a write fails.
> As well, we might need addtional argument VRANGE_FULL/VRANGE_PARTIAL
> for vrange system call. I discussed it long time ago but omitted it
> for early easy review phase. It is requested by Mozilla fork and of course
> I think it makes sense to me.
>
> https://lkml.org/lkml/2013/3/22/20
>
> In short, if you mark a range with VRANGE_FULL, kernel can discard all
> of pages within the range if memory is tight while kernel can discard
> part of pages in the vrange if you mark the range with VRANGE_PARTIAL.

Yea, I'm still not particularly fond of userland being able to specify
the purging semantics, but as we discussed earlier, this can be debated
in finer detail as an extension to the merged interface. :)

thanks
-john

2013-10-08 00:11:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

Hello Peter,

On Mon, Oct 07, 2013 at 04:59:40PM -0700, H. Peter Anvin wrote:
> On 10/07/2013 04:54 PM, John Stultz wrote:
> >>>
> >> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
> >> should do is an enhanced madvise() call?
> > Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
> > advisory after all. So it may immediately wipe out any data, but it may not.
> >
> > Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
> > VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
> > requires the memory at the specified range to not be volatile, and we
> > need to correctly inform userland how much was changed and if any of the
> > memory we did change to non-volatile was purged since being set volatile.
> >
> > In that way it is sort of different from madvise. Some sort of an
> > madvise2 could be done, but then the extra purge state argument would be
> > oddly defined for any other mode.
> >
> > Is your main concern here just wanting to have a zero-fill mode with
> > volatile ranges? Or do you really want to squeeze this in to the madvise
> > call interface?
>
> The point is that MADV_DONTNEED is very similar in that sense,
> especially if allowed to be lazy. It makes a lot of sense to permit
> both scrubbing modes orthogonally.
>
> The point you're making has to do with withdrawal of permission to flush
> on demand, which is a result of having the lazy mode (ongoing
> permission) and having to be able to withdraw such permission.

I'm sorry I could not understand what you wanted to say.
Could you elaborate a bit?

Thanks.

>
> -0hpa
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-10-08 00:18:47

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 05:13 PM, Minchan Kim wrote:
> Hello Peter,
>
> On Mon, Oct 07, 2013 at 04:59:40PM -0700, H. Peter Anvin wrote:
>> On 10/07/2013 04:54 PM, John Stultz wrote:
>>>> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
>>>> should do is an enhanced madvise() call?
>>> Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
>>> advisory after all. So it may immediately wipe out any data, but it may not.
>>>
>>> Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
>>> VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
>>> requires the memory at the specified range to not be volatile, and we
>>> need to correctly inform userland how much was changed and if any of the
>>> memory we did change to non-volatile was purged since being set volatile.
>>>
>>> In that way it is sort of different from madvise. Some sort of an
>>> madvise2 could be done, but then the extra purge state argument would be
>>> oddly defined for any other mode.
>>>
>>> Is your main concern here just wanting to have a zero-fill mode with
>>> volatile ranges? Or do you really want to squeeze this in to the madvise
>>> call interface?
>> The point is that MADV_DONTNEED is very similar in that sense,
>> especially if allowed to be lazy. It makes a lot of sense to permit
>> both scrubbing modes orthogonally.
>>
>> The point you're making has to do with withdrawal of permission to flush
>> on demand, which is a result of having the lazy mode (ongoing
>> permission) and having to be able to withdraw such permission.
> I'm sorry I could not understand what you wanted to say.
> Could you elaborate a bit?
My understanding of his point is that VRANGE_VOLATILE is like a lazy
MADV_DONTNEED (with sigbus, rather then zero fill on fault), suggests
that we should find a way to have VRANGE_VOLATILE be something like
MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS_FAULT, instead of adding a new
syscall. This would provide more options, since one could instead just
do MADV_DONTNEED|MADV_LAZY if they wanted zero-fill faults.

And indeed, for the VRANGE_VOLATILE case, we could do something like
that, but the unresolved problem I see is that that we still need to
handle the VRANGE_NONVOLATILE case, and the madvise() interface doesn't
seem to accomodate the needed semantics well.

thanks
-john

2013-10-08 00:33:15

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On Mon, Oct 07, 2013 at 05:18:40PM -0700, John Stultz wrote:
> On 10/07/2013 05:13 PM, Minchan Kim wrote:
> > Hello Peter,
> >
> > On Mon, Oct 07, 2013 at 04:59:40PM -0700, H. Peter Anvin wrote:
> >> On 10/07/2013 04:54 PM, John Stultz wrote:
> >>>> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
> >>>> should do is an enhanced madvise() call?
> >>> Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
> >>> advisory after all. So it may immediately wipe out any data, but it may not.
> >>>
> >>> Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
> >>> VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
> >>> requires the memory at the specified range to not be volatile, and we
> >>> need to correctly inform userland how much was changed and if any of the
> >>> memory we did change to non-volatile was purged since being set volatile.
> >>>
> >>> In that way it is sort of different from madvise. Some sort of an
> >>> madvise2 could be done, but then the extra purge state argument would be
> >>> oddly defined for any other mode.
> >>>
> >>> Is your main concern here just wanting to have a zero-fill mode with
> >>> volatile ranges? Or do you really want to squeeze this in to the madvise
> >>> call interface?
> >> The point is that MADV_DONTNEED is very similar in that sense,
> >> especially if allowed to be lazy. It makes a lot of sense to permit
> >> both scrubbing modes orthogonally.
> >>
> >> The point you're making has to do with withdrawal of permission to flush
> >> on demand, which is a result of having the lazy mode (ongoing
> >> permission) and having to be able to withdraw such permission.
> > I'm sorry I could not understand what you wanted to say.
> > Could you elaborate a bit?
> My understanding of his point is that VRANGE_VOLATILE is like a lazy
> MADV_DONTNEED (with sigbus, rather then zero fill on fault), suggests
> that we should find a way to have VRANGE_VOLATILE be something like
> MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS_FAULT, instead of adding a new
> syscall. This would provide more options, since one could instead just
> do MADV_DONTNEED|MADV_LAZY if they wanted zero-fill faults.

Hmm, actually, I have thought VRANGE_SIGBUS option because Address/Thread
sanitizer people wanted it as you know and someone might want it, too.

I agree it's orthogonal but not sure MADV_LAZY and MADV_SIGBUS_FAULT can be
used for other combination of advise except MADV_DONTNEED so it might
confuse userland without benefit.

>
> And indeed, for the VRANGE_VOLATILE case, we could do something like
> that, but the unresolved problem I see is that that we still need to
> handle the VRANGE_NONVOLATILE case, and the madvise() interface doesn't
> seem to accomodate the needed semantics well.

VRANGE_VOLATILE case could be a problem. In my mind, I had an idea to
return purged state when we call vrange(VRANGE_VOLATILE) because kernel
could purge them as soon as vrange(VRANGE_VOLATILE) called if memory is
really tight so userland can notice "purging" earlier and kernel can
discard them more efficiently.


>
> thanks
> -john
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-10-08 00:37:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On Tue, Oct 08, 2013 at 09:34:30AM +0900, Minchan Kim wrote:
> On Mon, Oct 07, 2013 at 05:18:40PM -0700, John Stultz wrote:
> > On 10/07/2013 05:13 PM, Minchan Kim wrote:
> > > Hello Peter,
> > >
> > > On Mon, Oct 07, 2013 at 04:59:40PM -0700, H. Peter Anvin wrote:
> > >> On 10/07/2013 04:54 PM, John Stultz wrote:
> > >>>> And wouldn't this apply to MADV_DONTNEED just as well? Perhaps what we
> > >>>> should do is an enhanced madvise() call?
> > >>> Well, I think MADV_DONTNEED doesn't *have* do to anything at all. Its
> > >>> advisory after all. So it may immediately wipe out any data, but it may not.
> > >>>
> > >>> Those advisory semantics work fine w/ VRANGE_VOLATILE. However,
> > >>> VRANGE_NONVOLATILE is not quite advisory, its telling the system that it
> > >>> requires the memory at the specified range to not be volatile, and we
> > >>> need to correctly inform userland how much was changed and if any of the
> > >>> memory we did change to non-volatile was purged since being set volatile.
> > >>>
> > >>> In that way it is sort of different from madvise. Some sort of an
> > >>> madvise2 could be done, but then the extra purge state argument would be
> > >>> oddly defined for any other mode.
> > >>>
> > >>> Is your main concern here just wanting to have a zero-fill mode with
> > >>> volatile ranges? Or do you really want to squeeze this in to the madvise
> > >>> call interface?
> > >> The point is that MADV_DONTNEED is very similar in that sense,
> > >> especially if allowed to be lazy. It makes a lot of sense to permit
> > >> both scrubbing modes orthogonally.
> > >>
> > >> The point you're making has to do with withdrawal of permission to flush
> > >> on demand, which is a result of having the lazy mode (ongoing
> > >> permission) and having to be able to withdraw such permission.
> > > I'm sorry I could not understand what you wanted to say.
> > > Could you elaborate a bit?
> > My understanding of his point is that VRANGE_VOLATILE is like a lazy
> > MADV_DONTNEED (with sigbus, rather then zero fill on fault), suggests
> > that we should find a way to have VRANGE_VOLATILE be something like
> > MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS_FAULT, instead of adding a new
> > syscall. This would provide more options, since one could instead just
> > do MADV_DONTNEED|MADV_LAZY if they wanted zero-fill faults.
>
> Hmm, actually, I have thought VRANGE_SIGBUS option because Address/Thread
> sanitizer people wanted it as you know and someone might want it, too.
>
> I agree it's orthogonal but not sure MADV_LAZY and MADV_SIGBUS_FAULT can be
> used for other combination of advise except MADV_DONTNEED so it might
> confuse userland without benefit.
>
> >
> > And indeed, for the VRANGE_VOLATILE case, we could do something like
> > that, but the unresolved problem I see is that that we still need to
> > handle the VRANGE_NONVOLATILE case, and the madvise() interface doesn't
> > seem to accomodate the needed semantics well.
>
> VRANGE_VOLATILE case could be a problem. In my mind, I had an idea to
> return purged state when we call vrange(VRANGE_VOLATILE) because kernel
> could purge them as soon as vrange(VRANGE_VOLATILE) called if memory is
> really tight so userland can notice "purging" earlier and kernel can
> discard them more efficiently.
>

And we should return the number of bytes marked but madvise returns error.

--
Kind regards,
Minchan Kim

2013-10-08 01:26:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On 10/07/2013 05:13 PM, Minchan Kim wrote:
>>
>> The point is that MADV_DONTNEED is very similar in that sense,
>> especially if allowed to be lazy. It makes a lot of sense to permit
>> both scrubbing modes orthogonally.
>>
>> The point you're making has to do with withdrawal of permission to flush
>> on demand, which is a result of having the lazy mode (ongoing
>> permission) and having to be able to withdraw such permission.
>
> I'm sorry I could not understand what you wanted to say.
> Could you elaborate a bit?
>

Basically, you need this because of MADV_LAZY or the equivalent, so it
would be applicable to a similar variant of madvise().

As such I would suggest that an madvise4() call would be appropriate.

-hpa

2013-10-08 02:07:32

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On Mon, Oct 07, 2013 at 06:24:49PM -0700, H. Peter Anvin wrote:
> On 10/07/2013 05:13 PM, Minchan Kim wrote:
> >>
> >> The point is that MADV_DONTNEED is very similar in that sense,
> >> especially if allowed to be lazy. It makes a lot of sense to permit
> >> both scrubbing modes orthogonally.
> >>
> >> The point you're making has to do with withdrawal of permission to flush
> >> on demand, which is a result of having the lazy mode (ongoing
> >> permission) and having to be able to withdraw such permission.
> >
> > I'm sorry I could not understand what you wanted to say.
> > Could you elaborate a bit?
> >
>
> Basically, you need this because of MADV_LAZY or the equivalent, so it
> would be applicable to a similar variant of madvise().
>
> As such I would suggest that an madvise4() call would be appropriate.
>
> -hpa

Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
&purged, &ret);

Another reason to make it hard is that madvise(2) is tight coupled with
with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
anon-vrange test performance much heavily and userland might want to
make volatile range with small unit like "page size" so it's undesireable
to make it with vma. Then, we should filter out to avoid vma split/merge
in implementation if only MADV_LAZY case? Doable but it could make code
complicated and lost consistency with other variant of madvise.

I think it would be better to implement MADV_FREE if you really want
MADV_LAZY(http://www.unix.com/man-page/FreeBSD/2/madvise/) which is
differnt with volatile range and vrange is more advanced function, IMHO
because MADV_FREE's cost would be proporational to range size due to
page table/page descriptor operations.

--
Kind regards,
Minchan Kim

2013-10-08 02:51:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

> Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
> &purged, &ret);
>
> Another reason to make it hard is that madvise(2) is tight coupled with
> with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
> anon-vrange test performance much heavily and userland might want to
> make volatile range with small unit like "page size" so it's undesireable
> to make it with vma. Then, we should filter out to avoid vma split/merge
> in implementation if only MADV_LAZY case? Doable but it could make code
> complicated and lost consistency with other variant of madvise.

I haven't seen your performance test result. Could please point out URLs?

2013-10-08 03:06:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

Hi KOSAKI,

On Mon, Oct 07, 2013 at 10:51:18PM -0400, KOSAKI Motohiro wrote:
> >Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
> > &purged, &ret);
> >
> >Another reason to make it hard is that madvise(2) is tight coupled with
> >with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
> >anon-vrange test performance much heavily and userland might want to
> >make volatile range with small unit like "page size" so it's undesireable
> >to make it with vma. Then, we should filter out to avoid vma split/merge
> >in implementation if only MADV_LAZY case? Doable but it could make code
> >complicated and lost consistency with other variant of madvise.
>
> I haven't seen your performance test result. Could please point out URLs?

https://lkml.org/lkml/2013/3/12/105

--
Kind regards,
Minchan Kim

2013-10-08 03:28:40

by Jianyu Zhan

[permalink] [raw]
Subject: Re: [PATCH 07/14] vrange: Purge volatile pages when memory is tight

On Thu, Oct 3, 2013 at 8:51 AM, John Stultz <[email protected]> wrote:
> static inline int page_referenced(struct page *page, int is_locked,
> struct mem_cgroup *memcg,
> - unsigned long *vm_flags)
> + unsigned long *vm_flags,
> + int *is_vrange)
> {
> *vm_flags = 0;
> + *is_vrange = 0;
> return 0;
> }

I don't know if it is appropriate to add a parameter in such a core
function for an optional functionality. Maybe the is_vrange flag
should be squashed into the vm_flags ? I am not sure .




--

Regards,
Zhan Jianyu

2013-10-08 04:35:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

(10/7/13 11:07 PM), Minchan Kim wrote:
> Hi KOSAKI,
>
> On Mon, Oct 07, 2013 at 10:51:18PM -0400, KOSAKI Motohiro wrote:
>>> Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
>>> &purged, &ret);
>>>
>>> Another reason to make it hard is that madvise(2) is tight coupled with
>>> with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
>>> anon-vrange test performance much heavily and userland might want to
>>> make volatile range with small unit like "page size" so it's undesireable
>>> to make it with vma. Then, we should filter out to avoid vma split/merge
>>> in implementation if only MADV_LAZY case? Doable but it could make code
>>> complicated and lost consistency with other variant of madvise.
>>
>> I haven't seen your performance test result. Could please point out URLs?
>
> https://lkml.org/lkml/2013/3/12/105

It's not comparison with and without vma merge. I'm interest how much benefit
vmas operation avoiding have.

2013-10-08 07:10:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On Tue, Oct 08, 2013 at 12:35:33AM -0400, KOSAKI Motohiro wrote:
> (10/7/13 11:07 PM), Minchan Kim wrote:
> >Hi KOSAKI,
> >
> >On Mon, Oct 07, 2013 at 10:51:18PM -0400, KOSAKI Motohiro wrote:
> >>>Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
> >>> &purged, &ret);
> >>>
> >>>Another reason to make it hard is that madvise(2) is tight coupled with
> >>>with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
> >>>anon-vrange test performance much heavily and userland might want to
> >>>make volatile range with small unit like "page size" so it's undesireable
> >>>to make it with vma. Then, we should filter out to avoid vma split/merge
> >>>in implementation if only MADV_LAZY case? Doable but it could make code
> >>>complicated and lost consistency with other variant of madvise.
> >>
> >>I haven't seen your performance test result. Could please point out URLs?
> >
> >https://lkml.org/lkml/2013/3/12/105
>
> It's not comparison with and without vma merge. I'm interest how much benefit
> vmas operation avoiding have.

I had an number but lost it so I should set up it in my KVM machine :(
And I needed old kernel 3.7.0 for testing vma-based approach.

DRAM:2G, CPU : 12

kernel 3.7.0

jemalloc: 20527 records/s
jemalloc vma based approach : 5360 records/s

vrange call made worse because every thread stuck with mmap_sem.

kernel 3.11.0

jemalloc: 21176 records/s
jemalloc vroot tree approach: 103637 records/s

It could enhance 5 times.

--
Kind regards,
Minchan Kim

2013-10-08 07:15:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/14] vrange: Add new vrange(2) system call

On Tue, Oct 08, 2013 at 04:12:02PM +0900, Minchan Kim wrote:
> On Tue, Oct 08, 2013 at 12:35:33AM -0400, KOSAKI Motohiro wrote:
> > (10/7/13 11:07 PM), Minchan Kim wrote:
> > >Hi KOSAKI,
> > >
> > >On Mon, Oct 07, 2013 at 10:51:18PM -0400, KOSAKI Motohiro wrote:
> > >>>Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
> > >>> &purged, &ret);
> > >>>
> > >>>Another reason to make it hard is that madvise(2) is tight coupled with
> > >>>with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
> > >>>anon-vrange test performance much heavily and userland might want to
> > >>>make volatile range with small unit like "page size" so it's undesireable
> > >>>to make it with vma. Then, we should filter out to avoid vma split/merge
> > >>>in implementation if only MADV_LAZY case? Doable but it could make code
> > >>>complicated and lost consistency with other variant of madvise.
> > >>
> > >>I haven't seen your performance test result. Could please point out URLs?
> > >
> > >https://lkml.org/lkml/2013/3/12/105
> >
> > It's not comparison with and without vma merge. I'm interest how much benefit
> > vmas operation avoiding have.
>
> I had an number but lost it so I should set up it in my KVM machine :(
> And I needed old kernel 3.7.0 for testing vma-based approach.
>
> DRAM:2G, CPU : 12
>
> kernel 3.7.0
>
> jemalloc: 20527 records/s
> jemalloc vma based approach : 5360 records/s
>
> vrange call made worse because every thread stuck with mmap_sem.
>
> kernel 3.11.0
>
> jemalloc: 21176 records/s
> jemalloc vroot tree approach: 103637 records/s
>
> It could enhance 5 times.

And please keep in mind that vrange's user might want to small vrange
like PAGE_SIZE. If we go with vma-based approach, we would consume memory
with lots of vm_area_struct.

--
Kind regards,
Minchan Kim

2013-10-08 16:23:19

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH 07/14] vrange: Purge volatile pages when memory is tight

On 10/07/2013 08:27 PM, Zhan Jianyu wrote:
> On Thu, Oct 3, 2013 at 8:51 AM, John Stultz <[email protected]> wrote:
>> static inline int page_referenced(struct page *page, int is_locked,
>> struct mem_cgroup *memcg,
>> - unsigned long *vm_flags)
>> + unsigned long *vm_flags,
>> + int *is_vrange)
>> {
>> *vm_flags = 0;
>> + *is_vrange = 0;
>> return 0;
>> }
> I don't know if it is appropriate to add a parameter in such a core
> function for an optional functionality. Maybe the is_vrange flag
> should be squashed into the vm_flags ? I am not sure .
Yea, this wasn't either Minchan or I were particularly fond of, but with
the vm_flags exausted, there wasn't a clear way to do so without doing
the rmap traversal again.

Other suggestions? Extending the vm_flags to 64bits is something many
better mm devs have tried to merge unsuccessfully, so I'm hesitant to
try pushing it myself.

thanks
-john